ISCA-2000 海外調査報告

Post on 15-Jan-2016

47 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ISCA-2000 海外調査報告. 電気通信大学大学院 情報システム学研究科 吉瀬謙二 kis@is.uec.ac.jp. 会議の概要. The 27th Annual International Symposium on Computer Architecture , Vancouver Canada 6 月 10 日~ 1 4日 キーノート1 パネル1 一般講演29(採択率17%) 参加者444人 (日本人13人,大学から4人) http://www.cs.rochester.edu/~ISCA2k/. 紹介する文献. - PowerPoint PPT Presentation

TRANSCRIPT

1

ISCA-2000 海外調査報告

電気通信大学大学院 情報システム学研究科吉瀬謙二 kis@is.uec.ac.jp

2

会議の概要

• The 27th Annual International Symposium on Computer Architecture, Vancouver Canada 6月 10 日~ 1 4日– キーノート1– パネル1– 一般講演29(採択率17%)

• 参加者444人 (日本人13人,大学から4人)

• http://www.cs.rochester.edu/~ISCA2k/

3

紹介する文献• Multiple-Banked Register File Architecture• On the Value Locality of Store Instructions• Completion Time Multiple Branch Prediction for

Enhancing Trace Cache Performance• Circuits for Wide-Window Superscalar Processor• Trace Preconstruction• A Hardware Mechanism for Dynamic Extraction and

Relayout of Program Hot Spots• A Fully Associative Software-Managed Cache Design• Performance Analysis of the Alpha 21264-based Compaq

ES40 System• Piranha: A Scalable Architecture Based on Single Chip

Multiprocessing

4

Multiple-Banked Register File Architecture

Jose-Lorenzo Cruz et al.

Universitat Politecnica de Catalunya, Spain

ISCA-2000 p.316-325

Session 8 – Microarchitecture Innovations

5

レジスタファイルの構成out1

32 Registers

Tri-state buffer

Reg31

Reg2

Reg1

Reg0

decoder5

64

• the number of registers

• the number of ports (Read, Write)

RegisterFile

5

value64

RegNo

Read

6

研究の動機

• The register file access time is one of the critical delays

• The access time of the register file depends on the number of registers and the number of ports– Instruction window -> registers– Issue width -> ports

7

研究の目的

• レジスタファイルのポート数を増やす

• シングル・サイクルでアクセスできるレジスタファイルに近づける

Machine Cycle

RegFile

RegFile

request value

8

Impact of Register File Architecture

2.00

2.25

2.50

2.75

3.00

3.25

3.50

3.75

4.00

48 64 96 128 160 192 224 256

Physical Register File Size

IPC

SEPCint95

9

Observation

• Processor needs many physical registers but a very small number are actually required at a given moment.– Registers with no value– Value used by later instructions– Last-use and overwrite– Bypass only or never read

10

Multiple-Banked Register File Architecture

Bank 2Bank 1

(a) one-level

Bank 1

Bank 2

(b) multi-level (register file cache)

uppermostlevel

lowestlevel

11

Register File Cache

Bank 1

Bank 2

uppermost Level16 registers

lowest Level128 registers

• The lowest level is always written.

• Data is moved only from lower to upper level.

• Cached in upper level based on heuristics.

• There is a prefetch mechanism.

12

Caching and Fetching Policy

• Non-bypass caching– バイパスロジックから読まれていない結果のみを上位レ

ベルに格納

• Ready caching– まだ発行されていない命令で必要とされている値のみを

上位レベルに格納

• Fetch-on-demand– 必要となった時点で値を上位レベルに転送

• Prefetch-first-pair -> next slide

Locality properties of registers and memory are very different.

13

Prefetch-first-pair

(1) p1 = P2 + P3(2) P4 = P3 + P6(3) P7 = P1 + P8

命令 (1) の結果レジスタ P1 を最初に利用する命令 (3)のもう一つのレジスタ P8 をプリフェッチする.

命令 (1) が発行される際に P8 をプリフェッチ

• 命令 (1) から (3) は,プロセッサ内でリネームステージを経過している.

• P1 ~ P8 は,ハードウェアによって変換された物理的なレジスタの番号

14

評価結果 (conf. C3)

• One-cycle single-banked– Area 18855, cycle 5.22 ns (191 MHz)– Read 4 port, Write 3 port

• Two-cycle single-banked– Area 18855, cycle 2.61 ns (383 MHz)– Read 4 port, Write 3 port

• Register file cache– Area 20529, cycle 2.61 ns (382 MHz)– Upper: Read 4 port, Write 4 port– Lower: Write 4 port, Bus 2

15

評価結果

0.0

1.0

2.0

3.0

SpecInt95 SpecFP95

Rel

ativ

e In

stru

ctio

n Th

roug

hput

1-cycle non-bypass caching + prefetch-first-pair 2-cycle, 1-bypass

16

研究の新規性• Register File Cache の提案

– 高速動作が可能– 上位レベルのミス率を削減することで,

アクセスのサイクル数を1に近づける.• 2つのキャッシュ方式と,2つの

フェッチ方式の提案• エリアとサイクル時間を考慮した性

能評価

17

研究へのコメント

• 巨大なレジスタファイル,ポート数の増加,アクセス時間の低減という要求

• 従来単純な構成だったレジスタファイルに関しても,キャッシュのように複数階層の構成が必要

• 今後,大規模なILPアーキテクチャにおける複雑化は避けられない?

18

On the Value Locality of Store Instructions

Kevin M. Lepak et al.

University of Wisconsin

ISCA-2000 p.182-191

Session 5a – Analysis of Workloads and Systems

19

Value Locality (Value Prediction)

• Value locality– a program attribute that describes the likelihood of t

he recurrence of previously-seen program values

• ある命令が前回生成した演算結果(データ値)と,今回生成するデータ値には関連がある.

(1) p1 = P2 + P3(2) P4 = P3 + P6(3) P7 = P1 + P8

P1 の演算結果の履歴 ... 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1 ?

20

研究の目的• Much publication has focused on load

instruction outcome.

• Examine the implications of store value locality– Introduce the notion of silent stores– Introduce two new definitions of

multiprocessor true sharing– Reduce multiprocessor data and address

traffic

21

Memory-centric and producer-centric Locality

• Program structure store value locality– The locality of values written by a particular

static store.

• Message-passing store value locality– The locality of values written to a particular

address in data memory.

22

20%-62% of stores are silent stores

0

10

20

30

40

50

60

70

80

90

100

go

m88

ksim gcc

com

pres

s li

ijpeg perl

vort

ex oltp

barn

es

ocea

n

Sile

nt S

tore

s (%

)

Silent store is a store that does not change the system state.

23

Silent Store Removal Mechanism

• Realistic Method– All previous store addresses must be known.– Load the data from the memory subsystem.– If the data values are equal, the store is

update-silent.• Remove from the LSQ and flag the RUU entry

– If store is silent, the store retires with no memory access.

24

Evaluation Results

• Writeback Reduction– Range in reduction from 81% to 0%– Average 33% reduction

• Instruction Throughput– Speedups of 6.3% and 6.9% for realistic

and perfect removal mechanisms.

25

New Definition of False Sharing

• Multiprocessor applications– All of the previous definitions rely on the

specific addresses in the same block.– No attempt is made to determine when the

invalidation of a block is unnecessary because the value stored in the line does not change.

– Silent stores and stochastically silent stores

26

Address-based Definition of Sharing [Dubois 1993]

• Cold Miss– The first miss to a given block by a processor

• Essential Miss– A cold miss is an essential miss– If during the lifetime of a block, the processor

accesses a value defined by another processor since the last essential miss to that block, it is an essential miss.

• Pure True Sharing miss  (PTS)– An essential miss that is not cold.

• Pure False Sharing miss (PFS)– A non-essential miss.

27

Update-based False Sharing (UFS)

• Essential Miss– A cold miss is an essential miss– If during the lifetime of a block, the

processor accesses an address which has had a different data value defined by another processor since the last essential miss to that block, it is an essential miss.

28

Stochastic False Sharing (SFS)

• It seems intuitive that– if we define false sharing to

compensate for the effect of silent stores that we could also define it in the presence of stochastically silent stores (values that are trivially predictable via some mechanism)

29

研究の新規性• Overall characterization of store value locality• Notion of silent stores• Uniprocessor speedup by squashing silent store

s• Definition of UFS and SFS• How to exploit UFS to reduce address and data

bus traffic on shared memory multiprocessors

30

研究へのコメント

• ストア命令のデータ値の局所性に関する様々な事柄をまとめている.

• 評価は初期的な構成のもので,今後の研究の動機付けとなる.

• 並列計算機における局所性の利用に関しては詳細な検討が必要

31

Completion Time Multiple Branch Prediction for Enhancing Trace

Cache Performance

Ryan Rakvic et al.

Carnegie Mellon University

ISCA-2000 p.47-58

Session 2a – Exploiting Traces

32

Branch Prediction andMultiple Branch Prediction

Control Flow Graph

TakenNotTaken

Basic Block

Branch Prediction:T or N?

TakenNotTaken

Multiple Branch Prediction:(T or N) (T or N) (T or N) ?

Taken

Taken

33

動機と目的

• Wide Instruction Fetching – 4-way -> 8-way

• Multiple branch prediction– Branch Execution Rate: about 15%– One branch prediction per cycle is not enough.

• Tree-Based Multiple Branch Predictor (TMP)

34

用いられている分岐予測の例 gshare

00

11

Two-bit bimodal branch predictor

01

10Taken Taken

Not taken Not takenB-3

B0

Taken

Not Taken

Not Taken

Global history

T N N B0

10

Tag 2BC

Taken

B-2

B-1index

N

T

N

N T

TN

T

35

Tree-Based Multiple Branch Predictor (TMP)

B-3

B-2

B-1

B0

B1

B2

B3

B4

Predicted path

Taken

Not Taken

Not Taken

Global history

TTNT Tree

T N N B0

Tree-based Pattern History Table (Tee-PHT)

B0

B1

B4

B2

B3

Tree(i)

36

Tree-based Pattern History Table

tag Tree

Tree-based Pattern History Table (Tee-PHT)

Predicted Path

00

11

Two-bit bimodal branch predictor

01

10

01 TN

11 T

T

N

NTNT

00

11

Taken Taken

Not taken Not taken

37

Updating of 2-bit bimodal tree-node

01 TN

11 T

T

N

TNNN

00

11

Recently completed path:NTNTOld predicted path:

TNTTNew predicted path:

10

11

10

10

01

10

0111

00

11

01

10Taken Taken

Not taken Not takenN

T

N

N T

TN

T

38

Tree-PHT with second level PHT(Node-PHT) for tree-node prediction

tag Tree

Tree-based Pattern History Table (Tee-PHT)

Predicted Path

Node Pattern History Table(Node-PHT)

TN

01...1

01...1n bits of local history

2-bit bimodal

• global(g)• per-branch(p)• shared(s)

39

研究の新規性と評価結果• TMP の提案

– Three-level branch predictor– Maintain a tree structure– Completion time update

• TMPs-best (shared)

– The number of entries in the Tree-PHT: 2K– Local history bit: 6– 72KB Memory

• 96%: 1 block• 93%: 2 consecutive blocks• 87%: 3 consecutive blocks• 82%: 4 consecutive blocks

40

研究へのコメント

• サイクル当たり複数の分岐命令の分岐先を予測するために,3レベルの予測機構を提案

• 分岐予測はさらに複雑になるが• 着実な性能向上

41

Circuits for Wide-Window Superscalar Processor

Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh and Rahul Sami

ISCA-2000 p.236-247

Session 6 – Circuit Considerations

42

Instruction Window

命令供給 実行命令ウィンドウ

実行結果 (タグ,値)

命令 命令,データ

アウトオブオーダ実行のスーパースカラプロセッサと命令ウィンドウ

Src1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid OpSrc1-Tag Valid Src2-Tag Valid Op

• Wake-up• Schedule

(1) p1 = P2 + P3Src1 Src2

43

研究の動機• 命令ウィンドウを大きくすることで命

令レベル並列性利用の可能性が増大• Alpha 21264 のウィンドウサイズは 35• MIPS R10000 のウィンドウサイズは ??• サイズの大きい命令ウィンドウを構成

することは困難• Power4, two 4-issue processors• Intel Itanium, VLIW techniques

実行命令列

命令ウィンドウ

44

研究の目的

• 高速動作する大きなサイズ(128)の命令ウィンドウを実現する

• Log-depth cyclic segment prefix (CSP) circuit の提案

• Log-depth cyclic segment prefix circuit とサイクル時間の関係を議論

• 大きなサイズの命令ウィンドウによる性能向上を議論

45

Gate-delay cyclic segmented prefix (CSP)

out 0in 0s 0

out 1in 1s 1

out 2in 2s 2

out 3in 3s 3

Head

Tail

An 4-entry wrap-aroundReordering buffer withAdjacent, linear gated-delaycyclic segmented prefix.

46

Commit login using CSP

doneout 0in 0s 0

out 1in 1s 1

out 2in 2s 2

out 3in 3s 3

done

done

Not done

Head

Tail done

done

done

Not done

Head

Tail

47

Wake-up logic for logical register R5

D: R4=R5+R7Not done

out 0in 0s 0

out 1in 1s 1

out 2in 2s 2

out 3in 3s 3

A: R5=R8+R1done

B: R1=R5+R1Not done

C: R5=R3+R3Not done

Head

Tail

48

Scheduler logic scheduling two FUs

D: R4=R5+R7request

A: R5=R8+R1request

B: R1=R5+R1

C: R5=R3+R3request

Head

Tail

+

+

+

+

1

1

2

2

Logarithmic gate-delay implementations p.240 - 241 参照

49

評価結果

• 128エントリの命令ウィンドウを設計

• Commit logic: 1.41 ns (709 MHz)

• Wakeup logic: 1.54 ns (649 MHz)

• Schedule logic: 1.69 ns (591 MHz)

• 現在のプロセス技術を用いて 500MHz以上の動作速度を達成

50

研究へのコメント

• 128エントリの命令ウィンドウの実現可能性を示した.

• 従来,命令ウィンドウのエントリ数を増やすことは困難と考えられてきた.

• この点を覆すという意味で面白い.

51

Trace Preconstruction

Quinn Jacobson et al.

Sun Microsystems

ISCA-2000 p.37-46

Session 2a – Exploiting Traces

52

Trace Cache

BranchPredict

I-Cache

ExecutionEngine

Trace Cache

NextTrace

Predictor

TraceConstructor

Traces are snapshots of short segments of the dynamic instruction stream.

InstructionFetch

53

Trace Cache Slow Path

BranchPredict

I-Cache

ExecutionEngine

Trace Cache

NextTrace

Predictor

TraceConstructor

Slow Path

If the trace cache does not have the needed trace, the slow path is used.

When the trace cache is busy, the slow path hardware is idle.

InstructionFetch

54

目的• Trace cache enables high bandwidth, low

latency instruction supply.

• Trace cache performance may suffer due to capacity and compulsory misses.

• By performing a function analogous to prefetching, trace preconstruction augments a trace cache.

55

Preconstruction Method

For preconstruction to be successful Region start points must identify

instructions that the actual execution path will reach in the future.

Heuristic loop back edge procedure call

56

Preconstruction Example

a

h

i

j

b

c

d

e

f

g

JAL

Br1

Br2

Br3Br4

JMP

a b c c c c c d e g h i i i

h i i i i i

Region Start Point: JAL Br1 Br2

Region 1

d e g

f g

j j

j j

Region 2

j

Region 3

CFG

57

Preconstruction Efficiency• Slow-path dynamic branch predictor

– To reduce the number of paths

– Assume a bimodal branch predictor

– Only the strongly biased path is followed during preconstruction.

• Trace Alignment– . . c c c d e g . . .

– <c c c> <d e g> or <. c c> <c d e> or ...

– In order for pre-constructed trace to be useful, it must align with the actual execution path

58

Conclusions• One implementation of trace

preconstruction– SPECint95 benchmarks with large working

set sizes– Reduce the trace cache miss rates from 30%

to 80%– 3% to 10% overall performance improvement

59

A Hardware Mechanism for Dynamic Extraction and

Relayout of Program Hot Spots

Matthew C. Merten et al.

Coordinated Science Lab

ISCA-2000 p.59-70

Session 2a – Exploiting Traces

60

Hot Spot

• 頻繁に実行される命令コード Hot Spot – 10対 90, 1対 99 の法則

61

目的

• Dynamic extraction and relayout of program hot spots– A hardware-driven profiling scheme for

identifying program hot spots to support runtime optimization, ISCA 1999

• Improve instruction fetch rate

62

Branch Behavior Buffer with new fields shaded in gray

Refresh Timer

Reset Timer

Branch Address

= &

+I

-D

At Zero:Hot Spot Detected

SaturatingAdder

Bra

nch

Tag

Exec

Counte

r

Take

n C

ounte

r

Cand F

lag

Take

n V

alid

Bit

Rem

app

ed T

ake

n A

ddr

FV V

alid

Bit

Rem

app

ed F

T A

dd

r

Call

ID

Touch

ed B

it

0

1

Hot SpotDetectionCounter

63

Memory-Based Code Cache• Use a region of memory called a code cache to

contain the translated code.• Standard paging mechanisms are used to

manage a set of virtual pages reserved for the code cache.

• The code cache pages can be allocated by the operating system and marked as read-only executable code.

• The remapping hardware must be allowed to write this region.

64

Hardware Interaction

Fetch Decode ... Execute In-order Retire

BranchPredictor

BTBIcache BBB

Trace GenerationUnit

Branch BehaviorBuffer

ProfileInformation

TGU

Memory

Code Cache

Update

65

Code Deployment

• Original code cannot be altered

• Transfer to the optimized code is handled by the Branch Target Buffer.– BTB target for the entry point branch is

updated with the address of the entry point target in the code cache.

• Optimized code consists of only the original code.

66

Trace Generation Algorithm

• Scan Mode– Search for a trace entry point. This is the initial

mode following hot spot detection

• Fill Mode– Construct a trace by writing each retired

instructions into the memory-based code cache.

• Pending Mode– Pause trace construction until a new path is

executed.

67

State Transition

ScanMode

New Entry Point(jcc || jmp) && candidate && taken

End Trace2-> jcc && candidate && both_dir_remapped3-> jcc && !candidate && off_path_branch_cnt > max_off_path_allowed4-> red && red_addr_mismatch5-> jcc && candidate && recursive_call

FillMode

From Profile Mode

PendingMode

Cold Branch(jcc || jmp) && !candidate

Merge jcc && candidate && other_dir_not_remapped && exec_dir_reampped

Continue jcc && addr_matches_pending_target && exec_dir_not_remapped

Hot spot detection

68

Trace Example with Optimization (1/3)

Execution order during remapping

A

B

D (b) Original code layout

A1 B1 A2 C2 A3 B2 C3 D1C1

C

Branch Taken

69

Trace Example with Optimization (2/3)

Execution order during remapping

(c) Trace generated by basic remapping

A1 B1 A2 C2 A3 B2 C3 D1C1

Branch Taken

RM-D1

RM-A1

RM-A2

RM-C2

RM-B1

Entrance: C1

70

Trace Example with Optimization (3/3)

Execution order during remapping

(d) The application of two remapping optimizations, patching and branch replication

A1 B1 A2 C2 A3 B2 C3 D1C1

RM-A1

RM-B1

RM-A2

RM-C3

RM-A3

RM-B2

RM-C3

RM-D1

Entrance: C1

71

Fetched IPC for various fetch mechanisms

0

2

4

6

8

IC:64KB IC:64KBremap

IC:64KB,TC:8KB

IC:64KB,TC:8KBremap

IC:64KB,TC:128KB

IC:64KB,TC:128KB

remap

Fetc

h IP

C

72

Conclusion• Detect the hot spots to perform

– code straightening– partial function inlining– loop unrolling

• Achieve significant fetch performance improvement at little extra hardware cost

• Create opportunities for more aggressive optimizations in the future

73

研究へのコメント

• 昨年度の Hot Spot の検出から, Hot Spot の最適化に踏み込んだ研究

• 最適化のアルゴリズムが複雑?• 最適化のバリエーションは無限?

–ソフトウェアによる最適化– 命令セット間の変換– 等

74

A Fully Associative Software-Managed Cache Design

Erik G. Hallnor et al.

The University of Michigan

ISCA-2000 p.107-116

Session 3 – Memory Hierarchy Improvement

75

Data Cache Hierarchy

MPU

L1 Data Cache

L2 Data Cache1 MB

Off-chip Data Cache

Main Memory (DRAM)

100 million Transistor

Tag Data Data Data DataTag Data Data Data DataTag Data Data Data DataTag Data Data Data Data

Data AddressTag Offset

Tag Data Data Data DataTag Data Data Data DataTag Data Data Data DataTag Data Data Data Data

Data

2-way set-associative data cache

76

研究の目的• Processor-memory gap

– miss latencies approaching a thousand instruction execution time

• On-chip SRAM caches in excess of one megabyte [Alpha 21364]

• Re-examination of how secondary caches are organized and managed– Practical, fully associative, software managed

secondary cache system

77

Indirect Index Cache (IIC)

hash

Tag Offset

TE TE TE TE Chain

Data

Data Value

=?

Hit?

=?

Hit?

=?

Hit?

=?

Hit?

Data Array

TAG STATUS INDEX REPL

Primary Hash Table

Secondary Storage for Chaining

ChainTE

TE

Data Address

=?

Hit?

78

Generational Replacement• Blocks are grouped into a small number of

prioritized pools.– Blocks that are referenced regularly are

promoted into higher-priority pools.– Demote unreferenced blocks into lower-priority

pools. Fresh pool

Pool 0(lowest priority) Pool 1 Pool 2

Pool 3(Highest priority)

Ref = 1 Ref = 1 Ref = 1

Ref = 0 Ref = 0 Ref = 0

Ref = 0 Ref = 1

79

Generational Replacement Algorithm

• Each pool is variable-length FIFO queue of blocks.

• On a hit, only the block's reference bit is updated.• On a each miss, the algorithm checks the head of

each pool FIFO.– Head block's reference bit is set -> Next higher-priority

pool, reference bit=0

– Head block's reference bit is not set -> Next lower-priority pool, reference bit=0

80

Evaluation Result and Conclusion

• Substantial reduction in miss rates(8-85%) relative to a conventional 4-way associative LRU cache.

• IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform as CPU-relative latencies grow.

81

研究へのコメント

• 大規模なオンチップの2次キャッシュを想定

• ソフトウェアが支援しているが,プログラマや実行コードからはキャッシュに見える.

• 命令セットやプログラマの支援を考慮した検討も必要ではないか.

82

Performance Analysis of the Alpha 21264-based Compaq

ES40 System

Zarka Cvetanovic, R.E.Kessler

Compaq Computer Corporation

ISCA-2000 p.192-202

Session 5a – Analysis of Workloads and Systems

83

研究の動機

0

500

1000

1500

2000

2500

3000

1-CPU 2-CPU 4-CPU

SPEC

fp_rat

e95

Compaq ES40/ 21264 667MHz

HP PA-8500 440MHz

SUN USparc- II 400MHz

Shared Memory Multiprocessor Comparison

Compaq ES40

84

研究の目的

• Evaluation of Compaq ES40 shared memory multiprocessor– Up to four Alpha 21264 CPU

• Quantitatively show the performance– Instruction Per Cycle – Branch mispredicts– Cache misses

85

Alpha 21264 Microprocessor

• 15-million transistors, 4-way out-of-order superscalar

• 80 in-flight instructions• 35 instruction window• hybrid (Local and global) branch predicti

on• two-clusters integer execution core• load hit/miss, store/load prediction

86

Instruction Cache Miss Rate Comparison

0 5 10 15 20 25 30 35 40 45 50 55 60

SPECfp95

SPECint95

TPM

Icache Misses per 1000 Retires

Alpha 21164

Alpha 21264

8KB direct-mapped -> 64KB two-way associative

Transactions Per Minute

87

Branch Mispredicts Comparison

0 2 4 6 8 10 12 14 16 18 20 22

SPECfp95

SPECint95

TPM

Branch Mispredicts per 1000 Retires

Alpha 21164

Alpha 21264

2bit predictor -> Local and global hybrid predictor

88

IPC Comparison

0.0 0.5 1.0 1.5 2.0

SPECfp95

SPECint95

TPM

Retired Instructions Per Cycle

Alpha 21164

Alpha 21264

89

Compaq ES40 Block Diagram

CPUAlpha 21264

CPUAlpha 21264

CPUAlpha 21264

CPUAlpha 21264

Control Chip

64b

Memory BankMemory Bank

Memory BankMemory Bank

8 DataSwitches

(Crossbar-based)

64b 256b

PCI-Chip PCI-Chip

PCI PCI

L28MB

L2

L2

L2

90

Inter-processor Communication

• Write-invalidate cache coherence– 21264 passes the L2 miss requests to the

control chip.– The control chip simultaneously forwards the

request to DRAM and other 21264s.– Other 21264s check for necessary coherence

violations and respond.– The control chip responds with data from

DRAM or another 21264.

91

STREAM Copy Bandwidth

• The STREAM benchmark• Measure the best memory bandwidth in

megabytes per second– COPY: a(i) = b(i)

– SCALE: a(i) = q*b(i)

– SUM: a(i) = b(i) + c(i)

– TRAID: a(i) = b(i) + q*c(i)– The general rule for STREAM is that each array must be

at least 4x the size of the sum of all the last-level caches used in the run, or 1 Million elements -- whichever is larger.

92

STREAM Copy Bandwidth

0

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

1-CPU 2-CPU 3-CPU 4-CPU

Mem

ory

Cop

y B

andw

idth

(M

B/s

ec)

Compaq ES40/ 21264 667MHzCompaq ES40/ 21264 500MHzSUN Ultra Enterprise 6000AlphaServer 4100 5/ 600

< 3 GBytes / sec

1197

2547

263470

93

研究結果

• Five times the memory bandwidth

• Microprocessor enhancement

• Compiler enhancement

• Compaq ES40 provides 2 to 3 times the performance of the AlphaServer 4100

94

研究へのコメント

• ネットワークにクロスバを採用することで性能向上を達成

• Alpha 21264 の挙動を紹介する文献として興味深い

• 実マシンの詳細な性能評価として興味深い

• アイデアの新しさはない

95

Piranha: A Scalable Architecture Based on Single Chip

Multiprocessing

Luiz Andre Barroso

Compaq Computer Corporation

ISCA-2000 p.282-293

Session 7 – Extracting Parallelism

96

研究の動機• Complex processors

– Higher development cost– Longer design times

• On-line transaction Processing (OLTP)– Little instruction-level parallelism– Thread-level or process-level parallelism

• Semiconductor integration density

• Chip multiprocessing (CMP)

97

研究の目的• Piranha: a research prototype at Compaq

– Targeted at parallel commercial workloads– Chip multiprocessing architectures– Small team, modest investment, short design

time

• General-purpose microprocessors or Piranha?

98

Single-chip Piranha processing node

CPU0 CPU7

iL1 dL1

Intra-Chip Switch

iL1 dL1

L20 L27

MC0 MC7

HomeEngine

RemoteEngine

SystemControl

Pack

et

Sw

itch

OutputQueue

InputQueue

Router

Inte

rcon

nect

Li

nks

DirectRambus Array RDRAM

RDRAMRDRAM

RDRAM

RDRAMRDRAM

01

31

01

31

Chip

99

Alpha CPU Core and L1 Caches

• CPU: Single-issue, in-order, 500MHz datapath

• iL1, dL1: 64KB two-way set-associative– Single-cycle latency

• TBL: 256 entries, 4-way set-associative

• 2-bit state field per cache line for MESI protocol

100

Intra-Chip Switch (ICS)

• ICS manages 27 clients.

• Conceptually, ICS is a crossbar.

• Eight internal datapaths

• Capacity of 32 GB/sec– 500MHz, 64bit bus (8Byte), 8 datapaths– 500MHz x 8Byte x 8 = 32 GB/sec– 3 times the available memory bandwidth

101

Second-Level Cache• L2: 1MB unified instruction/data cache

– Physically partitioned into eight banks

• Each bank is 8-way set-associative• Non-inclusive on-chip cache hierarchy

– Keep a duplicate copy of the L1 tags and state

• L2 controllers are responsible for coherence within a chip.

102

Memory Controller

• Eight memory controllers

• Each RDRAM channel has a maximum 1.6GB/sec

• Maximum local memory bandwidth of 12.8GB/sec

103

Single-chip Piranha versus out-of-order processor

233

145

100

34

350

191

100

44

0

50

100

150

200

250

300

350

400

P1 INO OOO P8 P1 INO OOO P8

Nor

mal

ized

Exe

cutio

n Tim

e

On-Line Transaction Processing (OLTP)

500MHz 1GHz 1GHz 500MHz 500MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue

DSS(Query 6 of the TPC-D)

104

Example System Configuration

P chip P chip P chip

P chip P chipP chip

I/O chip

I/O chip

A Piranha system with six processing(8 CPUs each) and two I/O chips.

105

研究の新規性

• Detailed evaluation of database workloads in the context of CMP

• Simple processor, standard ASIC design– short time, small team size, small investment

106

研究へのコメント

• トランザクション処理では命令レベルの並列性を利用できない。

• CMP が強い?• 普及するのは時間の問題?

top related