ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · architecture transistor densities increased...
TRANSCRIPT
![Page 1: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/1.jpg)
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ([email protected])
ΗΜΥ 656ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ
Εαρινό Εξάμηνο 2007ΔΙΑΛΕΞΗ 5:Chip Multiprocessors –The New Era in Processor Architectures
Acknowledgements:
Wen-Mei Hwu, Kunle Olukotun, Shih-HaoHung, Dezső Sima and the Stanford Hydra Group.
![Page 2: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/2.jpg)
Microarchitecture: Overview
InstructionSupply
ExecutionMechanism
DataSuppl
Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that
bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
![Page 3: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/3.jpg)
Microarchitecture, 1990
• Short pipelines• On-chip I and D Caches, blocking• Simple prediction
![Page 4: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/4.jpg)
Microarchitecture, 2000
• Mechanisms to find parallel instructions– dynamic scheduling– static scheduling
• On-chip cache hierarchies, with non-blocking, higher-bandwidth caches
• Sophisticated branch prediction
![Page 5: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/5.jpg)
Future Microarchitecture:One Perspective
InstructionSupply
ExecutionMechanism
DataSuppl
Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that
bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
![Page 6: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/6.jpg)
Where are we headed?• More ILP : Even wider, deeper
– enabling technology: speculation, predication, compiler transformations, binary re-optimization, complexity effective design
• Multithreading– enabling technology: speculation, subordinate
threads, discovery of thread-level parallelism• Chip Multiprocessors
– enabling technology: speculation, discovery of thread-level, course-grained parallelism
![Page 7: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/7.jpg)
More ILP• Instruction Supply
– Branches, cache misses, partial fetches• Data Supply
– Higher bandwidth, lower latency, memory ordering, non-blocking caches
• Execution– Reduction of redundant work, design complexity
and partitioning• Tolerating Latency
– Can some things just take a long time?
![Page 8: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/8.jpg)
Multithreading[Burton Smith, 1978]
Fetch
Execute
WriteBack
This is a snapshot of the pipeline during a single cycle. Each colorrepresents instructions from a different thread.
B. Smith’s original concept was for a single-wide pipeline, butextends naturally to a multiple issue pipeline.
![Page 9: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/9.jpg)
Simultaneous Multithreadiing[W. Yamamoto, 1994/D. Tullsen, 1995]
Fetch
Execute
WriteBack
![Page 10: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/10.jpg)
Simultaneous Multithreading,possible implementation
Front End Back End
•Intel Hyperthreading in Pentium 4 [HotChips’14] is first realization with two threads
•Small ISA register file minimizes effect of replication•Replicated retirement logic•Minimal hardware overhead but major increase in verification costcost
![Page 11: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/11.jpg)
Chip Multiprocessor[K. Olukotun, 1996]
Fetch
Execute
WriteBack
ProcA
Shared L2 Cache
ProcC
ProcDProcB
Single processor die contains multiple CPUs all ofwhich share some amount of resources, such as an L2 cache and chip pins.
![Page 12: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/12.jpg)
Hardware Accelerators
![Page 13: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/13.jpg)
Existing Solutions…
Intel IXP1200 Intel IXP1200 Network Network
ProcessorProcessor
Philips Philips Nexperia Nexperia (Viper)(Viper)
ARM
MICRO-
ENGINES
ACCESSCTL.
MIPS
MPEG
VLIW
VIDEO
MSP
IBM CellIBM Cell …… whatwhat’’s next? s next? ……
![Page 14: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/14.jpg)
Discussion/Thought Exercise• What are the essential differences
between the SMT model of execution and the CMP model?– What resources are shared and in what
manner?– What type of data movement exists in one
but not others?– What types of applications/situations are
the best case situations for each model?
![Page 15: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/15.jpg)
The Advent of Superscalar Architecture
Transistor densities increased at a stunning pace.
Any method to increase computing performance for using those transistors ?
Put more than one ALU a chip
The RS6000 from IBM released in 1990The world's first superscalar CPU
Most general purpose CPUs developed since about 1998 are superscalar
![Page 16: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/16.jpg)
Technology ↔ Architecture
• Transistors are cheap, plentiful and fast– Moore’s law– 100 million transistors by 2000
• Wires are cheap, plentiful and slow– Wires get slower relative to transistors– Long cross-chip wires are especially slow
• Architectural implications– Plenty of room for innovation– Single cycle communication requires localized blocks of logic– High communication bandwidth across the chip easier to
achieve than low latency
![Page 17: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/17.jpg)
Exploiting Program Parallelism
Instruction
Loop
Thread
Process
Leve
ls o
f Par
alle
lism
1 10 100 1K 10K 100K 1M
Grain Size (instructions)
![Page 18: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/18.jpg)
Future Processors to use Coarse-Grain Parallelism
• Today‘s microprocessors utilize instruction level parallelism by a deepinstruction pipeline and by the superscalar or VLIW multiple issuetechniques
• Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip,
What next?
• Two directions:– Increase of single-thread performance
--> use of more speculative instruction-level parallelism– Increase of multi-thread (multi-task) performance
--> Utilize thread-level parallism additionally to instruction-level parallelismA „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ...
• Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism
![Page 19: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/19.jpg)
Advanced Superscalar Processors for Billion Transistor Chips in Year 2005 - Characteristics
• Aggressive speculation, such as a very aggressive dynamic branch predictor,
• a large trace cache,• very-wide-issue superscalar processing (an issue width of 16
or 32 instructions per cycle),• a large number of reservation stations to accommodate
2,000 instructions,• 24 to 48 highly optimized, pipelined functional units,• sufficient on-chip data cache, and• sufficient resolution and forwarding logic.
– see: Yale N. Patt, Sanjay J. Patel, Marius Evers, Daniel H. Friendly, Jaret Stark: One Billion Transistors, One Uniprocessor, One Chip. IEEE Computer, September 1997, pp. 51-57.
![Page 20: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/20.jpg)
![Page 21: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/21.jpg)
Requirements and Solutions• Delivering optimal instruction bandwidth requires:
– a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue
each cycle, – and a minimal number of cycles in which the instructions fetched are
subsequently discarded.• Consuming this instruction bandwidth requires:
– sufficient data supply,– and sufficient processing resources to handle the instruction bandwidth.
• Suggestions:– an instruction cache system (the I-cache) that provides for out-of-order
fetch (fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction
stream,– an aggressive Multi-Hybrid branch predictor (multiple, separate branch
predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling.
![Page 22: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/22.jpg)
Future Processors to use Coarse-Grain Parallelism
• Chip multiprocessors (CMPs) or multiprocessor chips– integrate two or more complete processors on a single
chip,– every functional unit of a processor is duplicated
• Simultaneous multithreaded processors (SMPs)– store multiple contexts in different register sets on the
chip,– the functional units are multiplexed between the threads,– instructions of different contexts are simultaneously
executed
![Page 23: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/23.jpg)
Chip Multiprocessors (CMPs)Principal Chip Multiprocessor Alternatives
• symmetric multiprocessor (SMP), • distributed shared memory
multiprocessor (DSM), • message-passing shared-nothing
multiprocessor.
![Page 24: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/24.jpg)
Organizationalprinciples of
multiprocessors
Pro-cessor
Pro-cessor...
Interconnection
Shared Memory
(SMP) symmetric multiprocessor
Pro-cessor
Pro-cessor...
(DSM) distributed-shared-memorymultiprocessor
Interconnection
LocalMemory
LocalMemory
Pro-cessor
Pro-cessor...
Interconnection
LocalMemory
LocalMemory
message-passing(shared-nothing) multiprocessor
send receive
empty
global memory physically distributed memory
dist
ribut
ed a
ddre
ss sp
aces
shar
ed a
ddre
ss sp
ace
![Page 25: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/25.jpg)
Typical SMP
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
![Page 26: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/26.jpg)
Shared memory candidates for CMPs
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
SecondaryCache
SecondaryCache
SecondaryCache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
Secondary Cache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Shared-main memory and shared-secondary cache
![Page 27: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/27.jpg)
Shared memory candidates for CMPs
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
Secondary Cache
Global Memory
Primary Cache
and shared-primary cache
![Page 28: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/28.jpg)
Grain-levels for CMPs• multiple processes in parallel• multiple threads from a single
application ⇒ implies a common address space for
all threads• extracting threads of control
dynamically from a single instruction stream
• ⇒ see last lecture, multiscalar, trace processors, ...
![Page 29: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/29.jpg)
Hydra: A Single-Chip Multiprocessor
CPU 0
Centralized Bus Arbitration Mechanisms
Cache SRAM Array DRAM Main Memory I/O Device
A Single C
hip
PrimaryI-cache
PrimaryD-cache
CPU 0 Memory Controller
Rambus MemoryInterface
Off-chip L3Interface
I/O BusInterface
DMA
CPU 1
PrimaryI-cache
PrimaryD-cache
CPU 1 Memory Controller
CPU 2
PrimaryI-cache
PrimaryD-cache
CPU2 Memory Controller
CPU 3
PrimaryI-cache
PrimaryD-cache
CPU 3 Memory Controller
On-chip SecondaryCache
![Page 30: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/30.jpg)
Multithreaded Processors• Aim: Latency tolerance• What is the problem?
• Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are:– 7 cycles for a primary cache miss which hits in the on-chip L2 cache
of the 21164 processor,– 21 cycles for a L2 cache miss which hits in the L3 (board-level)
cache,– 80 cycles for a miss that is served by the memory, and– 125 cycles for a dirty miss, i.e., a miss that has to be served from
another processor's cache memory.
• Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors.
![Page 31: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/31.jpg)
Register set 1
Register set 2
Register set 3
Register set 4
PC PSR 1
PC PSR 2
PC PSR 3
PC PSR 4
FP
Thread 1:
Thread 2:
Thread 3:
Thread 4:
... ... ...
Multithreaded Processors• Multithreading:
– Provide several program counters registers (and usually several register sets) on chip
– Fast context switching by switching to another thread of control
![Page 32: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/32.jpg)
Approaches of Multithreaded Processors
• Cycle-by-cycle interleaving– An instruction of another thread is fetched and fed into the
execution pipeline at each processor cycle.• Block-interleaving
– The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.
• Simultaneous multithreading– Instructions are simultaneously issued from multiple threads
to the FUs of a superscalar processor.– combines a wide issue superscalar instruction issue with
multithreading.
![Page 33: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/33.jpg)
Comparison of Multithreading with Non-Multithreading Approaches:
(a) single-threaded scalar(b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar
(a)
Tim
e (p
roce
sscy
cles
)
(c)
Con
text
switc
h
(b)
Con
text
switc
h
![Page 34: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/34.jpg)
Comparison of Multithreading with Non-Multithreading Approaches:
(a) superscalar (c) cycle-by-cycle interleaving(b) VLIW (d) cycle-by-cycle interleaving VLIW
(a)
Tim
e(p
roc e
sso r
cyc l
e s)
Issue slots
(b)
N
NNNNN
(c)
Con
text
switc
h
(d)
Con
text
switc
h
NN
N
NNNNN
NN
N
![Page 35: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/35.jpg)
Comparison of Multithreading withNon-Multithreading:
simultaneous multithreading (SMT) and chip multiprocessor (CMP)
Issue slots
Tim
e (p
roce
ssor
cyc
les)
![Page 36: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/36.jpg)
Cycle-by-Cycle Interleaving• the processor switches to a different thread after
each instruction fetch• pipeline hazards cannot arise and the processor
pipeline can be easily built without the necessity of complex forwarding paths
• context-switching overhead is zero cycles• memory latency is tolerated by not scheduling a
thread until the memory transaction has completed • requires at least as many threads as pipeline stages in
the processor• degrading the single-thread performance if not
enough threads are present
![Page 37: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/37.jpg)
Cycle-by-Cycle Interleaving- Improving single-thread performance
• The dependence look-ahead technique adds several bits to each instruction format in the ISA.– Scheduler feeds non data or control dependent
instructions of the same thread successively into the pipeline.
• The interleaving technique proposed by Laudon et al. adds caching and full pipeline interlocks to the cycle-by-cycle interleaving approach.
![Page 38: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/38.jpg)
Single-core computer
![Page 39: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/39.jpg)
Single-core CPU chipthe single core
![Page 40: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/40.jpg)
Multi-core architectures
• This lecture is about a new trend in computer architecture:Replicate multiple processor cores on a single die.
Core 1 Core 2 Core 3 Core 4
Multi-core CPU chip
![Page 41: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/41.jpg)
Multi-core CPU chip• The cores fit on a single processor
socket• Also called CMP (Chip Multi-Processor)
core
1
core
2
core
3
core
4
![Page 42: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/42.jpg)
The cores run in parallel
core
1
core
2
core
3
core
4
thread 1 thread 2 thread 3 thread 4
![Page 43: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/43.jpg)
Within each core, threads are time-sliced (just like on a uniprocessor)
core
1
core
2
core
3
core
4
several threads
several threads
several threads
several threads
![Page 44: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/44.jpg)
Interaction with OS• OS perceives each core as a separate
processor
• OS scheduler maps threads/processes to different cores
• Most major OS support multi-core today
![Page 45: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/45.jpg)
Why multi-core ?• Difficult to make single-core
clock frequencies even higher • Deeply pipelined circuits:
– heat problems– speed of light problems– difficult design and verification– large design teams necessary– server farms need expensive
air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift
towards more parallelism)
![Page 46: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/46.jpg)
Instruction-level parallelism• Parallelism at the machine-instruction
level• The processor can re-order, pipeline
instructions, split them into microinstructions, do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
![Page 47: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/47.jpg)
Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate
thread (Web server, database server)• A computer game can do AI, graphics, and
physics in three separate threads• Single-core superscalar processors cannot
fully exploit TLP• Multi-core architectures are the next
step in processor evolution: explicitly exploiting TLP
![Page 48: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/48.jpg)
General context: Multiprocessors• Multiprocessor is any
computer with several processors
• SIMD– Single instruction, multiple data– Modern graphics cards
• MIMD– Multiple instructions, multiple data
Lemieux cluster,Pittsburgh
supercomputing center
![Page 49: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/49.jpg)
Multiprocessor memory types• Shared memory:
In this model, there is one (large) common shared memory for all processors
• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
![Page 50: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/50.jpg)
Multi-core processor is a special kind of a multiprocessor:
All processors are on the same chip
• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).
• Multi-core is a shared memory multiprocessor:All cores share the same memory
![Page 51: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/51.jpg)
What applications benefit from multi-core?
• Database servers• Web servers (Web
commerce)• Compilers• Multimedia applications• Scientific applications,
CAD/CAM• In general, applications with
Thread-level parallelism(as opposed to instruction-level parallelism)
Each can run on itsown core
![Page 52: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/52.jpg)
More examples
• Editing a photo while recording a TV show through a digital video recorder
• Downloading software while running an anti-virus program
• “Anything that can be threaded today will map efficiently to multi-core”
• BUT: some applications difficult toparallelize
![Page 53: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/53.jpg)
A technique complementary to multi-core:Simultaneous multithreading
• Problem addressed:The processor pipeline can get stalled:– Waiting for the result
of a long floating point (or integer) operation
– Waiting for data to arrive from memory
Other execution unitswait unused
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Source: Intel
![Page 54: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/54.jpg)
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”on the same core
• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
![Page 55: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/55.jpg)
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating point
Without SMT, only a single thread can run at any given time
![Page 56: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/56.jpg)
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 2:integer operation
![Page 57: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/57.jpg)
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating pointThread 2:integer operation
![Page 58: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/58.jpg)
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE
![Page 59: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/59.jpg)
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)• OS and applications perceive each
simultaneous thread as a separate “virtual processor”
• The chip has only a single copy of each resource
• Compare to multi-core:each core has its own copy of resources
![Page 60: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/60.jpg)
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 3
![Page 61: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/61.jpg)
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 2 Thread 4
Multi-core: threads can run on separate cores
![Page 62: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/62.jpg)
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
![Page 63: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/63.jpg)
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCodeROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2 Thread 3 Thread 4
![Page 64: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/64.jpg)
Comparison: multi-core vs SMT
• Multi-core:– Since there are several cores,
each is smaller and not as powerful(but also easier to design and manufacture)
– However, great with thread-level parallelism• SMT
– Can have one large and fast superscalar core– Great performance on a single thread– Mostly still only exploits instruction-level
parallelism
![Page 65: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/65.jpg)
Concepts – Multithreading
• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable
• Multithreading (MT)– Tolerate or mask long and often unpredictable
latency operations by switching to another context, which is able to do useful work.
![Page 66: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/66.jpg)
Why Multithreading Today?
• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and
PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.
![Page 67: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/67.jpg)
Classical Problem, 60’ & 70’• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers
![Page 68: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/68.jpg)
Requirements of Multithreading• Storage need to hold multiple context’s PC,
registers, status word, etc. • Coordination to match an event with a saved
context • A way to switch contexts • Long latency operations must use resources
not in use
![Page 69: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/69.jpg)
Processor Utilization vs. Latency
R = the run length to a long latency event
L = the amount of latency
![Page 70: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/70.jpg)
Problem of 80’• Problem was revisited due to the advent of
graphics workstations– Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for
the workstations to be more responsive. – These processes could drive or monitor display,
input, file system, network, user processing – Process switch was slow so the subsystems
were microprogrammed to support multiple contexts
![Page 71: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/71.jpg)
Scalable Multiprocessor (90’)• Dance hall – a shared interconnect with memory on one side
and processors on the other. • Or processors may have local memory
![Page 72: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/72.jpg)
How do the processors communicate?• Shared Memory • Potential long latency on every load
– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s
Butterfly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and
semaphores. • Message Passing
– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.
– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5
– Synchronization occurs through send and receive
![Page 73: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/73.jpg)
Cycle-by-Cycle Interleaved Multithreading
• Denelcor HEP1 (1982), HEP2
• Horizon, which was never built
• Tera, MTA
![Page 74: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/74.jpg)
Cycle-by-Cycle Interleaved Multithreading
• Features– An instruction from a different context is launched at each
clock cycle– No interlocks or bypasses thanks to a non-blocking
pipeline
• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on
completion
![Page 75: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/75.jpg)
Challenges with this approach
• I-Cache:– Instruction bandwidth– I-Cache misses: Since instructions are being grabbed from many different
contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:
– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.
– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.
• Single thread performance– Single thread performance significantly degraded since the context is
forced to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full
![Page 76: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/76.jpg)
Improving Single Thread Performance
• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from
each context. – This could lead to pipeline hazards, so other safe
instructions could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data
dependencies and the hardware enforces it by switching to another context if detected.
• Switch on load • Switch on miss
– Switching on load or miss will increase the context switch time.
![Page 77: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/77.jpg)
Simultaneous Multithreading (SMT)• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased
parallelism from multiple threads.
![Page 78: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/78.jpg)
Simultaneous Multithreading
![Page 79: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/79.jpg)
SMT Architecture• Straightforward extension to conventional
superscalar design.– multiple program counters and some mechanism by which
the fetch unit selects one each cycle,– a separate return stack for each thread for predicting
subroutine return destinations,– per-thread instruction retirement, instruction queue flush,
and trap mechanisms,– a thread id with each branch target buffer entry to avoid
predicting phantom branches, and– a larger register file, to support logical registers for all
threads plus additional registers for register renaming. • The size of the register file affects the pipeline and the
scheduling of load-dependent instructions.
![Page 80: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/80.jpg)
SMT PerformanceTullsen ‘96
![Page 81: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/81.jpg)
Commercial Machines w/ MT Support
• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON
• Sun CoolThreads– UltraSPARC T1– 4-threads per core
• IBM– POWER5
![Page 82: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/82.jpg)
IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf
![Page 83: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/83.jpg)
IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf
![Page 84: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/84.jpg)
SMT Summary
• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment
• Cons:– Slower single processor performance
![Page 85: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/85.jpg)
Multicore • Multiple processor cores on a chip
– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)
• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron
• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne
• Can be used together with SMT
![Page 86: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/86.jpg)
Chip Multithreading (CMT)
![Page 87: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/87.jpg)
Sun UltraSPARC T1 Processor
http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers
![Page 88: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/88.jpg)
8 Cores vs 2 Cores
• Is 8-cores too aggressive?– Good for server applications, given
• Lots of threads• Scalable operating environment• Large memory space (64bit)
– Good for power efficiency• Simple pipeline design for each core
– Good for availability– Not intended for PCs, gaming, etc
![Page 89: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/89.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
The Case for a Single-Chip Multiprocessor
Kunle [email protected]
Lance [email protected]
Basem A. [email protected]
Computer Systems LaboratoryStanford UniversityStanford, CA 94305-4070Hydra Grouphttp://www-hydra.stanford.edu
Ken Wilson
Kunyung Chang
![Page 90: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/90.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Increase in integration densityIntroduction
Higher clock rates
New Micro architectural innovation
Multiple instruction issueDynamic schedulingSpeculative executionNon-blocking caches
Microprocessor performance growth
![Page 91: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/91.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
The Limits of the Superscalar Approach
Recent trend of designing: CPU with Multiple instruction issue(dynamic scheduling)
Track register dependency between instructions
Superscalar Approach
![Page 92: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/92.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instruction
Introduction
Higher Clock rates
A dynamic superscalar CPU
Execution array
Superscalar Approach
![Page 93: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/93.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
Mispredicted branches
Instruction misalignment
Cache misses
2 ways to implement
Use an explicit table for mapping architectural registers to physical registers
Use a combination reorder buffer/instruction queue
3 factors constrain instruction fetch
Instruction is inserted into the instruction issue queue, and
instruction is issued for execution once all of its operands are ready
Superscalar Approach
![Page 94: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/94.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
O : the number of operands / instruction
W : the issue width of the machine
O x W : the number of access ports required by the mapping tablestructure
Ex : eight-wide issue machine with three operands per instruction requires a24 port mapping table
Superscalar Approach
![Page 95: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/95.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
n : the number of bits required to encode a register identifier
Q : the size of the instruction issue queue
n x Q x O x W : the number of comparators grows with the size of the instruction queue and issue widthstructure
Ex : eight-wide issue machine with three operands per instruction, a 64-entry instruction queue, 6-bit comparisons requires a
8x3x64x6= 9216 port mapping table
Superscalar Approach
![Page 96: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/96.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Instructions
That
can issue in parallel,
maintain the full issue bandwidth
-> A quadratic increase in the size of the instruction issue queue -> will limit the cycle time of the processor-> will limit the performance of wide issue superscalar machines
As Instruction issue widths increase, larger windows of instructions are required to find independent instructions
Superscalar Approach
![Page 97: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/97.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Motivation
Technology push
Technical issues especially -the delay of the complex issue queue -multi-port register fileswill limit the performance returns from a wide superscalar execution model.
Needs for decentralized micro architecture that maintain performance growth of microprocessors
Single-Chip Multiprocessor
![Page 98: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/98.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Motivation
Application pull
micro architecture that works best depends on the amount and characteristics of the parallelism in the applications.
Application pull towards a single-chip multiprocessor
In parallelism, 2 different applications require different execution models
Single-Chip Multiprocessor
![Page 99: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/99.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
Applications with low to moderate amounts of parallelism
Applications with large amount of parallelism greater than 40 i/c
Class 1
Class 2
Floating point apps, loop-level parallelism
Under 10 instructions/cycleVS
Single-Chip Multiprocessor
![Page 100: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/100.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
Applications with low to moderate amounts of parallelism
Applications work best on moderately superscalar processors with very high clock rates
Class 1
Little parallelism to exploit
Single-Chip Multiprocessor
![Page 101: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/101.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
However, the recent advances in parallel compilers make a multiprocessor an efficient and flexible way to exploit the parallelism in these programs
Class 2
Large amounts of parallelism and see performance benefits from a variety of methods designed to exploit parallelism such as Superscalar, VLIW, vector processing
Applications with large amount of parallelism greater than 40 i/c
Single-Chip Multiprocessor
![Page 102: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/102.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
1st Way to use multiprocessor
In the environment under the control of a multiprocessor aware operating system, a number of commercially available OS like WINNT, IRIX, Sun Solaris, etc.. have capability.
The increasingly widespread use of visualization and multimedia applications tends to increase the number of active processes orindependent threads on a desktop machine or server at a particular point in time.
Execution of multiple processes in parallel to increase throughput in a multiprogramming environment
Way to use Multiprocessor
![Page 103: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/103.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
2nd Way to use multiprocessor
Ex1, transaction processing
The threads communicate using shared memory.
Designed to run on parallel machines with communication latencies in the hundreds of CPU clock cycles.
The threads do not communicate in a very fine grained manner.
Execution of multiple threads in parallel that come from a single application.
Way to use Multiprocessor
![Page 104: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/104.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
2nd Way to use multiprocessor
Ex2, hand parallelized floating point scientific applications
When instruction window size is very large and the branch prediction is perfect. (because existing parallelism is widely distributed)
The parallelism exposed in this fine-grained manner cannot be exploited by a conventional multiprocessor architecture.
-> to exploit this, a single-chip multiprocessor architecture is available.
Execution of multiple threads in parallel that come from a single application.
Way to use Multiprocessor
Hand code (code before compile program
![Page 105: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/105.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Application parallelism
comparison
3rd Way to use multiprocessor
Automatic parallelization technology was shown to be effective.
The automatic parallelization technology derive significant performance benefits from the low-latency inter-processor communication
which are provided by a single-chip multiprocessor.
Accelerating the execution of sequential applications without manual intervention
Way to use Multiprocessor
Usually used in Fortran
![Page 106: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/106.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Two-issue CPU will have a higher clock rate than the six-issue CPU
But, assume that two processors have the same clock.
Key characteristics of the two micro architectures
The Difference of two systems
![Page 107: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/107.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Key characteristics of the two micro architectures
The Difference of two systems
![Page 108: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/108.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
Introduction
Higher Clock rates
Floor plan for the six-issue dynamic superscalar
Big overhead
6-Way Superscalar Architecture
![Page 109: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/109.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
TERMS
TLB :The translation lookaside buffer (TLB) is a table in the processor that contains cross-references between the virtual and real addresses of recently referenced pages of memory. It functions like a "hot list" or quick-lookup index of the pages in main memory that have been most recently accessed.
6-Way Superscalar Architecture
![Page 110: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/110.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
4x2-Way Superscalar Multiprocessor Architecture
Introduction
Higher Clock rates
Floor plan for the four-way single chip
Also have floating point unit, integer unit, smaller overhead
L1 caches in each chip
Switch for data communication among processors and memory
through the cache
![Page 111: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/111.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
.
Simulation Environment
![Page 112: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/112.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
.
Simulation Environment
![Page 113: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/113.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
ApplicationsSample Applications
applu
apsi
swim
tomcatv
pmake
Floating point applicationsMultiprogramming applications
Integer applications
compress
eqntott
m88ksim
MPsim
![Page 114: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/114.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
compress – compresses and uncompressed file in memory
eqntott – translates logic equations into truth tables
m88ksim – Motorola 88000 CPU simulatorMPsim – VCS compiled Verilog simulation of a multiprocessor
applu – solver for parabolic/elliptic partial differential equations
swim – shallow water model with 1K x 1K grid
tomcatv – mesh-genetation with Thomson solver
apsi – solves problems of temperature, wind, velocity, and distribution of pollutants
Sample Applications
pmake– parallel make of gnuchess using C compiler
Applications
![Page 115: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/115.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
How to compare
MPCI : misses per completed instructor – the cache miss rates
IPC :
BP rate :
Performance Comparison
![Page 116: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/116.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Comparison of two micro architectures
Multiple Single Chip
IPC Breakdown for a single 2-issue processor
Performance of a single 2-issue superscalar processor
Performance Comparison
![Page 117: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/117.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Key Characteristics of two micro architectures
IPC Breakdown for the 6-issue processor
Performance of the 6-issue superscalar processor
Performance Comparison
![Page 118: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/118.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Performance of the 4x2-issue processor
Performance Comparison
![Page 119: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/119.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Comparison of two micro architectures
Introduction
Higher Clock rates
Performance comparison of SS and MP
Performance Comparison
SS MP(Hydra)
![Page 120: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/120.jpg)
INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION
Multiple Single Chip
Conclusion
in selecting the way to design microprocessors
• Single-chip MP exploits parallelism more effectively at some levels than SS microprocessor
• Single-chip MP architecture is more efficient than Superscalar architecture in same physical space
• Provides up to 2x performance on applications with higher levels of parallelism
Conclusion
![Page 121: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/121.jpg)
Case examples
Intel MCPs (1)
The Move to Intel MultiThe Move to Intel Multi--corecore20052005 20062006 2007+2007+PlatformPlatform
ItaniumItanium®®processorprocessor
Desktop Desktop ClientClient
Mobile Mobile ClientClient
All products and dates are preliminary and subject to change without notice.
MP ServerMP Server
DP Server / DP Server / WSWS
Refer to ‘fact sheet’ for specific product timings
today
Figure 5.1: The move to Intel multi-coreSource: A. Loktu: Itanium 2 for Enterprise Computing
http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps
![Page 122: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/122.jpg)
Intel MCPs (2)
Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm)Source: http://www.intel.com/products/processor/index.htm
![Page 123: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/123.jpg)
EIST: Enhanced Intel SpeedStep Technology
First delivered in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,which can result in decreased average power consumptionand decreased average heat production.
It is a set of hardware enhancements to Intel’s server and client platformsthat can improve the performance and robustness of traditional software-based virtualization solutions.
Virtualization solutions will allow a platform to run multipleoperating systems and applications in independent partitions.Using virtualization capabilities, one computer system can function as multiple "virtual" systems.
VT: Virtualization Technology
Malicious buffer overflow attacks pose a significant security threat. In a typical attack, a malicious worm creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers. It can help prevent certain classes of malicious buffer overflow attackswhen combined with a supporting operating system.
Execute Disable Bit allows the processor to classify areas in memoryby where application code can execute and where it cannot.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation.
ED: Execute Disable Bit
Intel MCPs (3)
![Page 124: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/124.jpg)
Intel MCPs (4)
Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm)Source: http://www.intel.com/products/processor/index.htm
![Page 125: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/125.jpg)
Intel MCPs (5)
Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965Source: http://www.intel.com/products/processor/index.htm
![Page 126: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/126.jpg)
Intel MCPs (6)
Figure 5.5: Procesor specifications of Intel’s Yonah Duo (Core Duo) family
Source: http://www.intel.com/products/processor/index.htm
![Page 127: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/127.jpg)
Source: http://www.intel.com/products/processor_number/chart/core2duo.htm
Intel MCPs (7)
Figure 5.6 Specifications of Intel’s Core Processors
![Page 128: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/128.jpg)
Intel MCPs (8)Category Code Name Cores Cache Market
Desktop Kentsfield Dual core multi-die 4 MB Mid 2007
Desktop Conroe Dual core single die 4 MB shared End 2006
Desktop Allendale Dual core single die 2 MB shared End 2006
Desktop Cedar Mill (NetBurst/P4) Single core 512 kB, 1 MB, 2 MB Early 2006
Desktop Presler (NetBurst/P4) Dual core, dual die 4 MB Early 2006
Desktop/Mobile Millville Single core 1 MB Early 2007
Mobile Yonah2 Dual core, single die 2 MB Early 2006
Mobile Yonah1 Single core 1/2 MB Mid 2006
Mobile Stealey Single core 512 kB Mid 2007
Mobile Merom Dual core, single die 2/4 MB shared End 2006
Enterprise Sossaman Dual core, single die 2 MB Early 2006
Enterprise Woodcrest Dual core, single die 4 MB Mid 2006
Enterprise Clovertown Quad core, multi-die 4 MB Mid 2007
Enterprise Dempsey (NetBurst/Xeon) Dual core, dual die 4 MB Mid 2006
Enterprise Tulsa Dual core single die 4/8/16 MB End 2006
Enterprise Whitefield Quad core single die 8 MB, 16 MB shared Early 2008
Figure 5.7: Future 65 nm processors (overview)Source: P. Schmid: Top Secret Intel Processor Plans Uncovered
www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
![Page 129: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/129.jpg)
Codename Cores Cache Market
Desktop Wolfdale Dual core, single die 3 MB shared 2008
Desktop Ridgefield Dual core single die 6 MB shared 2008
Desktop Yorkfield 8 cores multi-die 12 MB shared 2008+
Desktop Bloomfield Quad core, single die - 2008+
Desktop/Mobile Perryville Single core 2 MB 2008
Mobile Penryn Dual core single die 3 MB, 6 MB shared 2008
Mobile Silverthorne - - 2008+
Enterprise Hapertown 8 cores multi-die 12 MB shared 2008
Figure 5.8: Future 45 nm processors (overview)
Intel MCPs (9)
Source: P. Schmid: Top Secret Intel Processor Plans Uncovered www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
![Page 130: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/130.jpg)
Athlon 64 X2
Figure 5.9: AMD Athlon 64 X2 dual-core processor architectureSource: AMD Athlon 64 X2 Dual-Core Processor for Desktop – Key Architecture Features, http:///www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html
![Page 131: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/131.jpg)
Sun’s UltraSPARC IV/IV+ (1)
Figure 5.10: UltraSPARC IV (Jaguar)
Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
ARB: Arbiter
![Page 132: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/132.jpg)
Sun’s UltraSPARC IV/IV+ (2)
Figure 5.11: UltraSPARC IV+ (Panther)
Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
![Page 133: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/133.jpg)
POWER4/POWER5 (1)
Figure 5.12: POWER4 chip logical viewSource: J.M. Tendler, S. Dodson, S. Fields, H. Le, B. Sinharoy: Power4 System Microarchitecture, IBM Server,
Technical White Paper, October 2001http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf
Built-In-SelfTest
Service ProcessorPower On Reset
Core interface Unit(crossbar)
Non-CacheableUnit
MultiChip Module
![Page 134: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/134.jpg)
POWER4/POWER5 (2)
Figure 5.13: POWER4 chip
Source: R. Kalla, B. Sinharoy, J. Tendler: Simultaneous Multi-threading Implementation in Power5 –IBM’s Next Generation POWER Microprocessor, 2003
http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf
![Page 135: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/135.jpg)
POWER4/POWER5 (3)
Figure 5.14: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
FabricController
![Page 136: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/136.jpg)
Cell
Figure 5.15: Cell (BE) microarchitecture
Source: IBM: „Cell Broadband Engine™ processor – based systems”, IBM corp. 2006
SPE: SynergisticProcessing Element
EIB: Element Interface Bus
MFC: Memory Flow Controller
PPE: Power Processing Element
AUC: Atomic Update Cache
![Page 137: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/137.jpg)
Cell (2)
Figure 5.16: Cell SPE architecture
Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html
![Page 138: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/138.jpg)
Cell (3)
Figure 5.17: Cell floorplan
Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html
![Page 139: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/139.jpg)
Issues and Challenges
• Memory Organization– Distributed? Shared?
• Interconnect and Communication Protocols• Coherency and Consistency – Memory/Cache.• Scheduling, Load Balancing and Synchronization• Reliability? Energy Efficiency?
• Will see all these through the rest of the course!
![Page 140: ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · Architecture Transistor densities increased at a stunning pace. Any method to increase computing performance for using those](https://reader035.vdocuments.pub/reader035/viewer/2022071011/5fc8f5b938dba02c924a6816/html5/thumbnails/140.jpg)
Next Week
• We will talk about Scheduling and Load Balancing issues, with respect to multiple processing nodes (in effect covers CMPs as well).
• We will also talk about the possibility of quizzes…