multithreaded architectures
DESCRIPTION
Multithreaded Architectures . Uri Weiser and Yoav Etsion. Overview. Multithreaded Software Multithreaded Architecture Multithreaded Micro-Architecture Conclusions. Multithreading Basics. Process Each process has its unique address space Can consist of several threads - PowerPoint PPT PresentationTRANSCRIPT
Computer Architecture 2014 - Multithreading 1/40
Multithreaded Architectures
Uri Weiser and Yoav Etsion
Computer Architecture 2014 - Multithreading 2/40
Overview
Multithreaded Software Multithreaded Architecture Multithreaded Micro-Architecture Conclusions
Computer Architecture 2014 - Multithreading 3/40
MultithreadingBasics
Process Each process has its unique address space Can consist of several threads
Thread – each thread has its unique execution context Thread has its own PC (Sequencer) + registers + stack All threads (within a process) share same address space Private heap is optional
Multithreaded app’s: process broken into threads #1 example: transactions (databases, web servers)
Increased concurrency Partial blocking (your transaction blocks, mine doesn’t have to) Centralized (smarter) resource management (by process)
Computer Architecture 2014 - Multithreading 5/40
Superscalar ExecutionExample – 5 stages, 4 wide machine
F
F
F
F
D1 WB
D1 D2
D1 EX/RET
D1
WB
EX/RET
EX/RET
EX/RET
D2 WB
D2
D2 WB
Computer Architecture 2014 - Multithreading 6/40
Superscalar Execution
Generally increases throughput, but decreases utilization
Time
Computer Architecture 2014 - Multithreading 7/40
CMP – Chip Multi-ProcessorMulti-threads/Multi-tasks runs on Multi-Processor/core
Time
Low utilization / higher throughput
Computer Architecture 2014 - Multithreading 8/40
Multiple Hardware Threads
A thread can be viewed as a stream of instructions State is represented by PC, Stack pointer, GP Registers
Equip multithreaded processors with multiple hardware contexts (threads) OS views CPU as multiple logical processors
Execution of instructions from different threads are interleaved in the pipeline Interleaving policy is critical…
Computer Architecture 2014 - Multithreading 9/40
Blocked MultithreadingMulti-threading/multi-tasking runs on single core
May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)
Time
How is this different from OS scheduling?
Computer Architecture 2014 - Multithreading 10/40
Fine Grained MultithreadingMulti-threading/multi-tasking runs on single core
Increases utilization/throughput by reducing impact of dependences
Time
Computer Architecture 2014 - Multithreading 11/40
Simultaneous Multithreading (SMT)Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput
Computer Architecture 2014 - Multithreading 12/40
Blocked Multithreading (SoE-MT- Switch on Event MT, aka’ – “Poor Man MT”)
Critical decision: when to switch threads When current thread’s utilization/throughput is about to drop
(e.g. L2 cache miss) Requirements for throughput:
(Thread switch) + (pipe fill time) << blocking latency Would like to get some work done before other thread comes back
Fast thread-switch: multiple register banks Fast pipe-fill: short pipe
Advantage: small changes to existing hardware Drawback: high single thread performance requires long thread
switch Examples
Macro-dataflow machine MIT Alewife IBM Northstar
Computer Architecture 2014 - Multithreading 13/40
Interleaved (Fine grained) Multithreading
Critical decision: none? Requirements for throughput:
Enough threads to eliminate data access and dependencies Increasing number of threads reduces single-thread performance
F D E M W
F D E M W
F D E M W
F D E M W F D E M W
I1 of T1
F D E M W
Superscalar pipeline Multithreaded pipeline
I1
I2
I1 of T2
I2 of T1
Computer Architecture 2014 - Multithreading 14/40
Interleaved (Fine grained) Multithreading (cont.)
Advantages: (w/ flexible interleave:) Reasonable single thread performance High processor utilization (esp. in case of many thread)
Drawback: Complicated hardware Multiple contexts (states) (w/ inflexible interleave:) limits single thread performance
Examples: HEP Denelcor: 8 threads (latencies were shorter then) TERA: 128 threads MicroUnity - 5 x 1GHz threads = 200 MHz like latency
Became attractive for GPUs and network processors
Computer Architecture 2014 - Multithreading 15/40
Simultaneous Multi-threading (SMT)
Critical decision: fetch-interleaving policy Requirements for throughput:
Enough threads to utilize resources Fewer than needed to stretch dependences
Examples: Compaq Alpha EV8 (cancelled) Intel Pentium® 4 Hyper-Threading Technology
Computer Architecture 2014 - Multithreading 16/40
SMT Case Study: EV8
8-issue OOO processor (wide machine) SMT Support
Multiple sequencers (PC): 4 Alternate fetch from multiple threads
Separate register renaming for each thread More (4x) physical registers Thread tags on all shared resources
ROB, LSQ, BTB, etc. Allow per thread flush/trap/retirement
Process tags on all address space resources: caches, TLB’s, etc.
Notice: none of these things are in the “core” Instruction queues, scheduler, wakeup are SMT-blind
Computer Architecture 2014 - Multithreading 17/40
Basic EV8
Fetch Decode/Map
Queue Reg Read
Execute Dcache/Store Buffer
Reg Write
Retire
PC
Icache
RegisterMap
DcacheRegs Regs
Thread-blind
Computer Architecture 2014 - Multithreading 18/40
SMT EV8
Fetch Decode/Map
Queue Reg Read
Execute Dcache/Store Buffer
Reg Write
Retire
IcacheDcache
PC
RegisterMap
Regs Regs
Thread Blind?Not really
Thread-blind
Computer Architecture 2014 - Multithreading 19/40
Performance Scalability
0%
50%
100%
150%
200%
250%
SpecInt SpecFP Mixed Int/FP
1T2T3T4T
0%
50%
100%
150%
200%
250%
Turb3d Swm256 Tomcatv
1T2T3T4T
0%
50%
100%
150%
200%
250%
300%
Barnes Chess Sort TP
1T2T4TDecomposed SPEC95 Applications
Multiprogrammed Workload
Multithreaded Applications
Computer Architecture 2014 - Multithreading 20/40
MultithreadI’m afraid of this
0 5 10 15 20 25Mcycles
0.4
0.6
0.8
1
1.2
1.4
1.6SM
T Sp
eedu
p
EV8 simulation R. EspasaIs CMP a solution?
Computer Architecture 2014 - Multithreading 21/40
Performance Gain – when?
When threads do not “step” on each other “predicted state” $, Branch Prediction, data prefetch….
If ample resources are available Slow resource, can hamper performance
Computer Architecture 2014 - Multithreading 22/40
Scheduling in SMT
What if one thread gets “stuck”?
Round-robin eventually it will fill up the machine (resource contention)
ICOUNT: thread with fewest instructions in pipe has priority
Thread doesn’t get to fetch until it gets “unstuck”
Other policies have been devised to get best balance, fairness, and overall performance
Computer Architecture 2014 - Multithreading 23/40
Scheduling in SMT
Variation: what if one thread is spinning?Not really stuck, gets to keep fetchingHave to stop/slow it artificially (QUIESCE)
Sleep until a memory location changes On Pentium 4 – use the Pause instruction
Computer Architecture 2014 - Multithreading 24/40
MT Application Data Sharing
Shared memory apps Decoupled caches = Private caches
Easier to implement – use existing MP like protocol Shared Caches
Faster to communicate among threads No coherence overhead Flexibility in allocating resources
DataTag DataTag DataTag DataTag
Way 0 Way 1 Way 2 Way 7
Computer Architecture 2014 - Multithreading 25/40
SMT vs. CMP How to use 2X transistors?
Better core+SMT or CMP? SMT
Better single thread performance Better resource utilization
CMP Much easier to implement Avoid wire delays problems Overall better throughput per area/power More deterministic performance BUT! program must be multithreaded
Rough numbers SMT:
ST Performance/MT performance for 2X transistors = 1.3X/1.5-1.7X CMP:
ST Performance /MT performance for 2X transistors = 1X/2X
SMTCore
Core A
Core B
2MBL2
2MBL2
Core A 1MBL2
Reference
SMT
CMP
Computer Architecture 2014 - Multithreading 26/40
Intermediate summary
Multithreaded Software Multithreaded Architecture
Advantageous cost/throughput …
Blocked MT - long latency tolerance Good single thread performance, good throughput Needs fast thread switch and short pipe
Interleaved MT – ILP increase Bad single thread performance, good throughput Needs many threads; several loaded contexts…
Simultaneous MT – ILP increase OK MT performance Good single thread performance Good utilization … Need fewer threads
Computer Architecture 2014 - Multithreading 28/40
Example: Pentium® 4 Hyper-Threading
Executes two threads simultaneously Two different applicationsor
Two threads of same application CPU maintains architecture state for two processors
Two logical processors per physical processor Implementation on Intel® Xeon™ Processor
Two logical processors for < 5% additional die area The processor pretends as if it has 2 cores in an MP
shared memory system
Computer Architecture 2014 - Multithreading 29/40
uArchitecture impact
Replicate some resources All; per-CPU architectural state Instruction pointers, renaming logic Some smaller resources - e.g. return stack predictor, ITLB, etc.
Partition some resources (share by splitting in half per thread) Re-Order Buffer Load/Store Buffers Queues; etc.
Share most resources Out-of-Order execution engine Caches
Computer Architecture 2014 - Multithreading 30/40
Hyper-Threading Performance
*All trademarks and brands are the property of their respective owners
1.22
IXP-MPHyper-
ThreadingDisabled
1.001.18
1.231.30
Microsoft*Active Directory
Microsoft*Exchange
Microsoft*SQL Server
Microsoft*IIS
Intel Xeon Processor MP platforms are prototype systems in 2-way configurationsApplications not tuned or optimized for Hyper-Threading Technology
IXP-MPHyper-Threading Enabled
Parallel GCC Compile
1.18
Performance varies as expected with: Number of parallel threads Resource utilization
Less aggressive MT than EV8 Less impressive scalability Machine optimized for single-thread.
Computer Architecture 2014 - Multithreading 31/40
execution
Memory access = P
execution
Memory access
Multi-Thread Architecture (SoE)
Period
Computer Architecture 2014 - Multithreading 32/40
Memory latency is shielded by multiple threads execution
C
To Memory
Threads architectural states
Multi-Thread Architecture (SoE)
Computer Architecture 2014 - Multithreading 33/40
CPII = CPI ideal of core C
MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads
C
To Memory
Threads architectural states
Multi-Thread Architecture (SoE)
Instructions/Memory access=1/MI
Computer Architecture 2014 - Multithreading 34/40
CPI1= {P + [1/MI]CPIi}/ [1/MI] =CPIi + MI * P Cycles / instructionsCPII = CPI ideal of core C
MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads
C
To Memory
Threads architectural states
Multi-Thread Architecture (SoE)
Period = P +[1/MI] CPIi
P
Computer Architecture 2014 - Multithreading 35/40
IPCT= n / CPI1 = n / [ CPIi + MI * P ]CPII = CPI ideal of core C
MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads
C
To Memory
Threads architectural states
Multi-Thread Architecture (SoE)
Period
Computer Architecture 2014 - Multithreading 36/40
Memory latency is shielded by multiple threads execution
Threads (n)
Perf
orm
ance
= IP
C
Max performance ?
Multi-Thread Architecture (SoE)
C
To Memory
Threads architectural states
Computer Architecture 2014 - Multithreading 37/40
Memory latency is shielded by multiple threads execution
Threads (n)
Perf
orm
ance
= IP
C Max performance = 1/CPIi
Multi-Thread Architecture (SoE)
C
To Memory
Threads architectural states
N1?
Computer Architecture 2014 - Multithreading 38/40
The unified machine – the valley
Three regions: Cache Region , the valley, MT Region
Threads
Perf
orm
ance
MC
regi
on
MT regionThe valley
trend
Perf = n/ [ CPIi + MI * MR(n) * P ]
Computer Architecture 2014 - Multithreading 39/40
Multi-Thread Architecture (SoE)conclusions
Applicable when many threads are availableRequire threads state savingCan reach “high” performance (~1/CPIi)Bandwidth to memory is high
high power
Computer Architecture 2014 - Multithreading 40/40
BACKUP SLIDES
unlimited BW to memory
The unified machineParameter: Cache Size
Increase Cache size cache ability to hide memory latency favors MC
Increase in
cache size
0100200300400500600700800900
10001100
0
2500
5000
7500
1000
0
1250
0
1500
0
1750
0
2000
0
2250
0
2500
0
2750
0
3000
0
3250
0
3500
0
3750
0
4000
0
GO
PS
Number Of Threads
Performance for Different Cache Sizes
no cache16M32M64M128Mperfect $
Increase in
Cache size
* At this point: unlimited BW to memory
Computer Architecture 2014 - Multithreading 42/40
0
100
200
300
400
500
600
700
800
900
1000
1100
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1000
0
1100
0
1200
0
1300
0
1400
0
1500
0
1600
0
1700
0
1800
0
1900
0
2000
0
GO
PS
Number Of Threads
Performance for Different Cache Sizes (Limited BW)
no $
16M
32M
64M
128M
perfect $
Increase in cache size cache suffices for more in-flight threads Extends the $ region
Increase in cache size
Cache Size Impact
..AND also Valuable in the MT region
Caches reduce off-chip bandwidth delay the BW saturation point
Computer Architecture 2014 - Multithreading 43/40
Increase in memory latency Hinders the MT region Emphasise the importance of caches
Unlimited BW to memory0
100200300400500600700800900
10001100
0
2000
4000
6000
8000
1000
0
1200
0
1400
0
1600
0
1800
0
2000
0
2200
0
2400
0
2600
0
2800
0
3000
0
3200
0
3400
0
3600
0
3800
0
4000
0
GO
PS
Number Of Threads
Performance for Different memory Latencies
zero latency
50 cycles
100 cycles
300 cycles
1000 cycles
Increase in Memory latency
Unlimited BW to memory
Memory Latency Impact