why parallel computing? - androbenchcsl.skku.edu/uploads/sse3054s17/01-paracomp.pdf ·...

47
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) Why Parallel Computing? Jinkyu Jeong ([email protected]) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Why Parallel Computing?

Jinkyu Jeong ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhttp://csl.skku.edu

Page 2: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 2

What is Parallel Computing?

• Software with multiple threads

• Parallel vs. concurrent– Parallel computing executes multiple threads at the same time on

multiple processors – performance– Concurrent computing can either alternate multiple threads on a

single processor or execute in parallel on multiple processors –convenient design

Page 3: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 3

What is Parallel Architecture?

• Machines with multiple processors

IntelQuadCorei7(Nehalem) SonyPlaystation 3/4 IBMBlueGene

GoogleDataCenterLocations

ParallelComputingvs.

DistributedComputing

Facebook DataCenter

Page 4: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 4

What is Parallel Architecture?

• One definition– A parallel computer is a collection of processing

elements that cooperate to solve large problems fast

• Some general issues– Resource allocation– Data access, communication and synchronization– Performance and scalability

Page 5: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 5

Why Study Parallel Computing?

• 10 years ago– Some important applications demand high performance– With CPU clock frequency scaling, it is not enough– Parallel computing provides higher performance

• Today– Many interesting applications demand high performance– CPU clock rates are no longer increasing!– Instruction-level parallelism is not increasing either!– Parallel computing is the only way to achieve higher

performance in the foreseeable future

Page 6: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 6

Why Study Parallel Architecture?

• Role of a computer architect– To design and engineer the various levels of a computer

system to maximize performance and programmability within limits of technology and cost

• Parallelism– Provides alternative to faster clock for performance– Applies at all levels of system design

• Bits, instructions, loops, threads, …

Page 7: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 7

Why Parallel Computing?

• Application demands

• Architectural Trends

• The impact of technology and power

Page 8: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 8

Application Trends• A positive feedback cycle between the two

– Delivered performance – Applications’ demand for performance

• Example application domains– Scientific computing: computational fluid dynamics, biology, meteorology,

…– General purpose computing: video, graphics, …– Commercial computing: databases, …

New ApplicationsMore Performance

Page 9: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 9

Speedup

• Goal of applications in using parallel machines

– Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

– Speedup fixed problem (p processors) =

Performance (p processors)Performance (1 processor)

Time (1 processor)Time (p processors)

Page 10: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

ScientificComputingDemand

Page 11: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 11

Example: N-body Simulation

• Simulation of a dynamical system of particles– N particles (location, mass, velocity)– Influence of physical forces (e.g., gravity)– Physics, astronomy, …

Page 12: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 12

Example: Graph Computation

• Google page-rank algorithm– Determining the importance of a web page by

counting the number and quality of likes to the page

• Relative importanceof a web page u

– Repeat until stabilized

• Trillion pages in WebCredit:wikipedia

Page 13: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 13

Engineering Computing

• Large parallel machines are mainstays in many industries

– Petroleum – reservoir analysis– Automotive – crash simulation, drag analysis, combustion

efficiency– Aeronautics – airflow analysis, engine efficiency, structural

mechanics, electromagnetism– Computer-aided design –VLSI layout– Pharmaceuticals – molecular modeling– Visualization – all the above, entertainment, architecture– Financial modeling – yield and derivative analysis– …

Page 14: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 14

Commercial Computing

• Relies on parallelism for high end– Computational power determines scale of business that can be

handled– e.g.) databases, online-transaction processing, decision support,

data mining, data warehousing ...

• TPC benchmarks– TPC-C (order entry), TPC-D (decision support)– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size no longer fixed as p increases, so throughput is

used as a performance measure, transactions per minute (tpm)

Page 15: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 15

Commercial Computing

• Hardware threads in the Top-10 TPC-C machines

Page 16: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 16

Why Parallel Computing?

• Application demands

• Architectural Trends

• The impact of technology and power

Page 17: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 17

Architectural Trends

• Architecture translates technology’s gifts into performance and capability

• Four generations of architectural history– Tube, transistor, IC, VLSI– Greatest delineation in VLSI has been in type of

parallelism exploited

Page 18: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 18

Moore’s Law

• 1965: the number of transistors/inch2

doubles every 1.5 years

GordonMooreCo-founderofIntelco.

Credit:shigeru23(CC-BY-SA-3.0)

Page 19: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 19

Evolution of Intel Microprocessors

Intel8088,19781core,nocache29Ktransistors

IntelNehalem-EX,20098cores,24MBcache

2.3Btransistors

Figurecredits:ExtremeTech,Thefutureofmicroprocessors,CommunicationsoftheACM,vol.54,no.5

Intelknightlanding,2016(?)72cores,16GBDDR3LLC(?)

7.1Btransistors

Page 20: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 20

Transistors & Clock Frequency

Credit:Thefutureofcomputingperformance:gameoverornextlevel?,TheNationalAcademiesPress,2011

Page 21: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 21

Impact of Device Shrinkage• What happens when the transistor size shrinks by a factor of x

?• Clock rate increases by x because wires are shorter

– Actually less than x, because of power consumption

• Transistors per unit area increases by x2

• Die size also tends to increase– Typically another factor of ~x

• Raw computing power of the chip goes up by ~ x4 !– Typically x3 is devoted to either on-chip

• Parallelism: hidden parallelism such as ILP• Locality: caches

• So most programs x3 times faster, without changing them

Page 22: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 22

Architectural Trends

• Greatest trend in VLSI generation is – Increase in parallelism

– Up to 1985: bit level parallelism• 4-bit → 8 bit → 16-bit, slows after 32 bit • Adoption of 64-bit now under way, 128-bit far (not performance issue)

– Mid 80s to mid 90s: instruction level parallelism• Pipelining and simple instruction sets + compiler advances (RISC)• On-chip caches and functional units ⇒ superscalar execution• Greater sophistication: out of order execution, speculation, prediction

– To deal with control transfer and latency problems– Now: thread level parallelism

• Multicore processors

Page 23: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Phasesin“VLSI”GenerationTr

ansis

tors

◆◆◆◆◆◆◆

◆◆

◆◆

◆ ◆◆

◆◆

◆◆◆ ◆

◆◆

◆ ◆

◆◆

◆ ◆

◆◆◆

◆◆◆ ◆

◆◆ ◆◆ ◆

◆◆◆ ◆

◆◆◆

◆◆◆◆

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

4bits → 8bits → 16bits → 32bits

PipeliningSuperscalarBranch predictionOut-of-order executionVLIW

Multicore / manycoreHeterogenous multicoreGPGPU

Page 24: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 24

Superscalar Execution

• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in

parallel

2-waysuperscalarexecutionpipeline

1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000

Page 25: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 25

Out-of-order Execution

• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in

parallel– Reorder instructions

2-waysuperscalarexecutionpipeline

dependency1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000

Page 26: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 26

Architecture-driven Advances

Credit:TheFutureofMicroprocessors,CommunicationsoftheACM,2011

Page 27: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 27

Limitation in ILP

• Speedup begins to saturate after issue width of 4– Assumption of ideal machine

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– But with realistic memory• Real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

●● ●

0 5 10 150

0.5

1

1.5

2

2.5

3Fr

actio

n of

tota

l cyc

les

(%)

Number of instructions issued

Spee

dup

Instructions issued per cycle (Issuewidth)

Page 28: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 28

Instruction-level Parallelism Concerns

• Issue waste

cycles

Fullissueslot

Emptyissueslot

Horizontalwaste=9slots

Verticalwaste=12slots

Issueslots(4way)Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995

• Contributingfactors• Instructiondependencies• Long-latencyoperationswithinathread

Page 29: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 29

Some Sources of Wasted Issue Slots

• TLB miss• Instruction cache miss• Data cache miss• Load delays (L1 hit latency)

• Branch misprediction

• Instruction dependency• Memory conflict

Memoryhierarchy

Controlflow

Instructionstream

Page 30: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 30

Simulation of 8-issue Superscalar Processor

Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995

Highlyunderutilizedprocessorcycles

• InmostofSPEC92applications• Onaverage<1.5IPC(19%)• Dominantwastediffersbyapplication

Page 31: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 31

Single-Thread Performance Curve

Therateofsingle-threadperformanceimprovementhasdecreased

Technology-driven

Architecture-driven

[ComputerArchitectures– Hennessy&Patterson]

Page 32: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 32

Single-chip Multiprocessor (Multicore)

• Limited instruction-level parallelism• à Exploits thread-level parallelism

6-waysuperscalarprocessor 42-waymultiprocessor

30%area

Credit:K.Olukotun,B.A.Nayfeh,L.Hammond,K.Wilson,andK.Chang,“Acaseforsingle-chipmultiprocessor,”ASPLOS,1996

Page 33: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 33

Why Parallel Computing?

• Application demands

• Architectural Trends

• The impact of technology and power

Page 34: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 34

Desperate Cooling?

Page 35: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 35

Cooling• Power delivery, packaging, and cooling costs

– Increased cost of thermal packing• $1 per watt for CPUs more than 35W [Tiwari, DAC98]

– At high-end, 1W of cooling for every 1W of power

• Cooling– Physical solutions

• Heatsink, thermal paste, fan– APM (advanced power management)

• BIOS manages clock speed• Triggered by thermal diodes

– ACPI (Advanced Configuration and Power Interface)• OS manages power for all devices• Idle, nap, sleep modes

Page 36: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 36

Power Consumption Trend

• Reason for power trends– Smaller feature size– Increased compaction– High clock frequency– More complicated processors

2000 2005 2010 2015

ProcessTransistorsSupplyvoltageFrequencyPower

180nm0.2B1.8-1.5V1GHz100W

90nm1.2B1.2-0.9V3GHz175W

55nm7.8B0.7V10GHz300W

3nm50B0.5V30GHz540W

source:Intel(2002)

400480088080

8085

8086

286 386 486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty (

W/

cm2 ) Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intelâ

Page 37: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 37

Power Density Limits Serial Performance

• Single-core processor– Dynamic power = V2fC

• V = supply voltage, f = frequency, C = capacitance– Increasing frequency (f) increases supply voltage (V)à Cubic effect

• Multicore processor– Dynamic power = V2fC– Increasing # of cores increase capacitance (C) linearly

• More transistors by Moore’s lawà more cores for higher performance instead

or, same performance at reduced power

Page 38: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

PowerConsumption(watts)

Core2duo1.2GHz(mobile)

Core2duo2.4GHz(mobile)

Core2duo3.0GHzCore2Quad3.0GHz

Source:MIT6.198IAP

Page 39: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 39

Evolution of Microprocessors

• Chip density continues increase 2x every 2 years• Clock speed is not• # of cores increase• Power no longer increases

Page 40: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 40

Recent Multicore Processors• 2H 15: Intel Knight’s Landing

– 60+ cores; 4-way SMT; 16GB (on pkg)

• 2015: Oracle SPARC M7 – 32 cores; 8-way fine-grain MT; 64MB cache

• Fall 14: Intel Haswell– 18 cores; 2-way SMT; 45MB cache

• June 14: IBM Power8 – 12 cores; 8-way SMT; 96MB cache

• Sept 13: SPARC M6 – 12 cores; 8-way fine-grain MT; 48MB cache

• May 12: AMD Trinity – 4 CPU cores; 384 graphics cores

• Q2 13: Intel Knight’s Corner (coprocessor) – 61 cores; 2-way SMT; 16MB cache

• Feb 12: Blue Gene/Q – 16+1+1 cores; 4-way SMT; 32MB cache

BlueGene/Qcomputechip

Credit:TheIBMBlueGene/QComputeChip,IEEEMicro2012

Page 41: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 41

Simultaneous Multithreading (SMT)

• Permit several independent threads to issue to multiple functional units each cycle

Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003

Page 42: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 42

Simultaneous Multithreading (SMT)

• SMT are easily added to superscalar architecture– Two set of PCs and registers– PC shares I-fetch unit– Register rename maps second set of registers

Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003

Page 43: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 43

Large-scale Parallel Hardware

• Top 7 sites in Top500 supercomputers (Nov. 2014)

148th,169th,170thKoreaMeteorologicaladministration

Credit:top500.org

Page 44: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 44

Cores/Socket & Cores in Top500

• fef

0

50

100

150

200

250

300

350

400

450

500

1993 1998 2003 2008

System

s

1

2

3-4 5-8 9-16 17-32 33-64 65-128 129-256 257-512 513-1024

1025-2048

2049-4096

4k-8k 8k-16k 16k-32k 32k-64k 64k-128k 128k-

Credit:top500.org,[email protected]

Page 45: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 45

Blue Gene/Q Packing Hierarchy

1.Chip16cores

2.ModuleSingleChip

4.NodeCard32ComputeCards,

OpticalModules,LinkChips,Torus

5a.Midplane16NodeCards

6.Rack2Midplanes

1,2or4I/ODrawers7.System20PF/s

3.ComputeCardOnesinglechipmodule,16GBDDR3Memory

5b.I/ODrawer8I/OCards

8PCIeGen2slots

Credit:TheBlueGene/QComputeChip,HotChips 2011

Page 46: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 46

Summary: Why Parallel Computing?

• If you can’t exploit parallelism, performance will be a zero sum game

– If you want to add a new feature and maintain performance, you will need to remove something else.

• The future is “multicore” (i.e. parallel) according to all major processor vendors

– We expect more and more cores will be delivered with each generation

• Starting now, everyone will need to know how to write parallel programs

– It is no longer a niche area: it is a mainstream.

Page 47: Why Parallel Computing? - AndroBenchcsl.skku.edu/uploads/SSE3054S17/01-ParaComp.pdf · 2017-03-12 · • Limited instruction-level parallelism • àExploits thread-level parallelism

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 47

References

• Some slides are adopted and adapted from– CS194: Parallel Programming by Prof. Katherine Yelick at UC Berkeley– CS267: Applications of Parallel Computers by Prof. Jim Demmel at UC

Berkeley– CS258: Parallel Computer Architecture by Prof. David E. Culler at UC

Berkeley– COMP422: Parallel Computing by Prof. John Mellor-Crummey at Rice

Univ.