heterogeneous cmp and its sea platform · 2013. 2. 26. · 2008 intel china multi-core academic...

34
2008 Intel China Multi-core Academic Forum Prof. Dongsheng Wang (汪东升) [email protected] Microprocessor&SoC Tech Center, Tsinghua University http://CPU.tsinghua.edu.cn Heterogeneous CMP and Its SEA Platform

Upload: others

Post on 29-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Prof. Dongsheng Wang (汪东升)[email protected]

Microprocessor&SoC Tech Center, Tsinghua University

http://CPU.tsinghua.edu.cn

Heterogeneous CMPand Its SEA Platform

Page 2: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Some facts

Heterogeneous CMP with Reconfigurable Logic

Summary

Research on Platform On Chip-NoC with cache coherent support

Simulator/Emulator/Accelerator (SEA) Platform

Page 3: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Page 4: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

New Moores’s Law

Conservative:

•Sea change for HW and SW industries since changing programmer model, responsibilities

•HW/SW industries bet farm that parallel successful

--- RAMP Tutorial, ASPLOS’08

2X processors or “cores” per socket every 2 years, same clock frequency

2007 4 cores 2009 8 cores 2011 16 cores

Page 5: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Future silicon Platform

Hardware is flexible, SW is hard to change-David A. Patterson

Design space

Multicore Processo

r

Page 6: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

?

Trends of semiconductor

TTL µproc.,memory

custom

standard

19571967

19771987

1997

2007

Makimoto’s curve

ASICs,accel’sLSI,

MSI?

Reconfigurable systemProduction standardized Application customized

hardwired Procedural programming Structural programming

algorithm:fixed

resource:fixed

FPGA

algorithm:variable

resource:fixed

algorithm:variable

resource:variable

Coarse grain RAs

可重构计算结合了生产标准化和应用可定制化的特点,将是未来体系结构的发展方向。同时,更因为其基于时间一空间的多维计算方式突破了冯·诺依

曼结构的局限性,可重构计算将拥有强大的生命力,有可能打破半导体行业每10年一次轮换的宿命,获得持续不断的发展

Page 7: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

At 90nm and 65nm, more than half the system challenge is designing the software

Design Trends:•More cores•More cache capacity•……

Design Trends vs. software challenge

•Language

•Compiler

•Profiling

•Debugging

•……

--- Nick Flaherty 2006

Page 8: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Research Area• Design space exploration • Memory hierarchy• Cache coherence• Programming model• NoC(Network on Chip) • Multi-core simulation & Emulation • OS and compiler• Tuning and debugging…

Page 9: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Collaborate Projects

Optimizing Memory Access in CMP with Transactional Memory

Heterogeneous CMP with Reconfigurable Logic

Research on Platform On Chip – NoC with cache coherent support

ASIM-based Heterogeneous CMP Simulator/Emulator/Accelerator (SEA) Platform

Page 10: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Heterogeneous CMP with Reconfigurable Logic

Page 11: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Get high Performance by Accelerate Kernel Codes

Flow control

GPP Reconfigurable Units

Application

Loop 2

Loop 1

Loop 3

KernelsKernels

Page 12: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

CMP with Reconfigurable Logic

high-ILPcomputation

low-ILP computation+ OS+ VM

CPU(multi-core)

ReconfigurableLogic

Memory

Tight coupling

Critical Issues

•Connection and Communication Between RL and CPU

•Memory Hierarchy

•Cache Coherence

•RL Organization

•Programming model

Page 13: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Good Applications for RLRelatively small application graph

FPGAs have limited capacitySimple control flow helps a lot

Data ParallelismExecute same computations on many independent data elementsPipeline computations through the hardware

Small and/or varying bit widthsTake advantage of the ability to customize the size of operators

{slide from UIUC lecture 15, 2007}

Page 14: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Reconfigurable Computing SuccessesRSA Decryption

Programmable-Active-Memory machine set record for decryption of RSA-encrypted data

DNA Sequence MatchingReconfigurable hardware has achieved 1000x better performance than contemporary supercomputers

Signal ProcessingFPGA-based filters often get 10x better performance than DSP chipsBenefit from customization of hardware to the application

EmulationUse reconfigurable logic to simulate new processors at high speeds

Cryptographic AttacksHigh-performance low-cost implementations for breaking encryption algorithms

{slide from UIUC lecture 15, 2007}

Page 15: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

15

Simulation of Baseline Architecture

CPU Core CPU Core

CPU Core CPU Core

Network On Chip

Fabric

CPU cores Reconfigurable Fabrics

Fabric Fabric

Fabric FabricFabric

Memory

• The above architecture which consists of the following modules is simulated on GEMS simulator.

• CMP• RL• Memory• Interconnection

.

Page 16: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

0

2

4

6

8

10

12

14

16

55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%

overhead = 4%

k=10

k=20

k=30

k=40

k=50

k=60

0

1

2

3

4

5

6

7

8

9

10

55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%

overhead = 8%

k=10

k=20

k=30

k=40

k=50

k=60

Speed-up Speed-up

• Speed-up is increasing with rl%• When rl%>90%,k plays a more important role in system

performance improvement。

Speed-up with rl%

Page 17: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Speed-up with overhead

Though rl% is big (95%), speed-up is apparently restricted by overhead. When overhead% is more than 8%, speed-up increases little with increasing k.

Speed-up Speed-up

0

1

2

3

4

5

6

7

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

overhead%

rl% = 85%

k=10

k=20

k=30

k=40

k=50

k=60

0

5

10

15

20

25

30

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

overhead%

rl% = 95%

k=10

k=20

k=30

k=40

k=50

k=60

Speed-up Speed-up

Page 18: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Ways to improve performance

Let RL do more work Increase rl%

Reduce the overhead Data prefetchSchedule the config file (load)

Improve the speed-up of kernelsUse high frequency Reconfigurable Logic

Page 19: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Elliptic Curve Cryptography

Finite field arithmetic

Point add and double

Point Multiplication

• Prime field GF(p) & Binary field GF(2m)

- Addition- Multiplication- Square- Inversion

• Point addition- 4 multiplications- 1 square- 2 additions

• Point doubling- 2 multiplications- 4 squares- 1 addition

• Montgomery Algo.- m point additions- m point doublings - 1 inversion

ECC is the next generation asymmetric crypto after RSA with better performance.

Protocols

• ECDH- 2 point mults/enc- 1 point mult/dec

• ECDSA- 1 point mult/enc- 2 point mults/dec

ECC Protocol StackECC Protocol StackECC Protocol Stack

Ultimately, it is all about addition/multiplication/square on finite fieldUltimately, it is all about addition/multiplication/square on fiUltimately, it is all about addition/multiplication/square on finite fieldnite field

Page 20: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Design Diagram

Point Multiplier of GF(2Point Multiplier of GF(2mm))

MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm))

SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm))

ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm))

INV GF(2INV GF(2mm))Point AdditionPoint Addition Point DoublePoint Double

Coordinates ConverterCoordinates Converter

X2X2

X2X2

X2X2

ADD GF(2ADD GF(2mm)) SQR GF(2SQR GF(2mm))

Page 21: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Speedup of RL (best performance) vs. General CPU

EC(2409) Pentium 4 3.4G

Conroe 2.4G FPGA FPGA/

ConroeGF a+b 10ns 12ns 3ns 4GF a2 470ns 235ns 3.7ns 64

GF a*b 3.2us 1.6us 10.2ns 157GF a-1 9.3ms 4.9ms 224ns 21000P + Q 13.7us 6.9us 51ns 141P + P 8.3us 4.1us 51ns 81k * P 9.0ms 4.5ms 29.4us 153

Area: 151951 LUTs (219% of XCVLX110T)Area: 151951 LUTs (219% of XCVLX110T)

Page 22: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

ccNoC- Network on Chip with Cache Coherent

Support

Page 23: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

ccNoC- Network on Chip with Cache Coherent Support Communication Mechanism of Future Multi-core System

Also suitable for our baseline structureMaintain Cache Coherence by NoC

Lighten the burden of CPU private cacheHarmonize different cache protocolsPerformance, power consumption and scalability

Page 24: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Structure of ccNoC

Page 25: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum 25

FeaturesConnect CPUs and RLs using Network on Chip (NoC)

Separate communication from computationScalability and flexibilityEasy Design (T2M)

Communication Message passingShared memory

S

CPU RL

S S S

CPU RL

RL CPU RL CPU

CPU RL CPU RL

S S S S

S S S S

Page 26: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Simulator/Emulator/Accelerator (SEA) Platform

Page 27: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

• Algorithms, Programming Languages, Compilers, Operating Systems,Architectures, Libraries, …not ready for 1000 CPUs / chip

• ≈ Only companies can build HW, and it takes years• Software people don’t start working hard until hardware arrives

• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW

• How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ?

• Can avoid waiting years between HW/SW iterations?RAMP(Research Accelerator for Multiple Processors )

http://ramp.eecs.berkeley.edu/Publications/RAMP%20Implementation.ppt

-- by Patterson

Problems with “Manycore” Sea Change

Page 28: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

SEA Platform

Page 29: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Summary Architectural simulation on GEMS simulatorSimulation of ccNoCTheoretical Model to analyze the systemPort ECC to Reconfigurable Logic to get the speed-upsTask/Resource Scheduling NSF SupportConnections

Page 30: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Connections

PostDoc: Tao WangYuan Liu

Intel hiresPhD Student

Peng Li(2003~2007)Master Students(2004~2007)

Yan HaoKebing WangZhiqiang LiuChangdong Cui……

Page 31: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Page 32: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

Future WorkPorting more applications

BioinformaticsFinancial Analysis

Provide Platform-on-Chip –NoC with Cache Coherent supportCombining ASIM/GEMS Simulators and FPGAs to Support Heterogeneous CMP with RL

SPARC Core on ASIMSEA(Multicore Simulator/Emulator/Accelerator) Platform

Research on programming model on the above architecture

Page 33: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum

THANKS

Page 34: Heterogeneous CMP and Its SEA Platform · 2013. 2. 26. · 2008 Intel China Multi-core Academic Forum ... 2008 Intel China Multi-core Academic Forum Future silicon Platform Hardware

2008 Intel China Multi-core Academic Forum