heterogeneous cmp and its sea platform · 2013. 2. 26. · 2008 intel china multi-core academic...

2008 Intel China Multi-core Academic Forum

Prof. Dongsheng Wang (汪东升)[email protected]

Microprocessor&SoC Tech Center, Tsinghua University

http://CPU.tsinghua.edu.cn

Heterogeneous CMPand Its SEA Platform


Some facts

Heterogeneous CMP with Reconfigurable Logic

Summary

Research on Platform On Chip-NoC with cache coherent support

Simulator/Emulator/Accelerator (SEA) Platform


New Moores’s Law

Conservative:

•Sea change for HW and SW industries since changing programmer model, responsibilities

•HW/SW industries bet farm that parallel successful

--- RAMP Tutorial, ASPLOS’08

2X processors or “cores” per socket every 2 years, same clock frequency

2007 4 cores 2009 8 cores 2011 16 cores


Future silicon Platform

Hardware is flexible, SW is hard to change-David A. Patterson

Design space

Multicore Processo

r


?

Trends of semiconductor

TTL µproc.,memory

custom

standard

19571967

19771987

1997

2007

Makimoto’s curve

ASICs,accel’sLSI,

MSI?

Reconfigurable systemProduction standardized Application customized

hardwired Procedural programming Structural programming

algorithm：fixed

resource：fixed

FPGA

algorithm：variable

resource：fixed

algorithm：variable

resource：variable

Coarse grain RAs

可重构计算结合了生产标准化和应用可定制化的特点，将是未来体系结构的发展方向。同时，更因为其基于时间一空间的多维计算方式突破了冯·诺依

曼结构的局限性，可重构计算将拥有强大的生命力，有可能打破半导体行业每10年一次轮换的宿命，获得持续不断的发展


At 90nm and 65nm, more than half the system challenge is designing the software

Design Trends:•More cores•More cache capacity•……

Design Trends vs. software challenge

•Language

•Compiler

•Profiling

•Debugging

•……

--- Nick Flaherty 2006


Research Area• Design space exploration • Memory hierarchy• Cache coherence• Programming model• NoC(Network on Chip) • Multi-core simulation & Emulation • OS and compiler• Tuning and debugging…


Collaborate Projects

Optimizing Memory Access in CMP with Transactional Memory


Research on Platform On Chip – NoC with cache coherent support

ASIM-based Heterogeneous CMP Simulator/Emulator/Accelerator (SEA) Platform


Get high Performance by Accelerate Kernel Codes

Flow control

GPP Reconfigurable Units

Application

Loop 2

Loop 1

Loop 3

KernelsKernels


CMP with Reconfigurable Logic

high-ILPcomputation

low-ILP computation+ OS+ VM

CPU(multi-core)

ReconfigurableLogic

Memory

Tight coupling

Critical Issues

•Connection and Communication Between RL and CPU

•Memory Hierarchy

•Cache Coherence

•RL Organization

•Programming model


Good Applications for RLRelatively small application graph

FPGAs have limited capacitySimple control flow helps a lot

Data ParallelismExecute same computations on many independent data elementsPipeline computations through the hardware

Small and/or varying bit widthsTake advantage of the ability to customize the size of operators

{slide from UIUC lecture 15, 2007}


Reconfigurable Computing SuccessesRSA Decryption

Programmable-Active-Memory machine set record for decryption of RSA-encrypted data

DNA Sequence MatchingReconfigurable hardware has achieved 1000x better performance than contemporary supercomputers

Signal ProcessingFPGA-based filters often get 10x better performance than DSP chipsBenefit from customization of hardware to the application

EmulationUse reconfigurable logic to simulate new processors at high speeds

Cryptographic AttacksHigh-performance low-cost implementations for breaking encryption algorithms

{slide from UIUC lecture 15, 2007}


15

Simulation of Baseline Architecture

CPU Core CPU Core

CPU Core CPU Core

Network On Chip

Fabric

CPU cores Reconfigurable Fabrics

Fabric Fabric

Fabric FabricFabric

Memory

• The above architecture which consists of the following modules is simulated on GEMS simulator.

• CMP• RL• Memory• Interconnection

.


0

2

4

6

8

10

12

14

16

55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%

overhead = 4%

k=10

k=20

k=30

k=40

k=50

k=60

0

1

2

3

4

5

6

7

8

9

10

55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%

overhead = 8%

k=10

k=20

k=30

k=40

k=50

k=60

Speed-up Speed-up

• Speed-up is increasing with rl%• When rl%>90%，k plays a more important role in system

performance improvement。

Speed-up with rl%


Speed-up with overhead

Though rl% is big (95%), speed-up is apparently restricted by overhead. When overhead% is more than 8%, speed-up increases little with increasing k.

Speed-up Speed-up

0

1

2

3

4

5

6

7

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

overhead%

rl% = 85%

k=10

k=20

k=30

k=40

k=50

k=60

0

5

10

15

20

25

30

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

overhead%

rl% = 95%

k=10

k=20

k=30

k=40

k=50

k=60

Speed-up Speed-up


Ways to improve performance

Let RL do more work Increase rl%

Reduce the overhead Data prefetchSchedule the config file (load)

Improve the speed-up of kernelsUse high frequency Reconfigurable Logic


Elliptic Curve Cryptography

Finite field arithmetic

Point add and double

Point Multiplication

• Prime field GF(p) & Binary field GF(2m)

- Addition- Multiplication- Square- Inversion

• Point addition- 4 multiplications- 1 square- 2 additions

• Point doubling- 2 multiplications- 4 squares- 1 addition

• Montgomery Algo.- m point additions- m point doublings - 1 inversion

ECC is the next generation asymmetric crypto after RSA with better performance.

Protocols

• ECDH- 2 point mults/enc- 1 point mult/dec

• ECDSA- 1 point mult/enc- 2 point mults/dec

ECC Protocol StackECC Protocol StackECC Protocol Stack

Ultimately, it is all about addition/multiplication/square on finite fieldUltimately, it is all about addition/multiplication/square on fiUltimately, it is all about addition/multiplication/square on finite fieldnite field


Design Diagram

Point Multiplier of GF(2Point Multiplier of GF(2mm))

MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm))

SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm))

ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm))

INV GF(2INV GF(2mm))Point AdditionPoint Addition Point DoublePoint Double

Coordinates ConverterCoordinates Converter

X2X2

X2X2

X2X2

ADD GF(2ADD GF(2mm)) SQR GF(2SQR GF(2mm))


Speedup of RL (best performance) vs. General CPU

EC(2409) Pentium 4 3.4G

Conroe 2.4G FPGA FPGA/

ConroeGF a+b 10ns 12ns 3ns 4GF a2 470ns 235ns 3.7ns 64

GF a*b 3.2us 1.6us 10.2ns 157GF a-1 9.3ms 4.9ms 224ns 21000P + Q 13.7us 6.9us 51ns 141P + P 8.3us 4.1us 51ns 81k * P 9.0ms 4.5ms 29.4us 153

Area: 151951 LUTs (219% of XCVLX110T)Area: 151951 LUTs (219% of XCVLX110T)


ccNoC- Network on Chip with Cache Coherent

Support


ccNoC- Network on Chip with Cache Coherent Support Communication Mechanism of Future Multi-core System

Also suitable for our baseline structureMaintain Cache Coherence by NoC

Lighten the burden of CPU private cacheHarmonize different cache protocolsPerformance, power consumption and scalability


Structure of ccNoC

2008 Intel China Multi-core Academic Forum 25

FeaturesConnect CPUs and RLs using Network on Chip (NoC)

Separate communication from computationScalability and flexibilityEasy Design (T2M)

Communication Message passingShared memory

S

CPU RL

S S S

CPU RL

RL CPU RL CPU

CPU RL CPU RL

S S S S

S S S S


Simulator/Emulator/Accelerator (SEA) Platform


• Algorithms, Programming Languages, Compilers, Operating Systems,Architectures, Libraries, …not ready for 1000 CPUs / chip

• ≈ Only companies can build HW, and it takes years• Software people don’t start working hard until hardware arrives

• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW

• How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ?

• Can avoid waiting years between HW/SW iterations?RAMP(Research Accelerator for Multiple Processors )

http://ramp.eecs.berkeley.edu/Publications/RAMP%20Implementation.ppt

-- by Patterson

Problems with “Manycore” Sea Change


SEA Platform


Summary Architectural simulation on GEMS simulatorSimulation of ccNoCTheoretical Model to analyze the systemPort ECC to Reconfigurable Logic to get the speed-upsTask/Resource Scheduling NSF SupportConnections


Connections

PostDoc: Tao WangYuan Liu

Intel hiresPhD Student

Peng Li(2003~2007)Master Students(2004~2007)

Yan HaoKebing WangZhiqiang LiuChangdong Cui……


Future WorkPorting more applications

BioinformaticsFinancial Analysis

Provide Platform-on-Chip –NoC with Cache Coherent supportCombining ASIM/GEMS Simulators and FPGAs to Support Heterogeneous CMP with RL

SPARC Core on ASIMSEA(Multicore Simulator/Emulator/Accelerator) Platform

Research on programming model on the above architecture


THANKS

heterogeneous cmp and its sea platform · 2013. 2. 26. · 2008 intel china multi-core academic...

Documents