heterogeneous cmp and its sea platform · 2013. 2. 26. · 2008 intel china multi-core academic...
TRANSCRIPT
2008 Intel China Multi-core Academic Forum
Prof. Dongsheng Wang (汪东升)[email protected]
Microprocessor&SoC Tech Center, Tsinghua University
http://CPU.tsinghua.edu.cn
Heterogeneous CMPand Its SEA Platform
2008 Intel China Multi-core Academic Forum
Some facts
Heterogeneous CMP with Reconfigurable Logic
Summary
Research on Platform On Chip-NoC with cache coherent support
Simulator/Emulator/Accelerator (SEA) Platform
2008 Intel China Multi-core Academic Forum
2008 Intel China Multi-core Academic Forum
New Moores’s Law
Conservative:
•Sea change for HW and SW industries since changing programmer model, responsibilities
•HW/SW industries bet farm that parallel successful
--- RAMP Tutorial, ASPLOS’08
2X processors or “cores” per socket every 2 years, same clock frequency
2007 4 cores 2009 8 cores 2011 16 cores
2008 Intel China Multi-core Academic Forum
Future silicon Platform
Hardware is flexible, SW is hard to change-David A. Patterson
Design space
Multicore Processo
r
2008 Intel China Multi-core Academic Forum
?
Trends of semiconductor
TTL µproc.,memory
custom
standard
19571967
19771987
1997
2007
Makimoto’s curve
ASICs,accel’sLSI,
MSI?
Reconfigurable systemProduction standardized Application customized
hardwired Procedural programming Structural programming
algorithm:fixed
resource:fixed
FPGA
algorithm:variable
resource:fixed
algorithm:variable
resource:variable
Coarse grain RAs
可重构计算结合了生产标准化和应用可定制化的特点,将是未来体系结构的发展方向。同时,更因为其基于时间一空间的多维计算方式突破了冯·诺依
曼结构的局限性,可重构计算将拥有强大的生命力,有可能打破半导体行业每10年一次轮换的宿命,获得持续不断的发展
2008 Intel China Multi-core Academic Forum
At 90nm and 65nm, more than half the system challenge is designing the software
Design Trends:•More cores•More cache capacity•……
Design Trends vs. software challenge
•Language
•Compiler
•Profiling
•Debugging
•……
--- Nick Flaherty 2006
2008 Intel China Multi-core Academic Forum
Research Area• Design space exploration • Memory hierarchy• Cache coherence• Programming model• NoC(Network on Chip) • Multi-core simulation & Emulation • OS and compiler• Tuning and debugging…
2008 Intel China Multi-core Academic Forum
Collaborate Projects
Optimizing Memory Access in CMP with Transactional Memory
Heterogeneous CMP with Reconfigurable Logic
Research on Platform On Chip – NoC with cache coherent support
ASIM-based Heterogeneous CMP Simulator/Emulator/Accelerator (SEA) Platform
2008 Intel China Multi-core Academic Forum
Heterogeneous CMP with Reconfigurable Logic
2008 Intel China Multi-core Academic Forum
Get high Performance by Accelerate Kernel Codes
Flow control
GPP Reconfigurable Units
Application
Loop 2
Loop 1
Loop 3
KernelsKernels
2008 Intel China Multi-core Academic Forum
CMP with Reconfigurable Logic
high-ILPcomputation
low-ILP computation+ OS+ VM
CPU(multi-core)
ReconfigurableLogic
Memory
Tight coupling
Critical Issues
•Connection and Communication Between RL and CPU
•Memory Hierarchy
•Cache Coherence
•RL Organization
•Programming model
2008 Intel China Multi-core Academic Forum
Good Applications for RLRelatively small application graph
FPGAs have limited capacitySimple control flow helps a lot
Data ParallelismExecute same computations on many independent data elementsPipeline computations through the hardware
Small and/or varying bit widthsTake advantage of the ability to customize the size of operators
{slide from UIUC lecture 15, 2007}
2008 Intel China Multi-core Academic Forum
Reconfigurable Computing SuccessesRSA Decryption
Programmable-Active-Memory machine set record for decryption of RSA-encrypted data
DNA Sequence MatchingReconfigurable hardware has achieved 1000x better performance than contemporary supercomputers
Signal ProcessingFPGA-based filters often get 10x better performance than DSP chipsBenefit from customization of hardware to the application
EmulationUse reconfigurable logic to simulate new processors at high speeds
Cryptographic AttacksHigh-performance low-cost implementations for breaking encryption algorithms
{slide from UIUC lecture 15, 2007}
2008 Intel China Multi-core Academic Forum
15
Simulation of Baseline Architecture
CPU Core CPU Core
CPU Core CPU Core
Network On Chip
Fabric
CPU cores Reconfigurable Fabrics
Fabric Fabric
Fabric FabricFabric
Memory
• The above architecture which consists of the following modules is simulated on GEMS simulator.
• CMP• RL• Memory• Interconnection
.
2008 Intel China Multi-core Academic Forum
0
2
4
6
8
10
12
14
16
55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%
overhead = 4%
k=10
k=20
k=30
k=40
k=50
k=60
0
1
2
3
4
5
6
7
8
9
10
55% 60% 65% 70% 75% 80% 85% 90% 95% 99%RL%
overhead = 8%
k=10
k=20
k=30
k=40
k=50
k=60
Speed-up Speed-up
• Speed-up is increasing with rl%• When rl%>90%,k plays a more important role in system
performance improvement。
Speed-up with rl%
2008 Intel China Multi-core Academic Forum
Speed-up with overhead
Though rl% is big (95%), speed-up is apparently restricted by overhead. When overhead% is more than 8%, speed-up increases little with increasing k.
Speed-up Speed-up
0
1
2
3
4
5
6
7
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
overhead%
rl% = 85%
k=10
k=20
k=30
k=40
k=50
k=60
0
5
10
15
20
25
30
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
overhead%
rl% = 95%
k=10
k=20
k=30
k=40
k=50
k=60
Speed-up Speed-up
2008 Intel China Multi-core Academic Forum
Ways to improve performance
Let RL do more work Increase rl%
Reduce the overhead Data prefetchSchedule the config file (load)
Improve the speed-up of kernelsUse high frequency Reconfigurable Logic
2008 Intel China Multi-core Academic Forum
Elliptic Curve Cryptography
Finite field arithmetic
Point add and double
Point Multiplication
• Prime field GF(p) & Binary field GF(2m)
- Addition- Multiplication- Square- Inversion
• Point addition- 4 multiplications- 1 square- 2 additions
• Point doubling- 2 multiplications- 4 squares- 1 addition
• Montgomery Algo.- m point additions- m point doublings - 1 inversion
ECC is the next generation asymmetric crypto after RSA with better performance.
Protocols
• ECDH- 2 point mults/enc- 1 point mult/dec
• ECDSA- 1 point mult/enc- 2 point mults/dec
ECC Protocol StackECC Protocol StackECC Protocol Stack
Ultimately, it is all about addition/multiplication/square on finite fieldUltimately, it is all about addition/multiplication/square on fiUltimately, it is all about addition/multiplication/square on finite fieldnite field
2008 Intel China Multi-core Academic Forum
Design Diagram
Point Multiplier of GF(2Point Multiplier of GF(2mm))
MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm)) MUL GF(2MUL GF(2mm))
SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm)) SQR GF(2SQR GF(2mm))
ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm)) ADD GF(2ADD GF(2mm))
INV GF(2INV GF(2mm))Point AdditionPoint Addition Point DoublePoint Double
Coordinates ConverterCoordinates Converter
X2X2
X2X2
X2X2
ADD GF(2ADD GF(2mm)) SQR GF(2SQR GF(2mm))
2008 Intel China Multi-core Academic Forum
Speedup of RL (best performance) vs. General CPU
EC(2409) Pentium 4 3.4G
Conroe 2.4G FPGA FPGA/
ConroeGF a+b 10ns 12ns 3ns 4GF a2 470ns 235ns 3.7ns 64
GF a*b 3.2us 1.6us 10.2ns 157GF a-1 9.3ms 4.9ms 224ns 21000P + Q 13.7us 6.9us 51ns 141P + P 8.3us 4.1us 51ns 81k * P 9.0ms 4.5ms 29.4us 153
Area: 151951 LUTs (219% of XCVLX110T)Area: 151951 LUTs (219% of XCVLX110T)
2008 Intel China Multi-core Academic Forum
ccNoC- Network on Chip with Cache Coherent
Support
2008 Intel China Multi-core Academic Forum
ccNoC- Network on Chip with Cache Coherent Support Communication Mechanism of Future Multi-core System
Also suitable for our baseline structureMaintain Cache Coherence by NoC
Lighten the burden of CPU private cacheHarmonize different cache protocolsPerformance, power consumption and scalability
2008 Intel China Multi-core Academic Forum
Structure of ccNoC
2008 Intel China Multi-core Academic Forum 25
FeaturesConnect CPUs and RLs using Network on Chip (NoC)
Separate communication from computationScalability and flexibilityEasy Design (T2M)
Communication Message passingShared memory
S
CPU RL
S S S
CPU RL
RL CPU RL CPU
CPU RL CPU RL
S S S S
S S S S
2008 Intel China Multi-core Academic Forum
Simulator/Emulator/Accelerator (SEA) Platform
2008 Intel China Multi-core Academic Forum
• Algorithms, Programming Languages, Compilers, Operating Systems,Architectures, Libraries, …not ready for 1000 CPUs / chip
• ≈ Only companies can build HW, and it takes years• Software people don’t start working hard until hardware arrives
• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW
• How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ?
• Can avoid waiting years between HW/SW iterations?RAMP(Research Accelerator for Multiple Processors )
http://ramp.eecs.berkeley.edu/Publications/RAMP%20Implementation.ppt
-- by Patterson
Problems with “Manycore” Sea Change
2008 Intel China Multi-core Academic Forum
SEA Platform
2008 Intel China Multi-core Academic Forum
Summary Architectural simulation on GEMS simulatorSimulation of ccNoCTheoretical Model to analyze the systemPort ECC to Reconfigurable Logic to get the speed-upsTask/Resource Scheduling NSF SupportConnections
2008 Intel China Multi-core Academic Forum
Connections
PostDoc: Tao WangYuan Liu
Intel hiresPhD Student
Peng Li(2003~2007)Master Students(2004~2007)
Yan HaoKebing WangZhiqiang LiuChangdong Cui……
2008 Intel China Multi-core Academic Forum
2008 Intel China Multi-core Academic Forum
Future WorkPorting more applications
BioinformaticsFinancial Analysis
Provide Platform-on-Chip –NoC with Cache Coherent supportCombining ASIM/GEMS Simulators and FPGAs to Support Heterogeneous CMP with RL
SPARC Core on ASIMSEA(Multicore Simulator/Emulator/Accelerator) Platform
Research on programming model on the above architecture
2008 Intel China Multi-core Academic Forum
THANKS
2008 Intel China Multi-core Academic Forum