high performance embedded computing with massively parallel processors yangdong steve deng 邓仰东...
TRANSCRIPT
High Performance Embedded Computing High Performance Embedded Computing with Massively Parallel Processorswith Massively Parallel Processors
Yangdong Steve Deng Yangdong Steve Deng 邓仰东邓仰东[email protected]@tsinghua.edu.cn
Tsinghua UniversityTsinghua University
22
Outline
Motivation and backgroundMorphing GPU into a
network processorHigh performance radar
DSP processor Conclusion
33
High Performance Embedded Computing Future IT infrastructure demands even higher computing power
Core Internet router throughput: up to 90Tbps 4G wireless base station: 1Gbit/s data rate per customer and up to 200
subscribers in service area CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second)…
44
~$1M
Fast Increasing IC Costs
Fabrication CostMoore’s Second Law: The cost of doubling circuit
density increases in line with Moore's First Law.
Design CostNow $20-50M per productWill reach $75-120M at
32nm node
The 4-year development The 4-year development of Cell processor by of Cell processor by Sony, IBM, and Toshiba Sony, IBM, and Toshiba costs over costs over $400M$400M..
55
Implications of the Prohibitive Cost
ASICs would be unaffordable for many applications!Scott MacGregor, CEO of Broadcom:
• “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.”
David Turek, VP of IBM:• “IBM will be pulling out of Cell
development, with PowerXCell
8i to be the company’s last
entrance in the technology.”
66
Multicore Machines Are Really Powerful!Manufacturer
Processor Type
Model Model Number # Cores GFLOPs FP64 GFLOPs FP32
AMD GPGPU FireStream 9270 160/800 240 1200
AMD GPU Radeon HD 5870 320/1600 544 2720
AMD GPU Radeon HD 5970 640/3200 928 4640
AMD CPU Magny-Cours 12 362.11 362.11
Fujitsu CPU SPARC64 VII 4 128 128
Intel CPU Core 2 Extreme QX9775 4 51.2 51.2
nVidia GPU Fermi 480 512 780 1560
nVidia GPGPU Tesla C1060 240 77.76 933.12
nVidia GPGPU Tesla C2050 448 515.2 1288
Tilera CPU TilePro 64 166 166
AMD 12-Core CPU Tilera Tile Gx100 CPU NVidia Fermi GPU
GPU: Graphics Processing Unit GPGPU: General Purpose GPU
77
Implications
An increasing number of applications would be implemented with multi-core devicesHuawei: multi-core base stations Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor…
Also meets the strong demands for customizability and extendibility
88
Outline
Motivation and backgroundMorphing GPU into a
network processorHigh performance radar
DSP processor Conclusion
99
Background and motivation GPU based routing processing
Routing table lookupPacket classificationDeep packet inspection
GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling
Software Routing with GPU
1010
Ever-Increasing Internet Traffic
1111
Fast Changing Network Protocols/Services
New services are rapidly appearingData-center, Ethernet forwarding, virtual LAN, …
Personal customization is often essential for QoS However, today’s Internet heavily depend on 2 protocols
Ethernet and IPv4, with both developed in 1970s!
1212
Internet Router
…
1313
Cisco GSR 12416
6ft
19”
2ft
Capacity: 160Gb/sPower: 4.2kW
Internet Router
Backbone network devicePacket forwarding and path findingConnect multiple subnetsKey requirements
• High throughput: 40G-90Tbps• High flexibility
Packets
Router Packets
1414
Current Router Solutions
Hardware routersFastLong design timeExpensiveAnd hard to maintain
Network processor based routerNetwork processor: data parallel packet processorNo good programming models
Software routersExtremely flexibleLow costBut slow
1515
Outline
Background and motivation GPU based routing processing
Routing table lookupPacket classificationDeep packet inspection
GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling
1616
Critical Path of Routing Processing
IP AddressLookup
UpdateHeader
Header Processing
RoutingTable
RoutingTable
IP Addr Next Hop
BufferMemory
BufferMemory
Packet Classification
Data Hdr
Data Hdr
QueuePacket
RuleSet
RuleSet
Hdr Fields FlowSwitch Fabric
Deep Packet Inspection
1717
GPU Based Software Router
CPU0 CPU1
CPU2 CPU3
Front Side Bus (FSB)
North Bridge (Memory
controller)NIC
NIC
PCIe 16-lane
PCIe 4-lane
PCIe 4-lane
Main Memory
Memory Bus
GPUGPU
Memory
Graphics Card
Internet
Data level parallelism = packet level parallelism
1818
Routing Table Lookup Routing table contains network topology information
Find the output port according to destination IP addressPotentially large routing table (~1M entries)
• Can be updated dynamically
Destination Address Prefix Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
An exemplar routing table
1919
Routing Table Lookup
Longest prefix matchMemory boundUsually based on a trie data
structure• Trie: a prefix tree
with strings as keys• A node’s position
directly reflects its key
• Pointer operations• Widely divergent branches!
Destination Address Prefix
Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
24.30.32/20
24.30.32.160/28
208.12.32/20
Search Trie
208.12.32.111/32
1
0
2
3 4
2020
GPU Based Routing Table Lookup
Organize the search trie into an arrayPointer converted to offset with regard to array head
6X speedup even with frequent routing table updates
2121
Packet Classification Match header fields with predefined rules
Size of rule-sets can be huge (i.e., over 5000 rules)
Rule Example
Priority Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority
Packet filtering Deny all traffic from ISP3 destined to 166.111.66.77
Traffic rate limit Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2
Accounting & billing Treat video traffic to 166.111.X.X as highest priority and perform accounting
2222
Packet Classification
Hardware solution Usually with Ternary CAM
(TCAM)• Expensive and power hungry
Software solutions Linear search Hash based Tuple space search
• Convert the rules into a set of exact match
2323
GPU Based Packet Classification
A linear search approachScale to rule sets with 20,000 rules
Meta-programmingCompile rules into CUDA code with PyCUDA
Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority
if (DA >= 166.111.66.70) && (DA <= 166.111.66.77)
priority = 0;
2424
GPU Based Packet Classification
~60X speedup
2525
Deep Packet Inspection (DPI) Core component for network intrusion detection
Against viruses, spam, software vulnerabilities, …
Packet Decoder
Preprocessor(Plug-ins)
Detection Engine
(Plug-ins)Output Stage
(Plug-ins)
Sniffing
Snort
Data
Flow
Alerts/Logs
Packet stream
Fixed String MatchingRegular
Expression Matching
Example rule:alert tcp $EXTERNAL_NET 27374 -> $HOME_NET any (msg:"BACKDOOR subseven 22"; flags: A+; content: "|0d0a5b52504c5d303032
0d0a|";
2626
GPU Based Deep Packet Inspection (DPI)
Fixed string matchEach rule is just a string that is disallowedBloom-filter based searchOne warp for a packet and one thread for a stringThroughput: 19.2Gbps (30X speed-up over SNORT)
0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 1 0 1 0 0 1
r1 r2 …
0 1 0 0 1 0 1 0 1 0 0 1
s1 s2 …
Hash 1
Hash 2
Hash 3
Initial Bloom Filter
After pre-processing rules
Checking packet content Bloom Vector
2727
GPU Based Deep Packet Inspection (DPI)
Regular expression matching Each rule is a regular expression
• e.g., a|b* = {ε, a, b, bb, bbb, ...} Aho-Corasick Algorithm
• Converts patterns into a finite state machine• Matching is done by state traversal
Memory bound• Virtually no computation
Compress the state table• Merging don’t-cared entries
Throughput: 9.3Gbps 15X speed-up over SNORT
Example: P={he, she, his, hers}
2828
Outline
Background and motivation GPU based routing processing
Routing table lookupPacket classificationDeep packet inspection
GPU microarchitecture enhancementCPU and GPU integrationQoS-aware scheduling
2929
CPU0 CPU1
CPU2 CPU3
Front Side Bus (FSB)
North Bridge (Memory
controller)NIC
NIC
PCIe 16 -lane
PCIe 4 -lane
PCIe 4 -lane
Main Memory
Memory Bus
GPUGPU
Memory
Graphics Card
Internet
Limitation of GPU-Based Packet Processing
Packet queue
CPU-GPU communication overhead
No QoS guarantee
3030
Microarchitectural Enhancements CPU-GPU integration with a shared memory
Maintain current CUDA interfaceImplemented on GPGPU-Sim*
*A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS, 2009.
NIC
CPUInternet
NPGPU
CPU/GPU Shared Memory
Task FIFO
Delayed Commit Queue
GPU
3131
Microarchitectural Enhancements
Uniformly one thread for one packetNo thread block necessaryDirectly schedule and issue warps
GPU fetches packet IDs from task queue whenEither a sufficient number of packets
are already collectedOr a given interval passes after last
fetch
CPU-maintained task queue
Delayed Commit Queue
GPU Core
GPU Core
GPU Core
GPU Core
GPU Core
GPU Core
3232
Results: Throughput
0
50
100
150
200
250
300
350
Deep PacketInspection
PacketClassification
Routing TableLookup
Decrease TTL
Line-card Rate
CPU/GPU
New Architecture
3333
Results: Packet Latency
0
50
100
150
200
250
Deep PacketInspection
Packet Classification Routing Table Lookup Decrease TTL
CPU/GPU
New Architecture
3434
Outline
Motivation and backgroundMorphing GPU into a
network processorHigh performance radar
DSP processor Conclusion
3535
High Performance Radar DSP Processor
Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor
3636
Research Objectives
High performance DSP processor For high-performance applications
• Radar, sonar, cellular baseband, …
Performance requirementsThroughput ≥ 800GFLOPsPower Efficiency ≥ 100GFLOPS/WMemory bandwidth ≥ 400Gbit/sScale to multi-chip solutions
3737
Current DSP Platforms
*GDDR5: Peak Bandwidth 28.2GB/s
ProcessorFrequen
cy# cores
Throughput
Memory Bandwid
th
Power
Power Efficiency
(GFLOPS/W)
TI TMS320C647
2-700500MHz 6
33.6GMac/s
NA 3.8W 17.7
FreeScale MSC8156
1GHz 6 48GMac/s 1GB/s 10W 9.6
ADI TigerSHARC
ADSP-TS201S 600MHz 1 4.8GMac/s
38.4GB/s (on-chip)
2.18W
4.4
PicoChip PC205
260MHz1GPP+248DSP
s31GMac/s NA <5W 12.4
Intel Core i7 980XE
3.3GHz 6107.
5GFLOPS31.8GB/s
130W
0.8
Tilera Tile64 866MHz 64 CPUs221GFLOP
S6.25GB/s 22W 10.0
NVidia Fermi GPU
1GHz512
scalar cores
1536GFLOPS
230GB/s *
200W
7.7
3838
High Performance Radar DSP Processor
Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor
3939
HPEC Challenge - Radar BenchmarksBenchmark Description
TDFIR Time-domain finite impulse response filtering
FDFIR Frequency-domain finite impulse response filtering
CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT
QR QR factorization: prevalent in target recognition algorithms
SVD Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference
CFAR Constant false-alarm rate detection: find target in an environment with varying background noise
GA Graph optimization via genetic algorithm: removing uncorrelated data relations
PM Pattern Matching: identify stored tracks that match a target
DB Database operations to store and query target tracks
4040
GPU ImplementationBenchmark Description
TDFIR Loops of multiplication and accumulation (MAC)
FDFIR FFT followed by MAC loops
CT GPU based matrix transpose, extremely efficient
QR Pipeline of CPU + GPU, Fast Givens algorithm
SVD Based on QR factorization and fast matrix multiplication
CFAR Accumulation of neighboring vector elements
GA Parallel random number generator and inter-thread communication
PM Vector level parallelism
DB Binary tree operation, hard for GPU implementation
4141
Performance ResultsKernels Data Set CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup
TDFIRSet 1Set 2
3.3823.326
97.50623.130
28.86.9
FDFIRSet 1Set 2
0.5410.542
61.68111.955
114.122.1
CTSet 1Set 2
1.1940.501
17.17735.545
14.370.9
PMSet 1Set 2
0.8710.281
7.76121.241
8.975.6
CFAR
Set 1Set 2Set 3Set 4
1.1541.3141.3131.261
2.23417.31913.9628.301
1.913.110.66.6
GA
Set 1Set 2Set 3Set 4
0.5620.6830.4410.373
1.1778.5710.5892.249
2.112.51.46.0
QRSet 1Set 2Set 3
1.7040.9010.904
54.3095.6796.686
31.86.37.4
SVDSet 1Set 2
0.7470.791
4.1752.684
5.63.4
DBSet 1Set 2
112.35.794
126.88.459
1.131.46
*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.
4242
Performance Comparison GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101
4343
Instruction Profiling
4444
Thread Profiling
Warp occupancy: number of active threads in an issued warp32 threads per warp
4545
Off-Chip Memory Profiling
DRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.
4646
Limitation GPU suffers from a low power-efficiency (MFLOPS/W)
4747
High Performance Radar DSP Processor
Motivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processor
4848
Key Idea - Hardware Architecture Borrow the GPU microarchitecture
Using a DSP core as the basic execution unitMultiprocessors organized in programmable pipelinesNeighboring multiprocessors can be merged as wider datapaths
4949
Key Idea – Parallel Code Generation Meta-programming based parallel code generation Foundation technologies
GPU meta-programming frameworks• Copperhead (UC Berkeley) and PyCUDA (NY University)
DSP code generation framework• Spiral (Carnegie Mellon University)
runtime
DSP code generation
Source optimization
Compile
5050
Key Idea – Internal Representation as KPN
Kahn Process Network (KPN)A generic model for concurrent
computationSolid theoretic foundation
• Process algebra
5151
Scheduling and Optimization on KPN Automatic task and thread scheduling and
mappingExtract data parallelism through process
splittingLatency and throughput aware schedulingPerformance estimation based on analytical
models
Ttotal
T1
T2
Ti
5252
Key Idea - Low Power Techniques GPU-like processors are power hungry! Potential low power techniques
Aggressive memory coalescingEnable task-pipeline to avoid synchronization via
global memoryOperation chaining to avoid extra memory accesses???
DRAM line
DRAM chip
Used
Unused
…
Current coalescingOur coalescing solution
5353
Outline
Motivation and backgroundMorphing GPU into a
network processorHigh performance radar
DSP processor Conclusion
5454
Conclusion
A new market of high performance embedded computing is emergingMulti-core engines would be the work-horses
Need both HW and SW researchCase study 1: GPU based Internet routingCase study 2: Massively parallel DSP
processor Significant performance improvementsMore works ahead
• Low power, scheduling, parallel programming model, legacy code, …