farhad mehdipour , hiroaki honda, hiroshi kataoka , koji inoue, kazuaki murakami
DESCRIPTION
Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator. Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami Kyushu University, Japan. - PowerPoint PPT PresentationTRANSCRIPT
Kyushu University
KL, Malaysia
Hardware and Software Requirements for Implementing a High-Performance
Superconductivity Circuits-Based Accelerator
Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami
Kyushu University, Japan
Kyushu University
KL, Malaysia
CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using
single-flux quantum (SFQ) circuits
SFQ-LSRDP
K. MurakamiK. InoueH. Honda
F. MehdipourH. Kataoka
Kyushu Univ.Architecture, Compiler
and Applications
S. Nagasawa et al.
Superconducting Research Lab. (SRL)
SFQ process
N. Yoshikawa et al.
Yokohama National Univ.SFQ-FPU chip, cell library
A. Fujimaki et al.
Nagoya Univ.SFQ-RDP chip, cell library,
and wiring
N. Takagi (Leader) et al.
Nagoya Univ.CAD for logic design and arithmetic circuits
Our mission: Architecture, compiler and application development 2
Kyushu University
KL, Malaysia
Outline of Large-Scale Reconfigurable Data-Path (LSRDP) Processor
ジョセフソン接合
超伝導ループ
磁束量子Single Flux QuantumSuperconductivityloop
Josephson junctionジョセフソン接合
超伝導ループ
磁束量子
ジョセフソン接合
超伝導ループ
磁束量子
ジョセフソン接合
超伝導ループ
磁束量子Single Flux QuantumSuperconductivityloop
Josephson junction
3
SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing
Kyushu University
KL, Malaysia
…
…
…
…
…
…
Buffers
inst;inst;…conf_LSRDP ( ); Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( ); inst; … sync_lsrdp ( ); rearrange_output_data ( );End_Loopinst;…
instinstconf_LSRDP();
conf. bit-stream …
…
…
…
rearrange_input_data ()
GPP
Memory Controller
set_IO_info ( );
Memory Controller
…
…
…
…
…
…
run_LSRDP ( ); inst sync_lsrdp ( );
GPPGPP
Waiting for the LSRDP LSRDP terminating the
operation
rearrange_output_data ( )
GPP
How it works
4
Memory
Buffers
LSRDP
Kyushu University
KL, Malaysia
Architecture Exploration
Layout-I
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
...
...
...
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
...
.
.
.
.
.
.
.
.
.
ADD/SUBMUL
ADD/SUBMUL
ORN
ORN
ORN
.
.
.
Layout-II
ADD/SUB
MUL
ADD/SUB MUL ADD/
SUB MUL
ADD/SUB MUL ADD/
SUB MUL
...
...
...
ADD/SUB MUL ADD/
SUB MUL ...
.
.
.
.
.
.
.
.
.
MUL ADD/SUB
ORN
ORN
ORN
.
.
.
Layout-III
MUL MUL
ADD/SUB
ADD/SUB
ADD/SUB
ADD/SUB
MUL MUL MUL MUL
...
...
...
ADD/SUB
ADD/SUB
ADD/SUB
ADD/SUB
...
.
.
.
.
.
.
.
.
.
MUL MUL
ORN
ORN
ORN
.
.
.
MCL= 1
Num
ber o
f row
s = 1
.5×M
Number of columns = 4×MCL
Num
ber o
f row
s = 2
×M
Number of columns = 6×MCL+2MCL= 1
Num
ber o
f row
s = 1
.5×M
Number of columns = 4×MCL+1
MCL= 2
LSRDP Layouts
ORN structures
5
FU TUTU
PE arch. I
4-inps/3-outs
FU TU
PE arch. II
3-inps/3-outs
TU TU FU TU
Basic PE arch.
3-inps/2-outs
PE structures
Kyushu University
KL, Malaysia
LSRDP Tool Chain
ApplicationC code
1 Modified application code
2
Modifying application code
Inserting LSRDP instructions in the code
1
ISAcc or COINS compiler
2
DFG Extraction
1
binary code
2
Data flow graphsPlacing and Routing Tool
2
Configuration file +various text & schematic
reports
1
LSRDP library fileFunction definitions
& declarations1
LSRDP architecture description
2
1: flow of the assembly code generation for GPP
2: flow of configuration bit-stream generation for the LSRDP
SimulatorPerformance evaluation 6
Kyushu University
KL, Malaysia
Mapping DFGs onto LSRDP
7Longest connections
DFG
LSRDP Architecture Description
Placing Input Nodes
Placing Operational & Output Nodes
Routing Nets
Routing IO Nets
Final Map
Kyushu University
KL, Malaysia
Global routing algorithms
src
dest
src
dest
vacant
fully- occupied
exhaustive search-basedvery time consuming
branch and bound alg.Very fast
Routing DFG connections between source and destination PEs
8
Kyushu University
KL, Malaysia
Micro-Routing-Problem Definition
Inputs• LSRDP basic specifications
–Layout, Width (W), MCL, PE arch., and etc.–List of connections b/w consecutive rows
• ORN structure including–The number of CBs and T2s in each row–The number of CB rows–Topology of connections among CBs
Output• Detailed routes via cross-bar switches
–The list of CBs used for routing each connection–Configuration of CBs
FU T FU T FU T FU T…
FU T FU T FU T FU T…
ORN
i-th row
(i+1)-th row
A micro-routing algorithm has been implemented for the LSRDP with underlying layout II and PE arch. III
Kyushu University
KL, Malaysia
ORN Micro-routing
00 01 10 11
00 01 10 11
CB
½CB
(PE1 PE 5)
(PE2 PE5, PE6, PE7)
(PE3 PE6, PE8 )
(PE4 PE7, PE8)
1/2CB: 1-input/2-ouput
CB: 2-input/2-output
Micro-nets
Example
10
PE 1
PE 2
PE 3
PE 5
PE 6
PE 7
PE 4 PE 8
½CB
½CB
½CB
½CB
CB
CB
CB
(CB)
(CB)
CB
CB
CB
CB
3
2
4
2
2
3
4
1
1
2
2
2
4
3
3
4
3
4
3
2
2
4
1
-
Kyushu University
KL, Malaysia
1817
12
20
18
25
24
24
3231
…
…
…
…
PEs in 3rd Row PEs in 4th row
4
5
6
7
8
9
10
11
ORN Micro-Routing Example: Heat 8x2- ORN b/w 3rd and 4th Rows
9
10
11
12
13
14
16
18
8
17
6
15
7
9
10
11
12
13
14
16
18
8
17
6
15
7
9
10
11
12
13
14
16
18
8
17
6
15
7
9
10
11
12
13
14
16
18
8
17
6
15
7
9
10
11
12
13
14
16
18
8
17
6
15
7
9
10
11
12
13
14
16
18
8
17
6
15
712
17
24
20
25
18
31
32
18
24
12
18
20
24
18
17
32
25
24
31
12
18
2524
24
31
18
32
17
20
12
18
18
24
24
3132
25
17
20
9
10
11
12
13
14
16
18
8
17
6
15
7
12
18
20
24
24
31
32
17
18
25
12
1818
20
24
31
17
32
2425
12
18
24
25
32
9
10
11
12
13
14
16
18
8
17
6
15
7
17
20
31
12
18
20
24
3132
25
17
9
10
11
12
13
14
16
18
8
17
6
15
7
12
20
24
31
17
32
18
25
18
12
17
20
24
3132
25
9
10
11
12
13
14
16
18
8
17
6
15
7
6
4
5
6
7
8
9
10
11
Kyushu University
KL, Malaysia
Specifications of Attempted DFGs
total # of nodes # of Inputs # of outputs # of ops
Heat-8x1 34 6 4 16
Heat-8x2 60 8 4 32Heat-16x2 172 16 12 96
Poisson-3x3 62 18 1 33Vibration-4x2 48 8 4 24Vibration-8x2 136 16 12 72
Vibration-8x4 168 16 8 96
ERI-1 76 16 9 51ERI-2 67 19 1 47
12
Kyushu University
KL, Malaysia
Example of a DFG MappingVibration- 8x2
13
Kyushu University
KL, Malaysia
Results of routing nets using the proposed algorithms
DFG avg. hor. C.L. avg./max.ver. C.L. # of global/micro nets to route
Timeto map (sec)
Heat-8x1 0.35 0.75/3 36/64 0.015
Heat-8x2 0.44 1.32/5 68/114 1.75
Heat-16x2 0.47 1.64/7 204/343 1.05
Poisson-3x3 0.68 2.4/16 67/120 2074.5
Vibration-4x2 0.46 1.58/9 50/88 0.34
Vibration-8x2 0.42 2.15/10 154/332 2.20
Vibration-8x4 2.48 3.72/16 348/610 6721.3
ERI-1 0.75 2.21/9 111/374 53.61
ERI-2 0.78 2.99/9 95/332 0.327
14
Kyushu University
KL, Malaysia
Thank You for Your Attention!
Any Questions!
Kyushu University
KL, Malaysia
16
SMACSMAC
10TFLOPS SFQ-RDP computer
:...:::
SMAC
SB
ORN
...
ORN
...
: : : :
ORN
...
ORN
FPU SFQ RDP( 32 PE×32 chips )( 2.5 GFLOPS / PE)
4.2 K
Streaming memoryAccess controller
CMOSCPU
(One Chip)
Memory bandwidth per MCM : 256GB/ s(=16GB/s ×16 channels)
1024FPU@MCM(34 chips) ×4MCM
2TB memory module( FB-DIMM
[DDR3@1333MHz, 128GB]×16 modules )
SFQ 0.5μm process
PE PEPE
ORN
PE PE PEPE
PE PE PEPE
ORN
オペランドルーティングネットワーク(ORN)
ORN
PE ...
...
...
PE PEPE
ORN
PE PE PEPE
PE PE PEPE
ORN
ORN
...
...
...
PEPE
Operand Routing Network(ORN)
..
.
..
.
..
.
..
.
Kyushu University
KL, Malaysia Chip Micro-architecture: Two types of PEs: F PA and FPM PE layout: Checkered pattern PE : Two Inputs ( A,B,C )→ Three Outputs ( A(*B),B,C )
Three scales of RDP (Small, Medium and Large-Scales)
17
FU TUTU TUFP TUTU TU
PE (i, j)
(i+2,j+1)
(i+L,j+1)
(i+1,j+1)
(i,j+1)
MCL = L
・・・
ORN
RDP parameters ( optimized by total number of JJs )
# Input # Output Width Height MCLTotal JJs(∝ RDP size )
RDP-S 19 12 22 14 4 19387KRDP-M 19 12 24 17 5 27027KRDP-L 38 24 41 34 6 96374K
Development of RDP Architecture
TU: Data Through
Kyushu University
KL, Malaysia
Development of RDP Complier
ApplicationC code
1 Modified code
2
Modifyingapplication code
Manual: Inserting LSRDP instructions in the code
1
ISAcc or COINScompiler
2
DFG ExtractionSemi-manual
1
.asm codefor MIPS-based GPP
2
Data flow graphsPlacement and Routing Tool
2
Configuration file +various text and schematic
reports
1
RDP library fileFunctions definition
& declaration
1RDP architecture description
2
1: flow of the assembly code generation for GPU
2: flow of configuration bit-stream generation for the RDP
SimulatorPerformance evaluation
Kyushu University
KL, Malaysia
19
Development of RDP Oriented Algorithms
One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum chemistry Runge-Kutta calculation for ordinary differential equation
Performance Evaluation Two-dimensional heat equation (1024x1024 mesh )
SFQ-RDP1): 50.6GFlop/s vs. GPU2): 63.0GFlop/s
1) Evaluation method:
RDP: - Execution time model,
- DFG has 21 inputs, 9 outputs, and 63 operations GPP:
- Cycle-accurate processor simulator- BW: 159.0GB/s
2) T. Aoki, and A. Nukada,“CUDA programming premier,“ Kougakusya, ISBN-10:4777514773, 2009 (in Japanese).