新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院
DESCRIPTION
新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院. Manycore-Accelerated Clusters. Tianhe 1A: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1CPU. Various Manycore Designs. Tilera. APU. Fermi/Kepler. MIC. Single-Chip Cloud. Larrabee. Cell. - PowerPoint PPT PresentationTRANSCRIPT
新型异构并行计算机上的数据传输与程序设计
陈一峯
北京大学信息学院
Manycore-Accelerated Clusters
Tianhe 1A: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1CPU
Various Manycore Designs
Fermi/Kepler
MIC
Single-Chip Cloud
APU
LarrabeeCell
Tilera
Parallel Programming:Toolbox vs Writing Case
CUDA
OpenMP
MPI
Irregular S
tructures
Array-Data-Parallel
Task-Parallel
“Only-need-to-learn”of Parray
• Dimensions in a tree• A dimension may refer to another array type.
Memory Layout Re-ordering
6
for (int i=0;i<4;i++) for (int j=0;j<4;j++) memcpy(b+i*8+j*2, a+i*2+j*8, 2);
#parray {paged float [4][4][2]} D#parray {paged float [[#D_0][#D_1]][#D_2]} A#parray {paged float [[#D_1][#D_0]][#D_2]} B#insert DataTransfer(a, A, b, B) {}
]2[]4[]4[
zxy]2[]4[]4[
zyx
Network Communication
7
#parray {mpi[4]} M#parray {paged float [4][2]} D#parray {[[#M][#D_0]][#D_1]} A#parray {[[#D_0][#M]][#D_1]} B#insert DataTransfer(a, A, b, B) {}
]2[]4[]4[
zyx]2[]4[]4[
zxy
MPI_AlltoallMPI_Alltoall
PCI Data Transfer
8
]2[]4[]4[
zxy]2[]4[]4[
zyx
#parray {pthd[4]} P#parray {dmem float[4][2]} D#parray {[[#P][#D_0]][#D_1]} A#parray {[[#D_0][#P]][# D_1]} B#insert DataTransfer(a, A, b, B) {}
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
cudaMemcpy(d2d)
cudaMemcpy(d2d)
#parray {mpi[4]} M#parray {paged float [4][2]} D#parray {[[#M][#D_0]][#D_1]} A#parray {[[#D_0][#M]][# D_1]} B#mainhost{ #detour M { float *a, *b; #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) }}
MPI or IB/verbs
#parray {pthd[4]} P#parray {dmem float [4][2]} D#parray {[[#P][#D_0]][#D_1]} A#parray {[[#D_0][#P]][# D_1]} B#mainhost{ #detour P { float *a, *b; INIT_GPU($tid$); #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) }}
CUDA + Pthread
Discontiguous Communication
#parray { mpi[7168] } M#parray { pinned[2][14336][14336] } D#parray {[[#M][#D_0][#D_1]][#D_2]} S#parray {[[#D_1][#M][#D_0]][#D_2]} T#insert DataTransfer(t,T,s,S) {}
#mainhost { #parallel { #detour pthd[3] { …… #detour mpi[4] { …… } } …… #detour cuda[2][128] { …… #detour cuda[4][256] { …… } …… } …… }}
Hierarchical SPMDs
GPU SGEMM
16
]WIDTH[
A
]WIDTH[
A 10
]WIDTH[
B
]WIDTH[
B 10
]WIDTH[
C
]WIDTH[
C 10
]32[
A
]32/WIDTH[
A
]16[
A
]16/WIDTH[
A 11100100
]256[
B
]256/WIDTH[
B
]32[
B
]32/WIDTH[
B 11100100
]256[
C
]256/WIDTH[
C
]16[
C
]16/WIDTH[
C 11100100
]32[
T
]4[
T
]256/WIDTH[
T
]16/WIDTH[
T 11100100
]32[
A
]32/WIDTH[
A
]4[
A
]4[
A
]16/WIDTH[
A 111001101000
]2[
B
]128[
B
]256/WIDTH[
B
]32[
B
]32/WIDTH[
B 111110100100
]2[
C
]128[
C
]256/WIDTH[
C
]4[
C
]4[
C
]16/WIDTH[
C 1111101001101000
]2[
C
]128[
C
]4[
C
]4[
C
]2[
B
]128[
B
]32[
B
]32[
A
]4[
A
]4[
A 1111100110101111100111011010
17
]32[
A
]32/WIDTH[
A
]4[
A
]4[
A
]16/WIDTH[
A 111001101000
]2[
B
]128[
B
]256/WIDTH[
B
]32[
B
]32/WIDTH[
B 111110100100
]2[
C
]128[
C
]256/WIDTH[
C
]4[
C
]4[
C
]16/WIDTH[
C 1111101001101000
]32[
T
]4[
T
]256/WIDTH[
T
]16/WIDTH[
T 11100100620Gflopson Fermi
C1060
Large FFT(ICS 10, PPoPP 12)
Direct Simulation of Turbulent Flows
Scale 12 distributed arrays 128TB Entire Tianhe-1A with 7168 GPUs
Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.
Software Technologies PARRAY (ACM PPoPP’12) code only 300 lines.
DiscussionsPerformance transparency: macros are
compiled out.Completeness: any index expressions
using add/mul/mod/div/fcompRegular structures from applications and
target manycore hardware Irregular structures allowed but better
supported by other toolsTypical training = 3 daysRelease in
http://code.google.com/p/parray-parallel-array/