new techniques for programming gpu clusters yifeng chen school of eecs peking university, china
DESCRIPTION
New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China. Two Conflicting Approaches for Programmability in HPC. Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization - PowerPoint PPT PresentationTRANSCRIPT
New Techniques for Programming GPU Clusters
Yifeng Chen
School of EECSPeking University, China.
Two Conflicting Approaches for
Programmability in HPCTop-down Approach
Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:
Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.
Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual
challenge, but Shorter code
GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1
CPU
GPU 0
GPU 1
cu d a Me m cp yH o s tTo D e vice
4 0 9 6 20482048
P ro c0
Pro c1
MPI_ Sca tte r
PC I
ne two rk
Motivating Examples for PARRAY
Basic Notation
Dimension Tree
Type Reference
GPU 0
GPU 1
cu d a Me m cp yH o s tTo D e vice
4 0 9 6 20482048
P ro c0
Pro c1
MPI_ Sca tte r
PC I
ne two rk
Thread Arrays
#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}
pthread_create
pthread_create
sem_postsem_post
sem_waitsem_wait
pthread_joinpthread_join
Generating CUDA+Pthread
#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G
float* host;_pa_mpi* m;
#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}
Generating MPI or IB/verbs
MPI_ScatterMPI_Scatter
ALLTOALL
BCAST
Other Communication Patterns
Generating Code for IB/verbs and YH
Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication
pattern achieving Zero-Copy.
Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem
(Before Nov 2011)
Direct Simulation of Turbulent Flows
Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes
Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.
Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable
computation Conclusion: GPU-accelerated large simulation on entire
Tianhe-1A is feasible.
Generated Code
DiscussionsOther programming models?
MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)
We need a software stack! Irregular structures must be encoded into arrays
and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future
support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…