[若渴計畫]由gpu硬體概念到coding cuda

由 GPU 硬體概念到coding CUDA

AJ2014.6.17

GPU 是否只能當顯示卡 ? 能不能拿來做平行運算 ?

兩個大廠• NVIDIA• AMD• 這兩大廠都有提供 open source project 給玩家來 join• 能 join 什麼 ? 還沒涉略

• 因為我的實驗室只有 NVIDA 卡 , 所以就使用 NVIDA ~”~• NVIDA 卡 , 它是使用何種 programming model 來 programming?• Single-instruction multiple thread (SIMT) programming model

使用此 model 帶來給你怎樣的設計概念

從 NVIDIA GPU 設計概念說起

在 NVIDIA GPU 中，可用三個特性來看SIMT• Single instruction, multiple register sets• Single instruction, multiple addresses• Single instruction, multiple flow paths

http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html


Single Instruction, Multiple Register Setsfor(i=0;i<n;++i) a[i]=b[i]+c[i];

__global__ void add(float *a, float *b, float *c) {int i = blockIdx.x * blockDim.x + threadIdx.x;a[i]=b[i]+c[i]; //no loop!

}

Costs:• 每個 thread 都會對應自己的 register set ，所以會有 redundant 情況發生。



Single Instruction, Multiple Addresses __global__ void apply(short* a, short* b, short* lut) {

int i = blockIdx.x * blockDim.x + threadIdx.x;a[i] = lut[b[i]]; //indirect memory access

// a[i] = b[i] }Cost: • 對於 DRAM memory 來說， random access 跟循序存取比起來是沒有效

率的。• 對於 shared memory 來說， random access 會藉由 bank contentions 而

變慢速度。 ( 先不討論 shared memory)



Single Instruction, Multiple Flow Paths


__global__ void find(int* vec, int len, int* ind, int* nfound, int nthreads) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int last = 0; int* myind = ind + tid*len; for(int i=tid; i<len; i+=nthreads) { if( vec[i] ) { //flow divergence myind[last] = i; last++; } } nfound[tid] = last;}

….

len thread id = 0

nthreads =1vec

get coalescing to read

if(vec[i]) 成立 if(vec[i]) 不成成立get coalescing to read

….

registers


以上為 SIMT 設計特性。先來看 kepler gk110 晶片方塊圖。

• 15 SMX( 串流處理器 ) X 192 cores• 4 warp scheduler per SMX• 暫存器個數 65536 per SMX

Form NVIDIA kepler gk110 architecture whitepaper

• warp scheduler 用來做啥 ?• SMX 內部的資源分配

Form NVIDIA kepler gk110 architecture whitepaper

warp1 warp2

Warp 使用 SIMT 運作1. 在 NVIDIA 中， a “warp” 是由好幾個 (32)threads 組成且同時跑。而每個 thread 需要自己的 registers 。

2.在 Warp 中， SIMT 去執行，也就是說 32 threads 執行相同指令。如果對於 flow divergence ，則硬體會分多個warp 處理這問題，但效能會變差。 (James Balfour, “CUDA Threads and Atomics” ,p.13~p.18) 。



Warp 使用 SIMT 運作 (cont.)1.. 好幾個 warps 組成 a “block” ，一個 block 被對應到一個 SMX ，而一個 SMX 裡面有 warp scheduler 去切換一個 block 中的 warps 去執行。而每個 warp 都有自己的 register sets 。

2. 由圖可知一個 block ，再做 warp schedule 時，是zero overhead (fast context switch) 。因為狀態接由register set 保存。而 warp 狀態可分 actives/suspended 。


3. 你可以指定一個 block 有多少 thread 。但一個 block 做多指定多少 thread ，要看硬體可支援的運算能力。


Thread ID 如何對應到 Warp

• Warp ID (warpid) • 如何知道一個 block 中某 thread 屬於哪個 warp? threadIdx.x / 32

• Thread ID = 0 ~ 31 warp • Thread ID = 32~64 warp• …

http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf , p.2

http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf

以上 GPU 原理 ( 當然不只 ) ，外加整合 CPU ，然而就有了 CUDA 的 coding 環境出現。

使用 CUDA 必須注意的事情• 使用哪一個 NVIDIA GPU Architecture 。• NVIDIA Tesla K20c

• 從 https://developer.nvidia.com/cuda-gpus 可知 Tesla K20c 的Compute Capability 為 3.5 。• 安裝 CUDA 環境，可參考 http://

docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#axzz33nDhVV00 。編譯器名稱為 nvcc 。• 最新的 CUDA 版本為 6.0 ，而我安裝的是 5.0 XD( 懶得升級哈 ) 。

• 安裝完 CUDA 環境，可跑內建執行檔 deviceQuery 去看看安裝對不對。

https://developer.nvidia.com/cuda-gpus

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#axzz33nDhVV00




/usr/local/cuda-5.0/samples/1_Utilities/deviceQuery

你會有個疑問那我同一個 CUDA 他如何做到同一個 GPU 不同SMX 數也可以執行 ?

Block Scalability

Program Compilation

CUDA 5: Separate Compilation & Linking

From Introducing CUDA 5.pdf

Makefile 範例

########################################################### compiler setting##########################################################CC = gccCXX = g++NVCC = nvccCFLAGS = -g -WallCXXFLAGS = $(CFLAGS) -Weffc++ -pgLIBS = -lm -lglut -lGLU -lGLINCPATH = -I./

OBJS = main.o\ c_a.o\

c_b.o\ cpp_a.o\

cpp_b.o\ cu_a.o\

cu_b.o\

EXEC = output

all: $(OBJS)$(NVCC) $(OBJS) -o $(EXEC) $(LIBS) -pg

%.o:%.cu$(NVCC) -c $< -o $@ -g –G -arch=sm_35

%.o:%.cpp$(CXX) -c $(CXXFLAGS) $(INCPATH) $< -o $@

%.o:%.c$(CC) -c $(CFLAGS) $(INCPATH) $< -o $@

#########################################################

假設拿到別人的平行化程式，可試試看一個不錯可能改善效能的方法。

The ILP method <= 小時候學的 ILP 可以這樣用啊 !!

• 多條 thread 合併 ->ILP 增加 -> 有機會對 coalesce global memory-> Block 數減少 -> 一個thread 使用 register 個數增加 -> Ocuupancy 降低

(Vasily Volkov, “Better Performance at Lower Occupancy”)

先說什麼是 Occupancy

• Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently.( 意思就是說你每個時間所同時跑的 thread 數，到底有沒有塞滿 GPU 提供的最大同時間跑的 thread 數。 )

From Optimizing CUDA – Part II © NVIDIA Corporation 2009

• 假設某 GPU 的其中一個 SMX最多同時間可跑 1536 個threads 以及 32K register

NVIDIA 工程師 (http://stackoverflow.com/users/749748/harrism) 在stackoverflow 表示• In general, as Jared mentions, using too many registers per thread is

not desirable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.• Therefore, you should probably not optimize arrays into registers.

Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).

http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable

http://stackoverflow.com/users/749748/harrism





也就是說不管 Occupancy 高不高，要讓memory 有機會能 coalesce 來讀取。

繼續對 ILP 在 NVIDIA GPU 影響做說明

http://continuum.io/blog/cudapy_ilp_opt

搬資料• Core• Memory

controller



上面的效果對應 CODE 是什麼啊 ?

ILP = 2 時 ,右邊用pseudocode 表示

# read i = thread.id ai = a[i] bi = b[i] j = i+5 aj = a[j] bj = b[j] # compute ci = core(ai, bi) cj = core(aj, bj) # write c[i] = ci c[j] = cj

ILP=4 時，實際效果 => 讓 GPU pipeline 效果變高




上述主要概念整理

•Hide latency = do other operations when waiting for latency• ILP 增加• 增加 Occupancy

剛提到 the ILP method ，一個 thread 所使用的 register 個數是一個重要考量。

Interpreting Output of --ptxas-options=-v

http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-vhttp://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda

• Each CUDA thread is using 46 registers? Yes, correct• There is no register spilling to local memory(shared memory)? Yes, correct• Is 72 bytes the sum-total of the memory for the stack frames of the __global__ (撰寫

平行化的副程式 )and __device__( 給 __global__函數呼叫的副程式 ) functions? Yes, correct

http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v

http://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda

http://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda

我要怎麼限制一個 thread 的 register 使用數• control register usage with the nvcc flag: --maxrregcount

假設 threads 的分配 register總量超過GPU 上的 register 數量，編譯器會怎做 ?

stackoverflow神人表示• PTX level allows many more virtual registers than the hardware.

Those are mapped to hardware registers at load time. The register limit you specify allows you to set an upper limit on the hardware resources used by the generated binary. It serves as a heuristic for the compiler to decide when to spill (see below) registers when compiling to PTX already so certain concurrency needs can be met.• For Fermi GPUs there are at most 64 hardware registers. The 64th is

used by the ABI as the stack pointer and thus for "register spilling" (it means freeing up registers by temporarily storing their values on the stack and happens when more registers are needed than available) so it is untouchable.




剛剛說利用增加 register來賺memory coalesce的時間。 register用超過會增加memory存取時間。怎辦啊?

哈 ! 再怎嘴砲，也是要 coding才知阿~~~~~

我可以寫程式把所需資料放在哪呢 ?

Mohamed Zahran,

“Lecture 6: CUDA Memories”

• 存取速度shared memory > constant memory > global memory >

要怎宣告的資料，代表存取哪種 memory啊 ?

描述有錯 , 要看compiler放在哪裡

Stackoverflow神人• Dynamically indexed arrays cannot be stored in registers, because the GPU

register file is not dynamically addressable.• Scalar variables are automatically stored in registers by the compiler.• Statically-indexed (i.e. where the index can be determined at compile

time), small arrays (say, less than 16 floats) may be stored in registers by the compiler.




來看一個簡單的範例

Summing two vectors

Jason Sa nders, Edward Kandrot, “CUDA by Example”

資料哪來啊 ? 從 CPU Memory搬到global memory


怎麼呼叫自己寫的平行化程式押 ?

•呼叫時需要指定每個 block 有 thread 數，一個 grid 有多少 block

• 上面意思是說一個 grid 有 N 個 blocks ，每個 block 有 1 個thread 再執行

threadsblocks


從 GPU global memory寫回到 CPU memory 去處理


整理以上流程

http://en.wikipedia.org/wiki/CUDA



為什麼要指定的 thread 數 block 數會有1D,2D,3D阿 ?

• 1 block 4

• 一個 block是 9x9,因為 100 thread所以有兩個 block

• 2 blocks

在 thread 數不是 32倍數的狀況下 ,1D,2D,3D 的分法就是要比較哪個 warp 塞比較滿 !!!

要怎量 GPU 跑的時間

Profiling Tool: nvprofnvprof --events warps_launched,threads_launched ./ 執行檔執行檔輸入參數 > result

Q&A-1: flow divergence 的討論• JIT 的作法• 程式用 profile 知道哪些 true或 false 的狀況 , 分開同時丟給 JIT 去執行• Brower 就是用這樣的方式去加快處理• 這樣的做法很吃 memory

Q&A-2: NVIDA/AMD

• NVIDA• 筆電 ,伺服器

• AMD• 手機

Q&A-3:Single Instruction, Multiple Addresses 的討論• 對於 compiler 處理 random access• Point analysis

Q&A-4:

• CUDA LLVM Compiler•目前 CUDA 不支援 OpenCL 2.0• https://developer.nvidia.com/opencl

Q&A-5: trace code 討論• cuda-gdb• http://docs.nvidia.com/cuda/cuda-gdb/#axzz34ufkPsqt• EX:

• Note: For disassembly instruction to work properly, cuobjdump must be installed and present in your $PATH.

http://docs.nvidia.com/cuda/cuda-gdb/#axzz34ufkPsqt

Q&A-6: GPU machine code放到哪執行阿 ?

不知道 GPU 有沒有在討論 locality 問題 ?

Q&A-7 把 function 切開平行化是否有好處 ?• Function() function1() function2() function3()• ?

Q&A-8 5 axis machine 的防碰撞平行化 • cutter 每走一步就用 GPU檢查有沒有撞到• 問題 : GPU持續耗電 • 如果 5軸機開雕刻一整天 GPU 不就耗電很恐怖 ?• Trade off: 耗電 / 速度

CUDA Toolkit Documentation

• http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9

http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9

http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9

[若渴計畫]由gpu硬體概念到coding cuda

Presentations & Public Speaking