gpu 병렬연산 - suwonvmlab.suwon.ac.kr/mwlee/data2/file/(12)gpu_computing_etc.pdf ·...
TRANSCRIPT
용어용어
GPGPU?GPGPU? GPU(Graphics Processing Unit)를이용한일반목적계산(General Purpose Computing)일반목적계산(General Purpose Computing)
즉 graphic hardware를 non-graphic 연산에사용
nVIDIA’s CUDA? Compute Unified Device Architecture data-parallel programming을다루는 software
architecture
22
33
44
55
66
Why GPGPU?Why GPGPU?
77
88
CPU vs GPUCPU vs. GPUCPU “ lti ” CPU “multi-core” 빠른 cache Branching adaptability Branching adaptability 고성능(high performance)
GPU “many-core” (수백개) 다중 ALU 빠른 onboard memory (main memory의거의 10배속도) parallel task에고효율(high throughput) parallel task에고효율(high throughput)
CPU는 task parallelism에뛰어남p GPU는 data parallelism에뛰어남
99
CPU vs GPU HardwareCPU vs. GPU - Hardware
data processing에더많은 hardware 사용
1010
GPU ArchitectureGPU Architecture
1111
Processing ElementProcessing Element
Processing element = thread processor = ALU
1212
Memory ArchitectureMemory Architecture(Device) Grid
Registers Local memory
Block (0, 0)
Shared Memory
Block (1, 0)
Shared Memoryy
Shared memory Constant memory
Shared Memory
Registers Registers
Shared Memory
Registers Registers
Constant memory Global memory
Local
Thread (0, 0)
Local
Thread (1, 0)
Local
Thread (0, 0)
Local
Thread (1, 0)
Texture memoryGlobalMemory
LocalMemory
LocalMemory
LocalMemory
LocalMemory
Host
ConstantMemory
T t
e o y
1313
TextureMemory
Data parallel ProgrammingData-parallel Programming
Think of the GPU as a massively-threaded co-processorsp
Write “kernel” functions that execute on the device processing multiple datathe device -- processing multiple data elements in parallel
Keep it busy! massive threading Keep it busy! massive threading Keep your data close! local memory
1414
RequirementsRequirements
Hardware- CUDA-capable NVIDIA graphics card- PCI-Express slot
Software & Tools- CUDA device driverCUDA device driver- CUDA toolkit : nvcc(compiler), …- CUDA SDKCUDA SDK
1515
Host vs DeviceHost vs. Device Host : main computer (CPU + main memory) Host : main computer (CPU + main memory)
Device : graphics card (GPU + graphics memory)CUDA d 는 C/C 로작성되며다음의 CUDA source code는 C/C++로작성되며다음의둘로구성됨 : ( 이름 ~.cu )
h t d CPU에서실행- host code : CPU에서실행- device code (“kernel”) : GPU에서실행
Compile - nvcc VectorAdd.cu
1616
1717
How to computeHow to computeCPU가사용할변수들을 i 에잡고 CPU가사용할변수들을 main memory에잡고
GPU가계산에사용할변수들을 graphic card 에할당하고memory에할당하고
Host computer의 main memory로부터 graphic d의 로 d t 를복사한후card의 memory로 data를복사한후
GPU는수천-수천만개의 thread를생성하고hi d의 를사용하여연산을수행graphic card의 memory를사용하여연산을수행
연산결과를 host computer의 main memory로복사하고 CPU가이를이용하여추가작업하거나복사하고 CPU가이를이용하여추가작업하거나출력하고작업끝
1818
Initially:Initially:
array
Host’s Memory GPU Card’s Memory
1919
Allocate Memory in the GPU card
array_darray
Host’s Memory GPU Card’s Memory
2020
Copy content from the host’s memory to the GPU card memory
array_darray
Host’s Memory GPU Card’s Memory
2121
Execute code on the GPUExecute code on the GPU
GPU MPs
array_darray
Host’s Memory GPU Card’s Memory
2222
Copy results back to the host memory
array_darray
Host’s Memory GPU Card’s Memory
2323
// VectorAdd.cu#include <stdio.h>
__global__ void VectorAdd( int*a, int*b, int*c) // device code (kernel){int tid = blockIdx.x * blockDim.x + threadIdx.x;c[tid] = a[tid] + b[tid];[ ] [ ] [ ]
}
int main(){{const int size = 512*65535;const int BufferSize = size*sizeof(int);int *InputA, *InputB, *Result;
InputA = (int*)malloc(BufferSize); // Assign host memoryInputB = (int*)malloc(BufferSize);Result = (int*)malloc(BufferSize);
int i = 0;int* dev_A; int* dev_B; int* dev_R;
for( int i = 0; i < size; i++) { // Input dataInputA[i] = i; InputB[i] = i; Result[i] = 0;
}
cudaMalloc((void**)&dev_A, size*sizeof(int)); // Assign device memory
2424
cudaMalloc((void**)&dev_B, size*sizeof(int));cudaMalloc((void**)&dev_R, size*sizeof(int));
// Transfer data from host memory to device memoryy ycudaMemcpy(dev_A, InputA, size*sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(dev_B, InputB, size*sizeof(int), cudaMemcpyHostToDevice);
// Create 65535x512 threads and perform computation on GPUp pVectorAdd<<<65535,512>>>(dev_A, dev_B, dev_R);
// Transfer data from device memory to host memorycudaMemcpy(Result, dev_R, size*sizeof(int), cudaMemcpyDeviceToHost);py( _ ( ) py )
// Print results.for( i = 0; i < 5; i++) {
printf(" Result[%d] : %d\n",i,Result[i]);( )}printf(" ......\n");for( i = size-5; i < size; i++) {
printf(" Result[%d] : %d\n",i,Result[i]);}
// Free device memorycudaFree(dev_A); cudaFree(dev_B); cudaFree(dev_R);
// Free host memoryfree(InputA); free(InputB); free(Result);
2525
return 0;}
S E l 1024 1024행렬의곱셈Some Example : 1024 x 1024 행렬의곱셈[pspark@para kias]$ ./MatrixMul-c
Matrix C (Results)0.389147 0.418741 : 257.658 0.574162 0.669713 : 254.338 0 674025 0 867991 261 3010.674025 0.867991 : 261.301 0.468286 0.619271 : 256.432
Total elapsed time on the CPU chip 10.3449p p
[pspark@para kias]$ ./MatrixMul-cudagrid : 32 32 : block : 32 32grid : 32 32 : block : 32 32
Matrix C (Results)0.389147 0.418741 : 2.93874e-39 0.574162 0.669713 : 3.30608e-39 0.674025 0.867991 : 3.67342e-39 0.468286 0.619271 : 4.04076e-39
2626
Total elapsed time on the GPU card 0.0469801
참고참고
미루웨어미루웨어http://www.miruware.com/
NVidia Developer CUDA Zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://ko.wikipedia.org/wiki/CUDA
OpenCLhtt // kh / l/http://www.khronos.org/opencl/http://ko.wikipedia.org/wiki/OpenCL
2727
Intel LarrabeeIntel Larrabeehttp://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%http://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%EB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%EDEB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%ED%82%A4%ED%85%8D%EC%B2%98)%82%A4%ED%85%8D%EC%B2%98)
AMD, nVIDIAAMD, nVIDIA의의큰큰난적난적… Intel… Intel의의 “Larrabee”“Larrabee”http://uzys2011.tistory.com/337http://uzys2011.tistory.com/337
Larrabee GPU, Larrabee GPU, 결국결국개발개발중단중단http://www.kbench.com/hardware/?no=84965http://www.kbench.com/hardware/?no=84965
양자컴퓨터양자컴퓨터http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%93%A8%ED%84%B093%A8%ED%84%B0
DD--Wave SystemsWave Systemshtt // d /htt // d /http://www.dwavesys.com/http://www.dwavesys.com/
Google, DGoogle, D--Wave 2 Wave 2 확보확보htt // d t k / / i ? ti id 20130704161219htt // d t k / / i ? ti id 20130704161219
2828
http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219
Brook+SC07 BOF Session
November 13, 2007
2
What is Brook+?
Brook is an extension to the C-language for stream programming originally developed by Stanford University
Brook+ is an implementation by AMD of the Brook GPU spec on AMD's compute abstraction layer with some enhancements
3
Examplekernel void sum(float a<>, float b<>, out float c<>)
{
c = a + b;
}
int main(int argc, char** argv)
{
int i, j;
float a<10, 10>;
float b<10, 10>;
float c<10, 10>;
float input_a[10][10];
float input_b[10][10];
float input_c[10][10];
for(i=0; i<10; i++) {
for(j=0; j<10; j++) {
input_a[i][j] = (float) i;
input_b[i][j] = (float) j;
}
}
streamRead(a, input_a);
streamRead(b, input_b);
sum(a, b, c);
streamWrite(c, input_c);
...
}
Kernels – Program functionsthat operate on stream elements
Kernels – Program functionsthat operate on stream elements
Streams – collection of data elements of the same type which can be operated on in parallel.
Streams – collection of data elements of the same type which can be operated on in parallel.
Brook+ access functionsBrook+ access functions
4
Brook+ Compiler
Converts Brook+ files into C++ code. Kernels, written in C, are compiled to AMD’s IL code for the GPU or C code for the CPU.
5
Brook+ Runtime
IL code is executed on the GPU. The backend is written in CAL.
6
Brook+ Features
Brook+ is an extension to the Brook for GPUs source code.
Features of Brook for GPUs relevant to modern graphics hardware are maintained.
Kernels are compiled to AMD’s IL.
Runtime uses CAL for the GPU backend.
Original CPU backend also included.
7
Folding@Home Stats
Folding@Home client using Brook+
Currently 39 TFLOPS on 664 GPU clients
Avg. 60 GFLOPS per GPU client
Compared to:
Avg. 25 GFLOPS per PS3 client
Avg. 1 GFLOPS per CPU client
8
Brook+ Release
Brook+ package:
– Compiler and runtime binaries
– Source code and build environments
– Sample applications
Source code released under the BSD License.
Project will also reside on SourceForge.net.
9
Brook+ Moving Forward
Double precision - FireStream 9170
Mem-export (scatter)
Graphics API interoperability
Multi-GPU support
Other operating systems (Linux, Vista, 64-bit)
10
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2007 Advanced Micro Devices, Inc. All rights reserved.
DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.