教育部智慧電子整合性人才培育計畫 高階應用處理器 ap...
TRANSCRIPT
-
教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫
高階應用處理器高階應用處理器高階應用處理器高階應用處理器 AP聯盟中心聯盟中心聯盟中心聯盟中心
主題領域主題領域主題領域主題領域:::: 處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統
課程課程課程課程模組名稱模組名稱模組名稱模組名稱: 編譯器實作模組編譯器實作模組編譯器實作模組編譯器實作模組
教材教材教材教材名稱名稱名稱名稱: 使用使用使用使用 LLVM實作減少實作減少實作減少實作減少 OpenCL程程程程
式分支分岐式分支分岐式分支分岐式分支分岐(Branch Divergence
Reduction of OpenCL
Programs Using LLVM)
開發教師開發教師開發教師開發教師: 游逸平游逸平游逸平游逸平
開發學生開發學生開發學生開發學生: 蔡也寧蔡也寧蔡也寧蔡也寧、、、、趙硯廷趙硯廷趙硯廷趙硯廷、、、、周昆霖周昆霖周昆霖周昆霖
學校系所學校系所學校系所學校系所: 國立交通大學國立交通大學國立交通大學國立交通大學資訊工程學系資訊工程學系資訊工程學系資訊工程學系
聯絡電話聯絡電話聯絡電話聯絡電話: 03-571-2121 轉轉轉轉 566 88
聯絡地址聯絡地址聯絡地址聯絡地址: 新竹市大學路新竹市大學路新竹市大學路新竹市大學路 1001號號號號
繳交日期繳交日期繳交日期繳交日期: 2014年年年年 2月月月月 1日日日日
實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字: LLVM, compiler backend,
optimization, GPGPU, OpenCL
-
BRANCH DIVERGENCE REDUCTION OF OPENCL PROGRAMS USING LLVM國立交通大學資訊工程學系 蔡也寧、周昆霖、趙硯廷 指導教授:游逸平
1
-
OUTLINE
� Branch Divergence Reduction of OpenCL Programs Using LLVM
� Building Up the Environment
� Introduction to LLVM
2
-
OUTLINE
�Motivation
�OpenCL Introduction
� Problem –Warp & SIMT
� Solution –Writing a pass
� Algorithm – Main issue
� Benchmark
3
-
MOTIVATION
4
-
HETEROGENEOUS WORLD
� A computer may have more than ones GPU
� GPU become more and more powerful
� GPU have more cores than CPU
� We can utilize the cores on GPU through OpenCL
→ use all computational resources in system
→ efficient parallel programming model
� !! Exist some problem !!
5
-
OPENCL INTRODUCTION
6
-
OPENCL
� Open Computing Language (OpenCL) is a framework.
� An OpenCL program usually consist of one host code , multiple kernel codes.
� Programs that execute across heterogeneous platforms.
Host Code
Kernel Code
7
-
KERNEL
� An OpenCL program contains one or more kernels and any supporting routines that run on a target device.
� An OpenCL kernel is the basic unit of parallel code that can be executed on a target device.
� Kernels are typically simple functions that transform input memory objects into output memory objects. 8
-
OPENCL PLATFORM MODEL
� Host: CPU
� Compute Device (devices): GPU
� Compute Unit (Work Group): Core
� Processing Element ( Work item , thread )
→→→→ each PE run the kernel code
CPU
GPUCores
9
-
PROBLEM
10
-
FEATURE OF EXECUTION
�Warp
� SIMT ( Single Instruction Multiple Threads )
11
-
WARP & SIMT
� Work-groups ( cores ) divide into groups of 32 threads called warps.
� Threads in the same warps always perform same instruction (SIMT).
→ !! where the problem is !!
12
-
BRANCH DIVERGENCE
� Instead of all warp threads going through both paths of the branch, they all take one of the paths
� Those that should take the other path simply do nothing, delaying their computations to a subsequent iteration
13
-
OPTIMIZE BRANCH DIVERGENCE
� If Block 1 and Block 2 have common sequence of code, we can move it to their predecessor or their successor.
Move up ( to predecessor)→→→→ Code Hoisting
Move down ( to successor )→→→→ Code Sinking
� Reducing the code size in the blocks can reducing the wasting time threads execute.
� Common sub-expression can be moved.
b1 & b2’s Predecessor
Block 1 (b1) Block 2 (b2)
b1 & b2’s Successor
14
-
SOLUTION
15
-
WRITING A PASS
� Write a pass to optimize the branch divergence problem on LLVM IR ( intermediate representation )
16
-
WRITING A PASS
� Add a optimization pass here which implement our optimized algorithm.
Front end Back endOptimization
Pass
17
-
WHAT DO WE USE
� LLVM - The LLVM Core libraries provide a modern source- and target-independent optimizer, along with code generator
→→→→ Use in front end to generate LLVM IR , and use in back end to generate NVPTX code
� Clang - an "LLVM native" C/C++/Objective-C compiler, which aims to deliver amazingly fast compiles
→→→→ Use in front end with llvm
� libclc – a library aims to implement the OpenCL standard library.
→→→→ Use in front end with llvm and clang in order to recognize OpenCL in kernel code
18
-
WHY DO WE CHOOSE LLVM IR
� The original code may be complex and hard to read.
� LLVM IR is easy to read , more straightforward , and LLVM provides lots of API that we can use.
19
-
20
-
HOW TO WRITE A PASS
21
-
WRITING A PASS
� under YOUR LLVM DIRECTORY/lib/Transforms/
� Create a folder ( by mkdir .. ) , we need some files
� →→→→ some cpp files and other header files
� →→→→ Makefile ( http://llvm.org/docs/WritingAnLLVMPass.html#setting-up-the-build-environment )
� More about it http://llvm.org/docs/WritingAnLLVMPass.html
22
-
MAKE IT
� make under the directory you made
� Then we will have a .so file under YOUR LLVM DIRECTORY/Debug+Asserts/lib/xxx.so
23
-
HOW TO USE
� opt -load ../../../Debug+Asserts/lib/xxx.so -flag < YOUR_IR.ll > /dev/null
� More about it http://llvm.org/docs/WritingAnLLVMPass.html#running-a-pass-with-opt
24
-
ALGORITHM
25
-
ALGORITHM
� How to run through the whole Kernel Code in a efficient way. ( Most Important )
� Code Hoisting – Some Dependency Problem
� Code Sinking – Some Dependency Problem
26
-
A RECURSIVE ALGORITHM
� Recursively call
27
-
�Each circle is a basicblock
28
-
CODE HOISTING
29
-
CODE SINKING
30
-
BENCHMARK
31
-
SOFTWARE INVIRONMENT
裝置 版本
Operating System Ubuntu 12.04.2 LTS
CPU AMD Athlon(tm) II X4 630 Processor
OpenCL 1.1
Compute
capability
2.1
LLVM 3.432
-
GPU INFORMATION
GPU GeForce GTX 560
Compute Units 7
Global Memory 1024 mb
Max Allocatable
Memory
256 mb
Local Memory 49152 kb 33
-
TESTDATA1 – MATRIX
� In this kernel , we randomly initialize the content of the array “a” & “b” , and we pass these two arrays to kernel and let the kernel do the following thing. If the content of array “a” is bigger than b’s , we subtract array b’s value from array a’s value and assign it to the array “result” , else we do the opposite.
� Total times doing code hoisting – 13 times
� Total times doing code sinking – 2 times
34
-
35
0
0.05
0.1
0.15
0.2
0.25
5000 10000 50000 100000 300000 500000 600000
毫秒
matrix size
unoptimized kernel optimized kernel
-
TESTDATA2 – CALCULATE PI
� In this kernel , we are trying to compute the value of “PI”. First we randomly initialize the content of the array “a” & “b” which value is between 0 and 1 , these two array means the x value and y value of each nodes in a coordinate system , then we pass it to the kernel for execution. If the value of a[i]*a[i]*b[i]*b[i] is less than “1” , we assign “1” to the array “result” , which means it’s inside the 1/4 circle within the square which side is “1” , else we assign “0” . After that , we compute the value of “ nodes inside the 1/4 circle / total nodes “ , then multiply by four , and it becomes the “PI” value we want.
� Total times doing code hoisting – 7 times
� Total times doing code sinking – 1 times
36
-
37
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
5000 10000 50000 100000 300000 500000 600000
毫秒
取樣個數
unoptimized kernel optimized kernel
-
BUILDING UP THE ENVIRONMENT
-
REQUIRED
� Unix-like System ( more information )
� OpenCL ( for host code)
� LLVM & Clang & LIBCLC ( for kernel code)
-
CHECKOUT LLVM
� Change directory to where you want the llvm directory placed.
� svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
-
CHECKOUT CLANG
� cd llvm/tools
� svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
-
BUILD LLVM AND CLANG
� Under the directory of LLVM which you have just downloaded
� ../llvm/configure --enable-targets=host,nvptx
� make
-
LIBCLC
� svn checkout http://llvm.org/svn/llvm-project/libclc/trunk libclc
or
� git clone http://llvm.org/git/libclc.git
-
BUILD LIBCLC WITH LLVM-CONFIG
� cd libclc
� ./configure.py --with-llvm-config=/”YOUR LLVM DIRECTORY”/Debug+Asserts/bin/llvm-config
� make
-
INSTALL OPENCL
� Install the NVIDIA CUDA Toolkit and Display Driver
� *Download the cuda_5.0.35_linux_64_ubuntu11.10-1.run from https://developer.nvidia.com/cuda-toolkit
� *run the script at the directory where you store the above file: sudo .sh ./cuda_5.0.35_linux_64_ubuntu11.10-1.run
-
� This script will install 3 items for you:
� a. CUDA toolkit
� b. CUDA Samples
� c. NVIDIA Display Driver
It is always needed to stop the X server to allow the NVIDIA display driver to be installed, by this
command
� sudosudosudosudo stop stop stop stop lightdmlightdmlightdmlightdm
You may also meet the failure that the Smaples can not be installed, due to missing required libraries
� . sudo apt-get install freeglut3
� . sudo ln -s /usr/lib/x86_64-linux-gnu/libglut.so.3 /usr/lib/libglut.so
INSTALL OPENCL
-
� After all the libs are installed, specify the PATH and LD_LIBRARY_PATH in ~/.bashrc by adding:
� export PATH=$PATH:/usr/local/cuda-5.0/bin:/usr/local/cuda/bin
� export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib:/usr/local/cuda/lib
� Enter the ~/NVIDIA_CUDA-5.0_Samples to build the samples : sudo make (optional)
� Install the NVIDIA GPU Computing SDK
� Download the Computint SDK http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run from https://developer.nvidia.com/cuda-toolkit-40
� You can see the ~/NVIDIA_GPU_Computing_SDK directory and enter the OpenCL / to make
INSTALL OPENCL
-
KERNEL.CL -> LLVM IR
� To get this feature target triple = “nvptx” , do it by this command
/ ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang
\ -I/ ”YOUR LIBCLC DIRECTORY” /generic/include -include clc/clc.h
\ -Dcl_clang_storage_class_specifiers -target nvptx--nvidiacl
\ -Xclang -mlink-bitcode-file
\ -Xclang / ”YOUR LIBCLC DIRECTORY” /built_libs/nvptx—nvidiacl.bc
\ -S -emit-llvm kernel.cl -o kernel.ll
-
LLVM IR -> NVPTX
� / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/llc -march=nvptx -mcpu=sm_20
kernel.ll –o kernel.ptx
-
COMPILE HOST CODE
� / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang++ host.cpp –
o host -lOpenCL
-
Introduction to LLVM IR
51
-
Features
� LLVM is a Static Single Assignment (SSA) based representation.
� Every definition creates a new variable.
� Infinite virtual registers
52
-
Three different form
� in-memory compiler IR
� on-disk bitcode representation
� human readable assembly language
representation.
53
The three different forms of LLVM are all equivalent.
-
Structure
�Module
�Function
�Basic block
54
-
Module Structure
� LLVM programs are composed of module’s , each of which is a translation unit of
the input programs.
55
-
Identifiers
� Global identifiers begin with @
� Local identifiers begin with %
� Have some reserved words like other
language
56
%result = mul i32 %X,8
-
Functions
� LLVM function definition consist of the “define” keyword
� LLVM function declaration consist of the “declare”
keyword
57
-
attributes
� Parameter attributes are simple keywords that follow the type
specified
� declare i32 @printf(i8* noalias nocapture, … )
� Function attribute are simple keywords that follow the type
specified
� Define void @f() noinline optsize { … }
58
Function attributes
Parameter attributes
-
Data Layout
� A module may specify a target specific data
layout string that specifies how data is to be laid
out in memory
� Target datalayout = “layout specification”
59
-
Type System
� The LLVM type system is one of the most important features of the
intermediate representation.
� A strong type system makes it easier to read the generated code
60
-
Array Type
� The array type is a very simple derived
type that arranges elements sequentially
in memory.
61
-
Pointer Type
� The pointer type is used to specify memory locations.
Pointers are commonly used to reference objects in
memory.
62
-
Constant
� Boolean constants : ‘true’ and ‘false’ , the i1
type
� Integer constants : such as ‘4’
� Floating point constants
� Null pointer constants : ‘null’ is recognized as a
null pointer constant
� Some other complex constant , such as structure
constants , array constants , vector constant ,
etc.63
-
Terminator Instructions
� br instructions ( used to cause control flow to transfer to a different basic block in the current function) :
br i1 , label , label
br label ; Unconditional branch
64
Every basic block in a program ends with a “Terminator” instruction
-
Instruction Reference
�Terminator Instructions
�Binary Operations
�Bitwise Binary Operations
�Vector Operations
�Aggregate Operations
�Memory Access and Addressing Operations
�Conversion Operations65
-
LLVM IR EXAMPLE
66
-
Kernel.cl
67
__kernel void vector_add_gpu (__global const float* src_a,__global const float* src_b,__global float* res,const int num)
{
const int idx = get_global_id(0);
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
else
res[idx] = src_a[idx] - src_b[idx];
}
-
Kernel.ll
� define void @vector_add_gpu(float addrspace(1)* nocapture %src_a, float addrspace(1)* nocapture %src_b, float addrspace(1)* nocapture %res, i32 %num) #0 {
� entry:
� %call = tail call i32 @get_global_id(i32 0) #2
� %cmp = icmp slt i32 %call, %num
� %arrayidx = getelementptr inbounds float addrspace(1)* %src_a, i32 %call
� %0 = load float addrspace(1)* %arrayidx, align 4, !tbaa !2
� %arrayidx1 = getelementptr inbounds float addrspace(1)* %src_b, i32 %call
� %1 = load float addrspace(1)* %arrayidx1, align 4, !tbaa !2
� br i1 %cmp, label %if.then, label %if.else
68
-
Kernel.ll
� if.then: ; preds = %entry
� %add = fadd float %0, %1
� %arrayidx2 = getelementptr inbounds float addrspace(1)* %res, i32 %call
� store float %add, float addrspace(1)* %arrayidx2, align 4, !tbaa !2
� br label %if.end
� if.else: ; preds = %entry
� %sub = fsub float %0, %1
� %arrayidx5 = getelementptr inbounds float addrspace(1)* %res, i32 %call
� store float %sub, float addrspace(1)* %arrayidx5, align 4, !tbaa !2
� br label %if.end
� if.end: ; preds = %if.else, %if.then
� ret void69
-
LLVM Class Introduction
70
-
Instruction class
� Collaboration diagram for llvm::Instruction:For operand
part
A container for instructions (list
structure)
� Base class for all the LLVM Instructions (IR instruction)
-
User class
OperandList is a container of “use”(use*)
� We get operand of the instruction by using user
interface
� This class defines the interface that one who uses a Value must implement.
� Each instance of the Value class keeps track of what User's have handles to it
� Instructions are the largest class of Users
� Each instruction is the “user” of its operands(value)
-
Use class
� The Use class represents the operand of an
instruction or some other User instance
which refers to a Value. The Use class keeps
the “use list” of the referenced value up to
date( use 把 value 包起來以便對 value 進行操作 )
-
Value class
� It is the base class of all values computed by a program that
may be used as operands to other values
� Value is the super class of other important classes such as
Instruction and Function
� All Values have a Type
� Setting the name on the Value automatically updates the module's symbol table
� Every value has a "use list" that keeps track of which other
Values are using this Value
-
How to find API you need?
� Step1: Go to llvm documentation like:
http://llvm.org/docs/doxygen/html/index.html
� Step2: Search the key word (like instruction ,basic block ,dominate tree …)
→ There may has some public function you can use
� Step3: for more detail you can also refer to the source code the function
provided
� Step4 Figure out and trace the Inheritance diagram as well as related classes
can help you more
-
For more
� Reference to :
http://llvm.org/docs/doxygen/html/index.ht
ml
-
The end
cover-AP-you.pdfAP_Slides_ypyou.pdf