教育部智慧電子整合性人才培育計畫 高階應用處理器 ap...

78
教育部智慧電子整合性人才培育計畫 教育部智慧電子整合性人才培育計畫 教育部智慧電子整合性人才培育計畫 教育部智慧電子整合性人才培育計畫 高階應用處理器 高階應用處理器 高階應用處理器 高階應用處理器 AP 聯盟中心 聯盟中心 聯盟中心 聯盟中心 主題領域 主題領域 主題領域 主題領域: 處理器軟硬體核心系統 處理器軟硬體核心系統 處理器軟硬體核心系統 處理器軟硬體核心系統 課程 課程 課程 課程模組名稱 模組名稱 模組名稱 模組名稱: 編譯器實作模組 編譯器實作模組 編譯器實作模組 編譯器實作模組 教材 教材 教材 教材名稱 名稱 名稱 名稱: 使用 使用 使用 使用 LLVM 實作減少 實作減少 實作減少 實作減少 OpenCL 式分支分岐 式分支分岐 式分支分岐 式分支分岐(Branch Divergence Reduction of OpenCL Programs Using LLVM) 開發教師 開發教師 開發教師 開發教師: 游逸平 游逸平 游逸平 游逸平 開發學生 開發學生 開發學生 開發學生: 蔡也寧 蔡也寧 蔡也寧 蔡也寧、趙硯廷 趙硯廷 趙硯廷 趙硯廷、周昆霖 周昆霖 周昆霖 周昆霖 學校系所 學校系所 學校系所 學校系所: 國立交通大學 國立交通大學 國立交通大學 國立交通大學資訊工程學系 資訊工程學系 資訊工程學系 資訊工程學系 聯絡電話 聯絡電話 聯絡電話 聯絡電話: 03-571-2121 566 88 聯絡地址 聯絡地址 聯絡地址 聯絡地址: 新竹市大學路 新竹市大學路 新竹市大學路 新竹市大學路 1001 繳交日期 繳交日期 繳交日期 繳交日期: 2014 2 1 實驗內容關鍵字 實驗內容關鍵字 實驗內容關鍵字 實驗內容關鍵字: LLVM, compiler backend, optimization, GPGPU, OpenCL

Upload: others

Post on 27-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • 教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫

    高階應用處理器高階應用處理器高階應用處理器高階應用處理器 AP聯盟中心聯盟中心聯盟中心聯盟中心

    主題領域主題領域主題領域主題領域:::: 處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統

    課程課程課程課程模組名稱模組名稱模組名稱模組名稱: 編譯器實作模組編譯器實作模組編譯器實作模組編譯器實作模組

    教材教材教材教材名稱名稱名稱名稱: 使用使用使用使用 LLVM實作減少實作減少實作減少實作減少 OpenCL程程程程

    式分支分岐式分支分岐式分支分岐式分支分岐(Branch Divergence

    Reduction of OpenCL

    Programs Using LLVM)

    開發教師開發教師開發教師開發教師: 游逸平游逸平游逸平游逸平

    開發學生開發學生開發學生開發學生: 蔡也寧蔡也寧蔡也寧蔡也寧、、、、趙硯廷趙硯廷趙硯廷趙硯廷、、、、周昆霖周昆霖周昆霖周昆霖

    學校系所學校系所學校系所學校系所: 國立交通大學國立交通大學國立交通大學國立交通大學資訊工程學系資訊工程學系資訊工程學系資訊工程學系

    聯絡電話聯絡電話聯絡電話聯絡電話: 03-571-2121 轉轉轉轉 566 88

    聯絡地址聯絡地址聯絡地址聯絡地址: 新竹市大學路新竹市大學路新竹市大學路新竹市大學路 1001號號號號

    繳交日期繳交日期繳交日期繳交日期: 2014年年年年 2月月月月 1日日日日

    實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字: LLVM, compiler backend,

    optimization, GPGPU, OpenCL

  • BRANCH DIVERGENCE REDUCTION OF OPENCL PROGRAMS USING LLVM國立交通大學資訊工程學系 蔡也寧、周昆霖、趙硯廷 指導教授:游逸平

    1

  • OUTLINE

    � Branch Divergence Reduction of OpenCL Programs Using LLVM

    � Building Up the Environment

    � Introduction to LLVM

    2

  • OUTLINE

    �Motivation

    �OpenCL Introduction

    � Problem –Warp & SIMT

    � Solution –Writing a pass

    � Algorithm – Main issue

    � Benchmark

    3

  • MOTIVATION

    4

  • HETEROGENEOUS WORLD

    � A computer may have more than ones GPU

    � GPU become more and more powerful

    � GPU have more cores than CPU

    � We can utilize the cores on GPU through OpenCL

    → use all computational resources in system

    → efficient parallel programming model

    � !! Exist some problem !!

    5

  • OPENCL INTRODUCTION

    6

  • OPENCL

    � Open Computing Language (OpenCL) is a framework.

    � An OpenCL program usually consist of one host code , multiple kernel codes.

    � Programs that execute across heterogeneous platforms.

    Host Code

    Kernel Code

    7

  • KERNEL

    � An OpenCL program contains one or more kernels and any supporting routines that run on a target device.

    � An OpenCL kernel is the basic unit of parallel code that can be executed on a target device.

    � Kernels are typically simple functions that transform input memory objects into output memory objects. 8

  • OPENCL PLATFORM MODEL

    � Host: CPU

    � Compute Device (devices): GPU

    � Compute Unit (Work Group): Core

    � Processing Element ( Work item , thread )

    →→→→ each PE run the kernel code

    CPU

    GPUCores

    9

  • PROBLEM

    10

  • FEATURE OF EXECUTION

    �Warp

    � SIMT ( Single Instruction Multiple Threads )

    11

  • WARP & SIMT

    � Work-groups ( cores ) divide into groups of 32 threads called warps.

    � Threads in the same warps always perform same instruction (SIMT).

    → !! where the problem is !!

    12

  • BRANCH DIVERGENCE

    � Instead of all warp threads going through both paths of the branch, they all take one of the paths

    � Those that should take the other path simply do nothing, delaying their computations to a subsequent iteration

    13

  • OPTIMIZE BRANCH DIVERGENCE

    � If Block 1 and Block 2 have common sequence of code, we can move it to their predecessor or their successor.

    Move up ( to predecessor)→→→→ Code Hoisting

    Move down ( to successor )→→→→ Code Sinking

    � Reducing the code size in the blocks can reducing the wasting time threads execute.

    � Common sub-expression can be moved.

    b1 & b2’s Predecessor

    Block 1 (b1) Block 2 (b2)

    b1 & b2’s Successor

    14

  • SOLUTION

    15

  • WRITING A PASS

    � Write a pass to optimize the branch divergence problem on LLVM IR ( intermediate representation )

    16

  • WRITING A PASS

    � Add a optimization pass here which implement our optimized algorithm.

    Front end Back endOptimization

    Pass

    17

  • WHAT DO WE USE

    � LLVM - The LLVM Core libraries provide a modern source- and target-independent optimizer, along with code generator

    →→→→ Use in front end to generate LLVM IR , and use in back end to generate NVPTX code

    � Clang - an "LLVM native" C/C++/Objective-C compiler, which aims to deliver amazingly fast compiles

    →→→→ Use in front end with llvm

    � libclc – a library aims to implement the OpenCL standard library.

    →→→→ Use in front end with llvm and clang in order to recognize OpenCL in kernel code

    18

  • WHY DO WE CHOOSE LLVM IR

    � The original code may be complex and hard to read.

    � LLVM IR is easy to read , more straightforward , and LLVM provides lots of API that we can use.

    19

  • 20

  • HOW TO WRITE A PASS

    21

  • WRITING A PASS

    � under YOUR LLVM DIRECTORY/lib/Transforms/

    � Create a folder ( by mkdir .. ) , we need some files

    � →→→→ some cpp files and other header files

    � →→→→ Makefile ( http://llvm.org/docs/WritingAnLLVMPass.html#setting-up-the-build-environment )

    � More about it http://llvm.org/docs/WritingAnLLVMPass.html

    22

  • MAKE IT

    � make under the directory you made

    � Then we will have a .so file under YOUR LLVM DIRECTORY/Debug+Asserts/lib/xxx.so

    23

  • HOW TO USE

    � opt -load ../../../Debug+Asserts/lib/xxx.so -flag < YOUR_IR.ll > /dev/null

    � More about it http://llvm.org/docs/WritingAnLLVMPass.html#running-a-pass-with-opt

    24

  • ALGORITHM

    25

  • ALGORITHM

    � How to run through the whole Kernel Code in a efficient way. ( Most Important )

    � Code Hoisting – Some Dependency Problem

    � Code Sinking – Some Dependency Problem

    26

  • A RECURSIVE ALGORITHM

    � Recursively call

    27

  • �Each circle is a basicblock

    28

  • CODE HOISTING

    29

  • CODE SINKING

    30

  • BENCHMARK

    31

  • SOFTWARE INVIRONMENT

    裝置 版本

    Operating System Ubuntu 12.04.2 LTS

    CPU AMD Athlon(tm) II X4 630 Processor

    OpenCL 1.1

    Compute

    capability

    2.1

    LLVM 3.432

  • GPU INFORMATION

    GPU GeForce GTX 560

    Compute Units 7

    Global Memory 1024 mb

    Max Allocatable

    Memory

    256 mb

    Local Memory 49152 kb 33

  • TESTDATA1 – MATRIX

    � In this kernel , we randomly initialize the content of the array “a” & “b” , and we pass these two arrays to kernel and let the kernel do the following thing. If the content of array “a” is bigger than b’s , we subtract array b’s value from array a’s value and assign it to the array “result” , else we do the opposite.

    � Total times doing code hoisting – 13 times

    � Total times doing code sinking – 2 times

    34

  • 35

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    5000 10000 50000 100000 300000 500000 600000

    毫秒

    matrix size

    unoptimized kernel optimized kernel

  • TESTDATA2 – CALCULATE PI

    � In this kernel , we are trying to compute the value of “PI”. First we randomly initialize the content of the array “a” & “b” which value is between 0 and 1 , these two array means the x value and y value of each nodes in a coordinate system , then we pass it to the kernel for execution. If the value of a[i]*a[i]*b[i]*b[i] is less than “1” , we assign “1” to the array “result” , which means it’s inside the 1/4 circle within the square which side is “1” , else we assign “0” . After that , we compute the value of “ nodes inside the 1/4 circle / total nodes “ , then multiply by four , and it becomes the “PI” value we want.

    � Total times doing code hoisting – 7 times

    � Total times doing code sinking – 1 times

    36

  • 37

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    5000 10000 50000 100000 300000 500000 600000

    毫秒

    取樣個數

    unoptimized kernel optimized kernel

  • BUILDING UP THE ENVIRONMENT

  • REQUIRED

    � Unix-like System ( more information )

    � OpenCL ( for host code)

    � LLVM & Clang & LIBCLC ( for kernel code)

  • CHECKOUT LLVM

    � Change directory to where you want the llvm directory placed.

    � svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm

  • CHECKOUT CLANG

    � cd llvm/tools

    � svn co http://llvm.org/svn/llvm-project/cfe/trunk clang

  • BUILD LLVM AND CLANG

    � Under the directory of LLVM which you have just downloaded

    � ../llvm/configure --enable-targets=host,nvptx

    � make

  • LIBCLC

    � svn checkout http://llvm.org/svn/llvm-project/libclc/trunk libclc

    or

    � git clone http://llvm.org/git/libclc.git

  • BUILD LIBCLC WITH LLVM-CONFIG

    � cd libclc

    � ./configure.py --with-llvm-config=/”YOUR LLVM DIRECTORY”/Debug+Asserts/bin/llvm-config

    � make

  • INSTALL OPENCL

    � Install the NVIDIA CUDA Toolkit and Display Driver

    � *Download the cuda_5.0.35_linux_64_ubuntu11.10-1.run from https://developer.nvidia.com/cuda-toolkit

    � *run the script at the directory where you store the above file: sudo .sh ./cuda_5.0.35_linux_64_ubuntu11.10-1.run

  • � This script will install 3 items for you:

    � a. CUDA toolkit

    � b. CUDA Samples

    � c. NVIDIA Display Driver

    It is always needed to stop the X server to allow the NVIDIA display driver to be installed, by this

    command

    � sudosudosudosudo stop stop stop stop lightdmlightdmlightdmlightdm

    You may also meet the failure that the Smaples can not be installed, due to missing required libraries

    � . sudo apt-get install freeglut3

    � . sudo ln -s /usr/lib/x86_64-linux-gnu/libglut.so.3 /usr/lib/libglut.so

    INSTALL OPENCL

  • � After all the libs are installed, specify the PATH and LD_LIBRARY_PATH in ~/.bashrc by adding:

    � export PATH=$PATH:/usr/local/cuda-5.0/bin:/usr/local/cuda/bin

    � export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib:/usr/local/cuda/lib

    � Enter the ~/NVIDIA_CUDA-5.0_Samples to build the samples : sudo make (optional)

    � Install the NVIDIA GPU Computing SDK

    � Download the Computint SDK http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run from https://developer.nvidia.com/cuda-toolkit-40

    � You can see the ~/NVIDIA_GPU_Computing_SDK directory and enter the OpenCL / to make

    INSTALL OPENCL

  • KERNEL.CL -> LLVM IR

    � To get this feature target triple = “nvptx” , do it by this command

    / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang

    \ -I/ ”YOUR LIBCLC DIRECTORY” /generic/include -include clc/clc.h

    \ -Dcl_clang_storage_class_specifiers -target nvptx--nvidiacl

    \ -Xclang -mlink-bitcode-file

    \ -Xclang / ”YOUR LIBCLC DIRECTORY” /built_libs/nvptx—nvidiacl.bc

    \ -S -emit-llvm kernel.cl -o kernel.ll

  • LLVM IR -> NVPTX

    � / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/llc -march=nvptx -mcpu=sm_20

    kernel.ll –o kernel.ptx

  • COMPILE HOST CODE

    � / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang++ host.cpp –

    o host -lOpenCL

  • Introduction to LLVM IR

    51

  • Features

    � LLVM is a Static Single Assignment (SSA) based representation.

    � Every definition creates a new variable.

    � Infinite virtual registers

    52

  • Three different form

    � in-memory compiler IR

    � on-disk bitcode representation

    � human readable assembly language

    representation.

    53

    The three different forms of LLVM are all equivalent.

  • Structure

    �Module

    �Function

    �Basic block

    54

  • Module Structure

    � LLVM programs are composed of module’s , each of which is a translation unit of

    the input programs.

    55

  • Identifiers

    � Global identifiers begin with @

    � Local identifiers begin with %

    � Have some reserved words like other

    language

    56

    %result = mul i32 %X,8

  • Functions

    � LLVM function definition consist of the “define” keyword

    � LLVM function declaration consist of the “declare”

    keyword

    57

  • attributes

    � Parameter attributes are simple keywords that follow the type

    specified

    � declare i32 @printf(i8* noalias nocapture, … )

    � Function attribute are simple keywords that follow the type

    specified

    � Define void @f() noinline optsize { … }

    58

    Function attributes

    Parameter attributes

  • Data Layout

    � A module may specify a target specific data

    layout string that specifies how data is to be laid

    out in memory

    � Target datalayout = “layout specification”

    59

  • Type System

    � The LLVM type system is one of the most important features of the

    intermediate representation.

    � A strong type system makes it easier to read the generated code

    60

  • Array Type

    � The array type is a very simple derived

    type that arranges elements sequentially

    in memory.

    61

  • Pointer Type

    � The pointer type is used to specify memory locations.

    Pointers are commonly used to reference objects in

    memory.

    62

  • Constant

    � Boolean constants : ‘true’ and ‘false’ , the i1

    type

    � Integer constants : such as ‘4’

    � Floating point constants

    � Null pointer constants : ‘null’ is recognized as a

    null pointer constant

    � Some other complex constant , such as structure

    constants , array constants , vector constant ,

    etc.63

  • Terminator Instructions

    � br instructions ( used to cause control flow to transfer to a different basic block in the current function) :

    br i1 , label , label

    br label ; Unconditional branch

    64

    Every basic block in a program ends with a “Terminator” instruction

  • Instruction Reference

    �Terminator Instructions

    �Binary Operations

    �Bitwise Binary Operations

    �Vector Operations

    �Aggregate Operations

    �Memory Access and Addressing Operations

    �Conversion Operations65

  • LLVM IR EXAMPLE

    66

  • Kernel.cl

    67

    __kernel void vector_add_gpu (__global const float* src_a,__global const float* src_b,__global float* res,const int num)

    {

    const int idx = get_global_id(0);

    if (idx < num)

    res[idx] = src_a[idx] + src_b[idx];

    else

    res[idx] = src_a[idx] - src_b[idx];

    }

  • Kernel.ll

    � define void @vector_add_gpu(float addrspace(1)* nocapture %src_a, float addrspace(1)* nocapture %src_b, float addrspace(1)* nocapture %res, i32 %num) #0 {

    � entry:

    � %call = tail call i32 @get_global_id(i32 0) #2

    � %cmp = icmp slt i32 %call, %num

    � %arrayidx = getelementptr inbounds float addrspace(1)* %src_a, i32 %call

    � %0 = load float addrspace(1)* %arrayidx, align 4, !tbaa !2

    � %arrayidx1 = getelementptr inbounds float addrspace(1)* %src_b, i32 %call

    � %1 = load float addrspace(1)* %arrayidx1, align 4, !tbaa !2

    � br i1 %cmp, label %if.then, label %if.else

    68

  • Kernel.ll

    � if.then: ; preds = %entry

    � %add = fadd float %0, %1

    � %arrayidx2 = getelementptr inbounds float addrspace(1)* %res, i32 %call

    � store float %add, float addrspace(1)* %arrayidx2, align 4, !tbaa !2

    � br label %if.end

    � if.else: ; preds = %entry

    � %sub = fsub float %0, %1

    � %arrayidx5 = getelementptr inbounds float addrspace(1)* %res, i32 %call

    � store float %sub, float addrspace(1)* %arrayidx5, align 4, !tbaa !2

    � br label %if.end

    � if.end: ; preds = %if.else, %if.then

    � ret void69

  • LLVM Class Introduction

    70

  • Instruction class

    � Collaboration diagram for llvm::Instruction:For operand

    part

    A container for instructions (list

    structure)

    � Base class for all the LLVM Instructions (IR instruction)

  • User class

    OperandList is a container of “use”(use*)

    � We get operand of the instruction by using user

    interface

    � This class defines the interface that one who uses a Value must implement.

    � Each instance of the Value class keeps track of what User's have handles to it

    � Instructions are the largest class of Users

    � Each instruction is the “user” of its operands(value)

  • Use class

    � The Use class represents the operand of an

    instruction or some other User instance

    which refers to a Value. The Use class keeps

    the “use list” of the referenced value up to

    date( use 把 value 包起來以便對 value 進行操作 )

  • Value class

    � It is the base class of all values computed by a program that

    may be used as operands to other values

    � Value is the super class of other important classes such as

    Instruction and Function

    � All Values have a Type

    � Setting the name on the Value automatically updates the module's symbol table

    � Every value has a "use list" that keeps track of which other

    Values are using this Value

  • How to find API you need?

    � Step1: Go to llvm documentation like:

    http://llvm.org/docs/doxygen/html/index.html

    � Step2: Search the key word (like instruction ,basic block ,dominate tree …)

    → There may has some public function you can use

    � Step3: for more detail you can also refer to the source code the function

    provided

    � Step4 Figure out and trace the Inheritance diagram as well as related classes

    can help you more

  • For more

    � Reference to :

    http://llvm.org/docs/doxygen/html/index.ht

    ml

  • The end

    cover-AP-you.pdfAP_Slides_ypyou.pdf