教育部智慧電子整合性人才培育計畫高階應用處理器 ap...

教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫教育部智慧電子整合性人才培育計畫

高階應用處理器高階應用處理器高階應用處理器高階應用處理器 AP聯盟中心聯盟中心聯盟中心聯盟中心

主題領域主題領域主題領域主題領域：：：：處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統處理器軟硬體核心系統

課程課程課程課程模組名稱模組名稱模組名稱模組名稱: 編譯器實作模組編譯器實作模組編譯器實作模組編譯器實作模組

教材教材教材教材名稱名稱名稱名稱: 使用使用使用使用 LLVM實作減少實作減少實作減少實作減少 OpenCL程程程程

式分支分岐式分支分岐式分支分岐式分支分岐(Branch Divergence

Reduction of OpenCL

Programs Using LLVM)

開發教師開發教師開發教師開發教師: 游逸平游逸平游逸平游逸平

開發學生開發學生開發學生開發學生: 蔡也寧蔡也寧蔡也寧蔡也寧、、、、趙硯廷趙硯廷趙硯廷趙硯廷、、、、周昆霖周昆霖周昆霖周昆霖

學校系所學校系所學校系所學校系所: 國立交通大學國立交通大學國立交通大學國立交通大學資訊工程學系資訊工程學系資訊工程學系資訊工程學系

聯絡電話聯絡電話聯絡電話聯絡電話: 03-571-2121 轉轉轉轉 566 88

聯絡地址聯絡地址聯絡地址聯絡地址: 新竹市大學路新竹市大學路新竹市大學路新竹市大學路 1001號號號號

繳交日期繳交日期繳交日期繳交日期: 2014年年年年 2月月月月 1日日日日

實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字實驗內容關鍵字: LLVM, compiler backend,

optimization, GPGPU, OpenCL

BRANCH DIVERGENCE REDUCTION OF OPENCL PROGRAMS USING LLVM國立交通大學資訊工程學系蔡也寧、周昆霖、趙硯廷指導教授：游逸平

1

OUTLINE

� Branch Divergence Reduction of OpenCL Programs Using LLVM

� Building Up the Environment

� Introduction to LLVM

2

OUTLINE

�Motivation

�OpenCL Introduction

� Problem –Warp & SIMT

� Solution –Writing a pass

� Algorithm – Main issue

� Benchmark

3

MOTIVATION

4

HETEROGENEOUS WORLD

� A computer may have more than ones GPU

� GPU become more and more powerful

� GPU have more cores than CPU

� We can utilize the cores on GPU through OpenCL

→ use all computational resources in system

→ efficient parallel programming model

� !! Exist some problem !!

5

OPENCL INTRODUCTION

6

OPENCL

� Open Computing Language (OpenCL) is a framework.

� An OpenCL program usually consist of one host code , multiple kernel codes.

� Programs that execute across heterogeneous platforms.

Host Code

Kernel Code

7

KERNEL

� An OpenCL program contains one or more kernels and any supporting routines that run on a target device.

� An OpenCL kernel is the basic unit of parallel code that can be executed on a target device.

� Kernels are typically simple functions that transform input memory objects into output memory objects. 8

OPENCL PLATFORM MODEL

� Host: CPU

� Compute Device (devices): GPU

� Compute Unit (Work Group): Core

� Processing Element ( Work item , thread )

→→→→ each PE run the kernel code

CPU

GPUCores

9

PROBLEM

10

FEATURE OF EXECUTION

�Warp

� SIMT ( Single Instruction Multiple Threads )

11

WARP & SIMT

� Work-groups ( cores ) divide into groups of 32 threads called warps.

� Threads in the same warps always perform same instruction (SIMT).

→ !! where the problem is !!

12

BRANCH DIVERGENCE

� Instead of all warp threads going through both paths of the branch, they all take one of the paths

� Those that should take the other path simply do nothing, delaying their computations to a subsequent iteration

13

OPTIMIZE BRANCH DIVERGENCE

� If Block 1 and Block 2 have common sequence of code, we can move it to their predecessor or their successor.

Move up ( to predecessor)→→→→ Code Hoisting

Move down ( to successor )→→→→ Code Sinking

� Reducing the code size in the blocks can reducing the wasting time threads execute.

� Common sub-expression can be moved.

b1 & b2’s Predecessor

Block 1 (b1) Block 2 (b2)

b1 & b2’s Successor

14

SOLUTION

15

WRITING A PASS

� Write a pass to optimize the branch divergence problem on LLVM IR ( intermediate representation )

16

WRITING A PASS

� Add a optimization pass here which implement our optimized algorithm.

Front end Back endOptimization

Pass

17

WHAT DO WE USE

� LLVM - The LLVM Core libraries provide a modern source- and target-independent optimizer, along with code generator

→→→→ Use in front end to generate LLVM IR , and use in back end to generate NVPTX code

� Clang - an "LLVM native" C/C++/Objective-C compiler, which aims to deliver amazingly fast compiles

→→→→ Use in front end with llvm

� libclc – a library aims to implement the OpenCL standard library.

→→→→ Use in front end with llvm and clang in order to recognize OpenCL in kernel code

18

WHY DO WE CHOOSE LLVM IR

� The original code may be complex and hard to read.

� LLVM IR is easy to read , more straightforward , and LLVM provides lots of API that we can use.

19

HOW TO WRITE A PASS

21

WRITING A PASS

� under YOUR LLVM DIRECTORY/lib/Transforms/

� Create a folder ( by mkdir .. ) , we need some files

� →→→→ some cpp files and other header files

� →→→→ Makefile ( http://llvm.org/docs/WritingAnLLVMPass.html#setting-up-the-build-environment )

� More about it http://llvm.org/docs/WritingAnLLVMPass.html

22

MAKE IT

� make under the directory you made

� Then we will have a .so file under YOUR LLVM DIRECTORY/Debug+Asserts/lib/xxx.so

23

HOW TO USE

� opt -load ../../../Debug+Asserts/lib/xxx.so -flag < YOUR_IR.ll > /dev/null

� More about it http://llvm.org/docs/WritingAnLLVMPass.html#running-a-pass-with-opt

24

ALGORITHM

25

ALGORITHM

� How to run through the whole Kernel Code in a efficient way. ( Most Important )

� Code Hoisting – Some Dependency Problem

� Code Sinking – Some Dependency Problem

26

A RECURSIVE ALGORITHM

� Recursively call

27

�Each circle is a basicblock

28

CODE HOISTING

29

CODE SINKING

30

BENCHMARK

31

SOFTWARE INVIRONMENT

裝置版本

Operating System Ubuntu 12.04.2 LTS

CPU AMD Athlon(tm) II X4 630 Processor

OpenCL 1.1

Compute

capability

2.1

LLVM 3.432

GPU INFORMATION

GPU GeForce GTX 560

Compute Units 7

Global Memory 1024 mb

Max Allocatable

Memory

256 mb

Local Memory 49152 kb 33

TESTDATA1 – MATRIX

� In this kernel , we randomly initialize the content of the array “a” & “b” , and we pass these two arrays to kernel and let the kernel do the following thing. If the content of array “a” is bigger than b’s , we subtract array b’s value from array a’s value and assign it to the array “result” , else we do the opposite.

� Total times doing code hoisting – 13 times

� Total times doing code sinking – 2 times

34

35

0

0.05

0.1

0.15

0.2

0.25

5000 10000 50000 100000 300000 500000 600000

毫秒

matrix size

unoptimized kernel optimized kernel

TESTDATA2 – CALCULATE PI

� In this kernel , we are trying to compute the value of “PI”. First we randomly initialize the content of the array “a” & “b” which value is between 0 and 1 , these two array means the x value and y value of each nodes in a coordinate system , then we pass it to the kernel for execution. If the value of a[i]*a[i]*b[i]*b[i] is less than “1” , we assign “1” to the array “result” , which means it’s inside the 1/4 circle within the square which side is “1” , else we assign “0” . After that , we compute the value of “ nodes inside the 1/4 circle / total nodes “ , then multiply by four , and it becomes the “PI” value we want.

� Total times doing code hoisting – 7 times

� Total times doing code sinking – 1 times

36

37

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

5000 10000 50000 100000 300000 500000 600000

毫秒

取樣個數

unoptimized kernel optimized kernel

BUILDING UP THE ENVIRONMENT

REQUIRED

� Unix-like System ( more information )

� OpenCL ( for host code)

� LLVM & Clang & LIBCLC ( for kernel code)

CHECKOUT LLVM

� Change directory to where you want the llvm directory placed.

� svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm

CHECKOUT CLANG

� cd llvm/tools

� svn co http://llvm.org/svn/llvm-project/cfe/trunk clang

BUILD LLVM AND CLANG

� Under the directory of LLVM which you have just downloaded

� ../llvm/configure --enable-targets=host,nvptx

� make

LIBCLC

� svn checkout http://llvm.org/svn/llvm-project/libclc/trunk libclc

or

� git clone http://llvm.org/git/libclc.git

BUILD LIBCLC WITH LLVM-CONFIG

� cd libclc

� ./configure.py --with-llvm-config=/”YOUR LLVM DIRECTORY”/Debug+Asserts/bin/llvm-config

� make

INSTALL OPENCL

� Install the NVIDIA CUDA Toolkit and Display Driver

� *Download the cuda_5.0.35_linux_64_ubuntu11.10-1.run from https://developer.nvidia.com/cuda-toolkit

� *run the script at the directory where you store the above file: sudo .sh ./cuda_5.0.35_linux_64_ubuntu11.10-1.run

� This script will install 3 items for you:

� a. CUDA toolkit

� b. CUDA Samples

� c. NVIDIA Display Driver

It is always needed to stop the X server to allow the NVIDIA display driver to be installed, by this

command

� sudosudosudosudo stop stop stop stop lightdmlightdmlightdmlightdm

You may also meet the failure that the Smaples can not be installed, due to missing required libraries

� . sudo apt-get install freeglut3

� . sudo ln -s /usr/lib/x86_64-linux-gnu/libglut.so.3 /usr/lib/libglut.so

INSTALL OPENCL

� After all the libs are installed, specify the PATH and LD_LIBRARY_PATH in ~/.bashrc by adding:

� export PATH=$PATH:/usr/local/cuda-5.0/bin:/usr/local/cuda/bin

� export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-5.0/lib:/usr/local/cuda/lib

� Enter the ~/NVIDIA_CUDA-5.0_Samples to build the samples : sudo make (optional)

� Install the NVIDIA GPU Computing SDK

� Download the Computint SDK http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run from https://developer.nvidia.com/cuda-toolkit-40

� You can see the ~/NVIDIA_GPU_Computing_SDK directory and enter the OpenCL / to make

INSTALL OPENCL

KERNEL.CL -> LLVM IR

� To get this feature target triple = “nvptx” , do it by this command

/ ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang

\ -I/ ”YOUR LIBCLC DIRECTORY” /generic/include -include clc/clc.h

\ -Dcl_clang_storage_class_specifiers -target nvptx--nvidiacl

\ -Xclang -mlink-bitcode-file

\ -Xclang / ”YOUR LIBCLC DIRECTORY” /built_libs/nvptx—nvidiacl.bc

\ -S -emit-llvm kernel.cl -o kernel.ll

LLVM IR -> NVPTX

� / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/llc -march=nvptx -mcpu=sm_20

kernel.ll –o kernel.ptx

COMPILE HOST CODE

� / ”YOUR LLVM DIRECTORY” /Debug+Asserts/bin/clang++ host.cpp –

o host -lOpenCL

Introduction to LLVM IR

51

Features

� LLVM is a Static Single Assignment (SSA) based representation.

� Every definition creates a new variable.

� Infinite virtual registers

52

Three different form

� in-memory compiler IR

� on-disk bitcode representation

� human readable assembly language

representation.

53

The three different forms of LLVM are all equivalent.

Structure

�Module

�Function

�Basic block

54

Module Structure

� LLVM programs are composed of module’s , each of which is a translation unit of

the input programs.

55

Identifiers

� Global identifiers begin with @

� Local identifiers begin with %

� Have some reserved words like other

language

56

%result = mul i32 %X,8

Functions

� LLVM function definition consist of the “define” keyword

� LLVM function declaration consist of the “declare”

keyword

57

attributes

� Parameter attributes are simple keywords that follow the type

specified

� declare i32 @printf(i8* noalias nocapture, … )

� Function attribute are simple keywords that follow the type

specified

� Define void @f() noinline optsize { … }

58

Function attributes

Parameter attributes

Data Layout

� A module may specify a target specific data

layout string that specifies how data is to be laid

out in memory

� Target datalayout = “layout specification”

59

Type System

� The LLVM type system is one of the most important features of the

intermediate representation.

� A strong type system makes it easier to read the generated code

60

Array Type

� The array type is a very simple derived

type that arranges elements sequentially

in memory.

61

Pointer Type

� The pointer type is used to specify memory locations.

Pointers are commonly used to reference objects in

memory.

62

Constant

� Boolean constants : ‘true’ and ‘false’ , the i1

type

� Integer constants : such as ‘4’

� Floating point constants

� Null pointer constants : ‘null’ is recognized as a

null pointer constant

� Some other complex constant , such as structure

constants , array constants , vector constant ,

etc.63

Terminator Instructions

� br instructions ( used to cause control flow to transfer to a different basic block in the current function) :

br i1 , label , label

br label ; Unconditional branch

64

Every basic block in a program ends with a “Terminator” instruction

Instruction Reference

�Terminator Instructions

�Binary Operations

�Bitwise Binary Operations

�Vector Operations

�Aggregate Operations

�Memory Access and Addressing Operations

�Conversion Operations65

LLVM IR EXAMPLE

66

Kernel.cl

67

__kernel void vector_add_gpu (__global const float* src_a,__global const float* src_b,__global float* res,const int num)

{

const int idx = get_global_id(0);

if (idx < num)

res[idx] = src_a[idx] + src_b[idx];

else

res[idx] = src_a[idx] - src_b[idx];

}

Kernel.ll

� define void @vector_add_gpu(float addrspace(1)* nocapture %src_a, float addrspace(1)* nocapture %src_b, float addrspace(1)* nocapture %res, i32 %num) #0 {

� entry:

� %call = tail call i32 @get_global_id(i32 0) #2

� %cmp = icmp slt i32 %call, %num

� %arrayidx = getelementptr inbounds float addrspace(1)* %src_a, i32 %call

� %0 = load float addrspace(1)* %arrayidx, align 4, !tbaa !2

� %arrayidx1 = getelementptr inbounds float addrspace(1)* %src_b, i32 %call

� %1 = load float addrspace(1)* %arrayidx1, align 4, !tbaa !2

� br i1 %cmp, label %if.then, label %if.else

68

Kernel.ll

� if.then: ; preds = %entry

� %add = fadd float %0, %1

� %arrayidx2 = getelementptr inbounds float addrspace(1)* %res, i32 %call

� store float %add, float addrspace(1)* %arrayidx2, align 4, !tbaa !2

� br label %if.end

� if.else: ; preds = %entry

� %sub = fsub float %0, %1

� %arrayidx5 = getelementptr inbounds float addrspace(1)* %res, i32 %call

� store float %sub, float addrspace(1)* %arrayidx5, align 4, !tbaa !2

� br label %if.end

� if.end: ; preds = %if.else, %if.then

� ret void69

LLVM Class Introduction

70

Instruction class

� Collaboration diagram for llvm::Instruction:For operand

part

A container for instructions (list

structure)

� Base class for all the LLVM Instructions (IR instruction)

User class

OperandList is a container of “use”(use*)

� We get operand of the instruction by using user

interface

� This class defines the interface that one who uses a Value must implement.

� Each instance of the Value class keeps track of what User's have handles to it

� Instructions are the largest class of Users

� Each instruction is the “user” of its operands(value)

Use class

� The Use class represents the operand of an

instruction or some other User instance

which refers to a Value. The Use class keeps

the “use list” of the referenced value up to

date( use 把 value 包起來以便對 value 進行操作 )

Value class

� It is the base class of all values computed by a program that

may be used as operands to other values

� Value is the super class of other important classes such as

Instruction and Function

� All Values have a Type

� Setting the name on the Value automatically updates the module's symbol table

� Every value has a "use list" that keeps track of which other

Values are using this Value

How to find API you need?

� Step1: Go to llvm documentation like:

http://llvm.org/docs/doxygen/html/index.html

� Step2: Search the key word (like instruction ,basic block ,dominate tree …)

→ There may has some public function you can use

� Step3: for more detail you can also refer to the source code the function

provided

� Step4 Figure out and trace the Inheritance diagram as well as related classes

can help you more

For more

� Reference to :

http://llvm.org/docs/doxygen/html/index.ht

ml

The end

cover-AP-you.pdfAP_Slides_ypyou.pdf

教育部智慧電子整合性人才培育計畫 高階應用處理器 ap...

Documents

教育部智慧電子整合性人才培育計畫高階應用處理器 ap...