大學部101級專題 cuda

PARALLELIZING THE PROBLEM OF SET INTERSECTION BY GPU

COMPUTING 指導教授：伍朝欽教授

組員：張力升、黃迺翔、林承祖、黃勛賢

CUDA

apriori algorithm

Memory coalescing

演算法實驗結果結論

摘要

Background

GPU verse CPU

架構Programming Model

Streaming Multiprocessor

CUDA

Background

CUDA 是 NVIDIA 推出利用GPU 做平行運算的架構。編程人員可以選擇使用高級語言或驅動程序 API 來實現並行處理。

GPU verse CPU

DRAM DRAM

Cache

ControlALU ALU

ALU ALU

CPU GPU

ALUControl & cache

Application

CUDA Library

CUDA Runtime

CUDA Driver

CPU

GPU

CUDA 的架構

CUDA 的組成分為Library 、 runtime、 Driver 三個部分，而開發程式中，就是經由這三個部份來控制並運用 GPU 的運算能力。

Programming Model

Kernel 1 Kernel 2

Grid 1 Grid 2

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Block(0, 2)

Block(1, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

CPUHost

GPUDevice

One Kernel One Grid

Kernel 2

Call

BLOCK 可支援到二維陣列，而 THREAD 則是支援至三維。

(Device)Grid

Global Memory

Constant Memory

Texture Memory

Block (0, 0)

Share Memory

Registers

Thread(0, 0)

LocalMemory

Registers

Thread(1, 0)

LocalMemory

Block (1, 0)

Share Memory

Registers

Thread(0, 0)

LocalMemory

Registers

Thread(1, 0)

LocalMemory

Host

Read/write per-thread : • registers

Read/write per-block : • shared memory

Read/write per-thread : • local memory (DRAM)

Read/write per-grid : • global memory (DRAM)

Read/only per-grid : • constant and texture

memories (DRAM)

SPE

ED

Memory model

• Multithreading issuing unit - 指令的調度• Instruction and constant cache• 8 streaming processor - 每個 SP對應處理一個 Thread

• 2 Special Function Units (SFU) -Transcendental operations (e.g. sin,cosin..)

• A 16KB read/write shared memory - 受軟體控制資料儲存

Streaming Multiprocessor

I cache

C cacheMT issue

SP SP

SP SP

SP SP

SP SP

SFU SFU

ShareMemory

G80 Streaming Multiprocessor (SM)

Data Mining

關聯分析 (Association Analysis)

Find Frequent Patterns

apriori Algorithm

apriori algorithm

Data Mining(1)

概念：從儲存的大量資料中，找出可有效理解的資訊，協助決策者進行更週延的決策。

資訊資料

Data Mining

Data Mining(2)

領域：資料產生資訊的模式（ Model ），描述資料中的特徵及關係。

分類（ Classification ）關聯分析（ Association ）分群（ Clustering ）趨勢分析（ Trend Analysis ）循序特徵（ Sequence Pattern ）

關聯分析分析兩資料的關聯性

E.g. 假設一零售商店店長，要進行一促銷策略的選擇，在交易紀錄發現，碳酸飲料和洋芋片一起購買的比例特別突出，就可以依此資訊來進行促銷。

Find Frequent Patterns(1)

概念：找最出現頻率較高的資料。

E.g. 確認哪些組合的商品，是較多顧客會同時選購的。


名詞定義： Minimum support : 最少需要出現幾次 K-itemset : K 個元素組成的集合

E.g. 2-itemset={A, B},{A, C} Transactions : Itemset 的 Index

E.g. Transactions Itemset

T1 {A, B, C}

T2 {A, D}


Minimum Support: Ex: minimum support = 3

1-itemset {A},{B},{C},{D},{E}

2-itemset {A,B},{B,C},{B,D},{B,E},{C,D}

3-itemset {B,C,D}

{A} 洋芋片{B} 碳酸飲料{C} 泡麵{D} 牛奶{E} 雞蛋

交易序號

交易記錄

出現 3 次以上 ( 含 ) 交易記錄 K-itemset

apriori Algorithm(1)

apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!( Agrawal & Srikant @VLDB’94,Mannila,etal.@KDD’ 94)


Data Result

Search Sort

Handle



{A} 3

{B} 6

{C} 4

{D} 4

{E} 3

Minimum support = 3{B} 6

{C} 4

{D} 4

{A} 3

{E} 3

Original Sorted


{A,B} 3

{A,C} 2

{A,D} 2

{A,E} 1

{B,C} 4

{B,D} 4

{B,E} 3

{C,D} 3

{C,E} 2

{D,E} 2

{B,C}

4

{B,D}

4

{A,B}

3

{B,E} 3

{C,D}

3

{A,C}

2

{A,D}

2

{C,E} 2

{D,E}

2

{A,E}

1

Original Sorted


Minimum support = 3 1-itemset {A},{B},{C},{D},{E}



{A,B,C} 2

{A,B,D} 2

{A,B,E} 1

{B,C,D} 3

{B,C,E} 2

{B,D,E} 2

{C,D,E} 2

Original


{B,C,D}

3

{A,B,C}

2

{A,B,D}

2

{B,C,E}

2

{B,D,E}

2

{C,D,E}

2

{A,B,E}

1

Sorted

3-itemset {B,C,D}

Minimum support = 3

概念Compare

Memory coalescing

概念 To maximize global memory bandwidth

Minimize the number of bus transactions Coalesce memory accesses

Coalescing Memory transactions are per half-warp (16

threads)

Compare(1)

Address 128 132 136 140 … 188

Thread 0 1 2 3 … 15

Half-wrap

Address 128 132 136 140 … 188

Thread 0 1 2 3 … 15

All threads participate

Some threads not participate

Compare(2)

Address 128 132 136 140 … 188

Thread 0 1 2 3 … 15

Address 128 132 136 140 … 188

Thread 0 1 2 3 … 15

Permuted Access by Threads

Misaligned Starting Address (not a multiple of 64)

目的參數定義實驗演算法

演算法

目的以平行方式解決 Find Frequent Patterns 問題以 Memory Coalescing 將程式最佳化



3-itemset {B,C,D}

參數定義

Target [] : 較短的集合 Set[] : 較長的集合 Result: 兩集合交集後的結果 Begin : 從 Set[Begin] 開始進行比對 End : 比對進行不超過 Set[End]

實驗演算法 (1)

B C E FTarget

A B C D E GSetBegin

Result TrueTrue TrueTrue True TrueTrue True True Flase

G H I J

A B C D E G H

實驗演算法 (2)

CPU

CPU

G H I J

A B C D E G

實驗演算法 (2)

Thread_1 Thread_2 Thread_3

GPUNon Memory coalescing

G H I J

A B C D E G

實驗演算法 (3)

Thread_1 Thread_2 Thread_3

GPUMemory coalescing

平台介紹Data Size, Block & Thread

CPU versus GPU

Memory Coalescing Effect

實驗結果

平台介紹 (CPU)

# of Cores 4# of Threads 4Clock Speed 2GHzMemory Size 6GBMemory Type DDR3 800

# of Memory Channels 3Max Memory Bandwidth 19.2GB/s

Cache 4MB

平台介紹 (GPU)

Number of GPUs 1Number of processor cores 240

Clock Speed 1300MHzMemory Size 4GBMemory Type GDDR3Memory Clock 800MHz

Max Memory Bandwidth 102.4GB/sCompute capability 1.3

Data Size = 10, Block fixed

Data Size = 10K, Block fixed

Data Size = 10K, Thread fixed

Data Size, Block & Thread

Data Size = 10

Data Size = 100K, Block fixed

Data Size = 100K, Thread fixed

Block = 1, Thread = 1




CPU versus GPU




Memory Coalescing Effect

結論

結論藉由應用 CUDA 架構 , 資料探勘的搜尋工作時

間在資料量很大時只需原本 CPU 程式工作時間的三分之一

經由 Memory Coalescing 改良的的 CUDA 程式效能是原本的三倍左右 , 與 CPU 程式比較下更是提升將近十倍的效率

未來目標：將更多種演算法平行化，應用在CUDA 架構上，藉以達成最佳化的目的。

Thank you for listening!

大學部101級專題 cuda

Technology