大學部101級專題 cuda
TRANSCRIPT
PARALLELIZING THE PROBLEM OF SET INTERSECTION BY GPU
COMPUTING 指導教授:伍朝欽教授
組員:張力升、黃迺翔、林承祖、黃勛賢
CUDA
apriori algorithm
Memory coalescing
演算法實驗結果結論
摘要
Background
GPU verse CPU
架構Programming Model
Streaming Multiprocessor
CUDA
Background
CUDA 是 NVIDIA 推出利用GPU 做平行運算的架構。編程人員可以選擇使用高級語言或驅動程序 API 來實現並行處理。
GPU verse CPU
DRAM DRAM
Cache
ControlALU ALU
ALU ALU
CPU GPU
ALUControl & cache
Application
CUDA Library
CUDA Runtime
CUDA Driver
CPU
GPU
CUDA 的架構
CUDA 的組成分為Library 、 runtime、 Driver 三個部分,而開發程式中,就是經由這三個部份來控制並運用 GPU 的運算能力。
Programming Model
Kernel 1 Kernel 2
Grid 1 Grid 2
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Block(0, 2)
Block(1, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
CPUHost
GPUDevice
One Kernel One Grid
Kernel 2
Call
BLOCK 可支援到二維陣列,而 THREAD 則是支援至三維。
(Device)Grid
Global Memory
Constant Memory
Texture Memory
Block (0, 0)
Share Memory
Registers
Thread(0, 0)
LocalMemory
Registers
Thread(1, 0)
LocalMemory
Block (1, 0)
Share Memory
Registers
Thread(0, 0)
LocalMemory
Registers
Thread(1, 0)
LocalMemory
Host
Read/write per-thread : • registers
Read/write per-block : • shared memory
Read/write per-thread : • local memory (DRAM)
Read/write per-grid : • global memory (DRAM)
Read/only per-grid : • constant and texture
memories (DRAM)
SPE
ED
Memory model
• Multithreading issuing unit - 指令的調度• Instruction and constant cache• 8 streaming processor - 每個 SP對應處理一個 Thread
• 2 Special Function Units (SFU) -Transcendental operations (e.g. sin,cosin..)
• A 16KB read/write shared memory - 受軟體控制資料儲存
Streaming Multiprocessor
I cache
C cacheMT issue
SP SP
SP SP
SP SP
SP SP
SFU SFU
ShareMemory
G80 Streaming Multiprocessor (SM)
Data Mining
關聯分析 (Association Analysis)
Find Frequent Patterns
apriori Algorithm
apriori algorithm
Data Mining(1)
概念:從儲存的大量資料中,找出可有效理解的資訊,協助決策者進行更週延的決策。
資訊資料
Data Mining
Data Mining(2)
領域:資料產生資訊的模式( Model ),描述資料中的特徵及關係。
分類( Classification ) 關聯分析( Association ) 分群( Clustering ) 趨勢分析( Trend Analysis ) 循序特徵( Sequence Pattern )
關聯分析 分析兩資料的關聯性
E.g. 假設一零售商店店長,要進行一促銷策略的選擇,在交易紀錄發現,碳酸飲料和洋芋片一起購買的比例特別突出,就可以依此資訊來進行促銷。
Find Frequent Patterns(1)
概念:找最出現頻率較高的資料。
E.g. 確認哪些組合的商品,是較多顧客會同時選購的。
Find Frequent Patterns(2)
名詞定義: Minimum support : 最少需要出現幾次 K-itemset : K 個元素組成的集合
E.g. 2-itemset={A, B},{A, C} Transactions : Itemset 的 Index
E.g. Transactions Itemset
T1 {A, B, C}
T2 {A, D}
Find Frequent Patterns(3)
Minimum Support: Ex: minimum support = 3
1-itemset {A},{B},{C},{D},{E}
2-itemset {A,B},{B,C},{B,D},{B,E},{C,D}
3-itemset {B,C,D}
{A} 洋芋片{B} 碳酸飲料{C} 泡麵{D} 牛奶{E} 雞蛋
交易序號
交易記錄
出現 3 次以上 ( 含 ) 交易記錄 K-itemset
apriori Algorithm(1)
apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!( Agrawal & Srikant @VLDB’94,Mannila,etal.@KDD’ 94)
apriori Algorithm(2)
Data Result
Search Sort
Handle
apriori Algorithm(3)
1-itemset {A},{B},{C},{D},{E}
{A} 3
{B} 6
{C} 4
{D} 4
{E} 3
Minimum support = 3{B} 6
{C} 4
{D} 4
{A} 3
{E} 3
Original Sorted
apriori Algorithm(4)
{A,B} 3
{A,C} 2
{A,D} 2
{A,E} 1
{B,C} 4
{B,D} 4
{B,E} 3
{C,D} 3
{C,E} 2
{D,E} 2
{B,C}
4
{B,D}
4
{A,B}
3
{B,E} 3
{C,D}
3
{A,C}
2
{A,D}
2
{C,E} 2
{D,E}
2
{A,E}
1
Original Sorted
2-itemset {A,B},{B,C},{B,D},{B,E},{C,D}
Minimum support = 3 1-itemset {A},{B},{C},{D},{E}
apriori Algorithm(5)
1-itemset {A},{B},{C},{D},{E}
{A,B,C} 2
{A,B,D} 2
{A,B,E} 1
{B,C,D} 3
{B,C,E} 2
{B,D,E} 2
{C,D,E} 2
Original
2-itemset {A,B},{B,C},{B,D},{B,E},{C,D}
{B,C,D}
3
{A,B,C}
2
{A,B,D}
2
{B,C,E}
2
{B,D,E}
2
{C,D,E}
2
{A,B,E}
1
Sorted
3-itemset {B,C,D}
Minimum support = 3
概念Compare
Memory coalescing
概念 To maximize global memory bandwidth
Minimize the number of bus transactions Coalesce memory accesses
Coalescing Memory transactions are per half-warp (16
threads)
Compare(1)
Address 128 132 136 140 … 188
Thread 0 1 2 3 … 15
Half-wrap
Address 128 132 136 140 … 188
Thread 0 1 2 3 … 15
All threads participate
Some threads not participate
Compare(2)
Address 128 132 136 140 … 188
Thread 0 1 2 3 … 15
Address 128 132 136 140 … 188
Thread 0 1 2 3 … 15
Permuted Access by Threads
Misaligned Starting Address (not a multiple of 64)
目的參數定義實驗演算法
演算法
目的 以平行方式解決 Find Frequent Patterns 問題 以 Memory Coalescing 將程式最佳化
1-itemset {A},{B},{C},{D},{E}
2-itemset {A,B},{B,C},{B,D},{B,E},{C,D}
3-itemset {B,C,D}
參數定義
Target [] : 較短的集合 Set[] : 較長的集合 Result: 兩集合交集後的結果 Begin : 從 Set[Begin] 開始進行比對 End : 比對進行不超過 Set[End]
實驗演算法 (1)
B C E FTarget
A B C D E GSetBegin
Result TrueTrue TrueTrue True TrueTrue True True Flase
G H I J
A B C D E G H
實驗演算法 (2)
CPU
CPU
G H I J
A B C D E G
實驗演算法 (2)
Thread_1 Thread_2 Thread_3
GPUNon Memory coalescing
G H I J
A B C D E G
實驗演算法 (3)
Thread_1 Thread_2 Thread_3
GPUMemory coalescing
平台介紹Data Size, Block & Thread
CPU versus GPU
Memory Coalescing Effect
實驗結果
平台介紹 (CPU)
# of Cores 4# of Threads 4Clock Speed 2GHzMemory Size 6GBMemory Type DDR3 800
# of Memory Channels 3Max Memory Bandwidth 19.2GB/s
Cache 4MB
平台介紹 (GPU)
Number of GPUs 1Number of processor cores 240
Clock Speed 1300MHzMemory Size 4GBMemory Type GDDR3Memory Clock 800MHz
Max Memory Bandwidth 102.4GB/sCompute capability 1.3
Data Size = 10, Block fixed
Data Size = 10K, Block fixed
Data Size = 10K, Thread fixed
Data Size, Block & Thread
Data Size = 10
Data Size = 100K, Block fixed
Data Size = 100K, Thread fixed
Block = 1, Thread = 1
Block = 1, Thread = 10
Block = 10, Thread = 100
Block = 10, Thread = 512
CPU versus GPU
Block = 1, Thread = 1
Block = 1, Thread = 10
Block = 10, Thread = 100
Block = 10, Thread = 512
Block = 1, Thread = 10
Block = 1, Thread = 512
Block = 10, Thread = 512
Memory Coalescing Effect
Block = 1, Thread = 10
Block = 1, Thread = 512
Block = 10, Thread = 512
結論
結論 藉由應用 CUDA 架構 , 資料探勘的搜尋工作時
間在資料量很大時只需原本 CPU 程式工作時間的三分之一
經由 Memory Coalescing 改良的的 CUDA 程式效能是原本的三倍左右 , 與 CPU 程式比較下更是提升將近十倍的效率
未來目標:將更多種演算法平行化,應用在CUDA 架構上,藉以達成最佳化的目的。
Thank you for listening!