accelerating collapsed variational bayesian inference for latent dirichlet allocation with nvidia...
DESCRIPTION
TRANSCRIPT
AcceleratingCollapsed Variational Bayesian
Inference for
Latent Dirichlet Allocation with
Nvidia CUDA Compatible Devices
Tomonari MASADA正田 備也
Nagasaki [email protected]
Overview
What is CVB?
Parallelization of CVB for LDA
Implementation for GPGPU GPGPU = Nvidia CUDA compatible devices
Tomonari MASADA (IEA-AIE 2009) 2
LDA(latent Dirichlet allocation)
latent Dirichlet allocation[Blei et al. 02]
Bayesian multi-topic document model
multi-topic
document = mixture of K topics
Bayesian
introducing a prior
obtaining a posterior
Tomonari MASADA (IEA-AIE 2009) 4
posterior distribution
p(x,z,θ,φ|α,β)
=p(θj|α)p(φk|β)Πi p(zji|θj)p(xji|zji,φ)
p(z,θ,φ|x,α,β)
= p(x,z,θ,φ|α,β) / p(x|α,β)
unknown known
Tomonari MASADA (IEA-AIE 2009) 8
Inference methods for LDA Variational Bayesian inference [Blei et al. 02]
Approximating posterior by a variational method
Collapsed Gibbs sampling [Griffiths et al. 04]
Marginalizing θjk and φkw
Sampling zji
Collapsed variational Bayesian inference (CVB)
[Teh et al. 06]
Marginalizing θjk and φkw
Approximating posterior by a variational methodTomonari MASADA (IEA-AIE 2009) 9
・・・
vote
γ111
γ112
γ113
party
γ12 1
γ122
γ123
prime
γ131
γ132
γ133
・・・
stock
γ241
γ242
γ243
ratio
γ251
γ252
γ253
prime
γ231
γ232
γ233
・・・
party
γ321
γ322
γ323
celeb
γ361
γ362
γ363
prime
γ331
γ332
γ333
・・・Tomonari MASADA (IEA-AIE 2009) 10
Interpretation of γjwk
γjwk
= How strongly
word w in document j
relates to topic k ?
Tomonari MASADA (IEA-AIE 2009) 11
Algorithm of CVB
for each dj
for each vw in dj
for each tk
next
next
next
O(MK) time
M : # of unique doc-word pairs
K : # of topics
Update γjwk
j: doc id
w: word id
k: topic id
Tomonari MASADA (IEA-AIE 2009) 12
Updating posterior parameters
γjwk ∝ (α + E[njk])
· (β + E[nkw])/(Wβ + E[nk])
· exp[ − Var[njk] / 2(α + E[njk])2
− Var[nkw] / 2(β + E[nkw])2
+ Var[nk] / 2(Wβ + E[nk])2]Tomonari MASADA (IEA-AIE 2009)
j: doc id
w: word id
k: topic id
13
Approximation by Gaussian
Means and variances E[njk] =Σw nwjγjwk, Var[njk] =Σwnwjγjwk(1-γjwk)
E[nkw]=Σjnwjγjwk, Var[nkw]=Σjnwjγjwk(1-γjwk)
E[nk] =Σw,jnwjγjwk, Var[nk] =Σw,jnwjγjwk(1-γjwk)
njk : # of word tokens which relate to topic k and appear in
document j
nkw : # of tokens of word w which relate to topic k
nk : # of word tokens which relate to topic kTomonari MASADA (IEA-AIE 2009)
j: doc id
w: word id
k: topic id
14
E[njk]
Var[njk]
E[nkw]
Var[nkw]
E[nk]
Var[nk]
O(JK) size
O(KW) size
O(K) size
γjwk
O(MK) size
Tomonari MASADA (IEA-AIE 2009)
j: doc id
w: word id
k: topic id
15
for each dj
for each vw in dj
for each tk
1. E[njk] −= nwj*γjwk; Var[njk] −= nwj*γjwk*(1−γwjk)
2. Update γwjk
3. E[njk] += nwj*γjwk; Var[njk] += nwj*γjwk*(1−γwjk)
next
next
next
Update another two types ofE[]s and Var[]s
in a similar manner.
Details of CVB for LDA
Tomonari MASADA (IEA-AIE 2009) 16
j: doc id
w: word id
k: topic id
Parallelizationof CVB for LDA
“as many threads as topics”
Parallelization of CVB
for each dj
for each vw in dj
for each tk
next
next
next
Update γjwk
conventional parallelization
proposed parallelization
Tomonari MASADA (IEA-AIE 2009) 18
γjwk ∝ (α + E[njk])
· (β + E[nkw])/(Wβ + E[nk])
· exp[ − Var[njk] / 2(α + E[njk])2
− Var[nkw] / 2(β + E[nkw])2
+ Var[nk] / 2(Wβ + E[nk])2]
Strategy:“different topics for different threads”
γjw1 + γjw2 + ・・・ + γjwK = 1
Normalization is required!
O(MK) O(MlogK)
Tomonari MASADA (IEA-AIE 2009) 19
Reduction for normalization
O(logK)
Tomonari MASADA (IEA-AIE 2009) 20
AcceleratingCVB for LDAby GPGPU
Nvidia CUDA (Compute Unified Device Architecture)
Grid
Device Memory
Shared Memory
Registers
Thread
Registers
Thread
Block
Shared Memory
Registers
Thread
Registers
Thread
Block
documents
topics
24
Device memory access latencyGrid
Device Memory
Shared Memory
Registers
Thread
Registers
Thread
Block
Shared Memory
Registers
Thread
Registers
Thread
Block
25
16KB
Data transfer latencyGrid
Device Memory
Shared Memory
Registers
Thread
Registers
Thread
BlockShared Memory
Registers
Thread
Registers
Thread
Block
Host Memory 26
Transferone large block
instead ofmany smaller ones!
E[njk]
Var[njk]
E[nkw]
Var[nkw]
E[nk]
Var[nk]
O(JK) size
O(KW) size
O(K) size
γjwk
parameters ofapproximated posterior
Tomonari MASADA (IEA-AIE 2009)
j: doc id
w: word id
k: topic id
27
Where to store?
Posterior parameters γjwk : O(K) size
Means and variances E[njk],Var[njk] : O(K) size for a fixed doc
E[nkw], Var[nkw] : O(KW) size
E[nk], Var[nk] : O(K) size
registers
shared memory(for summation)
registers
device memory
Tomonari MASADA (IEA-AIE 2009) 28
E[nkw]
Var[nkw]
O(KW) size
γjwk
write conflicts
Tomonari MASADA (IEA-AIE 2009) 29
γj’wk
Experiments
Text mining Articles from Mainichi and Asahi Web news
56,755 docs 40,158 words (applying MeCab + removing stop words)
M = 5,053,978 unique doc/word pairs 3,387,822 pairs for training
ASUS EN8800GT/HTDP/1G
+ Core2Quad Q9450
Evaluating by test data perplexityTomonari MASADA (IEA-AIE 2009) 31
16 topics
64 iterationson CPU
64 iterationson GPU
32
64 iterationson CPU
64 iterationson GPU
32 topics
33
64 topics
64 iterationson CPU
64 iterationson GPU
34
Image mining 1.5 million tiny images
http://people.csail.mit.edu/torralba/tinyimages/
Only first 32,768 images
Uniform color quantization: 16x16x16
Original image size: 32x32 word = (R, G, B, Xpos, Ypos) 16x16x16x32x32
30 topics
8 PCs (GeForce GTX260 for each PC) CUDA + MPICH2 + OpenMP (perplexity computation)
Tomonari MASADA (IEA-AIE 2009) 35
Image mining Statistics
J = 32,768 docs
W = 2,090,223 unique words
M = 33,554,432 unique document-word pairs
Running time 8,191 sec for 100 iterations
LEADTEK WinFast GTX 260 896MB + Core2Quad Q9550
http://www.cis.nagasaki-u.ac.jp/~masada/researches.htmlTomonari MASADA (IEA-AIE 2009) 36
Tomonari MASADA (IEA-AIE 2009) 37
Summary
discussions Larger device memory is better.
Data transfer latency between CPU and GPU
GPU is not enough for scalability. GPGPU + PC cluster (MPICH2)
“fine-grained”: topic <-> thread “coarse-grained” : data subset <-> node
Tomonari MASADA (IEA-AIE 2009) 40
Future work Collapsed Gibbs sampling on GPU?
Collapsed Gibbs sampling for LDA is
too simple to obtain speed-up by GPGPU.
Non-parametric Bayes on GPU? Hierarchical Dirichlet Processes [Teh et al. 06]
How to keep topic numbering consistent among
different threads?
Tomonari MASADA (IEA-AIE 2009) 41
Thank you for your attention!
非常感謝 !!!