accelerating collapsed variational bayesian inference for latent dirichlet allocation with nvidia...

AcceleratingCollapsed Variational Bayesian

Inference for

Latent Dirichlet Allocation with

Nvidia CUDA Compatible Devices

Tomonari MASADA正田備也

Nagasaki [email protected]

Overview

What is CVB?

Parallelization of CVB for LDA

Implementation for GPGPU GPGPU = Nvidia CUDA compatible devices

Tomonari MASADA (IEA-AIE 2009) 2

LDA(latent Dirichlet allocation)

latent Dirichlet allocation[Blei et al. 02]

Bayesian multi-topic document model

multi-topic

document = mixture of K topics

Bayesian

introducing a prior

obtaining a posterior


Inference methods for LDA Variational Bayesian inference [Blei et al. 02]

Approximating posterior by a variational method

Collapsed Gibbs sampling [Griffiths et al. 04]

Marginalizing θjk and φkw

Sampling zji

Collapsed variational Bayesian inference (CVB)

[Teh et al. 06]

Marginalizing θjk and φkw

Approximating posterior by a variational methodTomonari MASADA (IEA-AIE 2009) 9

・・・

vote

γ111

γ112

γ113

party

γ12 1

γ122

γ123

prime

γ131

γ132

γ133

・・・

stock

γ241

γ242

γ243

ratio

γ251

γ252

γ253

prime

γ231

γ232

γ233

・・・

party

γ321

γ322

γ323

celeb

γ361

γ362

γ363

prime

γ331

γ332

γ333

・・・Tomonari MASADA (IEA-AIE 2009) 10

Interpretation of γjwk

γjwk

= How strongly

word w in document j

relates to topic k ?


Algorithm of CVB

for each dj

for each vw in dj

for each tk

next

next

next

O(MK) time

M : # of unique doc-word pairs

K : # of topics

Update γjwk

j: doc id

w: word id

k: topic id


Updating posterior parameters

γjwk ∝ (α + E[njk])

· (β + E[nkw])/(Wβ + E[nk])

· exp[ − Var[njk] / 2(α + E[njk])2

− Var[nkw] / 2(β + E[nkw])2

+ Var[nk] / 2(Wβ + E[nk])2]Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

13

Approximation by Gaussian

Means and variances E[njk] =Σw nwjγjwk, Var[njk] =Σwnwjγjwk(1-γjwk)

E[nkw]=Σjnwjγjwk, Var[nkw]=Σjnwjγjwk(1-γjwk)

E[nk] =Σw,jnwjγjwk, Var[nk] =Σw,jnwjγjwk(1-γjwk)

njk : # of word tokens which relate to topic k and appear in

document j

nkw : # of tokens of word w which relate to topic k

nk : # of word tokens which relate to topic kTomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

14

E[njk]

Var[njk]

E[nkw]

Var[nkw]

E[nk]

Var[nk]

O(JK) size

O(KW) size

O(K) size

γjwk

O(MK) size

Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

15

for each dj

for each vw in dj

for each tk

1. E[njk] −= nwj*γjwk; Var[njk] −= nwj*γjwk*(1−γwjk)

2. Update γwjk

3. E[njk] += nwj*γjwk; Var[njk] += nwj*γjwk*(1−γwjk)

next

next

next

Update another two types ofE[]s and Var[]s

in a similar manner.

Details of CVB for LDA


j: doc id

w: word id

k: topic id

Parallelizationof CVB for LDA

“as many threads as topics”

Parallelization of CVB

for each dj

for each vw in dj

for each tk

next

next

next

Update γjwk

conventional parallelization

proposed parallelization


γjwk ∝ (α + E[njk])

· (β + E[nkw])/(Wβ + E[nk])

· exp[ − Var[njk] / 2(α + E[njk])2

− Var[nkw] / 2(β + E[nkw])2

+ Var[nk] / 2(Wβ + E[nk])2]

Strategy:“different topics for different threads”

γjw1 + γjw2 + ・・・ + γjwK = 1

Normalization is required!

O(MK) O(MlogK)


Reduction for normalization

O(logK)


AcceleratingCVB for LDAby GPGPU

Nvidia CUDA (Compute Unified Device Architecture)

Grid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

Block

Shared Memory

Registers

Thread

Registers

Thread

Block

documents

topics

24

Device memory access latencyGrid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

Block

Shared Memory

Registers

Thread

Registers

Thread

Block

25

16KB

Data transfer latencyGrid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

BlockShared Memory

Registers

Thread

Registers

Thread

Block

Host Memory 26

Transferone large block

instead ofmany smaller ones!

E[njk]

Var[njk]

E[nkw]

Var[nkw]

E[nk]

Var[nk]

O(JK) size

O(KW) size

O(K) size

γjwk

parameters ofapproximated posterior

Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

27

Where to store?

Posterior parameters γjwk : O(K) size

Means and variances E[njk],Var[njk] : O(K) size for a fixed doc

E[nkw], Var[nkw] : O(KW) size

E[nk], Var[nk] : O(K) size

registers

shared memory(for summation)

registers

device memory


E[nkw]

Var[nkw]

O(KW) size

γjwk

write conflicts


γj’wk

Experiments

Text mining Articles from Mainichi and Asahi Web news

56,755 docs 40,158 words (applying MeCab + removing stop words)

M = 5,053,978 unique doc/word pairs 3,387,822 pairs for training

ASUS EN8800GT/HTDP/1G

+ Core2Quad Q9450

Evaluating by test data perplexityTomonari MASADA (IEA-AIE 2009) 31

16 topics

64 iterationson CPU

64 iterationson GPU

32

64 iterationson CPU

64 iterationson GPU

32 topics

33

64 topics

64 iterationson CPU

64 iterationson GPU

34

Image mining 1.5 million tiny images

http://people.csail.mit.edu/torralba/tinyimages/

Only first 32,768 images

Uniform color quantization: 16x16x16

Original image size: 32x32 word = (R, G, B, Xpos, Ypos) 16x16x16x32x32

30 topics

8 PCs (GeForce GTX260 for each PC) CUDA + MPICH2 + OpenMP (perplexity computation)


Image mining Statistics

J = 32,768 docs

W = 2,090,223 unique words

M = 33,554,432 unique document-word pairs

Running time 8,191 sec for 100 iterations

LEADTEK WinFast GTX 260 896MB + Core2Quad Q9550

http://www.cis.nagasaki-u.ac.jp/~masada/researches.htmlTomonari MASADA (IEA-AIE 2009) 36

Summary

discussions Larger device memory is better.

Data transfer latency between CPU and GPU

GPU is not enough for scalability. GPGPU + PC cluster (MPICH2)

“fine-grained”: topic <-> thread “coarse-grained” : data subset <-> node


Future work Collapsed Gibbs sampling on GPU?

Collapsed Gibbs sampling for LDA is

too simple to obtain speed-up by GPGPU.

Non-parametric Bayes on GPU? Hierarchical Dirichlet Processes [Teh et al. 06]

How to keep topic numbering consistent among

different threads?


Thank you for your attention!

非常感謝 !!!

accelerating collapsed variational bayesian inference for latent dirichlet allocation with nvidia...

Documents