accelerating collapsed variational bayesian inference for latent dirichlet allocation with nvidia...

36
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices Tomonari MASADA 正正 正正 Nagasaki University [email protected]

Upload: tomonari-masada

Post on 27-Jan-2015

117 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

AcceleratingCollapsed Variational Bayesian

Inference for

Latent Dirichlet Allocation with

Nvidia CUDA Compatible Devices

Tomonari MASADA正田 備也

Nagasaki [email protected]

Page 2: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Overview

What is CVB?

Parallelization of CVB for LDA

Implementation for GPGPU GPGPU = Nvidia CUDA compatible devices

Tomonari MASADA (IEA-AIE 2009) 2

Page 3: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

LDA(latent Dirichlet allocation)

Page 4: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

latent Dirichlet allocation[Blei et al. 02]

Bayesian multi-topic document model

multi-topic

document = mixture of K topics

Bayesian

introducing a prior

obtaining a posterior

Tomonari MASADA (IEA-AIE 2009) 4

Page 5: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

posterior distribution

p(x,z,θ,φ|α,β)

=p(θj|α)p(φk|β)Πi p(zji|θj)p(xji|zji,φ)

p(z,θ,φ|x,α,β)

= p(x,z,θ,φ|α,β) / p(x|α,β)

unknown known

Tomonari MASADA (IEA-AIE 2009) 8

Page 6: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Inference methods for LDA Variational Bayesian inference [Blei et al. 02]

Approximating posterior by a variational method

Collapsed Gibbs sampling [Griffiths et al. 04]

Marginalizing θjk and φkw

Sampling zji

Collapsed variational Bayesian inference (CVB)

[Teh et al. 06]

Marginalizing θjk and φkw

Approximating posterior by a variational methodTomonari MASADA (IEA-AIE 2009) 9

Page 7: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

・・・

vote

γ111

γ112

γ113

party

γ12 1

γ122

γ123

prime

γ131

γ132

γ133

・・・

stock

γ241

γ242

γ243

ratio

γ251

γ252

γ253

prime

γ231

γ232

γ233

・・・

party

γ321

γ322

γ323

celeb

γ361

γ362

γ363

prime

γ331

γ332

γ333

・・・Tomonari MASADA (IEA-AIE 2009) 10

Page 8: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Interpretation of γjwk

γjwk

= How strongly

word w in document j

relates to topic k ?

Tomonari MASADA (IEA-AIE 2009) 11

Page 9: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Algorithm of CVB

for each dj

for each vw in dj

for each tk

next

next

next

O(MK) time

M : # of unique doc-word pairs

K : # of topics

Update γjwk

j: doc id

w: word id

k: topic id

Tomonari MASADA (IEA-AIE 2009) 12

Page 10: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Updating posterior parameters

γjwk ∝ (α + E[njk])

· (β + E[nkw])/(Wβ + E[nk])

· exp[ − Var[njk] / 2(α + E[njk])2

− Var[nkw] / 2(β + E[nkw])2

+ Var[nk] / 2(Wβ + E[nk])2]Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

13

Page 11: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Approximation by Gaussian

Means and variances E[njk] =Σw nwjγjwk, Var[njk] =Σwnwjγjwk(1-γjwk)

E[nkw]=Σjnwjγjwk, Var[nkw]=Σjnwjγjwk(1-γjwk)

E[nk] =Σw,jnwjγjwk, Var[nk] =Σw,jnwjγjwk(1-γjwk)

njk : # of word tokens which relate to topic k and appear in

document j

nkw : # of tokens of word w which relate to topic k

nk : # of word tokens which relate to topic kTomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

14

Page 12: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

E[njk]

Var[njk]

E[nkw]

Var[nkw]

E[nk]

Var[nk]

O(JK) size

O(KW) size

O(K) size

γjwk

O(MK) size

Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

15

Page 13: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

for each dj

for each vw in dj

for each tk

1. E[njk] −= nwj*γjwk; Var[njk] −= nwj*γjwk*(1−γwjk)

2. Update γwjk

3. E[njk] += nwj*γjwk; Var[njk] += nwj*γjwk*(1−γwjk)

next

next

next

Update another two types ofE[]s and Var[]s

in a similar manner.

Details of CVB for LDA

Tomonari MASADA (IEA-AIE 2009) 16

j: doc id

w: word id

k: topic id

Page 14: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Parallelizationof CVB for LDA

“as many threads as topics”

Page 15: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Parallelization of CVB

for each dj

for each vw in dj

for each tk

next

next

next

Update γjwk

conventional parallelization

proposed parallelization

Tomonari MASADA (IEA-AIE 2009) 18

Page 16: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

γjwk ∝ (α + E[njk])

· (β + E[nkw])/(Wβ + E[nk])

· exp[ − Var[njk] / 2(α + E[njk])2

− Var[nkw] / 2(β + E[nkw])2

+ Var[nk] / 2(Wβ + E[nk])2]

Strategy:“different topics for different threads”

γjw1 + γjw2 + ・・・ + γjwK = 1

Normalization is required!

O(MK) O(MlogK)

Tomonari MASADA (IEA-AIE 2009) 19

Page 17: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Reduction for normalization

O(logK)

Tomonari MASADA (IEA-AIE 2009) 20

Page 18: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

AcceleratingCVB for LDAby GPGPU

Page 19: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Nvidia CUDA (Compute Unified Device Architecture)

Grid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

Block

Shared Memory

Registers

Thread

Registers

Thread

Block

documents

topics

24

Page 20: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Device memory access latencyGrid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

Block

Shared Memory

Registers

Thread

Registers

Thread

Block

25

16KB

Page 21: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Data transfer latencyGrid

Device Memory

Shared Memory

Registers

Thread

Registers

Thread

BlockShared Memory

Registers

Thread

Registers

Thread

Block

Host Memory 26

Transferone large block

instead ofmany smaller ones!

Page 22: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

E[njk]

Var[njk]

E[nkw]

Var[nkw]

E[nk]

Var[nk]

O(JK) size

O(KW) size

O(K) size

γjwk

parameters ofapproximated posterior

Tomonari MASADA (IEA-AIE 2009)

j: doc id

w: word id

k: topic id

27

Page 23: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Where to store?

Posterior parameters γjwk : O(K) size

Means and variances E[njk],Var[njk] : O(K) size for a fixed doc

E[nkw], Var[nkw] : O(KW) size

E[nk], Var[nk] : O(K) size

registers

shared memory(for summation)

registers

device memory

Tomonari MASADA (IEA-AIE 2009) 28

Page 24: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

E[nkw]

Var[nkw]

O(KW) size

γjwk

write conflicts

Tomonari MASADA (IEA-AIE 2009) 29

γj’wk

Page 25: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Experiments

Page 26: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Text mining Articles from Mainichi and Asahi Web news

56,755 docs 40,158 words (applying MeCab + removing stop words)

M = 5,053,978 unique doc/word pairs 3,387,822 pairs for training

ASUS EN8800GT/HTDP/1G

+ Core2Quad Q9450

Evaluating by test data perplexityTomonari MASADA (IEA-AIE 2009) 31

Page 27: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

16 topics

64 iterationson CPU

64 iterationson GPU

32

Page 28: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

64 iterationson CPU

64 iterationson GPU

32 topics

33

Page 29: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

64 topics

64 iterationson CPU

64 iterationson GPU

34

Page 30: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Image mining 1.5 million tiny images

http://people.csail.mit.edu/torralba/tinyimages/

Only first 32,768 images

Uniform color quantization: 16x16x16

Original image size: 32x32 word = (R, G, B, Xpos, Ypos) 16x16x16x32x32

30 topics

8 PCs (GeForce GTX260 for each PC) CUDA + MPICH2 + OpenMP (perplexity computation)

Tomonari MASADA (IEA-AIE 2009) 35

Page 31: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Image mining Statistics

J = 32,768 docs

W = 2,090,223 unique words

M = 33,554,432 unique document-word pairs

Running time 8,191 sec for 100 iterations

LEADTEK WinFast GTX 260 896MB + Core2Quad Q9550

http://www.cis.nagasaki-u.ac.jp/~masada/researches.htmlTomonari MASADA (IEA-AIE 2009) 36

Page 32: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Tomonari MASADA (IEA-AIE 2009) 37

Page 33: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Summary

Page 34: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

discussions Larger device memory is better.

Data transfer latency between CPU and GPU

GPU is not enough for scalability. GPGPU + PC cluster (MPICH2)

“fine-grained”: topic <-> thread “coarse-grained” : data subset <-> node

Tomonari MASADA (IEA-AIE 2009) 40

Page 35: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Future work Collapsed Gibbs sampling on GPU?

Collapsed Gibbs sampling for LDA is

too simple to obtain speed-up by GPGPU.

Non-parametric Bayes on GPU? Hierarchical Dirichlet Processes [Teh et al. 06]

How to keep topic numbering consistent among

different threads?

Tomonari MASADA (IEA-AIE 2009) 41

Page 36: Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

Thank you for your attention!

非常感謝 !!!