clustering search results using plsa

29
22/6/20 1 Clustering Search Results Using PLSA 洪洪洪

Upload: fauve

Post on 13-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Clustering Search Results Using PLSA. 洪春涛. Outlines. Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results. Motivation. Current Internet search engines are giving us too much information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering Search Results Using PLSA

23/4/21 1

Clustering Search Results Using PLSA

洪春涛

Page 2: Clustering Search Results Using PLSA

23/4/21 2

Outlines

• Motivation

• Introduction to document clustering and PLSA algorithm

• Working progress and testing results

Page 3: Clustering Search Results Using PLSA

23/4/21 3

Motivation

• Current Internet search engines are giving us too much information

• Clustering the search results may help find the desired information quickly

Page 4: Clustering Search Results Using PLSA

23/4/21 4

The writer Truman Capote

The film Truman Capote

A demo of the searching result from Google.

Page 5: Clustering Search Results Using PLSA

23/4/21 5

Document clustering

• Put the ‘similar’ documents together

=> How do we define ‘similar’?

Page 6: Clustering Search Results Using PLSA

23/4/21 6

Vector Space Model of documents

The Vector Space Model (VSM) sees a document as a vector of terms:

Doc1: I see a bright future.

Doc2: I see nothing.

I see a bright future nothing

doc1 1 1 1 1 1 0

doc2 1 1 0 0 0 1

Page 7: Clustering Search Results Using PLSA

23/4/21 7

The distance between doc1 and doc2 is then defined as

1 2cos( 1, 2)

| 1| * | 2 |

doc docdoc doc

doc doc

Cosine as Distance Between Documents

Page 8: Clustering Search Results Using PLSA

23/4/21 8

Problems with cosine similarity

• Synonymy: different words may have the same meaning– Car manufacturer=automobile maker

• Polysemy: a word may have several different meanings- ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

Page 9: Clustering Search Results Using PLSA

23/4/21 9

Probabilistic Latent Semantic Analysis

Graphical model of PLSA:

( , ) ( ) ( | )

( | ) ( | ) ( | )z Z

P d w P d P w d

P w d P w z P z d

D1

Z1

W1

D: document

Z: latent class

W: word

These can also be written as:

( , ) ( ) ( | ) ( | )z Z

P d w P z P w z P d z

D2

Z1

W1 W1

0.10.9

0.30.7

D2

0.8

0.2

Page 10: Clustering Search Results Using PLSA

23/4/21 10

• Through Maximization Likelihood, one gets the estimated parameters:

P(d|z)This is what we want – a document-topic matrix

that reflects meanings of the documents.

P(w|z)

P(z)

Page 11: Clustering Search Results Using PLSA

23/4/21 11

Our approach

1. Get the P(d|z) matrix by PLSA, and

2. Use k-means clustering algorithm on the matrix

Page 12: Clustering Search Results Using PLSA

23/4/21 12

Problems with this approach

• PLSA takes too much time

solution: optimization & parallelization

Page 13: Clustering Search Results Using PLSA

23/4/21 13

Algorithm Outline

Expectation Maximization(EM) Algorithm:

Tempered EM:

E-step:

M-step:

Page 14: Clustering Search Results Using PLSA

23/4/21 14

Basic Data Structures

p_w_z_current, p_w_z_prev:dense double matrix W*Z

p_d_z_current, p_d_z_prev:dense double matrix D*Z

p_z_current, p_z_prev:double array Z

n_d_w:sparse integer matrix N

Page 15: Clustering Search Results Using PLSA

Lemur Implementation

• In-need calculation of p_z_d_w

• Computational complexity:O(W*D*Z2)

• For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration

23/4/21 15

Page 16: Clustering Search Results Using PLSA

Optimization of the Algorithm

• Reduce complexity– calculate p_z_d_w just once in an iteration– complexity reduced to O(N*Z)

• Reduce cache miss by reverting loopsfor(int d=1;d<numDocs;d++){

for(int w=0;w<numTermsInThisDoc;w++){

for(int z=0;z<numZ;z++){

….

}

}

}

23/4/21 16

Page 17: Clustering Search Results Using PLSA

Parallelization: Access Pattern

23/4/21 17

Data Race

solution: divide the co-occurrence table into blocks

Page 18: Clustering Search Results Using PLSA

Block Dispatching Algorithm

23/4/21 18

Page 19: Clustering Search Results Using PLSA

Block Dividing Algorithm

23/4/21 19

cranmed

Page 20: Clustering Search Results Using PLSA

Experiment Setup

23/4/21 20

Page 21: Clustering Search Results Using PLSA

Speedup

23/4/21 21

HPC134 Tulsa

Page 22: Clustering Search Results Using PLSA

Memory Bandwidth Usage

23/4/21 22

Page 23: Clustering Search Results Using PLSA

Memory Related Pipeline Stalls

23/4/21 23

Page 24: Clustering Search Results Using PLSA

Available Memory Bandwidth of the Two Machines

23/4/21 24

Page 25: Clustering Search Results Using PLSA

END

23/4/21 25

Page 26: Clustering Search Results Using PLSA

23/4/21 26

Backup slides

Page 27: Clustering Search Results Using PLSA

23/4/21 27

Test Results

PLSA VSM

Tr23 0.4977 0.5273

K1b 0.8473 0.5724

sports 0.7575 0.5563

Table 1. F-score of PLSA and VSM

Page 28: Clustering Search Results Using PLSA

23/4/21 28

sizeZ 10 20 50 100

Lemur 29 48 263 1015

Optimized 2 3.2 7 13

Table 2. Time used in one EM iteration (in second)

Uses the k1b dataset

(2340 docs, 21247 unique terms, 530374 terms)

Page 29: Clustering Search Results Using PLSA

23/4/21 29

Thanks!