clustering search results using plsa

Post on 13-Jan-2016

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Clustering Search Results Using PLSA. 洪春涛. Outlines. Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results. Motivation. Current Internet search engines are giving us too much information - PowerPoint PPT Presentation

TRANSCRIPT

23/4/21 1

Clustering Search Results Using PLSA

洪春涛

23/4/21 2

Outlines

• Motivation

• Introduction to document clustering and PLSA algorithm

• Working progress and testing results

23/4/21 3

Motivation

• Current Internet search engines are giving us too much information

• Clustering the search results may help find the desired information quickly

23/4/21 4

The writer Truman Capote

The film Truman Capote

A demo of the searching result from Google.

23/4/21 5

Document clustering

• Put the ‘similar’ documents together

=> How do we define ‘similar’?

23/4/21 6

Vector Space Model of documents

The Vector Space Model (VSM) sees a document as a vector of terms:

Doc1: I see a bright future.

Doc2: I see nothing.

I see a bright future nothing

doc1 1 1 1 1 1 0

doc2 1 1 0 0 0 1

23/4/21 7

The distance between doc1 and doc2 is then defined as

1 2cos( 1, 2)

| 1| * | 2 |

doc docdoc doc

doc doc

Cosine as Distance Between Documents

23/4/21 8

Problems with cosine similarity

• Synonymy: different words may have the same meaning– Car manufacturer=automobile maker

• Polysemy: a word may have several different meanings- ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

23/4/21 9

Probabilistic Latent Semantic Analysis

Graphical model of PLSA:

( , ) ( ) ( | )

( | ) ( | ) ( | )z Z

P d w P d P w d

P w d P w z P z d

D1

Z1

W1

D: document

Z: latent class

W: word

These can also be written as:

( , ) ( ) ( | ) ( | )z Z

P d w P z P w z P d z

D2

Z1

W1 W1

0.10.9

0.30.7

D2

0.8

0.2

23/4/21 10

• Through Maximization Likelihood, one gets the estimated parameters:

P(d|z)This is what we want – a document-topic matrix

that reflects meanings of the documents.

P(w|z)

P(z)

23/4/21 11

Our approach

1. Get the P(d|z) matrix by PLSA, and

2. Use k-means clustering algorithm on the matrix

23/4/21 12

Problems with this approach

• PLSA takes too much time

solution: optimization & parallelization

23/4/21 13

Algorithm Outline

Expectation Maximization(EM) Algorithm:

Tempered EM:

E-step:

M-step:

23/4/21 14

Basic Data Structures

p_w_z_current, p_w_z_prev:dense double matrix W*Z

p_d_z_current, p_d_z_prev:dense double matrix D*Z

p_z_current, p_z_prev:double array Z

n_d_w:sparse integer matrix N

Lemur Implementation

• In-need calculation of p_z_d_w

• Computational complexity:O(W*D*Z2)

• For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration

23/4/21 15

Optimization of the Algorithm

• Reduce complexity– calculate p_z_d_w just once in an iteration– complexity reduced to O(N*Z)

• Reduce cache miss by reverting loopsfor(int d=1;d<numDocs;d++){

for(int w=0;w<numTermsInThisDoc;w++){

for(int z=0;z<numZ;z++){

….

}

}

}

23/4/21 16

Parallelization: Access Pattern

23/4/21 17

Data Race

solution: divide the co-occurrence table into blocks

Block Dispatching Algorithm

23/4/21 18

Block Dividing Algorithm

23/4/21 19

cranmed

Experiment Setup

23/4/21 20

Speedup

23/4/21 21

HPC134 Tulsa

Memory Bandwidth Usage

23/4/21 22

Memory Related Pipeline Stalls

23/4/21 23

Available Memory Bandwidth of the Two Machines

23/4/21 24

END

23/4/21 25

23/4/21 26

Backup slides

23/4/21 27

Test Results

PLSA VSM

Tr23 0.4977 0.5273

K1b 0.8473 0.5724

sports 0.7575 0.5563

Table 1. F-score of PLSA and VSM

23/4/21 28

sizeZ 10 20 50 100

Lemur 29 48 263 1015

Optimized 2 3.2 7 13

Table 2. Time used in one EM iteration (in second)

Uses the k1b dataset

(2340 docs, 21247 unique terms, 530374 terms)

23/4/21 29

Thanks!

top related