clustering search results using plsa

23/4/21 1

Clustering Search Results Using PLSA

洪春涛

23/4/21 2

Outlines

• Motivation

• Introduction to document clustering and PLSA algorithm

• Working progress and testing results

23/4/21 3

Motivation

• Current Internet search engines are giving us too much information

• Clustering the search results may help find the desired information quickly

23/4/21 4

The writer Truman Capote

The film Truman Capote

A demo of the searching result from Google.

23/4/21 5

Document clustering

• Put the ‘similar’ documents together

=> How do we define ‘similar’?

23/4/21 6

Vector Space Model of documents

The Vector Space Model (VSM) sees a document as a vector of terms:

Doc1: I see a bright future.

Doc2: I see nothing.

I see a bright future nothing

doc1 1 1 1 1 1 0

doc2 1 1 0 0 0 1

23/4/21 7

The distance between doc1 and doc2 is then defined as

1 2cos( 1, 2)

| 1| * | 2 |

doc docdoc doc

doc doc

Cosine as Distance Between Documents

23/4/21 8

Problems with cosine similarity

• Synonymy: different words may have the same meaning– Car manufacturer=automobile maker

• Polysemy: a word may have several different meanings- ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

23/4/21 9

Probabilistic Latent Semantic Analysis

Graphical model of PLSA:

( , ) ( ) ( | )

( | ) ( | ) ( | )z Z

P d w P d P w d

P w d P w z P z d

D1

Z1

W1

D: document

Z: latent class

W: word

These can also be written as:

( , ) ( ) ( | ) ( | )z Z

P d w P z P w z P d z

D2

Z1

W1 W1

0.10.9

0.30.7

D2

0.8

0.2

23/4/21 10

• Through Maximization Likelihood, one gets the estimated parameters:

P(d|z)This is what we want – a document-topic matrix

that reflects meanings of the documents.

P(w|z)

P(z)

23/4/21 11

Our approach

1. Get the P(d|z) matrix by PLSA, and

2. Use k-means clustering algorithm on the matrix

23/4/21 12

Problems with this approach

• PLSA takes too much time

solution: optimization & parallelization

23/4/21 13

Algorithm Outline

Expectation Maximization(EM) Algorithm:

Tempered EM:

E-step:

M-step:

23/4/21 14

Basic Data Structures

p_w_z_current, p_w_z_prev:dense double matrix W*Z

p_d_z_current, p_d_z_prev:dense double matrix D*Z

p_z_current, p_z_prev:double array Z

n_d_w:sparse integer matrix N

Lemur Implementation

• In-need calculation of p_z_d_w

• Computational complexity:O(W*D*Z2)

• For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration

23/4/21 15

Optimization of the Algorithm

• Reduce complexity– calculate p_z_d_w just once in an iteration– complexity reduced to O(N*Z)

• Reduce cache miss by reverting loopsfor(int d=1;d<numDocs;d++){

for(int w=0;w<numTermsInThisDoc;w++){

for(int z=0;z<numZ;z++){

….

}

}

}

23/4/21 16

Parallelization: Access Pattern

23/4/21 17

Data Race

solution: divide the co-occurrence table into blocks

Block Dispatching Algorithm

23/4/21 18

Block Dividing Algorithm

23/4/21 19

cranmed

Experiment Setup

23/4/21 20

Speedup

23/4/21 21

HPC134 Tulsa

Memory Bandwidth Usage

23/4/21 22

Memory Related Pipeline Stalls

23/4/21 23

Available Memory Bandwidth of the Two Machines

23/4/21 24

END

23/4/21 25

23/4/21 26

Backup slides

23/4/21 27

Test Results

PLSA VSM

Tr23 0.4977 0.5273

K1b 0.8473 0.5724

sports 0.7575 0.5563

Table 1. F-score of PLSA and VSM

23/4/21 28

sizeZ 10 20 50 100

Lemur 29 48 263 1015

Optimized 2 3.2 7 13

Table 2. Time used in one EM iteration (in second)

Uses the k1b dataset

(2340 docs, 21247 unique terms, 530374 terms)

23/4/21 29

Thanks!