introduction

1

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement

Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2

000.

Presenter: 游斯涵

2

Introduction

• Some studies of applying modified or generalized SVD:

(improving the precision of similarities)– SDD ( semi discrete decomposition)

• T. G. Kolda and D. P. O’Learly• Proposed to reduce the storage and computational costs of L

SI.– R-SVD ( Riemannian SVD)

• E. p. Jiang and M. W. Berry• User feedback can be integrated into LSI models.

(theoretical of LSI)– MDS (Multidimensional Scaling)、 Bayesian regression model、

Probabilistic models.

3

Introduction

• Find the problem with SVD– SVD:

• The topics underlying outlier documents tend to be lost as we chose lower number of dimensions.

• Dimensional reduction comes from

two sources:– outlier document– minor term

• The thinking of this paper:– not to consider the outlier document

as “noise”, all documents assume to be equal.– Try to eliminate noise from the minor terms but not eliminate the

influence of the outlier documents.

Outlier documents

Documents very different from other documents

4

Introduction

5

Compare with SVD

• Same– Trying to find a smaller set of basis vectors for a reduced space.

• Differ– Scale the length of each residual vector– Treat documents and terms in a nonsymmetrical way.

6

Algorithm-basis vector creation

• Input: term-document matrix D, scaling factor q• Output: basis vectors

For ( i=1;until reaching some criterion ;i=i+1)

the first unit eigenvector of

End for

,..., 2bbi

],...,[ 11 n

q

n

q

s rrrrR

ib TssRR

RbbRR Tii

DR

UA TVTT UUAA 2

Doc

term

sR

11 rrq

m*m m*n

7

Algorithm-basis vector creation

RbbRR Tii

m = nm

n

RbbRR Tii

m n = m

n

Tib

ib

8

Algorithm-document vector creation

• Dimension reduction

iT

ki dbbd ],...,[ˆ1

k

m

m

n

= k

n

There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)q k

9

example

21100

12011

01021

01002Find the eigenvector of

TAA

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

366827.0655611.0507251.0422265.0

ib

366827.0

655611.0

507251.0

422265.0

366827.0655611.0507251.042265.0Tiibb

134562.0240496.0186073.0154898.0

240496.0429826.0332559.0276842.0

186073.0332559.0257304.0214194.0

154898.0276842.0214194.0178308.0

134562.0240496.0186073.0154898.0

240496.0429826.0332559.0276842.0

186073.0332559.0257304.0214194.0

154898.0276842.0214194.0178308.0

21100

12011

01021

01002

50962.0956525.0134562.0612642.0736365.0

910818.070955.1240496.009494.131607.1

704705.032269.1186073.0847167.001825.1

586638.010108.1154898.070523.0847652.0

Rbb Tii

10

example

21100

12011

01021

01002

50962.0956525.0134562.0612642.0736365.0

910818.070955.1240496.009494.131607.1

704705.032269.1186073.0847167.001825.1

586638.010108.1154898.070523.0847652.0

49038.104345.0865438.0612642.0736365.0

089182.029045.0240496.009494.031607.0

704705.032269.0186073.015283.101825.0

586638.010108.0154898.070523.015235.1

Find it’s eigenvector

045757.547192.238757.0

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

366827.0655611.0507251.0422265.0

11

Probabilistic model

• Basis vectors:• Follows a Gaussian distribution

• Multivariate Normal (MVN) Distribution

},...{ 1 kk bbB

)(/))(exp()|(1

2k

k

j jT

k BZbdBdp

dxbx

k

j jT ))(exp(

2

1

)()(2

1

2/12/

1

)2(

1)(

xx

p

t

exf

12

Probabilistic model

• The log likelihood for the document vectors reduced to dimension k is computed as (Ding)

n

i

k

jkj

Ti

n

ikik BZnbdBdpl

1 1

2

1

))(log()()())|(log(

kTT

kkkTi

Tk

Ti

kBd

kBd

k

k

j jT

k

BDDBBZnBdBd

BZneBZe

BZbdBdp

kTik

Ti

)(log)()(

)(log)(log))(log(log

)(/))(exp()|(22 )()(

1

2

Negligible because it changes slowlyMaximize this

13

parameter

• :

set 1 to 10， increment of 1.

• :

selection of dimension by log-likelihood

q

k

25.0 )*/()ˆˆ(ˆ)(2/*

1

1

2/*

cnclllkfnck

kii

k

nckiik

n

i

k

jkj

Ti

n

ikik BZnbdBdpl

1 1

2

1

))(log()()())|(log(

14

disjoint

experiment

• Test data:– TREC collections– 20 topics– Total umber of documents is 684

pool1 pool2

15 document set 15 document set

Each set range from 31~126Number of topic range from 6~20

Test data training data

15

Baseline algorithm

• Three algorithm– SVD taking the left singular vectors as the basis vector

– Term-document without any basis conversion (term frequency)

– This paper algorithm

16

evaluation

• Assumption– Similarity should be higher for any document pair relevant to the

same topic (intra-topic pair).

k

kjppprecision j

k

where pairs topic-intra of #)(

6062.2

67.7

17

evaluation

• Preservation rate (document length):

• Reduction rate (越大越好 ) :

1 - Preservation rate• Dimensional reduction rate (越大越好 ) :

1 - ( # of dimensions / max # of dimensions)

nf /2

ij

FjiAA

f

2],[

ectors.document v reduced theofmatrix theof norm-F:

18

Selection dimension

• Log-likelihood method:

• Training-based method:– Choose the dimension which make the preservation rate closer

to the average preservation rate.

• Random guess-based method:

25.0 )*/()ˆˆ(ˆ)(2/*

1

1

2/*

cnclllkfnck

kii

k

nckiik

19

result

17.8%

20

result

Dimension reduction rate 43% higher than SVD on averageThis algorithm shows 35.8% higher reduction rate than SVD

21

result

22

conclusion

• This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm.

• Scaling factor can become dynamic to improve the performance.

q