introduction

22
1 Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Itera tive scaling improves precision of inter-docu ment similarity measurement. In the 23th Annual I nternational ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游游游

Upload: myra-joseph

Post on 02-Jan-2016

10 views

Category:

Documents


0 download

DESCRIPTION

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction

1

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement

Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2

000.

Presenter: 游斯涵

Page 2: Introduction

2

Introduction

• Some studies of applying modified or generalized SVD:

(improving the precision of similarities)– SDD ( semi discrete decomposition)

• T. G. Kolda and D. P. O’Learly• Proposed to reduce the storage and computational costs of L

SI.– R-SVD ( Riemannian SVD)

• E. p. Jiang and M. W. Berry• User feedback can be integrated into LSI models.

(theoretical of LSI)– MDS (Multidimensional Scaling)、 Bayesian regression model、

Probabilistic models.

Page 3: Introduction

3

Introduction

• Find the problem with SVD– SVD:

• The topics underlying outlier documents tend to be lost as we chose lower number of dimensions.

• Dimensional reduction comes from

two sources:– outlier document– minor term

• The thinking of this paper:– not to consider the outlier document

as “noise”, all documents assume to be equal.– Try to eliminate noise from the minor terms but not eliminate the

influence of the outlier documents.

Outlier documents

Documents very different from other documents

Page 4: Introduction

4

Introduction

Page 5: Introduction

5

Compare with SVD

• Same– Trying to find a smaller set of basis vectors for a reduced space.

• Differ– Scale the length of each residual vector– Treat documents and terms in a nonsymmetrical way.

Page 6: Introduction

6

Algorithm-basis vector creation

• Input: term-document matrix D, scaling factor q• Output: basis vectors

For ( i=1;until reaching some criterion ;i=i+1)

the first unit eigenvector of

End for

,..., 2bbi

],...,[ 11 n

q

n

q

s rrrrR

ib TssRR

RbbRR Tii

DR

UA TVTT UUAA 2

Doc

term

sR

11 rrq

m*m m*n

Page 7: Introduction

7

Algorithm-basis vector creation

RbbRR Tii

m = nm

n

RbbRR Tii

m n = m

n

Tib

ib

Page 8: Introduction

8

Algorithm-document vector creation

• Dimension reduction

iT

ki dbbd ],...,[ˆ1

k

m

m

n

= k

n

There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)q k

Page 9: Introduction

9

example

21100

12011

01021

01002Find the eigenvector of

TAA

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

366827.0655611.0507251.0422265.0

ib

366827.0

655611.0

507251.0

422265.0

366827.0655611.0507251.042265.0Tiibb

134562.0240496.0186073.0154898.0

240496.0429826.0332559.0276842.0

186073.0332559.0257304.0214194.0

154898.0276842.0214194.0178308.0

134562.0240496.0186073.0154898.0

240496.0429826.0332559.0276842.0

186073.0332559.0257304.0214194.0

154898.0276842.0214194.0178308.0

21100

12011

01021

01002

50962.0956525.0134562.0612642.0736365.0

910818.070955.1240496.009494.131607.1

704705.032269.1186073.0847167.001825.1

586638.010108.1154898.070523.0847652.0

Rbb Tii

Page 10: Introduction

10

example

21100

12011

01021

01002

50962.0956525.0134562.0612642.0736365.0

910818.070955.1240496.009494.131607.1

704705.032269.1186073.0847167.001825.1

586638.010108.1154898.070523.0847652.0

49038.104345.0865438.0612642.0736365.0

089182.029045.0240496.009494.031607.0

704705.032269.0186073.015283.101825.0

586638.010108.0154898.070523.015235.1

Find it’s eigenvector

045757.547192.238757.0

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

405318.0748478.0451605.026749.0

0142943.00644068.0594321.0801517.0

837228.00761991.0430733.0328195.0

366827.0655611.0507251.0422265.0

Page 11: Introduction

11

Probabilistic model

• Basis vectors:• Follows a Gaussian distribution

• Multivariate Normal (MVN) Distribution

},...{ 1 kk bbB

)(/))(exp()|(1

2k

k

j jT

k BZbdBdp

dxbx

k

j jT ))(exp(

2

1

)()(2

1

2/12/

1

)2(

1)(

xx

p

t

exf

Page 12: Introduction

12

Probabilistic model

• The log likelihood for the document vectors reduced to dimension k is computed as (Ding)

n

i

k

jkj

Ti

n

ikik BZnbdBdpl

1 1

2

1

))(log()()())|(log(

kTT

kkkTi

Tk

Ti

kBd

kBd

k

k

j jT

k

BDDBBZnBdBd

BZneBZe

BZbdBdp

kTik

Ti

)(log)()(

)(log)(log))(log(log

)(/))(exp()|(22 )()(

1

2

Negligible because it changes slowlyMaximize this

Page 13: Introduction

13

parameter

• :

set 1 to 10, increment of 1.

• :

selection of dimension by log-likelihood

q

k

25.0 )*/()ˆˆ(ˆ)(2/*

1

1

2/*

cnclllkfnck

kii

k

nckiik

n

i

k

jkj

Ti

n

ikik BZnbdBdpl

1 1

2

1

))(log()()())|(log(

Page 14: Introduction

14

disjoint

experiment

• Test data:– TREC collections– 20 topics– Total umber of documents is 684

pool1 pool2

15 document set 15 document set

Each set range from 31~126Number of topic range from 6~20

Test data training data

Page 15: Introduction

15

Baseline algorithm

• Three algorithm– SVD taking the left singular vectors as the basis vector

– Term-document without any basis conversion (term frequency)

– This paper algorithm

Page 16: Introduction

16

evaluation

• Assumption– Similarity should be higher for any document pair relevant to the

same topic (intra-topic pair).

k

kjppprecision j

k

where pairs topic-intra of #)(

6062.2

67.7

Page 17: Introduction

17

evaluation

• Preservation rate (document length):

• Reduction rate (越大越好 ) :

1 - Preservation rate• Dimensional reduction rate (越大越好 ) :

1 - ( # of dimensions / max # of dimensions)

nf /2

ij

FjiAA

f

2],[

ectors.document v reduced theofmatrix theof norm-F:

Page 18: Introduction

18

Selection dimension

• Log-likelihood method:

• Training-based method:– Choose the dimension which make the preservation rate closer

to the average preservation rate.

• Random guess-based method:

25.0 )*/()ˆˆ(ˆ)(2/*

1

1

2/*

cnclllkfnck

kii

k

nckiik

Page 19: Introduction

19

result

17.8%

Page 20: Introduction

20

result

Dimension reduction rate 43% higher than SVD on averageThis algorithm shows 35.8% higher reduction rate than SVD

Page 21: Introduction

21

result

Page 22: Introduction

22

conclusion

• This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm.

• Scaling factor can become dynamic to improve the performance.

q