introduction
DESCRIPTION
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. - PowerPoint PPT PresentationTRANSCRIPT
1
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement
Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2
000.
Presenter: 游斯涵
2
Introduction
• Some studies of applying modified or generalized SVD:
(improving the precision of similarities)– SDD ( semi discrete decomposition)
• T. G. Kolda and D. P. O’Learly• Proposed to reduce the storage and computational costs of L
SI.– R-SVD ( Riemannian SVD)
• E. p. Jiang and M. W. Berry• User feedback can be integrated into LSI models.
(theoretical of LSI)– MDS (Multidimensional Scaling)、 Bayesian regression model、
Probabilistic models.
3
Introduction
• Find the problem with SVD– SVD:
• The topics underlying outlier documents tend to be lost as we chose lower number of dimensions.
• Dimensional reduction comes from
two sources:– outlier document– minor term
• The thinking of this paper:– not to consider the outlier document
as “noise”, all documents assume to be equal.– Try to eliminate noise from the minor terms but not eliminate the
influence of the outlier documents.
Outlier documents
Documents very different from other documents
4
Introduction
5
Compare with SVD
• Same– Trying to find a smaller set of basis vectors for a reduced space.
• Differ– Scale the length of each residual vector– Treat documents and terms in a nonsymmetrical way.
6
Algorithm-basis vector creation
• Input: term-document matrix D, scaling factor q• Output: basis vectors
For ( i=1;until reaching some criterion ;i=i+1)
the first unit eigenvector of
End for
,..., 2bbi
],...,[ 11 n
q
n
q
s rrrrR
ib TssRR
RbbRR Tii
DR
UA TVTT UUAA 2
Doc
term
sR
11 rrq
m*m m*n
7
Algorithm-basis vector creation
RbbRR Tii
m = nm
n
RbbRR Tii
m n = m
n
Tib
ib
8
Algorithm-document vector creation
• Dimension reduction
iT
ki dbbd ],...,[ˆ1
k
m
m
n
= k
n
There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)q k
9
example
21100
12011
01021
01002Find the eigenvector of
TAA
405318.0748478.0451605.026749.0
0142943.00644068.0594321.0801517.0
837228.00761991.0430733.0328195.0
366827.0655611.0507251.0422265.0
ib
366827.0
655611.0
507251.0
422265.0
366827.0655611.0507251.042265.0Tiibb
134562.0240496.0186073.0154898.0
240496.0429826.0332559.0276842.0
186073.0332559.0257304.0214194.0
154898.0276842.0214194.0178308.0
134562.0240496.0186073.0154898.0
240496.0429826.0332559.0276842.0
186073.0332559.0257304.0214194.0
154898.0276842.0214194.0178308.0
21100
12011
01021
01002
50962.0956525.0134562.0612642.0736365.0
910818.070955.1240496.009494.131607.1
704705.032269.1186073.0847167.001825.1
586638.010108.1154898.070523.0847652.0
Rbb Tii
10
example
21100
12011
01021
01002
50962.0956525.0134562.0612642.0736365.0
910818.070955.1240496.009494.131607.1
704705.032269.1186073.0847167.001825.1
586638.010108.1154898.070523.0847652.0
49038.104345.0865438.0612642.0736365.0
089182.029045.0240496.009494.031607.0
704705.032269.0186073.015283.101825.0
586638.010108.0154898.070523.015235.1
Find it’s eigenvector
045757.547192.238757.0
405318.0748478.0451605.026749.0
0142943.00644068.0594321.0801517.0
837228.00761991.0430733.0328195.0
405318.0748478.0451605.026749.0
0142943.00644068.0594321.0801517.0
837228.00761991.0430733.0328195.0
366827.0655611.0507251.0422265.0
11
Probabilistic model
• Basis vectors:• Follows a Gaussian distribution
• Multivariate Normal (MVN) Distribution
},...{ 1 kk bbB
)(/))(exp()|(1
2k
k
j jT
k BZbdBdp
dxbx
k
j jT ))(exp(
2
1
)()(2
1
2/12/
1
)2(
1)(
xx
p
t
exf
12
Probabilistic model
• The log likelihood for the document vectors reduced to dimension k is computed as (Ding)
n
i
k
jkj
Ti
n
ikik BZnbdBdpl
1 1
2
1
))(log()()())|(log(
kTT
kkkTi
Tk
Ti
kBd
kBd
k
k
j jT
k
BDDBBZnBdBd
BZneBZe
BZbdBdp
kTik
Ti
)(log)()(
)(log)(log))(log(log
)(/))(exp()|(22 )()(
1
2
Negligible because it changes slowlyMaximize this
13
parameter
• :
set 1 to 10, increment of 1.
• :
selection of dimension by log-likelihood
q
k
25.0 )*/()ˆˆ(ˆ)(2/*
1
1
2/*
cnclllkfnck
kii
k
nckiik
n
i
k
jkj
Ti
n
ikik BZnbdBdpl
1 1
2
1
))(log()()())|(log(
14
disjoint
experiment
• Test data:– TREC collections– 20 topics– Total umber of documents is 684
pool1 pool2
15 document set 15 document set
Each set range from 31~126Number of topic range from 6~20
Test data training data
15
Baseline algorithm
• Three algorithm– SVD taking the left singular vectors as the basis vector
– Term-document without any basis conversion (term frequency)
– This paper algorithm
16
evaluation
• Assumption– Similarity should be higher for any document pair relevant to the
same topic (intra-topic pair).
k
kjppprecision j
k
where pairs topic-intra of #)(
6062.2
67.7
17
evaluation
• Preservation rate (document length):
• Reduction rate (越大越好 ) :
1 - Preservation rate• Dimensional reduction rate (越大越好 ) :
1 - ( # of dimensions / max # of dimensions)
nf /2
ij
FjiAA
f
2],[
ectors.document v reduced theofmatrix theof norm-F:
18
Selection dimension
• Log-likelihood method:
• Training-based method:– Choose the dimension which make the preservation rate closer
to the average preservation rate.
• Random guess-based method:
25.0 )*/()ˆˆ(ˆ)(2/*
1
1
2/*
cnclllkfnck
kii
k
nckiik
19
result
17.8%
20
result
Dimension reduction rate 43% higher than SVD on averageThis algorithm shows 35.8% higher reduction rate than SVD
21
result
22
conclusion
• This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm.
• Scaling factor can become dynamic to improve the performance.
q