vsm 벡터공간모델
DESCRIPTION
정보검색시스템 강의노트 강승식교수님TRANSCRIPT
Chapter 2Chapter 2
Modeling
http://nlp.kookmin.ac.kr/
Contents
Introduction Taxonomy of IR Models Retrieval : Ad hoc, Filtering Formal Characterization of IR Models Classic IR Models Alternative Set Theoretic Models Alternative Algebraic Models Alternative Probabilistic Models
http://nlp.kookmin.ac.kr/
Contents (Cont.)
Structured Text Retrieval Models Models for Browsing Trends and Research Issues
http://nlp.kookmin.ac.kr/
2.1 Introduction
Traditional IR System– Adopt index terms to index and retrieve documents
Index Term– Restricted sense
• Keyword which has some meaning of its own (usually noun)
– General form• Any word which appears in the text of a document
Ranking Algorithm– Attempt to establish a simple ordering of the documents
retrieved– Operate according to basic premises regarding the notion
of document relevance
http://nlp.kookmin.ac.kr/
2.2 A Taxonomy of IR Models
Set Theoretic
Fuzzy
Extended Boolean
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
U
s
e
r
T
a
s
k
Retrieval:
Ad hoc
Filtering
Browsing
Classic Models
boolean
vector
probabilistic
Structured Models
Non-Overlapping Lists
Proximal Nodes
Browsing
Flat
Structure Guided
Hypertext
http://nlp.kookmin.ac.kr/
A Taxonomy of IR Models (Cont.)
Retrieval models
– Most frequently associated with distinct combinations of a document logical view and a user task
Index Terms Full Text Full Text + Structure
Retrieval
Classic
Set theoretic
Algebraic
Probabilistic
Classic
Set theoretic
Algebraic
Probabilistic
Structured
Browsing FlatFlat
Hypertext
Structure Guided
Hypertext
Logical View of Documents
USER
TASK
http://nlp.kookmin.ac.kr/
2.3 Retrieval
Ad hoc
– The documents in the collection remain relatively static while new queries are submitted to the system
– The most common form of user task Filtering
– The queries remain relatively static while new documents come into the system (and leave)
– User profile
• Describing the user’s preferences
– Routing (variation of filtering, rank the filtered document)
http://nlp.kookmin.ac.kr/
2.4 A Formal Characterization of IR Models
IR Model
jiji
ji
dqdqR
F
Q
D
dqRFQD
document and query with associateshich function w ranking : ),(
ipsrelationsh their and queries, documents, modelingfor framework :
needsn informatiouser for the viewslogical of composedset :
documents for the viewslogical of composedset :
)],(,,,[
theoremsBayes' and operations ticprobabilis sets,:model ticprobabilis
operations algebralinear and space vector ldimensiona- t:modelvector
setson operations and documents of sets :modelboolean
http://nlp.kookmin.ac.kr/
2.5 Classic Information Retrieval
Boolean Model– Based on set theory and Boolean algebra– Queries are specified as Boolean expressions– Model considers that index terms are present or absent in a doc
ument Vector Model
– Partial matching is possible– Assign non-binary weights to index terms– Term weights are used to compute the degree of similarity
Probabilistic Model– Given a query, the model assigns each document dj, as a measu
re of similarity to the query, p(dj relevant to q)/p(dj non-relevant to q) which computes the odds of the document dj being relevant to the query q
http://nlp.kookmin.ac.kr/
2.5.1 Basic Concepts Index Term
– Word whose semantics helps in remembering the document’s main themes
– Mainly nouns
• Nouns have meaning by themselves
– Weights
• All terms are not equally useful for describing the document
– Definition
)) (i.e., vector ldimensiona-any tin
index term with theassociated weight thereturnshat function t :
document a of index term with associatedweight :
),...,( :document
},...,{: sindex term ofset
21
1
ijji
ii
jiij
tjjjj
t
wd(g
kg
dkw
wwwdj
kkK
http://nlp.kookmin.ac.kr/
Basic Concepts (Cont.)
Mutual Independence
– Index term weights are usually assumed to be mutually independent
– Knowing the weight wij associated with the pair (ki, dj) tells us nothing about the weight w(i+1)j associated with the pair (ki+1, dj)
– It does simplify the task of computing index term weights and allows for fast ranking computation
http://nlp.kookmin.ac.kr/
2.5.2 Boolean Model
Base– Simple retrieval model based on Set theory and
Boolean algebra– Operation : and, or, not
Advantage– Clean formalism– Boolean query expressions have precise semantics
Disadvantage– Binary decision (no notion of a partial match)
• Retrieval of too few or too many document– Difficult to express their query requests in terms of
Boolean expressions
http://nlp.kookmin.ac.kr/
Boolean Model (Cont.)
Definition
Example
otherwiseqgdgkqqqifqdsim ccijiidnfcccc
j))()(,()( |
,0,1
),(
)0,0,1()0,1,1()1,1,1(
)(
dnf
cba
q
kkkq
ka kb
kc
}1,0{ ),...,( 21 ijtjjjj wwwwd
http://nlp.kookmin.ac.kr/
Boolean Model (Cont.)
)1,0,1()0,1,1()1,1,1(
)(
dnfq
q
시스템프로그램병렬병렬 프로그램
시스템문서
색인어유사도 병렬 프로그램 시스템 …
001 1 0 1 …
1
002 0 0 1 …
0
003 0 1 1 …
0
004 1 1 0 …
1
http://nlp.kookmin.ac.kr/
2.5.3 Vector model
Motivation
– Binary weights is too limiting
• Assign non-binary weights to index terms
– A framework in which partial matching is possible
• Instead of attempting to predict whether a document is relevant or not
• Rank the documents according to their degree of similarity to the query
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
Definition
documents theof space in theion Normalizat : ||
ranking affect thenot Does : ||
)similarity (cosine 1),(0
||||
),(
0 ),...,,(
0 ),...,,(
11
1
21
21
22
j
j
t
iiq
t
iij
t
iiqij
j
j
j
ijtjjjj
iqtqqq
d
q
qdsim
ww
ww
qd
qdqdsim
wwwwd
wwwwq
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
Clustering Problem
– Intra-cluster similarity
• What are the features which better describe the objects
– Inter-cluster similarity
• What are the features which better distinguish the objects
IR Problem
– Intra-cluster similarity (tf factor)
• Raw frequency of a term ki inside a document dj
– Inter-cluster similarity (idf factor)
• Inverse of the frequency of a term ki among the documents
http://nlp.kookmin.ac.kr/
Vector model (Cont.) Weighting Scheme
– Term Frequency (tf)
• Measure of how well that term describes the document contents
– Inverse Document Frequency (idf)
• Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one
)document in the termoffrequency Raw : ( max jiij
ljl
ijij dkfreq
freq
freqf
documents ofnumber Total :
appears index term hein which t documents ofNumber :
log
N
kn
n
Nidf
ii
ii
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
Best known index term weighting scheme
– Balance tf and idf (tf-idf scheme)
Query term weighting scheme
iijij idffw
iiqiq idffw 5.05.0
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
truck"ain arrived gold ofShipment " :
ck"silver tru ain arrivedsilver ofDelivery " :
fire" ain damaged gold ofShipment " :
3
2
1
D
D
D
ii n
Nidf log
ck"silver tru gold" :Q
Term a arrived damaged delivery fire gold in of silver shipment truck
idf 0 .176 .477 .477 .477 .176 0 0 .477 .176 .176
iijij idffw iiqiq idffw
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
D1 0 0 .477 0 .477 .176 0 0 0 .176 0
D2 0 .176 0 .477 0 0 0 0 .954 0 .176
D3 0 .176 0 0 0 .176 0 0 0 .176 .176
Q 0 0 0 0 0 .176 0 0 .477 0 .176
ij
t
iiqj wwDQSC
1
),(
031.0)176.0(
)0)(176.0()176.0)(0()0)(477.0()0)(0()0)(0()176.0)(176.0(
)477.0)(0()0)(0()477.0)(0()0)(0()0)(0(),(
2
1
DQSC
486.0)176.0()477.0)(954.0(),( 22 DQSC
062.0)176.0()176.0(),( 223 DQSC
Hence, the ranking would be D2, D3, D1
Document vectors
Not normalized
http://nlp.kookmin.ac.kr/
Vector model (Cont.)
Advantage– Term-weighting scheme improves retrieval performan
ce– Partial matching strategy allows retrieval of document
s that approximate the query conditions– Cosine ranking formula sorts the documents accordin
g to their degree of similarity to the query Disadvantage
– Index terms are assumed to be mutually independent• tf-idf scheme does not account for index term depe
ndencies• However, in practice, consideration of term depend
encies might be a disadvantage