vsm 벡터공간모델

Chapter 2Chapter 2

Modeling

http://nlp.kookmin.ac.kr/

Contents

Introduction Taxonomy of IR Models Retrieval : Ad hoc, Filtering Formal Characterization of IR Models Classic IR Models Alternative Set Theoretic Models Alternative Algebraic Models Alternative Probabilistic Models


Contents (Cont.)

Structured Text Retrieval Models Models for Browsing Trends and Research Issues


2.1 Introduction

Traditional IR System– Adopt index terms to index and retrieve documents

Index Term– Restricted sense

• Keyword which has some meaning of its own (usually noun)

– General form• Any word which appears in the text of a document

Ranking Algorithm– Attempt to establish a simple ordering of the documents

retrieved– Operate according to basic premises regarding the notion

of document relevance


2.2 A Taxonomy of IR Models

Set Theoretic

Fuzzy

Extended Boolean

Algebraic

Generalized Vector

Lat. Semantic Index

Neural Networks

Probabilistic

Inference Network

Belief Network

U

s

e

r

T

a

s

k

Retrieval:

Ad hoc

Filtering

Browsing

Classic Models

boolean

vector

probabilistic

Structured Models

Non-Overlapping Lists

Proximal Nodes

Browsing

Flat

Structure Guided

Hypertext


A Taxonomy of IR Models (Cont.)

Retrieval models

– Most frequently associated with distinct combinations of a document logical view and a user task

Index Terms Full Text Full Text + Structure

Retrieval

Classic

Set theoretic

Algebraic

Probabilistic

Classic

Set theoretic

Algebraic

Probabilistic

Structured

Browsing FlatFlat

Hypertext

Structure Guided

Hypertext

Logical View of Documents

USER

TASK


2.3 Retrieval

Ad hoc

– The documents in the collection remain relatively static while new queries are submitted to the system

– The most common form of user task Filtering

– The queries remain relatively static while new documents come into the system (and leave)

– User profile

• Describing the user’s preferences

– Routing (variation of filtering, rank the filtered document)


2.4 A Formal Characterization of IR Models

IR Model

jiji

ji

dqdqR

F

Q

D

dqRFQD

document and query with associateshich function w ranking : ),(

ipsrelationsh their and queries, documents, modelingfor framework :

needsn informatiouser for the viewslogical of composedset :

documents for the viewslogical of composedset :

)],(,,,[

theoremsBayes' and operations ticprobabilis sets,:model ticprobabilis

operations algebralinear and space vector ldimensiona- t:modelvector

setson operations and documents of sets :modelboolean


2.5 Classic Information Retrieval

Boolean Model– Based on set theory and Boolean algebra– Queries are specified as Boolean expressions– Model considers that index terms are present or absent in a doc

ument Vector Model

– Partial matching is possible– Assign non-binary weights to index terms– Term weights are used to compute the degree of similarity

Probabilistic Model– Given a query, the model assigns each document dj, as a measu

re of similarity to the query, p(dj relevant to q)/p(dj non-relevant to q) which computes the odds of the document dj being relevant to the query q


2.5.1 Basic Concepts Index Term

– Word whose semantics helps in remembering the document’s main themes

– Mainly nouns

• Nouns have meaning by themselves

– Weights

• All terms are not equally useful for describing the document

– Definition

)) (i.e., vector ldimensiona-any tin

index term with theassociated weight thereturnshat function t :

document a of index term with associatedweight :

),...,( :document

},...,{: sindex term ofset

21

1

ijji

ii

jiij

tjjjj

t

wd(g

kg

dkw

wwwdj

kkK


Basic Concepts (Cont.)

Mutual Independence

– Index term weights are usually assumed to be mutually independent

– Knowing the weight wij associated with the pair (ki, dj) tells us nothing about the weight w(i+1)j associated with the pair (ki+1, dj)

– It does simplify the task of computing index term weights and allows for fast ranking computation


2.5.2 Boolean Model

Base– Simple retrieval model based on Set theory and

Boolean algebra– Operation : and, or, not

Advantage– Clean formalism– Boolean query expressions have precise semantics

Disadvantage– Binary decision (no notion of a partial match)

• Retrieval of too few or too many document– Difficult to express their query requests in terms of

Boolean expressions


Boolean Model (Cont.)

Definition

Example

otherwiseqgdgkqqqifqdsim ccijiidnfcccc

j))()(,()( |

,0,1

),(

)0,0,1()0,1,1()1,1,1(

)(

dnf

cba

q

kkkq

ka kb

kc

}1,0{ ),...,( 21 ijtjjjj wwwwd


Boolean Model (Cont.)

)1,0,1()0,1,1()1,1,1(

)(

dnfq

q

시스템프로그램병렬병렬 프로그램

시스템문서

색인어유사도 병렬 프로그램 시스템 …

001 1 0 1 …

1

002 0 0 1 …

0

003 0 1 1 …

0

004 1 1 0 …

1


2.5.3 Vector model

Motivation

– Binary weights is too limiting

• Assign non-binary weights to index terms

– A framework in which partial matching is possible

• Instead of attempting to predict whether a document is relevant or not

• Rank the documents according to their degree of similarity to the query


Vector model (Cont.)

Definition

documents theof space in theion Normalizat : ||

ranking affect thenot Does : ||

)similarity (cosine 1),(0

||||

),(

0 ),...,,(

0 ),...,,(

11

1

21

21

22

j

j

t

iiq

t

iij

t

iiqij

j

j

j

ijtjjjj

iqtqqq

d

q

qdsim

ww

ww

qd

qdqdsim

wwwwd

wwwwq



Clustering Problem

– Intra-cluster similarity

• What are the features which better describe the objects

– Inter-cluster similarity

• What are the features which better distinguish the objects

IR Problem

– Intra-cluster similarity (tf factor)

• Raw frequency of a term ki inside a document dj

– Inter-cluster similarity (idf factor)

• Inverse of the frequency of a term ki among the documents


Vector model (Cont.) Weighting Scheme

– Term Frequency (tf)

• Measure of how well that term describes the document contents

– Inverse Document Frequency (idf)

• Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one

)document in the termoffrequency Raw : ( max jiij

ljl

ijij dkfreq

freq

freqf

documents ofnumber Total :

appears index term hein which t documents ofNumber :

log

N

kn

n

Nidf

ii

ii



Best known index term weighting scheme

– Balance tf and idf (tf-idf scheme)

Query term weighting scheme

iijij idffw

iiqiq idffw 5.05.0



truck"ain arrived gold ofShipment " :

ck"silver tru ain arrivedsilver ofDelivery " :

fire" ain damaged gold ofShipment " :

3

2

1

D

D

D

ii n

Nidf log

ck"silver tru gold" :Q

Term a arrived damaged delivery fire gold in of silver shipment truck

idf 0 .176 .477 .477 .477 .176 0 0 .477 .176 .176

iijij idffw iiqiq idffw



t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11

D1 0 0 .477 0 .477 .176 0 0 0 .176 0

D2 0 .176 0 .477 0 0 0 0 .954 0 .176

D3 0 .176 0 0 0 .176 0 0 0 .176 .176

Q 0 0 0 0 0 .176 0 0 .477 0 .176

ij

t

iiqj wwDQSC

1

),(

031.0)176.0(

)0)(176.0()176.0)(0()0)(477.0()0)(0()0)(0()176.0)(176.0(

)477.0)(0()0)(0()477.0)(0()0)(0()0)(0(),(

2

1

DQSC

486.0)176.0()477.0)(954.0(),( 22 DQSC

062.0)176.0()176.0(),( 223 DQSC

Hence, the ranking would be D2, D3, D1

Document vectors

Not normalized



Advantage– Term-weighting scheme improves retrieval performan

ce– Partial matching strategy allows retrieval of document

s that approximate the query conditions– Cosine ranking formula sorts the documents accordin

g to their degree of similarity to the query Disadvantage

– Index terms are assumed to be mutually independent• tf-idf scheme does not account for index term depe

ndencies• However, in practice, consideration of term depend

encies might be a disadvantage