집합모델 확장불린모델

26
http://nlp.kookmin .ac.kr/ 2.6 Alternative Set Theoretic Models Fuzzy Set Model Extended Boolean Model

Upload: jungeun-kang

Post on 10-May-2015

334 views

Category:

Documents


0 download

DESCRIPTION

정보검색시스템 강의노트 강승식교수님

TRANSCRIPT

Page 1: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.6 Alternative Set Theoretic Models

Fuzzy Set Model

Extended Boolean Model

Page 2: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.6.1 Fuzzy Set Model

Fuzzy Set Theory

– Deals with the representation of classes whose boundaries are not well defined

– Membership in a fuzzy set is a notion intrinsically gradual instead of abrupt (as in conventional Boolean logic)

tall

very tall

1

0height

1

0height

Fuzzy Membership Conventional Membership

Page 3: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Fuzzy Set Model (Cont.)

Definition

Definition

))(),(min()(

))(),(max()(

)(1)(

,

uuu

uuu

uu

Then.ement of Uu be an elAlso, let

ve to U.f A relatimplement o be the coAand

sets of U, fuzzy subd B be twourse, A ane of discohe universLet U be t

BABA

BABA

AA

]10[

]10[

, interval (u) in theμa number nt u of U each eleme

th ociates wi which ass,:Un μip functioa membersh

ed by haracterizrse U is c of discoua universebset A of A fuzzy su

A

A

Page 4: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Fuzzy Set Model (Cont.)

Fuzzy information retrieval

– Representing documents and queries through sets of keywords yields descriptions which are only partially related to the real semantic contents of the respective documents and queries

– Each query term defines a fuzzy set

– Each document has a degree of membership in this set

Rank the documents relative to the user query

)}40(),50{(D )},50( ),80{( D 21s21t .,d.,d.,d.,d

,0.4)}(d,0.5),{(dDDt)Q(s,0.5)}(d,0.8),{(dDDt)Q(s

21ts

21ts

Page 5: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.6.2 Extended Boolean Model

Motivation

– Boolean Model

• Simple and elegant

• No provision for term weighting

• No ranking of the answer set

• Output might be too large or too small

– Vector space Model

• Simple, fast, better retrieval performance

– Extended Boolean Model

• Combine Boolean query formulations with characteristics for the vector model

Page 6: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Extended Boolean Model (Cont.)

The Model is based on the Critique of a basic assumption of Boolean logic

– Conjunction Boolean query :

• Document which contains either the term kx or the term ky is as irrelevant as another document which contains neither of them

– Disjunction Boolean query :

• Document which contains either the term kx or the term ky is as relevant as another document which contains both of them

yx kkq

yx kkq

Page 7: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Extended Boolean Model (Cont.)

When only two terms are considered, queries and documents are plotted in a two dimensional map

kx and ky

dj

dj+1

ky

kx

(1,0)

(0,1)

(0,0)

(1,1)

`dj

dj+1

kx or ky

(1,0)(0,0)

(1,1)(0,1)

ky

kx

Page 8: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Extended Boolean Model (Cont.)

Disjunctive query :

– Point (0,0) is the spot to be avoided

– Measure of similarity

• Distance from the point (0,0)

Conjunctive query :

– Point (1,1) is the most desirable spot

– Measure of similarity

• Complement of the distance from the point (1,1)

yxor kkq

yxand kkq

2),(

22 yxdqsim or

2

)1()1(1),(

22 yxdqsim and

Page 9: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Extended Boolean Model (Cont.)

P-norm Model

– Generalizes the notion of distance to include not only Euclidean distance but also p-distances

– p value is specified at query time

– Generalized disjunctive query

– Generalized conjunctive query

mppp

or kkkq ...21

mppp

and kkkq ...21

Page 10: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Extended Boolean Model (Cont.)

P-norm Model query-document similarity

Example

ppm

pp

jand

ppm

pp

jor

m

xxxdqsim

m

xxxdqsim

1

21

1

21

)1(...)1()1(1),(

...),(

321 )( kkkq pp

p

p

p

ppp

j

xxx

dqsim

1

3

1

21

2

2)1()1(

1

),(

Page 11: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.7 Alternative Algebraic Models

Generalized Vector Space Model

Latent Semantic Indexing Model

Neural Network Model

Page 12: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.7.1 Generalized Vector Space Model

Three classic models

– Assume independence of index terms Generalized vector space model

– Index term vectors are assumed linearly independent but are not pairwise orthogonal

– Co-occurrence of index terms inside documents in the collection induces dependencies among these index terms

– Document ranking is based on the combination of the standard term-document weights with the term-term correlation factors

)0(

ji kk

Page 13: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.7.2 Latent Semantic Indexing Model

Motivation– Problem of lexical matching method

• There are many ways to express a given concept (synonymy)–Relevant documents which are not indexed by a

ny of the query keywords are not retrieved• Most words have multiple meanings (polysemy)

–Many unrelated documents might be included in the answer set

Idea– Map each document and query vector into a lower di

mensional space which is associated with concepts• Can be done by Singular Value Decomposition

Page 14: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.7.3 Neural Network Model

Motivation

– In a conventional IR system,

• Document vectors are compared with query vectors for the computation of a ranking

• Index terms in documents and queries have to be matched and weighted for computing this ranking

– Neural networks are known to be good pattern matchers and can be an alternative IR model

– Neural networks is a simplified graph representation of the mesh of interconnected neurons in human brain

• Node: processing unit, edge: synaptic connections

• Weight: strength of connection,

• Spread activation

Page 15: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Neural Network Model (Cont.)

Three layers

– query terms, document terms, documents Spread activation process

– At the first phase: the query term nodes initiate the process by sending signals to the document term nodes, and then the document term nodes generate signals to the document nodes

– The document nodes generate new signals back to the document term nodes, and then the document term nodes again fire new signals to the document nodes (repeat this process)

– Signals become weaker at each iteration and the process eventually halts

Page 16: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Neural Network Model (Cont.)

Example– D1

• Cats and dogs eat.– D2

• The dog has a mouse– D3

• Mice eat anything– D4

• Cats play with mice and rats

– D5• Cats play with rats

– Query• Do cats play with

mice?

Page 17: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.8 Alternative Probabilistic Models

Bayesian Networks

Inference Network Model

Belief Network Model

Page 18: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.8.1 Bayesian Networks

Bayesian networks are directed acyclic graphs(DAGs)

– node : random variables

• The parents of a node are those judged to be direct causes for it.

– arcs : causal relationships bet’n variables

• The strengths of causal influences are expressed by conditional probabilities.

x1

x2 x3

x4 x5

Page 19: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

2.8.2 Inference Network Model

Use evidential reasoning to estimate the probability that a document will be relevant to a query

The ranking of a document dj with respect to a query q is a measure of how much evidential support the observation of dj provides to the query q

Page 20: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Simple inference Networks

A B C D E

X Y

F

trueare parents

threeall given that trueis node child y that theprobabilit the

)1)(1()1()1()(

)1)(1()1()1()(

)1)(1)(1()1)(1(

)1()1()1()1)(1(

)1()1()(

111

00011011

00011011

000001

010011100

101110111

yxyxyxxytrueFP

edededdetrueXP

cbacba

cbabcacba

cbacababctrueXP

Page 21: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Link Matrices

– Indicate the strength by which parents (either by themselves or in conjunction with other parents) affect children in the inference network

0.950.80.20.1Y false

0.050.20.80.9Y true

DE ED ED DE

P(D)=0.8

P(E)=0.4

694.0)6.0)(2.0)(05.0()4.0)(2.0)(2.0()6.0)(8.0)(8.0()4.0)(8.0)(9.0(

)1)(1)(()1)(()1()()()(

)1()()(

00011011

}1{

edYLe-dYL-edYLdeYLtrueYP

PPNLtrueNP,...,nR Ri

iRi

ii

Page 22: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Inference Network Example

– Three Layers: document layer, term layer, and query layer– Documents are represented as nodes, and a link exists

from a document to a term.

t1 t3 t4 t2

D1 D2 D3 Q

t2 t3

d1 d2 d3

t1 t2 t3 t4

Q

Document Layer

Concept Layer

Query Layer

Page 23: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Relevance Ranking with Inference Network– Processing begins when a document, say D1, is

instantiated(we believe D1 has been observed)

– This instantiates all term nodes in D1 – All links emanate from the term nodes just

activated are instantiated, and a query node is activated

– The query node then computes the belief in the query given D1 This is used as the similarity coefficient for D1

– This process continues until all documents are instantiated

Page 24: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Example of computing similarity coefficient

Q : “gold silver truck”

D1: “Shipment of gold damaged in a fire.”

D2: “Delivery of silver arrived in a silver truck.”

D3: “Shipment of gold arrived in a truck.”

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11

idf 0 0.41 1.10 1.10 1.10 0.41 0 0 0.41 0.41 0.41

nidf 0 0.37 1 1 1 0.37 0 0 0.37 0.37 0.37

D1 1 0 1 0 1 1 1 1 0 1 0

D2 0.5 0.5 0 0.5 0 0 0.5 0.5 1 0 0.5

D3 1 1 0 0 0 1 1 1 0 1 1

Page 25: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Constructing Link Matrix for Terms– Computing the belief in a given term (ki)

• Given a document (dj)

• Pij = 0.5 + 0.5(ntfij)(nidfi)

• Pgold3 = 0.5 + 0.5(0.37)(1) = 0.685

– Link Matrix

0.6850.6850True0.3150.3151False

D1 D3D1 D3D1 D3gold

0.685True0.315False

D2silver

0.5920.6850True

0.4080.3151False

D2 D3D2 D3D2 D3truck

Page 26: 집합모델  확장불린모델

http://nlp.kookmin.ac.kr/

Inference Network Model(Cont.)

Computing Similarity Coefficient–A link matrix for a query node

– bel(gold|D1) = 0.685, bel(silver|D1) = 0, bel(truck|D1) = 0,

Bel(Q|D1) = 0.1(0.315)(1)(1) + 0.3(0.685)(1)(1) + 0.3(0.315)(0)(1) + 0.5(0.685)(0)(1) + 0.5(0.315)(1)(0) + 0.7(0.685)(1)(0) +

0.7(0.315)(0)(0) + 0.9(0.685)(0)(0) = 0.237– bel(gold|D2) = 0, bel(silver|D2) = 0.685, bel(truck|D2) = 0.592,

Bel(Q|D2) = 0.589– bel(gold|D3) = 0.685, bel(silver|D3) = 0, bel(truck|D3) = 0.685,

Bel(Q|D3) = 0.511

0.5

0.5

t

0.7

0.3

gt

0.5

0.5

gs

0.7

0.3

st

0.1

0.9

gst

0.90.30.3True

0.10.70.7False

gstsg