집합모델 확장불린모델

http://nlp.kookmin.ac.kr/

2.6 Alternative Set Theoretic Models

Fuzzy Set Model

Extended Boolean Model


2.6.1 Fuzzy Set Model

Fuzzy Set Theory

– Deals with the representation of classes whose boundaries are not well defined

– Membership in a fuzzy set is a notion intrinsically gradual instead of abrupt (as in conventional Boolean logic)

tall

very tall

1

0height

1

0height

Fuzzy Membership Conventional Membership


Fuzzy Set Model (Cont.)

Definition

Definition

))(),(min()(

))(),(max()(

)(1)(

,

uuu

uuu

uu

Then.ement of Uu be an elAlso, let

ve to U.f A relatimplement o be the coAand

sets of U, fuzzy subd B be twourse, A ane of discohe universLet U be t

BABA

BABA

AA

]10[

]10[

, interval (u) in theμa number nt u of U each eleme

th ociates wi which ass,:Un μip functioa membersh

ed by haracterizrse U is c of discoua universebset A of A fuzzy su

A

A


Fuzzy Set Model (Cont.)

Fuzzy information retrieval

– Representing documents and queries through sets of keywords yields descriptions which are only partially related to the real semantic contents of the respective documents and queries

– Each query term defines a fuzzy set

– Each document has a degree of membership in this set

Rank the documents relative to the user query

)}40(),50{(D )},50( ),80{( D 21s21t .,d.,d.,d.,d

,0.4)}(d,0.5),{(dDDt)Q(s,0.5)}(d,0.8),{(dDDt)Q(s

21ts

21ts


2.6.2 Extended Boolean Model

Motivation

– Boolean Model

• Simple and elegant

• No provision for term weighting

• No ranking of the answer set

• Output might be too large or too small

– Vector space Model

• Simple, fast, better retrieval performance

– Extended Boolean Model

• Combine Boolean query formulations with characteristics for the vector model


Extended Boolean Model (Cont.)

The Model is based on the Critique of a basic assumption of Boolean logic

– Conjunction Boolean query :

• Document which contains either the term kx or the term ky is as irrelevant as another document which contains neither of them

– Disjunction Boolean query :

• Document which contains either the term kx or the term ky is as relevant as another document which contains both of them

yx kkq

yx kkq



When only two terms are considered, queries and documents are plotted in a two dimensional map

kx and ky

dj

dj+1

ky

kx

(1,0)

(0,1)

(0,0)

(1,1)

`dj

dj+1

kx or ky

(1,0)(0,0)

(1,1)(0,1)

ky

kx



Disjunctive query :

– Point (0,0) is the spot to be avoided

– Measure of similarity

• Distance from the point (0,0)

Conjunctive query :

– Point (1,1) is the most desirable spot

– Measure of similarity

• Complement of the distance from the point (1,1)

yxor kkq

yxand kkq

2),(

22 yxdqsim or

2

)1()1(1),(

22 yxdqsim and



P-norm Model

– Generalizes the notion of distance to include not only Euclidean distance but also p-distances

– p value is specified at query time

– Generalized disjunctive query

– Generalized conjunctive query

mppp

or kkkq ...21

mppp

and kkkq ...21



P-norm Model query-document similarity

Example

ppm

pp

jand

ppm

pp

jor

m

xxxdqsim

m

xxxdqsim

1

21

1

21

)1(...)1()1(1),(

...),(

321 )( kkkq pp

p

p

p

ppp

j

xxx

dqsim

1

3

1

21

2

2)1()1(

1

),(


2.7 Alternative Algebraic Models

Generalized Vector Space Model

Latent Semantic Indexing Model

Neural Network Model


2.7.1 Generalized Vector Space Model

Three classic models

– Assume independence of index terms Generalized vector space model

– Index term vectors are assumed linearly independent but are not pairwise orthogonal

– Co-occurrence of index terms inside documents in the collection induces dependencies among these index terms

– Document ranking is based on the combination of the standard term-document weights with the term-term correlation factors

)0(

ji kk


2.7.2 Latent Semantic Indexing Model

Motivation– Problem of lexical matching method

• There are many ways to express a given concept (synonymy)–Relevant documents which are not indexed by a

ny of the query keywords are not retrieved• Most words have multiple meanings (polysemy)

–Many unrelated documents might be included in the answer set

Idea– Map each document and query vector into a lower di

mensional space which is associated with concepts• Can be done by Singular Value Decomposition


2.7.3 Neural Network Model

Motivation

– In a conventional IR system,

• Document vectors are compared with query vectors for the computation of a ranking

• Index terms in documents and queries have to be matched and weighted for computing this ranking

– Neural networks are known to be good pattern matchers and can be an alternative IR model

– Neural networks is a simplified graph representation of the mesh of interconnected neurons in human brain

• Node: processing unit, edge: synaptic connections

• Weight: strength of connection,

• Spread activation


Neural Network Model (Cont.)

Three layers

– query terms, document terms, documents Spread activation process

– At the first phase: the query term nodes initiate the process by sending signals to the document term nodes, and then the document term nodes generate signals to the document nodes

– The document nodes generate new signals back to the document term nodes, and then the document term nodes again fire new signals to the document nodes (repeat this process)

– Signals become weaker at each iteration and the process eventually halts


Neural Network Model (Cont.)

Example– D1

• Cats and dogs eat.– D2

• The dog has a mouse– D3

• Mice eat anything– D4

• Cats play with mice and rats

– D5• Cats play with rats

– Query• Do cats play with

mice?


2.8 Alternative Probabilistic Models

Bayesian Networks

Inference Network Model

Belief Network Model


2.8.1 Bayesian Networks

Bayesian networks are directed acyclic graphs(DAGs)

– node : random variables

• The parents of a node are those judged to be direct causes for it.

– arcs : causal relationships bet’n variables

• The strengths of causal influences are expressed by conditional probabilities.

x1

x2 x3

x4 x5


2.8.2 Inference Network Model

Use evidential reasoning to estimate the probability that a document will be relevant to a query

The ranking of a document dj with respect to a query q is a measure of how much evidential support the observation of dj provides to the query q


Inference Network Model(Cont.)

Simple inference Networks

A B C D E

X Y

F

trueare parents

threeall given that trueis node child y that theprobabilit the

)1)(1()1()1()(

)1)(1()1()1()(

)1)(1)(1()1)(1(

)1()1()1()1)(1(

)1()1()(

111

00011011

00011011

000001

010011100

101110111

yxyxyxxytrueFP

edededdetrueXP

cbacba

cbabcacba

cbacababctrueXP



Link Matrices

– Indicate the strength by which parents (either by themselves or in conjunction with other parents) affect children in the inference network

0.950.80.20.1Y false

0.050.20.80.9Y true

DE ED ED DE

P(D)=0.8

P(E)=0.4

694.0)6.0)(2.0)(05.0()4.0)(2.0)(2.0()6.0)(8.0)(8.0()4.0)(8.0)(9.0(

)1)(1)(()1)(()1()()()(

)1()()(

00011011

}1{

edYLe-dYL-edYLdeYLtrueYP

PPNLtrueNP,...,nR Ri

iRi

ii



Inference Network Example

– Three Layers: document layer, term layer, and query layer– Documents are represented as nodes, and a link exists

from a document to a term.

t1 t3 t4 t2

D1 D2 D3 Q

t2 t3

d1 d2 d3

t1 t2 t3 t4

Q

Document Layer

Concept Layer

Query Layer



Relevance Ranking with Inference Network– Processing begins when a document, say D1, is

instantiated(we believe D1 has been observed)

– This instantiates all term nodes in D1 – All links emanate from the term nodes just

activated are instantiated, and a query node is activated

– The query node then computes the belief in the query given D1 This is used as the similarity coefficient for D1

– This process continues until all documents are instantiated



Example of computing similarity coefficient

Q : “gold silver truck”

D1: “Shipment of gold damaged in a fire.”

D2: “Delivery of silver arrived in a silver truck.”

D3: “Shipment of gold arrived in a truck.”

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11

idf 0 0.41 1.10 1.10 1.10 0.41 0 0 0.41 0.41 0.41

nidf 0 0.37 1 1 1 0.37 0 0 0.37 0.37 0.37

D1 1 0 1 0 1 1 1 1 0 1 0

D2 0.5 0.5 0 0.5 0 0 0.5 0.5 1 0 0.5

D3 1 1 0 0 0 1 1 1 0 1 1



Constructing Link Matrix for Terms– Computing the belief in a given term (ki)

• Given a document (dj)

• Pij = 0.5 + 0.5(ntfij)(nidfi)

• Pgold3 = 0.5 + 0.5(0.37)(1) = 0.685

– Link Matrix

0.6850.6850True0.3150.3151False

D1 D3D1 D3D1 D3gold

0.685True0.315False

D2silver

0.5920.6850True

0.4080.3151False

D2 D3D2 D3D2 D3truck



Computing Similarity Coefficient–A link matrix for a query node

– bel(gold|D1) = 0.685, bel(silver|D1) = 0, bel(truck|D1) = 0,

Bel(Q|D1) = 0.1(0.315)(1)(1) + 0.3(0.685)(1)(1) + 0.3(0.315)(0)(1) + 0.5(0.685)(0)(1) + 0.5(0.315)(1)(0) + 0.7(0.685)(1)(0) +

0.7(0.315)(0)(0) + 0.9(0.685)(0)(0) = 0.237– bel(gold|D2) = 0, bel(silver|D2) = 0.685, bel(truck|D2) = 0.592,

Bel(Q|D2) = 0.589– bel(gold|D3) = 0.685, bel(silver|D3) = 0, bel(truck|D3) = 0.685,

Bel(Q|D3) = 0.511

0.5

0.5

t

0.7

0.3

gt

0.5

0.5

gs

0.7

0.3

st

0.1

0.9

gst

0.90.30.3True

0.10.70.7False

gstsg