knowledge base ―information retrieval― masaharu yoshiokayoshioka/kb/kby-ir.pdfinformation...

Knowledge Base―Information Retrieval ―

Masaharu Yoshioka

Information Retrieval v.s. Database Retrieval

n Database retrieval– Retrieve data that satisfy given condition

n Information retrieval– Retrieve documents based on their information need

• Information need is ambiguous and difficult to represent as a set of keywords.

• E.g., Ramen Sapporo

この写真の作成者不明な作成者は CC BY-NC-NDのライセンスを許諾されています

この写真の作成者不明な作成者は CC BY-SAのライセンスを許諾されています

http://crazycowcow.blogspot.tw/2015/07/2015-japan-day1-3.html

https://creativecommons.org/licenses/by-nc-nd/3.0/

https://commons.wikimedia.org/wiki/File:%E3%82%89%E3%83%BC%E3%82%81%E3%82%93%E5%85%AB%E8%A7%92%E3%81%AE%E5%A7%AB%E8%B7%AF%E3%83%A9%E3%83%BC%E3%83%A1%E3%83%B3.jpg

https://creativecommons.org/licenses/by-sa/3.0/

Query Processing

n Initial query may not be good enough to represent their information need.– Query processing

• Query rewrite• Normalization• Query expansion• …

Retrieval Model

n Boolean Retrieval Model– Check whether the documents satisfies condition

provided as Boolean query

n Ranked Retrieval Model– Calculate scores for the documents based on the model

and return ranked list for the given query.

IR Evaluation Measure

n Precision and Recall

– Precision and Recall may have trade-off relationship• When the Number of Retrieved Paper increases, we

expect Number of Retrieved Relevant Paper also increases slightly.

• Method using for improving recall such as stemming may deteriorate precision.

Information Retrieval for WWW

n Characteristics of WWW information– Frequent update– Various types of documents（length, language,

information source）– Document with links

• Relation among documentsn Characteristics of WWW information retrieval

users– They use small numbers of keywords (3 words at most)– They check only first top 10 ranked documents (they

don’t care recall) – They don’t use complex function such as Boolean

expression

Information Retrieval with WWW link structure

n PageRank (proposed by Google– Assumption：

• Good page may have links from many pages• Pages that are linked from good page may be good ones.

– Model• User click the link in the documents at random (random walk

hypothesis) and check the population of the user at convergence state

• Construct transition matrix based on the probability of link click (1/m for a link for a document with m links).

• Calculate convergence using following formula

Ｍ: Transition matrix, represents the population of the user.ii rMr !! ´=+1

ir!

Problem of PageRank

n Problem of closed linkn User may change the page without using links

– Usage of bookmark or typed-in

n Topic-sensitive PageRank– Calculate initial value for selecting topic representative pages as

a list of bookmarked page for the topic.• Topic representative pages were constructed by Open

Directory• Estimate query topic using similarity between the query term

and documents belongs to the topic.– Following formula represents mixture of transition based on link

and bookmarks. In this formula α represents probability for transition using bookmark control.

vrMr ii!!! ´+´´-=+ aa )1(1

v!

HTML as a Structured Document

n Utilization of tag information– Text in the TITLE tag may represents summary of the

page– Add more weight on such text

n Utilization of the anchor text– Anchor text：Text for the links to click

Example：<a href=“https://www.hokudai.ac.jp”>Hokkaido University</a>

– It can also be used as a summary of the pagen Utilization of the domain information

– Check site name to identify whether the site is official or not

Query log and Click through Data

n Click through data– User may click the links for the pages that seems to be

useful using snippet. (He/she may skip the links that seems to be useless) The system used the clicked information for evaluate the pages.

n Query suggest– The system would like to clarify the query by

suggesting candidate additional query words.• The system suggests additional keywords for

clarifying the query by using other users’ query.

Search Results Diversification

n Search results for query keywords with ambiguity (multiple meanings) and/or multiple aspects (e.g., user side, developer side, …)– When the ranking results contain only one meaning or one

aspect of the query, the user who would like to know the information about other meanings or aspects may not satisfy with those results.

– There are quite few users who check the 2nd page of the retrieval results.

n Add new search results with following evaluation criteria.– Novelty： Addition of the texts that contains similar

information of the higher ranked one may not have higher novelty

– Coverage: It is preferable to make a combination of the retrieval results that cover various meanings and aspects.

Comparison of Information Retrieval Systems

n Comparison of information retrieval systems– Comparison of retrieval results using benchmark

problems• Cranfield Experiments (1960s）

– Comparison of the retrieval results using same document dataset (1400 documents) and set of queries

– First experiments using test collection

Test Collection

n Dataset for evaluating information retrieval system– Document set– A set of queries– Relevant documents list for each query

n Utilization of test collection– Database construction based on a document set– Query formula construction from a set of query– Evaluation of the retrieval results using relevant

documents list

Construction of Test Collection

n Selection of a document set– Web, news paper articles, research papers

n Relevance judgement by a user who creates the query– The user can judge the relevance when the number of

document are small.– It’s impossible to judge the relevance of all documents

when the number of documents are large.

Construction of Test Collection by Evaluation Workshop

n Selection of relevant document candidates using pooling– Candidates are selected by using various kinds of

information retrieval systems.System A System B System C

Retrievalresults

Candidates list are selected from top ranked retrievalresults from various system

（pool） Relevant Documents List

Test Collection with Large Size Documents

n English– TREC Collection

• Constructed by evaluation workshop organized by NIST (U.S.)

– Collection based on news paper, Web, twitter, …– Monolingual, Cross lingual

n Japanese– NTCIR Collection

• Constructed by evaluation workshop organized by NII (Japan)

– Collection based on news paper, Web, twitter, patent, …– Monolingual, Cross lingual

– IREX Collection• Constructed by evaluation workshop organized by IREX

(Japan)– Collection based on news paper

Example of Task of Evaluation Workshop (1)

n Ad hoc retrieval（Monolingual retrieval）– Retrieve documents from database using queries by

same languagen Cross lingual retrieval

– Retrieve documents from database using multiple language using queries by one language

– Information for handling multiple language– Language resources：Dictionary・Translation system– Corpus

» Parallel Corpus (Aligned or Non-Aligned)» Comparable Corpus

Example of Task of Evaluation Workshop (2)

n Web retrieval task– Utilization of large document database and link

structure analysis– Precision of the top ranked documents

n Patent retrieval– Relevant patent retrieval for a given patent– Retrieve relevant patents from the news release

Example of Test Collection(TREC Web track)

n Query Example

<topic number="101" type="faceted"><query>ritz carlton lake las vegas</query><description>

Find information about the Ritz Carlton resort at Lake Las Vegas.</description><subtopic number="1" type="inf">Find information about the Ritz Carlton resort at Lake Las Vegas.

</subtopic><subtopic number="2" type="nav">Find a site where I can determine room price and availability.

</subtopic><subtopic number="3" type="nav">Find directions to the Ritz Carlton Lake Las Vegas.

</subtopic><subtopic number="4" type="inf">Find reviews of the Ritz Carlton Lake Las Vegas.

</subtopic></topic>

Example of Test Collection(TREC Web track)

n Document database– Clue web 09 http://lemurproject.org/clueweb09/

• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)

n Relevant Document list– Multi-grade relevance model

• Relevant documents with multi-facet (information, navigational)

Issues for the Large Scale Test Collection

n Consistency of relevance judgement– It is difficult to keep the evaluation criteria for

relevance judgement .

n Comprehensiveness of relevant documents– Various systems may find the varieties of the relevant

documents, but there is no guarantee that pool covers all relevant documents.

– When the size of pool is small, relevant document lists may miss the documents that requires semantic analysis.

Evaluation of the Retrieval System (1)

n Average Precision– Calculate precision when a new relevant document is found.

Calculate average of this precision for all relevant documents.

n 11 point average precision (Recall-Precision graph)– Calculate precision for the recall of 0.0～1.0 by step 0.1– Precision for recall 0.0 is almost same as the probability of the top

ranked documents is relevant

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

オリジナル適合的汎化


n λ precision– Precision of top λ（e.g., 5, 10, 15, 20, 30, 100） retrieval results

n R precision– Precision of top R (= number of all relevant documents) retrieval

results

n Break Even Point– The value where precision and recall are even

n F measure– Harmonic average of precision and recall

• Evaluation measure that consider the balance between precision and recall

𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =2

1𝑅𝑒𝑐𝑎𝑙𝑙 +

1𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛


n Evaluation measure to find one relevant document– Reciprocal rank：Reciprocal number of the rank where the top

ranked relevant document is found

n Evaluation measure that take into account – Cumulative gain

• Get points based on the multi-grade relevance level (perfect relevant:3, relevant 2, partially relevant:1).

– Normalized cumulative gain• Normalize cumulative gain by using the maximal points gained

by the ideal retrieval results.– Discounted cumulative gain (DCG)

• Discount cumulative gain by the rank（e.g., log2Rank）– Normalized discounted cumulative gain(n-DCG)

• Normalize discounted cumulative gain by using the maximal points gained by the ideal retrieval results.

Calculation of Cumulative Gain Familyn Two retrieval results (All relevant documents:A(perfect

relevant),B(relevant),C(partially relevant)）

ABCXXX

XXCBAX

356666

CG

001366

111111

n-CG

00

1/63/611

35

5.635.635.635.63

DCG

00

0.631.632.922.92

111111

n-DCG

00

0.110.290.520.52


n Evaluation measure to find one relevant documentthat takes into account the relevance level– Weighted reciprocal rank

• Revise denominator by using relevance level of a found documentb Relevant=2, b Partially relevant=4

– Normalized weighted reciprocal rank

𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙 𝑟𝑎𝑛𝑘 =1

𝑟 − 1/𝛽Q

𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙 𝑟𝑎𝑛𝑘 =1 − 1/𝛽U𝑟 − 1/𝛽Q

Weighted Reciprocal Rank

n Four retrieval results (All relevant documents:A,B(relevant),C(partially relevant)）

ABCXXX

21

XXAXXX

2/51/5

Weighted Reciprocal RankNormalized Weighted Reciprocal Rank

XCBXAX

4/72/7

XXCBAX

4/112/11

Summary

n Information Retrieval– Retrieve documents based on their information need

• Retrieval model• Query processing• Evaluation framework