enhancing biomedical text rankers by term proximity information 劉瑞瓏...
TRANSCRIPT
Enhancing Biomedical Text Rankers by
Term Proximity Information
劉瑞瓏慈濟大學醫學資訊學系
2012/06/13
Outline
• Background– Text ranking– Biomedical information needs
• An approach to enhancing text rankers in the biomedical domain
• Evaluation
• Conclusion
2
Research Background
3
Text Ranking• Goal
– Given a query q and a set T of texts retrieved for q, ranking those texts (in T) according to their degrees of relevance to q
• Motivation– Reducing information overload, since T is often
quite huge, even a smart search engine is used– Text ranking is a key issue in information
retrieval, and often a “secret” component for search engines
4
An Example Ranker
5
Biomedical Information Need
• Biomedical research requires relevant evidences in the huge and ever-growing biomedical literature
• Retrieval of the evidences requires a system that – Accepts a natural language query for a biomedical
information need, and – Ranks relevant texts higher for access or processing
6
An Example
• Query: urinary tract infection, criteria for treatment and admission (from OHSUMED) – A disease as the target concept (i.e., urinary tract infection)
– Two concepts about the scenario of the information need (i.e., treatment and admission)
• Neither special nor related to any disease
7
Contextual Completeness
• Biomedical queries need to be well-formed, and so call for a retrieval system that considers contextual completeness of each query concept t in the text d– Contextual completeness of t in d is the extent
to which the query concepts other than t appear in nearby areas in d
8
An Example
9
• In children with an acute febrile illness, what is the efficacy of single medication therapy with acetaminophen or ibuprofen in reducing fever?
[From Lin & Demner-Fushman, 2006]
PICO
Task
Answer
Strength
An Approach to Improving Rankers for Biomedical Info Needs
10
11
Goals
• An approach PRE (Proximity-based Ranker Enhancer) that – Measures contextual completeness of query
concepts appearing in a nearby area in the text– Serves as a supplement to improve existing
rankers
12
Contrast with Related Work• Biomedical text ranking
– Using synonyms and considering diversity of passages, without considering term proximity
• Text ranking– Individual text scoring techniques (e.g., BM25)
and learning to rank techniques (e.g., Ranking SVM), without considering term proximity
• Improving ranking by term proximity– Term proximity is employed, but contextual
completeness was not considered
System Overview
13
Text Ranker Development
TrainingTesting
Underlying RankerPRE
Text Ranking TF in d
User
Query (q)
Text (d)
TF (Term Frequency) Assessment
Training Data
Ranked Texts
TF Assessment
14
• Three types of term proximity– Overall proximity (QTermTF)– Individual proximity (IndiP)– Collective proximity (CollP)
• A term t may get a large TF increment in d, if – Many query terms appear frequently in d– Query terms are individually near to t at some
places, and– Query terms collectively appear at a place near to t
15
•RTF(t,d,q) = TF(t,d)+TFincrement(t,d,q)•TFincrement(t,d,q) = QtermTF(d,q)IndiP(t,d,q)×CollP(t,d,q)•QtermTF(d,q) = Total TF of query terms in d•IndiP(t,d,q) =ΣmM -
{t}SigmoidWeight(Mindist(t,m))/ MaxIndiP•Mindist(x,y) = shortest distance between x and y in d•SigmoidWeight(dt) = 1/(1+e-((|q|-1)-dt))•CollP(t,d,q) = MaxkK{mM - {t}
SigmoidWeight(dist(t,k,m))}/MaxCollP, where K is the set positions at which t appears in d•dist(t,k,m) = Distance between t (at position k) and m
16
Empirical Evaluation
17
Experimental Data• OHSUMED
– A popular database of biomedical queries and references
– 106 queries– 348,566 references– 16,140 query-reference pairs
• Definitively relevant• Possibly relevant• Not relevant
18
• TREC Genomics 2006– 28 queries (topics) and 27,999 query-passage
pairs• Definitively relevant, possibly relevant, and not
relevant
– 13,993 query-reference pairs
• TREC Genomics 2007– 36 queries and 35,996 query-passage pairs
• Relevant and not relevant
– 22,913 query-reference pairs
19
Underlying Rankers
20
Baseline Ranker Enhancer• Three state-of-the-art techniques that enhanced
text rankers by term proximity– The t-function: t() [Tao & Zhai, 2007]
– The p-function: p() [Cummins & O’Riordan, 2009] – The proximity language model: PLM [Zhao & Yun,
2009]
21
Evaluation Criteria• Evaluating how relevant references are ranked
higher for users to access– Mean average precision (MAP)
– Normalized discount cumulative gain at x (NDCG@X)
22
Results
23
24
25
26
27
28
29
30
Conclusion
31
• Contextual completeness of query concepts in the texts is essential in ranking biomedical texts
• To measure contextual completeness, it is helpful to integrate three types of term proximity– Overall proximity– Individual proximity– Collective proximity
• Existing rankers may be comprehensively enhanced
32
33
Thank You!