phrase identification from queries and its use for web search
DESCRIPTION
Phrase Identification from Queries and Its Use for Web Search. Fuchun Peng Microsoft Bing 7/23/2010. Motivation. Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks. sports psychology (course). simmons college ’s. - PowerPoint PPT PresentationTRANSCRIPT
1
Phrase Identification from Queries and Its Use for Web Search
Fuchun PengMicrosoft Bing7/23/2010
2
Motivation Query is often treated as a bag of words But when people are formulating queries,
they use “concepts” as building blockssimmons college’s Q: simmons college sports psychology
A1: “simmons college”, “sports psychology”A2: “college sports”
sports psychology (course)
• Can we automatically segment the query to recover the concepts?
3
Outline Summary of Segmentation approaches Use for Improving Search Relevance
◦ Query rewriting◦ Ranking features
Conclusions
4
Supervised Segmentation Supervised learning (Bergsma et al,
EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation
point◦ Features: POS, web counts, the, and, …
w1 w2 w3 w4 w5
N NY Y
• Problem:– Limited-range context– Features specifically designed for noun phrases
5
Manual Data Preparation◦ Linguistic driven
[San jose international airport]◦ Relevance driven
[San jose] [international airport]
Training Data Annotation
6
Mutual-information based (Risvik et al. WWW 2003)
w1 w2 w3 w4 w5
MI 1,2
2,3
3,4
4,5
threshold
MI(w1,w2) = P(w1w2) / P(w1)P(w2)
insert segment boundary w1w2 | w3w4w5
• Problem:– only captures short-range correlation (between
adjacent words)– What about my heart will go on?
Iterative update
7
Frequency Based Approach(Hagen et al SIGIR 2010)
8
LM Based Approach(Tan & Peng WWW 2008)
Assume the query is generated by independent sampling from a probability distribution of concepts:
simmons college sports psychology
unigram modelP(simmons college)=0.000016 P(sports psychology)=0.000002
P=0.000016×0.000002
simmons college sports psychologyP(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024
P=0.000007×0.000006×0.000024
>
• Enumerate all possible segmentations; Rank by probability of being generated by the unigram model
• How to estimate parameters P(w) for the unigram model?
9
Parameter (Concept Prob.) Estimation I We have ngram (n=1..5) counts in a web
corpus◦ 464M documents; L = 33B tokens◦ Approximate counts for longer ngrams are often
computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] #(ABC)=#(AB)+#(BC)-#(AB OR BC)
>= #(AB)+#(BC)-#(B)Solved by DP
10
Maximum Likelihood Estimate:PMLE(t) = #(t) / N
Problem:◦ #(potter and the goblet of) = 6765◦ P(potter and the goblet of) > P(harry potter and
the goblet of fire)? Wrong!◦ not prob. of seeing t in text, but prob. of seeing t
as a self-contained concept in text
Parameter Estimation
11
Parameter EstimationQuery-relevant web corpus
Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length)
t: a query substringC(t): longest matching count of tD = {(t, C(t)}: query-relevant corpuss(t): a segmentation of tθ: unigram model parameters (ngram probabilities)
θ = argmax P(D|θ)P(θ)
= argmax log P(D|θ) + log P(θ)
log P(D|θ) = ∑t log P(t|θ)C(t)P(t|θ) = ∑ s(t) P(s(t)|θ)
posterior prob.
DL of corpus DL of parameters
ngramlongestmatchingcount
rawfrequency
harryharry potterharry potter andharry potter and theharry potter and the gobletharry potter and the goblet ofharry potter and the goblet of fire...…fire
165710827773610436513301016185783……4200957
20031123460046826857832650264015783……4478774
12
System Architecture
13
Evaluation – Data sets Three human-segmented datasets
◦ 3 data sets, for training, validation, and testing, 500 queries for each set Segmented by three editors A, B, C
14
Evaluation -- metrics Evaluation metric:
◦ Boundary classification accuracy
◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy
◦ Segment accuracy: the percentage of segments being recovered Truth [abc] [de] [fg] Prediction: [abc] [de fg]: precision
w1 w2 w3 w4 w5
N NY Y
15
Results
16
Results I
17
Outline Summary of Segmentation approaches Use for Improving Search Relevance
◦ Query rewriting◦ Ranking features
Conclusions
18
Phrase Proximity Boosting Phrase Level Query Expansion
Use for Improving Relevance
19
Classifying a segment into one of three categories◦ Strong concept: no word reordering, no word
insertion/deletion Treat the whole segment as a single unit in matching
and ranking◦ Weak concept: allow word reordering or
deletion/insertion Boost documents matching the weak concepts
◦ Not a concept Do nothing
Phrase Proximity Boosting
20
Concept based BM25◦ Weighted by the confidence of concepts
Concept based min coverage◦ Weighted by the confidence of concepts
Phrase Proximity Boosting
21
Phrase level replacement◦ [San Francisco] -> [sf]◦ [red eye flight] ->[late night flight]
Phrased Based Expansion
22
Significant relevance boosting◦ Affects 40% query traffic◦ Significant DCG gain (1.5% for affected queries)◦ Significant online CTR gain (0.5% over all)
Relevance Results
23
Outline Summary of Segmentation approaches Use for Improving Search Relevance
◦ Query rewriting◦ Ranking features
Conclusions
24
Data is segmentation is important for query segmentation
Phrases are important for improving relevance
Conclusions
25
Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW 2008
References
Thank you!
26
27
Parameter Estimation II Solution 1: Offline segment the web corpus,
then collect counts for ngrams being segments
• Technical difficulties
harry potter and the goblet of fire += 1potter and the goblet of += 0
C. G. de Marcken, Unsupervised Language Acquisition, 96
Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01
... …| Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |... ...
28
Parameter Estimation III Solution 2: Online computation: only consider
parts of the web corpus overlapping with the query (longest matches)
... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...
Q=harry potter and the goblet of fire
harry potter and the goblet of fire += 1
the += 2
harry potter += 1
29
30
Parameter Estimation III Solution 2: Online computation: only consider
parts of the web corpus overlapping with the query (longest matches)
... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...
Q= potter and the goblet
potter and the goblet += 1
the += 2
potter += 1
Directly compute longest matching countsusing raw ngram frequency: O(|Q|2)