phrase identification from queries and its use for web search

1

Phrase Identification from Queries and Its Use for Web Search

Fuchun PengMicrosoft Bing7/23/2010

2

Motivation Query is often treated as a bag of words But when people are formulating queries,

they use “concepts” as building blockssimmons college’s Q: simmons college sports psychology

A1: “simmons college”, “sports psychology”A2: “college sports”

sports psychology (course)

• Can we automatically segment the query to recover the concepts?

3

Outline Summary of Segmentation approaches Use for Improving Search Relevance

◦ Query rewriting◦ Ranking features

Conclusions

4

Supervised Segmentation Supervised learning (Bergsma et al,

EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation

point◦ Features: POS, web counts, the, and, …

w1 w2 w3 w4 w5

N NY Y

• Problem:– Limited-range context– Features specifically designed for noun phrases

5

Manual Data Preparation◦ Linguistic driven

[San jose international airport]◦ Relevance driven

[San jose] [international airport]

Training Data Annotation

6

Mutual-information based (Risvik et al. WWW 2003)

w1 w2 w3 w4 w5

MI 1,2

2,3

3,4

4,5

threshold

MI(w1,w2) = P(w1w2) / P(w1)P(w2)

insert segment boundary w1w2 | w3w4w5

• Problem:– only captures short-range correlation (between

adjacent words)– What about my heart will go on?

Iterative update

7

Frequency Based Approach(Hagen et al SIGIR 2010)

8

LM Based Approach(Tan & Peng WWW 2008)

Assume the query is generated by independent sampling from a probability distribution of concepts:

simmons college sports psychology

unigram modelP(simmons college)=0.000016 P(sports psychology)=0.000002

P=0.000016×0.000002

simmons college sports psychologyP(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024

P=0.000007×0.000006×0.000024

>

• Enumerate all possible segmentations; Rank by probability of being generated by the unigram model

• How to estimate parameters P(w) for the unigram model?

9

Parameter (Concept Prob.) Estimation I We have ngram (n=1..5) counts in a web

corpus◦ 464M documents; L = 33B tokens◦ Approximate counts for longer ngrams are often

computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] #(ABC)=#(AB)+#(BC)-#(AB OR BC)

>= #(AB)+#(BC)-#(B)Solved by DP

10

Maximum Likelihood Estimate:PMLE(t) = #(t) / N

Problem:◦ #(potter and the goblet of) = 6765◦ P(potter and the goblet of) > P(harry potter and

the goblet of fire)? Wrong!◦ not prob. of seeing t in text, but prob. of seeing t

as a self-contained concept in text

Parameter Estimation

11

Parameter EstimationQuery-relevant web corpus

Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length)

t: a query substringC(t): longest matching count of tD = {(t, C(t)}: query-relevant corpuss(t): a segmentation of tθ: unigram model parameters (ngram probabilities)

θ = argmax P(D|θ)P(θ)

= argmax log P(D|θ) + log P(θ)

log P(D|θ) = ∑t log P(t|θ)C(t)P(t|θ) = ∑ s(t) P(s(t)|θ)

posterior prob.

DL of corpus DL of parameters

ngramlongestmatchingcount

rawfrequency

harryharry potterharry potter andharry potter and theharry potter and the gobletharry potter and the goblet ofharry potter and the goblet of fire...…fire

165710827773610436513301016185783……4200957

20031123460046826857832650264015783……4478774

12

System Architecture

13

Evaluation – Data sets Three human-segmented datasets

◦ 3 data sets, for training, validation, and testing, 500 queries for each set Segmented by three editors A, B, C

14

Evaluation -- metrics Evaluation metric:

◦ Boundary classification accuracy

◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy

◦ Segment accuracy: the percentage of segments being recovered Truth [abc] [de] [fg] Prediction: [abc] [de fg]: precision

w1 w2 w3 w4 w5

N NY Y

15

Results

16

Results I

17



Conclusions

18

Phrase Proximity Boosting Phrase Level Query Expansion

Use for Improving Relevance

19

Classifying a segment into one of three categories◦ Strong concept: no word reordering, no word

insertion/deletion Treat the whole segment as a single unit in matching

and ranking◦ Weak concept: allow word reordering or

deletion/insertion Boost documents matching the weak concepts

◦ Not a concept Do nothing

Phrase Proximity Boosting

20

Concept based BM25◦ Weighted by the confidence of concepts

Concept based min coverage◦ Weighted by the confidence of concepts

Phrase Proximity Boosting

21

Phrase level replacement◦ [San Francisco] -> [sf]◦ [red eye flight] ->[late night flight]

Phrased Based Expansion

22

Significant relevance boosting◦ Affects 40% query traffic◦ Significant DCG gain (1.5% for affected queries)◦ Significant online CTR gain (0.5% over all)

Relevance Results

23



Conclusions

24

Data is segmentation is important for query segmentation

Phrases are important for improving relevance

Conclusions

25

Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW 2008

References

Thank you!

26

27

Parameter Estimation II Solution 1: Offline segment the web corpus,

then collect counts for ngrams being segments

• Technical difficulties

harry potter and the goblet of fire += 1potter and the goblet of += 0

C. G. de Marcken, Unsupervised Language Acquisition, 96

Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01

... …| Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |... ...

28

Parameter Estimation III Solution 2: Online computation: only consider

parts of the web corpus overlapping with the query (longest matches)

... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...

Q=harry potter and the goblet of fire

harry potter and the goblet of fire += 1

the += 2

harry potter += 1

30

Parameter Estimation III Solution 2: Online computation: only consider

parts of the web corpus overlapping with the query (longest matches)

... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...

Q= potter and the goblet

potter and the goblet += 1

the += 2

potter += 1

Directly compute longest matching countsusing raw ngram frequency: O(|Q|2)

phrase identification from queries and its use for web search

Documents