phrase identification from queries and its use for web search

30
1 Phrase Identification from Queries and Its Use for Web Search Fuchun Peng Microsoft Bing 7/23/2010

Upload: clem

Post on 25-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Phrase Identification from Queries and Its Use for Web Search. Fuchun Peng Microsoft Bing 7/23/2010. Motivation. Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks. sports psychology (course). simmons college ’s. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phrase Identification  from Queries and Its  Use for Web Search

1

Phrase Identification from Queries and Its Use for Web Search

Fuchun PengMicrosoft Bing7/23/2010

Page 2: Phrase Identification  from Queries and Its  Use for Web Search

2

Motivation Query is often treated as a bag of words But when people are formulating queries,

they use “concepts” as building blockssimmons college’s Q: simmons college sports psychology

A1: “simmons college”, “sports psychology”A2: “college sports”

sports psychology (course)

• Can we automatically segment the query to recover the concepts?

Page 3: Phrase Identification  from Queries and Its  Use for Web Search

3

Outline Summary of Segmentation approaches Use for Improving Search Relevance

◦ Query rewriting◦ Ranking features

Conclusions

Page 4: Phrase Identification  from Queries and Its  Use for Web Search

4

Supervised Segmentation Supervised learning (Bergsma et al,

EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation

point◦ Features: POS, web counts, the, and, …

w1 w2 w3 w4 w5

N NY Y

• Problem:– Limited-range context– Features specifically designed for noun phrases

Page 5: Phrase Identification  from Queries and Its  Use for Web Search

5

Manual Data Preparation◦ Linguistic driven

[San jose international airport]◦ Relevance driven

[San jose] [international airport]

Training Data Annotation

Page 6: Phrase Identification  from Queries and Its  Use for Web Search

6

Mutual-information based (Risvik et al. WWW 2003)

w1 w2 w3 w4 w5

MI 1,2

2,3

3,4

4,5

threshold

MI(w1,w2) = P(w1w2) / P(w1)P(w2)

insert segment boundary w1w2 | w3w4w5

• Problem:– only captures short-range correlation (between

adjacent words)– What about my heart will go on?

Iterative update

Page 7: Phrase Identification  from Queries and Its  Use for Web Search

7

Frequency Based Approach(Hagen et al SIGIR 2010)

Page 8: Phrase Identification  from Queries and Its  Use for Web Search

8

LM Based Approach(Tan & Peng WWW 2008)

Assume the query is generated by independent sampling from a probability distribution of concepts:

simmons college sports psychology

unigram modelP(simmons college)=0.000016 P(sports psychology)=0.000002

P=0.000016×0.000002

simmons college sports psychologyP(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024

P=0.000007×0.000006×0.000024

>

• Enumerate all possible segmentations; Rank by probability of being generated by the unigram model

• How to estimate parameters P(w) for the unigram model?

Page 9: Phrase Identification  from Queries and Its  Use for Web Search

9

Parameter (Concept Prob.) Estimation I We have ngram (n=1..5) counts in a web

corpus◦ 464M documents; L = 33B tokens◦ Approximate counts for longer ngrams are often

computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] #(ABC)=#(AB)+#(BC)-#(AB OR BC)

>= #(AB)+#(BC)-#(B)Solved by DP

Page 10: Phrase Identification  from Queries and Its  Use for Web Search

10

Maximum Likelihood Estimate:PMLE(t) = #(t) / N

Problem:◦ #(potter and the goblet of) = 6765◦ P(potter and the goblet of) > P(harry potter and

the goblet of fire)? Wrong!◦ not prob. of seeing t in text, but prob. of seeing t

as a self-contained concept in text

Parameter Estimation

Page 11: Phrase Identification  from Queries and Its  Use for Web Search

11

Parameter EstimationQuery-relevant web corpus

Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length)

t: a query substringC(t): longest matching count of tD = {(t, C(t)}: query-relevant corpuss(t): a segmentation of tθ: unigram model parameters (ngram probabilities)

θ = argmax P(D|θ)P(θ)

= argmax log P(D|θ) + log P(θ)

log P(D|θ) = ∑t log P(t|θ)C(t)P(t|θ) = ∑ s(t) P(s(t)|θ)

posterior prob.

DL of corpus DL of parameters

ngramlongestmatchingcount

rawfrequency

harryharry potterharry potter andharry potter and theharry potter and the gobletharry potter and the goblet ofharry potter and the goblet of fire...…fire

165710827773610436513301016185783……4200957

20031123460046826857832650264015783……4478774

Page 12: Phrase Identification  from Queries and Its  Use for Web Search

12

System Architecture

Page 13: Phrase Identification  from Queries and Its  Use for Web Search

13

Evaluation – Data sets Three human-segmented datasets

◦ 3 data sets, for training, validation, and testing, 500 queries for each set Segmented by three editors A, B, C

Page 14: Phrase Identification  from Queries and Its  Use for Web Search

14

Evaluation -- metrics Evaluation metric:

◦ Boundary classification accuracy

◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy

◦ Segment accuracy: the percentage of segments being recovered Truth [abc] [de] [fg] Prediction: [abc] [de fg]: precision

w1 w2 w3 w4 w5

N NY Y

Page 15: Phrase Identification  from Queries and Its  Use for Web Search

15

Results

Page 16: Phrase Identification  from Queries and Its  Use for Web Search

16

Results I

Page 17: Phrase Identification  from Queries and Its  Use for Web Search

17

Outline Summary of Segmentation approaches Use for Improving Search Relevance

◦ Query rewriting◦ Ranking features

Conclusions

Page 18: Phrase Identification  from Queries and Its  Use for Web Search

18

Phrase Proximity Boosting Phrase Level Query Expansion

Use for Improving Relevance

Page 19: Phrase Identification  from Queries and Its  Use for Web Search

19

Classifying a segment into one of three categories◦ Strong concept: no word reordering, no word

insertion/deletion Treat the whole segment as a single unit in matching

and ranking◦ Weak concept: allow word reordering or

deletion/insertion Boost documents matching the weak concepts

◦ Not a concept Do nothing

Phrase Proximity Boosting

Page 20: Phrase Identification  from Queries and Its  Use for Web Search

20

Concept based BM25◦ Weighted by the confidence of concepts

Concept based min coverage◦ Weighted by the confidence of concepts

Phrase Proximity Boosting

Page 21: Phrase Identification  from Queries and Its  Use for Web Search

21

Phrase level replacement◦ [San Francisco] -> [sf]◦ [red eye flight] ->[late night flight]

Phrased Based Expansion

Page 22: Phrase Identification  from Queries and Its  Use for Web Search

22

Significant relevance boosting◦ Affects 40% query traffic◦ Significant DCG gain (1.5% for affected queries)◦ Significant online CTR gain (0.5% over all)

Relevance Results

Page 23: Phrase Identification  from Queries and Its  Use for Web Search

23

Outline Summary of Segmentation approaches Use for Improving Search Relevance

◦ Query rewriting◦ Ranking features

Conclusions

Page 24: Phrase Identification  from Queries and Its  Use for Web Search

24

Data is segmentation is important for query segmentation

Phrases are important for improving relevance

Conclusions

Page 25: Phrase Identification  from Queries and Its  Use for Web Search

25

Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW 2008

References

Page 26: Phrase Identification  from Queries and Its  Use for Web Search

Thank you!

26

Page 27: Phrase Identification  from Queries and Its  Use for Web Search

27

Parameter Estimation II Solution 1: Offline segment the web corpus,

then collect counts for ngrams being segments

• Technical difficulties

harry potter and the goblet of fire += 1potter and the goblet of += 0

C. G. de Marcken, Unsupervised Language Acquisition, 96

Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01

... …| Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |... ...

Page 28: Phrase Identification  from Queries and Its  Use for Web Search

28

Parameter Estimation III Solution 2: Online computation: only consider

parts of the web corpus overlapping with the query (longest matches)

... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...

Q=harry potter and the goblet of fire

harry potter and the goblet of fire += 1

the += 2

harry potter += 1

Page 29: Phrase Identification  from Queries and Its  Use for Web Search

29

Page 30: Phrase Identification  from Queries and Its  Use for Web Search

30

Parameter Estimation III Solution 2: Online computation: only consider

parts of the web corpus overlapping with the query (longest matches)

... …Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... ...

Q= potter and the goblet

potter and the goblet += 1

the += 2

potter += 1

Directly compute longest matching countsusing raw ngram frequency: O(|Q|2)