squad the starting point of web intelligence natural language computing, microsoft research asia

27
SQUAD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia Chin-Yew LIN [email protected]

Upload: yukio

Post on 23-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia. Chin-Yew LIN [email protected]. Web 2.0. Web as a platform Connect people and services anywhere, anytime, on any device Harnessing collective intelligence Aggregated grassroots contribution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

SQUAD the starting point of web intelligenceNatural Language Computing, Microsoft Research Asia

Chin-Yew LIN [email protected]

Page 2: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Web 2.0 Web as a platform

Connect people and services anywhere, anytime, on any device

Harnessing collective intelligence Aggregated grassroots contribution

Data is the next “Intel Inside” Data-centric computing

Tim O’Reilly’s “What is Web 2.0”http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html?page=1

How do we turn DATA

into VALUE?

Page 3: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Content

Community

Technology

Value

Page 4: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Baidu Zhidao ( 百度知道 ) 17,012,767 resolved questions in two years’ operation. 8,921,610 are knowledge related. 96.7% of questions are resolved. 10,000,000 daily visitors. 71,308 new questions per day. 3.14 answers per question.

http://www.searchlab.com.cn ( 中国人搜索行为研究 /User Research Lab of Chinese Search)

Cell PhoneMusic

SoftwareComputer

RelationshipLanguage

OSHardwareEducation

Internet

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000

359,285409,447

468,268481,882500,762

574,001579,133

709,438732,976

768,668

Baidu Zhidao Top 10 Question Types

50.70%26.0

0%

17.60%

5.70%

Baidu Zhidao Question Types Distribution

Knowledge/ 知识Life Style/ 生活Entertainment/ 娱乐Other/ 其他

Page 5: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Stickiness of Baidu Zhidao 据正望咨询调查,“百度知道”跟搜索的关系非常紧密,而且对搜索黏性的提高有很大帮助,根据其统计,“百度知道”已成为百度的一个核心产品。“百度的用户中有 50%搜索‘知道’,其用户量已经超过百度贴吧,与其MP3搜索可相提并论。” 。 50% of Baidu users search Baidu Zhidao. Zhidao search traffic comparable to MP3

search.(http://news.csdn.net/n/20080425/115453.html; 04/25/2008)

Page 6: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

A Traditional QA Architecture

A QA system gives direct answers to aquestion instead of documents

Falcon QA system (LCC)Moldovan et al. ACL 2000Surdeanu et al. IEEE Trans. PDS 2002Best QA system in TREC 8 & 9

•Average question answering time• TREC 8: 48 seconds• TREC 9: 94 seconds

Module TREC8 TREC9QP 1.1% 1.2%PR (21.3 sec) 44.4% (24.9 sec) 26.5%PS 5.4% 2.2%PO 0.1% 0.1%AP (23.4 sec) 48.7% (65.5 sec) 69.7%

Falcon QA system module analysis: processing time

Traditional IRNotScalable

Page 7: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Goal: Create a scalable question and

answering service Methods:

Index all question and answer pairs on the web

Enrich QnA through summarization

Scalable Question Answering & Distillation

Page 8: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Challenges

Question Mining

Question Answerin

g

Question Utility

Question Search

Question Recommendation

Answer Summari

zation

ACL 2008SIGIR 2008

AAAI 2008

ACL 2008

COLING 2008

WWW 2008

Page 9: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

List of Papers Accepted Recommending Questions Using the MDL-based Tree Cut

Model – Cao et al.; WWW 2008 Searching Questions by Identifying Question Topic and

Question Focus – Duan et al.; ACL 2008 Using Conditional Random Fields to Extract Contexts and

Answers of Questions from Online Forums – Ding el al.; ACL 2008 Finding Question Answer Pairs from Online Forums – Cong et

al.; SIGIR 2008 Question Utility: A Novel Static Ranking of Question Search

– Song et al.; AAAI 2008 Answer Summarization: Understanding and Summarizing

Answers in Community-Based Question Answering Services – Liu et al; COLING 2008

Page 10: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

QA Pairs in Online Forums

CONTEXTQUESTIONS

ANSWERS

Page 11: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Question Mining & Answering(ACL 2008 & SIGIR 2008)

ACL

2008

& S

IGIR

200

8

Extract question and answer pairs Community QnA

Create a resolved question listExtract & index question, best answer,

and other answersYahoo! Answers, Baidu Zhidao, …

ForumExtract and index threads and

postings, find questions and their answers

6 travel forums

Page 12: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Question Utility(AAAI 2008)

AAAI

200

8

Motivation How useful is a question? How should we rank questions without

queries? Definition

How likely a question would be asked again?

The probability generating query Q’from question Q (Relevance score)

The prior probability of question Q reflecting a static rank of the questioni.e. Question Utility

)'()|'()()'|( '' Qp

QQpQpargmaxQQpargmax QQ )|'()()'|( '' QQpQpargmaxQQpargmax QQ

Page 13: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Answer Summarization(COLING 2008)

COLI

NG 2

008

Example: “Where to stay in Paris?” 1,822 answers (Yahoo!

Answers 06/23/08) Is the “best answer” the

best answer? Question clustering

Find similar questions Answer summarization

Aggregate answers for aquestion cluster

Answer Taxonomy

Question Taxonomy

Page 14: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Yunbo CAO & Chin-Yew LINWWW 2008 & ACL 2008

Question Search & Recommendation

Page 15: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Question Search & Recommendation(ACL 2008 & WWW 2008)

WW

W 2

008

& AC

L 20

08

Query We would like to know what will be available to see in the Forbidden

City because we understand that it will be under repairs.

Question search Is it true that the Forbidden City is undergoing renovation & we

won't be allow to enter?

Question recommendation Would you get a lower price by not needing a guide for the

Forbidden City and etc? Can anybody recommend a budget hotel near Forbidden City?

Question = Topic + Focus + Others (TFO) Search: same topic similar focuses Recommend: same topic different focuses

How can we discriminatetopic from focus?

Page 16: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Identifying Topic and Focus

Specificity: the inverse of the entropy of the topic term ‘s distribution over the sub-categories.

Order topic terms by their specificity

Travel @Yahoo! Answers

Asia Pacific

Europe…

China

Japan

Travel @Yahoo! Answers

Asia Pacific

Europe

China

Japan

China1. Anyone know where to see the Dragon

Boat Festival in Beijing? 2. Where is a good (Less expensive) place

to shop in Beijing? 3. What's the cheapest way to get from

Beijing to Hong Kong?

Europe4. How far is it from Berlin to Hamburg?5. What is the cheapest way from Berlin to

Hamburg?6. Where to see between Hamburg and

Berlin?7. How long does it take from Hamburg to

Berlin?

Page 17: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Query: Any cool clubs in Hamburg or Berlin? Topic Terms: cool clubs, Hamburg, Berlin

Topic Terms: where to see, Hamburg, Berlin

Topic Chain: Hamburg Berlin cool clubs

Topic Chain: Hamburg Berlin where to see

Hamburg

Berlin

cool clubswhere to seehow far

Related questions: Where to see in Hamburg or Berlin? How far is it from Berlin to Hamburg?

Hamburg Berlin how far

Question Topic

Question Focus

Order Topic Terms by Specificity

Page 18: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Determine the Cut on a Question Tree

The Use of MDL (Minimum Description Length) Based Tree Cut Model (Li & Abe 1998)

ROOT

Hamburg Berlin

Berlin cheap

hotel (1)fun club

(1)

cool club (1) nice

hotel (1)how long does

it take (1)

Page 19: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

The MDL-based Tree Cut Model

(Li & Abe, CL1998)

Page 20: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Scoring the Candidates Given a queried question and a candidate

The search relevance score is

The recommendation score is

q q~

Question Topic

Question Focus

))(|)~(()1())(|)~(()|~( qFqFsimqTqTsimqqr

))(|)~(()1())(|)~(()|~( qFqFsimqTqTsimqqr

Question Topic

Question Focus

Page 21: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Flow of Question Search/Recommendation

Query: any cool clubs in Berlin or Hamburg?

Index

STEP 1:Retrieve Related

Questions

Related Questions: 1. Where to see between Hamburg and Berlin? 2. How far is it from Berlin to Hamburg? 3. Any good hostels in Hamburg or Berlin? 4. What are the most/best fun club in Hamburg?

cool club

Hamburg

Berlin where to see

how fargood hostel

fun club

STEP 3:Rank Questionson the basis of the

cut

Search: 1. What are the most/best fun club in Hamburg? Recommendation: 1. Where to see between Hamburg and Berlin? 2. How far is it from Berlin to Hamburg? 3. Any good hostels in Hamburg or Berlin?

STEP 2: Discriminate Question Topic from Question Focus

Page 22: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Experimental Results (Search)

Data (Yahoo! Answers) Query: 200 questions about ‘travel’; 200 questions about

‘computers & internet’ Relevance: human judgment

Baselines VSM (Vector Space Model), LMIR (Language Model for

Information Retrieval) Results

Travel

Computers & Internet

Methods MAP R-Precision MRRVSM 0.198 0.138 0.228LMIR 0.203 0.154 0.248

Our approach 0.236 0.192 0.279

Methods MAP R-Precision MRRVSM 0.236 0.175 0.289LMIR 0.248 0.191 0.304

Our approach 0.279 0.230 0.341

Page 23: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Experimental Results (Recommendation)

Data (Yahoo! Answers) Query: 100 questions about ‘travel’; 100 questions about

‘computers & internet’ Relevance: human judgment

Baselines VSM (Vector Space Model), PVSM (Phrase-based Vector Space

Model) Results

Travel

Computers & Internet

Methods MAP R-Precision P@5VSM 0.321 0.235 0.226

PVSM 0.291 0.276 0.234Our approach 0.350 0.324 0.290

Methods MAP R-Precision P@5VSM 0.307 0.216 0.200

PVSM 0.257 0.242 0.214Our approach 0.316 0.316 0.248

Page 24: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Error Analysis (Search) Stat. on question topic/focus identification errors

The reason – data sparseness (more than 0.04 MAP drop) No question focus (data sparseness over question topics)

Does anyone know anything about West Suburban Dialysis in Chicago? West Suburban Dialysis Chicago anything

To search question descriptions and answers as well as question titles Inaccurate specificity (data sparseness over question foci)

Any nightlife activities near Generator Hostel, Berlin? Incorrect: Generator Hostel nightlife activity Berlin Correct: Generator Hostel Berlin nightlife activity

To cluster topic terms (e.g., nightlife activity vs. night life activity)

Data No Question Focus Inaccurate Specificity TotalTravel 59 10 69

Computers & Internet 47 18 65

Page 25: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Knowledge Distillation & Dissemination

• S calable Question Answer in g and Dist illat ion

• Highly Structured QnA

FAQ

• Structured QnA

QnA

• Semi-structured QnA

Forum

• Unstructured QnA

Web

Page 26: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

QUESTION AND ANSWER=

KNOWLEDGE

Q&A = Knowledge = Power Q&A is complement to web keyword

search Q&A can enhance existing QnA and

search services Leverage existing knowledge in the

question and answer forms

KNOWLEDGE=

POWER

Page 27: SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

Discussion