清华大学计算机系 answer generating methods for community question and answering portals {tao...

Post on 19-Jan-2016

271 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

清华大学计算机系

Answer Generating Methods for Answer Generating Methods for Community Question and Answering PortalsCommunity Question and Answering Portals

{Tao Haoxiong, Hao Yu, Zhu Xiaoyan}

@Tsinghua University

清华大学计算机系

OutlineOutline

• Introduction and Related Work

• List-type Question– Answer generating method

– Method result and analysis

• Solution-type Question– Visible list

– Select the best list

– Experiment and analysis

• Conclusion

• Future Work

清华大学计算机系

IntroductionIntroduction

• Online community question answering (cQA) portals

have become a popular way to acquire information,

like Soso Wenwen and Baidu Zhidao.

• But they have some limitations:

– Can’t get answers in real-time.

– The quality of many answers is not high.

清华大学计算机系

Related WorkRelated Work

• To overcome unreal-time limitation, cQA portals

support search service.

– Users need to click links to see the whole answers.

– Spend long time to find useful information.

清华大学计算机系

Related WorkRelated Work

• To return high-quality answers

– Predict the quality of cQA answers.

• User profile features, text features, etc.

– Use multi-document summarization to summarize answers.

• More comprehensive but less readable.

– To improve answer quality, almost all well-perform systems

introduce a question taxonomy.

清华大学计算机系

Related WorkRelated Work

• The question taxonomy proposed by Fan Bu contains

6 question types:TYPE proportion

List 23.8%

Solution 19.7%

Reason 18.1%

Navigation 14.8%

Fact 14.4%

Definition 7.5%

• Examples:– List-type: List Nobel prize winners in 1990s?– Solution-type: How to make pizzas?

清华大学计算机系

Research FrameworkResearch Framework

• Propose answer generating methods for both List-

type and Solution-type questions.

清华大学计算机系

List-type QuestionList-type Question

• Each answer will be a single phrase or a list of phrases.

清华大学计算机系

Answer Generating MethodAnswer Generating Method

• Two characteristics about answers:

– “Best Answer” often don’t contain all answer points.

– Answer points which are high-quality or relevant to the

question often appear in more than one answers.

• Propose a method based on clustering of answer points.

清华大学计算机系

Answer Generating MethodAnswer Generating Method

清华大学计算机系

Answer Generating MethodAnswer Generating Method

清华大学计算机系

Example of the Method ResultExample of the Method Result

清华大学计算机系

Method Result and AnalysisMethod Result and Analysis

• Result contains more answer points than “Best

Answer”.

• Outputs are ranked. Easy to control the answer length.

• Further research is needed:

– Split answer into answer points.

– Choose the threshold of clustering.

清华大学计算机系

Solution-type QuestionSolution-type Question

• Visible List

清华大学计算机系

Solution-type QuestionSolution-type Question

• Visible List

– Choose 1179 solved Solution-type questions from Baidu

Zhidao, 30% questions’ answers having visible lists.

– Average length of “Best Answer” is above 1400 words,

while average length of visible list is about 600 words.

– 55% questions have more than one visible lists. We

propose a method to select the best list.

清华大学计算机系

Select the Best ListSelect the Best List

• Features:– FirstList

• If the list is the first list of the answer, then this feature value is 1, otherwise its value is 0.

– GuideSimilarity

• Cosine similarity between Guide words and question title.– Guide words: 列表四:三种方法巧疗慢性咽炎

– Question title: 问题:慢性咽炎怎么治疗?

– ContentSimilarity

• Cosine similarity between list content and question.

清华大学计算机系

Select the Best ListSelect the Best List

• Features:– VPRatio

• Word ratio of verbs and prepositions in the content of the list.

– SummaryScore

• Summarized answer contains N sentences, for every visible list, if it contains k sentences out of the N sentences, then it will have a summary score of k/N.

• Method:– Each feature is a [0, 1] value, we use Learning to Rank

model to get the weight of every feature.

清华大学计算机系

Experiment and AnalysisExperiment and Analysis

• Dataset:– Choose 1179 questions from Baidu Zhidao, 358 (30%) questions

have visible lists.

– 196 (55%) questions have more than one lists.

– Manually label a score to the 196 questions with more than one visible list:

• 1: high quality; 0:low quality.

• Two evaluations:– Evaluate the method of selecting the best list.

– Evaluate the quality of visible list as the answer

清华大学计算机系

Result of Selected Visible-lists Result of Selected Visible-lists

*Random select: 51.7%

清华大学计算机系

Evaluate Visible List as AnswerEvaluate Visible List as Answer

• Manually compare the quality of “Best Answer” and

visible list for each question:

– Mainly focus on the relevance to question, completeness

and whether containing redundant information.

• The average length of visible list is 600 words, while the average length is more than 1400 words for “Best Answer”.

清华大学计算机系

ConclusionConclusion

• Relying on the similar questions and their answers

from the cQA portals, propose appropriate answer

generating methods for List-type and Solution-type

questions

– List-type questions: based on the clustering of answer points.

– Solution-type questions: based on visible lists.

清华大学计算机系

Future WorkFuture Work

• List-type questions:– Do further research to split the answer into answer points

more robustly.

• Solution-type questions:– Introduce more semantic features to improve the semantic

relevance between selected list and question.

• Other types of questions:– Do further research to generate high-quality answers.

清华大学计算机系

ThanksThanks

top related