清华大学计算机系 answer generating methods for community question and answering portals {tao...
TRANSCRIPT
清华大学计算机系
Answer Generating Methods for Answer Generating Methods for Community Question and Answering PortalsCommunity Question and Answering Portals
{Tao Haoxiong, Hao Yu, Zhu Xiaoyan}
@Tsinghua University
清华大学计算机系
OutlineOutline
• Introduction and Related Work
• List-type Question– Answer generating method
– Method result and analysis
• Solution-type Question– Visible list
– Select the best list
– Experiment and analysis
• Conclusion
• Future Work
清华大学计算机系
IntroductionIntroduction
• Online community question answering (cQA) portals
have become a popular way to acquire information,
like Soso Wenwen and Baidu Zhidao.
• But they have some limitations:
– Can’t get answers in real-time.
– The quality of many answers is not high.
清华大学计算机系
Related WorkRelated Work
• To overcome unreal-time limitation, cQA portals
support search service.
– Users need to click links to see the whole answers.
– Spend long time to find useful information.
清华大学计算机系
Related WorkRelated Work
• To return high-quality answers
– Predict the quality of cQA answers.
• User profile features, text features, etc.
– Use multi-document summarization to summarize answers.
• More comprehensive but less readable.
– To improve answer quality, almost all well-perform systems
introduce a question taxonomy.
清华大学计算机系
Related WorkRelated Work
• The question taxonomy proposed by Fan Bu contains
6 question types:TYPE proportion
List 23.8%
Solution 19.7%
Reason 18.1%
Navigation 14.8%
Fact 14.4%
Definition 7.5%
• Examples:– List-type: List Nobel prize winners in 1990s?– Solution-type: How to make pizzas?
清华大学计算机系
Research FrameworkResearch Framework
• Propose answer generating methods for both List-
type and Solution-type questions.
清华大学计算机系
List-type QuestionList-type Question
• Each answer will be a single phrase or a list of phrases.
清华大学计算机系
Answer Generating MethodAnswer Generating Method
• Two characteristics about answers:
– “Best Answer” often don’t contain all answer points.
– Answer points which are high-quality or relevant to the
question often appear in more than one answers.
• Propose a method based on clustering of answer points.
清华大学计算机系
Answer Generating MethodAnswer Generating Method
清华大学计算机系
Answer Generating MethodAnswer Generating Method
清华大学计算机系
Example of the Method ResultExample of the Method Result
清华大学计算机系
Method Result and AnalysisMethod Result and Analysis
• Result contains more answer points than “Best
Answer”.
• Outputs are ranked. Easy to control the answer length.
• Further research is needed:
– Split answer into answer points.
– Choose the threshold of clustering.
清华大学计算机系
Solution-type QuestionSolution-type Question
• Visible List
清华大学计算机系
Solution-type QuestionSolution-type Question
• Visible List
– Choose 1179 solved Solution-type questions from Baidu
Zhidao, 30% questions’ answers having visible lists.
– Average length of “Best Answer” is above 1400 words,
while average length of visible list is about 600 words.
– 55% questions have more than one visible lists. We
propose a method to select the best list.
清华大学计算机系
Select the Best ListSelect the Best List
• Features:– FirstList
• If the list is the first list of the answer, then this feature value is 1, otherwise its value is 0.
– GuideSimilarity
• Cosine similarity between Guide words and question title.– Guide words: 列表四:三种方法巧疗慢性咽炎
– Question title: 问题:慢性咽炎怎么治疗?
– ContentSimilarity
• Cosine similarity between list content and question.
清华大学计算机系
Select the Best ListSelect the Best List
• Features:– VPRatio
• Word ratio of verbs and prepositions in the content of the list.
– SummaryScore
• Summarized answer contains N sentences, for every visible list, if it contains k sentences out of the N sentences, then it will have a summary score of k/N.
• Method:– Each feature is a [0, 1] value, we use Learning to Rank
model to get the weight of every feature.
清华大学计算机系
Experiment and AnalysisExperiment and Analysis
• Dataset:– Choose 1179 questions from Baidu Zhidao, 358 (30%) questions
have visible lists.
– 196 (55%) questions have more than one lists.
– Manually label a score to the 196 questions with more than one visible list:
• 1: high quality; 0:low quality.
• Two evaluations:– Evaluate the method of selecting the best list.
– Evaluate the quality of visible list as the answer
清华大学计算机系
Result of Selected Visible-lists Result of Selected Visible-lists
*Random select: 51.7%
清华大学计算机系
Evaluate Visible List as AnswerEvaluate Visible List as Answer
• Manually compare the quality of “Best Answer” and
visible list for each question:
– Mainly focus on the relevance to question, completeness
and whether containing redundant information.
• The average length of visible list is 600 words, while the average length is more than 1400 words for “Best Answer”.
清华大学计算机系
ConclusionConclusion
• Relying on the similar questions and their answers
from the cQA portals, propose appropriate answer
generating methods for List-type and Solution-type
questions
– List-type questions: based on the clustering of answer points.
– Solution-type questions: based on visible lists.
清华大学计算机系
Future WorkFuture Work
• List-type questions:– Do further research to split the answer into answer points
more robustly.
• Solution-type questions:– Introduce more semantic features to improve the semantic
relevance between selected list and question.
• Other types of questions:– Do further research to generate high-quality answers.
清华大学计算机系
ThanksThanks