知识图谱的构建与应用 -...
TRANSCRIPT
Contents
1
• Information & Knowledge
2
• Challenges
3
• Ideas & Achievements
Information & Knowledge
What do we use the internet for?
Access the Internet for getting news,
URL, etc.
INFORMATION SERVICE
Build knowledge graph for
generating answers
News Microblog
PictureVideo
Encyclopedia Forum
Information Services
• Baidu Box Computing
• Google Knowledge Graph
Knowledge
• What is knowledge?– Entity + Rule
– Triple (predicate, semantic network, rule, framework, ……)
Abstracted Information
Coded InformationDrawback: Lack of ability to communicate with machine
DIRECTLY!
知识的作用:帮助信息计算、理解、评价
Challenges
Scientific Perspective
• Our Goal
– Information Measurable
– Knowledge Computable
Information/
Knowledge
Acquisition
Measure Mining
Structurization
Application Perspective
Information Acquisition• How to measure the information
Knowledge Base• How to build knowledge base
– To build from original materials– To extend and update– To merge difference databases– To verify the knowledge
• How to use knowledge bases– For answer generation– For answer re-ranking– For inference– For recommendation– For vertical search– ……
Ideas and Achievements
Knowledge Base Construction
• Knowledge Base Construction– General Knowledge Base Construction
– Information Organization for Multi-source UGCs
– Domain specific Knowledge Base
• Key Points– Organization with semantic information
• Topic Hierarchy structure
– Practicability• Timeliness, Iteratively Update
• Customization, Query Oriented
Information Organization for Multi-source UGCs via Topic Hierarchy Construction
• Multi-source UGCs
– Huge volume & diversified topics
– Complementary
• Topic Hierarchy
– root node: user topic
– non-root node: sub topics
– leaf node: link to UGCs
Can we organize Multi-source UGCs by a unified structure, which can effectively guide users to their required knowledge?
Power Of Statistics : Find a Topic
Quality of Contents : Describe a Topic
…
…
… policypoll Islam
medic care
Information Organization for Multi-source UGCs via Topic Hierarchy Construction (cont.)
• Topic Extraction– Candidate selection via Multi-source
based Voter
– Sub-topic relation via Multi-source evidence
learning and combination
• Topic Hierarchy Construction– Iteratively we do :
• Select a node that maximize the
relatedness with the current hierarchy
• Update the edges’ weights on
the resultant hierarchy
• Adopt Edmond’s algorithm to
remove the potential cycles
Tax
p = 0.013
p = 0.009
p = 0.006
p = 0.015
Barack Obama
Tax
Debate
Resultant Hierarchy
Barack Obama
Policy
Tax
Debate
Policy
Published on SIGIR2013
Knowledge Base Applications
• KB for social media content presentation
• KB for recommendation system
• KB for vertical search engine
Problem of iPhone 5
… …
Customized Organization of Social Media Contents using Focused Topic Hierarchy
• Motivation
– Users’ information need: Specific, fine-grained, precise
– Information on the Web: Wide-coverage, coarse-grained, noisyTo organize the information from social media sources in a user-friendly way,
so that the user can fast and easily acquire their required information.
Submitted to SIGKDD2014
• Graph-based Topic Discovering– Weighted Topic Graph Generation
– Relevant Topic Detection via Graph Propagation
• Customized Hierarchy Construction– Using the detected topics, iteratively we do:
• Try to add one topic into the current hierarchy
• Remove the newly added edge if:
(1) it is duplicated with
existing edge
(2) it is part of a directed
cycle on the hierarchy
• Adopt Edmond’s algorithm to eliminate un-directed cycles
• Choose the resultant hierarchy that has the maximize likelihood as the final result of this iteration
Customized Organization of Social Media Contents using Focused Topic Hierarchy (cont.)
or: cut the edge
or
Enhance Personalized Recommendation using Topic Hierarchies
• Motivation– Purchase records and user ratings are used for recommendation systems
without detail semantic information
– UGCs included semantic information which are unstructured and noised
I like Pixar movies
UGCs
I like funny movies
Animation
Fun Entertainment
Disney Comedy
…
• huge volume• timeliness• crowded
Use topic hierarchies generated from UGCs to enhance recommenders with fine-grained, structured and dynamic background knowledge
Enhance Personalized Recommendation using General Topic Hierarchies (cont.)
• Topic Hierarchy Alignment among Items
– Topics matching via topic relatedness calculation
– Confliction resolution via optimum branching
• Topic Hierarchy based Recommender– Learning user-topic preference and item-topic feature matrixes via constrained
matrix factorization
Learn user’s positive preference on topics from users’ purchase records
Learn user’s negative preference on topics from users’ rating of items
Update item’s features from topic hierarchy and collaborative filtering
Initiate item’s features from items’ specific topic hierarchies
Knowledge Base for Vertical Search Engine
• Vertical Search Platform– Unified Knowledge Representation
• Triples (compatible with Freebase)
– General Information Processing Pipeline• High reusability of most modules• Easily and rapidly portable to different vertical domains if provided
enough domain data
– Accommodating Heterogeneous Source of Data• Knowledge Base• CQA • FAQs• Query logs • Encyclopedia (Wikipedia, Baidu Baike)• Free texts, books • APIs
Seed Production Rule Generation
Production Rule Learning
Knowledge Base
Seed Production Rules
RawParaphrase
Production RulesLexicon
Lexical Mapping Paraphrase
PERSON: who, J.K RowlingBOOK: Harry PotterWRITE: wrote, author
WRITER(J.K Rowling, Harry Potter)
WRITER-> PERSON WRITE BOOK
Who write Harry Potter?Who is the author of Harry Potter?Who is Harry Potter’s author?
PEOPLE WRITE BOOK?PEOPLE is the WRITE of BOOK?Who is BOOK’s WRITE?
WRITER -> PERSON WRITE BOOKWRITER -> PERSON is the WRITE of BOOK?WRITER -> PERSON is BOOK’s WRITE?
Semantic Parser
Production Rules CFG Parser
Questions
SyntacticStructure
Query Generation
Query:SPARQL
Semantic Parser (cont.)
Vertical Search Platform
Query
七里香是谁唱的?
Subject Relation Object
……
七里香 歌手 周杰伦
……
Semantic Parser
PatternGeneration
KnowledgeBase
Answer
周杰伦
Structured Query
?x <perform> ?y?y <name>七里香
Dialog Management
Domain-specific
Grammar
PreconditionCompletion
Applications
• Public Health Care (Database + CQA)
公众健康问答
• Music Search (Database) 音乐问答
• Mobile Services (Query log) 业务助手
• Weather (CQA + APIs) 天气自动问答THU
• Open Domain 清华小智
• Leaks & Exploits(Free text + Domain knowledge/rules)
• College Recruitment(FAQ)
微信公共账号
Examples of the Application (1)
• Music QA
– Resource:Domain Database,CQA
Examples of the Application (2)
• Health QA
– Resource:Domain Database,CQA,Baidu Baike
Acknowledgement
• Prof. Minlie Huang and Dr. Yu Hao,
• Mr. Xian Zhang, Chong Long, Fan Bu, Hao Xiong, Chao Han, Zhicheng Zheng, Xingwei Zhu, Tanche Li, Yicheng Liu, Yipeng Jiang, and Yang Tang ……
Thanks
对话管理
结构化查询
其它来源答案
候选答案
知识库
领域相关语法
模式生成
语义解析引擎
?x <perform> ?y?y <name>刘欢
领域语法与模式
答案验证与生成
条件填充刘欢的代表作是?
Subject Relation Object
……
好汉歌 歌手 刘欢
……
语义解析
Vertical Search Platform
Information Distance
Kolmogorov Complexity based Information Distance
Dmax(x,y)
Dmax(x,y|c)dmax(x,y|c)dmax(x,y)
Dmin(x,y)dmin(x,y)
Dmax(x1,x2,…)
29
Information Metrics
Applications of Information Distance
• Answer re-ranking
• Concepts measure, relatedness evaluation
• Similarity measure of question answer pairs
• Multiple document summarization
Published on:SIGKDD2006, COLING2010, IJCAI2011, CIKM2008, ICDM2009, WI2009;Knowledge and Information Systems, JCST, and Communications of ACM
Knowledge Base Construction
• Knowledge Extraction
– Semi-structured text
• Tables, info-boxes, etc.
– Free text
• Knowledge Transfer
– Open knowledge bases
• Freebase, Yago, DBPedia, etc.
– Inter-language transfer
• Wikipedia Multi-language linker
• Google translation
Subject Relation Object
……
巴拉克奥巴马 拥有国籍 美利坚合众国
……
Knowledge Base Construction
• Knowledge Base
– Knowledge structure
• Triple
– entity
– relation
– Size of depository
• Resources
– Baidu Baike
– Freebase
– Wikipedia
• Contains:
– 250,000 entities
– 1660,000 triples
Subject Relation object
……
巴拉克奥巴马 拥有国籍 美利坚合众国
……
巴拉克奥巴马 美利坚合众国拥有国籍
Extracted From Baidu Baike:
120, 000 entities, 650,000 triples
Transfer From Freebase:
130, 000 entities, 1000,000 triples