知识图谱的构建与应用 -...

32
知识图谱的构建与应用 清华大学计算机系 朱小燕 [email protected], @朱小燕THU 2014.4.18

Upload: others

Post on 28-Jul-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

知识图谱的构建与应用

清华大学计算机系朱小燕

[email protected], @朱小燕THU2014.4.18

Page 2: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Contents

1

• Information & Knowledge

2

• Challenges

3

• Ideas & Achievements

Page 3: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Information & Knowledge

Page 4: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

What do we use the internet for?

Access the Internet for getting news,

URL, etc.

INFORMATION SERVICE

Build knowledge graph for

generating answers

News Microblog

PictureVideo

Encyclopedia Forum

Page 5: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Information Services

• Baidu Box Computing

• Google Knowledge Graph

Page 6: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge

• What is knowledge?– Entity + Rule

– Triple (predicate, semantic network, rule, framework, ……)

Abstracted Information

Coded InformationDrawback: Lack of ability to communicate with machine

DIRECTLY!

知识的作用:帮助信息计算、理解、评价

Page 7: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Challenges

Page 8: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Scientific Perspective

• Our Goal

– Information Measurable

– Knowledge Computable

Information/

Knowledge

Acquisition

Measure Mining

Structurization

Page 9: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Application Perspective

Information Acquisition• How to measure the information

Knowledge Base• How to build knowledge base

– To build from original materials– To extend and update– To merge difference databases– To verify the knowledge

• How to use knowledge bases– For answer generation– For answer re-ranking– For inference– For recommendation– For vertical search– ……

Page 10: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Ideas and Achievements

Page 11: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge Base Construction

• Knowledge Base Construction– General Knowledge Base Construction

– Information Organization for Multi-source UGCs

– Domain specific Knowledge Base

• Key Points– Organization with semantic information

• Topic Hierarchy structure

– Practicability• Timeliness, Iteratively Update

• Customization, Query Oriented

Page 12: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Information Organization for Multi-source UGCs via Topic Hierarchy Construction

• Multi-source UGCs

– Huge volume & diversified topics

– Complementary

• Topic Hierarchy

– root node: user topic

– non-root node: sub topics

– leaf node: link to UGCs

Can we organize Multi-source UGCs by a unified structure, which can effectively guide users to their required knowledge?

Power Of Statistics : Find a Topic

Quality of Contents : Describe a Topic

… policypoll Islam

medic care

Page 13: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Information Organization for Multi-source UGCs via Topic Hierarchy Construction (cont.)

• Topic Extraction– Candidate selection via Multi-source

based Voter

– Sub-topic relation via Multi-source evidence

learning and combination

• Topic Hierarchy Construction– Iteratively we do :

• Select a node that maximize the

relatedness with the current hierarchy

• Update the edges’ weights on

the resultant hierarchy

• Adopt Edmond’s algorithm to

remove the potential cycles

Tax

p = 0.013

p = 0.009

p = 0.006

p = 0.015

Barack Obama

Tax

Debate

Resultant Hierarchy

Barack Obama

Policy

Tax

Debate

Policy

Published on SIGIR2013

Page 14: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge Base Applications

• KB for social media content presentation

• KB for recommendation system

• KB for vertical search engine

Page 15: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Problem of iPhone 5

… …

Customized Organization of Social Media Contents using Focused Topic Hierarchy

• Motivation

– Users’ information need: Specific, fine-grained, precise

– Information on the Web: Wide-coverage, coarse-grained, noisyTo organize the information from social media sources in a user-friendly way,

so that the user can fast and easily acquire their required information.

Submitted to SIGKDD2014

Page 16: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

• Graph-based Topic Discovering– Weighted Topic Graph Generation

– Relevant Topic Detection via Graph Propagation

• Customized Hierarchy Construction– Using the detected topics, iteratively we do:

• Try to add one topic into the current hierarchy

• Remove the newly added edge if:

(1) it is duplicated with

existing edge

(2) it is part of a directed

cycle on the hierarchy

• Adopt Edmond’s algorithm to eliminate un-directed cycles

• Choose the resultant hierarchy that has the maximize likelihood as the final result of this iteration

Customized Organization of Social Media Contents using Focused Topic Hierarchy (cont.)

or: cut the edge

or

Page 17: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Enhance Personalized Recommendation using Topic Hierarchies

• Motivation– Purchase records and user ratings are used for recommendation systems

without detail semantic information

– UGCs included semantic information which are unstructured and noised

I like Pixar movies

UGCs

I like funny movies

Animation

Fun Entertainment

Disney Comedy

• huge volume• timeliness• crowded

Use topic hierarchies generated from UGCs to enhance recommenders with fine-grained, structured and dynamic background knowledge

Page 18: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Enhance Personalized Recommendation using General Topic Hierarchies (cont.)

• Topic Hierarchy Alignment among Items

– Topics matching via topic relatedness calculation

– Confliction resolution via optimum branching

• Topic Hierarchy based Recommender– Learning user-topic preference and item-topic feature matrixes via constrained

matrix factorization

Learn user’s positive preference on topics from users’ purchase records

Learn user’s negative preference on topics from users’ rating of items

Update item’s features from topic hierarchy and collaborative filtering

Initiate item’s features from items’ specific topic hierarchies

Page 19: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge Base for Vertical Search Engine

• Vertical Search Platform– Unified Knowledge Representation

• Triples (compatible with Freebase)

– General Information Processing Pipeline• High reusability of most modules• Easily and rapidly portable to different vertical domains if provided

enough domain data

– Accommodating Heterogeneous Source of Data• Knowledge Base• CQA • FAQs• Query logs • Encyclopedia (Wikipedia, Baidu Baike)• Free texts, books • APIs

Page 20: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Seed Production Rule Generation

Production Rule Learning

Knowledge Base

Seed Production Rules

RawParaphrase

Production RulesLexicon

Lexical Mapping Paraphrase

PERSON: who, J.K RowlingBOOK: Harry PotterWRITE: wrote, author

WRITER(J.K Rowling, Harry Potter)

WRITER-> PERSON WRITE BOOK

Who write Harry Potter?Who is the author of Harry Potter?Who is Harry Potter’s author?

PEOPLE WRITE BOOK?PEOPLE is the WRITE of BOOK?Who is BOOK’s WRITE?

WRITER -> PERSON WRITE BOOKWRITER -> PERSON is the WRITE of BOOK?WRITER -> PERSON is BOOK’s WRITE?

Semantic Parser

Page 21: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Production Rules CFG Parser

Questions

SyntacticStructure

Query Generation

Query:SPARQL

Semantic Parser (cont.)

Page 22: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Vertical Search Platform

Query

七里香是谁唱的?

Subject Relation Object

……

七里香 歌手 周杰伦

……

Semantic Parser

PatternGeneration

KnowledgeBase

Answer

周杰伦

Structured Query

?x <perform> ?y?y <name>七里香

Dialog Management

Domain-specific

Grammar

PreconditionCompletion

Page 23: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Applications

• Public Health Care (Database + CQA)

公众健康问答

• Music Search (Database) 音乐问答

• Mobile Services (Query log) 业务助手

• Weather (CQA + APIs) 天气自动问答THU

• Open Domain 清华小智

• Leaks & Exploits(Free text + Domain knowledge/rules)

• College Recruitment(FAQ)

微信公共账号

Page 24: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Examples of the Application (1)

• Music QA

– Resource:Domain Database,CQA

Page 25: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Examples of the Application (2)

• Health QA

– Resource:Domain Database,CQA,Baidu Baike

Page 26: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Acknowledgement

• Prof. Minlie Huang and Dr. Yu Hao,

• Mr. Xian Zhang, Chong Long, Fan Bu, Hao Xiong, Chao Han, Zhicheng Zheng, Xingwei Zhu, Tanche Li, Yicheng Liu, Yipeng Jiang, and Yang Tang ……

Page 27: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Thanks

Page 28: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

对话管理

结构化查询

其它来源答案

候选答案

知识库

领域相关语法

模式生成

语义解析引擎

?x <perform> ?y?y <name>刘欢

领域语法与模式

答案验证与生成

条件填充刘欢的代表作是?

Subject Relation Object

……

好汉歌 歌手 刘欢

……

语义解析

Vertical Search Platform

Page 29: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Information Distance

Kolmogorov Complexity based Information Distance

Dmax(x,y)

Dmax(x,y|c)dmax(x,y|c)dmax(x,y)

Dmin(x,y)dmin(x,y)

Dmax(x1,x2,…)

29

Information Metrics

Page 30: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Applications of Information Distance

• Answer re-ranking

• Concepts measure, relatedness evaluation

• Similarity measure of question answer pairs

• Multiple document summarization

Published on:SIGKDD2006, COLING2010, IJCAI2011, CIKM2008, ICDM2009, WI2009;Knowledge and Information Systems, JCST, and Communications of ACM

Page 31: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge Base Construction

• Knowledge Extraction

– Semi-structured text

• Tables, info-boxes, etc.

– Free text

• Knowledge Transfer

– Open knowledge bases

• Freebase, Yago, DBPedia, etc.

– Inter-language transfer

• Wikipedia Multi-language linker

• Google translation

Subject Relation Object

……

巴拉克奥巴马 拥有国籍 美利坚合众国

……

Page 32: 知识图谱的构建与应用 - bj.bcebos.combj.bcebos.com/cips-upload/贵阳战略研讨会/知识图谱的构建... · 知识图谱的构建与应用 清华大学计算机系 朱小燕

Knowledge Base Construction

• Knowledge Base

– Knowledge structure

• Triple

– entity

– relation

– Size of depository

• Resources

– Baidu Baike

– Freebase

– Wikipedia

• Contains:

– 250,000 entities

– 1660,000 triples

Subject Relation object

……

巴拉克奥巴马 拥有国籍 美利坚合众国

……

巴拉克奥巴马 美利坚合众国拥有国籍

Extracted From Baidu Baike:

120, 000 entities, 650,000 triples

Transfer From Freebase:

130, 000 entities, 1000,000 triples