101035 中文信息处理 chinese nlp lecture 15. 应用 —— 信息抽取 information extraction...

30
101035 中中中中中中 Chinese NLP Lecture 15

Upload: russell-ferdinand-mclaughlin

Post on 27-Dec-2015

373 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

101035 中文信息处理

Chinese NLP

Lecture 15

Page 2: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

2

应用——信息抽取Information Extraction

• 基本概念( Concepts)

• 信息抽取的任务( IE Tasks)

• 历史和基准( History and Benchmarks)• 信息抽取的过程( IE Process)• 信息抽取和信息检索( IE vs IR)

Page 3: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

3

基本概念Concepts

• Information extraction (IE) analyzes unrestricted texts in order to extract information about pre-specified types of events, entities and relations, and to create a structured output from unstructured texts.

• IE is an essential NLP technique, which serves information retrieval(信息检索) , automatic summarization(自动摘要) , question and answer(自动问答) , etc.

Page 4: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

4

• IE object

• IE typically deals with natural language text, especially unstructured text.

• In a broad sense, IE deals with speech, image, video, and other types of data besides electronic text.

• In a narrow sense, IE deals only with natural language text.

Page 5: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

5

信息抽取的任务IE Tasks

• Named Entity Detection and Recognition

• It finds and classifies the named entities in the text into pre-defined categories, such as persons, organizations, locations, expressions of time, quantities, monetary values, and percentages, etc.

… banks in Boston and New York.

Named Entity

Page 6: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

6

• Co-Reference Resolution

• It identifies the identity relations between the entities in the text.

Jim bought 300 shares of Acme Corp. in 2006.

Jim bought 300 shares of Acme Corp. in 2006.

person quantity organization date

He sold them in 2008.

Entity

Co-Reference

Page 7: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

7

• Entity Relation Detection and Characterization

• It finds the relations between entities in the text and classifies them into pre-defined categories, such as AT, NEAR, PART, GROUP, AFFILIATION, POSITION, etc.

located at

… banks in Boston and New York.

located at

Entity Relation

Page 8: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

8

• Event Detection and Characterization

• It detects the events in which the entities participate, their arguments (such as agent, object, source and target) and attributes (such as time, location, instrument and purpose) and classifies the identified events into pre-defined categories, such as CREATION, MOVEMENT, TRANSFER, INTERACTION, etc.

In 1997, the company hired John D. Idol to take over as chief executive. In 1997, the company hired John D. Idol to take over as chief executive.

Event employee

In 1997, the company hired John D. Idol to take over as chief executive.

employer

In 1997, the company hired John D. Idol to take over as chief executive.

position

In 1997, the company hired John D. Idol to take over as chief executive.

time

Page 9: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

9

In-Class Exercise

• Please read the following sentence and hand-label the entity relations (which means you can be creative!). The entities are bold typed.

经过一晚的休息,我们一行 10人,早起从丰大出发前往黄山南大门,大概半小时左右到达黄山脚下的东岭换乘中心,然后乘坐大巴到云谷寺选择坐索道上山。

Page 10: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

10

历史和基准History and Benchmarks

• MUC (Message Understanding Conference) • From 1987-1997, it was sponsored by DARPA

(Defense Advanced Research Projects Agency, TIPSTER Program)

• Datasets: News Domains (Military Messages, Terrorist Events, Corporate Joint Ventures, Airplane crashes, etc.)

Page 11: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

11

• MUC Extraction Tasks

• Named Entity (NE): Find proper names, such as person, organization and location names, and quantities of interest, such as dates, times, percentages, and monetary amounts.

• Co-reference (CO)

• Template Element (TE): Fill slots of entity attributes, such as name, type, descriptor, and category.

• Template Relation (TR): Find the relations between TEs, such as employee_of, product_of, location_of.

• Scenario Template (ST): Build a template around an event in which entities participated.

Page 12: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

12

• MUC Extraction Tasks

Page 13: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

13

• ACE (Automatic Content Extraction) Evaluation

• During 1999-2008, it was sponsored by NIST (National Institute of Standards and Technology).

• The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events.

• Entity: A Real Object in the World

• Mention: Named (e.g., “George Bush”), Nominal (e.g. “our president”) and Pronominal (e.g., “he”)

Page 14: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

14

• ACE Extraction Tasks • ACE 2000: Entity Detection and Tracking (EDT)

• ACE 2001: Entity Detection and Tracking (EDT) + Relation Detection and Characterization (RDC)

• ACE 2002: The Same as ACE 2001

• ACE 2003: EDT (for English, Chinese, Arabic) + RDC (for English, Chinese)

• ACE 2004: Entity Detection and Recognition (EDR) + RDR + Time Expression Recognition and Normalization (TERN)

• ACE 2005: EDR + RDR + TERN + Value Detection and Recognition (VAL) + Event Detection and Recognition (VDR)

• ACE 2007: The Same as ACE 2005

• ACE 2008: Local (Within-Document) EDR and RDR (for English and Arabic) + Global (Cross-Document) EDR and RDR (for English and Arabic)

Page 15: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

15

• TAC (Text Analysis Conference) - KBP (Knowledge Base Population) Track

• 2009-now, it was sponsored by NIST (National Institute of Standards and Technology).

• TAC Extraction Tasks

• Entity Linking: Determines for each query (name string), which knowledge base entity is being referred to, or if the entity is not present in the reference KB ( Mono-lingual vs. Cross-lingual).

• Slot Filling: Involves collecting a pre-defined set of information regarding certain attributes of an entity, which may be a person or some type of organization.

Page 16: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

16

• TAC Tasks

Page 17: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

17

信息抽取的过程IE Process

Tokenization

Morphological and Lexical Processing

Syntactic Analysis

Text

Entity Detectionand Recognition

Relation Detectionand Recognition

Event Detectionand Recognition

TemplatesNatural Language Processing (NLP)

Page 18: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

18

• Tokenization and Word Segmentation • In the first step, the text is divided into sentences

and tokens (word occurrences).

• For Chinese, tokenization also includes word segmentation.

Sam, Schwartz, retired, as, executive, …

Page 19: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

19

• Morphological and Lexical Processing• Each token may be looked up in a dictionary to

determine its possible POS and features (both syntactic and semantic).

• The system may utilize several special purpose dictionaries, such as dictionaries of major place names, common first names, and common company suffixes (such as ‘Inc’).

retired -> retire, Sam NAME, Inc COMPANY SUFFIEX, retired -> VBD, …

Sam, Schwartz, retired, as, executive, …

Page 20: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

20

• Syntactic Analysis• It identifies the syntactic structure of the text.

• The contents to be extracted often correspond to the phrases (mainly noun phrases) in the text.

Page 21: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

21

• Name Entity (NE) Recognition• NE systems identify all the names of people, places,

organizations, dates, and amounts of money, etc.

• Useful for answering the questions about “What”, “Who”, “When” and “Where”.

• Entity Extraction Approaches

• Rule-based

• Learning-based (Classification or Sequential Tagging)

Page 22: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

22

In-Class Exercise

• Sequential tagging also applies to ___________.

A) word segmentation

B) POS tagging

C) dependency parsing

D) text classification

Page 23: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

23

• Scenario Pattern Matching• Scenario pattern matching extracts events relevant

to the scenario using patterns specific to the task.

• Templates are often used for the purpose.

PERSON retires as POSITION

PERSON is succeeded by PERSON

Person in Person outPositionOrganization

EVENT: succession

An event template

“retire” and “succeeded” are the

trigger verbs.

Matching rules

Page 24: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

24

• Scenario Pattern Matching• When PERSON and PERSON match noun phrases

with the associated types, the event is identified; and the associated information is filled in the template, such as the slots of position and person-out.

Person in Person outPositionOrganization

PERSONPOSITION

Person in Person outPositionOrganization

PERSONPERSON

Page 25: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

25

• Scenario Pattern Matching

Person in Person outPositionOrganization

Dowd retires as chief of Kenilworth Police Department.

Rocky Marciano retires as world heavyweight champion.

DowdChief of Kenilworth Police Department

PERSON retires as POSITION

Page 26: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

26

信息抽取和信息检索IE vs IR

• Information Retrieval (IR, 信息检索 )• IR retrieves a collection or a subset of documents

which are hopefully relevant to a query, based on keyword searching.

• IR is the essential technique underlying search engines and many IT successes. (Google, Baidu, Bing, etc.)

Page 27: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

27

• IE vs IR

Information Retrieval gets sets of relevant documents - analyze the documents

Information Extraction gets facts out of documents - analyze the facts

Page 30: 101035 中文信息处理 Chinese NLP Lecture 15. 应用 —— 信息抽取 Information Extraction 基本概念( Concepts) 信息抽取的任务( IE Tasks) 历史和基准(

30

• 基本概念• 信息抽取的任务• Named Entity Detection

and Recognition

• Co-Reference Resolution

• Entity Relation Detection and Characterization

• Event Detection and Characterization

Wrap-Up

• 历史和基准• MUC

• ACE

• TAC

• 信息抽取的过程• 信息抽取和信息检索