101035 中文信息处理 chinese nlp lecture 15. 应用 —— 信息抽取 information extraction...

Post on 27-Dec-2015

373 Views

Category:

Documents

9 Downloads

Preview:

Click to see full reader

TRANSCRIPT

101035 中文信息处理

Chinese NLP

Lecture 15

2

应用——信息抽取Information Extraction

• 基本概念( Concepts)

• 信息抽取的任务( IE Tasks)

• 历史和基准( History and Benchmarks)• 信息抽取的过程( IE Process)• 信息抽取和信息检索( IE vs IR)

3

基本概念Concepts

• Information extraction (IE) analyzes unrestricted texts in order to extract information about pre-specified types of events, entities and relations, and to create a structured output from unstructured texts.

• IE is an essential NLP technique, which serves information retrieval(信息检索) , automatic summarization(自动摘要) , question and answer(自动问答) , etc.

4

• IE object

• IE typically deals with natural language text, especially unstructured text.

• In a broad sense, IE deals with speech, image, video, and other types of data besides electronic text.

• In a narrow sense, IE deals only with natural language text.

5

信息抽取的任务IE Tasks

• Named Entity Detection and Recognition

• It finds and classifies the named entities in the text into pre-defined categories, such as persons, organizations, locations, expressions of time, quantities, monetary values, and percentages, etc.

… banks in Boston and New York.

Named Entity

6

• Co-Reference Resolution

• It identifies the identity relations between the entities in the text.

Jim bought 300 shares of Acme Corp. in 2006.

Jim bought 300 shares of Acme Corp. in 2006.

person quantity organization date

He sold them in 2008.

Entity

Co-Reference

7

• Entity Relation Detection and Characterization

• It finds the relations between entities in the text and classifies them into pre-defined categories, such as AT, NEAR, PART, GROUP, AFFILIATION, POSITION, etc.

located at

… banks in Boston and New York.

located at

Entity Relation

8

• Event Detection and Characterization

• It detects the events in which the entities participate, their arguments (such as agent, object, source and target) and attributes (such as time, location, instrument and purpose) and classifies the identified events into pre-defined categories, such as CREATION, MOVEMENT, TRANSFER, INTERACTION, etc.

In 1997, the company hired John D. Idol to take over as chief executive. In 1997, the company hired John D. Idol to take over as chief executive.

Event employee

In 1997, the company hired John D. Idol to take over as chief executive.

employer

In 1997, the company hired John D. Idol to take over as chief executive.

position

In 1997, the company hired John D. Idol to take over as chief executive.

time

9

In-Class Exercise

• Please read the following sentence and hand-label the entity relations (which means you can be creative!). The entities are bold typed.

经过一晚的休息,我们一行 10人,早起从丰大出发前往黄山南大门,大概半小时左右到达黄山脚下的东岭换乘中心,然后乘坐大巴到云谷寺选择坐索道上山。

10

历史和基准History and Benchmarks

• MUC (Message Understanding Conference) • From 1987-1997, it was sponsored by DARPA

(Defense Advanced Research Projects Agency, TIPSTER Program)

• Datasets: News Domains (Military Messages, Terrorist Events, Corporate Joint Ventures, Airplane crashes, etc.)

11

• MUC Extraction Tasks

• Named Entity (NE): Find proper names, such as person, organization and location names, and quantities of interest, such as dates, times, percentages, and monetary amounts.

• Co-reference (CO)

• Template Element (TE): Fill slots of entity attributes, such as name, type, descriptor, and category.

• Template Relation (TR): Find the relations between TEs, such as employee_of, product_of, location_of.

• Scenario Template (ST): Build a template around an event in which entities participated.

12

• MUC Extraction Tasks

13

• ACE (Automatic Content Extraction) Evaluation

• During 1999-2008, it was sponsored by NIST (National Institute of Standards and Technology).

• The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events.

• Entity: A Real Object in the World

• Mention: Named (e.g., “George Bush”), Nominal (e.g. “our president”) and Pronominal (e.g., “he”)

14

• ACE Extraction Tasks • ACE 2000: Entity Detection and Tracking (EDT)

• ACE 2001: Entity Detection and Tracking (EDT) + Relation Detection and Characterization (RDC)

• ACE 2002: The Same as ACE 2001

• ACE 2003: EDT (for English, Chinese, Arabic) + RDC (for English, Chinese)

• ACE 2004: Entity Detection and Recognition (EDR) + RDR + Time Expression Recognition and Normalization (TERN)

• ACE 2005: EDR + RDR + TERN + Value Detection and Recognition (VAL) + Event Detection and Recognition (VDR)

• ACE 2007: The Same as ACE 2005

• ACE 2008: Local (Within-Document) EDR and RDR (for English and Arabic) + Global (Cross-Document) EDR and RDR (for English and Arabic)

15

• TAC (Text Analysis Conference) - KBP (Knowledge Base Population) Track

• 2009-now, it was sponsored by NIST (National Institute of Standards and Technology).

• TAC Extraction Tasks

• Entity Linking: Determines for each query (name string), which knowledge base entity is being referred to, or if the entity is not present in the reference KB ( Mono-lingual vs. Cross-lingual).

• Slot Filling: Involves collecting a pre-defined set of information regarding certain attributes of an entity, which may be a person or some type of organization.

16

• TAC Tasks

17

信息抽取的过程IE Process

Tokenization

Morphological and Lexical Processing

Syntactic Analysis

Text

Entity Detectionand Recognition

Relation Detectionand Recognition

Event Detectionand Recognition

TemplatesNatural Language Processing (NLP)

18

• Tokenization and Word Segmentation • In the first step, the text is divided into sentences

and tokens (word occurrences).

• For Chinese, tokenization also includes word segmentation.

Sam, Schwartz, retired, as, executive, …

19

• Morphological and Lexical Processing• Each token may be looked up in a dictionary to

determine its possible POS and features (both syntactic and semantic).

• The system may utilize several special purpose dictionaries, such as dictionaries of major place names, common first names, and common company suffixes (such as ‘Inc’).

retired -> retire, Sam NAME, Inc COMPANY SUFFIEX, retired -> VBD, …

Sam, Schwartz, retired, as, executive, …

20

• Syntactic Analysis• It identifies the syntactic structure of the text.

• The contents to be extracted often correspond to the phrases (mainly noun phrases) in the text.

21

• Name Entity (NE) Recognition• NE systems identify all the names of people, places,

organizations, dates, and amounts of money, etc.

• Useful for answering the questions about “What”, “Who”, “When” and “Where”.

• Entity Extraction Approaches

• Rule-based

• Learning-based (Classification or Sequential Tagging)

22

In-Class Exercise

• Sequential tagging also applies to ___________.

A) word segmentation

B) POS tagging

C) dependency parsing

D) text classification

23

• Scenario Pattern Matching• Scenario pattern matching extracts events relevant

to the scenario using patterns specific to the task.

• Templates are often used for the purpose.

PERSON retires as POSITION

PERSON is succeeded by PERSON

Person in Person outPositionOrganization

EVENT: succession

An event template

“retire” and “succeeded” are the

trigger verbs.

Matching rules

24

• Scenario Pattern Matching• When PERSON and PERSON match noun phrases

with the associated types, the event is identified; and the associated information is filled in the template, such as the slots of position and person-out.

Person in Person outPositionOrganization

PERSONPOSITION

Person in Person outPositionOrganization

PERSONPERSON

25

• Scenario Pattern Matching

Person in Person outPositionOrganization

Dowd retires as chief of Kenilworth Police Department.

Rocky Marciano retires as world heavyweight champion.

DowdChief of Kenilworth Police Department

PERSON retires as POSITION

26

信息抽取和信息检索IE vs IR

• Information Retrieval (IR, 信息检索 )• IR retrieves a collection or a subset of documents

which are hopefully relevant to a query, based on keyword searching.

• IR is the essential technique underlying search engines and many IT successes. (Google, Baidu, Bing, etc.)

27

• IE vs IR

Information Retrieval gets sets of relevant documents - analyze the documents

Information Extraction gets facts out of documents - analyze the facts

30

• 基本概念• 信息抽取的任务• Named Entity Detection

and Recognition

• Co-Reference Resolution

• Entity Relation Detection and Characterization

• Event Detection and Characterization

Wrap-Up

• 历史和基准• MUC

• ACE

• TAC

• 信息抽取的过程• 信息抽取和信息检索

top related