101035 中文信息处理 chinese nlp lecture 8. 句 —— 语法分析( 1 ) grammatical...

26
101035 中中中中中中 Chinese NLP Lecture 8

Upload: pamela-mccoy

Post on 24-Dec-2015

335 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

101035 中文信息处理

Chinese NLP

Lecture 8

Page 2: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

2

句——语法分析( 1 )Grammatical Analysis (1)

• 语法分析基础( Basics)

• 形式语法( Formal grammars)

• 上下文无关语法( Context-free grammars )• 依存语法( Dependency grammar )

Page 3: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

3

语法分析基础Basics

• Constituency (句子成分)• Grammar, or strictly speaking syntax, is about how

words are put together to make sentences.

• A constituent is a group of words, assuming a certain syntactic role.

• A constituent stands in certain grammatical relations to other constituents.

Page 4: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

4

• Examples of Constituents

• English noun phrases

• English noun phrases appear in similar syntactic environments.

• But an individual word in a noun phrase cannot.

Harry the Horse, a high-class spot such as Mindy’sthe Broadway coppers, the reason he comes into the Hot Box

a high-class spot such as Mindy’s attracts. . .the Broadway coppers love. . .

* a high-class attracts. . .* the love. . .

Page 5: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

5

• Examples of Constituents

• Chinese phrases

• “ 把……”,“被……”

• Structural account

老师被迟到的学生逗乐了。 = 迟到的学生把老师逗乐了。 ≠ * 老师被迟到的学生被逗乐了。老师被冤枉的事情传开了。≠ * 冤枉的事情把老师传开了。 = 老师被冤枉的事情被传开了。电话被监听的老师找到了。 = 监听的老师把电话找到了。 = 电话被监听的老师被找到了。

Page 6: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

6

形式语法Formal Grammars

• Enumeration

• The grammar of a language can be the set of all enumerated sentences.

• We cannot exhaust all possible sentences or deal with new sentences.

• Rather, we should use recursive language to describe sentences with internal structure.

Page 7: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

7

• Regular expressions

• Symbols of a language (POS)

ART (冠词) , PRON (代词)N (名词) , V (动词) , ADJ (形容词) , ADV (副词)

• Combination patterns of the symbols

ART+N ; ART+N+V ; ART+ADJ+N+V

• Regular expression symbols

• *: occurs zero or more times

ART+ADJ*+N

• +: occurs 1 or more times

ART+ADJ++N

• ( ): occurs zero or 1 time

ART+(ADJ)+N

• |: disjunctions

N | PRON + V

Page 8: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

8

In-Class Exercise

• Write a regular expression that can describe all the following phrases.

老张是一个环卫工老张是一个聪明的环卫工。老张是一个聪明勤劳的环卫工。他是一个聪明的人。

Page 9: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

9

• Rules in a Formal Grammar

• A set of rules or productions express the ways that symbols of the language can be grouped and ordered together.

• S (句子) , NP (名词短语) , VP (动词短语) , PP (介词短语)

• Formal Definition of a Formal Grammar

• N: a set of non-terminal symbols (or variables)

• Σ: a set of terminal symbols (disjoint from N)

• R: a set of rules or productions, each of the form A β, where A is a nonterminal, β is a string of symbols from the infinite set of strings (Σ ∪N)∗

• S: a designated start symbol

S NP VP, NP Det N,VP V NP, PP Prep NP

Page 10: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

10

上下文无关语法Context-Free Grammars

• Definition

• As a kind of formal grammar, Context-Free Grammars (CFGs) are the most commonly used mathematical system for modeling the constituent structure of a language. They are also called Phrase-Structure Grammars.

Page 11: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

11

• Parse tree

• A parse tree is a tree structure that shows how the rules in a CFG are used in a sequence to expand a non-terminal node into terminal nodes.

NP → Det Nominal

Det → aNominal →

Noun

Noun → flight

Page 12: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

12

• An English Example

• Lexicon

I prefer a morning flight.

Page 13: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

13

• An English Example

• Grammar

I prefer a morning flight.

Page 14: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

14

• An English Example

• Parse Tree

I prefer a morning flight.

Page 15: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

15

• Chinese Examples

Page 16: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

16

• Treebanks

• A Treebank is a corpus in which every sentence is syntactically annotated with a parse tree.

• Treebanks are invaluable resources for NLP, especially parsing.

• The Penn Treebank Project is a representative treebank.

• Samples from Penn Treebank.

Page 17: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

17

• Chomsky Normal Form

• A CFG is in Chomsky Normal Form (CNF) if each production is either of the form A → B C or A → a. That is, the right-hand side of each rule either has two non-terminal symbols or one terminal symbol.

• Conversion to CNF

VP → VBD NP PPVP → VP PPVP -> VBD NP PP*

Page 18: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

18

依存语法Dependency Grammar

• Definition

• It is a kind of grammar where the syntactic structure of a sentence is described purely in terms of words and binary semantic or syntactic relations between these words.

• Dependency relations are directional.

• There are no structural levels or non-terminal nodes as in CFG.

Page 19: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

19

• A Chinese Example

Dependency Tree

• Dependency Graph

那个小孩喜欢通俗歌曲

喜欢

小孩 歌曲

通俗那个

喜欢小孩 歌曲通俗Root

HED

SBV

VOB

ATT

那个

ATT

Page 20: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

20

• Axioms of Dependency

• Only one constituent in a sentence is independent.

• All the other constituents in the sentence are dependent on some constituent.

• No constituent is dependent on two or more other constituents.

• If A is dependent on B and C is situated between A and B in the sentence, then either C is dependent on A or B, or C is dependent on a constituent between A and B.

Page 21: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

21

• Conditions of Dependency Tree

• Single Type Node: A dependency tree has only terminal nodes and no non-terminal nodes.

• Single Parent Node: The root node is the only parent node. All the other nodes have only one parent node.

• Unique Root Node: A dependency tree has only one root node, which governs all the other nodes.

• Non-overlapping: A dependency tree’s branches cannot overlap with each other.

• Mutual exclusiveness: The relations of governing and preceding are exclusive. If two nodes have a “governing” relation between them, they cannot have a “preceding” relation.

Page 22: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

22

• Dependency Relations

• There are more than 50 dependency relations in English (Stanford Parser)

Dependency relation

Meaning Example

amod adjectival modifier Sam eats red meatamod(meat, red)

dobj direct object She gave me a raisedobj(gave, raise)

nsubj nominal subject Clinton defeated Dole nsubj (defeated, Clinton)

pcomp prepositional complement

They heard about you missing classes pcomp(about, missing)

tmod temporal modifier Last night, I swam in the pooltmod(swam, night)

Page 23: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

23

In-Class Exercise

• Given the sentence The sausage was eaten by his dog, complete the following dependency relations by choosing from the list of {nsubj, amod, dobj, pcomp, tmod}.

_____(eat, sausage)

_____(eat, dog)

Page 24: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

24

• Heads and Dependency

• Syntactic constituents could be associated with a lexical head.

• N is the head of an NP, V is the head of a VP …

Workers dumped sacks into a bin.

Page 25: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

25

• Heads and Dependency

• A dependency graph can be automatically derived from a context-free parse by using the head rules.

Vinken will join the board as a nonexecutive director Nov 29.

Page 26: 101035 中文信息处理 Chinese NLP Lecture 8. 句 —— 语法分析( 1 ) Grammatical Analysis (1) 语法分析基础( Basics) 形式语法( Formal grammars) 上下文无关语法(

26

• 语法分析基础• Constituents

• 形式语法• Regular Expressions

• Symbols and Rules

• Formal Definition

• 上下文无关语法• Parse Tree

Wrap-Up

• Examples

• Treebanks

• 依存语法• Axioms

• Dependency Tree and Graph

• Dependency Relations

• Heads and Dependency