101035 中文信息处理 chinese nlp lecture 5. 词 —— 自动分词( 3 ) chinese word...

30
101035 中中中中中中 Chinese NLP Lecture 5

Upload: coleen-white

Post on 28-Dec-2015

376 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

101035 中文信息处理

Chinese NLP

Lecture 5

Page 2: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

2

词——自动分词( 3 )Chinese Word Segmentation

(3)

• 未登录词( Out-of-Vocabulary words, or OOV words)

• 未登录词的获取( OOV word acquisition )• 基于统计的中文词汇获取( Chinese word

acquisition )• 中文分词评测( Segmentation evaluation )

Page 3: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

3

未登录词Out-of-Vocabulary (OOV)

Words

• Definition

• OOV or unknown words are character strings that are not in the lexicon but should be identified as segments, such as person names, location names, organization names, and their abbreviations etc.

张韧弦 张 韧 弦

Page 4: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

4

• Types of OOV Words

• Morphologically derived words (MDWs).

Affixes (前后缀) 朋友 朋友们

Repetition (叠词)

走 走一走,漂亮 漂漂亮亮

Splitting (离合) 吃饭 吃了饭

Merging (合并) 上班 + 下班 上下班

Attachment (附加)

走出 走出去

To recognize them, make MDW dictionaries and use statistical information such as frequency, mutual information and context.

Page 5: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

5

• Types of OOV Words

• Factoids (FTs).

Date 6 月 20 日Duration 20 多分钟Time Point 十二点三十分Fraction 百分之五十 (50%), 八分之一

( 1/8 )Currency 1000 美元Numeral 三次 ,三百 ,第一个Numeral + Unit 28 岁 , 55 千克 , 3 千米 , 20 度 Email [email protected]

Phone number +86-21-65982200

Url www.tongji.edu.cnTo recognize them, use regular expressions or finite state machines.

Page 6: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

6

• Types of OOV Words

• Named Entities (NEs).

Person Name 毛泽东,林肯

Place Name 北京,纽约

Organization Name 联合国安理会,毕加索博物馆

Some person names, especially the translated ones, can be easily recognized by using special lexicons. But recognizing other types of NE is rather difficult.

Page 7: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

7

• Types of OOV Words

• New Words (NWs).

Initials 非典,高大上Dialect 埋单,忽悠Coined word 给力,屌丝Jargon 禽流感,三聚氰胺Loanword 克隆( clone ),欧巴(오보)Foreign word APEC , NBA

Word with new meaning 任性,土豪

It is difficult to recognize various NWs.

Page 8: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

8

In-Class Exercise

• What are they, MDWs, FTs, NEs, or NWs?

嘉定,美眉,左右手, iPad ,嘚瑟, 5 升,整整齐齐,章子怡,脱氧核糖核酸,春秋航空,一半,点赞

Page 9: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

9

未登录词的获取OOV Word Acquisition

• Introduction

• Words do not appear alone. Their appearance can be captured by statistical patterns regarding word collocation and word co-occurrence.

• Many Chinese OOV word acquisition methods are borrowed from the statistical methods for discovering English word collocations.

Page 10: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

10

• Frequency-based

• If two characters occur next to each other for many times, their bigram is a possible word.

Count(W1W2) W1 W2

1317 的 一1219 新 华1178 华 社1125 日 电858 国 的797 了 一746 是 一719 这 一26 造 就

Bigrams with higher frequency, such as 的一 , cannot make words. Those with lower frequency, such as 造就, make words.

Page 11: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

11

• Mean and Variance -Based

• In a collocation, two words may not appear in fixed positions. Frequency-based method cannot handle this problem, but mean and variance -based method can.

进来请敲门别敲我的门了你去敲他的门有人在敲门

Page 12: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

12

• Mean and Variance -Based

• Compute the mean and variance of offset between two characters. Decide whether a character collocation (word) exists by observing the mean and variance.

• Mean is the average offset of all character collocation offsets.

• Variance measures how much an individual offset deviates from the mean.

• SD, or the square root of variance s, is also used.

n is the number of co-occurrencesdi is the offset of the i’th co-occurrence is the mean of offsets in a sample

Page 13: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

13

• Mean and Variance -Based

• Example: 敲门

Mean > 1 and s is small, which means the two characters tend to co-occur with approximately the same distance.

进来请敲门别敲我的门了你去敲他的门有人在敲门

Page 14: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

14

• Hypothesis Testing-Based

• A character string or word detected by frequency, mean, and variance may be due to chance.

• To show that the finding is statistically reliable (significant), we often use hypothesis testing.

• H0: Suppose two characters W1 and W2 are independent, then the probability of their co-occurrence P(W1W2) = P(W1)P(W2) .

Page 15: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

15

• T-test

• T-test is based on sample mean and variance.

• If t is sufficiently large, the null hypothesis H0 can be rejected.

is the sample means2 is sample varianceN is the sample size is the distribution mean

Page 16: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

16

• T-test

• Example

In the corpus, 着 appears 2457 times, 生 appears 5540 times, and there are a total of 1505259 characters.

P( 着 ) = 2457 / 1505259 P( 生 ) = 5540/ 1505259

The null hypothesis is 着 and 生 are independent.

H0: P( 着生 ) = P( 着 ) P( 生 ) =

If the null hypothesis is correct, for a random bigram word, its corresponding random variable = 1 if it is 着生 and 0 if otherwise. Then we have a Bernoulli distribution, with

= 6.0075 × 10-6 σ2 = p(1 – p) ≈ p ()

Page 17: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

17

• T-test

• Example

In the corpus, 着生 appears 16 times, therefore

The t value is

Since t < 2.576, the critical value for α = 0.005, the null hypothesis cannot be rejected. Therefore, they do not make one word.

Page 18: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

18

• T-test for Difference

• T-test can also be used to find the difference between two characters according to their co-occurrence patterns.

• H0: average difference = 0, so = .

and are the means of sample 1 and sample 2s1

2 and s22 are the variances of sample 1 and

sample 2N1 and N2 are the sizes of sample 1 and sample 2

Page 19: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

19

• T-test for Difference

• Suppose the sample has Bernoulli distribution, w is the character of interest, and v1 and v2 are two characters to be compared.

s12 = P(v1w) , s2

2 = P(v2w)

C(x) is the count of x in the corpus.

Page 20: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

20

• Pearson’s χ2 Test

• χ2 is a statistical test to evaluate how likely it is that any observed difference between the sets arises by chance.

• For an 2×2 table, χ2 can be simply calculated by

Oi,j is an observed frequency of cell (i, j)Ei,j is the expected frequency

Page 21: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

21

• Pearson’s χ2 Test

• Example

If the confidence level α = 0.005, the critical value χ2 = 3.841 (df = 1). So the null hypothesis can be accepted, i.e., 上 and 自 do no make a word.

1w 上 1w 上

2w 自 16

(上自) 2946

(例如 来自)

2w 自 6310

(例如 上海) 1495987

(例如 教师)

Page 22: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

22

• T-Test vs χ2 Test

• When testing word collocation, both give approximately the same result.

• T-test assumes normal distribution, which is usually not the real case.

• χ2 Test is suitable for large probabilities, which does not meet the normal distribution condition required by t-test. The drawback is that small calculated values lead to unconvincing results.

• Other Statistical Methods

• Likelihood ratio

• Pointwise mutual information (PMI)

Page 23: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

23

基于统计的中文词汇获取Chinese Word Acquisition

• Candidate List

• Generally, we extract candidate words from fragments separated by multi-character word or high-frequency single-character word (eg. 是,的,在 ) boundaries.

• Statistics

• Let the text string C = C1C2 … Cn, where n is the length. The set of all Chinese characters is H = {H1, H2, … Hm}

Page 24: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

24

• Statistics

• Make the following Bernoulli assumptions:

• For any Ci, any occurs with the probability of Pk.

• Different positions in the text emit a character with fixed probabilities. They are mutually independent.

• If the characters in a candidate string are highly collocated, they can make a word.

• Deciding Word Categories

• We must decide the category for the statistically acquired word, i.e., person, place, domain word, etc.

Page 25: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

中文分词评测Segmentation Evaluation

• Binary Decision

• In segmenting a string of text, a system must make a boundary-placement decision after each character.

C1 C2 C3 … Cn? ? ?

“?” means to make a decision on

whether to insert a boundary or not

25

accuracy=4/6=0.667

计 算 机 会 下 象 棋standard

计 算 机 会 下 象 棋system output

number of correct binary decisions

total number of binary decisions accuracy=

0 0 1 1 1 0

0 1 0 1 1 0

Page 26: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

26

• Boundary Recall/Precision

• Recall measures the percentage of the actual boundaries identified.

• Precision measures the percentage of the identified boundaries that are in fact the actual boundaries.

Actual Boundaries

Identified and

Correct

Identified Boundaries

Recall=green

overlapPrecision=

yellow

overlap

Page 27: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

27

• Boundary Recall/Precision

计 算 机 会 下 象 棋standard

计 算 机 会 下 象 棋system output

number of correctly identified boundaries

total number of boundaries (in standard)Recall=

number of correctly identified boundaries

total number of identified boundariesPrecision=

recall=2/3=0.667 precision=2/3=0.667

Page 28: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

28

• Word Recall/Precision

• Evaluate words rather than boundaries.

计 算 机 会 下 象 棋standard

计 算 机 会 下 象 棋system output

number of correctly identified words

total number of words (in standard)Recall=

number of correctly identified words

total number of identified wordsPrecision=

recall=2/4=0.5 precision=2/4=0.5

Page 29: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

29

In-Class Exercise

• For the given string “ 这样的人才能经受考验” , the correct segment is “ 这样 的 人 才 能 经受 考验” , but your system segments it as “ 这样 的 人才 能 经受 考验” . What are the boundary recall/precision and word recall/precision?

Page 30: 101035 中文信息处理 Chinese NLP Lecture 5. 词 —— 自动分词( 3 ) Chinese Word Segmentation (3) 未登录词( Out-of-Vocabulary words, or OOV words) 未登录词的获取(

30

• 未登录词• Types

• 未登录词的获取• Frequency-based

• Mean and variance –based

• Hypothesis testing-based

• T-test

• T-test for difference

• Pearson’s χ2 Test

Wrap-Up

• 基于统计的中文词汇获取• 中文分词评测

• Binary decision

• Boundary recall/precision

• Word recall/precision