101035 中文信息处理 chinese nlp lecture 5. 词 —— 自动分词（ 3 ） chinese word...

101035 中文信息处理

Chinese NLP

Lecture 5

2

词——自动分词（ 3 ）Chinese Word Segmentation

(3)

• 未登录词（ Out-of-Vocabulary words, or OOV words)

• 未登录词的获取（ OOV word acquisition ）• 基于统计的中文词汇获取（ Chinese word

acquisition ）• 中文分词评测（ Segmentation evaluation ）

3

未登录词Out-of-Vocabulary (OOV)

Words

• Definition

• OOV or unknown words are character strings that are not in the lexicon but should be identified as segments, such as person names, location names, organization names, and their abbreviations etc.

张韧弦张韧弦

4

• Types of OOV Words

• Morphologically derived words (MDWs).

Affixes （前后缀）朋友朋友们

Repetition （叠词）

走走一走，漂亮漂漂亮亮

Splitting （离合）吃饭吃了饭

Merging （合并）上班 + 下班上下班

Attachment （附加）

走出走出去

To recognize them, make MDW dictionaries and use statistical information such as frequency, mutual information and context.

5


• Factoids (FTs).

Date 6 月 20 日Duration 20 多分钟Time Point 十二点三十分Fraction 百分之五十 (50%), 八分之一

（ 1/8 ）Currency 1000 美元Numeral 三次，三百，第一个Numeral + Unit 28 岁， 55 千克， 3 千米， 20 度 Email [email protected]

Phone number +86-21-65982200

Url www.tongji.edu.cnTo recognize them, use regular expressions or finite state machines.

6


• Named Entities (NEs).

Person Name 毛泽东，林肯

Place Name 北京，纽约

Organization Name 联合国安理会，毕加索博物馆

Some person names, especially the translated ones, can be easily recognized by using special lexicons. But recognizing other types of NE is rather difficult.

7


• New Words (NWs).

Initials 非典，高大上Dialect 埋单，忽悠Coined word 给力，屌丝Jargon 禽流感，三聚氰胺Loanword 克隆（ clone ），欧巴（오보）Foreign word APEC ， NBA

Word with new meaning 任性，土豪

It is difficult to recognize various NWs.

8

In-Class Exercise

• What are they, MDWs, FTs, NEs, or NWs?

嘉定，美眉，左右手， iPad ，嘚瑟， 5 升，整整齐齐，章子怡，脱氧核糖核酸，春秋航空，一半，点赞

9

未登录词的获取OOV Word Acquisition

• Introduction

• Words do not appear alone. Their appearance can be captured by statistical patterns regarding word collocation and word co-occurrence.

• Many Chinese OOV word acquisition methods are borrowed from the statistical methods for discovering English word collocations.

10

• Frequency-based

• If two characters occur next to each other for many times, their bigram is a possible word.

Count(W1W2) W1 W2

1317 的一1219 新华1178 华社1125 日电858 国的797 了一746 是一719 这一26 造就

Bigrams with higher frequency, such as 的一 , cannot make words. Those with lower frequency, such as 造就， make words.

11

• Mean and Variance -Based

• In a collocation, two words may not appear in fixed positions. Frequency-based method cannot handle this problem, but mean and variance -based method can.

进来请敲门别敲我的门了你去敲他的门有人在敲门

12


• Compute the mean and variance of offset between two characters. Decide whether a character collocation (word) exists by observing the mean and variance.

• Mean is the average offset of all character collocation offsets.

• Variance measures how much an individual offset deviates from the mean.

• SD, or the square root of variance s, is also used.

n is the number of co-occurrencesdi is the offset of the i’th co-occurrence is the mean of offsets in a sample

13


• Example: 敲门

Mean > 1 and s is small, which means the two characters tend to co-occur with approximately the same distance.

进来请敲门别敲我的门了你去敲他的门有人在敲门

14

• Hypothesis Testing-Based

• A character string or word detected by frequency, mean, and variance may be due to chance.

• To show that the finding is statistically reliable (significant), we often use hypothesis testing.

• H0: Suppose two characters W1 and W2 are independent, then the probability of their co-occurrence P(W1W2) = P(W1)P(W2) .

15

• T-test

• T-test is based on sample mean and variance.

• If t is sufficiently large, the null hypothesis H0 can be rejected.

is the sample means2 is sample varianceN is the sample size is the distribution mean

16

• T-test

• Example

In the corpus, 着 appears 2457 times, 生 appears 5540 times, and there are a total of 1505259 characters.

P( 着 ) = 2457 / 1505259 P( 生 ) = 5540/ 1505259

The null hypothesis is 着 and 生 are independent.

H0: P( 着生 ) = P( 着 ) P( 生 ) =

If the null hypothesis is correct, for a random bigram word, its corresponding random variable = 1 if it is 着生 and 0 if otherwise. Then we have a Bernoulli distribution, with

= 6.0075 × 10-6 σ2 = p(1 – p) ≈ p ()

17

• T-test

• Example

In the corpus, 着生 appears 16 times, therefore

The t value is

Since t < 2.576, the critical value for α = 0.005, the null hypothesis cannot be rejected. Therefore, they do not make one word.

18

• T-test for Difference

• T-test can also be used to find the difference between two characters according to their co-occurrence patterns.

• H0: average difference = 0, so = .

and are the means of sample 1 and sample 2s1

2 and s22 are the variances of sample 1 and

sample 2N1 and N2 are the sizes of sample 1 and sample 2

19

• T-test for Difference

• Suppose the sample has Bernoulli distribution, w is the character of interest, and v1 and v2 are two characters to be compared.

s12 = P(v1w) , s2

2 = P(v2w)

C(x) is the count of x in the corpus.

20

• Pearson’s χ2 Test

• χ2 is a statistical test to evaluate how likely it is that any observed difference between the sets arises by chance.

• For an 2×2 table, χ2 can be simply calculated by

Oi,j is an observed frequency of cell (i, j)Ei,j is the expected frequency

21


• Example

If the confidence level α = 0.005, the critical value χ2 = 3.841 (df = 1). So the null hypothesis can be accepted, i.e., 上 and 自 do no make a word.

1w 上 1w 上

2w 自 16

（上自） 2946

（例如来自）

2w 自 6310

（例如上海） 1495987

（例如教师）

22

• T-Test vs χ2 Test

• When testing word collocation, both give approximately the same result.

• T-test assumes normal distribution, which is usually not the real case.

• χ2 Test is suitable for large probabilities, which does not meet the normal distribution condition required by t-test. The drawback is that small calculated values lead to unconvincing results.

• Other Statistical Methods

• Likelihood ratio

• Pointwise mutual information (PMI)

23

基于统计的中文词汇获取Chinese Word Acquisition

• Candidate List

• Generally, we extract candidate words from fragments separated by multi-character word or high-frequency single-character word (eg. 是，的，在 ) boundaries.

• Statistics

• Let the text string C = C1C2 … Cn, where n is the length. The set of all Chinese characters is H = {H1, H2, … Hm}

24

• Statistics

• Make the following Bernoulli assumptions:

• For any Ci, any occurs with the probability of Pk.

• Different positions in the text emit a character with fixed probabilities. They are mutually independent.

• If the characters in a candidate string are highly collocated, they can make a word.

• Deciding Word Categories

• We must decide the category for the statistically acquired word, i.e., person, place, domain word, etc.

中文分词评测Segmentation Evaluation

• Binary Decision

• In segmenting a string of text, a system must make a boundary-placement decision after each character.

C1 C2 C3 … Cn? ? ?

“?” means to make a decision on

whether to insert a boundary or not

25

accuracy=4/6=0.667

计算机会下象棋standard

计算机会下象棋system output

number of correct binary decisions

total number of binary decisions accuracy=

0 0 1 1 1 0

0 1 0 1 1 0

26

• Boundary Recall/Precision

• Recall measures the percentage of the actual boundaries identified.

• Precision measures the percentage of the identified boundaries that are in fact the actual boundaries.

Actual Boundaries

Identified and

Correct

Identified Boundaries

Recall=green

overlapPrecision=

yellow

overlap

27

• Boundary Recall/Precision



number of correctly identified boundaries

total number of boundaries (in standard)Recall=

number of correctly identified boundaries

total number of identified boundariesPrecision=

recall=2/3=0.667 precision=2/3=0.667

28

• Word Recall/Precision

• Evaluate words rather than boundaries.



number of correctly identified words

total number of words (in standard)Recall=

number of correctly identified words

total number of identified wordsPrecision=

recall=2/4=0.5 precision=2/4=0.5

29

In-Class Exercise

• For the given string “ 这样的人才能经受考验” , the correct segment is “ 这样的人才能经受考验” , but your system segments it as “ 这样的人才能经受考验” . What are the boundary recall/precision and word recall/precision?

30

• 未登录词• Types

• 未登录词的获取• Frequency-based

• Mean and variance –based

• Hypothesis testing-based

• T-test

• T-test for difference


Wrap-Up

• 基于统计的中文词汇获取• 中文分词评测

• Binary decision

• Boundary recall/precision

• Word recall/precision

101035 中文信息处理 chinese nlp lecture 5. 词 —— 自动分词（ 3 ） chinese word...

Documents