pattern mining to unknown word extraction (10

34
1 Pattern Mining to Chinese Unknown word Extraction 資資資資 955202037 資資資 2008/10/14

Upload: jason-yang

Post on 26-Dec-2014

915 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Pattern Mining To Unknown Word Extraction (10

1

Pattern Mining to Chinese Unknown wordExtraction

資工碩三 955202037

楊傑程2008/10/14

Page 2: Pattern Mining To Unknown Word Extraction (10

2

Outline

Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions

Page 3: Pattern Mining To Unknown Word Extraction (10

3

Introduction

Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years.

Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese

texts.

Page 4: Pattern Mining To Unknown Word Extraction (10

4

Introduction

Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words.

Ambiguity One un-segmented Chinese character string has different

segmentations according to different context information.

Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly

unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into

“王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in

vocabularies.

Page 5: Pattern Mining To Unknown Word Extraction (10

5

Introduction- types of unknown words In this paper, we focus on Chinese unknown

word problem.

Types of Chinese unknown words

Organization names

Ex: 華碩電腦

Ex: 總經理、電腦化

Abbreviation

Proper Names

Ex: 中油、中大

Personal namesEx: 王小明

Derived Words Compounds

Ex: 電腦桌、搜尋法

Numeric type

compounds

Ex: 1986 年、 19 巷

Page 6: Pattern Mining To Unknown Word Extraction (10

6

Introduction- unknown word identification Chinese Word Segmentation Process:

1. Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more

parts. Ex: personal name 王小明 after initial segmentation,

become 王 小 明

2. Unknown word identification Characters belong to one unknown word should re-

combine together. Ex: re-combine 王 小 明 together as 王小明

Page 7: Pattern Mining To Unknown Word Extraction (10

7

Introduction- unknown word identification How does unknown word identification work?

A character can be a word ( 馬 ) or part of unknown word ( 馬 +英 + 九 ).

1. Unknown Word Detection Find detection rules to distinguish monosyllabic words from

monosyllabic morphemes.

2. Unknown Word Extraction focus on detected morphemes and combine them.

Page 8: Pattern Mining To Unknown Word Extraction (10

8

Introduction- applied techniques In this paper, we apply continuity pattern mining to

discover unknown word detection rules.

Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words.

Utilize syntactic information 、 context information and heuristic statistical information.

Our unknown word identification method is a general method

not limited on specific types of unknown words

Page 9: Pattern Mining To Unknown Word Extraction (10

9

Related Works- particular methods So far, research on Chinese word segmentation

has lasted for a decade.

First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal

names 、 transliteration names 、 Organization names) <[Chen & Li, 1996] 、 [Chen & Chen, 2000]>

Patterns, Frequency, Context Information

Page 10: Pattern Mining To Unknown Word Extraction (10

10

Related Works- general methods

(Rule-based) Then, researchers start to figure out methods extracting

whole kinds of unknown words. Rule-based Detection and Extraction:

<[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes

<[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal

names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%)

<[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up

merging algorithm Adopt morphological rules and general rules to extract all kinds of

unknown words. ( Precision: 76%, Recall: 57%)

Page 11: Pattern Mining To Unknown Word Extraction (10

11

Related Works- general methods

(Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]>

Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF

<[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%)

<[Tsai et. al, 2006]> CRF, (Recall: 73%)

Indirect method, like Sliding Window 、 Recurrent Sliding Windows

Page 12: Pattern Mining To Unknown Word Extraction (10

12

Related Works – Imbalanced Data Imbalance Data Problem

Ensemble method <C. Li, 2007>

Combine learning ability of multiple base classifiers using voting.

Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007>

Focus more on minority class examples. <C. Drummond et. al, 2003>

Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]>

Select the most informative instances.

Page 13: Pattern Mining To Unknown Word Extraction (10

13

Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]:

Unknown word detection Continuity pattern mining to derive detection rules.

Unknown word extraction Machine learning based – classification algorithms and sequential

learning (indirect).

We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.

Page 14: Pattern Mining To Unknown Word Extraction (10

14

Unknown Word Detection & Extraction

Unknown Word Detection(Detection Rule Mining)

Judge

Judge

Unknown Word Extraction(Machine Learning- Classification)

8/10 corpus + detection tags

(Initial Segmentation)8/10 corpus

1/10 corpus(Validation)

1/10 corpus(Initial Segmentation)

ClassificationDecision

1/10 corpus +detection tags

training

testing

Phase 1 Phase 2

Rules

1/10 corpus(Validation)Mining tool (Prowl)

Model

POS tagging POS tagging

Page 15: Pattern Mining To Unknown Word Extraction (10

15

Unknown Word Detection

Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.

Page 16: Pattern Mining To Unknown Word Extraction (10

16

Unknown word detection- Pattern Mining Pattern Mining:

Sequential Pattern: “因為… , 所以…” Required items match pattern order Allow noise in the middle of required items.

Continuity Pattern: “打 * 球” => “ 打棒球” : match, “ 打躲避球” : not

match Strict definition to each items and order. Efficient pattern mining

Page 17: Pattern Mining To Unknown Word Extraction (10

17

Unknown word detection- Continuity Pattern Mining Prowl

<[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns,

then evaluate its frequency. Iteratively extends to longer length of patterns.

Page 18: Pattern Mining To Unknown Word Extraction (10

18

Encoding

Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “葡萄” , in the lexicon => “ 葡萄” labels as known word

(Y) “葡萄皮” , not in the lexicon => “ 葡萄皮” labels as

unknown word (N)

Encoding examples: 葡萄 (Na) 葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na) 葡 (Na) N + 萄 (Na) N+ 皮 (Na) N

Page 19: Pattern Mining To Unknown Word Extraction (10

19

Create Detection Rules

Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears.

Rule Accuracy: Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄

(Na) ))( 葡 (Na), 萄 (Na), ) : 2

( 葡 (Na) Y, 萄 (Na), ) : 1

( 葡 (Na) N, 萄 (Na) N, ) : 1

( 葡 (Na) Y, 萄 (Na) Y, ) : 1

( 葡 , 萄 , ) : 2

( 葡 (Na), 萄 , ) : 2

( 葡 , 萄 (Na), ) : 2

Page 20: Pattern Mining To Unknown Word Extraction (10

20

Unknown Word Extraction

Machine Learning Classification Sequential learning

Page 21: Pattern Mining To Unknown Word Extraction (10

21

Unknown Word Extraction- feature (Pos) We use TnT POS tagger to detect part-of-

speech (pos) tags of terms. Kinds of pos tags :

Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …

Page 22: Pattern Mining To Unknown Word Extraction (10

22

Unknown Word Extraction- feature (term_attribute) After initial segmentation and applying detection rules, each term

will have a “term_attribute” label itself. Six different “term_attributes” are as follows :

ms() monosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王”、“小”、“明” on “ 王小明” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term

Target of unknown word: at least one ms(?)

運動會 () ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?) ‧ ()  本校 ()  為 ()  響 ()  應 ()

ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()

Page 23: Pattern Mining To Unknown Word Extraction (10

23

Data Processing- Sliding Window Sequential Supervised Learning

Indirect method: transform sequential learning to classification learning

Sliding Window

We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4.

Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. Window: n+2 terms (n+prefix+suffix) N-gram: n term

must exist at least one ms(?) in n terms.

prefix

t0

3-gram suffix

t4t1 t2 t3

Page 24: Pattern Mining To Unknown Word Extraction (10

24

EX: 3-gram Model

運動會 ‧ 四年 甲班 王 (?)

‧ 四年 甲班 王(?)

姿 (?)

四年 甲班 王 (?) 姿(?)

分 (?)

甲班 王 (?) 姿 (?) 分(?)

王 (?) 姿 (?) 分 (?) ‧ 本校

discard

negative

negative

negative

positive

運動會 () ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?) ‧ ()  本校 ()  為 ()  響 ()  應 ()

Page 25: Pattern Mining To Unknown Word Extraction (10

25

Unknown Word Extraction- feature (Statistical Information) Statistical information: (exemplified by 3-gram Model),

1. Frequency of 3-gram. 2. p( prefix | 3-gram), e.g. p( prefix | t1~t3)3. p( suffix | 3-gram), e.g. p( suffix | t1~t3)4. p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3)5. p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2)6. p( pos_freq(prefix) / pos_freq(prefix in training positive))7. p( pos_freq(suffix) / pos_freq(suffix in training positive))

prefix

t0

3-gram suffix

t4t1 t2 t3

Page 26: Pattern Mining To Unknown Word Extraction (10

26

Data presentation

Format of machine learning usage:

Dimension: accumulative

prefix t1 t2 ……

pos

(55)

term_attribute

(6)

pos

(55)

term_attribute

(6)

pos

(55)

term_attribute

(6)

…… suffix statistics (7)

pos

(55)

term_attribute

(6)…

Page 27: Pattern Mining To Unknown Word Extraction (10

27

Experiments

Unknown word detection. Unknown word extraction.

Page 28: Pattern Mining To Unknown Word Extraction (10

28

Unknown Word Detection

8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules.

1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall

Threshold(Accuracy) Precision Recall

F-measure(our

system)F-measure

(AS system)

0.7 0.9324 0.4305 0.589035 0.71250

0.8 0.9008 0.5289 0.66648 0.752447

0.9 0.8343 0.7148 0.769941 0.76955

0.95 0.764 0.8288 0.795082 0.76553

0.98 0.686 0.8786 0.770446 0.744036

Fre>=Precisio

n RecallF-measure

3 0.764 0.8288 0.795082

7 0.7113 0.8819 0.787466

11 0.6924 0.8932 0.780085

19 0.6736 0.8995 0.77033

29 0.6552 0.9092 0.76158

Page 29: Pattern Mining To Unknown Word Extraction (10

29

Unknown Word Extraction

8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution:

Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find

sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6

Page 30: Pattern Mining To Unknown Word Extraction (10

30

Unknown Word Extraction

In judging overlap and conflict problem of different combination of unknown words :

<[Chen et al., 2002]> frequency (w) * length (w). Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3

Our method: First solve identical N-gram overlap :

P (combine | overlap) Ex: “ 單 親 家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 )

Then solve different N-gram conflict : Real frequency

freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院”

Page 31: Pattern Mining To Unknown Word Extraction (10

31

Extraction result

Comparison: <[Ma et al., 2003]>

morphological rules+ statistical rules+ context free grammar rules

Precision: 76%, Recall: 57% Our result n-gram Precision Recall F1-score

4-gram 30.6% 70.3% 0.426

3-gram 63.3% 80% 0.707

2-gram 56.7% 67.1% 0.614

Total 58.1% 68.2% 0.627

Page 32: Pattern Mining To Unknown Word Extraction (10

32

Ensemble Method Improvement

分類模型

2-gram 3-gram 4-gram

Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score

C1 0.518 0.64 0.572 0.542 0.808 0.649 0.252 0.419 0.315

C2 0.569 0.657 0.61 0.627 0.791 0.7 0.219 0.743 0.338

C3 0.535 0.633 0.58 0.563 0.81 0.664 0.222 0.378 0.28

C4 0.557 0.645 0.598 0.574 0.796 0.667 0.305 0.676 0.42

C5 0.555 0.66 0.603 0.549 0.779 0.644 0.205 0.554 0.299

C6 0.536 0.636 0.582 0.568 0.735 0.641 0.23 0.608 0.333

C7 0.557 0.66 0.604 0.611 0.691 0.648 0.211 0.703 0.325

C8 0.541 0.673 0.6 0.579 0.813 0.676 0.226 0.486 0.309

C9 0.548 0.657 0.598 0.587 0.715 0.645 0.215 0.635 0.321

C10 0.543 0.661 0.596 0.599 0.723 0.655 0.232 0.662 0.344

C11 0.533 0.668 0.593 0.607 0.74 0.667 0.24 0.554 0.335

C12 0.538 0.645 0.587 0.587 0.776 0.669 0.299 0.662 0.412

Caverage 0.544 0.653 0.594 0.583 0.765 0.66 0.238 0.59 0.336

Censemble 0.567 0.671 0.614 0.633 0.8 0.707 0.306 0.703 0.426

Page 33: Pattern Mining To Unknown Word Extraction (10

33

Experiment- One phase

What if without unknown word detection?

Two phases do work better.

Classification

Performance

Precision Recall F-score

One Phase 40.8% 71.4% 0.52

Two Phases 58.1% 68.2% 0.627

Page 34: Pattern Mining To Unknown Word Extraction (10

34

Conclusions

We adopt two phases method to solve unknown word problems Unknown word detection

Continuity pattern mining to derive detection rules. Unknown word extraction

Machine learning based – classification algorithms and sequential learning (indirect).

Imbalanced data solution

Our experiment prove two phases do work better than one phase.

Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction

precision.