pattern mining to unknown word extraction (10
DESCRIPTION
TRANSCRIPT
1
Pattern Mining to Chinese Unknown wordExtraction
資工碩三 955202037
楊傑程2008/10/14
2
Outline
Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
3
Introduction
Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years.
Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese
texts.
4
Introduction
Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words.
Ambiguity One un-segmented Chinese character string has different
segmentations according to different context information.
Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly
unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into
“王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in
vocabularies.
5
Introduction- types of unknown words In this paper, we focus on Chinese unknown
word problem.
Types of Chinese unknown words
Organization names
Ex: 華碩電腦
Ex: 總經理、電腦化
Abbreviation
Proper Names
Ex: 中油、中大
Personal namesEx: 王小明
Derived Words Compounds
Ex: 電腦桌、搜尋法
Numeric type
compounds
Ex: 1986 年、 19 巷
6
Introduction- unknown word identification Chinese Word Segmentation Process:
1. Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more
parts. Ex: personal name 王小明 after initial segmentation,
become 王 小 明
2. Unknown word identification Characters belong to one unknown word should re-
combine together. Ex: re-combine 王 小 明 together as 王小明
7
Introduction- unknown word identification How does unknown word identification work?
A character can be a word ( 馬 ) or part of unknown word ( 馬 +英 + 九 ).
1. Unknown Word Detection Find detection rules to distinguish monosyllabic words from
monosyllabic morphemes.
2. Unknown Word Extraction focus on detected morphemes and combine them.
8
Introduction- applied techniques In this paper, we apply continuity pattern mining to
discover unknown word detection rules.
Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words.
Utilize syntactic information 、 context information and heuristic statistical information.
Our unknown word identification method is a general method
not limited on specific types of unknown words
9
Related Works- particular methods So far, research on Chinese word segmentation
has lasted for a decade.
First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal
names 、 transliteration names 、 Organization names) <[Chen & Li, 1996] 、 [Chen & Chen, 2000]>
Patterns, Frequency, Context Information
10
Related Works- general methods
(Rule-based) Then, researchers start to figure out methods extracting
whole kinds of unknown words. Rule-based Detection and Extraction:
<[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes
<[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal
names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%)
<[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up
merging algorithm Adopt morphological rules and general rules to extract all kinds of
unknown words. ( Precision: 76%, Recall: 57%)
11
Related Works- general methods
(Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]>
Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF
<[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%)
<[Tsai et. al, 2006]> CRF, (Recall: 73%)
Indirect method, like Sliding Window 、 Recurrent Sliding Windows
12
Related Works – Imbalanced Data Imbalance Data Problem
Ensemble method <C. Li, 2007>
Combine learning ability of multiple base classifiers using voting.
Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007>
Focus more on minority class examples. <C. Drummond et. al, 2003>
Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]>
Select the most informative instances.
13
Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]:
Unknown word detection Continuity pattern mining to derive detection rules.
Unknown word extraction Machine learning based – classification algorithms and sequential
learning (indirect).
We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.
14
Unknown Word Detection & Extraction
Unknown Word Detection(Detection Rule Mining)
Judge
Judge
Unknown Word Extraction(Machine Learning- Classification)
8/10 corpus + detection tags
(Initial Segmentation)8/10 corpus
1/10 corpus(Validation)
1/10 corpus(Initial Segmentation)
ClassificationDecision
1/10 corpus +detection tags
training
testing
Phase 1 Phase 2
Rules
1/10 corpus(Validation)Mining tool (Prowl)
Model
POS tagging POS tagging
15
Unknown Word Detection
Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.
16
Unknown word detection- Pattern Mining Pattern Mining:
Sequential Pattern: “因為… , 所以…” Required items match pattern order Allow noise in the middle of required items.
Continuity Pattern: “打 * 球” => “ 打棒球” : match, “ 打躲避球” : not
match Strict definition to each items and order. Efficient pattern mining
17
Unknown word detection- Continuity Pattern Mining Prowl
<[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns,
then evaluate its frequency. Iteratively extends to longer length of patterns.
18
Encoding
Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “葡萄” , in the lexicon => “ 葡萄” labels as known word
(Y) “葡萄皮” , not in the lexicon => “ 葡萄皮” labels as
unknown word (N)
Encoding examples: 葡萄 (Na) 葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na) 葡 (Na) N + 萄 (Na) N+ 皮 (Na) N
19
Create Detection Rules
Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears.
Rule Accuracy: Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄
(Na) ))( 葡 (Na), 萄 (Na), ) : 2
( 葡 (Na) Y, 萄 (Na), ) : 1
( 葡 (Na) N, 萄 (Na) N, ) : 1
( 葡 (Na) Y, 萄 (Na) Y, ) : 1
( 葡 , 萄 , ) : 2
( 葡 (Na), 萄 , ) : 2
( 葡 , 萄 (Na), ) : 2
20
Unknown Word Extraction
Machine Learning Classification Sequential learning
21
Unknown Word Extraction- feature (Pos) We use TnT POS tagger to detect part-of-
speech (pos) tags of terms. Kinds of pos tags :
Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …
22
Unknown Word Extraction- feature (term_attribute) After initial segmentation and applying detection rules, each term
will have a “term_attribute” label itself. Six different “term_attributes” are as follows :
ms() monosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王”、“小”、“明” on “ 王小明” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term
Target of unknown word: at least one ms(?)
運動會 () ‧ () 四年 () 甲班 () 王 (?) 姿 (?) 分 (?) ‧ () 本校 () 為 () 響 () 應 ()
ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()
23
Data Processing- Sliding Window Sequential Supervised Learning
Indirect method: transform sequential learning to classification learning
Sliding Window
We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4.
Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. Window: n+2 terms (n+prefix+suffix) N-gram: n term
must exist at least one ms(?) in n terms.
prefix
t0
3-gram suffix
t4t1 t2 t3
24
EX: 3-gram Model
運動會 ‧ 四年 甲班 王 (?)
‧ 四年 甲班 王(?)
姿 (?)
四年 甲班 王 (?) 姿(?)
分 (?)
甲班 王 (?) 姿 (?) 分(?)
‧
王 (?) 姿 (?) 分 (?) ‧ 本校
discard
negative
negative
negative
positive
運動會 () ‧ () 四年 () 甲班 () 王 (?) 姿 (?) 分 (?) ‧ () 本校 () 為 () 響 () 應 ()
25
Unknown Word Extraction- feature (Statistical Information) Statistical information: (exemplified by 3-gram Model),
1. Frequency of 3-gram. 2. p( prefix | 3-gram), e.g. p( prefix | t1~t3)3. p( suffix | 3-gram), e.g. p( suffix | t1~t3)4. p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3)5. p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2)6. p( pos_freq(prefix) / pos_freq(prefix in training positive))7. p( pos_freq(suffix) / pos_freq(suffix in training positive))
prefix
t0
3-gram suffix
t4t1 t2 t3
26
Data presentation
Format of machine learning usage:
Dimension: accumulative
prefix t1 t2 ……
pos
(55)
term_attribute
(6)
pos
(55)
term_attribute
(6)
pos
(55)
term_attribute
(6)
…… suffix statistics (7)
pos
(55)
term_attribute
(6)…
27
Experiments
Unknown word detection. Unknown word extraction.
28
Unknown Word Detection
8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules.
1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall
Threshold(Accuracy) Precision Recall
F-measure(our
system)F-measure
(AS system)
0.7 0.9324 0.4305 0.589035 0.71250
0.8 0.9008 0.5289 0.66648 0.752447
0.9 0.8343 0.7148 0.769941 0.76955
0.95 0.764 0.8288 0.795082 0.76553
0.98 0.686 0.8786 0.770446 0.744036
Fre>=Precisio
n RecallF-measure
3 0.764 0.8288 0.795082
7 0.7113 0.8819 0.787466
11 0.6924 0.8932 0.780085
19 0.6736 0.8995 0.77033
29 0.6552 0.9092 0.76158
29
Unknown Word Extraction
8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution:
Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find
sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6
30
Unknown Word Extraction
In judging overlap and conflict problem of different combination of unknown words :
<[Chen et al., 2002]> frequency (w) * length (w). Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3
Our method: First solve identical N-gram overlap :
P (combine | overlap) Ex: “ 單 親 家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 )
Then solve different N-gram conflict : Real frequency
freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院”
31
Extraction result
Comparison: <[Ma et al., 2003]>
morphological rules+ statistical rules+ context free grammar rules
Precision: 76%, Recall: 57% Our result n-gram Precision Recall F1-score
4-gram 30.6% 70.3% 0.426
3-gram 63.3% 80% 0.707
2-gram 56.7% 67.1% 0.614
Total 58.1% 68.2% 0.627
32
Ensemble Method Improvement
分類模型
2-gram 3-gram 4-gram
Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score
C1 0.518 0.64 0.572 0.542 0.808 0.649 0.252 0.419 0.315
C2 0.569 0.657 0.61 0.627 0.791 0.7 0.219 0.743 0.338
C3 0.535 0.633 0.58 0.563 0.81 0.664 0.222 0.378 0.28
C4 0.557 0.645 0.598 0.574 0.796 0.667 0.305 0.676 0.42
C5 0.555 0.66 0.603 0.549 0.779 0.644 0.205 0.554 0.299
C6 0.536 0.636 0.582 0.568 0.735 0.641 0.23 0.608 0.333
C7 0.557 0.66 0.604 0.611 0.691 0.648 0.211 0.703 0.325
C8 0.541 0.673 0.6 0.579 0.813 0.676 0.226 0.486 0.309
C9 0.548 0.657 0.598 0.587 0.715 0.645 0.215 0.635 0.321
C10 0.543 0.661 0.596 0.599 0.723 0.655 0.232 0.662 0.344
C11 0.533 0.668 0.593 0.607 0.74 0.667 0.24 0.554 0.335
C12 0.538 0.645 0.587 0.587 0.776 0.669 0.299 0.662 0.412
Caverage 0.544 0.653 0.594 0.583 0.765 0.66 0.238 0.59 0.336
Censemble 0.567 0.671 0.614 0.633 0.8 0.707 0.306 0.703 0.426
33
Experiment- One phase
What if without unknown word detection?
Two phases do work better.
Classification
Performance
Precision Recall F-score
One Phase 40.8% 71.4% 0.52
Two Phases 58.1% 68.2% 0.627
34
Conclusions
We adopt two phases method to solve unknown word problems Unknown word detection
Continuity pattern mining to derive detection rules. Unknown word extraction
Machine learning based – classification algorithms and sequential learning (indirect).
Imbalanced data solution
Our experiment prove two phases do work better than one phase.
Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction
precision.