fast methods for kernel-based text analysis
DESCRIPTION
Fast Methods for Kernel-based Text Analysis. Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology). 41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN. Background. Kernel methods (e.g., SVM) become popular - PowerPoint PPT PresentationTRANSCRIPT
1
Fast Methods for Kernel-based Text Analysis
Taku Kudo 工藤 拓Yuji Matsumoto 松本 裕治NAIST (Nara Institute of Science and Technology)
41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN
2
Background
Kernel methods (e.g., SVM) become popularCan incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy
3
Problem
Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testingSome kernel-based parsers run only at 2 - 3 seconds/sentence
4
Goals
Build fast but still accurate kernel- based text analyzersMake it possible to use them to wider range of NL applications
5
Outline
Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE
Experiments Conclusions and Future Work
6
Outline
Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE
Experiments Conclusions and Future Work
7
Kernel Methods
No need to represent example in an explicit feature vector
Complexity of testing is O(L ・ |X|)
L
iii
L
iii
XXK
XXXf
1
1
),(
)φ()φ()(
},,,{ 21 LXXXT Training data
8
Kernels for Sets (1/3)
FXXXX T
iiiF
jL
N
},,,,{
},,,{
21
21
Focus on the special case where examples are represented as sets
The instances in NLP are usually represented as sets (e.g., bag-of-words)
Feature set:
Training data:
9
Kernels for Sets (2/3)},,,{ ,},,,{ 21 edbaXdcbaX
Combinations (subsets) of features
}},,{{
}},{},,{},,{{
dba
dbdaba
3 |},,{| || ),( 2121 dbaXXXXK
Simple definition:
2nd order
3rd order
10
Kernels for Sets (3/3)
I ate a cake PRP VBD DT NN
Dependent (+1) or independent (-1) ?
head modifier
Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NN
X=
Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NNHead-POS/Modifier-POS: VBD/NNHead-word/Modifier-POS: ate/NN …
X=
Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks
Previous approaches select combinations heuristically
Heuristic
selection
11
Polynomial Kernel of degree d
..}2,1,0{1||),( 2121 dXXXXK dd
Implicit form
|)(|)(),(0
2121
d
rrdd XXPrcXXK
Explicit form
r
rm
lmrd
rld m
rm
l
drc )1()(
is a set of all subsets of with exactly elements in it
is prior weight to the subsets with size
)(XPr X
r )(rcd
r
(subset weight)
12
Example (Cubic Kernel d=3 )
},,,{ ,},,,{ 21 edbaXdcbaX
64)13(1||),( 3321213 XXXXK
Implicit form:
}},,{{)( ,6)3(
}},{},,{},,{{)( ,12)2(
}}{},{},{{)( ,7)1(
}{)( ,1)0(
2133
2123
2113
2103
dbaXXPc
dbdabaXXPc
dbaXXPc
XXPc
64163123711),( 213 XXK
Explicit form:
Up to 3 subsets are used as new
features
13
Outline
Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE
Experiments Conclusions and Future Work
14
Toy Example
{a, b, c}{a, b, d}{b, c, d}
10.5-2
Xα
X={a,c,e}
123
Feature Set: F={a,b,c,d,e}
Examples:
Test Example:
Kernel: 321213 1||),( XXXXK
j
#SVs L =3
j
15
PKB (Baseline)
{a, b, c}{a, b, d}{b, c, d}
10.5-2
Xα
Test Example X={a,c,e}
K(X,X’) = (|X∩X’|+1)3
123
f(X) = 1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15
Complexity is always O(L ・ |X|)
3 3 3
K(Xj,X)
j
16
PKI (Inverted Representation)
{a, b, c}{a, b, d}{b, c, d}
10.5-2
Xjα
K(X,X’) = (|X∩X’|+1)3
123
a b c d
{1,2}{1,2,3}{1,3}{2,3}
Test Example X= {a, c, e}
f(X)=1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 153 3 3
Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks
Inverted Index
B = Avg. size
17
PKE (Expanded Representation)
L
iii XXKXf
1
),()(
L
iii
L
iii
XX
XX
1
1
)φ( )φ(
)φ()φ(
ww
Convert into linear form by calculating vector w projects X into its subsets space)φ(X
18
PKE (Expanded Representation)
K(X,X’) = (|X∩X’|+1)
c3(0)=1, c3(1)=7,c3(2)=12, c3(3)=6
{a, b, c} {a, b, d} {b, c, d}
10.5-2
Xjαj
123
φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}
-0.5 10.5-3.5-7-10.5 18 12 6-12-18-24 6 3 0-12
C w
1
12
7
6
W (Expansion Table)3
F(X)= - 0.5 + 10.5 – 7 + 12 = 15
Test Example X={a,c,e}
{φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}}
Complexity is O(|X| ) , independent of the number of SVs (L)
Efficient if the number of SVs is large
d
w({b,d}) = 12 (0.5 – 2 ) = -18
19
PKE in Practice
Hard to calculate Expansion Table exactlyUse Approximated Expansion TableSubsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation
20
Subset Mining Problemid set
1234
{ a c d } { a b c } { a b d } { b c e }
Transaction Database
{a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2
Results
Extract all subsets that occur in no less than sets of the transaction database
and no size constraints → NP-hard Efficient algorithms have been proposed
(e.g., Apriori, PrefixSpan)
2
1
21
Feature Selection as Mining
• Can efficiently build the approximated table • σ controls the rate of approximation
{a, b, c} {a, b, d} {b, c, d}
10.5-2
Xiαi
123
Direct generation with subset mining
{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}
10.5-10.5 12 12 -12-18-24-12
W φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}
σ=10
-0.5 10.5-3.5-7-10.5 12 12 6-12-18-24 6 3 0-12
s w
Exhaustive generation and testing
→ Impractical!
s
22
Outline
Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE
Experiments Conclusions and Future Work
23
Experimental Settings
Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP)
Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP
24
Results (English Base-NP Chunking)
Time(Sec./Sent.)
Speedup Ratio
F-score
PKB .164 1.0 93.84PKI .020 8.3 93.84PKE (σ=.01) .0016 105.2 93.79PKE (σ=.005) .0016 101.3 93.85PKE (σ=.001) .0017 97.7 93.84PKE (σ=.0005) .0017 96.8 93.84
25
Results (Japanese Word Segmentation)
Time(Sec./Sent.)
Speedup Ratio
Accuracy (%)
PKB .85 1.0 97.94PKI .49 1.7 97.94PKE (σ=.01) .0024 358.2 97.93PKE (σ=.005) .0028 300.1 97.95 PKE (σ=.001) .0034 242.6 97.94 PKE (σ=.0005) .0035 238.8 97.94
26
Results (Japanese Dependency Parsing)
Time(Sec./Sent.)
Speedup Ratio
Accuracy (%)
PKB .285 1.0 89.29PKI .0226 12.6 89.29PKE (σ=.01) .0042 66.8 88.91PKE (σ=.005) .0060 47.8 89.05 PKE (σ=.001) .0086 33.3 89.26PKE (σ=.0005) .0090 31.8 89.29
27
Results
2 - 12 fold speed up in PKI 30 - 300 fold speed up in PKE Preserve the accuracy when we set an appropriate σ
28
Comparison with related work
XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion
table
PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create
the expansion table
29
Conclusions
Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded)
2-12 fold speed up in PKI, 30-300 fold speed up in PKEPreserve the accuracy
30
Future Work
Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]
Dot-product between trees Feature space is all sub-tree Apply sub-tree mining algorithm [Zaki 02]
31
English Base-NP ChunkingExtract Non-overlapping Noun Phrase from text[NP He ] reckons [NP the current account deficit ] will narrow to[NP only # 1.8 billion ] in [NP September ] .
BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside
Pair-wise method to 3-class problem
training: wsj15-18, test: wsj20 (standard set)
32
Japanese Word Segmentation
太 郎 は 花 子 に 本 を 読 ま せ た ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
Sentence:Boundaries:
},,,,,{ 321,12 iiiiiii ccccccX
Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09
If there is a boundary between and i 1i1iY , otherwise 1iY
Taro made Hanako read a book
33
Japanese Dependency Parsing
私は ケーキを 食べるI-top cake-acc. eat
Identify the correct dependency relations between two bunsetsu (base phrase in English)
Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc)
Binary classification (+1 dependent, -1 independent)
Cascaded Chunking Model [kudo, et al. 02]
Training: KUC 01-08, Test: KUC 09
I eat a cake
34
Kernel Methods (1/2)
L
iii XXXf
1
)φ()φ()(
X : example to be classified Xi: training examples : weight for examples : a function to map examples to another vectorial
spaceφ
i
Suppose a learning task: }1,1{: Xg
))(sgn()( XfXg
},{ 1 LXXT training examples
35
PKE (Expanded Representation)
L
i
d
rirdi XXPrcXf
1 0
|)(|)()(
If we calculate in advance ( is the indicator function)
))((|)(|)(1
||
L
iisdi XPsIscsw
for all subsets
)(
)()(Xs d
swXf
d
r rd FPFs0
)()(
d
r rd XPX0
)()(
I
36
TRIE representation
{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}
10.5-10.5 12 12 -12-18-24-12
w
a db
b c c d
c
d
d
root
10.5
12 12
-10.5
-24-18-12
-12
Compress redundant structures Classification can be done by simply
traversing the TRIE
37
Kernel Methods
No need to represent example in an explicit feature vector
Complexity of testing is O(L |X|)
L
iii
L
iii
XXK
XXXf
1
1
),(
)φ()φ()(
},,,{ 21 LXXXT Training data