fast methods for kernel-based text analysis

37
1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工工 工 Yuji Matsumoto 工工 工工 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN

Upload: kenyon-lane

Post on 31-Dec-2015

78 views

Category:

Documents


3 download

DESCRIPTION

Fast Methods for Kernel-based Text Analysis. Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology). 41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN. Background. Kernel methods (e.g., SVM) become popular - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Methods for  Kernel-based Text Analysis

1

Fast Methods for Kernel-based Text Analysis

Taku Kudo 工藤 拓Yuji Matsumoto 松本 裕治NAIST (Nara Institute of Science and Technology)

41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN

Page 2: Fast Methods for  Kernel-based Text Analysis

2

Background

Kernel methods (e.g., SVM) become popularCan incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy

Page 3: Fast Methods for  Kernel-based Text Analysis

3

Problem

Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testingSome kernel-based parsers run only at 2 - 3 seconds/sentence

Page 4: Fast Methods for  Kernel-based Text Analysis

4

Goals

Build fast but still accurate kernel- based text analyzersMake it possible to use them to wider range of NL applications

Page 5: Fast Methods for  Kernel-based Text Analysis

5

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE

Experiments Conclusions and Future Work

Page 6: Fast Methods for  Kernel-based Text Analysis

6

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE

Experiments Conclusions and Future Work

Page 7: Fast Methods for  Kernel-based Text Analysis

7

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L ・ |X|)

L

iii

L

iii

XXK

XXXf

1

1

),(

)φ()φ()(

},,,{ 21 LXXXT Training data

Page 8: Fast Methods for  Kernel-based Text Analysis

8

Kernels for Sets (1/3)

FXXXX T

iiiF

jL

N

},,,,{

},,,{

21

21

Focus on the special case where examples are represented as sets

The instances in NLP are usually represented as sets (e.g., bag-of-words)

Feature set:

Training data:

Page 9: Fast Methods for  Kernel-based Text Analysis

9

Kernels for Sets (2/3)},,,{ ,},,,{ 21 edbaXdcbaX

Combinations (subsets) of features

}},,{{

}},{},,{},,{{

dba

dbdaba

3 |},,{| || ),( 2121 dbaXXXXK

Simple definition:

2nd order

3rd order

Page 10: Fast Methods for  Kernel-based Text Analysis

10

Kernels for Sets (3/3)

I ate a cake PRP VBD DT NN

Dependent (+1) or independent (-1) ?

head modifier

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NN

X=

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NNHead-POS/Modifier-POS: VBD/NNHead-word/Modifier-POS: ate/NN …

X=

Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks

Previous approaches select combinations heuristically

Heuristic

selection

Page 11: Fast Methods for  Kernel-based Text Analysis

11

Polynomial Kernel of degree d

..}2,1,0{1||),( 2121 dXXXXK dd  

Implicit form

|)(|)(),(0

2121

d

rrdd XXPrcXXK

Explicit form

r

rm

lmrd

rld m

rm

l

drc )1()(

is a set of all subsets of with exactly elements in it

is prior weight to the subsets with size

)(XPr X

r )(rcd

r

(subset weight)

Page 12: Fast Methods for  Kernel-based Text Analysis

12

Example (Cubic Kernel d=3 )

},,,{ ,},,,{ 21 edbaXdcbaX

64)13(1||),( 3321213  XXXXK

Implicit form:

}},,{{)( ,6)3(

}},{},,{},,{{)( ,12)2(

}}{},{},{{)( ,7)1(

}{)( ,1)0(

2133

2123

2113

2103

dbaXXPc

dbdabaXXPc

dbaXXPc

XXPc

64163123711),( 213 XXK

Explicit form:

Up to 3 subsets are used as new

features

Page 13: Fast Methods for  Kernel-based Text Analysis

13

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE

Experiments Conclusions and Future Work

Page 14: Fast Methods for  Kernel-based Text Analysis

14

Toy Example

{a, b, c}{a, b, d}{b, c, d}

10.5-2

X={a,c,e}

123

Feature Set: F={a,b,c,d,e}

Examples:

Test Example:

Kernel:  321213 1||),( XXXXK

j

#SVs L =3

j

Page 15: Fast Methods for  Kernel-based Text Analysis

15

PKB (Baseline)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Test Example X={a,c,e}

K(X,X’) = (|X∩X’|+1)3

123

f(X) = 1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15

Complexity is always O(L ・ |X|)

3 3 3

K(Xj,X)

j

Page 16: Fast Methods for  Kernel-based Text Analysis

16

PKI (Inverted Representation)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Xjα

K(X,X’) = (|X∩X’|+1)3

123

a b c d

{1,2}{1,2,3}{1,3}{2,3}

Test Example X= {a, c, e}

f(X)=1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 153 3 3

Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks

Inverted Index

B = Avg. size

Page 17: Fast Methods for  Kernel-based Text Analysis

17

PKE (Expanded Representation)

L

iii XXKXf

1

),()(

L

iii

L

iii

XX

XX

1

1

)φ( )φ(

)φ()φ(

ww

Convert into linear form by calculating vector w projects X into its subsets space)φ(X

Page 18: Fast Methods for  Kernel-based Text Analysis

18

PKE (Expanded Representation)

K(X,X’) = (|X∩X’|+1)

c3(0)=1, c3(1)=7,c3(2)=12, c3(3)=6

{a, b, c} {a, b, d} {b, c, d}

10.5-2

Xjαj

123

φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

-0.5 10.5-3.5-7-10.5 18 12 6-12-18-24 6 3 0-12

C w

1

12

7

6

W (Expansion Table)3

F(X)= - 0.5 + 10.5 – 7 + 12 = 15

Test Example X={a,c,e}

{φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}}

Complexity is O(|X| ) , independent of the number of SVs (L)

Efficient if the number of SVs is large

d

w({b,d}) = 12 (0.5 – 2 ) = -18

Page 19: Fast Methods for  Kernel-based Text Analysis

19

PKE in Practice

Hard to calculate Expansion Table exactlyUse Approximated Expansion TableSubsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

Page 20: Fast Methods for  Kernel-based Text Analysis

20

Subset Mining Problemid set

1234

{ a c d } { a b c } { a b d } { b c e }

Transaction Database

{a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2

Results

Extract all subsets that occur in no less than sets of the transaction database

and no size constraints → NP-hard Efficient algorithms have been proposed

(e.g., Apriori, PrefixSpan)

2

1

Page 21: Fast Methods for  Kernel-based Text Analysis

21

Feature Selection as Mining

• Can efficiently build the approximated table • σ controls the rate of approximation

{a, b, c} {a, b, d} {b, c, d}

10.5-2

Xiαi

123

Direct generation with subset mining

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

W φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

σ=10

-0.5 10.5-3.5-7-10.5 12 12 6-12-18-24 6 3 0-12

s w

Exhaustive generation and testing

→ Impractical!

s

Page 22: Fast Methods for  Kernel-based Text Analysis

22

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE

Experiments Conclusions and Future Work

Page 23: Fast Methods for  Kernel-based Text Analysis

23

Experimental Settings

Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP)

Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP

Page 24: Fast Methods for  Kernel-based Text Analysis

24

Results (English Base-NP Chunking)

Time(Sec./Sent.)

Speedup Ratio

F-score

PKB .164 1.0 93.84PKI .020 8.3 93.84PKE (σ=.01) .0016 105.2 93.79PKE (σ=.005) .0016 101.3 93.85PKE (σ=.001) .0017 97.7 93.84PKE (σ=.0005) .0017 96.8 93.84

Page 25: Fast Methods for  Kernel-based Text Analysis

25

Results (Japanese Word Segmentation)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .85 1.0 97.94PKI .49 1.7 97.94PKE (σ=.01) .0024 358.2 97.93PKE (σ=.005) .0028 300.1 97.95 PKE (σ=.001) .0034 242.6 97.94 PKE (σ=.0005) .0035 238.8 97.94

Page 26: Fast Methods for  Kernel-based Text Analysis

26

Results (Japanese Dependency Parsing)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .285 1.0 89.29PKI .0226 12.6 89.29PKE (σ=.01) .0042 66.8 88.91PKE (σ=.005) .0060 47.8 89.05 PKE (σ=.001) .0086 33.3 89.26PKE (σ=.0005) .0090 31.8 89.29

Page 27: Fast Methods for  Kernel-based Text Analysis

27

Results

2 - 12 fold speed up in PKI 30 - 300 fold speed up in PKE Preserve the accuracy when we set an appropriate σ

Page 28: Fast Methods for  Kernel-based Text Analysis

28

Comparison with related work

XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion

table

PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create

the expansion table

Page 29: Fast Methods for  Kernel-based Text Analysis

29

Conclusions

Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded)

2-12 fold speed up in PKI, 30-300 fold speed up in PKEPreserve the accuracy

Page 30: Fast Methods for  Kernel-based Text Analysis

30

Future Work

Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]

Dot-product between trees Feature space is all sub-tree Apply sub-tree mining algorithm [Zaki 02]

Page 31: Fast Methods for  Kernel-based Text Analysis

31

English Base-NP ChunkingExtract Non-overlapping Noun Phrase from text[NP He ] reckons [NP the current account deficit ] will narrow to[NP only # 1.8 billion ] in [NP September ] .

BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside

Pair-wise method to 3-class problem

training: wsj15-18, test: wsj20 (standard set)

Page 32: Fast Methods for  Kernel-based Text Analysis

32

Japanese Word Segmentation

太 郎 は 花 子 に 本 を 読 ま せ た ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

Sentence:Boundaries:

},,,,,{ 321,12 iiiiiii ccccccX

Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09

If there is a boundary between and i 1i1iY , otherwise 1iY

Taro made Hanako read a book

Page 33: Fast Methods for  Kernel-based Text Analysis

33

Japanese Dependency Parsing

私は   ケーキを   食べるI-top cake-acc. eat

Identify the correct dependency relations between two bunsetsu (base phrase in English)

Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc)

Binary classification (+1 dependent, -1 independent)

Cascaded Chunking Model [kudo, et al. 02]

Training: KUC 01-08, Test: KUC 09

I eat a cake

Page 34: Fast Methods for  Kernel-based Text Analysis

34

Kernel Methods (1/2)

L

iii XXXf

1

)φ()φ()(

X : example to be classified Xi: training examples : weight for examples : a function to map examples to another vectorial

spaceφ

i

Suppose a learning task: }1,1{: Xg

))(sgn()( XfXg

},{ 1 LXXT training examples

Page 35: Fast Methods for  Kernel-based Text Analysis

35

PKE (Expanded Representation)

L

i

d

rirdi XXPrcXf

1 0

|)(|)()(

If we calculate in advance ( is the indicator function)

))((|)(|)(1

||

L

iisdi XPsIscsw

for all subsets

)(

)()(Xs d

swXf

d

r rd FPFs0

)()(

d

r rd XPX0

)()(

I

Page 36: Fast Methods for  Kernel-based Text Analysis

36

TRIE representation

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

w

a db

b c c d

c

d

d

root

10.5

12 12

-10.5

-24-18-12

-12

Compress redundant structures Classification can be done by simply

traversing the TRIE

Page 37: Fast Methods for  Kernel-based Text Analysis

37

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L |X|)

L

iii

L

iii

XXK

XXXf

1

1

),(

)φ()φ()(

},,,{ 21 LXXXT Training data