北海道大学 hokkaido university 1 lecture on information knowledge network2011/11/29 lecture on...

31
北北北北北 Hokkaido University 1 Lecture on Information knowledge network 2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

Upload: bailey-bury

Post on 02-Apr-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2011/11/29

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

Page 2: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

The 7thDevelopment of new compression

method for pattern matching~ Improving VF codes ~

Which comp. methods are suitable?VF code

STVF codeImprovement with allowing incomplete internal nodes

Improvement with iterative learningConclusion

Page 3: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Which comp. methods are suitable?

Key features for fast CPM– having clear code boundaries

End-tagged codes, byte-oriented codes, or fixed-length codes

– using a static and compact dictionary Methods like Huffman coding

– achieving high compression ratios to reduce the amount of data to be processed.

Lecture on Information knowledge network2011/11/29

VF codes(Variable-length-to-Fixed-length

Code)

Tunstall code ( Tunstall 1967 )

AIVF code ( Yamamoto&Yokoo 2001 )STVF code ( Klien&Shapira2008, Kida2009 )

3

Page 4: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

VF code and the others

VV codes are the mainstream from comp. ratios

It is difficult for VF codes to gain high compression ratios since the codewords are fixed– There has been no practical use of existent VF

codes!

Compressed text ( codeword )

Fixed length Variable length

Input text( source symbol )

Fixedlength

FF code(Fixed length to Fixed length

code)

FV code(Fixed length to Variable length

code)

Variable

length

VF code(Variable length to Fixed

length code)

VV code(Variable length to Variable

length code)

Huffman code

Tunstall code

LZ family, etc.

2011/11/29 Lecture on Information knowledge network

4

Page 5: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

VF coding using a parse tree

Repeat the followings:1. Reading symbols from the input text one by one.2. Parse the text when the traversal reaches to a leaf.3. Encode the parsed block to the number of the leaf.

Each leaf of the parse tree corresponds to a string, and it is numbered, which represents a codeword.

Lecture on Information knowledge network2011/11/29

a b c

a b c a b c

a b c

0 1 2

3 4

5

6 7 8

abbaabbaaacaccInput text : 3 5 1 5 0 6 8Coded text :

Parse tree T:

5

Page 6: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Tunstall codeB. P. Tunstall, “Synthesis of noiseless compression codes,” Ph.D. dissertation, Georgia Ins. Technol., Atlanta, GA, 1967 .

It uses a complete k-ary tree for the source alphabet ∑ (|∑| = k). It is the optimal VF code for memory-less information sources, where

the occurrence probability for each symbol is given and unchanged.

Tunstall tree Tm :∑ = {a, b, c}, k = 3,#internal nodes m = 3,P(a) = 0.5,P(b) = 0.2,P(c) = 0.3,

Ex) Code the input S = abacaacbabcc with Tm

S = ab ・ ac ・ aa ・ cb ・ ab ・ cc

Z = 001 010 000 101 001 1102011/11/29 Lecture on Information knowledge network

Each leaf number is represented by a fixed-length binary code, whose length is 3. (23 = 8)

𝑚=⌊(2ℓ −1)/(𝑘−1)⌋

a b c

0.5 0.2 0.3a b c

0.25 0.1 0.15

a b c0.15 0.06 0.09

000 001 010

011

100 101 110

6

Page 7: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Construct the optimal tree Tm* which makes the average block length

maximum.– Given an integer m 1 ≧ and the occurrence probability P(a) ( a ∑∈ ) ,

the construction of Tm* is as follows:

1. Let T1* be the initial parse tree, which is the complete k-ary tree whose length is equal to

1.

2. Repeat the following steps for i = 2, …, m.

A) Choose a node v=v*i whose probability is the maximum among all the leaves in the

current parse tree Ti –1* .

B) Add T1 onto v*i to make Ti

* .

Ex) Tunstall tree Tm*

∑ = {a, b, c}k = 3,m = 4

P(a) = 0.5P(b) = 0.2P(c) = 0.3

a b c

0.5 0.2 0.3

How to construct Tunstall tree

2011/11/29 Lecture on Information knowledge network

a b c0.25 0.1 0.15

a b c0.15 0.06 0.09

a b c0.125 0.05 0.075

7

Page 8: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Basic idea

Utilize the suffix tree for the input text as a parse tree.– The suffix tree[Weiner, 1973] for T has the complete statistical information

about T.

Suffix tree ST(T) for T is a dictionary tree for ANY substrings in T. → We cannot use ST(T) as is, since it includes T itself!– We have to prune the tree properly to make a parse tree.

– The pruning must be done so that there are nodes whose frequencies are high.

Note that we have to store the pruned suffix tree since we need it when we decompress the encoded text.

Kida presented at DCC2009. Unfortunately, however, there was the foregoing work! Their work is very similar to mine…– Klein, S.T., and Shapira, D, “Improved variable-to-fixed

length codes,” In : SPIRE2008, pp. 39-50, 2008. Lecture on Information knowledge network2011/11/29

8

Page 9: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to

the length of the substring.–There are O(n) online construction algorithms.

E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.

b b a bd

d

ab

ab

d

d

d

bab

cabbcabd

c

ca

bb

ca

b

dc

ca

bb

ca

bd

bcab

c

c

ba

d

dca

bb

ca

bd

bcab

d

dcabbcabd

STVF codes: parse tree by pruned suffix tree

24

3 3

26

2011/11/29 Lecture on Information knowledge network

Suffix tree ST(S); S=abbcabcabbcabd

9

Page 10: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to

the length of the substring.–There are O(n) online construction algorithms.

E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.

Suffix tree ST(S); S=abbcabcabbcabd b b a bd

d

ab

ab

d

d

d

bab

cabbcabd

c

ca

bb

ca

b

dc

ca

bb

ca

bd

bcab

c

c

ba

d

dca

bb

ca

bd

bcab

d

dcabbcabd

STVF codes: parse tree by pruned suffix tree

24

3 3

26

2011/11/29 Lecture on Information knowledge network

10

Page 11: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to

the length of the substring.–There are O(n) online construction algorithms.

E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.

Suffix tree ST(S); S=abbcabcabbcabd b b a bd

d

ab

ab

d

d

d

bab

cabbcabd

c

ca

bb

ca

b

dc

ca

bb

ca

bd

bcab

c

c

ba

d

dca

bb

ca

bd

bcab

d

dcabbcabd

Parse tree PT(S)

b b a bdab

ab

babc

c dc

cc

ba

d

000001

010 110

111100

101011

STVF codes: parse tree by pruned suffix tree

24

3 3

26

000S = abbcab ・ cab ・ bcab ・ d

011 110 100

2011/11/29 Lecture on Information knowledge network

11

Page 12: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

1. Construct suffix tree ST(T$).

2. Let T1’ be the initial candidate tree, which is the pruned suffix tree STk+1(T$) that consists the root of ST(T$) and its children.

3. Choose a node v whose frequency is the highest among all the leaves in Ti’ = STLi(T$) in the sense of the frequency in ST(T$). Let Li be the total number of the leaves in Ti’, and let Cv be the number of the children of v.

4. If Li + Cv – 1≦2l holds, add all the children of v onto Ti’ as new leaves and make it the new candidate Ti+1’. If child u of v is a leaf in ST(T$), however, chop the label from v to u off except the first one character.

5. Repeat Step 3 and 4 while Ti’ can be extended.

Frequency-based pruning algorithm

2011/11/29 Lecture on Information knowledge network

12

Page 13: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Experimental results (comp. ratios)

Methods E.coli bible.txt world192.txt

dazai.utf.txt

1000000.txt

Huffman code 25.00%

54.82% 63.03% 57.50% 59.61%

Tunstall code(8)

27.39% 72.70% 85.95% 100.00% 76.39%

Tunstall code(12)

26.47% 64.89% 77.61% 69.47% 68.45%

Tunstall code(16)

26.24% 61.55% 70.29% 70.98% 65.25%

STVF code (8) 25.09% 66.59% 80.76% 73.04% 74.25%

STVF code (12)

25.10% 50.25% 62.12% 52.99% 68.90%

STVF code (16)

28.90% 42.13% 49.93% 41.37% 78.99%

Lecture on Information knowledge network2011/11/29

Figures in () indicate the length of codeword (bits)

Text data Size (byte) |∑| Contents

E.coli 4638690 4 Complete genome of the E. Coli bacterium

bible.txt 4047392 63 The King James version of the bible

world192.txt

2473400 94 The CIA world fact book

dazai.utf.txt

7268943 141 The all works of Osamu Dazai (UTF-8 encoded)

1000000.txt

1000000 26 The random string automatically generated

※From Canterbury Corpus(http://corpus.canterbury.ac.nz/) and J-TEXTS(http://www.j-texts.com/)

13

Page 14: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Comp. ratio to codeword length

8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

Tunstall

STVF

Codeword length

Com

p. ratio

2011/11/29 Lecture on Information knowledge network

The used text isbible.txt (The King James version of the bible; 3.85MB)

14

Page 15: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Compression methods– STVF Coding– Tunstall + Range Coder– STVF Coding + Range Coder

Data– English Text (The bible of King James, 4MB, |Σ|=63)

Environments– CPU: Intel® Xeon® processor 3.00GHz dual core– Memory: 12GB– OS: Red Hat Enterprise Linux ES Release 4

Codeword Length– l = 8-16 bits

Improvement by combining with range coder

Compare each compression ratios, compression times, and decompression times

2011/11/29 Lecture on Information knowledge network

15

Page 16: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Results (codeword length-comp. ratio)

8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

TunstallTunstall + RangeCoderSTVFSTVF + RangeCoder

Codeword length

Com

p. ra

tio(

%)

2011/11/29 Lecture on Information knowledge network

16

Page 17: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Results (codeword length-comp. time)

8 9 10 11 12 13 14 15 160

1

2

3

4

5

6

7

8

9

STVFSTVF + RangeCoderTunstallTunstall + RangeCoder

Codeword length

Com

p. tim

e (se

c.)

2011/11/29 Lecture on Information knowledge network

17

Page 18: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Results (codeword length-decomp. time)

8 9 10 11 12 13 14 15 160

0.1

0.2

0.3

0.4

0.5

0.6

Tunstall + RangeCoderTunstallSTVF + RangeCoderSTVF

Codeword length

Deco

mp. tim

e (se

c.)

2011/11/29 Lecture on Information knowledge network

18

Page 19: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Take a breath19

Lecture on Information knowledge network2011/11/29

2011.11.03 Sendai Dai-Kannon@Daikanmitsu-ji

Page 20: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Take a breath20

Lecture on Information knowledge network2011/11/29

Coming

soon!

Page 21: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Problem of the original STVF code

– The bottleneck is the shape of the parse tree.– Even if we choose a node whose frequency is

high, not all the children have high frequencies!

Lecture on Information knowledge network2011/11/29

f = 1000

f = 400f = 1 f = 3f = 500f = 2

In STVF coding, all the children are added!

Useless leaves!

21

Page 22: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Improvement with allowing incomplete internal nodes

Adding nodes one by one in the frequency order must be better!– We allow to assign a codeword to an incomplete internal node– The coding procedure is modified so that it can encode

instantaneously (like AIVF code[Yamamoto&Yokoo2001]). Output a codeword when the traversal fails. Resume the traversal from the root by the character which lead the fail.

Lecture on Information knowledge network2011/11/29

f = 1000

f = 300f = 1 f = 1

f = 500

Choose nodes whose frequencies are high

enough

22

Page 23: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Difference between the original and this

Parse tree of the original STVF code

(T = BABCABABBABCBAC, k = 3)

BA C

B C A

B CC

0

1

2

B C

3 4

5

6 7

Improved parse tree

BA C

3

B C

0 1

A B C2

4 5

A B C6

7

Longer strings are likely to be added

into the tree.

In the one-by-one method, less frequent leaves are chopped and longer strings tend to be assigned codewords.

2011/11/29 Lecture on Information knowledge network

23

Page 24: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Experiments

Text data to be used– Bible.txt (The Canterbury Corpus), 3.85MB

Method– Compare with the original STVF code at the codeword length of 16 bit

Result– The one-by-one method improves 18.7% in comp. ratio and 22.2% in

comp. pattern matching speed.

MethodsComp. time

Comp. ratio

Comp. PM time

original STVF

6109ms 42.1% 7.27ms

One-by-one 6593ms 34.2% 5.67msOn Intel Core 2 Duo T7700, 2GB RAM, Windows Vista, Visual C++ 2008

2011/11/29 Lecture on Information knowledge network

24

Page 25: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Improvement with iterative learning

Not all substrings that occur frequently in the input text do not appear frequently in the coded sequence of blocks!

Shall we construct the optimal parse tree? ⇒ We have to choose substrings that are actually used when we encode the text, which are entered in a dictionary.

 ⇒ Boundaries of parsed blocks will vary for the dictionary.

 ⇒ Which comes first, the chicken or the egg?  ⇒ hard as NP-complete

How shall we brush up a parse tree?

– Encode the text iteratively and choose useful nodes to brush up!

Lecture on Information knowledge network2011/11/29

25

Page 26: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Idea of brushing-up a parse tree

The idea is to encode the text iteratively and then swap nodes that are eventually useless for nodes that are expected to be used.

Two criteria: – Calculate a contribution of node by encoding the input text once

[accept count]– Estimate an expectation of node that is not in the parse tree yet

[failure count]– Swap if the contribution of the expectation of

Lecture on Information knowledge network2011/11/29

C (𝑝)

𝑝=𝑇 [𝑖 .. 𝑗 ]

𝑇 [ 𝑗+1]

𝐴 (𝑝) ← 𝐴 (𝑝 )+1

𝐹 (𝑝 ∙𝑇 [ 𝑗+1]   ) ←𝐹 (𝑝 ∙𝑇 [ 𝑗+1]   )+1

26

Page 27: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Experiments

Texts Size(byte) |Σ| Contents

GBHTG119 81,173,787 4 DNA sequences

DBLP2003 90,510,236 97 XML data

Reuters-21578 18,805,335 103 English texts

Mainichi1991 78,911,178 256 Japanese texts(UTF-8)

Lecture on Information knowledge network2011/11/29

We compared with BPEX, ETDC, SCDC, gzip, and bzip2 for the above text data.

C++ compiled by g++ of GNU v 3.4 Intel Xeon® 3GHz and 12GB of RAM, running Red Hat Enterprise Linux ES Release 4.

27

Page 28: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Compression ratios

Lecture on Information knowledge network2011/11/29

Tunstall STVF Tunstall-100 STVF-100 BPEX ETDC SCDC gzip bzip20

10

20

30

40

50

60

70

80

90

100

110

GBHTG119

DBLP2003

Reuters21578

Mainichi1991

Com

pre

ssio

n r

atio

[%

]

28

Page 29: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Compression times

Lecture on Information knowledge network2011/11/29

Tunstall STVF Tunstall-100

STVF-100 BPEX ETDC SCDC gzip bzip20.1

1

10

100

1000

10000

GBHTG119

DBLP2003

Reuters21578

Mainichi1991

Com

pre

ssio

n t

ime

[sec

]

29

Page 30: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

Decompression times

Lecture on Information knowledge network2011/11/29

Tunstall STVF Tunstall-100 STVF-100 BPEX ETDC SCDC gzip bzip20

1

2

3

4

5

6

GBHTG119

DBLP2003

Reuters21578

Mainichi1991

Dec

omp

ress

ion

tim

e [s

ec]

30

Page 31: 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern

北海道大学 Hokkaido University

The 7th summary

Key features for fast CPM– having clear code boundaries– using a static and compact dictionary– achieving high compression ratios

VF coding is a promising compression method!– We developed a new VF code, named STVF code, which uses a

pruned suffix tree as a parse tree.– Improvement with allowing incomplete internal nodes– Improvement with iterative learning– VF codes reach to the level of the state-of-the-art methods like

Gzip and BPEX in compression ratios! Future works:– reach to the level of bzip2!– develop efficient pattern matching algorithms on the

improved STVF codes Implementation of BM type algorithm on STVF codes

Lecture on Information knowledge network2011/11/29

31