北海道大学 hokkaido university 1 lecture on information knowledge network2011/11/29 lecture on...
TRANSCRIPT
北海道大学 Hokkaido University
1
Lecture on Information knowledge network2011/11/29
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,Division of Computer Science,
Graduate School of Information Science and Technology,Hokkaido University
Takuya KIDA
The 7thDevelopment of new compression
method for pattern matching~ Improving VF codes ~
Which comp. methods are suitable?VF code
STVF codeImprovement with allowing incomplete internal nodes
Improvement with iterative learningConclusion
北海道大学 Hokkaido University
Which comp. methods are suitable?
Key features for fast CPM– having clear code boundaries
End-tagged codes, byte-oriented codes, or fixed-length codes
– using a static and compact dictionary Methods like Huffman coding
– achieving high compression ratios to reduce the amount of data to be processed.
Lecture on Information knowledge network2011/11/29
VF codes(Variable-length-to-Fixed-length
Code)
Tunstall code ( Tunstall 1967 )
AIVF code ( Yamamoto&Yokoo 2001 )STVF code ( Klien&Shapira2008, Kida2009 )
3
北海道大学 Hokkaido University
VF code and the others
VV codes are the mainstream from comp. ratios
It is difficult for VF codes to gain high compression ratios since the codewords are fixed– There has been no practical use of existent VF
codes!
Compressed text ( codeword )
Fixed length Variable length
Input text( source symbol )
Fixedlength
FF code(Fixed length to Fixed length
code)
FV code(Fixed length to Variable length
code)
Variable
length
VF code(Variable length to Fixed
length code)
VV code(Variable length to Variable
length code)
Huffman code
Tunstall code
LZ family, etc.
2011/11/29 Lecture on Information knowledge network
4
北海道大学 Hokkaido University
VF coding using a parse tree
Repeat the followings:1. Reading symbols from the input text one by one.2. Parse the text when the traversal reaches to a leaf.3. Encode the parsed block to the number of the leaf.
Each leaf of the parse tree corresponds to a string, and it is numbered, which represents a codeword.
Lecture on Information knowledge network2011/11/29
a b c
a b c a b c
a b c
0 1 2
3 4
5
6 7 8
abbaabbaaacaccInput text : 3 5 1 5 0 6 8Coded text :
Parse tree T:
5
北海道大学 Hokkaido University
Tunstall codeB. P. Tunstall, “Synthesis of noiseless compression codes,” Ph.D. dissertation, Georgia Ins. Technol., Atlanta, GA, 1967 .
It uses a complete k-ary tree for the source alphabet ∑ (|∑| = k). It is the optimal VF code for memory-less information sources, where
the occurrence probability for each symbol is given and unchanged.
Tunstall tree Tm :∑ = {a, b, c}, k = 3,#internal nodes m = 3,P(a) = 0.5,P(b) = 0.2,P(c) = 0.3,
Ex) Code the input S = abacaacbabcc with Tm
S = ab ・ ac ・ aa ・ cb ・ ab ・ cc
Z = 001 010 000 101 001 1102011/11/29 Lecture on Information knowledge network
Each leaf number is represented by a fixed-length binary code, whose length is 3. (23 = 8)
𝑚=⌊(2ℓ −1)/(𝑘−1)⌋
a b c
0.5 0.2 0.3a b c
0.25 0.1 0.15
a b c0.15 0.06 0.09
000 001 010
011
100 101 110
6
北海道大学 Hokkaido University
Construct the optimal tree Tm* which makes the average block length
maximum.– Given an integer m 1 ≧ and the occurrence probability P(a) ( a ∑∈ ) ,
the construction of Tm* is as follows:
1. Let T1* be the initial parse tree, which is the complete k-ary tree whose length is equal to
1.
2. Repeat the following steps for i = 2, …, m.
A) Choose a node v=v*i whose probability is the maximum among all the leaves in the
current parse tree Ti –1* .
B) Add T1 onto v*i to make Ti
* .
Ex) Tunstall tree Tm*
∑ = {a, b, c}k = 3,m = 4
P(a) = 0.5P(b) = 0.2P(c) = 0.3
a b c
0.5 0.2 0.3
How to construct Tunstall tree
2011/11/29 Lecture on Information knowledge network
a b c0.25 0.1 0.15
a b c0.15 0.06 0.09
a b c0.125 0.05 0.075
7
北海道大学 Hokkaido University
Basic idea
Utilize the suffix tree for the input text as a parse tree.– The suffix tree[Weiner, 1973] for T has the complete statistical information
about T.
Suffix tree ST(T) for T is a dictionary tree for ANY substrings in T. → We cannot use ST(T) as is, since it includes T itself!– We have to prune the tree properly to make a parse tree.
– The pruning must be done so that there are nodes whose frequencies are high.
Note that we have to store the pruned suffix tree since we need it when we decompress the encoded text.
Kida presented at DCC2009. Unfortunately, however, there was the foregoing work! Their work is very similar to mine…– Klein, S.T., and Shapira, D, “Improved variable-to-fixed
length codes,” In : SPIRE2008, pp. 39-50, 2008. Lecture on Information knowledge network2011/11/29
8
北海道大学 Hokkaido University
Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to
the length of the substring.–There are O(n) online construction algorithms.
E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.
b b a bd
d
ab
ab
d
d
d
bab
cabbcabd
c
ca
bb
ca
b
dc
ca
bb
ca
bd
bcab
c
c
ba
d
dca
bb
ca
bd
bcab
d
dcabbcabd
STVF codes: parse tree by pruned suffix tree
24
3 3
26
2011/11/29 Lecture on Information knowledge network
Suffix tree ST(S); S=abbcabcabbcabd
9
北海道大学 Hokkaido University
Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to
the length of the substring.–There are O(n) online construction algorithms.
E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.
Suffix tree ST(S); S=abbcabcabbcabd b b a bd
d
ab
ab
d
d
d
bab
cabbcabd
c
ca
bb
ca
b
dc
ca
bb
ca
bd
bcab
c
c
ba
d
dca
bb
ca
bd
bcab
d
dcabbcabd
STVF codes: parse tree by pruned suffix tree
24
3 3
26
2011/11/29 Lecture on Information knowledge network
10
北海道大学 Hokkaido University
Suffix tree–A compacted dictionary trie for all the substrings of the input text.–For the input text of length n, its size is O(n).–Using suffix tree ST(T), any substring can be searched in time linear to
the length of the substring.–There are O(n) online construction algorithms.
E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492.
Suffix tree ST(S); S=abbcabcabbcabd b b a bd
d
ab
ab
d
d
d
bab
cabbcabd
c
ca
bb
ca
b
dc
ca
bb
ca
bd
bcab
c
c
ba
d
dca
bb
ca
bd
bcab
d
dcabbcabd
Parse tree PT(S)
b b a bdab
ab
babc
c dc
cc
ba
d
000001
010 110
111100
101011
STVF codes: parse tree by pruned suffix tree
24
3 3
26
000S = abbcab ・ cab ・ bcab ・ d
011 110 100
2011/11/29 Lecture on Information knowledge network
11
北海道大学 Hokkaido University
1. Construct suffix tree ST(T$).
2. Let T1’ be the initial candidate tree, which is the pruned suffix tree STk+1(T$) that consists the root of ST(T$) and its children.
3. Choose a node v whose frequency is the highest among all the leaves in Ti’ = STLi(T$) in the sense of the frequency in ST(T$). Let Li be the total number of the leaves in Ti’, and let Cv be the number of the children of v.
4. If Li + Cv – 1≦2l holds, add all the children of v onto Ti’ as new leaves and make it the new candidate Ti+1’. If child u of v is a leaf in ST(T$), however, chop the label from v to u off except the first one character.
5. Repeat Step 3 and 4 while Ti’ can be extended.
Frequency-based pruning algorithm
2011/11/29 Lecture on Information knowledge network
12
北海道大学 Hokkaido University
Experimental results (comp. ratios)
Methods E.coli bible.txt world192.txt
dazai.utf.txt
1000000.txt
Huffman code 25.00%
54.82% 63.03% 57.50% 59.61%
Tunstall code(8)
27.39% 72.70% 85.95% 100.00% 76.39%
Tunstall code(12)
26.47% 64.89% 77.61% 69.47% 68.45%
Tunstall code(16)
26.24% 61.55% 70.29% 70.98% 65.25%
STVF code (8) 25.09% 66.59% 80.76% 73.04% 74.25%
STVF code (12)
25.10% 50.25% 62.12% 52.99% 68.90%
STVF code (16)
28.90% 42.13% 49.93% 41.37% 78.99%
Lecture on Information knowledge network2011/11/29
Figures in () indicate the length of codeword (bits)
Text data Size (byte) |∑| Contents
E.coli 4638690 4 Complete genome of the E. Coli bacterium
bible.txt 4047392 63 The King James version of the bible
world192.txt
2473400 94 The CIA world fact book
dazai.utf.txt
7268943 141 The all works of Osamu Dazai (UTF-8 encoded)
1000000.txt
1000000 26 The random string automatically generated
※From Canterbury Corpus(http://corpus.canterbury.ac.nz/) and J-TEXTS(http://www.j-texts.com/)
13
北海道大学 Hokkaido University
Comp. ratio to codeword length
8 9 10 11 12 13 14 15 160
10
20
30
40
50
60
70
80
Tunstall
STVF
Codeword length
Com
p. ratio
2011/11/29 Lecture on Information knowledge network
The used text isbible.txt (The King James version of the bible; 3.85MB)
14
北海道大学 Hokkaido University
Compression methods– STVF Coding– Tunstall + Range Coder– STVF Coding + Range Coder
Data– English Text (The bible of King James, 4MB, |Σ|=63)
Environments– CPU: Intel® Xeon® processor 3.00GHz dual core– Memory: 12GB– OS: Red Hat Enterprise Linux ES Release 4
Codeword Length– l = 8-16 bits
Improvement by combining with range coder
Compare each compression ratios, compression times, and decompression times
2011/11/29 Lecture on Information knowledge network
15
北海道大学 Hokkaido University
Results (codeword length-comp. ratio)
8 9 10 11 12 13 14 15 160
10
20
30
40
50
60
70
80
TunstallTunstall + RangeCoderSTVFSTVF + RangeCoder
Codeword length
Com
p. ra
tio(
%)
2011/11/29 Lecture on Information knowledge network
16
北海道大学 Hokkaido University
Results (codeword length-comp. time)
8 9 10 11 12 13 14 15 160
1
2
3
4
5
6
7
8
9
STVFSTVF + RangeCoderTunstallTunstall + RangeCoder
Codeword length
Com
p. tim
e (se
c.)
2011/11/29 Lecture on Information knowledge network
17
北海道大学 Hokkaido University
Results (codeword length-decomp. time)
8 9 10 11 12 13 14 15 160
0.1
0.2
0.3
0.4
0.5
0.6
Tunstall + RangeCoderTunstallSTVF + RangeCoderSTVF
Codeword length
Deco
mp. tim
e (se
c.)
2011/11/29 Lecture on Information knowledge network
18
北海道大学 Hokkaido University
Take a breath19
Lecture on Information knowledge network2011/11/29
2011.11.03 Sendai Dai-Kannon@Daikanmitsu-ji
北海道大学 Hokkaido University
Take a breath20
Lecture on Information knowledge network2011/11/29
Coming
soon!
北海道大学 Hokkaido University
Problem of the original STVF code
– The bottleneck is the shape of the parse tree.– Even if we choose a node whose frequency is
high, not all the children have high frequencies!
Lecture on Information knowledge network2011/11/29
f = 1000
f = 400f = 1 f = 3f = 500f = 2
In STVF coding, all the children are added!
Useless leaves!
21
北海道大学 Hokkaido University
Improvement with allowing incomplete internal nodes
Adding nodes one by one in the frequency order must be better!– We allow to assign a codeword to an incomplete internal node– The coding procedure is modified so that it can encode
instantaneously (like AIVF code[Yamamoto&Yokoo2001]). Output a codeword when the traversal fails. Resume the traversal from the root by the character which lead the fail.
Lecture on Information knowledge network2011/11/29
f = 1000
f = 300f = 1 f = 1
f = 500
Choose nodes whose frequencies are high
enough
22
北海道大学 Hokkaido University
Difference between the original and this
Parse tree of the original STVF code
(T = BABCABABBABCBAC, k = 3)
BA C
B C A
B CC
0
1
2
B C
3 4
5
6 7
Improved parse tree
BA C
3
B C
0 1
A B C2
4 5
A B C6
7
Longer strings are likely to be added
into the tree.
In the one-by-one method, less frequent leaves are chopped and longer strings tend to be assigned codewords.
2011/11/29 Lecture on Information knowledge network
23
北海道大学 Hokkaido University
Experiments
Text data to be used– Bible.txt (The Canterbury Corpus), 3.85MB
Method– Compare with the original STVF code at the codeword length of 16 bit
Result– The one-by-one method improves 18.7% in comp. ratio and 22.2% in
comp. pattern matching speed.
MethodsComp. time
Comp. ratio
Comp. PM time
original STVF
6109ms 42.1% 7.27ms
One-by-one 6593ms 34.2% 5.67msOn Intel Core 2 Duo T7700, 2GB RAM, Windows Vista, Visual C++ 2008
2011/11/29 Lecture on Information knowledge network
24
北海道大学 Hokkaido University
Improvement with iterative learning
Not all substrings that occur frequently in the input text do not appear frequently in the coded sequence of blocks!
Shall we construct the optimal parse tree? ⇒ We have to choose substrings that are actually used when we encode the text, which are entered in a dictionary.
⇒ Boundaries of parsed blocks will vary for the dictionary.
⇒ Which comes first, the chicken or the egg? ⇒ hard as NP-complete
How shall we brush up a parse tree?
– Encode the text iteratively and choose useful nodes to brush up!
Lecture on Information knowledge network2011/11/29
25
北海道大学 Hokkaido University
Idea of brushing-up a parse tree
The idea is to encode the text iteratively and then swap nodes that are eventually useless for nodes that are expected to be used.
Two criteria: – Calculate a contribution of node by encoding the input text once
[accept count]– Estimate an expectation of node that is not in the parse tree yet
[failure count]– Swap if the contribution of the expectation of
Lecture on Information knowledge network2011/11/29
C (𝑝)
𝑝=𝑇 [𝑖 .. 𝑗 ]
𝑇 [ 𝑗+1]
𝐴 (𝑝) ← 𝐴 (𝑝 )+1
𝐹 (𝑝 ∙𝑇 [ 𝑗+1] ) ←𝐹 (𝑝 ∙𝑇 [ 𝑗+1] )+1
26
北海道大学 Hokkaido University
Experiments
Texts Size(byte) |Σ| Contents
GBHTG119 81,173,787 4 DNA sequences
DBLP2003 90,510,236 97 XML data
Reuters-21578 18,805,335 103 English texts
Mainichi1991 78,911,178 256 Japanese texts(UTF-8)
Lecture on Information knowledge network2011/11/29
We compared with BPEX, ETDC, SCDC, gzip, and bzip2 for the above text data.
C++ compiled by g++ of GNU v 3.4 Intel Xeon® 3GHz and 12GB of RAM, running Red Hat Enterprise Linux ES Release 4.
27
北海道大学 Hokkaido University
Compression ratios
Lecture on Information knowledge network2011/11/29
Tunstall STVF Tunstall-100 STVF-100 BPEX ETDC SCDC gzip bzip20
10
20
30
40
50
60
70
80
90
100
110
GBHTG119
DBLP2003
Reuters21578
Mainichi1991
Com
pre
ssio
n r
atio
[%
]
28
北海道大学 Hokkaido University
Compression times
Lecture on Information knowledge network2011/11/29
Tunstall STVF Tunstall-100
STVF-100 BPEX ETDC SCDC gzip bzip20.1
1
10
100
1000
10000
GBHTG119
DBLP2003
Reuters21578
Mainichi1991
Com
pre
ssio
n t
ime
[sec
]
29
北海道大学 Hokkaido University
Decompression times
Lecture on Information knowledge network2011/11/29
Tunstall STVF Tunstall-100 STVF-100 BPEX ETDC SCDC gzip bzip20
1
2
3
4
5
6
GBHTG119
DBLP2003
Reuters21578
Mainichi1991
Dec
omp
ress
ion
tim
e [s
ec]
30
北海道大学 Hokkaido University
The 7th summary
Key features for fast CPM– having clear code boundaries– using a static and compact dictionary– achieving high compression ratios
VF coding is a promising compression method!– We developed a new VF code, named STVF code, which uses a
pruned suffix tree as a parse tree.– Improvement with allowing incomplete internal nodes– Improvement with iterative learning– VF codes reach to the level of the state-of-the-art methods like
Gzip and BPEX in compression ratios! Future works:– reach to the level of bzip2!– develop efficient pattern matching algorithms on the
improved STVF codes Implementation of BM type algorithm on STVF codes
Lecture on Information knowledge network2011/11/29
31