wavelet tree wo tears
TRANSCRIPT
echizen_tm Mar. 24, 2012
(1 slide)
(2 slides) (2 slides) FM-Index(13 slides)
(12 slides) (1 slide) (1 slide)
(2 slides) (1 slide)
IDechizen_tm
EchizenBlog-Zwei(http://d.hatena.ne.jp/echizen_tm/)
web ()
(1/2) () LOUDS (Information Theoretical Lower Bound = ITLB) (O(1)O(logN)) ic
(2/2) (DSIRNLP#2)
LOUDS (DSIRNLP#1)
(DSIRNLP#3)
(1/2) (Full-Text Search Engine) ()
(Inverted Index) (Suffix Array)
(2/2) FM-Index
(Suffix Array) (4)
FM-Index (0.3)
FM-Index(1/13) FM-Index FerraginaManzini [Ferragina+ 2000] Ferragina & Manzini - Index
Burrows-Wheeler(BWT)
(self-index) ()
[Ferragina+ 2004]
FM-Index(2/13) (Suffix Array) mississippi (1)(Suffix)0 1 2 3 4 5 6 7 8 9 10 11 mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi 11 10 7 4 1 0 9 8 6 3 5 2 #mississippi i#mississipp ippi#mississ issippi#miss ississippi#m mississippi# pi#mississip ppi#mississi sippi#missis sissippi#mis ssippi#missi ssissippi#mi
FM-Index(3/13) (Suffix Array) mississippi (2) (3) 11 10 7 4 1 0 9 8 6 3 5 2 #mississippi i#mississipp ippi#mississ issippi#miss ississippi#m mississippi# pi#mississip ppi#mississi sippi#missis sissippi#mis ssippi#missi ssissippi#mi
FM-Index(4/13) N 4
(N) + (4N)= 5N (5) (O(N)) + (O(NlogN)) = O(NlogN)
FM-Index(5/13) FM-Index Burrows-Wheeler(BWT) BWT(N) BWT()
( o(N)) FM-Index N + o(N) (o(N)0.3 = 1.33)
FM-Index(6/13) Burrows-Wheeler(BWT) BWT#mississippi i#mississipp ippi#mississ issippi#miss ississippi#m mississippi# pi#mississip ppi#mississi sippi#missis sissippi#mis ssippi#missi ssissippi#mi i p s s m # p i s s i i
BWT
FM-Index(7/13) BWTTO(N) O(1) LF()
LF(0) = 1LF(1) = 6 LF(6) = 7 LF(7) = 2 LF(2) = 8 LF(8) = 10 LF(10) = 3 LF(3) = 9 LF(9) = 11 LF(11) = 4 LF(4) = 5
T[0] = i T[1] = p T[6] = p T[7] = i T[2] = s T[8] = s T[10] = i T[3] = s T[9] = s T[11] = i T[4] = m
0 1 2 3 4 5 6 7 8 9 10 11
#mississippi i#mississipp ippi#mississ issippi#miss ississippi#m mississippi# pi#mississip ppi#mississi sippi#missis sissippi#mis ssippi#missi ssissippi#mi
i p s s m # p i s s i i
FM-Index(8/13) LF() LF(i) = TT[i]+ TiT[i]
ipssm#pissiiLF(9) = s (#1 + i4 + m1 + p2 = 8) + 9s(T[2], T[3], T[8]) =8+3 = 11
FM-Index(9/13) LF() LF(i) = TT[i]+ TiT[i]
TT[i](256) TiT[i] (256)(N)
FM-Index(10/13) TiT[i]
FM-Index(11/13) DSIRNLP#2
(O(1)O(logN)) rank(i) = i1 select(i) = i1
rank()
ic
FM-Index(12/13) LOUDS
BP
DFUDS
FM-Index(13/13) /
(4) BWT(FM-Index) BWT ic
(1/12) (Wavelet Tree) NO(N) + o(N) O(1)O(logN) rank(i, c)ic select(i, c)ic
FM-Indexrankrank
(2/12) 012
a,b,c,d4
(3/12) :abcdabdc
rank(5, a) = 2
abcdabdc2a
(4/12) abcdabdcrank(5,a)
4 2 abcdabdc abab (ab) cddc (cd)
(5/12) abab => 0101 cddc => 0110
rank
rank abcdabdcrank(5, a)ababrank(i, a)
irank
(6/12) abcdabdc5a2b1 ab3
rank(5, a)5a
5ab3 ababrank(3, a)
(7/12) abcdabdc5ab(abab)
abcdabdcabab0 cddc 1 abcdabdc => 00110011
rank(5, 0) = 3
(8/12) abcdabdcrank(5, a)
ababrank(3, a)
0101rank(3, 0) rank(3, 0) = 2
(9/12) abcdabdcrank(5, a) abab, cddc a,b => 0, c,d => 1 00110011rank(5, 0) rank(5, 0) = 3
ababa => 0, b => 1 0101rank(3, 0) rank(3, 0) = 2
(10/12) bv = abcdabdc
x
= 00110011 (abcdabdc) y[0] = 0101 (abab) y[1] = 0110 (cddc)
a = {0, 0}, b = {0, 1}, c = {1, 0}, d = {1, 1} bv.rank(5, a)= y[a[0]].rank(x.rank(5, a[0]), a[1]) = y[0].rank(x.rank(5, 0), 0) = y[0].rank(3, 0) =2
(11/12) bv = abcdabdc
x
= 00110011 (abcdabdc) y[0] = 0101 (abab) y[1] = 0110 (cddc)
a = {0, 0}, b = {0, 1}, c = {1, 0}, d = {1, 1} bv.rank(6, c)= y[c[0]].rank(x.rank(6, c[0]), c[1]) = y[1].rank(x.rank(6, 1), 0) = y[1].rank(2, 0) =1
(12/12) 4
rank(i, c)c rank 4 => 2 256 => 8
1=1rankrank8
FM-Index
FM-Index FM-IndexBurrows-Wheeler
FM-Index
LOUDS
The Burrows-wheeler Transform BWT ( )
(1/2) FM-IndexShellinford Shellinford
Shellinford()
(2/2) shellinford::fm_index fm;fm.push_back(); fm.push_back(); fm.push_back(); fm.search(, values); i = values.begin(); while (i != values.end()) { cout first)