semi-dynamic compact index for short patterns and succinct van emde boas tree 1 yoshiaki matsuoka 1,...

42
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1 , Tomohiro I 2 , Shunsuke Inenaga 1 , Hideo Bannai 1 , Masayuki Takeda 1 ( 1 Kyushu University) ( 2 TU Dortmund)

Upload: jodie-daniel

Post on 17-Jan-2016

224 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree

1

Yoshiaki Matsuoka1, Tomohiro I2, Shunsuke Inenaga1, Hideo Bannai1, Masayuki Takeda1

(1 Kyushu University) (2 TU Dortmund)

Page 2: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Overview There exist many space-efficient indices

(e.g. FM-index [Ferragina&Manzini, 2000])but most of them are static.

Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts.

2

Page 3: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Overview There exist many space-efficient indices

(e.g. FM-index [Ferragina&Manzini, 2000])but most of them are static.

Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts.

We propose a self-index for searching patterns of limited length, which:

• is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches,

• is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and

• can be constructed in online manner.3

Page 4: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Problem Preprocess : text T of length n over an alphabet of size σ.

Query : pattern P of length at most r.

Answer : all occurrences of P in T.

4

Page 5: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Problem Preprocess : text T of length n over an alphabet of size σ.

Query : pattern P of length at most r.

Answer : all occurrences of P in T.

Example.

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

If P = baa, then we output {5, 9, 14, 19} (in any order).

Page 6: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

A naïve algorithm Since we would like to search for any pattern of length at most r,

a naïve solution would be to store all occurrences of all r-grams in T.

This naïve algorithm requires at least n log n bits.

Example.

6

r-grams Occurrences

aaa 6, 10, 15, 16

aab 7, 11, 17

aba 4, 8, 18

・・・ ・・・

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a ar = 3

Page 7: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Sampling of q-grams To reduce the space, we only store the beginning positions divisible

by some k (> 1).

We also sample longer substrings (of length r + k − 1 = q) so that occurrences of substrings of length at most r are not missed.

Example.

7

r = 3

k = 4

q = 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

q-gramsOccurrences at positions divisible by

kaaabaa 16

abaaab 4, 8

abbaaa 12

abbbab 0

Page 8: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Sampling of q-grams For any pattern P of length at most r,

if w is a sampled q-gram at position x in T and P has an occurrence in w with relative position d (i.e., w[d .. d+|P|−1] = P), then x + d is an occurrence of P in T.

8

P = baa

r = 3

k = 4

q = 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

occurrence at 4+1

occurrence at 8+1

occurrence at 12+2

occurrence at 16+3

Page 9: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Set of q-grams QP,d

Let QP,d be the set of (not only sampled but) all q-grams w in T

where P has an occurrence in w with relative position d, i.e., w[d .. d+|P|−1] = P.

For example, consider the following string T:

In this example, if k = 4, q = 6 and P = baa, then

• QP,0 = {baaaab, baaaba, baaabb},

• QP,1 = {abaaab, bbbaab},

• QP,2 = {aabaaa, abbaaa, babaaa}, and

• QP,3 = {aaabaa, aabbaa, bbabaa}.9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

Page 10: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Set of q-grams QP,d

10

For example, consider the following string T:

In this example, if k = 4, q = 6 and P = baa, then

• QP,0 = {baaaab, baaaba, baaabb},

• QP,1 = {abaaab, bbbaab},

• QP,2 = {aabaaa, abbaaa, babaaa}, and

• QP,3 = {aaabaa, aabbaa, bbabaa}.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

• QP,0 ∪ QP,1 … ∪ ∪ QP,k−1 contains

all sampled q-grams which contain P (with its offset).

• |QP,d| ≤ #occ for any 0 ≤ d < k.

Observation

Page 11: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Basic strategy of our search algorithm

To compute all occurrences of P in T, we incrementally compute

QP,0, QP,1, …, QP,k−1 and output occurrences of P

when we encounter sampled q-grams in each QP,d.

11

• QP,0 ∪ QP,1 … ∪ ∪ QP,k−1 contains

all sampled q-grams which contain P (with its offset).

• |QP,d| ≤ #occ for any 0 ≤ d < k.

Observation

Page 12: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

q-gram transition graph

To compute QP,1,…, QP,k−1, we consider a directed graph

G = (Σq, E), which we call a q-gram transition graph.A q-gram transition graph is a subgraph of the de Bruijn graph of T s.t. the indegree of each vertex is at most 1.

12

Page 13: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

q-gram transition graph

13

r = 3

k = 4

q = 6

abbbab bbbaba bbabaa babaaa abaaab baaaba

baaabb

aaabba aabbaa abbaaa bbaaaa baaaab aaaaba

aaabaaaabaaa

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

We limit the indegree at most 1, so this edge is not constructed.

Page 14: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

q-gram transition graph

14

r = 3

k = 4

q = 6

abbbab bbbaba bbabaa babaaa abaaab baaaba

baaabb

aaabba aabbaa abbaaa bbaaaa baaaab aaaaba

aaabaaaabaaa

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

0 4, 8

12

16Positions of sampled q-grams.

Page 15: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

15

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

Page 16: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

16

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

Page 17: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

17

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

Page 18: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

18

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

Page 19: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

19

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

Page 20: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0 , …, QP,k−1

20

baaabb

baaaba

QP,0

P = baa

This edge does not exist,therefore abaaba is enumerated only once.

r = 3

k = 4

q = 6

baaaab bbaaaa

QP,1

abaaab

abbaaa

aabaaa

QP,2

babaaa

aabbaa

bbabaa

QP,3

aaabaa16

12

4, 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

T = a b b b a b a a a b a a a b b a a a a b a a

Page 21: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0

Given pattern P, first we need to compute

the source QP,0 of the q-gram transition graph,

i.e., all q-grams in T which begin with P.

21

Page 22: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

w

aaaaaa

0

aaaaab

1

aaaaba

2

・・・ ・・・

abbbbb

31

baaaaa

32

baaaab

33

baaaba

34

baaabb

35

・・・ ・・・

baabbb

39

・・・ ・・・

bbbbbb

63

Computing QP,0

Given pattern P, first we need to compute

the source QP,0 of the q-gram transition graph,

i.e., all q-grams in T which begin with P.

Consider all q-grams in lexicographical order.For any w Σ∈ q (not necessary appearing in T),we denote by the lexicographical rank of w.

For any pattern P, there existsa single range [sp(P), ep(P)] s.t.a q-gram w begins with P iff .This range can be computed easily.

22

q-grams that begin with baa.

sp(baa) = 32

ep(baa) = 39

w

Page 23: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Computing QP,0

Consider a bit array B of size σq s.t. iff w appears in T.

Then, w∈QP,0 iff

and .

Hence we need to output all w s.t. and .

23

w

aaaaaa

0 0

aaaaab

1 1

aaaaba

2 0

・・・ ・・・

・・・

abbbbb

31 1

baaaaa

32 0

baaaab

33 1

baaaba

34 0

baaabb

35 1

・・・ ・・・

・・・

baabbb

39 0

・・・ ・・・

・・・

bbbbbb

63 0

q-grams that begin with baa.

wBw

1wB)](),([ PepPspw sp(baa) = 32

ep(baa) = 39

1wB)](),([ PepPspw

1wB

Page 24: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Summary of our index We need to store:

a) q-gram transition graph,

b) bit array B[0 .. σq − 1] for computing QP,0, and

c) positions of sampled q-grams.

24

n : length of T.

σ : alphabet size.

q : length of sampled substrings.

k : sampling distance.

Page 25: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Summary of our index We need to store:

a) q-gram transition graph,

b) bit array B[0 .. σq − 1] for computing QP,0, and

c) positions of sampled q-grams.

We can represent

a) in O(σq log σ) bits,

b) in σq + O(σq / ω) bits, and

c) in (n / k + σq) log(n / k) bits.

We can search any pattern in

O(k × #occ + logσ n) time.25

n : length of T.

σ : alphabet size.

q : length of sampled substrings.

k : sampling distance.

ω : machine word size.

Page 26: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Summary of our index We need to store:

a) q-gram transition graph,

b) bit array B[0 .. σq − 1] for computing QP,0, and

c) positions of sampled q-grams.

We can represent

a) in O(σq log σ) bits,

b) in σq + O(σq / ω) bits, and

c) in (n / k + σq) log(n / k) bits.

We can search any pattern in

O(k × #occ + logσ n) time.26

n : length of T.

σ : alphabet size.

q : length of sampled substrings.

k : sampling distance.

ω : machine word size.

I will explain these next.

Page 27: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (a) Since q-gram transition graph is a subgraph of de Bruijn graph,

from each node u, it is enough to store the character c s.t.v = c u[0..q−2] if an edge (u,v) exists.

27

abaaab baaaba

baaaab aaaaba

aaabaaaabaaa

b

a

…b

a

a

a

a

Page 28: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (a) Since q-gram transition graph is a subgraph of de Bruijn graph,

from each node u, it is enough to store the character c s.t.v = c u[0..q−2] if an edge (u,v) exists.

Since the number of vertices is σq andthe indegree of each vertex is at most 1,the number of edges is at most σq.We can represent this graphin O(σq log σ) bitsby using some tables.

28

abaaab baaaba

baaaab aaaaba

aaabaaaabaaa

b

a

…b

a

a

a

a

Page 29: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) By data structure (b), we output all w s.t.

and .

So, using a fast successor data structure,we can compute all such q-grams w.

29

1wB)](),([ PepPspw w

aaaaaa

0 0

aaaaab

1 1

aaaaba

2 0

・・・ ・・・

・・・

abbbbb

31 1

baaaaa

32 0

baaaab

33 1

baaaba

34 0

baaabb

35 1

・・・ ・・・

・・・

baabbb

39 0

・・・ ・・・

・・・

bbbbbb

63 0

q-grams that begin with baa.

wBw

sp(baa) = 32

ep(baa) = 39

Page 30: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) By data structure (b), we output all w s.t.

and .

So, using a fast successor data structure,we can compute all such q-grams w.

We need a dynamic successor data structureto support online updates to T.

30

1wB)](),([ PepPspw w

aaaaaa

0 0

aaaaab

1 1

aaaaba

2 0

・・・ ・・・

・・・

abbbbb

31 1

baaaaa

32 0

baaaab

33 1

baaaba

34 0

baaabb

35 1

・・・ ・・・

・・・

baabbb

39 0

・・・ ・・・

・・・

bbbbbb

63 0

q-grams that begin with baa.

wBw

sp(baa) = 32

ep(baa) = 39

Page 31: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) By data structure (b), we output all w s.t.

and .

So, using a fast successor data structure,we can compute all such q-grams w.

We need a dynamic successor data structureto support online updates to T.

We can use van Emde Boas treebut it requiresΘ(σq) words = Θ(σq ω) bits.We want to reduce the space.

31

1wB)](),([ PepPspw w

aaaaaa

0 0

aaaaab

1 1

aaaaba

2 0

・・・ ・・・

・・・

abbbbb

31 1

baaaaa

32 0

baaaab

33 1

baaaba

34 0

baaabb

35 1

・・・ ・・・

・・・

baabbb

39 0

・・・ ・・・

・・・

bbbbbb

63 0

q-grams that begin with baa.

wBw

sp(baa) = 32

ep(baa) = 39

Page 32: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) We present a succinct variant of van Emde Boas tree.

We divide B into blocks of size ωh where ω is the machine word sizeand h (> 1) is some constant integer.

We maintain an ω-ary tree of height h (bottom tree) for each block,and a van Emde Boas tree (top tree) over the bottom trees.

32

10101100……1 00000000……0 00100000……0

ω-ary trees of height h ……

van Emde Boas tree

1 0 1

ωhCorresponds to B.

……

Page 33: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) : bottom tree Each bottom tree is a complete ω-ary tree.

Each node has a bit array A of length ω s.t. A[ j] = 1 iff the j-th child of the node contains 1.

33

1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0 0 1 …

0 1 2 3 4 5 6 7 ...

1 1 1 0aaaaaaaaaaaaaaa

a 0 1 1 1 ……

1 1 0 1 ……

Block of size ωh.

……

A

Page 34: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Representation of (b) Data structure (b) can be represented in σq + o(σq) bits.

• The bottom trees require σq + O(σq / ω) = σq + o(σq) bits andthe top tree requires O(σq / ωh−1) = o(σq) bits,assuming the machine word size ω = Θ(log n).

Updates of a single bit in B and successor queriescan be done in O(h + log log σq) = O(log log σq) time.

• If σq ≤ n then O(log log n) time.

34

Page 35: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Complexities We represent each q-gram by an integer, and we do not store the

original text T.

We assume that σ = polylog(n), k ≥ 1,

q = k + r − 1 and q ≤ logσ n − logσ logσ n.

If we choose k = Θ(logσ n), then the space complexity is

O(n log σ) bits, and hence our index is compact.35

Complexities

Construction time

O(n)

Searching time O(k × #occ + logσ n)

Space (in bits) (n / k + σq) log(n / k) + o(n)

Page 36: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

12.5 1250.1

1

10

100

1000

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

Experimental results of constructionT

ime

for

cons

truc

tion

(in

sec

onds

).

Text size n (in megabytes).36

Page 37: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

12.5 1250.1

1

10

100

1000

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

Experimental results of constructionT

ime

for

cons

truc

tion

(in

sec

onds

).

Text size n (in megabytes).

Our index is the fastest to construct.

37

Page 38: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Experimental results of searchingA

vera

ge ti

me

for

sear

chin

g,

usin

g100

pat

tern

s of

leng

th 6

(in

sec

onds

).

Text size n (in megabytes).

12.5 1251E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

38

Page 39: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Experimental results of searchingA

vera

ge ti

me

for

sear

chin

g,

usin

g100

pat

tern

s of

leng

th 6

(in

sec

onds

).

Text size n (in megabytes).

12.5 1251E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

Ours is the fastest compact/compressed index to search.

39

Page 40: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Experimental results of memory usageM

emor

y us

age

(in

meg

abyt

es).

Text size n (in megabytes).

12.5 12510

100

1000

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

40

Page 41: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Experimental results of memory usageM

emor

y us

age

(in

meg

abyt

es).

Text size n (in megabytes).

12.5 12510

100

1000

Our index (r=6, k=4)Our index (r=6, k=6)Suffix ArrayFM-indexDynamic FM-indexRice, Re-Pair (block size = 8192)_x000d_[Claude et al., 2010]Rice, Plain (block size = 8192)_x000d_[Claude et al., 2010]

Ours is much more space-efficient than Dynamic FM-index

41

Page 42: Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Conclusion We proposed a q-gram based self-index for searching patterns of

limited length. Our self-index:

• is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches,

• is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and

• can be constructed in online manner.

When the text is DNA sequence of human (i.e., σ = 4 and n ~ 109), the practical limit of pattern length is about 10 for our index.

Can we further reduce the space complexity?

42