succinct data structure for analyzing document collection

71
Succinct Data Structure for Analyzing Document Collection Preferred Infrastructure Inc. Daisuke Okanohara [email protected] 2011/12/1 ALSIP2011@Takamatsu

Upload: preferred-infrastructure-preferred-networks

Post on 28-May-2015

14.928 views

Category:

Technology


3 download

DESCRIPTION

Invited talk at the workshop ALSIP2011. The topic includes, Rank/Select data structure, and its implementation, wavelet trees and its applications, suffix arrays, suffix trees, Burrows Wheeler's Transform, FM-index, Top-k document query.

TRANSCRIPT

Page 1: Succinct Data Structure for Analyzing Document Collection

Succinct Data Structure for Analyzing Document Collection

Preferred Infrastructure Inc.

Daisuke Okanohara

[email protected]

2011/12/1 ALSIP2011@Takamatsu

Page 2: Succinct Data Structure for Analyzing Document Collection

Agenda

Succinct Data Structures (20 min.) Rank/Select, Wavelet Tree

Full-text Index (30 min.) Suffix Array/Tree, BWT

Compressed Full-Text Index (20 min.) FM-index

Analyzing Document Collection (20 min.) Document Statistics (for mining) Top-k query

Succinct data Structures

Full-textIndex

Compressed Full-text Index

Analyzing Document Collection

Page 3: Succinct Data Structure for Analyzing Document Collection

Background

Very large document collections appear everywhere Twitter : 200 billion tweets per year. (6K tweets per sec.) Modern DNA Sequencing

E.g. Illumina HiSeq 2000 produces 40G bases per day

Analyzing these large document collections require new data

structures and algorithms

Page 4: Succinct Data Structure for Analyzing Document Collection

Notation

T[0, n] : string T[0]T[1]…T[n] T[i, j] : subarray from T[i] to T[j] inclusively T[i, j) : subarray from T[i] to T[j-1]

lg m : log2 m (# of bits to represent m cases)

Σ : the alphabet set of string σ := |Σ| : the number of alphabets n : length of input string m : length of query string

Page 5: Succinct Data Structure for Analyzing Document Collection

PreliminariesEmpirical Entropy (1/2) Empirical entropy indicates the compressibility of the string Zero-order empirical entropy of T H0(T) := ∑c n(c)/n log (n/p(c))

n(c) : the number of occurrences of c in T n : the length of T The lower-bound for the average code length of compressor not

using context information

T = abccbaabcab,

n(a)=4,

n(b)=4, n(c)=3, H0(T) = 1.57 bit

Page 6: Succinct Data Structure for Analyzing Document Collection

Empirical Entropy (2/2)

k-th order empirical entropy of T Hk(T) = ∑w∈Σ

k-1 nw/n H0(Tw) : Tw : the characters followed by a substring w nw : the number of occurrences of a string w in T.

The lower-bound for the average code length of compressor using

the following substring of length k-1 as a context information

T = abccbaabcabTa=bac Tb=acaa Tc=bcb, H2(T) = 0.33 bit

Hierarchy of entropy: log σ ≧ H0(T) ≧ H1(T) ≧ … Hk(T) Typical English Text, lg σ=8 H0=4.5 , H5=1.7

Page 7: Succinct Data Structure for Analyzing Document Collection

Succinct Data Structure

Page 8: Succinct Data Structure for Analyzing Document Collection

Rank/Select Dictionary

For a bit vector B[0, n), B[i] ∈{0, 1}, consider following operations Lookupc(B, i) : B[i] Rankc(B, i) : The number of c’s in B[0…i) Selectc(B, i) : The (i+1)-th occurrence of c in B

Example:i 0123456789B 0111011001

Rank1(B,6)=4 Select0(B,2)=2

The number of 1’s in B[0, 6) The (2+1)-th occurrence of 0 in B

Page 9: Succinct Data Structure for Analyzing Document Collection

Rank/Select Dictionary Usage (1/2)Subset For U = {0, 1, 2, …, n-1}, a subset of U, V⊆U can be represented

by a bit vector B[0, n-1], s.t. B[i] = 1 if i∈V, and B[i]=0

otherwise. Select(B, i) is the (i+1)-th smallest element in V, Rank(B, i) = |{e|e∈V, e<i}|

V = {0, 4, 7, 10, 15}

i 0123456789012345 B = 1000100100100001

Page 10: Succinct Data Structure for Analyzing Document Collection

Rank/Select Dictionary Usage (2/2)Partial Sums For a non-negative integer array P[0, m), we concatenate each

unary codes of P[i] to build a bit array B[0, n) An unary code of a non-negative integer p≧0 is 0p1

E.g. unary codes of 0, 3, 10 are 1, 0001, 00000000001 From the definition, n=m+∑p[i] and the number of ones is m

P[i] = Select1(B, i) - Select1(B, i-1). Cum[k] = ∑i=0…kP[i]. = Select1(B, k) - k (# of 0’s before (k+1)-th

1) Find r s.t. Cum[r] ≦ x < Cum[r+1] by Rank1(B, Select0(B, x))

P = {3, 5, 2, 0, 4, 7}

B = 000100000100110000100000001

Cum[4] = Select(B, 4) – 4 = 14

Page 11: Succinct Data Structure for Analyzing Document Collection

Data structures for Rank/Select

Bit vector B[0, n) with m ones can be represented in lg (n, m) bits lg (n, m) ≒ m lg2e + m lg (n/m) when m << n

The implementations of Rank/Select Dictionary are divided into

the case when the bit vector is dense, sparse, and very sparse.

Case 1. Dense Case 2. Sparse (RRR) Case 3. Very Sparse (SDArray)

01101011010100110101100101

00001001000010001000100010

00000000000010000000001000

Page 12: Succinct Data Structure for Analyzing Document Collection

Rank/Select DictionaryCase 1. Dense Divide B into large blocks of length l=(log n)2

Divide each large block into small blocks of length s=(log n)/2 For each large block, store L[i]=Rank1(B, il) For each small block, store S[i]=Rank1(B, is) - Rank1(B, is / l) Build a table P(X) that stores the number of ones in X of length s

Rank1(B, i) = L[i/l] + S[i/s] + P(B[i/s*s, i)) three table lookup. O(1) time

Total size of L, S, P is o(n)

Select can also be supported in O(1) time

using n+o(n) bits

L[1]

+S[4]

+P[X]

In practice, P(X) is often replaced with popcount(X)

Page 13: Succinct Data Structure for Analyzing Document Collection

Rank/Select DictionaryCase 2. Sparse (RRR) Case 1. is redundant when m << n A small block X is represented by the class and the offset.

The class C(X) is the number of ones in X e.g . X=101, C(X) = 2.

The offset O(X) is the lexicographical rank among C(X) e.g. X0=011, X1=101, X2 =110, O(X0)=0, O(X1)=1, O(X2)=2

Store a pointer of every (log n)2/2 blocks and an offset for each

element

B = 111110100101010000100111010

C = 3 2 1 2 1 0 1 3 1

O = 0 2 2 1 1 0 2 2 1

Page 14: Succinct Data Structure for Analyzing Document Collection

Rank/Select Implementation Case 2. Sparse (RRR) Ida: when m << n, many classes (# of ones in small block) are

small, and the offsets are also small

Total size is nH0(B) + o(n) bits. Rank is supported in O(1) time We can also support select in O(1) time using the same working

space.

Note: RRR is also efficient when ones and zeros are clustered. The following bit is not sparse, but can be efficiently stored by

RRR encoding.

000000100000100011111111110111110111111111100000000010000001

many 0s many 1s many 0s

Page 15: Succinct Data Structure for Analyzing Document Collection

Rank/Select Implementation Case 3. Very Sparse (SD array) Case 2. is not efficient for very sparse case.

o(n) term cannot be negligible when nH0 is very small Let S[i] := select1(B, i) and t = lg (n/m). L[i] := the lowest t bits of S[i] P[i] := S[i] / 2t

Since P is weakly increasing integers, use prefix sums to

represent H Concatenate unary codes of P[i]-P[i-1].

B=0001011000000001

S[0, 3] = [3, 5, 6, 15], t = lg(16 / 4) = 2.

L[0, 3] = [3, 1, 2, 3]

P[i] = [0, 1, 1, 3], H = 1011001

Page 16: Succinct Data Structure for Analyzing Document Collection

Rank/Select Implementation Case 3. Very Sparse (SDArray) L is stored explicitly, and H is stored with Rank/Select dictionary. Size

H : 2m bits (H[m] ≦ m)L : m log2 (n/m) bits.

select(B, i) = (select1(H, i)-i)2t + L[i] O(1) time

rank(B, i) = linear search from 1 + rank1(H, select0(H, i)) expected O(1) time.

Page 17: Succinct Data Structure for Analyzing Document Collection

Rank/Select for large alphabets

Consider Rank/Select on a string with many alphabets I will explain only wavelet trees here since it is very very

useful !

I will not discuss other topics of rank / select for large alphabets Multi-array wavelet trees [Ferragina+ ACM Transaction 07] Permutation based approach [Golynski SODA 06] Alphabet partitioning [Barbey+ ISSAC 10]

Page 18: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree

Wavelet Tree = alphabet tree + bit vectors in each node

T = ahcbedghcfaehcgd

a b c d e f g h

Page 19: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree

Wavelet Tree = alphabet tree + bit vectors in each node

ahcbedghcfaehcgd 010010110

acbdc hegh

a b c d e f g h

Filter characters according to its value

For each character c := T[i] (i = 0…n), we move c to the left child and store 0if c is the descendant of the left child.We move c to the right child and store 1 otherwise.

Page 20: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree

Wavelet Tree = alphabet tree + bit vectors in each node Recursively continue this process. Store a alphabet tree and bit vectors of each node

ahcbedghcfaehcgd 0100101101011010

acbdcacd heghfehg 01011011 10110011

aba cdccd efe hghhg010 01001 010 10110

a b c d e f g h

0 1

Page 21: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (contd.)

Total bit length is n lg σ bits The height is l2σ, and there are n bits at each depth.

0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

Height is lg σ

Page 22: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (Lookup)

Lookup(T, pos) e.g. Lookup(T, 9)

0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

b := B[pos] = 0pos := Rankb(B, pos) = 4

b := B[pos] = 0pos := Rankb(B, pos) = 2

pos := 9

b := B[pos] = 0

Return “c”

Page 23: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (Rank)

Rankc (T, pos) e.g. Rankc(T, 9)

0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

b := B[9] = 0pos := Rank0(B, 0) = 4

b := B[4] = 4pos := Rank1(B, 4) = 2

b := B[2] = 0pos := Rank0(B, 2) = 1

Return 1

ahcbedghcfaehcgd

Page 24: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (Select)

Selectc (T, i) Example Selectc(T, 1)

0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

i := Select1(B,2) = 4

i := Select0(B,1) = 2

i=1

i := Select0(B,4) = 8

Page 25: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (Quantile List)

Quantile(T, p, sp, ep) : Return the (p+1)-th largest value in T[sp,

ep)

E.g. Quantile(T, 5, 4, 12)0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

# of zeros in B[4,12] is 33 < (5+1) and go to right

# of zeros in B[1,6] is 33 > 2 and go to left

# of zeros in B[0, 2] is 22 = 2 and go to right

Return “f”

43782614ahcbedghcfaehcgd

Wavelet Tree is so calleddouble-sorted array

Page 26: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (top-k Frequent List )

FreqList (T, k, sp, ep) : Return k most frequent values in T[sp, ep)

E.g. FreqList(T, 5, 4, 12)

0100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

Greedy Search on the tree.

Larger ranges will be examined first.

Q := priority queue {# of elems, node}Q.push {[ep-sp, root]}while q = Q.pop() If q is leaf, report q and return Q.push([0’s in range, q.left]) Q.push([1’s in range, q.right])end while

Page 27: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree (RangeFreq, RangeList)

RangeList (T, sp, ep, min, max) : Return {T[i]} s.t. min≦ [i]<max,

sp≦i<ep RangeFreq(T, sp, ep, min, max) : Return |RangeList(T, sp, ep, min,

max)|

E.g. RangeList(T, 4, 12, c, e)01234578901234560100101101011010

01011011 10110011

010 01001 010 10110

a b c d e f g n

0 1

Enumerate nodes that go to betweenmin and max

For RangeList, # of enumerated nodes is O(log σ + occ)

For RangeFreq, # of enumerated nodes is O(log σ)

Page 28: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree Summary

Input T[0…n), 0≦T[i]<σ

Many operations can be supported in O(log σ) time using wavelet

treeDescription Time

Lookup T[i] O(log σ)

Rankc(T, p) the number of c in T[0…p) O(log σ)

Selectc(T, i) the pos of (i+1)-th c in T O(log σ)

Quantile(T, s, e, r) (r+1)-th largest val in T[s, e) O(log σ)

FreqList(T, k, s, e) k most frequent values in T[s, e)

Expected O(klog σ)Worst O(σ)

RangeFreq(T, s, e, x, y)

items x ≦ T[i] < y in T[s, e) O(log σ)

RangeList(T, s, e, x, y) List T[i] s.t. x ≦ T[i] < y in T[s, e)

O(log σ + occ)

NextValue(T, s, e, x) smallest T[r] > x, s.t. s ≦ r ≦ e O(log σ)

Page 29: Succinct Data Structure for Analyzing Document Collection

Wavelet tree improvements

Huffman Tree shaped Wavelet Tree (HWT) If the alphabet tree corresponds to the Huffman tree for T, the

size of the wavelet tree becomes nH0 + o(nH0) bits. Expected query time is also O(H0) when the query distribution is

same as in T

Wavelet Tree for large alphabets For large alphabet set (e.g. σ ≒, n) the overhead of alphabet tree,

and bit vectors could large. Concatenate bit vectors of same depth. In traversal of tree, we

iteratively compute the range of the original bit array.

Page 30: Succinct Data Structure for Analyzing Document Collection

2D Search using Wavelet Tree

Given a 2D data set D={(xi, yi)}, find all points in [sx, sy] x [ex,

ey] By using a wavelet tree, wen can solve this in O(log ymax + occ)

time

0 1 2 3 4 5 6

0 *

1 * *

2

3 * * *

4 *

5 *

6 *

0 1 2 2 4 4 6 6 6

0 *

1 * *

2

3 * * *

4 *

5 *

6 *

Since each column hasone value, we can convertit to the stringT =313613046

Store the x informationusing a bit arrayXb = 101011001100111

Remove columns with no data.If there are more than one data in the same column, we copy the column so thateach column has exactly one data.

Page 31: Succinct Data Structure for Analyzing Document Collection

2D Search using Wavelet Tree (contd.)

[sx, sy] x [ex, ey] region is found by s’ := Select0(X, sx)+1 , e’ := Select0(X, ex)+1 s’’ := Rank1(X, s’) e’’ := Rank1(X, e’)

Perform RangeFreq(s’’, e’’, sy, ey)

T =313613046

X = 101011001100111

0 1 2 3 4 5 6

0 *

1 * *

2

3 * * *

4 *

5 *

6 *

Page 32: Succinct Data Structure for Analyzing Document Collection

Full-text Index

Page 33: Succinct Data Structure for Analyzing Document Collection

Full-text Index

Given a text T[0, n) and a query Q[0, m), support following

operations. occ(T, Q) : Report the number of occurrence of Q in T locate(T, Q) :Report the positions of occurrence of Q in T

Example

i 0123456789A

T abracadabra

occ(T, “abra”)=2

locate(T, “abra”)={0, 7}

Page 34: Succinct Data Structure for Analyzing Document Collection

Suffix Array

Let Si = T[i, n] be the i-th suffix of T Sort {Si}i=0…n by its lexicographical order Store sorted index in SA[0…n] in n lg n bits (SSA[i] < SSA[i+1])

S0 abracadabra$S1 bracadabra$S2 racadabra$S3 acadabra$S4 cadabra$S5 adabra$S6 dabra$S7 abra$S8 bra$S9 ra$S10 a$S11 $

S11 $S10 a$S7 abra$ S0 abracadabra$S3 acadabra$S5 adabra$S8 bra$ S1 bracadabra$S4 cadabra$S6 dabra$S9 ra$S2 racadabra$

11107035814692

Page 35: Succinct Data Structure for Analyzing Document Collection

Search using SA

Given a query Q[0…m), find the range [sp, ep) s.t. sp := max{i | Q ≦ SSA[i]}

ep := min{i | Q > SSA[i]} Since {SSA[i]} are sorted, sp and ep

can be found by binary search in O(m log n) time Each string comparison require O(m) time.

The occurrence positions are SA[sp, ep) if ep=sp then Q is not found on T

Occ(T, Q) : O(m log n) time. Locate(T, Q) : O(m log n + Occ) time

S11 $S10 a$S7 abra$ S0 abracadabra$S3 acadabra$S5 adabra$S8 bra$ S1 bracadabra$S4 cadabra$S6 dabra$S9 ra$S2 racadabra$

Q=“ab”, sp=2, ep=4

Page 36: Succinct Data Structure for Analyzing Document Collection

Suffix Tree

A trie for all suffixes of T Remove all nodes with only one child. Store suffix indexes (SA) at leaves Store depth information at internal nodes, and

edge information is recovered from depth and SA . Fact: # of internal nodes is at most N-1

Proof: Consider inserting a suffix one by one.

At each insertion, at most one internal node will

be build.

Many string operations can be supported by using

a suffix tree.

11

$

a

10$ bra 7

0c

3

5

d

bra8

1

4

6

9

2

c

d

ra

$

c

$

c

$

c

Page 37: Succinct Data Structure for Analyzing Document Collection

Burrows Wheeler Transform (BWT)

Convert T into BWT as follows

BWT[i] := T[SA[i] – 1] if SA[i] = 0, BWT[i] = T[n]

abracdabra$->ard$rcaaaabb

i SA

BWT

Suffix

0 11 a $

1 10 r a$

2 7 d abra$

3 0 $ abracadabra$

4 3 r acadabra$

5 5 c adabra$

6 8 a bra$

7 1 a bracadabra$

8 4 a cadabra$

9 6 a dabra$

10 9 b ra$

11 2 b racadabra$

Page 38: Succinct Data Structure for Analyzing Document Collection

Characteristics of BWT

Reversible transformation We can recover T from only BWT

Easily compressed Since similar suffixes tend to have preceding characters, same

characters tend to gather in BWT. bzip2 Implicit suffix information

BWT has same information as suffix arrays, and can be used for

full-text index (so called FM-index).

Page 39: Succinct Data Structure for Analyzing Document Collection

Example : Before BWT

When Farmer Oak smiled, the corners of his mouth spread till they

were within an unimportant distance of his ears, his eyes were

reduced to chinks, and diver gingwrinkles appeared round them,

extending upon his countenance like the rays in a rudimentary

sketch of the rising sun. His Christian name was Gabriel, and on

working days he was a young man of sound judgment, easy

motions, proper dress, and general good character. On Sundays he

was a man of misty views, rather given to postponing, and

hampered by his best clothes and umbrella : upon the whole, one

who felt himself to occupy morally that vast middle space of

Laodiceanneutrality which lay between the Communion people of

the parish and the drunken section, -- that is, he went to church,

but yawned privately by the time the con+gegation reached the

Nicene …

Page 40: Succinct Data Structure for Analyzing Document Collection

Example : After BWT

ooooooioororooorooooooooorooorromrrooomooroooooooormoorooor

orioooroormmmmmuuiiiiiIiuuuuuuuiiiUiiiiiioooooooooooorooooiiiioooi

oiiiiiiiiiiioiiiiiieuiiiiiiiiiiiiiiiiiouuuuouuUUuuuuuuooouuiooriiiriirriiiiriiiiiiaii

iiioooooooooooooiiiouioiiiioiiuiiuiiiiiiiiiiiiiiiiiiiiiiioiiiiioiuiiiiiiiiiiiiioiiiiiiiiiiiiio

iiiiiiuiiiioiiiiiiiiiiiioiiiiiiiiiioiiiioiiiiiiioiiiaiiiiiiiiiiiiiiiiioiiiiiioiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiii

oiiiiiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuuuiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiioiiiiuioiuiiiiiiioiiiiiiiu

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiioaoiiiiioioiiiiiiiioooiiiiiooioiiioiiiiiouiiiiiiiiiiiiooiiiiiiiiiiiiiiiiii

iiiiiiiiiiiiioiiiiiiiiiiiiiiiiiiioiooiiiiiiiiiiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiiiiiiiioiiiuiii

iiiiiiioiiiiiiiiiiiiuoiiioiiioiiiiiiiiiiiiiiiiiiiiiiuiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuiuiiiiiuuiiiii

iiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiioiiiiiiiiiiiiiiiiiiiiioiiiiiiiiioiiiiuiiiioiiiioiiiiiiiiii

iiiiiiiiiiiiiiiiiiiiiiiiiiiioiioiiiiiiuiiiiiiiiiiiiiiiooiiiiiiiiiiiiiiiiiiiioooiiiiiiiioiiiiouiiiiiiiiiiiiiiii

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiioiiiiiiioiiiiiioiiiiuiuoiiiiiiiiioiiiiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiioi

oi….

Page 41: Succinct Data Structure for Analyzing Document Collection

LF-mapping

Let F[i] = T[SA[i]] First characters of suffixes

Fact : the order of same chars

in BWT and F is same

i BWT Suffix (First char is F)

0 a1 $

1 r a1$

2 d a2bra$

3 $ a3bracadabra$

4 r a4cadabra$

5 c a5dabra$

6 a2 bra$

7 a3 bracadabra$

8 a4 cadabra$

9 a5 dabra$

10 b ra$

11 b racadabra$

Page 42: Succinct Data Structure for Analyzing Document Collection

LF-mapping (1/3)

Let F[i] = T[SA[i]] First characters of suffixes

Fact : the order of same chars

in BWT and F is same

Proof: the order of chars in BWT

is determined by the following

suffixes.

the order of chars in F is also

determined by the following

suffixes. These are same !

i BWT Suffix (First char is F)

0 a1

1 r a1

2 d a2

3 $ a3bracadabra$

4 r a4

5 c a5

6 a2

7 a3 bracadabra$

8 a4

9 a5

10

11

Same

Page 43: Succinct Data Structure for Analyzing Document Collection

LF-mapping (2/3)

Let F[i] = T[SA[i]] First characters of sorted suffixes

Let C[c] be the number of

characters smaller than c in T C[c] = beginning position of F[i] = c

For B[i]=c, its corresponding char in F[i] is found by Rankc(T, i) + C[c]

i BWT Suffix (First char is F)

0 a $

1 r a$

2 d abra$

3 $ abracadabra$

4 r acadabra$

5 c adabra$

6 a bra$

7 a bracadabra$

8 a cadabra$

9 a dabra$

10 b ra$

11 b racadabra$

Page 44: Succinct Data Structure for Analyzing Document Collection

LF-mapping (3/3)

Let ISA[0, n) s.t. ISA[r]=p if SA[p] = r SA[i] converts the rank into the position, and ISA[i] vice versa.

For c := BWT[r], we define two functions LF(r) := Rankc(T, r) + C[c] = ISA[SA[r] -1]

PSI(r) := Selectc(T, r– C[c]) = ISA[SA[r] + 1] LF and PSI are inverse functions each other.

Using LF and PSI, we can move a text in its position order. PSI = forward by one char , LF = backward by one char

e.g. T = abracadabra

SA[r] SA[PSI[r]]SA[LF[r]]

Page 45: Succinct Data Structure for Analyzing Document Collection

Decoding of BWT using LF-mapping

r := beg // the rank of T[0, n), i.e. ISA[0]

while

Output BWT[r]

r := PSI(r)

until r != beg

That’s all !

In practice, PSI values are pre-computed using O(n) time

Page 46: Succinct Data Structure for Analyzing Document Collection

Compressed Full-text Index

Page 47: Succinct Data Structure for Analyzing Document Collection

Compressed Full-Text Index

Full-text can be compressed Moreover, we can recover original text from compressed index

(self-index)

I explain FM-index only here.

Other important compressed full-text indices Compressed Suffix Arrays LZ-index LZ77-based

Page 48: Succinct Data Structure for Analyzing Document Collection

FM-index

Using LF-mapping for searching

pattern.

Assume that a range [sp, ep)

correspond to a pattern P.

Then if a pattern P’ = cP occurs,

then [sp, ep) should be preceded by c

i BWT Suffix

0 a $

1 r a$

2 d abra$

3 $ abracadabra$

4 r acadabra$

5 c adabra$

6 a bra$

7 a bracadabra$

8 a cadabra$

9 a dabra$

10 b ra$

11 b racadabra$

P=a appeared in [1, 5)

P’=ra appear in P

Page 49: Succinct Data Structure for Analyzing Document Collection

FM-index

Using LF-mapping for searching

pattern.

Assume that a range [sp, ep)

correspond to a pattern P.

Then if a pattern P’ = cP occurs,

then [sp, ep) is preceded by c

The corresponding suffixes can befound by Rankc(BWT, sp) + C[c]

and Rankc(BWT, ep) + C[c]

i BWT Suffix

0 a $

1 r1 a$

2 d abra$

3 $ abracadabra$

4 r2 acadabra$

5 c adabra$

6 a bra$

7 a bracadabra$

8 a cadabra$

9 a dabra$

10 b r1a$

11 b r2acadabra$

P=a appeared in [1, 5)

P’=ra appear in P

Page 50: Succinct Data Structure for Analyzing Document Collection

FM-index

Input query : Q[0, m)

Output a range [sp, ep) corresponding to Q

sp := 0, ep := n

for i = m-1 to 0

c := Q[c] sp := C[c] + Rankc(BWT, sp)

ep := C[c] + Rankc(BWT, ep)

end for

Return (sp, ep)

Occ(T, Q) = ep - sp

Locate(T, Q) = SA[sp, ep)

Page 51: Succinct Data Structure for Analyzing Document Collection

Storing Sampled SA’s

Store sampled SAs instead of original SA. Sample SA[p] s.t. SA[p] = tk for (t=0…) We also store a bit array R to indicate the index of

sampled SAs.

Recall that LF(r) = ISA[SA[r]-1] for c = BWT[c]. If SA[p] is not sampled, we use LF to move on SA

SA[r] = SA[LF(r)]-1 = SA[LF2(r)]-2 = … = SA[LFk(r)]-k We move on SA until the sampled point is found

using at most k LF operations.

Set k = log1+εn. The size of A is (n/k) log n = o(n) The lookup of SA[i] requires O(log1+εn logσ) time.

i SA R

0 0

1 0

2 0

3 0 1

4 0

5 0

6 8 1

7 0

8 4 1

9 0

10 0

11 0

Page 52: Succinct Data Structure for Analyzing Document Collection

Grammar Compression on SA [R. Gonzalez+ CPM07] SA can also be compressed using grammar based compressions Let DSA[i] := SA[i] – SA[i-1] (for i > 0) If BWT[i] = BWT[i+1] = … = BWT[i+run] then

DSA[i+1, i+run] = DSA[k+1, k+run] for k := LF[i] DSA contains many repetition if BWT has many runs.

DSA can be compressed using grammar compression [R. Gonzelez+ CPM07] used Re-pair but other grammar

compressions can be used as long as they support fast

decompression.

Page 53: Succinct Data Structure for Analyzing Document Collection

Compression Boosting in FM-index

The size of FM-index using wavelet trees is nH0 + o(nH0) bits We can improve the space to nHK by compression boosting Compression Boosting [P. Ferragina+ ACM T 07]

Divide BWT into blocks using contexts of length k Build a separate wavelet tree for each block

Implicit Compression Boosting [Makinen+ SPIRE 07] Wavelet Tree + compressed bit vectors using RRR in Wavelet Tree

The size is close to optimal: the overhead is at most σk+1 log n, which is o(n) when k ≦ ((1-ε)logσn)-1

bbabdd

aaaa..aabb..aacc..aacd..abcd..abcc..

Divide BWT’susing followingsuffix of length 2.

Page 54: Succinct Data Structure for Analyzing Document Collection

Compression Boosting using fixed length block[Karkkaine+ SPIRE 11] Dividing BWT into blocks of fixed length can also achieve nHK

Block size is σ(log n)2

To analyze the working space, we calculate the overhead incurred

of conversion from context-sensitive partition into fixed-length

partition

Note. This encoding is very similar to the encoding used in bzip2

Page 55: Succinct Data Structure for Analyzing Document Collection

SSA: Huffman WTAFFMI : Context Sensitive Partition of BWTFixed Block : Fixed Block

+ RRR : Represent bit vectorsin WT using RRR encoding

Page 56: Succinct Data Structure for Analyzing Document Collection

Summary of state-of-the-art

Give an input String T, compute BWT for T (=B). Divide B into blocks of fixed length Each block is stored in a Huffman-shaped wavelet tree. A bit-vector in a wavelet tree is stored using RRR representation. For English, 2-3 bits per character, 1 pattern 0.06 msec.

Page 57: Succinct Data Structure for Analyzing Document Collection

Analyzing Document Collection

Page 58: Succinct Data Structure for Analyzing Document Collection

Document Query

DL(D, P) : Document Listing List the distinct documents where P appears

DLF(D, P) :Document Listing with Frequency The result of DL(D, P) + frequency of P in each document

Top(k, D, P) : Top-k Retrieval List the k documents where P appears with largest scores S(D, P) e.g. S(D, P) : term frequency, document’s score (e.g. Page rank)

In the previous section we consider hit positions, and here we

consider hit documents We need to convert hit positions into hit documents. This is not easy.

Page 59: Succinct Data Structure for Analyzing Document Collection

Index for document collection

For a document collection D={d1, d2, …, dD} Build T = d1#d2#d3#...#dD and build an full-text index for T

We also use a bit array to indicate the beginning positions of docs. Given a hit position in T, we can convert it to the hit doc in O(1)

time.

Baseline of DL and DLF Perform Locate(T, Q) and convert each hit positions into a hit doc

each independently. However, this requires O(occ(T, Q)) time, and occ(T, Q) could be

much larger than # of hit documents

Page 60: Succinct Data Structure for Analyzing Document Collection

DocArray and PrevArray (1/2)[Muthukrishnan SODA 02]

DocArray : D[0, n] s.t. D[i] is the doc ID of SA[i]. DL(D, Q) is now to list all distinct values in the hit range D[sp, ep)

PrevArray : P[0, n] s.t. P[i]=max{j < I, D[j]=D[i]} P[i] can be simulated as selectD[i] (D, rankD[i](D, i-1))

i 0 1 2 3 4 5 6

D 66 77 66 88 77 77 88

P -1 -1 0 -1 1 4 3

Page 61: Succinct Data Structure for Analyzing Document Collection

DocArray and PrevArray (2/2)

Alg. DocList reports all distinct doc IDs in [sp, ep) in O(|DL|) time. Call DocList(sp, sp, ep)

i 0 1 2 3 4 5 6

D 66 77 66 88 77 88 88

P -1 -1 0 -1 1 3 5

RMQ(P, 2, 7) = -1 (=P[3])Report D[3]

RMQ(P, 2, 3) = 0 (=P[2])Report D[2]

RMQ(P, 5, 7) = 3 (=P[5])Return

DocList(gsp, sp, ep) IF sp=ep RETURN; (min, pos) := RMQ(P, sp, ep) IF min ≧ gsp RETURN; Report D[pos] DocList(gsp, sp, pos) DocList(gsp, pos+1, ep)

Page 62: Succinct Data Structure for Analyzing Document Collection

Wavelet Tree + Grammar Compression[Navarro+ SEA 11]

Build a wavelet tree for D. Using Freqlist(k, sp, ep), we can solve DL and DLF in O(log σ)

time. However storing D requires n lg |D| bits. This is very large.

As in compression of suffix arrays using grammar compression, D

can also be compressed by grammar compression Apply a grammar compression to bit arrays in wavelet trees, and

support rank/select operations

Currently, this is the fastest/smallest approach in practice.

Page 63: Succinct Data Structure for Analyzing Document Collection

Suffix Tree based approach (1/4)[W.K. Hon+ FOCS 09] Let ST be the generalized suffix tree for a collection of

documents A leaf l is marked with document d if l corresponds to d An internal node l is marked with d if at least two children of v

contain leaves marked with d An internal node can be marked with many values d.

a c b a a c c b c

3 710

2

1

4 5 6 8 911

12

13

14

a c

a b c

c

Page 64: Succinct Data Structure for Analyzing Document Collection

Suffix Tree based approach (2/4)

In every node v marked with d, we store a pointer ptr(v, d) to its

lowest ancestor u such that u is also marked with d ptr(v, d) also has a weight corresponds to relevance score. If such ancestor does not exist, ptr(v, d) points to a super root

node.

a c b a a c c b c

3 710

2

1

4 5 6 8 911

12

13

14

a c

a b c

cptr(7, a) = 2

ptr(13, a) = 2

Page 65: Succinct Data Structure for Analyzing Document Collection

Suffix Tree Based Approach (3/4)

Lemma 1. The total number of pointers ptr(*,*) in T is at most 2n Lemma 2. Assume that document d contains a pattern P and v is

the node corresponding to P. Then there exist a unique pointer

ptr(u, d), such that u is in the subtree of v and ptr(u, d) points to

an ancestor of v

c c b

10

2

1

11

12

13

c

a b c

c

Example: when v = 10. ptr(10, c)=2, and ptr(13, b)=2.These ptrs correspond to the occurrence of P in b and c.

Page 66: Succinct Data Structure for Analyzing Document Collection

Suffix Tree Based Approach (4/4)

To enumerate document corresponding to the node x, find all

ptrs that begins in the subtree of x, and points to the ancestor of

x. These ptrs uniquely corresponds to each hit documents (Lemma)

x

Page 67: Succinct Data Structure for Analyzing Document Collection

Geometric Interpretation

a c b a a c c b c

3 710

2

1

4 5 6 8 911

12

13

14

a c

a b c

c

-1012

1 2 2 2 4 5 6 7 8 9 10 11 12 13 14

c a bc

a c b a

a a

c

c c

b

c

depth

0

1

2

3

pd(v, c) := depth of ptr(v, c)

place all ptrs ina point (x, pd(v,c))

pd

Page 68: Succinct Data Structure for Analyzing Document Collection

v=2v=10

Document Query = three-sided 2D query

When v := locus(Q) d := depth(v), and r be the maximal index

assigned to v or its descendants. DL can be solve by

RangeList(v, r, 0, d-1)

-1012

1 2 2 2 4 5 6 7 8 9 10 11 12 13 14

c a bc

a c b a

a a

c

c c

b

c

pd

Page 69: Succinct Data Structure for Analyzing Document Collection

Top-K query

Each ptr(v, d) can store any score function depended on Q, D. For each node v, f ptr(x, d) points to vTop-K

Theorem [Navarro+ SODA 12] k most highly weighted points in

the range [a, b] x [0, h] can be reported in decreasing order of

their weights in O(h+k) time

In practice, we can use a priority queue on each node.

Page 70: Succinct Data Structure for Analyzing Document Collection

Bibliography

[Navarro+ SEA 12] Practical Compressed Document Retrieval, G.

Navarro, S. J. Puglisi, and D. Valenzuela, SEA 2012

[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and

Linear Space, SODA 2012

[T. Gagie+ TCS+ 11] New Algorithms on Wavelet Trees and

Applications to Information Retrieval, TCS 2011

[G. Navarro+ 11] Practical Top-K Document Retrieval in Reduced

Space

[Karkkaine+ SPIRE 11] Fixed Block Compression Boosting in FM-

indexes, J. Karkkainen, S. J. Puglisi, SPIRE 2011

[W.K. Hon+ FOCS 09] Space-Efficient Framework for Top-k String

Retrieval Problems, W. K. Hon, R. Shah, J. S. Vitter, FOCS 2009

Page 71: Succinct Data Structure for Analyzing Document Collection

[Navarro+ SEA 11] “Practical Compressed Document Retrieval”,

Gonzalo Navarro , Simon J. Puglisi, and Daniel Valenzuela,, SEA

2011

[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and

Linear Space , Gonzalo Navarro, Yakov Nekrich, SODA 2012

[Gagie+ JTCS 11] New Algorithms on Wavelet Trees and Applications

to Information Retrieval, Travis Gagie , Gonzalo Navarro , Simon J.

Puglisi, J. of TCS 2011