computing left-right maximal generic words takaaki nishimoto, yuto nakashima, shunsuke inenaga,...

PSC2015

Computing Left-Right Maximal Generic Words

Takaaki Nishimoto, Yuto Nakashima,Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Kyushu University, Japan

August 24-26, 2015

The string which is common among some documentscharacterizes a set of documents.

Characteristic String of Documents

T1 = p r a g u e s t r i n g a b c

T2 = b a c o m p s c i e n c e a p

T3 = a p s c r e e n a p s c i t e

T4 = s t r c o n f e r e n c e a b

T5 = w e p s c o m p r e s s e d a

WD(x) : number of distinct strings in D which have x as a substring.

d-Right-Maximal Generic Words

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.

Problem [Kucherov et al., SPIRE 2012]

A string x is a d-right-maximal extension of P if• P is a prefix of x• WD(x) ≥ d

• WD(xa) < d for any character a.

Example


Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.

Problem [Kucherov et al., SPIRE 2012]

T1 = a b a b a a b a a a a c b

T2 = c b a a b a c a b a a b c

T3 = b b a b a a c a

P = aa, d = 2

output = {aaba, aac}


There exists an O(n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + rocc) time.The data structure can be constructed in O(n) time.

Theorem [Kucherov et al., SPIRE 2012]

n : total length of strings in Drocc : number of d-right-maximal extensions of P

Each d-right-maximal extension is corresponds to a branching node in generalized suffix tree of D.

Each leaf of generalized suffix tree of D corresponds to a suffix of a string in D.

Generalized Suffix Tree (GST)

Example T1 = aabaab, T2 = aabab, T3 = babaaa

1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

Notations on generalized suffix tree of D

Generalized Suffix Tree (GST)

GSTD : generalized suffix tree of D

GSTD(u) : subtree rooted at a node u

strD(u) : string which is represented by a node u in GSTD

weightD(u) : = WD(strD(u))

maxchildD(u) : maximum weight of child of u

L(P) : locus of P

Each answer corresponds to a branching node in GSTD(L(P)).



1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

P = ab, d = 2 output = {abaa}

L(P)

weightD(u) ≥ 2, maxchildD(u) < 2

New Problem

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.

Problem

Example

T1 = a b a b a a b a a a a c b

T2 = c b a a b a c a b a a b c

T3 = b b a b a a c a

P = aa, d = 2 output = {baaba, abaab, babaa}

Our Contribution


Problem

There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.

Theorem

n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P

Each answer may not correspond to a branching node in GSTD(L(P)).

d-Left-Right-Maximal Generic Words


1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

P = ab, d = 2 output = {abaa, aaba}

Main Idea

Ti P

Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.

Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.

Main Idea

Ti P

If we check d-left-maximal extension of all right extensions of P, we can obtain all answers.

We consider such extensions on GST.

For any branching right (not necessary maximal) extension of P, we compute its d-left-maximal extension.

Main Idea

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

weightDR(v) ≥ d

maxchildDR(v) < d

Such nodes v are candidates of answers.

L(str(u)R) = r(u)L(P)

u

DR = {T1R, …, Tm

R}

Main Idea

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

L(str(u)R) = r(u)

L(P)

u

cand(u)

REx

Cand(REx) = ∪u∈REx cand(u)set of candidates

Cand(REx) may contains non-answers.

We want to remove such nodes from Cand(REx),so we characterize above nodes.

Cand(REx)

The nodes in Cand(REx) which are not answersare not d-right-maximal.

Non-answers

Ti P

××

We should check weather d-right-maximal or not.To do so, we need information of node r’(v) for each node v in GSTD

R.

r’(v) : node in GSTD s.t. str(v)R = str(r’(v)) (It may be implicit node.)

Remove non-answers

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

weightDR(v) ≥ d

maxchildDR(v) < d

L(str(u)R) = r(u)L(P)

u

r’(v)

v

We check whether the node v is d-right-maximal or not by checking maxchildD(r’(v)).

maxchildD(r’(v)) < d

d-left-maximald-right-maximal

We define the following subset of answers.

Remove non-answers

cand’(u) = {v | v∈cand(u) and maxchildD(r’(v)) < d}

We compute cand’(u) by using range reporting query.

preord(v) : rank of preorder traversal in GSTD’

end(v) : maximum rank in GSTD’(v)

Computing cand’(u)

• preord(r(u)) ≤ preord(v) ≤ end(r(u))• weight(v) ≥ d• maxchild(v) < d• maxchild(r’(v)) < d

• preord(r(u)) ≤ preord(v) ≤ end(r(u))• max{maxchild(v), maxchild(r’(v))} < d ≤ weight(v)

The nodes v in cand’(u) satisfy the following.

We compute the nodes which satisfy these formulaby using segment intersection query.

The nodes in GSTD’ correspond to horizontal segments.The query correspond to vertical segment.

Segment Intersection Query Problem

preord(r(u))

end(r(u))

d

beg(r(u)) ≤ preord(v) ≤ end(r(u))max{maxchild(v), maxxhild(r’(v))} < d ≤ weight(v)

The number of horizontal segments is O(n).

Computing cand’(u)

Segment Intersection Query can be answered in O(loglog n + k) time with O(n) space data structurewhere n is the number of segments and k is the size of output.

Lemma [Chan, 2013]

For any node u in REx, cand’(u) can be answered in O(loglog n + |cand’(u)|) time with O(n) space data structure.

Lemma

We can obtain the set of answers by computing cand’(u) for all node u in REx.

There exist duplication and nodes u s.t. cand’(u) = ∅.

We can skip such right extensions by using a range reporting query and a binary search on GST.

Meaningful Right Extensions


Theorem

Conclusion


Problem


Theorem

n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P

Consider a more efficient algorithm.

Can a single document version be solved more easily?special case of this problem

Consider the minimal discriminating words problem for left-right extensions.

Future Work

Thank You !

Cand’(REx) may contains duplications because of definition of REx.

About Cand’(REx)

We want to remove such nodes from Cand’(REx),so we characterize above nodes.

If there exists an answer s.t. P occurs in the answerat least two times, there exist duplicated answers.

Duplicated Answers

PTi P P

Let u be a node in REx s.t. P occurs in str(u) at least two times. For any node v s.t. str(v) is a proper suffix of str(u), cand’(u) cand’(v).⊆

Lemma

××

We use the following lemma.

Checking P’s

Let u be a node in REx.preord(u1) < beg(L(P)) ≤ end(L(P)) < preord(u2)iff P occurs in str(u) at once (P is a prefix of str(u)).

Lemma

k : SAstr(u)[k] = 1

u1 : str(u1) = str(u)[SAstr(u)[k−1]..|str(u)|]

u2 : str(u2) = str(u)[SAstr(u)[k+1]..|str(u)|]

Checking P’s

GSTD

PL(P)

u

str(u1)i

1

str(u2)j

P

SA

……

= str(u)

Checking P’s

GSTD

PL(P)

u

str(u1)i

1

j

P

SA

……

= str(u)

P = str(u2)