computing left-right maximal generic words takaaki nishimoto, yuto nakashima, shunsuke inenaga,...

31
Computing Left- Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima , Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan August 24-26, 2015 PSC2015

Upload: godwin-stevens

Post on 31-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

PSC2015

Computing Left-Right Maximal Generic Words

Takaaki Nishimoto, Yuto Nakashima,Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Kyushu University, Japan

August 24-26, 2015

Page 2: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

The string which is common among some documentscharacterizes a set of documents.

Characteristic String of Documents

T1 = p r a g u e s t r i n g a b c

T2 = b a c o m p s c i e n c e a p

T3 = a p s c r e e n a p s c i t e

T4 = s t r c o n f e r e n c e a b

T5 = w e p s c o m p r e s s e d a

Page 3: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

WD(x) : number of distinct strings in D which have x as a substring.

d-Right-Maximal Generic Words

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.

Problem [Kucherov et al., SPIRE 2012]

A string x is a d-right-maximal extension of P if• P is a prefix of x• WD(x) ≥ d

• WD(xa) < d for any character a.

Page 4: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Example

d-Right-Maximal Generic Words

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.

Problem [Kucherov et al., SPIRE 2012]

T1 = a b a b a a b a a a a c b

T2 = c b a a b a c a b a a b c

T3 = b b a b a a c a

P = aa, d = 2

output = {aaba, aac}

Page 5: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

d-Right-Maximal Generic Words

There exists an O(n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + rocc) time.The data structure can be constructed in O(n) time.

Theorem [Kucherov et al., SPIRE 2012]

n : total length of strings in Drocc : number of d-right-maximal extensions of P

Each d-right-maximal extension is corresponds to a branching node in generalized suffix tree of D.

Page 6: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Each leaf of generalized suffix tree of D corresponds to a suffix of a string in D.

Generalized Suffix Tree (GST)

Example T1 = aabaab, T2 = aabab, T3 = babaaa

1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

Page 7: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Notations on generalized suffix tree of D

Generalized Suffix Tree (GST)

GSTD : generalized suffix tree of D

GSTD(u) : subtree rooted at a node u

strD(u) : string which is represented by a node u in GSTD

weightD(u) : = WD(strD(u))

maxchildD(u) : maximum weight of child of u

L(P) : locus of P

Page 8: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Each answer corresponds to a branching node in GSTD(L(P)).

d-Right-Maximal Generic Words

Example T1 = aabaab, T2 = aabab, T3 = babaaa

1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

P = ab, d = 2 output = {abaa}

L(P)

weightD(u) ≥ 2, maxchildD(u) < 2

Page 9: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

New Problem

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.

Problem

Example

T1 = a b a b a a b a a a a c b

T2 = c b a a b a c a b a a b c

T3 = b b a b a a c a

P = aa, d = 2 output = {baaba, abaab, babaa}

Page 10: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Our Contribution

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.

Problem

There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.

Theorem

n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P

Page 11: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Each answer may not correspond to a branching node in GSTD(L(P)).

d-Left-Right-Maximal Generic Words

Example T1 = aabaab, T2 = aabab, T3 = babaaa

1

2

3

$3

1 1 23 3 23 11 2 3 1 3 2 1 2

3

$1

$2

$3b

a

a

b

a

b a

aa

$1

b

b

b$1

$2

$3 $1$3

$3

$3

$2

aa

$3$2

a

a

a

aaa

$2

$1 $2b

b$1

b$1

P = ab, d = 2 output = {abaa, aaba}

Page 12: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Main Idea

Ti P

Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.

Page 13: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.

Main Idea

Ti P

If we check d-left-maximal extension of all right extensions of P, we can obtain all answers.

We consider such extensions on GST.

Page 14: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

For any branching right (not necessary maximal) extension of P, we compute its d-left-maximal extension.

Main Idea

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

weightDR(v) ≥ d

maxchildDR(v) < d

Such nodes v are candidates of answers.

L(str(u)R) = r(u)L(P)

u

DR = {T1R, …, Tm

R}

Page 15: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Main Idea

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

L(str(u)R) = r(u)

L(P)

u

cand(u)

REx

Cand(REx) = ∪u∈REx cand(u)set of candidates

Page 16: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Cand(REx) may contains non-answers.

We want to remove such nodes from Cand(REx),so we characterize above nodes.

Cand(REx)

Page 17: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

The nodes in Cand(REx) which are not answersare not d-right-maximal.

Non-answers

Ti P

××

We should check weather d-right-maximal or not.To do so, we need information of node r’(v) for each node v in GSTD

R.

r’(v) : node in GSTD s.t. str(v)R = str(r’(v)) (It may be implicit node.)

Page 18: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Remove non-answers

GSTD P

≥ d

GSTDR

≥ dd ≤

d >d >< d

< d

weightDR(v) ≥ d

maxchildDR(v) < d

L(str(u)R) = r(u)L(P)

u

r’(v)

v

We check whether the node v is d-right-maximal or not by checking maxchildD(r’(v)).

maxchildD(r’(v)) < d

d-left-maximald-right-maximal

Page 19: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

We define the following subset of answers.

Remove non-answers

cand’(u) = {v | v∈cand(u) and maxchildD(r’(v)) < d}

We compute cand’(u) by using range reporting query.

Page 20: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

preord(v) : rank of preorder traversal in GSTD’

end(v) : maximum rank in GSTD’(v)

Computing cand’(u)

• preord(r(u)) ≤ preord(v) ≤ end(r(u))• weight(v) ≥ d• maxchild(v) < d• maxchild(r’(v)) < d

• preord(r(u)) ≤ preord(v) ≤ end(r(u))• max{maxchild(v), maxchild(r’(v))} < d ≤ weight(v)

The nodes v in cand’(u) satisfy the following.

We compute the nodes which satisfy these formulaby using segment intersection query.

Page 21: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

The nodes in GSTD’ correspond to horizontal segments.The query correspond to vertical segment.

Segment Intersection Query Problem

preord(r(u))

end(r(u))

d

beg(r(u)) ≤ preord(v) ≤ end(r(u))max{maxchild(v), maxxhild(r’(v))} < d ≤ weight(v)

Page 22: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

The number of horizontal segments is O(n).

Computing cand’(u)

Segment Intersection Query can be answered in O(loglog n + k) time with O(n) space data structurewhere n is the number of segments and k is the size of output.

Lemma [Chan, 2013]

For any node u in REx, cand’(u) can be answered in O(loglog n + |cand’(u)|) time with O(n) space data structure.

Lemma

Page 23: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

We can obtain the set of answers by computing cand’(u) for all node u in REx.

There exist duplication and nodes u s.t. cand’(u) = ∅.

We can skip such right extensions by using a range reporting query and a binary search on GST.

Meaningful Right Extensions

There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.

Theorem

Page 24: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Conclusion

Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.

Problem

There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.

Theorem

n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P

Page 25: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Consider a more efficient algorithm.

Can a single document version be solved more easily?special case of this problem

Consider the minimal discriminating words problem for left-right extensions.

Future Work

Thank You !

Page 26: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan
Page 27: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Cand’(REx) may contains duplications because of definition of REx.

About Cand’(REx)

We want to remove such nodes from Cand’(REx),so we characterize above nodes.

Page 28: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

If there exists an answer s.t. P occurs in the answerat least two times, there exist duplicated answers.

Duplicated Answers

PTi P P

Let u be a node in REx s.t. P occurs in str(u) at least two times. For any node v s.t. str(v) is a proper suffix of str(u), cand’(u) cand’(v).⊆

Lemma

××

Page 29: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

We use the following lemma.

Checking P’s

Let u be a node in REx.preord(u1) < beg(L(P)) ≤ end(L(P)) < preord(u2)iff P occurs in str(u) at once (P is a prefix of str(u)).

Lemma

k : SAstr(u)[k] = 1

u1 : str(u1) = str(u)[SAstr(u)[k−1]..|str(u)|]

u2 : str(u2) = str(u)[SAstr(u)[k+1]..|str(u)|]

Page 30: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Checking P’s

GSTD

PL(P)

u

str(u1)i

1

str(u2)j

P

SA

……

= str(u)

Page 31: Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Checking P’s

GSTD

PL(P)

u

str(u1)i

1

j

P

SA

……

= str(u)

P = str(u2)