computing left-right maximal generic words takaaki nishimoto, yuto nakashima, shunsuke inenaga,...
TRANSCRIPT
PSC2015
Computing Left-Right Maximal Generic Words
Takaaki Nishimoto, Yuto Nakashima,Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
Kyushu University, Japan
August 24-26, 2015
The string which is common among some documentscharacterizes a set of documents.
Characteristic String of Documents
T1 = p r a g u e s t r i n g a b c
T2 = b a c o m p s c i e n c e a p
T3 = a p s c r e e n a p s c i t e
T4 = s t r c o n f e r e n c e a b
T5 = w e p s c o m p r e s s e d a
WD(x) : number of distinct strings in D which have x as a substring.
d-Right-Maximal Generic Words
Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.
Problem [Kucherov et al., SPIRE 2012]
A string x is a d-right-maximal extension of P if• P is a prefix of x• WD(x) ≥ d
• WD(xa) < d for any character a.
Example
d-Right-Maximal Generic Words
Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-right-maximal extensions of P.
Problem [Kucherov et al., SPIRE 2012]
T1 = a b a b a a b a a a a c b
T2 = c b a a b a c a b a a b c
T3 = b b a b a a c a
P = aa, d = 2
output = {aaba, aac}
d-Right-Maximal Generic Words
There exists an O(n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + rocc) time.The data structure can be constructed in O(n) time.
Theorem [Kucherov et al., SPIRE 2012]
n : total length of strings in Drocc : number of d-right-maximal extensions of P
Each d-right-maximal extension is corresponds to a branching node in generalized suffix tree of D.
Each leaf of generalized suffix tree of D corresponds to a suffix of a string in D.
Generalized Suffix Tree (GST)
Example T1 = aabaab, T2 = aabab, T3 = babaaa
1
2
3
$3
1 1 23 3 23 11 2 3 1 3 2 1 2
3
$1
$2
$3b
a
a
b
a
b a
aa
$1
b
b
b$1
$2
$3 $1$3
$3
$3
$2
aa
$3$2
a
a
a
aaa
$2
$1 $2b
b$1
b$1
Notations on generalized suffix tree of D
Generalized Suffix Tree (GST)
GSTD : generalized suffix tree of D
GSTD(u) : subtree rooted at a node u
strD(u) : string which is represented by a node u in GSTD
weightD(u) : = WD(strD(u))
maxchildD(u) : maximum weight of child of u
L(P) : locus of P
Each answer corresponds to a branching node in GSTD(L(P)).
d-Right-Maximal Generic Words
Example T1 = aabaab, T2 = aabab, T3 = babaaa
1
2
3
$3
1 1 23 3 23 11 2 3 1 3 2 1 2
3
$1
$2
$3b
a
a
b
a
b a
aa
$1
b
b
b$1
$2
$3 $1$3
$3
$3
$2
aa
$3$2
a
a
a
aaa
$2
$1 $2b
b$1
b$1
P = ab, d = 2 output = {abaa}
L(P)
weightD(u) ≥ 2, maxchildD(u) < 2
New Problem
Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.
Problem
Example
T1 = a b a b a a b a a a a c b
T2 = c b a a b a c a b a a b c
T3 = b b a b a a c a
P = aa, d = 2 output = {baaba, abaab, babaa}
Our Contribution
Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.
Problem
There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.
Theorem
n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P
Each answer may not correspond to a branching node in GSTD(L(P)).
d-Left-Right-Maximal Generic Words
Example T1 = aabaab, T2 = aabab, T3 = babaaa
1
2
3
$3
1 1 23 3 23 11 2 3 1 3 2 1 2
3
$1
$2
$3b
a
a
b
a
b a
aa
$1
b
b
b$1
$2
$3 $1$3
$3
$3
$2
aa
$3$2
a
a
a
aaa
$2
$1 $2b
b$1
b$1
P = ab, d = 2 output = {abaa, aaba}
Main Idea
Ti P
Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.
Each d-left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.
Main Idea
Ti P
If we check d-left-maximal extension of all right extensions of P, we can obtain all answers.
We consider such extensions on GST.
For any branching right (not necessary maximal) extension of P, we compute its d-left-maximal extension.
Main Idea
GSTD P
≥ d
GSTDR
≥ dd ≤
d >d >< d
< d
weightDR(v) ≥ d
maxchildDR(v) < d
Such nodes v are candidates of answers.
L(str(u)R) = r(u)L(P)
u
DR = {T1R, …, Tm
R}
Main Idea
GSTD P
≥ d
GSTDR
≥ dd ≤
d >d >< d
< d
L(str(u)R) = r(u)
L(P)
u
cand(u)
REx
Cand(REx) = ∪u∈REx cand(u)set of candidates
Cand(REx) may contains non-answers.
We want to remove such nodes from Cand(REx),so we characterize above nodes.
Cand(REx)
The nodes in Cand(REx) which are not answersare not d-right-maximal.
Non-answers
Ti P
××
We should check weather d-right-maximal or not.To do so, we need information of node r’(v) for each node v in GSTD
R.
r’(v) : node in GSTD s.t. str(v)R = str(r’(v)) (It may be implicit node.)
Remove non-answers
GSTD P
≥ d
GSTDR
≥ dd ≤
d >d >< d
< d
weightDR(v) ≥ d
maxchildDR(v) < d
L(str(u)R) = r(u)L(P)
u
r’(v)
v
We check whether the node v is d-right-maximal or not by checking maxchildD(r’(v)).
maxchildD(r’(v)) < d
d-left-maximald-right-maximal
We define the following subset of answers.
Remove non-answers
cand’(u) = {v | v∈cand(u) and maxchildD(r’(v)) < d}
We compute cand’(u) by using range reporting query.
preord(v) : rank of preorder traversal in GSTD’
end(v) : maximum rank in GSTD’(v)
Computing cand’(u)
• preord(r(u)) ≤ preord(v) ≤ end(r(u))• weight(v) ≥ d• maxchild(v) < d• maxchild(r’(v)) < d
• preord(r(u)) ≤ preord(v) ≤ end(r(u))• max{maxchild(v), maxchild(r’(v))} < d ≤ weight(v)
The nodes v in cand’(u) satisfy the following.
We compute the nodes which satisfy these formulaby using segment intersection query.
The nodes in GSTD’ correspond to horizontal segments.The query correspond to vertical segment.
Segment Intersection Query Problem
preord(r(u))
end(r(u))
d
beg(r(u)) ≤ preord(v) ≤ end(r(u))max{maxchild(v), maxxhild(r’(v))} < d ≤ weight(v)
The number of horizontal segments is O(n).
Computing cand’(u)
Segment Intersection Query can be answered in O(loglog n + k) time with O(n) space data structurewhere n is the number of segments and k is the size of output.
Lemma [Chan, 2013]
For any node u in REx, cand’(u) can be answered in O(loglog n + |cand’(u)|) time with O(n) space data structure.
Lemma
We can obtain the set of answers by computing cand’(u) for all node u in REx.
There exist duplication and nodes u s.t. cand’(u) = ∅.
We can skip such right extensions by using a range reporting query and a binary search on GST.
Meaningful Right Extensions
There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.
Theorem
Conclusion
Let D = {T1, …, Tm} be a set of strings. Given a pattern P and positive integer d (≤ m),compute all d-left-right-maximal extensions of P.
Problem
There exists an O(n log n)-space data structure which can compute the all d-right-maximal extensions of P in O(|P| + occ log2n + rocc log n) time.
Theorem
n : total length of strings in Drocc : number of d-right-maximal extensions of P occ : number of d-left-right-maximal extensions of P
Consider a more efficient algorithm.
Can a single document version be solved more easily?special case of this problem
Consider the minimal discriminating words problem for left-right extensions.
Future Work
Thank You !
Cand’(REx) may contains duplications because of definition of REx.
About Cand’(REx)
We want to remove such nodes from Cand’(REx),so we characterize above nodes.
If there exists an answer s.t. P occurs in the answerat least two times, there exist duplicated answers.
Duplicated Answers
PTi P P
Let u be a node in REx s.t. P occurs in str(u) at least two times. For any node v s.t. str(v) is a proper suffix of str(u), cand’(u) cand’(v).⊆
Lemma
××
We use the following lemma.
Checking P’s
Let u be a node in REx.preord(u1) < beg(L(P)) ≤ end(L(P)) < preord(u2)iff P occurs in str(u) at once (P is a prefix of str(u)).
Lemma
k : SAstr(u)[k] = 1
u1 : str(u1) = str(u)[SAstr(u)[k−1]..|str(u)|]
u2 : str(u2) = str(u)[SAstr(u)[k+1]..|str(u)|]
Checking P’s
GSTD
PL(P)
u
str(u1)i
1
str(u2)j
P
SA
……
= str(u)
Checking P’s
GSTD
PL(P)
u
str(u1)i
1
j
P
SA
……
= str(u)
P = str(u2)