inferring strings from suffix trees and links on a binary alphabet

40
Inferring Strings from Suffix Trees and Links on a Binary Alphabet Tomohiro I , Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Upload: sadie

Post on 23-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Inferring Strings from Suffix Trees and Links on a Binary Alphabet. Tomohiro I , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda Kyushu University, Japan. Outline. Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Inferring Strings from Suffix Trees and Linkson a Binary Alphabet

Tomohiro I, Shunsuke Inenaga,Hideo Bannai, Masayuki Takeda

Kyushu University, Japan

Page 2: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problems on String Data StructuresSuffix Tree, Suffix LinksReverse Problem on Suffix TreesEfficient Solution

Inferring a Labeling FunctionSuffix Tour GraphOn a Binary Alphabet

Outline

Page 3: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Hot Topic

Direct ProblemGiven a string, compute its data structure.

Reverse ProblemGiven a data structure, compute its string.Solving reverse problems could lead to deeper understanding of strings and data structures.

Reverse Problems on String Data Structures

data structure

stringdata structure

string

border arrays, suffix arrays, DAWG, etc.

Page 4: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Border Array• [Franek et al., 2002]• [Duval et al., 2005]

Suffix Array• [Duval and Lefebvre, 2002]• [Bannai et al., 2003]• [Schürmann et al., 2005]

DAWG• [Bannai et al., 2003]

Parameterized Border Array• [I et al., 2009] • [I et al., 2010]

Reverse Problems on String Data Structures

KMP Failure Function• [Gawrychowski et al., 2010]

Runs• [Matsubara et al., 2010]

Palindromic Structure• [I et al., 2010]

Prefix Table• [Clement et al., 2009]

Cover Array• [Crochemore et al., 2010]

LPF Table• [He et al., 2011]

We consider the reverse problem on suffix trees.

Page 5: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

The suffix tree of w is the compacted trie which represents the suffixes of w.The suffix link of a node points to the node that represents the substring obtained by deleting the first character.

Suffix Tree, Suffix Links

12345678ababaaa$ababaaa$ababaaa$ababaaa$ababaaa$ababaaa$ababaaa$ababaaa$

42

5

3

7

8

6

1

$ ba

$ ba

$ a$

ba

a

a

aaa$

a$

baaa$

aa$

Index of suffixes.

Suffix link

Page 6: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

It can be solved in linear time [e.g. Ukkonen, 1995].

Direct Problem on Suffix Trees

w ababaaa$

Input : A string w.Output : The suffix tree of

w.

42

5

3

7

8

6

1

$ ba

$ ba

$ a$

ba

a

a

aaa$

a$

baaa$

aa$

Page 7: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problem on Suffix Trees

Input : An unlabeled ordered rooted tree T.Output : A string which realizes T (if such exists).

A string w is said to realize T if the suffix tree of w is isomorphic to T.

Page 8: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problem on Suffix Trees

Input : An unlabeled ordered rooted tree T and links f.Output : A string which realizes T and f (if such exists).

A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f.

link function f

Page 9: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problem on Suffix Trees

Input : An unlabeled ordered rooted tree T and links f.Output : A string which realizes T and f (if such exists).

4 25

3

7

8

6

1

A string w is said to realize (T, f ) if the suffix tree of w and its suffix links are isomorphic to T and f.

link function f12345678ababaaa$

$ ba

Page 10: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problem on Suffix Trees

Input : An unlabeled ordered rooted tree T and links f for inner nodes.Output : A string which realizes T and f (if such exists).

A string w is said to realize (T, f ) if the suffix tree of w and its suffix links for inner nodes are isomorphic to T and f.

ababaaa$aaababa$aababaa$abaaaba$

link function f

Page 11: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

How can we solve this problem?

Input : An unlabeled ordered rooted tree T and links f for inner nodes.Output : A string which realizes T and f (if such exists).

Page 12: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

How can we solve this problem?

4 25

3

7

8

6

1

12345678ababaaa$

Input : An unlabeled ordered rooted tree T and links f for inner nodes.Output : A string which realizes T and f (if such exists).

$ ba

If we can infer a “correct” order of leaves, we can get a string.

Page 13: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

How can we solve this problem?

4 25

3

7

8

6

1

12345678ababaaa$

Input : An unlabeled ordered rooted tree T and links f for inner nodes.Output : A string which realizes T and f (if such exists).

$ ba

If we can infer a “correct” order of leaves, we can get a string.

A naïve solution of considering all permutations takes O(n!) time. We need to take into account some “constraints” on leaves’ order,

which are implicitly given by input (T, f ). We introduce suffix tour graphs to capture the constraints.

Page 14: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Reverse Problems on String Data StructuresSuffix Tree, Suffix LinksReverse Problem on Suffix TreesEfficient Solution

Inferring a Labeling FunctionSuffix Tour GraphOn a Binary Alphabet

Outline

Page 15: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Input (T, f )V : the set of nodes of TE : the set of edges of T : the root node of TVin : the set of inner nodes of TVleaf : the set of leaf nodes of T

v V,V(v), Vin(v) and Vleaf(v) respectively represent the set of nodes, inner nodes and leaf nodes of the subtree rooted at v.children(v) : the set of children of v.chi(v) : the i-th child of v.par(v) : the parent of v.

Notations

f : Vin{}Vin

ordered rooted tree T

Page 16: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

1. The first child of the root node is a leaf.ch1() Vleaf

2. There exists a path of function f from any node v0 Vin{} to the root node . v0 Vin{}, v1, v2, …, vk s.t. vk and vi f (vi1) for any 1 i k

Preconditions of an Input

satisfied not satisfied

Page 17: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In what follows…

Infer the first character of the string of each edge, namely, a labeling function g : E {∪ $}.

$ ba

$ ba

$ a$

ba

a

a

aaa$

a$

baaa$

aa$

Page 18: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

1. The edge from to its first child is labeled with $. g((, ch1())) $.

Conditions for g to hold

$

Infer the first character of the string of each edge, namely, a labeling function g : E {∪ $}.

Page 19: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

1. The edge from to its first child is labeled with $. g((, ch1())) $.

2. The labels for the children are sorted in lexicographical order. v V, 1 i |children(v)|, g((v, chi(v))) g((v, chi1(v))).

Conditions for g to hold

vca ed

$

Infer the first character of the string of each edge, namely, a labeling function g : E {∪ $}.

Page 20: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

3. Condition on links of parent-child nodes. v V{}, vp par(v), there exists u children( f (vp)) s.t. g((vp, v)) g(( f (vp), u)).In addition, if v Vin then f (v) V(u).

Conditions for g to hold

f (vp)

vp

v

cu

c

f (v)

Infer the first character of the string of each edge, namely, a labeling function g : E {∪ $}.

Page 21: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

By Condition 3, the labels for inner edges (edges from inner nodes to inner nodes) can be uniquely determined.If the determined labels contradict Condition 2, the input turns out to be invalid.

Labels for Inner Edges

$ ba

ba

$ ba

aa

invalid

Page 22: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

$ ba

ba$$ a

ba

ba

Lg(v) means # of leaves in the following situation.

When a labeling function g holds Conditions 1~3, we define the following values for any node v. Lg(v) |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v),

v))}|Dg(v) = yV(v) Lg(y)

Lg and Dg

c

c

par(v)

par(u) 1

2

0 0

1 10 0

0

01

1

1

v

u

Page 23: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

$ ba

ba$$ a

ba

ba

Lg(v) means # of leaves in the following situation.

When a labeling function g holds Conditions 1~3, we define the following values for any node v. Lg(v) |{uVleaf | f (par(u))par(v), g((par(u), u))g((par(v),

v))}|Dg(v) = yV(v) Lg(y)

Lg and Dg

c

c

par(v)

par(u) 1

2

0 0

1 10 0

0

01

1

1

1

2

0 0

1 10 0

4

21

1

8

v

u

Constraints in leaves’ order :The next leaf of u is in Vleaf(v).

Dg(v) leaves in Vleaf(v) have constraints on such u’s.

Page 24: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

$ ba

ba$$ a

ba

ba

4. # of leaves of subtree rooted at v must be at least Dg(v).|Vleaf(v)| Dg(v) 0

Conditions for g to hold

1

2

0 0

1 10 0

0

01

1

1

1

2

0 0

1 10 0

4

21

1

8

1

0

1 1

0 01 1

1

00

0Lg(v) means # of leaves

in the following situation.

c

c

par(v)

par(u)v

u

Page 25: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG) w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))}

{(∪ u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)}

Suffix Tour Graph

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$$ a

ba

ba

STGg

Page 26: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG) w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))}

{(∪ u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)}

Suffix Tour Graph

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$$ a

ba

ba

STGg

Lg(v) means # of leaves in the following situation.

c

c

par(v)

par(u)v

u

Page 27: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG) w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))}

{(∪ u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)}

Suffix Tour Graph

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$$ a

ba

ba

STGg

STGg is an Eulerian graph.(possibly disjoint)

Lg(v) means # of leaves in the following situation.

c

c

par(v)

par(u)v

u

Page 28: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle.

Necessary and Sufficient Conditionfor (T, f ) and g to be valid

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$$ a

ba

ba

4 25

3

7

8

6

1

STGg

STGg has an Eulerian cycle that contains and all leaves.

STGg is an Eulerian graph.(possibly disjoint)

Page 29: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

When there exists such a cycle, a correct order of leaves that realizes (T, f ) and g can be obtained by the order of visiting leaves on the cycle.

Necessary and Sufficient Conditionfor (T, f ) and g to be valid

1

0 0 01

1

1 0

0

0

1

1

$ ba

ba$a b

ba

ba

STGg

Example for an invalid labeling function g.

STGg has an Eulerian cycle that contains and all leaves.

STGg is an Eulerian graph.(possibly disjoint)

Page 30: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

An Eulerian cycle can be computed in linear time in the graph size.We also showed that the size of STGg is linear in the input size.

Given g, we can check if g is valid or not byconstructing STGg computing an Eulerian cycle⇒

in linear time in the input size.

Computing an Eulerian Cycle

What remains is to find a valid labeling function g.

Page 31: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$$ a

ba

ba

STGg

Page 32: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

0 0 01

1

0 0

0

1

1

1

$ ba

ba$$ b

ba

ba

STGg

Page 33: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

1 0 01

1

0 0

0

0

1

1

$ ba

ba$a b

a$

a$

STGg

Page 34: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

0 0 01

1

0 0

0

1

1

1

$ ba

ba$a b

b$

b$

STGg

Page 35: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

0 0 01

1

1 0

0

0

1

1

$ ba

ba$a b

ba

ba

STGg

Page 36: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions.

On a Binary Alphabet

1

0 0 01

1

1 0

0

0

1

1

$ ba

ba$a b

ba

ba

STGg

On a binary alphabet, the reverse problem on suffix trees can be solved in linear time.Theorem

Page 37: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

We introduced suffix tour graphs which lead to the efficient solution of the reverse problem on suffix trees.(Note that it can be applied to non-binary cases.)On a binary alphabet, we showed that the problem can be solved in linear time in the input size.

What about non-binary cases? It seems to be difficult ⇒

⇒since # of labeling functions g increase combinatorially.What about the problem in which suffix links are not given?

I do not have any idea.⇒

Summary

Open Problems

Page 38: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Compute a string which realizes this tree and links.

Exercise?

Page 39: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

These labels are determined uniquely.

Hints

$a b

a b

a b a b

a b a b$

aa b b$

$$

Page 40: Inferring Strings from Suffix Trees and Links on a Binary Alphabet

Compute a string which realizes this tree and links.

Exercise?

babaabaaababaa$babaababaaabaa$babaaababaabaa$