the longest common subsequence problem and its variants 楊昌彪 中山大學資訊工程學系

98
The Longest Common Subsequence Problem and Its Variants 楊楊楊 楊楊楊楊楊楊楊楊楊楊 http://www.nsysu.edu.tw

Post on 20-Dec-2015

253 views

Category:

Documents


4 download

TRANSCRIPT

The Longest Common Subsequence Problem and

Its Variants

楊昌彪中山大學資訊工程學系

http://www.nsysu.edu.tw

2

Outline

Introduction to Bioinformatics Traditional LCS Algorithms Our Works

Block Edit Problems LCS of Run-Length Encoded Strings Merged LCS Problem Mosaic LCS Problem

Conclusions

Introduction to Bioinformatics

動物細胞 ( 細胞核、細胞質、細胞膜 ) DNA 位於細胞核內之「核仁」

5

DNA and RNA

Nucleotide ( 核甘酸 ) : 腺嘌呤 (adenine, A)

鳥糞嘌呤 (guanine, G)胞嘧啶 (cytosine, C)胸腺嘧啶 (thymine, T)尿嘧啶 (uracil, U)

DNA(deoxyribonucleic acid , 去氧核糖核酸 ) {A, G, C, T} (base pair: GC, A=T ) RNA(ribonucleic acid, 核糖核酸 ) {A, G, C, U} (base pair: GC, A=U, GU )

DNA Double Helix ( 雙股螺旋)

7

DNA Length

The total length of the human DNA is about 3109 (30 億 ) base pairs.

1% ~ 1.5% of DNA sequence is useful. # of human genes: 30,000~40,000

Conclusion from the Human Genome Project (1990~2003)

Expected # is 100,000 originally.

From DNA via RNA to Protein

9

DNA, Genes and Proteins

DNA: program for cell processes Proteins: execute cell processes

TCCAA

CGGTGC

TGAGGT

GCAC

GeneProtein

DNA

10

Promoter( 啟動子 ) and Gene

TranscriptionalStart Site

ATG TAG

TranscriptionalTermination Site

TATA

TTG

PromoterUpstream Downstream

intron

exon

Amino Acids ( 胺基酸 ) 胺基酸: Protein( 蛋白質 ) 的基本單位,共 20 種

12

Protein Structure

Traditional Dynamic Programming (DP) for the Longest Common Subsequence (LCS) Problem

14

The Longest Common Subsequence (LCS) Problem A string : S1 = “TAGTCACG” A subsequence of S1 : deleting 0 or more symbols from S1

(not necessarily consecutive).

e.g. G, AGC, TATC, AGACG Common subsequences of S1 = “TAGTCACG” and S2 =

“AGACTGTC” :

GG, AGC, AGACG Longest common subsequence (LCS) :

S1: TAGTCACG

S2: AGACTGTC

LCS: AGACG

15

Applications of LCS The edit distance of two strings or files.

(# of deletions and insertions)

S1: TAGTCAC G

S2: AG ACTGTC

Operation: DMMDDMMIMII Spoken word recognition Similarity of two biological sequences (DNA or protein)

Sequence alignment

16

The Traditional LCS Algorithm S1 = a1 a2 am and S2 = b1 b2 bn Ai,j denotes the length of the longest common

subsequence of a1 a2 ai and b1 b2 bj.

Dynamic programming:

Ai,j = Ai-1,j-1 + 1 if ai= bj

max{ Ai-1,j, Ai,j-1 } if ai bj

A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n.

Time complexity: O(mn)

a1 a2 ai-1ai

b1 b2 bj-1bj

17

LCS and Edit Distance

Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|

- A G A C T G T C - A G A C T G T C- 0 0 0 0 0 0 0 0 0 - 0 1 2 3 4 5 6 7 8T 0 0 0 0 0 1 1 1 1 T 1 2 3 4 5 4 5 6 7A 0 1 1 1 1 1 1 1 1 A 2 1 2 3 4 5 6 7 8G 0 1 2 2 2 2 2 2 2 G 3 2 1 2 3 4 5 6 7T 0 1 2 2 2 3 3 3 3 T 4 3 2 3 4 3 4 5 6C 0 1 2 2 3 3 3 3 4 C 5 4 3 4 3 4 5 6 5A 0 1 2 3 3 3 3 3 4 A 6 5 4 3 4 5 6 7 6C 0 1 2 3 4 4 4 4 4 C 7 6 5 4 3 4 5 6 7G 0 1 2 3 4 4 5 5 5 G 8 7 6 5 4 5 4 5 6

LCS Edit Distance

18

Sequence Alignment

S1 = TAGTCACG

S2 = AGACTGTC----TAGTCACG TAGTCAC-G--AGACT-GTC--- -AG--ACTGTC

Which one is better? We can set different gap penalties as parameters for

different purposes.

Gap Penalty for Sequence Alignment

is the gap penalty. Suppose

),(),0(

),()0,(

),()1,(

),(),1(

),()1,1(

max),(

xjjA

xiiA

bjiA

ajiA

bajiA

jiA

j

i

ji

),(or ),( xx

) including( if 1

if 2),(

yx

yxyx

Example for Sequence Alignment

TAGTCAC-G--

-AG--ACTGTC

- A G A C T G T C

0 -1 -2 -3 -4 -5 -6 -7 -8-

-1 -1 -2 -3 -4 -2 -3 -4 -5T

-2 1 0 0 -1 -2 -3 -4 -5A

-3 0 3 2 1 0 0 -1 -2G

-4 -1 2 2 1 3 2 2 1T

-5 -2 1 1 4 3 2 1 4C

-6 -3 0 3 3 3 2 1 3A

-7 -4 -1 2 5 4 3 2 3C

-8 -5 -2 1 4 4 6 5 4G

PAM250 Score Matrix for Protein Alignment A C D E F G H I K L M N P Q R S T V W Y A 2 C -2 12 D 0 -5 4 E 0 -5 3 4 F -4 -4 -6 -5 9 G 1 -3 1 0 -5 5 H -1 -3 1 1 -2 -2 6 I -1 -2 -2 -2 1 -3 -2 5 K -1 -5 0 0 -5 -2 0 -2 5 L -2 -6 -4 -3 2 -4 -2 2 -3 6 M -1 -5 -3 -2 0 -3 -2 2 0 4 6 N 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 P 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 Q 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 R -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 S 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 T 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 V 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 W -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 Y -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10

MSA, ET and LCS

Multiple sequence alignment

LCS

Phylogeny (evolutionary tree)

親緣樹

23

Hunt-Szymanski LCS Algorithm By extending the idea in RSK (Robinson-

Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches.

This algorithm is faster than the traditional dynamic programming if r is small.

24

The Pairs of Matching in Hunt-Szymanski Algorithm Input sequences: TAGTCACG and AGACTGTC Pairs of matching:

A G A C T G T C

T

A

G

T

C

A

C

G

(1,5)

(1,7)

(2,1)

(2,3)

(3,2)

(3,6)

(4,5)

(4,7)

(5,4)

(5,8)

(6,1)

(6,3)

(7,4)

(7,8)

(8,2)

(8,6)

25

Example for Hunt-Szymanski Algorithm The insertion order is row major and column backward.

Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search.

(1,7)

(1,5)

(2,3)

(2,1)

(3,6)

(3,2)

(4,7)

(4,5)

(5,8)

(5,4)

1 (1,7)

(1,5)

(2,3)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

2 (3,6)

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)

3 (4,7)

(4,5)

(4,5)

(5,4)

4 (5,8)

(5,8)

L

Time and Space Complexities for LCS

Block Edit Problems

28

Motivation – Finding Similar Codes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

#include <stdio.h> int partition(int array[], int top, int bottom); void quicksort(int array[], int top, int bottom); int partition(int array[], int top, int bottom) { int x = array[top]; int i = top - 1; int j = bottom + 1; do { while (x < array[--j]) {} while (x > array[++i]) {} if (i < j) { int temp = array[i]; array[i] = array[j]; array[j] = temp; } } while (i < j); return j; } void quicksort(int array[], int top, int bottom) { int middle; if (top < bottom) { middle = partition(array, top, bottom); quicksort(array, top, middle); quicksort(array, middle+1, bottom); } } int main() { int data[] = {3,1,7,6,4}; quicksort(data, 0, 4); for (int i=0; i<5; i++) printf("%d ", data[i]); }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

#include <stdio.h> void QS(int ARR[], int LL, int RR); int PP(int ARR[], int LL, int RR); void QS(int ARR[], int LL, int RR) { int MM; if (LL < RR) { MM = PP(ARR, LL, RR); QS(ARR, LL, MM); QS(ARR, MM+1, RR); } } int PP(int ARR[], int LL, int RR) { int ii = LL - 1; int Y = ARR[LL]; int jj = RR + 1; do { while (Y < ARR[--jj]) {} while (Y > ARR[++ii]) {} if (ii < jj) { int TT = ARR[ii]; ARR[ii] = ARR[jj]; ARR[jj] = TT; } } while (ii < jj); return jj; } int main() { int DD[] = {3,1,7,6,4}; QS(DD, 0, 4); for (int ii=0; ii<5; ii++) printf("%d ", DD[ii]); }

29

Block Edit Problems

Operations: Block copy, block deletion and block move.

Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed.

Various approximations were proposed.

Our assumptions – Restricted edit sequence: A series of edit operations are performed from left

to right on the source string X. Any two block-edit operations would not be

performed on overlapping regions on X.

30

A Series of Block Edit Operations

X(W0)

Y(Wk)

W1

Wi+1

...

Wi

Wi+2

Wi+3

...

31

Restricted Edit Sequence

(a) General (recursive) edit operations(b) Restricted edit sequence

a bX

W1 a ba b

W2 a ba ba ba b

Y a ba ba ba bbbbb

a bX

a b bW1

W2

Y

W3

W4

a b b b b

a b b b b a

a b b b b a a

a b b b b a a a b b b b

(a) (b)

32

Definitions of the Problems (1/2)

Let P(o, c) denote a block edit problem: o: a composition of block-edit operations c: the class of cost measures

The Block-Copy operations: External copy: copy a substring of X to Wi Internal copy: copy a valid substring of Wi-1 to Wi Shifted copy: copy a shifted substring

a a b b d

c c d d f

a 0 1 0 2

c 0 1 0 2

33

Definitions of the Problems (2/2)

The Cost Measures that can be chosen: Constant cost: pcopy Linear cost: ps+ k × pe Nested cost: pcopy+ dc(A, B)

Three problems are defined in our work: P(EIS,C) P(EI,L) P(EI,N)

34

Problem 1 -- P(EIS,C) – External, Internal, Shifted, Constant External and internal copies are allowed in

constant cost. Shifted copies are allowed in constant cost. It can be solved by a straightforward DP algorithm in

O(nm2 (n + m) |Σ|) time. We propose an O(nm) time DP algorithm with

O(n+m2) preprocessing time in worst case O(n+mlogm) preprocessing time in average case

35

Recurrence DP Formula for P(EIS,C)

Straightforward implementation:O(nm2 (n + m) |Σ|) time.

36

Functions and Operations (1) Character operations:

Block deletions:

Wi-1

Wi

+ pdelete

+1

+1

)(nO

)1(O

37

Functions and Operations (2) External copies:

Internal copies:

X

Wi-1

Wi

Wi-1

Wi

+ pcopy

)(nmO

)( 2mO

38

Functions and Operations (3) Shifted copies:

+ pcopy

X

Wi-1

Wi

Wi-1

Wi

)( 2 mnmO

39

Preprocessing for P(EIS,C)

For external copies: Build a suffix tree T(XR#YR$) to find the common

substrings between X and Y. For internal copies:

Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wi to Wi+1.

For shifted copies: Compute the differential strings X' and Y' of X and Y. Find the valid common substrings for external / internal

copies.

40

Preprocessing - Suffix Trees

A B C D

B C B D A

B

D

A

C

1

D

A

C

A

C

C

2

3

5

6

4

A B C D

B C B DA

BD

AC

D

A

C

AC

C

1

5

2

3

6

4

A B B D A CS1

B B D A C

B D A C

D A C

A C

C

1 2 3 4 5 6

Preprocessing – Longest Common Prefixes (LCP) and Suffix trees

root

xh xh-1...x1 xh xh-1...xg yj yj-1...yg'

root

XhR Xh

R YjR

v1

v2

vp

v3

YjRYj1

R Yj2R Yj3

R YjpR...

(b)

A D

B C

B D

A

B

D

A

C

a1

D

A

C

A

C

C

a2

a3

a4

(a)

A D

B C

B D

A

B

D

A

C

a1

D

A

C

A

C

C

a2

a3

a4

a5

42

Finding and Maintaining the Range Minimum in Constant Time

... 12 13 14 15 16 17 18 19 20

row i ... 4 3 2 3 3 4 5 4

... 12 13 14 15 16 17 18 19 20

row i ... 4 3 2 3 3 4 5

0

1

2

3

4

5

6

14

values

13

17

18

15 16

Rangemin 0

1

2

3

4

5

6

values

17

18

15 16

Rangemin

19

(a) (b)

+ pcopy

43

Problem 2 -- P(EI,L) – External, Internal, Linear The cost of each copy or deletion is with an initial

penalty plus a linear extended penalty.

44

Problem 3 -- P(EI,N) – External, Internal, Nested The copied strings can be further edited with

character-edit operations.

45

Summary of Block Edit Problems

O(nm2)

O(nmlogm)

O(nm)

Our methods

O((n+m)m2)

O(n+m2) in worst caseO(n+mlogm) in average case

O(n+m2) in worst caseO(n+m log m) in average case

Processing time

O(n2m3)P(EI,N)

O(nm2 (n + m))P(EI,L)

O(nm2 (n + m) |Σ |)P(EIS,C)

Straightforward DP

O(nm2)

O(nmlogm)

O(nm)

Our methods

O((n+m)m2)

O(n+m2) in worst caseO(n+mlogm) in average case

O(n+m2) in worst caseO(n+m log m) in average case

Processing time

O(n2m3)P(EI,N)

O(nm2 (n + m))P(EI,L)

O(nm2 (n + m) |Σ |)P(EIS,C)

Straightforward DP

LCS of Run-Length Encoded Strings

47

LCS of Run-Length Encoded Strings Run-length encoding (RLE) compression

aaaaabbbccccdd a5b3c4d2

Input:

RLE string X: length n, k runs

RLE string Y: length m, l runs

Output:

LCS between X and Y.

Dark & Light Blocks

Divide the DP lattice into k × l blocks.

Dark blocks: matched blocksLight blocks: mismatched blocks

a8b3 c4 a5b8 c4 a4

b6

c4

a12

a3

Results of Bunke and Csirik (1995)

Lemma 1 (Dark block):

Lemma 2 (Light block):

Only the boundaries of the blocks are needed.

rYXLCSYaXaLCS rr ),(),(

)},(),,(max{),( srsr YbXLCSYXaLCSYbXaLCS

)( kmnlO

50

Results of Liu et al. (2008)

A complex modified DP formula which computes the DP lattice row by row.

Only the bottom boundaries of the blocks are needed.

}),(min{ kmnlO

Additional Lemmas

Lemma 3 (Monotonicity):

Lemma 4 (Merged light blocks):if ,

jjiiYXLCSYXLCS jiji '1 ,'1 ),(),( ..1..1'..1'..1

),(

),,(max),(

21

21

2121

21

21

2121j

i

ji

sj

ss

ri

rrsj

ssri

rr

bbYbXLCS

YaaXaLCSbbYbaaXaLCS

jjiiba ji '1 ,'1 ''

52

Proof of Lemma 4

53

Basic Idea

C(v) denotes the number of occurrences of the matched symbol in the right side of v.

ni denotes the length of current run of X.

a8b3 c4 a5b8 c4 a4

b6

c4

a12

a3

v1v2

v3v4v5

d1d2d3

u0u2

u1

v0

v8 v7 v6

d0

inuLCS

dvCvLCS

dvCvLCS

dvCvLCS

ddduLCS

ddvLCS

ddvLCS

dvLCS

vLCS

)(

,)()(

,)()(

,)()(

max

)()(

),()(

),()(

),()(

max)(

0

088

077

066

3210

218

217

16

0

54

Dummy Nodes & Candidate Paths Some dummy nodes are considered, too. Divide the candidate paths into two sets.

a8b3 c4 a5b8 c4 a4

b6

c4

a12

a3

v1v2

v3v4v5

d1d2d3

u0

v0

v8 v7 v6

d0

v6'v8'

55

Range Minimum / Maximum Query (RMQ) Given an array A and a range [i, j], find the maximu

m in the range [i, j] Can be solved in O(n) preprocessing time and O(1)

query time.

1 2 3 4 5 6 7 8 9

3 -7 99 2 -20 8 9 -99 1

1 2 3 4 5 6 7 8 9

3 -7 99 2 -20 8 9 -99 1

56

Finding the Maximum from the Candidate Paths The value of u0 can be computed by Lemma 4. The maximum of the second set can be found by pr

ecomputing an array Li and then applying RMQ(Range Maximum Query) on it.

a8b3 c4 a5b8 c4 a4

b6

c4

a12

a3

v1v2

v3v4v5

d1d2d3

u0

v0

v8 v7 v6

d0

v6'v8'u2

u1

],0[ ],[],1[][ mjmjCjiMjL jii

57

How Fast It Is?

The elements needed to be computed: Right bottom corners of all blocks. Bottom boundaries of the dark blocks.

Let p1 and p2 denote the numbers of elements in the bottom and right boundaries of the dark blocks. The time complexity of our algorithm is .

}),min{( 21 ppklO

The Merged LCS Problem

Motivation -- Riffle Shuffle

Riffle Shuffle

4

1 2

3

A B

E(A, B)

61

Relationship among Decks (1)

E(A, B)

T

LCS(T, E(A, B))

T == E(A, B)

T != E(A, B)

Relationship among Decks (2)

A B

E(A, B)

T

LCS(T, E(A, B))?

? LCS(T, A, B)

?

63

Nested Genes Fruit fly -- Drosophila melanogaster Gene dcp-1 (Dmel_CG5370) Gene pita (Dmel_CG3941)

(LOCUS AE003461)

pita

dcp-1

70154

75907

72504 74745

Drosophila melanogasterchromosome 2R

7500070000

[Laundrie et al., Genetics 165, 2003]

64

Whole Genome Duplication

1 2 3 4 5 6 7 8 9 10

2 3 5 8 10

genome duplication

gene loss

after gene loss

common ancestor

lineage of species Y lineage of species T

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 3 4 6 7 9

1 2 3 4 5 6 7 8 9 10

species T

copy A of species Y

copy B of species Y

2 3 5 8 10

1 3 4 6 7 9

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

[Kellis et al., Nature 428(6983), 2004]

2R

65

Doubly Conserved Synteny Block Two yeast species

Kluyveromyces waltii Saccharomyces cerevisiae

[Kellis et al., Nature 428(6983), 2004]Block ?

66

Merged Sequence An interleaving sequence of merging sequ

ences A and B, denoted as E(A, B) The merged sequence is not unique.

A = cgatacc B = aattcgc

E1(A, B) = cgataaacgc

E2(A, B) = aattcgcgcatacc

E3(A, B) = cgaaatactcgc

67

Merged-LCS Problem To find the relationship among sequences

T, A, and B, denoted as LCS(T, E(A, B))T = atacgcgcttA = cgatacc B = aattcgc

A = -----cg---at-accT = ata--cgcgc-tt---B = a-att--cgc------

a a cgcgc t = LCS(T, E(A, B))

E1(A, B) = cgataaacgcE2(A, B) = aattcgcgcataccE3(A, B) = cgaaatactcgc

68

Algorithm MergedLCS Dynamic programming formula

Time complexity: O(nm2), n=|T|, m = max{|A|, |B|}

Space complexity: O(nm) [Hirsberg 1975, divide-and-conquer]

].[][ and ][][ if

)1,,(

),1,(

),,1(

max

],[][ if 1)1,,1(

],[][ if 1),1,1(

,0or ,0 ,0 if 0

max),,(

kBiTjAiT

kjiL

kjiL

kjiL

kBiTkjiL

jAiTkjiL

kji

kjiL

69

Blocked Merged Sequence An interleaving block sequence of merging b

lock sequences A and B, denoted as Eb(A, B) The blocked merged sequence is not unique.Ab = cgat acc Bb = aat tc gc

A1 A2 B1 B2 B3

Eb4(Ab, Bb) = Ab

1Bb1Bb

2Ab2 = cgataattcacc

Eb5(Ab, Bb) = Bb

1Ab1Ab

2Bb2 = aatcgatacctc

Eb6(Ab, Bb) = Bb

1Bb2Ab

1Bb3Ab

2 = aattccgatgcacc

70

Blocked Merged LCS Problem To find the relationship among block sequences T, Ab,

and Bb, denoted as bLCS(T, Eb(Ab, Bb))

T = atacgcgctt

Ab = cgat acc Bb = aat tc gc

Eb5(Ab, Bb) = Bb

1Ab1Ab

2Bb2 = aat cgat acc tc

T =a-ta cg-- -cgc t-t

Eb5(Ab, Bb) = aat- cgat ac-c tc-

a t cg c c t = bLCS(T, Eb(Ab, Bb))

Consider the symbol EOB (End of block)

Complexity: O(n m mb)

n = |T|, m = max{|Ab|, |Bb|}, mb: max. number of blocks in Ab and Bb

Algorithm for Block Merged LCS

72

Improved Algorithm BMergedLCS+Step 1. Compute S-table St(T, Ab

i) and St(T, Bb

j). O(nm)

Step 2. Initialize Lb(i, 0, 0) = 0. O(n)

Step 3. Vb(j, k) = max{Vb(j1, k) St(T, Abi), V

b(j1, k) St(T, Bbj)}. O(nmb

2)

Step 4. Return Lb(|T|, , ). O(1) or O(n) Complexity: O(nm + nmb

2) n= |T|, m= max{|Ab|, |Bb|} mb: max. number of blocks in Ab and Bb

73

Experimental Results (1)

Data Set

Sequence Length (bp)

Number of Blocks

Running time (sec.)

|T| |A| |B| |A| |B| MergedLCS BMergedLCS+

dodA 1629 687 942 6 7 52.69 0.70

pita & dcp-1

6000 2480 1756 3 3 1312.29 13.25

v v v vgi|24762322|ref TTCTCCTACTCGACCATTC------------------------------------------C--G-------G----G-----CTACTTCTCCTGGCGCAgi|73917619:700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318:c24 -------------------C-------------------G--AC---------------CA--A-CAA-CT--T-C--TT-GG----------------- ^ ^ ^^ ^^ ^ ^^^ ^^ ^ ^ ^^ ^^

gi|24762322|ref ----------------------------------------------------------------------------------------------------gi|73917619:700 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCCgi|24762318:c24 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCC ^^^^^^^^^^^^^^^^^^

vvv vgi|24762322|ref TTCTCCTACTCGACCATTCCGG------------------------------------------------------------GCTACTTCTCCTGGCGCAgi|73917619:700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318:c24 ----------------------------------------------------------------------------------------------------

7210072001

74031 74130

7413074031

pita

dcp-1

70154

75907

72504 74745

7500070000

72001|

72100

74031|

74130

(a)

(b)

(c)

gi|24762322|ref ----------------------------------------------------------------------------------------------------gi|73917619_700 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCCgi|24762318_c24 ----------------------------------------------------------------------------------------------------

gi|24762322|ref TTCTCCTACTCGACCATTCCGGG------------------------------------------------------------CTACTTCTCCTGGCGCAgi|73917619_700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318_c24 TTATCCACCTTCAGCTCATAGGCGTGCGAACGACGTGGCACTGGAGAGGAACCGGCTACCGTCCTGGGGGTCACCATGGCGCTCTCCAGCACCTGG-ACG ** *** ** * * ** ** * * ***** *

7210072001

74031 74130(d)

Experimental Results (2)

Clustal W

BMergedLCS+

MergedLCS

75

Summary – Merged LCS

The merged-LCS problem LCS(T, E(A, B)) MergedLCS: O(nm2)

The blocked merged-LCS problem bLCS(T, Eb(Ab, Bb)) BMergedLCS: O(n m mb)

BMergedLCS+: O(nm + nmb2)

n= |T|, m= max{|Ab|, |Bb|}

mb: max. number of blocks in Ab and Bb

pita

dcp-1

70154

75907

72504 74745

Drosophila melanogasterchromosome 2R

7500070000

The Mosaic LCS Problem

77

Chimera ( 嵌合體 )

“Chimera of Arezzo:” an Etruscan bronze (古希臘神 )

78

Chimeric Alignment

Komatsoulis and Waterman, 1997 For detecting chimeric sequences

S1

S2

S3

S4

T

-mosaic LCS Problem

T S

Input: Target sequence T, mosaic number , sequence set S. 1

234 = 4

T

S4 S2 S3 S2

LCS(T, S4S2S3S2) is maximal.

e.g. max{ LCS(T, C1C2C3C4) | Ci S }

Output: Maximal LCS(T, C), C = C1C2…C, Ci S.

80

Algorithm for -mosaic LCS (1)

T

Sj

p qT[p,q]

LCS(T[p, q], Sj), 0 p q n, Sj S, |Sj| = m

Sj SjSj

O(n2m|S|)

81

Algorithm for -mosaic LCS (2)Recursive doubling scheme

LCS(T[p,r], C1C2) = max{ LCS(T[p, q], C1) + LCS(T[q, r], C2) }

T

C1

p qT[p, q] T[q, r]

r

C2

0 p q r n, Ci S

O(n3)

(1, 1) = 2, (2, 2) = 4, (4, 4) = 8, (8, 8) = 16

(C1, C2) = C1C2, (C1C2, C3C4) = C1C2 C3C4O(n3 log )

82

Summary – Mosaic LCS

Mosaic LCS ProblemLCS(T, C<1,>)

Straightforward DP: O(n2m|S| + n3 log ) Improved Algorithm with S-table:

O(n(m+) |S|)

T

S

T

C = C1C2C3

83

Conclusions

Other related problems: Constrained LCS problem Longest Increasing Subsequence Problem Longest Common Increasing Subsequence

Problem of Two Sequences Near Optimal Alignment Alignment with Multiple Scoring Functions Multiple Sequence Alignment Fast LCS of Multiple Sequences

References (1) Block Edit Distance

[Ukkonen, 1985] Algorithms for approximate string matching, Information and Control, Vol. 64, pp. 100-118, 1985.

[Shapira and Storer, 2007] Edit distance with move operations, Journal of Discrete Algorithms, Vol. 5, No. 2, pp. 380-392, 2007.

[Ann 2007] Hsing-Yen Ann, Chang-Biau Yang, Yung-Hsing Peng and Bern-Cherng Liaw, "Efficient Algorithms for the Block Edit Problems," Proc. of the 24th Workshop on Combinatorial Mathematics and Computation Theory, pp. 201-208, Nantou, Taiwan, April 27-28, 2007.

LCS of Run-Length Encoded Strings [Bunke and Csirik, 1995] An improved algorithm for computing the edit

distance of run-length coded strings, Information Processing Letters, Vol. 54, No. 2, pp. 93–96, 1995.

[Liu et al., 2008] Finding a longest common subsequence between a run-length-encoded string and an uncompressed string, Journal of Complexity, Vol. 24, No. 2, pp. 173–184, 2008.

[Ann 2008] Hsing-Yen Ann, Chang-Biau Yang, Chiou-Ting Tseng, Chiou-Yi Hor "A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings”, Information Processing Letters, Vol. 108, pp. 360–364, 2008.

References (2) Merged LCS and Mosaic LCS

[Huang et al. 2007] Kuo-Si Huang, Chang-Biau Yang*, Kuo-Tsung Tseng, Yung-Hsing Peng and Hsing-Yen Ann, "Dynamic Programming Algorithms for the Mosaic Longest Common Subsequence," Problem. Information Processing Letters, Vol. 102, pp. 99-103, 2007.

[Huang et al. 2008] Kuo-Si Huang, Chang-Biau Yang*, Kuo-Tsung Tseng, Hsing-Yen Ann and Yung-Hsing Peng, "Efficient Algorithms for Finding Interleaving Relationship between Sequences," Information Processing Letters,  Vol. 105 (5), pp.188-193, 2008.

References (3)

Suffix Tree and Range Minimum Query [Bender and Farach-Colton, 2000] The LCA problem revisited, i

n: LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, 2000, pp. 88–94.

[Weiner, 1973] Linear pattern matching algorithm, In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1-11, 1973.

Genome [Laundrie et al., 2003] Germline cell death is inhibited by P-elem

ent insertions disrupting the dcp-1/pita nested gene pair in Drosophila, Genetics, Vol. 165, No. 4, pp. 1881-1888, 2003.

[Kellis et al., 2004] Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae, Nature, Vol. 428, pp. 617-624, 2004.

謝謝聽講UAA UAG UGA

~ The End ~

中山資工 楊昌彪

88

Finding and Maintaining the Range Minimum with Linear Penalties

2 34 3 53 3 4row i4 52 3 81 6 7 9

pe2pe

3pe4pe

5pe6pe

10

2 34 3 53 3 4 4row i4 52 3 81 6 7 9

2pe3pe

4pe5pe pe

10

2 34 3 53 3 4row i4 52 3 81 6 7 9

-7pe-6pe

-5pe-4pe

-3pe-2pe

10

-pe

2 34 3 53 3 4 4row i4 52 3 81 6 7 9

-7pe-8pe

10

-6pe

-5pe-4pe

-3pe-2pe

-pe

89

Finding the Substring Edit Distance in Constant Time

0 1 2 3 40 1 2 3 40 0 1 2 30 1 2 3 40 1 2 3 40 0 1 2 30 1 2 1 2

0 1 1 2

a a c a3 4 5 6

gattac

Y3..6

X

1

2

3

4

5

6

dsub(X, Y3..3) dsub(X, Y3..5)

X = gattac Y = gtaaca

90

Diagram for Blocked Merged LCS (1/2)

T

Aib

Bjb

T

Aib

Bjb

T

Aib

Bjb

T

Aib

Bjb

Abi-1

Bbj-1

91

Diagram for Blocked Merged LCS (2/2)

T

Aib

Bjb

Abi-1

Bbj-1

T

Aib

Bjb

Abi-1

Bbj-1

92

S-table

a t a c g c g c t t

T =

cgat accAb =

Ab1 Ab

2 c

g

a

t

c c c

g g

a a

t t t

Ab1

T

123456789

0 1 2 9

0 1 2 3

2 3 5 91 2 5 9

3 4 5 94 5 7 95 6 7 96 7 97 8 98 9

0

9

St(T, Ab1)

0 1 2 2 2 2 3 3

i

|LCS|atacgcgctt

1 2 3 4 5 6 7 8 9 10i =

LCS(T[3+1, i], Ab1) =

aat tc gcBb =

Bb1 Bb

2 Bb3

93

a t a c g c g c t t

T = cgat accAb =

Ab1 Ab

2

c

g

a

t

c c c

g g

a a

t t t

Ab1

T

123456789

0 1 2 9

0 1 2 3

2 3 5 91 2 5 9

3 4 5 94 5 7 95 6 7 96 7 97 8 98 9

0

9

St(T, Ab1)

0 1 2 2 2 2 3 3

i

|LCS|

atacgcgctt

1 2 3 4 5 6 7 8 9 10i =

LCS(T[3+1, i], Ab1) =

aat tc gcBb =

Bb1 Bb

2 Bb3

0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Ab1) =

0 1 2 3 4 5 6 7 8 9 10i =Vb(1, 0) =

94

0 1 2 3 4 5 6 7 8 9 10i =

a t a c g c g c t t

a

aBb

1

T

a a

a a

0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Bb1) =

123456789

0 1 2 9

0 1 2 3

2 3 91 2 9

3 94 95 96 97 98 9

0

9

St(T, Bb1)

i

|LCS|

0 1 1 1 1 1 1 1 2 2

1 2 3 4 5 6 7 8 9 10i =

LCS(T[1+1, i], Bb1) =

Vb(0, 1) =

t t t t

T = cgat accAb =

Ab1 Ab

2

atacgcgctt aat tc gcBb =

Bb1 Bb

2 Bb3

95

Vb(1, 1) =

0 1 2 3 3 3 3 3 3 4 4LCS(T[1, i], Ab1B

b1) =

0 1 2 3 4 5 6 7 8 9 10i =

0 1 2 3 3 4 4 4 4 5 5LCS(T[1, i], Bb1A

b1) =

Vb(0, 1) St(T, Ab1) }=

Vb(1, 0) St(T, Bb1) =

max{Vb(1, 0) St(T, Bb1), 0 1 2 3 3 4 4 4 4 5 5

Vb(0, 1) St(T, Ab1) =

0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Ab1) =

0 1 2 3 4 5 6 7 8 9 10i =Vb(1, 0) =

0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Bb1) =Vb(0, 1) =

T = cgat accAb =

Ab1 Ab

2

atacgcgctt aat tc gcBb =

Bb1 Bb

2 Bb3

96

Improved Algorithm

Vl = minE{Vl−1 ⊕ St(T, Si)} for each Si ∈ S and 1 ≤ l ≤

S-table

St(T, Si): O(nm|S|)

Vl = minE{Vl−1 ⊕ St(T, Si)}: O(n|S|)

Time Complexity: O(n(m+)|S|)

97

Example of Algorithm Formosa2

T = agactagtc

S = {S1=agc, S2=act, S3=aatg, S4=ttcg}

T = agactagtc

S1 = 12-3----- (0,1,2,4)

S2 = 1--23---- (0,1,4,5)

S3 = 12--3-4-- (0,1,2,5,7)

S4 = -1----2-3 (0,2,7,9)

12-3--4-- (0,1,2,4,7)=V1

98

Example of Algorithm Formosa2

St(T, S1):T = S1 =agactagtc agc

0 1 2 4

1 2 4 9

2 3 4 9

3 4 7 9

4 6 7 9

5 6 7 9

6 7 9

7 9

8 9

(0, 1, 2, 4, 7)

0 1 2 4

1 2 4 9

2 3 4 9

4 6 7 9

7 9

0 1 2 3 4 7 9

V1 =

V1 St(T, S1) =

01

2

4

7

(0, 1, 2, 3, 4, 7, 9)