the longest common subsequence problem and its variants 楊昌彪 中山大學資訊工程學系
Post on 20-Dec-2015
253 views
TRANSCRIPT
The Longest Common Subsequence Problem and
Its Variants
楊昌彪中山大學資訊工程學系
http://www.nsysu.edu.tw
2
Outline
Introduction to Bioinformatics Traditional LCS Algorithms Our Works
Block Edit Problems LCS of Run-Length Encoded Strings Merged LCS Problem Mosaic LCS Problem
Conclusions
5
DNA and RNA
Nucleotide ( 核甘酸 ) : 腺嘌呤 (adenine, A)
鳥糞嘌呤 (guanine, G)胞嘧啶 (cytosine, C)胸腺嘧啶 (thymine, T)尿嘧啶 (uracil, U)
DNA(deoxyribonucleic acid , 去氧核糖核酸 ) {A, G, C, T} (base pair: GC, A=T ) RNA(ribonucleic acid, 核糖核酸 ) {A, G, C, U} (base pair: GC, A=U, GU )
7
DNA Length
The total length of the human DNA is about 3109 (30 億 ) base pairs.
1% ~ 1.5% of DNA sequence is useful. # of human genes: 30,000~40,000
Conclusion from the Human Genome Project (1990~2003)
Expected # is 100,000 originally.
9
DNA, Genes and Proteins
DNA: program for cell processes Proteins: execute cell processes
TCCAA
CGGTGC
TGAGGT
GCAC
GeneProtein
DNA
10
Promoter( 啟動子 ) and Gene
TranscriptionalStart Site
ATG TAG
TranscriptionalTermination Site
TATA
TTG
PromoterUpstream Downstream
intron
exon
14
The Longest Common Subsequence (LCS) Problem A string : S1 = “TAGTCACG” A subsequence of S1 : deleting 0 or more symbols from S1
(not necessarily consecutive).
e.g. G, AGC, TATC, AGACG Common subsequences of S1 = “TAGTCACG” and S2 =
“AGACTGTC” :
GG, AGC, AGACG Longest common subsequence (LCS) :
S1: TAGTCACG
S2: AGACTGTC
LCS: AGACG
15
Applications of LCS The edit distance of two strings or files.
(# of deletions and insertions)
S1: TAGTCAC G
S2: AG ACTGTC
Operation: DMMDDMMIMII Spoken word recognition Similarity of two biological sequences (DNA or protein)
Sequence alignment
16
The Traditional LCS Algorithm S1 = a1 a2 am and S2 = b1 b2 bn Ai,j denotes the length of the longest common
subsequence of a1 a2 ai and b1 b2 bj.
Dynamic programming:
Ai,j = Ai-1,j-1 + 1 if ai= bj
max{ Ai-1,j, Ai,j-1 } if ai bj
A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n.
Time complexity: O(mn)
a1 a2 ai-1ai
b1 b2 bj-1bj
17
LCS and Edit Distance
Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|
- A G A C T G T C - A G A C T G T C- 0 0 0 0 0 0 0 0 0 - 0 1 2 3 4 5 6 7 8T 0 0 0 0 0 1 1 1 1 T 1 2 3 4 5 4 5 6 7A 0 1 1 1 1 1 1 1 1 A 2 1 2 3 4 5 6 7 8G 0 1 2 2 2 2 2 2 2 G 3 2 1 2 3 4 5 6 7T 0 1 2 2 2 3 3 3 3 T 4 3 2 3 4 3 4 5 6C 0 1 2 2 3 3 3 3 4 C 5 4 3 4 3 4 5 6 5A 0 1 2 3 3 3 3 3 4 A 6 5 4 3 4 5 6 7 6C 0 1 2 3 4 4 4 4 4 C 7 6 5 4 3 4 5 6 7G 0 1 2 3 4 4 5 5 5 G 8 7 6 5 4 5 4 5 6
LCS Edit Distance
18
Sequence Alignment
S1 = TAGTCACG
S2 = AGACTGTC----TAGTCACG TAGTCAC-G--AGACT-GTC--- -AG--ACTGTC
Which one is better? We can set different gap penalties as parameters for
different purposes.
Gap Penalty for Sequence Alignment
is the gap penalty. Suppose
),(),0(
),()0,(
),()1,(
),(),1(
),()1,1(
max),(
xjjA
xiiA
bjiA
ajiA
bajiA
jiA
j
i
ji
),(or ),( xx
) including( if 1
if 2),(
yx
yxyx
Example for Sequence Alignment
TAGTCAC-G--
-AG--ACTGTC
- A G A C T G T C
0 -1 -2 -3 -4 -5 -6 -7 -8-
-1 -1 -2 -3 -4 -2 -3 -4 -5T
-2 1 0 0 -1 -2 -3 -4 -5A
-3 0 3 2 1 0 0 -1 -2G
-4 -1 2 2 1 3 2 2 1T
-5 -2 1 1 4 3 2 1 4C
-6 -3 0 3 3 3 2 1 3A
-7 -4 -1 2 5 4 3 2 3C
-8 -5 -2 1 4 4 6 5 4G
PAM250 Score Matrix for Protein Alignment A C D E F G H I K L M N P Q R S T V W Y A 2 C -2 12 D 0 -5 4 E 0 -5 3 4 F -4 -4 -6 -5 9 G 1 -3 1 0 -5 5 H -1 -3 1 1 -2 -2 6 I -1 -2 -2 -2 1 -3 -2 5 K -1 -5 0 0 -5 -2 0 -2 5 L -2 -6 -4 -3 2 -4 -2 2 -3 6 M -1 -5 -3 -2 0 -3 -2 2 0 4 6 N 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 P 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 Q 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 R -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 S 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 T 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 V 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 W -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 Y -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10
23
Hunt-Szymanski LCS Algorithm By extending the idea in RSK (Robinson-
Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches.
This algorithm is faster than the traditional dynamic programming if r is small.
24
The Pairs of Matching in Hunt-Szymanski Algorithm Input sequences: TAGTCACG and AGACTGTC Pairs of matching:
A G A C T G T C
T
A
G
T
C
A
C
G
(1,5)
(1,7)
(2,1)
(2,3)
(3,2)
(3,6)
(4,5)
(4,7)
(5,4)
(5,8)
(6,1)
(6,3)
(7,4)
(7,8)
(8,2)
(8,6)
25
Example for Hunt-Szymanski Algorithm The insertion order is row major and column backward.
Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search.
(1,7)
(1,5)
(2,3)
(2,1)
(3,6)
(3,2)
(4,7)
(4,5)
(5,8)
(5,4)
1 (1,7)
(1,5)
(2,3)
(2,1)
(2,1)
(2,1)
(2,1)
(2,1)
(2,1)
(2,1)
2 (3,6)
(3,2)
(3,2)
(3,2)
(3,2)
(3,2)
3 (4,7)
(4,5)
(4,5)
(5,4)
4 (5,8)
(5,8)
L
28
Motivation – Finding Similar Codes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
#include <stdio.h> int partition(int array[], int top, int bottom); void quicksort(int array[], int top, int bottom); int partition(int array[], int top, int bottom) { int x = array[top]; int i = top - 1; int j = bottom + 1; do { while (x < array[--j]) {} while (x > array[++i]) {} if (i < j) { int temp = array[i]; array[i] = array[j]; array[j] = temp; } } while (i < j); return j; } void quicksort(int array[], int top, int bottom) { int middle; if (top < bottom) { middle = partition(array, top, bottom); quicksort(array, top, middle); quicksort(array, middle+1, bottom); } } int main() { int data[] = {3,1,7,6,4}; quicksort(data, 0, 4); for (int i=0; i<5; i++) printf("%d ", data[i]); }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
#include <stdio.h> void QS(int ARR[], int LL, int RR); int PP(int ARR[], int LL, int RR); void QS(int ARR[], int LL, int RR) { int MM; if (LL < RR) { MM = PP(ARR, LL, RR); QS(ARR, LL, MM); QS(ARR, MM+1, RR); } } int PP(int ARR[], int LL, int RR) { int ii = LL - 1; int Y = ARR[LL]; int jj = RR + 1; do { while (Y < ARR[--jj]) {} while (Y > ARR[++ii]) {} if (ii < jj) { int TT = ARR[ii]; ARR[ii] = ARR[jj]; ARR[jj] = TT; } } while (ii < jj); return jj; } int main() { int DD[] = {3,1,7,6,4}; QS(DD, 0, 4); for (int ii=0; ii<5; ii++) printf("%d ", DD[ii]); }
29
Block Edit Problems
Operations: Block copy, block deletion and block move.
Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed.
Various approximations were proposed.
Our assumptions – Restricted edit sequence: A series of edit operations are performed from left
to right on the source string X. Any two block-edit operations would not be
performed on overlapping regions on X.
31
Restricted Edit Sequence
(a) General (recursive) edit operations(b) Restricted edit sequence
a bX
W1 a ba b
W2 a ba ba ba b
Y a ba ba ba bbbbb
a bX
a b bW1
W2
Y
W3
W4
a b b b b
a b b b b a
a b b b b a a
a b b b b a a a b b b b
(a) (b)
32
Definitions of the Problems (1/2)
Let P(o, c) denote a block edit problem: o: a composition of block-edit operations c: the class of cost measures
The Block-Copy operations: External copy: copy a substring of X to Wi Internal copy: copy a valid substring of Wi-1 to Wi Shifted copy: copy a shifted substring
a a b b d
c c d d f
a 0 1 0 2
c 0 1 0 2
33
Definitions of the Problems (2/2)
The Cost Measures that can be chosen: Constant cost: pcopy Linear cost: ps+ k × pe Nested cost: pcopy+ dc(A, B)
Three problems are defined in our work: P(EIS,C) P(EI,L) P(EI,N)
34
Problem 1 -- P(EIS,C) – External, Internal, Shifted, Constant External and internal copies are allowed in
constant cost. Shifted copies are allowed in constant cost. It can be solved by a straightforward DP algorithm in
O(nm2 (n + m) |Σ|) time. We propose an O(nm) time DP algorithm with
O(n+m2) preprocessing time in worst case O(n+mlogm) preprocessing time in average case
36
Functions and Operations (1) Character operations:
Block deletions:
Wi-1
Wi
+ pdelete
+1
+1
)(nO
)1(O
37
Functions and Operations (2) External copies:
Internal copies:
X
…
Wi-1
Wi
Wi-1
Wi
+ pcopy
)(nmO
)( 2mO
39
Preprocessing for P(EIS,C)
For external copies: Build a suffix tree T(XR#YR$) to find the common
substrings between X and Y. For internal copies:
Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wi to Wi+1.
For shifted copies: Compute the differential strings X' and Y' of X and Y. Find the valid common substrings for external / internal
copies.
40
Preprocessing - Suffix Trees
A B C D
B C B D A
B
D
A
C
1
D
A
C
A
C
C
2
3
5
6
4
A B C D
B C B DA
BD
AC
D
A
C
AC
C
1
5
2
3
6
4
A B B D A CS1
B B D A C
B D A C
D A C
A C
C
1 2 3 4 5 6
Preprocessing – Longest Common Prefixes (LCP) and Suffix trees
root
xh xh-1...x1 xh xh-1...xg yj yj-1...yg'
root
XhR Xh
R YjR
v1
v2
vp
v3
YjRYj1
R Yj2R Yj3
R YjpR...
(b)
A D
B C
B D
A
B
D
A
C
a1
D
A
C
A
C
C
a2
a3
a4
(a)
A D
B C
B D
A
B
D
A
C
a1
D
A
C
A
C
C
a2
a3
a4
a5
42
Finding and Maintaining the Range Minimum in Constant Time
... 12 13 14 15 16 17 18 19 20
row i ... 4 3 2 3 3 4 5 4
... 12 13 14 15 16 17 18 19 20
row i ... 4 3 2 3 3 4 5
0
1
2
3
4
5
6
14
values
13
17
18
15 16
Rangemin 0
1
2
3
4
5
6
values
17
18
15 16
Rangemin
19
(a) (b)
+ pcopy
43
Problem 2 -- P(EI,L) – External, Internal, Linear The cost of each copy or deletion is with an initial
penalty plus a linear extended penalty.
44
Problem 3 -- P(EI,N) – External, Internal, Nested The copied strings can be further edited with
character-edit operations.
45
Summary of Block Edit Problems
O(nm2)
O(nmlogm)
O(nm)
Our methods
O((n+m)m2)
O(n+m2) in worst caseO(n+mlogm) in average case
O(n+m2) in worst caseO(n+m log m) in average case
Processing time
O(n2m3)P(EI,N)
O(nm2 (n + m))P(EI,L)
O(nm2 (n + m) |Σ |)P(EIS,C)
Straightforward DP
O(nm2)
O(nmlogm)
O(nm)
Our methods
O((n+m)m2)
O(n+m2) in worst caseO(n+mlogm) in average case
O(n+m2) in worst caseO(n+m log m) in average case
Processing time
O(n2m3)P(EI,N)
O(nm2 (n + m))P(EI,L)
O(nm2 (n + m) |Σ |)P(EIS,C)
Straightforward DP
47
LCS of Run-Length Encoded Strings Run-length encoding (RLE) compression
aaaaabbbccccdd a5b3c4d2
Input:
RLE string X: length n, k runs
RLE string Y: length m, l runs
Output:
LCS between X and Y.
Dark & Light Blocks
Divide the DP lattice into k × l blocks.
Dark blocks: matched blocksLight blocks: mismatched blocks
a8b3 c4 a5b8 c4 a4
b6
c4
a12
a3
Results of Bunke and Csirik (1995)
Lemma 1 (Dark block):
Lemma 2 (Light block):
Only the boundaries of the blocks are needed.
rYXLCSYaXaLCS rr ),(),(
)},(),,(max{),( srsr YbXLCSYXaLCSYbXaLCS
)( kmnlO
50
Results of Liu et al. (2008)
A complex modified DP formula which computes the DP lattice row by row.
Only the bottom boundaries of the blocks are needed.
}),(min{ kmnlO
Additional Lemmas
Lemma 3 (Monotonicity):
Lemma 4 (Merged light blocks):if ,
jjiiYXLCSYXLCS jiji '1 ,'1 ),(),( ..1..1'..1'..1
),(
),,(max),(
21
21
2121
21
21
2121j
i
ji
sj
ss
ri
rrsj
ssri
rr
bbYbXLCS
YaaXaLCSbbYbaaXaLCS
jjiiba ji '1 ,'1 ''
53
Basic Idea
C(v) denotes the number of occurrences of the matched symbol in the right side of v.
ni denotes the length of current run of X.
a8b3 c4 a5b8 c4 a4
b6
c4
a12
a3
v1v2
v3v4v5
d1d2d3
u0u2
u1
v0
v8 v7 v6
d0
inuLCS
dvCvLCS
dvCvLCS
dvCvLCS
ddduLCS
ddvLCS
ddvLCS
dvLCS
vLCS
)(
,)()(
,)()(
,)()(
max
)()(
),()(
),()(
),()(
max)(
0
088
077
066
3210
218
217
16
0
54
Dummy Nodes & Candidate Paths Some dummy nodes are considered, too. Divide the candidate paths into two sets.
a8b3 c4 a5b8 c4 a4
b6
c4
a12
a3
v1v2
v3v4v5
d1d2d3
u0
v0
v8 v7 v6
d0
v6'v8'
55
Range Minimum / Maximum Query (RMQ) Given an array A and a range [i, j], find the maximu
m in the range [i, j] Can be solved in O(n) preprocessing time and O(1)
query time.
1 2 3 4 5 6 7 8 9
3 -7 99 2 -20 8 9 -99 1
1 2 3 4 5 6 7 8 9
3 -7 99 2 -20 8 9 -99 1
56
Finding the Maximum from the Candidate Paths The value of u0 can be computed by Lemma 4. The maximum of the second set can be found by pr
ecomputing an array Li and then applying RMQ(Range Maximum Query) on it.
a8b3 c4 a5b8 c4 a4
b6
c4
a12
a3
v1v2
v3v4v5
d1d2d3
u0
v0
v8 v7 v6
d0
v6'v8'u2
u1
],0[ ],[],1[][ mjmjCjiMjL jii
57
How Fast It Is?
The elements needed to be computed: Right bottom corners of all blocks. Bottom boundaries of the dark blocks.
Let p1 and p2 denote the numbers of elements in the bottom and right boundaries of the dark blocks. The time complexity of our algorithm is .
}),min{( 21 ppklO
63
Nested Genes Fruit fly -- Drosophila melanogaster Gene dcp-1 (Dmel_CG5370) Gene pita (Dmel_CG3941)
(LOCUS AE003461)
pita
dcp-1
70154
75907
72504 74745
Drosophila melanogasterchromosome 2R
7500070000
[Laundrie et al., Genetics 165, 2003]
64
Whole Genome Duplication
1 2 3 4 5 6 7 8 9 10
2 3 5 8 10
genome duplication
gene loss
after gene loss
common ancestor
lineage of species Y lineage of species T
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 3 4 6 7 9
1 2 3 4 5 6 7 8 9 10
species T
copy A of species Y
copy B of species Y
2 3 5 8 10
1 3 4 6 7 9
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
[Kellis et al., Nature 428(6983), 2004]
2R
65
Doubly Conserved Synteny Block Two yeast species
Kluyveromyces waltii Saccharomyces cerevisiae
[Kellis et al., Nature 428(6983), 2004]Block ?
66
Merged Sequence An interleaving sequence of merging sequ
ences A and B, denoted as E(A, B) The merged sequence is not unique.
A = cgatacc B = aattcgc
E1(A, B) = cgataaacgc
E2(A, B) = aattcgcgcatacc
E3(A, B) = cgaaatactcgc
67
Merged-LCS Problem To find the relationship among sequences
T, A, and B, denoted as LCS(T, E(A, B))T = atacgcgcttA = cgatacc B = aattcgc
A = -----cg---at-accT = ata--cgcgc-tt---B = a-att--cgc------
a a cgcgc t = LCS(T, E(A, B))
E1(A, B) = cgataaacgcE2(A, B) = aattcgcgcataccE3(A, B) = cgaaatactcgc
68
Algorithm MergedLCS Dynamic programming formula
Time complexity: O(nm2), n=|T|, m = max{|A|, |B|}
Space complexity: O(nm) [Hirsberg 1975, divide-and-conquer]
].[][ and ][][ if
)1,,(
),1,(
),,1(
max
],[][ if 1)1,,1(
],[][ if 1),1,1(
,0or ,0 ,0 if 0
max),,(
kBiTjAiT
kjiL
kjiL
kjiL
kBiTkjiL
jAiTkjiL
kji
kjiL
69
Blocked Merged Sequence An interleaving block sequence of merging b
lock sequences A and B, denoted as Eb(A, B) The blocked merged sequence is not unique.Ab = cgat acc Bb = aat tc gc
A1 A2 B1 B2 B3
Eb4(Ab, Bb) = Ab
1Bb1Bb
2Ab2 = cgataattcacc
Eb5(Ab, Bb) = Bb
1Ab1Ab
2Bb2 = aatcgatacctc
Eb6(Ab, Bb) = Bb
1Bb2Ab
1Bb3Ab
2 = aattccgatgcacc
70
Blocked Merged LCS Problem To find the relationship among block sequences T, Ab,
and Bb, denoted as bLCS(T, Eb(Ab, Bb))
T = atacgcgctt
Ab = cgat acc Bb = aat tc gc
Eb5(Ab, Bb) = Bb
1Ab1Ab
2Bb2 = aat cgat acc tc
T =a-ta cg-- -cgc t-t
Eb5(Ab, Bb) = aat- cgat ac-c tc-
a t cg c c t = bLCS(T, Eb(Ab, Bb))
Consider the symbol EOB (End of block)
Complexity: O(n m mb)
n = |T|, m = max{|Ab|, |Bb|}, mb: max. number of blocks in Ab and Bb
Algorithm for Block Merged LCS
72
Improved Algorithm BMergedLCS+Step 1. Compute S-table St(T, Ab
i) and St(T, Bb
j). O(nm)
Step 2. Initialize Lb(i, 0, 0) = 0. O(n)
Step 3. Vb(j, k) = max{Vb(j1, k) St(T, Abi), V
b(j1, k) St(T, Bbj)}. O(nmb
2)
Step 4. Return Lb(|T|, , ). O(1) or O(n) Complexity: O(nm + nmb
2) n= |T|, m= max{|Ab|, |Bb|} mb: max. number of blocks in Ab and Bb
73
Experimental Results (1)
Data Set
Sequence Length (bp)
Number of Blocks
Running time (sec.)
|T| |A| |B| |A| |B| MergedLCS BMergedLCS+
dodA 1629 687 942 6 7 52.69 0.70
pita & dcp-1
6000 2480 1756 3 3 1312.29 13.25
v v v vgi|24762322|ref TTCTCCTACTCGACCATTC------------------------------------------C--G-------G----G-----CTACTTCTCCTGGCGCAgi|73917619:700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318:c24 -------------------C-------------------G--AC---------------CA--A-CAA-CT--T-C--TT-GG----------------- ^ ^ ^^ ^^ ^ ^^^ ^^ ^ ^ ^^ ^^
gi|24762322|ref ----------------------------------------------------------------------------------------------------gi|73917619:700 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCCgi|24762318:c24 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCC ^^^^^^^^^^^^^^^^^^
vvv vgi|24762322|ref TTCTCCTACTCGACCATTCCGG------------------------------------------------------------GCTACTTCTCCTGGCGCAgi|73917619:700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318:c24 ----------------------------------------------------------------------------------------------------
7210072001
74031 74130
7413074031
pita
dcp-1
70154
75907
72504 74745
7500070000
72001|
72100
74031|
74130
(a)
(b)
(c)
gi|24762322|ref ----------------------------------------------------------------------------------------------------gi|73917619_700 GCTTGGCTCGGCTGTCTTTGCTGGCACGACCAACAACTTCTTGGAAGCCACCATCGTCATGGGCGCACGGGGCTTCTCCAGCGGCGATGGCCAGTTACCCgi|24762318_c24 ----------------------------------------------------------------------------------------------------
gi|24762322|ref TTCTCCTACTCGACCATTCCGGG------------------------------------------------------------CTACTTCTCCTGGCGCAgi|73917619_700 TTCTCCTACTCGACCATTCCGGGTAAGGAATATGATTAGGTAACTATTTTAGTGAATTTCACTAGCAATCTCGTCCTGTTAGGCTACTTCTCCTGGCGCAgi|24762318_c24 TTATCCACCTTCAGCTCATAGGCGTGCGAACGACGTGGCACTGGAGAGGAACCGGCTACCGTCCTGGGGGTCACCATGGCGCTCTCCAGCACCTGG-ACG ** *** ** * * ** ** * * ***** *
7210072001
74031 74130(d)
Experimental Results (2)
Clustal W
BMergedLCS+
MergedLCS
75
Summary – Merged LCS
The merged-LCS problem LCS(T, E(A, B)) MergedLCS: O(nm2)
The blocked merged-LCS problem bLCS(T, Eb(Ab, Bb)) BMergedLCS: O(n m mb)
BMergedLCS+: O(nm + nmb2)
n= |T|, m= max{|Ab|, |Bb|}
mb: max. number of blocks in Ab and Bb
pita
dcp-1
70154
75907
72504 74745
Drosophila melanogasterchromosome 2R
7500070000
-mosaic LCS Problem
T S
Input: Target sequence T, mosaic number , sequence set S. 1
234 = 4
T
S4 S2 S3 S2
LCS(T, S4S2S3S2) is maximal.
e.g. max{ LCS(T, C1C2C3C4) | Ci S }
Output: Maximal LCS(T, C), C = C1C2…C, Ci S.
80
Algorithm for -mosaic LCS (1)
T
Sj
p qT[p,q]
LCS(T[p, q], Sj), 0 p q n, Sj S, |Sj| = m
Sj SjSj
O(n2m|S|)
81
Algorithm for -mosaic LCS (2)Recursive doubling scheme
LCS(T[p,r], C1C2) = max{ LCS(T[p, q], C1) + LCS(T[q, r], C2) }
T
C1
p qT[p, q] T[q, r]
r
C2
0 p q r n, Ci S
O(n3)
(1, 1) = 2, (2, 2) = 4, (4, 4) = 8, (8, 8) = 16
(C1, C2) = C1C2, (C1C2, C3C4) = C1C2 C3C4O(n3 log )
82
Summary – Mosaic LCS
Mosaic LCS ProblemLCS(T, C<1,>)
Straightforward DP: O(n2m|S| + n3 log ) Improved Algorithm with S-table:
O(n(m+) |S|)
T
S
T
C = C1C2C3
83
Conclusions
Other related problems: Constrained LCS problem Longest Increasing Subsequence Problem Longest Common Increasing Subsequence
Problem of Two Sequences Near Optimal Alignment Alignment with Multiple Scoring Functions Multiple Sequence Alignment Fast LCS of Multiple Sequences
References (1) Block Edit Distance
[Ukkonen, 1985] Algorithms for approximate string matching, Information and Control, Vol. 64, pp. 100-118, 1985.
[Shapira and Storer, 2007] Edit distance with move operations, Journal of Discrete Algorithms, Vol. 5, No. 2, pp. 380-392, 2007.
[Ann 2007] Hsing-Yen Ann, Chang-Biau Yang, Yung-Hsing Peng and Bern-Cherng Liaw, "Efficient Algorithms for the Block Edit Problems," Proc. of the 24th Workshop on Combinatorial Mathematics and Computation Theory, pp. 201-208, Nantou, Taiwan, April 27-28, 2007.
LCS of Run-Length Encoded Strings [Bunke and Csirik, 1995] An improved algorithm for computing the edit
distance of run-length coded strings, Information Processing Letters, Vol. 54, No. 2, pp. 93–96, 1995.
[Liu et al., 2008] Finding a longest common subsequence between a run-length-encoded string and an uncompressed string, Journal of Complexity, Vol. 24, No. 2, pp. 173–184, 2008.
[Ann 2008] Hsing-Yen Ann, Chang-Biau Yang, Chiou-Ting Tseng, Chiou-Yi Hor "A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings”, Information Processing Letters, Vol. 108, pp. 360–364, 2008.
References (2) Merged LCS and Mosaic LCS
[Huang et al. 2007] Kuo-Si Huang, Chang-Biau Yang*, Kuo-Tsung Tseng, Yung-Hsing Peng and Hsing-Yen Ann, "Dynamic Programming Algorithms for the Mosaic Longest Common Subsequence," Problem. Information Processing Letters, Vol. 102, pp. 99-103, 2007.
[Huang et al. 2008] Kuo-Si Huang, Chang-Biau Yang*, Kuo-Tsung Tseng, Hsing-Yen Ann and Yung-Hsing Peng, "Efficient Algorithms for Finding Interleaving Relationship between Sequences," Information Processing Letters, Vol. 105 (5), pp.188-193, 2008.
References (3)
Suffix Tree and Range Minimum Query [Bender and Farach-Colton, 2000] The LCA problem revisited, i
n: LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, 2000, pp. 88–94.
[Weiner, 1973] Linear pattern matching algorithm, In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1-11, 1973.
Genome [Laundrie et al., 2003] Germline cell death is inhibited by P-elem
ent insertions disrupting the dcp-1/pita nested gene pair in Drosophila, Genetics, Vol. 165, No. 4, pp. 1881-1888, 2003.
[Kellis et al., 2004] Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae, Nature, Vol. 428, pp. 617-624, 2004.
88
Finding and Maintaining the Range Minimum with Linear Penalties
2 34 3 53 3 4row i4 52 3 81 6 7 9
pe2pe
3pe4pe
5pe6pe
10
2 34 3 53 3 4 4row i4 52 3 81 6 7 9
2pe3pe
4pe5pe pe
10
2 34 3 53 3 4row i4 52 3 81 6 7 9
-7pe-6pe
-5pe-4pe
-3pe-2pe
10
-pe
2 34 3 53 3 4 4row i4 52 3 81 6 7 9
-7pe-8pe
10
-6pe
-5pe-4pe
-3pe-2pe
-pe
89
Finding the Substring Edit Distance in Constant Time
0 1 2 3 40 1 2 3 40 0 1 2 30 1 2 3 40 1 2 3 40 0 1 2 30 1 2 1 2
0 1 1 2
a a c a3 4 5 6
gattac
Y3..6
X
1
2
3
4
5
6
dsub(X, Y3..3) dsub(X, Y3..5)
X = gattac Y = gtaaca
92
S-table
a t a c g c g c t t
T =
cgat accAb =
Ab1 Ab
2 c
g
a
t
c c c
g g
a a
t t t
Ab1
T
123456789
0 1 2 9
0 1 2 3
2 3 5 91 2 5 9
3 4 5 94 5 7 95 6 7 96 7 97 8 98 9
0
9
St(T, Ab1)
0 1 2 2 2 2 3 3
i
|LCS|atacgcgctt
1 2 3 4 5 6 7 8 9 10i =
LCS(T[3+1, i], Ab1) =
aat tc gcBb =
Bb1 Bb
2 Bb3
93
a t a c g c g c t t
T = cgat accAb =
Ab1 Ab
2
c
g
a
t
c c c
g g
a a
t t t
Ab1
T
123456789
0 1 2 9
0 1 2 3
2 3 5 91 2 5 9
3 4 5 94 5 7 95 6 7 96 7 97 8 98 9
0
9
St(T, Ab1)
0 1 2 2 2 2 3 3
i
|LCS|
atacgcgctt
1 2 3 4 5 6 7 8 9 10i =
LCS(T[3+1, i], Ab1) =
aat tc gcBb =
Bb1 Bb
2 Bb3
0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Ab1) =
0 1 2 3 4 5 6 7 8 9 10i =Vb(1, 0) =
94
0 1 2 3 4 5 6 7 8 9 10i =
a t a c g c g c t t
a
aBb
1
T
a a
a a
0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Bb1) =
123456789
0 1 2 9
0 1 2 3
2 3 91 2 9
3 94 95 96 97 98 9
0
9
St(T, Bb1)
i
|LCS|
0 1 1 1 1 1 1 1 2 2
1 2 3 4 5 6 7 8 9 10i =
LCS(T[1+1, i], Bb1) =
Vb(0, 1) =
t t t t
T = cgat accAb =
Ab1 Ab
2
atacgcgctt aat tc gcBb =
Bb1 Bb
2 Bb3
95
Vb(1, 1) =
0 1 2 3 3 3 3 3 3 4 4LCS(T[1, i], Ab1B
b1) =
0 1 2 3 4 5 6 7 8 9 10i =
0 1 2 3 3 4 4 4 4 5 5LCS(T[1, i], Bb1A
b1) =
Vb(0, 1) St(T, Ab1) }=
Vb(1, 0) St(T, Bb1) =
max{Vb(1, 0) St(T, Bb1), 0 1 2 3 3 4 4 4 4 5 5
Vb(0, 1) St(T, Ab1) =
0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Ab1) =
0 1 2 3 4 5 6 7 8 9 10i =Vb(1, 0) =
0 1 2 2 2 2 2 2 2 3 3LCS(T[1, i], Bb1) =Vb(0, 1) =
T = cgat accAb =
Ab1 Ab
2
atacgcgctt aat tc gcBb =
Bb1 Bb
2 Bb3
96
Improved Algorithm
Vl = minE{Vl−1 ⊕ St(T, Si)} for each Si ∈ S and 1 ≤ l ≤
S-table
St(T, Si): O(nm|S|)
Vl = minE{Vl−1 ⊕ St(T, Si)}: O(n|S|)
Time Complexity: O(n(m+)|S|)
97
Example of Algorithm Formosa2
T = agactagtc
S = {S1=agc, S2=act, S3=aatg, S4=ttcg}
T = agactagtc
S1 = 12-3----- (0,1,2,4)
S2 = 1--23---- (0,1,4,5)
S3 = 12--3-4-- (0,1,2,5,7)
S4 = -1----2-3 (0,2,7,9)
12-3--4-- (0,1,2,4,7)=V1