dynamic-programming strategies for analyzing biomolecular sequences kun-mao chao ( 趙坤茂 )...

98
Dynamic-Programming Strateg ies for Analyzing Biomolecu lar Sequences Kun-Mao Chao ( 趙趙趙 ) Department of Computer Scienc e and Information Engineering National Taiwan University, T aiwan E-mail: [email protected] WWW: http://www.csie.ntu.edu.tw/~k mchao

Upload: linda-hortense-mckinney

Post on 30-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

Dynamic-Programming Strategies for Analyzing Biomolecular Sequenc

es

Kun-Mao Chao (趙坤茂 )Department of Computer Science an

d Information EngineeringNational Taiwan University, Taiwan

E-mail: [email protected]

WWW: http://www.csie.ntu.edu.tw/~kmchao

Page 2: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

2

Dynamic Programming

• Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure.

• Richard Bellman was one of the principal founders of this approach.

Page 3: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

3

Two key ingredients

• Two key ingredients for an optimization problem to be suitable for a dynamic-programming solution:

Each substructure is optimal.

(Principle of optimality)

1. optimal substructures

2. overlapping subproblems

Subproblems are dependent.

(otherwise, a divide-and-conquer approach is the choice.)

Page 4: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

4

Three basic components

• The development of a dynamic-programming algorithm has three basic components:– The recurrence relation (for defining the val

ue of an optimal solution);– The tabular computation (for computing th

e value of an optimal solution);– The traceback (for delivering an optimal sol

ution).

Page 5: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

5

Fibonacci numbers

.for

21

11

00

i>1i

Fi

FiF

F

F

The Fibonacci numbers are defined by the following recurrence:

Page 6: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

6

How to compute F10?

F10

F9

F8

F8

F7

F7

F6

……

Page 7: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

7

Tabular computation

• The tabular computation can avoid recompuation.

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

0 1 1 2 3 5 8 13 21 34 55

Page 8: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

8

Maximum-sum interval

• Given a sequence of real numbers a1a2…an , find a consecutive subsequence with the maximum sum.9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

Page 9: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

9

O-notation: an asymptotic upper bound

• f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0

cg(n)

f(n)

n0

Page 10: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

10

How functions grow?

30n 92n log n 26n2 0.68n3 2n

1000.003

sec.0.003 sec.

0.0026 sec.

0.68 sec.4 x 1016

yr.

100,000 3.0 sec. 2.6 min. 3.0 days 22 yr.

For large data sets, algorithms with a complexity greater than O(n log n) are often impractical!

n

function

(Assume one million operations per second.)

Page 11: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

11

Maximum-sum interval(The recurrence relation)

• Define S(i) to be the maximum sum of the intervals ending at position i.

0

)1(max)(

iSaiS i

ai

If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

Page 12: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

12

Maximum-sum interval(Tabular computation)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum sum

Page 13: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

13

Maximum-sum interval(Traceback)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum-sum interval: 6 -2 8 4

Page 14: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

14

Two fundamental problems we recently solved (I) (joint work with Lin and Jiang)

• Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.U = 3

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

Page 15: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

15

Joint work with Huang, Jiang and Lin

• Xiaoqiu Huang (黃曉秋 )Iowa State University, USA

• Tao Jiang (姜濤 ) University of California Riverside, USA

• Yaw-Ling Lin (林耀鈴 )Providence University, Taiwan

Page 16: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

16

Two fundamental problems we recently solved (II) (joint work with Lin and Jiang)

• Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.

L = 4

3 2 14 6 6 2 10 2 6 6 14 2 1

Page 17: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

17

Another example

Given a sequence as follows:2, 6.6, 6.6, 3, 7, 6, 7, 2

and L = 2, the highest-average interval is the squared area, which has the average value 20/3.

2, 6.6, 6.6, 3, 7, 6, 7, 2

Page 18: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

18

C+G rich regions

• Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time.

ATGACTCGAGCTCGTCA

00101011011011010

Search for an interval of length at least L with the highest average.

Page 19: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

19

Length-unconstrained version

• Maximum-average interval

3 2 14 6 6 2 10 2 6 6 14 2 1

The maximum element is the answer. It can be done in O(n) time.

Page 20: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

20

Q: computing density in O(1) time?

• prefix-sum(i) = S[1]+S[2]+…+S[i],– all n prefix sums are computable in O(n) time.

• sum(i, j) = prefix-sum(j) – prefix-sum(i-1)

• density(i, j) = sum(i, j) / (j-i+1)

prefix-sum(j)

i j

prefix-sum(i-1)

Page 21: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

21

A naive algorithm

• A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time

• Try L, L+1, L+2, ..., n. In total, O(n2).

Page 22: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

22

An O(nL)-time algorithm (Huang, CABIOS’ 94)

• Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm.

We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

Page 23: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

23

Good partners

• Finding good partner g(i) for each i=1, 2, …, n-L+1.

L

g(i)

maximing avg[i, g(i)]

i + L

Page 24: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

24

Right-Skew Decomposition

• Partition S into substrings S1,S2,…,Sk such that– each Si is a right-skew substring of S

• the average of any prefix is always less than or equal to the average of the remaining suffix.

– density(S1) > density(S2) > … > density(Sk)

• [Lin, Jiang, Chao]– Unique– Computable in linear time.

Page 25: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

25

An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02)

• Decreasingly right-skew decomposition (O(n) time)

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

Page 26: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

26

An O(n log L)-time algorithm

• Jumping tables that allows binary search.– O(log L) time for each position to find its good

partner, therefore the total running time is O(n log L).

• We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

Page 27: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

27

Goldwasswer, Kao, and Lu’s recent progress

• An O(n) time algorithm for the problem– Appeared in Proceedings of the Second Worksh

op on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002.

Page 28: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

28

A new important observation

• i < j < g(j) < g(i) implies– density(i, g(i)) is no more than density(j, g(j))

i g(i)j g(j)

Page 29: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

29

k non-overlapping maximum-average segments

(Lin, Huang, Jiang, Chao, Bioinformatics)

• Maintain all candidates in a priority queue.

• A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

Page 30: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

30

Longest increasing subsequence(LIS)

• The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2…an .

e.g. 9 2 5 3 7 11 8 10 13 6

2 3 7

5 7 10 13

9 7 11

3 5 11 13

are increasing subsequences.

are not increasing subsequences.

We want to find a longest one.

Page 31: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

31

A naive approach for LIS• Let L[i] be the length of a longest increasing

subsequence ending at position i.

L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0)

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 ?

Page 32: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

32

A naive approach for LIS

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 4 5 6 3

L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai}

The maximum length

The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.

This method runs in O(n2) time.

Page 33: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

33

Binary search

• Given an ordered sequence x1x2 ... xn, where x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time.

n...

n/2n/4

Page 34: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

34

Binary search

• How many steps would a binary search reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1

How many steps? O(log n) steps.

ns

n s

2log

12/

Page 35: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

35

An O(n log n) method for LIS

• Define BestEnd[k] to be the smallest number of an increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 69 2 2

5

2

3

2

3

7

2

3

7

11

2

3

7

8

2

3

7

8

10

2

3

7

8

10

13

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

Page 36: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

36

An O(n log n) method for LIS

• Define BestEnd[k] to be the smallest number of an increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 69 2 2

5

2

3

2

3

7

2

3

7

11

2

3

7

8

2

3

7

8

10

2

3

7

8

10

13

2

3

6

8

10

13

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).

Page 37: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

37

Longest Common Subsequence (LCS)

• A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent.

• The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

Page 38: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

38

LCS

For instance,

Sequence 1: president

Sequence 2: providence

Its LCS is priden.

president

providence

Page 39: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

39

LCS

Another example:

Sequence 1: algorithm

Sequence 2: alignment

One of its LCS is algm.

a l g o r i t h m

a l i g n m e n t

Page 40: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

40

How to compute LCS?

• Let A=a1a2…am and B=b1b2…bn .

• len(i, j): the length of an LCS between a1a2…ai and b1b2…bj

• With proper initializations, len(i, j)can be computed as follows.

,

. and 0, if)),1(),1,(max(

and 0, if1)1,1(

,0or 0 if0

),(

ji

ji

bajijilenjilen

bajijilen

ji

jilen

Page 41: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

41

p r o c e d u r e L C S - L e n g t h ( A , B )

1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0

2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0

3 . f o r i ← 1 t o m d o

4 . f o r j ← 1 t o n d o

5 . i f ji ba t h e n

" "),(

1)1,1(),(

jiprev

jilenjilen

6 . e l s e i f )1,(),1( jilenjilen

7 . t h e n

" "),(

),1(),(

jiprev

jilenjilen

8 . e l s e

" "),(

)1,(),(

jiprev

jilenjilen

9 . r e t u r n l e n a n d p r e v

Page 42: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

42

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

Page 43: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

43

p r o c e d u r e O u tp u t - L C S ( A , p r e v , i , j )

1 i f i = 0 o r j = 0 t h e n r e t u r n

2 i f p r e v ( i , j ) = ” “ t h e n

ia

jiprevALCSOutput

print

)1,1,,(

3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S ( A , p r e v , i - 1 , j )

4 e l s e O u tp u t - L C S ( A , p r e v , i , j - 1 )

Page 44: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

44

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

Output: priden

Page 45: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

45

Bioinformatics

Page 46: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

46

Bioinformatics and Computational Biology-Related Journals:

• Bioinformatics (previously called CABIOS)• Bulletin of Mathematical Biology• Computers and Biomedical Research• Genome Research• Genomics• Journal of Bioinformatics and Computational Biology• Journal of Computational Biology• Journal of Molecular Biology• Nature• Science

Page 47: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

47

Bioinformatics and Computational Biology-Related Conferences:

• the first IEEE Computer Society Bioinformatics Conference (CSB 2002, CA, USA)

• Intelligent Systems for Molecular Biology (ISMB 2003, Brisbane, Australia)

• Pacific Symposium on Biocomputing(PSB 2003, Kauai, Hawaii, USA)

• The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2003, Berlin, Germany)

Page 48: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

48

Bioinformatics and Computational Biology-Related Books:

• Calculating the Secrets of Life: Applications of the Mathematical Sciences in Molecular Biology, by Eric S. Lander and Michael S. Waterman (1995)

• Introduction to Computational Biology: Maps, Sequences, and Genomes, by Michael S. Waterman (1995)

• Introduction to Computational Molecular Biology, by Joao Carlos Setubal and Joao Meidanis (1996)

• Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997)

• Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000)

• Introduction to Bioinformatics, by Arthur M. Lesk (2002)

Page 49: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

49

Useful Websites• MIT Biology Hypertextbook

– http://www.mit.edu:8001/afs/athena/course/other/esgbio/www/7001main.html

• The International Society for Computational Biology:– http://www.iscb.org/

• National Center for Biotechnology Information (NCBI, NIH):– http://www.ncbi.nlm.nih.gov/

• European Bioinformatics Institute (EBI):– http://www.ebi.ac.uk/

• DNA Data Bank of Japan (DDBJ):– http://www.ddbj.nig.ac.jp/

Page 50: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

50

Sequence Alignment

Page 51: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

51

Dot MatrixSequence A: CTTAACT

Sequence B: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

Page 52: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

52

C---TTAACTCGGATCA--T

Pairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

An alignment of A and B:

Sequence A

Sequence B

Page 53: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

53

C---TTAACTCGGATCA--T

Pairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

An alignment of A and B:

Insertion gap

Match Mismatch

Deletion gap

Page 54: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

54

Alignment GraphSequence A: CTTAACT

Sequence B: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 55: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

55

A simple scoring scheme

• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

Alignment score

Page 56: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

56

An optimal alignment-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

Page 57: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

57

Computing Si,j

i

j

w(ai,-)

w(-,bj)

w(ai,b

j)

Sm,n

Page 58: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

58

Initializations

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 59: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

59

S3,5 = ?

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 60: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

60

S3,5 = ?

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

Page 61: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

61

C T T A A C – TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

Page 62: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

62

Global Alignment vs. Local Alignment

• global alignment:

• local alignment:

Page 63: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

63

An optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

Page 64: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

64

local alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 65: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

65

local alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

The best

score

Page 66: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

66

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

Page 67: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

67

Affine gap penalties• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

• Each gap is charged an extra gap-open penalty: -4.

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

-4 -4

Alignment score: 12 – 4 – 4 = 4

Page 68: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

68

Affine gap panalties• A gap of length k is penalized x + k·y.

gap-open penalty

gap-symbol penaltyThree cases for alignment endings:

1. ...x...x

2. ...x...-

3. ...-...x

an aligned pair

a deletion

an insertion

Page 69: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

69

Affine gap penalties

• Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion.

• Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion.

• Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Page 70: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

70

Affine gap penalties

),(

),(

),()1,1(

max),(

)1,(

)1,(max),(

),1(

),1(max),(

jiI

jiD

bawjiS

jiS

yxjiS

yjiIjiI

yxjiS

yjiDjiD

ji

Page 71: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

71

Affine gap penalties

SI

D

SI

D

SI

D

SI

D

-y-x-y

-x-y

-y

w(ai,bj)

Page 72: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

72

k best local alignments

• Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987)

• FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

• BLAST(Altschul et al., 1990; Altschul et al., 1997)

Page 73: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

73

k best local alignments

• Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987)– linear-space version: sim (Huang and Miller, 1991)

– linear-space variants: sim2 (Chao et al., 1995); sim3 (Chao et al., 1997)

– Chaining local alignments: Gap3 (Huang and Chao, 2003)

• FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)– linear-space band alignment (Chao et al., 1992)

• BLAST(Altschul et al., 1990; Altschul et al., 1997)– Enhanced PatternHunter (Yang et al., coming soon)

Page 74: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

74

FASTA

1) Find runs of identities, and identify regions with the highest density of identities.

2) Re-score using PAM matrix, and keep top scoring segments.

3) Eliminate segments that are unlikely to be part of the alignment.

4) Optimize the alignment in a band.

Page 75: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

75

FASTA

Step 1: Find runes of identities, and identify regions with the highest density of identities.

Sequence A

Sequence B

Page 76: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

76

FASTA

Step 2: Re-score using PAM matrix, andkeep top scoring segments.

Page 77: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

77

FASTA

Step 3: Eliminate segments that are unlikely to be part

of the alignment.

Page 78: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

78

FASTA

Step 4: Optimize the alignment in a band.

Page 79: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

79

BLAST

Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman)

The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

Page 80: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

80

The maximal segment pair measure

A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4)

the highest scoring pair

•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.

•BLAST heuristically attempts to calculate the MSP score.

Page 81: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

81

BLAST

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

3) Extend hits.

Page 82: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

82

BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)

For DNA sequences:

Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..

TTT

For protein sequences:

Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧

Page 83: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

83

BLASTStep2: Scan sequence B for hits.

Page 84: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

84

BLASTStep2: Scan sequence B for hits.

Step 3: Extend hits.

hit

Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

Page 85: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

85

Remarks

• Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments.

• The idea of filtration was used in both FASTA and BLAST.

Page 86: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

86

Linear-space ideasHirschberg, 1975; Myers and Miller, 1988

m/2

Page 87: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

87

Two subproblems½ original problem size

m/2

m/4

3m/4

Page 88: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

88

Four subproblems¼ original problem size

m/2

m/4

3m/4

Page 89: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

89

Time and Space Complexity

• Space: O(M+N)

• Time:

O(MN)*(1+ ½ + ¼ + …) = O(MN)

2

Page 90: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

90

Band Alignment(Joint work with W. Pearson and W. Miller)

Sequence B

Sequence A

Page 91: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

91

Band Alignment in Linear Space The remaining subproblems are no

longer only half of the original problem. In worst case, this could cause an additional log n factor in time.

Page 92: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

92

Band Alignment in Linear Space

Page 93: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

93

Multiple sequence alignment (MSA)

• The multiple sequence alignment problem is to simultaneously align more than two sequences.

Seq1: GCTC

Seq2: AC

Seq3: GATC

GC-TC

A---C

G-ATC

Page 94: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

94

How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

Page 95: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

95

MSA for three sequences

• an O(n3) algorithm

Page 96: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

96

General MSA

• For k sequences of length n: O(nk)

• NP-Complete (Wang and Jiang)

• The exact multiple alignment algorithms for many sequences are not feasible.

• Some approximation algorithms are given.(e.g., 2- l/k for any fixed l by Bafna et al.)

Page 97: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

97

Progressive alignment

• A heuristic approach proposed by Feng and Doolittle.• It iteratively merges the most similar pairs.• “Once a gap, always a gap”

A B C D E

The time for progressive alignment in most cases is roughly the order of the time for computing all pairwise alignme

nt, i.e., O(k2n2) .

Page 98: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National

98

Concluding remarks

• Three essential components of the dynamic-programming approach:– the recurrence relation– the tabular computation– the traceback

• The dynamic-programming approach has been used in a vast number of computational problems in bioinformatics.