sequence alignment

56
Sequence Alignment Kun-Mao Chao ( 趙趙趙 ) Department of Computer Scienc e and Information Engineering National Taiwan University, T aiwan E-mail: [email protected] WWW: http://www.csie.ntu.edu.tw/~k mchao

Upload: inga

Post on 12-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Sequence Alignment. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: [email protected] WWW: http://www.csie.ntu.edu.tw/~kmchao. Bioinformatics. Bioinformatics and Computational Biology-Related Journals:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Alignment

Sequence Alignment

Kun-Mao Chao (趙坤茂 )Department of Computer Science an

d Information EngineeringNational Taiwan University, Taiwan

E-mail: [email protected]

WWW: http://www.csie.ntu.edu.tw/~kmchao

Page 2: Sequence Alignment

2

Bioinformatics

Page 3: Sequence Alignment

3

Bioinformatics and Computational Biology-Related Journals:

• Bioinformatics (previously called CABIOS)• Bulletin of Mathematical Biology• Computers and Biomedical Research• Genome Research• Genomics• Journal of Bioinformatics and Computational Biology• Journal of Computational Biology• Journal of Molecular Biology• Nature• Nucleic Acid Research• Science

Page 4: Sequence Alignment

4

Bioinformatics and Computational Biology-Related Conferences:

• Intelligent Systems for Molecular Biology (ISMB)• Pacific Symposium on Biocomputing

(PSB)• The Annual International Conference on Research

in Computational Molecular Biology (RECOMB)• The IEEE Computer Society Bioinformatics Conf

erence (CSB)• ...

Page 5: Sequence Alignment

5

Bioinformatics and Computational Biology-

Related Books:• Calculating the Secrets of Life: Applications of the Mathematical Sciences in Molecular Biology, by Eric S. Lander and Michael S. Waterman (1995)

• Introduction to Computational Biology: Maps, Sequences, and Genomes, by Michael S. Waterman (1995)

• Introduction to Computational Molecular Biology, by Joao Carlos Setubal and Joao Meidanis (1996)

• Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997)

• Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000)

• Introduction to Bioinformatics, by Arthur M. Lesk (2002)

Page 6: Sequence Alignment

6

Useful Websites• MIT Biology Hypertextbook

– http://www.mit.edu:8001/afs/athena/course/other/esgbio/www/7001main.html

• The International Society for Computational Biology:– http://www.iscb.org/

• National Center for Biotechnology Information (NCBI, NIH):– http://www.ncbi.nlm.nih.gov/

• European Bioinformatics Institute (EBI):– http://www.ebi.ac.uk/

• DNA Data Bank of Japan (DDBJ):– http://www.ddbj.nig.ac.jp/

Page 7: Sequence Alignment

7

Sequence Alignment

Page 8: Sequence Alignment

8

Dot MatrixSequence A: CTTAACT

Sequence B: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

Page 9: Sequence Alignment

9

C---TTAACTCGGATCA--T

Pairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

An alignment of A and B:

Sequence A

Sequence B

Page 10: Sequence Alignment

10

C---TTAACTCGGATCA--T

Pairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

An alignment of A and B:

Insertion gap

Match Mismatch

Deletion gap

Page 11: Sequence Alignment

11

Alignment GraphSequence A: CTTAACT

Sequence B: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 12: Sequence Alignment

12

A simple scoring scheme

• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

Alignment score

Page 13: Sequence Alignment

13

An optimal alignment-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

Page 14: Sequence Alignment

14

Computing Si,j

i

j

w(ai,-)

w(-,bj)

w(ai,b

j)

Sm,n

Page 15: Sequence Alignment

15

Initializations

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 16: Sequence Alignment

16

S3,5 = ?

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 17: Sequence Alignment

17

S3,5 = 5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

Page 18: Sequence Alignment

18

C T T A A C – TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

Page 19: Sequence Alignment

19

Now try this example in class

Sequence A: CAATTGASequence B: GAATCTGC

Their optimal alignment?

Page 20: Sequence Alignment

20

Initializations

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

G A A T C T G C

C

A

A

T

T

G

A

Page 21: Sequence Alignment

21

S4,2 = ?

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 -5 -8 -11 -14 -4 -7 -10 -13

-6 -8 3 0 -3 -6 -9 -12 -15

-9 -11 0 11 8 5 2 -1 -4

-12 -14 ?

-15

-18

-21

G A A T C T G C

C

A

A

T

T

G

A

Page 22: Sequence Alignment

22

S5,5 = ?

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 -5 -8 -11 -14 -4 -7 -10 -13

-6 -8 3 0 -3 -6 -9 -12 -15

-9 -11 0 11 8 5 2 -1 -4

-12 -14 -3 8 19 16 13 10 7

-15 -11 -6 5 16 ?

-18

-21

G A A T C T G C

C

A

A

T

T

G

A

Page 23: Sequence Alignment

23

S5,5 = 14

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 -5 -8 -11 -14 -4 -7 -10 -13

-6 -8 3 0 -3 -6 -9 -12 -15

-9 -11 0 11 8 5 2 -1 -4

-12 -14 -3 8 19 16 13 10 7

-15 -11 -6 5 16 14 24 21 18

-18 -7 -9 2 13 11 21 32 29

-21 -10 1 -1 10 8 18 29 27

G A A T C T G C

C

A

A

T

T

G

A

optimal score

Page 24: Sequence Alignment

24

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 -5 -8 -11 -14 -4 -7 -10 -13

-6 -8 3 0 -3 -6 -9 -12 -15

-9 -11 0 11 8 5 2 -1 -4

-12 -14 -3 8 19 16 13 10 7

-15 -11 -6 5 16 14 24 21 18

-18 -7 -9 2 13 11 21 32 29

-21 -10 1 -1 10 8 18 29 27

G A A T C T G C

C

A

A

T

T

G

A

-5 +8 +8 +8 -3 +8 +8 -5 = 27

C A A T - T G AG A A T C T G C

Page 25: Sequence Alignment

25

Global Alignment vs. Local Alignment

• global alignment:

• local alignment:

Page 26: Sequence Alignment

26

An optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

Page 27: Sequence Alignment

27

local alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 28: Sequence Alignment

28

local alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

The best

score

Page 29: Sequence Alignment

29

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

Page 30: Sequence Alignment

30

Now try this example in class

Sequence A: CAATTGASequence B: GAATCTGC

Their optimal local alignment?

Page 31: Sequence Alignment

31

Did you get it right?

0 0 0 0 0 0 0 0 0

0 0 0 0 0 8 5 2 8

0 0 8 8 5 5 3 0 5

0 0 8 16 13 10 7 4 1

0 0 5 13 24 21 18 15 12

0 0 2 10 21 19 29 26 23

0 8 5 7 18 16 26 37 34

0 5 16 13 15 13 23 34 32

G A A T C T G C

C

A

A

T

T

G

A

Page 32: Sequence Alignment

32

0 0 0 0 0 0 0 0 0

0 0 0 0 0 8 5 2 8

0 0 8 8 5 5 3 0 5

0 0 8 16 13 10 7 4 1

0 0 5 13 24 21 18 15 12

0 0 2 10 21 19 29 26 23

0 8 5 7 18 16 26 37 34

0 5 16 13 15 13 23 34 32

G A A T C T G C

C

A

A

T

T

G

A

A A T – T GA A T C T G8+8+8-3+8+8 = 37

Page 33: Sequence Alignment

33

Affine gap penalties• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

• Each gap is charged an extra gap-open penalty: -4.

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

-4 -4

Alignment score: 12 – 4 – 4 = 4

Page 34: Sequence Alignment

34

Affine gap panalties• A gap of length k is penalized x + k·y.

gap-open penalty

gap-symbol penaltyThree cases for alignment endings:

1. ...x...x

2. ...x...-

3. ...-...x

an aligned pair

a deletion

an insertion

Page 35: Sequence Alignment

35

Affine gap penalties

• Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion.

• Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion.

• Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Page 36: Sequence Alignment

36

Affine gap penalties

),(

),(

),()1,1(

max),(

)1,(

)1,(max),(

),1(

),1(max),(

jiI

jiD

bawjiS

jiS

yxjiS

yjiIjiI

yxjiS

yjiDjiD

ji

(A gap of length k is penalized x + k·y.)

Page 37: Sequence Alignment

37

Affine gap penalties

SI

D

SI

D

SI

D

SI

D

-y-x-y

-x-y

-y

w(ai,bj)

Page 38: Sequence Alignment

38

Constant gap penalties• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: 0 (w(-,x)=w(x,-)=0)

• Each gap is charged a constant penalty: -4.

C - - - T T A A C TC G G A T C A - - T

+8 0 0 0 +8 -5 +8 0 0 +8 = +27

-4 -4

Alignment score: 27 – 4 – 4 = 19

Page 39: Sequence Alignment

39

Constant gap penalties

• Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion.

• Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion.

• Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Page 40: Sequence Alignment

40

Constant gap penalties

gap afor penalty gapconstant a is where

),(

),(

),()1,1(

max),(

)1,(

)1,(max),(

),1(

),1(max),(

x

jiI

jiD

bawjiS

jiS

xjiS

jiIjiI

xjiS

jiDjiD

ji

Page 41: Sequence Alignment

41

Restricted affine gap panalties• A gap of length k is penalized x + f(k)·y.

where f(k) = k for k <= c and f(k) = c for k > c

Five cases for alignment endings:

1. ...x...x

2. ...x...-

3. ...-...x

4. and 5. for long gaps

an aligned pair

a deletion

an insertion

Page 42: Sequence Alignment

42

Restricted affine gap penalties

),(');,(

),(');,(

),()1,1(

max),(

)1,(

)1,('max),('

)1,(

)1,(max),(

),1(

),1('max),('

),1(

),1(max),(

jiIjiI

jiDjiD

bawjiS

jiS

cyxjiS

jiIjiI

yxjiS

yjiIjiI

cyxjiS

jiDjiD

yxjiS

yjiDjiD

ji

Page 43: Sequence Alignment

43

D(i, j) vs. D’(i, j)

• Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j)

• Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c

D(i, j) <= D’(i, j)

Page 44: Sequence Alignment

44

k best local alignments

• Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987)

• FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

• BLAST(Altschul et al., 1990; Altschul et al., 1997)

Page 45: Sequence Alignment

45

FASTA

1) Find runs of identities, and identify regions with the highest density of identities.

2) Re-score using PAM matrix, and keep top scoring segments.

3) Eliminate segments that are unlikely to be part of the alignment.

4) Optimize the alignment in a band.

Page 46: Sequence Alignment

46

FASTA

Step 1: Find runes of identities, and identify regions with the highest density of identities.

Sequence A

Sequence B

Page 47: Sequence Alignment

47

FASTA

Step 2: Re-score using PAM matrix, andkeep top scoring segments.

Page 48: Sequence Alignment

48

FASTA

Step 3: Eliminate segments that are unlikely to be part

of the alignment.

Page 49: Sequence Alignment

49

FASTA

Step 4: Optimize the alignment in a band.

Page 50: Sequence Alignment

50

BLAST

Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman)

The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

Page 51: Sequence Alignment

51

The maximal segment pair measure

A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4)

the highest scoring pair

•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.

•BLAST heuristically attempts to calculate the MSP score.

Page 52: Sequence Alignment

52

BLAST

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

3) Extend hits.

Page 53: Sequence Alignment

53

BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)

For DNA sequences:

Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..

TTT

For protein sequences:

Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧

Page 54: Sequence Alignment

54

BLASTStep2: Scan sequence B for hits.

Page 55: Sequence Alignment

55

BLASTStep2: Scan sequence B for hits.

Step 3: Extend hits.

hit

Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

Page 56: Sequence Alignment

56

Remarks

• Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments.

• The idea of filtration was used in both FASTA and BLAST.