chapter 2 data searches and pairwise alignments

47
1 Chapter 2 Data Searches and Pai rwise Alignments 暨暨暨暨暨暨暨暨暨暨 暨暨暨 2004/03/08

Upload: tymon

Post on 10-Jan-2016

43 views

Category:

Documents


3 download

DESCRIPTION

Chapter 2 Data Searches and Pairwise Alignments. 暨南大學資訊工程學系 黃光璿 2004/03/08. Introduction. What is the difference between acctga and agcta?. a c c t g a a g c t g a a g c t - a. Nomenclature. 2.1 Dot Plots. 2.2 Simple Alignments. No gap. mutation (substitution): common insertion - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 2 Data Searches and Pairwise Alignments

1

Chapter 2Data Searches and Pairwise Alignments

暨南大學資訊工程學系黃光璿2004/03/08

Page 2: Chapter 2 Data Searches and Pairwise Alignments

2

Introduction

What is the difference between acctga and agcta?

a c c t g a

a g c t g a

a g c t - a

Page 3: Chapter 2 Data Searches and Pairwise Alignments

3

Nomenclature

Page 4: Chapter 2 Data Searches and Pairwise Alignments

4

2.1 Dot Plots

Page 5: Chapter 2 Data Searches and Pairwise Alignments

5

2.2 Simple Alignments

No gap

Page 6: Chapter 2 Data Searches and Pairwise Alignments

6

mutation (substitution): common insertion deletion

scoring scheme match score mismatch score

} gap, indel (rare)

Page 7: Chapter 2 Data Searches and Pairwise Alignments

7

2.3 Gaps

Page 8: Chapter 2 Data Searches and Pairwise Alignments

8

2.3.1 Gap Penalty

uniform gap affine gap

origination penalty length penalty

Page 9: Chapter 2 Data Searches and Pairwise Alignments

9

2.4 Scoring Matrices

Page 10: Chapter 2 Data Searches and Pairwise Alignments

10

Modeling 之問題 大自然是否真的依此規則運作?

Page 11: Chapter 2 Data Searches and Pairwise Alignments

11

Modeling

Page 12: Chapter 2 Data Searches and Pairwise Alignments

12

Page 13: Chapter 2 Data Searches and Pairwise Alignments

13

Define the odds ratio as

Page 14: Chapter 2 Data Searches and Pairwise Alignments

14

2.4.1 PAM Matrices

Dayhoff, Schwartz, Orcutt (1978) Point Accepted Mutation

Based on observed substitution rates (Box. 2.1)

Input A set of observed substitution rates

Output PAM-1 matrix (log-odds matrix)

Page 15: Chapter 2 Data Searches and Pairwise Alignments

15

Multiple Alignment

(1) Group the sequences with high similarity (> 85% identity).

Page 16: Chapter 2 Data Searches and Pairwise Alignments

16

Phylogenetic Tree

(2) For each group, build the corresponding phylogenetic tree.

Page 17: Chapter 2 Data Searches and Pairwise Alignments

17

Mutation Frequency

A->G, I->L, A->G, A->L, C->S, G->A

(3)

FG,A=3

Page 18: Chapter 2 Data Searches and Pairwise Alignments

18

Relative Mutability

(4)

Page 19: Chapter 2 Data Searches and Pairwise Alignments

19

Mutation Probability

(5)

Page 20: Chapter 2 Data Searches and Pairwise Alignments

20

Odds Ratio

(6)

Page 21: Chapter 2 Data Searches and Pairwise Alignments

21

Log-Odds Ratio

(7)

Page 22: Chapter 2 Data Searches and Pairwise Alignments

22

Which PAM matrix is the most appropriate? the length of the sequences How closely the sequences are believed to

be related. PAM 120 for database search PAM 200 for comparing two specific

proteins

Page 23: Chapter 2 Data Searches and Pairwise Alignments

23

2.4.2 BLOSUM Matrices

Henikoff & Henikoff (1992)

PAM-k: k 愈大 , 愈不相似 BLOSUM-k: k 愈大愈相似 BLOSUM62: for ungapped matching BLOSUM50: for gapped matching

Page 24: Chapter 2 Data Searches and Pairwise Alignments

24

2.5 Dynamic Programming

The Needleman and Wunsch Algorithm (Global Alignment)

Page 25: Chapter 2 Data Searches and Pairwise Alignments

25

Page 26: Chapter 2 Data Searches and Pairwise Alignments

26

Alignment Graph

Page 27: Chapter 2 Data Searches and Pairwise Alignments

27

Page 28: Chapter 2 Data Searches and Pairwise Alignments

28

A C - - T C G

A C A G T A G

Page 29: Chapter 2 Data Searches and Pairwise Alignments

29

Complexity

Page 30: Chapter 2 Data Searches and Pairwise Alignments

30

2.6 Global and Local Alignments Semi-global alignment Local alignment

Page 31: Chapter 2 Data Searches and Pairwise Alignments

31

2.6.1 Semi-global Alignments

A A C A C G T G T C T - - - A C G T - - - -

Page 32: Chapter 2 Data Searches and Pairwise Alignments

32

Page 33: Chapter 2 Data Searches and Pairwise Alignments

33

2.6.2 Local Alignment

The Smith-Waterman Alignment

Page 34: Chapter 2 Data Searches and Pairwise Alignments

34

Page 35: Chapter 2 Data Searches and Pairwise Alignments

35

2.7 Database Searches

BLAST and its relatives FASTA and related algorithms

Page 36: Chapter 2 Data Searches and Pairwise Alignments

36

2.7.1 BLAST and Its RelativesProgram Database Query

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Protein Nucleotide Protein

TBLASTN Nucleotide Protein

Protein

TBLASTX Nucleotide Protein

Nucleotide Protein

Page 37: Chapter 2 Data Searches and Pairwise Alignments

37

BLASTP

Using PAM or BLOSUM matrices

Page 38: Chapter 2 Data Searches and Pairwise Alignments

38

2.7.2 FASTA and Related Algorithms改進 dot plot & band search1. Preprocess the target sequence.

Identify the position for each word. (for amino acid & word length=1, a 20-entry array)

2. Scan the query sequence. Compute the shifts of query to align each word with the

target.3. Find the mode ( 眾數 ) of the shifts.4. Join the possible shifts into one new target sequ

ence. Perform the full local alignment algorithm.

Page 39: Chapter 2 Data Searches and Pairwise Alignments

39

Target: FAMLGFIKYLPGCMQuery:TGFIKYLPGACT

Page 40: Chapter 2 Data Searches and Pairwise Alignments

40

2.7.3 Alignment Scores and Statistical Significance of Database Searches related model v.s. random model

S-score: the alignment score E-score: expected number of sequences with sc

ore >= S by random chance P-score: probability that one or more sequence

s with score >= S would be found randomly Low E & P are better.

Page 41: Chapter 2 Data Searches and Pairwise Alignments

41

length correction

Scores

Page 42: Chapter 2 Data Searches and Pairwise Alignments

42

PAM 120 (ln 2)/2 nats A R N D C Q E G H I L K M F P S T W Y V B Z X *A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8

Page 43: Chapter 2 Data Searches and Pairwise Alignments

43

Applications

Reconstructing long sequences of DNA from overlapping sequence fragments

Determining physical and genetic maps from probe data under various experiment protocols

Database searching Comparing two or more sequences for

similarities

Page 44: Chapter 2 Data Searches and Pairwise Alignments

44

Protein structure prediction (building profiles)

Comparing the same gene sequenced by two different labs

Page 45: Chapter 2 Data Searches and Pairwise Alignments

45

2.8 Multiple Sequence Alignemnts CLUSTAL

R. G. Higgins & P. M. Sharp, 1988 CLUSTALW

Sequences are weighted according to how divergent they are from the most closely related pair of sequences.

Gaps are weighted for different sequences.

Page 46: Chapter 2 Data Searches and Pairwise Alignments

46

Summary

notion of similarity the scoring system used to rank

alignments the algorithms used to find optimal

scoring alignment the statistical method used to evaluate

the significance of an alignment score

Page 47: Chapter 2 Data Searches and Pairwise Alignments

47

參考資料及圖片出處

1. Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003.

2. BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. (天瓏代理)

3. Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998.

4. Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.