chapter 2 data searches and pairwise alignments
Post on 10-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Chapter 2Data Searches and Pairwise Alignments
暨南大學資訊工程學系黃光璿2004/03/08
2
Introduction
What is the difference between acctga and agcta?
a c c t g a
a g c t g a
a g c t - a
3
Nomenclature
4
2.1 Dot Plots
5
2.2 Simple Alignments
No gap
6
mutation (substitution): common insertion deletion
scoring scheme match score mismatch score
} gap, indel (rare)
7
2.3 Gaps
8
2.3.1 Gap Penalty
uniform gap affine gap
origination penalty length penalty
9
2.4 Scoring Matrices
10
Modeling 之問題 大自然是否真的依此規則運作?
11
Modeling
12
13
Define the odds ratio as
14
2.4.1 PAM Matrices
Dayhoff, Schwartz, Orcutt (1978) Point Accepted Mutation
Based on observed substitution rates (Box. 2.1)
Input A set of observed substitution rates
Output PAM-1 matrix (log-odds matrix)
15
Multiple Alignment
(1) Group the sequences with high similarity (> 85% identity).
16
Phylogenetic Tree
(2) For each group, build the corresponding phylogenetic tree.
17
Mutation Frequency
A->G, I->L, A->G, A->L, C->S, G->A
(3)
FG,A=3
18
Relative Mutability
(4)
19
Mutation Probability
(5)
20
Odds Ratio
(6)
21
Log-Odds Ratio
(7)
22
Which PAM matrix is the most appropriate? the length of the sequences How closely the sequences are believed to
be related. PAM 120 for database search PAM 200 for comparing two specific
proteins
23
2.4.2 BLOSUM Matrices
Henikoff & Henikoff (1992)
PAM-k: k 愈大 , 愈不相似 BLOSUM-k: k 愈大愈相似 BLOSUM62: for ungapped matching BLOSUM50: for gapped matching
24
2.5 Dynamic Programming
The Needleman and Wunsch Algorithm (Global Alignment)
25
26
Alignment Graph
27
28
A C - - T C G
A C A G T A G
29
Complexity
30
2.6 Global and Local Alignments Semi-global alignment Local alignment
31
2.6.1 Semi-global Alignments
A A C A C G T G T C T - - - A C G T - - - -
32
33
2.6.2 Local Alignment
The Smith-Waterman Alignment
34
35
2.7 Database Searches
BLAST and its relatives FASTA and related algorithms
36
2.7.1 BLAST and Its RelativesProgram Database Query
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nucleotide Protein
TBLASTN Nucleotide Protein
Protein
TBLASTX Nucleotide Protein
Nucleotide Protein
37
BLASTP
Using PAM or BLOSUM matrices
38
2.7.2 FASTA and Related Algorithms改進 dot plot & band search1. Preprocess the target sequence.
Identify the position for each word. (for amino acid & word length=1, a 20-entry array)
2. Scan the query sequence. Compute the shifts of query to align each word with the
target.3. Find the mode ( 眾數 ) of the shifts.4. Join the possible shifts into one new target sequ
ence. Perform the full local alignment algorithm.
39
Target: FAMLGFIKYLPGCMQuery:TGFIKYLPGACT
40
2.7.3 Alignment Scores and Statistical Significance of Database Searches related model v.s. random model
S-score: the alignment score E-score: expected number of sequences with sc
ore >= S by random chance P-score: probability that one or more sequence
s with score >= S would be found randomly Low E & P are better.
41
length correction
Scores
42
PAM 120 (ln 2)/2 nats A R N D C Q E G H I L K M F P S T W Y V B Z X *A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
43
Applications
Reconstructing long sequences of DNA from overlapping sequence fragments
Determining physical and genetic maps from probe data under various experiment protocols
Database searching Comparing two or more sequences for
similarities
44
Protein structure prediction (building profiles)
Comparing the same gene sequenced by two different labs
45
2.8 Multiple Sequence Alignemnts CLUSTAL
R. G. Higgins & P. M. Sharp, 1988 CLUSTALW
Sequences are weighted according to how divergent they are from the most closely related pair of sequences.
Gaps are weighted for different sequences.
46
Summary
notion of similarity the scoring system used to rank
alignments the algorithms used to find optimal
scoring alignment the statistical method used to evaluate
the significance of an alignment score
47
參考資料及圖片出處
1. Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003.
2. BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. (天瓏代理)
3. Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998.
4. Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.
top related