chapter 2 data searches and pairwise alignments

Chapter 2Data Searches and Pairwise Alignments

暨南大學資訊工程學系黃光璿2004/03/08

Introduction

What is the difference between acctga and agcta?

a c c t g a

a g c t g a

a g c t - a

Nomenclature

2.1 Dot Plots

2.2 Simple Alignments

No gap

mutation (substitution): common insertion deletion

scoring scheme match score mismatch score

} gap, indel (rare)

2.3 Gaps

2.3.1 Gap Penalty

uniform gap affine gap

origination penalty length penalty

2.4 Scoring Matrices

Modeling 之問題大自然是否真的依此規則運作？

Modeling

Define the odds ratio as

2.4.1 PAM Matrices

Dayhoff, Schwartz, Orcutt (1978) Point Accepted Mutation

Based on observed substitution rates (Box. 2.1)

Input A set of observed substitution rates

Output PAM-1 matrix (log-odds matrix)

Multiple Alignment

(1) Group the sequences with high similarity (> 85% identity).

Phylogenetic Tree

(2) For each group, build the corresponding phylogenetic tree.

Mutation Frequency

A->G, I->L, A->G, A->L, C->S, G->A

FG,A=3

Relative Mutability

Mutation Probability

Odds Ratio

Log-Odds Ratio

Which PAM matrix is the most appropriate? the length of the sequences How closely the sequences are believed to

be related. PAM 120 for database search PAM 200 for comparing two specific

proteins

2.4.2 BLOSUM Matrices

Henikoff & Henikoff (1992)

PAM-k: k 愈大 , 愈不相似 BLOSUM-k: k 愈大愈相似 BLOSUM62: for ungapped matching BLOSUM50: for gapped matching

2.5 Dynamic Programming

The Needleman and Wunsch Algorithm (Global Alignment)

Alignment Graph

A C - - T C G

A C A G T A G

Complexity

2.6 Global and Local Alignments Semi-global alignment Local alignment

2.6.1 Semi-global Alignments

A A C A C G T G T C T - - - A C G T - - - -

2.6.2 Local Alignment

The Smith-Waterman Alignment

2.7 Database Searches

BLAST and its relatives FASTA and related algorithms

2.7.1 BLAST and Its RelativesProgram Database Query

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Protein Nucleotide Protein

TBLASTN Nucleotide Protein

Protein

TBLASTX Nucleotide Protein

Nucleotide Protein

BLASTP

Using PAM or BLOSUM matrices

2.7.2 FASTA and Related Algorithms改進 dot plot & band search1. Preprocess the target sequence.

Identify the position for each word. (for amino acid & word length=1, a 20-entry array)

2. Scan the query sequence. Compute the shifts of query to align each word with the

target.3. Find the mode ( 眾數 ) of the shifts.4. Join the possible shifts into one new target sequ

ence. Perform the full local alignment algorithm.

Target: FAMLGFIKYLPGCMQuery:TGFIKYLPGACT

2.7.3 Alignment Scores and Statistical Significance of Database Searches related model v.s. random model

S-score: the alignment score E-score: expected number of sequences with sc

ore >= S by random chance P-score: probability that one or more sequence

s with score >= S would be found randomly Low E & P are better.

length correction

Scores

PAM 120 (ln 2)/2 nats A R N D C Q E G H I L K M F P S T W Y V B Z X *A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8

Applications

Reconstructing long sequences of DNA from overlapping sequence fragments

Determining physical and genetic maps from probe data under various experiment protocols

Database searching Comparing two or more sequences for

similarities

Protein structure prediction (building profiles)

Comparing the same gene sequenced by two different labs

2.8 Multiple Sequence Alignemnts CLUSTAL

R. G. Higgins & P. M. Sharp, 1988 CLUSTALW

Sequences are weighted according to how divergent they are from the most closely related pair of sequences.

Gaps are weighted for different sequences.

Summary

notion of similarity the scoring system used to rank

alignments the algorithms used to find optimal

scoring alignment the statistical method used to evaluate

the significance of an alignment score

參考資料及圖片出處

1. Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003.

2. BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. （天瓏代理）

3. Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998.

4. Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.

chapter 2 data searches and pairwise alignments

Documents

evolutionary inaccuracy of pairwise structural alignments...

applications of fast protein structure alignments · 2020....

ch 11 . assessing pairwise sequence similarity: blast and...

historijska traganja historical searches

cs262 lecture 14, win07, batzoglou multiple sequence...

pervasive pairwise intragenic epistasis among sequential...

uncountable collections of pairwise disjoint non-chainable

pairwise prof. dr. josé manuel sánchez martín universidad...

sistemas de recomendação hibridos baseados em mineração...

optimal alignments in linear space

historijska traganja historical searches -...

atlas susy & exotics searches

pairwise document similarity in large collections with ...

prediction of dynamic pairwise wake vortex separations for...

pentaquark searches at zeus

hiciao and exoplanet/disk searches on subaru

h ➞bb searches in atlas

statistical significance of alignments

pairwise testing - strategic test case design

searches with lhc