homology search tools
DESCRIPTION
Homology Search Tools. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. Homology Search Tools. Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - PowerPoint PPT PresentationTRANSCRIPT
-
Homology Search ToolsKun-Mao Chao ()Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan
WWW: http://www.csie.ntu.edu.tw/~kmchao
-
Homology Search ToolsSmith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987)FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985)BLAST (Altschul et al., 1990; Altschul et al., 1997)BLAT (Kent, 2002)PatternHunter (Li et al., 2004)
-
Finding Exact Word MatchesHash TablesSuffix TreesSuffix Arrays
-
Hash Tables
K
K
K
AAA
ATC
CAT
CCA
CTT
TCC
TCG
TCT
TTT
GAT
000000 (0)
001101 (13)
K
K
010011 (19)
010100 (20)
K
011111 (31)
100011 (35)
110101 (53)
110111 (55)
110110 (54)
111111 (63)
K
K
K
K
K
K
1
2
3
4
5
6
7
8
A
G
T
T
C
T
A
C
C
T
10
2
1
9
8
7
6
5
4
3
-
Suffix Trees (I)
A
G
T
T
C
T
A
C
C
T
10
2
1
9
8
7
6
5
4
3
2
ATC
CATCTT
TT
GATCCATCTT
T
C
CATCTT
TT
ATCTT
5
4
8
1
CATCTT
TT
T
9
6
3
7
10
C
-
Suffix Trees (II)
A
G
T
T
C
T
A
C
C
T
10
2
1
9
8
7
6
5
4
3
11
$
10
3
6
2
8
4
5
1
9
ATC
CATCTT$
TT$
GATCCATCTT$
C
CATCTT$
TT$
ATCTT$
T
CATCTT$
TT$
T$
7
C
$
$
11
-
Suffix Arrays
A
G
T
T
C
T
A
C
C
T
10
2
1
9
8
7
6
5
4
3
ATCCATCTT
2
ATCTT
6
CATCTT
5
CCATCTT
4
CTT
8
GATCCATCTT
1
T
10
TCCATCTT
3
TCTT
7
TT
9
-
FASTAFind runs of identities, and identify regions with the highest density of identities.Re-score using PAM matrix, and keep top scoring segments.Eliminate segments that are unlikely to be part of the alignment.Optimize the alignment in a band.
-
FASTAStep 1: Find runes of identities, and identify regions with the highest density of identities.Sequence ASequence B
-
FASTAStep 2: Re-score using PAM matrix, and keep top scoring segments.
-
FASTAStep 3: Eliminate segments that are unlikely to be part of the alignment.
-
FASTAStep 4: Optimize the alignment in a band.
-
BLASTBasic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman)The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
-
The maximal segment pair measureA maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4)the highest scoring pairThe MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.BLAST heuristically attempts to calculate the MSP score.
-
A matrix of similarity scores
G
C
T
A
C
C
T
A
T
C
T
-4
G
T
C
T
T
A
C
T
A
-4
-4
-4
-4
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
5
-4
5
-4
5
-4
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
-4
-4
5
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
5
-4
5
-4
5
-4
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
5
5
-4
-4
5
-4
5
-4
5
-4
T
-4
5
5
-4
-4
5
-4
5
-4
5
-4
-
A maximum-scoring segment
G
C
T
A
C
C
T
A
T
C
T
-4
G
T
C
T
T
A
C
T
A
-4
-4
-4
-4
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
5
-4
5
-4
5
-4
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
-4
-4
5
-4
-4
5
-4
-4
-4
-4
-4
5
5
-4
-4
5
-4
5
-4
5
-4
5
-4
-4
-4
5
-4
-4
-4
-4
-4
5
-4
5
5
-4
-4
5
-4
5
-4
5
-4
1
8
7
6
5
4
3
2
9
2
1
11
10
9
8
7
6
5
4
3
T
-4
5
5
-4
-4
5
-4
5
-4
5
-4
10
-
BLASTBuild the hash table for Sequence A.Scan Sequence B for hits.Extend hits.
-
BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)For DNA sequences:
Seq. A = AGATCGAT 12345678AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT For protein sequences:Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) T; Add xyz to the hash table if Score(xyz, LVI) T; Add xyz to the hash table if Score(xyz, VIS) T;
-
BLASTStep2: Scan sequence B for hits.
-
BLASTStep2: Scan sequence B for hits.Step 3: Extend hits.hitTerminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)BLAST 2.0 saves the time spent in extension, and considers gapped alignments.
-
Gapped BLAST (I)The two-hit method
-
Gapped BLAST (II)Confining the dynamic-programming
HSP with score at least S
q
seed residue pair
region confined by X
q
-
BLAT
database
index
query
-
PatternHunter (I)
-
PatternHunter (II)
K
K
K
K
K
K
K
K
K
K
K
K
A
G
T
T
C
T
A
C
C
T
10
2
1
9
8
7
6
5
4
3
CAC
CCT
ATG
TCA
001110 (14)
TCT
TTT
GAC
010001 (17)
010100 (20)
100001 (33)
110100 (52)
110111 (55)
111111 (63)
AAA
000000 (0)
ATC
001101 (13)
1
2
3
4
5
K
K
7
ATT
001111 (15)
6
-
RemarksFiltering is based on the observation that a good alignment usually includes short identical or very similar fragments.The idea of filtration was used in FASTA, BLAST, BLAT, and PatternHunter.