homology search tools

26
Homology Search Tools Kun-Mao Chao ( 趙趙趙 ) Department of Computer Scienc e and Information Engineering National Taiwan University, T aiwan WWW: http://www.csie.ntu.edu.tw/~k mchao

Upload: kiora

Post on 08-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Homology Search Tools. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. Homology Search Tools. Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - PowerPoint PPT Presentation

TRANSCRIPT

  • Homology Search ToolsKun-Mao Chao ()Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan

    WWW: http://www.csie.ntu.edu.tw/~kmchao

  • Homology Search ToolsSmith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987)FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985)BLAST (Altschul et al., 1990; Altschul et al., 1997)BLAT (Kent, 2002)PatternHunter (Li et al., 2004)

  • Finding Exact Word MatchesHash TablesSuffix TreesSuffix Arrays

  • Hash Tables

    K

    K

    K

    AAA

    ATC

    CAT

    CCA

    CTT

    TCC

    TCG

    TCT

    TTT

    GAT

    000000 (0)

    001101 (13)

    K

    K

    010011 (19)

    010100 (20)

    K

    011111 (31)

    100011 (35)

    110101 (53)

    110111 (55)

    110110 (54)

    111111 (63)

    K

    K

    K

    K

    K

    K

    1

    2

    3

    4

    5

    6

    7

    8

    A

    G

    T

    T

    C

    T

    A

    C

    C

    T

    10

    2

    1

    9

    8

    7

    6

    5

    4

    3

  • Suffix Trees (I)

    A

    G

    T

    T

    C

    T

    A

    C

    C

    T

    10

    2

    1

    9

    8

    7

    6

    5

    4

    3

    2

    ATC

    CATCTT

    TT

    GATCCATCTT

    T

    C

    CATCTT

    TT

    ATCTT

    5

    4

    8

    1

    CATCTT

    TT

    T

    9

    6

    3

    7

    10

    C

  • Suffix Trees (II)

    A

    G

    T

    T

    C

    T

    A

    C

    C

    T

    10

    2

    1

    9

    8

    7

    6

    5

    4

    3

    11

    $

    10

    3

    6

    2

    8

    4

    5

    1

    9

    ATC

    CATCTT$

    TT$

    GATCCATCTT$

    C

    CATCTT$

    TT$

    ATCTT$

    T

    CATCTT$

    TT$

    T$

    7

    C

    $

    $

    11

  • Suffix Arrays

    A

    G

    T

    T

    C

    T

    A

    C

    C

    T

    10

    2

    1

    9

    8

    7

    6

    5

    4

    3

    ATCCATCTT

    2

    ATCTT

    6

    CATCTT

    5

    CCATCTT

    4

    CTT

    8

    GATCCATCTT

    1

    T

    10

    TCCATCTT

    3

    TCTT

    7

    TT

    9

  • FASTAFind runs of identities, and identify regions with the highest density of identities.Re-score using PAM matrix, and keep top scoring segments.Eliminate segments that are unlikely to be part of the alignment.Optimize the alignment in a band.

  • FASTAStep 1: Find runes of identities, and identify regions with the highest density of identities.Sequence ASequence B

  • FASTAStep 2: Re-score using PAM matrix, and keep top scoring segments.

  • FASTAStep 3: Eliminate segments that are unlikely to be part of the alignment.

  • FASTAStep 4: Optimize the alignment in a band.

  • BLASTBasic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman)The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

  • The maximal segment pair measureA maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4)the highest scoring pairThe MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.BLAST heuristically attempts to calculate the MSP score.

  • A matrix of similarity scores

    G

    C

    T

    A

    C

    C

    T

    A

    T

    C

    T

    -4

    G

    T

    C

    T

    T

    A

    C

    T

    A

    -4

    -4

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    T

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

  • A maximum-scoring segment

    G

    C

    T

    A

    C

    C

    T

    A

    T

    C

    T

    -4

    G

    T

    C

    T

    T

    A

    C

    T

    A

    -4

    -4

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    5

    -4

    -4

    -4

    5

    -4

    -4

    -4

    -4

    -4

    5

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    1

    8

    7

    6

    5

    4

    3

    2

    9

    2

    1

    11

    10

    9

    8

    7

    6

    5

    4

    3

    T

    -4

    5

    5

    -4

    -4

    5

    -4

    5

    -4

    5

    -4

    10

  • BLASTBuild the hash table for Sequence A.Scan Sequence B for hits.Extend hits.

  • BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)For DNA sequences:

    Seq. A = AGATCGAT 12345678AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT For protein sequences:Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) T; Add xyz to the hash table if Score(xyz, LVI) T; Add xyz to the hash table if Score(xyz, VIS) T;

  • BLASTStep2: Scan sequence B for hits.

  • BLASTStep2: Scan sequence B for hits.Step 3: Extend hits.hitTerminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

  • Gapped BLAST (I)The two-hit method

  • Gapped BLAST (II)Confining the dynamic-programming

    HSP with score at least S

    q

    seed residue pair

    region confined by X

    q

  • BLAT

    database

    index

    query

  • PatternHunter (I)

  • PatternHunter (II)

    K

    K

    K

    K

    K

    K

    K

    K

    K

    K

    K

    K

    A

    G

    T

    T

    C

    T

    A

    C

    C

    T

    10

    2

    1

    9

    8

    7

    6

    5

    4

    3

    CAC

    CCT

    ATG

    TCA

    001110 (14)

    TCT

    TTT

    GAC

    010001 (17)

    010100 (20)

    100001 (33)

    110100 (52)

    110111 (55)

    111111 (63)

    AAA

    000000 (0)

    ATC

    001101 (13)

    1

    2

    3

    4

    5

    K

    K

    7

    ATT

    001111 (15)

    6

  • RemarksFiltering is based on the observation that a good alignment usually includes short identical or very similar fragments.The idea of filtration was used in FASTA, BLAST, BLAT, and PatternHunter.