606finalpres.ppt

Upload: rednri

Post on 02-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 606FinalPres.ppt

    1/32

    Hidden Markov Modelsin Bioinformatics

    Example Domain: Gene Finding

    Colin Cherry

    colinc@cs

  • 8/10/2019 606FinalPres.ppt

    2/32

    To recap last episode

    Hidden Markov Models (HMMs)

    Protein Family Characterization

    Profile HMMs for protein family

    characterization

    How profile HMMs can do homology search

  • 8/10/2019 606FinalPres.ppt

    3/32

    ...picking up where we left off

    Profile HMMs were good to start with

    Todays goal: Introduce HMMs as generaltools in bioinformatics

    I will use the problem of Gene Findingas anexample of an ideal HMM problem domain

  • 8/10/2019 606FinalPres.ppt

    4/32

    Learning Objectives

    When Im done you should know:

    1. When is an HMM a good fit for a problem space?

    2. What materials are needed before work can

    begin with an HMM?

    3. What are the advantages and disadvantages of

    using HMMs?

    4. What are the general objectives and challengesin the gene finding task?

  • 8/10/2019 606FinalPres.ppt

    5/32

    Outline

    HMMs as Statistical Models

    The Gene Finding task at a glance

    Good problems for HMMs HMM Advantages

    HMM Disadvantages

    Gene Finding Examples

  • 8/10/2019 606FinalPres.ppt

    6/32

    Statistical Models

    Definition:

    Any mathematical construct that attempts to parameterize

    a random process

    Example: A normal distribution Assumptions

    Parameters

    Estimation

    Usage

    HMMs are just a little more complicated

  • 8/10/2019 606FinalPres.ppt

    7/32

    HMM Assumptions

    Observations are ordered

    Random process can be represented by a

    stochastic finite state machine with emitting states.

  • 8/10/2019 606FinalPres.ppt

    8/32

    HMM Parameters

    Using weather example

    Modeling daily weatherfor a year

    Ra Ra Su Su Su Ra..

    Lots of parameters One for each table entry

    Represented in twotables. One for emissions

    One for transitions

  • 8/10/2019 606FinalPres.ppt

    9/32

    HMM Estimation

    Called training, it falls under machine learning

    Feed an architecture (given in advance) a setof observation sequences

    The training process will iteratively alter its

    parameters to fit the training set

    The trained model will assign the training

    sequences high probability

  • 8/10/2019 606FinalPres.ppt

    10/32

    HMM Usage

    Two major tasks

    Evaluate the probability of an observation

    sequence given the model (Forward)

    Find the most likely path through the model

    for a given observation sequence (Viterbi)

  • 8/10/2019 606FinalPres.ppt

    11/32

    Gene Finding

    (An Ideal HMM Domain)

    Our Objective:

    To find the coding and non-coding regions of an

    unlabeled string of DNA nucleotides

    Our Motivation:

    Assist in the annotation of genomic data produced

    by genome sequencing methods

    Gain insight into the mechanisms involved intranscription, splicing and other processes

  • 8/10/2019 606FinalPres.ppt

    12/32

    Gene Finding Terminology

    A string of DNA nucleotides containing a gene

    will have separate regions (lines):

    Intronsnon-coding regions within a gene

    Exonscoding regions

    Separated by functional sites (boxes)

    Start and stop codons

    Splice sitesacceptors and donors

  • 8/10/2019 606FinalPres.ppt

    13/32

    Gene Finding Challenges

    Need the correct reading frame

    Introns can interrupt an exon in mid-codon

    There is no hard and fast rule for identifyingdonor and acceptor splice sites

    Signals are very weak

  • 8/10/2019 606FinalPres.ppt

    14/32

    What makes a good HMM

    problem space?

    Characteristics:

    Classification problems

    There are two main types of output from anHMM:

    Scoring of sequences

    (Protein family modeling)

    Labeling of observations within a sequence

    (Gene Finding)

  • 8/10/2019 606FinalPres.ppt

    15/32

    HMM Problem Characteristics

    Continued

    The observations in a sequence should have

    a clear, and meaningful order

    Unordered observations will not map easily to

    states

    Its beneficial, but not necessary for the

    observations follow some sort of grammar

    Makes it easier to design an architecture Gene Finding

    Protein Family Modeling

  • 8/10/2019 606FinalPres.ppt

    16/32

    HMM Requirements

    So youve decided you want to build an HMM,

    heres what you need:

    An architecture Probably the hardest part

    Should be biologically sound & easy to interpret

    A well-defined success measure Necessary for any form of machine learning

  • 8/10/2019 606FinalPres.ppt

    17/32

    HMM Requirements

    Continued

    Training data

    Labeled or unlabeledit depends

    You do not always need a labeled training set to do

    observation labeling, but it helps

    Amount of training data needed is:

    Directly proportional to the number of free parameters

    in the model

    Inversely proportional to the size of the training

    sequences

  • 8/10/2019 606FinalPres.ppt

    18/32

    Why HMMs might be a good fit for

    Gene Finding

    Classification: Classifying observations within a sequence

    Order: A DNA sequence is a set of ordered observations

    Grammar / Architecture: Our grammatical structure (and the

    beginnings of our architecture) is right here:

    Success measure: # of complete exons correctly labeled

    Training data: Available from various genome annotation

    projects

  • 8/10/2019 606FinalPres.ppt

    19/32

    HMM Advantages

    Statistical Grounding

    Statisticians are comfortable with the theory

    behind hidden Markov models

    Freedom to manipulate the training and

    verification processes

    Mathematical / theoretical analysis of the results

    and processes

    HMMs are still very powerful modeling toolsfar

    more powerful than many statistical methods

  • 8/10/2019 606FinalPres.ppt

    20/32

    HMM Advantages continued

    Modularity

    HMMs can be combined into larger HMMs

    Transparency of the Model

    Assuming an architecture with a good design

    People can read the model and make sense of it

    The model itself can help increase understanding

  • 8/10/2019 606FinalPres.ppt

    21/32

    HMM Advantages continued

    Incorporation of Prior Knowledge

    Incorporate prior knowledge into the architecture

    Initialize the model close to something believed to

    be correct

    Use prior knowledge to constrain training process

  • 8/10/2019 606FinalPres.ppt

    22/32

    How does Gene Finding make

    use of HMM advantages?

    Statistics:

    Many systems alter the training process to better

    suit their success measure

    Modularity:

    Almost all systems use a combination of models,

    each individually trained for each gene region

    Prior Knowledge: A fair amount of prior biological knowledge is built

    into each architecture

  • 8/10/2019 606FinalPres.ppt

    23/32

    HMM Disadvantages

    Markov Chains

    States are supposed to be independent

    P(y) must be independent of P(x), and vice versa

    This usually isnt true Can get around it when relationships are local

    Not good for RNA folding problems

    P(x) P(y)

  • 8/10/2019 606FinalPres.ppt

    24/32

    HMM Disadvantages

    continued

    Standard Machine Learning Problems

    Watch out for local maxima Model may not converge to a truly optimal

    parameter set for a given training set

    Avoid over-fitting Youre only as good as your training set

    More training is not always good

  • 8/10/2019 606FinalPres.ppt

    25/32

    HMM Disadvantages

    continued

    Speed!!!

    Almost everything one does in an HMM involves:

    enumerating all possible paths through the

    model

    There are efficient ways to do this

    Still slow in comparison to other methods

  • 8/10/2019 606FinalPres.ppt

    26/32

    HMM Gene Finders:

    VEIL

    A straight HMM Gene Finder

    Takes advantage of grammatical structure and

    modular design

    Uses many states that can only emit one symbol toget around state independence

  • 8/10/2019 606FinalPres.ppt

    27/32

    HMM Gene Finders:

    HMMGene

    Uses an extended HMM called a CHMM

    CHMM = HMM with classes Takes full advantage of being able to modify

    the statistical algorithms

    Uses high-order states Trains everything at once

  • 8/10/2019 606FinalPres.ppt

    28/32

    HMM Gene Finders:

    Genie

    Uses a generalized HMM (GHMM)

    Edges in model are complete HMMs

    States can be any arbitrary program States are actually neural networks specially

    designed for signal finding

  • 8/10/2019 606FinalPres.ppt

    29/32

    Conclusions

    HMMs have problems where they excel, and

    problems where they do not

    You should consider using one if:

    Problem can be phrased as classification

    Observations are ordered

    The observations follow some sort of grammatical

    structure (optional)

  • 8/10/2019 606FinalPres.ppt

    30/32

    Conclusions

    Advantages:

    Statistics Modularity

    Transparency

    Prior Knowledge

    Disadvantages:

    State independence Over-fitting

    Local Maximums

    Speed

  • 8/10/2019 606FinalPres.ppt

    31/32

    Some final words

    Lots of problems can be phrased as

    classification problems

    Homology search, sequence alignment

    If an HMM does not fit, theres all sorts of

    other methods to try with ML/AI:

    Neural Networks, Decision Trees ProbabilisticReasoning and Support Vector Machines have all

    been applied to Bioinformatics

  • 8/10/2019 606FinalPres.ppt

    32/32

    Questions

    Any Questions?