606finalpres.ppt
TRANSCRIPT
-
8/10/2019 606FinalPres.ppt
1/32
Hidden Markov Modelsin Bioinformatics
Example Domain: Gene Finding
Colin Cherry
colinc@cs
-
8/10/2019 606FinalPres.ppt
2/32
To recap last episode
Hidden Markov Models (HMMs)
Protein Family Characterization
Profile HMMs for protein family
characterization
How profile HMMs can do homology search
-
8/10/2019 606FinalPres.ppt
3/32
...picking up where we left off
Profile HMMs were good to start with
Todays goal: Introduce HMMs as generaltools in bioinformatics
I will use the problem of Gene Findingas anexample of an ideal HMM problem domain
-
8/10/2019 606FinalPres.ppt
4/32
Learning Objectives
When Im done you should know:
1. When is an HMM a good fit for a problem space?
2. What materials are needed before work can
begin with an HMM?
3. What are the advantages and disadvantages of
using HMMs?
4. What are the general objectives and challengesin the gene finding task?
-
8/10/2019 606FinalPres.ppt
5/32
Outline
HMMs as Statistical Models
The Gene Finding task at a glance
Good problems for HMMs HMM Advantages
HMM Disadvantages
Gene Finding Examples
-
8/10/2019 606FinalPres.ppt
6/32
Statistical Models
Definition:
Any mathematical construct that attempts to parameterize
a random process
Example: A normal distribution Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated
-
8/10/2019 606FinalPres.ppt
7/32
HMM Assumptions
Observations are ordered
Random process can be represented by a
stochastic finite state machine with emitting states.
-
8/10/2019 606FinalPres.ppt
8/32
HMM Parameters
Using weather example
Modeling daily weatherfor a year
Ra Ra Su Su Su Ra..
Lots of parameters One for each table entry
Represented in twotables. One for emissions
One for transitions
-
8/10/2019 606FinalPres.ppt
9/32
HMM Estimation
Called training, it falls under machine learning
Feed an architecture (given in advance) a setof observation sequences
The training process will iteratively alter its
parameters to fit the training set
The trained model will assign the training
sequences high probability
-
8/10/2019 606FinalPres.ppt
10/32
HMM Usage
Two major tasks
Evaluate the probability of an observation
sequence given the model (Forward)
Find the most likely path through the model
for a given observation sequence (Viterbi)
-
8/10/2019 606FinalPres.ppt
11/32
Gene Finding
(An Ideal HMM Domain)
Our Objective:
To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides
Our Motivation:
Assist in the annotation of genomic data produced
by genome sequencing methods
Gain insight into the mechanisms involved intranscription, splicing and other processes
-
8/10/2019 606FinalPres.ppt
12/32
Gene Finding Terminology
A string of DNA nucleotides containing a gene
will have separate regions (lines):
Intronsnon-coding regions within a gene
Exonscoding regions
Separated by functional sites (boxes)
Start and stop codons
Splice sitesacceptors and donors
-
8/10/2019 606FinalPres.ppt
13/32
Gene Finding Challenges
Need the correct reading frame
Introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifyingdonor and acceptor splice sites
Signals are very weak
-
8/10/2019 606FinalPres.ppt
14/32
What makes a good HMM
problem space?
Characteristics:
Classification problems
There are two main types of output from anHMM:
Scoring of sequences
(Protein family modeling)
Labeling of observations within a sequence
(Gene Finding)
-
8/10/2019 606FinalPres.ppt
15/32
HMM Problem Characteristics
Continued
The observations in a sequence should have
a clear, and meaningful order
Unordered observations will not map easily to
states
Its beneficial, but not necessary for the
observations follow some sort of grammar
Makes it easier to design an architecture Gene Finding
Protein Family Modeling
-
8/10/2019 606FinalPres.ppt
16/32
HMM Requirements
So youve decided you want to build an HMM,
heres what you need:
An architecture Probably the hardest part
Should be biologically sound & easy to interpret
A well-defined success measure Necessary for any form of machine learning
-
8/10/2019 606FinalPres.ppt
17/32
HMM Requirements
Continued
Training data
Labeled or unlabeledit depends
You do not always need a labeled training set to do
observation labeling, but it helps
Amount of training data needed is:
Directly proportional to the number of free parameters
in the model
Inversely proportional to the size of the training
sequences
-
8/10/2019 606FinalPres.ppt
18/32
Why HMMs might be a good fit for
Gene Finding
Classification: Classifying observations within a sequence
Order: A DNA sequence is a set of ordered observations
Grammar / Architecture: Our grammatical structure (and the
beginnings of our architecture) is right here:
Success measure: # of complete exons correctly labeled
Training data: Available from various genome annotation
projects
-
8/10/2019 606FinalPres.ppt
19/32
HMM Advantages
Statistical Grounding
Statisticians are comfortable with the theory
behind hidden Markov models
Freedom to manipulate the training and
verification processes
Mathematical / theoretical analysis of the results
and processes
HMMs are still very powerful modeling toolsfar
more powerful than many statistical methods
-
8/10/2019 606FinalPres.ppt
20/32
HMM Advantages continued
Modularity
HMMs can be combined into larger HMMs
Transparency of the Model
Assuming an architecture with a good design
People can read the model and make sense of it
The model itself can help increase understanding
-
8/10/2019 606FinalPres.ppt
21/32
HMM Advantages continued
Incorporation of Prior Knowledge
Incorporate prior knowledge into the architecture
Initialize the model close to something believed to
be correct
Use prior knowledge to constrain training process
-
8/10/2019 606FinalPres.ppt
22/32
How does Gene Finding make
use of HMM advantages?
Statistics:
Many systems alter the training process to better
suit their success measure
Modularity:
Almost all systems use a combination of models,
each individually trained for each gene region
Prior Knowledge: A fair amount of prior biological knowledge is built
into each architecture
-
8/10/2019 606FinalPres.ppt
23/32
HMM Disadvantages
Markov Chains
States are supposed to be independent
P(y) must be independent of P(x), and vice versa
This usually isnt true Can get around it when relationships are local
Not good for RNA folding problems
P(x) P(y)
-
8/10/2019 606FinalPres.ppt
24/32
HMM Disadvantages
continued
Standard Machine Learning Problems
Watch out for local maxima Model may not converge to a truly optimal
parameter set for a given training set
Avoid over-fitting Youre only as good as your training set
More training is not always good
-
8/10/2019 606FinalPres.ppt
25/32
HMM Disadvantages
continued
Speed!!!
Almost everything one does in an HMM involves:
enumerating all possible paths through the
model
There are efficient ways to do this
Still slow in comparison to other methods
-
8/10/2019 606FinalPres.ppt
26/32
HMM Gene Finders:
VEIL
A straight HMM Gene Finder
Takes advantage of grammatical structure and
modular design
Uses many states that can only emit one symbol toget around state independence
-
8/10/2019 606FinalPres.ppt
27/32
HMM Gene Finders:
HMMGene
Uses an extended HMM called a CHMM
CHMM = HMM with classes Takes full advantage of being able to modify
the statistical algorithms
Uses high-order states Trains everything at once
-
8/10/2019 606FinalPres.ppt
28/32
HMM Gene Finders:
Genie
Uses a generalized HMM (GHMM)
Edges in model are complete HMMs
States can be any arbitrary program States are actually neural networks specially
designed for signal finding
-
8/10/2019 606FinalPres.ppt
29/32
Conclusions
HMMs have problems where they excel, and
problems where they do not
You should consider using one if:
Problem can be phrased as classification
Observations are ordered
The observations follow some sort of grammatical
structure (optional)
-
8/10/2019 606FinalPres.ppt
30/32
Conclusions
Advantages:
Statistics Modularity
Transparency
Prior Knowledge
Disadvantages:
State independence Over-fitting
Local Maximums
Speed
-
8/10/2019 606FinalPres.ppt
31/32
Some final words
Lots of problems can be phrased as
classification problems
Homology search, sequence alignment
If an HMM does not fit, theres all sorts of
other methods to try with ML/AI:
Neural Networks, Decision Trees ProbabilisticReasoning and Support Vector Machines have all
been applied to Bioinformatics
-
8/10/2019 606FinalPres.ppt
32/32
Questions
Any Questions?