606finalpres.ppt

8/10/2019 606FinalPres.ppt

1/32

Hidden Markov Modelsin Bioinformatics

Example Domain: Gene Finding

Colin Cherry

colinc@cs


2/32

To recap last episode

Hidden Markov Models (HMMs)

Protein Family Characterization

Profile HMMs for protein family

characterization

How profile HMMs can do homology search


3/32

...picking up where we left off

Profile HMMs were good to start with

Todays goal: Introduce HMMs as generaltools in bioinformatics

I will use the problem of Gene Findingas anexample of an ideal HMM problem domain


4/32

Learning Objectives

When Im done you should know:

1. When is an HMM a good fit for a problem space?

2. What materials are needed before work can

begin with an HMM?

3. What are the advantages and disadvantages of

using HMMs?

4. What are the general objectives and challengesin the gene finding task?


5/32

Outline

HMMs as Statistical Models

The Gene Finding task at a glance

Good problems for HMMs HMM Advantages

HMM Disadvantages

Gene Finding Examples


6/32

Statistical Models

Definition:

Any mathematical construct that attempts to parameterize

a random process

Example: A normal distribution Assumptions

Parameters

Estimation

Usage

HMMs are just a little more complicated


7/32

HMM Assumptions

Observations are ordered

Random process can be represented by a

stochastic finite state machine with emitting states.


8/32

HMM Parameters

Using weather example

Modeling daily weatherfor a year

Ra Ra Su Su Su Ra..

Lots of parameters One for each table entry

Represented in twotables. One for emissions

One for transitions


9/32

HMM Estimation

Called training, it falls under machine learning

Feed an architecture (given in advance) a setof observation sequences

The training process will iteratively alter its

parameters to fit the training set

The trained model will assign the training

sequences high probability


10/32

HMM Usage

Two major tasks

Evaluate the probability of an observation

sequence given the model (Forward)

Find the most likely path through the model

for a given observation sequence (Viterbi)


11/32

Gene Finding

(An Ideal HMM Domain)

Our Objective:

To find the coding and non-coding regions of an

unlabeled string of DNA nucleotides

Our Motivation:

Assist in the annotation of genomic data produced

by genome sequencing methods

Gain insight into the mechanisms involved intranscription, splicing and other processes


12/32

Gene Finding Terminology

A string of DNA nucleotides containing a gene

will have separate regions (lines):

Intronsnon-coding regions within a gene

Exonscoding regions

Separated by functional sites (boxes)

Start and stop codons

Splice sitesacceptors and donors


13/32

Gene Finding Challenges

Need the correct reading frame

Introns can interrupt an exon in mid-codon

There is no hard and fast rule for identifyingdonor and acceptor splice sites

Signals are very weak


14/32

What makes a good HMM

problem space?

Characteristics:

Classification problems

There are two main types of output from anHMM:

Scoring of sequences

(Protein family modeling)

Labeling of observations within a sequence

(Gene Finding)


15/32

HMM Problem Characteristics

Continued

The observations in a sequence should have

a clear, and meaningful order

Unordered observations will not map easily to

states

Its beneficial, but not necessary for the

observations follow some sort of grammar

Makes it easier to design an architecture Gene Finding

Protein Family Modeling


16/32

HMM Requirements

So youve decided you want to build an HMM,

heres what you need:

An architecture Probably the hardest part

Should be biologically sound & easy to interpret

A well-defined success measure Necessary for any form of machine learning


17/32

HMM Requirements

Continued

Training data

Labeled or unlabeledit depends

You do not always need a labeled training set to do

observation labeling, but it helps

Amount of training data needed is:

Directly proportional to the number of free parameters

in the model

Inversely proportional to the size of the training

sequences


18/32

Why HMMs might be a good fit for

Gene Finding

Classification: Classifying observations within a sequence

Order: A DNA sequence is a set of ordered observations

Grammar / Architecture: Our grammatical structure (and the

beginnings of our architecture) is right here:

Success measure: # of complete exons correctly labeled

Training data: Available from various genome annotation

projects


19/32

HMM Advantages

Statistical Grounding

Statisticians are comfortable with the theory

behind hidden Markov models

Freedom to manipulate the training and

verification processes

Mathematical / theoretical analysis of the results

and processes

HMMs are still very powerful modeling toolsfar

more powerful than many statistical methods


20/32

HMM Advantages continued

Modularity

HMMs can be combined into larger HMMs

Transparency of the Model

Assuming an architecture with a good design

People can read the model and make sense of it

The model itself can help increase understanding


21/32

HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to

be correct

Use prior knowledge to constrain training process


22/32

How does Gene Finding make

use of HMM advantages?

Statistics:

Many systems alter the training process to better

suit their success measure

Modularity:

Almost all systems use a combination of models,

each individually trained for each gene region

Prior Knowledge: A fair amount of prior biological knowledge is built

into each architecture


23/32

HMM Disadvantages

Markov Chains

States are supposed to be independent

P(y) must be independent of P(x), and vice versa

This usually isnt true Can get around it when relationships are local

Not good for RNA folding problems

P(x) P(y)


24/32

HMM Disadvantages

continued

Standard Machine Learning Problems

Watch out for local maxima Model may not converge to a truly optimal

parameter set for a given training set

Avoid over-fitting Youre only as good as your training set

More training is not always good


25/32

HMM Disadvantages

continued

Speed!!!

Almost everything one does in an HMM involves:

enumerating all possible paths through the

model

There are efficient ways to do this

Still slow in comparison to other methods


26/32

HMM Gene Finders:

VEIL

A straight HMM Gene Finder

Takes advantage of grammatical structure and

modular design

Uses many states that can only emit one symbol toget around state independence


27/32

HMM Gene Finders:

HMMGene

Uses an extended HMM called a CHMM

CHMM = HMM with classes Takes full advantage of being able to modify

the statistical algorithms

Uses high-order states Trains everything at once


28/32

HMM Gene Finders:

Genie

Uses a generalized HMM (GHMM)

Edges in model are complete HMMs

States can be any arbitrary program States are actually neural networks specially

designed for signal finding


29/32

Conclusions

HMMs have problems where they excel, and

problems where they do not

You should consider using one if:

Problem can be phrased as classification

Observations are ordered

The observations follow some sort of grammatical

structure (optional)


30/32

Conclusions

Advantages:

Statistics Modularity

Transparency

Prior Knowledge

Disadvantages:

State independence Over-fitting

Local Maximums

Speed


31/32

Some final words

Lots of problems can be phrased as

classification problems

Homology search, sequence alignment

If an HMM does not fit, theres all sorts of

other methods to try with ML/AI:

Neural Networks, Decision Trees ProbabilisticReasoning and Support Vector Machines have all

been applied to Bioinformatics


32/32

Questions

Any Questions?

606finalpres.ppt

Documents