mw  11:00-12:15 in beckman b302 prof: gill bejerano tas: jim notwell & harendra guturu

48
http://cs173.stanford.edu [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12: Chains & Nets, Conservation & Func

Upload: mizell

Post on 24-Feb-2016

16 views

Category:

Documents


0 download

DESCRIPTION

CS173. Lecture 12: Chains & Nets, Conservation & Function. MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu. Announcements. HW2 Due Today As are project assignments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 1

MW  11:00-12:15 in Beckman B302Prof: Gill BejeranoTAs: Jim Notwell & Harendra Guturu

CS173

Lecture 12: Chains & Nets, Conservation & Function

Page 2: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 2

Announcements• HW2 Due Today• As are project assignments• Coming monday 2/25 lecture has been moved to LK101(building next door – we’ll post instructions)

Page 3: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 3

Inferring Genomic MutationsFrom Alignments of Genomes

Page 4: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 4

TerminologyOrthologs : Genes related via speciation (e.g. C,M,H3)Paralogs: Genes related through duplication (e.g. H1,H2,H3)Homologs: Genes that share a common origin

(e.g. C,M,H1,H2,H3)

Species tree

Gene tree

SpeciationDuplicationLoss

singleancestralgene

Page 5: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

• What?• Compare whole genomes

• Compare two genomes• Within (intra) species• Between (inter) species

• Compare genome to itself• Why?

• Comparison reveals functional and neutral regions• Homologous regions most often have similar functions• Modification of functional regions can reveal

• Disease susceptibility• Adaptation

• How?

http://cs173.stanford.edu [BejeranoWinter12/13] 5

Page 6: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 6

Every Genome is Different

DNA Replication is imperfect – between individuals of the same species, even between the cells of an individual.

...ACGTACGACTGACTAGCATCGACTACGA...

chicken

egg ...ACGTACGACTGACTAGCATCGACTACGA...

functionaljunk

TT CAT

“anythinggoes”

many changesare not tolerated

chicken

Page 7: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Similarity is often measured using “%id”, or percent identity

%id = number of matching bases / number of alignment columns

Where

Every alignment column is a match / mismatch / indel base

Where indel = insertion or deletion (requires an outgroup to resolve)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 8: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

What to expect from genome comparisons?

http://cs173.stanford.edu [BejeranoWinter12/13] 8

human

lizar

d

Objective: find local alignment blocks, that are likely homologous (share common origin)

O(mn) examine the full matrix using DPO(m+n) heuristics based on seeding + extension trades sensitivity for speed

Page 9: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

9

“Raw” (B)lastz track (no longer displayed)

Protease Regulatory Subunit 3

Alignment = homologous regions

Page 10: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Chaining co-linear alignment blocks

http://cs173.stanford.edu [BejeranoWinter12/13] 10

human

lizar

d

Objective: find local alignment blocks, that are likely homologous (share common origin)

Chaining strings together co-linear blocks in the target genome to which we are comparing.Double lines when there is unalignable sequence in the other species. Single lines when there isn’t.

Page 11: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Reference genome perspective,The Use of an Outgroup

A B CD E

Outgroup Sequence

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

11

Page 12: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Gap Types: Single vs Double sidedA B C

D E

Ancestral Sequence

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

12

Page 13: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 13

Conservation Track Documentation

Page 14: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 14

Chains join together related local alignments

Protease Regulatory Subunit 3

likely ortholog

likely paralogsshared domain?

Page 15: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Note: repeats are a nuisance

http://cs173.stanford.edu [BejeranoWinter12/13] 15

humanm

ouse

If, for example, human and mouse have each 10,000 copiesof the same repeat:We will obtain and need to output 108 alignments of all these copies to each other.Note that for the sake of this comparison interspersed repeats and simple repeats are equal nuisances.Also note that simple repeats, but not interspersed repeats, violate the assumption that similar sequences are homologous.

Solution:1 Discover all repetitive sequences in each genome.2 Mask them when doing genome to genome comparison.3 Chain your alignments.4 Add back to the alignments only repeat matches that lie within

pre-computed chains.

Page 16: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 16

Chains• a chain is a sequence of gapless aligned blocks, where there must

be no overlaps of blocks' target or query coords within the chain.• Within a chain, target and query coords are monotonically non-

decreasing. (i.e. always increasing or flat)• double-sided gaps are a new capability (blastz can't do that) that

allow extremely long chains to be constructed.• not just orthologs, but paralogs too, can result in good chains.

but that's useful!• chains should be symmetrical -- e.g. swap human-mouse -> mouse-

human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments.

• chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.

• chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

Page 17: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 17

Before and After Chaining

Page 18: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 18

Chaining AlgorithmInput - blocks of gapless alignments from blastzDynamic program based on the recurrence relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

See [Kent et al, 2003] “Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes”

Page 19: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 19

Netting AlignmentsCommonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

Net finds best match mouse match for each human region.Highest scoring chains are used first.Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

Page 20: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 20

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Page 21: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 21

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Page 22: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 22

Nets attempt to capture the ortholog

(they also hide everything else)

Page 23: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 23

Retroposed Genes and Pseudogenes

Pseudogenes (“dead genes”):Genomic sequences that resemble (originated from) genes that no longer make proteins.

Retrogenes (“retrotranscribed”):Protein coding RNA that was reverse transcribed and inserted back into the genome.The RNA can be grabbed at any stage (partial/full transcript, before/during/after all introns are spliced).

Page 24: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 24

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

genepred.

Page 25: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 25

Nets/chains can reveal retrogenes (and when they jumped in!)

Page 26: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 26

Nets

• a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels.

• a net is single-coverage for target but not for query.• because it's single-coverage in the target, it's no longer symmetrical.• the netter has two outputs, one of which we usually ignore: the target-

centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.

• nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

• GB: for human inspection always prefer looking at the chains!

[Angie Hinrichs, UCSC wiki]

Page 27: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 27

Before and After Netting

Page 28: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 28

Convert / LiftOver"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.

LiftOver – batch utility

Page 29: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 29

What nets can’t show, but chains will

Page 30: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 30

Same Region…

same in allthe other fish

Page 31: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Drawbacks

• Inversions not handled optimally

> > > > chr1 > > > > > > > chr1 > > >

< < < < chr1 < < < <

< < < < chr5 < < < <

Chains

Nets > > > > chr1 > > > > > > > chr1 > > >

< < < < chr5 < < < <

31

Page 32: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Drawbacks

• High copy number genes can break orthology

32

Page 33: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 33

Self Chain reveals paralogs

(self net ismeaningless)

Page 34: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 34

Conservation and Function

Page 35: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 35

Evolution = Mutation + Selection

Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in.

...ACGTACGACTGACTAGCATCGACTACGA...

chicken

egg ...ACGTACGACTGACTAGCATCGACTACGA...

functionaljunk

TT CAT

“anythinggoes”

many changesare not tolerated

chicken

Conservation implies function!(But what function?)

Page 36: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 36

Vertebrates: what to sequence?

[Human Molecular Genetics, 3rd Edition]

you are here

, Opossum

, Lizard

, Stickleback

too farsweet spottoo close

Which species to compare to?Too close and purifying selection will be largely indistinguishable from the neutral rate.

Too far and many functional orthologs will diverge beyond our ability to accurately align them.

Page 37: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Searching Near And Far

http://cs173.stanford.edu [BejeranoWinter12/13] 37

Search too near (eg human to chimp or orang above) and you cannot distinguish neutral sequence from sequence under purifying selection.Search further still (eg mouse) and the two distributions pry apart.But now you’ve lost younger functional sequences born after the split.Ie, conservation implies function, butlack of conservation does NOT imply lack of function!

Page 38: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 38

, Opossum

, Lizard

, Stickleback

Phylogenetic Shadowing

you are here

too close

“too close” can actually be a boon if you have enough closely related genomes

Page 39: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

PhastCons Conserved Elements

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 40: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Distant homologies

http://cs173.stanford.edu [BejeranoWinter12/13] 40

When species diverge too much (e.g. chicken and beyond above), confident alignments can no longer be detected at the DNA level.E.g.: all SPI1 and SLC39A13 exons are there in chicken & fish.

Page 41: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Distant homologies search strategies

http://cs173.stanford.edu [BejeranoWinter12/13] 41

Here it is much better to search a gene model from species A (e.g human) against the genome of species B (e.g. chicken)

This is a search of amino acids in all their possible codons into a gene structure with unknown exon – intron structure.

(eg TBLASTN, translated BLAT)

Page 42: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Distant homologies

http://cs173.stanford.edu [BejeranoWinter12/13] 42

Find the most distantly related genes using gene models in both species:1 search amino acids sequences

against each other. (eg using BLASTP).

2 Map your hits back to the two respective genomes, anchored on the amino acid alignment (respecting any exon-intron gene body structure change).

3 Examine co-linear homology of flanking genes to try and call orthologs from paralogs.

Page 43: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

RNA homology searches

http://cs173.stanford.edu [BejeranoWinter12/13] 43

1 Define a mathematical construct that describe potential homologs.2 Go search for them (efficiently!). 3 Examine genomic context.

Page 44: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Enhancer remote homologs

http://cs173.stanford.edu [BejeranoWinter12/13] 44

Enhancer =

Gene regulatory sequences in general are the most challenging to search for:• Individual binding sites are very flexible.• Gaps between binding sites may evolve (semi) neutrally,

making DNA alignment seeding particularly frail.• Binding site gain/loss and shuffling may or may not be

allowed – we need a better understanding of underlying logic.

Page 45: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Exceptionally Old Enhancers Exist

http://cs173.stanford.edu [BejeranoWinter12/13] 45

But how many of these really exist?

Page 46: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 46

Ultraconservation: No known function requires this much conservation

CDS ncRNA TFBS

*****

seq.

?

Page 47: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

“Gene” Finding III: Comparative Genomics

http://cs173.stanford.edu [BejeranoWinter12/13] 47

Page 48: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

The challenge: map code to output

http://cs173.stanford.edu [BejeranoWinter12/13] 48

genome

person

Ultimately we sequence genomes, and study their function in detail to understand genome to phenotype relationships:• Minus side: Genomic contribution to disease• Plus side: Adaptation and speciation

3*109 letters1013 cells

To be continued…