fly modencode data integration update manolis kellis, mit mit computer science & artificial...

30
Fly ModENCODE data integration update Manolis Kellis, MIT Computer Science & Artificial Intelligence Laboratory oad Institute of MIT and Harvard

Upload: garey-woods

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Fly ModENCODE data integration update

Manolis Kellis, MIT

MIT Computer Science & Artificial Intelligence Laboratory

Broad Institute of MIT and Harvard

Page 2: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

modENCODE integration goals

• Annotate all functional elements– Enhancers, promoters, insulators, silencers– Protein-coding genes, RNA genes, alternative splice forms

• Understand their dynamics– Tissue- and stage-specific activity of each type of element

• Mechanisms– Relative roles of histones, chromatin, specific/general TFs– Sequence specificity, regulatory motifs and grammars

• Community involvement will be key– Seeking both computational and experimental partners– Large-scale: Complementary datasets / computation– Small-scale: Directed follow-up studies / genes, pathways

• Drosophila 2009 modENCODE workshop discussion

Page 3: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Each dataset is supported by all others

• Each type of element requires multiple data types– Protein genes– RNA genes– Promoters– Enhancers– Transcripts– Heterochromatin– Initiation sites

Replication

Chromatin

Nucleosomes

Small RNAs

Transcripts

TFs/Chromatin

Karpen

Henikoff

Celniker

White

Lai

Mac

Alpine

Already presented

Underway

Data Integration efforts

Page 4: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

modENCODE is not alone• Community data

types– Boundaries

– DNAse HS sites, low buoyant density (protein binding)

– evolutionary properties (correlations with conserved/non-conserved properties)

– Dam mapping

– Small RNAs

• Techniques and functional genomics– Gene Disruption projects

– RNAi collection

– Recombineering

– Computational analyses

Replication

Chromatin

Nucleosomes

Small RNAs

Transcripts

TFs/Chromatin

Karpen

Henikoff

Celniker

White

Lai

Mac

Alpine

Boundaries

DNAse HS

12flies

(+8 flies)

Dam

mapping

etc

Page 5: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Comparative resources for Drosophila genomes

• Identify functional elements by their evolutionary signatures: complement experimental studies

donepriority1priority2

New SpeciesDist

D. ficusphila0.80

D. biarmipes0.70

D. elegans0.72

D. kikkawai0.89

D. eugracilis0.76

D. takahashii0.65

D. rhopaloa0.66

D. bipectinata0.99

Page 6: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Evolutionary signatures for diverse functions

Protein-coding genes

- Codon Substitution Frequencies

- Reading Frame Conservation

RNA structures

- Compensatory changes

- Silent G-U substitutions

microRNAs

- Shape of conservation profile

- Structural features: loops, pairs

- Relationship with 3’UTR motifs

Regulatory motifs

- Mutations preserve consensus

- Increased Branch Length Score

- Genome-wide conservationStark et al, Nature 2007; Clark et al, Nature 2007

Page 7: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Functional annotation of Novel Transcripts using evo. sigs

CSF Score (best 30 aa window)

-20 0 20 40 60

CSF Score (best 30 aa window)

-20 0 20 40 60

Fra

ctio

n

Fre

quen

cy

73 Putative protein coding57 Putative non-coding

CSF = Heuristic metric for codon substitution frequency

Mike Lin, Jane Landolin, Sue Celniker

Page 8: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Consensus MCS Matches to known Tissue specific target expression Promoters Enhancers

1 CTAATTAAA 65.6 engrailed (en) 25.4 2

2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2

3 WATTRATTK 54.9 araucan (ara) 11.7 2.6

4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5

5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3

6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3

7 TGATTAAT 45.7 apterous (ap) 7.1 1.7

8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2

9 AAACNNGTT 41.2 20.1 4.3

10 RATTKAATT 40 3.9 0.7

11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9

12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7

13 AATTRMATTA 38.2 19.5 1.2

14 TATGCWAAT 37.8 5.8 2

15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4

16 CATNAATCA 36.9 1.8 1.7

17 TTACATAA 36.9 5.4

18 RTAAATCAA 36.3 3.2 2.8

19 AATKNMATTT 36 3.6 0

20 ATGTCAAHT 35.6 2.4 4.6

21 ATAAAYAAA 35.5 57.2 -0.5

22 YYAATCAAA 33.9 5.3 0.6

23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6

24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7

25 TGTMAATA 33.2 8.9 1.6

26 TAAYGAG 33.1 4.7 2.7

27 AAAKTGA 32.9 7.6 0.3

28 AAANNAAA 32.9 449.7 0.8

29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8

30 TTATTTAYR 32.9 Deformed (Dfd) 30.7

Discover motifs associated with binding

Ability to discover full dictionary of regulatory motifs de novoStark et al, Nature, 2007

Page 9: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

• ChIP-grade quality– Similar functional

enrichment– High sens. High spec.

• Systems-level– 81% of Transc. Factors– 86% of microRNAs– 8k + 2k targets– 46k connections

• Lessons learned– Pre- and post- are

correlated (hihi/lolo)– Regulators are heavily

targeted, feedback loop

Kheradpour et al, Genome Research, 2007Sushmita Roy

Initial regulatory network for an animal genome

Page 10: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Temporal latencies in regulatory networks

• TF-specific latencies, coherent with TF function

• Latencies associated with network motifs

• Extensions to tissue-specific networksRogerio Candeias

Page 11: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Incorporating ENCODE functional datasets

Pouya Kheradpour, Jason Ernst,

Chris Bristow, Rachel Sealfon

Page 12: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

modENCODE and gene regulation

Goal: Understand the DNA elements responsible for gene regulation:• The regulators: TFs, GFs, miRNAs, their specificities• The regions: enhancers, promoters, insulators• The targets: individual regulatory motif instances• The grammars: combinations predictive of tissue-specific activity

Building blocks of gene regulation

Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for promoter/enhancer/3’UTR motif annotation• Chromatin signatures for integrating histone modification datasets• TFs, GFs, motifs, instances associated with tissue-specific activity• Infer regulatory networks, their temporal and spatial dynamics

Integrate diverse datasets

Page 13: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Sequence motifs predictive of insulators

• Understand specificity of each factor

• How predictable are these of binding

• Motif combinations and grammars

GAF, check

CTCF, check

Su(Hw), check

BEAF-32, variant

Mod(mdg4), novel

CP190, novel

Motifs specific to each insulatorPouya Kheradpour

Page 14: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Motif instances correlate with ChIP peaks

• CTCF motif instances correlate strongly with narrow peak calls from multiple peak callers, even at 40bp window

• Correlation extends down rank link (to all 50,000 peaks)• Implications for peak calling and for motif discovery

SPP, 40bp window

Narrow Peak Interval Rank x104

Fra

cti

on

ove

rlap

pin

g C

TC

F m

oti

f in

sta

nce

s

Pouya Kheradpour, Ben Brown

Perf

orm

ance

(hig

her i

s be

tter

)

Peak size

Recovery of CTCF inst. at 90% confid.

Page 15: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Motifs and tissue-specific chromatin marks

Fold enrichment or over expression

• The NF-κB motif is enriched in H3K4me2 regions found uniquely in GM12878 cells

• It is likewise enriched in the uniquely bound regions for other active marks

• Conversely, it is enriched in the uniquely unbound regions for the repressive mark H3K27me3

• We find that NF-κB is also over expressed in GM12878, suggesting a causative explanation

NF-κB motif

Active marksRepressive mark

Pouya Kheradpour

Page 16: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Motifs and stage-specific chromatin marks

Fold enrichment or over expression

• abd-A motif is enriched in new H3K27me3 regions at L2– Coincides with a drop in the expression of abd-A– Model: sites gain H3K27me3 as abd-A binding lost

• Additional intriguing stories found, to be explored

H3K27me3

Page 17: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

What about combinations of chromatin marks?

Jason Ernst

Page 18: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

A hidden Markov model for chromatin state

Enhancer Transcription Start Site DNA

ObservedHistone Modifications

Most likely Hidden State

Transcribed Region

1 2 5 63 4 5 5 5 5 6

1:

3:

4:

5:

6:

Even though modification was not observed can still infer correct state based on neighboring locations that this state is likely of the same type as its neighboring states

6

Highly Likely Modifications in State

2:

0.8

0.9

0.9 0.8

0.80.7

0.9

.8

Page 19: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

20 distinct chromatin states, combinations of marks

• Combinations of chromatin marks – More informative than individual marks (A&B ≠ A&C)– Small number of states (20 instead of all 2 million=221)– Allow study of co-occurrence patterns, independence…

Page 20: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Each chromatin state associated w/ distinct function

• Reveals active/repressed promoters & enhancers

• Distinct enrichments for 5’UTR/3’UTR/transcripts

• Distinct chromatin properties of exons / introns

Tentative annotations

Page 21: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Transcriptional unit enrichment

Page 22: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Transcription start site (TSS) enrichment

Page 23: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Transcription termination site (TTS) enrichment

Page 24: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Transcriptional unit enrichment

Page 25: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Chromatin signatures as context for TF analysis

• TF role in establishing chromatin states• Chromatin role in modulating TF function

Page 26: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Specific enrichment for DV and AP factors

Page 27: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Functions of 20 distinct chromatin states in fly

DV enhancers AP enhancers General TFs Insulators Replication Motifs

Chromatin marks

Page 28: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

The grand challenge ahead

Ant

erio

r-P

oste

rior

Dor

sal-V

entr

al

Annotations & images for all expression patterns

Expression domain primitives reveal underlying logic

Binding sites of everydevelopmental regulator

GAF, check

Su(Hw), check

BEAF-32, variant

Mod(mdg4), novel

CP190, novel

CTCF, check

Sequence motifs forevery regulator

Understand regulatory logic specifying development

Page 29: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Summary of our lab’s experience in (mod)ENCODE

• Protein-coding genes (Mike Lin)– Hubbard: Predict new genes, evaluate novel genes– Celniker: Distinguish coding/non-coding transcripts

• Chromatin domains (Jason Ernst)– Karpen: Chromatin states in Drosophila– Bernstein: Chromatin states in Human

• Motif and grammar discovery (Pouya Kheradpour)– White: Motifs associated with insulator proteins– Bernstein: Tissue-specific chromatin states– White: Expression and Binding Time-course

• Tissue-specific gene expression (Chris Bristow)– Celniker: Embryo expression domains– All: Predictive models of gene expression

Page 30: Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Acknowledgements

Alex

Stark

TFs/Insul. Kevin White, Bing Ren, Nicolas Negre, Par Shah, Jim Posakony12+8-flies Andy Clark, Mike Eisen, Bill Gelbart, Doug Smith, Peter CherbasChromatin Gary Karpen, Aki Minoda, Nicole Riddle, Peter Park + KharchenkoProt.Genes BDGP: Sue Celniker, Jane Landolin, FlyBase: Bill Gelbart

Pouya

Kheradpour Mike

Lin

Jason

Ernst

Chris

Bristow

Funding ENCODE, modENCODE, NHGRI, NSF, Sloan Foundation