fly modencode data integration update manolis kellis, mit mit computer science & artificial...
TRANSCRIPT
Fly ModENCODE data integration update
Manolis Kellis, MIT
MIT Computer Science & Artificial Intelligence Laboratory
Broad Institute of MIT and Harvard
modENCODE integration goals
• Annotate all functional elements– Enhancers, promoters, insulators, silencers– Protein-coding genes, RNA genes, alternative splice forms
• Understand their dynamics– Tissue- and stage-specific activity of each type of element
• Mechanisms– Relative roles of histones, chromatin, specific/general TFs– Sequence specificity, regulatory motifs and grammars
• Community involvement will be key– Seeking both computational and experimental partners– Large-scale: Complementary datasets / computation– Small-scale: Directed follow-up studies / genes, pathways
• Drosophila 2009 modENCODE workshop discussion
Each dataset is supported by all others
• Each type of element requires multiple data types– Protein genes– RNA genes– Promoters– Enhancers– Transcripts– Heterochromatin– Initiation sites
Replication
Chromatin
Nucleosomes
Small RNAs
Transcripts
TFs/Chromatin
Karpen
Henikoff
Celniker
White
Lai
Mac
Alpine
Already presented
Underway
Data Integration efforts
modENCODE is not alone• Community data
types– Boundaries
– DNAse HS sites, low buoyant density (protein binding)
– evolutionary properties (correlations with conserved/non-conserved properties)
– Dam mapping
– Small RNAs
• Techniques and functional genomics– Gene Disruption projects
– RNAi collection
– Recombineering
– Computational analyses
Replication
Chromatin
Nucleosomes
Small RNAs
Transcripts
TFs/Chromatin
Karpen
Henikoff
Celniker
White
Lai
Mac
Alpine
Boundaries
DNAse HS
12flies
(+8 flies)
Dam
mapping
etc
Comparative resources for Drosophila genomes
• Identify functional elements by their evolutionary signatures: complement experimental studies
donepriority1priority2
New SpeciesDist
D. ficusphila0.80
D. biarmipes0.70
D. elegans0.72
D. kikkawai0.89
D. eugracilis0.76
D. takahashii0.65
D. rhopaloa0.66
D. bipectinata0.99
Evolutionary signatures for diverse functions
Protein-coding genes
- Codon Substitution Frequencies
- Reading Frame Conservation
RNA structures
- Compensatory changes
- Silent G-U substitutions
microRNAs
- Shape of conservation profile
- Structural features: loops, pairs
- Relationship with 3’UTR motifs
Regulatory motifs
- Mutations preserve consensus
- Increased Branch Length Score
- Genome-wide conservationStark et al, Nature 2007; Clark et al, Nature 2007
Functional annotation of Novel Transcripts using evo. sigs
CSF Score (best 30 aa window)
-20 0 20 40 60
CSF Score (best 30 aa window)
-20 0 20 40 60
Fra
ctio
n
Fre
quen
cy
73 Putative protein coding57 Putative non-coding
CSF = Heuristic metric for codon substitution frequency
Mike Lin, Jane Landolin, Sue Celniker
Consensus MCS Matches to known Tissue specific target expression Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
Discover motifs associated with binding
Ability to discover full dictionary of regulatory motifs de novoStark et al, Nature, 2007
• ChIP-grade quality– Similar functional
enrichment– High sens. High spec.
• Systems-level– 81% of Transc. Factors– 86% of microRNAs– 8k + 2k targets– 46k connections
• Lessons learned– Pre- and post- are
correlated (hihi/lolo)– Regulators are heavily
targeted, feedback loop
Kheradpour et al, Genome Research, 2007Sushmita Roy
Initial regulatory network for an animal genome
Temporal latencies in regulatory networks
• TF-specific latencies, coherent with TF function
• Latencies associated with network motifs
• Extensions to tissue-specific networksRogerio Candeias
Incorporating ENCODE functional datasets
Pouya Kheradpour, Jason Ernst,
Chris Bristow, Rachel Sealfon
modENCODE and gene regulation
Goal: Understand the DNA elements responsible for gene regulation:• The regulators: TFs, GFs, miRNAs, their specificities• The regions: enhancers, promoters, insulators• The targets: individual regulatory motif instances• The grammars: combinations predictive of tissue-specific activity
Building blocks of gene regulation
Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for promoter/enhancer/3’UTR motif annotation• Chromatin signatures for integrating histone modification datasets• TFs, GFs, motifs, instances associated with tissue-specific activity• Infer regulatory networks, their temporal and spatial dynamics
Integrate diverse datasets
Sequence motifs predictive of insulators
• Understand specificity of each factor
• How predictable are these of binding
• Motif combinations and grammars
GAF, check
CTCF, check
Su(Hw), check
BEAF-32, variant
Mod(mdg4), novel
CP190, novel
Motifs specific to each insulatorPouya Kheradpour
Motif instances correlate with ChIP peaks
• CTCF motif instances correlate strongly with narrow peak calls from multiple peak callers, even at 40bp window
• Correlation extends down rank link (to all 50,000 peaks)• Implications for peak calling and for motif discovery
SPP, 40bp window
Narrow Peak Interval Rank x104
Fra
cti
on
ove
rlap
pin
g C
TC
F m
oti
f in
sta
nce
s
Pouya Kheradpour, Ben Brown
Perf
orm
ance
(hig
her i
s be
tter
)
Peak size
Recovery of CTCF inst. at 90% confid.
Motifs and tissue-specific chromatin marks
Fold enrichment or over expression
• The NF-κB motif is enriched in H3K4me2 regions found uniquely in GM12878 cells
• It is likewise enriched in the uniquely bound regions for other active marks
• Conversely, it is enriched in the uniquely unbound regions for the repressive mark H3K27me3
• We find that NF-κB is also over expressed in GM12878, suggesting a causative explanation
NF-κB motif
Active marksRepressive mark
Pouya Kheradpour
Motifs and stage-specific chromatin marks
Fold enrichment or over expression
• abd-A motif is enriched in new H3K27me3 regions at L2– Coincides with a drop in the expression of abd-A– Model: sites gain H3K27me3 as abd-A binding lost
• Additional intriguing stories found, to be explored
H3K27me3
What about combinations of chromatin marks?
Jason Ernst
A hidden Markov model for chromatin state
Enhancer Transcription Start Site DNA
ObservedHistone Modifications
Most likely Hidden State
Transcribed Region
1 2 5 63 4 5 5 5 5 6
1:
3:
4:
5:
6:
Even though modification was not observed can still infer correct state based on neighboring locations that this state is likely of the same type as its neighboring states
6
Highly Likely Modifications in State
2:
0.8
0.9
0.9 0.8
0.80.7
0.9
.8
20 distinct chromatin states, combinations of marks
• Combinations of chromatin marks – More informative than individual marks (A&B ≠ A&C)– Small number of states (20 instead of all 2 million=221)– Allow study of co-occurrence patterns, independence…
Each chromatin state associated w/ distinct function
• Reveals active/repressed promoters & enhancers
• Distinct enrichments for 5’UTR/3’UTR/transcripts
• Distinct chromatin properties of exons / introns
Tentative annotations
Transcriptional unit enrichment
Transcription start site (TSS) enrichment
Transcription termination site (TTS) enrichment
Transcriptional unit enrichment
Chromatin signatures as context for TF analysis
• TF role in establishing chromatin states• Chromatin role in modulating TF function
Specific enrichment for DV and AP factors
Functions of 20 distinct chromatin states in fly
DV enhancers AP enhancers General TFs Insulators Replication Motifs
Chromatin marks
The grand challenge ahead
Ant
erio
r-P
oste
rior
Dor
sal-V
entr
al
Annotations & images for all expression patterns
Expression domain primitives reveal underlying logic
Binding sites of everydevelopmental regulator
GAF, check
Su(Hw), check
BEAF-32, variant
Mod(mdg4), novel
CP190, novel
CTCF, check
Sequence motifs forevery regulator
Understand regulatory logic specifying development
Summary of our lab’s experience in (mod)ENCODE
• Protein-coding genes (Mike Lin)– Hubbard: Predict new genes, evaluate novel genes– Celniker: Distinguish coding/non-coding transcripts
• Chromatin domains (Jason Ernst)– Karpen: Chromatin states in Drosophila– Bernstein: Chromatin states in Human
• Motif and grammar discovery (Pouya Kheradpour)– White: Motifs associated with insulator proteins– Bernstein: Tissue-specific chromatin states– White: Expression and Binding Time-course
• Tissue-specific gene expression (Chris Bristow)– Celniker: Embryo expression domains– All: Predictive models of gene expression
Acknowledgements
Alex
Stark
TFs/Insul. Kevin White, Bing Ren, Nicolas Negre, Par Shah, Jim Posakony12+8-flies Andy Clark, Mike Eisen, Bill Gelbart, Doug Smith, Peter CherbasChromatin Gary Karpen, Aki Minoda, Nicole Riddle, Peter Park + KharchenkoProt.Genes BDGP: Sue Celniker, Jane Landolin, FlyBase: Bill Gelbart
Pouya
Kheradpour Mike
Lin
Jason
Ernst
Chris
Bristow
Funding ENCODE, modENCODE, NHGRI, NSF, Sloan Foundation