blanchettegr.ppt
TRANSCRIPT
-
7/29/2019 blanchettegr.ppt
1/50
Genome-wide computational prediction of transcriptional
regulatory modules reveal new insights into human gene
expression
Mathieu Blanchette et al.
Presented By:
Manish Agrawal
-
7/29/2019 blanchettegr.ppt
2/50
Outline
Introduction
Cis regulatory module (CRM) predictionalgorithm
In silico validation of predicted modules
Experimental validation of predictedmodules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
3/50
Gene Regulation
Chromosomal activation/deactivation
Transcriptional regulation
Splicing regulation mRNA degradation
mRNA transport regulation
Control of translation initiation
Post-translational modification
Source: Lecture Notes by Prof. Saurabh Sinha, UIUC
-
7/29/2019 blanchettegr.ppt
4/50
GENE
ACAGTGA
TRANSCRIPTION
FACTOR
PROTEIN
Transcriptional regulation
Source: Lecture Notes by Prof. Saurabh Sinha, UIUC
-
7/29/2019 blanchettegr.ppt
5/50
GENE
ACAGTGA
TRANSCRIPTION
FACTOR
PROTEIN
Transcriptional regulation
Source: Lecture Notes by Prof. Saurabh Sinha, UIUC
-
7/29/2019 blanchettegr.ppt
6/50
Transcription Factors(TFs)
They generally have affinity for short,degenerate DNA sequences (5-15 bp).
Experiments have enabled identification ofconsensus-binding motifs for hundreds ofTFs.
The binding motifs are generallyrepresented by position-weight matrices(PWM).
-
7/29/2019 blanchettegr.ppt
7/50
Binding site sequence alignment
Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html
-
7/29/2019 blanchettegr.ppt
8/50
Alignment matrix for a binding
site
Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html
-
7/29/2019 blanchettegr.ppt
9/50
Position weighted matrice
(PWM)
Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html
-
7/29/2019 blanchettegr.ppt
10/50
Position weighted matrice
(PWM) To transform elements of the alignment matrix to the weight matrix
we used the following formula:
weighti,j = ln (ni,j+pi)/(N+1) ~ ln (fi.j /pi)
pi
N - total number of sequences (15 in this example)
ni,j - number of times nucleotide i was observed in positionj of the
alignment.
fi,j = ni,j/N - frequency of letter i at positionj pi - a priori probability of letterI
In this example pT,A is equal to 0.3 and pC,G is equal to 0.2 (overall
frequency of the letters withinDrosophila melanogastergenome)
Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html
-
7/29/2019 blanchettegr.ppt
11/50
Position weighted matrice
(PWM) Weight matrix can be used to evaluate the
resemblance of any L bp DNA sequence to the
training set of binding sites. The score for this sequence is calculated as the
sum of the values that each base of the sequence
has in the weight matrix.
Any sequence with score that is higher then the
predefined cut-off is a potential new binding site.
Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html
-
7/29/2019 blanchettegr.ppt
12/50
Complications in indentification
of TF-binding sites (TFBSs) The binding of a TF also depends on other factors
like the chromatin environment and the
cooperation or competition with other DNAbinding proteins.
In higher eukaryotes, TFs rarely operate by
themselves, but a combination of TFs act together
to achieve the desired gene expression. The DNAfootprint of this set of factors is called cis-
regulatory module (CRM).
-
7/29/2019 blanchettegr.ppt
13/50
Cis-regulatory module
-
7/29/2019 blanchettegr.ppt
14/50
Features of CRMs
CRMs generally consist of several bindingsites for a TF.
CRMs, and in particular the binding sitesthey contain, are generally moreevolutionarily conserved than their flankingintergenic regions
Genes regulated by a common set of TFstend to be co-expressed.
-
7/29/2019 blanchettegr.ppt
15/50
Outline
Introduction
CRM prediction algorithm
In silico validation of predicted modules
Experimental validation of predicated
modules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
16/50
Predicting CRMs
Different combinations of these features (ofCRMs) have been used, often with PWMinformation, to predict regulatory elements forspecific TFs.
However, very few existing methods are designedto be applied on a genome-wide scale withoutprior knowledge about sets of interacting TFs orsets of co-regulated genes.
Previous works had generally taken 5-10 PWMsand they looked for the clusters of these PWMs inthe genome. Such studies have been reported forembryo development in Drosophila.
-
7/29/2019 blanchettegr.ppt
17/50
Goals and challenges
The goal of this study is to do a genome-wideanalysis and identify CRMs in human genomewithout any prior knowledge about interaction ofTFs.
The new algorithm only uses the features ofCRMs (mentioned earlier) for its prediction.
Although, CRMs predicted like this may contain asignificant number of false positives, the wholegenome approach provides sufficient statisticalpower to formulate specific biological hypotheses.
-
7/29/2019 blanchettegr.ppt
18/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
CRM prediction algorithm (Overview)
-
7/29/2019 blanchettegr.ppt
19/50
CRM prediction algorithm
1. A set of 481 vertebrate PWMs frm Transfac 7.2 was usedfor the analysis. PWMs were grouped into 229 families.
2. The genome-wide multiple alignment was done for the
human, mouse and rat genomes by the MULTIZprogram. Only the regions within MULTIZ alignmentwere considered in the later part of the study. Theseregions cover 34% of the human genome
3. For each of the 481 PWMs, individual binding sites werefirst predicted. The human, mouse and rat genomes werescanned separately on both strands, and a log likelihoodscore is computed in the standard way.
-
7/29/2019 blanchettegr.ppt
20/50
CRM prediction algorithm
4. For each species and each PWM, a hit score wascomputed. Later, a weighted average of the human,mouse and rat scores was used to define a hit score for
each alignment columnp and PWM m,
5. hitScorealn(m,p)= hitScoreHum(m,p) + 1/2 max(0,hitScoreMou(m,p) + hitScoreRat(m,p))
6. The human hit scores has been given higher weight toallow prediction of human-specific binding sites,provided that they are very good matches to the PWMconsidered.
7. Only positions with hitScorealn(m,p)> 10 are retained to
construct modules. This threshold is somewhat arbitrarybut results in total number of bases predicted in pCRMs
to be ~2.88% of the genome.
-
7/29/2019 blanchettegr.ppt
21/50
CRM prediction algorithm:
Computation of module score We need moduleScore(p
1p
2) for the alignmentregion
going from positionp1 top2 of human.
DefineTotalScore(m, p1.p2) to be the sum of the
hitScoresaln of allnon-overlapping hits for m in the regionp
1.p
2.
The optimization problem of choosing the best set of non-overlapping hits is solved heuristically using a greedyalgorithm. This greedy algorithm iteratively selects the hitwith the maximal score that does not overlap with the otherhits previously chosen.
For each matrix and each region, a P-value is assigned.
-
7/29/2019 blanchettegr.ppt
22/50
CRM prediction algorithm:
Computaion of module score The score for a module is computed based on one to five
PWMs called tags.
The first tag for regionp1.p2 is thematrix with the most
significant TotalScore, i.e.,tag
1 = argminmPWMspValue(TotalScore(m,p1.p2)).
The regions belonging to tag1 are then masked out and theTotalScores for each matrix are recomputed, excluding hitsoverlapping those oftag
1. Thesecond tagis then the matrix with most significant
TotalScore. The process is repeated until five tags areselected if possible.
-
7/29/2019 blanchettegr.ppt
23/50
CRM prediction algorithm:
Computation of module score Finally, we define totalModuleScore as a function of the P-
values of individual tags.
So, a module can consist of one to five tags, depending on
which number of tags yields the highest statistical
significance.
The above algorithm was used to search for modules of
maximal length 100, 200, 500, 1000 and 2000 bp.
-
7/29/2019 blanchettegr.ppt
24/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
CRM prediction algorithm (Overview)
-
7/29/2019 blanchettegr.ppt
25/50
Results
The algorithm could identify about 118,000
putative CRMs covering 2.88% of the genome.
This constitutes one of the first genome-wide,non-promoter centric set of human cis-regulatory
modules.
The biological relevance of pCRMs were
evaluated by measuring the extent they overlap
known regulatory elements in databases such as
TRRD, Transfac and GALA.
-
7/29/2019 blanchettegr.ppt
26/50
Outline
Introduction
CRM prediction algorithm
In silico validation of predicted modules
Experimental validation of predicated
modules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
27/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
In silico validation of predicted modules
-
7/29/2019 blanchettegr.ppt
28/50
Comparison to other genome-
wide predictions The ability of the algorithm to take advantage of
interspecies TFBS conservation contributes in
good part to its accuracy. The 34% of the human genome that lies within an
alignment block with the mouse and rat genome
contains 90% of bases within Transfac sites, 67%
of those within TRRD modules, and 87% of thosewithin GALA regulatory regions.
-
7/29/2019 blanchettegr.ppt
29/50
Outline
Introduction
CRM prediction algorithm
In silico validation of predicted modules
Experimental validation of predicted
modules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
30/50
Experimental validation of
predicted modules Experimentally verified the data by Chip-
chip analysis.
This method allows for the large scaleidentification of protein-DNA interactions
as they occur in vivo.
-
7/29/2019 blanchettegr.ppt
31/50
Chip-chip Analysis
Buck et al. Genome Biol. 2005; 6(11): R97
-
7/29/2019 blanchettegr.ppt
32/50
Experimental validation of
predicted modules They selected modules predicted to be bound by the
estrogen receptor (ER), the E2F transcription factor(E2F4), STAT3 and HIFI to print a DNA microarray.
The microarray contains 758, 1370, 860 and 1882 modulespredicted to be bound by ER, E2F4, STAT3, and HIFIrespectively.
In the current study, the microarray was then probed byChIP-chip for ER and E2F4, respectively.
Approx. 3% of the 758 ER-predicted pCRMs on themicroarray actually proved to be bound by ER, while 17%of the 1370 E2F4-predicted pCRMs on the microarraywere bound by E2F4.
-
7/29/2019 blanchettegr.ppt
33/50
Experimental validation of
predicted modules These numbers need to be considered as an
underestimation of the actual specificity of the algorithm,
since the protein-DNA interactions were tested in a single
cell type, while TFs are known to regulate different sets ofgenes in different cell types, physiological conditions, and
time in development.
In addition, the experiment was conducted under a single
set of conditions (concentration of estradiol, time oftreatment, etc. ). For all of these reasons, it is difficult to
determine the real accuracy of the algorithm.
-
7/29/2019 blanchettegr.ppt
34/50
Experimental validation of
predicted modules As the microarray contained, predicted modules for fourdifferent TFs, the data can be used to assess the specificityof TFBS predictions.
Among the 55 modules bound by ER, 44% were indeedselected for their ER-binding sites and among the 433modules bound by E2F4, 54% were selected for thatfactor.
In addition to false positive ChIP-chip signals or the failure
of the algorithm to detect some binding sites, it is likelythat binding of TFs through alternative mechanisms suchas protein-protein interactions contributes to this result.
The present algorithm can only predict the binding of TFthrough direct DNA-binding interactions.
-
7/29/2019 blanchettegr.ppt
35/50
Outline
Introduction
CRM prediction algorithm
In silico validation of predicted modules
Experimental validation of predicated
modules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
36/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
Distribution of pCRMs along a region of chromosome 11
-
7/29/2019 blanchettegr.ppt
37/50
Global view of the gene
regulatory landscape The moduledensity varies widely across the genome, with
an average offour modules per 100 kb and a maximum of44 modules per 100-kbwindow, covering from 0% to 55%
of such a region. As illustrated in the previous figure, some regions are rich
in modules,but relatively poor in genes. In some cases, thiscould reflectthe presence of many unknown protein-coding genes, or at leastof many alternative TSSs. Another
possible explanation is thatsome of these modules may beregulating the transcription ofnoncoding transcripts.
Finally, this observation may be due to the presence oflong-range enhancers, which may affect transcription ofgenes upto several hundreds of kilobases away.
-
7/29/2019 blanchettegr.ppt
38/50
Regulatory modules are preferentially located
in specific regions relative to genes The position of pCRMs with respect to their closest gene
was studied.
The genome was divided into several types of noncoding
regions, i.e., upstream of a gene, 5' UTR, 1st intron,
internalintrons, last intron, 3' UTR, and downstream
region.
Withineach type of region, they computed the fraction of
bases includedin a pCRM as a function of the distance to areference pointfor each type of region.
-
7/29/2019 blanchettegr.ppt
39/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
Distribution of pCRMs relative to specific regions of the genes
-
7/29/2019 blanchettegr.ppt
40/50
Observations
1. Regions immediately surrounding TSSs are highly enriched for
predicted modules. This was expected as this region often contains
the promoter of the genes. Surprisingly, there are modules
immediately downstream of TSSs. These may represent alternative
promoters for initiation downstream from the annotated transcripts.
2. Regions surrounding the sites of termination are also enriched for
modules. 3' UTRs are essentiallyas enriched as 5' UTRsfor pCRMs.
Two reasons may explain this. First,these may represent enhancer
type of regulatoryelements thatactivate the upstream gene via a
DNA-looping mechanism.Second,these may represent promoter
elements driving noncodingtranscript,antisense relative to the
coding gene. Such antisensetranscriptsmay regulate gene
expression by a post-transcriptionalmechanism
-
7/29/2019 blanchettegr.ppt
41/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
Distribution of pCRMs relative to specific regions of the genes
-
7/29/2019 blanchettegr.ppt
42/50
Observations
Another surprising observationis that the density ofmodulesis the lowest in regions located1050 kbupstream ofthe TSS and, symmetrically, 1030kbdownstream of theend of transcription. This is
unexpected,
as one would expect
that these regions (atleast those upstreamof the TSS) wouldbe prime estatefor transcriptional regulation.
However, thisis confirmed by the density of interspeciesconservedelements,which is also at its lowest in those
regions. Being closeto the TSS, regulatory elements in these
regions maybe allowedto contain fewer binding sites (orbinding siteswith less affinity),making them difficult todetect using the currentmethod.
-
7/29/2019 blanchettegr.ppt
43/50
Observations
Alternatively,these regions(10-50 kb upstream) may
actually be depletedfor regulatory elements.This could
be due to constraints imposedby the chromatin structure
of the nuclear architecture, makingit more difficult fortheDNA of these regions to come in physicalproximity
to the TSS.
Another notable observation is that the density of
predicted modules in intronic regions is very low in theclose vicinity of exons (except the first and the last one),
but increases with the distance to the closest exon.
-
7/29/2019 blanchettegr.ppt
44/50
Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668
TFs target different regions relative to their target genes.
RED => Highly enriched for TFBSs, BLUE => Depleted in TFBSs
-
7/29/2019 blanchettegr.ppt
45/50
TFs target different regions
relative to their target genes. The previous figure shows that more than 70 of the 229
TFs families considered exhibit a significant enrichment
for one or more types of genomic regions.
A number of TFs show preference for distal positions,
mostly those located more than 100 kb upstream of the
TSS, and are also enriched within introns. This set of TFs
is enriched for factors containing homeo domains or basic
helix-loop-helix domains and are often involved inregulating development.
-
7/29/2019 blanchettegr.ppt
46/50
TFs target different regions
relative to their target genes. A second set of TFs preferentially binds within 1 kb of the
TSSs. This set is enriched for leucine zipper TF and factors
from Ets family. Notably, most of these factors, contrary to
what is observed for those binding distal sites, are involvedin basic cellular functions.
-
7/29/2019 blanchettegr.ppt
47/50
Outline
Introduction
CRM prediction algorithm
In silico validation of predicted modules
Experimental validation of predicated
modules
Location of CRMs relative to genes
Conclusions
-
7/29/2019 blanchettegr.ppt
48/50
Conclusions
Blanchett et al have identified a setof rules describing thearchitecture of DNA regulatory elementsand used them tobuild an algorithm allowing them to explore theregulatory
potential of the human genome. Although the false positive rate in CRM prediction is likely
to be high, the statistical power obtained through a large-scale, genome-wide approach revealed new insights abouttranscriptional regulation.
It was noted that a significant number of TFs have a strongbias for regulating genes either from a great distance orfrompromoter-proximal binding sites.
-
7/29/2019 blanchettegr.ppt
49/50
Conclusions
Noteworthy is the fact thatmost TFs that preferentially
work from a large distance areinvolved in development,
while those predicted to work frompromoter-proximal
sites tend to regulate genes involved in basiccellularprocesses.
It is expected that the database containing the modules
presented in this study may speed up the discovery and
experimental validation of CRMs
-
7/29/2019 blanchettegr.ppt
50/50
THANK YOU