blanchettegr.ppt

7/29/2019 blanchettegr.ppt

1/50

Genome-wide computational prediction of transcriptional

regulatory modules reveal new insights into human gene

expression

Mathieu Blanchette et al.

Presented By:

Manish Agrawal


2/50

Outline

Introduction

Cis regulatory module (CRM) predictionalgorithm

In silico validation of predicted modules

Experimental validation of predictedmodules

Location of CRMs relative to genes

Conclusions


3/50

Gene Regulation

Chromosomal activation/deactivation

Transcriptional regulation

Splicing regulation mRNA degradation

mRNA transport regulation

Control of translation initiation

Post-translational modification

Source: Lecture Notes by Prof. Saurabh Sinha, UIUC


4/50

GENE

ACAGTGA

TRANSCRIPTION

FACTOR

PROTEIN




5/50

GENE

ACAGTGA

TRANSCRIPTION

FACTOR

PROTEIN




6/50

Transcription Factors(TFs)

They generally have affinity for short,degenerate DNA sequences (5-15 bp).

Experiments have enabled identification ofconsensus-binding motifs for hundreds ofTFs.

The binding motifs are generallyrepresented by position-weight matrices(PWM).


7/50

Binding site sequence alignment

Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html


8/50

Alignment matrix for a binding

site



9/50

Position weighted matrice

(PWM)



10/50


(PWM) To transform elements of the alignment matrix to the weight matrix

we used the following formula:

weighti,j = ln (ni,j+pi)/(N+1) ~ ln (fi.j /pi)

pi

N - total number of sequences (15 in this example)

ni,j - number of times nucleotide i was observed in positionj of the

alignment.

fi,j = ni,j/N - frequency of letter i at positionj pi - a priori probability of letterI

In this example pT,A is equal to 0.3 and pC,G is equal to 0.2 (overall

frequency of the letters withinDrosophila melanogastergenome)



11/50


(PWM) Weight matrix can be used to evaluate the

resemblance of any L bp DNA sequence to the

training set of binding sites. The score for this sequence is calculated as the

sum of the values that each base of the sequence

has in the weight matrix.

Any sequence with score that is higher then the

predefined cut-off is a potential new binding site.



12/50

Complications in indentification

of TF-binding sites (TFBSs) The binding of a TF also depends on other factors

like the chromatin environment and the

cooperation or competition with other DNAbinding proteins.

In higher eukaryotes, TFs rarely operate by

themselves, but a combination of TFs act together

to achieve the desired gene expression. The DNAfootprint of this set of factors is called cis-

regulatory module (CRM).


13/50

Cis-regulatory module


14/50

Features of CRMs

CRMs generally consist of several bindingsites for a TF.

CRMs, and in particular the binding sitesthey contain, are generally moreevolutionarily conserved than their flankingintergenic regions

Genes regulated by a common set of TFstend to be co-expressed.


15/50

Outline

Introduction

CRM prediction algorithm


Experimental validation of predicated

modules


Conclusions


16/50

Predicting CRMs

Different combinations of these features (ofCRMs) have been used, often with PWMinformation, to predict regulatory elements forspecific TFs.

However, very few existing methods are designedto be applied on a genome-wide scale withoutprior knowledge about sets of interacting TFs orsets of co-regulated genes.

Previous works had generally taken 5-10 PWMsand they looked for the clusters of these PWMs inthe genome. Such studies have been reported forembryo development in Drosophila.


17/50

Goals and challenges

The goal of this study is to do a genome-wideanalysis and identify CRMs in human genomewithout any prior knowledge about interaction ofTFs.

The new algorithm only uses the features ofCRMs (mentioned earlier) for its prediction.

Although, CRMs predicted like this may contain asignificant number of false positives, the wholegenome approach provides sufficient statisticalpower to formulate specific biological hypotheses.


18/50

Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

CRM prediction algorithm (Overview)


19/50


1. A set of 481 vertebrate PWMs frm Transfac 7.2 was usedfor the analysis. PWMs were grouped into 229 families.

2. The genome-wide multiple alignment was done for the

human, mouse and rat genomes by the MULTIZprogram. Only the regions within MULTIZ alignmentwere considered in the later part of the study. Theseregions cover 34% of the human genome

3. For each of the 481 PWMs, individual binding sites werefirst predicted. The human, mouse and rat genomes werescanned separately on both strands, and a log likelihoodscore is computed in the standard way.


20/50


4. For each species and each PWM, a hit score wascomputed. Later, a weighted average of the human,mouse and rat scores was used to define a hit score for

each alignment columnp and PWM m,

5. hitScorealn(m,p)= hitScoreHum(m,p) + 1/2 max(0,hitScoreMou(m,p) + hitScoreRat(m,p))

6. The human hit scores has been given higher weight toallow prediction of human-specific binding sites,provided that they are very good matches to the PWMconsidered.

7. Only positions with hitScorealn(m,p)> 10 are retained to

construct modules. This threshold is somewhat arbitrarybut results in total number of bases predicted in pCRMs

to be ~2.88% of the genome.


21/50

CRM prediction algorithm:

Computation of module score We need moduleScore(p

1p

2) for the alignmentregion

going from positionp1 top2 of human.

DefineTotalScore(m, p1.p2) to be the sum of the

hitScoresaln of allnon-overlapping hits for m in the regionp

1.p

2.

The optimization problem of choosing the best set of non-overlapping hits is solved heuristically using a greedyalgorithm. This greedy algorithm iteratively selects the hitwith the maximal score that does not overlap with the otherhits previously chosen.

For each matrix and each region, a P-value is assigned.


22/50


Computaion of module score The score for a module is computed based on one to five

PWMs called tags.

The first tag for regionp1.p2 is thematrix with the most

significant TotalScore, i.e.,tag

1 = argminmPWMspValue(TotalScore(m,p1.p2)).

The regions belonging to tag1 are then masked out and theTotalScores for each matrix are recomputed, excluding hitsoverlapping those oftag

1. Thesecond tagis then the matrix with most significant

TotalScore. The process is repeated until five tags areselected if possible.


23/50


Computation of module score Finally, we define totalModuleScore as a function of the P-

values of individual tags.

So, a module can consist of one to five tags, depending on

which number of tags yields the highest statistical

significance.

The above algorithm was used to search for modules of

maximal length 100, 200, 500, 1000 and 2000 bp.


24/50


CRM prediction algorithm (Overview)


25/50

Results

The algorithm could identify about 118,000

putative CRMs covering 2.88% of the genome.

This constitutes one of the first genome-wide,non-promoter centric set of human cis-regulatory

modules.

The biological relevance of pCRMs were

evaluated by measuring the extent they overlap

known regulatory elements in databases such as

TRRD, Transfac and GALA.


26/50

Outline

Introduction




modules


Conclusions


27/50




28/50

Comparison to other genome-

wide predictions The ability of the algorithm to take advantage of

interspecies TFBS conservation contributes in

good part to its accuracy. The 34% of the human genome that lies within an

alignment block with the mouse and rat genome

contains 90% of bases within Transfac sites, 67%

of those within TRRD modules, and 87% of thosewithin GALA regulatory regions.


29/50

Outline

Introduction



Experimental validation of predicted

modules


Conclusions


30/50

Experimental validation of

predicted modules Experimentally verified the data by Chip-

chip analysis.

This method allows for the large scaleidentification of protein-DNA interactions

as they occur in vivo.


31/50

Chip-chip Analysis

Buck et al. Genome Biol. 2005; 6(11): R97


32/50


predicted modules They selected modules predicted to be bound by the

estrogen receptor (ER), the E2F transcription factor(E2F4), STAT3 and HIFI to print a DNA microarray.

The microarray contains 758, 1370, 860 and 1882 modulespredicted to be bound by ER, E2F4, STAT3, and HIFIrespectively.

In the current study, the microarray was then probed byChIP-chip for ER and E2F4, respectively.

Approx. 3% of the 758 ER-predicted pCRMs on themicroarray actually proved to be bound by ER, while 17%of the 1370 E2F4-predicted pCRMs on the microarraywere bound by E2F4.


33/50


predicted modules These numbers need to be considered as an

underestimation of the actual specificity of the algorithm,

since the protein-DNA interactions were tested in a single

cell type, while TFs are known to regulate different sets ofgenes in different cell types, physiological conditions, and

time in development.

In addition, the experiment was conducted under a single

set of conditions (concentration of estradiol, time oftreatment, etc. ). For all of these reasons, it is difficult to

determine the real accuracy of the algorithm.


34/50


predicted modules As the microarray contained, predicted modules for fourdifferent TFs, the data can be used to assess the specificityof TFBS predictions.

Among the 55 modules bound by ER, 44% were indeedselected for their ER-binding sites and among the 433modules bound by E2F4, 54% were selected for thatfactor.

In addition to false positive ChIP-chip signals or the failure

of the algorithm to detect some binding sites, it is likelythat binding of TFs through alternative mechanisms suchas protein-protein interactions contributes to this result.

The present algorithm can only predict the binding of TFthrough direct DNA-binding interactions.


35/50

Outline

Introduction




modules


Conclusions


36/50


Distribution of pCRMs along a region of chromosome 11


37/50

Global view of the gene

regulatory landscape The moduledensity varies widely across the genome, with

an average offour modules per 100 kb and a maximum of44 modules per 100-kbwindow, covering from 0% to 55%

of such a region. As illustrated in the previous figure, some regions are rich

in modules,but relatively poor in genes. In some cases, thiscould reflectthe presence of many unknown protein-coding genes, or at leastof many alternative TSSs. Another

possible explanation is thatsome of these modules may beregulating the transcription ofnoncoding transcripts.

Finally, this observation may be due to the presence oflong-range enhancers, which may affect transcription ofgenes upto several hundreds of kilobases away.


38/50

Regulatory modules are preferentially located

in specific regions relative to genes The position of pCRMs with respect to their closest gene

was studied.

The genome was divided into several types of noncoding

regions, i.e., upstream of a gene, 5' UTR, 1st intron,

internalintrons, last intron, 3' UTR, and downstream

region.

Withineach type of region, they computed the fraction of

bases includedin a pCRM as a function of the distance to areference pointfor each type of region.


39/50


Distribution of pCRMs relative to specific regions of the genes


40/50

Observations

1. Regions immediately surrounding TSSs are highly enriched for

predicted modules. This was expected as this region often contains

the promoter of the genes. Surprisingly, there are modules

immediately downstream of TSSs. These may represent alternative

promoters for initiation downstream from the annotated transcripts.

2. Regions surrounding the sites of termination are also enriched for

modules. 3' UTRs are essentiallyas enriched as 5' UTRsfor pCRMs.

Two reasons may explain this. First,these may represent enhancer

type of regulatoryelements thatactivate the upstream gene via a

DNA-looping mechanism.Second,these may represent promoter

elements driving noncodingtranscript,antisense relative to the

coding gene. Such antisensetranscriptsmay regulate gene

expression by a post-transcriptionalmechanism


41/50


Distribution of pCRMs relative to specific regions of the genes


42/50

Observations

Another surprising observationis that the density ofmodulesis the lowest in regions located1050 kbupstream ofthe TSS and, symmetrically, 1030kbdownstream of theend of transcription. This is

unexpected,

as one would expect

that these regions (atleast those upstreamof the TSS) wouldbe prime estatefor transcriptional regulation.

However, thisis confirmed by the density of interspeciesconservedelements,which is also at its lowest in those

regions. Being closeto the TSS, regulatory elements in these

regions maybe allowedto contain fewer binding sites (orbinding siteswith less affinity),making them difficult todetect using the currentmethod.


43/50

Observations

Alternatively,these regions(10-50 kb upstream) may

actually be depletedfor regulatory elements.This could

be due to constraints imposedby the chromatin structure

of the nuclear architecture, makingit more difficult fortheDNA of these regions to come in physicalproximity

to the TSS.

Another notable observation is that the density of

predicted modules in intronic regions is very low in theclose vicinity of exons (except the first and the last one),

but increases with the distance to the closest exon.


44/50


TFs target different regions relative to their target genes.

RED => Highly enriched for TFBSs, BLUE => Depleted in TFBSs


45/50

TFs target different regions

relative to their target genes. The previous figure shows that more than 70 of the 229

TFs families considered exhibit a significant enrichment

for one or more types of genomic regions.

A number of TFs show preference for distal positions,

mostly those located more than 100 kb upstream of the

TSS, and are also enriched within introns. This set of TFs

is enriched for factors containing homeo domains or basic

helix-loop-helix domains and are often involved inregulating development.


46/50

TFs target different regions

relative to their target genes. A second set of TFs preferentially binds within 1 kb of the

TSSs. This set is enriched for leucine zipper TF and factors

from Ets family. Notably, most of these factors, contrary to

what is observed for those binding distal sites, are involvedin basic cellular functions.


47/50

Outline

Introduction




modules


Conclusions


48/50

Conclusions

Blanchett et al have identified a setof rules describing thearchitecture of DNA regulatory elementsand used them tobuild an algorithm allowing them to explore theregulatory

potential of the human genome. Although the false positive rate in CRM prediction is likely

to be high, the statistical power obtained through a large-scale, genome-wide approach revealed new insights abouttranscriptional regulation.

It was noted that a significant number of TFs have a strongbias for regulating genes either from a great distance orfrompromoter-proximal binding sites.


49/50

Conclusions

Noteworthy is the fact thatmost TFs that preferentially

work from a large distance areinvolved in development,

while those predicted to work frompromoter-proximal

sites tend to regulate genes involved in basiccellularprocesses.

It is expected that the database containing the modules

presented in this study may speed up the discovery and

experimental validation of CRMs


50/50

THANK YOU

blanchettegr.ppt

Documents