blanchettegr.ppt

Upload: mohammad-rameez

Post on 04-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 blanchettegr.ppt

    1/50

    Genome-wide computational prediction of transcriptional

    regulatory modules reveal new insights into human gene

    expression

    Mathieu Blanchette et al.

    Presented By:

    Manish Agrawal

  • 7/29/2019 blanchettegr.ppt

    2/50

    Outline

    Introduction

    Cis regulatory module (CRM) predictionalgorithm

    In silico validation of predicted modules

    Experimental validation of predictedmodules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    3/50

    Gene Regulation

    Chromosomal activation/deactivation

    Transcriptional regulation

    Splicing regulation mRNA degradation

    mRNA transport regulation

    Control of translation initiation

    Post-translational modification

    Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

  • 7/29/2019 blanchettegr.ppt

    4/50

    GENE

    ACAGTGA

    TRANSCRIPTION

    FACTOR

    PROTEIN

    Transcriptional regulation

    Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

  • 7/29/2019 blanchettegr.ppt

    5/50

    GENE

    ACAGTGA

    TRANSCRIPTION

    FACTOR

    PROTEIN

    Transcriptional regulation

    Source: Lecture Notes by Prof. Saurabh Sinha, UIUC

  • 7/29/2019 blanchettegr.ppt

    6/50

    Transcription Factors(TFs)

    They generally have affinity for short,degenerate DNA sequences (5-15 bp).

    Experiments have enabled identification ofconsensus-binding motifs for hundreds ofTFs.

    The binding motifs are generallyrepresented by position-weight matrices(PWM).

  • 7/29/2019 blanchettegr.ppt

    7/50

    Binding site sequence alignment

    Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html

  • 7/29/2019 blanchettegr.ppt

    8/50

    Alignment matrix for a binding

    site

    Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html

  • 7/29/2019 blanchettegr.ppt

    9/50

    Position weighted matrice

    (PWM)

    Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html

  • 7/29/2019 blanchettegr.ppt

    10/50

    Position weighted matrice

    (PWM) To transform elements of the alignment matrix to the weight matrix

    we used the following formula:

    weighti,j = ln (ni,j+pi)/(N+1) ~ ln (fi.j /pi)

    pi

    N - total number of sequences (15 in this example)

    ni,j - number of times nucleotide i was observed in positionj of the

    alignment.

    fi,j = ni,j/N - frequency of letter i at positionj pi - a priori probability of letterI

    In this example pT,A is equal to 0.3 and pC,G is equal to 0.2 (overall

    frequency of the letters withinDrosophila melanogastergenome)

    Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html

  • 7/29/2019 blanchettegr.ppt

    11/50

    Position weighted matrice

    (PWM) Weight matrix can be used to evaluate the

    resemblance of any L bp DNA sequence to the

    training set of binding sites. The score for this sequence is calculated as the

    sum of the values that each base of the sequence

    has in the weight matrix.

    Any sequence with score that is higher then the

    predefined cut-off is a potential new binding site.

    Source: http://trantor.bioc.columbia.edu/Target_Explorer/manual/matrix.html

  • 7/29/2019 blanchettegr.ppt

    12/50

    Complications in indentification

    of TF-binding sites (TFBSs) The binding of a TF also depends on other factors

    like the chromatin environment and the

    cooperation or competition with other DNAbinding proteins.

    In higher eukaryotes, TFs rarely operate by

    themselves, but a combination of TFs act together

    to achieve the desired gene expression. The DNAfootprint of this set of factors is called cis-

    regulatory module (CRM).

  • 7/29/2019 blanchettegr.ppt

    13/50

    Cis-regulatory module

  • 7/29/2019 blanchettegr.ppt

    14/50

    Features of CRMs

    CRMs generally consist of several bindingsites for a TF.

    CRMs, and in particular the binding sitesthey contain, are generally moreevolutionarily conserved than their flankingintergenic regions

    Genes regulated by a common set of TFstend to be co-expressed.

  • 7/29/2019 blanchettegr.ppt

    15/50

    Outline

    Introduction

    CRM prediction algorithm

    In silico validation of predicted modules

    Experimental validation of predicated

    modules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    16/50

    Predicting CRMs

    Different combinations of these features (ofCRMs) have been used, often with PWMinformation, to predict regulatory elements forspecific TFs.

    However, very few existing methods are designedto be applied on a genome-wide scale withoutprior knowledge about sets of interacting TFs orsets of co-regulated genes.

    Previous works had generally taken 5-10 PWMsand they looked for the clusters of these PWMs inthe genome. Such studies have been reported forembryo development in Drosophila.

  • 7/29/2019 blanchettegr.ppt

    17/50

    Goals and challenges

    The goal of this study is to do a genome-wideanalysis and identify CRMs in human genomewithout any prior knowledge about interaction ofTFs.

    The new algorithm only uses the features ofCRMs (mentioned earlier) for its prediction.

    Although, CRMs predicted like this may contain asignificant number of false positives, the wholegenome approach provides sufficient statisticalpower to formulate specific biological hypotheses.

  • 7/29/2019 blanchettegr.ppt

    18/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    CRM prediction algorithm (Overview)

  • 7/29/2019 blanchettegr.ppt

    19/50

    CRM prediction algorithm

    1. A set of 481 vertebrate PWMs frm Transfac 7.2 was usedfor the analysis. PWMs were grouped into 229 families.

    2. The genome-wide multiple alignment was done for the

    human, mouse and rat genomes by the MULTIZprogram. Only the regions within MULTIZ alignmentwere considered in the later part of the study. Theseregions cover 34% of the human genome

    3. For each of the 481 PWMs, individual binding sites werefirst predicted. The human, mouse and rat genomes werescanned separately on both strands, and a log likelihoodscore is computed in the standard way.

  • 7/29/2019 blanchettegr.ppt

    20/50

    CRM prediction algorithm

    4. For each species and each PWM, a hit score wascomputed. Later, a weighted average of the human,mouse and rat scores was used to define a hit score for

    each alignment columnp and PWM m,

    5. hitScorealn(m,p)= hitScoreHum(m,p) + 1/2 max(0,hitScoreMou(m,p) + hitScoreRat(m,p))

    6. The human hit scores has been given higher weight toallow prediction of human-specific binding sites,provided that they are very good matches to the PWMconsidered.

    7. Only positions with hitScorealn(m,p)> 10 are retained to

    construct modules. This threshold is somewhat arbitrarybut results in total number of bases predicted in pCRMs

    to be ~2.88% of the genome.

  • 7/29/2019 blanchettegr.ppt

    21/50

    CRM prediction algorithm:

    Computation of module score We need moduleScore(p

    1p

    2) for the alignmentregion

    going from positionp1 top2 of human.

    DefineTotalScore(m, p1.p2) to be the sum of the

    hitScoresaln of allnon-overlapping hits for m in the regionp

    1.p

    2.

    The optimization problem of choosing the best set of non-overlapping hits is solved heuristically using a greedyalgorithm. This greedy algorithm iteratively selects the hitwith the maximal score that does not overlap with the otherhits previously chosen.

    For each matrix and each region, a P-value is assigned.

  • 7/29/2019 blanchettegr.ppt

    22/50

    CRM prediction algorithm:

    Computaion of module score The score for a module is computed based on one to five

    PWMs called tags.

    The first tag for regionp1.p2 is thematrix with the most

    significant TotalScore, i.e.,tag

    1 = argminmPWMspValue(TotalScore(m,p1.p2)).

    The regions belonging to tag1 are then masked out and theTotalScores for each matrix are recomputed, excluding hitsoverlapping those oftag

    1. Thesecond tagis then the matrix with most significant

    TotalScore. The process is repeated until five tags areselected if possible.

  • 7/29/2019 blanchettegr.ppt

    23/50

    CRM prediction algorithm:

    Computation of module score Finally, we define totalModuleScore as a function of the P-

    values of individual tags.

    So, a module can consist of one to five tags, depending on

    which number of tags yields the highest statistical

    significance.

    The above algorithm was used to search for modules of

    maximal length 100, 200, 500, 1000 and 2000 bp.

  • 7/29/2019 blanchettegr.ppt

    24/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    CRM prediction algorithm (Overview)

  • 7/29/2019 blanchettegr.ppt

    25/50

    Results

    The algorithm could identify about 118,000

    putative CRMs covering 2.88% of the genome.

    This constitutes one of the first genome-wide,non-promoter centric set of human cis-regulatory

    modules.

    The biological relevance of pCRMs were

    evaluated by measuring the extent they overlap

    known regulatory elements in databases such as

    TRRD, Transfac and GALA.

  • 7/29/2019 blanchettegr.ppt

    26/50

    Outline

    Introduction

    CRM prediction algorithm

    In silico validation of predicted modules

    Experimental validation of predicated

    modules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    27/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    In silico validation of predicted modules

  • 7/29/2019 blanchettegr.ppt

    28/50

    Comparison to other genome-

    wide predictions The ability of the algorithm to take advantage of

    interspecies TFBS conservation contributes in

    good part to its accuracy. The 34% of the human genome that lies within an

    alignment block with the mouse and rat genome

    contains 90% of bases within Transfac sites, 67%

    of those within TRRD modules, and 87% of thosewithin GALA regulatory regions.

  • 7/29/2019 blanchettegr.ppt

    29/50

    Outline

    Introduction

    CRM prediction algorithm

    In silico validation of predicted modules

    Experimental validation of predicted

    modules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    30/50

    Experimental validation of

    predicted modules Experimentally verified the data by Chip-

    chip analysis.

    This method allows for the large scaleidentification of protein-DNA interactions

    as they occur in vivo.

  • 7/29/2019 blanchettegr.ppt

    31/50

    Chip-chip Analysis

    Buck et al. Genome Biol. 2005; 6(11): R97

  • 7/29/2019 blanchettegr.ppt

    32/50

    Experimental validation of

    predicted modules They selected modules predicted to be bound by the

    estrogen receptor (ER), the E2F transcription factor(E2F4), STAT3 and HIFI to print a DNA microarray.

    The microarray contains 758, 1370, 860 and 1882 modulespredicted to be bound by ER, E2F4, STAT3, and HIFIrespectively.

    In the current study, the microarray was then probed byChIP-chip for ER and E2F4, respectively.

    Approx. 3% of the 758 ER-predicted pCRMs on themicroarray actually proved to be bound by ER, while 17%of the 1370 E2F4-predicted pCRMs on the microarraywere bound by E2F4.

  • 7/29/2019 blanchettegr.ppt

    33/50

    Experimental validation of

    predicted modules These numbers need to be considered as an

    underestimation of the actual specificity of the algorithm,

    since the protein-DNA interactions were tested in a single

    cell type, while TFs are known to regulate different sets ofgenes in different cell types, physiological conditions, and

    time in development.

    In addition, the experiment was conducted under a single

    set of conditions (concentration of estradiol, time oftreatment, etc. ). For all of these reasons, it is difficult to

    determine the real accuracy of the algorithm.

  • 7/29/2019 blanchettegr.ppt

    34/50

    Experimental validation of

    predicted modules As the microarray contained, predicted modules for fourdifferent TFs, the data can be used to assess the specificityof TFBS predictions.

    Among the 55 modules bound by ER, 44% were indeedselected for their ER-binding sites and among the 433modules bound by E2F4, 54% were selected for thatfactor.

    In addition to false positive ChIP-chip signals or the failure

    of the algorithm to detect some binding sites, it is likelythat binding of TFs through alternative mechanisms suchas protein-protein interactions contributes to this result.

    The present algorithm can only predict the binding of TFthrough direct DNA-binding interactions.

  • 7/29/2019 blanchettegr.ppt

    35/50

    Outline

    Introduction

    CRM prediction algorithm

    In silico validation of predicted modules

    Experimental validation of predicated

    modules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    36/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    Distribution of pCRMs along a region of chromosome 11

  • 7/29/2019 blanchettegr.ppt

    37/50

    Global view of the gene

    regulatory landscape The moduledensity varies widely across the genome, with

    an average offour modules per 100 kb and a maximum of44 modules per 100-kbwindow, covering from 0% to 55%

    of such a region. As illustrated in the previous figure, some regions are rich

    in modules,but relatively poor in genes. In some cases, thiscould reflectthe presence of many unknown protein-coding genes, or at leastof many alternative TSSs. Another

    possible explanation is thatsome of these modules may beregulating the transcription ofnoncoding transcripts.

    Finally, this observation may be due to the presence oflong-range enhancers, which may affect transcription ofgenes upto several hundreds of kilobases away.

  • 7/29/2019 blanchettegr.ppt

    38/50

    Regulatory modules are preferentially located

    in specific regions relative to genes The position of pCRMs with respect to their closest gene

    was studied.

    The genome was divided into several types of noncoding

    regions, i.e., upstream of a gene, 5' UTR, 1st intron,

    internalintrons, last intron, 3' UTR, and downstream

    region.

    Withineach type of region, they computed the fraction of

    bases includedin a pCRM as a function of the distance to areference pointfor each type of region.

  • 7/29/2019 blanchettegr.ppt

    39/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    Distribution of pCRMs relative to specific regions of the genes

  • 7/29/2019 blanchettegr.ppt

    40/50

    Observations

    1. Regions immediately surrounding TSSs are highly enriched for

    predicted modules. This was expected as this region often contains

    the promoter of the genes. Surprisingly, there are modules

    immediately downstream of TSSs. These may represent alternative

    promoters for initiation downstream from the annotated transcripts.

    2. Regions surrounding the sites of termination are also enriched for

    modules. 3' UTRs are essentiallyas enriched as 5' UTRsfor pCRMs.

    Two reasons may explain this. First,these may represent enhancer

    type of regulatoryelements thatactivate the upstream gene via a

    DNA-looping mechanism.Second,these may represent promoter

    elements driving noncodingtranscript,antisense relative to the

    coding gene. Such antisensetranscriptsmay regulate gene

    expression by a post-transcriptionalmechanism

  • 7/29/2019 blanchettegr.ppt

    41/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    Distribution of pCRMs relative to specific regions of the genes

  • 7/29/2019 blanchettegr.ppt

    42/50

    Observations

    Another surprising observationis that the density ofmodulesis the lowest in regions located1050 kbupstream ofthe TSS and, symmetrically, 1030kbdownstream of theend of transcription. This is

    unexpected,

    as one would expect

    that these regions (atleast those upstreamof the TSS) wouldbe prime estatefor transcriptional regulation.

    However, thisis confirmed by the density of interspeciesconservedelements,which is also at its lowest in those

    regions. Being closeto the TSS, regulatory elements in these

    regions maybe allowedto contain fewer binding sites (orbinding siteswith less affinity),making them difficult todetect using the currentmethod.

  • 7/29/2019 blanchettegr.ppt

    43/50

    Observations

    Alternatively,these regions(10-50 kb upstream) may

    actually be depletedfor regulatory elements.This could

    be due to constraints imposedby the chromatin structure

    of the nuclear architecture, makingit more difficult fortheDNA of these regions to come in physicalproximity

    to the TSS.

    Another notable observation is that the density of

    predicted modules in intronic regions is very low in theclose vicinity of exons (except the first and the last one),

    but increases with the distance to the closest exon.

  • 7/29/2019 blanchettegr.ppt

    44/50

    Mathieu Blanchette et al. Genome Res. 2006; 16: 656-668

    TFs target different regions relative to their target genes.

    RED => Highly enriched for TFBSs, BLUE => Depleted in TFBSs

  • 7/29/2019 blanchettegr.ppt

    45/50

    TFs target different regions

    relative to their target genes. The previous figure shows that more than 70 of the 229

    TFs families considered exhibit a significant enrichment

    for one or more types of genomic regions.

    A number of TFs show preference for distal positions,

    mostly those located more than 100 kb upstream of the

    TSS, and are also enriched within introns. This set of TFs

    is enriched for factors containing homeo domains or basic

    helix-loop-helix domains and are often involved inregulating development.

  • 7/29/2019 blanchettegr.ppt

    46/50

    TFs target different regions

    relative to their target genes. A second set of TFs preferentially binds within 1 kb of the

    TSSs. This set is enriched for leucine zipper TF and factors

    from Ets family. Notably, most of these factors, contrary to

    what is observed for those binding distal sites, are involvedin basic cellular functions.

  • 7/29/2019 blanchettegr.ppt

    47/50

    Outline

    Introduction

    CRM prediction algorithm

    In silico validation of predicted modules

    Experimental validation of predicated

    modules

    Location of CRMs relative to genes

    Conclusions

  • 7/29/2019 blanchettegr.ppt

    48/50

    Conclusions

    Blanchett et al have identified a setof rules describing thearchitecture of DNA regulatory elementsand used them tobuild an algorithm allowing them to explore theregulatory

    potential of the human genome. Although the false positive rate in CRM prediction is likely

    to be high, the statistical power obtained through a large-scale, genome-wide approach revealed new insights abouttranscriptional regulation.

    It was noted that a significant number of TFs have a strongbias for regulating genes either from a great distance orfrompromoter-proximal binding sites.

  • 7/29/2019 blanchettegr.ppt

    49/50

    Conclusions

    Noteworthy is the fact thatmost TFs that preferentially

    work from a large distance areinvolved in development,

    while those predicted to work frompromoter-proximal

    sites tend to regulate genes involved in basiccellularprocesses.

    It is expected that the database containing the modules

    presented in this study may speed up the discovery and

    experimental validation of CRMs

  • 7/29/2019 blanchettegr.ppt

    50/50

    THANK YOU