alternative splicing from ests eduardo eyras bioinformatics upf – february 2004
Post on 20-Dec-2015
216 views
TRANSCRIPT
Alternative Splicing from ESTs
Eduardo Eyras
Bioinformatics UPF – February 2004
Intro
ESTs
Prediction of Alternative Splicing from ESTs
AAAAAAA5’ CAPMature mRNA
Splicing
5’
3’
3’
5’
pre-mRNA
Transcriptionexons
introns
Translation
Peptide
AAAAAAA5’ CAPMature mRNA
Different Splicing
5’
3’
3’
5’
pre-mRNA
Transcriptionexons
introns
Translation
Different Peptide
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted protein diversity
Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs
It can modify the activity of the transcription factors, affecting the expression of genes
It is observed nearly in all metazoans
Estimated to occur in 30%-40% of human
Forms of alternative splicing
Exon skipping / inclusion
Alternative 3’ splice site
Alternative 5’ splice site
Mutually exclusive exons
Intron retention
Constitutive exon Alternatively spliced exons
How to study alternative splicing?
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of cDNA
Typically 200-500 nucleotides long
It may contain coding and/or non-coding region
ESTsCells from a specific organ, tissue or developmental stage
AAAAAA 3’5’
AAAAAA 3’5’
TTTTTT5’3’
AAAAAA 3’5’
TTTTTT5’3’
TTTTTT5’3’
AAAAAA 3’5’
TTTTTT5’3’
mRNA extraction
RNA
DNA
Double stranded cDNA
Add oligo-dT primer
Reverse transcriptase
Ribonuclease H
DNA polimerase Ribonuclease H
ESTs
AAAAAA 3’5’
TTTTTT5’3’Clone cDNA into a vector
Multiple cDNA clones5’ EST
3’ EST
Single-pass sequence reads
Splice variants
Genomic
Primary transcript
Splicing
cDNA clones
EST sequences
5’ 3’ 5’ 3’
Alternative Splicing from ESTs
Alternative Splicing from ESTs
ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)
EST sequencing
Is fast and cheap
Gives direct information about the gene sequence
Partial information
Resulting ESTs Known gene
(DB searches) Similar to known gene
Contaminant
Novel gene
ESTs provide expression data
eVOC Ontologies http://www.sanbi.ac.za/evoc/
Anatomical System
Cell Type
The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina.
Pathology
The precise cell type from which a sample was prepared. Examples are: B-lymphocyte, fibroblast and oocyte.
Developmental Stage
The pathological state of the sample from which the sample was prepared.Examples are: normal, lymphoma, and congenital.
Pooling
The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult.
Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.
Linking the expression vocabulary to gene annotations
ESTs
Genes
Normalized vs. non-normalized libraries
The down side of the ESTs
Cannot detect lowly/rarely expressed genes or non-expressed sequences (regulatory)
Random sampling: the more ESTs we sequence the less new useful sequences we will get
Gene Hunting
Sequencing of the Human Genome (HGP) EST Sequencing
Origin of the ESTs
Science. 1991 Jun 21;252(5013):1651-6
Complementary DNA sequencing: expressed sequence tags and human genome project.
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al.
Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD.
Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.
EST-sequencing explosion
Merck and WashU (1994)
public ESTs
GenBank
dbEST
non-exclusivity (1992)
Number of public entries: 20,039,613
Summary by organism
Homo sapiens (human) 5,472,005Mus musculus + domesticus (mouse) 4,056,481Rattus sp. (rat) 583,841Triticum aestivum (wheat) 549,926Ciona intestinalis 492,511Gallus gallus (chicken) 460,385Danio rerio (zebrafish) 450,652Zea mays (maize) 391,417Xenopus laevis (African clawed frog) 359,901…
dbEST release 20 February 2004
EST lengths
Human EST length distribution (dbEST Sep. 2003 )
~ 450 bp
Recover the mRNA from the ESTs
What is an EST cluster?
A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity
Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene. (Burke, Davison, Hide, Genome Research 1999).
EST pre-processing
VectorRepeats MitochondrialXenocontaminants
EST Clustering
UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene
TIGR Human Gene Index www.tigr.org
(The Institute for Genomic Research)
StackDB www.sanbi.ac.za
(South African Bioinformatics Institute)
UniGene
Species UniGene Entries
Homo sapiens 118,517
Mus musculus 82,482
Rattus norvegicus 43,942
Sus scrofa 20,426
Gallus gallus 11,970
Xenopus laevis 21,734
Xenopus tropicalis 17,102
…
ESTs and the Genome
ESTs aligned to the genome
Some advantages:
•It defines the location of exons and introns
•We can verify the splice sites of introns (e.g. GT-AG)
hence also check the correct strand of spliced ESTs
•It helps preventing chimeras
•It can avoid putting together ESTs from paralogous genes
•We can prevent including pseudogenes in our analysis
Aligning ESTs to the Genome
Many ESTs Fast programs, Fast computers
Nearly exact matches Coverage >= 97%Percent_id >= 97%
Splice sites: GT—AG, AT—AC, GC—AG
Aligning ESTs to the Genome
Clip poly A tails/Clip 20bp from either end
Best in genome
Remove potential processed pseudogenes
Give preference to ESTs that are spliced
Extra pre-processing of ESTs:
Human ESTGenesGenomic length distribution of aligned human ESTs
Tail up to ~ 800kb
~ 400bp
The Problem
What are the transcripts represented in this set of mapped ESTs?
ESTs
Genome
Transcript predictions
ESTs
Predict Transcripts from ESTs
Merge ESTs according to splicing structure compatibility
Representation
Extension
Inclusion zx
y
x
Sort by the smallest coordinate ascending and by the largest coordinate descending
Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not
The redundancy relation is a graph:
x
y
x
z
Criteria of merging
Allow internal mismatches
Allow intron mismatches
Allow edge-exon mismatches
Transitivity
Extension
Inclusionwz
y
x
w
x
This reduces the number of comparisons needed
x
y
z
xzw
ClusterMerge graph
z
x
x
y
y
z
w
Each node defines an inclusion sub-tree
Extensions form acyclic graphs
y
xz
xyzw
Recovering the Solution
1
2
9
6
8
7
43
5
Mergeable sets of ESTs can be recovered asspecial paths in the graph
Recovering the Solution
1
2
9
6
8
7
43
5
Root
Leaves
Leaf: not-extended and root of an inclusion tree
Root: does not extend any node
Recovering the Solution
1
2
9
6
8
7
43
5
Root
Leaves
Any set of ESTs in a path from a root to a leaf is mergeable
Recovering the Solution
1
2
9
6
8
7
43
5
Root
Leaves
Add the inclusion tree attached to each node in the path
Recovering the Solution
1
2
9
6
8
7
43
5
Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9)
This representation minimizes the necessary comparisons between ESTs
How to build the graph
Mutual Recursion
Search graph (leaves)
Recursion search along extension branch
Search sub-graph
Inclusion => go up in the tree
How to build the graph
1
32
4
65
Example
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Leaves
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Inclusion
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Inclusion
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Extension
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Inclusion
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Place
7
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Inclusion
7
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
tagged as visited - skip
7
How to build the graph
1
32
4
65
Example
1
4
2
6
5
3
7
Possible sub-trees beyond 1 or 3 remain unseen!
The representation minimizes the necessary comparisons
7
Deriving the transcripts from the lists
Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute
Deriving the transcripts from the lists
Splice Sites: are set to the most common coordinate
5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most
Single exon transcripts
Reject resulting single exon transcripts when using ESTs
Annotation with ESTs
ESTs aligned to the genome can provide information about UTRs and alternative splicing
Annotation with ESTs
EST-Transcripts at www.ensembl.org
Annotation with ESTs
Results for Human and Mouse
Human EST-genes (assembly ncbi33):
38,581 Genes
122,247Transcripts ( 42% with full CDS )
Mouse EST-genes (assembly ncbi30):
32,848 Genes
103,664 Transcripts ( 36% with full CDS )
How many transcripts are conserved?
Is Alternative Splicing conserved?
EST-transcript pairs
42,625 transcript pairs (in 18,242 gene pairs)
gene pairs
78% with one transcript pair conserved
22% with more than one transcript pair conserved
For 22% of the gene pairs
some form of alt. splicing is conserved
Conservation of Alt. SplicingTake gene-pairs with more than one transcript-pair
19% of alt. variants in human are conserved in mouse
32% of alt. variants in mouse are conserved in human
∑ ( number of paired transcripts - 1)
%conservation = -------------------------------------------------------
∑ ( number of transcripts - 1 )
∑ = sum over genes in a gene pair with more than one variant
( subtract the ‘main’ transcript form)
How many predicted ‘novel’ genes are validated by Human-Mouse
comparison?
Novel genesESTGenes
Not in Ensembl Human ESTGenes validated by comparison to mouse
13,174 18,242
ESTGenes with at least one complete ORF
24,201
Novel genes
984
ESTGenes not in Ensembl validated by comparison to mouse
With a complete ORF
THE END