alternative splicing from ests eduardo eyras bioinformatics upf – february 2004

Post on 20-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Alternative Splicing from ESTs

Eduardo Eyras

Bioinformatics UPF – February 2004

Intro

ESTs

Prediction of Alternative Splicing from ESTs

AAAAAAA5’ CAPMature mRNA

Splicing

5’

3’

3’

5’

pre-mRNA

Transcriptionexons

introns

Translation

Peptide

AAAAAAA5’ CAPMature mRNA

Different Splicing

5’

3’

3’

5’

pre-mRNA

Transcriptionexons

introns

Translation

Different Peptide

Alt splicing as a mechanism of gene regulation

Functional domains can be added/subtracted protein diversity

Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs

It can modify the activity of the transcription factors, affecting the expression of genes

It is observed nearly in all metazoans

Estimated to occur in 30%-40% of human

Forms of alternative splicing

Exon skipping / inclusion

Alternative 3’ splice site

Alternative 5’ splice site

Mutually exclusive exons

Intron retention

Constitutive exon Alternatively spliced exons

How to study alternative splicing?

ESTs (Expressed Sequence Tags)

Single-pass sequencing of a small (end) piece of cDNA

Typically 200-500 nucleotides long

It may contain coding and/or non-coding region

ESTsCells from a specific organ, tissue or developmental stage

AAAAAA 3’5’

AAAAAA 3’5’

TTTTTT5’3’

AAAAAA 3’5’

TTTTTT5’3’

TTTTTT5’3’

AAAAAA 3’5’

TTTTTT5’3’

mRNA extraction

RNA

DNA

Double stranded cDNA

Add oligo-dT primer

Reverse transcriptase

Ribonuclease H

DNA polimerase Ribonuclease H

ESTs

AAAAAA 3’5’

TTTTTT5’3’Clone cDNA into a vector

Multiple cDNA clones5’ EST

3’ EST

Single-pass sequence reads

Splice variants

Genomic

Primary transcript

Splicing

cDNA clones

EST sequences

5’ 3’ 5’ 3’

Alternative Splicing from ESTs

Alternative Splicing from ESTs

ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)

EST sequencing

Is fast and cheap

Gives direct information about the gene sequence

Partial information

Resulting ESTs Known gene

(DB searches) Similar to known gene

Contaminant

Novel gene

ESTs provide expression data

eVOC Ontologies http://www.sanbi.ac.za/evoc/

Anatomical System

Cell Type

The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina.

Pathology

The precise cell type from which a sample was prepared. Examples are: B-lymphocyte, fibroblast and oocyte.

Developmental Stage

The pathological state of the sample from which the sample was prepared.Examples are: normal, lymphoma, and congenital.

Pooling

The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult.

Indicates whether the tissue used to prepare the library was derived from single or multiple samples.  Examples are pooled, pooled donor and pooled tissue.

Linking the expression vocabulary to gene annotations

ESTs

Genes

Normalized vs. non-normalized libraries

The down side of the ESTs

Cannot detect lowly/rarely expressed genes or non-expressed sequences (regulatory)

Random sampling: the more ESTs we sequence the less new useful sequences we will get

Gene Hunting

Sequencing of the Human Genome (HGP) EST Sequencing

Origin of the ESTs

Science. 1991 Jun 21;252(5013):1651-6

Complementary DNA sequencing: expressed sequence tags and human genome project.

Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al.

Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD.

Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.

EST-sequencing explosion

Merck and WashU (1994)

public ESTs

GenBank

dbEST

non-exclusivity (1992)

Number of public entries: 20,039,613

Summary by organism

Homo sapiens (human) 5,472,005Mus musculus + domesticus (mouse) 4,056,481Rattus sp. (rat) 583,841Triticum aestivum (wheat) 549,926Ciona intestinalis 492,511Gallus gallus (chicken) 460,385Danio rerio (zebrafish) 450,652Zea mays (maize) 391,417Xenopus laevis (African clawed frog) 359,901…

dbEST release 20 February 2004

EST lengths

Human EST length distribution (dbEST Sep. 2003 )

~ 450 bp

Recover the mRNA from the ESTs

What is an EST cluster?

A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity

Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene.  (Burke, Davison, Hide, Genome Research 1999).

EST pre-processing

VectorRepeats MitochondrialXenocontaminants

EST Clustering

UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene

TIGR Human Gene Index www.tigr.org

(The Institute for Genomic Research)

StackDB www.sanbi.ac.za

(South African Bioinformatics Institute)

UniGene

Species UniGene Entries

Homo sapiens 118,517

Mus musculus 82,482

Rattus norvegicus 43,942

Sus scrofa 20,426

Gallus gallus 11,970

Xenopus laevis 21,734

Xenopus tropicalis 17,102

ESTs and the Genome

ESTs aligned to the genome

Some advantages:

•It defines the location of exons and introns

•We can verify the splice sites of introns (e.g. GT-AG)

hence also check the correct strand of spliced ESTs

•It helps preventing chimeras

•It can avoid putting together ESTs from paralogous genes

•We can prevent including pseudogenes in our analysis

Aligning ESTs to the Genome

Many ESTs Fast programs, Fast computers

Nearly exact matches Coverage >= 97%Percent_id >= 97%

Splice sites: GT—AG, AT—AC, GC—AG

Aligning ESTs to the Genome

Clip poly A tails/Clip 20bp from either end

Best in genome

Remove potential processed pseudogenes

Give preference to ESTs that are spliced

Extra pre-processing of ESTs:

Human ESTGenesGenomic length distribution of aligned human ESTs

Tail up to ~ 800kb

~ 400bp

The Problem

What are the transcripts represented in this set of mapped ESTs?

ESTs

Genome

Transcript predictions

ESTs

Predict Transcripts from ESTs

Merge ESTs according to splicing structure compatibility

Representation

Extension

Inclusion zx

y

x

Sort by the smallest coordinate ascending and by the largest coordinate descending

Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not

The redundancy relation is a graph:

x

y

x

z

Criteria of merging

Allow internal mismatches

Allow intron mismatches

Allow edge-exon mismatches

Transitivity

Extension

Inclusionwz

y

x

w

x

This reduces the number of comparisons needed

x

y

z

xzw

ClusterMerge graph

z

x

x

y

y

z

w

Each node defines an inclusion sub-tree

Extensions form acyclic graphs

y

xz

xyzw

Recovering the Solution

1

2

9

6

8

7

43

5

Mergeable sets of ESTs can be recovered asspecial paths in the graph

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Leaf: not-extended and root of an inclusion tree

Root: does not extend any node

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Any set of ESTs in a path from a root to a leaf is mergeable

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Add the inclusion tree attached to each node in the path

Recovering the Solution

1

2

9

6

8

7

43

5

Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9)

This representation minimizes the necessary comparisons between ESTs

How to build the graph

Mutual Recursion

Search graph (leaves)

Recursion search along extension branch

Search sub-graph

Inclusion => go up in the tree

How to build the graph

1

32

4

65

Example

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Leaves

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Extension

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Place

7

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

7

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

tagged as visited - skip

7

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Possible sub-trees beyond 1 or 3 remain unseen!

The representation minimizes the necessary comparisons

7

Deriving the transcripts from the lists

Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute

Deriving the transcripts from the lists

Splice Sites: are set to the most common coordinate

5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most

Single exon transcripts

Reject resulting single exon transcripts when using ESTs

Annotation with ESTs

ESTs aligned to the genome can provide information about UTRs and alternative splicing

Annotation with ESTs

EST-Transcripts at www.ensembl.org

Annotation with ESTs

Results for Human and Mouse

Human EST-genes (assembly ncbi33):

38,581 Genes

122,247Transcripts ( 42% with full CDS )

Mouse EST-genes (assembly ncbi30):

32,848 Genes

103,664 Transcripts ( 36% with full CDS )

How many transcripts are conserved?

Is Alternative Splicing conserved?

EST-transcript pairs

42,625 transcript pairs (in 18,242 gene pairs)

gene pairs

78% with one transcript pair conserved

22% with more than one transcript pair conserved

For 22% of the gene pairs

some form of alt. splicing is conserved

Conservation of Alt. SplicingTake gene-pairs with more than one transcript-pair

19% of alt. variants in human are conserved in mouse

32% of alt. variants in mouse are conserved in human

∑ ( number of paired transcripts - 1)

%conservation = -------------------------------------------------------

∑ ( number of transcripts - 1 )

∑ = sum over genes in a gene pair with more than one variant

( subtract the ‘main’ transcript form)

How many predicted ‘novel’ genes are validated by Human-Mouse

comparison?

Novel genesESTGenes

Not in Ensembl Human ESTGenes validated by comparison to mouse

13,174 18,242

ESTGenes with at least one complete ORF

24,201

Novel genes

984

ESTGenes not in Ensembl validated by comparison to mouse

With a complete ORF

THE END

top related