understanding genes using mathematical tools adam sartiel compugen

39
Understanding genes using mathematical tools Adam Sartiel COMPUGEN

Upload: hope-hines

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Understanding genesusing

mathematical tools

Adam SartielCOMPUGEN

2

Short History Of Compugen

• 1993: Founded• 1994: First Bioccelerator sold (Merck)• 1997: LEADS project initiated• 1998: Pfizer collaboration• 1999: USPTO agreement; LabOnWeb

launched• 2000: Launch of Z3; IPO• 2001: Gencarta and OligoLibraries

launched; Novartis collaboration

3

Unique R&D Team

• Substantial– 120 professionals – 32 PhD/MD, 37 M.Sc.

• Multidisciplinary– Algorithm development, Molecular biology,

Software engineering, Statistics, Physics, Chemistry

• Integrated– Synergy between disciplines and feedback

4

Gene analysis using mathematics

• Drug discovery and Bioinformatics• Principles of sequence alignment• The EST opportunity and the

Transcriptome• Applications (Gencarta and DNA chips)

5

Cellular pathways are highly complex

- identified targets

6

$500M

The Drug Development Process

7

Some definitions

• ‘Drug’ – protein, lipid, antibody, or small organic molecule which has proven effect and approved safety level.

• ‘Lead’ – A molecule in development which may one day become a drug

• ‘Target’ – A protein (in most cases) which activity a drug lead would affect, in order to create a desirable effect on the body.

• ‘Validated target’ – A target which has a proven, demonstrated effect on a disease or condition.

8

30,000 GENES?

• Fewer genes than initially thought?• Some complexity due to alternative

splicing• Gene prediction is problematic• Complex genes (interleaved, nested,...)

are especially difficult to identify• Both HGP and Celera tried to minimize

false positives• Conclusion: more genes may be found

Wright et al., Genome Biology 2001 2(7):

There are 65,000 – 75,000 genes

9

ONE GENE ONE PROTEIN???

Old Dogma

Gene

mRNA

Protein

Gene

mRNA

Protein

Current understanding

mRNAmRNA

Protein

Protein

Edited mRNA

Modified protein

Protein

10

Gene identification using sequence comparison

2311-2-2-5T 5

3012-1-1-4G 4

01-1-100-3T 3

-3-2-1001-2A 2

-3-2-101-1-1C 1

-6-5-4-3-2-100

G

6

T

5

C

4

G

3

C

2

A

10

11

Similar sequences, common ancestor...

... common ancestor, similar function

Understand genes = know your targets

12

The genetic code is redundant

13

Proteins ‘see’ deeper

Unrelated DNA sequences?

Highly related proteins!

TTACTCCGTCATGATGGGGUG

CTGATAAGGAAAGAAGGCTAT

LeuLeuArgHisAspGlyVal

LeuIleArgLysGluGlyTyr

14

How to align proteins?

MARQGEFPSILKM-RHGEFP-LLKWC

‘Good’

‘Bad’

A good algorithm, vs. 2001 databases, requires super-computers

15

Another direction: find genes by sequence

ACGATCGAGCATGCATCATCAGCATCTAGCGATCAGCAGGCATCGAGCAGCTAGCATGCATG

TGCTAGCACGTACGTAGTAGTCGTAGATCGCTAGTCGTCCGTAGCTCGTCGATCGTACGTCAC

- Gene regions have different nucleotide composition than non-coding regions.- Intron and exons are distinct in sequences- Splice junctions are clearly detectable

16

Genomic DNA

One step ahead: the story of the ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

Public domain ESTs (Expressed Sequence Tags): > 5,000,000

Craig Venter

17

The ESTs: Rough Diamonds?

• Short, inaccurate, badly annotated• Abundant with repeats, alternative

splicing• Too many…• The shredder effect

18

Input: GenBank- a pool of ESTs and mRNAs

Process 1-clustering

Process 2- Assembly

Output: The transcriptome

USING ESTS TO GET THE TRANSCRIPTOME

Cluster 1 Cluster 2 Cluster 3 Cluster 4

19

The Transcriptome - Definition

“The mRNA collection content, present at any given moment in a cell or a tissue,

and its behavior over time and cell states”

20

Introducing the Transcriptome

• The Genome: – Index to the range of possible proteins– Useful as map and for inter-organisms analysis

• The Proteome:– Describes what actually happens in the cell– Complex tools, partial results

• The Transcriptome:– “Golden path”: Proteome information in DNA

technology.

21

Transcriptome applications

• Discovery of new proteins– Which are present in specific tissues– Which have specific cell locations– Which respond to specific cell states

• Discovery of new variants– Of important genes– Which work to increase/decrease the

activity of the ‘native’ protein.

22

Example: Alternative SplicingOne Gene - Multiple mRNAs

64 521

Various Mature mRNA Transcripts

63 521

643 521

643 521 Pre mRNA

AlternativeSplicing

3

4

(tissue A)

(tissue B)

(Other tissues)

23

Alternative Splicing vs. “Contiging”

“Contiging”:

“Assembling”:Contig impossible

24

Extreme example of alternative splicing

Mature PSA

PSA precursor

PSA RNA

Genomic

Modified mRNA

LM precursor

Mature LM protein

Stop codon

Stop codon

Signal peptide

Signal peptide

Alternative splicing

Though coded by the same gene, mature proteins PSA and LM have not one residue in common!

25

PSA genomic

exon1 exon 2 exon 3 exon 4exon1 exon 2exon 3 exon 4 exon 5exon 5

KLK-2 genomic

LM KLM*Stop codon

Is This The Only Example?

**

**

26

Validation: Northern Blot

• Like PSA, LM expression is restricted to prostate tissue• Multiple bands may reflect conserved regions or

alternative splicing

27

Example: receptor with DN

DominantNegative

28

Natural Antisense – a regulation mechanism?

29

LEADS Antisense Prediction

• When analyzing EST data for Antisense:

– Use original EST orientation annotation

– Check splicing signals on both strands

– Examine library description for enzymes used

– Mark PolyA signals and PolyA tails (compare to genomic PolyA)

– Take into account NotI sites

30

Example: A Putative SNP

Cluster T07189 Position 347

31

Cluster T07189

Position 347

SNP Verification

32

Using Compugen’s Transcriptome Technology

• Large-scale collaborations: Pfizer, Novartis• Co-development of molecules: TNF,

Chemokine receptors, kinases, GPCRs• Academia research: UCSF, NYU, TAU.• Database products• DNA chip design • Mass-spec analysis• Gene Ontology

33

Chip Design on Alternative Splicing

Variant-specific or common probes can be designed

34

How many ‘genes’ are there really?

• Raw data: – 3,770,969 human sequences– 2,061,357 mouse sequences– 297,568 rat sequences

• Non-singleton ‘clusters’: 120,372 H, 63,043 M, 33,396 R

• % with splice variants: 26% (H), 32% (M), 23% (R)• Homology (to SwissProt+Trembl, InterPro, other GC proteins):

20% (H+M), 27% (R).• Total unique proteins: 236,797 (H), 106,119 (M), 32,352

(R)

35

The Novartis Agreement

• Signed August 2001• Novartis non-exclusively licensed the LEADS

platform and related software, and plans to use it for:– In-silico drug target identification and prioritization– Genome wide chip design

• Agreement was signed after a detailed pilot study run in November 2000– Discovered novel genes and splice variants using

Incyte and Celera data

• Genes were subsequently verified in Novartis laboratory.

36

GENCARTA

• Result of LEADS applied to:– Public genome information– Published mRNA– ESTs

• In-house designed interface, Oracle-based infrastructure.

• Installed: Kyowa-Hakko, Avalon Pharma, Weizmann Institute, YU

• Version 2.2 out in October 2001.

37

Let’s go for the real thing…

• Gencarta Demonstration• OligoLibrary Demonstration

38

Conclusion: Advantages of the Transcriptome

• Identify new drug targets• Understand splice variant behavior• Isolate “natural” drugs• Annotate Proteomics experiments• Design better DNA chips

Solve the real bottlenecks in drug discovery and developmentSolve the real bottlenecks in drug discovery and development

Understanding genesusing

mathematical tools

Adam SartielCOMPUGEN