scale of the ‘unknown’ gene problem

20
Scale of the ‘unknown’ gene problem Scale of the ‘unknown’ gene problem Comparative genomics outline Shared plant-prokaryote genes Shared plant-prokaryote genes Comparative genomics Comparative genomics When Blast tells you nothing…. When Blast tells you nothing…. The ‘guilt by association’ principle The ‘guilt by association’ principle Two-dimensional’ gene annotation Two-dimensional’ gene annotation SEED subsystems SEED subsystems Plant-prokaryote examples Plant-prokaryote examples Filling ‘pathway holes’ – FolQ Filling ‘pathway holes’ – FolQ Linking new functions to known systems – COG0354 Linking new functions to known systems – COG0354

Upload: conner

Post on 30-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Scale of the ‘unknown’ gene problem. Comparative genomics outline. Shared plant-prokaryote genes. Comparative genomics When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems. Plant-prokaryote examples - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scale of the ‘unknown’ gene problem

• Scale of the ‘unknown’ gene problemScale of the ‘unknown’ gene problem

Comparative genomics outlineComparative genomics outline

• Shared plant-prokaryote genesShared plant-prokaryote genes

• Comparative genomicsComparative genomics• When Blast tells you nothing….When Blast tells you nothing….• The ‘guilt by association’ principleThe ‘guilt by association’ principle• ‘‘Two-dimensional’ gene annotationTwo-dimensional’ gene annotation• SEED subsystemsSEED subsystems

• Plant-prokaryote examplesPlant-prokaryote examples• Filling ‘pathway holes’ – FolQFilling ‘pathway holes’ – FolQ• Linking new functions to known systems – COG0354Linking new functions to known systems – COG0354

Page 2: Scale of the ‘unknown’ gene problem

www.genomesonline.org

Whole genome sequencing progressWhole genome sequencing progress

●● Functional annotation of genes has nowhere near kept pace Functional annotation of genes has nowhere near kept pace

●● Functional annotations are often absent, vague, or wrongFunctional annotations are often absent, vague, or wrong

Ongoing

Complete

0

1000

2000

3000

4000

5000

Num

ber

of g

enom

es

6000

7000

8000

9000

10000

Dec 1

997

Jul 1

999

Jun

2000

Jan

2001

Sep 2

001

Jul 2

002

Jan

2003

Apr 2

003

Sep 2

003

Feb 2

004

Jun

2004

Oct 2

004

Apr 2

005

Oct 2

005

Aug 2

006

May

200

7

May

200

8

Aug 2

009

Mar

201

1

Page 3: Scale of the ‘unknown’ gene problem

Orphan genesOrphan genes

• 1437/3736 enzymes (1437/3736 enzymes (38%38%) ) with EC numbers have no with EC numbers have no associated genesassociated genes

Orphan enzymesOrphan enzymes

• 20-60% of genes in any 20-60% of genes in any given genome have no given genome have no known function or only a known function or only a

vague one (‘esterase’ etc)vague one (‘esterase’ etc)

Page 4: Scale of the ‘unknown’ gene problem

0

20

40

60

80

100

UnknownKnown

Pe

rce

nt

of

pro

tein

sPercentage of unknown proteins encoded by diverse genomes

Bacteria Archaea Eukarya

The unknown protein problem in various groupsThe unknown protein problem in various groups

Data from The SEED http://theseed.uchicago.edu/Data from The SEED http://theseed.uchicago.edu/

Esche

richia

coli

Lacto

bacil

lus

case

i

Staph

yloco

ccus

aure

usChla

myd

ia

trach

omat

is

Acidob

acte

rium

Soliba

cter

usita

tus

Synec

hocy

stis

Pyroc

occu

s

abys

si

Haloar

cula

mar

ismor

tui

Human

Arabid

opsis

Page 5: Scale of the ‘unknown’ gene problem

Source of genes Number of genes % of genome

Plants & prokaryotes share many (unknown) genesPlants & prokaryotes share many (unknown) genes

Cyanobacteria 5470 21.0

Proteobacteria 1170 4.6

Gram+ bacteria 2280 9.1

Other bacteria 1160 4.6

Archaea 1090 4.4

Total 11170 43.4

●● Estimates for Arabidopsis vary – but all are many thousands Estimates for Arabidopsis vary – but all are many thousands

●● Functions of most shared genes are metabolicFunctions of most shared genes are metabolic

From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)

●● Shared genes Shared genes identifiablyidentifiably from various groups from various groups

●● Plants are conglomerates of microbial metabolic genesPlants are conglomerates of microbial metabolic genes

●● Many opportunities for comparative genomicsMany opportunities for comparative genomics

Page 6: Scale of the ‘unknown’ gene problem

The power of comparative genomicsThe power of comparative genomics

●● Suppose you have an unknown plant protein:Suppose you have an unknown plant protein:

●● BlastP search gives various prokaryote hitsBlastP search gives various prokaryote hits

●● None of them have clear functionsNone of them have clear functions Dead endDead end

●● No! This is the beginning of comparative genomicsNo! This is the beginning of comparative genomics

●● Predicts functions via ‘guilt by association’ principlePredicts functions via ‘guilt by association’ principle

●● Genes of related function are associated in various waysGenes of related function are associated in various ways

●● e.g. Enzymes in a pathway, proteins in a complexe.g. Enzymes in a pathway, proteins in a complex

●● Whatever a gene’s associates do, it probably does too Whatever a gene’s associates do, it probably does too

Page 7: Scale of the ‘unknown’ gene problem

Associationevidence

Protein-protein interactions

Organelle proteomes

Co-expression

Gene WGene XGene YGene Z

Structures

Essentiality & other phenome data

A

B

C V M

A B C D

Gene clustering

Orf XY

Orf YOrf X

Gene fusion

C

A

B

D

Shared regulatory sites

XYYX

XYYX

XYYX

XYYX

Phylogenetic occurrence

+

+––

––

+

++

Genomic evidence Post-genomic evidence

Predictions

Testing (genetics, biochemistry)

Page 8: Scale of the ‘unknown’ gene problem

• ‘‘Dimensions’ are:Dimensions’ are:• Molecular function (e.g., an enzyme activity with EC no.)Molecular function (e.g., an enzyme activity with EC no.)

• Functional context (e.g., other enzymes of a pathway)Functional context (e.g., other enzymes of a pathway)

• ‘‘2-Dimensions good, 1-dimension bad’2-Dimensions good, 1-dimension bad’ • Even an EC no. function may be wrong if pathway not thereEven an EC no. function may be wrong if pathway not there

• Pathway context may be wrong if certain enzymes missingPathway context may be wrong if certain enzymes missing

• GenBank etc annotations are 1-dimensional (mol. function) GenBank etc annotations are 1-dimensional (mol. function)

Two-dimensional gene annotationTwo-dimensional gene annotation

Page 9: Scale of the ‘unknown’ gene problem

SEED subsystemsSEED subsystems

• Subsystems (SSs) capture both annotation dimensionsSubsystems (SSs) capture both annotation dimensions

• Sets of molecular functions (e.g. enzymes) that together Sets of molecular functions (e.g. enzymes) that together implement a specific biological process (e.g. a pathway) implement a specific biological process (e.g. a pathway)

Folate biosynthesis subsystem

Pathway hole

• SSs cover many genomes, have form of spreadsheet:SSs cover many genomes, have form of spreadsheet:• Columns are molecular functionsColumns are molecular functions

• Rows are genomesRows are genomes

• Each cell identifies the genes for proteins with the specific Each cell identifies the genes for proteins with the specific molecular functional role in the designated genomemolecular functional role in the designated genome

Page 10: Scale of the ‘unknown’ gene problem

• Prokaryote association evidence is mainly genomicProkaryote association evidence is mainly genomic

• Plant association evidence is mainly post-genomicPlant association evidence is mainly post-genomic

• Post-genomic evidence is noisier but very usefulPost-genomic evidence is noisier but very useful

• Superb plant post-genomic resources:Superb plant post-genomic resources:• Microarrays, RNAseq (organ- and environment-specific)Microarrays, RNAseq (organ- and environment-specific)

• Organellar targeting prediction, proteomics (location can r/o function)Organellar targeting prediction, proteomics (location can r/o function)

• Phenome databases (chlorosis, lethality can support function)Phenome databases (chlorosis, lethality can support function)

• Huge EST databasesHuge EST databases

• Vast plant metabolism bibliomeVast plant metabolism bibliome

Plant – prokaryote examplesPlant – prokaryote examples

Page 11: Scale of the ‘unknown’ gene problem

FolQ – Filling a pathway holeFolQ – Filling a pathway hole

• Missing step known to be a pyrophosphohydrolase, ~17 kDaMissing step known to be a pyrophosphohydrolase, ~17 kDa• Search genomes for small hydrolase clustered with Search genomes for small hydrolase clustered with folfol genes genes• YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa

Folate synthesis pathway

FolQDHN DHP DHF THF

Glu

GTP DHN-P3 DHN-P

pABA

HMDHP-P2HMDHP

Chrorismate ADC

FolE FolB FolK FolP FolC FolA[P-ase]

PabCPabAB

• YlgG has a plant homolog – At1g68760YlgG has a plant homolog – At1g68760

• FolQ FolQ universallyuniversally missing (prokaryotes, plants, fungi, protists) missing (prokaryotes, plants, fungi, protists)

Lactococcus lactis folate gene cluster

folCfolEK folP ylgG

Page 12: Scale of the ‘unknown’ gene problem

Recombinant proteins release DHN-P + PPi

2 4 6 2 4 6Minutes

0

40

80

120

160

200

240

Flu

ore

sce

nc

e

WT KO

DHNP3

DHN-P3

FolQ – Experimental testsFolQ – Experimental tests

Folate synthesis pathway

FolQDHN DHP DHF THF

Glu

GTP DHN-P3 DHN-P

pABA

HMDHP-P2HMDHP

Chrorismate ADC

FolE FolB FolK FolP FolC FolA[P-ase]

PabCPabAB

• ylgG ylgG KO accumulates DHN-PKO accumulates DHN-P33 • YlgGYlgG & At1g68760 act on DHN-P& At1g68760 act on DHN-P33

Pro

du

ct f

orm

atio

n (

nm

ol/

assa

y)

0

0.5

1.0

1.5

DHNP Pi PPi

YlgG

0

0.3

0.9

0.6

DHNP Pi PPi

At1g68760

Page 13: Scale of the ‘unknown’ gene problem

Mouse

Fly

Yeast

Leishmania

At4g12130

At1g60990

Haloarcula

Natronomonas

Rickettsia

Ehrlichia

Anaplasma

Bradyrhizobium

Burkholderia

Neisseria

Xanthomonas

Psychrobacter

E. coli

Shewanella

Thermus

Deinococcus

Synechocystis

Synechococcus

Nostoc

Corynebacterium

Streptomyces

Solibacter

Blastopirellula

Pirellula

GcvT

Yeast GcvT

Mouse GcvT

Arabidopsis GcvT

Rice GcvT

COG0354 – A folate protein for Fe/S cluster repair in oxidative stress COG0354 – A folate protein for Fe/S cluster repair in oxidative stress

• In all kingdoms of lifeIn all kingdoms of life • In all kingdoms of lifeIn all kingdoms of life

- Bacteria- Bacteria

- Bacteria- Bacteria

- Archaea- Archaea

- Archaea- Archaea - Fungi - Fungi

- Fungi - Fungi

AnimalsAnimals

AnimalsAnimals

PlantsPlants

PlantsPlants

• 2 plant proteins2 plant proteins • 2 plant proteins2 plant proteins

- 1 related to rickettsias (mitochondria)- 1 related to rickettsias (mitochondria)

- 1 related to rickettsias (mitochondria)- 1 related to rickettsias (mitochondria)

- 1 related to cyanobacteria (plastids) - 1 related to cyanobacteria (plastids)

- 1 related to cyanobacteria (plastids) - 1 related to cyanobacteria (plastids)

• Homolog of GcvT proteinHomolog of GcvT protein • Homolog of GcvT proteinHomolog of GcvT protein

- But clearly a distinct clade- But clearly a distinct clade

- But clearly a distinct clade- But clearly a distinct clade

COG0354 – Linking a new function to known systemCOG0354 – Linking a new function to known system

Folate-dependentFolate-dependent

Page 14: Scale of the ‘unknown’ gene problem

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

Developmental series

Arabidopsis Transcriptome DB(Max Planck Institute, Golm)

Mitochondrial COG0354Mitochondrial Frataxin

Ferritin 2Mitochondrial COG0354

• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis

- Mitochondrial COG0354 expression - Mitochondrial COG0354 expression

correlates with frataxin (Fe/S correlates with frataxin (Fe/S assembly)assembly)

- Mitochondrial COG0354 expression - Mitochondrial COG0354 expression

correlates with frataxin (Fe/S correlates with frataxin (Fe/S assembly)assembly) - And with ferritin 2 (Fe storage)- And with ferritin 2 (Fe storage)

- And with ferritin 2 (Fe storage)- And with ferritin 2 (Fe storage)

Page 15: Scale of the ‘unknown’ gene problem

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins

COG0354 Fe/S protein Fe/S partner

0354

0354

0354 MiaB

0354

0354

nifQ fd nifX nifN nifE fd nifHnifDnifK

● Nif cluster in Methylococcus capsulatus

● Suf cluster in Rubrobacter xylanophilus

sufC sufB sufD sufS thiC

● Sdh operon in Stenotrophomonas maltophila

sdhCsdhD sdhA sdhB

● NAD synthesis cluster in Pelagibacter ubique

nadA nadC

● MiaB (Radical SAM) in Buchnera aphidicola

• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis

Page 16: Scale of the ‘unknown’ gene problem

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins

- IscA proteins are scaffolds in Fe/S- IscA proteins are scaffolds in Fe/S

cluster assemblycluster assembly

- IscA proteins are scaffolds in Fe/S- IscA proteins are scaffolds in Fe/S

cluster assemblycluster assembly

• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present

CO

G03

54Is

cA

ClostridialesMollicutesLactobacillalesStaphylococcaceae

ListeriaceaeBacillaceae

Bifidobacterium

CampylobacteralesBdellovibrionales

DesulfovibrionalesDesulfuromonadales Myxococcales Syntrophobacterales

Desulfobacterales

Bacteroidales Flavobacteria Sphingobacteria

Firmicutes

FusobacteriaActinobacteria

Cyanobacteria

Acidobacteriaδ/ε-Proteobacteria

α-Proteobacteriaβ-Proteobacteriaγ-ProteobacteriaMagnetococcus

PlanctomycetesChlamydiales

ChlorobiBacteroidetes

Deinococcus/ThermusChloroflexiThermotogae

Spirochaetes

Nanoarcheota

EuryarchaeotaCrenarchaeota

Bacteria

Archaea

Methanococci Methanomicrobia

Archaeoglobi Halobacteria Methanobacteria

Methanopyri Thermococci Thermoplasmata

Gene absent Gene present

• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis

Page 17: Scale of the ‘unknown’ gene problem

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Associated with aerobic lifestyleAssociated with aerobic lifestyle • Associated with aerobic lifestyleAssociated with aerobic lifestyle

• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins

• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present

• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis

Page 18: Scale of the ‘unknown’ gene problem

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• HH22OO22-induced in -induced in E. coliE. coli • HH22OO22-induced in -induced in E. coliE. coli

• High-throughput screensHigh-throughput screens • High-throughput screensHigh-throughput screens

● Essential gene in:

● Important gene in:

– E. coli (slow growth)

– Yeast (petite)

– Mycobacterium tuberculosis

– Haemophilus influenzae – Pseudomonas aeruginosa

● Plant proteins both expressed

- Essentiality & phenomics- Essentiality & phenomics

- Essentiality & phenomics- Essentiality & phenomics

- Proteomics- Proteomics

- Proteomics- Proteomics

● Cyano-like protein in plastids

• Associated with aerobic lifestyleAssociated with aerobic lifestyle • Associated with aerobic lifestyleAssociated with aerobic lifestyle

• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins

• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present

• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis

● E. coli protein has folate site

Page 19: Scale of the ‘unknown’ gene problem

COG0354 – Predictions & Experimental ValidationCOG0354 – Predictions & Experimental Validation

COG0354 PREDICTIONS

● Is a folate-dependent enzyme

● Combats oxidative stress

● Helps make/repair Fe/S clusters

● Function is ancient & ubiquitous (like Fe/S proteins themselves)

● Folate mutations abolish activity

● Mutant oxidative stress-sensitive

● Mutant many Fe/S enzyme defects

● Complementation by all kingdoms

Controls Plant & mammal Fungi, protist, Archaea

VectorE. coli

E. coli

Plant M Plant C

Mammal

Protist

Yeast

Archaea

LB + plumbagin (oxidative stress)

Page 20: Scale of the ‘unknown’ gene problem

The power of comparative genomicsThe power of comparative genomics

William Whewell (1794-1866) William Whewell (1794-1866) English Scientist, Philosopher, Anglican priestEnglish Scientist, Philosopher, Anglican priestAn early influence on Charles DarwinAn early influence on Charles DarwinCoined the term “scientist”Coined the term “scientist”

““The facts are known but they are insulated and The facts are known but they are insulated and unconnected…. The pearls are there but they will not unconnected…. The pearls are there but they will not hang together until some one provides the string”hang together until some one provides the string”

Hypothesis that connects Hypothesis that connects and unifies observationsand unifies observations