scale of the ‘unknown’ gene problem
DESCRIPTION
Scale of the ‘unknown’ gene problem. Comparative genomics outline. Shared plant-prokaryote genes. Comparative genomics When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems. Plant-prokaryote examples - PowerPoint PPT PresentationTRANSCRIPT
• Scale of the ‘unknown’ gene problemScale of the ‘unknown’ gene problem
Comparative genomics outlineComparative genomics outline
• Shared plant-prokaryote genesShared plant-prokaryote genes
• Comparative genomicsComparative genomics• When Blast tells you nothing….When Blast tells you nothing….• The ‘guilt by association’ principleThe ‘guilt by association’ principle• ‘‘Two-dimensional’ gene annotationTwo-dimensional’ gene annotation• SEED subsystemsSEED subsystems
• Plant-prokaryote examplesPlant-prokaryote examples• Filling ‘pathway holes’ – FolQFilling ‘pathway holes’ – FolQ• Linking new functions to known systems – COG0354Linking new functions to known systems – COG0354
www.genomesonline.org
Whole genome sequencing progressWhole genome sequencing progress
●● Functional annotation of genes has nowhere near kept pace Functional annotation of genes has nowhere near kept pace
●● Functional annotations are often absent, vague, or wrongFunctional annotations are often absent, vague, or wrong
Ongoing
Complete
0
1000
2000
3000
4000
5000
Num
ber
of g
enom
es
6000
7000
8000
9000
10000
Dec 1
997
Jul 1
999
Jun
2000
Jan
2001
Sep 2
001
Jul 2
002
Jan
2003
Apr 2
003
Sep 2
003
Feb 2
004
Jun
2004
Oct 2
004
Apr 2
005
Oct 2
005
Aug 2
006
May
200
7
May
200
8
Aug 2
009
Mar
201
1
Orphan genesOrphan genes
• 1437/3736 enzymes (1437/3736 enzymes (38%38%) ) with EC numbers have no with EC numbers have no associated genesassociated genes
Orphan enzymesOrphan enzymes
• 20-60% of genes in any 20-60% of genes in any given genome have no given genome have no known function or only a known function or only a
vague one (‘esterase’ etc)vague one (‘esterase’ etc)
0
20
40
60
80
100
UnknownKnown
Pe
rce
nt
of
pro
tein
sPercentage of unknown proteins encoded by diverse genomes
Bacteria Archaea Eukarya
The unknown protein problem in various groupsThe unknown protein problem in various groups
Data from The SEED http://theseed.uchicago.edu/Data from The SEED http://theseed.uchicago.edu/
Esche
richia
coli
Lacto
bacil
lus
case
i
Staph
yloco
ccus
aure
usChla
myd
ia
trach
omat
is
Acidob
acte
rium
Soliba
cter
usita
tus
Synec
hocy
stis
Pyroc
occu
s
abys
si
Haloar
cula
mar
ismor
tui
Human
Arabid
opsis
Source of genes Number of genes % of genome
Plants & prokaryotes share many (unknown) genesPlants & prokaryotes share many (unknown) genes
Cyanobacteria 5470 21.0
Proteobacteria 1170 4.6
Gram+ bacteria 2280 9.1
Other bacteria 1160 4.6
Archaea 1090 4.4
Total 11170 43.4
●● Estimates for Arabidopsis vary – but all are many thousands Estimates for Arabidopsis vary – but all are many thousands
●● Functions of most shared genes are metabolicFunctions of most shared genes are metabolic
From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)
●● Shared genes Shared genes identifiablyidentifiably from various groups from various groups
●● Plants are conglomerates of microbial metabolic genesPlants are conglomerates of microbial metabolic genes
●● Many opportunities for comparative genomicsMany opportunities for comparative genomics
The power of comparative genomicsThe power of comparative genomics
●● Suppose you have an unknown plant protein:Suppose you have an unknown plant protein:
●● BlastP search gives various prokaryote hitsBlastP search gives various prokaryote hits
●● None of them have clear functionsNone of them have clear functions Dead endDead end
●● No! This is the beginning of comparative genomicsNo! This is the beginning of comparative genomics
●● Predicts functions via ‘guilt by association’ principlePredicts functions via ‘guilt by association’ principle
●● Genes of related function are associated in various waysGenes of related function are associated in various ways
●● e.g. Enzymes in a pathway, proteins in a complexe.g. Enzymes in a pathway, proteins in a complex
●● Whatever a gene’s associates do, it probably does too Whatever a gene’s associates do, it probably does too
Associationevidence
Protein-protein interactions
Organelle proteomes
Co-expression
Gene WGene XGene YGene Z
Structures
Essentiality & other phenome data
A
B
C V M
A B C D
Gene clustering
Orf XY
Orf YOrf X
Gene fusion
C
A
B
D
Shared regulatory sites
XYYX
XYYX
XYYX
XYYX
Phylogenetic occurrence
+
+––
––
+
++
Genomic evidence Post-genomic evidence
Predictions
Testing (genetics, biochemistry)
• ‘‘Dimensions’ are:Dimensions’ are:• Molecular function (e.g., an enzyme activity with EC no.)Molecular function (e.g., an enzyme activity with EC no.)
• Functional context (e.g., other enzymes of a pathway)Functional context (e.g., other enzymes of a pathway)
• ‘‘2-Dimensions good, 1-dimension bad’2-Dimensions good, 1-dimension bad’ • Even an EC no. function may be wrong if pathway not thereEven an EC no. function may be wrong if pathway not there
• Pathway context may be wrong if certain enzymes missingPathway context may be wrong if certain enzymes missing
• GenBank etc annotations are 1-dimensional (mol. function) GenBank etc annotations are 1-dimensional (mol. function)
Two-dimensional gene annotationTwo-dimensional gene annotation
SEED subsystemsSEED subsystems
• Subsystems (SSs) capture both annotation dimensionsSubsystems (SSs) capture both annotation dimensions
• Sets of molecular functions (e.g. enzymes) that together Sets of molecular functions (e.g. enzymes) that together implement a specific biological process (e.g. a pathway) implement a specific biological process (e.g. a pathway)
Folate biosynthesis subsystem
Pathway hole
• SSs cover many genomes, have form of spreadsheet:SSs cover many genomes, have form of spreadsheet:• Columns are molecular functionsColumns are molecular functions
• Rows are genomesRows are genomes
• Each cell identifies the genes for proteins with the specific Each cell identifies the genes for proteins with the specific molecular functional role in the designated genomemolecular functional role in the designated genome
• Prokaryote association evidence is mainly genomicProkaryote association evidence is mainly genomic
• Plant association evidence is mainly post-genomicPlant association evidence is mainly post-genomic
• Post-genomic evidence is noisier but very usefulPost-genomic evidence is noisier but very useful
• Superb plant post-genomic resources:Superb plant post-genomic resources:• Microarrays, RNAseq (organ- and environment-specific)Microarrays, RNAseq (organ- and environment-specific)
• Organellar targeting prediction, proteomics (location can r/o function)Organellar targeting prediction, proteomics (location can r/o function)
• Phenome databases (chlorosis, lethality can support function)Phenome databases (chlorosis, lethality can support function)
• Huge EST databasesHuge EST databases
• Vast plant metabolism bibliomeVast plant metabolism bibliome
Plant – prokaryote examplesPlant – prokaryote examples
FolQ – Filling a pathway holeFolQ – Filling a pathway hole
• Missing step known to be a pyrophosphohydrolase, ~17 kDaMissing step known to be a pyrophosphohydrolase, ~17 kDa• Search genomes for small hydrolase clustered with Search genomes for small hydrolase clustered with folfol genes genes• YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa
Folate synthesis pathway
FolQDHN DHP DHF THF
Glu
GTP DHN-P3 DHN-P
pABA
HMDHP-P2HMDHP
Chrorismate ADC
FolE FolB FolK FolP FolC FolA[P-ase]
PabCPabAB
• YlgG has a plant homolog – At1g68760YlgG has a plant homolog – At1g68760
• FolQ FolQ universallyuniversally missing (prokaryotes, plants, fungi, protists) missing (prokaryotes, plants, fungi, protists)
Lactococcus lactis folate gene cluster
folCfolEK folP ylgG
Recombinant proteins release DHN-P + PPi
2 4 6 2 4 6Minutes
0
40
80
120
160
200
240
Flu
ore
sce
nc
e
WT KO
DHNP3
DHN-P3
FolQ – Experimental testsFolQ – Experimental tests
Folate synthesis pathway
FolQDHN DHP DHF THF
Glu
GTP DHN-P3 DHN-P
pABA
HMDHP-P2HMDHP
Chrorismate ADC
FolE FolB FolK FolP FolC FolA[P-ase]
PabCPabAB
• ylgG ylgG KO accumulates DHN-PKO accumulates DHN-P33 • YlgGYlgG & At1g68760 act on DHN-P& At1g68760 act on DHN-P33
Pro
du
ct f
orm
atio
n (
nm
ol/
assa
y)
0
0.5
1.0
1.5
DHNP Pi PPi
YlgG
0
0.3
0.9
0.6
DHNP Pi PPi
At1g68760
Mouse
Fly
Yeast
Leishmania
At4g12130
At1g60990
Haloarcula
Natronomonas
Rickettsia
Ehrlichia
Anaplasma
Bradyrhizobium
Burkholderia
Neisseria
Xanthomonas
Psychrobacter
E. coli
Shewanella
Thermus
Deinococcus
Synechocystis
Synechococcus
Nostoc
Corynebacterium
Streptomyces
Solibacter
Blastopirellula
Pirellula
GcvT
Yeast GcvT
Mouse GcvT
Arabidopsis GcvT
Rice GcvT
COG0354 – A folate protein for Fe/S cluster repair in oxidative stress COG0354 – A folate protein for Fe/S cluster repair in oxidative stress
• In all kingdoms of lifeIn all kingdoms of life • In all kingdoms of lifeIn all kingdoms of life
- Bacteria- Bacteria
- Bacteria- Bacteria
- Archaea- Archaea
- Archaea- Archaea - Fungi - Fungi
- Fungi - Fungi
AnimalsAnimals
AnimalsAnimals
PlantsPlants
PlantsPlants
• 2 plant proteins2 plant proteins • 2 plant proteins2 plant proteins
- 1 related to rickettsias (mitochondria)- 1 related to rickettsias (mitochondria)
- 1 related to rickettsias (mitochondria)- 1 related to rickettsias (mitochondria)
- 1 related to cyanobacteria (plastids) - 1 related to cyanobacteria (plastids)
- 1 related to cyanobacteria (plastids) - 1 related to cyanobacteria (plastids)
• Homolog of GcvT proteinHomolog of GcvT protein • Homolog of GcvT proteinHomolog of GcvT protein
- But clearly a distinct clade- But clearly a distinct clade
- But clearly a distinct clade- But clearly a distinct clade
COG0354 – Linking a new function to known systemCOG0354 – Linking a new function to known system
Folate-dependentFolate-dependent
COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data
Developmental series
Arabidopsis Transcriptome DB(Max Planck Institute, Golm)
Mitochondrial COG0354Mitochondrial Frataxin
Ferritin 2Mitochondrial COG0354
• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis
- Mitochondrial COG0354 expression - Mitochondrial COG0354 expression
correlates with frataxin (Fe/S correlates with frataxin (Fe/S assembly)assembly)
- Mitochondrial COG0354 expression - Mitochondrial COG0354 expression
correlates with frataxin (Fe/S correlates with frataxin (Fe/S assembly)assembly) - And with ferritin 2 (Fe storage)- And with ferritin 2 (Fe storage)
- And with ferritin 2 (Fe storage)- And with ferritin 2 (Fe storage)
COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data
• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins
COG0354 Fe/S protein Fe/S partner
0354
0354
0354 MiaB
0354
0354
nifQ fd nifX nifN nifE fd nifHnifDnifK
● Nif cluster in Methylococcus capsulatus
● Suf cluster in Rubrobacter xylanophilus
sufC sufB sufD sufS thiC
● Sdh operon in Stenotrophomonas maltophila
sdhCsdhD sdhA sdhB
● NAD synthesis cluster in Pelagibacter ubique
nadA nadC
● MiaB (Radical SAM) in Buchnera aphidicola
• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis
COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data
• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins
- IscA proteins are scaffolds in Fe/S- IscA proteins are scaffolds in Fe/S
cluster assemblycluster assembly
- IscA proteins are scaffolds in Fe/S- IscA proteins are scaffolds in Fe/S
cluster assemblycluster assembly
• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present
CO
G03
54Is
cA
ClostridialesMollicutesLactobacillalesStaphylococcaceae
ListeriaceaeBacillaceae
Bifidobacterium
CampylobacteralesBdellovibrionales
DesulfovibrionalesDesulfuromonadales Myxococcales Syntrophobacterales
Desulfobacterales
Bacteroidales Flavobacteria Sphingobacteria
Firmicutes
FusobacteriaActinobacteria
Cyanobacteria
Acidobacteriaδ/ε-Proteobacteria
α-Proteobacteriaβ-Proteobacteriaγ-ProteobacteriaMagnetococcus
PlanctomycetesChlamydiales
ChlorobiBacteroidetes
Deinococcus/ThermusChloroflexiThermotogae
Spirochaetes
Nanoarcheota
EuryarchaeotaCrenarchaeota
Bacteria
Archaea
Methanococci Methanomicrobia
Archaeoglobi Halobacteria Methanobacteria
Methanopyri Thermococci Thermoplasmata
Gene absent Gene present
• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis
COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data
• Associated with aerobic lifestyleAssociated with aerobic lifestyle • Associated with aerobic lifestyleAssociated with aerobic lifestyle
• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins
• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present
• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis
COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data
• HH22OO22-induced in -induced in E. coliE. coli • HH22OO22-induced in -induced in E. coliE. coli
• High-throughput screensHigh-throughput screens • High-throughput screensHigh-throughput screens
● Essential gene in:
● Important gene in:
– E. coli (slow growth)
– Yeast (petite)
– Mycobacterium tuberculosis
– Haemophilus influenzae – Pseudomonas aeruginosa
● Plant proteins both expressed
- Essentiality & phenomics- Essentiality & phenomics
- Essentiality & phenomics- Essentiality & phenomics
- Proteomics- Proteomics
- Proteomics- Proteomics
● Cyano-like protein in plastids
• Associated with aerobic lifestyleAssociated with aerobic lifestyle • Associated with aerobic lifestyleAssociated with aerobic lifestyle
• Clusters with Fe/S proteinsClusters with Fe/S proteins • Clusters with Fe/S proteinsClusters with Fe/S proteins
• Only occurs if IscA is presentOnly occurs if IscA is present • Only occurs if IscA is presentOnly occurs if IscA is present
• Co-expression in ArabidopsisCo-expression in Arabidopsis • Co-expression in ArabidopsisCo-expression in Arabidopsis
● E. coli protein has folate site
COG0354 – Predictions & Experimental ValidationCOG0354 – Predictions & Experimental Validation
COG0354 PREDICTIONS
● Is a folate-dependent enzyme
● Combats oxidative stress
● Helps make/repair Fe/S clusters
● Function is ancient & ubiquitous (like Fe/S proteins themselves)
● Folate mutations abolish activity
● Mutant oxidative stress-sensitive
● Mutant many Fe/S enzyme defects
● Complementation by all kingdoms
Controls Plant & mammal Fungi, protist, Archaea
VectorE. coli
E. coli
Plant M Plant C
Mammal
Protist
Yeast
Archaea
LB + plumbagin (oxidative stress)
The power of comparative genomicsThe power of comparative genomics
William Whewell (1794-1866) William Whewell (1794-1866) English Scientist, Philosopher, Anglican priestEnglish Scientist, Philosopher, Anglican priestAn early influence on Charles DarwinAn early influence on Charles DarwinCoined the term “scientist”Coined the term “scientist”
““The facts are known but they are insulated and The facts are known but they are insulated and unconnected…. The pearls are there but they will not unconnected…. The pearls are there but they will not hang together until some one provides the string”hang together until some one provides the string”
Hypothesis that connects Hypothesis that connects and unifies observationsand unifies observations