ChEBIChEBI
Kirill Degtyarenko, EMBL-EBI / EPOKirill Degtyarenko, EMBL-EBI / EPO
• Rafael Alcántara• Michael Ashburner *• Volker Ast *• Michael Darsow *• Paula de Matos• Marcus Ennis• Janna Hastings• Alan McNaught *• Inma Spiteri• Christoph Steinbeck• Martin Zbinden *
The team
Chemical Entities of Biological Interest – an EBI database/dictionary of ‘biochemical compounds’
ChEBI: What is it?
Can be defined as consisting of
“molecules not directly encoded by the genome ... that are either the products of nature or are synthetic products used ... to intervene in the processes of living organisms”
[Michael Ashburner]
What are the ‘biochemical compounds’?
“Any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer etc., identifiable as a separately distinguishable entity”
[IUPAC “Gold Book”]
Molecular entity
• Molecular entities trans-vaccenic acid
• Groups trans-vaccenoyl group
• Classes fatty acids
In fact, ChEBI contains
‘Small molecules’?
Yes, but big molecules as well!
• alumina
• amylose
• metaborate
• poly(vinyl alcohol)
Current status (17.12.08)
14,274
9,196
13,163
15,773
14,847
43,880
16,618
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Structures
Database Links
Formulae
Registry Numbers
IUPAC names
Synonyms
ChEBI entries
1-D ChEBI
• Numeric ID
• Carefully checked terminology
• Unambiguous ChEBI name
• IUPAC names
• Cross-references to free resources
Unambiguous ChEBI name
CHEBI:28918
L-adrenaline
not just ‘adrenaline’
2-{[3-(trifluoromethyl)phenyl]amino}benzoic acid
NH
O
OH
F
F
F
Systematic Name (IUPAC)
1
23
4
5
6
1
2
34
5
6
• flufenamic acid (INN English)• acide flufénamique (INN French)• ácido flufenámico (INN Spanish)• acidum flufenamicum (INN Latin)• Flufenaminsäure (German)
NH
O
OH
F
F
F
Common Name
The Unpronounceables
CHEBI:48935
(E)-roxithromycin
IUPAC name:
(3R,4S,5S,6R,7R,9R,10E,11S,12R,13S,14R)-4-(2,6-dideoxy-3-C-methyl-3-O-methyl-α-L-ribo-hexopyranosyloxy)-14-ethyl-7,12,13-trihydroxy-10-{[(2-methoxyethoxy)methoxy]imino}-6-[3,4,6-trideoxy-3-(dimethylamino)-β-D-xylo-hexopyranosyloxy]-3,5,7,9,11,13-hexamethyloxacyclotetradecan-2-one
O O
O
O
OH
N
O
O
N
OH
OH
O OO
O
OH OH
CH3
CH3
CH3
CH3
CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
O O
O
O
OH
N
O
O
N
OH
OHO
OH OH
CH3
CH3
CH3CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
OOO
CH3
CHEBI:32109(Z)-roxithromycin
What is the common name of roxithromycin?
CHEBI:48935(E)-roxithromycinINN: roxithromycin
O O
O
O
OH
N
O
O
N
OH
OH
O OO
O
OH OH
CH3
CH3
CH3
CH3
CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
O O
O
O
OH
N
O
O
N
OH
OH
O OO
O
OH OH
CH3
CH3
CH3
CH3
CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
O O
O
O
OH
N
O
O
N
OH
OHO
OH OH
CH3
CH3
CH3CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
OOO
CH3
CHEBI:48844 roxithromycin
(E)-roxithromycin
O O
O
O
OH
N
O
O
N
OH
OH
O OO
O
OH OH
CH3
CH3
CH3
CH3
CH3CH3
CH3 CH3
CH3
CH3
CH3
CH3
CH3CH3
(Z)-roxithromycin
What is thiamine?CHEBI:18385thiamine(1+)aka thiamine
N
N
NH2
CH3 S
CH3
OH
N+
CHEBI:33283thiamine(1+) chlorideINN: thiamine
N
N
NH2
CH3 S
CH3
OH
N+
Cl-
CHEBI:49105 thiamine(2+) dichlorideaka thiamine chloride hydrochloride aka thiamine hydrochloride
N
NH3+
NCH3 S
CH3
OH
N+
Cl-
Cl-
• “Better to see the face than to hear the name” (Zen proverb)
• Structures and identifiers based on structures offer new ways of crosslinking to other databases
• Structure search
Need for 2-D
ChEBI
9 10 0 0 0 0 999 V2000 11.8219 -7.2713 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.8219 -8.0922 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.6074 -7.0165 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 11.1072 -6.8574 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.6039 -8.3505 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 11.1072 -8.5027 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.0886 -7.6818 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10.3923 -7.2713 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 10.3888 -8.0922 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 2 5 1 0 0 0 0 2 6 1 0 0 0 0 3 7 1 0 0 0 0 4 8 2 0 0 0 0 6 9 2 0 0 0 0 5 7 2 0 0 0 0 8 9 1 0 0 0 0M END
Connection table
NH
N
N
N
2-D ChEBI
• One or more 2-D (or 3-D) connection tables
• One is default
• Autogenerated images (PNG)
• Default diagrams should be unambiguous
The Fine Art of chemical drawing
Linear forms of monosaccharides
CHO
CH2OH
H OH
OH H
OH H
H OH
OH
O
H OH
OH H
OH H
H OH
H H
OH
OH
OH
OH
OH
O
Pyranose forms of monosaccharides
O
OHH
HOH
HOH
H OH
H
CH2OH
O
CH2OH
OH
OH
OH
OH
OH
OH
OH
OH
OOH
Fused systems
(R)-camphor
ambiguous unambiguous
CH3
OCH3
CH3
O
CH3CH3
CH3
Square planar geometry
Pt
N Cl
ClN
HH
H
H
HH
Pt
NCl
N Cl
H H
H
H
HH
cisplatin transplatin
SMILESInChI
From 2-D back to 1-D
• Simplified Molecular Input Line Entry Specification
• Developed by David Weininger in 1988
• Extended by others (e.g. Daylight)• String of standard ASCII characters• A number of valid SMILES can be
produced for the same molecule
SMILES (1)
SMILES (2)
NH
N
N
N
N1C=NC2=C1C=NC=N2c1ncc2ncnc2n1C=1N\C=N/C\2=N/C=N\C=1/2c1ncnc2/N=C\Nc12n1cc2c(nc1)ncn2 [H]c1nc([H])c2n([H])c([H])nc2n
1
InChI (1)
• IUPAC International Chemical Identifier or InChI
• Open source• Developed by Stein, Heller,
Tchekhovskoi and McNaught• Used by NIST, PubChem, CML…
and ChEBI
InChI (2)
NH
N
N
N
InChI=1/C5H4N4/c1-4-5(8-2-6-1)9-3-7-4/h1-3H,(H,6,7,8,9)/f/h7H
InChIKey=KDCGOANMDULRCW-QDQILVOLCG
Limitations (1)
• Stereochemistry other than sp3 tetrahedral and sp2 trigonal planar
• Polymers• Conformers• Radicals/different spin state• Topological isomers• Mixtures• Markush structures
Limitations (2)
InChI=1/2ClH.2H3N.Pt/h2*1H;2*1H3;/q;;;;+2/p-2
Pt
N Cl
ClN
HH
H
H
HH
Pt
NCl
N Cl
H H
H
H
HH
cisplatin transplatin
3-D ChEBI
cisplatin
Compositional uncertainty
Positional uncertainty
Configurational uncertainty
Conformational uncertainty
Uncertainty and ambiguity in chemistry
Examples
an alkali metal cation
vanadate(V) anion
[2H]ethanol
Compositional uncertainty
Examples
L-bromohistidine residue
pteroic acid (several tautomers)
Positional uncertainty
Examples
androstane
rel-(2R,3R)-2-amino-3-methylpentanoic acid
tetradec-11-enoic acid
Configurational uncertainty
Examples
cyclohexane: chair, boat, twist
protein secondary structure: , , …
Conformational uncertainty
• Molecular structure ontology• Subatomic particle ontology• Role ontology
Biological role Application
ChEBI ontology
Molecular structure ontology catecholamines
Biological role hormone
Application antiglaucoma bronchodilator cardiostimulant
L-adrenaline
The family relations
L-cysteine
L-cysteine(•)
L-cysteinate(2–)
L-cysteinate(1–)
L-cysteinyl
L-cysteinium
L-cysteino
L-cystein-S-yl
L-cysteine residue
L-cysteinate residue
D-cysteine
cysteine
L-cysteine zwitterion
Relationships in ChEBI ∆ Is A generic
⋄ Has Part generic
♯ Is Conjugate Acid Of specific
♭ Is Conjugate Base Of specific
Is Enantiomer Of specific
Is Tautomer Of specific
ℛ Is Substituent Group From specific
ℋ Has Parent Hydride specific
ℱ Has Functional Parent specific
Has Role generic?
Is A relationship
NH2
O
OHSH
NH2
O
OHSH∆
L-cysteine
cysteineis a
NH2
O
OHSH
Is Enantiomer Of
NH2
O
OHSH
L-cysteine
NH2
O
OHSH
∆ ∆
D-cysteine
is enantiomer of
NH3+
O
OHSH
NH3+
O
OHSH
L-cysteinium
Has Part
⋄
L-cysteine hydrochloride
is part of
Cl-
has part
NH2
O
O-
S-
NH3+
O
OHSH
NH2
O
O-
SH
Is Conjugate Acid Of
NH2
O
OHSH♯
L-cysteine
L-cysteinate(1–)is conjugate acid of
L-cysteinium
L-cysteinate(2–)
♯♯
NH2
O
O-
SH
Is Conjugate Base Of
NH2
O
OHSH
♭
L-cysteine
L-cysteinate(1–)
NH2
O
O-
S-
NH3+
O
OHSH
L-cysteinium
L-cysteinate(2–)
♭ ♭
NH2
O
O-
SH
Acid/base relationships
NH2
O
OHSH
♭L-
cysteineL-cysteinate(1–)
NH2
O
O-
S-
NH3+
O
OHSH
L-cysteinium
L-cysteinate(2–)
♭
♯
♭♯♯
NH3+
O
O-
SH
Is Tautomer Of
NH2
O
OHSH
L-cysteine
L-cysteine zwitterion
is tautomer of
Is Tautomer Of
3H-pyrrole
NH
N N
2H-pyrrole
1H-pyrrole
salutaridinol
Has Parent Hydride
has parent hydride
is parent hydride of
ℋ NHH
morphinan
OH
N
O
O
CH3
OH
CH3
CH3
7-O-acetylsalutaridinol
Has Functional Parent
has functional parent
is functional parent of
ℱ
salutaridinol
OH
N
O
O
CH3
CH3
CH3
OCH3
O
OH
N
O
O
CH3
OH
CH3
CH3
NH2
O
SH
L-cysteinyl
NH
O
SH
NH
O
OHSH
Is Substituent Group From
NH2
O
OHSHL-cysteine
L-cysteine residue
L-cysteino
ℛ
ℛ
ℛ
*
*
*
*
The family relations
L-cysteine
L-cysteine(•)
L-cysteinate(2–)
L-cysteinate(1–)
L-cysteinyl
L-cysteinium
L-cysteino
L-cystein-S-yl
L-cysteine residue
L-cysteinate residue
D-cysteine
cysteine
L-cysteine zwitterion
♭♯♯ ♭
ℛ
ℛ
ℛ
ℛ
ℛ
ℱ
∆
∆
♯ ♭ ♯ ♭
♯♭♯ ♭
Ontology of L-cysteine
Ontology of L-cysteine (1)
Ontology of L-cysteine (2)
Thank youThank you