enriching the gene ontology via the dissection of labels using the ontology pre-processor language
DESCRIPTION
Authors: J.T. Fernandez-Breis, L. Iannone, I. Palmisano, A. Rector, R. Stevens. Presented at 17th International Conference on Knowledge Engineering and Knowledge Management, EKAW2010TRANSCRIPT
Enriching the Gene Ontology via the Dissec4on of Labels using the
Ontology Pre-‐Processor Language
Jesualdo Tomás Fernández-‐Breis, Luigi Iannone,
Ignazio Palmisano, Alan L. Rector, and Robert Stevens
October 12th 2010, Lisbon, Portugal
Mo4va4on
• Biomedical Ontologies – The OBO Foundry • More than 200 biomedical ontologies • Some proper4es
– Delineated content – Reuse of exis4ng ontologies – Textual defini4ons – Systema4c naming conven4on
• Limited explicit seman4cs
Gene Ontology Consor4um
Enrichment of GO Molecular Func4on
Original GO MF Dissec4on of the Ontology
Analysis of Labels
Iden4fica4on of Linguis4c PaQerns
Design of Knowledge PaQerns
Execu4on of the Knowledge
PaQerns
Enriched GO MF
Dissec4on of the ontology into its seman4c axes
• Normaliza4on
• Analysis of the labels – Biochemical substances – Biological processes – Cellular component
• Reuse and combina4on of exis4ng ontologies
MyAuxiliarOntology
Biological Process
MySubstances
FMA Rela4ons
Ontology CHEBI MyProtein
EC-‐Primi4ve
Aminoacid Biochemical
Complex
Cellular Component
Design of linguis4c paQerns from labels
• Manual analysis of the structure of the labels by taxonomies
• Some linguis4c paQerns – “X binding” – “X codon amino acid adaptor ac4vity”
– “base pairing with X” – “transla4on X factor ac4vity”
Design of knowledge paQerns
• Some knowledge paQerns
binding = molecular_func,on and enables some (binds some chemical_substance or binds some cellular_component)
triplet_codon_amino_acid_adaptor_ac4vity= molecular_func,on and enables some (adapts some (amino_acid and recognizes some triplet))
Execu4on of the knowledge paQerns
• OPPL Version 2 – hQp://oppl2.sourceforge.net/
• Bulk manipula4on of OWL ontologies – Enrichment, Verifica4on, PaQerns – Manchester OWL Syntax
• Declara4ve – OWL Axioms, variables, regular expressions
OPPL Use case
OWL axioms
Values
OPPL Script
Lean Rich
Egaña et al. OWLED 2008 & EKAW 2008, Iannone ESWC 2009
A paQern as an OPPL script
?y:CLASS=Match("((\w+))_codon_amino_acid_adaptor_ac4vity"), ?x:CLASS=create(?y.GROUPS(1))
SELECT ?y subClassOf Thing WHERE ?y Match("((\w+))_codon_amino_acid_adaptor_ac4vity")
BEGIN
ADD ?y subClassOf molecular_func4on, ADD ?y subClassOf enables some (adapts some (amino_acid and recognizes some ?x))
END;
Results-‐ Scope • The “source” Gene Ontology
– Version 1550 – 8548 classes, 5 OP, 5 DP and 9954 subclass axioms – Classifica4on 4me : < 1 sec (Fact++)
• Scope of this study (approx 18% GO MF) – binding – structural molecule ac4vity – chaperone ac4vity – proteasome regulator ac4vity – electron carrier ac4vity – enzyme regulator ac4vity – transla4on regulator ac4vity
• Complete results: hQp://miuras.inf.um.es/~mfoppl/
Results – Effec4veness
– 1567 descendant classes of binding
– Knowledge paQerns: • Binding: 1228 / 1567 (78%) • Base pairing: 6 /84
– Molecular adaptor ac4vity (71/72)
• Triplet codon amino acid ac4vity (64/64)
• All the 7 binding paQerns: 1336 /1567 (85%)
Results-‐ Enrichment (I) Before A(er
Results-‐ Enrichment (II)
• The enriched GO MF – 58624 classes, 254 OP, 16 DP, 107631 subclass axioms, 264 equivalent class axioms and 488 disjoint class axioms
– Classifica4on 4me: approx 2 minutes (Fact++)
– Due to the paQerns • 584 new classes
– Subop4mal auxiliary ontologies: D1 Dopamine – Use of abbreviated forms in GO MF: MAPK, IgX
• 13 new OP • 3608 new subclass axioms
Results-‐ Querying (III) • We can make queries that were not possible with the original ontology: – Example: Molecular func/ons that bind substances that play a chemical role
Results-‐ Findings (II) • We can make queries that were not possible with the original ontology: – Example: Molecular func/ons that bind substances that play a chemical role
Results-‐ Time (IV)
• Execu4on 4me of the binding paQerns
Conclusions
• PaQerns and OPPL are useful for suppor4ng ontology enrichment processes
• The structure of the labels in biomedical ontologies embeds knowledge that can be extracted
• Benefits of encoding knowledge into paQerns: modularity, maintenance and evolu4on
• Cri4cal factor: the auxiliary ontologies
Further work
• Bio-‐evalua4on of the paQerns
• Iden4fica4on of linguis4cs paQerns using text mining techniques
• Applica4on to the rest of GO MF and the other GO ontologies
• Alignment with efforts of the GO Consor4um
Jesualdo Tomás Fernández Breis [email protected]
hQp://webs.um.es/jfernand
Thanks for your aQen4on!
Acknowledgements