software development by the genomics standards consortium
DESCRIPTION
Presentation held at the M3 SIG meeting at the ISMB in Stockholm 2009. Purpose to show the audience the software development activities of the Genomics standards Consortium. See also http://gensc.orgTRANSCRIPT
1
Bringing Standards to Life:
Software Development by theGenomics
Standards Consortium
Renzo Kottmann Microbial Genomics Group
Max Planck Institute for Marine Microbiology
M3 SIG Stockholm July 2009
2
Genomic Standards Consortium (GSC)
Goal
• Promote mechanisms that standardize the description of genomes
exchange and integrate genomic data
Open-membership, international working body
• Established in Sept 2005
• Participants include DDBJ, EMBL, GenBank, Sanger, JCVI, JGI, EBI and a range of US, UK and EU research institutions
• Organized a series of workshops
2http://gensc.org and http://gensc.org/gc_wiki/index.php/GSC_Membership
3
Minimum Information about a Genome Sequence(MIGS) Specification
MIGS extends what DDBJ/EMBL/GenBank request upon submission of a genome sequence
• Examples:
Description of geographic location of a sample and habitat
“Minimum Information about a Metagenomic Sequence” (MIMS)
– Temperature
– pH
Description of sequence generation– Sequencing method
– Assembly method
Field et al. Nat Biotechnol. 2008 3
4Field et al. Nat Biotechnol. 2008
MIGS Checklist 2.0
4
5
MIGS Checklist 2.0
Field et al. Nat Biotechnol. 2008
M = mandatory
5
6
Software Development for MIGS/MIMS
Mechanisms for achieving compliance are needed:
• Such mechanisms involve an appropriate reporting
structure for capturing and exchanging data,
software,
databases
and controlled vocabularies and/or ontologies for defining the terms used in the annotations.
Field et al. Nat Biotechnol. 2008
7
Software Development for MIGS/MIMS
Mechanisms for achieving compliance are needed:
• Such mechanisms involve an appropriate reporting
structure for capturing and exchanging data,
software,
databases
and controlled vocabularies and/or ontologies for defining the terms used in the annotations.
Supporting Projects:
• Habitat-Lite (Ontology specification)
Field et al. Nat Biotechnol. 2008
8
Software Development for MIGS/MIMS
Mechanisms for achieving compliance are needed:
• Such mechanisms involve an appropriate reporting
structure for capturing and exchanging data,
software,
databases
and controlled vocabularies and/or ontologies for defining the terms used in the annotations.
Supporting Projects:
• Habitat-Lite (Ontology specification)
• Genomic Rosetta Stone (Identifier Mapping)
Field et al. Nat Biotechnol. 2008
9
Software Development for MIGS/MIMS
Mechanisms for achieving compliance are needed:
• Such mechanisms involve an appropriate reporting
structure for capturing and exchanging data,
software,
databases
and controlled vocabularies and/or ontologies for defining the terms used in the annotations.
Supporting Projects:
• Habitat-Lite (Ontology specification)
• Genomic Rosetta Stone (Identifier Mapping)
• GCDML (MIGS/MIMS specification in XML)
Field et al. Nat Biotechnol. 2008
10
Software Development for MIGS/MIMS
Mechanisms for achieving compliance are needed:
• Such mechanisms involve an appropriate reporting
structure for capturing and exchanging data,
software,
databases
and controlled vocabularies and/or ontologies for defining the terms used in the annotations.
Supporting Projects:
• Habitat-Lite (Ontology specification)
• Genomic Rosetta Stone (Identifier Mapping)
• GCDML (MIGS/MIMS specification in XML)
• Genomes Catalogue (Database and Web Server)
Field et al. Nat Biotechnol. 2008
11
Habitat-Lite (= EnvO-Lite)
Easy-to-use (small) set of terms
• Captures high-level information about habitat
• Derived from the Environment Ontology (EnvO).
Meet the needs of multiple users
• Annotators, database providers, biologists, and bioinformaticians alike who need to search and employ such data in comparative analyses.
11
Aquatic Aquatic: Freshwater Acquatic: Marine Terrestrial Air Fossil Food Organism-Associated Extreme Habitat Other
Hirschman et al. OMICS. 2008
12
Habitat-Lite
1. Level 2. Level
Aquatic
Aquatic: Freshwater
Aquatic: Marine
Terrestrial
Air
Fossil
Food
Organism-Associated
Extreme Habitat
Other
soil
sediment
sludge
waste water
hot spring
hydrothermal vent
biofilm
microbial mat
12
< 20 terms
Hirschman et al. OMICS. 2008
13
Habitat-Lite applied
13http://www.megx.net/genomes
14
Genomic Rosetta Stone (GRS)
14
Create a unified mapping between different genomic
resources
Improve navigation across these resources
Enable the integration of this information in the near
future.
Van Brabant et al. OMICS. 2008
15
Genomic Rosetta Stone (GRS)
15Van Brabant et al. OMICS. 2008
16
Genomic Rosetta Stone (GRS)
Enable the integration of this information in the near
future
16Van Brabant et al. OMICS. 2008
17
Genomic Contextual DataMarkup Language (GCDML)
An Extensible Markup Language (XML)
Aim
• Implement MIGS/MIMS
• Provide even more descriptors
• Facilitate exchange and integration of genomic data
Kottmann et al. OMICS. 2008 17
18
GCDML Example (excerpt)
<gcdml:originalSample>
<gcdml:physicalMaterial>
<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>
<gcdml:samplePointLocation>
<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>
<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>
<gcdml:pos2D>54.329 10.149</gcdml:pos2D>
<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>
</gcdml:samplePointLocation>
<gcdml:marineHabitat>
<gcdml:waterBody>
<gcdml:depth>
<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>
</gcdml:depth>
</gcdml:waterBody>
</gcdml:marineHabitat>
<gcdml:materialType>seawater</gcdml:materialType>
<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>
</gcdml:physicalMaterial>
</gcdml:originalSample>Kottmann et al. OMICS. 2008 18
19
GCDML Example (excerpt)
<gcdml:originalSample>
<gcdml:physicalMaterial>
<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>
<gcdml:samplePointLocation>
<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>
<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>
<gcdml:pos2D>54.329 10.149</gcdml:pos2D>
<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>
</gcdml:samplePointLocation>
<gcdml:marineHabitat>
<gcdml:waterBody>
<gcdml:depth>
<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>
</gcdml:depth>
</gcdml:waterBody>
</gcdml:marineHabitat>
<gcdml:materialType>seawater</gcdml:materialType>
<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>
</gcdml:physicalMaterial>
</gcdml:originalSample>Kottmann et al. OMICS. 2008 19
20
GCDML Example (excerpt)
<gcdml:originalSample>
<gcdml:physicalMaterial>
<gcdml:samplingTime><gcdml:notGiven>unknown</gcdml:notGiven></gcdml:samplingTime>
<gcdml:samplePointLocation>
<gml:LocationKeyWord>Baltic Sea</gml:LocationKeyWord>
<gml:LocationString>Kiel Fjord, Baltic Sea, Germany</gml:LocationString>
<gcdml:pos2D>54.329 10.149</gcdml:pos2D>
<gcdml:determinationMethod>derived from literature</gcdml:determinationMethod>
</gcdml:samplePointLocation>
<gcdml:marineHabitat>
<gcdml:waterBody>
<gcdml:depth>
<gcdml:measure min="0.00" max="0.05“><gcdml:values uom="m">0.00 0.05</gcdml:values></gcdml:measure>
</gcdml:depth>
</gcdml:waterBody>
</gcdml:marineHabitat>
<gcdml:materialType>seawater</gcdml:materialType>
<gcdml:amount><gcdml:measure><gcdml:values uom="ml">100</gcdml:values></gcdml:measure></gcdml:amount>
</gcdml:physicalMaterial>
</gcdml:originalSample>Kottmann et al. OMICS. 2008 20
21
Genome Catalogue
Online system for capturing MIGS/MIMS compliant
reports
21Field et al. Nature 2008
22
Genome Catalogue
Requirements
• A Rich toolkit/user-friendly
• Designed to give credit to all contributors
• XML-based (GCDML) Able to maintain all versions of GCDML schemas
• Web services-based Supporting the automated exchange of content
• Serve as the international GCAT identifier authority
• Comprehensive Containing reports for all taxa and metagenomes
• Ontology-supportive
• Shared by the GSC
22
23
Current Status
We have specifications:
• MIGS/MIMS
• Habitat-Lite
• Genomic Rosetta Stone
Work on supporting software is ongoing:
• Genomes Catalogue is in prototype status
• Funding This is a long-term endeavour that can not be done on a
voluntary basis
23
24
Disscusion
Need of software for:
• Creation of MIGS/MIMS data
• Storage
• Analysis
Expand standardization efforts to
• Software specification/development
• Work on a standardized genomic data management architecture / cyberinfrastructure
Data intensive science is successful if it works
towards one community with one vision
• World Wide Genomics project
24
25
Acknowledgements
All Members of GSC incl. Dawn Field
Peter Sterk
Saul Kravitz
Tanya Gray
Megx.net team
Frank Oliver Glöckner
Ivaylo Kostadinov
Melissa Beth Duhaime
Pier Luigi Buttigieg
Wolfgang Hankeln
Pelin Yilmaz
26
END
Looking forward to the discussion
26
Join the GSC
http://gensc.org