embl outstation — the european bioinformatics institute miame and arrayexpress - a standard for...

35
EMBL Outstation — The European EMBL Outstation — The European Bioinformatics Institute Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen Parkinson Microarray Informatics Team European Bioinformatics Institute Hinxton

Upload: osborne-harvey

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

MIAME and ArrayExpress- a standard for microarray data

annotation and a database to store it

Helen ParkinsonMicroarray Informatics Team

European Bioinformatics Institute Hinxton

Page 2: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Three parts of my talk

Microarray data standards Ontologies for gene expression data ArrayExpress - a public database for

microarray data Analysis tools at the EBI

Page 3: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The size of the datasets Experiments:

– ~100 000 different transcripts in human – ~320 cell types– 2000 compounds– 3 time points– 2 concentrations– 2 replicates

Data– 8 x 1011 data-points– 1 x 1015 = 1 Peta Byte for Affymetrix

(data from Jerry Lanfear)

Page 4: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Microarray data Microarrays are widely used in experiments and

already producing massive amounts of data These data have to be stored in a well organised

and standard way, if they are to be accessed and analysed by the wide research community

There is a general consensus that there is a need for a public repository for microarray data

It is much less clear what exactly should be stored in such a repository

Page 5: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

A gene expression database from the data analyst’s point of view

SamplesG

enes

Gene expression levels

Sample annotations

Gene annotations

Gene expression matrix

Page 6: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Three parts of a gene expression database Gene annotation – can be given by links to

gene sequence databases and GO (function,process,cell compartment) – not perfect but lets not worry about it

Sample annotation – we do not have any external databases for sample description (except species taxonomy) – problem 1

Gene expression matrix – what are the measurement units for gene expression levels? – problem 2

Page 7: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Problem/consideration 1 – sample annotation

Gene expression data only have meaning in the context of detailed sample descriptions

If the data is going to be interpreted by independent parties, sample information has to be searchable and in the database

Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description

Page 8: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Sample annotation- what can be done? Few cv’s and ontologies for sample

description are available (species taxonomy, model organisms)

Some use of free text descriptions are unavoidable (curation workload)

Existing efforts of creating such ontologies should be coordinated (MGED ontology working group)

Use existing ontologies and cv’s wherever possible

Page 9: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Problem 2 – the lack of gene expression measurement units

What we would like to have– gene expression levels expressed in

some standard units (e.g. molecules per cell)

– reliability measure associated with each value (e.g. standard deviation)

What have we got– each experiment using different units– no reliability information

Page 10: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Comparing expression data

cm inc

Page 11: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Comparing expression data

? ?

Page 12: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Comparing expression data

Page 13: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

What to do in the absence of standard measurement

units? Record raw, intermediate and final

analysis data together with the detailed annotation of how the analysis has been performed

This effectively passes on the responsibility about interpreting the final analysis data to the user

Page 14: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Raw data

Array scans

Ge

nes

Samples

Gene expressiondata

Gene exp. levels

Three levels of microarray data processing

Sp

ots

Quantitations

Quantitationmatrices

Spot quantitations

Page 15: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Measurement units

In perspective:– standard controls for experiments (on chips

and in the samples) should be introduced– replicate measurements will become a norm

Temporary solution:– storing intermediate analysis results (including

the images) and annotations of how they were obtained

– Standards within experiments themselves (standard controls and protocols)

Page 16: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Standards for microarray data

Standards are needed to build a well organised microarray database

– Standards for annotation– Standards for data exchange– Standards for controls in the experiment

and data normalisation

www.dnachip.org/mged/normalization.html

Page 17: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

How to create microarray data standards

1. To understand thoroughly what is the minimum information about a microarray experiment that is needed to interpret it unambiguously and what is the structure of this information (objects and relationships)

2. To create the technical data format able to capture this information

3. Finding appropriate controlled vocabularies

Page 18: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Standardisation of microarray data and annotations -MGED

group

The goal of the group is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalisation methods. Includes most of the worlds largest microarray laboratories and companies (TIGR,Affymetrix Stanford,Sanger,Agilent etc)

www.mged.org

Page 19: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

MGED MGED 2 meeting in Heidelberg in 2000,

MGED 3 in Stanford in 2001, both ~ 300 participants

Minimum Information About a Microarray Experiment – MIAME version 1.0 posted

Collaboration with OMG on data formats MAML+GEML = MAGE-ML and MAGE-OM

MGED 4 meeting in February 2001, Boston MGED will become an ISCB Special Interest

Group

Page 20: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

MIAME – Minimum Information About a Microarray Experiment

PublicationExternal links

6 parts of a microarray experiment

www.mged.org

Hybridisation ArrayGene

(e.g., EMBL)Sample

Source(e.g., Taxonomy)

Data

Experiment

Normalisation

Page 21: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

sample source and treatment ID as used in section 1organism (NCBI taxonomy)additional "qualifier, value, source" list; the list includes:

cell source - provider type (if derived from primary sources (s))sexagegrowth conditionsdevelopment stageorganism part (tissue)animal/plant strain or linegenetic variation (e.g., gene knockout, transgenic variation)individualindividual genetic characteristics (e.g., disease alleles, polymorphisms)disease state or normaltarget cell typecell line and source (if applicable)in vivo treatments (organism or individual treatments)in vitro treatments (cell culture conditions)treatment type (e.g., small molecule, heat shock, cold shock, food deprivation)compoundis additional clinical information available (link)separation technique (e.g., none, trimming, microdissection, FACS)

laboratory protocol for sample treatment……

MIAME Section on Sample Source and Treatment

Page 22: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

What is an ontology? An ontology is a specification of

concepts that includes the relationships between those concepts.

Provides semantics and constraints Allows for computational inferences and

reliable comparisons

Page 23: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

MGED Biomaterial Ontology Under construction by Chris Stoeckert

– Using OILed (may use others) Motivated by MIAME and coordinated

with the database model Extend classes, provide constraints,

define terms, provide terms to use,develop cv’s for submissions (EBI)

Page 24: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Use case scenarioOWG Use Cases

• Return a summary of all experiments that use a specified type of biosource.– Group the experiments according to treatment.

• Return a summary of all experiments done examining effects of a specified treatment– Group the experiments according to biosource.

• Return a summary of all experiments measuring the expression of a specified gene.– Indicate when experiments confirm results, provide new

information, or conflict.

• Generate a distance metric for experiment types• Generate an error estimation for experimental

descriptions

Page 25: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Ontology Example

Concept=Age def=in standard units referenced to an identifiable time point from (class) developmental stage

Age=6 {units=days}, {dev_stage}=dauer Hierarchy=Dev_stage->larva->dauer

Page 26: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Excerpts from a Sample Descriptioncourtesy of M. Hoffman, S. Schmidtke, Lion BioSciences

Organism: mus musculus [ NCBI taxonomy browser ]Cell source: in-house bred mice (contact: [email protected]) Sex: female [ MGED ]Age: 3 - 4 weeks after birth [ MGED ]Growth conditions: normal

controlled environment20 - 22 oC average temperaturehoused in cages according to EU legislationspecified pathogen free conditions (SPF)14 hours light cycle10 hours dark cycle

Developmental stage: stage 28 (juvenile (young) mice)) [ GXD "Mouse Anatomical Dictionary" ]Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ]Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice]Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ]Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per 25 g bodyweight of the mouseCompound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS

Page 27: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

ArrayExpress conceptual model

PublicationExternal links

Hybridisation ArraySampleSource

(e.g., Taxonomy)

Experiment

Normalisation

Gene(e.g., EMBL)

Data

Page 28: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

ArrayExpress object model

Page 29: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

ArrayExpress – the state of the art

ArrayExpress Object model supporting MIAME requirements developed

Data model implemented in Oracle Data loader from MAML file format Expression Profiler – data analysis tool

already available

Page 30: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

ArrayExpress – plans and schedule

EU grant – new staff being recruited A web based query interface - under

development A web based submission tool – under test Participation in OMG – MAGE-OM & MAGE-

ML MAGE-ML will replace MAML in October Full scale database operation expected to start

at the beginning of 2002 Expression Profiler to link to ArrayExpress

Page 31: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Microarray data analysis

Expression Profiler – a web based gene expression data analysis tool: www.ebi.ac.uk/microarray/

Page 32: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EPCLUST(cluster Expression profiles)

GENOMESsequence, function,

annotation

SPEXS(Sequence Pattern Exhaustive Search)

novel patterns

URLMAP:provide links

Expression Profiler - web based tool for microarray data analysishttp://www.ebi.ac.uk/microarray/

Expression data

External data, toolspathways, function,

etc.

PATMATCHknown patterns

Page 33: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Conclusions Microarray standardisation is a challenge

and an imperative Join MGED to contribute to this process

www.mged.org Participate in the development of ontologies

and controlled vocabularies Send me your protocols Make your data available Feedback on MIAME, it’s up for discussion

Page 34: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Acknowledgments

Microarray Informatics Team, EBI Alvis Brazma, Katja Kivinen, Helen Parkinson, Olga Perez,Johan Rung, Ugis Sarkans,Thomas Schlitt, Mohammad Shojatalab, Lev Soinov, Koichi Tazaki, Jaak Vilo

Industry Support team, EBI Alan Robinson

MGED steering committee MIAME working group Chris Stoeckert, U. Penn. and MGED

Page 35: EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen

EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Useful URL’s

www.mged.org www.tigr.org

www.ebi.ac.uk/array www.geneontology.org www.hgmp.mrc.ac.uk

www.dnachip.org/mged/normalization.html [email protected]