why life is difficult, and what we might do about it

44
Why Research Data Management May Save Science Anita de Waard VP Research Data Collaborations [email protected] http://researchdata.elsevier.com / Why Life is Difficult, And What We Can Do About It

Upload: anita-de-waard

Post on 10-May-2015

570 views

Category:

Technology


1 download

DESCRIPTION

Keynote ISMB 2013 Bio-ontologies Workshop - http://www.bio-ontologies.org.uk/programme

TRANSCRIPT

Page 1: Why Life is Difficult, and What We MIght Do About It

Why Research Data Management

May Save Science Anita de Waard

VP Research Data [email protected]

http://researchdata.elsevier.com/

Why Life is Difficult, And What We Can Do About It

Page 2: Why Life is Difficult, and What We MIght Do About It

Outline:

• The problem: life is difficult.• One approach to tackling this: claim-evidence

networks. – How do we find claims? – How do we find evidence?– How do we connect the two?

• What is still missing? • Call to action!

Page 3: Why Life is Difficult, and What We MIght Do About It

The Problem

Page 4: Why Life is Difficult, and What We MIght Do About It

Problem 1: a rose is not a rose:

• “…there was significant variability of the injected venom composition from specimen to specimen, in spite of their common biogeographic origin.”

Jose A. Rivera-Ortiz, Herminsul Cano, Frank Marí, Intraspecies variability of the injected venom of Conus ermineus, doi:10.1016/j.peptides.2010.11.014

• “…Strains DV-3/84 DV-7/84 (group 3) showed 76.6% similarity to each other and were similar to all other strains at the 67.6% level.”

Zofia Dzierżewicz et al., Intraspecies variability of Desulfovibrio desulfuricans strains determined by the genetic profiles, FEMS Microbiology Letters, Volume 219, Issue 1, 14 February 2003, Pages 69–74, doi:10.1016/S0378-1097(02)01199-0

=> A specimen is not a species!

Page 5: Why Life is Difficult, and What We MIght Do About It

Problem 2: gene expression varies with:Age: “SIRT1-Associated genes are deregulated in the aged brain”

Philipp Oberdoerffer et al., SIRT1 Redistribution on Chromatin Promotes Genomic Stability but Alters Gene Expression during Aging, Cell, Volume 135, Issue 5, 28 November 2008, Pages 907–918, doi:10.1016/j.cell.2008.10.025

Smell: “…major urinary proteins […] mediate the pregnancy blocking effects of male urine”

P.A. Brennan, et al, Patterns of expression of the immediate-early gene egr-1 in the accessory olfactory bulb of female mice exposed to pheromonal constituents of male urine, Neuroscience, Volume 90, Issue 4, June 1999, P 1463–1470, doi:10.1016/S0306-4522(98)00556-9

Hunger: “Out of the ~30K genes, about 10K are differentially expressed in liver cells when an animal is in different states of satiety.“

Zhang F, Xu X, Zhou B, He Z, Zhai Q (2011) Gene Expression Profile Change and Associated Physiological and Pathological Effects in Mouse Liver Induced by Fasting and Refeeding. PLoS ONE 6(11): e27553. doi:10.1371/journal.pone.002755

Light: “Longer-term enrichment training also altered the mRNA levels of many genes associated with structural changes that occur during neuronal growth.”

Cailotto C., et al. (2009) Effects of Nocturnal Light on (Clock) Gene Expression in Peripheral Organs: A Role for the Autonomic Innervation of the Liver. PLoS ONE 4(5): e5650. doi:10.1371/journal.pone.0005650:

=> Knowing genes is not knowing how they are expressed!

Page 6: Why Life is Difficult, and What We MIght Do About It

• “We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals.”

The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234

• “Colonization of an infant’s gastrointestinal tract begins at birth. The acquisition and normal development of the neonatal microflora is vital for the healthy maturation of the immune system.”

Mackie RI, Sghir A, Gaskins HR., Developmental microbial ecology of the neonatal gastrointestinal tract. Am J Clin Nutr. 1999 May;69(5):1035S-1045S

Problem 3: No man (or mouse) is an island…

=> An animal is an ecosystem!

Page 7: Why Life is Difficult, and What We MIght Do About It

Problem 4: Interactions create more complexity:

• Computing cancer: “No amount of information about what happens inside a single cell can ever tell you what a tissue is going to do,” [Glazier] said. “Much of the information and complexity of tissues and life is embedded in the way cells talk to each other and the extracellular environment.”

• Megadata:“These complex emergent systems are impossible to understand,”,”[we] founded Applied Proteomics to create a protein diagnostic that reveals not just where a cancer is, but how it interacts with the body..” Nature Special Issue Vol. 491 No. 7425 ‘Ph

ysical Scientists Take On Cancer’ :

=> The whole is more than the sum of its parts!

Page 8: Why Life is Difficult, and What We MIght Do About It

Big problems in biology:

http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg

1. Interspecies variability > A specimen is not a species!2. Gene expression variability > Knowing genes is not

knowing how they are expressed!3. Microbiome > An animal is an ecosystem!4. Systems biology > Whole is more than the sum of its parts!5. Models vs. experiment > Are we talking about the same

things? In a way we can all use? 6. Dynamics > Life is not in equilibrium!

Life is complicated!Reductionism doesn’t

work for living systems.

Page 9: Why Life is Difficult, and What We MIght Do About It

Statistics could help! With enough observations, trends and anomalies can be detected:

• “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.”

The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234

• “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.”

Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.

Page 10: Why Life is Difficult, and What We MIght Do About It

But biological research is insular! • Biology is small: size 10^-5 – 10^2 m,

scientist can work alone (‘King’ and ‘subjects’).

• Biology is messy: it doesn’t happen behind a terminal.

• Biology is competitive: many people with similar skill sets, vying for the same grants

• In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be,

either…)).

Prepare

Observe

Analyze

Ponder

Communicate

Page 11: Why Life is Difficult, and What We MIght Do About It

How Can We Connect This Knowledge?

Page 12: Why Life is Difficult, and What We MIght Do About It

Claim-Evidence Networks Offer A Model for Connecting Knowledge:

Experimental Evidence

Page 13: Why Life is Difficult, and What We MIght Do About It

Converging on Claim/Evidence/Networks, e.g. here:• The Karyotype Ontology: a computational representation for human cytogenetic patterns. Jennifer Warrender and

Phillip Lord• Lexical Analysis and Characterization of the OBOFoundry Ontologies. Manuel Quesada-Martínez, Jesualdo Tomás

Fernández-Breis and Robert Stevens• Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison. Peter

Robinson, Sebastian Köhler, Anika Oellrich, Kai Wang, Chris Mungall, Suzanna E. Lewis, Sebastian Bauer, Dominik Seelow, Peter Krawitz, Christian Gilissen, Melissa Haendel and Damian Smedley

• BioAssay Ontology (BAO): Modularization, Integration and Applications. Uma Vempati, Hande Kucuk, Saminda Abeyruwan, Ubbo Visser, Vance Lemmon, Ahsan Mir and Stephan Schürer

• eXframe: A Semantic Web Platform for Genomics Experiments. Emily Merrill, Stephane Corlosquet, Paolo Ciccarese, Tim Clark and Sudeshna Das

• Ovopub: Modular data publication with minimal. provenance Alison Callahan and Michel Dumontier• Zooma – A tool for automated ontology annotation. Tony Burdett, Simon Jupp, James Malone, Helen Parkinson,

Eleanor Williams and Adam Faulconbridge• A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature. Chayan Chakrabarti,

Thomas B. Jones, Jiawei F. Xu, George F. Luger, Angela R. Laird, Matthew D. Turner and Jessica A. Turner• Preserving sequence annotations across reference sequences. Zuotian Tatum, Andrew Gibson, Marco Roos, Peter

E.M. Taschner, Mark Thompson, Erik A. Schultes and Jeroen F. J. Laros• A Taxonomy for Immunologists. James A. Overton, Randi Vita, Jason A. Greenbaum, Heiko Dietze, Alessandro Sette

and Bjoern Peters• Health Data Ontology Trunk: A middle-layer ontology for health- care. Ulf Schwarz, Luc Schneider, Emilio

Sanfilippo, Holger Stenzhorn and Nikolina Koleva• Structured representation of scientific evidence using semantic web techniques – a biochemistry use case.Christian

Bölling, Michael Weidlich and Hermann-Georg Holzhütter• Synthetic Biology Open Language Visual: an ontological use case. Jacqueline Quinn, Michal Galdzicki, Robert

Sidney Cox, Jacob Beal, Kevin Clancy, Nathan Hillson and Larisa Soldatova

Page 14: Why Life is Difficult, and What We MIght Do About It

Step 1: Find claims:E.g., using XIP for discourse analysis:

In contrast with previous hypotheses compact plaques form before significant deposition of diffuse A beta, suggesting that different mechanisms are involved in the deposition of diffuse amyloid and the aggregation into plaques.

Entities

Relationships

Temporality

Connections thematic roles

Status

core information(proposition)

information extraction

rhetorical metadiscourse

discourse analysis

discourse analysisdiscourse structure

Sándor, Àgnes and de Waard, Anita, (2012).

Page 15: Why Life is Difficult, and What We MIght Do About It

Finding Claimed Knowledge Updates:

Sandor, A. and de Waard, A. (2012)

Here we used mass spectrometry to identify HuD as a novel neuronal SMN-interacting partner

Our analysis of known HuD-associated mRNAs in neurons identified cpg15 mRNA as a highly abundant mRNA in HuD IPs

Our finding that SMN protein associates with HuD protein and the HuD target cpg15 mRNA in neurons …

Definition: 1) A CKU expresses a verbal or nominal proposition about biological entities. 2) A CKU is a new proposition.3) The authors present the CKU as factual. 4) A CKU is derived from the experimental work described in the article. 5) The ownership of the proposition is attributed to the author(s) of the article. 6) 4) and 5) are either explicitly expressed or are implicitly conveyed by a structural

position as title, section or caption title.

Page 16: Why Life is Difficult, and What We MIght Do About It

Allow for Hedging and Uncertainty:Ontology of Reasoning, Certainty and Attribution (ORCA)

For a Proposition P, an epistemically marked clause E is an evaluation of P, where EV, B, S(P), with:

– V = Value:3 = Assumed true, 2 = Probable, 1 = Possible, 0 = Unknown, (- 1= possibly untrue, - 2 = probably untrue, -3 = assumed untrue)

– B = Basis:ReasoningData

– S = Source:A = speaker is author A, explicitIA = speaker author, A, implicitN = other author N, explicitNN = other author NN, implicit

Based on a conversation with Ed Hovy;de Waard, A. and Schneider, J. (2012)

Page 17: Why Life is Difficult, and What We MIght Do About It

Turning claims into formal representations:Biological statement with BEL/ epistemic markup

BEL representation: Epistemic evaluation

These miRNAs neutralize p53-mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor-suppressor LATS2.

r(MIR:miR-372) -|(tscript(p(HUGO:Trp53)) -| kin(p(PFH:”CDK Family”)))Increased abundance of miR-372 decreases abundance of LATS2r(MIR:miR-372) -| r(HUGO:LATS2)

Value = PossibleSource = UnknownBasis = Unknown

Biological statement with Medscan/epistemic markup

MedScan Representation: Epistemic evaluation

Furthermore, we present evidence that the secretion of nesfatin-1 into the culture media was dramatically increased during the differentiation of 3T3-L1 preadipocytes into adipocytes (P < 0.001) and after treatments with TNF-alpha, IL-6, insulin, and dexamethasone (P < 0.01).

IL-6 NUCB2 (nesfatin-1)Relation: MolTransportEffect: PositiveCellType: AdipocytesCell Line: 3T3-L1

Value = ProbableSource = AuthorBasis = Data

Page 18: Why Life is Difficult, and What We MIght Do About It

Claims Link to Evidence:

Page 19: Why Life is Difficult, and What We MIght Do About It

The evidence is in data. To structure this:

• There are many different research databases– both generic (Dryad, Dataverse, DataBank, Zenodo, etc) and specific (NIF, IEDA, PDB)

• There are many systems for creating/sharing workflows (Taverna, MyExperiment, Vistrails, Workflow4Ever,)

• There are many e-lab notebooks (LabGuru, LabArchives, LaBlog etc)

• There are scores of projects, committees, standards, bodies, grants, initiatives, conferences for discussing and connecting all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc)

• … you could make a living out of this !

Page 20: Why Life is Difficult, and What We MIght Do About It

…but this is what most scientists do:

Using antibodiesand squishy bits Grad Students experimentand enter details into theirlab notebook. The PI then tries to make sense of their slides,and writes a paper. End of story.

Page 21: Why Life is Difficult, and What We MIght Do About It

One attempt to structure data: CMU Urban Legend

de Waard, A., Burton, S. et al., 2013

Page 22: Why Life is Difficult, and What We MIght Do About It

Connecting experimental results:

Prepare

Analyze Communicate

Prepare

Analyze Communicate

Observations

Observations

Observations

Across labs, experiments: track reagents and how they are used

Page 23: Why Life is Difficult, and What We MIght Do About It

Prepare

Analyze Communicate

Prepare

Analyze Communicate

Observations

Observations

Observations

Compare outcome of interactions with these entities

Connecting experimental results:

Page 24: Why Life is Difficult, and What We MIght Do About It

Prepare

Analyze Communicate

Prepare

AnalyzeCommunicate

Observations

Observations

Observations

Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think

Reason collectively!

Connecting experimental results:

Page 25: Why Life is Difficult, and What We MIght Do About It

NIF Antibodies Registry collects antibody information:

Page 26: Why Life is Difficult, and What We MIght Do About It

Step 3: Connect Claims and Evidence

Example: Hunter et al., Hanalyzer:

Page 27: Why Life is Difficult, and What We MIght Do About It

Step 1: Manually identify DDIs and drug names in wide collection of content sources

Step 2: Develop a model of Drug-Drug Interaction and define candidates

Step 3: Automate this process and store as Linked Data

Example: Drug-Drug Interactions

Boyce, Schroeder et al., 2013

Page 28: Why Life is Difficult, and What We MIght Do About It

Connect recommendations in clinical guidelines to underlying evidence

Hoekstra, de Waard and Vdovjak, 2012

Example:

Page 29: Why Life is Difficult, and What We MIght Do About It

Using what is known about interactions in fly & yeast,predict new interactions with a human protein –

Running over data on the web that he neither created nor knew about!

Given a protein P in Species X:

Find proteins similar to P in Species Y

Retrieve interactors in Species Y

Sequence-compare Y-interactors with Species X

genome

(1) Keep only those with homologue in

Find proteins similar to P in Species Z

Retrieve interactors in Species Z

Sequence-compare Z-interactors with (1)

Putative interactors in Species X

Example: do science ON the web:

Page 30: Why Life is Difficult, and What We MIght Do About It

Great! So we’re almost done, right – and we can all go home!

Not so fast…

Page 31: Why Life is Difficult, and What We MIght Do About It

Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 expression is a selective event during tumorigenesis.

Both seminomas and the EC component of nonseminomas share features with ES cells.To exclude thatthe detection of miR-371-3 merely reflects its expression pattern in ES cells,we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004).In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8),suggesting thatmiR-371-3 expression is a selective event during tumorigenesis.

Fact

Hypothesis

Method

Result

Implication

Goal

Reg-Implication

Conceptual knowledge

ExperimentalEvidence

What is a claim? In a paragraph?

Page 32: Why Life is Difficult, and What We MIght Do About It

• Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.”

• Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).”

• Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).”

“[Y]ou can transform .. fiction into fact, just by adding or subtracting references”, Latour, 1987

What is the claim? Who makes it?

Page 33: Why Life is Difficult, and What We MIght Do About It

> 50 My Papers2 M scientists

2 My papers/year

Evidence is largely lost….

Majority of data(90%?) is stored

on local hard drivesDryad:

7,631 filesDataverse:

0.6 My

Datacite: 1.5 My

Some data (8%?) stored in large,

generic data repositories

MiRB: 25k

PetDB: 1,5 k

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

Page 34: Why Life is Difficult, and What We MIght Do About It

…or buried..

Page 35: Why Life is Difficult, and What We MIght Do About It

• In 220 publications only 40% of antibodies, 40% of cell lines and 25% of constructs can be manually identified (Vasilevsly et al, submitted)

• The good news: we can find automatically what we can find manually

• Proposal (NIH, June 2013): – Author is asked to add methods section to a tool– Tool extracts likely reagents / resources– User interface asks author to confirm or select

…and you can’t extract it after the fact.

49 publications193 publications 76 publications 214 publications 210 publications

Entity Type

Precision Recall

Antibody 87.5 63.3

Resource 95.6 98.9

Page 36: Why Life is Difficult, and What We MIght Do About It

Even if we can link to evidence:

• Is it true?

Page 37: Why Life is Difficult, and What We MIght Do About It

In Summary:

We’re not out of the woods (or a job) just yet!

Page 38: Why Life is Difficult, and What We MIght Do About It

We need to improve claim networks:

• Can we make systems of computer-readable meaning that still represent the fullness of natural language? >> Let’s work with computational linguists!

• Trace claims across publications:>> Let’s work with legal/political argumentation specialists! Sentiment analysis!

Page 39: Why Life is Difficult, and What We MIght Do About It

> 50 My Papers2 M scientists

2 My papers/year

Improve evidence: scale up data curation!

Dryad: 7,631 files

Dataverse:0.6 My

Datacite: 1.5 My

MiRB: 25k

PetDB: 1,5 k

Majority of data(90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

INCREASE DATA DIGITISATION

IMPR

OVE DAT

A

USABI

LITY

DEVELOP SUSTAINABLE MODELS

IMPROVE REPOSITORYINTEROPERABILITY

Page 40: Why Life is Difficult, and What We MIght Do About It

Keep asking big questions:

• Is this true? • Does it matter? • To whom?

“Let us now build systems that allow a kid in Mali who wants to learn about proteomics to not be overwhelmed by the irrelevant and the untrue.”

- John Perry Barlow, iAnnotate, SF 2013

Page 41: Why Life is Difficult, and What We MIght Do About It

In Memoriam Douglas C. Engelbart, 1925-2013:

“This is an initial summary report of a project taking a new and systematic approach to improving the intellectual effectiveness of the individual human being. A detailed conceptual framework explores the nature of the system composed of the individual and the tools, concepts, and methods that match his basic capabilities to his problems. One of the tools that shows the greatest immediate promise is the computer, when it can be harnessed for direct on-line assistance, integrated with new concepts and methods.”

Page 42: Why Life is Difficult, and What We MIght Do About It

Summary:• The problem: life is difficult.• One approach to tackle this: claim-evidence

networks: – Find claims– Identify evidence– Connect the two.

• But we still need: – Better ways to represent subtlety of natural language– Better evidence: more structured, better connected– Focus on the big questions.

• There’s a lot of work to do!

Page 43: Why Life is Difficult, and What We MIght Do About It

Collaborations and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy• UCSD: Phil Bourne, Brian Shoettlander, Ilya Zaslavsky• NIF: Maryann Martone, Anita Bandrowski• MSU: Brian Bothner• OHSU: Melissa Haendel, Nicole Vasilevsky• CDL: Carly Strasser, John Kunze, Stephen Abrams• Harvard/MGH: Tim Clark, Paolo Ciccarese• VU: Rinke Hoekstra, Frank van Harmelen, Paul Groth• Columbia/IEDA: Kerstin Lehnert, Leslie Hsu• University of Pittsburgh: Richard Boyce• Xerox Research Europe: Agnes Sandor• DERI: Jodi Schneider

Thank you!

Page 44: Why Life is Difficult, and What We MIght Do About It

References:• de Waard, Buckingham Shum, Park, Samwald, Sandor, 2009: Hypotheses, Evidence and Relationships, ISWC2009• Biological Expression Language – http://www.openbel.org • Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications• Latour, B., Science in Action, 1987• de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A

Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012.

• Data2Semantics project: http://www.data2semantics.org/ • Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles,

Workshop on Detecting Structure in Scholarly Discourse, ACL 2012. • de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution

(ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012• de Waard, A., Burton, S.D., Gerkin, R.C., Harviston, M., Marques, D., Tripathy, S.J., Urban, N.N., Creating an Urban

Legend: A System for Electrophysiology Data Management and Exploration, Discovery Informatics, 2013• Boyce, R.D., Horn, J.R., Hassanzadeh, O., de Waard, A., Schneider, J., Luciano, J. S, Liakata, M., Dynamic enhancement of

drug process labels to support drug safety, efficacy, and effectiveness. Jnl of Biomedical Semantics, 2013, 4:5.• Hoekstra, R., de Waard,A., Vdovjak, R. (2012) Annotating Evidenced Based Clinical Guidelines - A Lightweight Ontology,

Proceedings of SWAT4LS 2012, Paris, Adrian Paschke, Albert Burger, Paolo Roma, M. Scott Marshall, Andrea Splendiani (ed.), Springer.

http://researchdata.elsevier.com/