how data sharing leads to - keio university...the ontology should enable data integration across...

Post on 30-May-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

How data sharing leads toHow data sharing leads to knowledgeg

M. Scott Marshall, Ph.D.W3C HCLS IG h iW3C HCLS IG co‐chair 

Leiden University Medical CenterUniversity of Amsterdam

http://staff.science.uva.nl/~marshall

Semantic Web Anno 1999Semantic Web Anno 1999

DIY (Do It Yourself)

Reasoning across a Federationg(Semantic Web 2012?)

SPARQLクエリを分解して分散コンピューSPARQLクエリを分解して分散コンピューティングを行なう

The Semantic Web is the New GlobalThe Semantic Web is the New Global Web of Knowledge

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It makes possible the answeringsophisticated questions usingsophisticated questions using

background knowledge

Source: Michel Dumontier

Some of the forces at workSome of the forces at work• Pharmaceutical industry changing strategyy g g gy

– David Cox Strategy: Academic / Industry partnershippartnership

– Pistoia Alliance, Vocabulary Services Initiative

• Personalized Medicine and EHRs

• US NIH: NCBCs and NCBOUS NIH: NCBCs and NCBO

• NCI Semantic Infrastructure

• European Innovative Medicine Initiatives (IMI)

Biology in a nutshell: Bigger isn’t bettergy gg

• DNA Dogma– Transcription = DNA ‐> mRNA ‐> Protein

• Molecular pathways allow biologists to ‘connect’ one molecule to another.

• Huntington’s mutation mapped in 1993 yet there is still no understanding of the mechanism that causesstill no understanding of the mechanism that causes the neurodegeneration.

• Semantic models are necessary to create a ‘systems• Semantic models are necessary to create a  systems view’ of biology. 

Where is biomedical knowledge?Where is biomedical knowledge?

Can be extracted from:

• PeoplePeople

• LiteratureM t f th f• Diagrams

• Clinical reports

Most of these sources of biomedical knowledge are notmachine‐readableClinical reports 

• Databases

notmachine readable

• Excel sheets

•• …

ExamplesExamples

1. “find all images with a big lesion in the right frontal lobe » ? g g g

2. Provide the neurosurgeon with landmarks Hmm, but what’s nearby?

Source: Christine Golbreich

Many tasks are still a challenge!With existing Web and Health IT: 

• Find and integrate informationFind and integrate information 

– “Although a plethora of resources (tools, databases materials) for neuroscientists is nowdatabases, materials) for neuroscientists is now available on the web, finding these resources among the billions of possible web pagesamong the billions of possible web pages continues to be a challenge.” [M. Martone, NCBO Seminar Series, 4 Nov 2009], ]

• Make multiple inferences based on background knowledgeknowledge 

– to obtain more complete answers

t di k l d– to discover knowledgeSource: Christine Golbreich

ExamplesExamples– in a medical record system

“find all patients which radiology exhibits a fracture of femur”

d– in genomic data

“find all genes annotated with a molecular function or f it d d t d hi h i i t d ithany of its descendants and which is associated with any

form of a given disease (see genes associated to muscular dystrophy [Sahoo et al. 2007])muscular dystrophy [Sahoo et al. 2007])

– find, share, annotate images

Source: Christine Golbreich

Biological and medical ontologiesBiological and medical ontologies• Medical domain is *very* lucky☺

a large number of terminologies and reference ontologies, E.g., FMA, NCI, GO, SNOMED‐CT, etc.

• Web PortalsBioportal library contains ~200 ontologies in different languages: OBO– Bioportal library contains 200 ontologies in different languages: OBO, Protégé Frames, RDF, OWL http://bioportal.bioontology.org/

– Bioportal now provides SPARQL access to ontologies: http://sparql bioontology orghttp://sparql.bioontology.org 

– Open Biomedical Ontologies (OBO) Foundry, http://obofoundry.org/

Source: Christine Golbreich

MotivationMotivation

Science is based on knowledge:Science is based on knowledge: 

knowledge capture, knowledge sharing, i.e. i i f fi dicommunication of findings. 

Semantic Web provides a basis for knowledge h i th h hi d bl dsharing through machine‐readable and reason‐able annotation of resources.

What is knowledge ?What is knowledge ?

“data”, “information”, “facts”, “knowledge”

Knowledge is a statement 

that can be tested for truth.

(by a machine)

Otherwise, computing can’t add much

RDF : a web format for knowledge

RDF is a W3C language to expressRDF is a W3C language to express 

statements.RDF Triple:

Subject     Predicate     Objectj j

Graph of Knowledge:Graph of Knowledge:

Node           Edge           Node

Round trip knowledge engineering (Vision)

In a linked open data environment:In a linked open data environment:– How will I find the resources? 

H ill I i t t th i t li ti ?– How will I integrate them into an application?

– How will I use existing knowledge in my analysis?

– How will I publish the new knowledge resulting from my analysis?

– And link to the evidence (data) from the new knowledge?knowledge?

Programmable Web:A few thousand APIs

http://www.programmableweb.com/

SPARQL Interoperability:A single API

• a query language for RDF that matches graph patternsp

• SPARQL endpoint is comparable to a database connector for triplesconnector for triples

• SPARQL endpoint enables data layer interoperability

Background of the HCLS IGBackground of the HCLS IG

• Originally chartered in 2005– Chairs: Eric Neumann and Tonya Hongsermeier

• Re‐chartered in 2008• Re‐chartered in 2008– Chairs: Scott Marshall and Susie Stephens

– Team contact: Eric Prud’hommeaux

• Broad industry participation– Over 100 members 

Mailing list of over 600– Mailing list of over 600

• Background Information– http://www.w3.org/blog/hclsp g g

– http://esw.w3.org/topic/HCLSIG

Mission of HCLS IGMission of HCLS IG

•The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for

Biological science– Biological science

– Translational medicine

– Health care

•These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on h i bili f i f i f d i dthe interoperability of information from many domains and processes for efficient decision support

How does the HCLS IG work?How does the HCLS IG work?

• Task forces

• Regular teleconferences using teleconferenceRegular teleconferences using teleconference bridge (Zakim), IRC, minutes

F 2F (F2F) i• Face2Face (F2F) twice a year

• Procedures for publishing W3C notes p g

Current Task ForcesCurrent Task Forces

Bi RDF f d i ( i ) k l d b• BioRDF – federating (neuroscience) knowledge bases– M. Scott Marshall (Leiden University Medical Center / University of Amsterdam)

• Clinical Observations Interoperability – patient recruitment in trials– Vipul Kashyap (Cigna Healthcare) 

• Linking Open Drug Data – aggregation of Web‐based drug data – Susie Stephens (Johnson & Johnson)Susie Stephens (Johnson & Johnson)

• Translational Medicine Ontology – high level patient‐centric ontology– Michel Dumontier (Carleton University) 

S i tifi Di b ildi iti th h t ki• Scientific Discourse – building communities through networking– Tim Clark (Harvard University) 

• Terminology – Semantic Web representation of existing resources– John Madden (Duke University)

COI: Translating across domainsCOI: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

COI: Use CaseCOI: Use Case

Pharmaceutical companies pay a lot to test drugsg

Pharmaceutical companies express protocol in CDISCCDISC

‐‐ precipitous gap –

H it l h i f ti i HL7/RIMHospitals exchange information in HL7/RIM

Hospitals have relational databases

Source: Eric Prud’hommeaux

Inclusion Criteria

T 2 di b t di t d i th• Type 2 diabetes on diet and exercise therapy or• monotherapy with metformin, insulin

t l h l id i hibit• secretagogue, or alpha-glucosidase inhibitors, or• a low-dose combination of these at 50%

i l d D i i t bl f 8 k i• maximal dose. Dosing is stable for 8 weeks prior• to randomization.• …• ?patient takes metformin .

Source: Holger Stenzhorn

Exclusion CriteriaExclusion Criteria

Use of warfarin (Coumadin), clopidogrel(Plavix) or other anticoagulants.( ) g…?patient doesNotTake anticoagulant .?patient doesNotTake anticoagulant .

Source: Holger Stenzhorn

Criteria in SPARQL?medication1 sdtm:subject ?patient ;

Criteria in SPARQL

spl:activeIngredient ?ingredient1 .?ingredient1 spl:classCode 6809 . #metformin

OPTIONAL {OPTIONAL {

?medication2 sdtm:subject ?patient ;spl:activeIngredient ?ingredient2 spl:activeIngredient ?ingredient2 .

?ingredient2 spl:classCode 11289 .#anticoagulant

} FILTER (!BOUND(?medication2))

Source: Holger Stenzhorn

TMO: Translating across domainsTMO: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

Questions & ProblemsThe Drug Development Pipeline

“A virtual space odyssey”, Cath O'Driscoll  (2004) 

http://www.nature.com/horizon/chemicalspace/background/odyssey.html

• The road is long, and costly.

H d t i t d d l b tt d ?• How do we contain costs and develop better drugs?

Source: Elgar Pichler

Translational Medicine OntologyMission

• Focuses on the development of a high level patient‐centric ontology for the pharmaceutical industry. The ontology should enable data integration across discovery research, hypothesis management,integration across discovery research, hypothesis management, experimental studies, compounds, formulation, drug development, market size, competitive data, population data, etc. This would enable i i i d i i i ifiscientists to answer new questions, and to answer existing scientific 

questions more quickly. 

• This will help pharmaceutical companies to model patient‐centricThis will help pharmaceutical companies to model patient centric information, which is essential for the tailoring of drugs, and for early detection of compounds that may have sub‐optimal safety profiles. The ontology should link to existing publicly available domain ontologies.

TMO StructureTMO Structure

Source: Susie Stephens

Translational Medicine KBTranslational Medicine KB

Source: Susie Stephens

TMO QueryTMO Query

How many patients experienced side effects while taking Donepezil?

Source: Susie Stephens

Terminology: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

TerminologyTerminology

• Representation of medical documents (OWL and SKOS))

• SKOS enables ‘fuzzier’ statements, with ‘broader’ and ‘narrower’broader  and  narrower

• Mammogram: Represent both radiology and pathology report to discover discrepancies

• Use Translational Medicine Ontology RadLex• Use Translational Medicine Ontology, RadLex, SNOMED in the RDF

BioRDF: Translating across domainsBioRDF: Translating across domains

EHR

Microarray AlzForum

MRI PubMed

BioRDF: Looking for Targets for Alzheimer’s

• Signal transduction pathways are considered to be rich in “druggable” targetstargets • CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease• Casting a wide net, can we find candidate genes known to be involvedcandidate genes known to be involved in signal transduction and active in Pyramidal Neurons?

Source: Alan Ruttenberg

BioRDF: Integrating Heterogeneous DataBioRDF: Integrating Heterogeneous Data 

NeuronDB

PDSPki

R t NeuronDB

BAMS

Gene Ontology

BrainPharmAntibodies

Reactome

Allen Brain

Literature

Entrez Gene

M li

BrainPharmAntibodies

PubChem

MESHAtlas

Homologene

SWAN

Mammalian Phenotype

AlzGene

Source: Susie Stephens

“find me genes involved in signal transduction that arefind me genes involved in signal transduction that are related to pyramidal neurons”

BioRDF: Results: Genes ProcessesBioRDF: Results: Genes, Processes

•DRD1, 1812 adenylate cyclase activation•ADRB2, 154 adenylate cyclase activation•ADRB2, 154 arrestin mediated desensitization of G‐protein coupled receptor protein signaling pathway•DRD1IP, 50632 dopamine receptor signaling pathway•DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway•DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway•GRM7 2917 G‐protein coupled receptor protein signaling pathwayGRM7, 2917 G protein coupled receptor protein signaling pathway•GNG3, 2785 G‐protein coupled receptor protein signaling pathway•GNG12, 55970 G‐protein coupled receptor protein signaling pathway•DRD2, 1813 G‐protein coupled receptor protein signaling pathway•ADRB2, 154 G‐protein coupled receptor protein signaling pathway•CALM3, 808 G‐protein coupled receptor protein signaling pathway•HTR2A, 3356 G‐protein coupled receptor protein signaling pathwayDRD1 1812 G i i li l d li l id d

Many of the genes are related to AD through gamma •DRD1, 1812 G‐protein signaling, coupled to cyclic nucleotide second messenger

•SSTR5, 6755 G‐protein signaling, coupled to cyclic nucleotide second messenger•MTNR1A, 4543 G‐protein signaling, coupled to cyclic nucleotide second messenger•CNR2, 1269 G‐protein signaling, coupled to cyclic nucleotide second messenger•HTR6, 3362 G‐protein signaling, coupled to cyclic nucleotide second messenger•GRIK2, 2898 glutamate signaling pathway•GRIN1, 2902 glutamate signaling pathway

g gsecretase (presenilin) activity

, g g g p y•GRIN2A, 2903 glutamate signaling pathway•GRIN2B, 2904 glutamate signaling pathway•ADAM10, 102 integrin‐mediated signaling pathway•GRM7, 2917 negative regulation of adenylate cyclase activity•LRP1, 4035 negative regulation of Wnt receptor signaling pathway•ADAM10, 102 Notch receptor processing•ASCL1 429 Notch signaling pathway•ASCL1, 429 Notch signaling pathway•HTR2A, 3356 serotonin receptor signaling pathway•ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)•PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway•EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway•NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway•CTNND1, 1500 Wnt receptor signaling pathway

Source: Alan Ruttenberg

Automatic query decompositionAutomatic query decomposition

• Decompose query into components

• Locate sources of informationLocate sources of information

• Distribute queries across information sources

• Aggregate results

A SPARQL query for processes involved in pyramidal neurons

prefix go: <http://purl org/obo/owl/GO#>prefix go: <http://purl.org/obo/owl/GO#>prefix rdfs: <http://www.w3.org/2000/01/rdf‐schema#>prefix owl: <http://www.w3.org/2002/07/owl#>prefix mesh: <http://purl.org/commons/record/mesh/>prefix sc: <http://purl.org/science/owl/sciencecommons/>prefix ro: <http://www.obofoundry.org/ro/ro.owl#> MeSH: Pyramidal Neuronsselect ?genename ?processnamewhere{  graph <http://purl.org/commons/hcls/pubmesh>{ ?paper ?p mesh:D017966 .?article sc:identified_by_pmid ?paper.?gene sc:describes_gene_or_gene_product_mentioned_by ?article.}

Pubmed: Journal Articles}graph <http://purl.org/commons/hcls/goa>{ ?protein rdfs:subClassOf ?res.?res owl:onProperty ro:has_function.?res owl:someValuesFrom ?res2.?res2 owl:onProperty ro:realized_as.?res2 owl:someValuesFrom ?process.h htt // l / /h l /20070416/ l l ti Entrez Gene: Genesgraph <http://purl.org/commons/hcls/20070416/classrelations>

{{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union{?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent.?parent owl:equivalentClass ?res3.?res3 owl:hasValue ?gene.

Entrez Gene: Genes

GO: Signal Transductiong}

graph <http://purl.org/commons/hcls/gene>{ ?gene rdfs:label ?genename }graph <http://purl.org/commons/hcls/20070416>{ ?process rdfs:label ?processname}

}

GO: Signal Transduction

Inference required

Recipe for a Semantic WebRecipe for a Semantic Web

• Follow Linked Open Data principles• Follow Linked Open Data principles

• Attempt to use Shared Names (same URI’s)

• Query rewriting to map from:                 – SPARQL > (query language)– SPARQL  ‐>  (query language)

– SPARQL (term1) ‐> SPARQL (term2)

• Add provenance information about graphsp g p

Interlinking in LODDInterlinking in LODD

http://esw.w3.org/HCLSIG/LODD/Interlinking

LOD cloud – a few challengesLOD cloud – a few challenges

• It’s a (nice) data warehouse in RDF– http://lod.openlinksw.com/p // p /

• Ad hoc identifiersP f h i i i d h d– Prefer authoritative, persistent, and shared

• Redundancyy

• Lacks provenance – Who created a given RDF rendering? Using what method? Version?rendering? Using what method? Version?

So what’s missing?So, what s missing?

• Data stewardship – serve (annotated) data back to communityy

• Plug‐n‐play SPARQL access to relational stores 

B i f d f d i f SPARQL• Best practice for data federation of SPARQL endpoints

• Best practice for important data sources: microarray data variety of image datamicroarray data, variety of image data

• Shared Identifiers

ProvenanceProvenance

• Represent knowledge so that others can discover where a fact (or triple) came from and evaluate how to use it – link facts to data as evidence

• Named graphs – talk about what’s in them

• Example use: “Find the latest RDF version of DrugBank” from SPARQL endpoints accessible to me.ug a o S Q e dpo s access b e o e

Provenance for Federationor “What’s in a graph?”

• Minimum information about a graph (MIAG)

• Is it an ontology or RDF data?Is it an ontology or RDF data?

• Version, License, ..

• Who converted it to RDF?

• Rendering: “I’m looking for SNOMED in SKOSRendering:  I m looking for SNOMED in SKOS (not OWL)”

• What is the most populated class?

• Graph statisticsGraph statistics

Translating across domainsTranslating across domains

EHR

Microarray AlzForum

MRI PubMed

Provenance types are perspectives on the data 

Source: Helena Deus

A Bottom‐up Approachp pp

Provenance Workflow, Domain ontologiesCommunitymodels

Workflow, experimental design

Domain ontologies(DO, GO…)

Communitymodels

Which genes are markers gfor neurodegenerative 

diseases?

Was gene ALG2

Provenance of Microarray experiment Was gene ALG2 

differentially expressed in multiple experiments?

experiment

Wh f d

QuestionsWhat software was used to 

analyse the data?

Raw Data

Results How can the experiment be replicated?

Raw Data

Source: Helena Deus

BioRDF: finding latest versionBioRDF: finding latest versionFILTER (?date_issued2 > ?date_issued) 

FILTER (!BOUND(?srvc2)) 

#Get associated diseases from most recently updated Diseasomey pserver. 

SERVICE ?srvc2 {                            #  <‐ Federation here

?diseasomeGene rdfs:label ?geneLabel . 

?disease diseasome:associatedGene ?diseasomeGene. 

?disease rdfs:label ?diseaseName . 

Myth bustingMyth busting

• Creating a Semantic Web application does not necessarily require converting and storing all y q g gdata in RDF 

• “Leave the data where it lives”• Leave the data where it lives

• Create views of the data accessible via RDF export or (BETTER) a SPARQL endpoint

SWObjectsSWObjects

• Creator: Eric Prud’hommeaux

• SWObjects creates a SPARQL endpointSWObjects creates a SPARQL endpoint

• Query rewriting to SQL and SPARQL

• Uses SPARQL Constructs as a mapping language

• JNI interfaceJNI interface

• http://en.wikipedia.org/wiki/SWObjects

• http://precedings.nature.com/documents/5538/

Semantic Web is a frameworkSemantic Web is a framework

that  provides common languages for defining vocabularies and querying across data that q y guses them.

However,  common identifiers must be established if we are to link data without a million mapping files.million mapping files.

Shared IdentifiersShared Identifiers

• Must use common URI’s in order to link data

• Provenance related identifiers still needed:Provenance related identifiers still needed:– Identifiers for people (researchers)

Id ifi f di– Identifiers for diseases

– Identifiers for terms (Terminology servers)

– Identifiers for programs, processes, workflows

– Identifiers for chemical compoundsIdentifiers for chemical compounds

• Shared Names http://sharednames.org

Emerging Best Practices for RDFEmerging Best Practices for RDF

d d d d f h1. Convert unstructured source data into a structured data format, such as relational tables or XML 

2. Identify persistent URLs (PURLS) for concepts in existing ontologies and y p ( ) p g gcreate a map from the structured data into an RDF view 

3. Customize the map manually if necessary

i h i i h d4. Map concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on ontologies 

5. Publish the RDF data through a SPARQL endpoint or as Linked Data g p

6. Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [Prud’hommeaux et al. 2010] that enables SPARQL queries over the relational schemaSPARQL queries over the relational schema

7. Create Semantic Web applications using the published data 

Create RDF views that a stranger can useCreate RDF views that a stranger can use

• Use a mapping language to create an RDF view of the data when possible, rather than customized code.

Wh ibl b l i th t l il bl• When possible, use vocabularies that are openly available from an authoritative server. For the life sciences, we have OBO and NCBOOBO and NCBO.

• Use rdfs:label generously to provide information to user interfaces

Publish RDF so that it can be discoveredPublish RDF so that it can be discovered

• Publish open access data whenever possible, as well as any associated software.

• Publish a URL to the software and mappings that you used to create the RDF.

• Register your data at CKAN. If it’s biomedical, register it in BioCatalogue.o a a ogue

• Assign a graph URI to the RDF graph and add provenance and metadata about the graph URI toprovenance and metadata about the graph URI to the graph itself. This practice makes it possible for visitors and crawlers to find out what is in the graphvisitors and crawlers to find out what is in the graph.

AcknowledgementsAcknowledgements• W3C Health Care and Life Science Interest Group, http://www.w3.org/blog/hcls

• National Center for Biomedical Ontologiesg

• Concept Web Alliance

• Authors of all contributed slides• Authors of all contributed slides

The End

“Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house ”stones is a house.

Henri Poincaré– Henri Poincaré, 

Science and Hypothesis, 1905

top related