big data, semantic web and ontologies mélanie courtot, phd nov 12 th 2014 mcourtot@sfu.ca 1

Post on 24-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Big data, Semantic Web and Ontologies

Mélanie Courtot, PhDNov 12th 2014

mcourtot@sfu.ca

2

About me

3

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

4

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

5

6

Big data

Big data is data that is too large and complex to process for any conventional data tools.

7

2005

8

2013

9

What is a Zettabyte?

1,000,000,000,000 gigabytes1,000,000,000,000 terabytes1,000,000,000,000 petabytes1,000,000,000,000 exabytes1,000,000,000,000 zettabyte

10

How big is big?

• Facebook: 25 Terabytes of logged data per day, Google (2008): 20 Petabytes per day

• Over 90% of all the data in the world was created in the past 2 years [1]

• Today 3.2 zettabytes. 2020: 40 zettabytes.[2]

• Good news: jobs! [3]

1. http://www-01.ibm.com/software/data/bigdata/2. http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/3. http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html

11https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

12

Issues with research data (1): data availability

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

13

Issues with research data (2):

data reproducibility

http://www.firstwordpharma.com/node/931605#axzz3IalL2lzU

14

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

15

A solution: the Semantic Web

"The Semantic Web is an... extension of the current web in which... information is given well-defined meaning,... better enabling computers and people to work in cooperation.”

The Semantic WebTim Berners-Lee, James Hendler and Ora LassilaScientific American, May 2001

http://www.scientificamerican.com/article/the-semantic-web/

16

Adds to Web standards and practices (currently only for documents and services) encouraging• Unambiguous names for things, classes, and relationships• Well organized and documented in ontologies• With data expressed using uniform knowledge

representation languages (e.g. OWL)• To enable computationally assisted exploitation of

information• That can be easily integrated from different sources

The Semantic Web in a nutshell

17

Some Semantic Web successes

• In February 2011, the Watson system by IBM made international headlines for beating the best humans in the quiz show Jeopardy!

• A significant number of very prominent websites are powered by Semantic Web technologies, including the New York Times, Thomson Reuters, BBC, and Google's Freebase.

• The Speech Interpretation and Recognition Interface Siri launched by Apple in 2011 as an intelligent personal assistant for the new generation of IPhone smartphones heavily draws from work on ontologies, knowledge representation, and reasoning.

http://130.108.5.60/faculty/pascal/pub/crc-handbook-13.pdf

18

19

Uniform Resource Identifiers (URIs)

• Two different uses:– Unambiguous name for something– Location of a document

• Examples:– http://example.org/wiki/Main_Page – ftp://example.org/resource.txt– mailto:someone@example.com

20

Resource Description Framework (RDF)

• Resources (= nodes)• Identified by Unique Resource Identifier (URI)

• Properties (= edges)• Identified by Unique Resource Identifier (URI)• Binary relations between 2 resources

http://elmonline.ca/sw/sparql/social.ttl

21

<http://www.linkedin.com/in/mcourtot> a foaf:Person ; foaf:name "Melanie Courtot" ; foaf:knows <http://elmonline.ca/luke> ; foaf:knows <http://www.linkedin.com/pub/mark-wilkinson/1/674/665> .

22

SPARQL

SELECT ?personWHERE { <http://www.linkedin.com/in/mcourtot> <http://xmlns.com/foaf/0.1/knows> ?person .}

---------------------------------------------------------------------------------------------| person |==========================================================| http://www.linkedin.com/pub/mark-wilkinson/1/674/665 || <http://elmonline.ca/luke> |----------------------------------------------------------------------------------------------

• An excellent tutorial by Luke McCarthy: http://elmonline.ca/sw/sparql/

A query language for RDF

23

The Web Ontology Language (OWL)

• Knowledge representation language• Based on Description Logics: fragments of

First-Order logics with decidable and defined computational properties

• Sound, complete, terminating reasoners available

24

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

25

Linked open data cloud

26

Biological resources in LOD

27

Examples of issues in linking data incorrectly

• http://dbpedia.org/resource/WelshOWL:sameAs

<http://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh><http://sw.cyc.com/2006/07/27/cyc/Welsh-TheWord><http://sw.cyc.com/2006/07/27/cyc/WelshLanguage><http://sw.cyc.com/2006/07/27/cyc/Welshing-Cheating>

28

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

29

Ontologies

• Representation of important things in a specific domain– Describes types of entities (e.g. cells) and relations between

them (e.g. prokaryotic cells and eukaryotic cells are cells) and their instances (e.g. the specific cells in my sample)

• An active computational artifact– A mathematical model based on a subset of first order logic– Tools can automatically process ontologies

• A communication tool– Provides a dictionary for collaborators, a shared

understanding– Allows data sharing

30

Reasoning is critical

• Prokaryotic and Eukaryotic cell are declared disjoints

• Fungal cell is a Eukaryotic cell

• Spore is a Fungal cell and a Prokaryotic cell

Insatisfiability Solution: clarify spore

(sensu Mycetozoa) AND actinomycete-type spore

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

31

Logics

• Simple example based on http://arxiv.org/pdf/1201.4089v1.pdf

• Ontology file available from http://www.sfu.ca/~mcourtot/course/20141112BigDataSemWebOntologies/ontology.owl

• Manipulation done using Protégé: http://protege.stanford.edu

32

Family ontology

33

Logics of a grandfather

34

Reasoning

35

Inferred class hierarchy

36

Explanations

37

A wrong assertion

38

Unsatisfiability

39

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

40

OBO Foundry

A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure

• tight connection to the biomedical basic sciences

• Compatibility

• interoperability, common relations

• formal robustness

• support for logic-based reasoning

41http://www.obofoundry.org

42

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy?)

Anatomical Entity

(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Organism-Level Process

(GO)

CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

Cellular Process

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)Slide credit: Barry Smith

43

Minimum Information to Reuse an External Ontology Term

• OBO and Sematic Web promote reuse of resources• Biological resources (e.g., FMA for anatomy),

taken together, are too big for current tool support.

• MIREOT used across the OBO library– OBI: 400 mireoted terms (140 GO, 55 ChEBI, 50 PATO)– PR (Protein Ontology): 23,000 mireoted terms

• http://ontofox.hegroup.org

Example of OBO ontologies

• OBI, Ontology for Biomedical investigations• VO, the vaccine ontology• AERO, the Adverse Event Reporting Ontology

45

Ontology for Biomedical Investigations (OBI)

• OBI is a multi-community project driven by the practical needs of its members with the goal to build a high quality, interoperable reference ontology

• OBI high level classes are in place - solidified over several years - that cover all aspects of biomedical investigations

• OBI is expanded to enable member applications and based on term requests

46

High level class hierarchy (partial)

Slide credit: OBI Consortium

47Slide credit: Alan Ruttenberg

48Slide credit: OBI Consortium

49

Representing vaccine data – the Vaccine Ontology (VO)

Picture credit: Yongqun He

50

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

51

Representing pharmacovigilance data

• The Adverse Event Reporting Ontology (AERO)

• Encodes existing clinical guidelines (Brighton Collaboration)

52

Background and problem statement

• Surveillance of Adverse Events Following Immunization is important– Detection of issues with vaccine – Importance of vaccine-risk communication

• Analysis of AE reports is a subjective, time- and money costly process– Manual review of the textual reports

53

Workflow• Hypothesis: Use the AERO I developed to annotate

and classify a dataset• VAERS dataset

– Vaccine Adverse Event Reporting System– 6032 reports: ~5800 negative, ~230 positive– Post H1N1 immunization 2009/2010– Manually classified for anaphylaxis

• MedDRA (Medical Dictionary of Regulatory Activities) is used to represent clinical findings

54

Automated Diagnosis workflow

55

Results

At best cut-off point: Sensitivity 57%Specificity 97%

56

AE classification can be improved through the use of ontologies

• Manual analysis: 3 months for 12 medical officers• Ontology-based analysis: once data collected (2 months), almost instantaneous

(2h on laptop) => Could allow for earlier detection of safety issues and better understanding of adverse events

2h automatedvs.

3 months manual

http://dx.doi.org/10.1371/journal.pone.0092632

57

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

58

IRI dereferencing

59

Ontobee: publishing biomedical resources on the Semantic Web

HTML for humans …

… RDF for machines

Ontobee: publishing biomedical resources on the Semantic Web

61

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

62

The Integrated Rapid Infectious Disease Analysis (IRIDA) project

• Goal: automate infectious disease outbreak detection and investigation

• Issues: – Integrate WGS, clinical and lab info– Provide relevant tools and validate pipeline

• Methods:– Data standards for information exchange– Analysis pipeline (Galaxy based)– User interface– Additional tools:

• IslandViewer• GenGIS

63

64

Building the IRIDA data standards

• Interview with key personnel at BCCDC• Review of existing resources• Identify “holes”, i.e., missing bits• Collect existing data• Liaise with implementation team• Generate cohesive resource• Validate

65

Relevant data standards

• TypON, the typing ontology• OBI, the ontology for Biomedical Investigations• NGSOnto, Next Generation Sequencing Ontology• NIAIS-GS-BRC core metadata• MIxS ontology• TRANS, Pathogen Transmission ontology• ExO, Exposure Ontology• EPO, Epidemiology Ontology• IDO, Infectious Disease Ontology• Food: USDA, EFSA?

66

Relevant international efforts

• MIxS standard• Global Microbial Identifier• Global Alliance for Genomics and Health• NCBI BioSample• European Nucleotide Archive• …

67

Remaining challenges

• Trust, provenance– Ability to track origin of data to assess whether it

is trustworthy• Data sharing, reuse, policy

– Social and legal issues in getting access to data• Confidentiality

– Privacy concerns when linking data

68

Overview• Big Data

– Big Data is BIG– Issues in research

• Semantic Web– Standards: URIs, RDF, SPARQL, OWL– Linked data

• Ontologies– Definition and reasoning– OBO Foundry– Example of existing ontologies– Pharmacovigilance– Publishing ontologies on the Semantic Web

• IRIDA– The IRIDA platform– Adding standards to IRIDA

• Take home message

69

Take home message

Big data is a big challenge, but we can deal with it if done properly: that will be your responsibility

DO NOT build a black boxDO annotate and describe your dataDO make your data openly available

70

Acknowledgements

• Drs. Fiona Brinkman, Will Hsiao, Ryan Brinkman• The Brinkman^2 labs• Alan Ruttenberg, Barry Smith, Chris Mungall &

OBO• Colleagues at Public Health Agency Canada (Ms

Lafleche, Dr Law)• The IRIDA consortium and the IRIDA ontology

working group (Emma Griffiths and Damion Dooley)

71

Mélanie Courtot, PhDmcourtot@sfu.ca

@mcourtothttp://purl.org/net/mcourtot

top related