360° semanc technologies: web mining, text analysis ... · jun 2010 360° semanc technologies: web...

58
Jun 2010 360° Seman+c Technologies: Web Mining, Text Analysis, Linked Data Search and Reasoning SemTech 2010 Workshop, San Francisco

Upload: dangtu

Post on 12-Apr-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Jun2010

360°Seman+cTechnologies:WebMining,TextAnalysis,LinkedData

SearchandReasoning

SemTech2010Workshop,SanFrancisco

Presenta+onOutline

•  SEMANTICDATAMANAGEMENT(1:10-1:50)–  OWLIMseman+crepository–  Linkeddata,FactForge–  LifeSciences,LinkedLifeData

•  SEMANTICANNOTATIONFORSEARCH(2:00-3:10)–  Textmining–  Seman+cannota+onindexing–  Searchshowcases

•  PARTNERPRESENTATIONS(3:20-4:20)–  BPEng-Businessprocessmanagement(Italy)–  StructureDynamics(Canada)–  TopQuadrant(USA)

•  WRAP-UP(4:20-4:30)

SemTech2010:360°SemanScTechnologies Jun2010 #2

Jun2010

Ontotext

•  Ontotextisaseman+ctechnologyprovider•  GloballeaderinsemanScsearchandsemanScdatabases

•  Establishedinyear2000–  PartofSirma,atop-3soYwarecompanyinBulgaria

•  Staff:50employeesandmulSplecontractors

•  InvestmentacquiredinJuly2008–  Afinancialinvestorobtainedminorityshareinadealfor2.5MEURO

•  Ontotextisinvolvedintwojointventures:–  Innovantage:onlinerecruitmentintelligenceproviderinUK–  Namerimi:naSonalsearchengineinBulgaria

#3SemTech2010:360°SemanScTechnologies

Jun2010

ResearchProjects

•  OntotextisthemostsuccessfulBulgariancompanyinFP6–  OntotexthasparScipatedin20+ECresearchprojects–  >100MEuroisthetotalbudgetoftheprojectsOntotextispartof–  Thisisabove10%oftheECprojectsrelatedtosemanScs

•  Partneringwiththeleadingresearchcentersandcompanies–  SAP,SoYwareAG,IBM,ATOSOrigin,CapGemini,Wikimedia,…

•  About3MEurofundingfromECprojectsfor2010-2012

#4SemTech2010:360°SemanScTechnologies

Jun2010

OntotextPosi+oning

•  Leadingseman+ctechnologyprovider–  Top-10coretechnologyprovider–  OfferingenginesandcomponentstovendorsandsoluSondevelopers

•  Uniquetechnologyporholio:–  Seman+cDatabases:high-performanceRDFDBMS,scalablereasoning

–  Seman+cSearch:text-mining(IE),InformaSonRetrieval(IR)

–  WebMining:focusedcrawling,screenscrapping,datafusion

–  WebServicesandBPM:WSannotaSon,discovery,etc.

•  Goodrecogni+onintheSemTechcommunity–  Ontotextpagesareranked1stfor“semanScannotaSon”and“semanScrepository”atGoogleandYahoo

#5SemTech2010:360°SemanScTechnologies

Webuilduponlightweightseman+csthatiseasytounderstand,deploy,andmanage

Forinstance,thinkofontologiesasdatabaseschematawithsimpleinterpretaSonrules.

Plentyofobvious(butuseful)implicitfactscanbeinferredandmatchqueriesrightaway

Jun2010 #6SemTech2010:360°SemanScTechnologies

Whatdowedo?

Jun2010 #7SemTech2010:360°SemanScTechnologies

Itissimple

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:rela+veOf

owl:inverseOf owl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:type

rdf:type

rdf:type

rdf:type

Jun2010 #8SemTech2010:360°SemanScTechnologies

Physicaldatarepresenta+on:RDFvs.RDBMS

Person

ID Name Gender

1 MariaP. F

2 IvanJr. M

3 …

Parent

ParID ChiID

1 2

Spouse

S1ID S2ID From To

1 3

Statement

Subject Predicate Object

myo:Person rdf:type rdfs:Class

myo:gender rdfs:type rdfs:Property

myo:parent rdfs:range myo:Person

myo:spouse rdfs:range myo:Person

myd:Maria rdf:type myo:Person

myd:Maria rdf:label “MariaP.”

myd:Maria myo:gender “F”

myd:Maria rdf:label “IvanJr.”

myd:Ivan myo:gender “M”

myd:Maria myo:parent Myd:Ivan

myd:Maria myo:spouse myd:John

Jun2010 #9SemTech2010:360°SemanScTechnologies

Getmorefacts–Matchmorequeries

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:rela+veOf

owl:inverseOf owl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf<C1,rdfs:subClassOf,C2> <C2,rdfs:subClassOf,C3> ⇒  <C1,rdfs:subClassOf,C3> <I,rdf:type,C1> <C1,rdfs:subClassOf,C2> ⇒  <I,rdf:type,C2> <P1,owl:inverseOf,P2> <I1,P1,I2> ⇒  <I2,P2,I1> <P1,rdf:type,owl:SymmetricProperty> ⇒  <P1,owl:inverseOf,P1>

rdf:type

rdf:type

rdf:type

rdf:type

The database will return Ivan as result of query for Maria relativeOf ?x when the fact asserted was Ivan childOf Maria

Jun2010 #10SemTech2010:360°SemanScTechnologies

TheSeman+csisEncodedinSimpleRules

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:rela+veOf

owl:inverseOf owl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:type

rdf:type

rdf:type

rdf:type

<C1,rdfs:subClassOf,C2> <C2,rdfs:subClassOf,C3> ⇒  <C1,rdfs:subClassOf,C3> <I,rdf:type,C1> <C1,rdfs:subClassOf,C2> ⇒  <I,rdf:type,C2> <P1,owl:inverseOf,P2> <I1,P1,I2> ⇒  <I2,P2,I1> <P1,rdf:type,owl:SymmetricProperty> ⇒  <P1,owl:inverseOf,P1>

ScalableReasoningMap(Jun’09)

Jun2010 #11SemTech2010:360°SemanScTechnologies

0

20

40

60

80

100

120

140

0 5 10 15 20

Load

ingS

peed

(100

0st./sec,higherisb

etter)

Datasetsize(bill.explicitstatements)BigOWLIM AllegroGraph Virtuoso JenaTDB BigData ORACLE

Bubblesizeindicatesloadingcomplexity(biggerisbetter)

sub-$10,0008-core server

sub-$20004-coredesktop

clusterof148-coreblades

sub-$10,0008-coreserver

InterlinkingTextandData

Jun2010 #12SemTech2010:360°SemanScTechnologies

Welink,yourdata,yourcontent,andtheweb!

In10weekswecanbuildasoluSonwhich:-integrates10databaseswiththelinkeddatacloud-mines10milliondocumentsandwebpages

andletsyousearchandnavigateallthisinforma+on-in10differentways-froma$10,000server

Jun2010 #13SemTech2010:360°SemanScTechnologies

ElevatorPitch

Presenta+onOutline

•  SEMANTICDATAMANAGEMENT(1:10-1:50)–  OWLIMseman+crepository–  Linkeddata,FactForge–  LifeSciences,LinkedLifeData

•  SEMANTICANNOTATIONFORSEARCH(2:00-3:10)–  Textmining–  Seman+cannota+onindexing–  Searchshowcases

•  PARTNERPRESENTATIONS(3:20-4:20)–  BPEng-Businessprocessmanagement(Italy)–  StructureDynamics(Canada)–  TopQuadrant(USA)

•  WRAP-UP(4:20-4:30)

SemTech2010:360°SemanScTechnologies Jun2010 #14

Jun2010

•  OWLIMisafamilyofscalableseman+crepositories•  SwibOWLIM:in-memory,fastest,scalesto~100millionstatements

•  BigOWLIM:file-based,sameAs&queryopSmizaSons,scalesto20billionstatements

•  OWLIMprovides–  Management,integraSonandanalysisofheterogeneousdata

–  Combinedwithlight-weight,high-performancereasoning

–  Theinferenceisbasedonlogicalrule-entailment

–  FullRDFS,restrictedOWLLite,OWLHorstandOWL2RLaresupported

–  Customseman+cscanbedefinedviarulesandaxiomaSctriples

Semantic Repository for RDFS and OWL

#15SemTech2010:360°SemanScTechnologies

Jun2010

Complexity*

Naïve OWL Fragments Map

DL Rules, LP

#16SemTech2010:360°SemanScTechnologies

OWL Full

OWL DL

OWL Lite

OWL Horst / Tiny

OWL DLP

RDFS

SWRL

OWL/WSML Flight

Datalog

OWL Lite- / DHL

OWL 2 RL

Expressivity supported by OWLIM

Jun2010

•  BigOWLIMisusedfordata-integraSoninlifesciences

•  SwiYOWLIMisbundledasanontologyserviceinGATE4.0

•  OWLIMisusedasasemanScrepositoryinKIM

•  TopBraidComposerbundlesOWLIMasareasoner

•  OWLIMisusedinmorethan10Europeanresearchprojects

•  FactForge(hrp://fachorge.net)isbasedonBigOWLIM

•  BigOWLIMhasbeensuccessfullyintegratedintothehighperformanceSeman+cWebPublishingstackpoweringtheBBC’s2010WorldCupWebsite

OWLIM in Use

#17SemTech2010:360°SemanScTechnologies

Jun2010

BigOWLIMSta+s+cs(basedonpublishedresults)

•  BigOWLIMistheonlyenginethatcanreasonwithmorethan10billionstatements

•  BigOWLIM’squeryperformanceisatleastasgoodasanyotherenginethancanhandlesemanScson1Billionstatements

•  BigOWLIMistheonlyengineforwhichfull-cycleloadingandqueryevaluaSonresultsarepublishedforLUBM(8000)

•  BigOWLIMsuccessfullypassesLUBM(90000)–over20billionexplicitandimplicitstatements

#18SemTech2010:360°SemanScTechnologies

Jun2010

•  IntroducSontoOWLIM–  Versions,MajorFeatures,InformaSon–  DialectsofOWLandcombinaSonswithRDFSandRules

•  BigOWLIM–  Advancedfeatures

•  Hands-on–  TryoutsomefeaturesusingFactForge

Outline

#19SemTech2010:360°SemanScTechnologies

BigOWLIMReplica+onCluster

•  DistribuSonthroughdatareplicaSonisusedto:–  Improvescalabilityofconcurrentuserrequests–  Resilience–failover,onlineconfiguraSon

•  Howdoesitwork?–  EveryuserwriterequestispushedinatransacSonqueue–  EachdatawriterequestismulSplexedtoallrepositoryinstances–  Eachreadrequestisdispatchedtooneinstanceonly–  Toensureload-balancing,each

readrequestsissenttotheinstancewithsmallestexecuSonqueueatthispointinSme

SemTech2010:360°SemanScTechnologies Jun2010 #20

Jun2010

Replica+onCluster-Behaviour

•  Thetotalloading/modifica+onperformanceoftheclusterisequaltothatofoneinstance

•  ThedatascalabilityoftheclusterisdeterminedbytheamountofRAMoftheweakestinstance

•  Thequeryperformanceoftheclusterrepresentsthesumofthethroughputsthatcanbehandledbyeachoftheinstances

•  Failover:–  Incaseoffailureofoneormoreinstances,theperformance

degrada+onisgraceful–  Theclusterisfullyopera+onalevenwhenthereisonlyoneinstance

working

•  Clustercanbereconfiguredwhenrunning#21SemTech2010:360°SemanScTechnologies

Replica+onCluster-TypesofNodes

•  Twotypesofnodes•  Flexibletopologiespossible•  Resiliencetofailureofworkersandmasters

SemTech2010:360°SemanScTechnologies #22Jun2010

Worker 1 Worker 3

Master

Worker 2

Master (hot standby)

Dispatches queries and updates to workers (read/write)

Dispatches queries to workers (read only)

Standard BigOWLIM instances

Queries & updates

Queries only

Replica+onCluster-Performance

•  Performancebenchmarks–  Tobepublishedsoon–  IniSalindicaSonslookgood–  ConcurrentqueryevaluaSonappearstoscalelinearlywithno.of

nodes

•  AttheBBC–  Millionsofreadrequestsperday–  Thousandsofupdatesperhour

SemTech2010:360°SemanScTechnologies #23Jun2010

Jun2010

•  InpreviousversionsofBigOWLIM,wheneverastatementwasdeleted,theenSrededucSveclosurewasinvalidated–  Whichwastriggeringfullre-inference

•  AparSalinvalidaSonmechanismhasnowbeenimplemented–  Itperformsasequenceofbackwardandforward-chainingiteraSonstofigure

outwhatpartofthededucSveclosureisnolongersupported

–  ItdoesNOTrequireany‘truthmaintenanceinformaSon’

•  ThecomplexityoftheinvalidaSoniscomparabletothecomplexityoftheaddiSon(doesn’tworkforowl:sameAs)–  Itisslower,butitissSllinthesameorderofmagnitude

–  Removing‘key’statementssSllcanbepainful

SmoothInvalida+on

#24SemTech2010:360°SemanScTechnologies

Jun2010

•  BigOWLIMincludesarouSnewhichallowsforefficientcalculaSonofamodificaSonofPageRankoverRDFgraphs

•  ThecomputaSonoftheRDFRanksforFactForge(400MLODstatements)takes310sec–  201secforreadingtheRDFgraphfromdisk-basedstructuresintospecificin-

memoryrepresentaSon

–  98secwerespentin27RDFRankiteraSons

•  Resultsareavailablethroughasystempredicate

•  Example:getthe100mostimportantnodesintheRDFgraphSELECT ?n {?n onto:hasRDFRank ?r}

ORDER BY DESC(?r) LIMIT 100

RDFRank

#25SemTech2010:360°SemanScTechnologies

Jun2010

RDFPriming

#26SemTech2010:360°SemanScTechnologies

•  ScalableandcustomizableimplementaSonof‘AcSvaSonSpreading’

•  Allows‘priming’oflargedatasetswithrespecttoconceptsrelevanttothecontextandtothequery

•  ControlledusingspecialASKqueriesthat–  IniSateprimingfromspecifiednodes,withdecayfactorsand

thresholds•  e.g.acSvatefromthisnode

•  Toreturnmorespecificresultsforthisquery

PREFIX onto: <http://www.ontotext.com#> PREFIX dbpedia3:<http://dbpedia.org/resource/> ASK { dbpedia3:1955_Ford onto:activateNode dbpedia3:Ford_Motor_Company }

SELECT * where {?x <http://dbpedia.org/property/class> http://dbpedia.org/resource/V8>.}

Jun2010

Full-TextSearch

#27SemTech2010:360°SemanScTechnologies

•  AlternaSveinformaSonaccessmethod(differentindices)

•  FindinformaSonbasedonstringelements(tokens)

•  Twoapproaches:NodeSearchandRDFSearch–  URIs,literals–  Criteriaarealistoftokenswithapredicate(exact,ignorecase,prefix,…)

–  Resultavailableasavariablebinding

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX onto: <http://www.ontotext.com/> SELECT ?x, ?label WHERE { ?x rdfs:label ?label . <3d:> onto:prefixMatchIgnoreCase ?label. }

Jun2010

Full-TextSearch(2)

#28SemTech2010:360°SemanScTechnologies

•  AlternaSveinformaSonaccessmethod(differentindices)

•  FindinformaSonbasedonstringelements(tokens)

•  Twoapproaches:NodeSearchandRDFSearch–  TextrepresentaSonofanode’s‘RDFmolecule’

–  CriteriaareLucenetokenswithspecial‘lucene’predicate

–  ResultsareURIs(orderedbyRDFrankifavailable)

PREFIX gossip: <http://www.....gossipdb.owl#> PREFIX onto: <http://www.ontotext.com/> SELECT * WHERE { ?person gossip:name ?name . ?name onto:luceneQuery "American AND life~" . }

Jun2010

•  InconsistencycheckingisperformedwhenatransacSoniscommired

•  Therearetwokindsofchecks:IF<premises>=>CHECK<constraints>

•  Similartoentailmentrules,exceptthatwheneverthepremisesaresaSsfied,acheckismadethatthe‘inferred’triplesexistintherepository

IF<premises>=>Inconsistent!

•  WheneverthepremisesaresaSsfied,aconsistencyviolaSonislogged

ConsistencyChecks

#29SemTech2010:360°SemanScTechnologies

Jun2010

No+fica+ons

•  Theclientcansubscribeforno+fica+onsforincomingstatementsmatchingdesiredgraphpanerns

•  Theparernsarethenusedtofilterincomingstatements–  NoSfythesubscriberaboutthosestatementsthathelpformanew

soluSonofatleastoneofthegraphparerns–  Inferredstatementsaretreatedinthesameway

•  Thesubscribershouldnotrelyonanypar+cularorderordis+nctnessofthestatementnoSficaSons–  RetracSonofstatementsarenotnoSfied

#30SemTech2010:360°SemanScTechnologies

SampleFactForgeQueries1

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX owlim: <http://www.ontotext.com/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX dbpedia: <http://dbpedia.org/resource/> SELECT * WHERE { ?Person dbp-ont:birthPlace ?BirthPlace ; rdf:type opencyc:Entertainer ; owlim:hasPageRank ?RR .

?BirthPlace geo-ont:parentFeature dbpedia:Germany . } ORDER BY DESC(?RR) LIMIT 100

•  WhoarethemostimportantGermanentertainers•  ThisqueryinvolvesdatafromDBPedia,Geonames,andUMBEL(OpenCyc)•  Itinvolvesinferenceovertypes,sub-classes,andtransiSverelaSonships•  Rankingtheresultsby‘importance’–RDFrank

#31SemTech2010:360°SemanScTechnologies Jun2010

SampleFactForgeQueries2

PREFIX onto: <http://www.ontotext.com/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?entity <http://xmlns.com/foaf/0.1/topic> ?topic . ?entity rdfs:label ?label . ?topic onto:luceneQuery "American AND life~" . }

•  GetthedescripSonsofenSSesthathaveaFOAFtopiccontaining‘American’andsomethingthatissimilarto‘Life’

#32SemTech2010:360°SemanScTechnologies Jun2010

Jun2010

OWLIM

hnp://www.ontotext.com/owlim

BasedonpublishedresultsandindependentevaluaSons:

OWLIMisthemostscalableandthemostefficient

seman+crepositoryintheworldandoffers

themostcomprehensivereasoningsupport

#33SemTech2010:360°SemanScTechnologies

Presenta+onOutline

•  SEMANTICDATAMANAGEMENT(1:10-1:50)–  OWLIMseman+crepository–  Linkeddata,FactForge–  LifeSciences,LinkedLifeData

•  SEMANTICANNOTATIONFORSEARCH(2:00-3:10)–  Textmining–  Seman+cannota+onindexing–  Searchshowcases

•  PARTNERPRESENTATIONS(3:20-4:20)–  BPEng-Businessprocessmanagement(Italy)–  StructureDynamics(Canada)–  TopQuadrant(USA)

•  WRAP-UP(4:20-4:30)

SemTech2010:360°SemanScTechnologies Jun2010 #34

Jun2010

LinkingDataAcrossDifferentServers

#35SemTech2010:360°SemanScTechnologies

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:rela+veOf

owl:inverseOf owl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:type

rdf:type

rdf:type

rdf:type

Jun2010

LinkingOpenData

•  LinkingOpenData(LOD)W3CSWEOCommunityprojecthrp://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

•  IniSaSveforpublishing“linkeddata”whichalreadyincludes50+interlinkeddatasetsandmorethan10billionfacts

#36SemTech2010:360°SemanScTechnologies

Presenta+onOutline

•  LinkedData–  IntroducSontolinkeddataandLOD–  WhyPeopleDoNotUseLinkedData?

•  Reason-ableViews•  FactForge:gatewaytothecenterofthewebofdata

–  Contents:thelargestbodyofcommon-senseknowledge–  RDFSearch:probablythebestwaytoexploreunknowndatasets–  LoadingandInferenceStaSsScs–  QueryingmulSpledatasets,consideringtheirsemanScs

•  Or:WhoisthemostpopularGermanentertainer?

–  TheModiglianitestfortheSemanScWeb

SemTech2010:360°SemanScTechnologies Jun2010 #37

WhyPeopleDoNotUseLinkedData?

•  PlentyofpeopleintheITworldhaveheardaboutlinkeddataandliketheidea

•  However,theimpactoflinkeddataintheenterprisesissSllverylimited

•  Because:–  Therearenowellestablishedopinionswhatlinkeddatacan“buy”fortheenterpriseandbestpracScesofusingit• Whataretheconcretebenefits?

–  Itisnotclearwhatitwouldcost• Whataretheproblems?• Whataretheassociatedrisks?

SemTech2010:360°SemanScTechnologies #38Jun2010

LinkedDataintheEnterprise:Why?

•  Tofacilitatedataintegra+on–  OnecanuseLODas“interlingua”forenterprisedataintegraSon–  AddiSonalpublicinformaSoncanhelpalignmentandlinking

•  Toaddvaluetoproprietarydata–  PublicdatacanallowmoreanalyScsontopofproprietarydata–  Forinstance,bylinkingtospaSaldatafromGeonames–  Benerdescrip+onandaccesstocontent,e.g.searchforimages

•  Makeenterprisedatamoreopen–  Tomaketheenterprisedataeasiertouseoutsidetheenterprise–  PublicidenSfiersandvocabulariescanbeusedtoaccessthem

SemTech2010:360°SemanScTechnologies Jun2010 #39

LinkedDataintheEnterprise:Challenges

•  LODishardtocomprehend–  Diversitycomesataprice–  Oneneedstomakeaqueryagainst200differentschemataand

hundredsofthousandsofclassesandproperSes

•  LODisunreliable–  MostoftheserversbehindLODtodayareslow

•  HighdownSme–  Dealingwithdatadistributedonthewebisslow

•  AfederatedSPARQLquerythatuses2-3serverswithinseveraljoinscanbe*very*slow

–  Nosortofconsistencyisguaranteed•  LowcommitmenttotheformalsemanScsandintendedusageoftheontologiesandschemata

SemTech2010:360°SemanScTechnologies Jun2010 #40

Presenta+onOutline

•  LinkedData–  IntroducSontolinkeddataandLOD–  WhyPeopleDoNotUseLinkedData?

•  Reason-ableViews•  FactForge:gatewaytothecenterofthewebofdata

–  Contents:thelargestbodyofcommon-senseknowledge–  RDFSearch:probablythebestwaytoexploreunknowndatasets–  LoadingandInferenceStaSsScs–  QueryingmulSpledatasets,consideringtheirsemanScs

•  Or:WhoisthemostpopularGermanentertainer?

–  TheModiglianitestfortheSemanScWeb

SemTech2010:360°SemanScTechnologies Jun2010 #41

Reason-ableViewstotheWebofData

•  Reason-ableviewsrepresentanapproachforreasoningandmanagementoflinkeddata

•  Keyideas:–  Groupselecteddatasetsandontologiesinacompounddataset

•  Cleanup,post-processandenrichthedatasetsifnecessary•  DothisconservaSvely,inaclearlydocumentedandautomatedmanner,sothat:

–  theoperaSoncaneasilybeperformedeachSmewhennewversionofoneofthedatasetsappear

–  UserscaneasilyunderstandtheintervenSonyoumadetotheoriginaldataset

–  Loadthecompounddatasetinasingleseman+crepository–  PerforminferencewithrespecttotractableOWLdialects–  Defineasetofsamplequeriesagainstthecompounddataset

•  Thosedeterminethe“levelofservice”orthe“scopeofconsistency”contractofferedbythereasonableview

SemTech2010:360°SemanScTechnologies Jun2010 #42

Reason-ableViews:Objec+ves

•  Makereasoningandqueryevalua+onfeasible

•  Guaranteeabasiclevelofconsistency–  Thesamplequeriesguaranteetheconsistencyofthedatainthesame

wayinwhichregressiontestsdoforthequalityofthesoYware

•  Guaranteeavailability–  Inthesamewayinwhichwebsearchenginesareusuallymorereliable

thanmostofthewebsites

•  Easierexplora+onandqueryingofunseendata–  LowerthecostofentrythroughURIauto-completeandRDFsearch–  Samplequeriesprovidere-usableextracSonparerns,whichreduce

theSmeforacquaintancewiththedatasetsandtheirinterconnecSons

SemTech2010:360°SemanScTechnologies Jun2010 #43

TwoReason-ableViewstotheWebofLinkedData

•  FactForge:LinkedDataSeman+cRepository(inred)–  SomeofthecentralLODdatasets–  General-purposeinformaSon(notspecifictoadomain)–  1.2Bexplicitplus.9Minferredindexed,4Bretrievablestatements–  Thelargestupper-levelknowledgebase–  hnp://www.ontotext.com/FactForge/

•  LinkedLifeData-PIKB(inyellow)–  20+ofthemostpopularlife-sciencedatasets–  Complementedbygluingontologies–  2.7Bexplicitand1.4Binferred,totalof4.1Bindexedstatements–  Thelargestbodyofknowledgethatwasusedforreasoning–  hnp://www.linkedlifedata.com

#44SemTech2010:360°SemanScTechnologies Jun2010

LinkingOpenDataDatasetsandViews(redandyellow)

Jun2010 #45SemTech2010:360°SemanScTechnologies

Presenta+onOutline

•  LinkedData–  IntroducSontolinkeddataandLOD–  WhyPeopleDoNotUseLinkedData?

•  Reason-ableViews•  FactForge:gatewaytothecenterofthewebofdata

–  Contents:thelargestbodyofcommon-senseknowledge–  RDFSearch:probablythebestwaytoexploreunknowndatasets–  LoadingandInferenceStaSsScs–  QueryingmulSpledatasets,consideringtheirsemanScs

•  Or:WhoisthemostpopularGermanentertainer?

–  TheModiglianitestfortheSemanScWeb

SemTech2010:360°SemanScTechnologies Jun2010 #46

LinkedDataSeman+cRepository

•  Datasets:DBPedia,FreeBase,Geonames,UMBEL,MusicBrainz,Wordnet,CIAWorldFactbook,Lingvoj

•  Ontologies:DublinCore,SKOS,RSS,FOAF•  Inference:materializaSonwithrespecttoOWL2RL

–  SeemstocompletelycoverthesemanScsofthedata–  owl:sameAsop+miza+oninBigOWLIMallowsreducSonofthe

indices,withoutlossofsemanScsorperformance

•  Freepublicserviceathnp://FactForge.ontotext.com,–  IncrementalURIauto-suggest–  QueryandexplorethroughForestandTabulator–  RDFSearch:retrieverankedlistofURIsbykeywords–  SPARQLend-point

#47SemTech2010:360°SemanScTechnologies Jun2010

FactForgeLoadingandInferenceSta+s+cs

#48SemTech2010:360°SemanScTechnologies Jun2010

Dataset

ExplicitIndexedTriples('000)

InferredIndexedTriples('000)

Total#ofStoredTriples('000)

En++es('000ofnodesinthegraph)

Inferredclosurera+o

Sechmataandontologies 11 7 18 6 0.6DBpedia(categories) 2,877 42,587 45,464 1,144 14.8DBpedia(sameAs) 5,544 566 6,110 8,464 0.1UMBEL 5,162 42,212 47,374 500 8.2Lingvoj 20 863 883 18 43.8CIAFactbook 76 4 80 25 0.1Wordnet 2,281 9,296 11,577 830 4.1Geonames 91,908 125,025 216,933 33,382 1.4DBpediacore 560,096 198,043 758,139 127,931 0.4Freebase 463,689 40,840 504,529 94,810 0.1MusicBrainz 45,536 421,093 466,630 15,595 9.2Total 1,177,961 881,224 2,058,185 283,253 0.7

Post-processing

•  Severalkindsofpost-processingwereperformed−  Goal:toalloweasiernavigaSonandbrowsing−  Mechanisms:theresultsareavailablethroughsystempredicates−  Forinstance:preferredlabels,textsnippetsandRDFRanksforall

nodes

•  FinalStaSsScs−  NumberofenSSes(RDFgraphnodes):405M−  Numberofinsertedstatements(NIS):1,2B−  Numberofstoredstatements(NSS):2.2B−  Numberofretrievablestatements(NRS):9.8B

−  7.6Bstatements“compressed”throughBigOWLIM’sowl:sameAsopSmisaSon

SemTech2010:360°SemanScTechnologies #49Jun2010

Presenta+onOutline

•  LinkedData–  IntroducSontolinkeddataandLOD–  WhyPeopleDoNotUseLinkedData?

•  Reason-ableViews•  FactForge:gatewaytothecenterofthewebofdata

–  Contents:thelargestbodyofcommon-senseknowledge–  RDFSearch:probablythebestwaytoexploreunknowndatasets–  LoadingandInferenceStaSsScs–  Queryingmul+pledatasets,consideringtheirseman+cs

•  Or:WhoisthemostpopularGermanentertainer?

–  TheModiglianitestfortheSemanScWeb

SemTech2010:360°SemanScTechnologies Jun2010 #50

GuesswhoisthemostpopularGermanentertainer?

#51SemTech2010:360°SemanScTechnologies Jun2010

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX FactForge: <http://www.ontotext.com/

SELECT * WHERE { ?Person dbp-ont:birthPlace ?BirthPlace ; rdf:type opencyc:Entertainer ; FactForge:hasPageRank ?RR . ?BirthPlace geo-ont:parentFeature dbpedia:Germany . } ORDER BY DESC(?RR) LIMIT 100

•  Involvesdatafrom:DBPedia,Geonames,UMBEL,andMusicBrainz•  Requiresinferenceovertypes,sub-classes,andtransiSverelaSonships•  WithoutFactForge,ge}nganswertosuchqueriesinrealSmeisimpossible•  AndthemostpopularentertainerborninGermanyis:

–  AskingfactualquesSonstoaglobalknowledgebasecanbringunexpectedandstrange,butformallycorrectresults–oneshouldbepreciseandconsidercontexts

F. Nietzsche

TheModiglianiTestfortheSeman+cWeb

•  ReadWriteWeb’sfounderRichardMcManus:“…the+ppingpointfortheSeman+cWebmaybewhenonecan…deliver–usingLinkedData–acomprehensivelistoflocaSonsoforiginalModiglianiartworks…”

hnp://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php

….

SemTech2010:360°SemanScTechnologies #52Jun2010

TheFactForgeQueryPassingtheModiglianiTest

#53SemTech2010:360°SemanScTechnologies Jun2010

PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-prop: <http://dbpedia.org/property/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX umbel-sc: <http://umbel.org/umbel/sc/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ot: <http://www.ontotext.com/> SELECT DISTINCT ?painting_l ?owner_l ?city_fb_con ?city_db_loc ?city_db_cit WHERE { ?p fb:visual_art.artwork.artist dbpedia:Amedeo_Modigliani ; fb:visual_art.artwork.owners [ fb:visual_art.artwork_owner_relationship.owner ?ow ] ; ot:preferredLabel ?painting_l. ?ow ot:preferredLabel ?owner_l .

OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }

OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } }

Presenta+onOutline

•  SEMANTICDATAMANAGEMENT(1:10-1:50)–  OWLIMseman+crepository–  Linkeddata,FactForge–  LifeSciences,LinkedLifeData

•  SEMANTICANNOTATIONFORSEARCH(2:00-3:10)–  Textmining–  Seman+cannota+onindexing–  Searchshowcases

•  PARTNERPRESENTATIONS(3:20-4:20)–  BPEng-Businessprocessmanagement(Italy)–  TopQuadrant(USA)–  StructureDynamics(Canada)

•  WRAP-UP(4:20-4:30)

SemTech2010:360°SemanScTechnologies Jun2010 #54

StructuredDynamics

•  StructuredDynamicsisdeveloperofUMBEL–  AderivaSveofCyc,whichisoneofthedatasetsinFachorge

•  LikeOntotextStructuredDynamicsalsousesGATE

•  StructuredDynamicsandOntotextstartpartnershiptoprovideabererusefulsemanScstructuretoWikipediaviaupdatedversionofUMBEL–  AsdiscussedintheWikipedia'sStructureddatachallengesession

SemTech2010:360°SemanScTechnologies Jun2010 #55

Presenta+onOutline

•  SEMANTICDATAMANAGEMENT(1:10-1:50)–  OWLIMseman+crepository–  Linkeddata,FactForge–  LifeSciences,LinkedLifeData

•  SEMANTICANNOTATIONFORSEARCH(2:00-3:10)–  Textmining–  Seman+cannota+onindexing–  Searchshowcases

•  PARTNERPRESENTATIONS(3:20-4:20)–  BPEng-Businessprocessmanagement(Italy)–  StructureDynamics(Canada)–  TopQuadrant(USA)

•  WRAP-UP(4:20-4:30)

SemTech2010:360°SemanScTechnologies Jun2010 #56

Presenta+onOutline

SemTech2010:360°SemanScTechnologies Jun2010 #57

Thankyou!

Wedevelopcoreseman+ctechnologyOntotextinvested200person-years,partneredwith100leadinggroups,

createdsomeofthemostpopulartools,anddeliveredmulSplesoluSons.

Weknowwhatworksandwhatdoesn’tOntotextsetmanybenchmarksandadvancedthefronSersofthesemanScdatabases.

Weinventedthe“semanScannotaSon”–linkingtextwithdata

Nowwearepreparedto

interlinkyourdata,yourcontent,andtheweb

Jun2010 #58SemTech2010:360°SemanScTechnologies