360° semanc technologies: web mining, text analysis ... · jun 2010 360° semanc technologies: web...
TRANSCRIPT
Jun2010
360°Seman+cTechnologies:WebMining,TextAnalysis,LinkedData
SearchandReasoning
SemTech2010Workshop,SanFrancisco
Presenta+onOutline
• SEMANTICDATAMANAGEMENT(1:10-1:50)– OWLIMseman+crepository– Linkeddata,FactForge– LifeSciences,LinkedLifeData
• SEMANTICANNOTATIONFORSEARCH(2:00-3:10)– Textmining– Seman+cannota+onindexing– Searchshowcases
• PARTNERPRESENTATIONS(3:20-4:20)– BPEng-Businessprocessmanagement(Italy)– StructureDynamics(Canada)– TopQuadrant(USA)
• WRAP-UP(4:20-4:30)
SemTech2010:360°SemanScTechnologies Jun2010 #2
Jun2010
Ontotext
• Ontotextisaseman+ctechnologyprovider• GloballeaderinsemanScsearchandsemanScdatabases
• Establishedinyear2000– PartofSirma,atop-3soYwarecompanyinBulgaria
• Staff:50employeesandmulSplecontractors
• InvestmentacquiredinJuly2008– Afinancialinvestorobtainedminorityshareinadealfor2.5MEURO
• Ontotextisinvolvedintwojointventures:– Innovantage:onlinerecruitmentintelligenceproviderinUK– Namerimi:naSonalsearchengineinBulgaria
#3SemTech2010:360°SemanScTechnologies
Jun2010
ResearchProjects
• OntotextisthemostsuccessfulBulgariancompanyinFP6– OntotexthasparScipatedin20+ECresearchprojects– >100MEuroisthetotalbudgetoftheprojectsOntotextispartof– Thisisabove10%oftheECprojectsrelatedtosemanScs
• Partneringwiththeleadingresearchcentersandcompanies– SAP,SoYwareAG,IBM,ATOSOrigin,CapGemini,Wikimedia,…
• About3MEurofundingfromECprojectsfor2010-2012
#4SemTech2010:360°SemanScTechnologies
Jun2010
OntotextPosi+oning
• Leadingseman+ctechnologyprovider– Top-10coretechnologyprovider– OfferingenginesandcomponentstovendorsandsoluSondevelopers
• Uniquetechnologyporholio:– Seman+cDatabases:high-performanceRDFDBMS,scalablereasoning
– Seman+cSearch:text-mining(IE),InformaSonRetrieval(IR)
– WebMining:focusedcrawling,screenscrapping,datafusion
– WebServicesandBPM:WSannotaSon,discovery,etc.
• Goodrecogni+onintheSemTechcommunity– Ontotextpagesareranked1stfor“semanScannotaSon”and“semanScrepository”atGoogleandYahoo
#5SemTech2010:360°SemanScTechnologies
Webuilduponlightweightseman+csthatiseasytounderstand,deploy,andmanage
Forinstance,thinkofontologiesasdatabaseschematawithsimpleinterpretaSonrules.
Plentyofobvious(butuseful)implicitfactscanbeinferredandmatchqueriesrightaway
Jun2010 #6SemTech2010:360°SemanScTechnologies
Whatdowedo?
Jun2010 #7SemTech2010:360°SemanScTechnologies
Itissimple
myData:Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
owl:inverseO
f
inferred
myData:Ivan
owl:rela+veOf
owl:inverseOf owl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:type
rdf:type
rdf:type
rdf:type
Jun2010 #8SemTech2010:360°SemanScTechnologies
Physicaldatarepresenta+on:RDFvs.RDBMS
Person
ID Name Gender
1 MariaP. F
2 IvanJr. M
3 …
Parent
ParID ChiID
1 2
…
Spouse
S1ID S2ID From To
1 3
…
Statement
Subject Predicate Object
myo:Person rdf:type rdfs:Class
myo:gender rdfs:type rdfs:Property
myo:parent rdfs:range myo:Person
myo:spouse rdfs:range myo:Person
myd:Maria rdf:type myo:Person
myd:Maria rdf:label “MariaP.”
myd:Maria myo:gender “F”
myd:Maria rdf:label “IvanJr.”
myd:Ivan myo:gender “M”
myd:Maria myo:parent Myd:Ivan
myd:Maria myo:spouse myd:John
…
Jun2010 #9SemTech2010:360°SemanScTechnologies
Getmorefacts–Matchmorequeries
myData:Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
owl:inverseO
f
inferred
myData:Ivan
owl:rela+veOf
owl:inverseOf owl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf<C1,rdfs:subClassOf,C2> <C2,rdfs:subClassOf,C3> ⇒ <C1,rdfs:subClassOf,C3> <I,rdf:type,C1> <C1,rdfs:subClassOf,C2> ⇒ <I,rdf:type,C2> <P1,owl:inverseOf,P2> <I1,P1,I2> ⇒ <I2,P2,I1> <P1,rdf:type,owl:SymmetricProperty> ⇒ <P1,owl:inverseOf,P1>
rdf:type
rdf:type
rdf:type
rdf:type
The database will return Ivan as result of query for Maria relativeOf ?x when the fact asserted was Ivan childOf Maria
Jun2010 #10SemTech2010:360°SemanScTechnologies
TheSeman+csisEncodedinSimpleRules
myData:Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
owl:inverseO
f
inferred
myData:Ivan
owl:rela+veOf
owl:inverseOf owl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:type
rdf:type
rdf:type
rdf:type
<C1,rdfs:subClassOf,C2> <C2,rdfs:subClassOf,C3> ⇒ <C1,rdfs:subClassOf,C3> <I,rdf:type,C1> <C1,rdfs:subClassOf,C2> ⇒ <I,rdf:type,C2> <P1,owl:inverseOf,P2> <I1,P1,I2> ⇒ <I2,P2,I1> <P1,rdf:type,owl:SymmetricProperty> ⇒ <P1,owl:inverseOf,P1>
ScalableReasoningMap(Jun’09)
Jun2010 #11SemTech2010:360°SemanScTechnologies
0
20
40
60
80
100
120
140
0 5 10 15 20
Load
ingS
peed
(100
0st./sec,higherisb
etter)
Datasetsize(bill.explicitstatements)BigOWLIM AllegroGraph Virtuoso JenaTDB BigData ORACLE
Bubblesizeindicatesloadingcomplexity(biggerisbetter)
sub-$10,0008-core server
sub-$20004-coredesktop
clusterof148-coreblades
sub-$10,0008-coreserver
Welink,yourdata,yourcontent,andtheweb!
In10weekswecanbuildasoluSonwhich:-integrates10databaseswiththelinkeddatacloud-mines10milliondocumentsandwebpages
andletsyousearchandnavigateallthisinforma+on-in10differentways-froma$10,000server
Jun2010 #13SemTech2010:360°SemanScTechnologies
ElevatorPitch
Presenta+onOutline
• SEMANTICDATAMANAGEMENT(1:10-1:50)– OWLIMseman+crepository– Linkeddata,FactForge– LifeSciences,LinkedLifeData
• SEMANTICANNOTATIONFORSEARCH(2:00-3:10)– Textmining– Seman+cannota+onindexing– Searchshowcases
• PARTNERPRESENTATIONS(3:20-4:20)– BPEng-Businessprocessmanagement(Italy)– StructureDynamics(Canada)– TopQuadrant(USA)
• WRAP-UP(4:20-4:30)
SemTech2010:360°SemanScTechnologies Jun2010 #14
Jun2010
• OWLIMisafamilyofscalableseman+crepositories• SwibOWLIM:in-memory,fastest,scalesto~100millionstatements
• BigOWLIM:file-based,sameAs&queryopSmizaSons,scalesto20billionstatements
• OWLIMprovides– Management,integraSonandanalysisofheterogeneousdata
– Combinedwithlight-weight,high-performancereasoning
– Theinferenceisbasedonlogicalrule-entailment
– FullRDFS,restrictedOWLLite,OWLHorstandOWL2RLaresupported
– Customseman+cscanbedefinedviarulesandaxiomaSctriples
Semantic Repository for RDFS and OWL
#15SemTech2010:360°SemanScTechnologies
Jun2010
Complexity*
Naïve OWL Fragments Map
DL Rules, LP
#16SemTech2010:360°SemanScTechnologies
OWL Full
OWL DL
OWL Lite
OWL Horst / Tiny
OWL DLP
RDFS
SWRL
OWL/WSML Flight
Datalog
OWL Lite- / DHL
OWL 2 RL
Expressivity supported by OWLIM
Jun2010
• BigOWLIMisusedfordata-integraSoninlifesciences
• SwiYOWLIMisbundledasanontologyserviceinGATE4.0
• OWLIMisusedasasemanScrepositoryinKIM
• TopBraidComposerbundlesOWLIMasareasoner
• OWLIMisusedinmorethan10Europeanresearchprojects
• FactForge(hrp://fachorge.net)isbasedonBigOWLIM
• BigOWLIMhasbeensuccessfullyintegratedintothehighperformanceSeman+cWebPublishingstackpoweringtheBBC’s2010WorldCupWebsite
OWLIM in Use
#17SemTech2010:360°SemanScTechnologies
Jun2010
BigOWLIMSta+s+cs(basedonpublishedresults)
• BigOWLIMistheonlyenginethatcanreasonwithmorethan10billionstatements
• BigOWLIM’squeryperformanceisatleastasgoodasanyotherenginethancanhandlesemanScson1Billionstatements
• BigOWLIMistheonlyengineforwhichfull-cycleloadingandqueryevaluaSonresultsarepublishedforLUBM(8000)
• BigOWLIMsuccessfullypassesLUBM(90000)–over20billionexplicitandimplicitstatements
#18SemTech2010:360°SemanScTechnologies
Jun2010
• IntroducSontoOWLIM– Versions,MajorFeatures,InformaSon– DialectsofOWLandcombinaSonswithRDFSandRules
• BigOWLIM– Advancedfeatures
• Hands-on– TryoutsomefeaturesusingFactForge
Outline
#19SemTech2010:360°SemanScTechnologies
BigOWLIMReplica+onCluster
• DistribuSonthroughdatareplicaSonisusedto:– Improvescalabilityofconcurrentuserrequests– Resilience–failover,onlineconfiguraSon
• Howdoesitwork?– EveryuserwriterequestispushedinatransacSonqueue– EachdatawriterequestismulSplexedtoallrepositoryinstances– Eachreadrequestisdispatchedtooneinstanceonly– Toensureload-balancing,each
readrequestsissenttotheinstancewithsmallestexecuSonqueueatthispointinSme
SemTech2010:360°SemanScTechnologies Jun2010 #20
Jun2010
Replica+onCluster-Behaviour
• Thetotalloading/modifica+onperformanceoftheclusterisequaltothatofoneinstance
• ThedatascalabilityoftheclusterisdeterminedbytheamountofRAMoftheweakestinstance
• Thequeryperformanceoftheclusterrepresentsthesumofthethroughputsthatcanbehandledbyeachoftheinstances
• Failover:– Incaseoffailureofoneormoreinstances,theperformance
degrada+onisgraceful– Theclusterisfullyopera+onalevenwhenthereisonlyoneinstance
working
• Clustercanbereconfiguredwhenrunning#21SemTech2010:360°SemanScTechnologies
Replica+onCluster-TypesofNodes
• Twotypesofnodes• Flexibletopologiespossible• Resiliencetofailureofworkersandmasters
SemTech2010:360°SemanScTechnologies #22Jun2010
Worker 1 Worker 3
Master
Worker 2
Master (hot standby)
Dispatches queries and updates to workers (read/write)
Dispatches queries to workers (read only)
Standard BigOWLIM instances
Queries & updates
Queries only
Replica+onCluster-Performance
• Performancebenchmarks– Tobepublishedsoon– IniSalindicaSonslookgood– ConcurrentqueryevaluaSonappearstoscalelinearlywithno.of
nodes
• AttheBBC– Millionsofreadrequestsperday– Thousandsofupdatesperhour
SemTech2010:360°SemanScTechnologies #23Jun2010
Jun2010
• InpreviousversionsofBigOWLIM,wheneverastatementwasdeleted,theenSrededucSveclosurewasinvalidated– Whichwastriggeringfullre-inference
• AparSalinvalidaSonmechanismhasnowbeenimplemented– Itperformsasequenceofbackwardandforward-chainingiteraSonstofigure
outwhatpartofthededucSveclosureisnolongersupported
– ItdoesNOTrequireany‘truthmaintenanceinformaSon’
• ThecomplexityoftheinvalidaSoniscomparabletothecomplexityoftheaddiSon(doesn’tworkforowl:sameAs)– Itisslower,butitissSllinthesameorderofmagnitude
– Removing‘key’statementssSllcanbepainful
SmoothInvalida+on
#24SemTech2010:360°SemanScTechnologies
Jun2010
• BigOWLIMincludesarouSnewhichallowsforefficientcalculaSonofamodificaSonofPageRankoverRDFgraphs
• ThecomputaSonoftheRDFRanksforFactForge(400MLODstatements)takes310sec– 201secforreadingtheRDFgraphfromdisk-basedstructuresintospecificin-
memoryrepresentaSon
– 98secwerespentin27RDFRankiteraSons
• Resultsareavailablethroughasystempredicate
• Example:getthe100mostimportantnodesintheRDFgraphSELECT ?n {?n onto:hasRDFRank ?r}
ORDER BY DESC(?r) LIMIT 100
RDFRank
#25SemTech2010:360°SemanScTechnologies
Jun2010
RDFPriming
#26SemTech2010:360°SemanScTechnologies
• ScalableandcustomizableimplementaSonof‘AcSvaSonSpreading’
• Allows‘priming’oflargedatasetswithrespecttoconceptsrelevanttothecontextandtothequery
• ControlledusingspecialASKqueriesthat– IniSateprimingfromspecifiednodes,withdecayfactorsand
thresholds• e.g.acSvatefromthisnode
• Toreturnmorespecificresultsforthisquery
PREFIX onto: <http://www.ontotext.com#> PREFIX dbpedia3:<http://dbpedia.org/resource/> ASK { dbpedia3:1955_Ford onto:activateNode dbpedia3:Ford_Motor_Company }
SELECT * where {?x <http://dbpedia.org/property/class> http://dbpedia.org/resource/V8>.}
Jun2010
Full-TextSearch
#27SemTech2010:360°SemanScTechnologies
• AlternaSveinformaSonaccessmethod(differentindices)
• FindinformaSonbasedonstringelements(tokens)
• Twoapproaches:NodeSearchandRDFSearch– URIs,literals– Criteriaarealistoftokenswithapredicate(exact,ignorecase,prefix,…)
– Resultavailableasavariablebinding
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX onto: <http://www.ontotext.com/> SELECT ?x, ?label WHERE { ?x rdfs:label ?label . <3d:> onto:prefixMatchIgnoreCase ?label. }
Jun2010
Full-TextSearch(2)
#28SemTech2010:360°SemanScTechnologies
• AlternaSveinformaSonaccessmethod(differentindices)
• FindinformaSonbasedonstringelements(tokens)
• Twoapproaches:NodeSearchandRDFSearch– TextrepresentaSonofanode’s‘RDFmolecule’
– CriteriaareLucenetokenswithspecial‘lucene’predicate
– ResultsareURIs(orderedbyRDFrankifavailable)
PREFIX gossip: <http://www.....gossipdb.owl#> PREFIX onto: <http://www.ontotext.com/> SELECT * WHERE { ?person gossip:name ?name . ?name onto:luceneQuery "American AND life~" . }
Jun2010
• InconsistencycheckingisperformedwhenatransacSoniscommired
• Therearetwokindsofchecks:IF<premises>=>CHECK<constraints>
• Similartoentailmentrules,exceptthatwheneverthepremisesaresaSsfied,acheckismadethatthe‘inferred’triplesexistintherepository
IF<premises>=>Inconsistent!
• WheneverthepremisesaresaSsfied,aconsistencyviolaSonislogged
ConsistencyChecks
#29SemTech2010:360°SemanScTechnologies
Jun2010
No+fica+ons
• Theclientcansubscribeforno+fica+onsforincomingstatementsmatchingdesiredgraphpanerns
• Theparernsarethenusedtofilterincomingstatements– NoSfythesubscriberaboutthosestatementsthathelpformanew
soluSonofatleastoneofthegraphparerns– Inferredstatementsaretreatedinthesameway
• Thesubscribershouldnotrelyonanypar+cularorderordis+nctnessofthestatementnoSficaSons– RetracSonofstatementsarenotnoSfied
#30SemTech2010:360°SemanScTechnologies
SampleFactForgeQueries1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX owlim: <http://www.ontotext.com/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX dbpedia: <http://dbpedia.org/resource/> SELECT * WHERE { ?Person dbp-ont:birthPlace ?BirthPlace ; rdf:type opencyc:Entertainer ; owlim:hasPageRank ?RR .
?BirthPlace geo-ont:parentFeature dbpedia:Germany . } ORDER BY DESC(?RR) LIMIT 100
• WhoarethemostimportantGermanentertainers• ThisqueryinvolvesdatafromDBPedia,Geonames,andUMBEL(OpenCyc)• Itinvolvesinferenceovertypes,sub-classes,andtransiSverelaSonships• Rankingtheresultsby‘importance’–RDFrank
#31SemTech2010:360°SemanScTechnologies Jun2010
SampleFactForgeQueries2
PREFIX onto: <http://www.ontotext.com/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?entity <http://xmlns.com/foaf/0.1/topic> ?topic . ?entity rdfs:label ?label . ?topic onto:luceneQuery "American AND life~" . }
• GetthedescripSonsofenSSesthathaveaFOAFtopiccontaining‘American’andsomethingthatissimilarto‘Life’
#32SemTech2010:360°SemanScTechnologies Jun2010
Jun2010
OWLIM
hnp://www.ontotext.com/owlim
BasedonpublishedresultsandindependentevaluaSons:
OWLIMisthemostscalableandthemostefficient
seman+crepositoryintheworldandoffers
themostcomprehensivereasoningsupport
#33SemTech2010:360°SemanScTechnologies
Presenta+onOutline
• SEMANTICDATAMANAGEMENT(1:10-1:50)– OWLIMseman+crepository– Linkeddata,FactForge– LifeSciences,LinkedLifeData
• SEMANTICANNOTATIONFORSEARCH(2:00-3:10)– Textmining– Seman+cannota+onindexing– Searchshowcases
• PARTNERPRESENTATIONS(3:20-4:20)– BPEng-Businessprocessmanagement(Italy)– StructureDynamics(Canada)– TopQuadrant(USA)
• WRAP-UP(4:20-4:30)
SemTech2010:360°SemanScTechnologies Jun2010 #34
Jun2010
LinkingDataAcrossDifferentServers
#35SemTech2010:360°SemanScTechnologies
myData:Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
owl:inverseO
f
inferred
myData:Ivan
owl:rela+veOf
owl:inverseOf owl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:type
rdf:type
rdf:type
rdf:type
Jun2010
LinkingOpenData
• LinkingOpenData(LOD)W3CSWEOCommunityprojecthrp://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
• IniSaSveforpublishing“linkeddata”whichalreadyincludes50+interlinkeddatasetsandmorethan10billionfacts
#36SemTech2010:360°SemanScTechnologies
Presenta+onOutline
• LinkedData– IntroducSontolinkeddataandLOD– WhyPeopleDoNotUseLinkedData?
• Reason-ableViews• FactForge:gatewaytothecenterofthewebofdata
– Contents:thelargestbodyofcommon-senseknowledge– RDFSearch:probablythebestwaytoexploreunknowndatasets– LoadingandInferenceStaSsScs– QueryingmulSpledatasets,consideringtheirsemanScs
• Or:WhoisthemostpopularGermanentertainer?
– TheModiglianitestfortheSemanScWeb
SemTech2010:360°SemanScTechnologies Jun2010 #37
WhyPeopleDoNotUseLinkedData?
• PlentyofpeopleintheITworldhaveheardaboutlinkeddataandliketheidea
• However,theimpactoflinkeddataintheenterprisesissSllverylimited
• Because:– Therearenowellestablishedopinionswhatlinkeddatacan“buy”fortheenterpriseandbestpracScesofusingit• Whataretheconcretebenefits?
– Itisnotclearwhatitwouldcost• Whataretheproblems?• Whataretheassociatedrisks?
SemTech2010:360°SemanScTechnologies #38Jun2010
LinkedDataintheEnterprise:Why?
• Tofacilitatedataintegra+on– OnecanuseLODas“interlingua”forenterprisedataintegraSon– AddiSonalpublicinformaSoncanhelpalignmentandlinking
• Toaddvaluetoproprietarydata– PublicdatacanallowmoreanalyScsontopofproprietarydata– Forinstance,bylinkingtospaSaldatafromGeonames– Benerdescrip+onandaccesstocontent,e.g.searchforimages
• Makeenterprisedatamoreopen– Tomaketheenterprisedataeasiertouseoutsidetheenterprise– PublicidenSfiersandvocabulariescanbeusedtoaccessthem
SemTech2010:360°SemanScTechnologies Jun2010 #39
LinkedDataintheEnterprise:Challenges
• LODishardtocomprehend– Diversitycomesataprice– Oneneedstomakeaqueryagainst200differentschemataand
hundredsofthousandsofclassesandproperSes
• LODisunreliable– MostoftheserversbehindLODtodayareslow
• HighdownSme– Dealingwithdatadistributedonthewebisslow
• AfederatedSPARQLquerythatuses2-3serverswithinseveraljoinscanbe*very*slow
– Nosortofconsistencyisguaranteed• LowcommitmenttotheformalsemanScsandintendedusageoftheontologiesandschemata
SemTech2010:360°SemanScTechnologies Jun2010 #40
Presenta+onOutline
• LinkedData– IntroducSontolinkeddataandLOD– WhyPeopleDoNotUseLinkedData?
• Reason-ableViews• FactForge:gatewaytothecenterofthewebofdata
– Contents:thelargestbodyofcommon-senseknowledge– RDFSearch:probablythebestwaytoexploreunknowndatasets– LoadingandInferenceStaSsScs– QueryingmulSpledatasets,consideringtheirsemanScs
• Or:WhoisthemostpopularGermanentertainer?
– TheModiglianitestfortheSemanScWeb
SemTech2010:360°SemanScTechnologies Jun2010 #41
Reason-ableViewstotheWebofData
• Reason-ableviewsrepresentanapproachforreasoningandmanagementoflinkeddata
• Keyideas:– Groupselecteddatasetsandontologiesinacompounddataset
• Cleanup,post-processandenrichthedatasetsifnecessary• DothisconservaSvely,inaclearlydocumentedandautomatedmanner,sothat:
– theoperaSoncaneasilybeperformedeachSmewhennewversionofoneofthedatasetsappear
– UserscaneasilyunderstandtheintervenSonyoumadetotheoriginaldataset
– Loadthecompounddatasetinasingleseman+crepository– PerforminferencewithrespecttotractableOWLdialects– Defineasetofsamplequeriesagainstthecompounddataset
• Thosedeterminethe“levelofservice”orthe“scopeofconsistency”contractofferedbythereasonableview
SemTech2010:360°SemanScTechnologies Jun2010 #42
Reason-ableViews:Objec+ves
• Makereasoningandqueryevalua+onfeasible
• Guaranteeabasiclevelofconsistency– Thesamplequeriesguaranteetheconsistencyofthedatainthesame
wayinwhichregressiontestsdoforthequalityofthesoYware
• Guaranteeavailability– Inthesamewayinwhichwebsearchenginesareusuallymorereliable
thanmostofthewebsites
• Easierexplora+onandqueryingofunseendata– LowerthecostofentrythroughURIauto-completeandRDFsearch– Samplequeriesprovidere-usableextracSonparerns,whichreduce
theSmeforacquaintancewiththedatasetsandtheirinterconnecSons
SemTech2010:360°SemanScTechnologies Jun2010 #43
TwoReason-ableViewstotheWebofLinkedData
• FactForge:LinkedDataSeman+cRepository(inred)– SomeofthecentralLODdatasets– General-purposeinformaSon(notspecifictoadomain)– 1.2Bexplicitplus.9Minferredindexed,4Bretrievablestatements– Thelargestupper-levelknowledgebase– hnp://www.ontotext.com/FactForge/
• LinkedLifeData-PIKB(inyellow)– 20+ofthemostpopularlife-sciencedatasets– Complementedbygluingontologies– 2.7Bexplicitand1.4Binferred,totalof4.1Bindexedstatements– Thelargestbodyofknowledgethatwasusedforreasoning– hnp://www.linkedlifedata.com
#44SemTech2010:360°SemanScTechnologies Jun2010
Presenta+onOutline
• LinkedData– IntroducSontolinkeddataandLOD– WhyPeopleDoNotUseLinkedData?
• Reason-ableViews• FactForge:gatewaytothecenterofthewebofdata
– Contents:thelargestbodyofcommon-senseknowledge– RDFSearch:probablythebestwaytoexploreunknowndatasets– LoadingandInferenceStaSsScs– QueryingmulSpledatasets,consideringtheirsemanScs
• Or:WhoisthemostpopularGermanentertainer?
– TheModiglianitestfortheSemanScWeb
SemTech2010:360°SemanScTechnologies Jun2010 #46
LinkedDataSeman+cRepository
• Datasets:DBPedia,FreeBase,Geonames,UMBEL,MusicBrainz,Wordnet,CIAWorldFactbook,Lingvoj
• Ontologies:DublinCore,SKOS,RSS,FOAF• Inference:materializaSonwithrespecttoOWL2RL
– SeemstocompletelycoverthesemanScsofthedata– owl:sameAsop+miza+oninBigOWLIMallowsreducSonofthe
indices,withoutlossofsemanScsorperformance
• Freepublicserviceathnp://FactForge.ontotext.com,– IncrementalURIauto-suggest– QueryandexplorethroughForestandTabulator– RDFSearch:retrieverankedlistofURIsbykeywords– SPARQLend-point
#47SemTech2010:360°SemanScTechnologies Jun2010
FactForgeLoadingandInferenceSta+s+cs
#48SemTech2010:360°SemanScTechnologies Jun2010
Dataset
ExplicitIndexedTriples('000)
InferredIndexedTriples('000)
Total#ofStoredTriples('000)
En++es('000ofnodesinthegraph)
Inferredclosurera+o
Sechmataandontologies 11 7 18 6 0.6DBpedia(categories) 2,877 42,587 45,464 1,144 14.8DBpedia(sameAs) 5,544 566 6,110 8,464 0.1UMBEL 5,162 42,212 47,374 500 8.2Lingvoj 20 863 883 18 43.8CIAFactbook 76 4 80 25 0.1Wordnet 2,281 9,296 11,577 830 4.1Geonames 91,908 125,025 216,933 33,382 1.4DBpediacore 560,096 198,043 758,139 127,931 0.4Freebase 463,689 40,840 504,529 94,810 0.1MusicBrainz 45,536 421,093 466,630 15,595 9.2Total 1,177,961 881,224 2,058,185 283,253 0.7
Post-processing
• Severalkindsofpost-processingwereperformed− Goal:toalloweasiernavigaSonandbrowsing− Mechanisms:theresultsareavailablethroughsystempredicates− Forinstance:preferredlabels,textsnippetsandRDFRanksforall
nodes
• FinalStaSsScs− NumberofenSSes(RDFgraphnodes):405M− Numberofinsertedstatements(NIS):1,2B− Numberofstoredstatements(NSS):2.2B− Numberofretrievablestatements(NRS):9.8B
− 7.6Bstatements“compressed”throughBigOWLIM’sowl:sameAsopSmisaSon
SemTech2010:360°SemanScTechnologies #49Jun2010
Presenta+onOutline
• LinkedData– IntroducSontolinkeddataandLOD– WhyPeopleDoNotUseLinkedData?
• Reason-ableViews• FactForge:gatewaytothecenterofthewebofdata
– Contents:thelargestbodyofcommon-senseknowledge– RDFSearch:probablythebestwaytoexploreunknowndatasets– LoadingandInferenceStaSsScs– Queryingmul+pledatasets,consideringtheirseman+cs
• Or:WhoisthemostpopularGermanentertainer?
– TheModiglianitestfortheSemanScWeb
SemTech2010:360°SemanScTechnologies Jun2010 #50
GuesswhoisthemostpopularGermanentertainer?
#51SemTech2010:360°SemanScTechnologies Jun2010
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX FactForge: <http://www.ontotext.com/
SELECT * WHERE { ?Person dbp-ont:birthPlace ?BirthPlace ; rdf:type opencyc:Entertainer ; FactForge:hasPageRank ?RR . ?BirthPlace geo-ont:parentFeature dbpedia:Germany . } ORDER BY DESC(?RR) LIMIT 100
• Involvesdatafrom:DBPedia,Geonames,UMBEL,andMusicBrainz• Requiresinferenceovertypes,sub-classes,andtransiSverelaSonships• WithoutFactForge,ge}nganswertosuchqueriesinrealSmeisimpossible• AndthemostpopularentertainerborninGermanyis:
– AskingfactualquesSonstoaglobalknowledgebasecanbringunexpectedandstrange,butformallycorrectresults–oneshouldbepreciseandconsidercontexts
F. Nietzsche
TheModiglianiTestfortheSeman+cWeb
• ReadWriteWeb’sfounderRichardMcManus:“…the+ppingpointfortheSeman+cWebmaybewhenonecan…deliver–usingLinkedData–acomprehensivelistoflocaSonsoforiginalModiglianiartworks…”
hnp://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
….
SemTech2010:360°SemanScTechnologies #52Jun2010
TheFactForgeQueryPassingtheModiglianiTest
#53SemTech2010:360°SemanScTechnologies Jun2010
PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-prop: <http://dbpedia.org/property/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX umbel-sc: <http://umbel.org/umbel/sc/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ot: <http://www.ontotext.com/> SELECT DISTINCT ?painting_l ?owner_l ?city_fb_con ?city_db_loc ?city_db_cit WHERE { ?p fb:visual_art.artwork.artist dbpedia:Amedeo_Modigliani ; fb:visual_art.artwork.owners [ fb:visual_art.artwork_owner_relationship.owner ?ow ] ; ot:preferredLabel ?painting_l. ?ow ot:preferredLabel ?owner_l .
OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }
OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } }
Presenta+onOutline
• SEMANTICDATAMANAGEMENT(1:10-1:50)– OWLIMseman+crepository– Linkeddata,FactForge– LifeSciences,LinkedLifeData
• SEMANTICANNOTATIONFORSEARCH(2:00-3:10)– Textmining– Seman+cannota+onindexing– Searchshowcases
• PARTNERPRESENTATIONS(3:20-4:20)– BPEng-Businessprocessmanagement(Italy)– TopQuadrant(USA)– StructureDynamics(Canada)
• WRAP-UP(4:20-4:30)
SemTech2010:360°SemanScTechnologies Jun2010 #54
StructuredDynamics
• StructuredDynamicsisdeveloperofUMBEL– AderivaSveofCyc,whichisoneofthedatasetsinFachorge
• LikeOntotextStructuredDynamicsalsousesGATE
• StructuredDynamicsandOntotextstartpartnershiptoprovideabererusefulsemanScstructuretoWikipediaviaupdatedversionofUMBEL– AsdiscussedintheWikipedia'sStructureddatachallengesession
SemTech2010:360°SemanScTechnologies Jun2010 #55
Presenta+onOutline
• SEMANTICDATAMANAGEMENT(1:10-1:50)– OWLIMseman+crepository– Linkeddata,FactForge– LifeSciences,LinkedLifeData
• SEMANTICANNOTATIONFORSEARCH(2:00-3:10)– Textmining– Seman+cannota+onindexing– Searchshowcases
• PARTNERPRESENTATIONS(3:20-4:20)– BPEng-Businessprocessmanagement(Italy)– StructureDynamics(Canada)– TopQuadrant(USA)
• WRAP-UP(4:20-4:30)
SemTech2010:360°SemanScTechnologies Jun2010 #56
Thankyou!
Wedevelopcoreseman+ctechnologyOntotextinvested200person-years,partneredwith100leadinggroups,
createdsomeofthemostpopulartools,anddeliveredmulSplesoluSons.
Weknowwhatworksandwhatdoesn’tOntotextsetmanybenchmarksandadvancedthefronSersofthesemanScdatabases.
Weinventedthe“semanScannotaSon”–linkingtextwithdata
Nowwearepreparedto
interlinkyourdata,yourcontent,andtheweb
Jun2010 #58SemTech2010:360°SemanScTechnologies