mid-ontology learning from linked data @jist2011
DESCRIPTION
TRANSCRIPT
大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Mid-Ontology Learning from Linked Data
Lihua Zhao and Ryutaro Ichise
JIST2011, 12.05.2011, Hangzhou
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Outline
Introduction
Mid-Ontology Learning Approach
Experimental Evaluation
Related Work
Conclusion and Future Work
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 2大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Linked Open Data295 data sets, 31 billion RDF triples (as of Sep. 2011)7 domains (cross-domain, geographic, media, life sciences,government, user-generated content, and publications)Interlinked Instances (owl:sameAs)
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 3大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Challenging ProblemEach data set has specific ontology schema
DBpedia: http://dbpedia.org/property/populationGeonames: http://www.geonames.org/ontology#population
Time-consuming to learn all the ontology schemaDBpedia: 320 classes and thousands of properties.
Heterogeneity of ontology schemahttp://dbpedia.org/property/populationTotalhttp://dbpedia.org/property/population
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 4大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Objective
Collected data based on “http://dbpedia.org/resource/Berlin”.Predicate Object
http : //dbpedia.org/property/name Berlinhttp : //dbpedia.org/property/population 3439100http : //dbpedia.org/property/plz 10001-14199http : //dbpedia.org/ontology/postalCode 10001-14199http : //dbpedia.org/ontology/populationTotal 3439100. . . . . . . . . . . .http : //www .geonames.org/ontology#alternateName Berlinhttp : //www .geonames.org/ontology#alternateName Berlyn@afhttp : //www .geonames.org/ontology#population 3426354. . . . . . . . . . . .http : //www .w3.org/2004/02/skos/core#prefLabel Berlin (Germany)http : //data.nytimes.com/elements/first use 2004-09-12http : //data.nytimes.com/elements/latest use 2010-06-13
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 5大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Simple ontology for various data sets: Mid-OntologyInvestigation on linked instances
owl:sameAs links identical or related instancesScale down the data set
Automatic ontology learningIntegrate ontologies from diverse domain data setsAutomate the ontology construction processAdapt to linked open data sets
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 6大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 7大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Data Collection
We scale down the data sets by collecting only linked instances,from which we can extract related information.
Extract data linked with owl:sameAsSelect a core data set (inward & outward links)Collect all instances that have owl:sameAs
Remove noisy instances of the core data setNoisy instances: without any meaningful triple
Collect predicates and objectscollect <predicate, object> (PO) pairs from collected instancescollect PO pairs from linked instances (other data sets)
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 8大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Collected Data
dbpedia:Berlin owl:sameAs http://sws.geonames.org/2950159/
http://data.nytimes.com/N50987186835223032381 owl:sameAs dbpedia:Berlin
Collected data based on “http://dbpedia.org/resource/Berlin”.Predicate Object
http : //dbpedia.org/property/name Berlinhttp : //dbpedia.org/property/population 3439100http : //dbpedia.org/property/plz 10001-14199http : //dbpedia.org/ontology/postalCode 10001-14199http : //dbpedia.org/ontology/populationTotal 3439100. . . . . . . . . . . .http : //www .geonames.org/ontology#alternateName Berlinhttp : //www .geonames.org/ontology#alternateName Berlyn@afhttp : //www .geonames.org/ontology#population 3426354. . . . . . . . . . . .http : //www .w3.org/2004/02/skos/core#prefLabel Berlin (Germany)http : //data.nytimes.com/elements/first use 2004-09-12http : //data.nytimes.com/elements/latest use 2010-06-13
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 9大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 10大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,because many similar or related predicates actually refer to thesame thing.
Group predicates by exact matching
Prune groups by similarity matching
Refine groups using extracted relations
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 11大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,because many similar or related predicates actually refer to thesame thing.
Group predicates by exact matchingOne predicate may have various objectsDifferent predicates may have the same object value
Prune groups by similarity matching
Refine groups using extracted relations
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 12大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Group Predicates by Exact Matching
Create initial groups (Gi ) of PO pairse.g. Gi .predicates = { db-prop:name, geo-onto:alternateName }
Gi .objects = { Berlin, Berlyn@af }
Collected data based on “http://dbpedia.org/resource/Berlin”.Predicate Object
http : //dbpedia.org/property/name Berlinhttp : //dbpedia.org/property/population 3439100http : //dbpedia.org/property/plz 10001-14199http : //dbpedia.org/ontology/postalCode 10001-14199http : //dbpedia.org/ontology/populationTotal 3439100. . . . . . . . . . . .http : //www .geonames.org/ontology#alternateName Berlinhttp : //www .geonames.org/ontology#alternateName Berlyn@afhttp : //www .geonames.org/ontology#population 3426354. . . . . . . . . . . .http : //www .w3.org/2004/02/skos/core#prefLabel Berlin (Germany)http : //data.nytimes.com/elements/first use 2004-09-12http : //data.nytimes.com/elements/latest use 2010-06-13
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 13大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,because many similar or related predicates actually refer to thesame thing.
Group predicates by exact matching
Prune groups by similarity matchingExact matching may ignore
Terms of predicates or objects written in different languagesSemantically identical or related predicates
Refine groups using extracted relations
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 14大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Ontology similarity matching at the concept level
String-based similarity measure: StrSim(O(Gi ),O(Gj ))
O(Gi ): objects in Gi
Prefix, Suffix, Levenshtein distance, and n-gram.
Knowledge-based similarity measure: WNSim(T (Gi ),T (Gj ))
T (Gi ): pre-processed terms of predicates in Gi
Natural Language Processing: tokenizing terms, removing stop words,and stemming.WordNet-based similarity measures: LCH, RES, HSO, JCN, LESK,PATH, WUP, LIN, and VECTOR
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 15大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Similarity between initial groups {G1,G2, . . .Gk}
Sim(Gi ,Gj ) =StrSim(O(Gi ),O(Gj )) + WNSim(T (Gi ),T (Gj ))
2
Prune initial groups Gi
If Sim(Gi ,Gj ) is higher than the predefined similarity threshold, wemerge Gi and Gj .
If an initial group Gi has not been merged and has only one POpair, we remove Gi .
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 16大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Similarity Calculation
Group Predicate Object
Gi http : //dbpedia.org/property/population 3439100http : //dbpedia.org/ontology/populationTotal 3439100
Gj http : //www .geonames.org/ontology#population 3426354
Example of String-based similarity measures on pairwise objects.Pairwise Objects prefix suffix Levenshtein distance n-gram
“3439100”, “3426354” 0.29 0 0 0.29
Example of WordNet-based similarity measures on pairwise terms.Pairwise Terms LCH RES HSO JCN LESK PATH WUP LIN VECTOR
population, population 1 1 1 1 1 1 1 1 1population, total 0.4 0 0 0.06 0.03 0.11 0.33 0 0.06
Sim(Gi ,Gj ) =0.145 + 0.5825
2= 0.36375
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 17大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,because many similar or related predicates actually refer to thesame thing.
Group predicates by exact matching
Prune groups by similarity matchingRefine groups using extracted relations
Divide pruned groups according to rdfs:domain and rdfs:range.Keep groups with high frequency
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 18大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 19大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Construction
Select terms for Mid-Ontology
Collect all the terms of predicates in each refined group Gi .
Collect all the pre-processed terms of P(Gi ) (predicates in Gi ).
Choose one term, which has the highest frequency and longestterm.e.g. “area” and “areaCode” are totally different
Construct Relations
mo-prop:hasMembers to link Mid-Ontology classes and integratedpredicates
Construct Mid-Ontology
Automatically construct Mid-Ontology using selected terms andmo-prop:hasMembers.
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 20大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Evaluation
Evaluate the Mid-Ontology approach from four different aspects:
Evaluation of Data Reduction
Evaluation of Ontology Quality
Evaluation with A SPARQL Example
Analysis of Mid-Ontology Approach
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 21大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Implementation
Environment
Linux Ubuntu 10.10, 16GB Memory, 1 TB DiskCore i7 CPU 880 3.07GHz
Java, Netbeans 6.9
Virtuoso
High-performance server for RDF storage
SPARQL query endpoint
WordNet::Similarity
Implemented in Perl
Knowledge-based similarity measures
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 22大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs
Geonames: geographical domain, 7 million URIs
NYTimes: media domain, 10,467 subject news
Choose DBpedia as the core data set, because of its wealth of inwardand outward links to other data sets.
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 23大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Data Reduction
Evaluate the effectiveness of data reduction during the datacollection phase by comparing the number of instances.
Number of distinct instances during data collection phase.Data set Before reduction owl:sameAs retrieval Noisy data removal
DBpedia 8,955,728 135,749 (1.52%) 88,506 (0.99%)Geonames 7,479,714 128,961 (1.72%) 82,054 (1.10%)NYTimes 10,467 9,226 (88.14%) 8,535 (81.54%)
Evaluation Analysis
The data sets are dramatically scaled down by keeping onlylinked instances that share related information.
Successfully removed noisy instances, which may affect thequality of the Mid-Ontology.e.g. Removed instances with only db-prop:hasPhotosCollection(broken link) and owl:sameAs link.
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 24大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Evaluate the quality of Mid-Ontology by validating whetherpredicates in each class share related information.
Accuracy of Mid-Ontology
ACC (MO) =
∑ni=1
|Correct Predicates in Ci ||Ci |
n
n: the number of classes|Ci |: the number of predicates in class Ci .
Cardinality
Cardinality =|Number of Predicates||Number of Classes|
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 25大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Improvement achieved by our approach
MO no p r: with exact matching (without the pruning andrefining processes)
MO: with both pruning and refining processes
MO Number of Classes Number of Predicates Cardinality AccuracyMO no p r 11 300 27.27 68.78%MO 29 180 6.21 90.10%
Evaluation Analysis
Significantly improved the accuracy
Decreased the cardinality (Less number of predicates and moreclasses)
Successfully removed unrelated predicates
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 26大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL Example
Evaluate the effectiveness of information retrieval with theMid-Ontology constructed with our approach.
Predicates grouped in mo-onto:population.<rdf:Description rdf:about=“mid-onto:population”><mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/population”/><mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/popLatest”/><mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/populationTotal”/><mo-prop:hasMembers rdf:resource=“http://dbpedia.org/ontology/populationTotal”/><mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/einwohner”/><mo-prop:hasMembers rdf:resource=“http://www.geonames.org/ontology#population”/></rdf:Description>
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 27大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL ExampleSPARQL: Find places with a population of more than 10 million.
SELECT DISTINCT ?placesWHERE{ mid-onto:population mo-prop:hasMembers ?prop.
?places ?prop ?population.FILTER (xsd:integer(?population) > 10000000). }
Single property for population Number of Results
http://dbpedia.org/property/population 177http://dbpedia.org/property/popLatest 1http://dbpedia.org/property/populationTotal 107http://dbpedia.org/ontology/populationTotal 129http://dbpedia.org/property/einwohner 1http://www.geonames.org/ontology#population 244
Evaluation AnalysisFind 517 places with mid-onto:population.Less results with each single predicate under the samecondition.
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 28大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Analysis of Mid-Ontology Approach
Analyze whether we can successfully identify how data sets areconnected.
Sample classes in the Mid-OntologyDBpedia DBpedia & Geonames DBpedia & Geonames & NYTimes
mo-onto:birthdate mo-onto:population mo-onto:namemo-onto:deathdate mo-onto:prominence mo-onto:longmo-onto:motto mo-onto:postal
Evaluation Analysis
Predicates in DBpedia are heterogeneous.
Linked instances between DBpedia and Geonames are aboutplaces.
Linked instances among DBpedia, Geonames, and NYTimesare about events, persons, or places.
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 29大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs linkse.g. Find missing owl:sameAs link with mo-onto:populationhttp://dbpedia.org/resource/Cyclades db-prop:population “119549”http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”http://sws.geonames.org/259819/ geo-onto:population “119549”http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 30大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs linkse.g. Find missing owl:sameAs link with mo-onto:populationhttp://dbpedia.org/resource/Cyclades db-prop:population “119549”http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”http://sws.geonames.org/259819/ geo-onto:population “119549”http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
Add owl:sameAs linkhttp://dbpedia.org/resource/Cyclades owl:sameAs http://sws.geonames.org/259819/http://sws.geonames.org/259819/ owl:sameAs http://dbpedia.org/resource/Cyclades
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 31大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Related Work
Construct intermediate-layer ontology from geospatial, zoology,and genetics data resources. [Parundekar, et al.,2010]
Limited to a specific domain
Construct intermediate-level ontology by enriching upperontology (by adding new classes and properties). [Damova, etal., 2010]
Still too large
Analysis of basic properties of SameAs network,Pay-Level-Domain network and Class-Level Similarity network.[Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 32大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Conclusion and Future Work
Conclusion
Learning heterogeneous ontology schema in the linked opendata sets is not feasible.
An automatic Mid-Ontology learning approach can solve theheterogeneity problem by integrating related predicates.
The Mid-Ontology has a high accuracy, and effective to searchfrom various data sets.
A simple Mid-Ontology can be constructed without learningthe entire ontology schema.
Future Work
Billion Triple Challenge (BTC) data set
Crawl links at two or three depths without a core data set
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 33大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics
Questions?
Lihua Zhao, [email protected] Ichise, [email protected]
Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 34大学共同利用機関法人 情報・システム研究機構
国立情報学研究所National Institute of Informatics