analisis y comparaci´ on de estrategias para el...

94
U NIVERSIDAD DE V ALLADOLID E DIFICIO DE LAS T ECNOLOG ´ IAS DE LA I NFORMACI ´ ON Y LAS C OMUNICACIONES T RABAJO F IN DE M ASTER MASTER UNIVERSITARIO EN I NVESTIGACI ´ ON EN T ECNOLOG ´ IAS DE LA I NFORMACI ´ ON Y LAS C OMUNICACIONES An´ alisis y comparaci ´ on de estrategias para el alineamiento entre ontolog´ ıas Autor: D. Fco. Javier Delgado del Hoyo Tutor: Dra. Mercedes Mart´ ınez Gonz ´ alez Cotutor: Dr. Javier Finat Codes Valladolid, 6 de Julio de 2011

Upload: others

Post on 26-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

UNIVERSIDAD DE VALLADOLID

EDIFICIO DE LAS TECNOLOGIAS DE LA

INFORMACION Y LAS COMUNICACIONES

TRABAJO FIN DE MASTER

MASTER UNIVERSITARIO EN INVESTIGACION

EN TECNOLOGIAS DE LA INFORMACION Y LAS COMUNICACIONES

Analisis y comparacion de estrategias para elalineamiento entre ontologıas

Autor:

D. Fco. Javier Delgado del Hoyo

Tutor:

Dra. Mercedes Martınez Gonzalez

Cotutor:

Dr. Javier Finat Codes

Valladolid, 6 de Julio de 2011

Page 2: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

ii

Page 3: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

T ITULO: Analisis y comparacion de estrategiaspara el alineamiento entre ontologıas

AUTOR: D. Fco. Javier Delgado del HoyoTUTOR: Dra. Mercedes Martınez GonzalezDEPARTAMENTO: INFORMATICA & ALGEBRA, GEOMETRIA Y TOPOLOGIA

TribunalPRESIDENTE: Dr. D. Pablo de la Fuente RedondoVOCAL: Dr. D. Belarmino Pulido JunqueraSECRETARIO: Dr. D. Guillermo Vega GorgojoFECHA: 6 de Julio de 2011CALIFICACION:

Resumen del TFMLa heterogeneinad entre sistemas de informacion dificulta su interoperabilidad. Desde

la iniciativa conocida como Web Semantica surgieron las ontologıas como una forma desolucionar el problema mediante el etiquetado de informacion con un vocabulario comun.Sin embargo, la proliferacion de ontologıas para representar el mismo conocimiento hatrasladado el problema hacia el alinemaiento entre ontologıas. Es frecuente que la on-tologıa dependa del contexto local y la idea de una unica ontologıa global en un dominioes inviable. Por todo esto, el alineamiento semantico emerge como una lınea de inves-tigacion para establecer correspondencias entre conceptos de dos ontologıas de formaautomatica. Uno de los campos de aplicacion de las ontologıas son los Sistemas de In-formacion Geografica, donde su integracion con Informacion sobre Edificios y la WebGeoespacial supone un reto desde hace tiempo. En este trabajo se pretende evaluar elrendimiento de diferentes tecnicas de alineamiento entre ontologıas aplicadas a la inte-gracion entre los Sistemas de Informacion Geografica y Modelado de Informacion delEdificio.

Palabras clavealineamiento semantico, Web Semantica, interoperabilidad, integracion de sistemas

AbstractHeterogeneity between information systems limits their interoperability. Emerged

from the Semantic Web, ontologies seem to be the key to solve interoperability by meansof the use and sharing of a common knowledge representation. Due to the growing inter-est, a huge amount of different ontologies has been developed for representing the samedomain. Usually, the ontology is developed in a biased way depending of a local context

Page 4: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

ii

and the use of a global unique ontology is unfeasible. Thus, ontology matching emerges asa new field of research for establishing automatic correspondences between the conceptsof two ontologies. One of the domains exemplifying this problem is the Geographic Infor-mation Systems where their integration with information from Building Information andthe Geospatial Web supposes a challenge. In this work, the compliance and performanceof different ontology matching techniques are evaluated when they are applied to the in-tegration of information from Geographic Information System and Building InformationModeling.

Keywordsontology matching, semantic mapping, ontology alignment, GIS BIM integration,

CityGML IFC

Page 5: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Acknowledgements

To my family and friends for providing me with the strength and knowledge needed tofinish this work. Also to Mercedes and Javier for their support and advices along thelast four months. And finally, to the ontology matching research community for the hugeamount of materials (software, publications and tutorials) which are publicly available forfurther research.

This work was also conducted in part thanks to Protege resource, which is supportedby grant LM007885 from the United States National Library of Medicine.

iii

Page 6: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

iv

Page 7: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Project goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Structure of this document . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Languages and technologies . . . . . . . . . . . . . . . . . . . . 92.1.2 Ontologies and Web Ontology Language . . . . . . . . . . . . . 9

2.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Semantic heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 The matching problem . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Matching techniques and algorithms . . . . . . . . . . . . . . . . 132.2.4 String-based techniques . . . . . . . . . . . . . . . . . . . . . . 162.2.5 Linguistic resources techniques . . . . . . . . . . . . . . . . . . 182.2.6 Matching systems . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.7 Alignment representation . . . . . . . . . . . . . . . . . . . . . . 282.2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Geographic Information Systems . . . . . . . . . . . . . . . . . . . . . . 302.3.1 CityGML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Building Information Modeling . . . . . . . . . . . . . . . . . . 33

3 Methodology 393.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 General methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 Literature reviewing . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Design of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . 443.3.2 Selection of datasets . . . . . . . . . . . . . . . . . . . . . . . . 453.3.3 Reference alignments . . . . . . . . . . . . . . . . . . . . . . . . 473.3.4 Auxiliary tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Frameworks for ontology matching . . . . . . . . . . . . . . . . . . . . . 483.5 Method for experimentation . . . . . . . . . . . . . . . . . . . . . . . . 50

v

Page 8: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

vi CONTENTS

4 Experimentation 534.1 Characterization of the ontologies . . . . . . . . . . . . . . . . . . . . . 554.2 Adaptation of the ontologies . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Construction of the reference alignments . . . . . . . . . . . . . . . . . . 564.4 Experimental environment . . . . . . . . . . . . . . . . . . . . . . . . . 574.5 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.1 Alignment between CityGML and IFC . . . . . . . . . . . . . . 594.5.2 Alignment between CityGML and GbXML . . . . . . . . . . . . 614.5.3 Alignment between CityGML and DBPedia . . . . . . . . . . . . 654.5.4 Alignment between CityGML and LinkedGeoData ontology . . . 674.5.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 70

5 Conclusion and future work 735.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Page 9: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

List of Figures

1.1 Semantic Web Hype curve . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Stack of Semantic Web Technologies [6] . . . . . . . . . . . . . . . . . . 92.2 Course catalogue integration . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Representation of the Ontology Matching process . . . . . . . . . . . . . 132.4 Classification of elementary matching techniques (extracted from [67]) . . 152.5 S-Match example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Minimal mapping example . . . . . . . . . . . . . . . . . . . . . . . . . 212.7 SPSM course catalog example . . . . . . . . . . . . . . . . . . . . . . . 232.8 Matching of two Web Services using SPSM (functions are in rectangles) . 232.9 Example of Wikipedia categorization for BLOOMS trees . . . . . . . . . 262.10 Sample of RDF file in the Alignment API format . . . . . . . . . . . . . 292.11 CityGML modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.12 Classes of CityGML building module . . . . . . . . . . . . . . . . . . . 322.13 Example of CityGML dataset . . . . . . . . . . . . . . . . . . . . . . . . 322.14 Levels of detail of CityGML . . . . . . . . . . . . . . . . . . . . . . . . 332.15 IFC building classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.16 BIM and GIS applications along building life cycle . . . . . . . . . . . . 352.17 BIM and GIS integration scenario . . . . . . . . . . . . . . . . . . . . . 35

4.1 String-based techniques for alignment between CityGML-IFC . . . . . . 604.2 Linguistic techniques for alignment between CityGML-IFC . . . . . . . . 614.3 Matching systems for alignment between CityGML-IFC . . . . . . . . . 624.4 String-based techniques for alignment between CityGML-GbXML . . . . 634.5 Linguistic techniques for alignment between CityGML-GbXML . . . . . 644.6 Matching systems for alignment between CityGML-GbXML . . . . . . . 644.7 String-based techniques for alignment between CityGML-DBPedia . . . 654.8 Linguistic techniques for alignment between CityGML-DBPedia . . . . . 664.9 Matching systems for alignment between CityGML-DBPedia . . . . . . . 674.10 String-based techniques for alignment between CityGML-LinkedGeoData 684.11 Linguistic techniques for alignment between CityGML-LinkedGeoData . 694.12 Matching systems for alignment between CityGML-LinkedGeoData . . . 70

vii

Page 10: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

viii LIST OF FIGURES

Page 11: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

List of Tables

2.1 Approaches for integrating IFC and CityGML . . . . . . . . . . . . . . . 37

4.1 Comparison of ontology metrics for the dataset . . . . . . . . . . . . . . 564.2 Number of correspondences of the reference alignments . . . . . . . . . . 574.3 Speed and memory consumption of the different techniques . . . . . . . . 71

ix

Page 12: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

x LIST OF TABLES

Page 13: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Chapter 1

Introduction

Resumen

La vision de la Web Semantica de Berners-Lee entiende la Web como una gran base dedatos donde la informacion esta almacenada en un repositorio distribuido. Uno de losprincipales problemas para conseguir esta vision es la interoperabilidad entre silos de in-formacion. Las ontologıas, provenientes de la Inteligencia Artificial, parecen ser la llavepara conseguir que todos los sistemas hablen el mismo lenguaje. Desde la aparicion dellenguaje de representacion OWL en 2004 muchas ontologıas han surgido para represen-tar el mismo dominio lo que solo ha trasladado el problema hacia otro nuevo: ¿comopodemos alinear 2 ontologıas? El alineamiento entre ontologıas u Ontology Matchingsurge como un nuevo campo de investigacion para abordar este problema. Todavıa hoy esun campo en crecimiento que no ha alcanzado la madurez o estabilidad suficiente comopuede apreciarse en la figure 1.1.

Este nuevo campo tiene numerosas aplicaciones como la integracion de informacion,el descubrimiento de Servicios Web, el intercambio P2P, etc. En este trabajo nos interesaestudiar las posibilidades de la primera especialmente en el campo de los Sistemas de In-formacion Geografica relacionado con el Modelado de Informacion del Edificio y la WebGeoespacial. Este campo es de especial interes debido a la carencia de investigacionespara resolver el problema de forma automatica, porque presenta un interes creciente enlos ultimos anos y porque las ontologıas presentan una estructura compleja y terminosmuy tecnicos. Estas ontologıas de interes son CityGML, IFC, GbXML, DBPedia y laontologıa de la iniciative Linked Geo Data. CityGML actua como ontologıa de referen-cia en el campo por lo que sera en la que se centren los alineamientos. Actualmente laintegracion se realiza por medio de extensiones de dominio (ADE) de CityGML que sondesarrolladas con la asistencia de expertos. Una automatizacion del proceso facilitarıa eltrabajo.

El objetivo de este trabajo es analizar el comportamiento y rendimiento de variastecnicas de alineamiento sobre esas ontologıas. Tambien se pretende caracterizar previa-mente las ontologıas que intervienen, ası como las aproximaciones actuales para integrarCityGML con IFC de forma manual.

El resto de la memoria esta organizada de la siguiente forma. El capıtulo 2 describelos conocimientos teoricos necesarios para comprender el resto de la investigacion. El

1

Page 14: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2 CHAPTER 1. INTRODUCTION

capıtulo 3 explica la metodologıa seguida para llevar a cabo la investigacion, incluyendola parte de experimentacion. El capıtulo 4 esta destinado a los resultados del trabajo desar-rollado durante la fase de experimentacion cuyas conclusiones se resumen en el capıtulo5, indicando tambien las limitaciones y el trabajo futuro.

1.1 MotivationThe Semantic Web vision of Berners-Lee [7] (often called Web of Data) sees the Web as acollection of unrestricted linked data. Another key concept from the original idea was theautomatic reasoning of machines instead of humans, allowing the automatic processingof data. Thus, the data have to be represented and modeled according to some restrictionsby means of an upper level schema or data dictionary which represents the concepts andnot only the instances. The answer to these challenges was the inclusion of ontologies inthe Semantic Web.

Ontologies, a concept from philosophy and artificial intelligence, provides reason-ing and modeling capabilities to the Semantic Web. The term was defined in 1993 byT.R Gruber as ”a specification of a conceptualization” 1 in [40]. Since this introduction,ontological engineers around the world have developed several ontologies for differentparticular domains: government, education, bio-informatics, entertainment, publicationsand geospatial data. They are included in many applications such as recommendation sys-tems, semantic searches, e-commerce, etc. However in open and dynamically changingsystems like the Web, different communities usually adopt different ontologies.

Usually, for the same domain many ontologies (even versions of the same ontology)are developed causing new issues in their maintenance, sharing and exchange. Thus,merely using ontologies does not reduce the heterogeneity problem; it is often neces-sary a preprocessing step in which ontology reconciliation takes place by means of someconfidence measure. This measure is usually computed through techniques coming frommachine learning, natural language processing, graph matching, for example. The re-search field emerged to tackle this challenge was called Ontology Matching 2 a broadtopic addressed in [29].

The european Knowledge Web project studies the present and the future of the Se-mantic Web technologies. The deliverable [19] published in 2007 includes a Gartner hypecurve 3 where the Semantic Web technologies are placed along it (see Figure 1.1). One ofthe topics included in the curve is ontology matching, referred here with the term align-ment. The main remark is that both researchers and practitioners agree on locating thistopic just before the peak of inflated expectation, with a long term duration (5 to 10 years)to mainstream adoption. Hence, there are still many challenges to be addressed beforeontology matching technology can be considered mature enough. This is one of the basicrequirements for the present work: the introduced research must continue in the future.Comparing the papers published in 2010 4 in major conference and journals with respectto previous years, the overall trend shows that ontology matching keeps growing, which

1http://www-ksl.stanford.edu/kst/what-is-an-ontology.html2http://ontologymatching.org3http://www.gartner.com/pages/story.php.id.8795.s.8.jsp4http://www.ontologymatching.org/publications.html

Page 15: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

1.1. MOTIVATION 3

Figure 1.1: Hype curve: comparison between researchers’ and the practitioners’ view-points (from [19])

confirms that it is an interesting research topic.Despite of the efforts to develop internationally accepted ontologies, it is very difficult

that two organizations do not use different ontologies for representing the same reality.An ontology alignment is a set of mappings between concepts of two different ontologies(one could be a version of the other). In order to allow the exchange of information be-tween distributed heterogeneous systems, an alignment between the ontologies is needed.The alignments could be used later for ontology merging or integration as the main twooperations. Many applications would benefit from this research such as query answer-ing, database or catalog integration, peer-to-peer communication systems, web servicediscovery, etc. Our contribution in this work is to evaluate the suitability of the ontol-ogy matching techniques for automatic merging in the domain of Geographic InformationSystems (GIS).

Along the last years GIS have gained wide attention within and outside geospatial in-formation communities. Its main applications are the visualization and analysis of the ter-ritory in decision support systems (emergency management or land planning e.g.). Nowa-days, 3D digitalization techniques, such as Light Detection and Ranging (LIDAR) [14],photogrammetry and Computer Aided Design (CAD), are cheaper and more accessible sothe amount of 3D information is growing which is paving the way from 2D to 3D GIS.This causes the need for developing an international standard to exchange 3D city modelsbetween different stakeholders, such as public administrations, enterprises, professionalsand citizens. After some years of research, CityGML [52], developed since 2004, wasadopted as standard in 2008 by the Open Geospatial Consortium (OGC) 5.

5http://www.opengeospatial.org/standards/citygml

Page 16: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4 CHAPTER 1. INTRODUCTION

Now, it is easier to provide open access to geospatial data thanks to CityGML, whichis usually implemented by means of Web Services from different sources. However theinformation modeled with CityGML is not enough for addressing some more specificproblems such as the management of utility networks, hydrological or building informa-tion. Traditional ways of extending CityGML consist of designing and developing tailoredApplication Domain Extensions (ADE) which include the required knowledge from theoverlapped external domain. The development of these extensions is a manual processrequiring expert assessment to extend the core model. Due to the rapid growth of theSemantic Web is common to find ontologies in which the required knowledge is alreadymodeled showing a overlap with CityGML. In general, the development of the extensionsis expensive, difficult and static, i.e. every change or revision of the ontologies makesincompatible with the previous ones. Thus, the challenge is how to discover and mergejoint points between CityGML and other ontologies based on locally defined semanticswhich can be formulated as a problem of semantic heterogeneity.

The optimal solution would be the automatic matching of data from CityGML withdata from other specifics domain having a common overlap. So this could be pose as aheterogeneous data source interoperability problem, an active research field addressed bydatabase and digital library researchers. In the last years this problem has been tackledwith the technologies of the Semantic Web.

Despite not being encoded using a specific ontology language, CityGML is commonlyaccepted as a geospatial ontology [61]. As a matter of fact, its specification includes theessential components of any ontology: classes, properties and relationships. Furthermore,expert knowledge for developing CityGML extensions is usually available and modeledby other ontologies. Integrating of information from related domains with a partial over-lapping can improve the automation of some tasks such as response in emergency situa-tions and take of decision by means of intelligent agents. At the same time, the integrationof CityGML with the proposed ontologies for the Geospatial Semantic Web [23] couldsuppose the fusion of two worlds which traditionally share data but have different targetusers.

In summary, the two main reasons raising this research are the next ones: 1 the in-teroperability between GIS and CAD when semantics are included is a relevant problemnot addressed in an automatic way exploiting the ontological view of the data schemes;2) the ontologies with which the techniques are evaluated are not conventional: they arehighly refined standards having a lot of technical lexical terms, complex structure and alow degree of overlapping .

1.2 Project goalsThe main goal of this research is to evaluate the application of ontology matching tech-niques for integrating information coming from GIS and BIM professional fields. Also,the emerging Semantic GeoSpatial Web field is taken into account since it shares knowl-edge, but the target users differs. Previous approaches in the literature highlight the im-portance of solving interoperability between GIS and CAD which justifies its selection asthe case study. This goal can be decomposed into the following secondary goals:

1. The characterization of current approaches for solving interoperability between GIS

Page 17: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

1.3. STRUCTURE OF THIS DOCUMENT 5

and CAD, specially when they include semantics (referred to the concepts and prop-erties of the objects).

2. The characterization of the ontologies involved in the experiments for pointing thedifferences with other ontologies for evaluating ontology matching.

3. The evaluation of the performance and understanding of the behavior of ontologymatching algorithms when they are applied to complex ontologies related to GISand CAD fields.

1.3 Structure of this documentThe rest of the document is structured as follows:

Chapter 2 clarifies notions, concepts and explain the background needed to understandthe research work. This involves a survey of Semantic Web technologies and languageslike OWL. The ontology matching framework is formally introduced including classifica-tion criteria for elementary techniques. The theoretical foundation of matching strategies,which will be evaluated in chapter 4, is introduced in this chapter too. Later, the waysfor representing alignments are discussed. Finally, this chapter introduces concepts about3D Geographic Information Systems and Building Information Modeling, pointing theinteroperability problem which is researched at this work.

Chapter 3 describes in detail the research methodology followed to achieve the goalsdescribed in section 1.2. Here, the decisions taken at each step along the entire researchresearch process are summarized. This section also includes the evaluation measurestaken for the experiments, the description of the ontologies used as datasets with theirmain features, the matching systems (matching implementations) for performing the ex-periments and the auxiliary tools for adapting the ontologies, executing the experimentsand representing the final results.

Chapter 4 explains the manipulation and parametrization performed on the datasetsand the behavior of the tools and frameworks documented in the previous chapter. Itshows and discusses the results returned by the experiments. Several tables and figuresare showed for representing experimental results which aid to explain the conclusions ofthe next chapter.

Finally, Chapter 5 concludes the work with the conclusions and lessons learned, sketch-ing new ideas for future work which should improve the results. Here, we highlight theadvantages and drawbacks of the ontology matching techniques to integrate informationbetween GIS and CAD.

Page 18: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

6 CHAPTER 1. INTRODUCTION

Page 19: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Chapter 2

Background

Resumen

Este capıtulo aporta la vision del estado del arte actual de la tecnologıa involucrada enesta investigacion. Se ha dividido atendiendo a las 3 principales areas de investigacionque estan involucradas: la Web Semantica, el Ontology Matching y los Sistemas de infor-macion Geograficos.

La Web Semantica es una nueva vision de la Web que aporta mas protagonismo a lasmaquinas que a los humanos mediante el etiquetado y publicacion de los contenidos uti-lizando vocabularios estandares. Estos vocabularios, como RDF, facilitan el enlazado decontenidos (en la llamada Linked Open Data), su busqueda, su recuperacion, etc. Otroslenguajes como RDFS o OWL estan pensados para definir ontologıas que permiten especi-ficar vocabularios controlados y relaciones entre terminos para enriquecer la semanticade los contenidos publicados en RDF. La nueva especificacion OWL2 introduce algunoscambios respecto a la anterior, como 3 nuevos niveles de especificacion atendiendo a lapotencia de inferencia mediante logica descriptiva que permite cada uno. Estos lenguajes,junto con otras tecnologıas se representan habitualmente mediante la pila de la figure 2.1.

La proliferacion de diferentes ontologıas relacionadas con el mismo dominio producela necesidad de abordar el problema del alineamiento. Este problema es una instancia de laheterogeneidad semantica que puede existir entre 2 sistemas. El proceso de alineamientopuede definirse como una funcion que depende de 2 ontologıas de entrada, una coleccionde parametros y, opcionalmente, un alineamiento previo y fuentes de conocimiento exter-nas. El resultado de un alineamiento es un conjunto de correspondencias (o mappings)entre terminos de las 2 ontologıas. Para implementar este alineamiento existen multitudde tecnicas basicas, que suelen ser combinadas siguiendo diferentes estrategias.

Las tecnicas y algoritmos de alineamiento pueden ser clasificadas atendiendo a difer-entes aspectos que afectan a la entrada como el formato (XSD, RDF, OWL) o el nivelde operacion (instancias o esquema), al proceso como el uso de recursos externos o sies probabilıstico o determinista, o a la salida como si aporta una medida de similitud osi determina diferentes tipos de relaciones. Esta clasificacion de las tecnicas en cuestionpuede observarse en la figura 2.4. Para este trabajo hemos seleccionado tecnicas repre-sentativas de todos los conjuntos para poder comparar el rendimiento de todas ellas. Entreellas destacan la distancia entre subcadenas, la distancia entre sinonimos o S-Match, que

7

Page 20: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

8 CHAPTER 2. BACKGROUND

explota la parte logica de la ontologıa (vista como una coleccion de reglas). Para unadescripcion mas detallada ir a la seccion 2.2.3. Las tecnicas y algoritmos son evalua-dos anualmente en el concurso de la OAEI que cada ano propone nuevas colecciones deontologıas para medir los progresos de cada uno y las fortalezas o debilidades.

Los alineamientos generados deben ser representados en un formato comun, que puedaser reutilizado en otros sistemas para alguna aplicacion o simplemente para comparar losresultados. El estandar de-facto es el Alignment format que esta implementado y sopor-tado en el Alignment API del INRIA. Existe otras propuestas de ontologıas para represen-tar los alineamientos que son mas potentes como la Semantic Bridge Ontology, o que sonampliamente utilizados para representar conocimiento como SKOS. Incluso hay algunospara la representacion de reglas como OntoMorph.

La ultima seccion define el estado del arte de las ontologıas en el dominio de apli-cacion: los Sistemas de Informacion Geograficos. En este campo el Open GeospatialConsortium es el organismo estandarizador que en 2008 propuso CityGML como estandarpara la representacion de semantica y la geometrıa de objetos urbanos. Aunque integrar lageometrıa es un problema en sı mismo, este trabajo se centra en la semantica. Uno de losprincipales retos actuales es la integracion de Informacion del Edificio cuyo estandar masimportante es IFC. Su integracion permitirıa mejorar procesos en arquitectura, ingenierıay construccion, ası como la toma de decisiones gracias al enriquecimiento de la infor-macion. Mientras que CityGML esta enfocado para escala urbana, IFC se centra solo enel edificio por lo que las ontologıas presentan solo un solapamiento parcial. Los diferentesniveles de detalle que soporta CityGML permiten integrar la informacion del edificio enel nivel 4. GbXML es otra ontologıa con informacion sobre el edificio, pero enfocada aaplicaciones de simulacion por lo que contiene multitud de parametros sobre materiales,equipamiento e instalaciones.

This chapter introduces the state of the art about the three research areas that convergesin this work. First, the Semantic Web is the general framework in which the information isexchanged between systems which usually make use of the ontologies. Secondly, ontol-ogy matching is an emerging field within the Semantic Web which provides algorithms,languages and tools to find correspondences between concepts of two different ontologies.Finally, 3D Geographic Information Systems are the special kind of information systemsshowing interoperability issues which can be solved by means of ontology matching tech-niques.

2.1 Semantic Web

The Semantic Web or the Web of Data [7] introduces the basic idea of annotating tradi-tional web contents with semantic information or machine-readable metadata. The Se-mantic Web allows machine reasoning and understanding of the content meaning. Thus,data can be processed directly or indirectly by the machines, simplifying the interoper-ability between information systems or autonomous intelligent agents.

Page 21: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.1. SEMANTIC WEB 9

Figure 2.1: Stack of Semantic Web Technologies [6]

2.1.1 Languages and technologies

Since its conception, many technologies and languages have been developed by the WorldWide Web Consortium (W3C) in order to provide the necessary support to speed up theadoption of the Semantic Web. Ontologies are a key part of this new conception of theWeb which is directly related to this work so we focus on the technologies for its repre-sentation. Figure 2.1 shows the architecture of the Semantic Web technologies which isusually represented as a stack because of each one is built over its predecessor. The factthat there exists many technologies, three of them are the most relevant for the aims ofthis work:

• The Resource Description Framework [51] (RDF) was designed to support the an-notation and publishing of metadata in the Web. There are different ways of se-rialization for representing RDF contents, such as RDF/XML, Turtle, N3 and N3-Triples.

• The RDF Schema [58] provides a basic ontology that formally describe concepts,properties and relationships for modeling the knowledge in a domain of interest. Itsexpressiveness is powerful enough for representing taxonomies but not for repre-senting some advance restrictions (e.g. cardinality)

• The Web Ontology Language (OWL) [5] was released in its first version in 2004.It intended to fill the expressiveness limitations of RDFS for creating a powerful(but also complex) modeling language. This complexity makes the expressionsundecidable, so different dialects have to be created. Next section details moreinformation about these dialects.

2.1.2 Ontologies and Web Ontology Language

Ontology is a concept from philosophy that according to T. Gruber [39,40], in the contextof information science, ”defines a set of representational primitives (classes, attributes andrelationships) with which to model a domain of knowledge or discourse”. In the Semantic

Page 22: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

10 CHAPTER 2. BACKGROUND

Web, ontologies defines a formal structure for the metadata and taxonomies for concepts,constraining the RDF-triples that can be considered as valid.

The Web Ontology Language (OWL) is a family of knowledge representation lan-guages characterized by formal semantics and RDF/XML serialization. Its first speci-fication (OWL 1.0 [5]) was proposed in 2004 by the W3C and the second one (OWL2.0 [60]) in 2009 by the same organization. In the version 1.0 three different dialects wereconsidered:

• OWL Lite, that includes a classification hierarchy and simple constraints for basicusages. It is not used in practice because most of the expressiveness constraintspresent syntactic inconveniences, making it at least as difficult to implement asOWL DL.

• OWL DL, that provides maximum expressiveness retaining computational com-pleteness, decidability and practically computable. It corresponds with descriptionlogic that form the basic foundation of OWL.

• OWL Full has different semantics which are fully compatible with RDFS but thereis no available complete reasoning support.

OWL 2 provides several profiles which can be more simply and/or efficiency imple-mented. Each profile defines some restrictions on the structure of OWL 2 ontologies:

• OWL 2 EL is suitable for ontologies defining a large number of classes, allowing toperform ontology consistency, class expression subsumption and instance checkingin polynomial time.

• OWL 2 QL provides the necessary features to represent conceptual models such asUML or ER diagrams. Based on the DL-Lite family of description logics, it allowsto query and access data stored in a relational database system by rewriting thequery into an SQL query answered without changing data.

• OWL 2 RL provides a trade-off between expressiveness and scalable reasoningneeded by some kind of applications. It defines a syntactic subset of OWL 2 im-plementable using current rule-based technologies, at the same time that presents apartial axiomatization in form of first-order implications.

The vocabulary introduced by OWL extends the expressiveness of RDFS in an upperlevel, allowing the definition of:

• Classes: the most basis concepts in a domain should correspond to classes repre-sented with owl:Class. Every member of an ontology is a member of owl:Thing.Also, complex taxonomies are represent by means of rdfs:subClassOf.

• Properties: they assert general facts about the members of classes as a binary re-lation with owl:ObjectProperty and owl:DatatypeProperty. The firstone models relations between two instances of two classes and the second one mod-els relations between one instance of class and RDF literals.

Page 23: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 11

• Property characteristics: they are used to further specify properties providing apowerful mechanism to enhance current reasoning capabilities about a property. Itincludes transitive properties (owl:TransitiveProperty), symmetric prop-erties (owl:SymmetricProperty), functional (owl:FunctionalProperty)or inverse properties (owl:inverseOf).

• Property restrictions: they constraint the property in the context of owl:Restriction,indicating the restricted property with owl:onProperty. Some examples areowl:allValuesFrom, owl:someValuesFrom, owl:cardinality,owl:hasValue.

OWL also defines a vocabulary for Ontology Mapping that facilitates the sharing,reuse and composition of ontology collections. This is a mechanism that allows to mergeand integrate different existing ontologies avoiding the hard work in the ontology devel-opment process where classes and properties have to be hooked together to maximizeimplications. Moreover, this allows the automatic development of bridges between twoontologies after ontology matching process. This vocabulary consists of the followingterms: owl:equivalentClass, owl:equivalentProperty, owl:sameAs,owl:differentFrom, owl:AllDifferent.

2.2 Ontology Matching

Interoperability among people of different cultures and languages, having different view-points and using different terminology for modeling the same knowledge has always beena huge problem. With the advent of the Web and the consequential information explosionwhich is dynamically changing everyday, the problem seems to be emphasized. Peopleface the concrete problems to retrieve, disambiguate and integrate information comingfrom a wide variety of sources.

Generally, ontology matching is the process of determining correspondences betweenconcepts arising from two or more ontologies or schemas. It represents a fundamentaltechnique in many applications areas such as resource discovery, data integration, datamigration, query translation, peer to peer networks, agent communication, schema andontology merging. It has been proposed as a valid solution to the semantic heterogeneityproblem (see Section 2.2.1), namely managing the diversity in knowledge.

Figure 2.2 shows two extracts University course catalogues which serves as moti-vation. The matching between both course catalogues could be useful in the case of atransfer of a student from one University to another, where the later has to decide whichcourses to recognize from the former University. This is a classical example of catalogueintegration, one of the applications of ontology matching.

The next sections gives a complete overview of the state of the art in different as-pects involved with ontology matching such as elementary techniques, matching strate-gies, evaluation and representation of the alignment. During the last 10 years a huge com-munity has been created around the same problem which had led to several publicationsand conferences around the world such as [17, 49, 67, 68].

Page 24: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

12 CHAPTER 2. BACKGROUND

Figure 2.2: Example of two course catalogues matching

2.2.1 Semantic heterogeneity

The problem of data heterogeneity is frequently addressed in distributed systems whenthey must exchange information not represented in the same terms, structure and rules (ora combination of them). Depending on the source of heterogeneity [29], there are threedifferent types:

• Terminological, if the names referring to the same concepts are different. For ex-ample Paper and Article.

• Syntactical, if the structures or language used to represent data are different. Forexample, OWL and XML.

• Conceptual, if there are differences in the same domain of interest. There are threemain important reasons for these to hold: difference in coverage (possible overlap-ping), granularity (level of detail) or perspective (thematic interest).

The rise of the Semantic Web had lead to the development of several particular on-tologies that encoded a particular modelling of a knowledge field called contextual ontol-ogy [12]. Thus, the problem of system heterogeneity could not be solved only by meansof ontologies. The last decade of continuous development of ontologies has shown theneed to discover, represent and maintain alignments between ontologies (and databaseschemes) in a semi-automatic way (see figure 1.1). Semantic matching, or alignment, isthe response of the scientific community for this need. It establishes the best correspon-dence between pairs of terms from two different ontologies.

It is important to note that the semantic bridges (links between terms) are createdbetween concepts from the ontology and not between the data. This last approach iscloser to the Linked Open Data (LOD) [9] community where the contents of RDF triplesare matched to discovery non schema-based relations between them. In fact, while thereis a huge amount of RDF data linked in the Semantic Web framework, ontologies in theSemantic Web still remains unlinked.

Page 25: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 13

Figure 2.3: Representation of the Ontology Matching process

2.2.2 The matching problemThe Ontology Matching process can be formally defined as a tuple of elements (see figure2.3) involved in the achievement of an alignment between terms of two ontologies. Moreconcretely, the alignment is defined as a result of the matching process in the followingterms Alignment = Matching(o, o′, A, p, r) where:

• o and o′ are the ontologies or schemes

• A is an optional input alignment to enhance

• p is a set of parameters (weights, thresholds, etc)

• r is a set of external resources (thesauri, etc)

The alignment is composed by a set of mapping elements where each one of them isdefined as Mapping =< id, e, e′, n, R >, where:

• id is an identifier

• e and e′ are the entities

• n is a confidence measure

• R is a relation between e and e′ (≡, w, ⊥, u)

2.2.3 Matching techniques and algorithmsMatching algorithms can be classified along several independent dimensions. Accordingto the definition of the matching process shown by the figure 2.3, there are three maincriteria for classifying the algorithms:

• Kind of input

– Conceptual model for expressing ontologies such as Entity-Relation, XML,RDF or OWL.

Page 26: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

14 CHAPTER 2. BACKGROUND

– Kind of information exploited such as schema-level information, instance-level information or a combination of both.

• Matching process

– The computation model can be approximate or exact (to achieve better perfor-mance).

– Interpretation of the input data can be classified into syntactic (intrinsic in-put), external (resources) and semantics (semantic theory of the consideredentities).

• Output form

– Kind of answer such as graded (confidence measure) or all-or-nothing.

– Kind of entity relations such as equivalence (=), subsumption (v) and incom-patibility (⊥).

Elementary matching techniques are those that form the basis for creating more com-plex strategies to perform ontology matching. According to [67], these techniques can beclassified using two synthetic classifications inspired by the above matching dimensions.Elementary techniques are represented within the leaves shared by two trees correspond-ing to the two classifications (see figure 2.4). Both classifications are explained in the nextparagraphs.

Granularity / Input Interpretation classification

This classification is represented as a two level tree where the first level divides techniquesbased on the matcher granularity, i.e., element- or structure-level. The first kind includesthose techniques which compute correspondences using isolated entities while the secondone uses the relations between entities. Next, the second level divides the techniques basedon the interpretation of the input information, i.e., syntactic (if they interpret the input byits sole structure), external (if their exploit auxiliary resources or common knowledge) orsemantic (if the use formal semantics grounded on some theoretical model).

Kind of input classification

As the previous one, this classification is represented as a two level tree based on thekind of input considered by the technique. At the first level techniques are categorizedby the kind of data the algorithms works on, such as strings (terminological), structure(structural) or models (semantic). While the two firsts ones can be found in the ontologythe third one uses a reasoner to deduce the correspondences. The second level is notalways mandatory. It decomposes the upper categories into the next subcategories:

• Terminological methods are subdivided into string-based (terms are treated as asequence of characters) or linguistic (terms are treated as linguistic objects)

• The structural category splits the methods into internal (considering the internalstructure of entities) and relational (considering the relations between entities).

Page 27: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 15

Figure 2.4: Classification of elementary matching techniques (extracted from [67])

Page 28: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

16 CHAPTER 2. BACKGROUND

The selection of techniques from different categories allows to expand the scope of theresearch. The matching systems or matchers described in section 2.2.6 are based mainlyon structural techniques (both at schema-level and instance-level) but they also includelinguistic distances (based on WordNet) and string similarities. However, the combinationof them does not allow to measure the degree of importance of each kind of technique,so these techniques must be evaluated separately. In most cases, these techniques are thesimplest and also the most effective at the same time. The next two sections describe twotypes of very common techniques: string-based techniques and linguistic resources basedtechniques.

2.2.4 String-based techniques

These techniques are used for matching names and comments of the ontology entitiesbased on considering strings as sequences of letters in an alphabet. They are typicallybased on the following intuition: the more similar the strings, the more likely they are todenote the same concepts. There exists several distance functions which map a pair ofstrings to a real number. Usually, a smaller value of the real number indicates a greatersimilarity between the strings, since the distance is opposite to the similarity. The nextsubsections introduce the techniques we have chosen which are extensively used in severalmatching systems. There are three main categories in which the distances can be framed:

• String equality is the most basic measure which returns 1 if the compared stringsare identical or 0 otherwise. It requires previous normalization to lowercase, usingthe same font encoding and removing accents.

• Substring distance is a variation of the string equality which considers strings verysimilar when one is a substring of another.

• String edit distance is the minimal cost of the operations for transforming one stringinto the other. It is specially suited for measuring similarity in presence of spellingmistakes. The transformations, which include insertion, replacement and deletionof a character, have assigned a cost so the distance between two strings is the sumof the cost of each operation on the less costly set of operations.

Hamming distance

String equality measure does not explain how strings are different. The Hamming dis-tance, which counts the number of positions in which the two strings differ [42], is a moresophisticated way of compared two strings.

Definition 1. The normalized Hamming distance is a distance d : S × S → [0, 1] suchthat

d(s, t) =(∑min(|s|,|t|))i=1 s[i] 6= t[i]) + ||s| − |t||

max(|s|, |t|)

Page 29: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 17

Substring similarity

The substring distance is usually implemented by measuring the ratio of the commonsubpart between two strings. The definition can be used for building functions based onthe longest common prefix or longest common suffix.

Definition 2. Substring similarity is a similarity σ : S × S → [0, 1] such that ∀x, y ∈ S,and let t be the longest common substring of x and y:

σ(x, y) =2|t||x|+ |y|

N-gram similarity

This measure consist of computing the number of common n-grams, i.e., sequences of ncharacters, between them. For instance, 3-grams for the string ”building” are: bui, uil, ild,ldi, din, ing. This measure is specially useful when only some characters are missing andit penalizes transformations in random characters.

Definition 3. Let ngram(s, n) be the set of substrings of s of length n. The n-gramnormalized similarity is a similarity σ : S × S → R measure such that:

σ(s, t) =|ngram(s, n) ∩ ngram(t, n)|

min(|s|, |t|)− n+ 1

Levenshtein distance

This distance defined in [55] measures the cost of the minimum number of insertions,deletions, and substitutions of characters required to transform one string into the other.It represents the basic edit distance with all costs equal to 1. The Needleman-Wunchdistance [62] is other modification of the edit distance with a higher cost for insertion anddeletion of characters.

Jaro measure

The Jaro [47] measure allows to match strings with similar spelling mistakes. It does notfollow the edit distance model, but it is based on the number and proximity of the com-mon characters between two strings. This measure cannot be considered as a similaritybecause it is not symmetric. Its formal definition is quite complex compared with theother measures:

Definition 4. The Jaro measure is a non symmetric measure σ : S × S → [0, 1] such that

σ(s, t) =1

3× (|com(s, t)||s|

+|com(t, s)||t|

+|com(s, t)| − |transp(s, t)|

|comp(s, t)|)

wheres[i] ∈ com(s, t)⇐⇒ ∃j ∈ [i− (min(|s|, |t|)/2i, (min(|s|, |t|)/2]

and transp(s, t) represents the elements of comp(s, t) with different order in s and t.

Page 30: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

18 CHAPTER 2. BACKGROUND

Please, note how the comp(s, t) functions only consider common characters those thatfall in nearby positions. Later, the measure was improved by the Jaro-Winkler [73] mea-sure which favors matches between strings with longer common prefixes. This provides amore realistic model of mistakes that penalizes less the comparisons.

Smoa measure

Another edit distance measure is Smoa [69]. It is a specialized distance for ontologymatching identifiers adapted from the way computer users define the identifiers. It is basedon common substring lengths and non common substring lengths where the second partis substracted from the first one (commonality - dissimilarity). In the original definition,this measure takes a value in [−1, 1], which can be adapted to [0, 1]. The improvementdefined by Winkler can be also applied to the Jaro measure.

2.2.5 Linguistic resources techniquesSince the concepts represented in the ontologies are frequently collected in a thesaurusdatabase (or some kind of external resource), it is feasible to consider the matching prob-lem as the search of linguistic relations between words of a natural language. These lin-guistic resources, such as lexicons or domain specific thesauri, are used for matching twowords based on linguistic relations between them, e.g., synonyms, hypernym /hyponym(superconcept/subconcept), meronyms (part of relations), etc.

The most widely used external resource for matching is WordNet [59]. WordNet is themost famous lexical database for English (it has been adapted to other languages), basedon the notion of synsets (sets of synonyms). A synset denotes a concept or a sense in agroup of terms. It also provides textual descriptions of the concepts (gloss) containingdefinitions and examples. Three families of methods can be distinguish based on whatcriteria is used for measuring similarity: 1) terms belonging to the same synset; 2) hyper-nym structure between synsets of the terms; 3) definitions of concepts between the synsetsassociated with two terms.

Next, we describe the similarity measures based on WordNet relations between termswhich are evaluated in our research.

Synonym similarity

Simple measures can be defined by considering only synonyms because they are the basisof WordNet synsets (but other relationships can be used as well). The synonym similarityis the simplest measure based on synonyms which is defined as follows.

Definition 5. The synonym similarity is a similarity σ : S × S → [0, 1] measure betweentwo terms s and t, using a synonym resource Σ, such that:

σ(s, t) =

{1 if Σ(s) ∩ Σ(t) 6= 00 otherwise

A simple variation of this similarity measure is the basic synonym distance whichreplaces zero values with result of applying basic string-based distance between the twoterms.

Page 31: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 19

Cosynonym similarity

This is a more refined measure which indicates how far non synonymous objects are.Since synonymy is a relation between two terms, all the measures on the graph of relationscan be applied to synonyms. Thus, this measure is defined in the following terms:

Definition 6. The cosynonymy is a similarity σ : S × S → [0, 1] between terms s and tbased on a synonym resource Σ such that:

σ(s, t) =|Σ(s) ∩ Σ(t)||Σ(s) ∪ Σ(t)|

Basic gloss overlap

Another way of comparing two terms is to use the definition (gloss) of the terms given byWordNet. A dictionary entry s ∈ Σ is identified by the set of words corresponding to thegloss λ(s). Then string-based measures can be used for comparing the strings.

Definition 7. The gloss overlap σ : S × S → [0, 1] between terms s and t using thesynonym resource Σ, is defined by the similarity between their glosses, such as:

σ(s, t) =|λ(s) ∩ λ(t)||λ(s) ∪ λ(t)|

Wu-Palmer similarity

There exists other measures which consider the fact that terms can be part of severalsynsets to measure the distance in the hyponym/hypernym hierarchy between synsets. Forexample, the simplest measure (called edge-count) counts the number of edges separatingtwo synsets in Σ (also called the structural topological dissimilarity on hierarchies). Moreelaborate measures of this kind weight the edge count with the position of synsets in thehierarchy, such as the one proposed by Wu and Palmer in [74].

This distance is based on the following assumption: two classes near the root of ahierarchy are closer in terms of edges but they can be very different conceptually, whiletwo classes under one of them which are separated by a larger number of edges should becloser conceptually.

Definition 8. The Wu-Palmer similarity σ : o × o → R is a similarity over a hierarchyH = 〈o,≤〉, such that:

σ(c, c′) =2× λ(c ∧ c′, ρ)

λ(c, c ∧ c′) + λ(c′, c ∧ c′) + 2× λ(c ∧ c′, ρ)

where ρ represents the root class of the hierarchy, λ(c, c′) is the number of intermediateedges between a classes c and c′ and c ∧ c′ = {c′′ ∈ o; c ≤ c′′ ∧ c′ ≤ c′′} is the set ofcommon hypernyms between classes.

Page 32: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

20 CHAPTER 2. BACKGROUND

2.2.6 Matching systemsSection 2.2.5 and 2.2.4 describe elementary techniques for ontology matching categorizedaccording to the taxonomy introduced in figure 2.4. In the practice, it is quite rare thatthe techniques are used alone for matching. Instead, the combination of elementary tech-niques following different strategies (combination, composition, aggregation, e.g.) allowsto improve the performance of the matching process and the robustness of the matchingsystem. Thus, basic techniques introduced above represent the building blocks for devel-oping entire matching systems. The alignment is usually obtained based on the similaritycomputed for each pair of ontology entities. This combination of techniques raises an-other relevant problems, treated in [29, Chapter 5]. The next ones are some examples ofthat:

• the aggregation and combination (matcher composition) of similarity measures com-puted by the elementary techniques;

• the automatic learning from data of the best method and the best parameters formatching;

• the use of probabilistic methods to combine matchers or to derive missing corre-spondences;

• the generation of the alignments from the resulting similarity. It is even possible togenerate different alignments from the same similarity measures.

Although several matching systems have been reviewed, due to the constrained timefor the development of this work, we have selected only the most relevant ones accord-ing to the results published in the literature. The next paragraphs describe them with ahigher level of detail. Further information regarding each technique and references topublications are provided in [29].

Semantic Matching (S-Match)

Semantic Matching 1 is a specific ontology matching technique that relies on semanticinformation encoded in lightweight ontologies (graph-like structures, like classifications,database or XML schema and ontologies) to identify semantically related nodes. For ex-ample, a term labeled ”paper” is semantically equivalent to another term labeled ”article”because they are synonyms in English according to some linguistic resource or oracle likeWordNet [59].

S-Match can be considered as a semantic matching operator in which the correspon-dences are computed through two steps: (i) translating the entities of the ontologies intoformal propositional formulas using an artificial and unambiguous language, and (ii) re-ducing the matching problem to a propositional validity problem. Formulas allows torepresent the concept descriptions as they are encoded in the ontology structure and inexternal resources. Thus, the matching problem is translated into a propositional validityproblem, which takes advantage of state of the art propositional satisfiability solvers such

1http://semanticmatching.org/semantic-matching.html

Page 33: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 21

Figure 2.5: Complete semantic matching of two course catalogs

Figure 2.6: Minimal mapping between two course catalogs

as SAT4J [54]. Finally, the output of S-Match is an alignment in which the followingrelations are considered: disjointness, equivalence, more specific and less specific.

Let us consider the example of figure 2.5 to illustrate the operation of S-Match. If wewould have only focused on the node College of Arts and Sciences under the node labeledby Courses, we would say that the meaning of the node College of Arts and Sciences is infact Courses of College of Arts and Sciences dragging the semantics of the parent in someway. Thus, this is translated into the logical formula Courses AND College of Arts andSciences.

The figure 2.5 shows all possible correspondences between concepts returned by theS-Match algorithm when it is executed without filtering the results using a minimal map-ping filter. The minimal mappings result of above semantic matching is shown by figure2.6. It collapses the links, returning only the most important mappings not inferred fromother mappings. The set of mappings is drastically reduced, providing clear usability ad-vantages. It is more human-readable for a better visualization in graphical interfaces andcorresponds to what a person will expect to see as the result of the semantic matcher.Furthermore, their maintenance is much easier, faster and less error prone. For a formaldefinition of minimal and redundant mappings, a proof of their existence, uniqueness andone algorithm for computing them, see the work done in [32] by Giunchiglia et al.

In addition to a matching operator or algorithm, S-Match is also the name of the se-

Page 34: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

22 CHAPTER 2. BACKGROUND

mantic matching framework publicly available as Open Source Software 2. Its modulardesign simplifies the use of different semantic matching algorithms furthermore it pro-vides a framework for developing of new algorithms. For this, S-Match provides the corefor computing semantic relations by only customizing and connecting single components.Its main architecture and basic idea was initially explained in [35]. In the first version [36],the system was basically a re-implementation of CtxMatch [13]. Despite of the posteriorevolution which includes more element- and structure-level matchers, alignment expla-nation and iterative semantic matching, S-Match is limited to tree-like structures withoutconsidering properties or roles.

The behavior of S-Match is the following: it takes two graph-like structures (in differ-ent formats but both in the same), and returns the logic relations in terms of equivalenceand subsumptions between the different entities. The input ontologies are preprocessedby a module with the help of oracles (WordNet or UMLS2, e.g.) which provide externallexical and domain knowledge. Internally, it combines in parallel several matchers at theelement level. As output of the matching process, an enriched tree is obtained and storedin an internal database where it can be browsed, edited and manipulated.

Currently, S-Match libraries contain around 20 basic element-level matchers in threecategories namely string-based (n-gram, edit distance, e.g.), WordNet sense-based, andWordNet gloss-based matchers. Structure-level matchers include SAT solvers, e.g., SAT4J,and ad-hoc reasoning methods. Its main advantage is that it is still an alive project becauseof the number of relevant contributions is still growing.

Structure Preserving Semantic Matching

Structure Preserving Semantic Matching (SPSM 3) is a matching technique for comparingtwo tree-like structures and obtaining a similarity score between both trees and a set ofcorrespondences between nodes. These correspondences present two main features: thecardinality of the relation between nodes is one-to-one, and leaf nodes are matched againstleaf nodes in the same way that internal nodes are matched only against internal nodes.Figure 2.7 shows the set of correspondence between the two University course catalogsof figure 2.2 when they are computed using the SPSM algorithm.

Although its main application is to compare function definitions where each functionname is an internal node, and each parameter is represented as a leaf node, it can beapplied to any kind of tree-like structures. The structural properties of SPSM guaranteesthat one internal / leaf node of the first tree is mapped only to one internal / leaf element ofthe second tree respectively. Also, one-to-one correspondences allow to generate adapterswhich can translate on-line calls to the function for making both interoperable.

SPSM was intended to match functions representing different Web Services wherethe parameters are leaves of the tree and the name of the function is an internal node ofthe tree. For example, let us consider two Web Services as the represented by Figure 2.8,which can be written as get wine(Region, Country, Color, Price,Number of bottles)and get wine(Region(Country, Area), Colour, Cost, Y ear,Quantity). However thisapproach still requires the establishment of a threshold through empirical methods which

2http://semanticmatching.org3http://semanticmatching.org/spsm.html

Page 35: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 23

Figure 2.7: Two course catalogs example aligned by SPSM algorithm. Note how the setof mappings preserve one-to-one, leaf-to-leaf and internal-internal constraints

Figure 2.8: Matching of two Web Services using SPSM (functions are in rectangles)

guarantees that Web Services are only integrated when they are similar enough accordingto the global similarity measure given by SPSM.

The SPSM approach was introduced by Giunchiglia et al. in [34] dividing the match-ing in two steps: (i) node matching and (ii) tree matching. Node matching uses the S-Match approach introduced in the previous section for computing the similarity betweennodes by considering labels and contextual information in the domain of interest. Onthe other hand, tree matching uses the results from the node matching step and the struc-ture of the trees to get an approximate matching (in open and real environments exactmatching is almost impossible). Ideas from previous works in fields related to tree match-ing are present in SPSM algorithm such as the theory of abstraction [37] (categorizationof various kinds of abstraction operations for estimating the similarity between two treestructures) and the tree-edit distance [8]. Following the both theories, two trees (T1 andT2) approximately match if there is at least one node n1i in T1 and node n2j in T2 such that:(i) n1i approximately matches n2j , (ii) all ancestors of n1i are approximately matched tothe ancestors of n2j . Note that the order of sibling is not preserved at matching since this

Page 36: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

24 CHAPTER 2. BACKGROUND

would mean a limitation for functions where the only change is the order in the parameterdeclaration.

Bootstrapping Matching

BLOOMS [45] introduces a new bootstrapping approach based on the exploitation of theWikipedia category hierarchy for aligning ontologies. Shortly, it constructs a forest (a setof trees) TA for each matching candidate A, which roughly corresponds to a selection ofsuper-categories of the class name. Next, the forests TA and TB are compared betweenthem for determining the kind of relation between concepts A and B. Now, the operationmode of BLOOMS is explained with a higher level of detail. The input of BLOOMS aretwo ontologies over which it performs the following steps:

1. Restrictions, individuals, and properties of the ontologies are removed. Compositeclass names are tokenized for normalizing the string of the class label into a listof words without stop words. This includes the replacement of underscores andhyphens by spaces, splitting at capital letters, and the removal of several types ofstop words.

2. Forest are built for each class name, using information from the Wikipedia. Thesystem makes a call to the Wikipedia Web service using the word extracted in step1 as input. The service returns a set of Wikipedia pages resulting of the search. Ina next step disambiguation pages are replaced by all Wikipedia pages mentioned init. Then, a new tree is built for each element of the resulting set as follows:

• The root of the tree is the element.

• Its children are exactly all its categories according to Wikipedia.

• Each node in the tree corresponding with a subcategory has all its Wikipediacategories as children.

• The resulting tree is cut at level 4. More deeper trees include irrelevant cate-gories because they are too general.

3. The forest are compared yielding the decisions about which one of them are aligned.A function which assigns a real number in the unit interval to each (ordered) pair oftrees is designed to achieve this goal. This function is explained later.

4. The obtained results are post-processed with the help of the Alignment API [28] forfinding alignments between the input ontologies. The mappings with a confidencevalue greater than 0.95 are kept, and added to the results. Then, the reasoningcapabilities of Jena are used to compute the transitive closure of the alignmentswhich is given as result in the Alignment API format.

The value of overlap(Ts, Tt) is defined as follows. First, all nodes of Ts for whichthere is a parent node which occurs in Tt are removed since they do not provide anyadditional information. All leaves of the resulting tree T ′s are either of level 4 or occur inTt. Then overlap(Ts, Tt) = n

k−1 , where n is the number of common nodes between T ′sand Tt, and k is the total number of nodes in T ′s without the root. Finally, the decision on

Page 37: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 25

an alignment is made as follows: if there exists a pair of trees for the two concepts whichcan be considered equals, then they are equivalent; if the lowest overlap between two treesfor the two concepts is greater than a predefined threshold, then the concept whose treespresents the lowest overlapping value is subsumed by the other.

Now, the process is explained through an example using the class names ”Event” and”JazzFestival”, taken from the LOD datasets DBpedia and Music Ontology respectively.

1. ”JazzFestival” is transformed to ”Jazz Festival”, whereas ”Event” is not modified atall.

2. The search of ”Event” in Wikipedia returns ”Event”, ”Eventing”, ”Sport”, ”NFLDraft”, ”News”, ”Festival”, ”Event-driven programming”, ”Rodeo”, ”Athletics atthe Summer Olympics”, and ”Extinction event”. Figure 2.9 represent the generatedtrees.

3. The values for the overlapping function are the next:

overlap(TEvent, TJazzFestival) = 3/4, overlap(TJazzFestival, TEvent) = 5/5

Then, as overlap(TEvent, TJazzFestival) > overlap(TJazzFestival, TEvent)”Event” subsumed ”JazzFestival”.

4. The Alignment API determines no correspondence between ”JazzFestival” and ”Event”so the transitive closure is computed and the output is given in the Alignment APIformat.

One of the strengths of BLOOMS is the computation of the alignment using noisycommunity-generated data available on the Web. Although it currently uses the Wikipediacategory hierarchy, it would be technically feasible the use of other inputs such as existingupper-level ontology or thesauri. This only introduces some bias in the alignment whichcan be exploited for addressing the problem in specific thematic domains. However, dueto the fact that queries are executed on-line this is a method with a very bad performancenot suitable for applications with big ontologies and real-time requirements. Some of thealternatives would be the following: Ontologies such as Cyc or SUMO [57]; Thesauri suchas WordNet [59]; Taxonomies created from Wikipedia, such as the one reported in [64];or efforts like the Open Directory Project 4 or YAGO [70]. The authors of BLOOMSjustify the use of Wikipedia in the following terms:

• It offers hight thematic coverage.

• It is community built and maintained so it is permanently updated.

• It provides a search Web Service which simplifies the creation of forests

4http://www.dmoz.org

Page 38: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

26 CHAPTER 2. BACKGROUND

Figure 2.9: BLOOMS trees for categories from Wikipedia until level 4 [45]

Association Rule Ontology Matching Approach

Association Rule Ontology Matching Approach (AROMA) [21] is an hybrid, extensionaland asymmetric matching method, which has been designed to find out relations (equiv-alence and subsumption) between entities issued from two OWL ontologies. AROMA isbased on the association rule paradigm, a well-know model for Knowledge Discovery inDatabases which is both asymmetric and extensional. It selects the relevant terms con-tained in ontologies to discover equivalence and subsumption relations holding betweenconcepts and properties which are modeled as rules. The essential innovation relies onthe implication intensity measure, a probabilistic model of deviation from independenceguided by two criteria which assess the implication quality and the generativity of therule. Although this method was not mainly intended for dealing with ontologies, it wasadapted for working with OWL ontology matching.

The AROMA method consists of two parts: (1) the acquisition and selection of rele-vant terms for each concept; (2) the discovery of significant implications between the twohierarchies.

The first stage generates a set of relevant terms for each concept in a hierarchy. Theseterms are extracted from documents indexed to concepts and selected by evaluating asso-ciation rules t→ c which states that ”if a document includes term t then this document isassociated with the concept c’.

In the second stage implicative matching relations between concepts are discovered byevaluating association rules between their respective relevant terms sets. The algorithmtakes as input the preprocessed hierarchies and considers only the terms shared by thetwo structures. The algorithm provides a top-down search of association rules and usestwo criteria for select significant rules. A rule a → b (between the concepts a ∈ C1 andb ∈ C2) will be significant if it respects the two following criteria:

• The implication intensity of the rule (a → b) between two concepts for a giventhreshold is computed according to the expected number of relevant terms for con-cept a which are not relevant for concept b.

• The generativity of the rule for reducing redundancy in the mined rule set. A valid

Page 39: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 27

rule is deleted whether exists a more generative rule having an implication intensitygreater than or equals to it. A rule x → y is more generative than a rule u → v ifu ≤ x ∧ y ≤ v (when they are not the same). For example, the rule vehicle →transportObject can be more generative than car → auto.

Further details about the mathematical theory behind this method can be found in thework of David et Al. in [21].

Ontology Mapping by Particle Swarm Optimisation (MapPSO)

MapPSO [11] is an ontology matching system and an algorithm whose approach is basedon discrete particle swarm optimization. It was developed for the purpose of aligninglarge ontologies, motivated by the fact that ontologies and schema information such asthesauri or dictionaries are not only getting numerous on the web, but also are becomingincreasingly large. The algorithm is highly scalable thanks to the new parallel architec-tures.

MapPSO translates the ontology alignment problem into an optimization problem,allowing the application of a discrete variant of particle swarm optimization [18, 50], apopulation based optimization paradigm inspired by social interaction between swarminganimals. This method provides some interesting benefits for ontology matching tasks. Themost relevant are: 1) the population based structure provides high scalability on parallelsystems; 2) the method belongs to the group of algorithms which allow for interruptionat any time, providing the best answer being available (interesting when an alignmentapplication is subject to time constraints)

The key of the optimization is the objective function which supplies a fitness value foreach candidate alignment. For its application, MapPSO defines a set of particles wherebyeach particle is a candidate alignment comprising a set of initially random one-to-onemappings. Each particle remembers the previously found good mappings (personal best)and the swarm maintains the best known alignment (global best). In each iteration, thecorrespondences of each particle are updated in a guided random way. Correspondencesin both the global best set and the personal best set are more likely to be kept since theyhave a very good evaluation. However, worst correspondences are more likely to be re-placed with other correspondences which are randomly created and recommended frombest alignment (personal best and global best). The fitness of each candidate alignment iscomputed with the sum of quality measures of its correspondences.

The quality score of a correspondence is calculated based on an aggregation of scoresby applying a weighted average operator (a kind of matching strategy) from a config-urable set of base matchers which provide distance / similarity measures. MapPSO hasthe following base matchers available:

• SMOA string distance for entity names / labels

• WordNet distance for entity names / labels

• Vector space similarity for entity comments

• Hierarchy distance to propagate similarity of super / subclasses and super / sub-properties

Page 40: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

28 CHAPTER 2. BACKGROUND

• Structural similarity of classes / properties based on properties such as domain orrange classes

• Similarity of classes from individuals that are instances of them

• Similarity of properties derived from individuals that are subjects or objects of them

• Similarity of individuals derived from property assertions such as the values of dataproperties, the object / subject (individuals) of object properties where the individ-ual is asserted to as subject / object

The initial fixed weight assigned to each base distance is automatically adjusted be-fore the starting of the process, according to the ontology characteristics. Besides, in theMapPSO implementation each particle optimization runs in a separate thread in whichthe fitness is computed and the particle updated. Despite of this parallelization, a sequen-tial synchronization after each iteration is still required for determining the global bestalignment based on the fitness values of each particle.

2.2.7 Alignment representationThe alignments have a high value once they are developed so there is the need of a com-mon representation for sharing during their maintenance and reuse along its entire lifecycle. Since the literature lacks of a well defined classification of the languages and rep-resentation formats, we can structure the reviewed existing approaches into three differentcategories. The following categorization can be considered as a minor contribution of thiswork:

• A common mapping language or ontology such as the Semantic Bridge Ontol-ogy [56], Context OWL [13] or the Alignment format [25] (figure 2.10 shows anexample). The Alignment Format is used by the Ontology Alignment EvaluationInitiative and supported by the Alignment API [20]. This is also the alignment wechoose for experimentation due to its wide acceptation by the other works in theliterature and the compatibility with matching implementations.

• A ontology for knowledge organization and representation to define relationshipsbetween terms, such as PRONTO or SKOS [31]. Even OWL itself is a language forrepresenting mappings between ontologies by means of owl:sameAs vocabularyentries.

• A knowledge rule language from artificial intelligence, such as OntoMorph [16]).

2.2.8 EvaluationThe increasing number of ontology matching techniques raises a common research prob-lem: the systematic comparison between different results. The mechanisms to achieve thisgoal are the definition of common datasets and evaluation measures (see section 3.3 for

Page 41: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.2. ONTOLOGY MATCHING 29

<rdf:RDF xmlns=’http://knowledgeweb.semanticweb.org/heterogeneity/alignment’xmlns:ns0=’http://knowledgeweb.semanticweb.org/heterogeneity/alignment’xmlns:rdf=’http://www.w3.org/1999/02/22-rdf-syntax-ns#’xmlns:xsd=’http://www.w3.org/2001/XMLSchema#’xmlns:align=’http://knowledgeweb.semanticweb.org/heterogeneity/alignment#’>

<Alignment><xml>yes</xml><level>0</level><type>**</type><ns0:method>fr.inrialpes.exmo.align.impl.method.SMOANameAlignment</ns0:method><onto1>

<Ontology rdf:about="http://www.opengis.net/citygml/1.0"><location>http://www.opengis.net/citygml/1.0</location><formalism>

<Formalism align:name="OWL 2.0" align:uri="http://www.w3.org/2002/07/owl#"/></formalism>

</Ontology></onto1><onto2>

<Ontology rdf:about="http://www.gbxml.org/schema"><location>http://www.gbxml.org/schema</location><formalism>

<Formalism align:name="OWL 2.0" align:uri="http://www.w3.org/2002/07/owl#"/></formalism>

</Ontology></onto2><map>

<Cell><entity1 rdf:resource=’urn:oasis:names:tc:ciq:xsdschema:xAL:2.0#hasAddressDetails’/><entity2 rdf:resource=’http://www.gbxml.org/schema#email1Address’/><relation>=</relation><measure rdf:datatype=’http://www.w3.org/2001/XMLSchema#float’>0.7753222836095763</measure>

</Cell></map><map>

<Cell><entity1 rdf:resource=’http://www.opengis.net/citygml/1.0#AbstractCityObject’/><entity2 rdf:resource=’http://www.gbxml.org/schema#City’/><relation>=</relation><measure rdf:datatype=’http://www.w3.org/2001/XMLSchema#float’>0.6818181818181819</measure>

</Cell></map>...

Figure 2.10: Sample of RDF file in the Alignment API format

Page 42: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

30 CHAPTER 2. BACKGROUND

more information). The Ontology Alignment Evaluation Initiative (OAEI) 5 is a coordi-nated international initiative to forge the consensus for evaluating techniques performancethough the controlled experimental evaluation. This initiative organizes a yearly evalua-tion event and it publishes the tests and results of the event for further analysis to achievethe following goals:

• Assessing strength and weakness of matching techniques developed across the world

• Comparing performance of techniques

• Improving and motivating the work on ontology alignment/matching

The methodology adopted in this work has been inspired by the white paper [44]which was published by the same initiative. Within it, its authors propose an entire lifecycle for the methodology, at the same time they provide a list of measures which servesas a guide for the evaluation of alignments against a gold standard.

2.3 Geographic Information SystemsGeographic Information Systems (GIS) are systems which capture, store, analyze, man-age, share and display data referred to a geographic location for performing decisionmaking. In the last decades it had been a field of intense research activity from cartog-raphy, statistic analysis, image processing and database systems. This systems can beapplied to archeology, cartography, geography, land surveying, infrastructure manage-ment, logistics, navigation, landscape analysis, environmental simulations, photographyor agriculture, among others.

Thus, GIS covers a wide area in which different applications are included with severaldifficulties for achieving interoperability between them. Another concept derived fromGIS are the Spatial Data Infrastructures (SDI), a set of services, standards and specifi-cations to ensure a minimum level of compatibility. SDI technologies are standardizedby the Open Geospatial Consortium (OGC) which is composed by several enterprises,universities and public administrations. In the last decade, the OGC has defined some im-portant services such as Web Map Service, Web Feature Services, Web Processing Serviceand some XML-based languages such as GML or CityGML (section 2.3.1).

In despite of the powerful set of potential applications of GIS, they can only be ap-plied at metropolitan or wide area scale. Then, information from another fields, such asBuilding Information Modeling, improves the classical solutions offered by GIS. Nowthe formats includes semantic meaning of the data and not only the geometry, the inter-operability between both fields is a problem which can be addressed through ontologymatching techniques.

2.3.1 CityGMLThe Geometry Markup Language (GML) is a standard promoted by the OGC to repre-sent the classical geometric primitives and features in a reference system. GML allows

5http://oaei.ontologymatching.org

Page 43: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.3. GEOGRAPHIC INFORMATION SYSTEMS 31

Figure 2.11: CityGML architecture decomposed in modules (from [38])

to define application schema on top of it including semantic information or new specificprimitives. In 2006, CityGML [52] emerges as an application schema for semantic mod-eling of 3D urban objects and infrastructures in response to the growing demand of 3DGIS. In 2008 it was approved as standard and included into the SDI by the OGC. Figure2.3.1 shows the modular architecture of CityGML while the building module is expandedin figure 2.3.1. To get an idea of what kind of information is represented by CityGML takea look at the picture shown by figure 2.3.1 which corresponds to a screenshot of CityVu6, a CityGML visualizer developed in Java.

CityGML support the upto five Level of Detail (LoD) shown in figure 2.14, but thelevel 0 does not represent information about buildings. Although the different LoD affectsspecially to the geometry (physical representation of the object), they also add or removesome semantic relation and concepts. For example, level 1 only includes informationabout the building, without relations, while level 2 represents boundary surfaces; level 3additionally includes roofs, doors and windows and level 4 adds the remaining semanticssuch as equipment, interior walls, rooms, installations, etc.

In order to adapt the model defined by CityGML to the different needs and requirementof applications the specification includes an extension mechanism called Application Do-main Extension (ADE). This means an extension point of CityGML to merge other ontolo-gies which provides more detailed information about certain concepts in other domains.Currently, several extensions are available 7 for merging information from other domainsto CityGML. This includes, among others, underground infrastructures, utility networks,noise information, hydrological resources, and building information, among others. Thedeveloped extensions serve as example of how significant is to extent CityGML model toother domains.

6http://cityvu.3dgis.it7http://www.citygmlwiki.org/index.php/CityGML-ADEs

Page 44: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

32 CHAPTER 2. BACKGROUND

Figure 2.12: Classes of CityGML building module (from [52])

Figure 2.13: CityGML dataset visualized with CityVu

Page 45: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.3. GEOGRAPHIC INFORMATION SYSTEMS 33

Figure 2.14: Illustration of the different levels of details for buildings in CityGML

2.3.2 Building Information Modeling

Building Information Modeling(BIM) was established in 2000 as a new paradigm forgenerating and managing building data along its entire life cycle. Its emerges in responseto the waste of resources in failures and inconsistencies between the different agents in-volved in the construction process in EEUU. In order to develop this idea it was necessarya standard data model in which the agents can share and exchange building information.Besides, it has to include information not only about the building elements but also aboutthe processes and the agents involved. In order to achieve this goal, the buildingSMARTalliance proposed the Industry Foundation Classes (IFC) as a standard which is equivalentto an ontology for BIM.

This discipline is nearer of CAD than GIS. The interoperability between both worldshas been a long time research topic addressed in works like in [63], but it is still a remain-ing challenge even with the new standards (CityGML and IFC). Obviously, this integra-tion only affects to the building module of CityGML which overlaps with IFC in manyconcepts and it is also the most relevant module. Both standards show substantial diffi-culties for translating one model to another, due to the fact that they are designed havingin mind different requirements of CAD and GIS. For example, the geometry is modeledusing different representations, their scale of application is very different and the targetapplications differ from urban planning (GIS) to energy simulations (CAD). However,there are several applications in which information from both standards is required. Fig-ure 2.16 presents the applications in which information from GIS and BIM improves theperformance. Also, it structures the project phases in which both fields are needed. Also,figure 2.17 drafts a possible scenario in which the integration of GIS and BIM wouldbe beneficial. Leaving aside the geometry for representing the basic primitives, they in-troduce heterogeneity in several levels. Figures 2.3.1 and 2.3.2 show the class diagram

Page 46: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

34 CHAPTER 2. BACKGROUND

Figure 2.15: Subset of IFC classes for building information (from [24])

focused on building entities which illustrates different kind of heterogeneities:

• Terminological, where same terms are named differently. For instance, (IfcWindowin IFC and Window in CityGML.

• Syntactical, where the taxonomies and relationship between entities differs. Forexample, IfcDoor and IfcWindow are subclasses of IfcBuildingElementin IFC while Door and Window are subclasses of Opening in CityGML.

• Semantical, where terms differ in their scope or classification. For example, IfcSlabin IFC is a superclass of GroundSurface, FloorSurface and CeilingSurfacein CityGML.

Due to the relevance of this challenge, several commercial solutions are availablesuch as AutoDesk LandXexplorer 8 or Safe Software 9. Although these solutions claimthe problem of integration between CAD and GIS is solved the literature still shows newproposals to tackle the problem. Most of these approaches only solve partially the mainproblem showing unidirectional conversion (IFC to CityGML) [72], offering discussionsabout what should be done without a concrete implementation, or integrating IFC into aconcrete LoD of CityGML [43].

These limitations lead to the following conclusion: a formal framework for strictsemantic and geometry conversion is required for a complete and robust integration ofCityGML and IFC. To achieve this goal it is necessary to integrate the geometric modelsand to harmonize semantics through formal mapping [24]. The main reason for this isthat very heterogeneous semantics makes difficult to achieve interoperability through di-rect matching. Table 2.1 shows a comparison between state of the art approaches whose

8http://www.3dgeo.de/citygml.aspx9http://www.safe.com

Page 47: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.3. GEOGRAPHIC INFORMATION SYSTEMS 35

Figure 2.16: Requirements for GIS and BIM systems applications in the building lifecycle (from http://www.opengeospatial.org)

Figure 2.17: An example of scenario where the integration of information from BIM andGIS is needed (from http://cadbim.usace.army.mil)

Page 48: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

36 CHAPTER 2. BACKGROUND

characterization has been done along this work. It establishes a framework in which wecan insert and compare our own approach. The selected criteria and values are the follow-ing ones:

• Integration refers to the direction of the proposed translation which can be unidi-rectional (usually from IFC to CityGML) or bidirectional.

• Domain extension means whether the approach includes new terms from BIM toCityGML or it only considers the terms included in CityGML.

• Realization level is equivalent to the level of maturity of the approach which can beimplemented (a software implementing the approach has been developed), designed(a formal framework is established) or discussed (only theoretical requirements areintroduced).

• Level of Detail indicates what LoD of CityGML are integrated with IFC. Then, thevalues match exactly with the LoD of CityGML.

• Ontological is a way of denoting whether the approach uses an ontological engineermethodology or, at least, considers the problem related with ontologies. Althoughthis a merely perceptual characteristic, it points out the originality of consideringthe problem from the ontology matching viewpoint.

• Expert knowledge refers to whether expert assessment has been used for the devel-opment of the proposed approach, usually in ontological engineering methods.

• Geometry concerns whether the geometry primitives placed below the semantics ofthe building object are translated or not.

• Precision is the same term as the measure used for the evaluation of results (seesection 3.3.1). This features points whether the proposal establishes deterministiccorrespondences between concepts (value 1) or it establishes a confidence measurefor the correspondence (a value inside the interval [0, 1]).

Analyzing the summary presented in table 2.1 we can extract some conclusions of thecurrent approaches in the state of the art. One of the first observations is that most of theapproaches include bidirectional translation between IFC and CityGML. This implies thatthere is a perfect equivalence for the terms of CityGML with the terms of IFC (usuallyincluding some preprocessing steps). However, the unidirectional approach shown in [72]allows the inclusion of new terms from IFC into CityGML which enriches the semanticsof the building module. This represents at the same time a domain extension for CityGMLdeveloped through the merging of both ontologies.

Except for the approach in [43], the remaining approaches show an implementation ofthe proposal applied to real use cases. In regard to the LoD at which the two ontologiesare integrated, all of them support LoD 4 integration, which includes details about theinternal structure and objects of the building. Only the approaches in [24, 43] allow theintegration of IFC data at lower levels of CityGML such as 2 and 3 which only includedetails about the exterior structure of the building. Although the approach detailed in [24]

Page 49: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

2.3. GEOGRAPHIC INFORMATION SYSTEMS 37

UBM [24] AQI [2] GeoBIM [72] Zlatanova [43]Integration Bidirectional Bidirectional To CityGML BidirectionalDomain extension No No Yes NoRealization level Implemented Implemented Implemented DiscussedLevel of Detail LoD 2-4 LoD 4 LoD 4 LoD 2-4Ontological Yes No No NoExpert knowledge Yes Yes Yes YesGeometry No Yes No YesPrecision 1 1 1 1

Table 2.1: Approaches for integrating IFC and CityGMLComparison of literature approaches for solving interoperability between CityGML and

IFC

is the most complete approach until now, it does not consider details about geometryconversion because these details were covered in previous approaches. The same happensin [72], while the other two approaches covers geometry transformation aspects. We adoptthe same position that in [24] during the development of our research because this fallsoutside of our research scope. One of the similarities between the approach of [24] andthe one introduced here is that both of them address the problem within the ontologicalframework. However, their approach use an ontological engineering approach while thiswork applies ontology matching techniques.

Finally, the most significant characteristics are the use of expert assessment and theprecision of the translation. They are the key differences between our approach and theothers reviewed in the literature. Looking at the table 2.1 all the approaches have thevalue of precision and expert knowledge in common because all of them mine the expertknowledge that exists in the field. As a logical consequence, the precision of the translatedmodel is 1 due to the use of a reference alignment for merging both ontologies. At thesame time, this implies that the merging is never automatic; moreover it uses a completelymanual development process.

Our approach introduces two big differences with respect to the surveyed approaches:1) it works in an automatic way, without the assessment of expert users; 2) the correspon-dences between concepts obtained from ontology matching techniques are probabilistic,instead of the deterministic correspondences, which are established by experts. Obvi-ously, the results could be unacceptable for certain kind of tasks, such as for direct filetranslation, but they are also promising in other ones, such as the automatic developmentof application domain extensions or for discovering relationships in the framework of theLinked Open Data initiative. This contributes in a high degree to the originality of ourresearch.

Page 50: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

38 CHAPTER 2. BACKGROUND

Page 51: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Chapter 3

Methodology

Resumen

Este capıtulo esta dedicado a la descripcion de la metodologıa de investigacion seguidapara realizar desde la concrecion de los objetivos del trabajo hasta los experimentos finalesque contrasten la hipotesis de partida.

El primer paso consiste en plantear la descripcion del problema a resolver que parte dela motivacion expresada en el capıtulo primero. El problema que se pretende resolver es laintegracion de informacion entre Sistemas de Informacion Geografica y otras modelos dedatos representados por ontologıas de forma automatica. Este problema se plantea comouna aplicacion de las tecnicas y algoritmos de Ontology Matching de la que nos interesaestudiar su viabilidad. Para lograrlo se ha seguido el metodo clasico devingenierıa queconsiste en estudiar el problema y las soluciones habituales para resolverlo, detectar laslimitaciones, plantear una nueva solucion y evaluar los resultados obtenidos de formaexperimental.

Para implementar esta metodologıa se han llevado a cabo una serie de tareas mas conc-retas que aseguraran que el trabajo de investigacion cumpla con los objetivos establecidosen el tiempo previsto inicialmente. El primer paso consiste en realizar una planificacion enlas dos primeras semanas del trabajo a realizar en la que se fija el calendario de reunionescon el tutor, los entregables e hitos fundamentales. El siguiente paso consiste en realizaruna revision del estado del arte del campo a partir de documentacion bibliografica para irdando forma a los objetivos secundarios del trabajo. Este trabajo, que inicialmente estabapensado para evaluar una coleccion de tecnicas y aportar alguna mejora sobre las ya exis-tentes, fue reformulado para anadir cierta originalidad. Despues de esta revision quedarondefinidos los objetivos tal y como estan redactados en este documento. Como resultado deesta etapa se obtuvo una caracterizacion de las propuestas actuales para integrar CityGMLe IFC.

A continuacion se definio la metodologıa experimental a seguir para contrastar lahipotesis planteada. Esta incluye tanto las metricas para evaluar la calidad de los alin-eamientos (precision, exhaustividad y F-measure, por ejemplo) como la seleccion delconjunto de datos (ontologıas como CityGML, IFC, DBPedia o LinkedGeoData ontology)sobre los que evaluar las tecnicas de alineamiento. Tambien se plantean los alineamientosde referencia que se van a generar a partir de las ontologıas para evaluar la calidad de

39

Page 52: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

40 CHAPTER 3. METHODOLOGY

los resultados del alineamiento. Todos ellos utilizaran CityGML por ser la ontologıa dereferencia para Sistemas de Informacion Geografica.

Despues se describen las herramientas software que forman el entorno experimen-tal tanto para manipular las ontologıas y los alineamientos (Protege, NeOn Toolkit yTopBraidComposer), como para implementar las tecnicas y algoritmos de alineamientoa evaluar (Alignment API, S-Match, AROMA, BLOOMS and MapPSO). En cada casose describen las caracterısticas que motivan la seleccion de esas herramientas y no otrasigualmente validas. Tambien se describen algunas herramientas como RIBOM, SOBOMu OLA cuyos algoritmos no han podido ser evaluados por diferentes razones.

Finalmente se describe en profundidad la metodologıa de experimentacion que sesigue para llevar a cabo los experimentos cuyos resultados se recogen en el siguientecapıtulo. Esta consiste en obtener las ontologıas, adaptarlas, generar los alineamientos dereferencia, aplicar las diferentes tecnicas sobre las ontologıas, recoger los alineamientosresultantes y extraer los valores de las metricas de calidad para finalmente generar unasgraficas resumen que faciliten el analisis y la extraccion de las conclusiones finales (queaparecen en el ultimo capıtulo).

3.1 Problem descriptionThe problem of system heterogeneity emerges when two or more systems represents thesame knowledge using different representations either in terms of the language or in termsof the semantics, among others. We focus on the second source of heterogeneity. Thetraditional ways of addressing this problem consist of the establishment of some kind ofmapping rules which translates instances from one representation to another allowing thereuse of the knowledge. This moves the problem towards the extraction of these mappingswhich ideally is done in an automatic way. This is the approach adopted by ontologymatching.

This kind of problem is exhibited by GIS and CAD representations, where data encod-ing the same information are represented using different standards. In this sense, the inte-gration between both fields allow to improve the task where information from both fieldsis mandatory (see figure 2.16). Besides, the information emerging from the GeospatialSemantic Web (a specific community inside the Semantic Web) would be very useful if itwas merged with professional GIS information sources. Currently, the problem is beingsolved by manually development of the specific mappings which, in terms of semanticquality of the results, is a good solution but poorly scalable (at least in the way requiredby the exponential growth of the Web). Hence, the main question which motivates theresearch and raises the goals of this project is whether ontology matching techniques canbe applied to solve (or at least to narrow) this problem.

3.2 General methodologyEvery research work needs to follow a methodology which, following the scientific method,establishes a hypothesis and decides whether is valid or not. At the beginning of this workthe author had a very low knowledge about Ontology Matching. The starting goals of this

Page 53: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.2. GENERAL METHODOLOGY 41

project were different with respect to they finally are. Next paragraphs describe the gen-eral methodology followed for the whole work instead of focusing only on the experimen-tation (addressed at the last step). There was a huge amount of work for reviewing theliterature, classifying current approaches and finding some limitations in which we canprovide some contribution. Thus, experimentation is only the final step of a big amountof previous research work.

The research methodology followed in this work is widely known as the ”engineer-ing method” [1, 65]. This methodology enforces four basic steps, which are cyclicallyrepeated during the research:

1. The study of current solutions in the literature trying to find documented issues,limitations and contributions to solve the target problem into the Semantic Matchingresearch field.

2. The search of a new approach which solves, or at least improves, the current pro-posals. In this step a solutions that addresses the existing problems in an innovativeway has to be found.

3. The development of the solution, which usually implies the building of some artifactin which the hypothesis was implemented.

4. The evaluation of the performance and compliance of the solution. This step is in-tended to evaluate the validity of the hypothesis and the quality of the improvementsintroduced by the new solution.

During the execution of the first step the first goal of the project (the study, evaluationand classification of different matching techniques), was reformulated for introducing cer-tain originality. The main reason for that was the identification of a lot of work alreadydone in different surveys published four or five years ago in prestigious journals and pro-ceedings. Besides, Shvaiko and Euzenat in [68] denote the great effort for a little improve-ment as one of the current biggest challenges in the ontology matching field. Taking intoaccount the potential power of ontology matching for solving interoperability problemsand its relevance for GIS and CAD fields, the goal of the work was reoriented towardsevaluating the suitability of ontology matching to solve interoperability between GIS andCAD fields. This problem is of special interest for the scientific community involved inGeospatial Information Science, as it was proposed by the cotutor of the project.

Thus, according to the new goals, the first step also includes the review of currentproposals for solving the interoperability problem between GIS and CAD. The secondstep is addressed thanks to the acquired knowledge in ontology matching, which can beproposed as a possible solution for automatic matching. Then, the third step can be ful-filled by choosing the best of the reviewed techniques, saving a valued time and effort notavailable for this kind of research work (this is not a PhD. Thesis). Finally, the fourth stepimplies the comparison of the proposed approach based on ontology matching with theexisting ones, and measuring the performance of the proposed methods. In summary, im-proving current techniques for ontology matching is a task which requires a huge amountof work; it is more realistic to try to apply this knowledge for solving problems betweenCAD and GIS in a way they are never addressed until now.

Page 54: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

42 CHAPTER 3. METHODOLOGY

This is just another example of how the research process is not linear but cyclic andinitial assumptions can not be made without a good understanding of the real situationof the researched field. This also shows the difficulty of finding the goals and a relevant,original and feasible research question for solving in a limited amount of time.

We can briefly summarize the methodology in the following high level steps: a) thework is scheduled before it starts by pointing deadlines and deliverables; b) the stateof the art in Ontology Matching is reviewed by collecting bibliographic resources; c) theproblem for interoperability between GIS and CAD is detected and the current approachesare reviewed; d) the goal of the project is reshaped while the research is going on; e) thedataset and standards in the field are surveyed with special regard to GIS, CAD and theGeospatial Web; f) the experiments are drafted and evaluations measures are established;g) the reference alignments are created using CityGML as a ”umbrella”; h) the datasets areevaluated against the matching system using different strategies; i) the results are analyzedfor drawing the final conclusions, limitations and future work. Next sections bring a moredetailed view about the decisions taken and the steps followed along the research.

3.2.1 Scheduling

This must be the first step almost in any job which requires a certain amount of devel-opment work. As the same word means it consists of scheduling the necessary steps forthe successful achievement of the goals of the project. This includes some control ac-tions such as meetings with the tutor every two weeks, the deliverables which have to bepartially or totally developed, and the main goal in which each step is focused.

In our case a timetable with low level of detail has been developed at the beginning ofthe project, when the project was granted by the tutor. All the events described in Section3.2 appear in this table which can be seen as a project road map. The main function of thisartifact is to control delays and achievements in time in order to guarantee the delivery ofproject before the deadline. Moreover, a brief description of the goals of the projects wasdone, although they were refined several times after as it is explained in the above section.

3.2.2 Literature reviewing

This is an essential step in any research work. It consist of the reviewing of the avail-able documentation sources (books, journals, proceedings, technical reports, etc) in theontology matching field for acquiring the necessary knowledge to formulate the researchproblem. Then, the same procedure is repeated for reviewing the proposed solutionsfor solving interoperability problems between GIS and CAD. The acquired knowledge issummarized in the chapter 2 (”Background”) of this document.

At the beginning some relevant surveys were reviewed for landing in the field. Adedicated website for ontology matching was found 1 which serves as entry point fordifferent queries. It is a repository of documentation and information sources relatedwith ontology matching including projects, publications, frequently asked questions andevaluation initiatives, among others. Moreover, it shows the relevant authors in the field

1http://www.ontologymatching.org

Page 55: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.3. DESIGN OF EXPERIMENTS 43

(Jerome Euzenat and Pavel Shvaiko) who are also the authors of the book titled ”OntologyMatching” [29].

Using the surveys and the book as a reference, articles about several themes treatedin the book were reviewed. Most of them were focused on the development of ontologymatching strategies for improving alignment results which shows that there is little spacefor a relevant contribution in this field because, according to [68], a little improvementrequires a lot of work. However, the same paper shows that many challenges still remainunsolved, such as the one addressed by this work.

While the algorithms and strategies were being reviewed other issues concerning on-tology matching emerged. Some of them are, for example, the representation of align-ments, the toolbox or applications available for manipulating the ontologies and align-ments, the systems and frameworks for generating the alignments, the evaluation of thequality of alignments and main applications of ontology matching. Each of them wasreviewed separately during the first and second month of the project. Also, a big amountof time was given to understand and debug ontology matching systems and frameworkswhich implement the algorithms used by the evaluation phase, since most of them lack ofa good documentation or a simple graphical interface.

The review process also includes the understanding of some new terms and conceptsfrom knowledge engineering field which are usually source of ambiguities and misunder-standing in the literature such as matching, alignment, mapping, ontology integration andmerging, semantic heterogeneity, local ontology, etc. All these terms are treated in thiswork using the semantic sense defined by Euzenat and Shvaiko in [29, Section 2.4].

After the second tutor of the project showed the relevance of interoperability problembetween CAD and GIS representations, a problem tackled by ontology matching, a secondliterature review was started for identifying the current approaches to solve this problem.During this second review the main goal was to identify, characterize and classify currentapproaches, which are mainly focused on the two most prominent standards: CityGMLand IFC (see section 2.3 for more information). There are several recent references whichjustify the interest in this kind of problems such as [22, 24, 24, 43, 53, 76].

As result of this reviewing process a limitation was detected: the extension of CityGMLto other overlapping domains demands a lot of manually developed work. If this exten-sions deal with a continuously changing environment as the Web, the maintenance andscalability of the information system would be unfeasible. Besides there is a need ofmerging expert information coming from 3D GIS with collaborative information com-ing from Geospatial Web community to improve the richness of the knowledge for bothcommunities.

3.3 Design of experiments

This section explains the experiments that conduct the last step of the research introducedin this work. The experiment have been designed to satisfy the objectives of the previoussection. Basically, they consist of analysis of ontology characteristics and comparison ofthe results obtained by applying different tools and techniques against a reference align-ment established.

Page 56: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

44 CHAPTER 3. METHODOLOGY

3.3.1 Evaluation measuresThe main question addresses in this section is: how can we measure the quality of theevaluation results of the ontology matching algorithms? There are both qualitative andquantitative measures for evaluating several features of the automatic generated align-ments as is introduced in [44] and reviewed in [29, Section 7.3]. They are classified intothe following categories:

• Compliance measures evaluate the degree of compliance of the alignment with re-gard to some standard. They are used for computing the quality of the output com-pared to a reference output. However, this output is not always available, not alwaysuseful and not always consensual, although it is always desirable for benchmarking.

• Performance measures give non-functional features of the algorithms such as the re-source consumption. This kind of measures depend on the processing environmentand the system which make difficult to obtain objective measures.

• User-related measures includes user evaluation, overall aggregating, and measuresto evaluate a specific application. They are specially interesting when algorithmsrequire some kind of user interaction ranging from the use of the alignment to userinput during the alignment process. This gets the user into the evaluation loop whichmakes even more difficult to obtain an objective evaluation.

This work does not include user-related measures due to the difficult for obtainingtruly experts in the domain of interest with a relevant opinion. However we consider onlymeasures related with the alignment process, i.e. compliance and performance measures.The approach proposed in the literature consist of using a reference alignment R (”a goldstandard”) compared with the alignment A returned by the algorithm. Then, we will onlyconsider the following compliance measures:

• Precision, which represents the ratio of true positives respected to all retrieved pairs.Adapted from information retrieval research, it allows to measure the correctness ofthe technique. It is defined as follows:

P (A,R) =|R ∩ A||A|

• Recall, which measures the ratio of true positives over the total number of expectedcorrespondences expressed in the reference alignment. Thus, it allows to determinethe degree of completeness of the alignment. As precision, this measure also comesfrom information retrieval and it is formally defined as follows:

R(A,R) =|R ∩ A||R|

• F-measure allows direct comparison between two systems which are often not com-parable based uniquely in precision an recall because they usually have contradic-tory values: higher precision implies lower recall and vice versa. This measure

Page 57: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.3. DESIGN OF EXPERIMENTS 45

aggregate the results of precision and recall using a parameter α between 0 and 1and it is defined as:

Mα(A,R) =P (A,R)×R(A,R)

(1− α)× P (A,R) + α×R(A,R)

Then, the higher α( 1), the more importance is given to precision and the lowerα( 0), the more important is given to recall. A value of 0.5 means that F-measurecorrespond to the harmonic mean of precision and recall representing a commonscenario where the two measures are equally important.

Moreover, the performance of the algorithm is taking into account using the followingperformance measures:

• Speed is measured as the processing time (in seconds) used by the test machine inorder to get the alignment trough the execution of the alignment process.

• Memory considers the amount of memory used by the process performing the align-ment. However the memory considered here includes not only for algorithm exe-cution but also the memory used by the underlying ontology management systemwhich means a bias in the measures taken.

Since the evaluation is performed inside the context of a particular application twocomplementary ways are used for designing the evaluation procedure: (i) using a specifictest set and experiment design; (ii) interpreting the results with an application-orientedbias (we are aware of this). As a matter of fact, some applications require high recall (in-teractive merging e.g.) while others require high precision (automatic connection of twoweb services e.g.). Following the classification introduced by Euzenat in [29, Chapter 1]we frame the application into the system integration category due to the number of sim-ilarities between both application fields. According to this, a high precision and recall ispreferred, while the level of automation and the speed can be lower. In our case, the levelof automation is a requirement and a motivation of the research such that we do not con-sider user intervention. Hence, leaving speed aside, the focus will be in the achievementof the highest possible precision and recall.

3.3.2 Selection of datasetsThis section enumerates the ontologies which represent the dataset used for experimenta-tion. Besides, it explains the features which justify the relevance of the selected datasets.This is a crucial step for every experimental validation since the conclusions have to betaken having into account what dataset are selected. For example, if this data would notbe reliable or available, the conclusions would be constrained in a significant way.

We choose the following ontologies:

• CityGML 2. This is an international standard for city object modeling supportedby the OGC which is an application schema of GML. It includes more than 1000

2http://schemas.opengis.net/citygml/

Page 58: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

46 CHAPTER 3. METHODOLOGY

concepts but the building module includes 134, which it is in practice the onlysubset of terms used for the experiments. More information is provided in section2.3.1.

• IFC 3. This is also a standard for BIM supported by buildingSMART alliance withmore than 7000 terms about building and construction domains. It is the mostwidely used against other such as AECxml or CIMsteel Integration Standards. Thenumber of applications that support it is growing every year since it was approvedas standard.

• GbXML 4. This international standard is an specific version of the IFC speciallysuited to energy efficiency analysis accepted by many simulation tools.

• DBPedia 5 [3, 10]. This is the most famous and popular ontology available in theWeb which publishes data from info boxes of the Wikipedia. It is a cross domainontology with many concepts related with the geospatial domain that provide gen-eral information about places or things.

• LinkedGeoData 6 is a geospatial ontology developed by the Linked Geospatial Datainitiative which models places, toponyms and objects in geospatial environments forlinking with other knowledge bases inside the Linked Open Data initiative. It mod-els data extracted from the Open Street Map (OSM) [41] project which is publishedas RDF triples, with at the moment more than 2 billion triples. The data was linkedto DBPedia automatically by expanding the user created links in OSM to Wikipediausing machine learning based on a heuristic on the combination of type information,spatial distance, and name similarity as it is detailed in [4].

Please note that the choice of the datasets was made a priori and imposed by thefield of study. The selection was not tailored to favor any specific system. However,if the results were interpreted in terms of ontology matching techniques, they will beapplication-biased (i.e. we cannot take general conclusions about the performance of thetechniques). Despite of this fact, the main goal of this work is to evaluate the behaviorof the techniques in a specific application domain. Thus, the selected datasets for theexperiments will throw valid conclusions because they are representative enough in thefield and they fulfill the following important features:

• Size. Too small datasets are not relevant but too big datasets are unwieldy.

• Operative. Both dataset and tools mus be fully compatible and work at least ina significant number of test. This implies the collection of not null results withminimum adaptation of the dataset to the tools or vice-versa.

• Representative. The dataset must be adopted by so many entities as possible. Theymust include a big number of different entries in order to achieve significant results.

3http://buildingsmart-tech.org/ifcXML/IFC2x3/FINAL/IFC2X3.xsd4http://www.gbxml.org/schema/0-36/GreenBuildingXML.xsd5http://downloads.dbpedia.org/3.6/dbpedia_3.6.owl.bz26http://linkedgeodata.org

Page 59: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.3. DESIGN OF EXPERIMENTS 47

• Reliable. The publisher of the dataset must be reliable on the international sceneand the samples have to be less modified as possible.

• Reproducible. The dataset must be publicly available for replaying the test andcomparing the results. Besides, this eliminates any unfair advantage in conclusionswe might obtain as a result of ”self created ontologies”.

• Static. When several versions of the dataset exists one of them has to be selectedand pointed out, including the dumping of the data if needed.

• Real. Synthetic dataset are useful to test strengths and weaknesses of different al-gorithms in a well established set of challenges but are not valid when we are tryingto probe the robustness of the same algorithms in a real-world situation. Frequentlyresults tend to be very different.

Some other relevant ontologies were rejected taking into account the above mentionedfeatures. For example, GeoWordNames [33] provides a big RDF database of informationabout places which can be related with CityGML but its ontology only defines 8 categorieswithout any hierarchy.

3.3.3 Reference alignments

The alignments are created around CityGML because the large number of modules makesit generic enough to encapsulate various kind of domains so it can be matched to a largenumber of schemas, having an umbrella function in this sense. Besides, this work triesto evaluate the automatic creation of extensions for CityGML which justifies that thealignments are centered in that standard. Finally, we evaluate the following alignments:

• CityGML-IFC. The building module of CityGML can be merged with IFC. In fact,this is one way to address the challenge of interoperability between GIS and BIMsystems.

• CityGML-GbXML. Again, the building module of CityGML merged with energyefficiency information is a valuable integration from the expert viewpoint.

• CityGML-DBPedia. The generic information from DBPedia complements the tech-nical information represented by CityGML. This information should be importantfor improving the take of decisions because of the merging of expert informationwith community-driven datasets.

• CityGML-LinkedGeoData. The LinkedGeoData ontology is similar to the CityGML-DBPedia alignment since both merge expert and detailed information with informaldata provided by the users. In difference with DBPedia, the LinkedGeoData ontol-ogy is focused on the Geospatial domain in which it contributes with a lightweightontology (a taxonomy of concepts). Facilities, building types and building parts aresome of the concepts shared by both ontologies which have a clear correspondence.

Page 60: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

48 CHAPTER 3. METHODOLOGY

The development of the reference alignments requires three general steps: 1) the doc-umentation and review of the specification details for the involved ontologies; 2) the def-inition of equivalence and subsumption (subclass) relations between terms of the ontolo-gies; 3) the iterative refinement of the alignment and the transitive closure of the well-established correspondences. More details about the developed work and the followedprocedure are provided in section 4.3.

3.3.4 Auxiliary toolsThe transformation and adaptation of the datasets require the use of various ontologydevelopment environments since none of them meets all the requirements of the ontologymatching tools. In the last 10 years several ontology frameworks and tools have beendeveloped, but according to their popularity, robustness and functionality, we have chosenthe following:

• Protege7. An open source environment for ontology development which facilitatesthe edition and management of the ontologies. It is used to inspect and completethe ontologies.

• TopBraid Composer 8. A commercial environment for ontology development witha free version available. It includes a robust XSD to OWL importer that transformsthe XSD schemas into OWL ontologies by importing the required ontologies auto-matically.

• NeOn Toolkit 9. An open source framework for ontology development which in-cludes special plug-ins for ontology matching. This framework is a result fromNeOn Project of the FP6 EU Research program that exports the alignments in thesame format specified by the Alignment API [20].

3.4 Frameworks for ontology matchingThe evaluated matching techniques are already implemented across several frameworksand tools. Some of them are implemented standalone, such as BLOOMS or AROMA,while others are included inside a more general framework such as the WordNet distancein the Alignment API or S-match and SPSM in the S-Match framework. The literatureshows several different proposals so a selection has to be done in order for experimenta-tion. This selection is based on the following criteria:

• Availability, which means that the software implementation is Open Source or, atleast, the binary executable is available for testing.

• Flexibility, which implies that the software is customizable in the sense that, at least,the ontology loader and the mapping renderer are configurable.

7http://protege.stanford.edu8http://www.topquadrant.com/products/TB_Composer.html9http://neon-toolkit.org/

Page 61: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.4. FRAMEWORKS FOR ONTOLOGY MATCHING 49

• Automation, which requires that the ontology alignment can be done without userguidance. There are several tools showing semi-automatic approaches which im-prove the quality of the alignment by means of user interaction. This ensures abetter alignment since the user reviews the basic alignment provided by the system.

• OWL compatibility, which intends that the software can import ontologies encodedusing the OWL language because the ontologies for generating the alignments areencoded in OWL.

• Performance, which means that the tools provided relevant results in other evalua-tions, even whether they were obtained for specific ontologies.

• Variety, which requires that the selected tools uses different techniques and hencetheir comparison offers more relevant conclusions. Besides, in order to achievemore breadth in our evaluation, we have included tools such as S-Match outside ofthe OAEI contests.

Taking into account this criteria and considering the algorithms and techniques de-scribed in section 2.2.3, the following systems have been selected for the experiments:

• S-match 10 [36] is a semantic Java framework developed at the University of Trento.It provides the basic component for simplifying the building of more complexmatchers quickly. Some of the most relevant modules included are context ren-derer, mapping filters, context loaders and basic matchers. S-Match includes theimplementation of two matchers used for experimentation: S-Match and SPSM. Itis available as Open Source, under the GPL license 11.

• Alignment API 12 [20] is an implementation for expressing and sharing ontologyalignments allowing various ontology matchers to share the same format and inter-face for accessing matching results. It provides the implementation of several basicmatching techniques such as the linguistic similarity measure based on WordNetdistance. It is developed with Java and available as Open Source under the GPLlicense 13.

• AROMA [21] is an INRIA project, developed with Java, which uses the AlignmentAPI for ontology manipulation. AROMA ranked second in the OAEI contest of2008. It is also available as an Open Source project under the GPL license 14.

• MapPSO 15 [11] is ontology matching system of OWL ontologies based on ParticleSwarm Optimization described in section 2.2.6. Developed in Java, it combinesdifferent elementary techniques by means of different strategies such as weightedaverage for measuring similarities between concepts at each step of the optimization

10http://semanticmatching.org/s-match.html11http://sourceforge.net/projects/s-match12http://alignapi.gforge.inria.fr13https://gforge.inria.fr/projects/alignapi14https://gforge.inria.fr/projects/aroma15http://sourceforge.net/projects/mappso/

Page 62: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

50 CHAPTER 3. METHODOLOGY

process. Each execution requires the specification of more than 20 parameters in afile params.xml to control the similarity measure and the optimization process.The source code is available under the GPL3 license and the project is still alive.

• BLOOMS 16 [45] is a semantic matcher between OWL ontologies which implementsthe technique based on bootstrapping information from Wikipedia, described insection 2.2.3. It is a Java application available only for download but the sourcecode is not published yet.

Some of the system we can not evaluate in our datasets are: ASMOV [48] (the demois still unavailable), RIMOM 17 [71] (it can only run on the benchmark dataset of OAEI2006), SOBOM 18 [75] (the available code is full of errors and developed for specificexamples) or OLA [30] (java memory heap was always exceeded during the experiments).

As its own authors claim, the Alignment API is not a matcher; it only provides elemen-tary techniques for matching two ontologies. Comparison against other specific matchersis not fair and meaningless (the result is meaningful only when the other matchers areworse than the Alignment API). Mentioning that these systems have been compared with”The Alignment API” without qualification would not be appropriate so we want to clarifythese terms before the evaluation.

3.5 Method for experimentationThe final stage in our methodology is the performance of the experiments. The nextparagraphs explains the steps followed to accomplish the stage.

The first task was to acquire a copy of the datasets. Since they are all publicly avail-able there is no problem with this. However, the last version of GbXML and IFC areonly available for members of their respective alliance so the last minor version has beenused. The others are downloaded directly as XML files from the URL specified in Sec-tion 3.3.2. While CityGML, GbXML and IFC are XML schemas (XSD), DBPedia andGeoLinkedData are OWL ontologies serialized as RDF/XML.

The next task was the translation of XSD files for CityGML, IFC and GbXML intoOWL ontologies using the auxiliary tool TopBraid Composer (see Section 3.3.4). Then,the next alignments were created: CityGML-IFC, CityGML-GbXML, CityGML-DBPedia.For this, heuristics and personal knowledge are used as explained in section 4.3. The as-sistance of Neon Toolkit for generating partial alignments using the Alignment API is re-quired for giving some clues about unnoticed correspondences. The alignment are storedusing the format of the Alignment API which simplifies the posterior comparison: thesame Alignment API provides an evaluator for obtain the measure specified in Section3.3.1.

Then, the execution of the automatic alignment was done. Using the tools and frame-works of section 3.4, the algorithms described in section 2.2.3 were tested, modifying itsparameters for measuring the incidence in the result. Before the compliance measures,

16http://wiki.knoesis.org/index.php/BLOOMS17http://keg.cs.tsinghua.edu.cn/project/RiMOM/18http://mlg.hit.edu.cn:8080/Ontology/Welcome.jsp

Page 63: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

3.5. METHOD FOR EXPERIMENTATION 51

the performance measures were obtained, taking the execution time and memory whichresults from the average of three executions. Later, the Alignment API was used for com-puting the compliance measures automatically.

When the execution of alignments is finished, the results have to be displayed graph-ically for a better understanding. The Open Source software GNU-Plot [66] is used forthis task. Tables are directly written in Latex for a better presentation. All these result canbe seen in Section 4.5. Finally, the results, the experiences and the anomalies detected ineach algorithm during the experimentation and confirmed by the results are discussed.

Page 64: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

52 CHAPTER 3. METHODOLOGY

Page 65: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Chapter 4

Experimentation

Resumen

En este capıtulo se recogen los resultados obtenidos del trabajo de experimentacion real-izado. En el ultimo apartado se comparan las diferentes tecnicas y se analizan las causasdel mal / buen comportamiento de cada una.

Las ontologıas seleccionadas para estos experimentos son representativas del dominioestudiado pero ademas ofrecen algunas diferencias significativas desde un punto de vistacualitativo con respecto a otras. Entre estas caracterısticas podemos destacar que CityGML,IFC y GbXML son estandares con internacionales con un largo proceso de refinamientoy muy perfeccionadas. Ademas representan informacion a diferentes niveles de detalleo escala con un porcentaje de solapamiento mas bajo de lo habitual. Ademas presentanuna estructura muy compleja (en relaciones entre objetos y en la taxonomıa) y estan cod-ificadas en diferentes lenguajes (XSD y OWL). Finalmente DBPedia y Linked Geo Dataprovienen de la Web y no siguen un proceso controlado de refinamiento como el resto,presentando un vocabulario mucho menos tecnico que las otras. Para una comparacion decaracterıstica vea la tabla 4.1.

Dado que todas las implementaciones de tecnicas y algoritmos de alineamiento im-portan los datos serializados en RDF/OWL las ontologıas han sido adaptadas utilizandola herramienta TopBraidComposer. A continuacion se han construido los alineamientosespecificados en el capıtulo anterior utilizando el plugin para el Alignment API disponiblepara en entorno de desarrollo de ontologıas NeOn Toolkit. A partir de una primera version,este se ha ido refinando mediante sucesivas iteraciones utilizando la documentacion disponiblesobre los formatos (white papers), la propia experiencia personal y las consultas puntualesa expertos en el campo. Una ultima comprobacion de unicidad de correspondencias y desintaxis es suficiente para completar el alineamiento de referencia.

Todos los experimentos han sido realizados en la misma maquina, un Intel Core 2Duo a 2.8 GHz con 3 GB de RAM y corriendo un servidor Ubuntu 10.04. Ciertas tecnicasrequieren de un ajuste previo de los parametros que se ha realizado de forma experi-mental, eligiendo el que proporciona mejores resultados en cada caso. Siguiendo estametodologıa, el ”lexicalThreshold” de AROMA ha sido ajustado a 0.6. Cuando el numerode parametros ha sido demasiado grande como en el caso de MapPSO se han elegido losparametros por defecto. Vease la seccion 4.4 para mas informacion.

53

Page 66: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

54 CHAPTER 4. EXPERIMENTATION

Una vez que se han ejecutado todos los experimentos y se han obtenido todas lasgraficas de resultados se intentan analizar los resultados. Las tecnicas evaluadas han sidoagrupadas en 3: basadas en distancia entre cadenas, basadas en recursos linguısticos ex-ternos (WordNet) y sistemas de alineamiento (que combinan diferentes tecnicas mediantedistintas estrategias). Estos 3 grupos se mantienen a lo largo de los 4 experimentos real-izados cuyos resultados se comparan con su respectivo alineamiento de referencia.

Desde el punto de vista de las tecnicas, las basadas en cadenas son las que obtienenmejores resultados, en especial la distancia entre subcadenas y la distancia de edicion.Las tecnicas basadas en WordNet presentan mejores datos de precision, pero el valor dela exhaustividad es sensiblemente menor. Los sistemas de alineamiento solo han fun-cionado en los experimentos de CityGML con IFC y con GbXML, no aportando ningunresultado para DBPedia ni LinkedGeoData. En cualquiera de los dos primeros casos elresultado es bastante pobre tanto en precision como en exhaustividad, debido en parte a lainfluencia de los metodos basados en la estructura donde estas ontologıas presentan granheterogeneidad.

Desde el punto de vista de los 4 alineamientos el que ha presentado mejores resulta-dos ha sido CityGML con LinkedGeoData ontology, debido a que son las dos ontologıascon mayor solapamiento. De hech, las dos ontologıas estan pensadas para representarla misma informacion pero se modelan teniendo diferentes usuarios como objetivo. Estees uno de los casos para el que esta pensado el Ontology Matching. Confirmando esterazonamiento, tenemos que el peor alineamiento es CityGML con DBPedia, debido a queDBPedia es una ontologıa multi-dominio, de proposito general y con un bajo grado desolapamiento. Para los otros dos experimentos el resultado es muy similar ya que tantoIFC como GbXML representan el mismo dominio con mınimas diferencias: IFC presentaun prefijo ”Ifc” en todos sus terminos y GbXML incluye parametros especıficos de ma-teriales con un vocabulario mucho mas tecnico que supone un reto para las busquedas entesauros como WordNet.

Respecto del rendimiento medido en terminos de consumo de memoria y tiempo deCPU, las tecnicas basicas nunca necesitan mas de 60 segundos para completar su eje-cucion, incluso con IFC que es la ontologıa que presenta un vocabulario mas extenso.Los sistemas de alineamiento necesitan por lo general muchas mas memoria (el doble) yconsumen mas tiempo. En el caso de AROMA, por ejemplo, necesita 1080 segundos paraalinear CityGML e IFC aportando poco mas de 10 correspondencias. En el caso de ac-ceder a servicios externos, como BLOOMS, el tiempo se dispara tardando incluso hasta 8horas en realizar el alineamiento entre CityGML e IFC. Este tiempo depende del numerode terminos y de su especificidad ya que debe recuperar toda la jerarquı de terminos me-diante consultas al servicio Web de Wikipedia.

En general podemos decir que las tecnicas funcionan mejor cuanto mayor es el sola-pamiento entre las ontologıas de entrada y que cuando las ontologıas son muy complejas(en terminos del numero de conceptos y complejidad del grafo de relaciones) las tecnicasbasicas de alineamiento proporcionan resultados mas satisfactorios.

Todos los resultados obtenidos y parametros de los sistemas de alineamiento, ası comolas ontologıas adaptadas y los alineamientos generados estan publicamente disponibles enla URL http://lfa.mobivap.uva.es/˜fradelg

This chapter explains the steps followed during the experimentation according to the

Page 67: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.1. CHARACTERIZATION OF THE ONTOLOGIES 55

methodology detailed in the previous chapter. Experimentation is the method employed atthis work for validating the hypothesis and extracting valid conclusions. The methodologyguarantees that the procedure is systematic and reproducible.

4.1 Characterization of the ontologiesOne of the reasons why these ontologies have been selected for the experimentation isthat they introduce some particularities with respect to other experiments in the literature.To our best knowledge, these differences have not been taken into account in other evalu-ations for ontology matching. There are some qualitative differences with respect to otherontologies:

• CityGML, IFC and GbXML are international standards with a large refinement andimprovement along several years by big and strong organizations such as the OGCor buildingSMART alliance.

• Excluding LinkedGeoData, the others are standards (DBPedia and GbXML arestandards de-facto). Other experimentations use synthetic or specific purpose on-tologies which do not reflect real knowledge organization.

• They exhibit a different level of detail in the representation with a low degree ofoverlapping. For example, CityGML is intended for modeling city objects whileIFC or GbCML are focused in building level. Only DBPedia is a general domainontology, while the others are ontologies of specific domains.

• CityGML, IFC and GbXML are ontologies from professional applications, devel-oped with expert knowledge which includes several technical terms and specificdefinitions. This supposes a challenge for some techniques such as based on lin-guistic resources like WordNet.

• They introduce high heterogeneity at syntactic level, both in their structure and intheir encoding language (XML and OWL). The first one is specially relevant forevaluating matching techniques based on the structure.

Table 4.1 shows the count of different kind of ontology elements for our datasets. Ata glance, CityGML ontology seems to be the simplest ontology because the experimentsare performed only using the building module. IFC stands out among the others in thenumber of classes. Most of them are technical terms describing specific properties ofmaterials, equipments and installations. GbXML is a smaller ontology which shares somecommon concepts with IFC showing one advantage: the labels of terms have not a prefixlike Ifc. Despite of being a cross-domain ontology, DBPedia has just over 300 classes,which seems not much for a cross-domain ontology. The LinkedGeoData ontology for theSemantic Geospatial Web displays a high number of classes. Most of them are categoriesof a big taxonomy of places, sites and objects which can be geographically referenced. Wethink that the merge of this ontology with CityGML would be very useful for enrichingthe information about places with the information provided by CityGML.

Page 68: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

56 CHAPTER 4. EXPERIMENTATION

Ontology Class SubClassOf Object Property Data PropertyCityGML 134 338 61 4IFC 7519 7519 1088 43GbXML 517 1733 333 144DBPedia 303 272 706 734LinkedGeoData 1295 1314 9 1663

Table 4.1: Comparison of ontology metrics for the dataset

4.2 Adaptation of the ontologiesA previous translation of the ontologies was required since they are encoded using XMLSchema Definition language (XSD) which, in fact, represents a syntactical heterogeneity.All of the evaluated tools require as input an OWL ontology serialized in RDF/XML orsimilar languages. This resolves the syntactic heterogeneity which arises at the first stepwhen data models are encoded using different languages. This was performed followingthe next steps:

1. The ontology / schema was downloaded from the Web of its maintainer (see footnotes at section 3.3.2).

2. If the schema was encoded in XSD, it was translated to OWL using Top BraidComposer. Every imported schema using xsd:import was also converted to OWLfollowing the same procedure.

3. The ontologies of the dataset and the imported ontologies were serialized in RDF/XMLand stored in folder of the disk.

4.3 Construction of the reference alignmentsThe evaluations performed by the OAEI use a reference alignment established by expertsin the domain of interest. Due to the lack of reference alignments for the datasets usedin this work, the alignments have been developed using only personal knowledge andexperience in the field. More concretely, the following procedure has been followed fordeveloping the reference alignments:

1. The NeOn Toolkit was used for generating an initial version of the alignment inthe Alignment API format. Then, successive iterations over this alignment tried toimprove the final quality by augmenting the number of correspondences.

2. The documentation of the publisher or standardizing organization was read. Thisincludes both the bibliographic resources available in form of white papers andsome other papers which show how to take advantage of the formats.

3. The available information model or knowledge diagrams were reviewed for analyz-ing relations and meaning of concepts.

Page 69: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.4. EXPERIMENTAL ENVIRONMENT 57

Alignment # Mappings # SubsumptionsCityGML - IFC 35 18CityGML - GbXML 31 12CityGML - DBPedia 26 18CityGML - LinkedGeoData 23 14

Table 4.2: Number of correspondences of the reference alignments

4. The classification of the terms was reviewed using Protege.

5. Personal experience and knowledge were used to determine the relationship be-tween two entities. The concepts from one ontology were related to each other viaa subclass or an equivalence relationship.

6. This lead to some redundancy in the first versions of the alignments when the estab-lished relations between concepts are subsumed by more abstract concepts. Thus,in order to avoid duplicity of mappings, the alignments were reviewed in successiveiterations.

7. Finally, the syntax was validated using the Alignment API parser before the align-ment was ready for evaluating the results of the experiments.

Obviously, we previously rejected the terms with no correspondence in the other on-tology. The degree of similarity was not approximate but exact, which means that almostany equivalences or subclasses were settled. This process is subjective to some extent,but in the absence of a community agreed reference alignment, there is no better way ofevaluating the performance of ontology matching techniques applied to the field of GIS.A similar methodology was employed previously in [57].

Table 4.2 summarizes the number of correspondences, pointing what of them are sub-sumptions, for the developed reference alignments. The number of correspondences inthis alignments are less than the number of correspondences of the alignments proposedby the OAEI (see section 2.2.8) since they present a higher level of overlapping. Thetable also shows that the most complex alignment is CityGML with IFC, while the sim-plest is CityGML with DBPedia due to it is a cross-domain ontology with a low numberof classes. The high number of subsumptions between CityGML and the LinkedGeoDataontology is because most of the correspondences are established between Building andthe subclasses of Building in the LinkedGeoData ontology.

4.4 Experimental environmentAs we have explained before, we have performed a comprehensive evaluation of state-of-the-art systems in ontology matching over ontologies for GIS with BIM and GeoSpatialWeb. These systems, described in section 3.3.4, are developed in Java, which simplifiesthe deployment as a single test application (used for the Alignment API) or as shell scrips(used for the rest of tools) for performing the experiments. All the experiments were

Page 70: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

58 CHAPTER 4. EXPERIMENTATION

executed in a machine powered by an Intel Core 2 Duo CPU E7400 at 2.8 GHz, 3 GB ofRAM memory and cache of 3 MB and managed by an Ubuntu Server 10.04. Performancemeasures were taken when the tools were running with other server processes.

There is a common parameter between all techniques: the similarity threshold. Here-after, and for shortening, when we talk about threshold values we are referring to thesimilarity threshold by which the set of correspondences is trimmed. Then, a correspon-dence is dropped when its confidence measure is lower than the established threshold. Itsinfluence in the precision and recall was analyzed as part of the experiments. Besides, thematchers AROMA and MapPSO introduce certain tunable parameters whose values werepreviously selected taking into account the criteria explained in the next paragraphs.

The ”lexicalThreshold” parameter of AROMA for the syntax alignment procedureis adjusted to 0.6. The experiments performed using parameters below 0.5 resulted inalignments with a very poor precision, while higher thresholds, such as 0.8, resulted inidentification of very few results (poor recall). Furthermore, to unify the comparisonbetween different results, the same 3.0 version of the WordNet lexical database wereused both in S-Match and Alignment API. This is relevant because S-Match includes theversion 2.0 of WordNet in its distribution. This was changed in the parametrization file ofS-Match.

MapPSO introduces too many parameters 1 (both for combining the matchers and forcontrolling particle swarm execution) to evaluate the effect of changing each one in theresulting alignment since it required more time than we had available. The tests previouslyexecuted changing the matchers and weights, and playing with other parameters, showedthe same bad precision and recall. Thus, the selection of the parameters seems not to besignificant in terms of experimental results. In the end, the default parameters of MapPSOwere used in the experiments.

Finally, the N-gram distance is in fact a 3-gram distance (i.e. considering only stringfragments of 3 characters). 2-grams increment the probability of matching terms whichare not related while 4-grams (or more) make too difficult the finding of valid correspon-dences (only equal words are matched).

4.5 Analysis of the results

This section summarizes and discusses the results obtained from the experiments in theresearch. The section is subdivided into 4 subsections where each one corresponds to asingle alignment between CityGML and other ontology (as it is detailed in the title of thesubsection). CityGML acts as an ”umbrella” matched against other ontologies. For eachexperiment, the same set of techniques and systems were evaluated. Further informationabout the theoretical background of the techniques is provided in section 2.2.

For a clearer presentation and understanding of the results, the techniques are clusteredinto three different groups, following the classification introduced in section 2.2.3. Foreach group we analyze two aspects: the first one evaluates the F-measure values for thetechniques using different similarity thresholds (10 samples are taken in the interval [0,1]);

1Parameter set specification is provided in http://sourceforge.net/apps/mediawiki/mappso/index.php?title=Params.xml

Page 71: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 59

the second one compares the precision and recall of the techniques in the best case (i.e.the threshold giving the biggest F-measure). The three groups are the following ones:

1. String techniques, which includes techniques based on string distances such asthe Hamming, substring, n-gram, Smoa, Levenghstein and Jaro-Winkler similar-ity measures.

2. Linguistic techniques, which includes techniques based on WordNet relations be-tween names such as basic synonym similarity and distance, cosynonym similarity,basic gloss overlap and Wu-Palmer similarity.

3. Matching systems, which includes systems that combine different elementary tech-niques such as S-Match, SPSM, AROMA and MAPPSO.

BLOOMS, described in section 2.2.6, was one the most promising systems for match-ing according to the results in the literature. It had already been applied in ontologiesshowing a partial overlapping in [45, 46]. Nevertheless, the execution of BLOOMS de-pends of the Wikipedia Web Services which has changed since BLOOMS was developed.These changes throw some exceptions which affect the rendering of the alignment so thatthe output format does not include neither the kind of relation between two concepts northe similarity between them. Because of BLOOMS is not open-source software (althoughits web says that the source code will be available soon), it was completely impossible tofix the bug. After two request for helping to the contact e-mail provided by BLOOMS,none positive answer has been obtained. Hence, the evaluation of this matching systems,which was originally scheduled, cannot be performed and it is left as possible future work.

The reference alignments, the generated alignments and the parametrization / config-uration files of the matchers are publicly available at http://lfa.mobivap.uva.es/˜fradelg for making reproducible the results analyzed in the next subsections.They are too big to include them as an appendix at the end of this document so we decideto leave them in digital version available through the Web.

Before starting to analyze results in deep, a first observation is that the values of pre-cision and recall are lower than other results in the literature [15, 27, 28]. Probably, themain reason for that is that the ontologies are not representing exactly the same domain;they show a high degree of overlapping but many terms in one of them do not have corre-spondences in the other.

4.5.1 Alignment between CityGML and IFCThis experiment matches the CityGML building ontology with the IFC ontology, whichare large ontologies, not formed only by taxonomies, but by a more complex structurewith several object relationships. The results are divided into three groups which areexplained in the next subsections.

String techniques

These techniques perform the best results in this experiment. The main reason is that theterms in the ontologies introduce a big lexical similarity while they have a big structuraldissimilarity.

Page 72: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

60 CHAPTER 4. EXPERIMENTATION

Figure 4.1: Results of string-based techniques for alignment between CityGML-IFC. Left:F-measure sampled at different thresholds. Right: precision and recall in the best case

Figure 4.1 shows how the optimal threshold is located inside the interval [0.5, 0.8],which is according with previous results in the literature and the other experiments. Here,the best F-measure is achieved by the substring distance technique although the Leven-shtein distance is better at lower thresholds. The Jaro-Winkler measure exhibits the worstF-measure value. This is due to the fact that it is not a symmetric measure such thatmany important correspondences are missing, specially with higher thresholds. As a cu-riosity, every technique (excepts the Jaro-Winkler measure) presents at threshold 1.0 thesame F-measure value (0.055) which corresponds to the terms with the same name in bothontologies.

Jaro-Winkler and Hamming distances achieve the maximum precision but a lowerrecall since they only match terms represented by the same string. The best recall isachieved by the Levenshtein distance since the most of correspondences have the sameterm name with a prefix ”Building” or ”Ifc”.

Linguistic techniques

These techniques use different similarity measures based on the synonym relations ofWordNet. The left part of the figure 4.2 shows a comparison of values for the F-measurein which the best values are obtained by the synonym distance. This is the only measurenot entirely based on WordNet since it includes a basic distance string. A big improvementover the basic synonym similarity is noted in the graphic, specially within the interval [0.5,0.8]. In general, this group of techniques highlight by its stability at different thresholdvalues.

The Cosynonym Similarity does not work here since it is very difficult to find com-mon synonym between two technical and specific terms. The Basic Gloss Overlap is thesecond best measure. This is because of many terms like ”Slab” and ”Surface” share abig part of their definitions. Taking a look at the right part of the figure 4.2 we see thatdespite of being the best F-measure, the Synonym distance is the only measure whichdoes not provide the highest precision. The other measures are more restrictive about thecorrespondences, delivering only a low number of correct mappings without fails. The

Page 73: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 61

Figure 4.2: Results of linguistic techniques for alignment between CityGML-IFC. Left:F-measure sampled at different thresholds. Right: precision and recall in the best case

low recall is another consequence of that behavior.

Matching systems

The matching systems lose both in precision and recall with respect to the other tech-niques. The left part of figure 4.3 shows that AROMA is the only matcher with a sig-nificant F-measure almost independent of the threshold value. MapPSO does not deliverany good correspondence in the 164 returned mappings. S-Match suffers of retrieving toomany incorrect results. The correspondences returned are not significant, such as thoseestablished with ”Thing”. SPSM is the second best result but too far from AROMA,probably due to it was developed for matching simpler structures such as Web Servicessignatures.

The right part of the figure shows that AROMA performs the best in terms of bothprecision and recall, but the difference between them is very high. SPSM achieves thesame recall than S-Match but it creates less spurious correspondences which leads to abetter precision. This recall is not too far from AROMA but the precision is very low.

The reason behind the bad results of this methods is that all of them consider therelations between terms as an important fact for matching. Since the structure of theontologies is very different, it is very difficult that this method find correspondences. Asimilar thing happens with the other experiments.

4.5.2 Alignment between CityGML and GbXML

In this test, the objective is matching the CityGML building module and the GbXMLontology. In general, the results are slightly better than in the previous experiment withIFC.

Page 74: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

62 CHAPTER 4. EXPERIMENTATION

Figure 4.3: Results of matching systems for alignment between CityGML-IFC. Left: F-measure sampled at different thresholds. Right: precision and recall in the best case

String techniques

The figure 4.4 shows in the left side the variation of F-measure with different thresholds.The best result is achieved by the Levenghstein distance with a maximum value of 0.4 forthe F-measure at threshold of 0.5. This is the best result achieved for the experiments.The substring distance measure is worse than the N-grams despite of it was the best inthe previous experiment. The difference could be in the absence of a prefix in terms as”Ifc” which increments the similarity between terms in techniques like N-grams or theLevenghstein distance. The Jaro-Winkler measure exhibits the same behavior than forIFC when the threshold is bigger than 0.5. This is because the similarity value is neverhigh for valid correspondences. Then, trimming by a high threshold value removes manyvalid mappings between terms.

The right side of figure 4.4 shows the precision and recall in the best cases. Here, weobserve that the recall value of the Levenghstein distance (0.66) is acceptable since it isnear of results shown in other works. The Hamming distance obtains the worst recall (0.1)and the best precision at the same time because it delivers only 3 valid correspondencesof a total of 4. The N-gram and Levenghstein measures have a complementary behaviorfor almost the same F-measure: while the first shows a high precision but a low recall,the second one shows a low precision but a bad recall. For an integration task it would bepreferable that the recall is higher.

These results shows that the most basic techniques sometimes can give the best per-formance in terms of the quality of the results.

Linguistic techniques

Figure 4.5 compares WordNet-based distances. In general, they deliver a higher precisionthan string-based techniques but a lower recall which leads to a lower F-measure (0.28).The same analysis performed for the previous alignment is valid here too. The synonymdistance is again the best measure but here the cosynonym similarity returns a set ofcorrespondences not null. This is because the techniques do not tokenize the names and

Page 75: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 63

Figure 4.4: Results of string-based techniques for alignment between CityGML-GbXML.Left: F-measure sampled at different thresholds. Right: precision and recall in the bestcase

the prefix ”Ifc” made difficult to find synonyms for IFC terms. In fact, the F-measurevalue is higher than in the previous experiment for the synonym distance too. Besides,the gloss overlap gives higher similarity correspondences which are only trimmed when athreshold of value 0.9 is applied. This means that all correspondences have values higherthan 0.8.

If we take a look at the right side of the figure we can see the precision and recallof the best F-measures for each technique. Here, gloss overlap and cosynonym showsthe highest precision thanks to the few correspondences returned (which leads to a lowrecall at the same time). Every technique shows a precision higher than recall except forsynonym distance which consequently has the best F-measure. The Wu-Palmer similarityis similar to synonym similarity.

Matching systems

In the figure 4.6 we observe that S-Match provides good results for the recall but its pre-cision is affected by a plethora of invalid correspondences. SPSM does not generate anyvalid mapping for these ontologies, probably because there are too many structural dif-ferences between CityGML and GbXML. Although we think they are not more than forIFC, where this technique delivers some goods correspondences. None of the 30 corre-spondences generated by SPSM is a valid mapping. Both algorithms show impressiveresults in multilingual thesaurus and in some tracks of the OAEI but here, they show verybad results in the same way than in [45].

The AROMA technique exhibits a relative good precision at the expense of a very lowrecall. These is due to the low number of correspondences (10) detected between bothontologies. MapPSO, as in the previous experiment, does not detect any valid mappingfrom a total of 161 generated. The left side of figure 4.6 shows that the correspondencesdelivered by S-Match have a similarity value of 1.0, while AROMA weights the validcorrespondences with a similarity value lower than 0.5.

Page 76: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

64 CHAPTER 4. EXPERIMENTATION

Figure 4.5: Results of linguistic techniques for alignment between CityGML-GbXML.Left: F-measure sampled at different thresholds. Right: precision and recall in the bestcase

Figure 4.6: Results of matching systems for alignment between CityGML-GbXML. Left:F-measure sampled at different thresholds. Right: precision and recall in the best case

Page 77: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 65

Figure 4.7: Results of string-based techniques for alignment between CityGML-DBPedia.Left: F-measure sampled at different thresholds. Right: precision and recall in the bestcase

4.5.3 Alignment between CityGML and DBPedia

The experiment matches CityGML and DBPedia ontologies. The aim is to evaluate theachieved performance when CityGML is aligned with a general purpose, cross-domainontology. It allows to improve the study of the behavior of the matching techniques in adifferent situation. The main motivation is that the merging of information of DBPediawith CityGML helps to speed up the Geospatial Web by integrating the general informa-tion about buildings with GIS information.

The precision and recall achieved are lower than in previous experiments and they areall below 0.5.

String techniques

Figure 4.7 shows how the N-grams distance is the best at the threshold of 0.6 (wheremost of the techniques reach the optimal F-measure). The Levenshtein and substringdistance exhibit a similar behavior than N-grams. The main reason for this is that thecorrespondences are established between terms where there are two or more concatenatedwords in different order (”BuildingUsage” vs ”UsageOfBuilding”, e.g.). Then the N-gram distance is lower than any edit distance since there are many changes between onestring an another. The Jaro-Winkler measure behaves in the same way than in previousexperiments because of its asymmetry.

As in the other experiments, if we look at the right part of figure 4.7, basic string-baseddistance techniques achieve better results than many advance matching systems. N-gramand Levenshtein techniques compete in the recall, but the precision of N-gram is better.The other techniques show better precision values but the recall is very poor due to thelower number of terms.

The results are worse than for GbXML due to the short length of the names of DB-Pedia ontology, which generates many spurious correspondences with the long terms ofCityGML.

Page 78: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

66 CHAPTER 4. EXPERIMENTATION

Figure 4.8: Results of linguistic techniques for alignment between CityGML-DBPedia.Left: F-measure sampled at different thresholds. Right: precision and recall in the bestcase

Linguistic techniques

As figure 4.8 shows, the behavior of each technique is similar with respect to its homony-mus in the other experiments. Again, the synonym distance is the technique with the bestF-measure at the threshold value of 0.7. In the same way that happens with the string-based techniques, the best F-measure value is worse than in the previous experiment withGbXML (0.22 vs 0.28). Here, the main cause is that the generic terms of DBPedia sharemany common synonyms with other terms of CityGML which are not really part of cor-respondences. This is confirmed by the right side of figure 4.8 which shows that theprecision is the half that obtained in the previous experiment.

The precision of these techniques is very similar: there is a group of techniques whichgive a value around 0.35 formed by Wu-Palmer, synonym similarity and distance, whilethe other group, formed by cosynonym similarity and gloss overlap, gives a value of 0.5.However, this second group achieves a bad recall due to the low number of terms whichleads to a low F-measure. In the first group, the synonym distance gives a higher recallwhich makes it the best method again thanks to the combination of WordNet and basicdistance measures.

Matching systems

In this experiment the SPSM and AROMA methods fail, not because they do not deliverany correspondence, but because the correspondences (2 for AROMA and 9 for SPSM)are not valid (see figure 4.9). As in the previous experiments, methods based on the struc-ture do not work well with CityGML. This might be due to the high degree of objectproperties in the ontology. The results delivered by S-Match are exactly the same thanin the previous experiment. S-Match falls in the same problem than in previous experi-ments: it considers the subclass relation between the class ”Thing” of CityGML and everyclass included into the DBPedia and vice versa (532 correspondences). Besides, in thisexperiment none of the correspondences is valid, which makes the application of S-Match

Page 79: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 67

Figure 4.9: Results of matching systems for alignment between CityGML-DBPedia. Left:F-measure sampled at different thresholds. Right: precision and recall in the best case

inappropriate. MapPSO suffers the same problem than S-Match, which they also exhibitin previous experiments: it gives 163 mappings but none of them has sense due to theuse of a elementary technique based on the hierarchical structure of the ontologies. Thedifferences in the structure between CityGML and DBPedia are even bigger than in theprevious experiments.

4.5.4 Alignment between CityGML and LinkedGeoData ontology

The last experiment is formed by the CityGML building module and the ontology formodeling RDF data published from Open Street Map in the Linked Geo Data initiative(in a similar way DBPedia publishes Wikipedia data). The LinkedGeoData ontology be-longs to the LOD initiative and shares many terms with CityGML. Both of them modelgeographic information for two different kind of users: experts and common users. Theintegration of both ontologies would enrich the information available. The LinkedGeo-Data ontology is a lightweight ontology with a structure very different with respect toCityGML.

String techniques

These techniques show the best performance with respect to the other techniques, as fig-ure 4.10 shows. A key difference betwen this experiment and the other experiments isthat the optimal threshold (which gives the best F-measure) is reached at 0.8 or 0.9. Itmeans that the techniques give a higher similarity or confidence value to the valid cor-respondences. This is because the most of the correspondences are between CityGML”Building” and different kind of buildings in LinkedGeoData ontology such as ”His-toricBuilding” or ”BuildingUniversity”. The other correspondences are between termswith the same name.

The use of a common prefix and suffix makes that the substring distance is the tech-niques which achieves the best F-measure at the threshold 0.9. This is also the highest

Page 80: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

68 CHAPTER 4. EXPERIMENTATION

Figure 4.10: Results of string-based techniques for alignment between CityGML-LinkedGeoData. Left: F-measure sampled at different thresholds. Right: precision andrecall in the best case

value of the performed experiments. The second one is the Levenshtein measure, repeat-ing a common trend in the other experiments. The N-gram technique perform worse thanin previous experiments due to the terms with repeated 3-grams such as ”BuildingBuild-ing”. For instance, ”BuildingBuilding” matches with many non valid terms of CityGMLas for example ”BuildingPart” with a similarity value of 1.0. The Jaro-Winkler measuresuffers again the problem of asymmetry and it assigns low similarities to valid correspon-dences (in fact, it can be considered a similarity measure).

A deeper analysis of precision and recall in figure 4.10 shows how substring distanceis not the best neither in precision nor in recall, but its balance gives it the best F-measure.The Smoa and Jaro-Winkler measures present a higher precision but the very few corre-spondences (only 5) lead to a low recall (0.17). In fact, they only deliver correspondencesbetween terms labeled with the same name. The Hamming and Levenshtein distance givesworse precision than the others since it considers many correspondences with no sense butwith many characters in common such as between ”BuildingPart” and ”BuildingFarm” or”Sanitation” and ”Substation”.

The comparison of these results with the DBPedia experiment shows that results forontologies with a higher degree of overlap are better. This suggests that the ontologyoverlap is a critical factor for the success of the ontology matching techniques becauseit is inevitable that many spurious relations between terms of the regions without overlapare inserted in the alignment.

Linguistic techniques

Figure 4.11 shows the evolution of WordNet-based linguistic techniques in different thresh-olds. The synonym distance, which was the best in the other experiments, is surpassedby the cosynonym similarity which shows impressive results in this experiment. What isreally happened is that the synonym distance performs really bad with LinkedGeoDataontology because the cosynonym similarity only finds alignments between terms with thesame name (4 valid alignments). Some incorrect correspondences are created by the syn-

Page 81: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 69

Figure 4.11: Results of linguistic techniques for alignment between CityGML-LinkedGeoData. Left: F-measure sampled at different thresholds. Right: precision andrecall in the best case

onym distance, specially when there are prefix like ”Building” or short words like ”Car”which are aligned with ”BuildingUsage” and ”Maintenance”.

The synonym similarity performs worst because it not considers the distance betweensynonym of terms. For example, it provides a similarity 1.0 for the terms ”Habitation”and ”Home”, while the cosynonim gives it a value of 0.05. It remains constant since thesimilarity is 1.0 or 0.0 for each pair of terms. The Wu-Palmer similarity adds not validcorrespondences, whose terms have common synonyms, such as ”Culture” and ”Yield”with respect to cosynonym. Thus, when they are trimmed by a threshold of 1.0 the resultsconverges towards the cosynonym results. Finally, the basic gloss overlap suffers, as inthe other experiments, from its lack of a similarity measure (it always gives 1.0 similarity).Besides, it introduces many invalid correspondences between not related terms, such asroute names which contains ”Street” with ”Building” because they share terms in theirdefinition.

From the viewpoint of the precision and recall in the best case, we observe in figure4.11 that all the techniques show a low and similar recall, while the precision of thecosynonym is the best (0.8). Despite of the recall of the gloss overlap is acceptable, itsrecall is awful by the reason explained in the previous paragraph. The precision and recallachieved by the other techniques is quite similar.

Matching systems

In this experiment, as in the DBPedia, the techniques based on the structure do not run.AROMA and MapPSO do not find any mapping (valid or not), while SPSM shows only 5correspondences where only the trivial ”Thing” with ”Thing” could ne considered valid.S-Match presents almost 1500 trivial correspondences circling around the classes ”Thing”and ”Object”. This may be due to the huge syntactical heterogeneity between both on-tologies, as occurs in the previous experiments. The conception of a XML schema withrespect to an OWL ontology is very different, and the structural differences are too bigto use techniques witch use logic rules or hierarchy matching techniques. Hence, this

Page 82: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

70 CHAPTER 4. EXPERIMENTATION

Figure 4.12: Results of matching systems for alignment between CityGML-LinkedGeoData. Left: F-measure sampled at different thresholds. Right: precision andrecall in the best case

strategies for ontology matching cannot be applied for matching ontologies with manystructural differences and a low degree of overlap.

4.5.5 Performance evaluationThe evaluated techniques exhibit different consumption of computational resources. Sincethey do not really depend on the selected threshold, we decide to separate them in adifferent section.

Table 4.3 shows the megabytes of memory consumed by each technique and the sec-onds of computation time needed to accomplish the alignment. The elementary matchingtechniques requires less resources than the matching systems since they are often com-posed of several elementary matchers. Also, the WordNet based distances need slightlymore memory than distance techniques because they have to load the WordNet thesaurus.Inside the WordNet based techniques, the synonym distance takes a long computing timewhile the memory is approximately the same. Taking a look at the definition of the dis-tance we find the answer: when the synonym similarity is 0, which happens many times,the basic string distance is computed between the terms. This is an hybrid method whichcombines WordNet synonyms with string-based distances.

If we focused on the alignments, the CityGML-IFC alignment need more time andmemory to be computed. Although, this difference is more significant in the matchingsystems, which uses structure-based methods, than in the elementary techniques. Themain reason is that the complexity of the XML schemas of CityGML and GbXML isbigger than the schemes of DBPedia or LinkedGeoData. The time needed for AROMAhighlight over the others, showing a worse scalability. This may be due to the fact thatAROMA checks rule sets formed by all possible pairs of concepts such that its orderof complexity is quadratic. MapPSO shows a good performance but considering its badcompliance in the results it is whole worse than the others.

The next alignment by complexity after CityGML-IFC is CityGML-LinkedGeoData.WordNet-based techniques have similar performance requirements while these are slightly

Page 83: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

4.5. ANALYSIS OF THE RESULTS 71

IFC GbXML DBPedia LinkedGeoDataSpeed Mem Speed Mem Speed Mem Speed Mem

Hamming 10 200 6 224 5 156 8 150Smoa 14 164 7 253 7 165 9 175

Levenshtein 10 220 6 194 6 160 8 164Subdistance 10 204 6 224 6 156 8 152Jaro-Winkler 10 208 6 220 6 160 8 168

N-Grams 11 268 6 252 5 144 8 152Syn. Dis. 70 276 29 268 37 172 70 164Syn. Sim. 11 248 7 220 6 198 9 180

Cosyn 15 320 8 228 8 216 12 192Wu 15 232 10 280 10 240 15 212

Gloss 20 340 10 232 11 232 8 224S-Match 56 501 11 384 5 388 20 324SPSM 56 688 12 432 6 384 25 440

AROMA 1080 308 61 200 5 156 5 200MAPPSO 11 396 8 292 7 280 6 220

Table 4.3: Memory (in MB) and speed consumption (in seconds) of the matching tech-niques for the different alignments

smaller in the other cases. A good explanation is that for string-based measures there areless terms in LinkedGeoData ontology than in IFC, and for matching systems the struc-ture is simpler due to the lower number of object properties in LinkedGeoData ontology.DBPedia and GbXML alignments shows similar speed in string and WordNet based tech-niques but the memory is greater in GbXML due to the bigger number of terms. Withrespect to the matching systems, the more complex structure of GbXML (in terms ofsubclass and object property relations described in section 4.1) supposes twice more com-putational time while the memory is slightly bigger.

Page 84: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

72 CHAPTER 4. EXPERIMENTATION

Page 85: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Chapter 5

Conclusion and future work

Resumen

Este capıtulo recoge las conclusiones obtenidas tras el proceso de investigacion en relacioncon la hipotesis planteada al comienzo de la misma, ası como las limitaciones del estudioy el posible trabajo futuro que no ha podido ser realizado aquı.

Atendiendo a los resultado mostrados en el capıtulo anterior y considerando que laintegracion de informacion como campo de aplicacion requiere valores de precision y ex-haustividad elevados, las tecnicas de alineamiento no podrıan ser utilizadas para resolverel problema de forma automatica; siempre serıa necesaria asistencia experta, aunquepuede facilitar la solucion al problema. Si considerasemos la precision como la medidamas importante (para reducir al mınimo el numero de relaciones no validas), sin atendera la exhaustividad, algunas de las tecnicas evaluadas como la distancia de Jaro-Winkler ola similitud entre cosinonimos proporcionan precisiones de casi el 100%. En este caso elproblema serıa el bajo numero de relaciones descubiertas para fusionar ambas ontologıasque proporcionarıa un bajo grado de satisfaccion al usuario.

La investigacion realizada aquı presenta algunas limitaciones, especialmente rela-cionadas con la validez de los alineamientos de referencia utilizados, con las metricasde calidad definidas y con las tecnicas que se han evaluado. Tampoco se han incluido me-didas subjetivas de evaluacion de la calidad del alineamiento desde el punto de vista delexperto en el campo. En cualquier caso hay que considerar el contexto de la investigacionen terminos de tiempo disponible, extension y objetivos del trabajo.

En el futuro la investigacion podrıa continuar con el desarrollo de un prototipo paradescubrir relaciones entre ontologıas en la Web Semantica de forma automatica que serıaaplicable a la generacion de extensiones automaticas de CityGML. Desde el punto de vistade investigacion se tratarıa de desarrollar nuevas estrategias / tecnicas de alineamientopara la fusion de ontologıas que sean mucho mas robusta que los sistemas de alineamientoactuales ante la presencia de conceptos no compartidos entre ambas ontologıas, y quemejore especialmente la exhaustividad conseguida.

73

Page 86: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

74 CHAPTER 5. CONCLUSION AND FUTURE WORK

5.1 Conclusions

Ontology matching studies how to establish alignments between the concepts of two on-tologies. They are intended for achieving reconciliation between two ontologies, whichrepresents exactly the same domain of interest with some kind of terminological, struc-tural or semantical differences. This works introduces an evaluation of the complianceand performance of these techniques when they are applied to integrate Geographic Infor-mation with Building Information and the Geospatial Web by means of ontology merging.The ontologies selected for representing these fields are CityGML, IFC, GbXML, DBPe-dia and the LinkedGeoData ontology. The use of these ontologies is motivated by theincreased interest in the field in last years, the lack of automation in the integration of theinformation and the complexity of the ontologies.

The results show that a very low recall and an acceptable precision can be achievedby means of string-based techniques and the use of external linguistic resources such asWordNet. However, most advanced techniques competing in the OAEI contest with thebest results have shown very ineffective, specially for merging expert ontologies withontologies of the Web of Data. This illustrates that the techniques do not perform equalwhen the ontologies have a low degree of overlap, which is needed for ontology merging.The most robust techniques of each group, in terms of average performance, have beenthe substring distance, the synonym distance, and AROMA, which mixes structure withlogic rule matching. These techniques have been able to tackle the challenges exposed bythe different experiments.

The performance of the different techniques is very different. While matching systemslike AROMA requires upto 1000 seconds for the CityGML and IFC alignment, elemen-tary techniques like Levenshtein distance takes only 10 seconds. The same happens withthe memory consumption. Another techniques like BLOOMS, whose results cannot be fi-nally evaluated, have to query external resources like Wikipedia or external oracles whichincreases the time needed to compute the alignment upto 10 hours.

Although the information integration as an ontology matching application, demandsa high precision and recall, a higher precision is still more important because it does notadd spurious connections between entities. From this point of view, we could consider theresults as positive since the precision has been usually over 0.5. However, the excessivelow recall (under 0.4) would lead to unsatisfactory results for the users since too littleinformation of one ontology could be moved to the other ontology. These results have tobe still improved for achieving a successful application in an open scenario as the Web.

Finally, we want remark that the research introduced by this document illustrates themultidisciplinarity which is usual in any research work. This research involves differentknowledge areas such as computer science for ontology matching, architecture for Build-ing Information Modeling and geography for Geographic Information Systems. More-over, within the computer science field, there are many disciplines involved, such as Se-mantic Web for ontology and alignment representation, information theory for measuringthe distance between terms, logic inference for evaluating the similarity between two on-tologies or information retrieval for evaluating the quality of the final results.

Page 87: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

5.2. LIMITATIONS 75

5.2 LimitationsThere are some limitations in this research, which have to be taken into account wheninterpreting the results. For example, the ontology matching techniques have been onlyevaluated with datasets which are publicly available, based on the review of the literature.This work does not intend to be a complete and exhaustive evaluation. Besides, the evalua-tion does not take into account the preservation of structural relations between terms sinceit only measures the quantity of correspondences correctly established between pairs ofterms. More sophisticated metrics would be necessary for considering this kind of issues.

Moreover, the lack of reference alignments in the fields of interest forces to the devel-opment of new alignments which have not been validated in deep using expert assessment.Only punctual support have been provided for resolving ambiguities and conflicts betweenthe semantics of some terms.

Finally, as Euzenat et al. stands in [26], conventional precision and recall metricsfrom information retrieval are not specially suited for ontology matching so the resultsare biased towards worse than they really are. Then, some other measure like RelaxedPrecision and Recall are defined. However, if we understand the task of informationintegration as the bidirectional translation of information, there is the need of perfectmatching between terms. This is the main reason because the evaluation shown by thiswork uses the classical measures.

5.3 Future workOne limitation which can be overcome with more time is the development and validationof the reference alignments with expert assessment. It would require the establishment ofa new methodology which could make a big improvement in the validity of the conclu-sions previously introduced.

In the area of development, from this work could emerge a prototype for the automaticcreation of Application Domain Extensions for CityGML using matching techniques forupdating the bridges between concepts of CityGML and related ontologies, such as IFC,GbXML, DBPedia or the LinkedGeoData ontology.

One of the aspects leaved outside the scope of this work is the management of thelogical rules which are also part of the ontologies. This means the study of approachesfor improve the interoperability between the different kind of logics (predicates, classesand descriptive). Although it is also inside of ontology matching field too, it is nearer ofmathematics and philosophy than information systems and it is not needed for informationintegration.

Page 88: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

76 CHAPTER 5. CONCLUSION AND FUTURE WORK

Page 89: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

Bibliography

[1] ADRION, W. Research methodology in software engineering. In Summary of theDagstuhl Workshop on Future Directions in Software Engineering (1993), vol. 18,ACM SIGSOFT Software Engineering Notes.

[2] AKINCI, B., KARIMI, H., PRADHAN, A., WU, C., AND FICHTL, G. CAD and GISinteroperability through semantic web services. CAD and GIS Integration (2009),199.

[3] AUER, S., BIZER, C., KOBILAROV, G., LEHMANN, J., CYGANIAK, R., AND

IVES, Z. Dbpedia: A nucleus for a web of open data. The Semantic Web (2007),722–735.

[4] AUER, S., LEHMANN, J., AND HELLMANN, S. Linkedgeodata: Adding a spatialdimension to the web of data. The Semantic Web-ISWC 2009 (2009), 731–746.

[5] BECHHOFER, S., VAN HARMELEN, F., HENDLER, J., HORROCKS, I., MCGUIN-NESS, D., PATEL-SCHNEIDER, P., STEIN, L., ET AL. OWL web ontology languagereference. W3C recommendation 10 (2004), 2006–01.

[6] BERNERS-LEE, T. Semantic web on xml. http://www.w3.org/2000/talks/1206-xml2k-tbl [31-05-2011]. Keynote presentation for XML (2000).

[7] BERNERS-LEE, T., HENDLER, J., LASSILA, O., ET AL. The semantic web. Sci-entific american 284, 5 (2001), 28–37.

[8] BILLE, P. A survey on tree edit distance and related problems. Theoretical computerscience 337, 1-3 (2005), 217–239.

[9] BIZER, C., HEATH, T., AND BERNERS-LEE, T. Linked data-the story so far. Int.J. Semantic Web Inf. Syst. 5, 3 (2009), 1–22.

[10] BIZER, C., LEHMANN, J., KOBILAROV, G., AUER, S., BECKER, C., CYGANIAK,R., AND HELLMANN, S. DBpedia-A crystallization point for the Web of Data.Web Semantics: Science, Services and Agents on the World Wide Web 7, 3 (2009),154–165.

[11] BOCK, J., AND HETTENHAUSEN, J. Discrete particle swarm optimisation for on-tology alignment. Information Sciences (2010).

77

Page 90: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

78 BIBLIOGRAPHY

[12] BOUQUET, P., DONA, A., SERAFINI, L., AND ZANOBINI, S. Contextualized localontologies specification via ctxml. In Working Notes of the AAAI-02 workshop onMeaning Negotiation. Edmonton (Canada) (2002).

[13] BOUQUET, P., GIUNCHIGLIA, F., HARMELEN, F., SERAFINI, L., AND STUCK-ENSCHMIDT, H. C-owl: Contextualizing ontologies. The SemanticWeb-ISWC 2003(2003), 164–179.

[14] BOYER, K., AND KAK, A. Color-encoded structured light for rapid active ranging.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1 (1987), 14–28.

[15] CARACCIOLO, C., EUZENAT, J., HOLLINK, L., ICHISE, R., ISAAC, A.,MALAISE, V., MEILICKE, C., PANE, J., SHVAIKO, P., STUCKENSCHMIDT, H.,ET AL. Results of the ontology alignment evaluation initiative 2008. In The 7thInternational Semantic Web Conference (2008), vol. 33, Citeseer, p. 73.

[16] CHALUPSKY, H. Ontomorph: A translation system for symbolic knowledge. InInternational Conference of Principles of Knowledge Representation and Reasoning(2000), Morgan Kaufmann Publishers, pp. 471–482.

[17] CHOI, N., SONG, I., AND HAN, H. A survey on ontology mapping. ACM SigmodRecord 35, 3 (2006), 34–41.

[18] CORREA, E., FREITAS, A., AND JOHNSON, C. A new discrete particle swarmalgorithm applied to attribute selection in a bioinformatics data set. In Proceedings ofthe 8th annual conference on Genetic and evolutionary computation (2006), ACM,pp. 35–42.

[19] CUEL, R., DELTEIL, A., LOUIS, V., AND RIZZI, C. Knowledge webwhite paper: The technology roadmap of the semantic web. KnowledgeWeb Project, http://knowledgeweb.semanticweb.org/o2i/menu/KWTR-whitepaper-43-final.pdf [31-05-2011] (2007).

[20] DAVID, J., EUZENAT, J., SCHARFFE, F., AND TROJAHN DOS SANTOS, C. TheAlignment API 4.0. Semantic Web (2011).

[21] DAVID, J., GUILLET, F., AND BRIAND, H. Matching directories and owl on-tologies with aroma. In Proceedings of the 15th ACM international conference onInformation and knowledge management (2006), ACM, pp. 830–831.

[22] DOLLNER, J., AND HAGEDORN, B. Integrating urban gis, cad, and bim data byservice-based virtual 3d city models. Urban and Regional Data Management UDMS2007 Annual (2007), 157.

[23] EGENHOFER, M. Toward the semantic geospatial web. In Proceedings of the10th ACM international symposium on Advances in geographic information systems(2002), ACM, pp. 1–4.

Page 91: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

BIBLIOGRAPHY 79

[24] EL-MEKAWY, M., OSTMAN, A., AND SHAHZAD, K. Towards interoperatingcitygml and ifc building models: A unified model based approach. Advances in3D Geo-Information Sciences (2011), 73–93.

[25] EUZENAT, J. Towards composing and benchmarking ontology alignments. In Proc.ISWC-2003 workshop on semantic information integration, Sanibel Island (FL US)(2003), vol. 165, Citeseer, p. 166.

[26] EUZENAT, J. Semantic precision and recall for ontology alignment evaluation. InProc. 20th International Joint Conference on Artificial Intelligence (IJCAI) (2007),pp. 348–353.

[27] EUZENAT, J., FERRARA, A., HOLLINK, L., ISAAC, A., JOSLYN, C., ET AL. Re-sults of the ontology alignment evaluation initiative 2009. In Fourth InternationalWorkshop on Ontology Matching, Washington, DC (2009), Citeseer.

[28] EUZENAT, J., FERRARA, A., MEILICKE, C., PANE, J., SCHARFFE, F., SHVAIKO,P., STUCKENSCHMIDT, H., SVAB-ZAMAZAL, O., SVATEK, V., AND TROJAHN,C. First Results of the Ontology Alignment Evaluation Initiative 2010. OntologyMatching (2010), 85.

[29] EUZENAT, J., AND SHVAIKO, P. Ontology matching. Springer-Verlag New YorkInc, 2007.

[30] EUZENAT, J., AND VALTCHEV, P. Similarity-based ontology alignment in owl-lite.In 16th European Conference on Artificial Intelligence, Valencia, Spain (2004), IosPr Inc, p. 333.

[31] GASEVIC, D., AND HATALA, M. Searching Web Resources Using Ontology Map-pings. In Integrating Ontologies Workshop Proceedings (2005), Citeseer, p. 33.

[32] GIUNCHIGLIA, F., MALTESE, V., AND AUTAYEU, A. Computing minimal map-pings. In At the 4th Ontology Matching Workshop at the ISWC (2009), Citeseer.

[33] GIUNCHIGLIA, F., MALTESE, V., FARAZI, F., AND DUTTA, B. GeoWordNet: aresource for geo-spatial applications. The Semantic Web: Research and Applications(2010), 121–136.

[34] GIUNCHIGLIA, F., MCNEILL, F., YATSKEVICH, M., PANE, J., BESANA, P., AND

SHVAIKO, P. Approximate structure-preserving semantic matching. On the Moveto Meaningful Internet Systems: OTM 2008 (2008), 1217–1234.

[35] GIUNCHIGLIA, F., AND SHVAIKO, P. Semantic matching. The Knowledge Engi-neering Review 18, 03 (2003), 265–280.

[36] GIUNCHIGLIA, F., SHVAIKO, P., AND YATSKEVICH, M. S-match: an algorithmand an implementation of semantic matching. The semantic web: research andapplications (2004), 61–75.

Page 92: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

80 BIBLIOGRAPHY

[37] GIUNCHIGLIA, F., AND WALSH, T. A theory of abstraction. Artificial Intelligence57, 2-3 (1992), 323–389.

[38] GROGER, G., KOLBE, T., AND CZERWINSKI, A. Opengis citygml implementationspecification, 2006.

[39] GRUBER, T. What is an Ontology. Encyclopedia of Database Systems 1 (2008).

[40] GRUBER, T., ET AL. A translation approach to portable ontology specifications.Knowledge acquisition 5, 2 (1993), 199–220.

[41] HAKLAY, M., AND WEBER, P. Openstreetmap: User-generated street maps. IEEEPervasive Computing (2008), 12–18.

[42] HAMMING, R. Error detecting and error correcting codes. Bell System TechnicalJournal 29, 2 (1950), 147–160.

[43] ISIKDAG, U., AND ZLATANOVA, S. Towards defining a framework for automaticgeneration of buildings in CityGML using Building Information Models. 3D Geo-Information Sciences (2009), 79–96.

[44] J., E., M., E., AND R., G.-C. Towards a methodology for evaluating alignmentand matching algorithms. http://oaei.ontologymatching.org/doc/oaei-methods.1.pdf[31-05-2011]. Tech. rep., Ontology Alignment Evaluation Initiative, 2005.

[45] JAIN, P., HITZLER, P., SHETH, A., VERMA, K., AND YEH, P. Ontology alignmentfor linked open data. The Semantic Web–ISWC 2010 (2010), 402–417.

[46] JAIN, P., YEH, P., VERMA, K., VASQUEZ, R., ET AL. Contextual ontology align-ment of lod with an upper ontology: A case study with proton. ESWC.

[47] JARO, M. Advances in record-linkage methodology as applied to matching the1985 census of tampa, florida. Journal of the American Statistical Association 84,406 (1989), 414–420.

[48] JEAN-MARY, Y., SHIRONOSHITA, E., AND KABUKA, M. Ontology matching withsemantic verification. Web Semantics: Science, Services and Agents on the WorldWide Web 7, 3 (2009), 235–251.

[49] KALFOGLOU, Y., AND SCHORLEMMER, M. Ontology mapping: the state of theart. The knowledge engineering review 18, 01 (2003), 1–31.

[50] KENNEDY, J., AND EBERHART, R. A discrete binary version of the particle swarmalgorithm. In Systems, Man, and Cybernetics, 1997.’Computational Cyberneticsand Simulation’., 1997 IEEE International Conference on (1997), vol. 5, IEEE,pp. 4104–4108.

[51] KLYNE, G., AND CARROLL, J. Resource description framework (rdf): Conceptsand abstract syntax. Changes (2004).

Page 93: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

BIBLIOGRAPHY 81

[52] KOLBE, T. Representing and exchanging 3D city models with CityGML. 3D Geo-Information Sciences (2009), 15–31.

[53] KOLBE, T., GROGER, G., AND PLUMER, L. Citygml - interoperable access to3d city models. In First International Symposium on Geo-Information for DisasterManagement GI4DM (2005), Citeseer.

[54] LE BERRE, D., AND PARRAIN, A. The sat4j library, release 2.2 system description.Journal on Satisfiability, Boolean Modeling and Computation 7 (2010), 59–64.

[55] LEVENSHTEIN, V. Binary codes capable of correcting deletions, insertions, andreversals. In Soviet Physics Doklady (1966), vol. 10 (8), pp. 707–710.

[56] MAEDCHE, A., MOTIK, B., SILVA, N., AND VOLZ, R. MAFRA: a mappingframework for distributed ontologies. Knowledge Engineering and Knowledge Man-agement: Ontologies and the Semantic Web (2002), 69–75.

[57] MASCARDI, V., LOCORO, A., AND ROSSO, P. Automatic ontology matching viaupper ontologies: A systematic evaluation. IEEE Transactions on Knowledge andData Engineering (2009), 609–623.

[58] MCBRIDE, B. The resource description framework (rdf) and its vocabulary descrip-tion language rdfs. Handbook on Ontologies (2004), 51–66.

[59] MILLER, G. Wordnet: a lexical database for english. Communications of the ACM38, 11 (1995), 39–41.

[60] MOTIK, B., PATEL-SCHNEIDER, P., PARSIA, B., BOCK, C., FOKOUE, A.,HAASE, P., HOEKSTRA, R., HORROCKS, I., RUTTENBERG, A., SATTLER, U.,ET AL. OWL 2 web ontology language: Structural specification and functional-stylesyntax. W3C Recommendation 27 (2009).

[61] NAGEL, C., STADLER, A., AND KOLBE, T. Conceptual Requirements for the Auto-matic Reconstruction of Building Information Models from Uninterpreted 3d Mod-els. In Proc. of the Academic Track of the Geoweb 2009 Conference on Cityscapes(2009), Citeseer, pp. 27–31.

[62] NEEDLEMAN, S., AND WUNSCH, C. A general method applicable to the searchfor similarities in the amino acid sequence of two proteins. Journal of molecularbiology 48, 3 (1970), 443–453.

[63] PEACHAVANISH, R., KARIMI, H., AKINCI, B., AND BOUKAMP, F. An ontolog-ical engineering approach for integrating cad and gis in support of infrastructuremanagement. Advanced Engineering Informatics 20, 1 (2006), 71–88.

[64] PONZETTO, S., AND STRUBE, M. Deriving a large scale taxonomy from wikipedia.In Proceedings of the national conference on artificial intelligence (2007), vol. 22(2), Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,p. 1440.

Page 94: Analisis y comparaci´ on de estrategias para el ...lfa.mobivap.uva.es/~fradelg/tfm/memoria_tfm_fcodelgado.pdf · rendimiento de diferentes tecnicas de alineamiento entre ontolog´

82 BIBLIOGRAPHY

[65] POTTS, C. Software-engineering research revisited. Software, IEEE 10, 5 (1993),19–28.

[66] RACINE, J. Gnuplot 4.0: a portable interactive plotting utility. Journal of AppliedEconometrics 21, 1 (2006), 133–141.

[67] SHVAIKO, P., AND EUZENAT, J. A survey of schema-based matching approaches.Journal on Data Semantics IV (2005), 146–171.

[68] SHVAIKO, P., AND EUZENAT, J. Ten challenges for ontology matching. On theMove to Meaningful Internet Systems: OTM 2008 (2008), 1164–1182.

[69] STOILOS, G., STAMOU, G., AND KOLLIAS, S. A string metric for ontology align-ment. The Semantic Web–ISWC 2005 (2005), 624–637.

[70] SUCHANEK, F., KASNECI, G., AND WEIKUM, G. Yago: a core of semanticknowledge. In Proceedings of the 16th international conference on World WideWeb (2007), ACM, pp. 697–706.

[71] TANG, J., LIANG, B., LI, J., AND WANG, K. Risk minimization based ontologymapping. Content Computing (2004), 469–480.

[72] VAN BERLO, L., AND DE LAAT, R. Integration of BIM and GIS: The developmentof the CityGML GeoBIM extension. Advances in 3D Geo-Information Sciences(2011).

[73] WINKLER, W. The state of record linkage and current research problems. In Statis-tical Research Division, US Census Bureau (1999), Citeseer.

[74] WU, Z., AND PALMER, M. Verbs semantics and lexical selection. In Proceedingsof the 32nd annual meeting on Association for Computational Linguistics (1994),Association for Computational Linguistics, pp. 133–138.

[75] XU, P., WANG, Y., CHENG, L., AND ZANG, T. Alignment results of sobom foroaei 2010. Ontology Matching (2010), 203.

[76] ZLATANOVA, S. 3d gis for urban development. ITC dissertation series (Nether-lands) (2000).