dh101 2013/2014 course 6 - semantic coding, rdf, cidoc-crm
Post on 06-May-2015
1.248 Views
Preview:
TRANSCRIPT
Digital Humanities 101 - 2013/2014 - Course 6
Digital Humanities Laboratory
Frederic Kaplan
frederic.kaplan@epfl.ch
Semester 1 : Content of each course
• (1) 19.09 Introduction to the course / Live Tweeting and Collective note
taking
• (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment
• (3) 2.10 Introduction to the Venice Time Machine project / Zotero
•9.10 No course
• (4) 16.10 Digitization techniques / Deadline first assignment
• (5) 23.10 Datafication / Presentation of projects
• (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first
assignment
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 2o
Semester 1 : Content of each course
• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation
• (8) 13.11 Historical Geographical Information Systems, Procedural modelling
/ City Engine / Deadline Project selection
• (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap
• (10) 27.11 Cultural heritage interfaces and visualisation / Museographic
experiences
•4.12 Group work on the projects
•11.12 Oral exam / Presentation of projects / Deadline Project blog
•18.12 Oral exam / Presentation of projects
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 3o
Objective of today’s course
•Showing you the beauty and making you feel the power of semantic coding
•Give you a quick idea about what is behind the following strange acronyms :
RDF, URI, OWL, SPARQL, SWRL, CIDOC-CRM
•Motivate you to look deeper.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 4o
A short introduction to semantic coding
•Many good books exist. I recommend
this one.
• I will reuse some of their example in the
following slides.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 5o
Doris Stockly
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 6o
incanti.dhlab.ch
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 7o
The simplest kind of dataset, that everyone is familiarwith, is tabular data (any data kept in a table such as anExcel spreadsheet).
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 8o
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 9o
Data kept in table is easy to display, sort, print, edit.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 10o
You might not even think of data in an Excel spreadsheetas modeled. But there are semantics in data table.Where ?
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 11o
There are also obvious limitations with this kind ofstorage.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 12o
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 13o
You cannot search for the routes that stay more than 2days at Corfu. Sorting the columns does not capture thedeeper meaning of the text we entered.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 14o
Relational databases are a solution. Many very matureproducts exist like Oracle DB, MySQL and PostgreSQL. Arelational database allows multiple tables to be joined in astandardized way.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 15o
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 16o
But, as our project goes we may need to reformate ourtables.This is called schema migration. A painful process.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 17o
For big databases, schema can get incredibly complex.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 18o
Trying to normalize these databases in a single schema isa labor-intensive process.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 19o
How tomake future-proof schemata
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 20o
How tomake future-proof schemata
•With this mode of coding we can add easily new properties (price of
Route, captain, etc.). The schema is future-proof.
• In addition, the data about the data (i.e. the medadata, the name of
columns) is now part of the data itself.
•This is ideal for projects in Perpetual Beta.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 21o
and most important it makes a direct and simpleconnection with a well-developed research field : logic.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 22o
Indeed, this can bewritten in a different way
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 23o
Indeed, this can bewritten in a different way
• (Subject Predicate Object)• (R1 departure Venice)
•This is called a RDF statement, an atomic relation in a database
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 24o
RDF statements
• (Subject Predicate Object)• (R1 departure Venice)
•This is called a RDF statement, an atomic relation in a database
• (R1 departure-date 2.7.1422)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 25o
This is a graph
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 26o
As RDF statements can be understood both a logicstatements and as parts of a graph, one can use manytools and idea from logic and graph theory to manipulatethem.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 27o
URIs
•The nodes of the Graph are called Resources.
•When you want to coordinate multiple datasets it can become
increasingly difficult to guarantee unique and consistent identifiers fore
ach node.
•R1 that we use in our database may mean something else in an other
database.
•For naming resources, RDF uses URIs (Unique Resource Identifiers) and
an optional Fragment identifier.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 28o
URIs
•You are probably familiar with URL (Universal Resource Locators), the
string used to specify how web pages are retrieved.
•URIs generalize this concept further by saying that anything, whether you
can retrieve it electronically or not, can be uniquely identified in a similar
way.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 29o
URIs
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 30o
Since URIs can identified anything as a resource, thesubject of an RDF statement can be a resource, the objectcan be a resource and most importantly predicates arealways resources.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 31o
An example of URI Ref for a common RDF predicate
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 32o
It is common in RDF to shorten URIs by assigning anamespace to the base URI and writing only thedistinctive part of the identifier. The last URIs can bewritten in a shorter manner : rdf:type
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 33o
Serialization
•While the data model that RFD uses is very simple, the serialized
representation tends to get complicated when a RDF graph is saved in a
file or sent over a network.
•Different serialization formats exist :, N3, RDF/XML(the most freq.
used), RDFa (RDF in attributes)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 34o
Vocabularies
•A set of URIRefs is known as a vocabulary.
•We can design a specific vocabulary for our maritime route examples.
•There are also famous vocabularies like the RDF vocabulary (the set of
URIRefs describing the RDF concepts, ex. rdf :resource, rdf :type)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 35o
SPARQL
•Just as SQL provides a standard query language across relational
databases, SPARQL provides a query language for RDF graphs.
(pronounce sparkle)
•SPARQL queries attempt to match patterns in the graph and bind
wildcard variables as its finds solutions.
•Departure( ?x1,Venice)
•Captain( ?x1, ?x2), Gender( ?x2,Women)
•Semantic coding is all about asking bigger questions.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 36o
SWRL
•With RDF coding, we can also write rules to infer new triples
• If hasParent( ?x1, ?x2) and hasBrother( ?x2, ?x3) then hasUncle( ?x1, ?x3)
•This is also a way of detecting possible incoherence in the set of
knowledge coded in the triple store (actors doing things after their death)
•One standard language to do this is SWRL (Semantic Web Rule
Language)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 37o
Ontologies
•An ontology provides a special vocabulary with which knowledge can be
represented.
•This vocabulary allows us to specify which entities will be represented,
how they can be grouped and what relationship connect them together.
• (Venice isa Place), (Corfu isa Place), (Place haslat latitude), (Place
haslong longitude)
•Now, something very beautiful...
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 38o
An ontology can be expressed as RDF triples and storedin a graph alongside the data it describes.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 39o
An ontology can be expressed as RDF triples and storedin a graph alongside the data it describes.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 40o
OWL
•OWL (Web Ontology Language) is an ontology language layered on top
of RDF and RDFs
•Terminology statements
• ex:Bridge rdf:type rdfs:class
• ex:Bridge rdfs:subclass ex:Place
•Assertion statements
• ex:Rialto rdf:type ex:Bridge
• ex:ex:RialtoCons ex:broughtIntoExistence ex:Rialto
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 41o
It is relatively easy to create your own ontology using asoftware like Protégé. But some ontologies aim at beinguniversal
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 42o
CIDOC-CRM
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 43o
CIDOC-CRM
•CIDOC-CRM is an ontology for Cultural heritage.
•About 20 years of work.
•An ISO standard 21127.
•100+ schema. Very stable.
•CIDOC-CRM is a tentative to formalise an underlying semantics common
to many classifications. It includes very interesting ideas.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 44o
CIDOC-CRM : Events
• In CIDOC-CRM, the modelling is event-centric.
•The underlying idea is to model change, not state. Therefore, temporal
entities play a central role.
• Instead of coding the birthdate of a actor, it is better to code the event
of its birth.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 45o
Actors relate to things only via temporal entities and events.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 46o
CIDOC-CRM : Events
•The participation or presence of several non-temporal entities in an event
e1 allows to conclude that they have been in the same time-interval and
space, even without knowledge of the particular time or space.
•They must have existed at that time. They have not been somewhere
else at that time (with electronic communication, the space volume in
which events occur can become very large).
•The events e0i of creation of each participant i have happened before or
at the time of e1. The events e2i of destruction (or vanishing) of each
participant have happened after or at the time of e1.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 47o
CIDOC-CRM : Properties
•The property P11 had participants denotes active or passive involvement
of Actors, whereas P12 occurred in the presence of ranges from objects
just being there (e.g. a desk where a treaty was signed)
•The properties P92 brought into existence, P93 took out of existence are
limiting the existence of things which have a persistent existence.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 48o
CIDOC-CRM : Properties
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 49o
CIDOC-CRM : Properties
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 50o
CIDOC-CRM : Properties
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 51o
CIDOC-CRM : Place
•CIDOC-CRM has also implemented a very interesting model for places.
What is hard about places ?
•The question where is it can be answered in natural language by relation
to two different kinds of entities : geometric areas or objects.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 52o
In France, in Athens, 39N 124E. Points given by spatialcoordinates are typically understood as the centre of awider, extended area.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 53o
on mount St Helens, at the Rhine river.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 54o
on Queen Elizabeth (the ship), in my suitcase, at home.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 55o
CIDOC-CRM : Place
•Following the CIDOC CRM, geometric areas (E53 Place) can only be
defined relative to larger objects, including the surface of earth.
•Those objects in turn may be located at different times at different places
(relative to a larger object).
•The cultural interest is in the relation to other things and not to an
abstract absolute space. Absolute coordinates seem to make no sense
when the reference objects move.
•As historical information is incomplete and sparse, and many reference
objects move, normalization of place information to absolute coordinates
should not replace the primary information, which is typically relative.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 56o
CIDOC-CRM : Places
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 57o
CIDOC-CRM : Influence
•Another problematic issue is the notion of influence. It is difficult to
develop a systematic understanding of the different forms of influence
and their mutual relations
•Some are more physical, like using a mould or a tool. The influence of a
mould on a produced object is strong and can often be verified on the
object afterwards. The influence of a hammer is less specific.
•Similarly, making a copy of a painting has a strong influence on the
product, copying the idea of a painting, a weak one. The latter is more
an intellectual influence than a physical one.
• If a real influence existed, a temporal sequence can be deduced.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 58o
CIDOC-CRM : Influence
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 59o
CIDOC-CRM : Influence
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 60o
CIDOC-CRM
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 61o
CIDOC-CRM
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 62o
Summary : Guidelines for coding historical data
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 63o
(1) Prefer events to properties. Actors do not haveproperties, they participate to event. Instead of coding thebirthdate of a actor, it is better to code the event of itsbirth.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 64o
(2) Code date intervals instead of dates. This is muchmore flexible and permits to detect inconsistencies.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 65o
(3) Code places in a relative manner and not an absolutemanner. The cultural interest is in the relation to otherthings and not to an abstract absolute space. Absolutecoordinates seem to make no sense when the referenceobjects move.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 66o
All this is very beautifut, but is it sufficient to do the kindof historical modeling we want to do ? We have an issue,which one ?
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 67o
Metaknowledge : Knowledge about how knowledge isproduced.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 68o
How canwe encodemetaknowledge
•Expressed knowledge (RDF triples) is not in the same space as resources
(URI). We can easily attach new information to resource but not to
triples.
• It is not easy to represent metaknowledge like the origin of the
uncertainty linked with an information.
•To overcome this issue we need to introduce two levels of knowledge and
use a trick.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 69o
Reifued RDF vs. Standard RDF
•An expressed RDF (RialtoReconstruction hasTimeSpan 1588-1591) can
be transformed in 3 reified triplets
• (s1 rdf:subject RialtoReconstruction)
• (s1 rdf:predicate hasTimeSpan)
• (s1 rdf:object 1588-1591)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 70o
Reifued RDF vs. Standard RDF
•An expressed RDF (RialtoReconstruction hasTimeSpan 1588-1591) can
be transformed in 3 reified triplets
• (s1 rdf:subject RialtoReconstruction)
• (s1 rdf:predicate hasTimeSpan)
• (s1 rdf:object 1588-1591)
• (s1 metardf:reliability 0.8)
• (s1 metardf:creator FredericKaplan)
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 71o
Possible historical spaces
•Now our RDF store includes both historical knowledge and knowledge
about the creation of this historical knowledge.
•These kinds of metainformation can document all the construction
phases (whether realized by humans or machines)
•With this approach, we can extract through queries the historical
knowledge corresponding to some specific sources and thus create a
possible historical reality.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 72o
Summary
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 73o
Encodingmetahistorical information
•We must not only model historical information, but model each step of
the construction of historical knowledge.
•There is a need for semantic framework capable of coding historical
information and meta-historical information.
•Coding meta-historical information implies documenting the choice of
sources, transcription phases, interpretation processes realized by humans
or machines.
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 74o
No unique global truth but fully documented possiblehistorical reconstructions
my header
Digital Humanities 101 - 2013/2014 - Course 6 | 2013 75o
top related