presentation for aginfra hackathon in athens 12th december 2013

30
1 Supported by EU projects 12/12/2013 Athens, Greece Open Data in Agriculture Hands-on with data infrastructures that can power your agricultural data products

Upload: jane-bromley

Post on 08-Jul-2015

128 views

Category:

Education


1 download

DESCRIPTION

Using SPARQL to locate specific educational material on Open Learn (from the Open University)

TRANSCRIPT

Page 1: Presentation for agINFRA Hackathon in Athens 12th December 2013

1

Supported by EU projects

12/12/2013Athens, Greece

Open Data in AgricultureHands-on with data infrastructures that

can power your agricultural data products

Page 2: Presentation for agINFRA Hackathon in Athens 12th December 2013

2

OpenLearn and the SPARQL endpoint

Page 3: Presentation for agINFRA Hackathon in Athens 12th December 2013

Maths, Computing and Technology Faculty The Open University Walton Hall Milton Keynes MK7 6AA

www.open.ac.ukmct‐research.open.ac.uk

Jane Bromley    David King      David Morse

Presenter
Presentation Notes
Page 4: Presentation for agINFRA Hackathon in Athens 12th December 2013

4

*open*

Presenter
Presentation Notes
A lot of interesting information about the OU is available here: http://www.open.ac.uk/about/main/ The OU was founded in 1969 To open up higher education to all, regardless of their circumstances or where they live a distance learning and research university with > 240k students No formal entry requirements. It’s aim is to widen participation in education. We have a lot of older students, but also more younger ones are coming attracted by lower costs. As part of this “open” mission it is making an increasing amount of Open University teaching and learning resources available free of charge to anyone with access to the internet, no matter where in the world they live. The four biggest open access schemes are: iTunes U http://www.open.ac.uk/itunes/ OUView on YouTube http://www.youtube.com/ou Open Research Online http://oro.open.ac.uk/ one of the largest university research publications collections in the UK. And….(next slide)
Page 5: Presentation for agINFRA Hackathon in Athens 12th December 2013

5

Objectives

An introduction to the Open University’s free material

• Show available metadata

• Talk about RDF – the format used for graph databases

• How to query the material through SPARQL

Page 6: Presentation for agINFRA Hackathon in Athens 12th December 2013

6

http://www.open.edu/openlearn/body‐mind/the‐real‐story‐behind‐cereals

Presenter
Presentation Notes
OpenLearn makes some OU course material and other educational resources available free of charge to potential learners anywhere in the world don't need to register as students What can you do on OpenLearn? Explore: learning material created by OU with input from experts, much material linked to TV and radio programmes made with the BBC; comment and share with friends Try: Prepare for studying with the OU (or somewhere else) by enrolling on one of the 650 free courses. You will find short articles like this one about cereals. It’s “posted under” (i.e. tagged) Body & Mind, Money & Management, Health, Business Studies, Environmental Decision Making, Technology, The Environment, Nature & Environment OU doesn’t have an agricultural research department, but related research and teaching is carried out. This is the material that is being made available to agINFRA and today.
Page 7: Presentation for agINFRA Hackathon in Athens 12th December 2013

7

http://www.open.edu/openlearn/nature‐environment/good‐food‐destroying‐biodiversity

Presenter
Presentation Notes
This one made for International Day for Biodiversity (22 May 2013). OpenLearn is a dynamic site as it’s not just tied to courses, but to radio & tv shows, and to events. This page is a good example of this. Therefore this isn’t a site that you harvest once, but will need to return to it “regularly”.
Page 8: Presentation for agINFRA Hackathon in Athens 12th December 2013

8

http://www.open.edu/openlearn/science‐maths‐technology/science/biofuels/content‐section‐0

Presenter
Presentation Notes
There are whole units eg this one on Biofuels http://www.open.edu/openlearn/science-maths-technology/science/biofuels/content-section-0 5 hours at Introductory Level Posted under: Science You can see a link to actually study it as part of course “Learn about plants and people”, course code S173 worth 10 point course.
Page 9: Presentation for agINFRA Hackathon in Athens 12th December 2013

9

Open Research Online – publications originating from OU researchersOU PodcastsCourse DescriptionsSome KMi datasetsAnd…

Presenter
Presentation Notes
All very nice, but how can other people use this material easily? KMi, the Knowledge Media Institute http://kmi.open.ac.uk/, are specialists in knowledge management and the semantic web KMi set up data.open.ac.uk to be the home of open linked data from The Open University. It was developed to extract, interlink and expose data available in various institutional repositories of the University and make it available openly for reuse. data.open.ac.uk hosts datasets obtained from public data repositories at the Open University and applications making use of these data. Currently relate to publications, courses and Audio/Video material produced at the Open University, as well as the people involved in making them. Available through standard formats (RDF and SPARQL) and are (in most cases) available under an open license (Creative Commons Attribution 3.0 Unported License). The data itself is hosted under various named graphs and namespaces, but uses a consistent URI scheme (under http://data.open.ac.uk) to ensure the connection between the exposed datasets.
Page 10: Presentation for agINFRA Hackathon in Athens 12th December 2013

10

http://data.open.ac.uk/site/datasets.html

Available through standard formats (RDF and SPARQL)

Page 11: Presentation for agINFRA Hackathon in Athens 12th December 2013

11

Resource Description Framework • one of the basic building blocks forming web of semantic data• defines a graph database• format defines statements comprising:

Subject is the T‐shirtPredicate (property) is the colourObject is white

subject‐>predicate‐>object relationship is called a triple.

RDF

<?xml version="1.0" encoding="UTF‐8"?>

<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22‐rdf‐syntax‐ns#"xmlns:feature="http://www.linkeddatatools.com/clothing‐features#">

<rdf:Descriptionrdf:about="http://www.linkeddatatools.com/clothes#t‐shirt <feature:color rdf:resource="http://www.linkeddatatools.com/colors#white"/>

</rdf:Description></rdf:RDF>

RDF/XML ‐ the XML form of RDF

Presenter
Presentation Notes
This is an intro slide for RDF. See http://www.w3.org/RDF/ and their primer http://www.w3.org/TR/rdf-primer/ http://www.linkeddatatools.com/introducing-rdf is a good background introduction. // This first line says you need an XML parser to interpret this document and the character // set that can be used is UTF-8 ie a wide range of characters, not just latin alphabet. <?xml version="1.0" encoding="UTF-8"?> // states that the enclosing document is an RDF document, <rdf:RDF // resides in the standard W3.org namespace http://www.w3.org/1999/02/22-rdf-syntax-ns# // when “rdf” is used in subsequent it means this // conventionally associated with the QName prefix rdf xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" // Defines the XML namespace feature by giving it the namespace URI http://www.linkeddatatools.com/clothing-features#. xmlns:feature="http://www.linkeddatatools.com/clothing-features#“> // the rdf:Description tag (can contain one or more statements about the same subject) rdf:Description // "I'm going to describe something (a subject) and I'm giving it the unique ID http://www.linkeddatatools.com/clothes#t-shirt". rdf:about="http://www.linkeddatatools.com/clothes#t-shirt // The subject has a property/predicate with name feature:color with object http://www.linkeddatatools.com/colors#white". <feature:color rdf:resource="http://www.linkeddatatools.com/colors#white"/> </rdf:Description></rdf:RDF> // Or more clearly, it has this structure <rdf:Description rdf:about="subject"> <predicate rdf:resource="object" /> <predicate>literal value</predicate> <rdf:Description> Glossary XML = Extensible Markup Language http://www.w3.org/TR/2000/REC-xml-20001006 Character encoding UTF-8 UCS Transformation Format—8-bit is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. allows for any character set eg greek, japanese…
Page 12: Presentation for agINFRA Hackathon in Athens 12th December 2013

12

http://data.open.ac.uk/query

The SPARQL endpoint

Presenter
Presentation Notes
Query syntax is SPARQL - SPARQL Protocol and RDF Query Language, pronounced "sparkle“ This is the SPARQL endpoint – target uri address where you can submit your query Useful for playing and developing your queries before making the code.
Page 13: Presentation for agINFRA Hackathon in Athens 12th December 2013

13

select distinct ?propsfrom <http://data.open.ac.uk/context/openlearn> where { ?subj ?props ?obj }

Presenter
Presentation Notes
Are you familiar with SQL or similar? Then this should be familiar. See http://www.w3.org/TR/sparql11-query/ for full description of SPARQL This SPARQL query returns a list of the properties used in OpenLearn. The query consists of two parts: SELECT clause identifies the variables to appear in the query results FROM specifies the dataset to be queried WHERE clause provides the basic graph pattern to match against the data graph. DISTINCT means you want unique results, no duplicates The graph pattern is a triple pattern (subject, predicate and object) Variables are prefixed with a question mark Let's explore this further by taking a closer look at the example above. This SELECT DISTINCT statement requests the unique values for the variable ?props to be returned. FROM restricts the selection to OpenLearn material ?subj and ?obj are variables too, but not returned. All three variables are matched against the data graph. Any matching subject, object URI will be returned for this part of the query pattern and the result will be mapped onto that variable name. Hence, in the above SPARQL query, ?props returns all the predicates (properties) in the data graph.
Page 14: Presentation for agINFRA Hackathon in Athens 12th December 2013

14

Presenter
Presentation Notes
So these are all the predicates / properties in http://data.open.ac.uk/context/openlearn Result sets are illustrated in tabular form. x y z "Alice“ <http://example/a>      …. A 'binding' is a pair (variable, RDF term). In this result set, there are three variables: x, y and z (shown as column headers). Each solution is shown as one row in the body of the table.  Here, there is a single solution, in which variable x is bound to "Alice", variable y is bound to <http://example/a>, and variable z is not bound to an RDF term. Variables are not required to be bound in a solution.
Page 15: Presentation for agINFRA Hackathon in Athens 12th December 2013

15

http://www.open.edu/openlearn/science‐maths‐technology/science/biofuels/content‐section‐0

Page 16: Presentation for agINFRA Hackathon in Athens 12th December 2013

16http://data.open.ac.uk/page/openlearn/s173_1

Presenter
Presentation Notes
These are the Predicates / Properties and Objects of subject s173_1 You can see that what you get back could be a string or a uri
Page 17: Presentation for agINFRA Hackathon in Athens 12th December 2013

17

How to find agriculturally useful  material in OpenLearn?

Presenter
Presentation Notes
On previous slide property Title has value Biofuels Subject has values ou-topic:wood ou-topic:change ou-topic:fuel ou-topic:energy Science ou-topic:biogas ou-topic:biofuels ou-topic:climate ou-topic:photosynthesis ou-topic:global_warming Decided the “subject” value was probably best way to decide if the document was related to agriculture.
Page 18: Presentation for agINFRA Hackathon in Athens 12th December 2013

18

A three step process:

1. Find all the subjects and chose those   relevant to agriculture

2. Find all the OpenLearn Units that have  just these subjects

3. Collect the metadata for each of the  selected Open Learn units

Presenter
Presentation Notes
Only as good as the person who gave the subject keywords to the material
Page 19: Presentation for agINFRA Hackathon in Athens 12th December 2013

19

Presenter
Presentation Notes
select distinct ?t where { ?olu a <http://data.open.ac.uk/openlearn/ontology/OpenLearnUnit>. ?olu <http://purl.org/dc/terms/subject> ?t } http://www.w3.org/TR/rdf-sparql-query/ Most forms of SPARQL query contain a set of triple patterns called a basic graph pattern. Triple patterns are like RDF triples except that each of the subject, predicate and object may be a variable. A basic graph pattern matches a subgraph of the RDF data when RDF terms from that subgraph may be substituted for the variables and the result is RDF graph equivalent to the subgraph. a means rdf:type. http://www.w3.org/1999/02/22-rdf-syntax-ns#type Or Browser command line: http://data.open.ac.uk/query?query=select distinct ?t where {?olu a <http://data.open.ac.uk/openlearn/ontology/OpenLearnUnit>. ?olu <http://purl.org/dc/terms/subject> ?t }&output=xml&stylesheet=/openrdf-workbench/transformations/tuple.xsl Or David’s python code import json import urllib.parse import urllib.request def get_subjects(): ''' returns data.open.ac.uk subject values ''‘ query_text = 'select distinct ?topic where { ?olu a <http://data.open.ac.uk/openlearn/ontology/OpenLearnUnit>. ?olu <http://purl.org/dc/terms/subject> ?topic }‘ query_encoded = urllib.parse.quote(query_text) req = urllib.request.Request('http://data.open.ac.uk/query?query={}&output=json'.format(query_encoded)) f = urllib.request.urlopen(req) return(f.read().decode('utf-8')) if __name__ == '__main__': with open('subjects.txt', 'w', encoding='utf-8', newline='\n') as f: results = json.loads(get_subjects()) subjects = sorted([topic['topic']['value'].rpartition('/')[2] for topic in results['results']['bindings']]) f.write('\n'.join(subjects))
Page 20: Presentation for agINFRA Hackathon in Athens 12th December 2013

20

(1130) as of end of October 2013http://data.open.ac.uk/topic/psychologyhttp://data.open.ac.uk/topic/sociologyhttp://data.open.ac.uk/topic/social_carehttp://data.open.ac.uk/topic/educational_practicehttp://data.open.ac.uk/topic/biologyhttp://data.open.ac.uk/topic/herbicideshttp://data.open.ac.uk/topic/energyofficial1342688874openlearn_teamadminhttp://data.open.ac.uk/topic/unitsdefault1330523206frank_siebertzz884926http://data.open.ac.uk/topic/pre_course_workdefault1263940536linda_smithlps32http://data.open.ac.uk/topic/employmentofficial1342688874richard_howesrh4685http://data.open.ac.uk/topic/using_mathsdefault1231080717peter_mcalisterzz298445http://data.open.ac.uk/topic/numbersdefault1330523196elizabeth_ellisee944http://data.open.ac.uk/topic/nuclearofficial1342688874lucy_hendylmf7http://data.open.ac.uk/topic/environmental_sciencehttp://data.open.ac.uk/topic/audiohttp://data.open.ac.uk/topic/cctvhttp://data.open.ac.uk/topic/social_workhttp://data.open.ac.uk/topic/scotlandhttp://data.open.ac.uk/topic/personalisationhttp://data.open.ac.uk/topic/religious_studieshttp://data.open.ac.uk/topic/religion…

Presenter
Presentation Notes
As you can see, some funny/strange topics. I edited this list by hand to select topics that seemed relevant to agriculture However, material is constantly being added to OpenLearn, so this step could have to be done from time to time.
Page 21: Presentation for agINFRA Hackathon in Athens 12th December 2013

21

40 topics chosen:<http://data.open.ac.uk/topic/agriculture>,         <http://data.open.ac.uk/topic/environment>,         <http://data.open.ac.uk/topic/the_environment>,         <http://data.open.ac.uk/topic/nature_&amp_environm ent>  <http://data.open.ac.uk/topic/environmental_science>,<http://data.open.ac.uk/topic/herbicides>,<http://data.open.ac.uk/topic/ecology>,<http://data.open.ac.uk/topic/genetics>,<http://data.open.ac.uk/topic/diversity>,<http://data.open.ac.uk/topic/global_warming>,<http://data.open.ac.uk/topic/biodiversity>,<http://data.open.ac.uk/topic/pollution>,<http://data.open.ac.uk/topic/conservation>,<http://data.open.ac.uk/topic/the_environment>,<http://data.open.ac.uk/topic/climate>,<http://data.open.ac.uk/topic/environmental_studies>,<http://data.open.ac.uk/topic/climate_change>,<http://data.open.ac.uk/topic/sustainability>,<http://data.open.ac.uk/topic/biogas>,<http://data.open.ac.uk/topic/biofuels>,

<http://data.open.ac.uk/topic/photosynthesis>,<http://data.open.ac.uk/topic/waste_management>,<http://data.open.ac.uk/topic/landfill>,<http://data.open.ac.uk/topic/economic_growth>,<http://data.open.ac.uk/topic/waste>,<http://data.open.ac.uk/topic/acid_rain>,<http://data.open.ac.uk/topic/weather>,<http://data.open.ac.uk/topic/meteorology>,<http://data.open.ac.uk/topic/natural_resources>,<http://data.open.ac.uk/topic/animals>,<http://data.open.ac.uk/topic/ecological_sustainability>,<http://data.open.ac.uk/topic/overfishing>,<http://data.open.ac.uk/topic/ecosystem>,<http://data.open.ac.uk/topic/the_end_of_nature>,<http://data.open.ac.uk/topic/survival_of_the_fittest>,<http://data.open.ac.uk/topic/barter>,<http://data.open.ac.uk/topic/plants>,<http://data.open.ac.uk/topic/freshwater>,<http://data.open.ac.uk/topic/maps>,<http://data.open.ac.uk/topic/food>..

Topics relevant to agriculture?

Page 22: Presentation for agINFRA Hackathon in Athens 12th December 2013

22

A three step process:

1. Find all the subjects and chose  those relevant to agriculture

2.  Find all the OpenLearn Units that  have just these subjects

3.  Collect the metadata for each of the  selected Open Learn units

Page 23: Presentation for agINFRA Hackathon in Athens 12th December 2013

23

select distinct ?olu from  <http://data.open.ac.uk/context/openlearn>where {  ?olu <http://purl.org/dc/terms/subject> ?topic .  filter ( ?topic in (         

<http://data.open.ac.uk/topic/agriculture>,          <http://data.open.ac.uk/topic/environment>,....etc.) )

}

→ 85 OpenLearn units

Units are extracts from OU courses with multiple pages of  material and expected to take many hours of study.

Presenter
Presentation Notes
Get all the Subjects (Open Learn Units) with topic from the list on slide before. OpenLearn Units can match multiple topics therefore only want the unique (DISTINCT) ones.
Page 24: Presentation for agINFRA Hackathon in Athens 12th December 2013

24

http://data.open.ac.uk/openlearn/s250_3http://data.open.ac.uk/openlearn/sdk125_1http://data.open.ac.uk/openlearn/t123_1http://data.open.ac.uk/openlearn/t206_2http://data.open.ac.uk/openlearn/t213_1http://data.open.ac.uk/openlearn/s173_1http://data.open.ac.uk/openlearn/u116_3http://data.open.ac.uk/openlearn/s278_19http://data.open.ac.uk/openlearn/t306_3http://data.open.ac.uk/openlearn/s189_1http://data.open.ac.uk/openlearn/s344_1http://data.open.ac.uk/openlearn/s324_1http://data.open.ac.uk/openlearn/s250_2……

Presenter
Presentation Notes
85 of these
Page 25: Presentation for agINFRA Hackathon in Athens 12th December 2013

25

http://data.open.ac.uk/openlearn/s250_2http://www.open.edu/openlearn/science‐maths‐technology/science/environmental‐science/social‐issues‐and‐gm‐crops/content‐section‐0

This unit is an adapted extract from the course Science in context (S250)

Presenter
Presentation Notes
For example S250_2
Page 26: Presentation for agINFRA Hackathon in Athens 12th December 2013

26

A three step process:

1. Find all the subjects and chose  those relevant to agriculture

2. Find all the OpenLearn Units that  have just these subjects

3. Collect the metadata for each of the  selected Open Learn units

Page 27: Presentation for agINFRA Hackathon in Athens 12th December 2013

27

import urllib.parseimport urllib.request

# To run: python get_SPARQL_from_OpenData.py# Edit this file in two places to choose output format as json or rdf/xml

def run_SPARQL(course_id):    ''' returns results of SPARQL query'''    # EDIT HERE# place course_id in request    # req = urllib.request.Request('http://data.open.ac.uk/openlearn/{}'.format(course_id), 

headers={'Accept': 'application/rdf+json'})    req = urllib.request.Request('http://data.open.ac.uk/openlearn/{}'.format(course_id), 

headers={'Accept': 'application/rdf+xml'})# fire off the query    f = urllib.request.urlopen(req)    # pass back the query result having rendered it readable first    return(f.read().decode('utf‐8'))

if __name__ == '__main__':llist = ['a180_2', 'b823_1', 'd837_1', 'dd100_7', 'e500_11', 'k111_1', …]for course_id in llist:print(course_id)# run query with chosen course id# result = run_SPARQL(course_id)# EDIT HERE# with open('{}.json'.format(course_id), 'w', encoding='utf‐8', newline='\n') as f:with open('{}.xml'.format(course_id), 'w', encoding='utf‐8', newline='\n') as f:

f.write(result)

Python script to dump the metadata

Presenter
Presentation Notes
4. Then get the metadata for each course out of OpenLearn I initially ran SPARQL query for each unit and then cut and paste the result. Note they put in a copy to clipboard function for me. select * from <http://data.open.ac.uk/context/openlearn> WHERE { <http://data.open.ac.uk/openlearn/s250_2> ?predicate ?object } but this is too slow for 85 units. Also the output is a SPARQL table output in xml or json and Vassilis Protonotarios [[email protected]] and team want it in rdf In order to streamline this step we used a python to make a url request. Output can be either xml or json eg s250_2.json or s250_2.xml
Page 28: Presentation for agINFRA Hackathon in Athens 12th December 2013

28

{ "http://data.open.ac.uk/openlearn/s250_2" : {"http://purl.org/dc/terms/language" : [ {"type" : "literal" ,"value" : "en‐gb" ,"datatype" : http://www.w3.org/2001/XMLSchema#string } ] ,"http://data.open.ac.uk/openlearn/ontology/relatesToCourse" : [ {"type" : "uri" ,"value" : http://data.open.ac.uk/course/s250 } ] ,

"http://purl.org/dc/terms/title" : [ {"type" : "literal" ,"value" : "Social issues and GM crops" ,"datatype" : http://www.w3.org/2001/XMLSchema#string }

……

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22‐rdf‐syntax‐ns#xmlns:j.0=http://dbpedia.org/property/xmlns:j.1="http://xmlns.com/foaf/0.1/" xmlns:j.3=http://web.resource.org/cc/xmlns:j.2=http://www.w3.org/TR/2010/WD‐mediaont‐10‐20100608/xmlns:j.4=http://purl.org/dc/terms/xmlns:j.5=http://data.open.ac.uk/openlearn/ontology/xmlns:rdfs="http://www.w3.org/2000/01/rdf‐schema#"><j.1:Document rdf:about="http://data.open.ac.uk/openlearn/s250_2"><j.2:locator rdf:resource="http://www.open.edu/openlearn/nature‐environment/the‐environment/environmental‐science

/social‐issues‐and‐gm‐crops/content‐section‐0"/><j.5:relatesToCourse rdf:resource="http://data.open.ac.uk/course/s250"/><j.4:creator rdf:resource="http://data.open.ac.uk/organization/the_open_university"/>    <j.4:subject rdf:resource="http://data.open.ac.uk/topic/risk"/><j.4:published rdf:datatype=http://www.w3.org/2001/XMLSchema#dateTime>2011‐06‐02T23:00:00Z</j.4:published>

……

rdf/xml format

json format

Page 29: Presentation for agINFRA Hackathon in Athens 12th December 2013

29

Summary:

A three step process:

1. Find all subjects/keywords relevant to agriculture2. Identify OpenLearn Units with these subjects3. Collect the metadata for each Open Learn unit

All the scripts (and more) are available