dspace oai-pmh
TRANSCRIPT
![Page 2: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/2.jpg)
Harvesting Statstical Metadata from an Online Repository for Data Analysis and Visualization
![Page 3: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/3.jpg)
Outline Goal and Motivation Theseus.fi Dspace Getting Data out from Dspace Dspace OAI-PMH as a Data provider for Theseus Request Types(Verbs) Flow Control Harvesting Data from Theseus’s Data provider Project Result Final thoughts
![Page 4: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/4.jpg)
Goal
Harvest metadata of thesis documents from Theseus
author name, title, keywords, submission year....
Store the harvested data into a separate MYSQL database.
Build a Web portal out of this stored data
Goal and Motivation
Why conduct this project?
Thesis data analysis and visualization of overall statistical facts.
Compare thesis documents
Compare universities and departments
Analyse trending keywords used by students every year
![Page 5: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/5.jpg)
Theseus.fi
Digital libraries are now commonly used by academic institutions worldwide.
Theseus provides online access to theses and publications from Finnish universities of applied sciences.
End users can search, browse and upload thesis documents to Theseus.
![Page 6: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/6.jpg)
...
Theseus also has an API that can be used by third party organizations to utilize theses data.
Theseus is powered by a pioneer open source digital asset management system called Dspace.
Functionalities and features of Theseus are inherited from Dspace.
![Page 7: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/7.jpg)
Dspace
Dspace is an open source software platform that provides stable, long-term storages commonly for digital intellectual materials.
Many academic institutions worldwide use Dspace to offer their users an easy access to their digital resources.
Dspace can be freely downloaded and used or even modified to store digital materials.
![Page 8: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/8.jpg)
AbbreviationsOAI: Open Archives Initiative
PMH: Protocol for Metadata Harvesting
![Page 9: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/9.jpg)
Getting Data out from Dspace
OAI-PMH is HTTP based protocol that defines methods and protocols for sharing, publishing and archiving metadata from Dspace repositories
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is used to programatically access data from Dspace.
![Page 10: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/10.jpg)
Dspace OAI-PMH as a Data provider for Theseus
Dspace repositories have an 'OAI Base URL' in addition the URL for human users.
OAI Base URL : http://publications.theseus.fi/oai/request?
URL for human users : https://www.theseus.fi/
This URL is used in machine to machine communications between data consumers and data harvesters.
When harvesting request is made using the OAI Base URL , Theseus’s data provider returns XML formatted metadata of thesis documents.
![Page 11: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/11.jpg)
…
Theseus OAI-PMH exposes thesis documents in twelve unique metadata formats.
KansalliKirjasto format:
<kk:field schema="dc" element="contributor" qualifier="author" language="none" value=" Denut, Nicolae "/>
OAI Dublin Core format : <dc:creator> Denut, Nicolae </dc:creator>
Each metadata format can be queried to get any data from Theseus’s data provider.
![Page 12: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/12.jpg)
Request Types (Verbs)
There are six methods in OAI-PMH that can be appended to OAI based URLs to access different repository contents.
Theseus implements all six request types to provide thesis metadata to harvesters.
1. Identify: fetches information about Theseus data-provider itself
2. ListMetadataFormats: returns a list of available metadata formats supported by a Theseus data provider
3. ListIdentifiers: lists thesis record identifiers
![Page 13: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/13.jpg)
…
4. ListSets: retrieves the set structure (list of universities and departments) .
5. ListRecords: gets list of complete metadata of thesis documents from a Theseus and
6. GetRecord: retrieves individual metadata of a thesis document
By attaching any one of these request types to Theseus’s OAI base URL,a request URL can be formed.
![Page 14: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/14.jpg)
+AOI Base URL
Request type => Reque
st URL
![Page 15: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/15.jpg)
http://publications.theseus.fi/oai/request?verb=ListSets
![Page 16: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/16.jpg)
Flow control
The three request types ListIdentifiers, ListSets and ListRecords return large lists from Theseus.
In such cases, it is practical to partition them among a series of requests and responses.
Resumption tokens are options from OAI protocol that allow data providers to chunk long list responses in parts.
![Page 17: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/17.jpg)
Resumption token work flow
![Page 18: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/18.jpg)
Harvesting Data from Theseus’s Data provider
Simple HTML DOM parser, is an open source parser library written in PHP to read, modify, and return structured content from external data sources.
This parser library can create a Document Object Model by loading structured data from a URL.
To get nodes of the DOM object , this library provides a method called “find ()”.
![Page 19: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/19.jpg)
Universities Departments Thesis documentsIdentifier (setSpec) identifier (setSpec) Thesis IdentifierUniversity name Department Name Author namesListSets Request URLs ListSets Request URLs TitlesTotal number of papers Total number of papers GetRecord request URLs
University identifiers Department identifiers University identifiers KeywordsSubjects (official keywords)Number of pagesyearLanguage
Summary of gathered theses metadata
![Page 20: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/20.jpg)
84,391 Whoa! That’s a big number, aren’t you proud?
![Page 21: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/21.jpg)
Project Result
• How many Thesis documents are in Theseus?
• Which school has what amount of papers in Theseus?
• How many papers is each school publishing every year?
• What departments are there in each school?
• How many papers belong to which department?
• How many pages does each paper have?
• In what language is the paper written?
• How many times has each paper been downloaded by Theseus visitors?
• What are the keywords of each thesis document?
![Page 22: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/22.jpg)
The built Web portal aims to give better insights on the contribution of each school to Theseus on its front page.
![Page 23: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/23.jpg)
Web portal showing
![Page 24: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/24.jpg)
Departments versus number of Thesis documents in Metropolia UAS
![Page 25: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/25.jpg)
Analysing Keywords is also easy
I want to analyse
keywords
Fill out a form
See results
![Page 26: Dspace OAI-PMH](https://reader036.vdocuments.pub/reader036/viewer/2022062313/55c1257fbb61eb26098b4625/html5/thumbnails/26.jpg)
Keyword fetching form