the ups protoproto project
TRANSCRIPT
the UPS protoproto project
herbert van de sompel, michael nelson, thomas krichel
UPS 1 Meeting Santa Fe - October 21th 1999
description project
the UPS protoproto demo
the data exchange framework dex
project why a protoproto?
• UPS: enable cross-archive end-user services • protoproto:
– facilitate discussions – identify issues involved in creating cross-archive services – experiment with digital object concepts for archive
material – does not claim to be a solution
• protoproto is multi-disciplinary – a special instance of cross-archive – there is a market – promotional value
project who?
• coordination: herbert van de sompel, michael nelson, thomas krichel
• involvement of: – Old Dominion U & NASA Langley – U of Surrey – U of Ghent – Los Alamos National Laboratory - Library – Russian Academy of Science - Siberian branch
project sponsors
• Los Alamos National Laboratory - Research Library • JISC eLib WoPEc project
project datasets – metadata only – full text remains at archives – static dumps obtained ca. July 99
the arXiv CogPrints NACA NCSTRL NDLTD RePEc
Total
objects 85,223
742 3,036 29,184 1,590 73,367
193,142
full-text 85,223
659 3,036 9,084 951
13,582
112,535
!organization 17,983
14 100 93 1
2,453
project metadata formats
the arXiv CogPrints NACA NCSTRL NDLTD RePEc
format internal internal Refer RFC1807 MARC ReDIF
• Getting metadata out of archives – not all archives support metadata extraction
• some archives have undocumented metadata extraction procedures
– not all archives support rich criteria for extraction
• single dump concept only
• Intellectual property and use rights not always clear
project metadata extraction
• Metadata has problems with: – record duplication – crucial missing fields – internal errors – ambiguous references to people and places,
publications
project metadata quality
project metadata conversion
• data enhancements: • creation of unique identifier • addition of raw subject-classification • normalization of publication types
• all datasets converted to ReDIF: • essential to have a single fomat for the creation of services • supply by archives in a single format was not realistic • no downgrading of data
project re-creation of archives
• creation of archives for ReDIF-ed metadata • using intelligent digital objects : “buckets”
arXiv
RePEc
NCSTRL
• Buckets were chosen to study the implications of using rich, intelligent objects in UPS
• Buckets are: – DL protocol / system independent – self-contained and mobile – handle their own display, enforcement of terms and
conditions, and dissemination of their contents – designed for bundling multiple data representations and
data instance types • The aggregative nature of buckets is well
suited for adding valued-added services at the object level
project buckets
project creation of end-user service
• NCSTRL+ digital library service • indexing buckets in archives by requesting their metadata • enhanced user-interface • NCSTRL+ search results point at buckets • buckets auto-display • buckets provide link to full-text in native archive
• UPS contains 193K objects – using buckets consumed inodes (~60 inodes per
bucket) • filesystem reformatted with more generous amount
of inodes
– Solaris and Dienst conflict • Dienst wants each object in an publishing authority
to be in a single directory • Solaris has a hard limit of 32K objects in a directory • resolution: use many (100+) authorities for UPS
project scaling problems
project addition of linking service
• integrate the archives with the traditional communication mechanism • context-sensitive linking to deliver extended services via SFX technology
project SFX linking service
metadata metadata evaluate metadata
extended services
system A system B
project SFX linking database
• buckets for arXiv, NCSTRL and RePEc are SFX-aware
• Cogprints, NACA, NDLTD not SFX-aware • SLAC/SPIRES is SFX-aware • linking services for preprint metadata + for published version
project addition of linking service
demo the UPS protoproto
http://ups.cs.odu.edu:8000/
• will be available starting beginning of November • UPS list will be notified • disclaimer “not a production system”
http://ups.cs.odu.edu
dex some issues (I)
• data exchange framework • data provision vs. data implementation • central searching, distributed archives
• need for a framework by which archives can describe themselves:
• content • terms and conditions • protocols, criteria supported to extract (meta)data • metadata scheme, subject classification scheme, material-type scheme, ...
• need for an identifier scheme for archives and archive objects
• (cf. ISSN, ISBN, DOI) • metadata quality obstructs the creation of services • desirabile to extend metadata with citation information • smart objects
• archived objects that are active, not passsive
dex some issues (II)
• Providing data: – publishing into an archive – providing methods for metadata “harvesting”
• provide non-technical context for sharing information also
• Implementing Data: – harvest metadata from providers – implement user interface to data
• Even if provided by the same DL, these are distinct functions
dex providing vs. implementing data
Provider Input interface
Native end-user interface
Provider Input interface
Native end-user interface
Native harvesting interface
No machine based way to extract metadata…
Machine and user interfaces for extracting metadata….
dex providing vs. implementing data
Provider Input interface
Native harvesting interface
Provider Input interface
Native end-user interface
Native harvesting interface
Implementor Native end-user interface
Input and harvesting interfaces optional
Native end-user interface optional (e.g., RePEc)
dex providing vs. implementing data
• Much of the learning about the constituent UPS archives occurred out of band…
• Given an unknown archive, we should be able to algorithmically determine the archive’s metadata...
Provider Input interface
Native end-user interface
Native harvesting interface Where possible, the
harvesting interface should provide the same criteria as the end-user interface
dex self-describing archives
• Recommended criteria for metadata extraction: – subject classification – accession date – publication date
• Criteria for archive description – metadata formats employed – contact information for archive – publication type scheme – identifier scheme – subject classification scheme
dex self-describing archives
• Useful in: – reference linking – can be used in citations – resolving duplications
• UPS duplications were removed by hand
– tracking publication lifecycle • Need the ability for an object to have
multiple unique identifiers – organization, discipline, etc.
dex identifiers
• Premise: Objects are more important than the archives that hold them
• SODA: Smart Objects, Dumb Archives
• Objects should be the canonical authority for • metadata • contents • use
• Objects should be able to grow and change • correct metadata • add new formats • add new services • reflect the lifecycle of the object
dex smart objects
• It would be beneficial if the archived objects could be heterogenous:
• with their own “look-and-feel” • unique functionality / services
– e.g., the data archiving needs of an atmospheric scientist can be different than that of a computer scientist, engineer or medical researcher
• yet maintained a standard API for: • extracting metadata • content retrieval • resource discovery on the object • terms and conditions
dex smart objects
• A strong distinction between the provision of data, and the implementation of data – also, a socio-legal context for sharing metadata
• Open, “self-describing” archives • A universal, unique identifier name space • Archived objects with more intelligence and
flexibility
dex lessons learned