Download - En toen was er niets meer
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Herbert Van de SompelLANL & DANS@hvdsomp
En toen was er niets meer …
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Yet, the Web Exists in a Perpetual Now
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
• Content Management Systems
• Web Archives
• Transactional archives
• Search engine caches
• …
Traces of the Past Web Exist
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
But Past and Current Web(s) are Parallel Universes
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Memento Protocol Integrates the Current and Past Web
7http://mementoweb.org/guide/rfc/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Today Select DateMar 9 1999 Feb 8 1999
Bibliotheca AlexandrinaWeb Archive
Memento: Access Versions via the Original URI and a Datetime
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
vogin.nl in 1999
http://web.archive.bibalex.org/web/19990208021257/http://www.vogin.nl/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Memento for Chrome
http://bit.ly/memento-for-chome
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks
Eric Sieverts (2017) https://vogin-ip-lezing.net/2017/01/17/linkrot-linkroest-en-webarchieven/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link Rot
http://404-resto.com/typo3temp/pics/7580ea80fa.jpg
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
2000 2004
2005 2008
http://dl00.org in 2000, 2004, 2005, 2008
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web, All Hyperlinks Subject to Link Rot, Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web, All Hyperlinks Subject to Reference Rot
• Reference Rot hinders our ability to follow links as they were intended when they were put in place:
• Link rot: A link stops working all together
• Content drift: The Linked content changes over time and may eventually no longer be representative of the content that was originally linked
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Creating Pockets of Persistence
• How to maintain the integrity of links?
• This challenge exists for the entire web. Some communities with well managed collections care about addressing it because they consider it a Quality of Service issue:
• Scholarly communication• Cultural heritage• Legal publications• Government communication• Journalism• Wikipedia• …
• What can these communities do to create Pockets of Persistence?
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
A Managed Collection Desires Reliable Outlinks
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links to another Managed Collection
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Preamble 2 - Hiberlink Study of Reference Rot in STM Articles
PMC articles published 1997-2012 PMCTotal 479,194With links to articles 240,857With links to web-at-large resources 156,160
Links PMCTo articles 744,678To web-at-large resources 480,853A B
A B
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Number of Articles & Links - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links to Articles & to Web At Large Resources - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links Rot Occurs when B moves to C
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link to PID(B) ;; HTTP Redirect from PID(B) to B
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
When B moves to C: HTTP Redirect from PID(B) to C
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Core assumption in the PID solution: PIDs will be used to establish links.
But are they?
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
• When classifying links extracted from PMC as linking to articles, we assumed that filtering on http://dx.doi.org/* would do the trick
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these extracted links as linking to articles
A Disconcerting Observation
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Cartoon by Patrick Hochstenbach
A Proposal to Get PIDs Used: Signposting
http://signposting.org
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: HTTP Link with identifier Relation Type
http://signposting.org/identifier/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: HTTP Link with identifier Relation Type
http://signposting.org/identifier/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: Use HTTP Link with identifier Relation Type
curl –I http://www.dlib.org/dlib/november15/vandesompel/11vandesompel.html
HTTP/1.1 200 OKDate: Wed, 26 Oct 2016 12:36:37 GMTServer: Apache/2.2.15 (CentOS)Last-Modified: Thu, 19 Nov 2015 14:50:19 GMTETag: "205a5e-f5ef-524e5e0ab80c0"Accept-Ranges: bytesContent-Length: 62959Content-Type: text/html; charset=UTF-8Link: <http://doi.org/10.1045/november2015-vandesompel> ; rel=“identifier”
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
• Custodian of C needs to hold on to domain of B
• Custodian of C needs to establish redirection patterns, often rather simple rules
• No problem with establishing links to PID(B);; the URI in the browser address bar (initially B, later C) is just fine
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
• Was not really considered an issue because:• the objects that receive PIDs were typically static, e.g. scientific papers
• when a (substantially) new version of an object is published, a new PID is assigned
• But:• PID links (typically) lead to landing pages, not the identified objects
• increasingly, landing pages are increasingly rich, aggregate comments, discussion, annotations;; they do change over time.
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of B Takes Snapshots of B as it Evolves over Time
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of B Ensures Snapshots of B as it Evolves over Time
• This does not happen for PID-identified objects, AFAIK
• Version Control Systems (e.g. Wikipedia) hold on to all versions;; snapshots are local.
• Pro-active archiving solutions for web servers that create snapshots when e.g. new content is published/visited or at regular intervals:• on-demand archiving of a web server, cf. archiefweb.eu, archive-it.org
• self-archiving web server, cf. SiteStory
• How to access the snapshots of B? Memento!
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
SiteStory Transactional Archive & Memento
https://mementoweb.github.io/SiteStory/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
SiteStory, Wikipedia, Web Archive, Memento in Action
http://lanlsource.lanl.gov/hello
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Scholarly Context Not Found
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link Rot - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Scholarly Context Adrift
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
How to Assess Content Drift?
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 2: Select Representative Mementos
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Text Similarity Measures
• Compute aggregate text similarity scores (values between 0...100) for:• Simhash• Jaccard• Sørensen-Dice• Cosine
• If the aggregate score is 100, we decide that the Pre/Post Mementos are representative
• We find 313K URI references with representative Mementos
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
URI References without Representative Mementos - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 3: Dereference Live Web Version of URI
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 4: Representative Memento vs. Live Version
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Uncertainty Regarding the Future of B when A Links to It
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of A Takes a Snapshot of B when Linking to It
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Taking a Snapshots of B: Automation is Key
• Web archive APIs for on-demand archiving• perma.cc, Internet Archive, archive.is, webcitation
• Amber for Wordpress & Drupal archives resources linked in a page• http://amberlink.org/
• Hiberlink’s experimental Zotero extension archives bookmarked URLs• http://hiberlink.org/zotero.html
• Hiberlink’s experimental HiberActive archives all URLs referenced in a newly submitted paper• https://www.slideshare.net/martinklein0815/hiberactive
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Linking to Snapshot of B = Potentially Creating a Rotten Link
• Existing practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with existing practice:o Impossible to visit the original URI, if desiredo Requires the permanent existence/uptime of the archive that holds the snapshot- One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
Capture of http://webcitation.org dated July 17 2013https://archive.today/eAETp
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014https://web.archive.org/web/20140214233752/https://www.mummify.it/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-video/510074.html
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link to Snapshot of B and Decorate the Link
• Desired practice for linking to captures is to decorate the link so it provides a variety of options:
<a href=“URL of snapshot of B”data-originalurl=“B”data-versiondate=“datetime of snapshot of B”>
• Supports:o Revisiting the original URLo Finding snapshots in any web archive (original URL)o Finding a temporally appropriate snapshot in any web archive (original URL & snapshot datetime)
o Automatically accessing a temporally appropriate snapshot in any web archive (Memento, original URL & snapshot datetime)
http://robustlinks.mementoweb.org/spec/
Herbert Van de SompelVOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Robust Links: Link Decoration in Action
Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In: D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the link decorations actionable