how accurate are ir usage statistics?

18
Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin [email protected] http://researchrepository.ucd.ie How accurate are IR usage statistics? Open Repositories 2016 Dublin, 16 June

Upload: ucd-library

Post on 11-Apr-2017

238 views

Category:

Education


1 download

TRANSCRIPT

Page 1: How Accurate are IR Usage Statistics?

Leabharlann UCD

An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire

UCD Library

University College Dublin,Belfield, Dublin 4, Ireland

Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie

How accurate are IR usage statistics?

Open Repositories 2016Dublin, 16 June

Page 2: How Accurate are IR Usage Statistics?

Usage statistics are important for OA repositories

• How is the service used overall?• Advocacy

– Connects with authors on what is most important to them: the use of their research

• KPI for return on investment– Usage of a Library service– Visibility of university’s research

Page 3: How Accurate are IR Usage Statistics?
Page 4: How Accurate are IR Usage Statistics?

Monthly email sent to all depositors

Page 5: How Accurate are IR Usage Statistics?

Infographic distributed semi-annually by College Liaison Librarians

Page 6: How Accurate are IR Usage Statistics?

How accurate are they? Web robots

• Some follow rules– Search engines, Internet Archive, link checkers,

Twitterbot, etc.– robots.txt, naming themselves in the user agent

string• Others do not

– Email spammers, comment spammers, dictionary attackers, phishers, etc.

– Often mimic human users

Page 7: How Accurate are IR Usage Statistics?

Experimental study

• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty

• Manually checked to determine if robot or human• Compared findings against our robot detection

technique– U. Minho DSpace Stats Add-on– Monthly outlier exclusion (manual)

Greene, J. Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, July 2016

Page 8: How Accurate are IR Usage Statistics?

First finding

85% of the Research Repository UCD’s unfiltered downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper

on 20 IRs; 85% was also found to be robots

Page 9: How Accurate are IR Usage Statistics?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall (robots)

Accu

racy

of d

ownl

oad

stat

s (in

vers

e pr

eciti

on)

Catching more robots improves stats(But how much depends on the number of robots)

Get b

ette

r sta

ts

Catch more robots

Typical website, 15% robot traffic

OA journal, 40% robot

Internet Archive, 91% robot

OA repositories, 85% robot

Page 10: How Accurate are IR Usage Statistics?

How did we do at UCD?

• What proportion of robot downloads did we catch? (Recall)– Our method catches 94% of all robots

• How often were we correct -- how many are actually human? (Precision)– 98.9% of downloads that we label robots really are

robots• How accurate are the download stats -- how many

are actually made by human beings? (Inverse precision)– 73% of the download statistics as reported are

human

Page 11: How Accurate are IR Usage Statistics?

How does that compare?

• Who knows? There are no other studies like this on repositories!

• Applied DSpace's and EPrints' web robot detection algorithms to our data– Experimental– Real data– Same dataset used for each ‘system’– Algorithms easy to mimic in vitro– But SEO, crawl behaviour may be different for

different systems

Page 12: How Accurate are IR Usage Statistics?

Robot detection techniques used

DSpace EPrints Minho DSpace

Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓

Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making

Page 13: How Accurate are IR Usage Statistics?

Results

Page 14: How Accurate are IR Usage Statistics?

DSpace Eprints Minho (no manual outlier checking)

Minho plus monthly manual checking (UCD)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.897 0.911 0.8900.942

Robots detected (Recall)

Page 15: How Accurate are IR Usage Statistics?

DSpace Eprints Minho (no manual outlier checking)

Minho plus monthly manual checking (UCD)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

11.000

0.9400.989 0.989

Accuracy of detection (Precision)

Page 16: How Accurate are IR Usage Statistics?

DSpace

Eprin

ts

Minho (no m

anual

outlier c

hecking)

Minho plus monthly

manual

checki

ng (UCD)

Without fi

ltration

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.620 0.552 0.5900.730

0.144

Accuracy of download stats(Inverse precision)

I.e. 38% of DSpace’s reported downloads are made by robots, etc.

Page 17: How Accurate are IR Usage Statistics?

DSpace

EPrin

ts

Minho

Minho with

monthly

manual

checki

ng (UCD)

No robot d

etection

00.10.20.30.40.50.60.70.80.9

1

Robot detection in OA IR systems

RecallPrecisionNegative precision (accuracy of download stats)

Page 18: How Accurate are IR Usage Statistics?

Thank you!