#icanhazrobot?: improved robot detection for ir usage statistics

21
Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin [email protected] http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June

Upload: ucd-library

Post on 11-Apr-2017

637 views

Category:

Education


0 download

TRANSCRIPT

Page 1: #iCanHazRobot?: improved robot detection for IR usage statistics

Leabharlann UCD

An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire

UCD Library

University College Dublin,Belfield, Dublin 4, Ireland

Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie

#iCanHazRobot?Improved robot detection for IR usage statistics

Open Repositories 2016Dublin, 14 June

Page 2: #iCanHazRobot?: improved robot detection for IR usage statistics

Overview and take-home points

• Usage stats are important– (go to the Usage Stats panel on Thursday,

16/Jun/2016: 11:00am - 12:30pm)• Robot filtration is a problem, especially in

repositories• Robot detection has an exponential effect on

usage stats’ accuracy in repositories• 2-3 ways to improve DSpace and EPrints’ usage

stats by 20% or more will be demonstrated

Page 3: #iCanHazRobot?: improved robot detection for IR usage statistics

Experimental study

• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty

• Manually checked to determine if robot or human• Applied DSpace, EPrints robot detection

algorithms to the dataset– This is an EXPERIMENT, simulating algorithms on a

DSpace repository’s usage data and Apache logs– The data is real, live data, and the algorithms were

very easy to simulate

Page 4: #iCanHazRobot?: improved robot detection for IR usage statistics

First finding

85% of unfiltered repository downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper

on 20 IRs; 85% was also found to be robots

Page 5: #iCanHazRobot?: improved robot detection for IR usage statistics

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall (robots)

Accu

racy

of d

ownl

oad

stat

s (in

vers

e pr

eciti

on)

Catching more robots improves stats(But how much depends on the number of robots)

Get b

ette

r sta

ts

Catch more robots

Typical website, 15% robot traffic

OA journal, 40% robot

Internet Archive, 91% robot

OA repositories, 85% robot

Page 6: #iCanHazRobot?: improved robot detection for IR usage statistics

Robot detection techniques used

DSpace EPrints Minho DSpace

Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓

Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making

Page 7: #iCanHazRobot?: improved robot detection for IR usage statistics

Measurements used in robot detection

• All measurements are a number between 0 and 1• Recall: proportion of robots detected

– I can haz robot?• Precision: true positives in robot detection

– Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots)

• Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by

humans

Page 8: #iCanHazRobot?: improved robot detection for IR usage statistics

How they perform, out-of-the-box

DSpace

EPrin

ts

Minho

Minho with

monthly

manual

check

ing

No robot d

etecti

on0

0.20.40.60.8

1

Robot detection in OA IR systems

RecallPrecisionNegative precision (accuracy of download stats)

Page 9: #iCanHazRobot?: improved robot detection for IR usage statistics

Room for improvement?

Page 10: #iCanHazRobot?: improved robot detection for IR usage statistics

1. Ability to manually check for outliers

• At UCD, once a month, we check:– Daily downloads for the last 2-4 months– Top 10 most downloaded items– Top 20 downloading IP addresses for the last 2-4

months

Page 11: #iCanHazRobot?: improved robot detection for IR usage statistics
Page 12: #iCanHazRobot?: improved robot detection for IR usage statistics
Page 13: #iCanHazRobot?: improved robot detection for IR usage statistics
Page 14: #iCanHazRobot?: improved robot detection for IR usage statistics
Page 15: #iCanHazRobot?: improved robot detection for IR usage statistics

DSpace Eprints Minho0

0.20.40.60.8

1

Robots caught (Recall)

DSpace Eprints Minho Wihtout robot detection

00.10.20.30.40.50.60.70.80.9

1

Accuracy of reported download stats (Inverse precision)

Out-of-the-boxWith manual checking (outlier exclusion)

Page 16: #iCanHazRobot?: improved robot detection for IR usage statistics

2. Recalibrate the EPrints repeat-download (double-click) filter

0

0.2

0.4

0.6

0.8

1Effect of double-click filter on EPrints’ robot detection and stats

Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*

𝑻𝒑+𝑻𝒏𝒏

Page 17: #iCanHazRobot?: improved robot detection for IR usage statistics

3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints

• 1 Java class• Input is Apache Combined Log Format• Output is a database update (robot = true field)

– Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot'

field in the SOLR usage events document• Requires 2 database tables to store learned

agents and IPs

Page 18: #iCanHazRobot?: improved robot detection for IR usage statistics

DSpace Eprints Minho0

0.2

0.4

0.6

0.8

1

Robots caught (Recall)

DSpace Eprints Minho Wihtout robot detection

00.10.20.30.40.50.60.70.80.9

1

Accuracy of reported download stats(Inverse precision)

Out-of-the-box With Minho log parser

Page 19: #iCanHazRobot?: improved robot detection for IR usage statistics

4. Combine two or more techniques

DSpace Eprints Minho0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Robots caught(Recall) Out-of-the-box

With manual checking (outlier exclusion)

With recalibrated double click filter*

With Minho log parser

With Minho and out-liers

Minho, outliers, and recalibrated double-click*

Page 20: #iCanHazRobot?: improved robot detection for IR usage statistics

4. Combine two or more techniques

DSpace Eprints Minho Wihtout robot detection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accuracy of reported download stats (Inverse precision)

Out-of-the-box

With manual checking (outlier exclusion)

With recalibrated double click filter*

With Minho log parser

With Minho and out-liers

Minho, outliers, and recalibrated double-click*

Page 21: #iCanHazRobot?: improved robot detection for IR usage statistics

Thank you!