demonstration of prototype. some challenges (and some solutions…) classification – self...

10
Demonstration of Prototype

Upload: may-barnett

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Demonstration of Prototype

Page 2: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination
Page 3: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Some challenges (and some solutions…)• Classification – self selection vs. categorisation

– Solution, for now, is a combination of approaches (more in a second)

• Expectation Management – Might have been handled better from the outset: making our expectations clear

is probably important– ‘Prototype’ status has its issues

• Relating themes to specific events/projects– Have begun incorporating events & projects into the system, using the same

sort of vocabulary as that used for themes & researchers

Page 4: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Classification – the solution (?)

Mixture of controlled classification schemes:• RCUK research classification scheme

–Cross-disciplinary–Hierarchical–Tied to funding

- Relational MySQL version of the scheme created, and shared on the blog

Page 5: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Classification – the solution (?)• Some of the other classification schemes we considered

include:–The University’s own College/School structure

• Lacked granularity. Recently re-structured...

–Eurostat’s Classifications metadata• Focus on economic activity

–The EU’s Nomenclature for the analysis and Comparison of Scientific Programmes and Budgets (NABS) classification

• Largely science-based

–The Universal Decimal Classification Summary (udcS)• Probably closest to our needs• Perhaps lacked familiar nomenclature

Page 6: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Classification – the solution (?)

• ESRC National Centre for Research Methods

• Degree of top-down approval (Research Council)• Provides an implicit hierarchy• None of the potential schemes we found to be exhaustive

– Social Sciences focus of the NCRM scheme actually includes a pretty comprehensive list of qualitative and quantitative methods

Page 7: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Some technical points• Text extraction (from PDF) was less trivial than expected

• Decoding streams, dealing with odd characters, etc.

• Authentication was somewhat problematic• More of an institutional hurdle than a technical challenge

• Search and comparison algorithms have been improved by incorporation of stemming and fuzzy search

Page 8: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Stemming

• Using a version of the Porter stemming algorithm

• Used to suggest keywords from publications and project descriptions

• Much more useful (in my opinion!) when used to conflate search results

• Can optionally allow for stemming in search engine

Page 9: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

Fuzzy Search

• Experimented with an implementation of Jaro–Winkler distance• Also tried PHP’s built-in similar_text function

• Finally settled on Levenshtein distance• Wrapped up the native PHP function with some additional

parameters for acceptable distances

• Fuzzy search is another option in the search engine• But quite useful as another conflation tool behind the scenes

Page 10: Demonstration of Prototype. Some challenges (and some solutions…) Classification – self selection vs. categorisation – Solution, for now, is a combination

• Demo…