demonstration of prototype. some challenges (and some solutions…) classification – self...

Demonstration of Prototype

Some challenges (and some solutions…)• Classification – self selection vs. categorisation

– Solution, for now, is a combination of approaches (more in a second)

• Expectation Management – Might have been handled better from the outset: making our expectations clear

is probably important– ‘Prototype’ status has its issues

• Relating themes to specific events/projects– Have begun incorporating events & projects into the system, using the same

sort of vocabulary as that used for themes & researchers

Classification – the solution (?)

Mixture of controlled classification schemes:• RCUK research classification scheme

–Cross-disciplinary–Hierarchical–Tied to funding

- Relational MySQL version of the scheme created, and shared on the blog

Classification – the solution (?)• Some of the other classification schemes we considered

include:–The University’s own College/School structure

• Lacked granularity. Recently re-structured...

–Eurostat’s Classifications metadata• Focus on economic activity

–The EU’s Nomenclature for the analysis and Comparison of Scientific Programmes and Budgets (NABS) classification

• Largely science-based

–The Universal Decimal Classification Summary (udcS)• Probably closest to our needs• Perhaps lacked familiar nomenclature

Classification – the solution (?)

• ESRC National Centre for Research Methods

• Degree of top-down approval (Research Council)• Provides an implicit hierarchy• None of the potential schemes we found to be exhaustive

– Social Sciences focus of the NCRM scheme actually includes a pretty comprehensive list of qualitative and quantitative methods

Some technical points• Text extraction (from PDF) was less trivial than expected

• Decoding streams, dealing with odd characters, etc.

• Authentication was somewhat problematic• More of an institutional hurdle than a technical challenge

• Search and comparison algorithms have been improved by incorporation of stemming and fuzzy search

Stemming

• Using a version of the Porter stemming algorithm

• Used to suggest keywords from publications and project descriptions

• Much more useful (in my opinion!) when used to conflate search results

• Can optionally allow for stemming in search engine

Fuzzy Search

• Experimented with an implementation of Jaro–Winkler distance• Also tried PHP’s built-in similar_text function

• Finally settled on Levenshtein distance• Wrapped up the native PHP function with some additional

parameters for acceptable distances

• Fuzzy search is another option in the search engine• But quite useful as another conflation tool behind the scenes

• Demo…

demonstration of prototype. some challenges (and some solutions…) classification – self...

Documents