charlie hull managing director, flax 1 st november 2012

Post on 15-Jan-2016

36 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Search,plus building taxonomy, autoclassification and media monitoring tools with open source software. Charlie Hull Managing Director, Flax 1 st November 2012. charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch. Who are Flax?. - PowerPoint PPT Presentation

TRANSCRIPT

Charlie HullManaging Director, Flax1st November 2012

charlie@flax.co.ukwww.flax.co.uk/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Search,plus

building taxonomy, autoclassification and mediamonitoring tools with open source software

Search engine specialists with decades of experience

Developers, innovators and strategists based in Cambridge, UK

Technology agnostic – but open source exponents

UK Authorized Partner of Lucid Imagination

Customers include Reed Specialist Recruitment, Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen, Accenture, University of Cambridge, Cabinet Office...

Come to

Who are Flax?

Who am I?Wrote my first saleable software at age 14

Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars....

Muscat (Bayesian search)

Helped build a half-billion-page web search engine

Co-founder and CEO of Flax

Who am I?Wrote my first saleable software at age 14

Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars....

Muscat (Bayesian search)

Helped build a half-billion-page web search engine

Co-founder and CEO of Flax

What I'll cover today

Search – the state of play

Clade – an open source taxonomy based classifier

Flax Media Monitor

Some other crazy ideas

Conclusions

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Let's talk about something more interesting...

Clade: classifying data into a taxonomy with a search engine– Developed as a proof of concept– Based on Apache Solr & Stanford NLP– Written in JQuery & Python

Caveats:– We don't know much about library science!– Something like this may already exist (not that we

could find it...)– This is an alpha version only

Clade demo....

What Clade doesn't do (yet)

Talk standard taxonomy formats

Output anything

Multiple users

Rules-based classification

Look pretty

http://www.flax.co.uk/the_software to try it out...

Media monitoring

Standard search – few queries over many documents

Monitoring search – many queries over each document

Customers interests manually turned into queries

Humans probably still have the final say on relevance

Eventual result is a list of articles emailed (or even printed for) customers

Media monitoring - parameters

Tens of thousands of stored expressions or keywords– Can't rewrite these so must use same syntax!

Hundreds of thousands of articles to monitor every day

Source data can sometimes be scanned & OCR'd

False positives cost human operator time: false negatives cost customers!

Traditional approach is brute force using standard search engine software

Media monitoring – a Keyword(";PALM BEACH COUNTY"; W/48 ((";TOURIS*"; OR ";TOUR"; OR ";TOURS"; OR ";TRAVEL*"; OR ";HOLIDAY*"; OR ";HOL"; OR ";HOLS"; OR ";HOTEL*"; OR ";VISIT*"; OR ";TRIP"; OR ";TRIPS"; OR ";DAYTRIP*"; OR ";BEACH"; OR ";!BEACHES"; OR ";COAST"; OR ";!COASTLINE*"; OR ";ABTA"; OR ";DAY TRIP*"; OR ";SUITE"; OR ";SUITES"; OR ";A%CCOMMODATION"; OR ";BED AND !BREAKFAST"; OR ";B&B"; OR ";BED & !BREAKFAST"; OR ";!BREAKFAST AND BOARD"; OR ";FULL BOARD"; OR ";HALF BOARD"; OR ";ALL !INCLUSIVE"; OR ";THINGS TO DO"; OR ";HOSP?TALITY"; OR ";SHORT BREAK*"; OR ";!WEEKEND BREAK"; OR ";CITY BREAK*"; OR ";!SIGHTSEE*"; OR ";!VACATION*"; OR ";E%XCURSION*"; OR ";FLY* WITH"; OR ";FLY* THERE"; OR ";FLY* DRIVE"; OR ";!GETAWAY"; OR ";!BACKPACK*"; OR ";BACK PACK*"; OR ";!ECOTOURIS*"; OR ";!WATERSPORT*"; OR ";WATER SPORT*"; OR ";FESTIVAL*"; OR ";RESORT* & SPA"; OR ";RESORT* AND SPA"; OR ";WHALE WATCH*"; OR ";GET THERE"; OR ";WHERE TO STAY"; OR ";GETTING THERE"; OR ";STAYCATION*"; OR ";VILLA"; OR ";VILLAS"; OR ";AIRPORT*"; OR ";SPA"; OR ";SPAS"; OR ";OUTDOOR EVENT*"; OR ";OUTDOOR ADVENTURE*"; OR ";OUTDOOR PURSUIT*"; OR ";OUTDOOR ACTIVIT*"; OR ";CLIMBING WALL*"; OR ";CLIMBING CENTRE*"; OR ";ROCK CLIMB*"; OR ";WHITE WATER RAFTING";) OR (";PLACES"; W/4 (";TO STAY"; OR ";TO SEE"; OR ";TO EAT";)) OR ((";FLIGHT*"; OR ";FLY"; OR ";FLYING"; OR ";CRUISE*";) W/4 (";OFFER"; OR ";AVAILABLE"; OR ";DEPART*"; OR ";FROM"; OR ";TRANSFER*";))))

Media monitoring – another Keyword(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

Flax Media Monitor

Based on a modification of Apache Lucene/Solr

Runs a separate Solr server for archiving

Consumes XML articles & keywords

Outputs matches as XML

REST API for status & configuration

Allows you to test new Keywords on old content

Flax Media Monitor demo...

Flax Media Monitor - performance

For simple keywords (<20 terms):– 70,000 keywords applied per second to an article– Tested on a Macbook– 20 times faster than previous implementation

For more complex keywords (some run to three pages!)– 20,000 keywords applied in 0.5 seconds– Approx 2000 docs/hour

Can be scaled horizontally for high load (and needs a lot less hardware)

Archive can store tens to hundreds of millions of articles

Some other crazy ideas...

Combine media monitoring with Clade: very fast expression-based classification!

We can parse syntax from other search engines...

How to store rapidly changing classification data in a search engine index:

1. Re-index all documents affected (expensive)

2. Store the classifications somewhere else: how about a Lucene codec backed by a NoSQL Database?

http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

Conclusions

Search isn't just about “search”

Taxonomy management is ready for open source

Media monitoring can be done at low cost for high volume with open source – Classification maybe as a special case of monitoring?

It's all much more fun than 'vanilla' search!

Thankyou!

Any questions?

charlie@flax.co.ukwww.flax.co.uk/blog+44 (0) 8700 118334Twitter: @FlaxSearch

top related