charlie hull managing director, flax 1 st november 2012

21
Charlie Hull Managing Director, Flax 1 st November 2012 [email protected] www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch Search,plus building taxonomy, autoclassification and media monitoring tools with open source software

Upload: bunme

Post on 15-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Search,plus building taxonomy, autoclassification and media monitoring tools with open source software. Charlie Hull Managing Director, Flax 1 st November 2012. [email protected] www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch. Who are Flax?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Charlie Hull Managing Director, Flax 1 st  November 2012

Charlie HullManaging Director, Flax1st November 2012

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Search,plus

building taxonomy, autoclassification and mediamonitoring tools with open source software

Page 2: Charlie Hull Managing Director, Flax 1 st  November 2012

Search engine specialists with decades of experience

Developers, innovators and strategists based in Cambridge, UK

Technology agnostic – but open source exponents

UK Authorized Partner of Lucid Imagination

Customers include Reed Specialist Recruitment, Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen, Accenture, University of Cambridge, Cabinet Office...

Come to

Who are Flax?

Page 3: Charlie Hull Managing Director, Flax 1 st  November 2012

Who am I?Wrote my first saleable software at age 14

Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars....

Muscat (Bayesian search)

Helped build a half-billion-page web search engine

Co-founder and CEO of Flax

Page 4: Charlie Hull Managing Director, Flax 1 st  November 2012

Who am I?Wrote my first saleable software at age 14

Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars....

Muscat (Bayesian search)

Helped build a half-billion-page web search engine

Co-founder and CEO of Flax

Page 5: Charlie Hull Managing Director, Flax 1 st  November 2012

What I'll cover today

Search – the state of play

Clade – an open source taxonomy based classifier

Flax Media Monitor

Some other crazy ideas

Conclusions

Page 6: Charlie Hull Managing Director, Flax 1 st  November 2012

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Page 7: Charlie Hull Managing Director, Flax 1 st  November 2012

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Page 8: Charlie Hull Managing Director, Flax 1 st  November 2012

Search – the state of play

Types of search project: – Website search– Intranet search– Database search

Closed source engines either:– Sold up– Repositioned– In trouble!

Open source engines:– Apache Lucene/Solr– ElasticSearch– (ish!) Attivio/Lucidworks/...

Page 9: Charlie Hull Managing Director, Flax 1 st  November 2012

Let's talk about something more interesting...

Clade: classifying data into a taxonomy with a search engine– Developed as a proof of concept– Based on Apache Solr & Stanford NLP– Written in JQuery & Python

Caveats:– We don't know much about library science!– Something like this may already exist (not that we

could find it...)– This is an alpha version only

Page 10: Charlie Hull Managing Director, Flax 1 st  November 2012

Clade demo....

Page 11: Charlie Hull Managing Director, Flax 1 st  November 2012

What Clade doesn't do (yet)

Talk standard taxonomy formats

Output anything

Multiple users

Rules-based classification

Look pretty

http://www.flax.co.uk/the_software to try it out...

Page 12: Charlie Hull Managing Director, Flax 1 st  November 2012

Media monitoring

Standard search – few queries over many documents

Monitoring search – many queries over each document

Customers interests manually turned into queries

Humans probably still have the final say on relevance

Eventual result is a list of articles emailed (or even printed for) customers

Page 13: Charlie Hull Managing Director, Flax 1 st  November 2012

Media monitoring - parameters

Tens of thousands of stored expressions or keywords– Can't rewrite these so must use same syntax!

Hundreds of thousands of articles to monitor every day

Source data can sometimes be scanned & OCR'd

False positives cost human operator time: false negatives cost customers!

Traditional approach is brute force using standard search engine software

Page 14: Charlie Hull Managing Director, Flax 1 st  November 2012

Media monitoring – a Keyword(";PALM BEACH COUNTY"; W/48 ((";TOURIS*"; OR ";TOUR"; OR ";TOURS"; OR ";TRAVEL*"; OR ";HOLIDAY*"; OR ";HOL"; OR ";HOLS"; OR ";HOTEL*"; OR ";VISIT*"; OR ";TRIP"; OR ";TRIPS"; OR ";DAYTRIP*"; OR ";BEACH"; OR ";!BEACHES"; OR ";COAST"; OR ";!COASTLINE*"; OR ";ABTA"; OR ";DAY TRIP*"; OR ";SUITE"; OR ";SUITES"; OR ";A%CCOMMODATION"; OR ";BED AND !BREAKFAST"; OR ";B&B"; OR ";BED & !BREAKFAST"; OR ";!BREAKFAST AND BOARD"; OR ";FULL BOARD"; OR ";HALF BOARD"; OR ";ALL !INCLUSIVE"; OR ";THINGS TO DO"; OR ";HOSP?TALITY"; OR ";SHORT BREAK*"; OR ";!WEEKEND BREAK"; OR ";CITY BREAK*"; OR ";!SIGHTSEE*"; OR ";!VACATION*"; OR ";E%XCURSION*"; OR ";FLY* WITH"; OR ";FLY* THERE"; OR ";FLY* DRIVE"; OR ";!GETAWAY"; OR ";!BACKPACK*"; OR ";BACK PACK*"; OR ";!ECOTOURIS*"; OR ";!WATERSPORT*"; OR ";WATER SPORT*"; OR ";FESTIVAL*"; OR ";RESORT* & SPA"; OR ";RESORT* AND SPA"; OR ";WHALE WATCH*"; OR ";GET THERE"; OR ";WHERE TO STAY"; OR ";GETTING THERE"; OR ";STAYCATION*"; OR ";VILLA"; OR ";VILLAS"; OR ";AIRPORT*"; OR ";SPA"; OR ";SPAS"; OR ";OUTDOOR EVENT*"; OR ";OUTDOOR ADVENTURE*"; OR ";OUTDOOR PURSUIT*"; OR ";OUTDOOR ACTIVIT*"; OR ";CLIMBING WALL*"; OR ";CLIMBING CENTRE*"; OR ";ROCK CLIMB*"; OR ";WHITE WATER RAFTING";) OR (";PLACES"; W/4 (";TO STAY"; OR ";TO SEE"; OR ";TO EAT";)) OR ((";FLIGHT*"; OR ";FLY"; OR ";FLYING"; OR ";CRUISE*";) W/4 (";OFFER"; OR ";AVAILABLE"; OR ";DEPART*"; OR ";FROM"; OR ";TRANSFER*";))))

Page 15: Charlie Hull Managing Director, Flax 1 st  November 2012

Media monitoring – another Keyword(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

Page 16: Charlie Hull Managing Director, Flax 1 st  November 2012

Flax Media Monitor

Based on a modification of Apache Lucene/Solr

Runs a separate Solr server for archiving

Consumes XML articles & keywords

Outputs matches as XML

REST API for status & configuration

Allows you to test new Keywords on old content

Page 17: Charlie Hull Managing Director, Flax 1 st  November 2012

Flax Media Monitor demo...

Page 18: Charlie Hull Managing Director, Flax 1 st  November 2012

Flax Media Monitor - performance

For simple keywords (<20 terms):– 70,000 keywords applied per second to an article– Tested on a Macbook– 20 times faster than previous implementation

For more complex keywords (some run to three pages!)– 20,000 keywords applied in 0.5 seconds– Approx 2000 docs/hour

Can be scaled horizontally for high load (and needs a lot less hardware)

Archive can store tens to hundreds of millions of articles

Page 19: Charlie Hull Managing Director, Flax 1 st  November 2012

Some other crazy ideas...

Combine media monitoring with Clade: very fast expression-based classification!

We can parse syntax from other search engines...

How to store rapidly changing classification data in a search engine index:

1. Re-index all documents affected (expensive)

2. Store the classifications somewhere else: how about a Lucene codec backed by a NoSQL Database?

http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

Page 20: Charlie Hull Managing Director, Flax 1 st  November 2012

Conclusions

Search isn't just about “search”

Taxonomy management is ready for open source

Media monitoring can be done at low cost for high volume with open source – Classification maybe as a special case of monitoring?

It's all much more fun than 'vanilla' search!

Page 21: Charlie Hull Managing Director, Flax 1 st  November 2012

Thankyou!

Any questions?

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch