how to build the next 1000 search engines?!

Download How to build the next 1000 search engines?!

If you can't read please download the document

Upload: arjen-de-vries

Post on 18-Dec-2014

546 views

Category:

Technology


2 download

DESCRIPTION

Opening keynote at the ECIR 2012 Industry Day: How to build the next 1000 search engines?! Includes preview of new minimal indexing approach.

TRANSCRIPT

  • 1. How to build the next 1000search engines?! Arjen P. de [email protected] Wiskunde & Informatica Delft University of TechnologySpinque B.V.

2. Search is everywhere 3. Search is everywhere Yet it only works well on the web 4. Complications Heterogeneous data sources WWW, wikipedia, news, e-mail, patents, twitter, personal information, Varying result types Documents, tweets, courses, people, experts, gene expressions, temperatures, Multiple dimensions of relevance Topicality, recency, reading level, 5. Complications Many search tasks require a mix withinthese dimensions: News and patents Companies and their CEOs Recent and on topic Many search tasks also require a mixacross these dimensions: Patents assigned to our top 3 competitors inmarket segments mentioned in the recentpress releases issued by our top 10 clients 6. Systems internal information representation Linguistic annotations Named entities, sentiment, dependencies, Knowledge resources Wikipedia, Freebase, IDC9, IPTC, Links to related documents Citations, urls Anchors that describe the URI Anchor text Queries that lead to clicks on the URI Session, user, dwell-time, Tweets that mention the URI Time, location, user, Other social media that describe the URI User, rating Tag, organisation of `folksonomy + UNCERTAINTY ALL OVER! 7. What goes in the black box? Document Collection:AnchorsEntity typesSentimentTweets BM25Cited documentsBM25F LMRMRankedVSM listDFRofanswers QIR?User Learning to rank? ContextECIR / CIKM / SIGIR / ICTIR / WSDM papers! 8. Rarely & scarcely addressedStudent: How do I build it? Professor: Who will build it for me?Last session of the conference 9. Search System 10. Parameterised Search SystemCornacchia, De Vries, ECIR 2007A Parametrised Search System 11. Parameterised Search SystemCannot we removethis IR engineer (orscientist!) from theloop, like DBMS software removes the data engineer from the loop?Cornacchia, De Vries, ECIR 2007A Parametrised Search SystemAnd three (four?) children, a startup and 5 years later, a PhD defense! 12. Search by Strategy Visually construct search strategies byconnecting building blocks 13. Search by Strategy Visually construct search strategies byconnecting building blocks Each block describes either data or actionsupon that data Connection points (pins) are typed:doc / sec / term / ne (named entity) / tuple Actions are expressed as scripts (later more) 14. Strategy Builder 15. From Patent to Inventor 16. ReportsVisits 17. Generate Search Engine!Or, really, generate a REST API from the strategy specification! 18. Demo(Showed demo of childrens search engine) 19. How Strategies Help Strategies improve communication betweensearch intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage Strategies facilitate knowledge management Store / share / publish / refine Strategies mix exact (DB) and ranked (IR)searches Avoid the need for human (probabilistic) joins 20. Search Intermediaries Travel agency Task complexity Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists 21. Exploratory Search Search & (Faceted) Browsing Help discover schema, ontology, etc. Help discover the relevant sources Within-collection (by year/location, by type, ) Across multiple collections (by source) 22. Probabilistic faceted browsingTraditional (booleanfilters)Probabilistic Price Price 100K - 200K 100K - 200K 200K - 300K 200K - 300K 300K - 400K 300K - 400K Rooms Rooms 3 3 4 4 5 5 SizeSize 100 - 150 m2 100 - 150 m2 150 - 200 m2 150 - 200 m2 200 - 250 m2 200 - 250 m2Good when user knows exactly Good for exploratory search which filters to apply Will see perfect-match resultsWill see perfect-match resultsWont see interesting results Will also see interesting results 23. Dynamic facetsPre-indexedDynamicPrice Price 100K - 200K 100K - 200K 200K - 300K 200K - 300K 300K - 400K 300K - 400KRooms Rooms 3 3 4 4 5 5SizeSize 100 - 150 m2 100 - 150 m2 150 - 200 m2 150 - 200 m2 200 - 250 m2 200 - 250 m2 Pre-defined ad-hoc indices Facets decided from result setintersected with result set Challenge: dynamically adapt granularity Challenge: many indices to maintain Different price ranges for villa/garage! Challenge: heavy concurrent queries to DB 24. Demo(Showed Spinques Real-estate searchdemo) 25. Limitations Search & Browse Faceted exploration does not include joins Cannot construct new data sources fromexisting ones! Only the pre-defined paths through theinformation space can actually be traversed 26. Who needs a Join? You!!! whenever relevance cues are typed: People (e.g., inventors) Companies (e.g., assignees) Categories (e.g., IPTC) Time (e.g., expiry date) Location (e.g., country) or whenever multiple sources are to be combined E.g., patents & news, patents & Wikipedia, 27. Patents on X by Y(y)by Y(y) 28. Interactive Information Access Feedback: Interaction improves informationrepresentation Faceted Browsing: Interaction can let user take over wheremachine would fail Search by Strategy: Interaction can let user take over wheresystem designer would fail 29. Conclusion No idealized one-shot search engine Empower the user! 30. Under the Hood 31. From Strategies to DB Queriesin1 in2 in3 Strategy Data flowBB1(in1,in2,in3, u1,u2)out in1 BB2(in1)Spinque: strategyoutCREATE VIEW a ASSELECT .. Query: strategy made operationalCREATE VIEW b ASSELECT ..CREATE VIEW c ASSpinque: PRASELECT .. DatabaseSpinque: RDBMS (MonetDB) Relational DB 32. Probabilistic Relational Algebra Strategy x = Project DISTINCT PRA: probabilistic [$1,$3](y); relational algebra (Fuhr and Roelleke, TOIS 2001) CREATE VIEW x AS SELECT a1, a3, SQL 1-prod(1-prob) AS prob FROM yexplicit probabilities GROUP BY a1, a3; Relational DB 33. Whats in the DB? Text-based ranking T Df term-doc-freq relations (inverted file)t0d3 3 One per language, stemming, section t0d510 Domain-independent, click and indext1d2 4 Entity ranking subjpred/attrobj/valuep Probabilistic triplesArjen speaks_toyou 0.95 Domain-aware youfollowArjen 0.5 speechminutes 45 0.8 Needs supervised indexing Content-based (MM) retrieval Img_id f1 fN Feature vectors, click and index00.12 0.8410.540.3120.230.1 34. VIEWS and TABLES User Stored relationparameter CREATE VIEWTABLE a AS SELECT FROM term-doc ; CREATE VIEWb AS SELECT FROM a WHERE a.x = u1 ; CREATE VIEWTABLE c AS SELECT FROM a WHERE a.x = 42 ; CREATE VIEWd AS SELECT FROM b ; No userparameter Pre-computable BB content: sequence of VIEW definitions relation A VIEW is pre-computable when All the relations addressed are pre-computable / stored No dependency on user parameters Pre-computable VIEWs can become TABLEs (or MATERIALIZEDVIEWs) Query-independent computations are performed only once, then read from TABLEs at each query Recognition of these patterns is fully automatic Extends MonetDBs per-session caching to across-sessions caching 35. What Next? 36. Current Situation index ;Schema definition repeat {specify ;retrieveSearch & explore } until 37. Traditional Indexing Preprocessing determines to large extend howsearch request form will be processed Especially regarding tokenization, stemming, etc. Fast and scalable, but inflexible E.g., entity search hard-coded on top of engine,advertisements matched on different data, etc. 38. Search by Strategy Flexible: generate arbitrary engine on the fly Not as fast as highly optimized and very wellengineered inverted file based systems 39. Desirable Situation repeat {index ; Mixed Initiativespecify ; Schema definition Search & exploreretrieve } until 40. Non-Indexed Search Grep Very flexible Use it all the time on my mh mail folders when gmail fails me! Not scalable, little or no structure 41. Minimal Indexing How to reduce pre-processing necessary tocreate a search engine over a new collection? Can we do without a keyword index? Can we avoid hardwired decisions for tokenization,language detection, stemming, 42. Suffix Array Pros: provides many core search functions: termstatistics, keyword search, phrase search. no upfront tokenization needed (access atcharacter level) no upfront language detection needed Cons: difficult to build for large corpora expensive w.r.t. disk space 43. Demo(Showed patent search demo) 44. Real Code 45. Patents on X by Y(y)by Y(y) 46. PRAs__STRATEGY___filter_DOC_with_NE_nes =Project [$2,$3]( Join [$1 = $2](s__STRATEGY___clef_ip_patents_DATA_result,Project [$1,$3]( Select [$2 = "ipcr-classification"](s__STRATEGY___clef_ip_patents_DATA_ne_doc )) )); 47. CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS SELECTtmp_1814091754.a2 AS a1,tmp_1814091754.a3 AS a2,tmp_1814091754.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1, tmp__1652836708.a1 AS a2, tmp__1652836708.a2 AS a3,s__STRATEGY___clef_ip_patents_DATA_result.prob * tmp__1652836708.prob AS probFROMs__STRATEGY___clef_ip_patents_DATA_result,(SELECT tmp_1444787941.a1 AS a1, tmp_1444787941.a3 AS a2, tmp_1444787941.prob AS prob FROM (SELECTs__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1,s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2,s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3,s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS probFROM s__STRATEGY___clef_ip_patents_DATA_ne_docWHERE s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 =ipcr-classification) AS tmp_1444787941 ) AS tmp__1652836708WHERE s__STRATEGY___clef_ip_patents_DATA_result.a1= tmp__1652836708.a2 ) AS tmp_1814091754 ORDER BY a1 WITH DATA; 48. [email protected]/spinque