leveraging lucene/solr as a knowledge graph and intent engine: presented by trey grainger,...

37
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine Trey Grainger Director of Engineering, Search & Recommendations 2015.10.15

Upload: lucidworks

Post on 19-Feb-2017

1.104 views

Category:

Technology


0 download

TRANSCRIPT

Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine Trey Grainger

Director of Engineering, Search & Recommendations

2015.10.15

Trey Grainger Director of Engineering, Search & Recommendations •  Joined CareerBuilder in 2007 as a Software Engineer •  MBA, Management of Technology – Georgia Tech •  BA, Computer Science, Business, & Philosophy – Furman University •  Mining Massive Datasets (in progress) - Stanford University

Fun outside of CB:

•  Co-author of Solr in Action, plus a handful of research papers •  Frequent conference speaker •  Founder of Celiaccess.com, the gluten-free search engine •  Lucene/Solr contributor

About  Me  

Agenda •  Introduc/on  •  Defining  the  problem  –  the  need  for  Seman/c  Search  •  Building  an  Intent  Engine      -­‐  Type-­‐ahead  predic/on      -­‐  Spelling  Correc/on      -­‐  En/ty  /  En/ty-­‐type  Resolu/on      -­‐  Seman/c  Query  Parsing      -­‐  Query  Augmenta/on      -­‐  The  Knowledge  Graph  

•  Conclusion  Knowledge  

Graph  

At CareerBuilder, Solr Powers... At CareerBuilder, Solr Powers...

Search  by  the  Numbers  

5  

Powering  50+  Search  Experiences  Including:  

100  million  +  Searches  per  day  

30+  SoRware  Developers,  Data  

Scien/sts  +  Analysts  

   500+  Search  Servers  

1,5  billion  +  Documents  indexed  and  

searchable  

1  Global  Search    

Technology  plaUorm  

...and many more

What’s  the  problem  we’re  trying  to  solve  today?  User’s  Query:        

machine  learning  research  and  development  Portland,  OR  soRware    engineer  AND  hadoop,  java        

Tradi>onal  Query  Parsing:          (machine  AND  learning  AND  research  AND  development  AND  portland)      OR  (soRware  AND  engineer  AND  hadoop  AND  java)    Seman>c  Query  Parsing:  "machine  learning"  AND    "research  and  development"  AND    "Portland,  OR"    AND    "soRware  engineer"  AND  hadoop  AND  java    Seman>cally  Expanded  Query:  ("machine  learning"^10  OR    "data  scien/st"  OR  "data  mining"  OR  "ar/ficial  intelligence")  AND  ("research  and  development"^10  OR    "r&d")  AND    AND  ("Portland,  OR"^10  OR    "Portland,  Oregon"  OR  {!geofilt  pt=45.512,-­‐122.676  d=50  sfield=geo})    AND  ("soRware  engineer"^10  OR  "soRware  developer")    AND  (hadoop^10  OR    "big  data"  OR  hbase  OR  hive)  AND  (java^10  OR  j2ee)  

But  we  also  really  want  “things”,  not  “strings”…  

Job  Level   Job  /tle   Company  

Job  Title   Company   School  +  Degree  

Type-­‐ahead  Predic/on  

Knowledge  Graph  and  Intent  Engine  

Search  Box  

Seman/c  Query  Parsing  

Intent Engine

Spelling  Correc/on  

En/ty  /  En/ty  Type  Resolu/on  

Machine-­‐learned  Ranking  

Relevancy Engine (“re-expressing intent”)

User  Feedback    (Clarifying  Intent)  

Query  Re-­‐wri/ng   Search  Results  

Query  Augmenta/on  

Knowledge  Graph  

Type-­‐ahead  Predic>ons  

Seman/c  Autocomplete    •  Shows  top  terms  for  any  search    •  Breaks  out  job  /tles,  skills,  companies,  

related  keywords,  and  other  categories  

 •  Understands  abbrevia/ons,  alternate  

forms,  misspellings    •  Supports  full  Boolean  syntax  and  

mul/-­‐term  autocomplete    •  Enables  fielded  search  on  en//es,  not  

just  keywords  

Spelling  Correc>on*    

*Google  “Solr  Spell  Check  Component”

En>ty  /  En>ty-­‐type  Resolu>on  

Differen>a>ng  related  terms  

Synonyms:                                        cpa                  =>      cer/fied  public  accountant                                                                                    rn                      =>      registered  nurse                                                                                                                                                                      r.n.                  =>      registered  nurse    Ambiguous  Terms*:          driver        =>      driver  (trucking)      ~80%  likelihood                                                                                    driver        =>      driver  (so5ware)    ~20%  likelihood    Related  Terms:                        r.n.                    =>      nursing,  bsn                                                                                hadoop    =>      mapreduce,  hive,  pig      *differen9ated  based  upon  user  and  query  context    

Building  a  Taxonomy  of  En>>es   Many ways to generate this: •  Topic Modelling •  Clustering of documents •  Statistical Analysis of interesting phrases •  Buy a dictionary (often doesn’t work for

domain-specific search problems) •  … Our strategy: Generate a model of domain-specific phrases by    mining  query  logs  for  commonly  searched  phrases  within  the  domain  [1]  

[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

En>ty-­‐type  Recogni>on  

Build classifiers trained on External data sources (Wikipedia, DBPedia, WordNet, etc.), as well as from our own domain. The subject for a future talk / research paper… java  developer  

registered  nurse  

emergency  room  

director  

job  >tle  

skill  

job  level  

loca>on  work  type  Portland,  OR  

part-­‐>me  

Seman>c  Query  Parsing  

Query  Parsing:  The  whole  is  greater  than  the  sum  of  the  parts  

project  manager                      vs.                          "project"  AND  "manager"  building  architect                  vs.                          "building"  AND  "architect"  soRware  architect                vs.                          "soRware"  AND  "architect"      

Consider:      a  "soRware  architect"  designs  and  builds  soRware                                                    a  "building  architect"  uses  soRware  to  design  architecture                    

User’s  Query:  machine  learning  research  and  development  Portland,  OR  soRware    engineer  AND  hadoop  java  

Tradi>onal  Query  Parsing:          (machine  AND  learning  AND  research  AND  development  AND  portland)      OR  (soRware  AND  engineer  AND  hadoop  AND  java)  

Identifying the correct phrase (not just the parts) is crucial here!

Probabilistic Query Parser

Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop"

Input: senior hadoop developer java ruby on rails perl

Seman>c  Search  Architecture  –  Query  Parsing  1)  Generate the previously discussed taxonomy of

Domain-specific phrases •  You  can  mine  query  logs  or  actual  text  of  documents  for  

significant  phrases  within  your  domain  [1]  

2) Feed these phrases to SolrTextTagger (uses Lucene FST for high-throughput term lookups)

3) Use SolrTextTagger to perform entity extraction on incoming queries (tagging documents is also possible)

4) Also invoke probabilistic parser to dynamically identify unknown phrases from a corpus of data (language model)

5) Shown on next slides: Pass extracted entities to a Query Augmentation phase to rewrite the query with enhanced semantic understanding [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. [2] https://github.com/OpenSextant/SolrTextTagger

Query  Augmenta>on  

machine  learning  

Keywords:  

Search  Behavior,  Applica>on  Behavior,  etc.  

Job  Title  Classifier,  Skills  Extractor,  Job  Level  Classifier,  etc.  

Seman>c  Query  Augmenta>on  

keywords:((machine  learning)^10  OR    {  AT_LEAST_2:  ("data  mining"^0.9,  matlab^0.8,    "data  scien/st"^0.75,  "ar/ficial  intelligence"^0.7,    "neural  networks"^0.55))  }  {  BOOST_TO_TOP:  (  job_/tle:(  "soRware  engineer"  OR  "data  manager"  OR    "data  scien/st"  OR  "hadoop  engineer"))  }    

Modified  Query:  

Related  Occupa>ons  machine  learning:    {15-­‐1031.00        .58  Computer  So\ware  Engineers,  Applica>ons  

15-­‐1011.00        .55  Computer  and  Informa>on  Scien>sts,  Research  

15-­‐1032.00        .52    Computer  So\ware  Engineers,  Systems  So\ware  }  

machine  learning:      {  soRware  engineer  .65,          data  manager  .3,          data  scien/st  .25,          hadoop  engineer  .2,  }  

Common  Job  Titles  

Semantic Search Architecture – Query Augmentation

                                   Related  Phrases  

machine  learning:      {    data  mining  .9,          matlab  .8,          data  scien/st  .75,            ar/ficial  intelligence  .7,            neural  networks  .55  }  

Known  keyword    phrases  java  developer  machine  learning  registered  nurse  

FST  

Knowledge    Graph  in  

+

Query Enrichment

Document Enrichment

Document Enrichment

Knowledge  Graph  

Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of relationships between items in our domain. Compare the relationships of skills to keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience level, etc.  

Knowledge Graph API

Core  similarity  engine,  exposed  via  API  Any  product  can  leverage  our  core  rela/onship  scoring  engine  to  score  any  list  of  en//es  against  any  other  list  

Full  domain  support  Keywords,  job  /tles,  skills,  companies,  job  levels,  loca/ons,  and  all  other  taxonomies.    

Intersec>ons,  overlaps,  &  rela>onship  scoring,  many  levels  deep  Users  can  either  provide  a  list  of  items  to  score,  or  else  have  the  system  dynamically  discover  the  most  related  items  (or  both).  

Knowledge  Graph  

So how does it work?

Foreground  vs.  Background  Analysis  Every  term  scored  against  it’s  context.  The  more    commonly  the  term  appears  within  it’s  foreground  context  versus  its  background  context,  the  more  relevant  it  is  to  the  specified  foreground  context.  

countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 },

{ "value":"java", "relatedness":0.9236, "popularity":15653 },

{ "value":".net", "relatedness":0.5294, "popularity":17683 },

{ "value":"bee", "relatedness":0.0, "popularity":0 },

{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },

{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }

We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus)

+ -

Foreground  Query:            "Hadoop"  

Knowledge  Graph  

Knowledge Graph – Potential Use Cases

Cross-­‐walk  between  Types  •  Have  an  ID  field,  but  want  to  enable  free  text  search  

on  the  most  associated  en/ty  with  that  ID?  

•   Have  a  “state”  (geo)  search  box,  but  want  to  accept  any  free-­‐text  loca/on  and  map  it  to  the  right  state?    

•  Have  an  old  classifica/on  taxonomy  and  want  to  know  how  the  values  from  the  old  system  now  map  into  the  new  values?  

Build  User  Profiles  from  Search  Logs  •  If  someone  searches  for  “Java”,  and  then  “JQuery”,  and  then  “CSS”,  and  then  “JSP”,  what  do  those  have  in  common?  

•  What  if  they  search  for  “Java”,  and  then    “C++”,  and  then  “Assembly”?  

Discover  Rela>onships  Between  Anything  •  If  I  want  to  become  a  data  scien/st  and  know  

Python,  what  libraries  should  I  learn?  

•  If  my  last  job  was  mid-­‐level  soRware  engineer  and  my  current  job  is  Engineering  Lead,  what  are  my  most  likely  next  roles?  

Traverse  arbitrarily  deep,  Sort  on  anything  •  Build  an  instant  co-­‐occurrence  matrix,  sort  the  top  

values  by  their  relatedness,  and  then  add  in  any  number  of  addi/onal  dimensions  (RAM  permi|ng).  

Data  Cleansing  •  Have  dirty  taxonomies  and  need  to  figure  out  which  

items  don’t  belong?  •  Need  to  understand  the  conceptual  cohesion  of  a  

document  (vs  spammy  or  off-­‐topic  content)?  

Knowledge  Graph  

2014 - 2015 Publications & Presentations Books: Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr

Research papers: ●  Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014 ●  Towards a Job title Classification System - 2014 ●  Augmenting Recommendation Systems Using a Model of Semantically-related Terms

Extracted from User Behavior - 2014 ●  sCooL: A system for academic institution name normalization - 2014 ●  PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014 ●  SKILL: A System for Skill Identification and Normalization – 2015 ●  Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015 ●  WebScalding: A Framework for Big Data Web Services - 2015 ●  A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015 ●  Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015 ●  Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015 ●  Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015

Speaking Engagements: ●  Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second

International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data 2015 (x6) Lucene/Solr Revolution 2015

So  What’s  Next?  

machine  learning  

Keywords:  

Search  Behavior,  Applica>on  Behavior,  etc.  

Job  Title  Classifier,  Skills  Extractor,  Job  Level  Classifier,  etc.  

Seman>c  Query  Augmenta>on  

keywords:((machine  learning)^10  OR    {  AT_LEAST_2:  ("data  mining"^0.9,  matlab^0.8,    "data  scien/st"^0.75,  "ar/ficial  intelligence"^0.7,    "neural  networks"^0.55))  }  {  BOOST_TO_TOP:  (  job_/tle:(  "soRware  engineer"  OR  "data  manager"  OR    "data  scien/st"  OR  "hadoop  engineer"))  }    

Modified  Query:  

Related  Occupa>ons  machine  learning:    {15-­‐1031.00        .58  Computer  So\ware  Engineers,  Applica>ons  

15-­‐1011.00        .55  Computer  and  Informa>on  Scien>sts,  Research  

15-­‐1032.00        .52    Computer  So\ware  Engineers,  Systems  So\ware  }  

machine  learning:      {  soRware  engineer  .65,          data  manager  .3,          data  scien/st  .25,          hadoop  engineer  .2,  }  

Common  Job  Titles  

Semantic Search Architecture – Query Augmentation

                                   Related  Phrases  

machine  learning:      {    data  mining  .9,          matlab  .8,          data  scien/st  .75,            ar/ficial  intelligence  .7,            neural  networks  .55  }  

Known  keyword    phrases  java  developer  machine  learning  registered  nurse  

FST  

Knowledge    Graph  in  

+ This  Piece:            How  do  you  construct  the                    best  possible  queries?                    The  answer…  Learning  to  Rank                                          (Machine-­‐learned  Ranking)              That  can  be  a  topic  for  next  /me…  

Type-­‐ahead  Predic/on  

Knowledge  Graph  and  Intent  Engine  

Search  Box  

Seman/c  Query  Parsing  

Intent Engine

Spelling  Correc/on  

En/ty  /  En/ty  Type  Resolu/on  

Machine-­‐learned  Ranking  

Relevancy Engine (“re-expressing intent”)

User  Feedback    (Clarifying  Intent)  

Query  Re-­‐wri/ng   Search  Results  

Query  Augmenta/on  

Knowledge  Graph  

Addi>onal  References:  

Contact  Info  

Yes,  WE  ARE  HIRING  @                                                                    .      Come  talk  with  me  if  you  are  interested…  

Trey  Grainger    [email protected]    @treygrainger      

   

  hcp://solrinac>on.com  Conference discount (43% off): lusorevcftw

 Other  presenta>ons:                hcp://www.treygrainger.com