lucene solr talk at java user group karlsruhe

70
Suchen und Finden mit Lucene und Solr Florian Hopf 04.07.2012

Upload: florian-hopf

Post on 26-Jan-2015

108 views

Category:

Art & Photos


1 download

DESCRIPTION

German slides introducing Lucene and Solr.

TRANSCRIPT

Page 1: Lucene Solr talk at Java User Group Karlsruhe

Suchen und Finden mit Lucene und Solr

Florian Hopf04.07.2012

Page 2: Lucene Solr talk at Java User Group Karlsruhe

http://techcrunch.com/2010/08/04/schmidt-data/

Page 3: Lucene Solr talk at Java User Group Karlsruhe
Page 4: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Page 5: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

Page 6: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

SQL

Page 7: Lucene Solr talk at Java User Group Karlsruhe
Page 8: Lucene Solr talk at Java User Group Karlsruhe

Talktitlecontenttalkdate

Speaker

name

Category

name

*

*

*

*

Page 9: Lucene Solr talk at Java User Group Karlsruhe

mysql> select * from talk where title like "%apache%";+----+-------------------------------------------+---------+------------+| id | title | content | talkdate |+----+-------------------------------------------+---------+------------+| 1 | Apache Karaf | ... | 2012-04-25 || 2 | Integration ganz einfach mit Apache Camel | ... | 2012-04-04 |+----+-------------------------------------------+---------+------------+2 rows in set (0.00 sec)

Page 10: Lucene Solr talk at Java User Group Karlsruhe

mysql> select * from talk t join talk_category tc join category c where t.id = tc.talk and tc.category = c.id and (c.name like '%OSGi%' or t.title like '%OSGi%' or t.content like '%OSGi%');+----+-------------------------------------------+---------+------------+------+----------+----+------+| id | title | content | talkdate | talk | category | id | name |+----+-------------------------------------------+---------+------------+------+----------+----+------+| 1 | Apache Karaf | ... | 2012-04-25 | 1 | 1 | 1 | OSGi || 2 | Integration ... | ... | 2012-04-04 | 2 | 1 | 1 | OSGi |+----+-------------------------------------------+---------+------------+------+----------+----+------+2 rows in set (0.00 sec)

Page 11: Lucene Solr talk at Java User Group Karlsruhe

● Performance?● Ranking?● False Positives● Ähnlichkeitssuche (Meyer <=> Meier)● Flexibilität?● Wartbarkeit?

Page 12: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

Page 13: Lucene Solr talk at Java User Group Karlsruhe

File directory = new File(dir); File[] textFiles = directory.listFiles(new TextFiles());

for (File probableMatch : textFiles) { String text = readText(probableMatch); if (text.matches(".*" + Pattern.quote(term) + ".*")) { filesWithMatches.add(probableMatch.getAbsolutePath()); } }

Page 14: Lucene Solr talk at Java User Group Karlsruhe

● Skalierbarkeit?● Unterschiedliche Formate?● Ranking?● Kombination mit Datenbank?

Page 15: Lucene Solr talk at Java User Group Karlsruhe

Suche Go

Ergebnis 1In Ergebnis 1 taucht der Suchbegriff auf...

Ergebnis 3… hier steht der Suchbegriff aus Ergebnis 3...

Ergebnis 2In Ergebnis 2 taucht der Suchbegriff auch auf ...

SQL

Page 16: Lucene Solr talk at Java User Group Karlsruhe

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

Page 17: Lucene Solr talk at Java User Group Karlsruhe

Die 1

Stadt 1,2

liegt 1

in 1

den 1

Bergen 1

Vom 2

Berg 2

kann 2

man 2

die 2

sehen 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

Page 18: Lucene Solr talk at Java User Group Karlsruhe

die 1,2

stadt 1,2

liegt 1

in 1

den 1

bergen 1

vom 2

berg 2

kann 2

man 2

sehen 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

2. Lowercasing

Page 19: Lucene Solr talk at Java User Group Karlsruhe

die 1,2

stadt 1,2

liegt 1

in 1

den 1

berg 1,2

vom 2

kann 2

man 2

seh 2

Dokument 1

Die Stadtliegt in den

Bergen.

Dokument 2

Vom Berg kann man die Stadt

sehen.

1. Tokenization

2. Lowercasing

3. Stemming

Page 20: Lucene Solr talk at Java User Group Karlsruhe
Page 21: Lucene Solr talk at Java User Group Karlsruhe

● Java-Bibliothek● Invertierter Index● Analyzer● Query-Syntax● Relevanz-Algorithmus● KEIN Crawler● KEIN Document-Extractor

Page 22: Lucene Solr talk at Java User Group Karlsruhe

Quelle: http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/

Page 23: Lucene Solr talk at Java User Group Karlsruhe

● Indexieren:● Erstellen eines Documents● Festlegen des Analyzers● Indexieren über IndexWriter

● Suchen:● Verwendung des selben Analyzers● Parsen der Query mit QueryParser ● Auslesen über IndexSearcher/IndexReader ● Ausgabe über Document

Page 24: Lucene Solr talk at Java User Group Karlsruhe

Document

Field Name 1 Value 1title Value

Document

Field Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Field Name 1 Value 1title ValueField Name 1 Value 1date 20120404

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache Camel

Document

Field Name 1 Value 1title Value

Document

Field Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Field Name 1 Value 1title ValueField Name 1 Value 1date 20120425

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1title Apache Karaf

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Achim Nierbeck

Field Name 1 Value 1title ValueField Name 1 Value 1title Integration ganz einfach mit Apache CamelField Name 1 Value 1title ValueField Name 1 Value 1speaker Christian Schneider

Page 25: Lucene Solr talk at Java User Group Karlsruhe

● Index● ANALYZED● NOT_ANALYZED● NO

● Store● YES/NO

● Feldtyp● String, Numeric, Boolean

Page 26: Lucene Solr talk at Java User Group Karlsruhe
Page 27: Lucene Solr talk at Java User Group Karlsruhe

Tokenizer

TokenFilter

TokenFilter

TokenFilter

TokenFilter

Analyzer

Page 28: Lucene Solr talk at Java User Group Karlsruhe

StandardTokenizer

GermanNormalizationFilter

LowercaseFilter

StandardFilter

GermanLightStemFilter

Analyzer

Tokenizer

TokenFilter

TokenFilter

TokenFilter

TokenFilter

Analyzer

Page 29: Lucene Solr talk at Java User Group Karlsruhe

Directory

IndexWriter

Analyzer

Document

Page 30: Lucene Solr talk at Java User Group Karlsruhe
Page 31: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 32: Lucene Solr talk at Java User Group Karlsruhe

Directory

IndexWriter

Analyzer

Document

IndexReaderIndexSearcher

QueryParser

Query

Page 33: Lucene Solr talk at Java User Group Karlsruhe
Page 34: Lucene Solr talk at Java User Group Karlsruhe

● TermQuery ● Apache ● title:Apache

● Boolean Query ● Apache AND Karaf

● PhraseQuery ● "Apache Karaf"

Page 35: Lucene Solr talk at Java User Group Karlsruhe

● WildcardQuery ● Integ* ● Te?t

● RangeQuery ● date:[20120705 TO 20121231]

● FuzzyQuery ● Schneyder~

Page 36: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

Page 37: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

BooleanQueryAND

RangeQuerydate:[...]

FuzzyQueryspeaker:schneyder

TermQuerytitle:apach

Page 38: Lucene Solr talk at Java User Group Karlsruhe

title:Apache AND speaker:schneyder~ AND date:[20120401 TO 20120430]

BooleanQueryAND

RangeQuerydate:[...]

FuzzyQueryspeaker:schneyder

TermQuerytitle:apach

Page 39: Lucene Solr talk at Java User Group Karlsruhe

● FilterQueries● Ausschlusskriterium, kann gecacht werden

● Sortierung● Boosting

● Indexing-Time● Query-Time

Page 40: Lucene Solr talk at Java User Group Karlsruhe

score(q ,d )=coord (q ,d )∗queryNorm (q )∗∑t∈q

(tf (t , d )∗idf (t)2∗t.boost∗norm (t , d ))

Page 41: Lucene Solr talk at Java User Group Karlsruhe

score(q ,d )=coord (q ,d )∗queryNorm (q )∗∑t∈q

(tf (t , d )∗idf (t)2∗t.boost∗norm (t , d ))

Anzahl Term im Dokument

Feldlänge,Index-Boost

Invers zu Anzahl Dokumente, die

den Term enthalten

Anzahl der Matches imDokument

Query-Boost

Page 42: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 43: Lucene Solr talk at Java User Group Karlsruhe
Page 44: Lucene Solr talk at Java User Group Karlsruhe

● Parser API● Zahlreiche Formate● Integriert OpenSource-Libs● Betrieb embedded oder über Server

Page 45: Lucene Solr talk at Java User Group Karlsruhe
Page 46: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 47: Lucene Solr talk at Java User Group Karlsruhe
Page 48: Lucene Solr talk at Java User Group Karlsruhe

● Enterprise Search Server● Basiert auf Lucene● HTTP API● Index-Schema● Integriert häufig verwendete Lucene-Module● Facettierung● Dismax Query Parser● Admin-Interface

Page 49: Lucene Solr talk at Java User Group Karlsruhe

WebappWebapp

Client Solr Lucenehttp

XML, JSON, JavaBin, Ruby, ...

Page 50: Lucene Solr talk at Java User Group Karlsruhe

schema.xml solr-config.xml

dataconf

Solr Home

Lucene

Page 51: Lucene Solr talk at Java User Group Karlsruhe

schema.xml

Field Types

Fields

Page 52: Lucene Solr talk at Java User Group Karlsruhe
Page 53: Lucene Solr talk at Java User Group Karlsruhe
Page 54: Lucene Solr talk at Java User Group Karlsruhe

SearchHandler

DIH

SearchComp.

Monitoring

Lucene

SolrCell

UpdateRequestHandler

Replication

Search Index Index StartIndexing

DBURLFiles

Caches

Page 55: Lucene Solr talk at Java User Group Karlsruhe

solrconfig.xml

Lucene ConfigCaches

Request Handler

Search Components

Page 56: Lucene Solr talk at Java User Group Karlsruhe
Page 57: Lucene Solr talk at Java User Group Karlsruhe

WebappWebapp

Client Solr Lucenehttp

XML, JSON, JavaBin, Ruby, ...

Page 58: Lucene Solr talk at Java User Group Karlsruhe
Page 59: Lucene Solr talk at Java User Group Karlsruhe
Page 60: Lucene Solr talk at Java User Group Karlsruhe

DEMO

Page 61: Lucene Solr talk at Java User Group Karlsruhe
Page 62: Lucene Solr talk at Java User Group Karlsruhe

Master

SlaveSlave

Loadbalancer

Search

Replicate

Index

Page 63: Lucene Solr talk at Java User Group Karlsruhe

● Geospatial Search● More like this● Spellchecker● Suggester● Result Grouping● Function Queries● Sharding

Page 64: Lucene Solr talk at Java User Group Karlsruhe
Page 65: Lucene Solr talk at Java User Group Karlsruhe

● Suchserver basierend auf Apache Lucene● RESTful API● Dokumentenorientiert (JSON)● Schemafrei● Distributed Search● Near Realtime Search● No Commits (Transaction Log)

Page 66: Lucene Solr talk at Java User Group Karlsruhe

curl -XPOST 'http://localhost:9200/jugka/talk/' -d '{ "speaker" : "Florian Hopf", "date" : "2012-07-04T19:30:00", "title" : "Suchen und Finden mit Lucene und Solr"}'

{"ok":true,"_index":"jugka","_type":"talk","_id":"CeltdivQRGSvLY_dBZv1jw","_version":1}

Page 67: Lucene Solr talk at Java User Group Karlsruhe

curl -XGET 'http://localhost:9200/jugka/talk/_search?q=solr'{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.054244425,"hits":[{"_index":"jugka","_type":"talk","_id":"CeltdivQRGSvLY_dBZv1jw","_score":0.054244425, "_source" : { "speaker" : "Florian Hopf", "date" : "2012-07-04T19:30:00", "title" : "Suchen und Finden mit Lucene und Solr"}

Page 68: Lucene Solr talk at Java User Group Karlsruhe

● http://lucene.apache.org● http://tika.apache.org● http://lucene.apache.org/solr/● http://elasticsearch.org● http://github.com/fhopf/lucene-solr-talk

Page 69: Lucene Solr talk at Java User Group Karlsruhe

http://nlp.stanford.edu/IR-book/

Page 70: Lucene Solr talk at Java User Group Karlsruhe

Vielen Dank!

http://www.florian-hopf.de@fhopf