solr black belt pre-conference

73
Solr Black Belt code4lib conference 2010 - Asheville, NC Erik Hatcher, Lucid Imagination Naomi Dushay, Stanford 1

Upload: erik-hatcher

Post on 11-May-2015

8.153 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Solr Black Belt Pre-conference

Solr Black Beltcode4lib conference 2010 - Asheville, NC

Erik Hatcher, Lucid ImaginationNaomi Dushay, Stanford

1

Page 2: Solr Black Belt Pre-conference

What’s new in Solr 1.4?• Java-based replication

• VelocityResponseWriter (Solritas)

• Logging switched to SLF4J

• Rollback, since last commit

• StatsComponent

• TermVectorComponent

• Configurable Directory provider

• CharFilter

• TermsComponent

• Rich document indexing, via Tika (Solr Cell)

• Greatly improved faceting performance

• Exact/near duplicate document handling

• Support added for Lucene's omitTf

• "trie" range query support

2

Page 3: Solr Black Belt Pre-conference

Performance Improvements

• Caching

• Concurrent file access

• Per-segment index updates

• Faceting

• DocSet generation, avoids scoring

• Streaming updates for SolrJ

3

Page 4: Solr Black Belt Pre-conference

Lucene 2.9

• IndexReader#reopen()

• Faster filter performance, by 300% in some cases

• Per-segment FieldCache

• Reusable token streams

• Faster numeric/date range queries, thanks to trie

• and tons more, see Lucene 2.9's CHANGES.txt

4

Page 5: Solr Black Belt Pre-conference

Deployment Architectures

5

Page 6: Solr Black Belt Pre-conference

JVM

• -server

• -XmxNNNNm

• Java 1.6 (latest point release)

• garbage collector

• 64-bit?

• Tools: JVM GC logging, jconsole

6

Page 7: Solr Black Belt Pre-conference

Useful JVM switches

• -Xloggc:gc.out: Will output GC information to a file named “gc.out”.

• –XX:+PrintGC: Outputs basic information at every garbage collection.

• –XX:+PrintGCDetails: Outputs more detailed information at every garbage collection.

• –XX:+PrintGCTimeStamps: Outputs a time stamp at the start of each garbage collection event. Used with –XX:+PrintGC or –XX:+PrintGCDetails to show when each garbage collection begins.

• -XX:-HeapDumpOnOutOfMemoryError: Dump heap to file when java.lang.OutOfMemoryError is thrown.

7

Page 8: Solr Black Belt Pre-conference

Indexing Performance

• Tricks of the trade:

• multithread/multiprocess

• batch documents

• separate Solr server and indexers

• Indexing master + replicants

• StreamingUpdateSolrServer + javabin

8

Page 9: Solr Black Belt Pre-conference

MARC indexing strategies

• SolrMarc

• Future? DataImportHandler hooks

9

Page 10: Solr Black Belt Pre-conference

Index Settings

• useCompoundFile: set to false

• mergeFactor: 10 or lower, generally

• ramBufferSizeMB: buffer used for added documents before flushing to directory; more predictable instead of using maxBufferedDocs. Benchmarking shows <= 128 is best.

• maxMergeDocs: maximum number of documents for a single segment

• maxFieldLength: generally max. int is desired = 2147483647

• maxWarmingSearchers: 1 is best

10

Page 11: Solr Black Belt Pre-conference

Searching Performance

• javabin - binary protocol for Java clients

• caches: filterCache most relevant here

• autowarm

• FastLRUCache

• warming queries: firstSearcher, newSearcher

• sorting, faceting

11

Page 12: Solr Black Belt Pre-conference

debugQuery=true

• parsed queries

• scoring explanations

• search component timings

12

Page 13: Solr Black Belt Pre-conference

Query Parsing

• defType

• applies to main query only

• fq parsed as "lucene" unless individually overridden

• {!parser local=params}query string

13

Page 14: Solr Black Belt Pre-conference

Solr Query Parser (lucene)

• http://lucene.apache.org/java/2_9_1/queryparsersyntax.html + Solr extensions

• Kitchen sink parser, includes advanced user-unfriendly syntax

• Syntax errors throw parse exceptions back to client

• Example: title:ipod* AND price:[0 TO 100]

• http://wiki.apache.org/solr/SolrQuerySyntax

14

Page 15: Solr Black Belt Pre-conference

SolrQueryParser

• Default query parser

• schema.xml

• <defaultSearchField>text</defaultSearchField>

• <solrQueryParser defaultOperator="OR"/>

• Adds _query_:"..." and _val_:"..." hooks

• Supports leading wildcards with ReversedWildcardFilterFactory

15

Page 16: Solr Black Belt Pre-conference

Dismax Query Parser(dismax)

• Simplified syntax: loose text “quote phrases” -prohibited +required

• Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf)

16

Page 17: Solr Black Belt Pre-conference

dismax: q and q.alt

• odd number of quotes is parsed as if there were no quotes

• wildcards, fuzzy, etc not supported

• q.alt: alternate query; "lucene" parsed, used when q is omitted; useful as *:* to get collection-wide facet counts

17

Page 18: Solr Black Belt Pre-conference

dismax: qf and pf

• query fields / phrase fields

• syntax: field[^boost]...

• example: title^2 body

• pf for boosting where terms in q are in close proximity; entire q string is used as phrase implicitly

18

Page 19: Solr Black Belt Pre-conference

dismax: qs and ps

• qs: query slop; used for explicit "phrase queries"

• ps: phrase slop; used for implicit phrase query added for pf fields

19

Page 20: Solr Black Belt Pre-conference

dismax: mm• minimum match, for optional clauses

• default = 100% (pure AND)

• Examples:

• pure OR: mm=0 or mm=0%

• at least two should match: mm=2

• at least 75% should match: mm=75%

• 1-3 clauses, must match, 4 or more 90% must match: mm=3<90%

• 1-2 clauses all required, 3-9 clauses all but 25% must match, 9 or more all but 3 are requried: mm=2<-25% 9<-3

• 1-3 clauses all must match, 3-5 clauses, one less than the number of clauses must match, 6 or more clauses, 80% must match, rounded down: mm=2<-1 5<80%

http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

20

Page 21: Solr Black Belt Pre-conference

dismax: tie• tiebreaker

• more than one field may match and scored based on term frequency

• how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field.

• A value of "0.0" makes the query a pure "disjunction max query" -- only the maximum scoring sub query contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't matter what the maximum scoring sub query is, the final score is the sum of the sub scores. Typically a low value (ie: 0.1) is useful.

21

Page 22: Solr Black Belt Pre-conference

dismax: tie

• The “tie” (tie breaker) parameter is very important, but not easy to understand. It may be useful to visualize it as a “slider” control between 0 and 1, with a value of 0 being a “pure disjunction max” query, and a value of “1” being a “pure disjunction sum” query. So the “max” score is added to the sum of all other scores multiplied by the tie breaker:

• If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is 0:

• score = 2.12 + ((1.7 + 0.5) * 0 ) = 2.12

• If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is 1:

• score = 2.12 + ((1.7 + 0.5) * 1) = 4.32

22

Page 23: Solr Black Belt Pre-conference

dismax: bq

• boosting query

• "lucene" query parsed, by default

• combined (optionally) with users query to boost matching documents

• warning: a boolean query with boost of 1.0 has clauses added as-is, can be problematic by adding required/prohibited clauses; could be caused by multiple bq parameters

• Example: bq=library:music^2

23

Page 24: Solr Black Belt Pre-conference

dismax: bf

• boost function

• same as using _val_:"function(...)" in bq parameter

• example: bf=recip(ms(NOW,mydatefield),3.16e-11,1,1)

• but careful with adding versus multiplying scores, bf will be additive - see "boost" query parser

24

Page 25: Solr Black Belt Pre-conference

local params• {!parser p=param}expression

• OR {!parser p=param v=expression}

• Indirect parameter values with $syntax:

• {!parser p=$p}expression&p=param

• Real example:

• _query_:”{!dismax qf=$qf_author pf=$pf_author}[advanced author search box field value], where qf_author and pf_author defined in request handler mapping, combined with other clauses or similar _query_'s for other groups

25

Page 26: Solr Black Belt Pre-conference

Raw query parser

• {!raw f=field}Foo Bar

• exact TermQuery, no analysis or transformations

• ideal for typical fq usage

• fq={!raw f=format}Musical Score

• avoids query parsing escaping madness

26

Page 27: Solr Black Belt Pre-conference

request handler ninjitsu

<requestHandler class="solr.SearchHandler" name="/document">

<lst name="invariants">

<str name="q">{!raw f=id v=$id}</str>

<str name="rows">1</str>

<str name="fl">*</str>

</lst>

</requestHandler>

http://localhost:8983/solr/document?id=...

27

Page 28: Solr Black Belt Pre-conference

Field query parser

• {!field f=field}Foo Bar

• generally equivalent to field:"Foo Bar"

• parses to term or phrase query, depending on analysis for field

28

Page 29: Solr Black Belt Pre-conference

Prefix query parser

• {!prefix f=field}foo

• no analysis or transformation performed

• generally equivalent to field:foo*

29

Page 30: Solr Black Belt Pre-conference

Function query parser

• {!func}log(foo)

• Used for _val_ expressions in "lucene" parser

30

Page 31: Solr Black Belt Pre-conference

Boost query parser

• {!boost b=log(popularity)}foo

• Multiplies score, rather than additive

• Example:

•?q={!boost b=$dateboost v=$qq defType=dismax}&dateboost=recip(ms(NOW,manufacturedate_dt),3.16e-11,1,1)&qf=text&pf=text&qq=ipod

31

Page 32: Solr Black Belt Pre-conference

extended dismax (edismax)

• Solr 1.5 (currently trunk)

• Supports full lucene query syntax in the absence of syntax errors: AND/OR/NOT, wildcards, fuzzy...; and/or also

• When syntax errors, smart partial escaping of special characters, fielded queries, +/-, and phrases still supported

• shingles phrases specified in pf2 and pf3 parameters

• advanced stopword handling: stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.

32

Page 33: Solr Black Belt Pre-conference

edismax: pf2 and pf3

• shingles into two and three term phrases

• prevents problem of needing 100% of the words in the document, as well as having all of the words in a single field, to get any boost

33

Page 34: Solr Black Belt Pre-conference

edismax: boost

• wraps generated query with boost query

• like the dismax bf param, but multiplies the function query instead of adding it in

34

Page 35: Solr Black Belt Pre-conference

Nested queries

• Naomi's "A Better Advanced Search", Wednesday, 13:00

• http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

• Example:

• _query_:"{!dismax qf=$qf1}query1" AND _query_:"{!dismax qf=$qf2}query2"

35

Page 36: Solr Black Belt Pre-conference

Useful request handlers

• dump, ping, luke, system, plugins, threads, properties, file

36

Page 37: Solr Black Belt Pre-conference

Dump

• http://localhost:8983/solr/debug/dump

• Echoes parameters, content streams, and Solr web context

• Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler]

37

Page 38: Solr Black Belt Pre-conference

Ping

• http://localhost:8983/solr/admin/ping

• If healthcheck configured and file not available, error is reported

• Executes single configured request and reports failure or OK

38

Page 39: Solr Black Belt Pre-conference

Luke

• http://localhost:8983/solr/admin/luke

• Introspects Lucene index structure and schema relationships

• See an individual document:

• ?doc=<key> or ?docId=<lucene doc #>

• Schema details: ?show=schema

• Admin schema browser uses Luke request handler

• See also: original Luke tool - http://www.getopt.org/luke/

39

Page 40: Solr Black Belt Pre-conference

System

• http://localhost:8983/solr/admin/system

• core info, Lucene version, JVM details, uptime, operating system info

40

Page 41: Solr Black Belt Pre-conference

Plugins

• http://localhost:8983/solr/admin/plugins

• Configuration details of Solr core, available query and update handlers, cache settings

41

Page 42: Solr Black Belt Pre-conference

Threads

• http://localhost:8983/solr/admin/threads

• JVM thread details

42

Page 43: Solr Black Belt Pre-conference

Properties

• http://localhost:8983/solr/admin/properties

• All JVM system properties, or single property value (?name=os.arch)

43

Page 44: Solr Black Belt Pre-conference

File

• http://localhost:8983/solr/admin/file?file=/

• See fetchable directory tree

• http://localhost:8983/solr/admin/file?file=schema.xml&contentType=text/plain

44

Page 45: Solr Black Belt Pre-conference

Search components

• Standard: query, facet, mlt, highlight, stats, debug

• Others: elevation, clustering, term, term vector

45

Page 46: Solr Black Belt Pre-conference

Clustering

• Dynamic grouping of documents into labeled sets

• http://localhost:8983/solr/clustering?q=*:*&rows=10

• http://wiki.apache.org/solr/ClusteringComponent

• Requires additional steps to install (see documentation) with Apache Solr distro; baked fully into Lucid certified distro

46

Page 47: Solr Black Belt Pre-conference

Terms

• Enumerates terms from specified fields

• http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=vi

47

Page 48: Solr Black Belt Pre-conference

Term Vectors

• Details term vector information: term frequency, document frequency, position and offset information

• http://localhost:8983/solr/select/?q=*%3A*&qt=tvrh&tv=true&tv.all=true

48

Page 49: Solr Black Belt Pre-conference

stats.jsp

• Not technically a “request handler”, outputs only XML

• http://localhost:8983/solr/admin/stats.jsp

• Index stats such as number of documents, searcher open time

• Request handler details, number of requests and errors, average request time, average requests per second, number of pending docs, etc, etc

49

Page 50: Solr Black Belt Pre-conference

Analysis Tricks• CharFilters: MappingCharFilterFactory, PatternReplaceCharFilterFactory,

HTMLStripCharFilterFactory

• ReversedWildcardFilterFactory, see example schema.xml "text_rev" field type

• *thing queries for gniht*

• PositionFilterFactory

• "can be used with a query Analyzer to prevent expensive Phrase and MultiPhraseQueries" or "all words and shingles to be placed at the same position, so that all shingles to be treated as synonyms of each other."

• CommonGramsFilterFactory - Makes shingles by combining common tokens and regular tokens

• CollationKeyFilterFactory (Solr 1.5) - locale based sorting

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

50

Page 51: Solr Black Belt Pre-conference

Faceting

• multi-select

• hierarchical

51

Page 52: Solr Black Belt Pre-conference

Multi-select

• &facet.field=facet_field&fq=facet_field:(value1 OR value2)

• to exclude filters from facet counts:

• &facet.field={!ex=group}facet_field&fq={!tag=group}facet_field:value2

52

Page 53: Solr Black Belt Pre-conference

Hierarchical

• http://wiki.apache.org/solr/HierarchicalFaceting

53

Page 54: Solr Black Belt Pre-conference

Facet paging

• Blacklight trick, requesting one more than page size

54

Page 55: Solr Black Belt Pre-conference

i18n

• CJK

• SmartChineseAnalyzer

• German

• DictionaryCompoundWordTokenFilterFactory

• To watch:

• http://code.google.com/p/lucene-hunspell/

55

Page 56: Solr Black Belt Pre-conference

Testing

• Automate

• Relevancy

• Performance

• Solr log analysis: zero results queries, slow queries

56

Page 57: Solr Black Belt Pre-conference

Questions

• One subject that's of some interest to me is paging through facets.  It drives me a little crazy that Solr lets you page through facets, yet it won't give you a total count of how many facets you are paging through, which makes presenting a fully functional paging mechanism rather problematic.  I've heard that Bobo-browse may be helpful here but haven't dug into it too deeply.  Maybe this is too narrow a topic to be worth spending much time on, but if anybody has any thoughts or solutions, I'd love to discuss them

• What if we wanted to implement a traditional browse with Solr?  Like a call number browse to simulate shelf browsing? Is there a way to leverage Solr for something like that?  I'd think the trie structure would make this possible, but how it could be exposed in that manner is a mystery.

• that inner query/nested query stuff that Naomi is using for advanced search would be one thing I'd add to the list.  Continues to confuse me every time I look at it.

• Another idea, approaches for figuring out how much RAM solr needs, and how big to make the various Solr query caches. I know it depends on a lot and is different for every index, but I don't even know how to get started figuring out what it should be for my index.  Not sure if this makes sense as an issue or not, just an idea.

57

Page 58: Solr Black Belt Pre-conference

Questions• We're currently using 1.3, so the biggest changes/improvements in 1.4 would

be good.

• I'm also interested in fulltext indexing.  We have some documents (newspapers and dissertations) that are quite large (hundreds of MB of plaintext).  Is there a good rule-of-thumb for how much text we should index?  How large is too large?  Is uncorrected OCR'd text worth indexing?

• The other topic I'm particularly interested in is update performance.  Most of our data is currently batch-loaded and batch-indexed, but we are moving to interactive editing for some of our data, with the expectation that the solr index be kept updated in realtime (or near-realtime).  Should we use a separate server (or core) to keep the updates from impacting read-only performance?  Do we need to optimize the index (this can take 20+ seconds for our main index) frequently?

58

Page 59: Solr Black Belt Pre-conference

Questions

• One other thing: we're using the web service interface which seems fast and reliable.  Is the SolrJ interface significantly faster or better?

• DidYouMean/Spellcheck

59

Page 60: Solr Black Belt Pre-conference

Questions

does it make sense to use fixtures or fixture scenarios like Rails? does it make sense to set up a separate 'testing' core

that can be dynamically dumped and rebuilt through the apis by your test suite

60

Page 61: Solr Black Belt Pre-conference

Questions1.  What methods and tools can be used to determine whether configuration or physical resource changes might improve performance.  E.g. increasing filter cache, adding more memory, going to 64 bit architecture, adding another disk drive to the array, etc.

2. Best procedures to make these configuration changes.E.g. These two parameters work in conjunction with each other, change this one then that one, this one should be set to X percent of your physical memory, don't touch this one unless you really know what you are doing, etc.

61

Page 62: Solr Black Belt Pre-conference

Questions

- Scaling issues: millions of records, trying to keep data reasonably current - Distributed search- Considerations for non-Roman data mixed in with Roman data?  We have CJK data, Cyrillic, Hebrew, Arabic.  Is there a sensible way to set up the analyzers?- Any considerations for merging heterogenous data (MARC, OAI-DC, EAD, web spidering) that may be particular to Solr?  (I don't expect so, it's all going into one schema, but maybe you're run into something.)

62

Page 63: Solr Black Belt Pre-conference

QuestionsIndexing strategies:

* Performance tuning or configuring Solr for indexing (as opposed to a copy of Solr a search app runs on). Which config options make a difference? What JVM options matter?* Merging a 'build' copy of an index into a search app's copy. Is this the replication piece?* Using multiple threads when writing to Solr. Using StreamingSolrUpdateServer effectively/safely.

Advanced features on retrieval side:

* Info about facets: can Solr retrieve the global count number for a facet in addition to the count number within a filtered search result set? Only with 2 queries?* Doing Google-like autosuggest against facet values for subject terms (not like facet.prefix method in the Solr 1.4 book). Best to use a multicore setup and have an index or two dedicated to autosuggestions?

Multiple index design:

As my colleague Eric put it: big generalized index + N extreme indexes = Righteous Discovery Platform || High Folly?

This is a question we are dealing with. As librarians and researchers learn what we are doing on our campus a lot of people are offering up data. Some of which is *highly* specialized. For example, metadata based on a microscopy data standard. We expect that these researchers would like us to create an expert search tool with advanced features tailored to their data model

63

Page 64: Solr Black Belt Pre-conference

QuestionsGetting a better understanding of Solr memory use would be very helpful for us. (Or perhaps tools and tips for understanding Solr memory use)Right now we can watch the tomcat/Solr jvm with Jconsole and see heap use suddenly increasing and decreasing, but we don't understand why, so our main technique is to wait until we get an OutOfMemoryError and then increase the memory we give to the Solr/Tomcat JVM. (That and continuing to buy more memory:)

The dismax/edismax and how folks are using them to tweak relevance ranking (based on MARC fields) is also of great interest.

A couple of topics that may or may not be of interest to other folks and may or may not be appropriate for the workshop.  The context of these is that we are trying to understand scalability and performance issues with very large indexes (300GB x 10) and multiple shards (5 million full-text docs and growing.)

1) I'd like to get a bit of a better understanding of how filter queries are implemented. (and how that relates to faceting)

2) I'd like to get a better understanding of how distributed search is implemented.  In particular, I'd like to understand the traffic that goes between the head shard and the shards it distributes the query to.  For example in the tomcat logs we can see traffic with the isShard=true  and ids="abc","def" parameters.

64

Page 65: Solr Black Belt Pre-conference

Questions

• Call number -> shelf key

• Reverse sorting fields

• termsComponent queries

• Terms -> documents

• Can we apply facets?

65

Page 66: Solr Black Belt Pre-conference

Books

66

Page 67: Solr Black Belt Pre-conference

e-book now available!print coming soon

http://www.manning.com/lucene

67

Page 68: Solr Black Belt Pre-conference

LucidWorks for Solr• Certified Distribution

• Value-added integration

• KStemmer

• Carrot2 clustering

• LucidGaze for Solr

• installer

• Reference Manual

• Solr 1.4++ certified

68

Page 69: Solr Black Belt Pre-conference

LucidGaze for Solr

• Monitoring tool, captures, stores, and interactively views Solr performance metrics

• requests/second

• time/request

69

Page 70: Solr Black Belt Pre-conference

70

Page 71: Solr Black Belt Pre-conference

LucidFind

http://search.lucidimagination.com/?q=code4lib

71

Page 72: Solr Black Belt Pre-conference

72

Page 73: Solr Black Belt Pre-conference

http://www.flickr.com/photos/mikeoliveri/2036797884/

73