Searching The United States Code with Solr/Lucene
Paul Nelson / Ronald Matamoros, Search Technologies [email protected], 5/25/2011
Searching the United States Code
§ Who are we: • Paul Nelson, Chief Architect • Ronald Matamoros, Lead Engineer
§ Our Mission: Replace Personal Librarian Search • A 20-Year-Old Search Engine!
§ Key Challenges • How to index this massive, complex, 85-year-old
document? • How to replicate 20-Year-Old search features?
§ Government Documents are Fun!
3
Search Technologies § The largest independent provider of enterprise
search expertise and services § 80 full-time dedicated search engine experts § 200+ customers § Technology Neutral
• (yeah, we know Sphinx too)
§ Offices All Over • DC, NY, CA, MD,
OH, UK, CR…
4
A Quick Civics Lesson… § The United States Code
• The general & permanent laws of the U.S. Government – All in one place
• 51 titles § Agriculture, Armed Forces, Conservation, The President,
Food and Drugs, Postal Service, Public Health…
• First Version: 1926 § The Office of the Law Revision Council (OLRC)
• 20 lawyers who author the U.S. Code • They report to the Speaker of the House of
Representatives § Bonus Question: Which Title is the largest?
5
Major Challenges 1. Document Parsing
• A 50 Volume Table Of Contents!
2. Query Parsing • Custom Features (exact case, exact suffix,
proximity, query templates, lemmatization, lots of fields…)
3. Searching & Highlighting Fields • Some fields are embedded in the document • These fields must be highlighted in context
6
7
screenshot
8
screenshot
9
screenshot
10
Part The First: Document Processing
11
Document Processing / Indexing
12
USC Title
Parse & Granularize
Repository
Construct XHTML Store Xform &
Index Solr Embed Refs
Field Type 1: Extracted to Index
13
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Page Numbers
Title Heading
Source Credit
Document Processing / Indexing
14
Title 14
ch. 1 ch. 2 ch. 3
pt. A pt. B pt. C
sec. 1 sec. 2 sec. 3
… …
…
USC Title
Parse & Granularize
Repository
Construct XHTML Store Xform &
Index Solr Embed Refs
Field Type 2: Embedded Refs
15
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Public Law Other USC Refs
Statute at Large
Public Law
Public Law
Document Processing / Indexing
16
USC Title
Parse & Granularize
Repository
Construct XHTML Store Xform &
Index Solr Embed Refs
Document Processing / Indexing
17
USC Title
Parse & Granularize
Repository
Construct XHTML Store Xform &
Index Solr Embed Refs
§ /US-Code § /2010
§ /title2 § /USC-title2-section1532.htm § /USC-title2-node3-rule5.htm
Part The Second: Token Processing
18
Token Processing 1 xhtml tag tokenizer
19
<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
<!-- field-end:amendment-note -->
Field Type 3: Marked Within Doc
20
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107–296 substituted “Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Token Processing 2 Mark Start and End Tags
21
S/amendment
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
<!-- field-start:amendment-note -->
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
<!-- field-end:amendment-note -->
Token Processing 3 Remove XHTML Tags
22
S/amendment
Amendments
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
S/amendment
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
Token Processing 4 Tag Original Case & Lower Case
23
S/amendment
O/Amendments L/amendments
O/2002 L/2002
O/Pub L/pub
O/L L/l
O/107 L/107
O/296 L/296
O/Substituted L/substituted
O/Department L/department
O/of L/of
E/amendment
S/amendment
Amendments
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
Token Processing 5 Lemmatize
Uses dictionary-based lemmatizer based on GCIDE and WordNet
24
S/amendment
O/Amendments L/amendments amendment
O/2002 L/2002 2002
O/Pub L/Pub pub
O/L L/l; l
O/107 L/107 107
O/296 L/296 296
O/Substituted L/Substituted substitute
O/Department L/Department department
O/of L/of of
E/amendment
S/amendment
O/Amendments L/amendments
O/2002 L/2002
O/Pub L/pub
O/L L/l
O/107 L/107
O/296 L/296
O/Substituted L/substituted
O/Department L/department
O/of L/of
E/amendment
Part The Third: Query Processing
25
Query Processing
26
parse mark phrases lemmatize query
template
build lucene query
mark exact:
Query String search
§ Communicates via generic QNode Class • Simpler to manipulate than Lucene operators
§ Can produce FAST FQL as well • (cue the derisive catcalls)
§ But most importantly: • It is a Query Processing Pipeline
§ Mix and match query processing modules
(not all stages shown)
Query Processing
27
parse mark lowercase lemmatize query
template
build lucene query
mark original
Query String search
and
exact:
|FOIA|
phrase
|top| |secret|
amendment:
|RECORDS|
exact:FOIA “top secret” amendment:RECORDS
Query Processing
28
parse mark lowercase lemmatize query
template
build lucene query
mark original
Query String search
and
O/FOIA phrase
|top| |secret|
amendment:
exact:FOIA “top secret” amendment:RECORDS
|RECORDS|
Query Processing
29
parse mark lowercase lemmatize query
template
build lucene query
mark original
Query String search
and
O/FOIA phrase
|L/top| |L/secret|
amendment:
exact:FOIA “top secret” amendment:RECORDS
|records|
Query Processing
30
parse mark lowercase lemmatize query
template
build lucene query
mark original
Query String search
and
O/FOIA phrase
|L/top| |L/secret|
amendment:
exact:FOIA “top secret” amendment:RECORDS
|record|
Query Processing
31
parse mark lowercase lemmatize query
template
build lucene query
mark original
Query String search
and
O/FOIA phrase
|L/top| |L/secret|
between
exact:FOIA “top secret” amendment:RECORDS
E/amendment
S/amendment
|record|
The between() Operator § between(start-tag, end-tag, pos-clause, neg-clause)
§ start-tag à Starting tag, e.g. “S/amendment” § end-tag à Ending tag, e.g. “E/amendment”
§ pos-clause à words which must occur between start and end • Note: Requires a nested ScanAnd() operator
§ neg-clause à words which must not occur between start and end
32
Part the Fourth: Hierarchical Navigation
33
34
screenshot
Hierarchies: Requirements § Any number of levels
§ Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section
§ Levels vary across titles § Title 1: 3 levels § Title 26: 8 levels
§ Multiple views: § Children § Ancestors § Ancestor’s Siblings
§ Multiple search scopes: § Only children, all descendents, everything
35
Hierarchies: Ancestor-Siblings § US-Code
• Title 1 • Title 2
§ Chapter 1 § Chapter 2
– Part 1 – Part 2
• Section 2.1 • Section 2.2
– Part 3 – Part 4
§ Chapter 3 § Chapter 4
• Title 3
36
Hierarchies: Fields § ancestors
• Searching § USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
subchapter2
§ encodedAncestors – for display only • Where the node exists within the hierarchy
§ id;heading;subjectTitle//id;heading;subjectTitle//... § USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform
§ parentId – ID of the parent node § USC-title2-chapter25-subchapter2
§ treesort – Hierarchical sort field, e.g. “13/000/0/00882”
37
Hierarchies: Tree Sort § Sorting In Print Order
• Front Matter à Titles à Tables à etc. • Everything padded to fixed-length
38
01/011/1/02032
01 = USC Title
011 = Title 11 1 = An Appendix
Sequence # in file
Hierarchies: Sample Searches § Assuming Node = “USC-title2-chapter25” § Search Children
• parentId:USC-title2-chapter25 § Search All Descendents
• ancestors:USC-title2-chapter25 § Ancestor Siblings
• (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)
39
Contact § Paul Nelson
• [email protected] § Ronald Matamoros
• [email protected] § Search Technologies
• http://searchtechnologies.com
40