searching the united states code with solr/lucene

Post on 03-Jul-2015

548 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and permanent laws of the United States Government. It has been updated every year since and has grown into a 30 volume set of some 40,000 pages divided into 50 titles.

TRANSCRIPT

Searching The United States Code with Solr/Lucene

Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011

rmatamoros@searchtechnologies.com

Searching the United States Code

§  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer

§  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine!

§  Key Challenges •  How to index this massive, complex, 85-year-old

document? •  How to replicate 20-Year-Old search features?

§  Government Documents are Fun!

3

Search Technologies §  The largest independent provider of enterprise

search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral

•  (yeah, we know Sphinx too)

§  Offices All Over •  DC, NY, CA, MD,

OH, UK, CR…

4

A Quick Civics Lesson… §  The United States Code

•  The general & permanent laws of the U.S. Government – All in one place

•  51 titles §  Agriculture, Armed Forces, Conservation, The President,

Food and Drugs, Postal Service, Public Health…

•  First Version: 1926 §  The Office of the Law Revision Council (OLRC)

•  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of

Representatives §  Bonus Question: Which Title is the largest?

5

Major Challenges 1.  Document Parsing

•  A 50 Volume Table Of Contents!

2.  Query Parsing •  Custom Features (exact case, exact suffix,

proximity, query templates, lemmatization, lots of fields…)

3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context

6

7

screenshot

8

screenshot

9

screenshot

10

Part The First: Document Processing

11

Document Processing / Indexing

12

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Field Type 1: Extracted to Index

13

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Page Numbers

Title Heading

Source Credit

Document Processing / Indexing

14

Title 14

ch. 1 ch. 2 ch. 3

pt. A pt. B pt. C

sec. 1 sec. 2 sec. 3

… …

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Field Type 2: Embedded Refs

15

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Public Law Other USC Refs

Statute at Large

Public Law

Public Law

Document Processing / Indexing

16

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Document Processing / Indexing

17

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

§  /US-Code §  /2010

§  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm

Part The Second: Token Processing

18

Token Processing 1 xhtml tag tokenizer

19

<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note -->

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Field Type 3: Marked Within Doc

20

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Token Processing 2 Mark Start and End Tags

21

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Token Processing 3 Remove XHTML Tags

22

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Token Processing 4 Tag Original Case & Lower Case

23

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Token Processing 5 Lemmatize

Uses dictionary-based lemmatizer based on GCIDE and WordNet

24

S/amendment

O/Amendments L/amendments amendment

O/2002 L/2002 2002

O/Pub L/Pub pub

O/L L/l; l

O/107 L/107 107

O/296 L/296 296

O/Substituted L/Substituted substitute

O/Department L/Department department

O/of L/of of

E/amendment

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

Part The Third: Query Processing

25

Query Processing

26

parse mark phrases lemmatize query

template

build lucene query

mark exact:

Query String search

§  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators

§  Can produce FAST FQL as well •  (cue the derisive catcalls)

§  But most importantly: •  It is a Query Processing Pipeline

§  Mix and match query processing modules

(not all stages shown)

Query Processing

27

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

exact:

|FOIA|

phrase

|top| |secret|

amendment:

|RECORDS|

exact:FOIA “top secret” amendment:RECORDS

Query Processing

28

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|top| |secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|RECORDS|

Query Processing

29

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|records|

Query Processing

30

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|record|

Query Processing

31

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

between

exact:FOIA “top secret” amendment:RECORDS

E/amendment

S/amendment

|record|

The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause)

§  start-tag à Starting tag, e.g. “S/amendment” §  end-tag à Ending tag, e.g. “E/amendment”

§  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator

§  neg-clause à words which must not occur between start and end

32

Part the Fourth: Hierarchical Navigation

33

34

screenshot

Hierarchies: Requirements §  Any number of levels

§  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section

§  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels

§  Multiple views: §  Children §  Ancestors §  Ancestor’s Siblings

§  Multiple search scopes: §  Only children, all descendents, everything

35

Hierarchies: Ancestor-Siblings §  US-Code

•  Title 1 •  Title 2

§  Chapter 1 §  Chapter 2

–  Part 1 –  Part 2

•  Section 2.1 •  Section 2.2

–  Part 3 –  Part 4

§  Chapter 3 §  Chapter 4

•  Title 3

36

Hierarchies: Fields §  ancestors

•  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-

subchapter2

§  encodedAncestors – for display only •  Where the node exists within the hierarchy

§  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//

USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform

§  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2

§  treesort – Hierarchical sort field, e.g. “13/000/0/00882”

37

Hierarchies: Tree Sort §  Sorting In Print Order

•  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length

38

01/011/1/02032

01 = USC Title

011 = Title 11 1 = An Appendix

Sequence # in file

Hierarchies: Sample Searches §  Assuming Node = “USC-title2-chapter25” §  Search Children

•  parentId:USC-title2-chapter25 §  Search All Descendents

•  ancestors:USC-title2-chapter25 §  Ancestor Siblings

•  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)

39

Contact §  Paul Nelson

•  pnelson@searchtechnologies.com §  Ronald Matamoros

•  rmatamoros@searchtechnologies.com §  Search Technologies

•  http://searchtechnologies.com

40

top related