record-boundary discovery in web documents d.w. embley, y. jiang, y.-k. ng data-extraction group*...

28
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University Provo, UT, USA nded in part by Novell, Inc., Ancestry.com, Inc., and Faneuil Resear

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Record-Boundary Discovery in Web Documents

D.W. Embley, Y. Jiang, Y.-K. NgData-Extraction Group*

Department of Computer Science

Brigham Young UniversityProvo, UT, USA

*Funded in part by Novell, Inc., Ancestry.com, Inc., and Faneuil Research.

Record-Boundary DiscoveryLarger Goal: Information Extraction<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr>…</body></html>

#####

#####

#####

Year Make Model PhoneNr

Desired ObjectiveQuery the Web Like a Database

Example: Get the year, make, model, and price for 1987 or later cars that are red or white.

Year Make Model Price-----------------------------------------------------------------------97 CHEVY Cavalier 11,99594 DODGE 4,99594 DODGE Intrepid 10,00091 FORD Taurus 3,50090 FORD Probe88 FORD Escort 1,000

for a page of unstructured records, rich in data and narrow in ontological breadth

Approach and LimitationsAutomatic Ontology-Based

Wrapper Generation

ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredRecords

Constant/KeywordMatching Rules

Data-Record Table

Record-Level Objects,Relationships, and Constraints

DatabaseScheme

PopulatedDatabase

Record Extractor

Web Page

Application Ontology:Object-Relationship Model Instance

Car [-> object];Car [0..1] has Model [1..*];Car [0..1] has Make [1..*];Car [0..1] has Year [1..*];Car [0..1] has Price [1..*];Car [0..1] has Mileage [1..*];PhoneNr [1..*] is for Car [0..1];PhoneNr [0..1] has Extension [1..*];Car [0..*] has Feature [1..*];

Year Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Application Ontology: Data Frames

Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, …end;Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, …end;Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\b “\bmi.\b”;end;...

Ontology Parser

Make : chevy…KEYWORD(Mileage) : \bmiles\b...

create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...

Object: Car;...Car: Year [0..1];Car: Make [0..1];…CarFeature: Car [0..*] has Feature [1..*];

ApplicationOntology

OntologyParser

Constant/KeywordMatching Rules

Record-Level Objects,Relationships, and Constraints

DatabaseScheme

Record Extractor

<html>…<h4> ‘97 CHEVY Cavalier, Red, 5 spd, … </h4><hr><h4> ‘85 DODGE Daytona, needs paint, … </h4><hr>….</html>

…#####‘97 CHEVY Cavalier, Red, 5 spd, …#####‘85 DODGE Daytona, needs paint, …#####...

UnstructuredRecords

Record Extractor

Web Page

Record Extractor:High Fan-Out Heuristic

<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, … </h4><hr><h4> ‘85 DODGE Daytona, needs … </h4><hr>…</body></html>

html

head

title

body

… hr h4 b hr h4 ...h1

Candidate Separator Tags

Record Extractor:Record-Separator Heuristics

IT: Identifiable “html separator” Tags

HT: Highest-count Tags

SD: Standard Deviation

OM: Ontological Match

RP: Repeating-tag Patterns

IT: Identifiable “html separator” Tags

<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>

hr tr td a table p br h4 h1 strong b i

HT: Highest-count Tags

<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>

Tag Count----------------- hr 4 h4 3 b 1

SD: Standard Deviation

<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>

hr ( = 45.5)-------------------159 characters 63 characters 62 characters

h4 ( = 48.0)--------------------159 characters 63 characters

OM: Ontological Match

<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>

Record Estimator: average of count of Year, Make, and Model = 3.

Closest candidate separator count: h4 = 3, hr = 4, b = 1.

RP: Repeating-tag Patterns

<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>

3 pairs: <hr><h4>

Of the tags in the repeating pattern, h4 is closest with 3, then hr with 4.

Record Extractor:Consensus Heuristic

Certainty is a generalization of: C(E1) + C(E2) - C(E1)C(E2).C denotes certainty and Ei is the evidence for an observation.

Our certainties are based on observations from 10 differentsites for 2 different applications (car ads and obituaries)

Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%

Record Extractor:Example Consensus Heuristic

Rank Computed IT HT SD OM RP Certainty Factorhr 1 1 1 2 2 .994h4 2 2 2 1 1 .983b 3 3 - 3 - .182

e.g., b: 0 + .165 + .02 - 0.165 - 0.02 - .165.02 + 0.165.02 = .1817

Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%

Record Extractor: Results

4 different applications (car ads, job ads, obituaries,university courses) with 5 new/different sites for eachapplication

Heuristic Success Rate IT 95% HT 45% SD 65% OM 80% RP 75%Consensus 100%

Constant/Keyword Recognizer

Descriptor/String/Position(start/end)

‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Constant/KeywordRecognizer

UnstructuredRecords

Constant/KeywordMatching Rules

Data-Record Table

Database Instance Generator

• Keyword proximity• Subsumed and overlapping

constants• Functional relationships• Nonfunctional relationships• First occurrence without

constraint violation

Heuristics

Descriptor/String/Position(start/end)

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Database-InstanceGenerator

Data-Record Table

Record-Level Objects,Relationships, and Constraints

=2 {

} =52

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Database-Instance Generator

insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Database-InstanceGenerator

Data-Record Table

Record-Level Objects,Relationships, and Constraints

DatabaseScheme

PopulatedDatabase

Recall & Precision

N

CRecall

IC

C

Precision

N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly

(of facts available, how many did we find?)

(of facts retrieved, how many were relevant?)

Results: Car Ads

Training set for tuning ontology: 100Test set: 116

Salt Lake Tribune

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99

Car Ads: Comments

• Unbounded sets– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models

• Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns

• Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns

• Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos

Results: Computer Job Ads

Training set for tuning ontology: 50Test set: 50

Los Angeles Times

Recall % Precision %Degree 100 100Skill 74 100Email 91 83Fax 100 100Voice 79 92

Results: Obituaries

Training set for tuning ontology: ~ 24Test set: 90

Arizona Daily Star

Recall % Precision %DeceasedName* 100 100Age 86 98BirthDate 96 96DeathDate 84 99FuneralDate 96 93FuneralAddress 82 82FuneralTime 92 87…Relationship 92 97RelativeName* 95 74

*partial or full name

Cautions

• Ontology Creation and Tuning– Regular expressions (tool for experimentation)– Category specialization and cultural localization

• Record Separation– Web page has multiple records satisfying an ontology– (HTML) record separator exists

• Attribute-Value Pair Generation– Context-sensitive recognizable/categorizable constants– Topic switches within records

Conclusions• Given an ontology and a Web page with multiple records, it is

possible to extract and structure the data automatically.• Record Separation Results: 100%• Recall and Precision Results

– Car Ads: ~ 94% recall and ~ 99% precision– Job Ads: ~ 84% recall and ~ 98% precision– Obituaries: ~ 90% recall and ~ 95% precision (except names: ~ 73% precision)

• Future Work– Find and categorize pages of interest.– Relax restrictions for record separation.– Strengthen heuristics for extraction.– Add richer conversions and additional constraints to data frames.

http://www.deg.byu.edu/