neil andersonjunhong
TRANSCRIPT
Visually Extrac.ng Data Records
from the Deep Web
Neil Anderson and Jun Hong Queen’s University Belfast, UK
Data Record Extrac.on
Given a query result page containing a set of data records, our goal is to group the data items and labels of each data record together.
Title
Previous Approaches
• Common theme is to iden.fy repeated paJerns • Source code and regular expressions – JavaScript makes this tricky
• Supervised learning with annotated pages – Wrapper induc.on
• Tag tree representa.on (DOM) – Hierarchical representa.on of the page, designed for the browser, not for humans
– Doesn’t mirror the displayed structure -‐ modern complex web pages make this difficult
Layout Engine
What now?
Our Visual Approach
• Mimic human intui.on • To make use of the common sources of evidence on displayed pages that humans use, including – Structural regularity – Visual and content similarity between data records
Previous Approaches Need to Iden.fy Data Rich Sec.on
PiWalls: How to iden.fy the Data Rich Sec.on
DRS does not contain all the records DRS contains noise as well as records
Our Approach
• We find records, not the Data Rich Sec.on • Extract data records individually on displayed query
result pages, while excluding noise items • Records in a grid or a column • Use clustering algorithms and a set of similarity
measures to: Iden.fy records Exclude noise
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
Green and blue blocks
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
Title
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
Green and blue blocks
Title
Title
Selec.ng Other Candidate Containers
Filter the set of all container blocks on the page (blue blocks) and
Discard blocks that don’t match the width of any candidate container block (orange blocks). Cluster the remaining blocks by width.
Why width? Web pages designed for ver.cal, not horizontal, scrolling.
Title
Selec.ng Record Containers
Block content similarly measure Block A – Candidate record block (orange) Block B – Container block (block) with the same width as A
The cluster with the maximum number of similar blocks is the winner!
Title
Title
Our Approach
jQuery
Web Page
Renderer
WebKit
Visual Block
Modeller
JavaScript
Seed Block
Selector
JavaScript
Data Record Block
Selector
jQuery
Record Boundary Drawer
Title
Title
Visual Block Model
Visual Block Model -‐ Clean
Conclusions: Main Contribu.ons
• Visual approach to directly access a rendering engine to get posi.onal and visual features rather than codes or tag trees
• No need to iden.fy data rich sec.on • Use observa.ons on visual and content similarity, and structural regularity to group data items into records
Future Work
• Use a domain schema from schema.org, or a domain ontology to annotate data records
• Use a domain schema or ontology to annotate query forms too
• Solve Label incompleteness and inconsistency issues
• Similarity threshold – Set by machine learning.