neil andersonjunhong

Visually Extrac.ng Data Records

from the Deep Web

Neil Anderson and Jun Hong Queen’s University Belfast, UK

Data Record Extrac.on

Given a query result page containing a set of data records, our goal is to group the data items and labels of each data record together.

Previous Approaches

•  Common theme is to iden.fy repeated paJerns •  Source code and regular expressions –  JavaScript makes this tricky

•  Supervised learning with annotated pages – Wrapper induc.on

•  Tag tree representa.on (DOM) – Hierarchical representa.on of the page, designed for the browser, not for humans

– Doesn’t mirror the displayed structure -‐ modern complex web pages make this difficult

Layout Engine

What now?

Our Visual Approach

•  Mimic human intui.on •  To make use of the common sources of evidence on displayed pages that humans use, including – Structural regularity – Visual and content similarity between data records

Previous Approaches Need to Iden.fy Data Rich Sec.on

PiWalls: How to iden.fy the Data Rich Sec.on

DRS does not contain all the records DRS contains noise as well as records

Our Approach

•  We find records, not the Data Rich Sec.on •  Extract data records individually on displayed query

result pages, while excluding noise items •  Records in a grid or a column •  Use clustering algorithms and a set of similarity

measures to: Iden.fy records Exclude noise

Our Approach

jQuery

Web Page

Renderer

WebKit

Visual Block

Modeller

JavaScript

Seed Block

Selector

JavaScript

Data Record Block

Selector

jQuery

Record Boundary Drawer

Green and blue blocks

Our Approach

jQuery

Web Page

Renderer

WebKit

Visual Block

Modeller

JavaScript

Seed Block

Selector

JavaScript

Data Record Block

Selector

jQuery


Green and blue blocks

Selec.ng Other Candidate Containers

Filter the set of all container blocks on the page (blue blocks) and

Discard blocks that don’t match the width of any candidate container block (orange blocks). Cluster the remaining blocks by width.

Why width? Web pages designed for ver.cal, not horizontal, scrolling.

Selec.ng Record Containers

Block content similarly measure Block A – Candidate record block (orange) Block B – Container block (block) with the same width as A

The cluster with the maximum number of similar blocks is the winner!

Our Approach

jQuery

Web Page

Renderer

WebKit

Visual Block

Modeller

JavaScript

Seed Block

Selector

JavaScript

Data Record Block

Selector

jQuery


Visual Block Model

Visual Block Model -‐ Clean

Conclusions: Main Contribu.ons

•  Visual approach to directly access a rendering engine to get posi.onal and visual features rather than codes or tag trees

•  No need to iden.fy data rich sec.on •  Use observa.ons on visual and content similarity, and structural regularity to group data items into records

Future Work

•  Use a domain schema from schema.org, or a domain ontology to annotate data records

•  Use a domain schema or ontology to annotate query forms too

•  Solve Label incompleteness and inconsistency issues

•  Similarity threshold – Set by machine learning.

neil andersonjunhong

Documents