extracting schema from semistructured data

Extracting Schema from Semistructured Data

Nestorov, Abiteboul, and Motwani at Stanford

Perspective

• This paper is new work.

• More than the details look at the issues:– What are their goals?– What does this contribute?– Do they attain their goals?– Why do we need this?

Sample Database

“The Keg” “Steak” “Jim”

NameEntree

Manager

“BurgerKing”

“Fries”

Name EntreeManager

“AA+Management”

543-7798

CompanyName Phone

2 3 1098

Schema = Types

Where does semistructured data come from?

• Document collections

• Biological data

• HTML

• Bibtex, etc.

Who needs structure?

• For the user– To know what queries are possible– Browsing the database– Type checking

• Storage– Data layout to facilitate querying

• E.g. place similar objects on same page

– Indexes

Who Needs Structure?(2)

• Query optimization– All the relational query optimization tricks

• Maintaining statistics per data type– Cardinality, # of pages, Index cardinality, etc.

• Estimating the cost/size of result of query plans

– Efficient processing of path expressions

• Other?

Their Goals

Approximate typing (schema extraction) of semistructured data.

Restaurant(X) :- Link(X,A,B,C) & Name-atom(A) &Entrée-atom(B) & Manager-atom(C)

Example (little lie) Typing Program:

Given a database:

Outline of the Algorithm

1. Find the perfect typing program.– This typing might be too large so we:

2. Coalesce similar types into k types.

3. Assign a type to objects in database.

4. Deduce meaningful names for the types.

Typing

“The Keg” “Steak” “Jim”

NameEntree

Manager

7The two base relations:

- link(FromObj, ToObj, Label)

- atomic(Obj, Value)

These are the only two EDB’s of the typing program.

Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) &link(X,B,Entrée) & atomic(B, Bp) &link(X,C,Manager) & atomic(C,Cp)

Typing 2Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) &

link(X,B,Entrée) & atomic(B, Bp) &link(X,C,Manager) & atomic(C,Cp)

EDB:link(7, 8, Name) atomic(8, “The Keg”)

IDB: (intensional relations)

defined by the typing program

Extension of an IDB:

Restaurant(1)

Restriction on TypesArbitrary type programs are not allowed.

Rules typei(X) can only be built from the following:

1. link(Y, X, c) & typej(Y)2. link(X, Y, c) & typej(Y)3. link(X, Y, c) & atomic(Y, Z)

Types can only express local characteristics.

The collection of typed links is a set.(2 entrées = 1 entrée)

Semantics of Type Program

The greatest fixpoint of a datalog program on a database defines the semantics of the typing.

Fixpoint = Extensions of IDB’s + EDB’s– Least fixpoint

• start with model of only EDB’s

• at each step union into the model anything new.

Greatest Fixpoint

1. Start with a model of EDB’s and all possible extensions.2. At each step, remove any extensions not derived by applying

the rules.

Least fixpoint doesn’t work:

person(X) :- link(X, Y, is-manager-of) & firm(Y) & link(X, Yp, name) & atomic(Yp, Z)

firm(X) :- link(X, Y, is-managed-by) & person(Y) & link(X, Yp, name) & atomic(Yp, Z)

Imperfect TypesDefect: a measure of how well an

object fits a given type.

= Excess + deficit

type1 = +

Defect is 2 for assigning 11to type1.

“McD”

“Steak” “Jim”

NameEntree

Manager

“biscuit” 53

NameEntree

# seats

“The Keg”manager0

name0 entree0

Imperfect Types(2)

“McD”

“Steak” “Jim”

NameEntree

Manager

“biscuit” 53

NameEntree

# seats

“The Keg”

• Excess: # of EDB’s not used to validate any object’s type.

• Deficit: Minimum # of ground facts that need to be added to make all type derivations possible.

Perfect Typing Program (Stage 1)

Multiple Roles

CountryTeam

Movie Name

NameCountry

Country

TeamMovie

Scholes

England

Man Utd

Cantona

Star Trek

France

Binoche

RockyHorror

O1 O2O3

How hard is it to choose to types for the cover?How do you quantify atomization?

Clustering (Stage 2)

Define a distance function between two types:

First approximation is difference between the bodies oftheir rule definitions.

t1 :- a0, b2 t2 :- a0, b1

t3 :- b2, b1, b3

d(t1, t2) = 2

A Better Function

Include some measure of the weight of a type(# of objects of that type):

t2 ~> t1

Some desirable properties:• increasing in d = coalesce similar types

• decreasing in w1 = compensate for ‘expected noise’

• increasing in w2 = maintain types with large extents

Choosing what to coalesce is hard!

),( 21 ww

Recasting (Stage 3)

Assign each object to types within the k types formedfrom stage 2.

(optional) choose a better value of k an rerun step 2.

Results

• Heavy use of synthetic data.– Create a type definition and generate instances

that are peturbed randomly in some way.

• What do the graphs show?– Are the data sets realistic?

Conclusions

• Paper problems:– The algorithm isn’t completely explained.– Many comments are not elaborated.

• But, it’s an important problem and good first approach.

extracting schema from semistructured data

Documents

xml schema xml-schema xml schema -...

extracting mobility statistics from indexed spatio-temporal...

i location, location, l r location: extracting e location

geographic markup language (gml)...

note sur le schema schema directeur eaux usees

extracting key paragraph based on topic and event detection...

extracting biological names and relations from texts

bakalá řská práce zvláštnosti v poskytování ošet...

extracting and ranking product features in opinion documents

extracting spatial knowledge from the web - kevin mccurley

extracting symbols from thumbnails with perl

extracting why text segment from web based on grammar-gram

schema elettrico - wiring diagram -...

methodology for extracting dynamic standard load...

extracting interesting concepts from large-scale textual...

extracting a sn spectrum from emmi

university of california riverside extracting actionable

2015-mar-02: ago fluxgate data: extracting value from an...

a method of extracting malicious expressions in bulletin...

extracting semantic user networks from informal...