lecture #6 xml november 2 nd, 2000. administration thanks for the mid-term comments comment on the...

72
Lecture #6 XML November 2 nd , 2000

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Lecture #6 XML

November 2nd, 2000

Administration

• Thanks for the mid-term comments

• Comment on the book & readings

• Project #2

• Project #1

• Homework #4

• Homework #5

Outline

• XML:– Introduction– Syntax, DTDs, – Politics, – Exporting relational data to XML.

• Querying XML:– Xpath, XML-QL, Quilt, XSLT

• XML data management:– Storing XML in relational databases.

What is XML ?From HTML to XML

HTML describes the presentation: easy for humans

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteboul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

HTML is hard for applications

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>XML describes the content: easy for applications

XML

• eXtensible Markup Language

• Roots: comes from SGML (very nasty language).

• After the roots: a format for sharing data

• Emerging format for data exchange on the Web and between applications

XML Applications

• Sharing data between different components of an application.

• Format for storing all data in Office 2000.• EDI: electronic data exchange:

– Transactions between banks

– Producers and suppliers sharing product data (auctions)

– Extranets: building relationships between companies

– Scientists sharing data about experiments.

XML Syntax

• Very simple:<db> <book> <title>Complete Guide to DB2</title> <author>Chamberlin</author> </book> <book> <title>Transaction Processing</title> <author>Bernstein</author> <author>Newcomer</author> </book> <publisher> <name>Morgan Kaufman</name> <state>CA</state> </publisher></db>

XML Terminology

• tags: book, title, author, …

• start tag: <book>, end tag: </book>

• start tags must correspond to end tags, and conversely

XML Terminology• an element: everything between tags

– example element:

<title>Complete Guide to DB2</title>– example element:

<book> <title> Complete Guide to DB2 </title> <author>Chamberlin</author> </book>• elements may be nested• empty element: <red></red> abbreviated <red/>• an XML document has a unique root element

well formed XML document: if it has matching tags

The XML Tree

db

book book publisher

title author title author author name state“CompleteGuideto DB2”

“Chamberlin” “TransactionProcessing”

“Bernstein” “Newcomer”“MorganKaufman”

“CA”

Tags on nodesData values on leaves

More XML Syntax: Attributes

<book price = “55” currency = “USD”>

<title> Complete Guide to DB2 </title>

<author> Chamberlin </author>

<year> 1998 </year>

</book>

price, currency are called attributes

Replacing Attributes with Elements

<book>

<title> Complete Guide to DB2 </title>

<author> Chamberlin </author>

<year> 1998 </year>

<price> 55 </price>

<currency> USD </currency>

</book>

attributes are alternative ways (worse ) to represent data

XML References

<db> <book ID="b1" pub="mkp"> <title>Complete Guide to DB2</title> <author>Chamberlin</author> </book> <book ID="b2" pub="mkp"> <title>Transaction Processing</title> <author>Bernstein</author> <author>Newcomer</author> </book> <publisher ID="mkp"> <name>Morgan Kaufman</name> <state>CA</state> </publisher></db>

ID’s and IDREFs are used to reference objects.

“Types” (or “Schemas”) for XML

• Document Type Definition – DTD

• Define a grammar for the XML document, but we use it as substitute for types/schemas

• Will be replaced by XML-Schema (will extend DTDs)

An Example DTD<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (name, state)> <!ELEMENT name (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ATTLIST book pub IDREF #IMPLIED>]>

• PCDATA means Parsed Character Data (a mouthful for string)

DTDs as Grammarsdb ::= (book|publisher)*book ::= (title,author*,year?)title ::= stringauthor ::= stringyear ::= stringpublisher ::= (name, state)name ::= stringstate ::= string

• A DTD is a EBNF (Extended BNF) grammar• An XML tree is precisely a derivation tree

XML Documents that have a DTD and conform to it are called valid

More on DTDs as Grammars

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>

XML documents can be nested arbitrarily deep

XML for Representing Data

<persons>

<row> <name>John</name>

<phone> 3634</phone></row>

<row> <name>Sue</name>

<phone> 6343</phone>

<row> <name>Dick</name>

<phone> 6363</phone></row>

</persons>

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

persons XML: persons

XML vs Data Models

• XML is self-describing

• Schema elements become part of the data– Reational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part

of the data, and are repeated many times

• Consequence: XML is much more flexible

• XML = semi-structured data

Semi-structured Data Explained

• Missing attributes:– <person> <name> John</name>

<phone>1234</phone>

</person>

– <person><name>Joe</name></person> no phone !

• Repeated attributes– <person> <name> Mary</name>

<phone>2345</phone>

<phone>3456</phone>

</person>

S-S Data Further Explained

• Attributes with different types in different objects– <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person>

• Nested collections• Heterogeneous collections:

– <db> contains both <book>s and <publisher>s

XML Data v.s. E/R, ODL, Relational

• Q: is XML better or worse ?• A: serves different purposes

– E/R, ODL, Relational models:• For centralized processing, when we control the data

– XML:• Data sharing between different systems• we do not have control over the entire data• E.g. on the Web

• Do NOT use XML to model your data ! Use E/R, ODL, or relational instead.

Why Do People Like XML so much?

• It’s easy to learn.• It’s human readable. No need for proprietary formats

anymore.• Lots of tools out there for basis manipulations, editing,

validation, simple querying.• It’s very flexible:

– Data is self-describing– Can add attributes easily– Data can be irregular

• Note: (common fallacy) without common DTD’s data sharing is not solved!

Why are we DB’ers interested?

• It’s data, stupid. That’s us.• Proof by Altavista:

– database+XML -- 40,000 pages.

• Database issues:– How are we going to model XML? (trees).

– How are we going to query XML? (XML-QL,Quilt)

– How are we going to store XML (in a relational database? object-oriented?)

– How are we going to process XML efficiently? (uh… well..., um..., ah..., get some good grad students!)

Exporting Relational Data to XML

Data Sharing with XML: Easy

Data source(e.g. relational

Database)

ApplicationWeb

XML

Exporting Relational Data to XML

• Product(pid, name, weight)

• Company(cid, name, address)

• Makes(pid, cid, price)

product companymakes

Export data grouped by companies

<db><company> <name> GizmoWorks </name> <address> Tacoma </address> <product> <name> gizmo </name> <price> 19.99 </price> </product> <product> …</product> …</company><company> <name> Bang </name> <address> Kirkland </address> <product> <name> gizmo </name> <price> 22.99 </price> </product> …</company>…

</db>

Redundant representation of products

The DTD

<!ELEMENT db (company*)>

<!ELEMENT company (name, address, product*)>

<!ELEMENT product (name,price)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT price (#PCDATA)>

Export Data by Products<db> <product> <name> Gizmo </name> <manufacturer> <name> GizmoWorks </name> <price> 19.99 </price> <address> Tacoma </address> </manufacturer> <manufacturer> <name> Bang </name> <price> 22.99 </price> <address> Kirkland </address> </manufacturer> … </product> <product> <name> OneClick </name> …</db>

Redundant representation of companies

Which One Do We Choose ?

• The structure of the XML data is determined by agreement, with our partners, or dictated by committees– Many XML dialects (called applications)

• XML Data is often nested, irregular, etc

• No normal forms for XML

Querying XML

• At first, there was XQL (XPath): weak, impoverished.

• Then, Dbers noticed that XML is a variant on semi-structured data: invented XML-QL.

• A lot of people got excited – formed a committee.

• Committee is still working:– Inside scoop: Quilt – the best of XML-QL,

SQL and Xpath.

Running Example<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

XPath Introduction

• Syntax for XML document navigation and node selection

• A recommendation of the W3C (i.e., a standard)

• Building block for other W3C standards:– XSL Transformations (XSLT) – XML Link (XLink)– XML Pointer (XPointer)

XPath Traversal

/bib/book/year

Result: <year> 1995 </year>

<year> 1998 </year>

/bib/paper/year

Result: empty (there were no papers)

XPath Unbounded Traversal

//author

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

/bib//first-nameResult: <first-name> Rick </first-name>

XPath Text Selection

/bib/book/author/text()

Result: Serge Abiteboul

Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

XPath Wild Star

//author/*

Result: <first-name> Rick </first-name>

<last-name> Hull </last-name>

* Matches any element

XPath Attribute Navigation

/bib/book/@price

Result: “55”

@price means that price is has to be an attribute

XPath Existential Constraint

/bib/book/author[firstname]

Result: <author> <first-name> Rick </first-name>

<last-name> Hull </last-name>

</author>

XPath Existentials

/bib/book[@price < “60”]

/bib/book[author/@age < “25”]

/bib/book[author/text()]

XPath Expressions Summarybib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

bib/book/[@price<“55”]/author/lastname matches…

XML-QL Hello World

• Matching data using elements patterns.WHERE <book>

<publisher><name>Addison-Wesley</></>

<title> $t </>

<author> $a </>

</book> IN “www.a.b.c/bib.xml”

CONSTRUCT $a

XML-QL Element-As

• Matching data using elements patterns.WHERE <book>

<publisher><name>Addison-Wesley</></>

<title> $t </>

<author> $a </>

</book> ELEMENT-AS $e

IN “www.a.b.c/bib.xml”

CONSTRUCT $e

Constructing XML Data

WHERE <book>

<publisher><name>Addison-Wesley</></>

<title> $t </>

<author> $a </>

</> IN “www.a.b.c/bib.xml

CONSTRUCT <result>

<author> $a </>

<title> $t</>

</>

Grouping with Nested Queries

WHERE <book><title> $t </>,<publisher><name>Addison-Wesley</></>

</> CONTENT_AS $p IN “www.a.b.c/bib.xml” CONSTRUCT <result>

<titre> $t </>WHERE <author> $a </> IN $pCONSTRUCT <auteur> $a</>

</>

Joining Elements by Value

WHERE <article> <author>

<firstname> $f </> <lastname> $l </>

</> </> ELEMENT_AS $e IN “www.a.b.c/bib.xml”

<book year=$y> <author>

<firstname> $f </> <lastname> $l </>

</> </> IN “www.a.b.c/bib.xml” , y > 1995

CONSTRUCT $e Find all articles whose writers also published a book

after 1995.

Tag Variables

WHERE <article> <author>

<firstname> $f </> <lastname> $l </>

</> </> ELEMENT_AS $e IN “www.a.b.c/bib.xml”

<$t year=$y> <author>

<firstname> $f </> <lastname> $l </>

</> </> IN “www.a.b.c/bib.xml” , y > 1995

CONSTRUCT $e Find all articles whose writers have done something

after 1995.

Regular Path Expressions

WHERE

<part*>

<name>$r</>

<brand>Ford</> </>

IN "www.a.b.c/bib.xml"

CONSTRUCT

<result>$r</>Find all parts whose brand is Ford, no matter what level

they are in the hierarchy.

Regular Path Expressions

WHERE

<part+.(subpart|component.piece)>$r</>

IN "www.a.b.c/parts.xml"

CONSTRUCT

<result> $r </>

XML Data Integration

WHERE <person>

<name></> ELEMENT_AS $n

<ssn> $ssn </>

</> IN “www.a.b.c/data.xml”

<taxpayer>

<ssn> $ssn </>

<income></> ELEMENT_AS $I

</> IN “www.irs.gov/taxpayers.xml”

CONSTRUCT <result> $n $I </>

Query can access more than one XML document.

Quilt: Hello World

List all titles of books published by Morgan Kaufmann in 1998:

FOR $b IN document(“bib.xml”)/book

WHERE $b/publisher = “Morgan Kaufmann”

AND $b/year = “1998”

RETURN $b/title

Quilt: Creating XML Output

• Find all names with a firstname and lastname; group them in a <name>

FOR $a IN document(“bib.xml”)//author, $f IN $a/firstName, $l IN $a/lastNameRETURN <name> <fn> $f </fn> <ln> $l </ln> </name>

Quilt: Joins• Retrieve the titles of the books written by Laing before

1967, together with their reviews.

FOR $b in document(“bib.xml”)//book[@year<1967],

$r in document(“reviews.xml”)//review

WHERE $b/authors/lastname=“Laing” and $b/@ISBN=$r/@ISBN

RETURN

<resultBook ISBN=$b/@ISBN>

<title> $b/title/text() </title>,

$r

</resultBook>

Quilt: FLWR Expressions• Retrieve the titles of the books written by Laing before 1967 together with

their reviews.

FOR $b in document(“input.xml”)//book[@year<1967]

LET $R = document(“input.xml”)//review[@isbn=$b/@isbn]

WHERE $b/authors/lastname=“Laing”

RETURN

<resultBook ISBN=$b/@ISBN>

<resultTitle> $t </resultTitle>

<bookReviews> $R </bookReviews>

</resultBook>

Quilt: another example

• List all authors that published both in 1998 and 1999

FOR $a IN distinct(document(“bib.xml”)/book/author,

WHERE contains(document(“bib.xml”)/book[year=1998]/author, $a)

AND contains(document(“bib.xml”)/book[year=1999]/author, $a)

RETURN $a

XSL

• A.k.a XSLT

• A recommendation of the W3C (standard)

• Initial goal: translate XML to HTML

• Became: translate XML to XML– HTML is just a special case

• Interesting politics

Query Processing For XML

• Approach 1: store XML in a relational database. Translate an XML-QL/Quilt query into a set of SQL queries.– Leverage 20 years of research & development.

• Approach 2: store XML in an object-oriented database system.– OO model is closest to XML, but systems do not

perform well and are not well accepted.

• Approach 3: build a native XML query processing engine. – Still in the research phase; see Zack next week.

Relational Approach

• Step 1: given a DTD, create a relational schema.

• Step 2: map the XML document into tuples in the relational database.

• Step 3: given a query Q in XML-QL/Quilt, translate it to a set of queries P over the relational database.

• Step 4: translate the tuples returned from the relational database into XML elements.

Which Relational Schema?• The key question! Affects performance.

• No magic solution.

• Some options:– The EDGE table: put everything in one table– The Attribute tables: create a table for every tag

name.– The inlining method: inline as much data into

the tables.

An Example DTD<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (name, state)> <!ELEMENT name (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ATTLIST book pub IDREF #IMPLIED>]>

• PCDATA means Parsed Character Data (a mouthful for string)

The Edge Approach

sourceID tag destID destValue

- Don’t need a DTD.- Very simple to implement.

The Attribute Approach

rootID bookId

bookID title

rootID pubID

pubID pubName

bookID author

Book

Title

Author

Publisher

pubID state

PubName

PubState

The In-lining Approach

bookID title pubName pubState

bookID authorBookAuthor

Book

sourceID tag destID destValuePublisher

Let the Querying Begin!

• Matching data using elements patterns.WHERE <book>

<author> Bernstein </author>

<title> $t </title>

</book> IN “www.a.b.c/bib.xml”

CONSTRUCT

<bernsteinBook> $t </bernsteinBook>

The Edge Approach

SELECT e3.destValueFROM E as e1, E as e2, E as e3WHERE e1.tag = “book” and e1.destID=e2.sourceID and e2.tag=“title” and e1.destID=e3.sourceID and e3.tag=“author” and e2.author=“Bernstein”

The Attribute Approach

SELECT Title.titleFROM Book, Title, Author WHERE Book.bookID = Author.bookID and Book.bookID = Title.bookID and Author.author = “Bernstein”

The In-lining Approach

SELECT Book.titleFROM Book, BookAuthor WHERE Book.bookID =BookAuthor.bookID and BookAuthor.author = “Bernstein”

Reconstructing XML Elements

• Matching data using elements patterns.WHERE <book>

<author> Bernstein </author>

<title> $t </title>

</book> ELEMENT-AS $e

IN “www.a.b.c/bib.xml”

CONSTRUCT

$e

Open Questions

• Native query processing for XML• To order or not to order?• Combining IR-style keyword queries with

DB-style structured queries• Updates• Automatic selection of a relational schema• How should we extend relational engines to

better support XML storage and querying?