organizing and searching information with xml selectivity estimation for xml queries thomas beer,...

Organizing and Searching Information with XML

Selectivity Estimation for XML Queries

Thomas Beer, Christian Linz,Mostafa Khabouze

Outline

• Definition Selectivity Estimation• Motivation• Algorithms for Selectivity Estimation

oPath Tree oMarkov TablesoXPathLearneroXSketches

• Summary

SelectivityDefinition

Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p

A

B C

E DD

Example: σ(A/B/D) = 2

Motivation

• Estimating the size of query results and inter-mediate results is neccessary for effective query optimization

• Knowing selectivities of sub-queries help identifying cheap query evaluation plans

• Internet Context: Quick feedback about expected

result sizebefore evaluating the full query

result

Example

XQuery-Expression:

For $f IN document („personnel.xml“)//department/facultyWHERE count ($f/TA) > 0 AND count($f/RA) > 0RETURN $f

This expression matches all faculty members that has at least

one TA and one RA

• one join for every edge is computed

Presumption

• Number of nodes is known• Join-Algorithm: Nested Loop

Department

Faculty

RA TA

Node Count

Dep. 1

Faculty

3

RA 7

TA 2

Department

Name

Faculty

Secretary

Name

RARA TATA

Faculty Faculty

RA

RARA

Scientist

Name RARA

Method 1Join 1: (Faculty) – TAJoin 2: (Result Join 1) – RAJoin 3: (Result Join 2) – Dep.

Method 2Join 1: (Faculty) – Dep.Join 2: (Result Join 1) – RAJoin 3: (Result Join 2) – TA

Evaluating the join

Number of operations:Join 1: 3 * 2 = 6Join 2: 1 * 7 = 7Join 3: 1 * 1 = 1 Total = 14

Number of operations:Join 1: 3 * 1 = 3Join 2: 3 * 7 = 21Join 3: 3 * 2 = 6 Total = 30

Outline

• Motivation• Definition Selectivity Estimation• Algorithms for Selectivity Estimation

oPath Trees oMarkov TablesoXPathLearneroXSketches

• Summary

Representing XML data structure

Path Trees Markov Tables

A

B C

D D E1 31

2 1

1

Path Trees<A> <B></B> <B> <D></D> </B> <C> <D></D> <E></E> <E></E> <E></E> </C></A>

Problem: The Path Tree may become larger than the available memory

The tree has to be summarized

Summarizing a Path Tree

4 different Algorithms:•Sibling-*

•Level-*

•Global-*

•No-*

Delete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information

Operation breakdown:

Sibling-*

Operation breakdown:

A

B C

E G H

K K

FD

1

9

10 6

11 12

1557

13

KI J 4I J2

• Mark the nodes with the lowest frequencies for deletion

• Check siblings, if sibling coalesce

*n=2f=6• Traverse Tree and compute average frequency 3

A

B C

*

K

F*

*

1

9

8

f=23

n=23

156

13

Level-*

A

B C

G

K

F*

*

1

9

10

113

156

13

K 12

A

B C

E G H

K K

FD

I J

1

9

10 6

11 122

1557

13

4

• As before, delete the nodes with the lowest frequency

• One *-node for every level

Global-*A

B C

E G H

K K

FD

I J

1

9

10 6

11 122

1557

13

4

• Delete the nodes with the lowest frequency

• One *-node for the complete tree

*

B C

G H

K K

FD

9

10 6

11 12

157

13

3

No-*

• Low frequency nodes are deleted and not replaced• Tree may becomes a forest with many roots

No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree

Selectivity-EstimationA

B C

*

K

F*

*

1

9

8

113

156

13

•find all matchings tags

•estimated selectivity = total frequency of these nodes

Example: σ(A/B/F) = 15 + 6 = 21

σ(A/B/Z) = 6

σ(A/C/Z/K) = 11

Outline



• Summary

What are Markov Tables ?

• Table, conaining all distinct paths in the data of length up to m and their selectivity

• m 2• Order: m - 1• Markov Table = Markov Histogramm

A

B C

1

611 D 4

C 9 D 7

D 8

Path

Sel. Path

Sel.

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

Selectivity Estimation• The table provides selectivity estimates for all paths

of length up to m• Assumption that the occurence of a particular tag in

a path is dependant only on m-1 tags occuring before it

• Selectivity estimation for longer path expressions is done with the following formula

Selectivity Estimation

NtPttPttt niinn

i

*][*)]|[(,...,(1

1

1)2,1

P[tn] Propability of tag tn occuring in the xml data treeN Total number of nodes in the xml data tree

P[ti|ti+1] Probability of tag ti occuring before tag ti+1

E

E Predictand for the occurence of tag tn

E1

E1 Predictand for the occurence of tag ti before tag ti+1

Markov Chaint1

t2

t3

t…

t…


Ntf

tPi

i)(

][ )1

11

(

),(]|[

i

iiii

tf

ttfttP

)( pf = Selectivity of path p

),(*)(

),(),...,,( 1

2

1 1

121 )( nn

n

i i

iin ttf

tf

ttftttf

8*15

9),(*

)(

),(),,( )(

23

1

DCfCf

CBfDCBf

Example:

Summarizing Markov Tables

The Nodes with the lowest selectivity are deleted and replaced

3 Algorithms:

• Suffix-*

• Global-*

• No-*

Suffix-** - Path : representing all deleted paths of length 1*/* - Path : representing all deleted paths of

length 2

•Deleting a path of length 1 add to path *

SD : Set of deleted paths with length 2

•Deleting a path of length 2 add to SD and look for paths with the same start tag

Example: SD={(A/C), (G/H)}

deleting (A/B) (A/*)

•Before checking SD, check Markov Table

suffix-* path

Global-*

* - Path : representing all deleted paths of length 1*/* - Path : representing all deleted paths of

length 2

•Deleting a path of length 1 add to path *

•Deleting a path of length 2 immediately add to path */*

No-*

•does not use *-Paths

•Low-frequency paths simply discarded

If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero

Which method should be used ?

Path Trees vs. Markov Table

Path exists in XML-Data * - Algorithm

Path do not exist No - * - Algorithm

„ * “ vs. „ No-* “

Data has common structure Markov Table

Data has NO common structure Path Trees

Outline



• Summary

Weaknesses of previous methods

• Off-line, scan of the entire data set

• Limited to simple path expressions

• Oblivious to workload distribution

• Updates too expensive

XPathLearner is...

• An on-line self-tuning Markov histogram for XML path selectivity estimation

• on-line: collects statistics from query feedback

• self-tuning: learns Markov model from feedback, adapts to changing XML

data

• workload-aware

• supports simple, single-value and multi-value path expressions

HistogramLearner

Histogram

Training data

SelectivityEstimator

feedback,real

selectivity

updates

estimated selectivity

System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error.

initial training

Workflow

observed estimation error

Basics

• Relies on path trees as intermediate representation

• Uses Markov histogram of order (m-1) to store the path tree and the statistics

• Henceforth m=2

table stores tag-tag and tag-value pairs and single tags

Data values

• Problem: Number of distinct data values is very large

table may become larger than the available memory

• Solution• Only the k most frequent tag-value pairs are

stored exactly• All other pairs are aggregated into buckets

according to some feature• Feature should distribute as uniform as possible

Example, k=1

Tag

Count

A 1

B 6

C 3

Tag

Tag

Count

A B 6

A C 3

Tag

Value

Count

B v3 3

Tag

Feat.

Sum

#pairs

B b 1 1

C a 1 1

Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘

A

B C

1

36

1V3 V1V2

3 1


NtPttPttt niinn

i

][)]|[(,...,(

1

1

1)2,1

P[tn] Propability of tag tn occuring in the xml data treeN Total number of nodes in the xml data tree

P[ti|ti+1] Probability of tag ti occuring before tag ti+1

E

E Expectation for the occurence of tag tn

E1

E1 Expectation for the occurence of tag ti before tag ti+1 (if n=2 ti+1 = tn)

Selectivity Estimation• Simple path p=//t1/t2.../tn

• Analogous for single-value path p=//t1/t2.../tn-

1=vn-1

• Slightly more complicated for multi-value path

)()(

),()...( ,1

1

1 1

121 nn

n

i i

iin ttf

tf

ttfttt

Example

336

6

)3()(

)()3(

vBfBf

ABfvAB

Tag

Count

A 1

B 6

C 3

Tag

Tag

Count

A B 6

A C 3

Tag

Value

Count

B v3 3

Tag

Feat.

Sum

#pairs

B b 1 1

C a 1 1

Real selectivity =3

Updates

• Changes in the data require the statistics to be updated

• Done via query feedback tuple (p,) • p denotes the path denotes the accurate selectivity of p

• Feedback is contributed to all path p according to some strategies

Learning process

• Given• Initially empty Markov Histogram f• Query feedback (p,)• Estimated selectivity

• Learn any unknown length-2-path• Update selectivities for known paths

• Two strategiesoHeavy-Tail-RuleoDelta-Rule

Algorithm-Part 1• Learn new paths of length up to 2

UPDATE(Histogram f, Feedback(p, ), Estimate )if |p|2 then

if not exists f(p)then add entry f(p)=

else f(p)

• Example: (AD)=1 (not in f), (AD) = 2Tag

Count

A 1

B 6

C 3

3CA

6BA

CountTagTag

2 DA

Tag

Value

Count

B v3 3

Tag

Feat.

Sum

#pairs

B b 1 1

C a 1 1

Algorithm-Part 2

• Learn longer paths (decompose into paths of length 2)

elsefor each (ti,ti+1)p

if not exists f(ti,ti+1)

then add entry f(ti,ti+1)=1

f(ti,ti+1) updateendfor

• f(ti,ti+1) update depends on update strategy

Example

Tag

Count

A 1

B 6

C 3

5CA

1DC

6BA

CountTagTag Tag

Value

Count

B v3 3

Tag

Feat.

Sum

#pairs

B b 1 1

C a 1 1

(ACD)=1, (ACD)=5

f(CD)=4

•decompose into AC and CD•AC is present update the frequency•CD is not present• update f(CD)

add f(CD)=1

4DC

Algorithm-Part 3

• Learn frequency of single tagsfor each tip, i1

if not exists f(ti)

then add entry f(ti)

f(ti) max{f(ti),f(, ti)}

endfor

• Example: (AD)=1 (not in f), (AD) = 2

3C

2D

6B

1A

CountTag

3CA

6BA

CountTagTag

2 DA

Tag

Value

Count

B v3 3

Tag

Feat.

Sum

#pairs

B b 1 1

C a 1 1

Update strategiesHeavy-Tail-Rule

• Attribute more of the estimation error to the end of the path

• where • wi weighting factors (increasing with i,e.g. 2i) learning rate• W normalized weight

njji

iit

iit

wwttf

ttf

))(sgn(),(

),(

1

11

)()( pp W

Update strategiesDelta-Rule

• Error reduction learning technique• Minimizes an error function

• update to term f(ti,ti+1) proportional to the negative gradient of E with respect to f(ti,ti+1)

determines the length of a step

2))()(( ppE

),(),(),(

1111

iil

iiliilttf

Ettfttf

Evaluation

• Good• on-line, adapts to changing data• workload-aware• after learning phase comparable to

off-line methods• update overhead nearly constant

• Bad• still restricted to XML trees, no

support for idrefs

Outline


oPath Trees and Markov TablesoXPathLearneroXSketches

• Summary

Preliminaries

XML Data Graph

• A: Author • P: Paper• B: Book• PB: Publisher• T: Title• N: Name

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Preliminaries

Path Expressions

• XPath Expressions : • Simple: A/P/T• Complex :

A[B]/P/T• Result is a set

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

T11 T12

Preliminaries

Path Expressions


A[B]/P/T• Result is a set

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 V4

E14

V10

V11

V12

V13

V14

T11

Preliminaries

Path Expressions


A[B]/P/T• Result is a set:

{T1,T2}

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

T11 T12

Preliminaries

• MotivationSelectivity Estimation over XML Data

Graphs

• OutlineoXSketch SynopsisoEstimation FrameworkoXSketch Refinement OperationsoExperiment

XSketch Synopsis

• XML Data Graph

• General Synopsis Graph

P(1)

A(2) PB(1)

N(2) P(2) B(2)

T(2) T(2) E(1)

Count(A) = | Extent(A) |

= |{A1,A2}| =2

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Backward-edge Stability

• XML Data Graph

• Synopsis Graph

b P(1) b

A(2) PB(1) b b

N(2) P(2) B(2)

b b b

T(2) T(2) E(1)

Label(u,v) = b if all elements in v have a parent in u

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Backward-edge Stability

• XML Data Graph

• Synopsis Graph

b P(1) b

A(2) PB(1) b b

N(2) P(2) B(2)

b b b

T(2) T(2) E(1)

Label(A2,B2) & Label(PB1,B2)

are empty

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Forward-edge Stability

• XML Data Graph

• Synopsis Graph

f P(1) f

A(2) PB(1) f f f

N(2) P(2) B(2)

f f

T(2) T(2) E(1)

Label(u,v) = f if all elements in u have a child in v

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Forward-edge Stability

• XML Data Graph

• Synopsis Graph

f P(1) f

A(2) PB(1) f f f

N(2) P(2) B(2)

f f

T(2) T(2) E(1)

B9 is in B(2) have no child in E(1)

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

XSketch Synopsis

• XML Data Graph

• XSketch Synopsis Graph

f/b P(1) f/b

A(2) PB(1) f/b f/b Ø f

N(2) P(2) B(2)

f/b f/b b

T(2) T(2) E(1)

XSketch is a Synopsis G. with Label(u,v)={b,f,b/f, Ø}

P0

A1

PB3

P6N4

T13

N8 B5

T10

A2

P7 B9

T12 V8 T11 V4

E14

V10

V11

V12

V13

V14

Estimation Framework

• calculate the Selectivity for the PE. V=V1/…/Vn

Count (V) = Count (Vn) * f( V )

1.Case:For all i if Label (Vi , Vi+1) = {b}f (V) =1, so

Count (V) = Count (Vn)

• Example :

f/b P(1) f/b

A(2) PB(1) f/b f/b f

N(2) P(2) B(2)

f/b f/b b

T(2) T(2) E(1)

Count (A/P/T) = Count (T) * f (A/P/T) = 2

Estimation Framework2.Case:if exist i s.t. Label (Vi ,Vi+1)≠ {b}

A1. Path Independance Assum-ption: f (u/v | v/w) ≈ f(u/v)

A2. B-Edge Uniformity Assum-ption:

all Ui in U such that: Label (U,V) ≠ b are uniformlydistributed over all suchparents

• Example :

f/b P(1) f/b

A(2) PB(1) f/b f/b Ø f

N(2) P(2) B(2)

f/b f/b b

T(2) T(2) E(1)

f (P/PB/B/T) = ???


• Example: f (P/PB/B/T) = ??

f (P/PB/B/T) = f (B/T) * f (P/PB/B | B/T) = f (B/T) * f (PB/B | B/T) * f (P/PB |

PB/B/T)B-Stability = f (PB/B | B/T) A1: ≈ f (PB/B)A2: = Count (PB) / [ Count (PB) + Count

(A) ]

f (P/PB/B/T) = 1 / 1+2 = 1/3


• A3. Branch-Independence Assumption: Outgoing paths from v are conditionally

independent of the existence of other outgoing paths

• A4. Forward-Edge Uniformity Assumption : The outgoing edges from v to all children u of

v such that Label(u,v) ≠ F are uniformly

distributed across all such children

XSketch Refinement Operations

• Goal : construct an efficient

XSketch for given space budget

• Refinement Operations:B-Stabilize (Xs (G), u,v): Label(v,u) ≠ B. Refine node u into two elementpartitions u1,u2 with the samelabel s.t. Label(v,u1) = B orLabel(v,u2) = B

Example : V1 V2…Vn

U V1 V2….Vn b U1 U2 b-Stabilize


• f-Stabilize (Xs(G),u,w):

• Label(u,w)≠ F

• Refine u into two nodes

u1,u2 with same label s.t.

Label (u1,w) = label(u,w)U{F}

Example: U

W1 W2….Wn U1 U2 f W1 W2…….Wn

f - Stabilize


A

P1 ...Pi

Pi+1... Pn

Pi Pi+1...PnP1 ...

A1 A2

P1 ... Pi

c(A)

P1 ...Pi

Pi+1... Pn

Pi+1...Pn

Backward Split

0

10

20

30

40

50

60

70

80

90

100

15 20 25 30 35 40 45 50Summary Size (KB)

Avg

Abs

Rel

Err

or (%

)XSketches

MT

Wp pcount

pestimpcount

W )(

|)()(|

||

1

Markov Tables vs. XSketch

Outline


oPath Trees and Markov TablesoXPathLearneroXSketches

• Summary

Summary

• Definition Selectivity

• Summarizing XML Documents (Path Trees / Markov Tables)

• Application using Markov Tables: XPathLearner

• Extension of Selectivity Estimation on Graphs: XSketch

Questions?