1 noget helt andet… platon vil gerne være vært (i Århus) for et bit møde i efteråret – soa...
TRANSCRIPT
1
Noget helt andet…Noget helt andet…
Platon vil gerne være vært (i Århus) for et BIT møde i efteråret– SOA eller MDM– Fint for mig, men hvad siger i ?
Platon inviterer alle til www.bi2006.dk – 7-8 juni– Special pris for BIT medlemmer: 2995 kr.– Tilmelding via Jørgen Davidsen, [email protected]
Lineage Tracing in DataLineage Tracing in Data WarehousesWarehouses
Torben Bach Pedersen
Based on work by Yingwei Cui and Jennifer Widom
Stanford University Database Group
3
Motivation: Data WarehousingMotivation: Data Warehousing
Data Warehouse
Source 1 Source 2 Source 3
Lucrative Fields
Databases $8800K Theory $320K
Networks $800K
StudentsEnrollmentsCourses
Wow?!
Databases $8800K
4
Courses Enrollments Students
Oh, I see...
Source 1 Source 2 Source 3
Lineage Tracer
Data Warehouse
Lucrative Fields
Database 1800 Theory $320K
Networks $800K Databases $8800K
CS145 Ted CS154 Joe
CS244 BobCS145 Ann CS245 Jane
……
Bob MS $1K Jane Web $5K
Ann BS $1K
Joe BS $1KTed Web $5K … … …
CS145 Databases CS154 Theory
CS244 Networks CS245 Databases
5
The Data Lineage ProblemThe Data Lineage Problem
Data warehouses integrate data from multiple sources for analysis and mining
Data lineageData lineage: given data item o in the warehouse, which data items in the sources were used to derive o?
Sometimes called “drill-through” in industry– “Drill-through” often limited
6
ChallengesChallenges
Warehouse of relational views over relational sources– What is a good formal definition for lineage?– How do we trace data lineage for arbitrary views?– How do we make it efficient?
Warehouse defined by graph of data transformations– No fixed, well-defined relational operators– Large transformation sequences and graphs
7
Outline of TalkOutline of Talk
Part 1: Lineage tracing for relational views
Part 2: Lineage tracing for general data transformations
8
Part 1: Part 1: Lineage Tracing for Relational ViewsLineage Tracing for Relational Views
Declarative definition of data lineage
Lineage tracing algorithms
Using auxiliary views for efficient lineage tracing
Experimental results (small sample)
9
Views We ConsiderViews We Consider
Relational algebra
Arbitrary use of aggregation
Set semantics
Also in thesis– Set operators – Bag semantics
R S T
V
10
V
V = ( (R S)) Y,sum(Z) X >Z
R
S
X Y Z3 2a
bb
88
06
Y sum
a 2b 6
X Y Z3 2a8 08 98 6
bbb
X Y3 a
Y Z
2a0b9b6b
8 b
Y,sum(Z)X >Z
T U
b 6b8 0b8 6
8 0
8 6
b
b0b
6b
8 b
Simple Lineage ExampleSimple Lineage Example
select Y,sum(Z) from R natural join Swhere X>Zgroup by Y
11
Lineage for Relational OperatorsLineage for Relational Operators
Unary relational operators definition took a long time
op
R
R* t
Lineage of t according to op is the maximal subset R* R such that
(1) op(R*) = {t} - output of R* through op is t(2) t* R*: op({t*}) - op used on t* is nonempty
12
Example 1 – the two conditions ensure that only tuples contributing to t are included in lineage
R
X Y Z3 2a
bb
88
06
X Y Z3 2a8 08 98 6
bbb
X >Z
Lineage of t according to op is the maximal subset R* R such that
(1) (1) opop((RR*) = {*) = {tt}}(2) (2) tt* * RR*: *: opop({({tt*}) *})
Lineage for Relational OperatorsLineage for Relational Operators
b8 68 6b
13
Example 2 –”maximal” requirement ensures that (8,b,0) tuple in included in (b,6) lineage
R
X Y Z3 2a
bb
88
06
Y sum
a 2b 6
Y,sum(Z)
Lineage of t according to op is the maximalmaximal subset R* R such that
(1) op(R*) = {t}(2) t* R*: op({t*})
Lineage for Relational OperatorsLineage for Relational Operators
b 6b8 0b8 6
14
N-ary relational operators ( ,,) – lineage unique
Lineage for Relational OperatorsLineage for Relational Operators
Lineage of t according to op is the maximalmaximal subsets Ri* Ri for i = 1..n such that
(1) op(R1*, …, Rn*) = {t}(2) ti* Ri*: op(R1, …, {ti*}, …, Rn)
op
R1*
*R2
R2
R1
15
Lineage for Relational ViewsLineage for Relational Views
Lineage of a tuple set is union of lineage of each tuple in the set
Lineage for views is defined recursively => naive, but inefficient, algorithm (need to recompute/store all intermediate results)
opop1 2
VU
R1
R2
t
U*
*
*
R1
R2
Lineage of t is R1*, R2*
16
Lineage TracingLineage Tracing
Convert view into segmented normal form (SPJ+agg)segmented normal form (SPJ+agg)
E1 … En Each segment
Generate one tracing query tracing query for each segment
Apply tracing queries recursively
– # non-top + 1
Proof: lineage result is unaffected by Proof: lineage result is unaffected by normalization and segment-level tracingnormalization and segment-level tracing
17
Tracing Query for One SegmentTracing Query for One Segment
V Y sum
a 2b 6
R
S
TQ = Split ( (R S))X >Z Y=b R,S
Y,sum(Z)
X >Z
b
6
b
X Y3 a8
Y Z
2a09b
b
R*={(8,b)}, S*={(b,0),(b,6)}
b 0
6b
b8
b 6
V = ( (R S)) X >ZY,sum(Z)
Split = ”unjoin” – project over R+S schemas
18
Recursive Tracing ProcedureRecursive Tracing Procedure
V W avg
p 4q 6
U
R
S
X Y3 a
Y Z
2a0b9b6b
8 b
T
Y sum
a 2b 6
Y Wa p
pq
bb
TQ = Split ( (U T))W=q1 U,T TQ = Split ( (R S))X >Z Y=b2 R,S
b 6
qb
8 b
0b
6b
q 6
R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)}
8 b
0b
6bqb
V = (( (R S)) T)) W, avg(sum) Y,sum(Z) X >Z
19
Making It EfficientMaking It Efficient
Source accesses are usually expensive or impossible
Need some intermediate results for lineage tracing
Store auxiliary viewsauxiliary views at the warehouse– Reduce or eliminate source accesses– Reduce recomputation of intermediate results
20
Aux View ExampleAux View Example
21
Aux View ExampleAux View Example
22
Auxiliary ViewsAuxiliary Views
There are many possible auxiliary views
For single-segment views– Identified 10 possible auxiliary view schemes– Studied performance tradeoffs
For arbitrary views– Hard optimization problem– Exhaustive and heuristic algorithms– Performance study
R1 … Rn
23
Single Segment SchemesSingle Segment Schemes
Store nothing (NO)
Store Base Tables (BT)
Store Lineage Views (LV)
Store Split Lineage Tables (SLT)
Store Partial Base Tables (PBT)
Store Base Table Projections (BP)
Store Lineage View Projections (LP)
Self-maintainable variations: LV-S, SLT-S, PBT-S
24
+ Always improve lineage tracing
– Must be maintained when sources change
+ Can also help with maintenance of original user views
Auxiliary Views: Performance TradeoffsAuxiliary Views: Performance Tradeoffs
25
Auxiliary View Schemes for Auxiliary View Schemes for Single-Segment ViewsSingle-Segment Views
Parameters:- 3-way SPJ view- sources: 10MB each- disk: 1Mbps- network: 50kbps- 1000 operations- q/u ratio = 4
Measurements:- tracing time- maintenance time
26
Auxiliary View Selection Auxiliary View Selection Algorithms for Arbitrary ViewsAlgorithms for Arbitrary Views
27
Part 2: Part 2: Transformation GraphsTransformation Graphs
Lineage definition
Tracing algorithms
Combining transformations for lineage tracing
Experimental results (tiny sample) Source 1
Data Warehouse
Source 2 Source 3
T6
T4 T5
T3
T2
T1
28
T1
T3 T4 T6 T7T5
id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)
id name price valid1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 500 2/1/98-7/1/98 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-
name avg3 Q4 palm 2K 6Kpalmpalm 2K 6K 2K 6K
3 palm 400 7/2/98-9/1/993 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-3 palm 300 9/2/99-
2 C 4/5/99 2(5),3(10)2 C 4/5/99 2(5),3(10)
4 B 8/6/994 B 8/6/99 1(10),3(5)1(10),3(5)5 D 10/8/99 1(5),3(10)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)6 B 12/1/99 2(10),3(10)
SalesJump
Order
Product T2
Transformation Example Transformation Example
selection
“join”split pivot projectionselectionprojection
29
Lineage for General TransformationsLineage for General Transformations
A transformationtransformation can be an arbitrary program
T
select … from … where … main(int argc, char** argv) {…} sed “s/string1/string2/g” …
??
– One extreme: relational operators– Another extreme: we know nothing about T– Middle ground: based on transformation properties
30
Transformation PropertiesTransformation Properties
Transformation classes
Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure
31
i II: T(I) = T({i})
dispatcher
T*(o) = {i | oT({i})}
Transformation ClassesTransformation Classes
Produces 0 or more output items per input item
Applying T on complete set is the same as on each input item separately
32
Dispatcher ExampleDispatcher Example
id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)
Orderid cust date pid quant1 A 2/8/99 1 101 A 2/8/99 2 10 : : : 5 D 10/8/99 1 55 D 10/8/99 3 10 6 B 12/1/99 2 106 B 12/1/99 3 10
T1
O1
5 D 10/8/99 1(5),3(10)
5 D 10/8/99 1 55 D 10/8/99 3 10 5 D 10/8/99 3 10
5 D 10/8/99 1(5),3(10)
A non-relational operator, but a typical dispatcher
33
i II: T(I) = T({i})
dispatcher
I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}
aggregator
T*(ok) = IkT*(o) = {i | oT({i})}
Transformation ClassesTransformation Classes
34
Aggregator ExampleAggregator Example
T4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K
O3
O4
oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5
3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10
2 palm 4/5/99 400 10 2 palm 4/5/99 400 10
4 palm 8/6/99 400 5
6 palm 12/1/99 300 10
palm 0K 4K 2K 6K 5 palm 10/8/99 300 10
palm 0K 4K 2K 6K
2 palm 4/5/99 400 10
4 palm 8/6/99 400 5
6 palm 12/1/99 300 10
5 palm 10/8/99 300 10
T4 computes quarterly sales per product by ”pivoting”
Again, a non-relational operator, but a typical aggregator
35
i II: T(I) = T({i})
dispatcher
I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}
aggregator black-box
All others
T*(ok) = Ik T*(o) = IT*(o) = {i | oT({i})}
Transformation ClassesTransformation Classes
36
Most transformations are dispatchers, aggregators, or their compositions
A transformation can be both dispatcher and aggregator– Proof: Lineage definitions are then equivalent
Transformations can be relational operators– Lineage definitions same as relational definitions
Transformation ClassesTransformation Classes
37
Transformation PropertiesTransformation Properties
Transformation classes
Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure
38
Transformation SubclassesTransformation Subclasses
Permit more efficient lineage tracing
Filter is a special dispatcher– Each input data item produces itself or nothing
Context-free aggregator– Whether two input data items are in the same partition
is independent of other items
Key-preserving aggregator– Any subset of an input partition always produces the
same output key
39
Tracing Example: AggregatorsTracing Example: Aggregators Consider T(I) = {o1…on}
Tracing the lineage of o for aggregator– Partition input I into I1…In such that T(Ik) = {ok}– Return Ik such that T(Ik) = {o}
Tracing the lineage of o for context-free aggregator– Partition input I into I1…In such that |T(Ik)| = 1– Return Ik such that T(Ik) = {o}
– 2^n versus n^2 running time !
40
Schema InformationSchema Information
Input schema A=(A1…An) and key Akey
Output schema B=(B1…Bn) and key Bkey
Schema mappings: f(A) B and A g(B)
Transformations with special schema mappings– Forward key-map: f(A) Bkey – Backward key-map: Akey g(B) – Backward total-map: A g(B)
– More efficient tracing for these
41
Tracing Example: Forward Key-MapsTracing Example: Forward Key-Maps
T4
O3 O4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K palm 0K 4K 2K 6K
oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5
3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10
2 palm 4/5/99 400 10 2 palm 4/5/99 400 10
4 palm 8/6/99 400 5
6 palm 12/1/99 300 10
5 palm 10/8/99 300 10
”name” is carried over as key - trace of ”palm” is easy : the O3 tuples with name = ’palm’
42
Other PropertiesOther Properties
Transformation author provides Tracing Procedure
Provided Transformation Inverse T –1
– If T is an aggregator, then o’s lineage is T –1({o}) – Not always true for dispatchers or black-boxes
43
Tracing ProceduresTracing Procedures
Property Procedure # T Calls # Accesses
dispatcher TraceDS O(|I|) O(|I|)
aggregator TraceAG O(2|I|) O(2|I|)
black-box return I; 0 O(|I|)
filter return o; 0 0
context-free aggr. TraceCF O(|I|2) O(|I|2)
key-preserving aggr. TraceKP O(|I|) O(|I|)
forward key-map TraceFM 0 O(|I|)
backward key-map TraceBM 0 O(|I|)
backward total-map TraceTM 0 0
Provided tracing-proc. provided ? ?
44
Property HierarchyProperty HierarchyANY
provided tracing-proc.
or inverse
black-boxaggregator
dispatchercontext-free aggr.
key-preserving aggr.
filter
forward key-mapbackward key-map
total-map
45
Summary of Our Approach for Summary of Our Approach for One TransformationOne Transformation
Properties are provided with transformations– Specified by the transformation author – Declared in prepackaged transformations– Derived using recent techniques [Clio01, RB01]
The best property of a transformation is selected based on the hierarchy
The tracing procedure using the best property is called at tracing time
Indexing techniques
46
Transformation SequencesTransformation Sequences
Naive algorithm traces backwards one transformation at a time– Need all intermediate results– Poor performance for long sequences
T1 T2 T3 TnI O
47
T1 T2 T3 TnI O
T’ TnI O
Combine transformations and trace as one– Reduces number of intermediate results– By combining judiciously
Reduces tracing cost Doesn’t lose accuracy
Transformation SequencesTransformation Sequences