Practical Lineage Tracing in Data Warehouses Paper by
Practical Lineage Tracing in Data Warehouses Paper by Y. Cui and J. Widom Appeared in ICDE 2000 Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009
Introduction
A visit to a computing center of a restaurant View Menu(item, cost, price) soup 5 7 coffee 3. 9 6 salad 7 10 cake 8 steak 10 12 … … … > 5 What made the price high? How can we solve the problem? Database product(name, cost) labor(id, cost) overheads(name, cost) operations(name, cost)
The Lineage Problem n Given: n n A view V A database instance D A data item d in a tuple in V(D) Find: n All data items that produced d and the process in which d was produced Lineage = שושלת יוחסין , אילן יוחסין
Motivation n In many data analysis and management scenarios, the source of the data is valuable n n n OLAP (online analytical processing) When sources are of different qualities (certainty, reliability, etc. ) Scientific databases Top-down Datalog evaluation On-line monitoring This is the first research to discuss the problem
Example - I DB Schema View
Example - II Promising e. industry computer medicine
Example - III n What is the exact set of data items which produced computer according to view Promising?
View Data Lineage
Tuple Lineage for one Operator n Let Op be an operator from { , , , } n Let T=Op(T 1, …, Tm) , t T t’s lineage in (T 1, …, Tm) according to Op is: Op-1<T 1, …, Tm>= T*1, …, T*m where T*i are the maximal sets s. t. lineage tuples derive exactly t every tuple contributes to t n (a) Op(T*1, …, T*m) = {t} n (b) T*i t* Ti : Op(T*1, …, {t*}, …, T*m)
Discussion n Op(T*1, …, T*m) = {t} Alone, this condition could be met even if many non-relevant tuples are in T*i n n T*i t* Ti : Op(T*1, …, {t*}, …, T*m) Alone, this condition could be met by many tuples not at all related to t Together, the two conditions define the lineage
Example T(X, Y) a 2 a 3 b 6 b 9 c 3 d 4 a 1 d 5 x, sum(Y)(T) t= a 6 b 15 c 3 d 9 a 2 a 3 a 1 Lineage of (a, 6)
Tuple lineage for a view n A view definition has many operators n We assume that views are evaluated as a query tree, n n bottom-up The lineage in D of a tuple t according to v(D), v-1 D(t), is defined by recursively generalizing tuple lineage for an operator n Basis: t contributes to itself in V, when the view is just a table n Step: previous definition of an operator Transitivity: if t 1 contributes to t 2, and t 2 contributes to t 3, then t 1 contributes to t 3
Example V = X, sum(Y) ( Y>0(R S))
Canonical form for ASPJ views n n n Any aggregate-select-project-join (ASPJ) view can be transformed to an equivalent canonical form The canonical form consists of nested ASPJ segments of the form agg-project-select-join Example: The Promising View is canonical, with two levels Segment 1 Segment 2
Canonicalization Algorithm
Lineage Tracing Query n n n Let D be a database instance, Let v be a view definition Let t v(D) Then, TQt, v is a lineage tracing query if. TQt, v(D) = v-1 D(t) And for a set T, TQT, v(D)
Lineage Tracing Query for one -level ASPJ Views n Consider a query in canonical form n The tracing query for a tuple t is n And for a set T split turns the table into multiple tables with projections
Lineage Tracing for multi-level Views
Example - Tracing the lineage of view Promising
Auxiliary Views
Motivation n In a distributed environment, querying data sources is a difficult problem n n Access costs Network costs Not always accessible What Storing auxiliary views in the warehouse should we can help store? ?
Scope n n We deal with one-level SPJ view only Extension to ASPJ views and to multilevel ASPJ view are straightforward and done on [Cui and Widom, DMDW 2000]
Tracing query trees for SPJ views view tracing
(Method 1: Store Nothing (N n A degenerated case where no auxiliary views are stored User view is n Lineage tracing query is n n n Very low storage costs No aux. view storage or aux. view updating costs Tracing query has large costs, particularly network User view has maintenance cost
(Method 2: Base Tables (BT n Auxiliary views are base tables after selection, BTi User view is n Lineage tracing query is n n n High storage costs, tables are large (even after selection) Maintenance of aux. views is fast (unprocessed tables) Tracing query has processing costs but not network costs User view has to be maintained
(Method 3: Lineage View (LV n Auxiliary view: User view is n Lineage tracing query is (query tree (a)) n n n Large storing costs (for the join) Maintenance of lineage view is expensive Very good tracing performance, LV appears as-is in tracing query Maintenance of lineage views helps maintaining user view
Performance
Method 4: Store Split Lineage (Tables (SLT n Auxiliary views (Ti are source tables): n User view is n Lineage tracing query is: n n n Very good when LV joins are large Usually small storage costs (LV is not materialized) Same maintenance cost as in method LV Tracing cost is low, yet higher than LV because more than a simple semi-join is performed
Method 5: Store Partial Base (Tables (PBT n Auxiliary views (V is the user view): n User view is n Lineage tracing query is: n n n Smaller storage comparing to BT Maintenance is costly, user view has to be maintained before aux. views Tracing benefits from operating on small tables
Method 6: Store Base Tables (Projections (BP n n n Auxiliary views (Ai includes key atts. , atts. projected in V and atts. involved in the join): What is the User view is assumption Lineage tracing query is: here? n n n Small storage due to usually small tables Cheap maintenance (tables, not join, are maintained) However, source tables have to be queried in tracing, rendering tracing relatively expensive
Method 7: Store Linear View (Projections (LP n Auxiliary views (A are atts in V, Ki are key atts. in Ti ): n User view is n Lineage tracing query is: n n n Small storage due to small tables Maintenance higher than BP due to join Small tracing cost, but sources have to be queried
Performance
Self maintainability n Previous results show to store more data in order to make views self- maintainable n n n [Quass, Gupta, Mumick and Widom 1996] Done using… auxiliary views Maintenance is done using delta relations Methods 5, 6, 7 have a self maintainable version: S-PBT, S-BP, S-LP
Experiments
Storage Costs
Total time Including user-view maintenance
Cost Model n Maintenance / Tracing cost: Disk cost * num of I/Os + Trans cost * num of transmitted bytes + Msg cost * num of network messages
Impact of table size on storage
Impact of table size on time
Conclusions
Results In Brief n n Definitions and problem formulation Lineage tracing n n For an operator For views in a canonical form Auxiliary views Performance study
- Slides: 42