Provenance Semirings T J Green G Karvounarakis V

  • Slides: 15
Download presentation
Provenance Semirings T. J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of

Provenance Semirings T. J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (Pr. OPr) Philadelphia, PA June 26, 2007 PROPR 2007

Provenance ●First studied in data warehousing ▪Lineage [Cui, Widom, Wiener 2000] ●Scientific applications (to

Provenance ●First studied in data warehousing ▪Lineage [Cui, Widom, Wiener 2000] ●Scientific applications (to assess quality of data) ▪Why-Provenance [Buneman, Khanna, Tan 2001] ●Our interest: P 2 P data sharing in the O RCHESTRA system (project headed by Zack Ives) ▪Trust conditions based on provenance ▪Deletion propagation PROPR 2007 2

Annotated relations ●Provenance: an annotation on tuples ●Our observation: propagating provenance/lineage through views is

Annotated relations ●Provenance: an annotation on tuples ●Our observation: propagating provenance/lineage through views is similar to querying ▪Incomplete Databases (conditional tables) ▪Probabilistic Databases (independent tuple tables) ▪Bag Semantics Databases (tuples with multiplicities) ●Hence we look at queries on relations with annotated tuples PROPR 2007 3

Incomplete databases: boolean C-tables R a b c d b e f g e

Incomplete databases: boolean C-tables R a b c d b e f g e { I(R)= ; , boolean variables p r s abc semantics: a set of instances , dbe , fge , PROPR 2007 abc dbe , abc fge , dbe fge , abc dbe fge } 4

Imielinski & Lipski (1984): queries on C -tables R a b c d b

Imielinski & Lipski (1984): queries on C -tables R a b c d b e f g e union of conjunctive queries (UCQ) sr r q(x, z) : - R(x, _, z), R(_, _, z) p r s q(R) ac ae dc de fe q(x, z) : - R(x, y, _), R(_ , y, z) r r (p Æ p) Ç (p Æ p) pÆr rÆp (r Æ r) Ç (r Æ s) (s Æ s) Ç (s Æ r) PROPR 2007 p pÆr = pÆr r s p=true r=false s=true ac fe 5

Why-provenance/lineage Which input tuples contribute to the presence of a tuple in the output?

Why-provenance/lineage Which input tuples contribute to the presence of a tuple in the output? same query R a b c d b e f g e tuple ids p r s [Cui, Widom, Wiener 2000] [Buneman, Khanna, Tan 2001] PROPR 2007 q(R) ac ae dc de fe {p} {p, r} {r, s} 6

C –tables vs. Why-provenance ac (p Æ p) Ç (p Æ p) ae pÆr

C –tables vs. Why-provenance ac (p Æ p) Ç (p Æ p) ae pÆr dc rÆp de (r Æ r) Ç (r Æ s) fe (s Æ s) Ç (s Æ r) ac ({p} {p}) ae {p} {r} dc {r} {p} de ({r} {r}) ({r} {s}) fe ({s} {s}) ({s} {r}) c-table calculations Why-provenance calculations PROPR 2007 The structure of the calculations is the same! 7

Another analogy, with bag semantics R a b c d b e f g

Another analogy, with bag semantics R a b c d b e f g e q(R) tuple multiplicities 2 5 1 same query c-table calculations ac (p Æ p) Ç (p Æ p) ae pÆr dc rÆp de (r Æ r) Ç (r Æ s) fe (s Æ s) Ç (s Æ r) ac 8 ac 2¢ 2+2¢ 2 ae 10 ae 2¢ 5 dc 10 dc 5¢ 2 de 55 de 5¢ 5+5¢ 1 fe 7 fe 1¢ 1+1¢ 5 PROPR 2007 multiplicity calculations The structure of the calculations is the same! 8

Abstracting the structure of these calculations C-tables Bags Why-provenance Abstract join union Æ Ç

Abstracting the structure of these calculations C-tables Bags Why-provenance Abstract join union Æ Ç abstract calculations ac (p ¢ p) + (p ¢ p) ae p¢r dc r¢p d e (r ¢ r) + (r ¢ s) f e (s ¢ s) + (s ¢ r) ¢ [ [ + ¢ + These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance PROPR 2007 9

Positive K-relational algebra ●We define an RA+ on K-relations: ▪The ¢ corresponds to join:

Positive K-relational algebra ●We define an RA+ on K-relations: ▪The ¢ corresponds to join: ▪The + corresponds to union and projection ▪ 0 and 1 are used for selection predicates ▪Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later) PROPR 2007 10

RA+ identities imply semiring structure! ●Common RA+ identities ▪Union and join are associative, commutative

RA+ identities imply semiring structure! ●Common RA+ identities ▪Union and join are associative, commutative ▪Join distributes over union ▪etc. (but not idempotence!) These identities hold for RA+ on K-relations iff (K, +, ¢, 0, 1) is a commutative semiring PROPR 2007 (K, +, 0) is a commutative monoid (K, ¢, 1) is a commutative monoid ¢ distributes over +, etc 11

Calculations on annotated tables are particular cases (B, Ç, Æ, false, true) usual relational

Calculations on annotated tables are particular cases (B, Ç, Æ, false, true) usual relational algebra (N, +, ¢, 0, 1) bag semantics (Pos. Bool(B), Ç, Æ, false, true) boolean C-tables (P( ), [, Å, ; , ) probabilistic event tables (P(X), [, [, ; ) lineage/why-provenance PROPR 2007 12

Provenance Semirings ●X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples)

Provenance Semirings ●X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) ●N[X] : multivariate polynomials with coefficients in N and indeterminates in X ●(N[X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings ●N[X] –relations are the relations with provenance! ▪The polynomials capture the propagation of provenance through (positive) relational algebra PROPR 2007 13

A provenance calculation q(x, z) : - R(x, _, z), R(_, _, z) q(x,

A provenance calculation q(x, z) : - R(x, _, z), R(_, _, z) q(x, z) : - R(x, y, _), R(_ , y, z) q(R) R a b c d b e f g e p r s a a d d f c e e Why-provenance 2 p 2 pr pr 2 r 2 + rs 2 s 2 + rs same why-provenance, different polynomials ac ae dc de fe {p} {p, r} {r, s} ●Not just why- but also how-provenance (encodes derivations)! ●More informative than why-provenance PROPR 2007 14

Further work ●Application: P 2 P data sharing in the O RCHESTRA system: ▪Need

Further work ●Application: P 2 P data sharing in the O RCHESTRA system: ▪Need to express trust conditions based on provenance of tuples ▪Incremental propagation of deletions ▪Semiring provenance itself is incrementally maintainable ●Future extensions: ▪full relational algebra: For difference we need semirings with “proper subtraction” ▪richer data models: nested relations/complex values, XML PROPR 2007 15