The Rise and Fall and Rise of Dependency

  • Slides: 25
Download presentation
The Rise and Fall and Rise of Dependency Theory Part II: The Rise from

The Rise and Fall and Rise of Dependency Theory Part II: The Rise from the Ashes Ronald Fagin IBM Almaden Research Center

Dependencies were Considered Harmful n Dependencies were undesirable q Except for keys and referential

Dependencies were Considered Harmful n Dependencies were undesirable q Except for keys and referential integrity constraints q Database normalization eliminated dependencies n BCNF: each FD is a logical consequence of keys n 4 NF: each MVD is a logical consequence of keys n 5 NF: each JD is a logical consequence of keys 2

But then: n Dependencies took on a new, very positive role! 3

But then: n Dependencies took on a new, very positive role! 3

Data Integration and Data Exchange Data integration: Describe data in a global schema in

Data Integration and Data Exchange Data integration: Describe data in a global schema in terms of data in local schemas Data exchange: Describe data in a target schema in terms of data in a source schema, and actually produce the target database 4

Data Integration and Data Exchange These are old, but recurrent, database problems n Phil

Data Integration and Data Exchange These are old, but recurrent, database problems n Phil Bernstein – 2003 “Data exchange is the oldest database problem” n EXPRESS: IBM San Jose Research Lab – 1977 q for transforming data between hierarchical databases n The universal relation model is an early case of data integration We will focus mainly on data exchange 5

Schema Mappings & Data Exchange Σ Source S I n n Target T J

Schema Mappings & Data Exchange Σ Source S I n n Target T J Schema Mapping M = (S, T, Σ) q Source schema S, Target schema T q High-level, declarative assertions Σ that specify the relationship between S and T Data Exchange via the schema mapping M = (S, T, Σ): Transform a given source instance I to a target instance J, so that <I, J> satisfy the specifications Σ of M 6

Schema Mapping Specification Language The relationship between source and target is typically given by

Schema Mapping Specification Language The relationship between source and target is typically given by source-to-target tgds (x) y (x, y) where § (x) is a conjunction of atoms over the source § (x, y) is a conjunction of atoms over the target (Student(s) Enrolls(s, c)) t g (Teaches(t, c) Grade(s, c, g)) There may also be target tgds and egds: Grade(s, c, g)) Grade(s, c, g’)) (g = g’) 7

New Role of Dependencies n In data exchange, dependencies play a crucial role in

New Role of Dependencies n In data exchange, dependencies play a crucial role in describing how to transform data from one format to another 8

Solutions in Schema Mappings Definition: Schema Mapping M = (S, T, Σ) If I

Solutions in Schema Mappings Definition: Schema Mapping M = (S, T, Σ) If I is a source instance, then a solution for I is a target instance J such that <I, J> satisfy Σ Fact: In general, for a given source instance I, q there may be no solutions at all or q there may be multiple solutions; in fact there may be infinitely many solutions 9

Universal Solutions in Data Exchange n [Fagin, Kolaitis, Miller, Popa – ICDT 2003] introduced

Universal Solutions in Data Exchange n [Fagin, Kolaitis, Miller, Popa – ICDT 2003] introduced universal solutions as the “best” solutions in data exchange q q By definition, a solution is universal if it has homomorphisms to all other solutions n Thus, it is a “most general” solution Constants: entries in source instances Variables (labeled nulls): entries besides constants in target instances Homomorphism h: J 1 → J 2 between target instances: n h(c) = c, if c is a constant n If P(a 1, …, am) is in J 1, , then P(h(a 1), …, h(am)) is in J 2 10

How to Obtain a Universal Solution? n Answer: Use our old friend the chase!

How to Obtain a Universal Solution? n Answer: Use our old friend the chase! Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]: If there is a solution, then the chase produces a universal solution 11

Standard schema mappings n [Fagin, Kolaitis, Miller, Popa – ICDT 2003] define a weakly

Standard schema mappings n [Fagin, Kolaitis, Miller, Popa – ICDT 2003] define a weakly acyclic set of tgds q [Deutsch, Tannen - ICDT 2003] have a slightly more restrictive notion n Let a standard schema mapping be one specified by s-t tgds, target egds, and a weakly acyclic set of target tgds. Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]: For standard schema mappings, the chase runs in polynomial time (data complexity) 12

Query Answering in Data Exchange Σ Schema S I q Schema T J Question:

Query Answering in Data Exchange Σ Schema S I q Schema T J Question: What is the semantics of target query answering? Definition: The certain answers of a query q over T on I certain(q, I) = ∩ { q(J): J is a solution for I } Note: It is the standard semantics in data integration 13

Computing the Certain Answers Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]: Assume a

Computing the Certain Answers Theorem [Fagin, Kolaitis, Miller, Popa – ICDT 2003]: Assume a standard schema mapping. Let q be a union of conjunctive queries over the target. § If I is a source instance and J is a universal solution for I: certain(q, I) = the set of all “null-free” tuples in q(J). § Hence, certain(q, I) is computable in polynomial time 1. Compute a universal solution J, using the chase, in polynomial time 2. Evaluate q(J) and remove tuples with nulls 14

Composing Schema Mappings M 12 Schema S 1 M 23 Schema S 2 Schema

Composing Schema Mappings M 12 Schema S 1 M 23 Schema S 2 Schema S 3 M 13 n Given M 12 = (S 1, S 2, 12) and M 23 = (S 2, S 3, 23), derive a schema mapping M 13 = (S 1, S 3, 13) that is “equivalent” to the sequence M 12 and M 23 What does it mean for M 13 to be “equivalent” to the composition of M 12 and M 23? 15

Semantics of Composition 13 has to have the property that: <I 1, I 3>

Semantics of Composition 13 has to have the property that: <I 1, I 3> ⊨ 13 if and only if there exists I 2 such that <I 1, I 2> ⊨ 12 and <I 2, I 3> ⊨ 23 16

Result of the composition n n Question: If M 12 and M 23 are

Result of the composition n n Question: If M 12 and M 23 are each specified by s-t tgds, what language is needed for specifying the composition of M 12 and M 23? Answer: [Fagin, Kolaitis, Popa, Tan – PODS 2004]: second-order tgds 17

Second-Order Tgds Definition: Let S be a source schema and T a target schema.

Second-Order Tgds Definition: Let S be a source schema and T a target schema. A second-order tuple-generating dependency (SO-tgd) is a formula of the form: f 1 … fm( ( x 1( 1 1)) … ( xn( n n)) ), where q fi is a function symbol q i is a conjunction of atoms over S and equalities of terms q i is a conjunction of atoms from T Example: f ( e( Emp(e) Mgr(e, f(e) ) e( Emp(e) (e=f(e)) Self. Mgr(e) ) ) 18

Composition and SO-Tgds Theorem [Fagin, Kolaitis, Popa, Tan – PODS 2004]: q The composition

Composition and SO-Tgds Theorem [Fagin, Kolaitis, Popa, Tan – PODS 2004]: q The composition of any finite sequence of schema mappings specified by s-t tgds can be specified by an SO-tgd q Conversely, every SO-tgd specifies the composition of a finite sequence of mappings that are each specified by s-t tgds. q Recently [Arenas, Fagin, Nash – ICDT 2010] showed that the sequence need only be of size 2 19

Composition with Target Constraints q [Arenas, Fagin, Nash – ICDT 2010] defined s-t SO

Composition with Target Constraints q [Arenas, Fagin, Nash – ICDT 2010] defined s-t SO dependencies, which generalize SO tgds by allowing not only target atoms but also equalities in the conclusion q Theorem [Arenas, Fagin, Nash – ICDT 2010] : • The composition of any finite sequence of standard schema mappings can be specified by an s-t SO dependency (along with target egds and target tgds) • Conversely, every s-t SO dependency specifies the composition of a finite sequence of standard schema mappings – In fact, again, the sequence need only be of size 2 q The chase procedure can be extended to schema mappings specified by s-t SO dependencies, so that it produces universal solutions in polynomial time (data complexity) 20

Conclusions n Dependencies now play a crucial role in data integration and data exchange

Conclusions n Dependencies now play a crucial role in data integration and data exchange n We even have second-order dependencies, which have in fact been implemented in IBM Infosphere Data Architect. n Dependency theory is alive and well! 21

Extra slides 22

Extra slides 22

The Smallest Universal Solution Fact: Universal solutions need not be unique n Question: Is

The Smallest Universal Solution Fact: Universal solutions need not be unique n Question: Is there a “best” universal solution? n Answer: [Fagin, Kolaitis, Popa – PODS 2003] took a “small is beautiful” approach: There is a smallest universal solution (if solutions exist); hence, the most compact one to materialize § Definition: The core of an instance J is the smallest subinstance J’ that is homomorphically equivalent to J § Fact: § Every finite relational structure has a core § The core is unique up to isomorphism n 23

Core: The smallest universal solution Theorem [Fagin, Kolaitis, Popa – PODS 2003] : q

Core: The smallest universal solution Theorem [Fagin, Kolaitis, Popa – PODS 2003] : q q q All universal solutions have the same core The core of the universal solutions is the smallest universal solution If the target constraints are egds, then the core is polynomial-time computable (data complexity) Theorem [Gottlob and Nash – PODS 2006]: If the target constraints are egds and a weakly acyclic set of tgds, then the core is polynomial-time computable 24

Old Conclusions n Dependencies now play a crucial role in data integration and data

Old Conclusions n Dependencies now play a crucial role in data integration and data exchange n We even have second-order dependencies, which have in fact been implemented in practice! n Lately, even probabilistic dependencies have been studied q [Dong, Halevy, Yu – VLDB 2007] q [Das Sarma, Dong, Halevy – SIGMOD 2008] q [Fagin, Kimelfeld, Kolaitis – ICDT 2010] n Probabilistic dependencies on probabilistic databases n Dependency theory is alive and well! 25