Views as Incomplete Databases Certain Possible Answers Views

Views as Incomplete Databases – Certain & Possible Answers Ø Views – an incomplete representation ØCertain and possible answers Ø Complexity results for certain answers 2005 certain 1

ØViews – an incomplete representation Given: a view def V, view extension I Sound V: I is contained in V(D) Complete V: I contains V(D) Precise V: I = V(D) V may also be mixed: some views are sound, others are complete In general, more than one db D may exist s. t. 2005 certain 2

Example : teams in World Cup Soccer Tournament Global scheme : Team(country, group) (gr – assignment for 1 st round) Source 1: S-C(C) – the countries that participate Source 2 : S-Q(C) -- countries that participated in qualifying games Source 3 : S-T(C) – teams whose games will be on T. V For all three, the logical mapping is v(X) : - Team(X, Y) 2005 certain 3

Given V (including a specification in s/c/p) and I poss(V, I) = {D | D is a db for which I is a possible view} Since we have only the views, this is the set of possible databases. For sound views : an infinite set For complete views : contains the empty db For precise views : may be empty -- inconsistent views Example : v 1(X, Y) : - R(X, Y, Z), v 1={(a, b), (b, c)} v 2(X, Z) : - R(X, Y, Z), v 2={(a, d), (c, e)} * The above changes when the global db is known to satisfy constraints (e. g. keys) 2005 certain 4

ØCertain and possible answers Now, assume also a query Q cert(Q, V, I) – seems easier to compute, always finite poss(Q, V, I) – may be infinite and where do we obtain values not in I? A possible approach: a finite representation of a possibly infinite family of partially unknown databases 2005 certain 5

We concentrate on certain answers -- an absolute notion of answering queries using views Cert(Q, V, I) depends on soundness/completeness of views Example : global : p(x, y) v 1(x) : - p(x, y), v 2(y): - p(x, y) I = {v 1(a), v 2(b)} Q: q(x, y) : - p(x, y) Sound views : cert(Q, V, I) is empty Precise views : cert(Q, V, I) is {(a, b)} 2005 certain 6

An issue in query processing : For same example, let Q’ : s(x) : - p(x, y) To allow relational algebra manipulation of certain answers, we need more than a simple relational representation! We need algorithms for performing operations on representations of partially unknown db’s (not in this course) 2005 certain 7

From now : sound views, certain answers Was investigated for views defined in L 1, query defined in L 2, where L 1, L 2 in {CQ, CQ!=, NR-Datalog, FO} Results include: • Complexity – lower bounds • Algorithms – upper bounds 2005 certain 8

ØComplexity results for certain answers Thm : for V in L 1 , Q in L 2, the following are equivalent: (a) computing cert(Q, V, I) (b) deciding containment: is Q 1 (in L 1) contained in Q 2 (in L 2)? • (a) is decidable iff (b) is • When decidable, combined complexity of (a) = query complexity of (b) data complexity of (a) <= query complexity of (b) [ Data complexity: function of db size Query complexity: function of query size Combined : both 2005 certain ] 9

Proof (sketch) : given t, how hard to decide if t is in cert(Q, V, I)? Let I = {vi(tij)}, define Q’ by Q’ contains the rules that define V, and one more “large” rule: (t follows from facts in I) Claim: Hence deciding if t in cert(Q, V, I) is no harder than this containment (Note: for L 1 = CQ, need to “massage” Q’ into CQ) 2005 certain 10

ç How hard to check containment of Q 1 in Q 2? let p be a new predicate Define V by: rules of Q 1, and let v(c) : - q 1(X), p(X) , I = {v(c)} Define Q by: rules of Q 2 , and q(c) : - q 2(X), p(X) Then: (c) is in cert(Q, V, I) 2005 iff certain Q 1 is contained in Q 2 11

Consequences : computing certain answers (depends on L 1, L 2) Is: undecidable for Datalog, FO decidable if: one side <= datalog, other side <= nr-datalog For decidable cases, the above gives combined complexity, We are interested more in data complexity; here it is same Viewsquery CQ CQ!= nr-datalog Datalog FO CQ P Co-NP P P undec CQ!= P Co-NP P P undec nr-datalog Co-NP undec Datalog Co-NP undec FO undec undec Co-NP data complexity is bad: impractical to compute, no datalog plan! We will not prove co-NP complexity results 2005 certain 12

Claim : For Q in Datalog, V in CQ(!=), let V~ be the same view def, with inequalities omitted Then cert(Q, V, I) = cert(Q, V~, I) (Computing the certain answers from I using V w/o the inequalities gives same results) Proof : (b) If t is in cert(Q, V~, I), then for each D in poss(V~, I), t in Q(D) If D also in poss(V, I) -- fine If D not in poss(V, I), exists larger D’ in poss(V, I) s. t. t is in Q(D’) Hence, t is in cert(Q, V, I) 2005 certain 13

Proof of last claim: some s in I, but s not in V(D), because of some inequality Since s is in V(D’’), inequality involves attribute in view body can add some tuples to D so obtain D 1, s. t. s is in V(D 1) adding for all such s gives D’ that contains D, s. t. D’ is in poss(V, I) If t in Q(D’), since Q has no inequalities, t also in Q(D) 2005 certain 14

For CQ views, Datalog queries, Query plan: datalog program P on V exp(P) – replace views by their definitions (using fresh names for existential variables) P is maximally-contained in Q: • exp(P)(D) is contained in Q(D) • exp(P’)(D) is contained in ep(P)(D) for all other plans P’ Such a plan is best among all plans (This is a language-dependent notion – given a more expressive language, P may not be best any more) But, if a plan delivers cert(Q, V, I) it is absolutely best 2005 certain 15

Thm : For CQ sound views, Datalog queries, the inverse rules algorithm computes cert(Q, V, I) (Thus, for this case, a Datalog query plan can give the absolute best possible answer) Corollary: If P is max-cont(Q) then, for all view instances, I P(I) = cert(Q, V, I) we proceed to prove theorem 2005 certain 16

Def: A tableau is a collection of atoms, with constants and variables A tableau T represents a db D: there is a valuation from T into D Rep(T) = {D | for some h, D contains H(T) } 2005 certain 17

Claim : For a Datalog query Q, tableau T cert(Q, rep(T)) = the tuples w/o variables in Q(T) Proof : (a) Can consider only D in rep(T) s. t. D = h(T) every tuple in Q(D’) but not in Q(D) where D’ is larger than h(T) is not in cert(Q, rep(T)) (b) For such D, h(Q(T)) = Q(D) a ground tuple in Q(T) is in cert(Q, rep(T)) (c) For a non-ground t tuple in Q(T), can find D 1, D 2 in rep(T) that give different values to variables in t no instance of this tuple is in cert(Q, rep(T)) 2005 certain 18

The inverse rules of V create from a view I a database with elements that are skolem functions. Consider each skolem term to be a distinct variable This is a tableau T(V, I) Claim : T(V, I) represents poss(V, I) Proof : easy Corollary : is cert(Q, V, I) This is precisely what the inverse rule algorithm produces: For each I, the inverse rules produce T(V, I), then apply Q end of story Next: one more (last) algorithm, for CQ queries and views, that is fastest so far 2005 certain 19