Queries with Difference on Probabilistic Databases Sanjeev Khanna

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1

Probabilistic Databases • To model and query uncertain data (sensor networks, information extraction…) • Possible worlds model – Each possible world W is a standard database instance, has a probability P[W] – Compact representation D assuming independence R a 1 a 2 a 3 0. 4 0. 6 S a 1 b 1 0. 1 a 2 b 1 0. 5 a 3 b 2 0. 2 a 3 b 3 0. 1 T b 1 0. 7 b 2 0. 8 b 3 0. 4 D 2

Query Semantics • Query Semantics on probabilistic databases: – Apply the query q on each possible world W – Add up the probabilities of the worlds that give the same query answer A P[q(D) = A] = ∑W : q(W) = A P[W] • Goal: Efficiently evaluate P[q(D) = A] – Data complexity; want time polynomial in n = |D| • Can we always efficiently compute P[q(D)]? – NO, in general it is #P-hard 3

Query Answering in Two Steps Introduce event variables for tuples (P[w 1] = 0. 3, …) Step 1: Boolean provenance for q(D) [FR ’ 97, Z ’ 97] f = w 1 v 1 u 1 + w 2 v 2 u 1 + w 3 v 3 u 2 + w 3 v 4 u 3 Step 2: Compute P[q(D)] = P[f] easy hard given P[w 1] = 0. 3, P[v 1] = 0. 4, … D R a 1 w a 11 0. 3 a 2 w a 22 0. 4 a 3 w a 33 0. 6 a 1 a b 1 b v 1 S 0. 1 a 2 a b 21 b v 21 0. 5 a 3 a b 32 b v 32 0. 2 a 3 a b 3 b v 43 0. 1 Event variables b 1 bu 11 T 0. 7 b 2 bu 22 0. 8 b 3 bu 33 0. 4 Probability Boolean query q(): R(x), S(x, y), T(y) 4

Probability Computation for Positive Queries • Dichotomy Result [DS ’ 04, ’ 07; DSS ’ 10] Given q as input, we can efficiently decide if q is – Safe: Safe plans run in poly-time on all instances, or, – Unsafe: #P-hard, e. g. q() : - R(x) S(x, y) T(y) • Instance-by-instance approach [SDG ’ 10, RPT ’ 11] – Both q and D are given as input – Poly-time algorithm to compute P[q(D)] for special cases even if q is unsafe What about queries with difference? 5

Boolean Provenances for Difference S R T b 1 c 1 u 1 c 1 a 1 v 1 a 1 w 1 b 2 c 2 u 2 c 1 a 2 v 2 a 2 w 2 b 1 c 3 u 3 c 2 a 3 v 3 a 3 w 3 c 3 a 2 v 4 q 1(x): - R(x, y), S(y, z) q 2(x): - R(x, y), S(y, z), T(z) b 1 u 1(v 1 + v 2) + u 3 v 4 b 1 u 1 v 1 w 1 + u 1 v 2 w 2 + u 3 v 4 w 2 b 2 u 2 v 3 w 3 q = q 1 – q 2 b 1 (u 1(v 1 + v 2) + u 3 v 4). (u 1 v 1 w 1 + u 1 v 2 w 2 + u 3 v 4 w 2) b 2 (u 2 v 3). (u 2 v 3 w 3) 6

Previous Work on Difference FOR ’ 11 – Framework for exact and approximate probability computation – But, no guarantee of polynomial running time In fact, we show in this paper that with difference, in some cases no approximation exists (unless NP = RP) How far can we go with difference in poly-time? 7

A Quick Comparison Without difference With difference • DNF of boolean provenance is poly-size (n|q|) • DNF of boolean provenance may be exponential in n • P[q(D)] is always • P[q(D)] may not be approximable (FPRAS) FPRAS: Fully Polynomial Randomized Approx. Scheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε p [(1 -ε) P[q(D)], (1+ε) P[q(D)] 8

Our Results • We study queries of the form q 1 – q 2 and their generalization – FPRAS: If q 1 is any UCQ, q 2 is any safe CQ– #P-hardness: Even if both q 1 and q 2 are safe CQ– Inapproximability: Even if q 1 is the trivial TRUE query and q 2 is a UCQ • Our FPRAS result extends to a larger class of queries of which q 1 – q 2 is a special case [CQ- : Conjunctive queries without self-joins] 9

Difference Rank • Define difference rank �(q) of query q recursively – �(R) = 0 – �(q 1 - q 2) = �(q 1) + �(q 2) + 1 • R – S : rank 1 – �(q 1 ⋈ q 2) = �(q 1) + �(q 2) • (R – S 1) ⋈ • (R - T 1) ⋈ R( - S 2) : rank 2 2 T : rank 1 – �(q 1 q 2) = max (�(q 1), �(q 2)) • (R – S 1) ⋈ (R - 2 S) (R - T 1) ⋈ T 2 : rank 2 – Select, project: rank remains the same 10

FPRAS for queries q with �(q) = 1 given some conditions hold (inapproximable for �(q) = 1 in general) 11

Steps in FPRAS • Step 1: Compute boolean provenance of q[D] for any query q with �(q) = 1 • Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible) • Step 3: FPRAS inspired by Karp-Luby framework 12

Boolean Provenance for Queries q s. t. �(q) = 1 Lemma: For any q with �(q) = on 1, any D, the provenance f of q(D) has form f is poly-size in n = |D|, poly-time computable 13

Probability Friendly Form (PFF) f is in PFF, if the negated DNF-s can be written in poly -size d-DNNFs (next slide) If f is in PFF, we can approximate P[f] using Karp-Luby Framework 14

d-DNNF Darwiche ’ 01, ’ 02, DM ’ 02 deterministic - Decomposable Negation Normal Form At most one child of a +node is satisfiable Children of a. -node do not share variables No internal node can have negation In general, can be a DAG + + Probability can be computed in linear time 15

Karp-Luby Framework [KL ’ 83] Given boolean expression DAGs F 1, …, Fm f = F 1 + F 2 +. . . + Fm P[f] can be computed in poly-time (in m, n) if in poly-time, i (1) P[Fi] can be computed (2) it can be checked if a given assignment satisfies Fi (3) a random satisfying assignment of Fi can be sampled Well-studied special case: DNF counting, where F 1, …, Fm are DNF minterms: f = xyz + xyw + wuv 16

Conditions (1) and (2) hold for PFF Product of minterm and d-DNNF is another d-DNNF + + w 2=1, z 1=1 17

Condition (3) also holds Lemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time At random + v 1 = 1, v 2 = 0 v 1 = 0, v 2 = 0 + v 1 = 1 v 2 = 0 v 1 = 0 v 2 = 1 1. Process in reverse topological order 2. Generate a random satisfying assignment bottom up v 2 = 0 18

Expressibility in PFF So, if f is in PFF, we can approximate P[q(D)] But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs? • Not known • But, there are natural sufficient conditions that can be verified in poly-time – If certain sub-queries are safe and hence generate read-once expressions [OH ’ 08] – If sub-queries generate poly-size OBDDs [JS ’ 11] – Extends to instance-by-instance approach (both q, D given) 19

#P-hardness for q 1 - q 2 both q 1, q 2 are safe CQ- 20

#P-hardness: Steps in the proof “Hard” query q = q 1 – q 2 – q 1() : = R 1(x, y 1) R 2(x, y 2) R 3(x, y 3) R 4(x, y 4) – q 2() : = R 1(x 1, y) R 2(x 2, y) R 3(x 3, y) R 4(x 4, y) Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings Counting independent sets in 3 -regular bipartite graphs (XZ ’ 06) 21

Other Related Work – Semantics of probabilistic query answering • Fuhr-Rollecke ’ 97, Zimanyi ‘ 97 – Dichotomy of CQ- , CQ and UCQ queries • Dalvi-Suciu ’ 04, ’ 07, Dalvi-Schnaitter-Suciu ’ 10 – Knowledge compilation techniques • Olteanu-Huang ’ 08, Jha-Olteanu-Suciu ’ 10, Jha-Suciu ’ 11, Fink-Olteanu ’ 11 – Instance-by-instance approach • Sen-Deshpande-Getoor ’ 10, Roy-Perduca-Tannen ’ 11 22

Conclusions and Future work A step towards understanding complexity of exact and approximate computation for queries with difference operations Future work – Dichotomy results that classify syntactically difference queries (similar to positive UCQ)? – Extending FPRAS to queries with difference rank > 1? – Experimental evaluation of our algorithms 23

Thank you Questions? 24