General Database Statistics Using Maximum Entropy Raghav Kaushik
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington
2 Study Cardinality Estimation 1. Model: Information that optimizer knows Propose a declarative language with statistical assertions “We estimate that distinct # of Employees is 10” 2. Prediction: use the model to estimate cardinality of future queries Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.
3 Motivating Applications 1. Incorporate query feedback records Underutilized: No general purpose mechanism 2. Optimizers for new domains (DB Kit 2. 0) Cloud Computing, Information Extraction 3. Data generation and description
4 Outline • Statistical programs and desiderata • Semantics of Statistical Programs • Two examples • Conclusions
V(x) : - R(x, y), …. #V= 106 5 Statistical Assertions An assertion is a CQ Views + sharp (#) statement: V 1(x) : - R(x, -) #V 1 = 20 “The number of values in the output of V 1 is 20” V 2(y) : - R(-, y), S(y) #V 2 = 50 “The number of values in the output V 2 is 50” A program is a set of assertions
V(x) : - R(x, y), …. #V= 106 6 Model as a Probabilistic Database Intuitively, # is “Expected Value” V 1(x) : - R(x, -) #V 1 = 20 “The number of values in the output of V 1 is 20” A model is a probabilistic database s. t. the expected number of tuples in V 1 is 20. Ok, but which pdb?
7 V(x) : - R(x, y), …. #V= 106 Desiderata for our solution • Two Desiderata for the distribution (D 1): Should agree with provided statistics (D 2): Should assume nothing else Approach: maximize entropy subject to D 1 Challenge: Compute params of Max. Ent Distribution Technical Desideratum: want params analytically
8 Outline • Statistical programs and desiderata • Semantics of Statistical Programs • Two examples • Conclusions
9 Notation for Probabilistic Databases • • Consider a domain D of size n. Fix a schema R=R 1, R 2, … Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world
10 Notation for Probabilistic Databases • • Consider a domain D of size n. Fix a schema R=R 1, R 2, … Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world A probabilistic database is a pair (Inst(n), p) Essentially, any discrete probability distribution on relations
Achieving (D 1): Stats must agree 11 The semantics of # # means “expected value” V 1(x) : - R(x, -) #V 1 = 20 “The number of values in the output of V 1 is 20” NB: In truth, we let n tend to infinity, and settle for asymptotically equal.
12 Achieving (D 1): Stats must agree Multiple Views • Given V 1, V 2, … with #Vi = di for i=1, …, t If p satisfies these equations, we’ve achieved: (D 1): Should agree with provided statistics Many such distributions exist. How do we pick one?
13 Achieving (D 2) : No ad-hoc assumptions Selecting the best one • Maximize Entropy subject to constraints:
14 Achieving (D 2) : No ad-hoc assumptions Selecting the best one • Maximize Entropy subject to constraints: One can show that p has following form: NB: p is only a function of the stats, and so we have achieved (D 2) Z is normalizing constant and ai is positive parameter for i=1, . . , t
15 Benefits of Max. Ent A statistical program • Every (consistent) statistical program induces a well-defined distribution – Every query has a well-defined cardinality estimate • Statistics as a whole, not as individual stats. • Can add new statistics to our heart’s content Technical Challenge: ai analytically
16 Outline • Statistical programs and desiderata • Semantics of Statistical Programs • Two examples • Conclusions
17 Two quick Examples • I: A material random Graph – Even simple EM solutions have interesting theory • II: Intersection Models – Generating function , and – Different, analytic technique
18 Example I: Random Graphs are EM V(x, y) : - R(x, y) #V = d Random Graph: Add edges independently at random
19 Example I: Random Graphs are EM V(x, y) : - R(x, y) #V = d By Linearity, E[V] = xn 2 = d Random Graph: Add edges independently at random
20 Example I: Random Graphs are EM V(x, y) : - R(x, y) #V = d By Linearity, E[V] = xn 2 = d This is Max. Ent…write: Random Graph: Add edges independently at random
21 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
22 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
23 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
24 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
25 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
26 Example II: an intersection model V(x) : - R 1(x), R 2(x) #R 1 = d 1 , #R 2 = d 2 , #V = d 3 Read: Each element is either in R 1, R 2, or all three e. g. , term with x 1 k is an instance where k distinct values in R 1
27 Results in the paper • Normal Form for statistical programs • Syntactic classes that we can solve analytically – “Project-Semijoin” queries (previous slide) • A general technique, conditioning: – Start with tuple independent prior, and condition – Introduces inclusion constraints • Extensions to handle histograms
28 Conclusion • Showed a principled, general model for database statistics based on Max. Ent • Analytically solved syntactic classes of statistics • Applications: Query Feedback and the Cloud
- Slides: 28