Partial Results in Database Systems Willis Lang Rimma

Partial Results in Database Systems Willis Lang Rimma V. Nehme Eric Robinson Jeffrey F. Naughton Microsoft Gray Systems Lab University of Wisconsin SIGMOD’ 14 Proceedings of the 2014 ACM SIGMOD international conference on Management of data, June 22– 27, 2014, Snowbird, UT, USA 2015/04/28 M 1 andy 1

Main Author ØWillis Lang • Research Engineer of Gray Systems Lab • Ph. D. University of Wisconsin - Madison Microsoft üdatabase 2

Introduction • Evaluating relational queries over multiple information sources Øsome of them may return incomplete tuple sets • queries spanning a collection of loosely coupled cloud databases • queries in a traditional parallel RDBMS • queries in a single node system 3

Introduction ØSolutions to the problem: • either replicate the data sources comprising the distributed system or make them more reliable • add replication and failover to the nodes of a parallel DBMS • fix the incomplete tables and views in the single node system 4

Introduction üThese solutions can be either financially costly, performance hindering, or both üIn certain cases, an error external to the database or misconfigurations may be impossible to fix different approach Øletting the query run to completion despite the incomplete inputs 5

New Approach • The system needs to tell the user that the result is computed based upon incomplete data • Can we make any guarantees (about what the user may or may not receive)? • Can we classify the types of anomalies that might result? • Can we develop sound mechanisms for determining when anomalies can and cannot occur? 6 anomaly : 異常

Purpose of this paper ØTo present a broad classification of anomalies when evaluating queries over incomplete data ØTo show to detect the various anomalies that arise • by analyzing how they are created and propagated 7

taxonomy : 分類(法) Partial Results Taxonomy n“partial result” : a tuple set produced from some query execution where some data is unavailable ØA partial result may not be the same as the True result 8

Partial Results Taxonomy • Partial result semantics are based on a combination of two basic properties of a tuple set p cardinality and correctness • used to describe how a particular partial result tuple set differs from the True result 9

Two basic properties pcardinality • Incomplete aspect : missing some tuples • Phantom aspect : some extra tuples • Indeterminate : both Incomplete and Phantom • Complete : neither Incomplete nor Phantom üescalation when cardinality guarantees are lost, the state of the tuple set may be escalated to another state 10

Two basic properties escalation 11

Two basic properties pcorrectness (credibility) • Credible : correct data • Non-credible : cannot be guaranteed to be correct üdata can only lose the Credible guarantee when it is calculated during query processing • e. g. ) calculating a COUNT over a partial result that is Indeterminate 12

Partial Result Analysis Models ØFour models with different analysis “granularities” • many levels of result quality guarantees are possible 13 coarse fine

Query Model • treat a query as a black box that has produced a partial result • we do not know what is “wrong” with the partial result üany guarantees cannot be provided the partial results are classified as Indeterminate and Non-credible 14

Operator Model • look into a query and analyze its logical operator tree • we can distinguish between Incomplete and Complete tables üfor each operator • need to know the input’s partial semantics • need to determine the semantics the output tuple set result of 15

Operator Model • The operator model allows us to distinguish different partial result semantics • But it still produces overly conservative guarantees • It still treats the inputs and outputs as black boxes 16

Column Model • the credibility of different parts of a tuple can be tracked üneed to identify the parameters of the operators • to know which columns are processed 17

Column Model • The Column model provides finer granularity precision for making partial result guarantees • Credible and Non-credible columns are mixed 18

Partition Model • consider the input table to be a collection of partitions • use properties of partitions üneed to know which nodes are available or return incomplete data • to classify the other partitions of the table as Complete and Credible 19

Partition Model • Some tuples that are exactly the same as in the True result can be identified and returned • The partition model provides the most precise guarantees in its partial result semantics 20

Partial Result Analysis Models Øfour models with different analysis “granularities” 21 coarse fine

Propagation of Partial Result Semantics ØSo far • how to make the partial result guarantees • how to classify the result set ØNext • how relational operators propagate partial result semantics using the Partition Model prelational operators • four unary operators • three binary operators 22

Unary Operators pfour unary operators • SELECT, PROJECT, EXTENDED PROJECT and AGGREGATE • projection is differentiated into two categories • PROJECT : simply remove columns • EXTENDED PROJECT : define a new column through an expression can 23

SELECT ØIf SELECT operator has a predicate expression that operates over Non-credible columns • set the cardinality property of the result to Indeterminate ØIf the predicate is defined over all-Credible data • the partial result semantics are simply propagated from input to output SELECT m. name FROM member m WHERE m. grade > 2 Credible or Non-credible 24

PROJECT ØPROJECT operator only affects the partial result cardinality property of a tuple set • only when the tuple set is partitioned ØIf PROJECT operator removes a non-partitioning column • it simply propagates the remaining rows’ partial result semantics • PROJECT operator is not affected by, nor does it affect, the credibility of columns 25

PROJECT • a simple partitioned tuple set 26

PROJECT • a more complicated example 27

EXTENDED PROJECT ØIf an expression computes a value using -credible values as input Non • the output is also Non-credible ØIf the expression parameters are all-Credible • it produces a column guaranteed as Credible • EXTENDED PROJECT operator does not affect the cardinality semantics (Incomplete and Phantom) 28

AGGREGATE • five types of AGGREGATE functions : COUNT, SUM, AVG, MIN and MAX Øonly consider when the functions are applied over one column of the input tuple set 29

AGGREGATE 30

AGGREGATE 31

Binary Operators pthree binary operators • UNION ALL, CARTESIAN PRODUCT and SET DIFFERENCE 32

UNION ALL Øescalate the cardinality property of the output • based on the combination of the two input’s cardinality properties ØIf either of the corresponding input columns are Non-credible • an output column is escalated to Non-credible 33

UNION ALL 34

CARTESIAN PRODUCT ØCARTESIAN PRODUCT is not affected by, nor does it change the credibility of the data values ØBut it may or may not simply propagate the input semantics to the output 35

CARTESIAN PRODUCT 36

SET DIFFERENCE 37

SET DIFFERENCE 38

SET DIFFERENCE 39

Dependence on Query Plan A foundational principle of query evaluation in traditional settings: Øthe same result is computed independent of the query plan • it would be nice if this carried over to partial results analysis • but this is not the case 40 query plan : クエリ実行計画

Dependence on Query Plan • Can we make any guarantees? ØShow that for different orderings (plans) of commutative operators, we will have identical classifications of the partial result outputs • under the persistent failure model npersistent failure model: all failures occur before the query begins executing and persist throughout the entire execution commutative : 交換可能な 41

Partial Results Consistency with Operator Re-ordering ØProblem Statement: For two commutative operators α and β, we wish to show that for the sequences (α –> β) and (β –> α), under the persistent failure model, the partial result guarantees produced by the analysis are identical for any (partial) inputs. 42

Commutative Pairs of Unary Operators ØThe conditions under which two operators are commutative: commutative pairs of unary operators are two operators α and β, such that given an input tuple set R, if we execute either α–> β or β–>α, then the output will always be S üAGGREGATE is not commutative with PROJECT or EXTENDED PROJECT 43

Pairs involving SELECT • only affects partial result cardinality properties if the predicate is on a Non-credible column Øif the predicate defined on a all-Credible column • any commutative pairing with any operator will have consistent partial result semantics Øif with a Non-credible predicate • the output will always be Indeterminate 44

Commutative PROJECT and EXTENDED PROJECT • only concerned with the partial result cardinality property • EXTENDED PROJECT • only concerned with the partial result data correctness property Øalways classify the result the same way regardless of the ordering of this pair 45

Commutative Unary/Binary Pairs üAGGREGATE operator is not generally commutative with the binary operators 46

Pairs involving SELECT Øif the predicate defined on a all-Credible column • any commutative pairing with any operator will have consistent partial result semantics Øif with a Non-credible predicate • the UNION ALL operator is commutative • UNION ALL taints its output’s partial result semantics with the semantics of its inputs • the CARTESIAN PRODUCT or the SET DIFFERENCE operator are not commutative 47

SELECT and UNION ALL Pair 48

Pairs involving PROJECT Øonly commutative with the UNION ALL and CARTESIAN PRODUCT operators 49

PROJECT and CARTESIAN PRODUCT Pair 50

PROJECT and UNION ALL Pair 51

PROJECT and UNION ALL Pair 52

EXTENDED PROJECT Øcommutative with the UNION ALL operator and CARTESIAN PRODUCT operator • EXTENDED PROJECT • only involves data credibility • CARTESIAN PRODUCT • concerned with partial result cardinality 53

Pairs of Binary Operators ØBinary operators are only commutative (and associative) with themselves • only need to show that pairs of homogeneous binary operators will be consistently classified 54

Implementation • incorporating partial results analysis into an existing DBMS • requires minimal changes to the codebase • has almost no impact on the performance of the system • the partial results analyzer • take the detected runtime failures and the query plan • produce the partial results guarantees 55

Implementation • a prototype interface • connected to the partial result-aware database system 56

Conclusion • deal with the problem of partial results • query execution continues and results are produced even when some data may be unavailable • a classification framework equipped with four different models • for analyzing partial results semantics at various granularities 57