Continuous Query Languages CQL Blocking Operators and the

Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017 1

CQLs for DSMS z Most of DSMS projects use SQL for continuous queries—for good reasons, since y Many applications span data streams and DB tables y A CQL based on SQL will be easier to learn & use y Moreover: the fewer the differences the better! z But DBMS were designed for persistent data and transient queries---not for persistent queries on transient data z Adaptation of SQL and its enabling technology presents difficult research challenges z These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i. e. , lack of expressive power 2

Language Problems z Most DSMS use SQL — queries spanning both data streams and DBs will be easier. But … z Even for persistent data, SQL is far from perfect. Important application areas poorly supported include: y Data Mining, and we need to mine data streams, y Sequence queries: and data streams are unbounded sequences!! z Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) y Only Non. Blocking operators in DSMS: blocking forbidden y Distinction not clear in DBMS which often use blocking implementations for nonblocking operators y The distinction needs to formally characterized y and so is the loss of query power caused upon CQLs. 3

Blocking Operators z A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS 02] z But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. z Only non-blocking (nb) queries and operators can be used on data streams (i. e. those that return their results before they have detected the end of the input). z Current DBMSs make heavy usage of blocking computations: 1. For operators that are intrinsically blocking 2. And for those that are not—i. e. , they are only implemented that way. To exclude 1, we need to find a characterization for blocking & nonblocking that is independent of implementation. 4

Partial Ordering z Let S = [ t 1, ¼, tn] be a sequence and 0 £ k £ n. z Then Sk =[t 1, ¼, tk ] is said to be the presequence of S, of length k>0. z Also S 0=[ ] denotes the empty sequence z L S denotes that L is a presequence of S, z Defines a Partial Order: reflexive, antisymmetric and transitive. z The notion of subset is different from that of `preorder. ’ For sets order and duplicates are immaterial z The empty sequence [ ] is a pre-sequence of every other sequence. 5

Operators on Sequences: S ®G ® G(S): result of a applying G to the whole S Operators viewed as incremental transducers: • Gj(S) denotes the cumulative output produced up to the j-th input tuple included. Sj input up to step j. • S is a sequence of length n. Then G is said to be: z Blocking when Gj(S)=[ ] for j<n, and Gn(S)=G(S) z Nonblocking when Gj(S) = G(Sj), for every j £ n. 6

employees(E#, Sal, . . . ) select count(E#) from employees grouped by Sal select Sal, count(E#) over (range unbounded preceding) from employees ordered by Sal z Traditional SQL-2 aggregates: blocking z SQL: 2003 Non Blocking Continuous count returns, for each new tuple, the count so far. On a sequence of length n: at each step j<n the count up to j is returned: count 1 (S)= [1], count 2 (S)= [1, 2], countj (S)= [1, 2, …, j] . . . independent on whether j=n or j<n. Tradional count: Cumulative return For each j<n: nothing, countj (S)=[ ] Final: countn (S)=[n] 7

Examples • Selection is nonblocking. • Projection is non-blocking even if we eliminate resulting duplicates. • Traditional SQL-2 aggregates are blocking (for arbitrarily ordered input) • SQL: 2003 OLAP functions are not. E. g. Continuous count, sum, max, etc. (i. e. , the unlimited preceding count of OLAP functions) is non-blocking • Intermediate cases are also possible 8

Characterization of Non. Blocking (NB) Theorem: Queries can be expressed via nonblocking computations iff they are monotonic w. r. t. the presequence ordering. Proof: (i) NB G implies monotonic G: We need to prove that if Sj Sk then G(Sj) G(Sk). Since j ≤ k, it is always true that Gj(Sk) Gk(Sk). But if G is NB then Gj(Sk)=Gj(Sj) and Gk(Sk)= G(Sk) QED (ii)monotonic G implies NB G … the incremental G transducer, at step j+1 adds the difference between G(Sj+1) and G(Sj). 9

Non. Blocking Iff Monotonic z The theorem generalizes from presequences to sets---i. e. presequences where duplicates are not allowed and order is immaterial. y In fact S 1 is a subset of S 2 iff S 1 is a presequence of S 2, after proper reordering and elimination of duplicates z NB=monotonic: e. g. , selection, projection, and OLAP functions z Blocking= Non-Monotonic: e. g. Traditional aggregates. z Results hold for operators of more than one argument: y Join are monotonic (i. e. , NB) in both arguments. y R-S is monotonic on R and antimonotonic on S: i. e. , will block on S but not on R (but it will unblock on R only after it has seen the whole S!) 10

NB-Completeness z A query language L can express a given set of functions on its input (DB, sequences, data streams). z Thus nonmonotonic functions are intrinsically blocking and they cannot be used on data streams. z For continuous queries on data streams, we should disallow blocking (i. e. , nonmonotonic) operators & constructs and only allow nonblocking (i. e. , monotonic ) operators: nb-operators for short. z But can ALL the monotonic functions expressible by L be expressed using only its nb-operators ? z Or did we also lose some monotonic queries? Definition: When using only its NB-operators L can express all the monotonic queries expressible in L, then L is said to be NB-complete. 11

Expressive Power and NB-Completeness z Consider a (DB) language L. The expressive power of L is the set of functions F that can be computed on the DB using its operators (or constructs). z On data streams, we are only interested in monotonic functions: F’ F. Also let O be the operators of L, and O’ O be the subset of such operators that are monotonic. z L will be said to be NB-complete if all functions in F’ can be expressed using only the operators in O’. z NB-completeness is a test that O is as suitable for continuous queries on data streams as it is on the database. z Say that L is not NB-complete: then there exist monotonic functions that L can express on the data stored in the DB, but it can no longer express on the same data presented as a stream. 12

Is SQL NB complete? z E-Bay Example Auctions: a stream of positive bids on an item. bid. Stream(Item#, Bid. Value, Time) z Items for which the sum of bids is > 100 K SELECT Item# FROM bid. Stream GROUP BY Item# HAVING SUM(Bid. Value) > 100000; z This is a monotonic query. c. Thus it can be expressed in a language containing suitable query operators. But it cannot be expressed in SQL 2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. z So SQL-2 is not nb-complete because of its blocking aggregates. z What about RA without aggregates? 13

Relational Algebra (RA) z Set difference can produce monotonic queries: Are these still expressible without set diff? z Intersection is monotonic: R 1 Ç R 2 = R 1 - (R 1 - R 2) But intersection can also be expressed as a joins: product+select. So it is not lost if we disallow set diff. z But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. z Example: Temporal domain isomorfic to nonnegative integers. Intervals closed to the left but open to the right: p(0, 3). % 0, 1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8). 14

Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). cp(0, 3). cp(2, 4). cp(4, 5). p(6, 8). cp(0, 4). cp(2, 5). cp(0, 5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of some interval in between is a hole. cp(I 1, J 2) ¬ p(I 1, J 1), p(I 2, J 2), J 1 < J 2, Øhole(I 1, J 2) ¬ p(I 1, J 1), p(I 2, J 2), p(_, K), J 1 £ K, K < I 2, Øcep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5, _) holds if cp has an interval that starts at 0 & contains 5 p. Until q(yes) ¬ q(0, J). p. Until q(yes) ¬ cp(0, I), q(J, _), I ³ J. 15

Relational Algebra z Non. Monotonic (i. e. , blocking) RA operators: set difference and division z We are left with: select, project, join, and union. Can these express all FO monotonic queries? z Some interesting temporal queries: coalesce and until y They are expressible in RA (by double negation) y They are monotonic y But they cannot be expressed in NB-RA. Theorem: RA and SQL are not NB-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates. 16

Real Applications Require REAL Power z SQL’s lack of expressive power is a major problem for database-centric applications. z These problems are significantly more serious for data streams since: y. Only monotonic queries can be used, y. Actually, not even all the monotonic ones since SQL is not nb-complete, y. These problems cannot be solved by embedding SQL statements in a PL program—next slide! 17

Embedding SQL Queries in a PL z In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a ` Get Next of Cursor’ statement. z Operations that could not be expressed in SQL can then be expressed in the PL: y an effective remedy for the lack of expressive power of SQL z But cursors are a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them! z The DSMS can only deliver its output to the PL as a stream y This might be OK for simple situations y But if the core of the work has not been done yet, the PL system must do the actual DSMS work! z Conclusion: to support applications of any complexity we must have a DSMS with real expressive power, y As opposed to DBMS that are useful even with a weak QL. 18

Real Applications Require Real Power Embedding CQL in PL programs does not work well. . . BUT: Embedding PL programs in CQL works: z User Defined Functions with BLOBS: y Good for DBMS but DSMS require incremental computation z User-Defined Aggregates (UDAs) functions: y Incremental computation model y Can be defined using a PL or SQL itself y with natively defined UDAs, SQL becomes Turing complete y And NB-complete: can express all monotonic functions y Simple syntactic characterization for NB aggregates. y Effective on a broad range of data-intensive applications: KDD in particular. y A few extensions are still need—more later. 19

Why UDAS are Important z We have seen how new aggregates can be defined by the intialize, iterate, terminate scheme, using SQL itself (native UDAs) or an external language (C++, Java, etc. ) Theorem [Law-Wang-Zaniolo 2011] SQL with natively defined UDAs is Turing-Complete. Ø With non-blocking UDAs SQL, becomes NB-complete: it can express all monotonic computable functions on a single stream. Ø Also complete on multiple streams if we union becomes a sort-merge operator. 20

References D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15 th Conference on Information and Knowledge Management (CIKM'06), 2006 Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492 -503 Haixun Wang and Carlo Zaniolo. ATLa. S: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130 -141, 2003 Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Relational languages and data models for continuous queries on sequences and data streams. ACM Trans. Database Syst. 36(2): 8: 1 -8: 32 (2011) 21