Data Stream Management Systems Presented by ChungYan Kwan

Data Stream Management Systems Presented by Chung-Yan Kwan Amy Lau June 5, 2003 CS 240 B Professor Carlo Zaniolo 1

Outline n n Characteristics of Data Stream Management System (DSMS) AURORA: Brandeis Uni. , Brown Uni. , MIT n n n n Introduction System Architecture System Model Operators Query Model Optimization Qo. S Data Structure Future Work 2

Outline (con’t) n STREAM: The Stanford Stream Data Manager n n n n Introduction System Architecture Query Language Query Plans Approximation Techniques Resource Management Implementation and Interfaces 3

Characteristics of Data Stream Management System (DSMS) n n n Manage traditional stored data (relations) Handle multiple continuous, unbounded, possibly rapid and time-varying data streams Supports long-running continuous queries, and produce answers in a continuous and timely fashion 4

Introduction of Aurora n n n General-purpose DSMS Efficiently support a variety or real-time monitoring applications 3 Key Components n n n Scheduler Storage Manager Load Shedder 5

Scheduler n n n decides which operators to execute and in which order to execute them pays special attention to reducing operator scheduling and invocation overheads batches (i. e. , groups) multiple tuples and operators and executes each batch at once 6

Storage Manager n n designed for storing ordered queues of tuples instead of sets of tuples (relations) combines the storage of push-based queues with pull-based access to historical data stored at connection points. 7

Load Shedder n n responsible for detecting and handling overload situations. Handling Overload Situation n n accomplished by shedding tuples by temporarily adding “drop” operators to the Aurora processing network. Goal: filter messages, in order to rectify the overload situation and provide better overall Qo. S at the expense of reduced answer quality. 8

Aurora System Model n n The basic job of Aurora is to process incoming streams in the way defined by an application Administrator Data Stream Flow n n Input from external Stream Data flow through a loop-free, directed graph of processing operations (ie. boxes) Output streams are presented to applications Maintain historical storage (support ad-hoc query) 9

Operators n n Eight Primitive Operators Windowed Operators n n n Slide Tumble Latch Resample Non-Windowed Operators n n Filter – Drop Map Group. By Join 10

Query Model 11

Query Model (con’t) n 3 types n n n Continual queries (real-time processing) Views Ad-hoc queries 12

Continual Query n n n No need to store the data once they are processed The Qo. S specification at the end of the path controls how resources are allocated to the processing elements along the path Application – programmed to deal with asynchronous tuples. 13

Views n n a path is defined with no connected application. It is allowed to have a Qo. S specification as an indication of the importance of the view. Applications can connect to the end of this path whenever there is a need. Moreover, it can store these partial results at any point along a view path. 14

Ad-hoc Query n Connection Point n n A connection point is an arc that will support dynamic modification to the network. An ad-hoc query can be attached to a connection point at any time. Data stored in the connection point is delivered to adhoc query Thus, the semantics for an Aurora ad-hoc query is the same as a continuous query that starts executing at tnow-T and continues until explicit termination. 15

Optimization n Inserting Projections n n Combining Boxes n n Project out all unneeded attributes If possible, it could at least saves the box execution overhead and reduces the totla number of boxes. Reordering Boxes n (next slide) 16

Reordering Boxes n n n Cost of b, c(b), expected execution time for b per input tuple Selectivity of b, s(b), expected number of output tuples per input tuple If bi before bj, expected cost c(bi) + s(bi) x c(bj) n 1 -s(bj)/c(bj) > 1 -s(bi)/c(bi) 17

Qo. S Data Structure n Multidimensional function of several attributes of an Aurora system n n n Response times Tuple drops Values produced 18

Qo. S Data Structure (con’t) n Response times n n Tuple drops n n Output tuples should be produced in a timely fashion Tuples dropped to shed load will deteriorate Qo. S Values produced n Depends on whether important values are being produced or not. 19

Future Work n n Implementing an Aurora prototype system Working on a distributed architecture, Aurora*. 20

Introduction of STREAM n A general-purpose DSMS n Supports a declarative query language (CQL) n n n registering continuous query Flexible query plans Designed to cope with high data rates and large number of continuous queries n n provides approximate answers when resources are limited careful resource allocation and usage 21

System Architecture 22

Query Language (CQL) n n An extended version of SQL Includes: n Sliding window specification: n n partitioning clause (grouping) window size (ROWS or RANGE) n n e. g. “ROWS 50 PRECEDING” e. g. “RANGE 15 MINUTES PRECEDING” filtering predicate (WHERE) Sampling clause n specifies that a random sample of the data elements should be used for query processing (e. g. “ 1 % SAMPLE” means each data element in the stream should be retained with probability 0. 01 and discarded with probability 0. 99) 23

Query Example n the example queries reference a stream Requests of requests to a web proxy server, each with four attributes: client_id, domain, URL, and req. Time • • counts the number of requests for pages from the domain stanford. edu in the last day SELECT COUNT(*) FROM WHERE Requests S[RANGE 1 DAY PRECEDING] S. domain = ‘standford. edu’ counts how many page requests were for pages served by Stanford’s CS department web server, considering only each client’s 10 most recent page requests from the domain stanford. edu SELECT FROM WHERE COUNT(*) Requests S [PARTITION BY S. client_id ROWS 10 PRECEDING WHERE S. domain = ‘stanford. edu’] S. URL LIKE ‘http: //cs. stanford. edu/%’ 24

Query Example (cont. ) n this example references a stored relation Domains that classifies domains by the primary type of web content they serve • counts the number of requests for pages from “commerce” domains out of the last 10, 000 requests for pages from domains that have been classified SELECT FROM WHERE COUNT(*) (SELECT R. class FROM Requests S 10% SAMPLE, Domains R WHERE S. domain = R. domain) T [ROWS 10000 PRECEDING] T. class = ‘commerce’ Note: the stream of requests must be joined with the Domains relation (resulting in a stream labeled T) before applying the sliding window 25

Query Language (cont. ) n Stream Ordering and Timestamps n Assume global, discrete, ordered time domain n Each stream tuple has a timestamp n n n Explicit n Use attribute TIMESTAMP (type DATETIME) in CREATE STREAM statement Arrival-based n Value of the system clock at that time Inactive and Weighted Queries n n n Queries may be assigned weights indicating their relative importance Provide more precision with higher weight Inactive queries n n queries with negligible weight Influence query plans and resource allocation 26

Query Plans n n Accounting for plan sharing and approximation techniques Compiles declarative queries into individual plans, system may merge plan n n Allows direct input of query plans n n Aurora uses directly manipulate one large execution plan Similar to Aurora Plans composed of three types of components n Query operators (similar to traditional DBMS) n Inter-operator queues (similar to some traditional DBMS) n Synopses n n n used to maintain state associated with operators summarization technique (sliding windows) used to limit their size (produce approximate results) Global scheduler for plan execution 27

Query Plans (cont. ) n Generic methods of the Operator class: n n Generic methods of the Synopsis class: n n Create, change. Mem, run Create, change. Mem, insert and delete, query Separate implementation allows us to couple any operator type with any synopsis type, and paves the way for operator and synopsis sharing 28

Example of Query Plans 29

Resource Sharing in Query Plans n Can combine plans that have exact matching subexpressions n n The implementation of a shared queue: n n n multiple queries assessing the same incoming base data stream S “share” S as a common subexpression maintains a pointer to the first unread tuple for each operator that reads from the queue, and it discards tuples once they have been read by all parent operators Not to use a shared subplan if two queries with a common subexpression produce parent operators with very different consumption rates May need to introduce synopsis sharing Automatic resource sharing is less crucial in Aurora n Resource sharing is primarily programmed by users when they augment the current mega-plan 30

Approximation Techniques n n n Goal is to maximize the precision of query answers based on the available resources Static and Dynamic Approximations Static Approximation n Queries are modified when they are submitted to the system (use less resources) n Two techniques: n Window Reduction (reduce memory and computation) n n n Decrease the window size or introduce a window where none was specified originally (band joins) This can have a ripple effect that propagates up the operator tree Sampling Rate Reduction (reduce output rate) n n Reduce the sampling rate of the SAMPLE clause or introduce one where none was specified originally Can take an existing sample operator and push it down the query plan 31

Advantages of Static Approximation n n User is guaranteed certain query behavior if query is being executed precisely by the system n User can participate in the process by guiding or approving the system’s query modifications Adaptive approximation techniques and continuous monitoring of system activity are not required 32

Dynamic Approximation n n n Queries are unchanged System may not always provide precise query answer Three techniques: n Synopsis Compression (analogous to window reduction) n n Sampling (reduce queue size) n n Reduce synopsis sizes at one or more operators n Incorporating a sliding window into a synopsis or shrinking the existing window n Maintaining a sample of the intended synopsis content Introduce one or more sample operators into the query plan, or to reduce the sampling rate at existing operators Load Shedding (reduce queue size) n Simply drop tuples from queues when they grow too large 33

Advantages of Dynamic Approximation n n The level of approximation can vary with fluctuations in data rates and distributions, query workload, and resource availability Approximation can occur at the plan operator level, and decisions can be made based on the global set of (possibly shared) query plans running in the system 34

Resource Management n Focus primarily on memory consumed by query plan synopses and queues n Static Resource Allocation n n Allocating resources to queries (in a limited environment) that maximizes query result precision Assume that all plan operators map allocated resources to precision specifications (FP, FN) Where FP & FN [0, 1] n FP captures the false positive rate: the probability that an output stream tuple is incorrect n FN captures the false negative rate: the probability, for each correct output stream tuple, that there is another correct tuple that was missed n (FP, FN) also can denote the precision of an operator For each operator type, compute output stream precision (FP, FN) values from the precision of the input streams and the precision of the operator itself Apply the formulas bottom-up to the query plan, feeding the result to the numerical solver which produces the optimal resource allocation 35

Exploiting Constraints Over Data Streams n n To reduce memory overhead in query plan operators Specify an “adherence parameter” k to captures how closely a given stream or sets of streams adheres to a constraint of that type n n n The closer the streams adhere to the specified constraints at run-time, the smaller the required synopses (state) Constraints considered: n n n e. g. Clustered-arrival constraints on a stream attribute S. A n If two tuples in stream S have the same value v for A, then at most k tuples with non-v values for A occur on S between them Between two streams: n many-one join, and referential integrity constraints Individual stream: n unique-value, cluster ed-arrival, and ordered-arrival Algorithm accepts select-project-join queries over streams with arbitrary constraints, and it produces a query plan that exploits constraints to reduce synopsis sizes without comprising precision 36

Scheduling n Global scheduler for plan execution (calls run methods) uses round-robin scheme Focus on minimizing intermediate (inter-operator) queue sizes Parallelism not considered Greedily schedule the operator that “consumes” the largest number of tuples per time unit and is the most selective (i. e. “produces” the fewest tuples) Example: n n n n n a query plan with two unary operators: O 1 operates on input queue q 1, writing results to queue q 2 which is input to operator O 2 O 1 takes one time unit to operate on a batch of n tuples from q 1, and has 20% selectivity (produces n/5 tuples in q 2) operator O 2 takes one time unit to operate on n/5 tuples, produces no tuples on its output queue assume the average arrival rate of tuples on q 1 is no more than n tuples per two time units, so all tuples can be processed and queues will not grow without bound 37

Scheduling (cont. ) n Two possible scheduling strategies for the example n n Tuples are processed to completion in the order they arrive on q 1. Each batch of n tuples in q 1 is processed by O 1 and then O 2 based on arrival time, consuming two time units overall If there is a batch of n tuples in q 1, then O 1 operates on them using one time unit, producing n/5 new tuples in q 2. Otherwise, if there any tuples in q 2 then up to n/5 of these tuples are operated on by O 2, consuming one time unit e. g. 2 n tuples arrive on q 1 at time = 0, no tuples at time =1, n tuples each at times = 2 and = 3 The table shows the total size of queues q 1 and q 2, each table entry is a multiplier for n n both finish at the 8 th step. Strategy 2 is clearly preferable in terms of memory overhead 38

Scheduling (cont. ) n n Can achieve queue size minimization, but pay in increased time to initial results Two additional considerations: n n Favor operators with full batches of tuples in their input queues over higherpriority operators with underfull input queues Chains of operators within a plan: n n n do not schedule chains as a unit as in Aurora’s train scheduling algorithm Aurora’s objective is to improve throughput by reducing context-switching between operators, batching the processing of tuples through operators, and reducing I/O overhead (inter-operator queues may be written to disk) Aurora: n “Qo. S graphs” capture tradeoffs among precision, response time, resource usage, and usefulness to the application. However, approximation appears solely through drop-boxes that perform load shedding. 39

Implementation and Interfaces n Three features of the design: n n Generic entities Coding of query plans System interface Entities and Control Tables n n Operators, queues and synopses are subclasses of a generic Entity class Each entity has a table of attribute-values pairs--Control Table (CT), and each entity exports an interface to query and update its CT n Dynamically control the behavior of an entity n n n The amount of memory used by a synopsis S can be controlled by updating the value of attribute Memory in S’s control table Collect statistics about entity behavior for resource management and for user-level system monitoring n The number of tuples that have passed through a queue q is stored in attribute Count of q’s control table Offer extensibility (add new attributes to a CT) 40

Implementation and Interfaces (cont. ) n Query Plans n n n Implemented as networks of entities, stored in main memory A graphical interface is provided for creating and viewing plans, and for adjusting attributes of operators, queues, and synopses Query plans may be viewed and edited even as queries are running Main-memory plan structures in XML files (persistent continuous query) Plans are loaded at system startup, any modifications to plans during system execution are reflected in the corresponding XML Users are free to create and edit XML plans offline 41

Implementation and Interfaces (cont. ) n Programmatic and Human Interfaces n a web interface through direct HTTP n n remote applications: n n n planing to expose as a web service through SOAP can be written in any language and on any platform can register queries can request and update CT attribute values can receive the results of a query as a streaming HTTP response in XML human users: n web-based GUI exposing the same functionality 42

Conclusion n n Both prototype are still under development STREAM need to design the query processor with a migration to distributed processing STREAM may extend the system to handle XML data streams Both systems are quiet alike We think they could join their efforts together to come up with a even better DSMS 43

References n n n Aurora website http: //www. cs. brown. edu/research/aurora/ Carney, D. , et al. , “Monitoring Streams - A New Class of Data Management Applications”, Proc. of Very Large Databases (VLDB), Hong Kong, China, August 2002. http: //www. cs. uml. edu/~kajal/courses/91. 580 -S 03/papers/ccccmonitoring-streams. pdf Motwani, R. , et al. , “Query Processing, Approximation, and Resource Management in a Data Stream Management System, ” In Proc. of the 2003 CIDR Babcock, B. , et al. , “Models and issues in data stream systems, ” In Proc. 21 st ACM SIGACT-SIGMOD-SIGART Symp. On Principles of Database Systems, p. 1 -16, Madison, Wisconsin, May 2002 Stanford University STREAM website http: //www. db. stanford. edu/stream 44