Streaming Queries over Streaming Data Sirish Chandrasekaran UC

About Me 3 rd Year ISYE major ¢ Minor in Computer Science ¢ From

Agenda ¢ ¢ Background/Motivation PSoup l l l l ¢ Introduction System Overview Query

Background/Motivation ¢ Continuous Query (CQ) Systems Treat queries as fixed entities and stream data

PSoup: Introduction Query processor based on Telegraph query processing framework ¢ Allows both data

PSoup: System Overview User initially registers query specification with system ¢ System returns handle

PSoup: System Overview ¢ ¢ ¢ PSoup treats execution of query streams as a

PSoup: Query Processing Techniques ¢ Overview l l PSoup assigns unique query. ID that

PSoup: Query Processing Techniques ¢ Entry of new Query specs l New queries split

PSoup: Query Processing Techniques ¢ Entry of new data New tuples assigned globally unique

PSoup: Query Processing Techniques ¢ Selection Queries over a single stream

PSoup: Query Processing Techniques ¢ Join Queries Over Multiple Streams

PSoup: Query Processing Techniques ¢ Query Invocation and Result Construction l l Results Structure

PSoup: Implementation ¢ Eddy l l l Each tuple has a predicate attribute and

PSoup: Implementation ¢ Data Ste. M Use tree-based index for data to provide efficient

PSoup: Implementation ¢ Query Ste. M l l Allows sharing of work between queries

PSoup: Implementation ¢ Query Ste. M l For queries involving joins of multiple attributes,

PSoup: Implementation ¢ Results Structure l l Stores metadata indicating which tuples satisfy which

PSoup: Performance ¢ ¢ Implemented in Java with customized versions of Eddy and Ste.

PSoup: Performance ¢ Storage Requirements l l l No. Mat: Storage cost = space

PSoup: Performance ¢ Experimental Setup l l l Varied window sizes (27 -216) and

PSoup: Performance ¢ Response Time vs. Window Size

PSoup: Performance ¢ Response Time vs. # Interval Predicates

PSoup: Performance ¢ Data Arrival Rate vs. # SQCs

PSoup: Performance ¢ Summary of Results l l l Materializing results of queries supports

PSoup: Performance ¢ Removing Redundancy in Join processing Entry of a query specification or

PSoup: Aggregation Queries PSoup can support aggregate functions ¢ Only possible to share data

PSoup: Conclusions ¢ ¢ Treats data and query streams analogously Can support queries that

Critique ¢ ¢ Strengths l Very well written, easy to follow l Clear examples,

Slides: 29

Download presentation

Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

About Me 3 rd Year ISYE major ¢ Minor in Computer Science ¢ From Austin, TX ¢ Have visited every state but Alaska ¢ Intern at Deloitte Consulting focusing on SAP implementation ¢

Agenda ¢ ¢ Background/Motivation PSoup l l l l ¢ Introduction System Overview Query Processing Techniques Implementation Performance Aggregation Queries Conclusions Critique

Background/Motivation ¢ Continuous Query (CQ) Systems Treat queries as fixed entities and stream data over them l Previous systems only allowed streaming of either data or queries l Continuously deliver results as they are computed (infeasible/inefficient) l • Data Recharging • Monitoring

PSoup: Introduction Query processor based on Telegraph query processing framework ¢ Allows both data and queries to be streamed ¢ Partially stores results to support disconnected operation and improve data throughput and response time ¢

PSoup: System Overview User initially registers query specification with system ¢ System returns handle which can be used to invoke results of query later ¢ Example Query: SELECT * FROM Data_Stream D_s WHERE (D_s. a < x ^ D_s. b > y) BEGIN(NOW – 10) END(NOW); ¢ Begin-End Clause allows: ¢ l l l ¢ Snapshot (constant beginning and ending time) Landmark (constant beginning and variable ending time) Sliding window (variable beginning and ending time) Limited by size of memory

PSoup: System Overview ¢ ¢ ¢ PSoup treats execution of query streams as a join of query and data streams Maintains State Modules (Ste. Ms) for queries and data One query Ste. M for all queries in the system, and one data Ste. M for each data stream

PSoup: Query Processing Techniques ¢ Overview l l PSoup assigns unique query. ID that it returns to the user Client can disconnect, reconnect and execute query to obtain updated results PSoup continuously matches data to query predicates in background and stores the results in its Results Structure When a query is invoked, PSoup applies the appropriate input window to the Results Structure to return the current results

PSoup: Query Processing Techniques ¢ Entry of new Query specs l New queries split into two parts: • Standing Query Clause (SQC): consists of the SELECT-FROM-WHERE clauses • BEGIN-END clause, stored in separate Windows. Table structure l l l SQC inserted into Query Ste. M Used to probe Data Ste. Ms corresponding to tables in FROM clause Resulting tuples stored in Results Structure

PSoup: Query Processing Techniques ¢ Entry of new data New tuples assigned globally unique tuple. ID and physical timestamp (physical. ID) based on system clock l Inserted into appropriate Data Ste. M l Then used to probe Query Ste. M to determine which SQCs it satisfies l Tuple. IDs and physical. IDs stored in Results Structure l

PSoup: Query Processing Techniques ¢ Selection Queries over a single stream

PSoup: Query Processing Techniques ¢ Join Queries Over Multiple Streams

PSoup: Query Processing Techniques ¢ Query Invocation and Result Construction l l Results Structure maintains info about which tuples in Data Ste. M(s) satisfy which SQCs in Query Ste. M For each result tuple of each query, it stores tuple. ID and physical. ID of all constituent base tuples of result tuple Results of a query can be accessed by its query. ID Ordered by timestamp (physical. ID)

PSoup: Implementation ¢ Eddy l l l Each tuple has a predicate attribute and an Interest List dictating where it is to be routed Provides Stream Prefix Consistency by storing new and temporary tuples separately in New Tuple Pool and Temporary Tuple Pool Begins by selecting a tuple from the NTP and then processing everything in the TTP before pickign another tuple from the NTP

PSoup: Implementation ¢ Data Ste. M Use tree-based index for data to provide efficient access to probing queries l One red-black tree for every attribute l Maintains hash-based index over tuple. IDs for fast access l

PSoup: Implementation ¢ Query Ste. M l l Allows sharing of work between queries that have overlapping FROM clauses Use red-black trees to index single-attribute singlerelation boolean factors of a query

PSoup: Implementation ¢ Query Ste. M l For queries involving joins of multiple attributes, tree structure doesn’t work l Instead, a linked list called the predicate. List is used l Query Ste. M contains an array in which each cell represents a query l At beginning of probe by a data tuple, each cell is set to the number of boolean factors in corresponding query l Every time tuple satisfies a boolean factor, cell value is decremented l At end of probe, if cell = 0, that means the data tuple satisfies the given query

PSoup: Implementation ¢ Results Structure l l Stores metadata indicating which tuples satisfy which SQCs Can either be accomplished by previouslymentioned bitmap or by associating a linked list of satisfactory data tuples for each query Ordering by timestamp is simple for singletable queries For Join queries, typically use oldest timestamp

PSoup: Performance ¢ ¢ Implemented in Java with customized versions of Eddy and Ste. Ms Examined performance of two versions: l l ¢ PSoup-Partial (PSoup-P): Maintain results corresponding to SQCs in Results Structure, and apply BEGIN-END clauses to retrieve current results on query invocation PSoup-Complete (PSoup-C): Continuously maintains results corresponding to current input window for each query in linked lists No. Mat: Measurements of a system that doesn’t materialize results

PSoup: Performance ¢ Storage Requirements l l l No. Mat: Storage cost = space taken to store base data streams within maximum window over which queries are supported, plus size of structures PSoup-P: Storage cost = storage cost of No. Mat + size of Results Structure (either bitarray or linked-list) PSoup-C: Storage cost >> storage cost of PSoup-P since C always stores current results of standing queries at a given time

PSoup: Performance ¢ Experimental Setup l l l Varied window sizes (27 -216) and number(18)/type of boolean factors Measured response time and maximum supportable data arrival rate Examined both P and C with and without predicate indexes Tested scheme to remove redundancies arising from joins Used synthetic generated query(27 -212) /data streams

PSoup: Performance ¢ Response Time vs. Window Size

PSoup: Performance ¢ Response Time vs. # Interval Predicates

PSoup: Performance ¢ Data Arrival Rate vs. # SQCs

PSoup: Performance ¢ Summary of Results l l l Materializing results of queries supports higher query invocation rates Indexing queries and lazily applying windows improves maximum data throughput PSoup-C requires more memory PSoup-C optimizes query invocation rate PSoup-P optimizes data arrival rate

PSoup: Performance ¢ Removing Redundancy in Join processing Entry of a query specification or new data l Composite tuples in joins l

PSoup: Aggregation Queries PSoup can support aggregate functions ¢ Only possible to share data structures across queries with identical SELECTPROJECT-JOIN clause ¢

PSoup: Conclusions ¢ ¢ Treats data and query streams analogously Can support queries that require access to data that arrived before and after the query Materializes results to cut down on response time and to support disconnected operation l Enables data recharging and monitoring Future work: l Write data streams to disk and execute queries over them l Transfer queries between disk and memory, allowing query execution to be scheduled l Confront resource constraints when dealing with infinite streams l Query browser for temporal data

Critique ¢ ¢ Strengths l Very well written, easy to follow l Clear examples, excellent explanation of performance results l Strong method that reduces processing time with increase in interval predicates Weaknesses l Lacking sufficient data on storage costs l Experimentation only tested one multiple-relation boolean factor for joins; unrealistic l Didn’t address whether same (or similar) query could be entered twice and accidentally given two ID’s