Data Stream Management Systems DSMS Introduction Concepts and

  • Slides: 34
Download presentation
Data Stream Management Systems (DSMS) - Introduction, Concepts and Issues Morten Lindeberg University of

Data Stream Management Systems (DSMS) - Introduction, Concepts and Issues Morten Lindeberg University of Oslo (With slides from Vera Goebel)

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts and Issues l Morten Lindeberg 1. and 2. lecture Requirements Architecture Data model Queries� Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 2

The DSMS Research Field l New and active research field (~ 10 years) derived

The DSMS Research Field l New and active research field (~ 10 years) derived from the database community l l l Two syllabus articles: l l l Stream algorithms Application and database perspective (we) Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom: "Models and issues in data stream systems" Lukasz Golab, M. Tamer Ozsu: "Issues in data stream management” Future: Complex Event Processing (CEP) 16. sept 2009 INF 5100 - H 2009 3

DBMS vs. DSMS #1 SQL Query Continuous Query (CQ) Result Query Processing Main Memory

DBMS vs. DSMS #1 SQL Query Continuous Query (CQ) Result Query Processing Main Memory Data Stream(s) Disk 16. sept 2009 INF 5100 - H 2009 4

DBMS vs. DSMS #2 l Traditional DBMS: l l DSMS: stored sets of relatively

DBMS vs. DSMS #2 l Traditional DBMS: l l DSMS: stored sets of relatively static records with no pre -defined notion of time good for applications that require persistent data storage and complex querying 16. sept 2009 INF 5100 - H 2009 support on-line analysis of rapidly changing data streams data stream: real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items, too large to store entirely, not ending continuous queries 5

DBMS vs. DSMS #3 DBMS DSMS Persistent relations Transient streams (relatively static, stored) (on-line

DBMS vs. DSMS #3 DBMS DSMS Persistent relations Transient streams (relatively static, stored) (on-line analysis) l One-time queries l Random access l “Unbounded” disk store l Only current state matters l No real-time services l Relatively low update rate l Data at any granularity l Assume precise data l Access plan determined by query processor, physical DB design Continuous queries (CQs) Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics l 16. sept 2009 INF 5100 - H 2009 6 Adapted from [Motawani: PODS tutorial]

DSMS Applications Pull-based l Sensor Networks l l Network Traffic Analysis l l Real

DSMS Applications Pull-based l Sensor Networks l l Network Traffic Analysis l l Real time analysis of Internet traffic. E. g. , Traffic statistics and critical condition detection. Push-based Financial Tickers l l E. g. Tiny. DB. See earlier lecture by Jarle Søberg On-line analysis of stock prices, discover correlations, identify trends. Transaction Log Analysis l 16. sept 2009 E. g. Web click streams and telephone calls INF 5100 - H 2009 7

Data Streams - Terms l l l A data stream is a (potentially unbounded)

Data Streams - Terms l l l A data stream is a (potentially unbounded) sequence of tuples Each tuple consist of a set of attributes, similar to a row in database table Transactional data streams: log interactions between entities l l Credit card: purchases by consumers from merchants Telecommunications: phone calls by callers to dialed parties Web: accesses by clients of resources at servers Measurement data streams: monitor evolution of entity states l l l 16. sept 2009 Sensor networks: physical phenomena, road traffic IP network: traffic at router interfaces Earth climate: temperature, moisture at weather stations INF 5100 - H 2009 8

Motivation #1 l Massive data sets: l Huge numbers of users, e. g. ,

Motivation #1 l Massive data sets: l Huge numbers of users, e. g. , l l l Highly detailed measurements, e. g. , l l AT&T long-distance: ~ 300 M calls/day AT&T IP backbone: ~ 10 B IP flows/day NOAA: satellite-based measurements of earth geodetics Huge number of measurement points, e. g. , l 16. sept 2009 Sensor networks with huge number of sensors INF 5100 - H 2009 9

Motivation #2 l Near real-time analysis l l ISP: controlling service levels NOAA: tornado

Motivation #2 l Near real-time analysis l l ISP: controlling service levels NOAA: tornado detection using weather radar Hospital: Patient monitoring Traditional data feeds l l Simple queries (e. g. , value lookup) needed in realtime Complex queries (e. g. , trend analyses) performed off-line 16. sept 2009 INF 5100 - H 2009 10

Motivation #3 Stig Støa, Morten Lindeberg and Vera Goebel. Online Analysis of Myocardial Ischemia

Motivation #3 Stig Støa, Morten Lindeberg and Vera Goebel. Online Analysis of Myocardial Ischemia From Medical Sensor Data Streams with Esper, In Proceedings of the First International Symposium on Applied Sciences in Biomedical and Communication Technologies (ISABEL 2008) l Queries over sensor traces from surgical procedures on pigs performed at IVS, Rikshospitalet, running a open source java system called Esper l. Heart attack! Successful identification of occlusion to the heart (heart attack) l SELECT y, timestamp FROM Accelerometer. win: ext_timed(t, 5 s) HAVING count(y) BETWEEN 2 AND 200 16. sept 2009 INF 5100 - H 2009 11

2008 SSD seek time 0. 1 msec, but capacity is small, e. g. 120

2008 SSD seek time 0. 1 msec, but capacity is small, e. g. 120 GB. Motivation #4 Performance of disks: 1987 2004 CPU Performance 1 MIPS 2, 000 MIPS 2, 000 x Memory Size 16 Kbytes 32 Gbytes 2, 000 x Memory Performance 100 usec 2 nsec 50, 000 x Disc Drive Capacity 20 Mbytes 300 Gbytes 15, 000 x 5. 3 msec 11 x Disc Drive Performance 60 msec Source: Seagate Technology Paper: ” Economies of Capacity and Speed: Choosing the most cost-effective disc drive size and RPM to meet IT requirements” 16. sept 2009 INF 5100 - H 2009 Increase Memory I/O is much faster than disk I/O! 12

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Morten

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Morten Lindeberg 1. and 2. lecture Concepts and Issues l Requirements Architecture Data model Queries Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 13

Requirements l Data model and query semantics: order- and time-based operations l l l

Requirements l Data model and query semantics: order- and time-based operations l l l l Query processing: l l l Streaming query plans must use non-blocking operators Only single-pass algorithms over data streams Data reduction: approximate summary structures l l Selection Nested aggregation Multiplexing and demultiplexing Frequent item queries Joins Windowed queries Synopses, digests => no exact answers Real-time reactions for monitoring applications => active mechanisms Long-running queries: variable system conditions Scalability: shared execution of many continuous queries, monitoring multiple streams 16. sept 2009 INF 5100 - H 2009 14

Working Storage Input Summary Monitor Storage Streaming Inputs Static Storage Updates to Static Data

Working Storage Input Summary Monitor Storage Streaming Inputs Static Storage Updates to Static Data Query Repository Query Processor Generic DSMS Architecture Output Buffer Streaming Outputs User Queries [Golab & Özsu 2003] 16. sept 2009 INF 5100 - H 2009 15

Architecture #2 static d. B System monitor input module buffer Load Shedder query tree

Architecture #2 static d. B System monitor input module buffer Load Shedder query tree output module buffer Query processor Query Optimizer Concepts from Borealis user query 16. sept 2009 INF 5100 - H 2009 16

3 -Level Architecture l l l Reduce tuples through several layered operations (several DSMSs)

3 -Level Architecture l l l Reduce tuples through several layered operations (several DSMSs) Store results in static DB for later analysis E. g. , distributed DSMSs 16. sept 2009 INF 5100 - H 2009 17 VLDB 2003 Tutorial [Koudas & Srivastava 2003]

Data Models l l Real-time data stream: sequence of items that arrive in some

Data Models l l Real-time data stream: sequence of items that arrive in some order and may only be seen once. Stream items: like relational tuples l l l Relation-based: e. g. , STREAM, Telegraph. CQ and Borealis Object-based: e. g. , COUGAR, Tribecca Window models l l l Direction of movements of the endpoints: fixed window, sliding window, landmark window Time-based vs. Tuple-based Update interval: eager (for each new arriving), lazy (batch processing), non-overlapping tumbling windows. 16. sept 2009 INF 5100 - H 2009 18

More on Windows l l Mechanism for extracting a finite relation from an infinite

More on Windows l l Mechanism for extracting a finite relation from an infinite stream Solves blocking operator problem Sliding: window Jumping: window win Overlapping window window win (adapted from Jarle Søberg) 16. sept 2009 INF 5100 - H 2009 19

Timestamps l l l Used for tuple ordering and by the DSMS for defining

Timestamps l l l Used for tuple ordering and by the DSMS for defining window sizes (time-based) Useful for the user to know when the tuple originated Explicit: set by the source of data Implicit: set by DSMS, when it has arrived Ordering is an issue Distributed systems: no exact notion of time 16. sept 2009 INF 5100 - H 2009 20

Queries #1 DBMS: one-time (transient) queries l DSMS: continuous (persistent) queries l Unbounded memory

Queries #1 DBMS: one-time (transient) queries l DSMS: continuous (persistent) queries l Unbounded memory requirements l Blocking operators: window techniques l Queries referencing past data l 16. sept 2009 INF 5100 - H 2009 21

Queries #2 l l DBMS: (mostly) exact query answer DSMS: (mostly) approximate query answer

Queries #2 l l DBMS: (mostly) exact query answer DSMS: (mostly) approximate query answer l l l Approximate query answers have been studied: l sampling, synopses, sketches, wavelets, histograms, … Data reduction Batch processing 16. sept 2009 INF 5100 - H 2009 22

One-pass Query Evaluation l DBMS: l l l Arbitrary data access One/few pass algorithms

One-pass Query Evaluation l DBMS: l l l Arbitrary data access One/few pass algorithms have been studied: l Limited memory selection/sorting: n-pass quantiles l Tertiary memory databases: reordering execution l Complex aggregates: bounding number of passes DSMS: l l Per-element processing: single pass to reduce drops Block processing: multiple passes to optimize I/O cost 16. sept 2009 INF 5100 - H 2009 23

Query Plan l l DBMS: fixed query plans optimized at beginning DSMS: adaptive query

Query Plan l l DBMS: fixed query plans optimized at beginning DSMS: adaptive query operators l Adaptive plans have been studied: l l l 16. sept 2009 Query scrambling: wide-area data access Eddies: volatile, unpredictable environments Borealis: High Availability monitors and query distribution INF 5100 - H 2009 24

Query Languages #1 l l Stream query language issues (compositionality, windows) SQL-like proposals suitably

Query Languages #1 l l Stream query language issues (compositionality, windows) SQL-like proposals suitably extended for a stream environment: l Composable SQL operators l Queries reference relations or streams l Queries produce relations or streams Query operators (selection/projection, join, aggregation) Examples: l GSQL (Gigascope) l CQL (STREAM) l EPL (ESPER) 16. sept 2009 INF 5100 - H 2009 25

Query Languages #2 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and

Query Languages #2 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and enhanced support for windows and ordering, e. g. , CQL (STREAM), Strea. Quel (Telegraph. CQ), AQuery, Giga. Scope 2. Object-based: object-oriented stream modeling, classify stream elements according to type hierarchy, e. g. , Tribeca, or model the sources as abstract data types (ADTs), e. g. , COUGAR 3. Procedural: users specify the data flow, e. g. , Borealis, users construct query plans via a graphical interface (1) and (2) are declarative query languages, currently, the relation-based paradigm is mostly used. 16. sept 2009 INF 5100 - H 2009 26

Sample Stream Traffic ( source. IP -- source IP address source. Port -- port

Sample Stream Traffic ( source. IP -- source IP address source. Port -- port number on source dest. IP -- destination IP address dest. Port -- port number on destination length -- length in bytes time -- time stamp ); 16. sept 2009 INF 5100 - H 2009 27

Procedural Query (Borealis) l Simple Do. S (SYN Flooding) identification query 16. sept 2009

Procedural Query (Borealis) l Simple Do. S (SYN Flooding) identification query 16. sept 2009 INF 5100 - H 2009 28

Selections and Projections l Selections, (duplicate preserving) projections are straightforward l l l Local,

Selections and Projections l Selections, (duplicate preserving) projections are straightforward l l l Local, per-element operators Duplicate eliminating projection is like grouping Projection needs to include ordering attribute l No restriction for position ordered streams SELECT source. IP, time FROM Traffic WHERE length > 512 16. sept 2009 INF 5100 - H 2009 29

Joins l General case of join operators problematic on streams l l l May

Joins l General case of join operators problematic on streams l l l May need to join arbitrarily far apart stream tuples Equijoin on stream ordering attributes is tractable Majority of work focuses on joins between streams with windows specified on each stream SELECT A. source. IP, B. source. IP FROM Traffic 1 A [window T 1], Traffic 2 B [window T 2] WHERE A. dest. IP = B. dest. IP 16. sept 2009 INF 5100 - H 2009 30

Aggregations l General form: l l select G, F 1 from S where P

Aggregations l General form: l l select G, F 1 from S where P group by G having F 2 op ϑ G: grouping attributes, F 1, F 2: aggregate expressions Window techniques are needed! Aggregate expressions: l l l distributive: sum, count, min, max algebraic: avg holistic: count-distinct, median 16. sept 2009 INF 5100 - H 2009 31

Query Optimization l l DBMS: table based cardinalities used in query optimization => Problematic

Query Optimization l l DBMS: table based cardinalities used in query optimization => Problematic in a streaming environment Cost metrics and statistics: accuracy and reporting delay vs. memory usage, output rate, power usage Query optimization: query rewriting to minimize cost metric, adaptive query plans, due to changing processing time of operators, selectivity of predicates, and stream arrival rates Query optimization techniques l l l stream rate based resource based Qo. S based Continuously adaptive optimization Possibility that objectives cannot be met: l l resource constraints bursty arrivals under limited processing capability 16. sept 2009 INF 5100 - H 2009 32

Data Reduction Techniques l l l Aggregation: approximations e. g. , mean or median

Data Reduction Techniques l l l Aggregation: approximations e. g. , mean or median Load Shedding: drop random tuples Sampling: only consider samples from the stream (e. g. , random selection). Used in sensor networks. l Sketches: summaries of stream that occupy small amount of memory, e. g. , randomized sketching l l Wavelets: hierchical decomposition Histograms: approximate frequency of element values in stream 16. sept 2009 INF 5100 - H 2009 33

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts

Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts and Issues l Morten Lindeberg 1. and 2. lecture Requirements Architecture Data model Queries Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 34