Data Stream Management Systems DSMS Introduction Concepts and
- Slides: 34
Data Stream Management Systems (DSMS) - Introduction, Concepts and Issues Morten Lindeberg University of Oslo (With slides from Vera Goebel)
Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts and Issues l Morten Lindeberg 1. and 2. lecture Requirements Architecture Data model Queries� Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 2
The DSMS Research Field l New and active research field (~ 10 years) derived from the database community l l l Two syllabus articles: l l l Stream algorithms Application and database perspective (we) Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom: "Models and issues in data stream systems" Lukasz Golab, M. Tamer Ozsu: "Issues in data stream management” Future: Complex Event Processing (CEP) 16. sept 2009 INF 5100 - H 2009 3
DBMS vs. DSMS #1 SQL Query Continuous Query (CQ) Result Query Processing Main Memory Data Stream(s) Disk 16. sept 2009 INF 5100 - H 2009 4
DBMS vs. DSMS #2 l Traditional DBMS: l l DSMS: stored sets of relatively static records with no pre -defined notion of time good for applications that require persistent data storage and complex querying 16. sept 2009 INF 5100 - H 2009 support on-line analysis of rapidly changing data streams data stream: real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items, too large to store entirely, not ending continuous queries 5
DBMS vs. DSMS #3 DBMS DSMS Persistent relations Transient streams (relatively static, stored) (on-line analysis) l One-time queries l Random access l “Unbounded” disk store l Only current state matters l No real-time services l Relatively low update rate l Data at any granularity l Assume precise data l Access plan determined by query processor, physical DB design Continuous queries (CQs) Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics l 16. sept 2009 INF 5100 - H 2009 6 Adapted from [Motawani: PODS tutorial]
DSMS Applications Pull-based l Sensor Networks l l Network Traffic Analysis l l Real time analysis of Internet traffic. E. g. , Traffic statistics and critical condition detection. Push-based Financial Tickers l l E. g. Tiny. DB. See earlier lecture by Jarle Søberg On-line analysis of stock prices, discover correlations, identify trends. Transaction Log Analysis l 16. sept 2009 E. g. Web click streams and telephone calls INF 5100 - H 2009 7
Data Streams - Terms l l l A data stream is a (potentially unbounded) sequence of tuples Each tuple consist of a set of attributes, similar to a row in database table Transactional data streams: log interactions between entities l l Credit card: purchases by consumers from merchants Telecommunications: phone calls by callers to dialed parties Web: accesses by clients of resources at servers Measurement data streams: monitor evolution of entity states l l l 16. sept 2009 Sensor networks: physical phenomena, road traffic IP network: traffic at router interfaces Earth climate: temperature, moisture at weather stations INF 5100 - H 2009 8
Motivation #1 l Massive data sets: l Huge numbers of users, e. g. , l l l Highly detailed measurements, e. g. , l l AT&T long-distance: ~ 300 M calls/day AT&T IP backbone: ~ 10 B IP flows/day NOAA: satellite-based measurements of earth geodetics Huge number of measurement points, e. g. , l 16. sept 2009 Sensor networks with huge number of sensors INF 5100 - H 2009 9
Motivation #2 l Near real-time analysis l l ISP: controlling service levels NOAA: tornado detection using weather radar Hospital: Patient monitoring Traditional data feeds l l Simple queries (e. g. , value lookup) needed in realtime Complex queries (e. g. , trend analyses) performed off-line 16. sept 2009 INF 5100 - H 2009 10
Motivation #3 Stig Støa, Morten Lindeberg and Vera Goebel. Online Analysis of Myocardial Ischemia From Medical Sensor Data Streams with Esper, In Proceedings of the First International Symposium on Applied Sciences in Biomedical and Communication Technologies (ISABEL 2008) l Queries over sensor traces from surgical procedures on pigs performed at IVS, Rikshospitalet, running a open source java system called Esper l. Heart attack! Successful identification of occlusion to the heart (heart attack) l SELECT y, timestamp FROM Accelerometer. win: ext_timed(t, 5 s) HAVING count(y) BETWEEN 2 AND 200 16. sept 2009 INF 5100 - H 2009 11
2008 SSD seek time 0. 1 msec, but capacity is small, e. g. 120 GB. Motivation #4 Performance of disks: 1987 2004 CPU Performance 1 MIPS 2, 000 MIPS 2, 000 x Memory Size 16 Kbytes 32 Gbytes 2, 000 x Memory Performance 100 usec 2 nsec 50, 000 x Disc Drive Capacity 20 Mbytes 300 Gbytes 15, 000 x 5. 3 msec 11 x Disc Drive Performance 60 msec Source: Seagate Technology Paper: ” Economies of Capacity and Speed: Choosing the most cost-effective disc drive size and RPM to meet IT requirements” 16. sept 2009 INF 5100 - H 2009 Increase Memory I/O is much faster than disk I/O! 12
Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Morten Lindeberg 1. and 2. lecture Concepts and Issues l Requirements Architecture Data model Queries Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 13
Requirements l Data model and query semantics: order- and time-based operations l l l l Query processing: l l l Streaming query plans must use non-blocking operators Only single-pass algorithms over data streams Data reduction: approximate summary structures l l Selection Nested aggregation Multiplexing and demultiplexing Frequent item queries Joins Windowed queries Synopses, digests => no exact answers Real-time reactions for monitoring applications => active mechanisms Long-running queries: variable system conditions Scalability: shared execution of many continuous queries, monitoring multiple streams 16. sept 2009 INF 5100 - H 2009 14
Working Storage Input Summary Monitor Storage Streaming Inputs Static Storage Updates to Static Data Query Repository Query Processor Generic DSMS Architecture Output Buffer Streaming Outputs User Queries [Golab & Özsu 2003] 16. sept 2009 INF 5100 - H 2009 15
Architecture #2 static d. B System monitor input module buffer Load Shedder query tree output module buffer Query processor Query Optimizer Concepts from Borealis user query 16. sept 2009 INF 5100 - H 2009 16
3 -Level Architecture l l l Reduce tuples through several layered operations (several DSMSs) Store results in static DB for later analysis E. g. , distributed DSMSs 16. sept 2009 INF 5100 - H 2009 17 VLDB 2003 Tutorial [Koudas & Srivastava 2003]
Data Models l l Real-time data stream: sequence of items that arrive in some order and may only be seen once. Stream items: like relational tuples l l l Relation-based: e. g. , STREAM, Telegraph. CQ and Borealis Object-based: e. g. , COUGAR, Tribecca Window models l l l Direction of movements of the endpoints: fixed window, sliding window, landmark window Time-based vs. Tuple-based Update interval: eager (for each new arriving), lazy (batch processing), non-overlapping tumbling windows. 16. sept 2009 INF 5100 - H 2009 18
More on Windows l l Mechanism for extracting a finite relation from an infinite stream Solves blocking operator problem Sliding: window Jumping: window win Overlapping window window win (adapted from Jarle Søberg) 16. sept 2009 INF 5100 - H 2009 19
Timestamps l l l Used for tuple ordering and by the DSMS for defining window sizes (time-based) Useful for the user to know when the tuple originated Explicit: set by the source of data Implicit: set by DSMS, when it has arrived Ordering is an issue Distributed systems: no exact notion of time 16. sept 2009 INF 5100 - H 2009 20
Queries #1 DBMS: one-time (transient) queries l DSMS: continuous (persistent) queries l Unbounded memory requirements l Blocking operators: window techniques l Queries referencing past data l 16. sept 2009 INF 5100 - H 2009 21
Queries #2 l l DBMS: (mostly) exact query answer DSMS: (mostly) approximate query answer l l l Approximate query answers have been studied: l sampling, synopses, sketches, wavelets, histograms, … Data reduction Batch processing 16. sept 2009 INF 5100 - H 2009 22
One-pass Query Evaluation l DBMS: l l l Arbitrary data access One/few pass algorithms have been studied: l Limited memory selection/sorting: n-pass quantiles l Tertiary memory databases: reordering execution l Complex aggregates: bounding number of passes DSMS: l l Per-element processing: single pass to reduce drops Block processing: multiple passes to optimize I/O cost 16. sept 2009 INF 5100 - H 2009 23
Query Plan l l DBMS: fixed query plans optimized at beginning DSMS: adaptive query operators l Adaptive plans have been studied: l l l 16. sept 2009 Query scrambling: wide-area data access Eddies: volatile, unpredictable environments Borealis: High Availability monitors and query distribution INF 5100 - H 2009 24
Query Languages #1 l l Stream query language issues (compositionality, windows) SQL-like proposals suitably extended for a stream environment: l Composable SQL operators l Queries reference relations or streams l Queries produce relations or streams Query operators (selection/projection, join, aggregation) Examples: l GSQL (Gigascope) l CQL (STREAM) l EPL (ESPER) 16. sept 2009 INF 5100 - H 2009 25
Query Languages #2 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and enhanced support for windows and ordering, e. g. , CQL (STREAM), Strea. Quel (Telegraph. CQ), AQuery, Giga. Scope 2. Object-based: object-oriented stream modeling, classify stream elements according to type hierarchy, e. g. , Tribeca, or model the sources as abstract data types (ADTs), e. g. , COUGAR 3. Procedural: users specify the data flow, e. g. , Borealis, users construct query plans via a graphical interface (1) and (2) are declarative query languages, currently, the relation-based paradigm is mostly used. 16. sept 2009 INF 5100 - H 2009 26
Sample Stream Traffic ( source. IP -- source IP address source. Port -- port number on source dest. IP -- destination IP address dest. Port -- port number on destination length -- length in bytes time -- time stamp ); 16. sept 2009 INF 5100 - H 2009 27
Procedural Query (Borealis) l Simple Do. S (SYN Flooding) identification query 16. sept 2009 INF 5100 - H 2009 28
Selections and Projections l Selections, (duplicate preserving) projections are straightforward l l l Local, per-element operators Duplicate eliminating projection is like grouping Projection needs to include ordering attribute l No restriction for position ordered streams SELECT source. IP, time FROM Traffic WHERE length > 512 16. sept 2009 INF 5100 - H 2009 29
Joins l General case of join operators problematic on streams l l l May need to join arbitrarily far apart stream tuples Equijoin on stream ordering attributes is tractable Majority of work focuses on joins between streams with windows specified on each stream SELECT A. source. IP, B. source. IP FROM Traffic 1 A [window T 1], Traffic 2 B [window T 2] WHERE A. dest. IP = B. dest. IP 16. sept 2009 INF 5100 - H 2009 30
Aggregations l General form: l l select G, F 1 from S where P group by G having F 2 op ϑ G: grouping attributes, F 1, F 2: aggregate expressions Window techniques are needed! Aggregate expressions: l l l distributive: sum, count, min, max algebraic: avg holistic: count-distinct, median 16. sept 2009 INF 5100 - H 2009 31
Query Optimization l l DBMS: table based cardinalities used in query optimization => Problematic in a streaming environment Cost metrics and statistics: accuracy and reporting delay vs. memory usage, output rate, power usage Query optimization: query rewriting to minimize cost metric, adaptive query plans, due to changing processing time of operators, selectivity of predicates, and stream arrival rates Query optimization techniques l l l stream rate based resource based Qo. S based Continuously adaptive optimization Possibility that objectives cannot be met: l l resource constraints bursty arrivals under limited processing capability 16. sept 2009 INF 5100 - H 2009 32
Data Reduction Techniques l l l Aggregation: approximations e. g. , mean or median Load Shedding: drop random tuples Sampling: only consider samples from the stream (e. g. , random selection). Used in sensor networks. l Sketches: summaries of stream that occupy small amount of memory, e. g. , randomized sketching l l Wavelets: hierchical decomposition Histograms: approximate frequency of element values in stream 16. sept 2009 INF 5100 - H 2009 33
Today’s Agenda Introduction l Research field DBMS vs. DSMS Motivation l l l Concepts and Issues l Morten Lindeberg 1. and 2. lecture Requirements Architecture Data model Queries Data reduction l l l Examples l l 16. sept 2009 Jarle Søberg 3. lecture Telegraph. CQ INF 5100 - H 2009 34
- Data stream management system
- Differentiate byte stream and character stream
- Models and issues in data stream systems
- Data stream in multimedia
- Data stream management system
- Data management concepts
- Spontaneous generation in data flow diagram
- Describe data and process modeling concepts and tools
- Introduction and basic concepts of thermodynamics
- Thermodynamics introduction and basic concepts
- Introduction to statistics and some basic concepts
- Introduction to transaction processing concepts and theory
- Introduction and mathematical concepts
- Introduction and mathematical concepts
- Transaction processing concepts
- Introduction to content management system
- Introduction to management information systems 5th edition
- Introduction to data warehousing and data mining
- Bloom filter for stream data mining
- Meletakkan posisi yang telah ditandai dengan mark()
- Data stream characteristics in multimedia
- Stream data processing
- Muthukrishnan data stream algorithms
- Stream data model
- Stream data model
- Stream data model
- Stream data model
- Alon matias szegedy
- Data stream
- Data stream
- Counting distinct elements in a stream
- Database system concepts university database
- Operating systems concepts
- Core concepts of accounting information systems
- Idm healthcare