Models and Issues in Data Stream Systems Rajeev
- Slides: 91
Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom) STREAM Project Members: Arvind Arasu, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma PODS 2002 1
Data Streams • Traditional DBMS – data stored in finite, persistent data sets • New Applications – data input as continuous, ordered data streams – – – – Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets PODS 2002 2
Using Traditional Database User/Application Query … Result … Loader PODS 2002 3
New Approach for Data Streams User/Application Register Query Results Stream Query Processor PODS 2002 4
New Approach for Data Streams User/Application Register Query Results Stream Query Processor Data Stream Management System (DSMS) Scratch Space (Memory and/or Disk) PODS 2002 5
Sample Applications • Network security (e. g. , i. Policy, Net. Forensics/Cisco, Niksun) – Network packet streams, user session information – Queries: URL filtering, detecting intrusions & DOS attacks & viruses • Financial applications (e. g. , Traderbot) – Streams of trading data, stock tickers, news feeds – Queries: arbitrage opportunities, analytics, patterns – SEC requirement on closing trades PODS 2002 6
Sample Applications • Network management and traffic engineering (e. g. , Sprint) – Streams of measurements and packet traces – Queries: detect anomalies, adjust routing • Telecom call data (e. g. , AT&T) – Streams of call records – Queries: fraud, customer call patterns, billing PODS 2002 7
Sample Applications • Sensor Networks Cornell’s Cougar, UCB’s Telegraph) (e. g. – Large number of cheap, wireless sensors – Noisy streams of real-world measurements – Abstraction: Massive distributed database – Queries: aggregate, correlate, localize, alert – Novelty: control rate for battery power • Manufacturing processes – Sensors in plants and warehouses PODS 2002 8
Sample Applications • Web tracking and personalization Yahoo, Google, Akamai) (e. g. , – Clickstreams, user query streams, log records – Queries: monitoring, analysis, personalization • Truly massive databases (e. g. , Astronomy Archives) – Stream the data by once (or over and over) – Queries: do the best they can PODS 2002 9
Challenges • Multiple, continuous, rapid, time-varying, ordered streams • Main memory computations • Queries may be continuous (not just one-time) – Evaluated continuously as stream data arrives – Answer updated over time • Queries may be complex – Beyond element-at-a-time processing – Beyond stream-at-a-time processing – Beyond relational queries (scientific, data mining) PODS 2002 10
Executive Summary • Data Stream Management Systems (DSMS) – Highlight issues and motivate research – Not a tutorial or comprehensive survey • Caveats – Personal view of emerging field Stanford STREAM Project bias Cannot cover all projects in detail PODS 2002 11
Meta-Questions • Killer-apps – Application stream rates exceed DBMS capacity? – Can DSMS handle high rates anyway? • Motivation – Need for general-purpose DSMS? – Not ad-hoc, application-specific systems? • Non-Trivial – DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt? PODS 2002 12
DBMS versus DSMS • Persistent relations PODS 2002 • Transient streams 13
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries PODS 2002 14
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access PODS 2002 15
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory PODS 2002 16
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical PODS 2002 17
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical • Passive repository • Active stores PODS 2002 18
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical • Passive repository • Active stores • Relatively low update rate • Possibly multi-GB arrival rate PODS 2002 19
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical • Passive repository • Active stores • Relatively low update rate • Possibly multi-GB arrival rate • No real-time services • Real-time requirements PODS 2002 20
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical • Passive repository • Active stores • Relatively low update rate • Possibly multi-GB arrival rate • No real-time services • Real-time requirements • Assume precise data • Data stale/imprecise PODS 2002 21
DBMS versus DSMS • Persistent relations • Transient streams • One-time queries • Continuous queries • Random access • Sequential access • “Unbounded” disk store • Bounded main memory • Only current state matters • History/arrival-order is critical • Passive repository • Active stores • Relatively low update rate • Possibly multi-GB arrival rate • No real-time services • Real-time requirements • Assume precise data • Data stale/imprecise • Access plan determined by query processor, physical DB design • Unpredictable/variable data arrival and characteristics PODS 2002 22
Making Things Concrete BOB ALICE Central Office Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) DSMS event = start or end PODS 2002 23
Query 1 (self-join) • Find all outgoing calls longer than 2 minutes SELECT O 1. call_ID, O 1. caller FROM Outgoing O 1, Outgoing O 2 WHERE (O 2. time – O 1. time > 2 AND O 1. call_ID = O 2. call_ID AND O 1. event = start AND O 2. event = end) • Result requires unbounded storage • Can provide result as data stream • Can output after 2 min, without seeing end PODS 2002 24
Query 2 (join) • Pair up callers and callees SELECT O. caller, I. callee FROM Outgoing O, Incoming I WHERE O. call_ID = I. call_ID • Can still provide result as data stream • Requires unbounded temporary storage … • … unless streams are near-synchronized PODS 2002 25
Query 3 (group-by aggregation) • Total connection time for each caller SELECT FROM WHERE O 1. caller, sum(O 2. time – O 1. time) Outgoing O 1, Outgoing O 2 (O 1. call_ID = O 2. call_ID AND O 1. event = start AND O 2. event = end) GROUP BY O 1. caller • Cannot provide result in (append-only) stream – Output updates? – Provide current value on demand? – Memory? PODS 2002 26
Remainder of Talk Reconsider all aspects of data management and processing in presence of data streams PODS 2002 27
Outline of Remaining Talk • Stream Models and DSMS Architectures • Query Processing • Runtime and Systems Issues • Algorithms • Conclusion PODS 2002 28
Data Characteristics • Database: stored relations + data streams • Stream characteristics – – – – Type of data (schema) Data distribution Flow rate Stability of distribution and flow Ordering and other constraints Timestamps Synchronization of multiple streams Distributed streams PODS 2002 29
Data Model • Append-only – Call records • Updates – Stock tickers • Deletes – Transactional data • Meta-Data – Control signals, punctuations System Internals – probably need all above PODS 2002 30
Query Model User/Application Query Processor DSMS PODS 2002 31
Related Database Technology • DSMS must use ideas, but none is substitute – – – – Triggers, Materialized Views in Conventional DBMS Main-Memory Databases Distributed Databases Pub/Sub Systems Active Databases Sequence/Temporal/Timeseries Databases Realtime Databases Adaptive, Online, Partial Results • Novelty in DSMS – Semantics: input ordering, streaming output, … – State: cannot store unending streams, yet need history – Performance: rate, variability, imprecision, … PODS 2002 32
Related Database Technology • Triggers on Conventional Databases handling stream ordering/rate, scaling/generality for triggers • Main-Memory Databases handling stream ordering/rate, better for read-only/query-intensive • Publish/Subscribe Systems handling stream ordering, event-filtering only, dissemination focus • Materialized Views handling stream ordering, no streaming output • Active Databases event-condition-action rules, similar to triggers • Sequence/Temporal/Timeseries Databases represents time/ordering in stored relations • Realtime Databases transactions with deadlines PODS 2002 33
Stream Projects • Amazon/Cougar (Cornell) – sensors • Aurora (Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – telecom streams • Niagara (OGI/Wisconsin) – Internet XML databases • Open. CQ (Georgia) – triggers, incr. view maintenance • Stream (Stanford) – general-purpose DSMS • Tapestry (Xerox) – pub/sub content-based filtering • Telegraph (Berkeley) – adaptive engine for sensors • Tribeca (Bellcore) – network monitoring PODS 2002 34
Aurora Architecture (Brown/MIT) O ⋈ σ Historical Storage ⋈ σ Input Data Streams PODS 2002 ⋈ ⋈ π σ π Output Streams Applications process output streams UI for designing operator network and querying trigger-base Application Administrator specifies processing strategy and Qo. S parameters 35
STREAM Architecture (Stanford) Synopses Output streams Query Plans Running Op Ready Op x Waiting Op s Historical Storage Input streams PODS 2002 x Applications register continuous queries Users issue continuous and ad-hoc queries Administrator monitors query execution and adjusts run-time parameters 36
Aurora versus STREAM • Focus on large number of sensor streams • Attempt at a generalpurpose DSMS • Exposes query plan • Declarative SQL • Qo. S/Realtime emphasis • Semantic precision • Load-shedding • Approximations Despite different emphasis, two approaches very similar Differences in runtime environment (scheduler, memory manager, Qo. S manager) PODS 2002 37
Eddies – Continuous Adaptivity [Avnur-Hellerstein] EDDY Continuously Adaptive Query Plan • Tuples flow in different orders • Visit each operator once before output • Routing policy adaptively chooses the “optimal” plan PODS 2002 (Slide courtsey: Joe Hellerstein) 38
Adaptivity (Telegraph) Output Queues STe. Ms for join R EDDY grouped filter (R. A) Rx. Sx. T grouped filter (S. B) S T Input Streams R S T • Runtime Adaptivity • Multi-query Optimization • Framework – implements arbitrary schemes PODS 2002 39
CACQ versus STREAM • Continuously adaptive plans • Relatively static query plans • Multi-query optimization via operator sharing and tuple routing • Multi-query optimization via operator sharing and resource allocation • Memory overhead in maintaining tuple lineage • No such extra memory overhead • Memory allocation, scheduling policy, hardcoded in system • Flexible allocate/schedule to optimize memory usage and response time • Web-based databases and sensor networks • General purpose DSMS PODS 2002 40
Niagara Architecture (Wisconsin) GUI Niagara Query Engine Continuous Query Processor Query Parser CQ Manager Query Optimizer Group Optimizer Event Detector Execution Engine Niagara Search Engine Data Manager Internet Data Sources PODS 2002 41
Query-Split Scheme (Niagara) trig. Act. i trig. Act. j scan file i file j … IBM … file i MSFT file j … … split Quotes. XML join scan Symbol = Const. Value scan constant table • Aggregate subscription for efficiency • Split – evaluate trigger only when file updated • Triggers – multi-query optimization PODS 2002 42
Shared Predicates [Niagara, Telegraph] Predicates for R. A > 1 R. A > 7 R. A > 11 R. A < 3 R. A < 5 R. A = 6 R. A = 8 R. A ≠ 9 PODS 2002 7 11 A>1 < A>7 3 A<3 = ≠ A>11 Tuple A=8 A<5 6 8 9 43
Niagara versus STREAM • Files (possibly on disk) store intermediate results • Queues (in main memory) store intermediate results • Multi-query optimization via operator sharing, incremental group optimization • Multi-query optimization via operator sharing and resource management • Explicit timer support for query output • More difficult to express in query language • No support for approximation and load management • Graceful degradation with load via approximations • Designed for online XML data sources • General-purpose DSMS PODS 2002 44
Outline of Remaining Talk • Stream Models and DSMS Architectures • Query Processing • Runtime and Systems Issues • Algorithms • Conclusion PODS 2002 45
Query Processing Outline • Query Language • Blocking Operators, Punctuations, Constraints • Impact of Limited Memory • Approximations – Sliding Windows and Timestamps – Load-shedding – Synopses • Query Evaluation – Multiple Queries – Adaptive Processing – Distributed Processing PODS 2002 46
Blocking Operators • Blocking – No output until entire input seen – Streams – input never ends • Simple Aggregates – output “update” stream • Set Output (sort, group-by) – – Root – could maintain output data structure Intermediate nodes – try non-blocking analogs Example – juggle for sort [Raman, R, Hellerstein] Punctuations and constraints • Join – non-blocking, but intermediate state? – sliding-window restrictions PODS 2002 47
Punctuations [Tucker, Maier, Sheard, Fegaras] • Assertion about future stream contents • Unblocks operators, reduces state group-by State/Index R. A<10 R. A≥ 10 X R S P: S. A≥ 10 • Future Work – Inserted at source or internal (operator signaling)? – Does P unblock Q? Exists P? Rewrite Q? – Relation between P and memory for Q? PODS 2002 48
Constraints – Schema-level: ordering, referential integrity, many-one joins – Instance-level: punctuations – Query-level: windowed join (nearby tuples only) • [Babu-Widom] – Input – multi-stream SPJ query, schema-level constraints – Output – plan with low intermediate state for joins • Future Work – Query-level constraints? Combining constraints? – Relaxed constraints (near-sorted, near-clustered) – Exploiting constraints in intra-operator signaling PODS 2002 49
Relaxed Constraints • DBMS – typically strict constraints • Streams are dynamic ⇒ relax constraints • Example – Stream S, attribute A, tuples s, t – Strict Ordering time(t) – time(s) ≥ 0 ⇒ t. A ≥ s. A – Relaxed Ordering time(t) – time(s) ≥ k ⇒ t. A ≥ s. A – Allow limited out-of-order arrival • Open – relaxation-benefit tradeoff PODS 2002 50
Impact of Limited Memory • Continuous streams grow unboundedly • Queries may require unbounded memory • [ABBMW 02] – a priori memory bounds for query – Conjunctive queries with arithmetic comparisons – Queries with join need domain restrictions – Impact of duplication elimination • Open – general queries PODS 2002 51
Approximate Query Evaluation • Why? – Handling load – streams coming too fast – Avoid unbounded storage and computation – Ad hoc queries need approximate history • How? Sliding windows, synopsis, samples, load-shed • Major Issues? – – – Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric PODS 2002 52
Sliding Window Approximation 011000011100000101010 • Why? – Approximation technique for bounded memory – Natural in applications (emphasizes recent data) – Well-specified and deterministic semantics • Issues – Extend relational algebra, SQL, query optimization – Algorithmic work – Timestamps? PODS 2002 53
Timestamps • Explicit – Injected by data source – Models real-world event represented by tuple – Tuples may be out-of-order, but if near-ordered can reorder with small buffers • Implicit – Introduced as special field by DSMS – Arrival time in system – Enables order-based querying and sliding windows • Issues – Distributed streams? – Composite tuples created by DSMS? PODS 2002 54
Timestamps in JOIN Output R S Approach 1 x T Approach 2 • User-specified, with defaults • Best-effort, no guarantee • Compute output timestamp • Output timestamp is exit-time • Must output in order of timestamps • Tuples arriving earlier more likely to exit earlier • Better for Explicit Timestamp • Better for Implicit Timestamp • Need more buffering • Maximum flexibility to system • Get precise semantics and user-understanding • Difficult to impose precise semantics PODS 2002 55
Approximate via Load-Shedding Handles scan and processing rate mismatch Input Load-Shedding Output Load-Shedding • Sample incoming tuples • Buffer input infrequent output • Use when scan rate is bottleneck • Use when query processing is bottleneck • Positive – online aggregation • Example – XJoin [Hellerstein, Haas, Wang] • Negative – join sampling [Urhan, Franklin] • Exploit synopses [Chaudhuri, Motwani, Narasaya] PODS 2002 56
Processing Multiple Queries • Large number of continuous queries • Long-running • Shared resources • Multi-query optimization – Operator sharing – Adaptivity (Eddies) – Shared predicate indexes (CACQ, Niagara) PODS 2002 57
Adaptive Query Evaluation • Why adaptivity? – Queries are long-running – Fluctuating stream arrival & data characteristics – Evolving query loads • Issues in Adaptivity – Resource allocation (memory, computation) – Dynamic query execution plans (Eddies) PODS 2002 58
Distributed Query Evaluation • Logical stream = many physical streams – maintain top 100 Yahoo pages • Correlate streams at distributed servers – network monitoring • Many streams controlled by few servers – sensor networks • Issues – Move processing to streams, not streams to processors – Approximation-bandwidth tradeoff PODS 2002 59
Example: Distributed Streams • Maintain top 100 Yahoo pages – Pages served by geographically distributed servers – Must aggregate server logs – Minimize communication • Pushing processing to streams – Most pages not in top 100 – Avoid communicating about such pages – Send updates about relevant pages only – Requires server coordination PODS 2002 60
Distributed Streams: Example • Problem – streams are dynamic – Popular set can change rapidly (e. g. 9/11) – Must detect popularity change – Communication for unpopular pages is unavoidable • Approach – assign server “quotas” – Servers report: count of unpopular page exceeds quota – Popularity rise detected quickly – Low communication for pages remaining unpopular PODS 2002 61
Stream Query Language? • SQL extension • Sliding windows as first-class construct – Awkward in SQL, needs reference to timestamps – SQL-99 allows aggregations over sliding windows • Sampling/approximation/load-shedding/Qo. S support? • Stream relational algebra and rewrite rules – Aurora and STREAM – Sequence/Temporal Databases PODS 2002 62
Outline of Remaining Talk • Stream Models and DSMS Architectures • Query Processing • Runtime and Systems Issues • Algorithms • Conclusion PODS 2002 63
STREAM Implementation Goals • Comprehensive DSMS query processor • Broad suite of operators/synopses • Sophisticated Developers-Workbench interface – Submit queries in extended SQL or algebra – Submit/edit query plans in XML or GUI – Visualizing query execution – On-the-fly modification of memory allocation, scheduling policies, queue management, etc. PODS 2002 64
Aurora Run-time Architecture Inputs Outputs Router Q 1 π Q 2 σ Q 3 Scheduler Box Processors Buffer Manager Catalogs Persistent Store Q 4 Q 5 PODS 2002 x Load Shedder Qo. S Monitor 65
DSMS Internals • Query plans: operators, synopses, queues • Memory management – Dynamic Allocation – queries, operators, queues, synopses – Graceful adaptation to reallocation – Impact on throughput and precision • Operator scheduling – Variable-rate streams, varying operator/query requirements – Response time and Qo. S – Load-shedding – Interaction with queue/memory management PODS 2002 66
Synopses Memory Management • Current work: Optimize synopsis for accuracy, respecting memory constraint • Global Optimization – Memory allocation to synopses – Similar to [Jagadish, Jin, Ooi, Tan] • Synopsis sharing? – Across multiple operators and queries – Similar to optimal Index Selection PODS 2002 67
Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani] • Goal – Given – query plan and selectivity estimates – Schedule – tuples through operator chains • Minimize total queue memory – Best-slope scheduling is near-optimal – Danger of starvation for some tuples • Minimize tuple response time – Schedule tuple completely through operator chain – Danger of exceeding memory bound • Open – graceful combination and adaptivity PODS 2002 68
Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani] Output selectivity = 0. 0 σ3 selectivity = 0. 6 σ2 selectivity = 0. 2 Net Selectivity σ1 best slope σ2 starvation point σ3 σ1 Time Input PODS 2002 69
Response Time Memory vs. Response time Memory Open – Parametrized algorithm for trade-off PODS 2002 70
Precision-Resource Tradeoff • Resources – memory, computation, I/O • Global Optimization Problem – Input: queries with alternate plans, importance weights – Precision: function of resource allocation to queries/operators – Goal: select plans, allocate resources, maximize precision • Memory Allocation Algorithm [Varma, Widom] – Model – single query plan, simple precision model – Rules for precision of composed operators – Non-linear numerical optimization formulation • Open – Combinatorial algorithm? General case? PODS 2002 71
Query Optimization • STREAM – Combine resource allocation and query optimizer • Aurora – Data flow ⇒ less scope for plan optimization – Inserts inferred projects – Reorders selects • Rate-based optimization [Viglas, Naughton] – Model for output-rates as function of input-rates – Enables optimizer to increase throughput PODS 2002 72
Load Shedding • Aurora – Qo. S approach – Static: drop-based – Runtime: delay-based % tuples delivered Drop-based PODS 2002 Qo. S – Semantic: value-based Delay-based Ouput-value Value based 73
Query Processing: Other Issues • Query Processing – Handling blocking operators – Ad-hoc and Inactive-registered queries – Query importance dial • Query Optimization – Caching – Disk spilling for full accuracy-memory-latency tradeoff – Rate management • Systems Issues – User Interfaces – Gathering and using statistics – Crash recovery and transaction management PODS 2002 74
Outline of Remaining Talk • Stream Models and DSMS Architectures • Query Processing • Runtime and Systems Issues • Algorithms • Conclusion PODS 2002 75
Synopses • Queries may access or aggregate past data • Need bounded-memory history-approximation • Synopsis? – Succinct summary of old stream tuples – Like indexes/materialized-views, but base data is unavailable • Examples – – – Sliding Windows Samples Sketches Histograms Wavelet representation PODS 2002 76
Model of Computation 1 1 0 0 Synopses/Data Structures 1 t g n si a cre In 1 0 0 me i 1 1 1 Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε: error parameter PODS 2002 77
Reservoir Sampling [Vitter] • Maintain R samples from stream • Can generalize to weighted samples • Efficient implementation N = number of data items seen R=5 1 62 3 4 85 Replace one at random. T = 6, 7, 8, heads prob. = 5/ 6. 7. 8. Result = H T PODS 2002 78
Sketching Techniques • [Alon, Matias, Szegedy] frequency moments • [Feigenbaum etal, Indyk] extended to Lp norm • [Dobra et al] complex aggregates over joins • Key Subproblem – Self-Join Size Estimation – Stream of values from D = {1, 2, …, N} – Let fi = frequency of value i – Self-join size S = Σ fi 2 – Question – estimating S in small space? PODS 2002 79
Self-Join Size Estimation • AMS Technique (randomized sketches) – Given (f 1, f 2, …, f. N) – Zi = random{-1, 1} – X = Σ fi. Zi (X incrementally computable) • Theorem Exp[X 2] = Σ fi 2 – Cross-terms fi. Zi fj. Zj have 0 expectation – Square-terms fi. Zi = fi 2 • Space = log (N + Σ fi) • Independent samples Xk reduce variance PODS 2002 80
Sample Run of AMS Z 1 = V = 3 1 -1 1 1 Σvi 2 = 123 Z 1 = 6 2 -1 Z 2 = X 1= 5, X 12 = 25 V = 4 1 -1 1 1 5 6 -1 7 -1 1 X 2= 14, X 22 = 196 2 5 Z 2 = 1 Est = 110. 5 7 -1 1 1 Σ vi 2 = 130, X 1= 6, X 12 = 36, X 2= 12, X 22 = 144, Est = 90 PODS 2002 81
Tug-of-War even dimensions go right, odd dimensions go left +1 -1 {1, 5} {4, 3} {1, 1} {1, 2} Odd dimension {2, 4} {1, 6} {4, 8} Even dimension. {x, i} (“x” = value, “i” = dimension. Based on “i”, add pull of strength “x”, to right or left). E[{ strength( PODS 2002 ) – strength ( )}2] = F 2 (l 2) 82
Quantiles (Greenwald-Khanna) • Maintain triplets (vi, gi, Δi), where vi are increasing • gi represents number of values merged into this representative tuple (can be viewed as added error) • Δi represents error when tuple was created. • Merge tuples periodically while maintaining gi+Δi ≤ 2εN Vi 2, 7, 8, 10, 34, 78, 85, 100 gi 1, 2, 3, 6, 2, 3, 4, 5 5 tuples get added. They can be merged into“ 34” PODS 2002 Δi 0, 2, 5, 3, 2, 3, 4, 4 2εN = 10 83
Sliding Window Computations [Datar, Gionis, Indyk, Motwani] • Goal: statistics/queries • Memory: o(N), preferably poly(1/ε, log N) • Problem: count/sum/variance, histogram, clustering, … • Sample Results: (1+ε)-approximation – Counting: Space O(1/ε (log N)) bits, Time O(1) amortized – Sum over [0, R]: Space O(1/ε log N (log N + log R)) bits, Time O(log R/log N) amortized – Lp sketches: maintain with poly(1/ε, log N) space overhead – Matching space lower bounds PODS 2002 84
Sliding Window Histograms • Key Subproblem – Counting 1’s in bit-stream • Goal – Space O(log N) for window size N • Problem – Accounting for expiring bits • Idea – Partition/track buckets of known count – Error in oldest bucket only – Future 0’s? 100101 PODS 2002 111 101001 11100000… 85
Exponential Histograms • Buckets of exponentially increasing size • Between K/2 and K/2+1 buckets of each size • K = 1/ε and ε = relative error PODS 2002 86
Exponential Histograms • Buckets of exponentially increasing size • Between K/2 and K/2+1 buckets of each size • K = 1/ε and ε = relative error K=2 Bucket sizes = 4, 2, 2, 1. 4, 2, 2, 2, 1. 4, 4, 2, 1. 4, 2, 2, 1, 1, 1. …. 1 1 0 1 0 1 0 1 1… Future Element arrived this step. Ci-1 + Ci-2+…+ C 2 + C 1 + 1 >= (K/2) Ci PODS 2002 87
Many other results … • Histograms – V-Opt Histograms [Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss], [Indyk] – End-Biased Histograms (Iceberg Queries) [Manku, Motwani], [Fang, Shiva, Garcia-Molina, Motwani, Ullman] – Equi-Width Histograms (Quantiles) [Manku, Rajagopalan, Lindsay], [Khanna, Greenwald] – Wavelets Seminal work [Vitter, Wang, Iyer] + many others! • Data Mining – Stream Clustering [Guha, Mishra, Motwani, O’Callaghan] [O’Callaghan, Meyerson, Mishra, Guha, Motwani] – Decision Trees [Domingos, Hulten], [Domingos, Hulten, Spencer] PODS 2002 88
Algorithms – Open Issues • Global space allocation for synopsis – Global error metric – Dynamic optimization • Synopses for sliding windows • Correlated aggregates – [Gehrke, Korn, Srivastava] – Provable guarantees? • Distributed Algorithms (e. g. , top-k counting) PODS 2002 89
Conclusion: Future Work • Query Processing – Stream Algebra and Query Languages – Approximations – Blocking, Constraints, Punctuations • Runtime Management – Scheduling, Memory Management, Rate Management – Query Optimization (Adaptive, Multi-Query, Ad-hoc) – Distributed processing • Synopses and Algorithmic Problems • Systems – UI, statistics, crash recovery and transaction management – System development and deployment PODS 2002 90
Thank You! PODS 2002 91
- Models and issues in data stream systems
- Differentiate byte stream and character stream
- Rajeev srivastava iit bhu
- Rajeev ram mit
- Gabby giffords aphasia
- Dr rajeev nagpal
- Rajeev surati
- Rajeev sangal
- Rajeev bapat
- Melena management
- Rajeev balasubramonian
- Rajeev balasubramonian
- Pyki rtc
- Uconn accreditation
- What is the difference between modals and semi modals
- Professional issues in information system
- Ethical and social issues in information systems doc
- Chapter 4 ethical and social issues in information systems
- 4 components of an information system
- Ethical and social issues in information systems
- Systems and system models
- The engineering design of systems: models and methods
- Bloom filter for stream data mining
- Apa itu data stream
- Data stream management system
- Data stream characteristics in multimedia
- Alur data memory
- Data stream management system
- Stream data model
- Stream data model
- Stream data model
- Dds vs kafka
- Stream data model
- Alon-matias-szegedy algorithm
- Data stream
- Data stream management system
- Data stream
- Counting distinct elements in a stream in big data
- Design issues in distributed system
- Contemporary issues in information systems
- Fundamental model in distributed system
- Physical model in distributed system
- Memory consistency models in distributed systems
- Big data: issues, challenges, tools and good practices
- Decision support systems and intelligent systems
- Data link layer design issues
- 5 key issues in data gathering
- Five key issues of data gathering
- Big data privacy issues in public social media
- Data link
- Unacknowledged connectionless service
- Design issues of data link layer
- Data mining major issues
- Data center management issues
- Data quality issues
- Unrestricted simplex protocol program in c
- Delay models in data networks
- Describe the three-schema architecture.
- Simplified data communication model
- Cs 6703
- Partial specialization rule example
- Partial specialization rule diagram
- Analitical cubism
- Packaged data models
- Dicapine
- Embedded systems vs cyber physical systems
- Engineering elegant systems: theory of systems engineering
- Chapter 11 section 2 stream and river deposits answer key
- Littoral limnetic
- Google drive sync shared folders to desktop
- Systems applications and products in data processing
- System application and products in data processing
- Asset data to improve cmdbs and it systems
- The writer properly quotes and cited sources in some places
- E commerce security and fraud issues and protections
- Law and ethics in information security
- Grace tseng stream of praise
- Strahler stream order
- Trunk stream example
- Stream-of-consciousness technique
- Value stream mapping karen martin pdf
- Value stream management for lean healthcare
- Stream classes in c++
- 最美的祝福
- Ultimate base level of a stream
- James met is dead
- Stream is a sequence of
- Konsep stream
- Stream continuum concept
- Block cipher vs stream cipher example
- Streamblade
- Wegstein method