Stream Base Systems Stream Processing Overview Dr Stan
Stream. Base Systems Stream Processing Overview Dr. Stan Zdonik, Co-Founder March 14, 2006 www. streambase. com © Copyright Stream. Base®. 1
Agenda § Problem Space and Landscape § Case Scenarios § Technical Approaches to CEP § What is required of a Stream Processing Engine - Emphasis on Stream. SQL § Future Directions for the Community www. streambase. com © Copyright Stream. Base®. 2
Stream. Base at a Glance § Founded in 2003 by Dr. Mike Stonebraker (Ingres, Illustra) § Initial research prototype at MIT, Brown, & Brandeis (2001). § Boston-based company, with offices in NY, Washington, DC, & Europe § Financial backing by tier-one venture capital firms § Solid, growing customer base § Do for real-time data what relational database and SQL do for stored data www. streambase. com © Copyright Stream. Base®. Investors Partners 3
Use Case: Running VWAP § Scenario: - Every minute for every stock I am trading: § Calculate VWAP (vol. weighted avg. price) for my § trades & all trades Alert whenever my personal trading execution is inferior to market § Solution: - www. streambase. com 5 Stream. Base operators, 30 min to build © Copyright Stream. Base®. 5
Use Case: Intrusion Detection § Client Scenario - Need to identify unusual patterns in IP connections § Solution - - www. streambase. com (Group by IP prefix; Sum) Implement sophisticated filtering & monitoring to drive real-time alerting 18. 31. 0. * 18. 31. 0. 89 (Group by source IP; Count) Filter count > T 1 http, dns, ssh Filter count > T 2 (Group by source IP; count distinct protocol) Filter sum > T 1 Join Source IP Example of IP intrusion detection with Stream. Base Immediate termination of suspicious user access § Delivery - Process, analyze, & act on 50 k msgs/sec © Copyright Stream. Base®. 6
Use Case: Battalion Monitoring § Client Scenario - - Government contractor required filtering of data and reports from reconnaissance aircraft of friendly and enemy activity Determine positions of friendly vs enemy troops, tanks, aircraft in real-time § Solution - Critical alerting established to pinpoint any/every enemy movement (unit#, x, y) (x. y) across line Lookup (unit#) (unit#, x, y, enemy? ) and enemy Count; Count > 3 Window = 1 min Example of combat military monitoring of friendly and enemy forces in real-time with Stream. Base www. streambase. com © Copyright Stream. Base®. 7
CEP/Stream Processing Marketplace The high end • ~100 K messages/second • ~1 msec latency Processing Complexity Anything will work at the low end • 1 message/day: Use pencil & paper • 1 message/hour: Use spreadsheet • 1 message/minute: Use favorite app server, RDBMS Complex and/or enterprise middleware events Simple events Stream Processing Engines (Stream. SQL) Conventional Architectures Human speed (seconds to minutes) www. streambase. com Machine speed (msec) Processing Speed © Copyright Stream. Base®. 9
Technical Approaches to CEP § Custom code - Almost everybody does this today - Nobody wants to continue to do this going forward - Replacing this with commercial off-the-shelf (COTS) infrastructure will fuel an explosion in exploitation of increasingly ubiquitous real-time data § Your favorite rule engine § Stream. SQL stream processing engine www. streambase. com © Copyright Stream. Base®. 10
Required Characteristics for Complex Event Processing Engines www. streambase. com 1. Perform data processing without first storing and retrieving the data 2. Leverage Stream. SQL query paradigm 3. Store and access current or historical state information using a familiar standard such as SQL 4. Handle stream imperfections (e. g. late or delayed, missing, out-of-sequence data) 5. Process time-series records (tuples) in a consistent, deterministic manner 6. Failover streaming application to a back-up and keep running in the event of primary system failure 7. Split applications over multiple processors or machines for scalability, without writing low-level code 8. Run Rules 1 -7 in-process at tens to hundreds of thousands of messages/second with low latency © Copyright Stream. Base®. 11
Rule 1: Keep the Data Moving To achieve low-latency, perform data processing without first storing and retrieving the data In-stream Processing Traditional Data Processing Event Data Up Alerts Actions Memory s te da Memory Disk Stream. Base Application Disk Queries Low latency § No waiting § Results delivered in-flight www. streambase. com © Copyright Stream. Base®. 12
Rule 2: Query Paradigm (Stream. SQL) Use querying mechanism to find output events of interest or compute analytics on real-time and historical data What is Stream. SQL? § Stream. SQL extends conventional SQL with time windows for key functions (e. g. joining, querying, aggregating data) § § Streams do not have “end of table” Optimal approach for unifying processing of real-time and stored data § SQL is a good paradigm § § § www. streambase. com For analytics And filtering “Gold standard” for stored data © Copyright Stream. Base®. 13
Stream. SQL Programming Paradigm § Time window-based computations, statistics Arrival time 3: 01. 00 3: 01. 10 3: 01. 20 3: 01. 30 3: 02. 00 3: 02. 40 3: 03. 55 3: 04. 10 3: 04. 88 3: 05. 75 3: 06. 28 3: 07. 00 3: 08. 50 3: 09. 50 Data Value § Extensibility - User-defined functions and aggregates Custom Java or C++ operators Modules for reusability § Stores state www. streambase. com © Copyright Stream. Base®. 14
Integrating Real Time and Stored State…… Produce the split-adjusted price of every security in a feed over several days (stock can split more than once) Two feeds: Tick (symbol, price, volume, date, time) Splits (symbol, date, time, split_factor) www. streambase. com © Copyright Stream. Base®. 15
Stream. SQL solution for Real-Time and Stored Data Stored table: Feeds: Store (symbol, factor) Tick and Split _____________________ Mixing UPDATE Store Stream and Table (SET factor = factor * S. split_factor) FROM Split S WHERE symbol = S. symbol SELECT T. symbol, price = T. price * S. factor, T. volume, T. date, T. time Mixing FROM Tick T, Store S Stream and Table WHERE S. symbol = T. symbol www. streambase. com © Copyright Stream. Base®. 16
Stream. SQL Solution …. or a four box application in the Stream. Base GUI Tick (symbol, price, volume, date, time) (read) T. price * S. factor Store Splits (symbol, date, time, split_factor) (Symbol, Factor) (write) factor * S. split_factor Some programmers prefer textual notation; some prefer GUI. Take your pick. www. streambase. com © Copyright Stream. Base®. 17
Characteristics of Example § Storage of (perhaps lots of) state § Decision making based on a mix of stored state and real time computation Stream. SQL has a single programming paradigm for both kinds of data. Not necessarily true for other technical approaches. www. streambase. com © Copyright Stream. Base®. 18
What About Pattern-Matching? § Example: Find IBM ticks over 80 followed by at least two ticks under 80. CREATE STREAM Tick. Triples AS SELECT symbol, T 1. price AS price 1, T 2. price AS price 2, T 3. price AS price 3 FROM Ticks T 1 -> Ticks T 2 -> Ticks T 3 WHERE T 1. symbol = T 2. symbol AND T 2. symbol = T 3. symbol; SELECT * FROM Tick. Triples WHERE price 1 > 80 AND price 2 < 80 AND price 3 < 80 AND symbol = "IBM"; Regular expression (pattern matching) is the same in any technology!!! www. streambase. com © Copyright Stream. Base®. 19
Performance – Stream. SQL § Internal query plan (think of it as our graphical workflow notation) § For any event, we know exactly what processing happens next § As a result, we can optimize the plan www. streambase. com © Copyright Stream. Base®. 20
Stream. SQL Advantages § Superior performance § Easy programmability (and maintainability) § One notation for real-time and stored data § Includes regular expression evaluation § Closer to basis for standardization - FROM clause can mix stored tables and streams - Add time windows to SQL - Add stream disorder to SQL www. streambase. com © Copyright Stream. Base®. 21
Rule 3: Handle Delayed, Missing, & Out-of-Order Data Make provision for handling data which is late or delayed, missing, or out-of-sequence § Ability to time-out individual calculations or computations § Ability to merge streams and plug gaps from one with valid value in another § Bounded sort operation (BSORT) § Outer-join www. streambase. com © Copyright Stream. Base®. 22
Rule 4: Generate Predictable Outcomes Process time-series records (tuples) in a consistent manner § Two distinct runs of the system with the same input should yield the same output (deterministic). § Ensure calculations performed on one time-series record do not interfere with calcs done on another www. streambase. com © Copyright Stream. Base®. 23
Rule 5: Process Streaming or Stored Data Store and access current or historical state information, preferably using a familiar standard such as SQL § Interfaces: - Embedded in-process DB for low latency, low overhead Standards such as ODBC, JDBC to external databases § Ability to test trading algorithms on historical data, then switch seamlessly to live feed Alerts Actions Real-time Feeds Remote process Embedded local storage Data store www. streambase. com © Copyright Stream. Base®. 24
Rule 6: Guarantee Data Safety & Availability If a failure occurs (hardware, operating system, software), the streaming application must failover to a back-up and keep running § Restarting and recovering from a log for real-time processing is not practical. § Better idea: A tandem-style approach for streaming data Secondary Market Data Checkpoint Alerts Actions Primary www. streambase. com © Copyright Stream. Base®. 25
Rule 7: Partition & Scale Automatically Split an application over multiple processors or machines for scalability, without developer having to write low-level code § Easily split application without custom-coding § Multi-threading: - www. streambase. com To utilize multi-CPU (Multi-core) hardware Avoid blocking for external events and maintain low latency © Copyright Stream. Base®. 26
Rule 8: Process & Respond Instantaneously Run all 7 rules in-process at tens to hundreds of thousands of messages/second with low latency § Ensure high availability, stored/real-time processing, handling stream imperfections all work concurrently with low latency § Test rigorously—simulated and live feeds § Monitor latency and processing speed in messages/second www. streambase. com © Copyright Stream. Base®. 27
Stream Processing Engine Architecture The Stream. Base Server Output Stream Input Stream Messaging/Transport System Input Stream. Base Application Stream. Base Server Operating System Hardware www. streambase. com Client Applications Operating System Messaging/Transport System Output Stream Hardware Functional Capabilities: Infrastructure Capabilities: - Implements Stream. SQL - Multi-threaded with real-time scheduling - Multiple options for managing stored data - Insertion of custom logic & analytics to the data stream - Adapters to external data sources & messaging systems - © Copyright Stream. Base®. 10 k-500 K+ msgs/sec High availability 64 bit addressing Supports clusters & blade configurations via application & data partitioning 28
Integrated Development Environment Integrated environment for building, testing, deploying - Eclipse-based IDE - Drag-andconnect with workflow orientation - Built-in load simulation for easy testing - Stream Record/Playback - Custom C++ or Java operators - Debugger & performance monitor www. streambase. com © Copyright Stream. Base®. 29
Required Characteristics for Complex Event Processing Engines www. streambase. com 1. Perform data processing without first storing and retrieving the data 2. Leverage Stream. SQL query paradigm 3. Store and access current or historical state information using a familiar standard such as SQL 4. Handle stream imperfections (e. g. late or delayed, missing, out-of-sequence data) 5. Process time-series records (tuples) in a consistent, deterministic manner 6. Failover streaming application to a back-up and keep running in the event of primary system failure 7. Split applications over multiple processors or machines for scalability, without writing low-level code 8. Run Rules 1 -7 in-process at tens to hundreds of thousands of messages/second with low latency © Copyright Stream. Base®. 30
Future Directions for the Community § Standard vocabulary and vernacular: - E. g. “events, ” “CEP, ” “stream processing, ” “patternmatching” § Education and visibility around category: - Analyst reports - Broader market education § Technical standards: - Benchmarks: Performance, scalability - Languages: Stream. SQL or extended SQL § Research: - Approximation - Distributed processing - Self-adaptive - Sensor applications - Scientific applications www. streambase. com © Copyright Stream. Base®. 31
Thank You Enterprise-class stream processing software designed to transform real-time complex events into actionable intelligence Corporate Headquarters 181 Spring Street Lexington, Massachusetts 02421 +1 866 STRMBAS +1 866 787 6227 +1 781 761 0800 www. streambase. com New York City Office 220 West 42 nd Street, 20 th Floor New York, New York 10036 +1 866 STRMBAS +1 866 787 6227 Reston, Virginia Office 11921 Freedom Drive, Suite 550 Reston, VA 20190 +1 703 608 6958 © Copyright Stream. Base®. London Office 107 -111 Fleet Street London EC 4 A 2 AB United Kingdom +44 (0)20 7936 9050 32
- Slides: 30