CS 487587 Advanced Topics Data Streams Kristin Tufte

  • Slides: 53
Download presentation
CS 487/587 Advanced Topics: Data Streams Kristin Tufte Thanks to: David Maier, Jenny Li,

CS 487/587 Advanced Topics: Data Streams Kristin Tufte Thanks to: David Maier, Jenny Li, Vassilis Papadimos 3/8/2021 1

History n n Joint project with University of Wisconsin – Madison and PSU (was

History n n Joint project with University of Wisconsin – Madison and PSU (was OGI), funded by National Science Foundation (NSF) Niagara was originally for querying the Internet q n Retrieve, store, index XML files available on the Internet At PSU, we’ve studied stream processing q q 3/8/2021 Have modified Niagara v 1. 0 (developed at Wisconsin) to turn it into a data stream system Window aggregate processing in particular Data Streams: Lecture 4 2

Niagara. ST Architecture Niagara. STServer Niagara. ST Client connecti on request, queries 3/8/2021 Logical

Niagara. ST Architecture Niagara. STServer Niagara. ST Client connecti on request, queries 3/8/2021 Logical Physical operator DAG of Quer s s Query DAG of physica y Engine logical ops pars er parsed querie XML Optimizer s (ndom) Data Connecti Manag on er Manager Data Streams: Lecture 4 data strea m 3

Niagara v 1. 0 Queries n n n Pipelined system Each operator is a

Niagara v 1. 0 Queries n n n Pipelined system Each operator is a thread. Operators are connected by queues of tuples Operators wait on input queue, when tuple is ready, it is processed and result is inserted in output queue Simple scheduler q n n creates operators and connects them Control Messages upstream Not stream-specific! 3/8/2021 Data Streams: Lecture 4 sum select xmlscan 4

Why Isn’t This Good Enough? sum select xmlscan 3/8/2021 Data Streams: Lecture 4 5

Why Isn’t This Good Enough? sum select xmlscan 3/8/2021 Data Streams: Lecture 4 5

Because… Problem is here! n n Stream may never end, must have a mechanism

Because… Problem is here! n n Stream may never end, must have a mechanism for producing results Select is not blocking (OK) Sum is blocking (problem) Use windows (and punctuation) to unblock sum select xmlscan 3/8/2021 Data Streams: Lecture 4 6

Niagara. ST Query Execution n Modified SQL operators q q n i. e. window.

Niagara. ST Query Execution n Modified SQL operators q q n i. e. window. Sum Keep modifications minimal bucket i. e. bucket Isolate window semantics in these operators select Stream-specific operators q window. Sum Queues contain tuples and “punctuation” xmlscan 3/8/2021 Data Streams: Lecture 4 7

Running Example n n n On-line auction site – produces a stream of bids

Running Example n n n On-line auction site – produces a stream of bids May receive bids from multiple sites For now assume: q q n Receive bids from one site Bids arrive in order of the time attribute Schema: 3/8/2021 Bids(auctionid, bidderid, amt, time) Data Streams: Lecture 4 8

Window Definition n n Time-based windows – specify Range and Slide Find the average

Window Definition n n Time-based windows – specify Range and Slide Find the average bid on each auction in the past 5 minutes, update the result every minute SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP BY auctionid Bids(auctionid, bidderid, amt, time) 3/8/2021 Data Streams: Lecture 4 9

Windows Example t 1 t 2 t 3 t 4 t 5 t 6

Windows Example t 1 t 2 t 3 t 4 t 5 t 6 t 7 (auctionid, amt, time(hh: mm: ss)) ( 3, 37. 00, 12: 00: 30 ) ( 28, 54. 00, 12: 01: 45 ) ( 42, 31. 00, 12: 02: 55 ) ( 23, 25. 00, 12: 04: 20 ) ( 4, 103. 00, 12: 05: 15 ) ( 82, 92. 00, 12: 05: 55 ) ( 21, 87. 00, 12: 09: 15 ) windows: W 1: 12: 00 – 12: 05: 00 W 2: 12: 01: 00 – 12: 06: 00 W 3: 12: 00 – 12: 07: 00 3/8/2021 W 2 W 3 SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP BY auctionid Data Streams: Lecture 4 10

Explicit Window Ids n n Idea: Identify windows with a window id Tag tuples

Explicit Window Ids n n Idea: Identify windows with a window id Tag tuples with the ids of the windows they belong to (explicit WID attribute) New operator: Given Range and Slide Bucket tags tuples with WID Add WID to the list of grouping attributes for the aggregation operator 3/8/2021 Data Streams: Lecture 4 11

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP BY auctionid (auctionid, window-id, average) ( 10, 3, $6. 50) window. Average (group on window-id, auctionid) Window operation converted to a group by (auctionid, amt, time, window-id) ( 10, $5. 00, 12: 04: 35 PM, 1 -5) t 1 ( 10, $8. 00, 12: 06: 40 PM, 3 -7) t 2 bucket RANGE 5 minutes SLIDE 1 minute WATTR time (auctionid, amt, time ) ( 10, $5. 00, 12: 04: 35 PM) t 1 ( 10, $8. 00, 12: 06: 40 PM) t 2 xmlscan 3/8/2021 Data Streams: Lecture 4 12

What’s Wrong? n n When do you output? When the window ends How do

What’s Wrong? n n When do you output? When the window ends How do you know when window ends? One solution: Punctuation q q q 3/8/2021 Looks like a tuple Interspersed with tuples in the inter-operator queues Indicates a certain subset of data is complete Data Streams: Lecture 4 13

Idea: Punctuation n Bucket will generate a “punctuation” to notify window. Average that window

Idea: Punctuation n Bucket will generate a “punctuation” to notify window. Average that window 1 is complete Punctuations look like tuples and are intermixed with tuples in inter-operator queues This punctuation says: all data with wid=1 has been seen. (aid, amt, time, wid) ( *, *, *, 1 ) 3/8/2021 Data Streams: Lecture 5 punctuation 14

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP BY auctionid (auctionid, window-id, average) ( 10, 3, $6. 50) window. Average (group on window-id, auctionid) (auctionid, amt, time, window-id) ( 10, $5. 00, 12: 04: 35 PM, 1 -5) t 1 ( 10, $8. 00, 12: 06: 40 PM, 3 -7) t 2 ( *, *, *, 3 ) p 1 bucket RANGE 5 minutes SLIDE 1 minute WATTR time (auctionid, amt, time ) ( 10, $5. 00, 12: 04: 35 PM) t 1 ( 10, $8. 00, 12: 06: 40 PM) t 2 xmlscan 3/8/2021 Note: bucket generates punctuation Data Streams: Lecture 4 15

Niagara. ST WID Detailed Example Schema: Bids(auctionid, bidderid, amt, time) Query: Find the maximum

Niagara. ST WID Detailed Example Schema: Bids(auctionid, bidderid, amt, time) Query: Find the maximum bid price for each auction over the past 3 minutes. Update the result each minute. (Simple window aggregate query. ) SELECT aid, MAX(amt) FROM Bids [Range 3 Minutes Slide 1 Minute] GROUP BY aid=auctionid 3/8/2021 Data Streams: Lecture 5 16

Niagara. ST Example – Data & Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3

Niagara. ST Example – Data & Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE 1 Minute] GROUP BY aid t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 7 (aid, ( 8, ( 9, ( 8, ( 9, ( 8, 3/8/2021 amt, 37. 00, 54. 00, 61. 00, 35. 00, 42. 00, 45. 00, 67. 00, 57. 00, 65. 00, 72. 00, time) 12: 00: 03) 12: 00: 34) 12: 01: 10) 12: 02: 15) 12: 03: 51) 12: 04: 35) 12: 05: 15) 12: 05: 20) 12: 06: 30) 12: 07: 45) n 12: 00 -12: 03 12: 01 -12: 04 12: 02 -12: 05 12: 03 -12: 06 12: 04 -12: 07 12: 05 -12: 08 Data Streams: Lecture 5 Want output every minute, but don’t have 3 minutes of data at startup time 17

Niagara. ST Example – Data & Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3

Niagara. ST Example – Data & Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE 1 Minute] GROUP BY aid t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 7 (aid, ( 8, ( 9, ( 8, ( 9, ( 8, 3/8/2021 amt, 37. 00, 54. 00, 61. 00, 35. 00, 42. 00, 45. 00, 67. 00, 57. 00, 65. 00, 72. 00, time) 12: 00: 03) 12: 00: 34) 12: 01: 10) 12: 02: 15) 12: 03: 51) 12: 04: 35) 12: 05: 15) 12: 05: 20) 12: 06: 30) 12: 07: 45) W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 n Partial windows at start-up windows: W 1: 12: 00 W 2: 12: 00 W 3: 12: 00 W 4: 12: 01: 00 W 5: 12: 00 W 6: 12: 03: 00 W 7: 12: 04: 00 W 8: 12: 05: 00 Data Streams: Lecture 5 – – – – 12: 01: 00 12: 03: 00 12: 04: 00 12: 05: 00 12: 06: 00 12: 07: 00 12: 08: 00 18

Tuples Belong to Multiple Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE

Tuples Belong to Multiple Windows SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE 1 Minute] GROUP BY aid t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 7 (aid, ( 8, ( 9, ( 8, ( 9, ( 8, 3/8/2021 amt, 37. 00, 54. 00, 61. 00, 35. 00, 42. 00, 45. 00, 67. 00, 57. 00, 65. 00, 72. 00, time) 12: 00: 03) 12: 00: 34) 12: 01: 10) 12: 02: 15) 12: 03: 51) 12: 04: 35) 12: 05: 15) 12: 05: 20) 12: 06: 30) 12: 07: 45) windows: W 1: 12: 00 W 2: 12: 00 W 3: 12: 00 W 4: 12: 01: 00 W 5: 12: 00 W 6: 12: 03: 00 W 7: 12: 04: 00 W 8: 12: 05: 00 Data Streams: Lecture 5 – – – – 12: 01: 00 12: 03: 00 12: 04: 00 12: 05: 00 12: 06: 00 12: 07: 00 12: 08: 00 19

SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE 1 Minute] GROUP BY aid

SELECT aid, MAX(amt) FROM Bids [RANGE 3 Minutes SLIDE 1 Minute] GROUP BY aid Window Ids n t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 7 Assign explict window ids (WIDs) to tuples (aid, ( 8, ( 9, ( 8, ( 9, ( 8, 3/8/2021 amt, 37. 00, 54. 00, 61. 00, 35. 00, 42. 00, 45. 00, 67. 00, 57. 00, 65. 00, 72. 00, time) 12: 00: 03) 12: 00: 34) 12: 01: 10) 12: 02: 15) 12: 03: 51) 12: 04: 35) 12: 05: 15) 12: 05: 20) 12: 06: 30) 12: 07: 45) WIDS [1 -3] [2 -4] [3 -5] [4 -6] [5 -7] [6 -8] [7 -9] [8 -10] Data Streams: Lecture 5 windows: W 1: W 2: W 3: W 4: W 5: W 6: W 7: W 8: 12: 00: 00 12: 01: 00 12: 03: 00 12: 04: 00 12: 05: 00 – – – – 12: 01: 00 12: 03: 00 12: 04: 00 12: 05: 00 12: 06: 00 12: 07: 00 12: 08: 00 20

Example aid wid max 8 1 37. 00 54. 00 8 2 37. 00

Example aid wid max 8 1 37. 00 54. 00 8 2 37. 00 54. 00 8 3 37. 00 54. 00 window. Max state (hash table) window. Max groups on: aid, wid (aid, amt, time, wid) ( 8, 54. 00, 37. 00, 12: 00: 34, 12: 00: 03, 1 -3) t 1 t 2 bucket (aid, amt, time) ( 8, 54. 00, 37. 00, 12: 00: 34) 61. 00, 12: 00: 03) t 2 12: 01: 10) t 1 t 3 xmlscan 3/8/2021 windows: W 1: 12: 00 – 12: 01: 00 W 2: 12: 00 – 12: 00 Data Streams: Lecture W 3: 12: 00 – 12: 03: 00 21 5

aid wid Example window. Max groups on: aid, wid max 8 1 2 54.

aid wid Example window. Max groups on: aid, wid max 8 1 2 54. 00 61. 00 8 2 3 54. 00 61. 00 8 3 4 54. 00 61. 00 8 9 4 2 61. 00 35. 00 9 3 35. 00 Output: (aid, wid, max) (8, 1, 54. 00) 9 4 35. 00 (aid, amt, time, wid) (( 9, 8, *, 35. 00, 61. 00, *, 12: 01: 45, 12: 01: 10, *, 2 -4) 12 -4) ) p 1 t 4 t 3 bucket (aid, amt, time) ( 9, 8, 35. 00, 61. 00, 12: 01: 45) 12: 01: 10) t 4 t 3 xmlscan 3/8/2021 windows: W 1: 12: 00 – 12: 01: 00 W 2: 12: 00 – 12: 00 Data Streams: Lecture W 3: 12: 00 – 12: 03: 00 22 5

Bucket - WIDS t 1 t 2 t 3 t 4 t 5 t

Bucket - WIDS t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 7 (aid, ( 8, ( 9, ( 8, ( 9, ( 8, amt, time) 37. 00, 3) 54. 00, 34) 61. 00, 70) 35. 00, 135) 42. 00, 231) 45. 00, 275) 67. 00, 315) 57. 00, 320) 65. 00, 390) 72. 00, 465) WIDS [1 -3] [2 -4] [3 -5] [4 -6] [5 -7] [6 -8] [7 -9] [8 -10] Time is in seconds from 12: 00 n n n Query: [Range 180 Seconds Slide 60 Seconds] wids(time) = [ (time/SLIDE)+1, (time+RANGE)/SLIDE ] wids(231) = [(231/60)+1, (231+180)/60 ] = [4, 6] windows: W 1: W 3: W 5: W 7: 0 -60 0 -180 120 -300 240 -420 3/8/2021 W 2: W 4: W 6: W 8: 0 -120 60 -240 180 -360 360 -480 Data Streams: Lecture 5 23

Window Query Execution in Niagara. ST n Ideas: Explicit Window Ids & Punctuation Append

Window Query Execution in Niagara. ST n Ideas: Explicit Window Ids & Punctuation Append a range of window ids (WIDs) to each tuple New operator: Bucket assigns WIDs to tuples Group by WID n Punctuation indicates end of windows n n n 3/8/2021 Data Streams: Lecture 5 24

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP

WID Implementation SELECT auctionid, AVG(amt) FROM Bids [Range 5 Minutes Slide 1 Minute] GROUP BY auctionid (auctionid, window-id, average) ( 10, 3, $6. 50) window. Average Group by WID Punctuation indicates end of windows (group on window-id, auctionid) (auctionid, amt, time, window-id) ( 10, $5. 00, 12: 04: 35 PM, 1 -5) t 1 ( 10, $8. 00, 12: 06: 40 PM, 3 -7) t 2 ( *, *, *, 3 ) p 1 bucket New operator: Bucket assigns ranges of WIDs to tuplesalso generates Bucket punctuation (auctionid, amt, time ) RANGE 5 minutes SLIDE 1 minute WATTR time ( ( 10, $5. 00, 12: 04: 35 PM) t 1 $8. 00, 12: 06: 40 PM) t 2 xmlscan 3/8/2021 Data Streams: Lecture 4 25

Advantages n n Window are not buffered (less memory) Window specification is isolated in

Advantages n n Window are not buffered (less memory) Window specification is isolated in bucket operator q q n Flexible q q n Average doesn’t know about windows No need for specialized window operators – just use punctuation-aware operators Window on system time, external time or tuple-based Data can arrive and be processed out of order Punctuation can guarantee progress q 3/8/2021 Gaps in tuple arrival need not affect result production Data Streams: Lecture 5 26

Niagara. ST Window Join Semantics SELECT A. id, A. name, B. bidderid, B. amt

Niagara. ST Window Join Semantics SELECT A. id, A. name, B. bidderid, B. amt FROM Auctions A [RANGE 10 MINUTES SLIDE 2 MINUTES], Bids B [RANGE 5 MINUTES SLIDE 2 MINUTES] WHERE A. id = B. auctionid AND A. sellerid = 0 n n Map tuples in A and B to window ids using method described in previous slides Add A. wid = B. wid to the join condition: q 3/8/2021 A. id = B. auctionid AND A. wid = B. wid Data Streams: Lecture 5 27

DSMS Challenges n Unbounded Memory Requirements q q q n Blocking Operators q n

DSMS Challenges n Unbounded Memory Requirements q q q n Blocking Operators q n Techniques for limiting the amount of memory required Sliding Windows Approximation: sampling, synopsis How to unblock them…(windows, punctuation) Tuple Arrival Gaps 3/8/2021 Data Streams: Lecture 5 28

Evaluate with Finite (Bounded) Memory? Schema: S 1(A, B, C) S 2(D, E) SELECT

Evaluate with Finite (Bounded) Memory? Schema: S 1(A, B, C) S 2(D, E) SELECT S 1. A FROM S 1 WHERE S 1. A>10 3/8/2021 Data Streams: Lecture 5 29

Your turn… Schema: S 1(A, B, C) S 2(D, E) Query 1: SELECT S

Your turn… Schema: S 1(A, B, C) S 2(D, E) Query 1: SELECT S 1. A FROM S 1, S 2 WHERE S 1. A = S 2. D Query 2: SELECT S 1. A FROM S 1, S 2 WHERE S 1. A = S 2. D AND S 1. A > 10 AND S 2. D < 20 3/8/2021 Data Streams: Lecture 5 30

Blocking Operators n Intuitively: An Op is blocking if it waits until the end

Blocking Operators n Intuitively: An Op is blocking if it waits until the end of the stream to output results q q n Informally: An Op is non-blocking if: q n sum, count, min, max, sort are blocking What about difference and outer join? an additional input tuple does not cause updates to results for previous tuples Blocking Operators vs. Blocking Implementations q q 3/8/2021 Join is a non-blocking operator, but many join implementations are blocking (hash, sort-merge) Symmetric hash join is not blocking Data Streams: Lecture 5 31

Operator Summary n Finite vs. Unbounded Memory q q q n Finite: select, project

Operator Summary n Finite vs. Unbounded Memory q q q n Finite: select, project Unbounded: join, sort, dupelim Aggregate classification is more complex Blocking vs. Non-Blocking q q 3/8/2021 Blocking: aggregates (min, max, count), sort Non-blocking: join, select, project, dupelim Data Streams: Lecture 5 32

Tuple Arrival Gaps n What is the problem? q Tuple 12: 01 n n

Tuple Arrival Gaps n What is the problem? q Tuple 12: 01 n n closes window – what if the tuple doesn’t arrive? (8, 61, 12: 01: 10) t 3 12: 00 - If we are using tuple timestamps to figure out when to close windows, we are in trouble Can’t tell the difference between no data and a delayed tuple 3/8/2021 Data Streams: Lecture 5 33

aid wid Example max 8 1 54. 00 8 2 54. 00 61. 00

aid wid Example max 8 1 54. 00 8 2 54. 00 61. 00 8 3 54. 00 61. 00 8 4 61. 00 Output: (aid, wid, max) (8, 1, 54. 00) (8, 2, 61. 00) (8, 3, 61. 00) (8, 4, 61. 00) window. Average groups on: aid, wid (aid, amt, time, wid) ( *, 8, 61. 00, *, 12: 01: 10, *, 12 -4) 2 3 4 ) p 4 p 1 t 3 p 2 p 3 bucket (aid, amt, time) ( 8, 65. 00, (10, 61. 00, 12: 04: 15) 12: 01: 10) t 4 t 3 xmlscan 3/8/2021 windows: W 1: 12: 00 – 12: 01: 00 W 2: 12: 00 – 12: 00 Data Streams: Lecture W 3: 12: 00 – 12: 03: 00 34 5

Tuple Arrival Gaps n Timeouts (Aurora) q Each windowed op has a timeout saying

Tuple Arrival Gaps n Timeouts (Aurora) q Each windowed op has a timeout saying how long to wait for next tuple: n n (size = 10 tuples, step 2 tuples, timeout 5 sec) Punctuation (GS, Niagara. ST) q Application generates timestamps & punctuation …… Site A 3/8/2021 …… Site B tuples Data Streams: Lecture 5 …… Site C punctuation 35

Summary n Unbounded memory q n Blocking/non-blocking operators and implementations q n Windows, approximation

Summary n Unbounded memory q n Blocking/non-blocking operators and implementations q n Windows, approximation Windows, punctuation Timestamps and tuple arrival gaps q 3/8/2021 Punctuation, heartbeats, slack Data Streams: Lecture 5 36

XML n n n Tree of elements – only one root element Elements have

XML n n n Tree of elements – only one root element Elements have attributes (name, value) pairs and sub-elements Only simple XML used in the exercises <? xml version=“ 1. 0”> <department name = “CS”> <course name=“Data. Streams”> <time> TTh 10 -11: 50 </time> <location> FAB 150 </location> </course> <course> … </course> </department> 3/8/2021 Data Streams: Lecture 4 37

Example Query Plan Schema: Bids(auctionid, bidderid, amt, time) <? xml version="1. 0"? > <!DOCTYPE

Example Query Plan Schema: Bids(auctionid, bidderid, amt, time) <? xml version="1. 0"? > <!DOCTYPE plan SYSTEM "/stash/datalab/datastreamsstudent/bin/queryplan. dtd"> <!-- This query selects the bidderid and amt for all bids with amt > $100 --> <!-- Specify the top operator in the query plan (last operator to be executed) --> <plan top="cons"> <!-- Scan the input file, extracting the attributes bidder and price --> <xmlscan id="scan" attrs="bidderid, amt" filename = "/stash/datalab/datastreamsstudent/streamdata/bids_small. xml" /> 3/8/2021 Data Streams: Lecture 5 38

Example Query Plan II <!-- Perform the selection --> <select id="selectop" input="scan"> <pred op="ge">

Example Query Plan II <!-- Perform the selection --> <select id="selectop" input="scan"> <pred op="ge"> <var value="$amt"/> <number value="100"/> </pred> </select> <!-- Construct the result --> <construct id="cons" input="selectop"> <![CDATA[ <output> $bidderid $amt </output> ]]> </construct> </plan> 3/8/2021 Data Streams: Lecture 5 39

Input Data Streams<niagara: stream xmlns: niagara n n Data streams are faked using files

Input Data Streams<niagara: stream xmlns: niagara n n Data streams are faked using files Scan below on this file would produce two tuples, each with two attributes: $auctionid, $bidderid T 1 T 2 <xmlscan id=“scan” attrs=“auctionid, bidderid” file = “…”> 3/8/2021 ="http: //www. cs. pdx. edu/dot/nia gara"> <bids> <bid> <auctionid>0</auctionid> <time>1</time> <bidderid>2</bidderid> <amt>57. 00</amt> </bid> <auctionid>1</auctionid> <time>2</time> <bidderid>1</bidderid> <amt>147. 00</amt> </bids> </niagara: stream> Data Streams: Lecture 4 40

Window Aggregates in Niagara. ST Kristin Tufte, Jin Li Thanks to the Niagara. ST

Window Aggregates in Niagara. ST Kristin Tufte, Jin Li Thanks to the Niagara. ST Group @ PSU Data Streams: Lecture 10 41

Outline n n Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data

Outline n n Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing Data Streams: Lecture 10 42

Window Aggregate – Buffering ( aid, ( a 5, ( a 6, max) 47

Window Aggregate – Buffering ( aid, ( a 5, ( a 6, max) 47 ) 48 ) window. Max (aid, t 1 (a 5, t 2 (a 6, t 3 (a 5, t 4 (a 5, t 5 (a 6, amt ts (hh: mm: ss)) 40, 01: 06: 30) 42, 01: 07: 45) 45, 01: 08: 15) 47, 01: 10) 48, 01: 10: 40) t 6 (a 6, 46, ( aid, t 4 t 6 t 5 ( a 6, a 5, 01: 11: 02) amt, ts ) 47, 46, 48, 01: 11: 02) 01: 10) 01: 10: 30) Data Streams: Lecture 10 window SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid windows: window: 01: 06: 00 – 01: 11: 00 01: 07: 00 – 01: 12: 00 01: 08: 00 – 01: 13: 00 43

Window Aggregate Evaluation in wid max aid Niagara. ST …… …… …… -WID 70

Window Aggregate Evaluation in wid max aid Niagara. ST …… …… …… -WID 70 a 5 max: 45 47 40 (aid, window-id, max) ( a 6, 70, 48 ) Max groups on: aid, wid (aid, amt, ts, window-id) t 4 t 5 p 1 t 6 t 1 t 2(( a 6, t 3 a 5, a 6, 40, 47, 48, 42, 45, *, 01: 11: 02, 01: 06: 30, 01: 10, 01: 10: 30, 01: 07: 45, 01: 08: 15, *, 70 -74 70 71 -75 )) …… …… …… 74 a 5 max: 45 47 40 70 a 6 max: 48 71 42 …… …… …… a 6 74 a 6 max: 48 42 a 6 max: 46 75 Bucket Range 5 minutes Slide 1 minute Wattr ts (aid, amt, ts ) p 1 t 4 ( a 6, t 5 t 6 t 1 t 2 t 3 s 5, 47, a 5, 48, 46, 40, 42, 45, *, 01: 10) 01: 10: 30) 01: 11: 02) 01: 06: 30) 01: 07: 45) 01: 08: 15) 01: 11: 00) Data Streams: Lecture 10 windows: 01: 06: 00 – 01: 11: 00 01: 07: 00 – 01: 12: 00 01: 08: 00 – 01: 13: 00 44

What’s the Difference n Window Semantics q q n Assumptions of data arrival order

What’s the Difference n Window Semantics q q n Assumptions of data arrival order vs. Window-id Data arrival (query answer and result production) vs. Punctuation Query evaluation performance q q q Space Time Latency Data Streams: Lecture 10 45

Bucket n Bucket maps each tuple to windows: A 01: 03: 00 – 01:

Bucket n Bucket maps each tuple to windows: A 01: 03: 00 – 01: 08: 00 B 01: 04: 00 – 01: 09: 00 C 01: 05: 00 – 01: 10: 00 D 01: 06: 00 – 01: 11: 00 E 01: 07: 00 – 01: 12: 00 F 01: 08: 00 – 01: 13: 00 ( aid, amt, t 3 ( a 5, 45, t 2 ( a 6, 42, t 6 ( a 6, 46, t 1 ( a 5, 40, t 5 ( a 6, 48, t 4 ( s 5, 47, ts ) 01: 08: 15) 01: 07: 45) 01: 11: 02) 01: 06: 30) 01: 10: 10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10 46

Window Semantics Framework in Niagara. ST n T: the set of all tuples in

Window Semantics Framework in Niagara. ST n T: the set of all tuples in the input stream S: a window specification W: a set of window-ids In this lecture, we assume time starts at 0 n windows: (T, S) W n n n q n extent: (T, S, w) U T, where w W q n Defines the set of window ids to be used, e. g. , 0, 1, 2, … Specifies which tuples belong to a given window wids: (T, S, t) V W, where t T q q Determines the set of window-ids to which a tuple belongs Is the dual of extent Data Streams: Lecture 10 47

Handling Disorder n Sort (In-order processing) q q n Slack – BSort in Aurora

Handling Disorder n Sort (In-order processing) q q n Slack – BSort in Aurora Heartbeat in STREAM Sort-based Merge in Gigascope Output buffering and sorting in a shared-window join Space and time cost Data Streams: Lecture 10 48

Out-of-Order Processing n Out-of-order processing in Join q n M. A. Hammad, W. G.

Out-of-Order Processing n Out-of-order processing in Join q n M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 Niagara. ST: Punctuation + Window-Id q Analogue to CPU’s out-of-order processing of instructions Data Streams: Lecture 10 49

Disorder Handling - WID Q 1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes

Disorder Handling - WID Q 1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid wid (aid, window-id, max) ( a 6, 70, 48 ) Max Group on: wid, aid (aid, amt, ts, window-id) p 1 t 7 t 6(( a 6, a 5, a 6, 52, 46, *, 01: 10: 15, 01: 11: 02, *, 70 -74) 71 -75) 70 ) aid max …… …… …… 70 s 5 max: 52 47 …… …… …… 47 74 s 5 max: 52 71 70 s 6 max: 48 …… …… …… s 6 max: 48 74 s 6 max: 46 75 bucket (aid, amt, ts ) p 1 t 7 ( a 6, t 6 a 5, 46, 52, *, 01: 11: 02) 01: 10: 15) 01: 11: 00) Data Streams: Lecture 10 50

Sources of Punctuation n External Punctuation q n Data sources, e. g. , Gigascope

Sources of Punctuation n External Punctuation q n Data sources, e. g. , Gigascope Internal Punctuation q Mechanisms that can be used to generate punctuations n n Slack Heartbeat Data Streams: Lecture 10 51

Latency vs. Accuracy Band Disorder n n n Compare external punctuation and two flavors

Latency vs. Accuracy Band Disorder n n n Compare external punctuation and two flavors of slack As slack increases, error decreases and latency increases External punctuation has better latency and accuracy than slack Data Streams: Lecture 10 52

Latency vs. Accuracy Block-Sorted Disorder Latency vs. Accuracy Block-Sorted. Disorder (percentage of incorrect answers)

Latency vs. Accuracy Block-Sorted Disorder Latency vs. Accuracy Block-Sorted. Disorder (percentage of incorrect answers) Data Streams: Lecture 10 SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts] 53