COMP 5331 Data Stream Prepared by Raymond Wong
COMP 5331 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP 5331 1
Data Mining over Static Data 1. 2. 3. Association Clustering Classification Static Data COMP 5331 Output (Data Mining Results) 2
Data Mining over Data Streams 1. 2. 3. Association Clustering Classification … Unbounded Data COMP 5331 Output (Data Mining Results) Real-time Processing 3
Data Streams Each point: a transaction 1 2 … Less recent COMP 5331 More recent 4
Data Streams Traditional Data Mining Data Stream Mining Data Type Static Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) Memory Limited More challenging Efficiency Time-Consuming Efficient Output Exact Answer Approximate (or Exact) Answer COMP 5331 5
Entire Data Streams Each point: a transaction 1 2 … Less recent More recent Obtain the data mining results from all data points read so far COMP 5331 6
Entire Data Streams Each point: a transaction 1 2 … Less recent More recent Obtain the data mining results over a sliding window COMP 5331 7
Data Streams n n Entire Data Streams with Sliding Window COMP 5331 8
Entire Data Streams n n n Association Clustering Classification COMP 5331 Frequent pattern/item 9
Frequent Item over Data Streams n n n Let N be the length of the data streams Let s be the support threshold (in fraction) (e. g. , 20%) Problem: We want to find all items with frequency >= s. N Each point: a transaction 1 2 … Less recent COMP 5331 More recent 10
Data Streams Traditional Data Mining Data Stream Mining Data Type Static Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) Memory Limited More challenging Efficiency Time-Consuming Efficient Output Exact Answer Approximate (or Exact) Answer COMP 5331 11
Data Streams Static Data Frequent item • I 1 Infrequent item • I 2 • I 3 Output (Data Mining Results) … Unbounded Data COMP 5331 Frequent item • I 1 • I 3 Infrequent item • I 2 12
False Positive/Negative n E. g. n Expected Output n Frequent item n n I 1 Infrequent item n I 2 I 3 Algorithm Output n Frequent item n n n I 1 I 3 Infrequent item n False Positive -The item is classified as frequent item -In fact, the item is infrequent I 2 COMP 5331 Which item is one of the false positives? I 3 More ? No. of false positives = 1 If we say: The algorithm has no false positives. All true infrequent items are classified as infrequent items in the algorithm output. 13
False Positive/Negative n E. g. n Expected Output n Frequent item n n n Infrequent item n n I 1 I 3 I 2 Algorithm Output n Frequent item n n I 1 Infrequent item n n I 2 I 3 COMP 5331 False Negative -The item is classified as infrequent item -In fact, the item is frequent Which item is one of the false negatives? I 3 More No. ? No. of false negatives = 1 No. of false positives = 0 If we say: The algorithm has no false negatives. All true frequent items are classified as frequent items in the algorithm output. 14
Data Streams Traditional Data Mining Data Stream Mining Data Type Static Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) Memory Limited More challenging Efficiency Time-Consuming Efficient Output Exact Answer Approximate (or Exact) Answer COMP 5331 We need to introduce an input error parameter 15
Data Streams Static Data Frequent item • I 1 Infrequent item • I 2 • I 3 Output (Data Mining Results) … Unbounded Data COMP 5331 Frequent item • I 1 • I 3 Infrequent item • I 2 16
N: total no. of occurrences of items Store the statistics of all items N = 20 • I 1: 10 • I 2 : 8 = 0. 2 • I 3: 12 Data Streams N = 4 Static Item Data I 1 True Frequency 10 Output D <= N ? Estimated Frequency Diff. D (Data Mining Results) 0 Yes 10 I 2 8 4 4 Yes I 3 12 10 2 Yes … Unbounded Data Estimate the statistics of all items • I 1: 10 • I 2 : 4 • I 3: 10 COMP 5331 Output (Data Mining Results) 17
-deficient synopsis n n n Let N be the current length of the stream (or total no. of occurrences of items) Let be an input parameter (a real number from 0 to 1) An algorithm maintains an -deficient synopsis if its output satisfies the following properties n Condition 1: There is no false negative. All true frequent items are classified as frequent items in the algorithm output. n n Condition 2: The difference between the estimated frequency and the true frequency is at most N. Condition 3: All items whose true frequencies less than (s- )N are classified as infrequent items in the algorithm output COMP 5331 18
Frequent Pattern Mining over Entire Data Streams n Algorithm n n n Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm COMP 5331 19
Sticky Sampling Algorithm Support threshold Stored in the memory s Error parameter Confidence parameter Sticky Sampling Statistics of items Output … Unbounded Data COMP 5331 Frequent items Infrequent items 20
Sticky Sampling Algorithm n n n The sampling rate r varies over the lifetime of a stream Confidence parameter (a small real number) Let t = 1/ ln(s-1 -1) COMP 5331 Data No. r (sampling rate) 1 ~ 2 t 1 2 t+1 ~ 4 t 2 4 t+1 ~ 8 t 4 … … 21
Sticky Sampling Algorithm e. g. s = 0. 02 = 0. 01 = 0. 1 n n n The sampling rate r varies over the lifetime of t = 622 a stream Confidence parameter (a small real number) Let t = 1/ ln(s-1 -1) 1~1244 Data No. 1 ~12*622 ~ 2 t r (sampling rate) 1 1245~2488 2*622+1 ~ 4*622 2 t+1 ~ 4 t 2 2489~4976 4*622+1 ~ 8*622 4 t+1 ~ 8 t 4 … COMP 5331 … 22
Sticky Sampling Algorithm e. g. s = 0. 5 = 0. 35 = 0. 5 n n n The sampling rate r varies over the lifetime of t=4 a stream Confidence parameter (a small real number) Let t = 1/ ln(s-1 -1) 1~8 Data No. r (sampling rate) 1~ ~2*4 2 t 1 9~16 2*4+1 ~ 4*4 2 t+1 ~ 4 t 2 17~32 4*4+1 ~~ 8*48 t 4 t+1 4 … COMP 5331 … 23
Sticky Sampling Algorithm element S: empty list will contain (e, f) 1. 2. When data e arrives, if e exists in S, increment f in (e, f) if e does not exist in S, add entry (e, 1) with prob. 1/r (where r: sampling rate) § § 3. Just after r changes, For each entry (e, f), n n n Repeatedly toss a coin with P(head) = 1/r until the outcome of the coin toss is head If the outcome of the toss is tail, n 4. Estimated frequency n Decrement f in (e, f) If f = 0, delete the entry (e, f) [Output] Get a list of items where f + N >= s. N COMP 5331 24
Analysis n -deficient synopsis n n Sticky Sampling computes an -deficient synopsis with probability at least 1 - Memory Consumption n Sticky Sampling occupies at most 2/ ln(s-1 -1) entries on average COMP 5331 25
Frequent Pattern Mining over Entire Data Streams n Algorithm n n n Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm COMP 5331 26
Lossy Counting Algorithm Support threshold Stored in the memory s Error parameter Lossy Counting … Unbounded Data COMP 5331 Statistics of items Output Frequent items Infrequent items 27
Lossy Counting Algorithm N: current length of stream Each point: a transaction 1 2 … Bucket 1 … Bucket 2 Bucket 3 Width w = Bucket bcurrent = Less recent COMP 5331 More recent 28
Lossy Counting. Frequency Algorithm of element since this element entry was inserted into D D: Empty set 1. Will contain (e, f, ) n 2. When data e arrives, If e exists in D, n n n 4. Increment f in (e, f, ) If e does not exist in D, n 3. Max. possible error in f Add entry (e, 1, bcurrent-1) Remove some entries in D whenever N 0 mod w (i. e. , whenever it reaches the bucket boundary) The rule of deletion is: (e, f, ) is deleted if f + <= bcurrent [Output] Get a list of items where f + N >= s. N COMP 5331 29
Lossy Counting Algorithm n -deficient synopsis n n Lossy Counting computes an -deficient synopsis Memory Consumption n Lossy Counting occupies at most 1/ log( N) entries. COMP 5331 30
Comparison e. g. s = 0. 02 = 0. 01 = 0. 1 N = 1000 Memory = 1243 -deficient synopsis Memory Consumption Sticky Sampling 1 - confidence 2/ ln(s-1 -1) Lossy Counting 100% confidence 1/ log( N) Memory = 231 COMP 5331 31
Comparison e. g. s = 0. 02 = 0. 01 = 0. 1 N = 1, 000 Memory = 1243 -deficient synopsis Memory Consumption Sticky Sampling 1 - confidence 2/ ln(s-1 -1) Lossy Counting 100% confidence 1/ log( N) Memory = 922 COMP 5331 32
Comparison e. g. s = 0. 02 = 0. 01 = 0. 1 N = 1, 000, 000 Memory = 1243 -deficient synopsis Memory Consumption Sticky Sampling 1 - confidence 2/ ln(s-1 -1) Lossy Counting 100% confidence 1/ log( N) Memory = 1612 COMP 5331 33
Frequent Pattern Mining over Entire Data Streams n Algorithm n n n Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm COMP 5331 34
Sticky Sampling Algorithm Support threshold Stored in the memory s Error parameter Confidence parameter Sticky Sampling Statistics of items Output … Unbounded Data COMP 5331 Frequent items Infrequent items 35
Lossy Counting Algorithm Support threshold Stored in the memory s Error parameter Lossy Counting … Unbounded Data COMP 5331 Statistics of items Output Frequent items Infrequent items 36
Space-Saving Algorithm Support threshold Stored in the memory s Memory parameter M Space-Saving … Unbounded Data COMP 5331 Statistics of items Output Frequent items Infrequent items 37
Space-Saving n M: the greatest number of possible entries stored in the memory COMP 5331 38
Space-Saving element Frequency of element since this entry was inserted into D D: Empty set n Will contain (e, f, ) Max. possible error in f 2. pe = 0 3. When data e arrives, n If e exists in D, n Increment f in (e, f, ) n If e does not exist in D, n If the size of D = M n pe mine D {f + } n Remove all entries e where f + pe n Add entry (e, 1, pe) 1. 4. [Output] Get a list of items where f + >= s. N COMP 5331 39
Space-Saving n Greatest Error n Let E be the greatest error in any estimated frequency. E 1/M n -deficient synopsis n Space-Saving computes an -deficient synopsis if E COMP 5331 40
Comparison e. g. s = 0. 02 = 0. 01 = 0. 1 N = 1, 000, 000 Memory = 1243 -deficient synopsis Memory Consumption Sticky Sampling 1 - confidence 2/ ln(s-1 -1) Lossy Counting 100% confidence 1/ log( N) Space-Saving 100% confidence where E <= M Memory = 1612 Memory can be very large (e. g. , 4, 000) Since E <= 1/M the error is very small COMP 5331 41
Data Streams n n Entire Data Streams with Sliding Window COMP 5331 42
Data Streams with Sliding Window n n n Association Clustering Classification COMP 5331 Frequent pattern/itemset 43
Sliding Window n n n Mining Frequent Itemsets in a sliding window E. g. t 1: I 1 I 2 t 1 t 2 … t 2 : I 1 I 3 I 4 Sliding window … To find frequent itemsets in a sliding window COMP 5331 44
Sliding Window Sliding window B 1 Storage B 2 B 3 B 4 Storage Last 4 batches COMP 5331 45
Sliding Window Sliding window B 1 Storage B 2 B 3 B 4 B 5 Storage Last 4 batches Remove the whole batch COMP 5331 46
- Slides: 46