Mining TopK Frequent Items in a Data Stream

  • Slides: 29
Download presentation
Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows Hoang Thanh

Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows Hoang Thanh Lam and Toon Calders

Outline • • A Survey of Frequency Measures Max-Frequency Problem Statement Memory Issues A

Outline • • A Survey of Frequency Measures Max-Frequency Problem Statement Memory Issues A Memory-efficient Stream Summary A Stream Summary in Practice Conclusions / name of department 1/20/2022 1

A Survey of Frequency Measures I am monitoring the data stream a b a

A Survey of Frequency Measures I am monitoring the data stream a b a c a d b Time Data Stream / name of department 1/20/2022 2

A Survey of Frequency Measures Which items are the most important? a b a

A Survey of Frequency Measures Which items are the most important? a b a c a d b Time Data Stream / name of department 1/20/2022 3

A Survey of Frequency Measures Resource allocation Prediction burst early a b a c

A Survey of Frequency Measures Resource allocation Prediction burst early a b a c a d b Time Data Stream / name of department 1/20/2022 4

A Survey of Frequency Measures • • Relative frequency (RF) RF in a fixed-length

A Survey of Frequency Measures • • Relative frequency (RF) RF in a fixed-length sliding window Measures based on decaying factor Max frequency (Max. Freq) / name of department 1/20/2022 5

Max Frequency Measure • Independent from parameters • Recent occurrence is more important •

Max Frequency Measure • Independent from parameters • Recent occurrence is more important • Captures the entire history of the stream / name of department 1/20/2022 6

A Survey of Frequency Measures Max-Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3

A Survey of Frequency Measures Max-Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3 4/7 3/6 a b a a c c 2/3 1/2 a d a b Data Stream / name of department 1/20/2022 7

Max-Frequency Measure Max. Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3 Border Point

Max-Frequency Measure Max. Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3 Border Point 4/7 3/6 a b a Max Point a c c 2/3 1/2 a d a b Data Stream / name of department 1/20/2022 8

Properties of Border Points a A B If there exists A and B such

Properties of Border Points a A B If there exists A and B such that RF(a, A)>RF(a, B), the given occurrence of a cannot be a border point, otherwise it is. / name of department 1/20/2022 9

Stream summary / name of department 1/20/2022 10

Stream summary / name of department 1/20/2022 10

Minimum Threshold Principle A border point with the current associated frequency less than a

Minimum Threshold Principle A border point with the current associated frequency less than a pre-defined minimum threshold never becomes a max point with frequency greater than that threshold value. / name of department 1/20/2022 11

Problem statement • Given a data stream S evolving over time, • At every

Problem statement • Given a data stream S evolving over time, • At every time point: • Maintain a summary • Upon request: • Give top-k most frequent items according to Max. Freq • Based on the summary • fast / name of department 1/20/2022 12

Memory issues • It can be proven that in the worst case the number

Memory issues • It can be proven that in the worst case the number of border points is the same order of magnitude as the stream size. Storing the complete set of borders is not feasible in stream applications !!! / name of department 1/20/2022 13

Memory issues • To answer even the top-2 query exactly we need an amount

Memory issues • To answer even the top-2 query exactly we need an amount of memory being linear in the number of distinct items in the stream • In many applications it is not feasible: • with multiple streams and • the number of distinct of items is huge • Therefore: • approximate algorithms / name of department 1/20/2022 14

A Memory-efficient Stream Summary / name of department 1/20/2022 15

A Memory-efficient Stream Summary / name of department 1/20/2022 15

A Memory-efficient Stream Summary Guess this lower bound !!! / name of department 1/20/2022

A Memory-efficient Stream Summary Guess this lower bound !!! / name of department 1/20/2022 16

Guessing the lower bound a b b c d a d d e a

Guessing the lower bound a b b c d a d d e a a a f f c c a c • Let Xk be the size of the minimum suffix of the stream S such that exact k distinct items can be found in this suffix X 5= 11 Observation: Choose / name of department is a lower bound on the Max. Freq of the top-k items as the guessed lower bound and use this for pruning 1/20/2022 17

Mean. Summary • Pruning all the border points at which the associated relative frequency

Mean. Summary • Pruning all the border points at which the associated relative frequency of the given item is less than the pruning threshold / name of department 1/20/2022 18

Accuracy analysis • There is no false positive • According to the Markov’s inequality

Accuracy analysis • There is no false positive • According to the Markov’s inequality the possible false negative top-k list is bounded by: / name of department 1/20/2022 19

However … • Estimation of E(Xk) is well-known as the Coupon Collector Problem (CCP)

However … • Estimation of E(Xk) is well-known as the Coupon Collector Problem (CCP) • No closed formula of E(Xk) for any distribution is known so far except for the uniform distribution • Moreover, stream is very dynamic, item distribution is changing over time Therefore: • Algorithm that dynamically adjusts threshold / name of department 1/20/2022 20

A Stream Summary in Practice / name of department 1/20/2022 21

A Stream Summary in Practice / name of department 1/20/2022 21

Min. Summary • Initialize the pruning threshold with lowest possible value • At every

Min. Summary • Initialize the pruning threshold with lowest possible value • At every time step: • Look up the minimum Max. Freq of the top-l stored items • Update if this value is greater than its current value • In this way: • threshold is dynamically increasing • There are no false positives • Frequencies are exact / name of department 1/20/2022 22

Datasets used in the experiments / name of department 1/20/2022 23

Datasets used in the experiments / name of department 1/20/2022 23

Accuracy of Min. Summary / name of department 1/20/2022 24

Accuracy of Min. Summary / name of department 1/20/2022 24

Memory Usage Kosarak Data / name of department 1/20/2022 25

Memory Usage Kosarak Data / name of department 1/20/2022 25

Memory Usage Sligro Data / name of department 1/20/2022 26

Memory Usage Sligro Data / name of department 1/20/2022 26

Conclusions • We prove that the exact solution for the top-k frequent item problem

Conclusions • We prove that the exact solution for the top-k frequent item problem requires a prohibited amount of memory • However it is possible to summarize the stream with very small memory usage to solve the problem with high accuracy / name of department 1/20/2022 27

Questions • Thank you very much ! / name of department 1/20/2022 28

Questions • Thank you very much ! / name of department 1/20/2022 28