Mining TopK Frequent Items in a Data Stream
- Slides: 29
Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows Hoang Thanh Lam and Toon Calders
Outline • • A Survey of Frequency Measures Max-Frequency Problem Statement Memory Issues A Memory-efficient Stream Summary A Stream Summary in Practice Conclusions / name of department 1/20/2022 1
A Survey of Frequency Measures I am monitoring the data stream a b a c a d b Time Data Stream / name of department 1/20/2022 2
A Survey of Frequency Measures Which items are the most important? a b a c a d b Time Data Stream / name of department 1/20/2022 3
A Survey of Frequency Measures Resource allocation Prediction burst early a b a c a d b Time Data Stream / name of department 1/20/2022 4
A Survey of Frequency Measures • • Relative frequency (RF) RF in a fixed-length sliding window Measures based on decaying factor Max frequency (Max. Freq) / name of department 1/20/2022 5
Max Frequency Measure • Independent from parameters • Recent occurrence is more important • Captures the entire history of the stream / name of department 1/20/2022 6
A Survey of Frequency Measures Max-Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3 4/7 3/6 a b a a c c 2/3 1/2 a d a b Data Stream / name of department 1/20/2022 7
Max-Frequency Measure Max. Frequency Max. Freq(a)=Max {1/2, 2/3, 3/6, 4/7} = 2/3 Border Point 4/7 3/6 a b a Max Point a c c 2/3 1/2 a d a b Data Stream / name of department 1/20/2022 8
Properties of Border Points a A B If there exists A and B such that RF(a, A)>RF(a, B), the given occurrence of a cannot be a border point, otherwise it is. / name of department 1/20/2022 9
Stream summary / name of department 1/20/2022 10
Minimum Threshold Principle A border point with the current associated frequency less than a pre-defined minimum threshold never becomes a max point with frequency greater than that threshold value. / name of department 1/20/2022 11
Problem statement • Given a data stream S evolving over time, • At every time point: • Maintain a summary • Upon request: • Give top-k most frequent items according to Max. Freq • Based on the summary • fast / name of department 1/20/2022 12
Memory issues • It can be proven that in the worst case the number of border points is the same order of magnitude as the stream size. Storing the complete set of borders is not feasible in stream applications !!! / name of department 1/20/2022 13
Memory issues • To answer even the top-2 query exactly we need an amount of memory being linear in the number of distinct items in the stream • In many applications it is not feasible: • with multiple streams and • the number of distinct of items is huge • Therefore: • approximate algorithms / name of department 1/20/2022 14
A Memory-efficient Stream Summary / name of department 1/20/2022 15
A Memory-efficient Stream Summary Guess this lower bound !!! / name of department 1/20/2022 16
Guessing the lower bound a b b c d a d d e a a a f f c c a c • Let Xk be the size of the minimum suffix of the stream S such that exact k distinct items can be found in this suffix X 5= 11 Observation: Choose / name of department is a lower bound on the Max. Freq of the top-k items as the guessed lower bound and use this for pruning 1/20/2022 17
Mean. Summary • Pruning all the border points at which the associated relative frequency of the given item is less than the pruning threshold / name of department 1/20/2022 18
Accuracy analysis • There is no false positive • According to the Markov’s inequality the possible false negative top-k list is bounded by: / name of department 1/20/2022 19
However … • Estimation of E(Xk) is well-known as the Coupon Collector Problem (CCP) • No closed formula of E(Xk) for any distribution is known so far except for the uniform distribution • Moreover, stream is very dynamic, item distribution is changing over time Therefore: • Algorithm that dynamically adjusts threshold / name of department 1/20/2022 20
A Stream Summary in Practice / name of department 1/20/2022 21
Min. Summary • Initialize the pruning threshold with lowest possible value • At every time step: • Look up the minimum Max. Freq of the top-l stored items • Update if this value is greater than its current value • In this way: • threshold is dynamically increasing • There are no false positives • Frequencies are exact / name of department 1/20/2022 22
Datasets used in the experiments / name of department 1/20/2022 23
Accuracy of Min. Summary / name of department 1/20/2022 24
Memory Usage Kosarak Data / name of department 1/20/2022 25
Memory Usage Sligro Data / name of department 1/20/2022 26
Conclusions • We prove that the exact solution for the top-k frequent item problem requires a prohibited amount of memory • However it is possible to summarize the stream with very small memory usage to solve the problem with high accuracy / name of department 1/20/2022 27
Questions • Thank you very much ! / name of department 1/20/2022 28
- Finding frequent items in data streams
- Apriori algorithm
- Mining complex types of data in data mining
- Associations and correlations in data mining
- X videos
- Dbminer
- Mining frequent patterns without candidate generation
- Mining frequent patterns associations and correlations
- Bloom filter for stream data mining
- Mining multimedia databases in data mining
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Lliver
- Caudal region of the body
- Reproductive system
- Mosby items and derived items
- Mosby items and derived items
- Autonomic drugs
- Mosby items and derived items
- Absorptive atelectasis
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items
- Mosby items and derived items