Data Mining for Data Streams 1122020 Data Mining

  • Slides: 65
Download presentation
Data Mining for Data Streams 11/2/2020 Data Mining: Concepts and Techniques 1 1

Data Mining for Data Streams 11/2/2020 Data Mining: Concepts and Techniques 1 1

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis n Sketching 11/2/2020 Data Mining: Concepts and Techniques 2

Characteristics of Data Streams n Data Streams Model: n n The system cannot store

Characteristics of Data Streams n Data Streams Model: n n The system cannot store the entire stream, but only a small fraction How do you make critical calculations about the stream using a limited amount of memory? Characteristics n Huge volumes of continuous data, possibly infinite n Fast changing and requires fast, real-time response n 11/2/2020 Data enters at a high speed rate Random access is expensive—single scan algorithms(can only have one look) Data Mining: Concepts and Techniques 3

Architecture: Stream Query Processing SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple

Architecture: Stream Query Processing SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk) 11/2/2020 Data Mining: Concepts and Techniques 4

Stream Data Applications n Telecommunication calling records n Business: credit card transaction flows n

Stream Data Applications n Telecommunication calling records n Business: credit card transaction flows n Network monitoring and traffic engineering n Financial market: stock exchange n Engineering & industrial processes: power supply & manufacturing n Sensor, monitoring & surveillance: video streams, RFIDs n Web logs and Web page click streams n 11/2/2020 Massive data sets (even saved but random access is too expensive) Data Mining: Concepts and Techniques 5

DBMS versus DSMS n Persistent relations n Transient streams n One-time queries n Continuous

DBMS versus DSMS n Persistent relations n Transient streams n One-time queries n Continuous queries n Random access n Sequential access n “Unbounded” disk store n Bounded main memory n Only current state matters n Historical data is important n No real-time services n Real-time requirements n Relatively low update rate n Possibly multi-GB arrival rate n Data at any granularity n Data at fine granularity n Assume precise data n Data stale/imprecise n 11/2/2020 Access plan determined by query processor, physical DB design n Unpredictable/variable data arrival and characteristics Ack. From Motwani’s PODS tutorial slides Data Mining: Concepts and Techniques 6

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis 11/2/2020 Data Mining: Concepts and Techniques 7

Processing Stream Queries n Query types n n One-time query vs. continuous query (being

Processing Stream Queries n Query types n n One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line) Unbounded memory requirements n For real-time response, main memory algorithm should be used n Memory requirement is unbounded if one will join future tuples Approximate query answering n With bounded memory, it is not always possible to produce exact answers n High-quality approximate answers are desired n Data reduction and synopsis construction methods n 11/2/2020 Sketches, random sampling, histograms, wavelets, etc. Data Mining: Concepts and Techniques 8

Methodologies for Stream Data Processing n n n 11/2/2020 Major challenges n Keep track

Methodologies for Stream Data Processing n n n 11/2/2020 Major challenges n Keep track of a large universe, e. g. , pairs of IP address, not ages Methodology n Synopses (trade-off between accuracy and storage) k n Use synopsis data structure, much smaller (O(log N) space) than their base data set (O(N) space) n Compute an approximate answer within a small error range (factor ε of the actual answer) Major methods n Random sampling n Histograms n Sliding windows n Multi-resolution model n Sketches n Radomized algorithms Data Mining: Concepts and Techniques 9

Stream Data Processing Methods (1) n Random sampling (but without knowing the total length

Stream Data Processing Methods (1) n Random sampling (but without knowing the total length in advance) n Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir. n n Sliding windows n Make decisions based only on recent data of sliding window size w n An element arriving at time t expires at time t + w Histograms n Approximate the frequency distribution of element values in a stream n Partition data into a set of contiguous buckets n n Multi-resolution models n 11/2/2020 Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket) Popular models: balanced binary trees, micro-clusters, and wavelets Data Mining: Concepts and Techniques 10

Stream Data Mining vs. Stream Querying n n 11/2/2020 Stream mining—A more challenging task

Stream Data Mining vs. Stream Querying n n 11/2/2020 Stream mining—A more challenging task in many cases n It shares most of the difficulties with stream querying n But often requires less “precision”, e. g. , no join, grouping, sorting n Patterns are hidden and more general than querying n It may require exploratory analysis n Not necessarily continuous queries Stream data mining tasks n Multi-dimensional on-line analysis of streams n Mining outliers and unusual patterns in stream data n Clustering data streams n Classification of stream data Data Mining: Concepts and Techniques 11

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis n Research issues 11/2/2020 Data Mining: Concepts and Techniques 12

Challenges for Mining Dynamics in Data Streams n Most stream data are at pretty

Challenges for Mining Dynamics in Data Streams n Most stream data are at pretty low-level or multidimensional in nature: needs ML/MD processing n n Analysis requirements n Multi-dimensional trends and unusual patterns n Capturing important changes at multi-dimensions/levels n Fast, real-time detection and response n Comparing with data cube: Similarity and differences Stream (data) cube or stream OLAP: Is this feasible? n 11/2/2020 Can we implement it efficiently? Data Mining: Concepts and Techniques 13

Multi-Dimensional Stream Analysis: Examples n Analysis of Web click streams n n n Raw

Multi-Dimensional Stream Analysis: Examples n Analysis of Web click streams n n n Raw data at low levels: seconds, web page addresses, user IP addresses, … Analysts want: changes, trends, unusual patterns, at reasonable levels of details E. g. , Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours. ” n Analysis of power consumption streams n n Raw data: power consumption flow for every household, every minute Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago 11/2/2020 Data Mining: Concepts and Techniques 14

A Stream Cube Architecture n A tilted time frame n Different time granularities n

A Stream Cube Architecture n A tilted time frame n Different time granularities n n Critical layers n Minimum interest layer (m-layer) n Observation layer (o-layer) n n second, minute, quarter, hour, day, week, … User: watches at o-layer and occasionally needs to drill-down to m-layer Partial materialization of stream cubes 11/2/2020 n Full materialization: too space and time consuming n No materialization: slow response at query time n Partial materialization… Data Mining: Concepts and Techniques 15

A Titled Time Model n Natural tilted time frame: n n Logarithmic tilted time

A Titled Time Model n Natural tilted time frame: n n Logarithmic tilted time frame: n 11/2/2020 Example: Minimal: quarter, then 4 quarters 1 hour, 24 hours day, … Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, … Data Mining: Concepts and Techniques 16

Two Critical Layers in the Stream Cube (*, theme, quarter) o-layer (observation) (user-group, URL-group,

Two Critical Layers in the Stream Cube (*, theme, quarter) o-layer (observation) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer 11/2/2020 Data Mining: Concepts and Techniques 17

On-Line Partial Materialization vs. OLAP Processing n On-line materialization n Materialization takes precious space

On-Line Partial Materialization vs. OLAP Processing n On-line materialization n Materialization takes precious space and time n n Only materialize “cuboids” of the critical layers? n n Only incremental materialization (with tilted time frame) Online computation may take too much time Preferred solution: n popular-path approach: Materializing those along the popular drilling paths n H-tree structure: Such cuboids can be computed and stored efficiently using the H-tree structure n 11/2/2020 Online aggregation vs. query-based computation n Online computing while streaming: aggregating stream cubes n Query-based computation: using computed cuboids Data Mining: Concepts and Techniques 18

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis 11/2/2020 Data Mining: Concepts and Techniques 19

Mining Approximate Frequent Patterns n Mining precise freq. patterns in stream data: unrealistic n

Mining Approximate Frequent Patterns n Mining precise freq. patterns in stream data: unrealistic n n Even store them in a compressed form, such as FPtree Approximate answers are often sufficient (e. g. , trend/pattern analysis) n Example: a router is interested in all flows: n n n 11/2/2020 whose frequency is at least 1% (s) of the entire traffic stream seen so far and feels that 1/10 of s (ε = 0. 1%) error is comfortable How to mine frequent patterns with good approximation? n Lossy Counting Algorithm (Manku & Motwani, VLDB’ 02) n Based on Majority Voting… Data Mining: Concepts and Techniques 20

Majority n n n A sequence of N items. You have constant memory. In

Majority n n n A sequence of N items. You have constant memory. In one pass, decide if some item is in majority (occurs > N/2 times)? 2 9 9 9 7 6 4 9 9 9 3 9 N = 12; item 9 is majority 11/2/2020 Data Mining: Concepts and Techniques 21

Misra-Gries Algorithm n n (‘ 82) A counter and an ID. n If new

Misra-Gries Algorithm n n (‘ 82) A counter and an ID. n If new item is same as stored ID, increment counter. n Otherwise, decrement the counter. n If counter 0, store new item with count = 1. If counter > 0, then its item is the only candidate for majority. 2 9 9 9 7 6 4 9 9 9 3 9 ID 2 2 9 9 4 4 9 9 count 1 0 1 2 11/2/2020 Data Mining: Concepts and Techniques 22

A generalization: Frequent Items Find k items, each occurring at least N/(k+1) times. n

A generalization: Frequent Items Find k items, each occurring at least N/(k+1) times. n ID 1 ID 2. count . . . IDk Algorithm: n n 11/2/2020 ID Maintain k items, and their counters. If next item x is one of the k, increment its counter. Else if a zero counter, put x there with count = 1 Else (all counters non-zero) decrement all k counters Data Mining: Concepts and Techniques 23

Frequent Elements: Analysis n n n 11/2/2020 A frequent item’s count is decremented if

Frequent Elements: Analysis n n n 11/2/2020 A frequent item’s count is decremented if all counters are full: it erases k+1 items. If x occurs > N/(k+1) times, then it cannot be completely erased. Similarly, x must get inserted at some point, because there are not enough items to keep it away. Data Mining: Concepts and Techniques 24

Problem of False Positives n False positives in Misra-Gries algorithm n n n 11/2/2020

Problem of False Positives n False positives in Misra-Gries algorithm n n n 11/2/2020 It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. How can we tell if the non-zero counters correspond to true heavy hitters or not? A second pass is needed to verify. False positives are problematic if heavy hitters are used for billing or punishment. What guarantees can we achieve in one pass? Data Mining: Concepts and Techniques 25

Approximation Guarantees n n Find heavy hitters with a guaranteed approximation error [Demaine et

Approximation Guarantees n n Find heavy hitters with a guaranteed approximation error [Demaine et al. , Manku-Motwani, Estan-Varghese…] Manku-Motwani (Lossy Counting) n n n Suppose you want -heavy hitters--- items with freq > N An approximation parameter , where << . (E. g. , =. 01 and =. 0001; = 1% and =. 01% ) Identify all items with frequency > N No reported item has frequency < ( - )N The algorithm uses O(1/ log ( N)) memory G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams, VLDB’ 02 11/2/2020 Data Mining: Concepts and Techniques 26

Lossy Counting Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window

Lossy Counting Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window 3 Is window size a function of support s? Will fix later… 11/2/2020 Data Mining: Concepts and Techniques 27

Lossy Counting in Action. . . Frequency Counts + Empty First Window At window

Lossy Counting in Action. . . Frequency Counts + Empty First Window At window boundary, decrement all counters by 1 11/2/2020 Data Mining: Concepts and Techniques 28

Lossy Counting continued. . . Frequency Counts + Next Window At window boundary, decrement

Lossy Counting continued. . . Frequency Counts + Next Window At window boundary, decrement all counters by 1 11/2/2020 Data Mining: Concepts and Techniques 29

Error Analysis How much do we undercount? If and then current size of stream

Error Analysis How much do we undercount? If and then current size of stream window-size =N = 1/ε frequency error #windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0. 1% 11/2/2020 Data Mining: Concepts and Techniques 30

Output: Elements with counter values exceeding s. N – εN Approximation guarantees Frequencies underestimated

Output: Elements with counter values exceeding s. N – εN Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least s. N – εN How many counters do we need? Worst case: 1/ε log (ε N) counters 11/2/2020 [See paper for proof] Data Mining: Concepts and Techniques 31

Enhancements. . . Frequency Errors For counter (X, c), true frequency in [c, c+

Enhancements. . . Frequency Errors For counter (X, c), true frequency in [c, c+ εN] Trick: Remember window-id’s For counter (X, c, w), true frequency in [c, c+w-1] If (w = 1), no error! Batch Processing Decrements after k windows 11/2/2020 Data Mining: Concepts and Techniques 32

Algorithm 2: Sticky Sampling Stream Create counters by sampling Maintain exact counts thereafter 28

Algorithm 2: Sticky Sampling Stream Create counters by sampling Maintain exact counts thereafter 28 31 41 23 35 19 34 15 30 What rate should we sample? 11/2/2020 Data Mining: Concepts and Techniques 33

Sticky Sampling contd. . . For finite stream of length N Sampling rate =

Sticky Sampling contd. . . For finite stream of length N Sampling rate = 2/Nε log 1/(s ) = probability of failure Output: Elements with counter values exceeding s. N – εN Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least s. N – εN Same error guarantees as Lossy Counting but probabilistic 11/2/2020 Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0. 1% set failure probability = 0. 01% Data Mining: Concepts and Techniques 34

Sampling rate? Finite stream of length N Sampling rate: 2/Nε log 1/(s ) Infinite

Sampling rate? Finite stream of length N Sampling rate: 2/Nε log 1/(s ) Infinite stream with unknown N Gradually adjust sampling rate (see paper for details) In either case, Expected number of counters = 2/ log 1/s Independent of N! 11/2/2020 Data Mining: Concepts and Techniques 35

Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N No

Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N No of counters Support s = 1% Error ε = 0. 1% No of counters Log 10 of N (stream length) 11/2/2020 Data Mining: Concepts and Techniques 36

From elements to sets of elements … 11/2/2020 Data Mining: Concepts and Techniques 37

From elements to sets of elements … 11/2/2020 Data Mining: Concepts and Techniques 37

Frequent Itemsets Problem. . . Stream Identify all subsets of items whose current frequency

Frequent Itemsets Problem. . . Stream Identify all subsets of items whose current frequency exceeds s = 0. 1%. Ø Frequent Itemsets => Association Rules 11/2/2020 Data Mining: Concepts and Techniques 38

Three Modules TRIE SUBSET-GEN BUFFER 11/2/2020 Data Mining: Concepts and Techniques 39

Three Modules TRIE SUBSET-GEN BUFFER 11/2/2020 Data Mining: Concepts and Techniques 39

Module 1: TRIE Compact representation of frequent itemsets in lexicographic order. 45 50 40

Module 1: TRIE Compact representation of frequent itemsets in lexicographic order. 45 50 40 31 29 32 42 30 50 45 11/2/2020 40 32 30 42 31 29 Sets with frequency counts Data Mining: Concepts and Techniques 40

Module 2: BUFFER Window 1 Window 2 Window 3 Window 4 Window 5 Window

Module 2: BUFFER Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 In Main Memory Compact representation as sequence of ints Transactions sorted by item-id Bitmap for transaction boundaries 11/2/2020 Data Mining: Concepts and Techniques 41

Module 3: SUBSET-GEN BUFFER 11/2/2020 3 2 2 3 1 1 3 2 1

Module 3: SUBSET-GEN BUFFER 11/2/2020 3 2 2 3 1 1 3 2 1 3 1 4 Frequency counts of subsets in lexicographic order Data Mining: Concepts and Techniques 42

Overall Algorithm. . . BUFFER 3 2 2 3 1 1 3 2 1

Overall Algorithm. . . BUFFER 3 2 2 3 1 1 3 2 1 3 1 4 SUBSET-GEN TRIE new TRIE Problem: Number of subsets is exponential! 11/2/2020 Data Mining: Concepts and Techniques 43

SUBSET-GEN Pruning Rules A-priori Pruning Rule If set S is infrequent, every superset of

SUBSET-GEN Pruning Rules A-priori Pruning Rule If set S is infrequent, every superset of S is infrequent. Lossy Counting Pruning Rule At each ‘window boundary’ decrement TRIE counters by 1. Actually, ‘Batch Deletion’: At each ‘main memory buffer’ boundary, decrement all TRIE counters by b. See paper for details. . . 11/2/2020 Data Mining: Concepts and Techniques 44

Bottlenecks. . . BUFFER 3 2 2 3 1 1 3 2 1 3

Bottlenecks. . . BUFFER 3 2 2 3 1 1 3 2 1 3 1 SUBSET-GEN TRIE Consumes main memory 11/2/2020 4 new TRIE Consumes CPU time Data Mining: Concepts and Techniques 45

Design Decisions for Performance TRIE Main memory bottleneck Compact linear array (element, counter, level)

Design Decisions for Performance TRIE Main memory bottleneck Compact linear array (element, counter, level) in preorder traversal No pointers! Tries are on disk All of main memory devoted to BUFFER Pair of tries old and new (in chunks) mmap() and madvise() SUBSET-GEN Very fast implementation See paper for details 11/2/2020 Data Mining: Concepts and Techniques CPU bottleneck 46

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis 11/2/2020 Data Mining: Concepts and Techniques 47

Classification for Dynamic Data Streams n Decision tree induction for stream data classification n

Classification for Dynamic Data Streams n Decision tree induction for stream data classification n Is decision-tree good for modeling fast changing data, e. g. , stock market analysis? Other stream classification methods n n n 11/2/2020 VFDT (Very Fast Decision Tree)/CVFDT (Domingos, Hulten, Spencer, KDD 00/KDD 01) Instead of decision-trees, consider other models n Naïve Bayesian n Ensemble (Wang, Fan, Yu, Han. KDD’ 03) n K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’ 04) Tilted time framework, incremental updating, dynamic maintenance, and model construction Comparing of models to find changes Data Mining: Concepts and Techniques 48

Hoeffding Tree n n n 11/2/2020 With high probability, classifies tuples the same Only

Hoeffding Tree n n n 11/2/2020 With high probability, classifies tuples the same Only uses small sample n Based on Hoeffding Bound principle Hoeffding Bound (Additive Chernoff Bound) r: random variable R: range of r n: # independent observations Mean of r is at least ravg – ε, with probability 1 – d Data Mining: Concepts and Techniques 49

Hoeffding Tree Algorithm n n 11/2/2020 Hoeffding Tree Input S: sequence of examples X:

Hoeffding Tree Algorithm n n 11/2/2020 Hoeffding Tree Input S: sequence of examples X: attributes G( ): evaluation function d: desired accuracy Hoeffding Tree Algorithm for each example in S retrieve G(Xa) and G(Xb) //two highest G(Xi) if ( G(Xa) – G(Xb) > ε ) split on Xa recurse to next node break Data Mining: Concepts and Techniques 50

Decision-Tree Induction with Data Streams Packets > 10 yes Data Stream no Protocol =

Decision-Tree Induction with Data Streams Packets > 10 yes Data Stream no Protocol = http Packets > 10 yes Data Stream no Bytes > 60 K yes Protocol = ftp 11/2/2020 Protocol = http Ack. From Gehrke’s SIGMOD tutorial slides Data Mining: Concepts and Techniques 51

Hoeffding Tree: Strengths and Weaknesses n n 11/2/2020 Strengths n Scales better than traditional

Hoeffding Tree: Strengths and Weaknesses n n 11/2/2020 Strengths n Scales better than traditional methods n Sublinear with sampling n Very small memory utilization n Incremental n Make class predictions in parallel n New examples are added as they come Weakness n Could spend a lot of time with ties n Memory used with tree expansion n Number of candidate attributes Data Mining: Concepts and Techniques 52

VFDT (Very Fast Decision Tree) n n n 11/2/2020 Modifications to Hoeffding Tree n

VFDT (Very Fast Decision Tree) n n n 11/2/2020 Modifications to Hoeffding Tree n Near-ties broken more aggressively n G computed every nmin n Deactivates certain leaves to save memory n Poor attributes dropped n Initialize with traditional learner (helps learning curve) Compare to Hoeffding Tree: Better time and memory Compare to traditional decision tree n Similar accuracy n Better runtime with 1. 61 million examples n 21 minutes for VFDT n 24 hours for C 4. 5 Data Mining: Concepts and Techniques 53

CVFDT (Concept-adapting VFDT) n n 11/2/2020 Concept Drift n Time-changing data streams n Incorporate

CVFDT (Concept-adapting VFDT) n n 11/2/2020 Concept Drift n Time-changing data streams n Incorporate new and eliminate old CVFDT n Increments count with new example n Decrement old example n Sliding window n Nodes assigned monotonically increasing IDs n Grows alternate subtrees n When alternate more accurate => replace old n O(w) better runtime than VFDT-window Data Mining: Concepts and Techniques 54

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis n Research issues 11/2/2020 Data Mining: Concepts and Techniques 55

Clustering Data Streams [GMMO 01] n n 11/2/2020 Base on the k-median method n

Clustering Data Streams [GMMO 01] n n 11/2/2020 Base on the k-median method n Data stream points from metric space n Find k clusters in the stream s. t. the sum of distances from data points to their closest center is minimized Constant factor approximation algorithm n In small space, a simple two step algorithm: 1. For each set of M records, Si, find O(k) centers in S 1, …, Sl n Local clustering: Assign each point in Si to its closest center 2. Let S’ be centers for S 1, …, Sl with each center weighted by number of points assigned to it n Cluster S’ to find k centers Data Mining: Concepts and Techniques 56

Hierarchical Clustering Tree level-(i+1) medians level-i medians data points 11/2/2020 Data Mining: Concepts and

Hierarchical Clustering Tree level-(i+1) medians level-i medians data points 11/2/2020 Data Mining: Concepts and Techniques 57

Hierarchical Tree and Drawbacks n Method: n n n On seeing m of them,

Hierarchical Tree and Drawbacks n Method: n n n On seeing m of them, generate O(k) level-(i+1) medians of weight equal to the sum of the weights of the intermediate medians assigned to them Drawbacks: n n 11/2/2020 maintain at most m level-i medians Low quality for evolving data streams (register only k centers) Limited functionality in discovering and exploring clusters over different portions of the stream over time Data Mining: Concepts and Techniques 58

Clustering for Mining Stream Dynamics n Network intrusion detection: one example n Detect bursts

Clustering for Mining Stream Dynamics n Network intrusion detection: one example n Detect bursts of activities or abrupt changes in real time—by online clustering n Another approach: n Tilted time frame work: o. w. dynamic changes cannot be found n Micro-clustering: better quality than k-means/k-median n incremental, online processing and maintenance n Two stages: micro-clustering and macro-clustering n With limited “overhead” to achieve high efficiency, scalability, quality of results and power of evolution/change detection 11/2/2020 Data Mining: Concepts and Techniques 59

Clu. Stream: A Framework for Clustering Evolving Data Streams n Design goal n n

Clu. Stream: A Framework for Clustering Evolving Data Streams n Design goal n n n High quality for clustering evolving data streams with greater functionality While keep the stream mining requirement in mind n One-pass over the original stream data n Limited space usage and high efficiency Clu. Stream: A framework for clustering evolving data streams n Divide the clustering process into online and offline components n n 11/2/2020 Online component: periodically stores summary statistics about the stream data Offline component: answers various user questions based on the stored summary statistics Data Mining: Concepts and Techniques 60

The Clu. Stream Framework n Micro-cluster n Statistical information about data locality n Temporal

The Clu. Stream Framework n Micro-cluster n Statistical information about data locality n Temporal extension of the cluster-feature vector n Multi-dimensional points with time stamps n Each point contains d dimensions, i. e. , n n A micro-cluster for n points is defined as a (2. d + 3) tuple Pyramidal time frame n 11/2/2020 Decide at what moments the snapshots of the statistical information are stored away on disk Data Mining: Concepts and Techniques 61

Clu. Stream: Pyramidal Time Frame n Pyramidal time frame n Snapshots of a set

Clu. Stream: Pyramidal Time Frame n Pyramidal time frame n Snapshots of a set of micro-clusters are stored following the pyramidal pattern n n Snapshots are classified into different orders varying from 1 to log(T) n n 11/2/2020 They are stored at differing levels of granularity depending on recency The i-th order snapshots occur at intervals of αi where α ≥ 1 Only the last (α + 1) snapshots are stored Data Mining: Concepts and Techniques 62

Clu. Stream: Clustering On-line Streams n Online micro-cluster maintenance n Initial creation of q

Clu. Stream: Clustering On-line Streams n Online micro-cluster maintenance n Initial creation of q micro-clusters n n Online incremental update of micro-clusters n n If new point is within max-boundary, insert into the microcluster n O. w. , create a new cluster n May delete obsolete micro-cluster or merge two closest ones Query-based macro-clustering n 11/2/2020 q is usually significantly larger than the number of natural clusters Based on a user-specified time-horizon h and the number of macro-clusters K, compute macroclusters using the k-means algorithm Data Mining: Concepts and Techniques 63

References on Stream Data Mining (1) n n n n n 11/2/2020 C. Aggarwal,

References on Stream Data Mining (1) n n n n n 11/2/2020 C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data Streams, VLDB'03 C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of Evolving Data Streams, KDD'04 C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB'04 S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, Sept. 2001 B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems”, PODS'02. (Conference tutorial) Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. "Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02 P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00 A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD’ 02 J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous data streams. SIGMOD'01 C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining frequent patterns in data streams at multiple time granularities, Kargupta, et al. (eds. ), Next Generation Data Mining’ 04 Data Mining: Concepts and Techniques 64

References on Stream Data Mining (2) n S. Guha, N. Mishra, R. Motwani, and

References on Stream Data Mining (2) n S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, FOCS'00 n G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams. KDD 2001 n n n n 11/2/2020 S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous Queries over Streams, SIGMOD 02 G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams, VLDB’ 02 A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. ICDT'05 S. Muthukrishnan, Data streams: algorithms and applications, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003 R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995 S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming Information Sources, SIGMOD’ 02 Y. Zhu and D. Shasha. Stat. Stream: Statistical Monitoring of Thousands of Data Streams in Real Time, VLDB’ 02 H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, KDD'03 Data Mining: Concepts and Techniques 65