A Framework for Clustering Evolving Data Streams Charu

  • Slides: 21
Download presentation
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03) 2021/2/23 報告人: 吳建良 1

Outline n n n Cluster analysis: A general overview Developed methodology Micro-cluster analysis and

Outline n n n Cluster analysis: A general overview Developed methodology Micro-cluster analysis and maintenance Macro-cluster analysis Evolution analysis Empirical results 2

Cluster analysis: A general overview n n What is cluster analysis? —Grouping a set

Cluster analysis: A general overview n n What is cluster analysis? —Grouping a set of data objects into a set of clusters s. t. the intra-cluster similarity is high and the inter-cluster similarity is low New requirements in stream clustering n n Generate high-quality clusters in one scan High quality, efficient incremental clustering Analysis should take care of multi-dimensional space Provide flexibility to compute clusters over user-defined time period 3

Developed methodology: Outline n Methodology n n Divide the clustering process into online and

Developed methodology: Outline n Methodology n n Divide the clustering process into online and offline components Online: periodically stores summary statistics about the stream data n n Micro-clustering: better quality than k-means Online processing and maintenance Pyramidal time window: register dynamic changes Offline: answers various user queries based on the stored summary statistics 4

Clustering Feature Vector • Originated from BIRCH • Clustering Feature: CF = (N, LS,

Clustering Feature Vector • Originated from BIRCH • Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi 2 CF = (5, (16, 30), (54, 190)) (3, 4) (2, 6) (4, 5) (4, 7) (3, 8) 5

Micro-Clusters: Design Methodology n n Data streams Ø Multi-dimensional points with time stamps T

Micro-Clusters: Design Methodology n n Data streams Ø Multi-dimensional points with time stamps T 1, … Tk …. Ø Each point contains d dimensions, i. e. , A micro-cluster for n points is defined as a (2*d + 3) tuple: - the sum of the squares of the data values - the sum of the squares of the time stamps - the sum of the time stamps - the number of data points 6

Pyramidal Time Frame n Snapshots n n n The micro-clusters are also stored at

Pyramidal Time Frame n Snapshots n n n The micro-clusters are also stored at particular moments in the stream Classified into different frame number which can vary from 0 to log(T), where T is the clock time elapsed since the beginning of the stream The frame number of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained 7

Maintain Snapshot Frame Table The Rules for insertion of a snapshot t into frame

Maintain Snapshot Frame Table The Rules for insertion of a snapshot t into frame table n 1. 2. If (t mod αi)=0 but (t mod αi+1) ≠ 0, t is inserted into frame number i Each slot has a max_capacity. If the slot has already reached its max_capacity, the oldest snapshot is removed and the new snapshot inserted Example: n n n α= 2 max_capacity =3 8

Micro-clusters Maintenance n n The micro-clustering stage is online, statistical data collection – not

Micro-clusters Maintenance n n The micro-clustering stage is online, statistical data collection – not dependant on user input Initial creation of q micro-clusters M 1 … Mq n n Use k-means clustering algorithm q is usually significantly larger than # of natural clusters q is determined by the amount of available memory Each micro-cluster is associated with a unique id when it is created 9

Incremental Update of Micro-clusters n When a new data point Xik arrives, it is

Incremental Update of Micro-clusters n When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created n If Xik falls within the maximum boundary of its closest micro-cluster Mp, Xik is added to Mp n n n Maximum boundary: the RMS deviation of the data points in Mp from its centroid RMS deviation: Otherwise, a new micro-cluster is created for Xik 10

Incremental Update of Micro-clusters (Contd. ) n Delete an old cluster or merge two

Incremental Update of Micro-clusters (Contd. ) n Delete an old cluster or merge two closest clusters? n n n A micro-cluster is deleted whenever the average time stamp of the last m points is less than a given threshold Otherwise, the two closest micro-cluster are merged by adding corresponding cluster feature vectors An idlist is created for the two micro-clusters 11

Macro-Cluster Creation n n Macro-clusters are created over a user-specified time horizon h Let

Macro-Cluster Creation n n Macro-clusters are created over a user-specified time horizon h Let S(tc): the set of micro-clusters at time tc S(tc-h): the set of micro-clusters at time tc-h The new set of micro-clusters N(tc-h) are created by subtracting S(tc-h) from S(tc) Subtractive property n Let C 1 and C 2 be two sets of points such that Then 12

Macro-Cluster Creation (Contd. ) n n n Each micro-cluster in N(tc-h) is treated as

Macro-Cluster Creation (Contd. ) n n n Each micro-cluster in N(tc-h) is treated as pseudo-point Each pseudo-point has a weight proportional to the number of points inside it A k-means clustering approach is applied to this set of pseudo-points in order to create a higher level of macro-clusters 13

Evolution Analysis of Micro-Clusters n n In many case, it is desirable to find

Evolution Analysis of Micro-Clusters n n In many case, it is desirable to find how the micro-clusters have changed over time Given a user-specified time-horizon h and two clock times, t 1 and t 2 (where t 1 < t 2 ) n Analyze the evolution nature of data arriving between (t 2–h, t 2), and the data arriving between (t 1– h, t 1) 14

Evolution Analysis of Micro-Clusters (Contd. ) n The following questions n Are there new

Evolution Analysis of Micro-Clusters (Contd. ) n The following questions n Are there new clusters in the data at time t 1 which were not present at time t 2? n n Have some of the original clusters been lost? n n Find micro-clusters in N(t 2 -h) which are not present in N(t 1 -h) Find micro-clusters in N(t 1 -h) which are not present in N(t 2 -h) Have some of the original clusters at time t 1, shifted in position and nature? 15

Empirical Result n Data sets n n Real Data Sets: Network Intrusion and KDD

Empirical Result n Data sets n n Real Data Sets: Network Intrusion and KDD Cup 98 data set (Charitable Donation) Synthetic Data Sets: n n Gaussian Distribution Base Size: 100 k ~ 1000 k points # Cluster: 4 ~ 64 Dimensionality: 10 ~ 100 16

Cluster Quality (Network Intrusion) Horizon H=1, Stream_speed=2000 Horizon H=256, Stream_speed=200 17

Cluster Quality (Network Intrusion) Horizon H=1, Stream_speed=2000 Horizon H=256, Stream_speed=200 17

Cluster Quality (Charitable Donation) Horizon H=4, Stream_speed=2000 Horizon H=16, Stream_speed=200 18

Cluster Quality (Charitable Donation) Horizon H=4, Stream_speed=2000 Horizon H=16, Stream_speed=200 18

Scalability Stream_speed=2000 19

Scalability Stream_speed=2000 19

Sum of Square Distance (SSQ) n Assume there a total N points in the

Sum of Square Distance (SSQ) n Assume there a total N points in the past horizon H at current time Tc , where pi is the centroid of macro-cluster closest to 20

K-means clustering algorithm 10 10 9 9 8 8 7 7 6 6 5

K-means clustering algorithm 10 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each points to closest center Update the cluster means reassign 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign K=2 Arbitrarily choose K points as initial cluster center Update the cluster means 21