A Framework for Clustering Evolving Data Streams Charu

Outline l Background of Clustering l Motivation for Clustering over Streaming Data. l Overall

Background of Clustering l Definition of Clustering ¡ For a given set of data

Background of Clustering l Some of the most popular clustering algorithms: ¡ K- Means,

Motivation l Challenge in Streaming Environment: ¡ Clustering is an expensive process. ¡ Resource

Motivation l Requirements of clustering for stream processing: ¡ Statistical summary information storage. ¡

Overall Solution of the Paper l Divide the clustering process to two phases Online

Micro-Clusters l What is a Micro-Cluster A Micro-Cluster is a set of individual data

Micro-Clusters l What to Store in a Micro-Cluster = Key idea: Additivity Property

Pyramidal Time Frame l The micro-clusters are stored at snapshots. … … Snapshot l

Pyramidal Time Frame Snapshots are classified into different orders which can vary from 1

Why Pyramidal Pattern? l For any user-specified time window of h, at least one

Micro Cluster Creation l It is assumed that a total of q microclusters are

Online Micro Cluster Maintenance How to deal with a new coming point? 1. Join

Macro-Cluster Creation l Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation Current Time T, the window size is h. That means the user

Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 1] Time:

Macro-Cluster Creation l Run K-means on Micro-Clusters

How do you feel about this paper? l My feeling: Quite Fuzzy Results: Approximation

Counter Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 2]

Advertisement l 1. 2. 3. 4. 5. Di and Charu’s project deals with: Deterministic

Slides: 21

Download presentation

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad

Outline l Background of Clustering l Motivation for Clustering over Streaming Data. l Overall Solution l Micro Clusters l Pyramid Time Frame l Macro Cluster l Cluster Maintenance

Background of Clustering l Definition of Clustering ¡ For a given set of data points, partitioning them into one or more groups of similar objects. ¡ “Similarity” is often defined with the use of some distance measure. l Difference clustering. between “group by” queries and

Background of Clustering l Some of the most popular clustering algorithms: ¡ K- Means, BIRCH, CURE, Density Based Clustering. l Clustering has many applications in data bases, information visualization, data mining. l What are Oultiers?

Motivation l Challenge in Streaming Environment: ¡ Clustering is an expensive process. ¡ Resource constraints. ¡ Infinite streams. l Can simply extending one pass algorithms for static databases to stream processing suffice?

Motivation l Requirements of clustering for stream processing: ¡ Statistical summary information storage. ¡ Efficient update process. ¡ Ability to cluster for a specific time horizon,

Overall Solution of the Paper l Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering

Micro-Clusters l What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-Cluster View of Macro-Cluster

Micro-Clusters l What to Store in a Micro-Cluster = Key idea: Additivity Property

Pyramidal Time Frame l The micro-clusters are stored at snapshots. … … Snapshot l When should we make the snapshot? l The snapshots follow a pyramidal pattern

Pyramidal Time Frame Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. l For a data stream the maximum number of snapshots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order) l

Why Pyramidal Pattern? l For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!

Micro Cluster Creation l It is assumed that a total of q microclusters are maintained at any moment by the algorithm. l This is done using an offline process (kmeans) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance How to deal with a new coming point? 1. Join one of the old cluster 2. Create a new cluster by its own l How to deal with the old clusters 1. Delete them (based on relevance stamp) 2. Merge them (merge the closest two) l A merged cluster will have all the IDs its components have

Macro-Cluster Creation l Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: 1. 2. 3. 1 st step: Find the snapshot for T, get the micro-cluster set S(T). 2 nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C 1, C 2, C 3) in S(T) and a cluster with Id C 1 in S(T-h). Then the we use CFT(C 1, C 2, C 3)-CFT(C 1)=CFT(C 2, C 3), because C 1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h, T)

Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 1] Time: T-h C_ID: [C 2, C 3] Result: T-h

Macro-Cluster Creation l Run K-means on Micro-Clusters

How do you feel about this paper? l My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 2] Time: T-h C_ID: [C 1, C 3] Result

Advertisement l 1. 2. 3. 4. 5. Di and Charu’s project deals with: Deterministic Clusters with Arbitrary Shapes Real Expirations Disk Version Outlier Detection by Free