A Framework for Clustering Evolving Data Streams Charu
- Slides: 21
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad
Outline l Background of Clustering l Motivation for Clustering over Streaming Data. l Overall Solution l Micro Clusters l Pyramid Time Frame l Macro Cluster l Cluster Maintenance
Background of Clustering l Definition of Clustering ¡ For a given set of data points, partitioning them into one or more groups of similar objects. ¡ “Similarity” is often defined with the use of some distance measure. l Difference clustering. between “group by” queries and
Background of Clustering l Some of the most popular clustering algorithms: ¡ K- Means, BIRCH, CURE, Density Based Clustering. l Clustering has many applications in data bases, information visualization, data mining. l What are Oultiers?
Motivation l Challenge in Streaming Environment: ¡ Clustering is an expensive process. ¡ Resource constraints. ¡ Infinite streams. l Can simply extending one pass algorithms for static databases to stream processing suffice?
Motivation l Requirements of clustering for stream processing: ¡ Statistical summary information storage. ¡ Efficient update process. ¡ Ability to cluster for a specific time horizon,
Overall Solution of the Paper l Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering
Micro-Clusters l What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-Cluster View of Macro-Cluster
Micro-Clusters l What to Store in a Micro-Cluster = Key idea: Additivity Property
Pyramidal Time Frame l The micro-clusters are stored at snapshots. … … Snapshot l When should we make the snapshot? l The snapshots follow a pyramidal pattern
Pyramidal Time Frame Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. l For a data stream the maximum number of snap- shots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order) l
Why Pyramidal Pattern? l For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!
Micro Cluster Creation l It is assumed that a total of q microclusters are maintained at any moment by the algorithm. l This is done using an offline process (kmeans) at the very beginning of the data stream computation process.
Online Micro Cluster Maintenance How to deal with a new coming point? 1. Join one of the old cluster 2. Create a new cluster by its own l How to deal with the old clusters 1. Delete them (based on relevance stamp) 2. Merge them (merge the closest two) l A merged cluster will have all the IDs its components have
Macro-Cluster Creation l Based on the Additivity Property of cluster feature vector
Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: 1. 2. 3. 1 st step: Find the snapshot for T, get the micro-cluster set S(T). 2 nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C 1, C 2, C 3) in S(T) and a cluster with Id C 1 in S(T-h). Then the we use CFT(C 1, C 2, C 3)-CFT(C 1)=CFT(C 2, C 3), because C 1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h, T)
Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 1] Time: T-h C_ID: [C 2, C 3] Result: T-h
Macro-Cluster Creation l Run K-means on Micro-Clusters
How do you feel about this paper? l My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.
Counter Example C_ID: [C 1, C 2, C 3] Time: T C_ID: [C 2] Time: T-h C_ID: [C 1, C 3] Result
Advertisement l 1. 2. 3. 4. 5. Di and Charu’s project deals with: Deterministic Clusters with Arbitrary Shapes Real Expirations Disk Version Outlier Detection by Free
- A framework for clustering evolving data streams
- Flat and hierarchical clustering
- Partitional clustering
- Rumus distance
- Charu mullick
- Evolving design
- Key evolving signature
- Evolving
- Trajectory clustering: a partition-and-group framework
- Data nugget streams as sensors answers
- Basic concepts in mining data streams
- Finding frequent items in data streams
- Clustering by passing messages between data points
- Classification and clustering in data mining
- Clustering non numeric data
- Birch in data mining
- Clustering data mining
- K-means clustering algorithm in data mining
- Types of jet streams
- Bill nye rivers and streams answers
- Cost streams
- Streams aq: waiting for messages in the queue