Stream Clustering CSE 902 Big Data Stream analysis

Stream Clustering CSE 902

Big Data

Stream analysis Stream: Continuous flow of data Challenges ◦ Volume: Not possible to store all the data ◦ One-time access: Not possible to process the data using multiple passes ◦ Real-time analysis: Certain applications need real-time analysis of the data ◦ Temporal Locality: Data evolves over time, so model should be adaptive.

Stream Clustering Topic cluster Article Listings

Stream Clustering • Online Phase • Summarize the data into memory-efficient data structures • Offline Phase • Use a clustering algorithm to find the data partition

Stream Clustering Algorithms Data Structures Examples Prototypes Stream, Stream Lsearch CF-Trees Scalable k-means, single pass k-means Microcluster Trees Clus. Tree, Den. Stream, HP-Stream Grids D-Stream, ODAC Coreset Tree Stream. KM++

Prototypes Stream, LSearch

CF-Trees Summarize the data in each CF-vector • Linear sum of data points • Squared sum of data points • Number of points Scalable k-means, Single pass k-means

Microclusters CF-Trees with “time” element Clu. Stream • Linear sum and square sum of timestamps • Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering • Timestamp of the most recent data point added to the vector • Maintain only the most recent T microclusters Den. Stream • Microclusters are associated with weights based on recency • Outliers detected by creating separate microcluster

Grids D-Stream • Assign the data to grids • Grids weighted by recency of points added to it • Each grid associated with a label DGClust • Distributed clustering of sensor data • Sensors maintain local copies of the grid and communicate updates to the grid to a central site

Stream. KM++ (Coresets) Stream. KM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012

Kernel-based Clustering

Kernel-based Stream Clustering q Use non-linear distance measures to define similarity between data points in the stream q Challenges q Quadratic running time complexity q Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)

Stream Kernel k-means (s. KKM) Kernel k-means Weighted Kernel k-means Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012 History from only the preceding data chunk retained

Statistical Leverage Scores Measures the influence of a point in the low-rank approximation

Statistical Leverage Scores

Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially “important” o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity o Bounded amount of memory

Approximate Stream kernel k-means

Importance Sampling

Clustering Kernel k-means “Approximate” Kernel k-means

Clustering “Approximate” Kernel k-means

Updating eigenvectors • Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering • Update the eigenvectors and eigenvalues incrementally

Approximate Stream Kernel k-means

Network Traffic Monitoring Ø Clustering used to detect intrusions in the network Ø Network Intrusion Data set Ø TCP dump data from seven weeks of LAN traffic Ø 10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means 6. 6 14. 2 Stream. KM++ 0. 8 7. 0 s. KKM 42. 1 13. 3 Around 200 points clustered per second

Summary Ø Efficient kernel-based stream clustering algorithm - linear running time complexity Ø Memory required is bounded Ø Real-time clustering is possible Ø Limitation: does not account for data evolution