Clustering Categorical Data Pasi Frnti 18 2 2016

  • Slides: 24
Download presentation
Clustering Categorical Data Pasi Fränti 18. 2. 2016

Clustering Categorical Data Pasi Fränti 18. 2. 2016

K-means clustering

K-means clustering

Definitions and data Set of N data points: X={x 1, x 2, …, x.

Definitions and data Set of N data points: X={x 1, x 2, …, x. N} Partition of the data: P={p 1, p 2, …, p. M}, Set of M cluster prototypes (centroids): C={c 1, c 2, …, c. M},

Distance and cost function Euclidean distance of data vectors: Mean square error:

Distance and cost function Euclidean distance of data vectors: Mean square error:

Clustering result as partition Partition of data Illustrated by Voronoi diagram Cluster prototypes Illustrated

Clustering result as partition Partition of data Illustrated by Voronoi diagram Cluster prototypes Illustrated by Convex hulls

Duality of partition and centroids Partition of data Partition by nearest prototype mapping Cluster

Duality of partition and centroids Partition of data Partition by nearest prototype mapping Cluster prototypes Centroids as prototypes

Categorical data

Categorical data

Categorical clustering Three attributes director Coppola Scorsese Hitchcock actor De Niro Stewart Grant genre

Categorical clustering Three attributes director Coppola Scorsese Hitchcock actor De Niro Stewart Grant genre Crime Thriller t 5 (Bishop's Wife) Koster Grant Comedy t 6 (Harvey) Koster Stewart Comedy t 1 (Godfather II) t 2 (Good Fellas) t 3 (Vertigo) t 4 (N by NW)

Categorical clustering Sample 2 -d data: color and shape Model A Model B Model

Categorical clustering Sample 2 -d data: color and shape Model A Model B Model C

Hamming Distance (Binary and categorical data) • • Number of different attribute values. Distance

Hamming Distance (Binary and categorical data) • • Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 3 -bit binary cube 100 ->011 has distance 3 (red path) 010 ->111 has distance 2 (blue path)

K-means variants Methods: • • • k-modes k-medoids k-distributions k-histograms k-populations k-representatives Histogram-based methods:

K-means variants Methods: • • • k-modes k-medoids k-distributions k-histograms k-populations k-representatives Histogram-based methods:

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

Iterative algorithms

K-modes clustering Distance function

K-modes clustering Distance function

K-modes clustering Prototype of cluster

K-modes clustering Prototype of cluster

K-medoids clustering Prototype of cluster Vector with minimal total distance to every other 2

K-medoids clustering Prototype of cluster Vector with minimal total distance to every other 2 A C E 3 B C F 2 Medoid: B D G B C F 2+3=5 2+2=4 2+3=5

K-medoids Example

K-medoids Example

K-medoids Calculation

K-medoids Calculation

K-histograms D 2/3 F 1/3

K-histograms D 2/3 F 1/3

K-distributions Cost function with ε addition

K-distributions Cost function with ε addition

Example of cluster allocation Change of entropy

Example of cluster allocation Change of entropy

Problem of non-convergence Non-convergence

Problem of non-convergence Non-convergence

Results with Census dataset

Results with Census dataset

Literature Modified k-modes + k-histograms: M. Ng, M. J. Li, J. Z. Huang and

Literature Modified k-modes + k-histograms: M. Ng, M. J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503 -507, March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253 -262, Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345 -366, 200 x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283 -304, 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436 -443, Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, Co. RR, abs/cs/0509033, http: //arxiv. org/abs/cs/0509033, 2005.