UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU

Problem Formulation Given N data samples X={x 1, x 2, …, x. N}, construct

Traditional K-Means Algorithm n Iterations of two steps: n n n Characteristics: n n

Motivation Investigation on a clustering algorithm that : n performs the conventional K-Means algorithm

Selecting initial partiton Based on kernel feature extraction approach and dynamic programming (DP): 1.

Kernel PCA vs. PCA Sovle principal component analysis (PCA) in the reproducing kernel space

Problem formulation of kernel PCA For the kernel PCA for the N data samples,

Dynamic programming in kernel component direction The optimal convex partition Qk={(qj-1, qj]| j=1, ,

Application of Delta-MSE Dissimilarity Move vector x from cluster i to cluster j, the

Four K-Means algorithms used in the experiments n n K-D tree based K-Means: selects

Performance comparison 1 F-ratio validity index values for UCI data sets:

Performance comparison 2 F-ratio validity index values for image data sets:

Conclusions A new approach to the k-center clustering problem by incorporating the kernel PCA

Further Work Solving the k-center clustering problem by iteratively incorporating the kernel Fisher discriminant

Slides: 19

Download presentation

UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND A Heuristic K-means Clustering Algorithm by Kernel PCA Mantao Xu and Pasi Fränti

Problem Formulation Given N data samples X={x 1, x 2, …, x. N}, construct the codebook C = {c 1, c 2, …, c. M} such that mean-square-error is minimized. The class membership p (i) is

Traditional K-Means Algorithm n Iterations of two steps: n n n Characteristics: n n n assignment of each data vector with a class label computation of cluster centroid by averaging all data vectors that are assigned to it Randomized initial partition or codebook Convergence to a local minimum Use of L 2, L 1 and L distance Fast and easy implementation Extensions: n n n Kernel K-means algorithm EM algorithm K-median algorithm

Motivation Investigation on a clustering algorithm that : n performs the conventional K-Means algorithm in searching a solution close to the global optimum. n estimates the initial partition close to the optimal solution n applies a dissimilarity function based on the current partition instead of L 2 distance

Selecting initial partiton Based on kernel feature extraction approach and dynamic programming (DP): 1. Construct 1 -D subspace by kernel PCA. 2. Find a suboptimal partition by DP in the 1 -D subspace. 3. Output the partition of the DP as the initial solution to the K-means algorithm.

Kernel PCA vs. PCA Sovle principal component analysis (PCA) in the reproducing kernel space F, thus implicitly depicts the irregular shape of data with a nonlinear hypercurve: PCA Kernel PCA

Problem formulation of kernel PCA For the kernel PCA for the N data samples, X={xi}, solve the eigenvalue problem in F , which is assumed to be equivalent to: where the eigenvector V is a linear expansion by ans K is the kernel matrix with respect to data X

Dynamic programming in kernel component direction The optimal convex partition Qk={(qj-1, qj]| j=1, , n} in the kernel component direction w can be obtained by dynamic promgramming in terms of either MSE distortion on one-dimensional kernel component subspace (1) or in terms of MSE distortion on original feature space (2)

Application of Delta-MSE Dissimilarity Move vector x from cluster i to cluster j, the change of the MSE function [10] caused by this move is: Delta-MSE(x 4, G 2)=Add. Variance x 2 x 1 y 1 G 1 x 3 G 2 x 4 y 2 y 3 Delta-MSE(x 4, G 1)=Removal. Variance

Pseudocodes of the heuristic K-Means

Four K-Means algorithms used in the experiments n n K-D tree based K-Means: selects initial cluster centroids from the k-bucket centers of a kd-tree structure that is recursively built by PCA-based K-Means: estimate a sub-optimal initial partition by applying the dynamic programming in the PCA direction KPCA-I: the proposed K-Means algorithm based on the dynamic programming criterion (1) LFD-II: the proposed K-Means algorithm based on the dynamic programming criterion (2)

Performance comparison 1 F-ratio validity index values for UCI data sets:

Performance comparison 2 F-ratio validity index values for image data sets:

F-ratio validity index values

Conclusions A new approach to the k-center clustering problem by incorporating the kernel PCA and dynamic programming. The proposed approach in general is superior to the two other algorithms: the PCA-based and the kd-tree based K-Means. Gain in classification performance of the proposed approach increases with the number of clusters in comparison to two others.

Further Work Solving the k-center clustering problem by iteratively incorporating the kernel Fisher discriminant analysis and the dynamic programming technique Solving the k-center clustering problem by boosting a decision function (conduct decision function f over X to obtain a scalar space f(X)).