UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU

  • Slides: 19
Download presentation
UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND A Heuristic K-means Clustering Algorithm

UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND A Heuristic K-means Clustering Algorithm by Kernel PCA Mantao Xu and Pasi Fränti

Problem Formulation Given N data samples X={x 1, x 2, …, x. N}, construct

Problem Formulation Given N data samples X={x 1, x 2, …, x. N}, construct the codebook C = {c 1, c 2, …, c. M} such that mean-square-error is minimized. The class membership p (i) is

Traditional K-Means Algorithm n Iterations of two steps: n n n Characteristics: n n

Traditional K-Means Algorithm n Iterations of two steps: n n n Characteristics: n n n assignment of each data vector with a class label computation of cluster centroid by averaging all data vectors that are assigned to it Randomized initial partition or codebook Convergence to a local minimum Use of L 2, L 1 and L distance Fast and easy implementation Extensions: n n n Kernel K-means algorithm EM algorithm K-median algorithm

Motivation Investigation on a clustering algorithm that : n performs the conventional K-Means algorithm

Motivation Investigation on a clustering algorithm that : n performs the conventional K-Means algorithm in searching a solution close to the global optimum. n estimates the initial partition close to the optimal solution n applies a dissimilarity function based on the current partition instead of L 2 distance

Selecting initial partiton Based on kernel feature extraction approach and dynamic programming (DP): 1.

Selecting initial partiton Based on kernel feature extraction approach and dynamic programming (DP): 1. Construct 1 -D subspace by kernel PCA. 2. Find a suboptimal partition by DP in the 1 -D subspace. 3. Output the partition of the DP as the initial solution to the K-means algorithm.

Kernel PCA vs. PCA Sovle principal component analysis (PCA) in the reproducing kernel space

Kernel PCA vs. PCA Sovle principal component analysis (PCA) in the reproducing kernel space F, thus implicitly depicts the irregular shape of data with a nonlinear hypercurve: PCA Kernel PCA

Problem formulation of kernel PCA For the kernel PCA for the N data samples,

Problem formulation of kernel PCA For the kernel PCA for the N data samples, X={xi}, solve the eigenvalue problem in F , which is assumed to be equivalent to: where the eigenvector V is a linear expansion by ans K is the kernel matrix with respect to data X

Dynamic programming in kernel component direction The optimal convex partition Qk={(qj-1, qj]| j=1, ,

Dynamic programming in kernel component direction The optimal convex partition Qk={(qj-1, qj]| j=1, , n} in the kernel component direction w can be obtained by dynamic promgramming in terms of either MSE distortion on one-dimensional kernel component subspace (1) or in terms of MSE distortion on original feature space (2)

Application of Delta-MSE Dissimilarity Move vector x from cluster i to cluster j, the

Application of Delta-MSE Dissimilarity Move vector x from cluster i to cluster j, the change of the MSE function [10] caused by this move is: Delta-MSE(x 4, G 2)=Add. Variance x 2 x 1 y 1 G 1 x 3 G 2 x 4 y 2 y 3 Delta-MSE(x 4, G 1)=Removal. Variance

Pseudocodes of the heuristic K-Means

Pseudocodes of the heuristic K-Means

Four K-Means algorithms used in the experiments n n K-D tree based K-Means: selects

Four K-Means algorithms used in the experiments n n K-D tree based K-Means: selects initial cluster centroids from the k-bucket centers of a kd-tree structure that is recursively built by PCA-based K-Means: estimate a sub-optimal initial partition by applying the dynamic programming in the PCA direction KPCA-I: the proposed K-Means algorithm based on the dynamic programming criterion (1) LFD-II: the proposed K-Means algorithm based on the dynamic programming criterion (2)

Performance comparison 1 F-ratio validity index values for UCI data sets:

Performance comparison 1 F-ratio validity index values for UCI data sets:

Performance comparison 2 F-ratio validity index values for image data sets:

Performance comparison 2 F-ratio validity index values for image data sets:

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

F-ratio validity index values

Conclusions A new approach to the k-center clustering problem by incorporating the kernel PCA

Conclusions A new approach to the k-center clustering problem by incorporating the kernel PCA and dynamic programming. The proposed approach in general is superior to the two other algorithms: the PCA-based and the kd-tree based K-Means. Gain in classification performance of the proposed approach increases with the number of clusters in comparison to two others.

Further Work Solving the k-center clustering problem by iteratively incorporating the kernel Fisher discriminant

Further Work Solving the k-center clustering problem by iteratively incorporating the kernel Fisher discriminant analysis and the dynamic programming technique Solving the k-center clustering problem by boosting a decision function (conduct decision function f over X to obtain a scalar space f(X)).