National Yunlin University of Science and Technology A

N. Y. U. S. T. I. M. Outline n n n n Introduction Review

N. Y. U. S. T. I. M. Motivation n Handling mixed data types in

N. Y. U. S. T. I. M. Objective n An enhanced supervised clustering n

Cluster Representation and Distance Measures n Clustering and classification algorithm—supervised (CCAS) n n Based

Post-processing of the Cluster Structure for More Robustness n Data Redistribution n Reduce the

N. Y. U. S. T. I. M. Classification n Concept n Classify a new

N. Y. U. S. T. I. M. ECCAS (extended CCAS) n Method A: Based

N. Y. U. S. T. I. M. Results and discussion

N. Y. U. S. T. I. M. Conclusions n ECCAS n n n Number

N. Y. U. S. T. I. M. Comments n Advantage n n Drawback n

Slides: 11

Download presentation

國立雲林科技大學 National Yunlin University of Science and Technology A supervised clustering and classification algorithm for mining data with mixed variables Xiangyang Li and Nong Ye IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, 2006, pp. 396 -406. Presenter : Wei-Shen Tai Advisor : Professor Chung-Chian Hsu 2006/10/11

N. Y. U. S. T. I. M. Outline n n n n Introduction Review of CCAS ECCAS Applications of ECCAS Results and discussion Conclusion Comments

N. Y. U. S. T. I. M. Motivation n Handling mixed data types in data mining n For data with mixed variables, including numerical, ordinal, and nominal variables.

N. Y. U. S. T. I. M. Objective n An enhanced supervised clustering n n It enhances the robustness to the presentation order of training data points and the noise in training data. This algorithm supports incremental learning and mixed data types.

Cluster Representation and Distance Measures n Clustering and classification algorithm—supervised (CCAS) n n Based on the distance of the data points, as well as the target class of each data point. A grid-based supervised clustering of data points. N. Y. U. S. T. I. M.

Post-processing of the Cluster Structure for More Robustness n Data Redistribution n Reduce the impact of the presentation order of data points. When a seed cluster (existing clusters) is found to be the nearest cluster to a data point, the seed cluster is replaced by a new cluster with the data point as the centroid and only this data point in this cluster. Supervised Grouping of Clusters n n N. Y. U. S. T. I. M. Any two clusters nearest to each other have the same target class and thus can be grouped into one cluster. Removal of Outliers n Remove data outliers by checking the number of data points in each cluster.

N. Y. U. S. T. I. M. Classification n Concept n Classify a new data point using the clusters labeled with target class. n where Lj is the jth nearest cluster, and Wj is the weight for the cluster Lj based on the squared distance to D; the target class values of this cluster and D are Y Lj and Y.

N. Y. U. S. T. I. M. ECCAS (extended CCAS) n Method A: Based on a Combination of Two Distance Measures n n Count the frequencies of the ni categories for this nominal attribute for a cluster j with a number of data points, and represent these frequencies Method B: Based on Conversion of Nominal Variables to Binary Variables n Each categorical value of a nominal attribute is represented by a binary variable.

N. Y. U. S. T. I. M. Results and discussion

N. Y. U. S. T. I. M. Conclusions n ECCAS n n n Number of grid intervals n n Handles data with both numeric and nominal variables. Reduces the impact of the data presentation order on the prediction accuracy. Shows the impact on the prediction accuracy of ECCAS. Adaptively and dynamically adjust the parameters n Includes the grid-interval configuration and the threshold-controlling outlier removal.

N. Y. U. S. T. I. M. Comments n Advantage n n Drawback n n n Provide a concept for supervised clustering with target class. An alternative method in handling data with mixed type. Attempt to represent hyperspace via a line concept. If the target class is the determinant of clustering, why we need the redistribution to improve the robustness? Experimental results seem disorderly, inconsistent. Application n A classification solution for data mining with mixed data type.