Determining the number of clusters using information entropy
Determining the number of clusters using information entropy for mixed data Presenter : Hong-Yi, Cai Authors : Jiye Liang, Xingwang Zhao, Deyu Li, Fuyuan Cao, Chuangyin Dang PR, 2012 1
Outlines • • • 2 Motivation Objectives Methodology Experiments Conclusions Comments
Motivation • The determination of the initial parameters of cluster is the most difficult problem. • None of cluster algorithms can cluster effectively mixed data set. 3
Objectives • To propose a generalized mechanism on mixed data set by integrating Renyi entropy and complement entropy. • To improve k-prototype algorithm by using new generalized mechanism. 4
Methodology • K-Prototype… 5
Methodology • A generalized mechanism for numerical data… data Renyi Entropy : By the convolution theorem… Within-Cluster Entropy: Parzen window density estimation: Between-Cluster Entropy: Improved Entropy for numerical data: 6
Methodology • A generalized mechanism for categorical data… data Indiscernibility relation… Complement Entropy: Within-Cluster Entropy: Between-Cluster Entropy: Huang Dissimilarity for categorical data: Improved Entropy for categorical data: 7
Methodology • A generalized mechanism for mixed data set… 8
Methodology • Cluster validity index for mixed data… For numerical data… For categorical data… For mixed data… 9
Methodology 10
Experiments • Ten Cluster 11
Experiments • STUDENT 12
Experiments • Real data sets… 13
Experiments • Wine 14 Breast
Experiments • Voting 15 Car
Experiments • DNA 16 TAE
Experiments • Heart 17 Credit
Experiments • CMC 18 Adult
Experiments 19
Conclusions • The generalized mechanism and algorithm can cluster effectively and determine the optimal number of clusters for mixed data sets. 20
Comments • Advantages – The entropy can apply on mixed data set. • Applications – Cluster for mixed-type data 21
- Slides: 21