New Representations in Genetic Programming for Feature Construction





















- Slides: 21
New Representations in Genetic Programming for Feature Construction in k-means Clustering Andrew Lensen, Dr. Bing Xue, and Prof. Mengjie Zhang Victoria University of Wellington, New Zealand SEAL ‘ 17
Clustering • Task of grouping similar data items into a number of clusters. • Unsupervised: no known labels. • Most renowned clustering algorithm is k-means clustering. • Iteratively refines cluster centres, and assigns each instance to nearest centre. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 2
k-means Clustering Advantages: Disadvantages: • Low computational cost: amortized linear time. • Can get stuck in local optima, especially with “bad” seed. • Straightforward to implement and understand. • Cannot accurately identify non-hyper-spherical clusters. • Good performance with small k, low dimensionality. • Achieves poor results as the problem space enlarges. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 3
How can we address some of these limitations? • Transform the feature space into a simpler one suitable for k-means. • Use different optimisation criteria which have less cluster shape bias. • Use Genetic Programming (GP) to automatically construct several high-level features tailored to the dataset. • GP will learn which features work for k-means. • Constructed features (CFs) allow for varying cluster shape. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 4
Goals Explore using GP for feature construction with a wrapper approach to improve the clustering performance of k-means. • What GP representations can create multiple CFs from one GP individual? • Can different fitness functions be used to improve cluster quality? • How does performance change across a variety of datasets? • Can we interpret evolved GP trees to understand why their CFs are useful? Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 5
Overall Design Produce CFs for each instance from tree/s. k-means Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 6
GP Representation What GP representations can create multiple CFs from one GP individual? We investigated two approaches: 1) Using multiple trees. 2) Using a single tree, with a vector representation. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 7
GP Representation – Multiple Trees • Each tree creates a single constructed feature. • Each individual contains t trees, to give t constructed features. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 8
GP Representation – Vector • Having to set t is annoying. Can we use a single tree? • Introduce a new concat operator which can create vectors of CFs. • Automatically build up a suitable length vector. • Extend the function set to work on vectors. Example output gives 4 features: [min(0. 63, F 23), F 37/0. 59, F 86, F 85] • However, each tree must be larger. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 9
GP Functions & Terminals • Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 10
Fitness Function (1) – Σ Intra • The first fitness function we investigated is that used by vanilla k-means – the total distance from each cluster centre to all of its instances. • Allows us to evaluate how GP for FC alone can improve results. • However, will still lead to hyper-spherical clusters… Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 11
Fitness Function (2) – Connectedness • Connectedness measures how well an instance is assigned to the same cluster as its (immediate) neighbours. • Similar instances should be in the same cluster! • Fitness is the mean connectedness of all instances. • Formulated carefully to reduce cluster shape bias. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 12
Datasets Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 13
Evaluation Metrics • The paper compares the two fitness function metrics, and two external metrics which compare to the dataset labels. • To save time, we focus here on the F-measure metric, which is adapted from classification and looks at the number of TPs, FPs, and FNs present. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 14
Experiment Setup • We compare each of the two representations and two fitness functions to vanilla k-means (using all features). • 4 proposed methods: MTConn, MTIntra, Vector. Conn, Vector. Intra. • Use typical GP parameters: top-10 elitism, 80% crossover, 40% mutation, 1024 population size. Set t to 7. • k-means is run until convergence or a max of 100 iterations. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 15
Results – Real-World Dataset MTConn MTIntra Vector. Conn Vector. Intra Vanilla k-means Iris 0. 83 0. 81 0. 91 0. 81 0. 75 Wine 0. 94 0. 93 0. 90 Movement Lib. 0. 34 0. 35 0. 34 Breast Cancer 0. 84 0. 94 0. 82 0. 94 0. 93 Image Seg. 0. 59 0. 57 0. 55 Dermatology 0. 93 0. 79 0. 88 0. 76 • Clear improvement over vanilla k-means. • Vector better on Iris than MT, otherwise similar. • Conn better on Dermatology, worse on Breast Cancer. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 16
Results – Synthetic Dataset MTConn MTIntra Vector. Conn Vector. Intra Vanilla k-means 10 d 10 c 0. 78 0. 80 0. 89 0. 86 10 d 20 c 0. 99 0. 80 10 d 40 c 0. 93 0. 95 0. 94 0. 95 0. 86 50 d 10 c 0. 52 0. 48 0. 50 0. 48 0. 49 50 d 20 c 0. 50 0. 48 0. 44 0. 38 50 d 40 c 0. 44 0. 43 0. 41 0. 38 0. 26 100 d 10 c 0. 53 0. 58 0. 54 0. 58 0. 53 100 d 20 c 0. 47 0. 46 0. 45 0. 43 0. 38 100 d 40 c 0. 46 0. 44 0. 40 0. 27 • Clearest improvements on hardest datasets (high K). • Multi-tree slightly better. Fitness functions both good… Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 17
Evolved Program Analysis – Multi-tree • But how do the evolved trees improve results? • Combination of feature selection, feature scaling, and high-level constructed features. Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 18
Evolved Program Analysis – Vector • What about a harder dataset? • 0. 50 F-M on 100 d 40 c (vs 0. 27 baseline). • 11 features produced. • Only uses 12/100 original features! Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 19
Final Remarks • GP can be used for feature construction to significantly improve the performance of k-means clustering. • GP automatically selects a subset of useful features and constructs a feature space tailored to the dataset. • GP trees are interpretable. • Future: other fitness functions, different representations… Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 20
Thank you! Lensen, Xue, Zhang. Andrew. Lensen@ecs. vuw. ac. nz 21