Genetic Algorithms for clustering problem Pasi Frnti 7

Genetic Algorithms for clustering problem Pasi Fränti 7. 4. 2016

General structure Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions Create new solutions by crossover Mutate solutions END-REPEAT

Main principle

Components of GA • • Representation of solution Selection method Crossover method Mutation Most critical !

Representation

Representation of solution • Partition (P): – Optimal centroid can be calculated from P – Only local changes can be made • Codebook (C): – Optimal partition can be calculated from C – Calculation of P takes O(NM) slow • Combined (C, P): – Both data structures are needed anyway – Computationally more efficient

Selection method • To select which solutions will be used in crossover for generating new solutions • Main principle: good solutions should be used rather than weak solutions • Two main strategies: – Roulette wheel selection – Elitist selection • Exact implementation not so important

Roulette wheel selection • Select two candidate solutions for the crossover randomly. • Probability for a solution to be selected is weighted according to its distortion:

Elitist selection • Main principle: select all possible pairs among the best candidates. Elitist approach using zigzag scanning among the best solutions

Crossover

Crossover methods Different variants for crossover: • • • Random crossover Centroid distance Pairwise crossover Largest partitions PNN Local fine-tuning: • All methods give new allocation of the centroids. • Local fine-tuning must be made by K-means. • Two iterations of K-means is enough.

Random crossover Select M/2 centroids randomly from the two parent. Solution 1 Solution 2 +

Parent solution A Parent solution B c 1 c 4 c 3 c 2 c 4 c 2 1 New Solution: 2 4 5 8 M=4 Some possibilities: How to create a new solution? Parent A Parent B Rating Picking M/2 randomly chosen cluster centroids from each of the two parents in turn. c 2, c 4 c 1, c 4 Optimal c 1, c 2 c 3, c 4 Good (K-Means) c 2, c 3 Bad How many solutions are there? 36 possibilities how to create a new solution. Probability to select a good one? Not high, some are good but K-Means is needed, most are bad. See statistics. Rough statistics: Optimal: 1 Good: 7 Bad: 28 Explanation Data point Centroid M – number of clusters

Parent solution A Parent solution B c 1 c 4 c 3 c 2 c 4 c 2 1 Child solution (optimal) c 1 c 4 Child solution (good) c 1 c 3 c 2 4 5 8 Child solution (bad) c 4 c 3 c 2 2 c 1 c 2 c 3 c 4

Centroid distance crossover [Pan, Mc. Innes, Jack, 1995: Electronics Letters ] [Scheunders, 1997: Pattern Recognition Letters ] • For each centroid, calculate its distance to the center point of the entire data set. • Sort the centroids according to the distance. • Divide into two sets: central vectors (M/2 closest) and distant vectors (M/2 furthest). • Take central vectors from one codebook and distant vectors from the other.

Parent solution A Parent solution B c 4 6 6 5 c 4 5 Ced c 2 c 1 1 2 4 c 3 1 1 Ced 5 8 1) Distances d(ci, Ced): A: d(c 4, Ced) < d(c 2, Ced) < d(c 1, Ced) < d(c 3, Ced) B: d(c 1, Ced) < d(c 3, Ced) < d(c 2, Ced) < d(c 4, Ced) c 2 1 2 4 5 8 New solution: c 4 2) Sort centroids according to the distance: A: c 4, c 2, c 1, c 3, B: c 1, c 3, c 2, c 4 Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B 3) Divide into two sets (M = 4): OR A: central vectors: c 4, c 2, distant vectors: c 1, c 3 B: central vectors: c 1, c 3, distant vectors: c 2, c 4 Explanation c 2 Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B Data point c 1 Centroid c 3 Centroid of entire dataset M – number of clusters c 4 c 1 c 3 c 2

Child - variant (a) Child – variant (b) c 4 6 6 5 5 c 3 Ced c 3 c 2 1 2 4 c 2 c 1 1 Ced 1 5 8 1 2 4 5 8 New solution: c 4 Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B OR Explanation Data point Centroid c 4 c 2 Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B Centroid of entire dataset c 1 M – number of clusters c 1 c 3 c 2

Pairwise crossover [Fränti et al, 1997: Computer Journal] Greedy approach: • For each centroid, find its nearest centroid in the other parent solution that is not yet used. • Among all pairs, select one of the two randomly. Small improvement: • No reason to consider the parents as separate solutions. • Take union of all centroids. • Make the pairing independent of parent.

Pairwise crossover example Initial parent solutions MSE=8. 79 109 MSE=11. 92 109

Pairwise crossover example Pairing between parent solutions MSE=7. 34 109

Pairwise crossover example Pairing without restrictions MSE=4. 76 109

Largest partitions [Fränti et al, 1997: Computer Journal] Crossover algorithm: • Each cluster in the solutions A and B is assigned with a number, cluster size S, indicating how many data objects belong to it. • In each phase we pick the centroid of the largest cluster. • Assume that cluster i was chosen from A. The cluster centroid Ci is removed from A to avoid its reselection. • For the same reason we update the cluster sizes of B by removing the effect of those data objects in B that were assigned to the chosen cluster i in A.

Largest partitions [Fränti et al, 1997: Computer Journal] Parent solution A Parent solution B S=50 S=30 S=100 c 1 S=100 S=20 Explanation Data point Centroid

PNN crossover for GA [Fränti et al, 1997: The Computer Journal] Initial 1 Combined Initial 2 Union PNN After PNN

The PNN crossover method (1) [Fränti, 2000: Pattern Recognition Letters]

The PNN crossover method (2)

Importance of K-means (Random crossover) Bridge Worst Best

Effect of crossover method (with k-means iterations) Bridge

Effect of crossover method (with k-means iterations) Binary data Bridge 2

Mutations

Mutations • Purpose is to implement small random changes to the solutions. • Happens with a small probability. • Sensible approach: change the location of one centroid by the random swap! • Role of mutations is to simulate local search. • If mutations are needed crossover method is not very good.

Effect of k-means and mutations K-means improves but less vital Mutations alone better than random crossover!

GAIS – Going extreme

Agglomerative clustering PNN: Pairwise Nearest Neigbor method – Merges two clusters – Preserves hierarchy of clusters IS: Iterative shrinking method n Removes one cluster n Repartition data vectors in removed cluster

Iterative shrinking

Pseudo code

Local optimization of IS Finding secondary cluster: Removal cost of single vector:

Example (1)

Example (2)

Pseudo code of GAIS [Virmajoki & Fränti, 2006: Pattern Recognition]

PNN vs. IS crossovers Further improvement of about 1%

Optimized GAIS variants GAIS short (optimized for speed): - Create new generations only as long as the best solution keeps improving (T=*). - Use a small population size (Z=10) - Apply two iterations of k‑means (G=2). GAIS long (optimized for quality): - Create a large number of generations (T=100) - Large population size (Z=100) - Iterate k‑means relatively long (G=10).

Comparison with image data Popular Simplest of the good ones Previous GA BEST!

What does it cost? Bridge Random: K-means: SOM: GA-PNN: GAIS – short: GAIS – long: ~0 s 8 s 6 minutes 13 minutes ~1 hour ~3 days

Comparison of algorithms

Variation of the result

Time vs. quality comparison Bridge

Conclusions • Best clustering obtained by GA • Crossover method most important • Mutations not needed

References 1. 2. 3. 4. 5. 6. P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761 -765, May 2006. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61 -68, January 2000. P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547 -554, 1997. J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113 -129, 2003. J. S. Pan, F. R. Mc. Innes and M. A. Jack, VQ codebook design using genetic algorithms. Electronics Letters, 31, 1418 -1419, August 1995. P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters, 17, 547 -556, 1996.

Wor ki ng s pace Text box