Data Clustering A brief introduction and some areas

Data Clustering A brief introduction and some areas of ongoing research

This Presentation • Part 1 Introduction • • Why data clustering What is data clustering How to get started in data science Use clustering to solve a problem • Part 2 Research • Clustering Research • My Clustering Research • Questions

Part 1 An Introduction to Data Clustering

Background – AI & Data Science • Understanding intelligence and building intelligent entities • Definitions of AI vary somewhat, system that: think humanly (copy the brain), think rationally (copy rational thought process), act humanly (do tasks that require human intelligence) and act rationally (exhibit intelligent behaviour) 1 • Cross disciplinary set of skills • Technology, Math & Statistics, Expertise’s 2 • Knowledge Discovery in Databases 3 1 Artificial Intelligence and Modern Approach, Russel and Norvig 2 http: //drewconway. com/zia/2013/3/26/the-data-science-venn-diagram 3 https: //www. kdnuggets. com/gpspubs/aimag-kdd-overview-1996 -Fayyad. pdf

Data Clustering - Why? • Unsupervised learning, a machine can learn to recognise different objects without being taught with labels. • Reduce the amount of labelling required for supervised learning • Recognise groups in data to big for humans to work through • Recognise clusters in data that is greater than 3 dimensions and therefor hard for humans to spot clusters in Exploratory Task Pre-processing Step

Data Clustering - What? • Data Clustering is an unsupervised machine learning / data mining technique which groups data instances on a measure of similarity. • It reveals groups in data, referred to has clusters, labelling those instances as similar. • It is not to be confused with classification. • Subjective Image https: //www. kdnuggets. com/gpspubs/aimag-kdd-overview-1996 -Fayyad. pdf

Data Clustering Visual Example

How to get started with Data Clustering? • The tools Python, Sci-kit Learn and Jupyter. https: //bitbucket. org/paulmogs 398/glassclusteringexample/

Detective – Let’s solve a crime! There has been a burglar breaking and entering the a university through the windows and stealing precious clustering research! There are three suspects! A, B and C! Each of the suspects has been found to have glass fragments in their shoes! A Image credit: http: //www. fosterfreeman. com/35 -products/technology-3/420 -glass. html B C

Window These are not the first glass fragments the police department has found analysed. In fact they have a collection over 200 fragments. Sadly the suspects new glass pieces do not match any of pieces they have labelled Bottle ? Bulb

• Fe: Iron RI Na Mg Al Si K Ca Ba Fe 0 1. 52101 13. 64 4. 49 1. 10 71. 78 0. 06 8. 75 0. 00 1 1. 51761 13. 89 3. 60 1. 36 72. 73 0. 48 7. 83 0. 00 2 1. 51618 13. 53 3. 55 1. 54 72. 99 0. 39 7. 78 0. 00 3 1. 51766 13. 21 3. 69 1. 29 72. 61 0. 57 8. 22 0. 00 4 1. 51742 13. 27 3. 62 1. 24 73. 08 0. 55 8. 07 0. 00 5 1. 51596 12. 79 3. 61 1. 62 72. 97 0. 64 8. 07 0. 0 0. 26 6 1. 51743 13. 30 3. 60 1. 14 73. 09 0. 58 8. 17 0. 00 7 1. 51756 13. 15 3. 61 1. 05 73. 24 0. 57 8. 24 0. 00 8 1. 51918 14. 04 3. 58 1. 37 72. 08 0. 56 8. 30 0. 00 9 1. 51755 13. 00 3. 60 1. 36 72. 99 0. 57 8. 40 0. 11 • Si: Silicon • Ba: Barium • Ca: Calcium • K: Potassium • Al: Aluminium • Mg: Magnesium • Na: Sodium • RI: refractive index The top 10 rows of the glass dataset https: //archive. ics. uci. edu/ml/datasets/Glass+Identification

• Normalise or standardise the dataset: RI Na Mg Al Si K Ca Ba Fe count 214 214 214 mean 0. 316744 0. 402684 0. 597891 0. 359784 0. 507310 0. 080041 0. 327785 0. 055570 0. 111783 std 0. 133313 0. 122798 0. 321249 0. 155536 0. 138312 0. 105023 0. 132263 0. 157847 0. 191056 min 0. 000000 0. 000000 25% 0. 235843 0. 327444 0. 471047 0. 280374 0. 441071 0. 019726 0. 261152 0. 000000 50% 0. 286655 0. 386466 0. 775056 0. 333333 0. 532143 0. 089372 0. 294610 0. 000000 75% 0. 351514 0. 465414 0. 801782 0. 417445 0. 585268 0. 098229 0. 347816 0. 000000 0. 196078 max 1. 000000 1. 000000

Reducing the dimensionality:

Instance selection, removing outliers… 214 instances 171 instances

• Clustering the data

• Predicting the type of glass found on our suspects. • Suspect A = Cluster 1 • Suspect B = Cluster 1 • Suspect C = Cluster 0

• Suspect A = Cluster 1 • Suspect B = Cluster 1 • Suspect C = Cluster 0 C Window Jar Tumbler Window Bulb Window Bottle

Other Example Application s RECOGNISING DIFFERENT TYPES OF LANDSCAPE IN SATELLITE IMAGERY IDENTIFYING THE TYPES OF PEOPLE THAT USE A SERVICE IDENTIFYING DIFFERENT OBJECTS IN SELF DRIVING CAR FOOTAGE BIOLOGICAL APPLICATIONS, GROUPING CELLS AND PROTEINS ETC.

Part 2 Research

Open Challenges in Clustering Stability and consistency Big data High-dimensional data Interpreting and evaluating clustering results Selecting the right clustering method Semi-supervised clustering

My Research: Instance Weighted Clustering 1 0. 5 0. 7 x y z 1 1 5 2 1 8 4 6 1 2 1 9 x y z 0. 8 1 1 5 0. 8 2 1 8 4 6 1 2 1 9 0. 1 0. 9

Local Outlier Factor

LOFIWKM 1 LOF scores calculated for the whole dataset 2 Weighted random to initialise centroids 3 Instances assigned to nearest clusters 4 New centroid positions calculated using weighted mean 5 Loop back to 3 if not converged

LOFIWKM

ILOFIWKM • LOF scores calculated for the whole dataset • 2 Weighted random to initialise centroids • 3 Instances assigned to nearest clusters • 4 LOF scores calculated per cluster • 5 New centroid positions calculated using weighted mean • 6 Loop back to 3 if not converged

ILOFIWKM

Thanks & Further Information Excellent practical introduction into the tools used for today’s example • Thank you for listening! I am happy to answer questions. Modern overview of clustering