Data Clustering A brief introduction and some areas
Data Clustering A brief introduction and some areas of ongoing research
This Presentation • Part 1 Introduction • • Why data clustering What is data clustering How to get started in data science Use clustering to solve a problem • Part 2 Research • Clustering Research • My Clustering Research • Questions
Part 1 An Introduction to Data Clustering
Background – AI & Data Science • Understanding intelligence and building intelligent entities • Definitions of AI vary somewhat, system that: think humanly (copy the brain), think rationally (copy rational thought process), act humanly (do tasks that require human intelligence) and act rationally (exhibit intelligent behaviour) 1 • Cross disciplinary set of skills • Technology, Math & Statistics, Expertise’s 2 • Knowledge Discovery in Databases 3 1 Artificial Intelligence and Modern Approach, Russel and Norvig 2 http: //drewconway. com/zia/2013/3/26/the-data-science-venn-diagram 3 https: //www. kdnuggets. com/gpspubs/aimag-kdd-overview-1996 -Fayyad. pdf
Data Clustering - Why? • Unsupervised learning, a machine can learn to recognise different objects without being taught with labels. • Reduce the amount of labelling required for supervised learning • Recognise groups in data to big for humans to work through • Recognise clusters in data that is greater than 3 dimensions and therefor hard for humans to spot clusters in Exploratory Task Pre-processing Step
Data Clustering - What? • Data Clustering is an unsupervised machine learning / data mining technique which groups data instances on a measure of similarity. • It reveals groups in data, referred to has clusters, labelling those instances as similar. • It is not to be confused with classification. • Subjective Image https: //www. kdnuggets. com/gpspubs/aimag-kdd-overview-1996 -Fayyad. pdf
Data Clustering Visual Example
How to get started with Data Clustering? • The tools Python, Sci-kit Learn and Jupyter. https: //bitbucket. org/paulmogs 398/glassclusteringexample/
Detective – Let’s solve a crime! There has been a burglar breaking and entering the a university through the windows and stealing precious clustering research! There are three suspects! A, B and C! Each of the suspects has been found to have glass fragments in their shoes! A Image credit: http: //www. fosterfreeman. com/35 -products/technology-3/420 -glass. html B C
Window These are not the first glass fragments the police department has found analysed. In fact they have a collection over 200 fragments. Sadly the suspects new glass pieces do not match any of pieces they have labelled Bottle ? Bulb
• Fe: Iron RI Na Mg Al Si K Ca Ba Fe 0 1. 52101 13. 64 4. 49 1. 10 71. 78 0. 06 8. 75 0. 00 1 1. 51761 13. 89 3. 60 1. 36 72. 73 0. 48 7. 83 0. 00 2 1. 51618 13. 53 3. 55 1. 54 72. 99 0. 39 7. 78 0. 00 3 1. 51766 13. 21 3. 69 1. 29 72. 61 0. 57 8. 22 0. 00 4 1. 51742 13. 27 3. 62 1. 24 73. 08 0. 55 8. 07 0. 00 5 1. 51596 12. 79 3. 61 1. 62 72. 97 0. 64 8. 07 0. 0 0. 26 6 1. 51743 13. 30 3. 60 1. 14 73. 09 0. 58 8. 17 0. 00 7 1. 51756 13. 15 3. 61 1. 05 73. 24 0. 57 8. 24 0. 00 8 1. 51918 14. 04 3. 58 1. 37 72. 08 0. 56 8. 30 0. 00 9 1. 51755 13. 00 3. 60 1. 36 72. 99 0. 57 8. 40 0. 11 • Si: Silicon • Ba: Barium • Ca: Calcium • K: Potassium • Al: Aluminium • Mg: Magnesium • Na: Sodium • RI: refractive index The top 10 rows of the glass dataset https: //archive. ics. uci. edu/ml/datasets/Glass+Identification
• Normalise or standardise the dataset: RI Na Mg Al Si K Ca Ba Fe count 214 214 214 mean 0. 316744 0. 402684 0. 597891 0. 359784 0. 507310 0. 080041 0. 327785 0. 055570 0. 111783 std 0. 133313 0. 122798 0. 321249 0. 155536 0. 138312 0. 105023 0. 132263 0. 157847 0. 191056 min 0. 000000 0. 000000 25% 0. 235843 0. 327444 0. 471047 0. 280374 0. 441071 0. 019726 0. 261152 0. 000000 50% 0. 286655 0. 386466 0. 775056 0. 333333 0. 532143 0. 089372 0. 294610 0. 000000 75% 0. 351514 0. 465414 0. 801782 0. 417445 0. 585268 0. 098229 0. 347816 0. 000000 0. 196078 max 1. 000000 1. 000000
Reducing the dimensionality:
Instance selection, removing outliers… 214 instances 171 instances
• Clustering the data
• Predicting the type of glass found on our suspects. • Suspect A = Cluster 1 • Suspect B = Cluster 1 • Suspect C = Cluster 0
• Suspect A = Cluster 1 • Suspect B = Cluster 1 • Suspect C = Cluster 0 C Window Jar Tumbler Window Bulb Window Bottle
Other Example Application s RECOGNISING DIFFERENT TYPES OF LANDSCAPE IN SATELLITE IMAGERY IDENTIFYING THE TYPES OF PEOPLE THAT USE A SERVICE IDENTIFYING DIFFERENT OBJECTS IN SELF DRIVING CAR FOOTAGE BIOLOGICAL APPLICATIONS, GROUPING CELLS AND PROTEINS ETC.
Part 2 Research
Open Challenges in Clustering Stability and consistency Big data High-dimensional data Interpreting and evaluating clustering results Selecting the right clustering method Semi-supervised clustering
My Research: Instance Weighted Clustering 1 0. 5 0. 7 x y z 1 1 5 2 1 8 4 6 1 2 1 9 x y z 0. 8 1 1 5 0. 8 2 1 8 4 6 1 2 1 9 0. 1 0. 9
Local Outlier Factor
LOFIWKM 1 LOF scores calculated for the whole dataset 2 Weighted random to initialise centroids 3 Instances assigned to nearest clusters 4 New centroid positions calculated using weighted mean 5 Loop back to 3 if not converged
LOFIWKM
ILOFIWKM • LOF scores calculated for the whole dataset • 2 Weighted random to initialise centroids • 3 Instances assigned to nearest clusters • 4 LOF scores calculated per cluster • 5 New centroid positions calculated using weighted mean • 6 Loop back to 3 if not converged
ILOFIWKM
Thanks & Further Information Excellent practical introduction into the tools used for today’s example • Thank you for listening! I am happy to answer questions. Modern overview of clustering
- Slides: 27