Ethnicity Classification Through Analysis of Facial Features in

- Slides: 1
Ethnicity Classification Through Analysis of Facial Features in SAS By: Remy Welch Faculty Advisor: Dr. Cuixian Chen Results Motivation Technique LDA/LOO LDA/5 -fold CV KNN/5 -fold CV k=3 KNN/LOO k=3 Objective : classify facial images using Linear Discriminant Analysis (LDA), KNearest Neighbor (KNN), and k-means clustering in SAS. Applications : visual surveillance, market research, and online photo albums, etc . Dataset All Ethnicities Included (C=5) Accuracy (%) 96. 1 ± 41. 7 94. 4 ± 1. 8 97. 9 ± 2. 0 70. 0 ± 28. 5 Only Black and White (C=5) The Dataset consisted of 540 faces, which were categorized into 5 ethnicities based on self-report. Statistical Techniques Experiment: The facial images were classified into the 5 ethnicity categories using LDA. The faces were also classified using K-Nearest Neighbor (KNN). 2 different cross-validation (CV) techniques were used to test the classifications: the Leaveone-out (LOO) technique and a 5 -fold CV. In addition, the data was also classified using a clustering procedure. Group 1 Data The best Linear Discriminant Function Group 2 Data LDA: classification tool used to linearly separate a dataset into different groups. Total KNN: a classification tool in which an observation is classified based on the majority vote of its k-nearest neighbors. 6 84 1. 11 15. 56 4 0. 74 Total 88 16. 30 77 14. 26 123 22. 78 187 34. 63 65 12. 04 34 412 540 6. 30 76. 30 100. 0 When only black and white faces were used, approximately 71% of the black faces were reliably grouped into one cluster, for c=2 to 5. No more than 61% of white faces were ever grouped into a single cluster. When c=5, the largest percentage of white faces in a single cluster was only 36% of the total number of white faces. When all ethnicities were included in the kmeans clustering, a good separation is seen between the five clusters. However, the distribution among those clusters does not reflect the true ethnicity distribution of the data. When c=5, the grouping of faces results in only a majority of Black and Asian faces being placed in a single cluster Conclusions 5 -fold CV: The data is separated into 5 partitions. 4/5 ths of the data is used to train the LDA, and the remaining 1/5 th is used to test the predictions. Each partition of the data is used as the testing group once. Cluster 1 Cluster 2 LOO CV: One observation is used to test the data, and the rest are used to train the model. This is repeated until every observation has been used to test the model. Table of CLUSTER by ethnicity CLUSTER ethnicity Frequency Row Pct Col Pct Asian Black Hispanic Indian White 1 1 12 1 14 60 1. 14 13. 64 1. 14 15. 91 68. 18 16. 67 14. 29 25. 00 41. 18 14. 56 2 0 60 0 12 5 0. 00 77. 92 0. 00 15. 58 6. 49 0. 00 71. 43 0. 00 35. 29 1. 21 3 0 0 1 1 121 0. 00 0. 81 98. 37 0. 00 25. 00 2. 94 29. 37 4 4 6 1 5 171 2. 14 3. 21 0. 53 2. 67 91. 44 66. 67 7. 14 25. 00 14. 71 41. 50 5 1 6 1 2 55 1. 54 9. 23 1. 54 3. 08 84. 62 16. 67 7. 14 25. 00 5. 88 13. 35 Table of CLUSTER by ethnicity CLUSTER ethnicity Frequency Row Pct Col Pct Black White Total 1 1 152 0. 66 99. 34 30. 65 1. 19 36. 65 2 0 116 0. 00 100. 00 23. 39 0. 00 28. 16 3 9 68 77 11. 69 88. 31 15. 52 10. 71 16. 50 4 60 6 66 90. 91 9. 09 13. 31 71. 43 1. 46 5 14 71 85 16. 47 83. 53 17. 14 16. 67 17. 23 Total 84 412 496 16. 94 83. 06 100. 00 Cluster 3 K-means Clustering: Separates the data into clusters based on each data points’ proximity to the clusters’ means. Does not factor in the ground truth (what the peoples’ ethnicities actually are) The statistical technique that produced the most accurate classification of ethnicity was the KNN procedure, when cross-validated using a 5 -fold CV. When the LOO CV was used, the LDA produced the most accurate classification, however the 5 fold CV is a more valid assessment of a procedure’s accuracy, therefore it can be concluded that the KNN procedure was better at predicting ethnicity. It should also be noted that for KNN/LOO, white faces were classified with 100% accuracy. Clustering did not prove to be very effective at classifying the ethnicities. A large part of this may have been due to the small representation of the Asian, Hispanic, and Indian ethnicities. When only Black and White faces were considered, the clustering procedure actually separated the black faces fairly well.