Fast Attributebased Unsupervised and Supervised Table Clustering using
Fast Attribute-based Unsupervised and Supervised Table Clustering using P-Trees Dr. William Perrizo North Dakota State University
Contents • • • Introduction Predicate-Trees FAUST_P Algorithm Performance Conclusion and Future Work
Introduction • Exponential growth in image data – For e. g. NASA capturing Earth images down to 15 m resolution since 1970 s • Data archived much before proper analysis • Existing clustering algorithms are slow
P-Trees • Predicate-Trees are data-mining-ready, lossless and compressed data structures • Effective tool in Horizontal Processing of Vertical data • Tree obtained by recursively partitioning the vertical strips of data
FAUST_P PREM=pure 1 1. attr, class, calculate means, gaps. MT(attr, class, mean, gap. L, gap. H, gap. REL) sorted desc on gap. REL =(gap. L+gap. H)/2*mean) gap. L=low gap, gap. H = hi gap. 2. MT record with max gap. REL c. L=mn-gap. L/2 c. H=mn+gap. H/2 PCLASS = PA>c. L & P'A>c. H & PREM= PREM &P'CLASS 3. Repeat 2 til all classes p. Tree. 4. Repeat 1, 2, 3 til converge 1. attr, class, calc means, mean_gaps. s. LN m mg se 51 12 vi 63 7 ve 70 MT cl se se vi vi ve ve MTscl se se vi ve ve vi se vi ve s. WD m mg ve 32 1 vi 33 2 se 35 at s. SL s. WD p. SL p. WD mn 51 35 14 2 63 33 60 25 70 32 47 14 g. L 12 2 33 12 8 1 13 11 7 1 33 12 g. H 12 2 33 12 7 2 13 11 7 1 13 11 at p. WD p. SL p. WD s. SL s. WD mn g. L g. H 2 12 12 14 33 33 60 13 13 47 33 13 14 12 11 25 11 11 51 12 12 63 8 7 70 7 7 35 2 2 33 1 2 32 1 1 p. LN m mg se 14 33 ve 47 13 vi 60 g. R. 26. 06 2. 36 6. 12. 05 2. 2. 44. 1. 03. 94. 82 p. WD m mg se 2 12 ve 14 11 vi 25 (not yet sorted ( 12+12)/(2*51) ( 2+ 2)/(2*35) (33+33)/(2*14) (12+12)/(2* 2) ( 8+ 7)/(2*63) ( 1+ 2)/(2*33) (13+13)/(2*60) (11+11)/(2*25) ( 7+ 7)/(2*70) ( 1+ 1)/(2*32) (33+13)/(2*47) (12+11)/(2*14) on g. R) x's fill ins. se se se ve ve ve vi vi vi 49 47 46 54 54 46 50 44 49 54 64 69 55 65 57 63 49 66 52 50 58 71 63 65 76 49 73 67 72 65 30 32 31 36 39 34 34 29 31 37 32 31 23 28 28 33 24 29 27 20 27 30 29 30 30 25 29 25 36 32 2 2 4 3 2 2 15 15 13 16 10 13 14 10 19 21 18 22 21 17 18 18 25 20 2. MT rec w max gap. REL 0 0 0 0 0 1 1 0 0 0 1 0 1 1 Sepal Lth 1 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 MT c. H=mean+gap. H/2 Sepal Wth 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 0 0 g. R 6 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 Pedal Lth 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 We're separating out setosa class Pedal 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 Wth 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 PREM PA>c H PA>c. L =Ppure 1 = 2 + 12/2 = 8 = 0 1 0 0 0 PA>c. H =(P 4, 4|(P 4, 3&(P 4, 2|(P 4, 1|P 4, 0)))) Psetosa =PA>c. L & P'A>c. H & PREM= PREM &P'CLASS 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 at cl mn g. L g. H se p. WD 2 12 12 c. L= mean - gap. L/2 = 2 - 12/2 = -4 g. R (sortws desc g. R) 6 2. 36 2. 2. 94. 82. 44. 26. 12. 1. 06. 05. 03 14 13 15 14 17 14 15 15 45 49 40 46 45 47 33 46 39 35 51 59 56 58 66 45 63 58 61 51 =Ppure 1 & P'A>c. H & Ppure 1 = P'A>c H = Ppure 1 &P'setosa = P'setosa Pse = P'A>c H
Performance • O(k) where k is the number of attributes • ‘k’ significantly smaller in comparison to horizontal methods where ‘n’ is in order of billions • 95% accuracy achieved in the first epoch
Conclusion and Future Work • Extremely fast supervised clustering algorithm based on P-Trees • Future work to include standard deviation in place of mean for better accuracy
- Slides: 7