PARALLEL KMEANS CLUSTERING BASED ON MAPREDUCE Weizhong Zhao

PARALLEL K-MEANS CLUSTERING BASED ON MAPREDUCE Weizhong Zhao, Huifang Ma, Qing He Chinese Academy of Sciences : LNCS 2009 유윤식

Parallel K-Means Clustering Based on Map. Reduce Contents • Introduction • Parallel K-Means Algorithm Based on Map. Reduce • K-Means Algorithm • Map. Reduce • PKMeans Based on Map. Reduce • Experimental Result • Conclusions 2

Parallel K-Means Clustering Based on Map. Reduce 3 Introduction • Legacy parallel clustering algorithm’s drawback • Assume that all objects can reside in main memory at the same time • Their parallel systems have provided restricted programming models and used the restrictions to parallelize the computation automatically • Using Map. Reduce • Useful to processing and generating large datasets • automatically parallelizes the computation across large-scale clusters of machines

Parallel K-Means Clustering Based on Map. Reduce K-Means Algorithm 4

Parallel K-Means Clustering Based on Map. Reduce • Map • <key, value> → <key’, value’>* • Reduce • <key’, value’*> → value’’* 5

Parallel K-Means Clustering Based on Map. Reduce 6 PKMeans Based on Map. Reduce (#1) • Map • Input ( offset, sample value ) • Output (center_idx, <x, y> ) Calculate distance per Sample Data

Parallel K-Means Clustering Based on Map. Reduce 7 PKMeans Based on Map. Reduce (#2) • Combine (in Map Tasks) • Input (center_idx, [<x 1, y 1>, <x 2, y 2>, ………. . , <xn, yn>] ) • Output (center_idx, [n, {<x 1, y 1>, <x 2, y 2>, ………. . , <xn, yn>}] )

Parallel K-Means Clustering Based on Map. Reduce 8 PKMeans Based on Map. Reduce (#3) • Reduce • Input ( center_idx, [n, <x 1, y 1>, <x 2, y 2>, ………. . , <xn, yn>] ) • Output ( center_idx, new center)

Parallel K-Means Clustering Based on Map. Reduce 9 PKMeans Based on Map. Reduce (#4) (0, 2) (0, 1) (3, 0) (2, 3) (5, 0) Map (c 1, (0, 2)) (c 2, (3, 0)) (c 2, (5, 0)) (c 1, (2, 3)) (c 1, (0, 1)) ( c 1, [1, (0, 2)] ) Combine Reduce ( c 1, (0, 1) ) ( c 2, [2, (3, 0), (5, 0)] ) ( c 2, (4, 0) ) Combine Centroid Index(c 1): (0, 1) Index(c 2): (4, 0) ( c 1, [2, (2, 3), (0, 1)] ) Reduce update

Parallel K-Means Clustering Based on Map. Reduce 10 Evaluations Result (#1) • Evaluation Algorithm • Xu, X. , Jager, J. , Kriegel, H. P. : A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery 3, 263– 290 (1999) • Speedup: (Speed) • Speedup(m) = run-time on one compute/run-time on m computers • Scaleup: (Work load) • measures the ability to grow both the system and the database size • Scaleup(DB, m) = run-time for clustering DB on 1 computer/run-time for clustering m * DB on m computer • Sizeup: (Data size) • measures how much longer it takes on a given system • Sizeup(DB, m) = run-time for clustering m * DB/run-time for clustering DB

Parallel K-Means Clustering Based on Map. Reduce Evaluations Result (#2) • Speedup 11

Parallel K-Means Clustering Based on Map. Reduce Evaluations Result (#3) • Scaleup 12

Parallel K-Means Clustering Based on Map. Reduce Evaluations Result (#4) • Sizeup 13

Parallel K-Means Clustering Based on Map. Reduce 14 Conclusions • “Parallel K-Means Algorithm Based on Map. Reduce” can process large datasets on commodity hardware effectively.