Kmeans Clustering Group 15 Swathi Gurram Prajakta Purohit

  • Slides: 16
Download presentation
K-means Clustering Group 15 Swathi Gurram Prajakta Purohit

K-means Clustering Group 15 Swathi Gurram Prajakta Purohit

Goal �To program K-means on Twister (Iterative Map- Reduce) and Hadoop(Map - Reduce) and

Goal �To program K-means on Twister (Iterative Map- Reduce) and Hadoop(Map - Reduce) and see how the change of framework effects the implementation time.

Survey �Twister Configurable long running (cacheable) map/reduce tasks � Pub/sub messaging based communication/data transfers

Survey �Twister Configurable long running (cacheable) map/reduce tasks � Pub/sub messaging based communication/data transfers � Efficient support for Iterative Map. Reduce computation � Combine phase to collect all reduce outputs � Data access via local disks �

Survey �Hadoop: a software framework that supports data -intensive distributed applications �Uses Map- reduce

Survey �Hadoop: a software framework that supports data -intensive distributed applications �Uses Map- reduce programming model �it's own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files �can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing

Survey �Haloop : a modified version of the Hadoop Map. Reduce framework � provide

Survey �Haloop : a modified version of the Hadoop Map. Reduce framework � provide caching options for loop-invariant data access �let users reuse major building blocks from applications' Hadoop implementations �have similar intra-job fault-tolerance mechanisms to Hadoop. � Ha. Loop reduces query runtimes by 1. 85 compared with Hadoop

K-means Clustering

K-means Clustering

K-means Clustering

K-means Clustering

Twister K-means

Twister K-means

Hadoop K-means

Hadoop K-means

Twister- Hadoop Comparison 1000 900 Execution Time in seconds --> 800 700 600 500

Twister- Hadoop Comparison 1000 900 Execution Time in seconds --> 800 700 600 500 400 300 200 100 0 Twister Hadoop 1 1. 1542 603 2 1. 1263 630 3 1. 1264 886 4 1. 1097 642 5 6 1. 1137 1. 1262 646 942 Centroid Sets--> 7 1. 0926 483 8 1. 1102 690 9 1. 1034 671 10 1. 1159 713

Implementation Timeline Week Task Team member Oct 24 th – Oct 31 st Understand

Implementation Timeline Week Task Team member Oct 24 th – Oct 31 st Understand K-means algorithm and design Prajakta, Swathi Nov 1 st – Nov 7 th Implement K-means Prajakta, Swathi Nov 8 th – Nov 21 st Implement K-means on Twister and performance Prajakta, Swathi analysis Nov 21 st – Optimized validation method for Kmeans Nov 28 th algorithm Prajakta, Swathi Nov 29 th – Implement K-means on Hadoop Dec 3 rd Prajakta, Swathi Dec 4 th – Dec 5 th Performance Analysis and Presentation Prajakta, Swathi Dec 6 th – Dec 12 th Final Technical report Prajakta, Swathi

Validation methods

Validation methods

Conclusion �Twister framework is faster than Hadoop for iterative map- reduce applications.

Conclusion �Twister framework is faster than Hadoop for iterative map- reduce applications.

References �http: //salsahpc. indiana. edu �http: //www. iterativemapreduce. org/samples. html �http: //hadoop. apache. org/

References �http: //salsahpc. indiana. edu �http: //www. iterativemapreduce. org/samples. html �http: //hadoop. apache. org/ �http: //en. wikipedia. org/wiki/Apache_Hadoop �http: //clue. cs. washington. edu/node/14 �http: //code. google. com/p/haloop/ �http: //www. cs. washington. edu/homes/billhowe/pu bs/Ha. Loop. pdf

Demo

Demo

Thank you

Thank you