Hadoop Distributed File System HDFS implementation in GENI

Hadoop Distributed File System (HDFS) implementation in GENI Wei Kou – University of Connecticut Madhav –Missouri University of Science and Technology Sheyda – University of Missouri Kansas City Min Sang Yoon – Iowa State University

Contents • Introduction of Hadoop • Hadoop configuration in GENI(single site) • Multiple sites configuration • Simulation result

HDFS (Hadoop Distributed File Systems) • Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. • Composed with a single name node and data node clusters. • Use Map-reduce programming model to distribute single file

Hadoop configuration in GENI (single site) 172. 16. 1. 10 172. 16. 1. 14 Master node 172. 16. 1. 12 172. 16. 1. 13 172. 16. 1. 12 - One maser node - 5 data nodes - Configured the cluster

Hadoop configuration in GENI (single site) • We configured 128 GB Capacity HDFS.

Hadoop configuration in GENI (single site) • Each data node allocate 25. 6 GB for HDFS

Hadoop configuration in GENI (single site) File distribution command File list on HDFS Master Worker-0

Hadoop configuration in GENI (multiple sites) • Purpose : To observe how physical distance affect to performance of network. • We generated 4 slices configured in different sites. • Master node is located in same site in all scenarios. • Two data nodes are assigned in same site with master node and three data nodes are assigned in different sites. • All are connected to same subnet. case 1: GPO(master) – Texas A&M case 2: GPO(master) – UC Davis case 3: GPO(master) - Wayne State University case 4: GPO(master) – University of Florida GPO(master) – UC Davis

Hadoop configuration in GENI (multiple sites) Wayne State University UC Davis Texas A&M University of Florida

Simulation configuration • We generated 1 Gb dataset for each case. • We measure data transmission time of each case. • 128 GB HDFS capacity. • 25 GB from each data node.

Simulation Result Distribution time (case 2 result) Distribution time (case 3 result) Case Distance Distribution time Single site 0 miles 19 seconds GPO – Wayne State University 717 miles 7 min GPO – University of Florida 1220 miles 7 min 36 seconds GPO – Texas A&M 1862 miles 7 min 55 seconds GPO – UC Davis 3027 miles 8 min 30 seconds

Simulation Result 600 second Distribution Time (sec) 500 400 300 200 100 0 Case 1 Case 2 Case 3 Case 4 Case 5

Conclusion & future work • Hadoop distributed file system can be implemented in GENI successfully. • We could observe the relationship between physical distance and network time. • However, the affection of physical distance is not very significant than our expectation. • We should consider other factors more carefully in deciding load distribution in network.