GRID AND CLOUD COMPUTING Globus Toolkit and Hadoop

GRID AND CLOUD COMPUTING Globus Toolkit and Hadoop http: //web. uettaxila. edu. pk/CMS/FALL 2017/te. GNCCms/ Courtesy: Dr Gnanasekaran Thangavel

UNIT 6: Globus Toolkit and Hadoop – Open source grid middleware packages: – Globus Toolkit (GT 6) Architecture, Components – Usage of Globus – Distributed processing using Hadoop - Introduction to Hadoop - Mapreduce, Input splitting, map and reduce functions, specifying input and output parameters, configuring and running a job - Design of Hadoop file system, HDFS concepts, command line and java interface, dataflow of File read & File write. 10/28/2020 2

Open source grid middleware packages • The Open Grid Forum and Object Management Group (OMG) are two well-formed organizations behind the standards • Middleware is the software layer that connects software components. It lies between operating system and the applications. • Grid middleware is specially designed a layer between hardware and software, enable the sharing of heterogeneous resources and managing virtual organizations created around the grid. • The popular grid middleware 1. BOINC - Berkeley Open Infrastructure for Network Computing. 2. UNICORE - Middleware developed by the German grid computing community. 3. Globus (GT 4) - A middleware library jointly developed by Argonne National Lab. , Univ. of Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH. 4. CGSP in China. Grid - The CGSP (China. Grid Support Platform) is a middleware library developed by 20 top universities in China as part of the China. Grid Project 10/28/2020 3

Open source grid middleware packages conti… 5. Condor-G - Originally developed at the Univ. of Wisconsin for general distributed computing, and later extended to Condor-G for grid job management. 6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business grid applications. Applied to private grids and local clusters within enterprises or campuses. 7. g. Light -Born from the collaborative efforts of more than 80 people in 12 different academic and industrial research centers as part of the EGEE Project, g. Lite provided a framework for building grid applications tapping into the power of distributed computing and storage resources across the Internet. 10/28/2020 4

The Globus Toolkit Architecture (GT 6) • The Globus Toolkit, is an open middleware library for the grid computing communities. These open source software libraries support many operational grids and their applications on an international basis. • The toolkit addresses common problems and issues related to grid resource discovery, management, communication, security, fault detection, and portability. The software itself provides a variety of components and capabilities. • The library includes a rich set of service implementations. The implemented software supports grid infrastructure management, provides tools for building new web services in Java , C, and Python, builds a powerful standard-based security infrastructure and client APIs (in different languages), and offers comprehensive command-line programs for accessing various grid services. • The Globus Toolkit was initially motivated by a desire to remove obstacles that prevent seamless collaboration, and thus sharing of resources and services, in scientific and engineering applications. The shared resources can be computers, storage, data, services, networks, science instruments (e. g. , sensors), and so on. • The Globus library version GT 6, is conceptually shown in Figure on next slide: 10/28/2020 5

GT 6 is binary compatible with GT 5 and GT 5. 2 10/28/2020 6

Grid Security Infrastructure (GSI) • As Grid Resources and Users are Distributed and Owned by different organizations, only authorized users should be allowed to access them. • A simple authentication infrastructure is needed. • Also, both users and owners should be protected from each other. • The Users need be assured about security of their: – Data – Code – Message • GSI Provides all of the above • GSI C are the C language Libraries included in Globus Toolkit • They help in compiling and patching Open. SSH for use with GSI 10/28/2020 7

My. Proxy • Online repository of encrypted GSI credentials • Provides authenticated retrieval of proxy credentials over the network • Improves usability – Retrieve proxy credentials when/where needed without managing private key and certificate files • Improves security – Long-term credentials stored encrypted on a well-secured server 10/28/2020 % bin/grid-proxy-init Your identity: /O=Grid/OU=Example/CN=Adeel Akram Enter GRID pass phrase for this identity: Creating proxy. . . . Done Your proxy is valid until: Tue Oct 26 01: 33: 42 2010 8

Credential Accessibility with My. Proxy • A My. Proxy server can be deployed for a single user, a virtual organization, or a Certificate Authority (CA) • Users can delegate proxy credentials to the My. Proxy server for storage – Can store multiple credentials with different names, lifetimes, and access policies • Then, they can retrieve stored proxies when needed using My. Proxy client tools – And allow trusted services to retrieve proxies • No need to copy certificate and key files between machines 10/28/2020 9

Grid. FTP • Grid. FTP is an extension of the File Transfer Protocol (FTP) for grid computing. • The protocol was defined within the Grid. FTP working group of the Open Grid Forum. • There are multiple implementations of the protocol; the most widely used is that provided by the Globus Toolkit. 10/28/2020 10

RLS • The Replica Location Service (RLS) provides a framework for tracking the physical locations of data that has been replicated. • At its simplest, RLS maps logical names (which don't include specific pathnames or storage system information) to physical names (which do include storage system addresses and specific pathnames). 10/28/2020 11

GRAM 5 • The Globus Toolkit includes a set of service components collectively referred to as the Globus Resource Allocation Manager (GRAM). • GRAM simplifies the use of remote systems by providing a single standard interface for requesting and using remote system resources for the execution of "jobs". • The most common use (and the best supported use) of GRAM is remote job submission and control. This is typically used to support distributed computing applications. 10/28/2020 12

Globus XIO • Globus XIO is an extensible input/output library written in C for the Globus Toolkit. • It provides a single API (open/close/read/write) that supports multiple wire (communication) protocols, with protocol implementations encapsulated as drivers. • The XIO drivers distributed with 6. 0 include TCP, UDP, file, HTTP, GSI, GSSAPI_FTP, TELNET and queuing. • In addition, Globus XIO provides a driver development interface for use by protocol developers. 10/28/2020 13

The Globus Toolkit 4 10/28/2020 14

The Globus Toolkit • GT 4 offers the middle-level core services in grid applications. • The high-level services and tools, such as MPI , Condor-G, and Nimrod/G, are developed by third parties for general purpose distributed computing applications. • The local services, such as LSF, TCP, Linux, and Condor, are at the botom level and are fundamental tools supplied by other developers. • As a de facto standard in grid middleware, GT 6 is based on industrystandard web service technologies. 10/28/2020 15

High Level Services and Tools • DRM – Distributed Resource Management ➢ Resource manager on ASCI supercomputer • Cactus ➢ Grid-aware numerical solver framework • MPICH-G 2 ➢ Grid-enabled MPI • Globusrun ➢ More complicated version of globus-job-run • PUNCH ➢ Web browser based resource manager from Purdue University • Nimrod/G ➢ Model computational jobs from Monash • Grid Status ➢ Repository of state of jobs in grid • Condor-G ➢ Condor job management layer to Globus 10/28/2020 16

GT Core Services • GASS – Globus Access to Secondary Storage ➢ File and executable staging and I/O redirection • Grid. FTP - Grid File Transfer Protocol ➢ Reliable, high performance FTP • MDS - Metacomputing Directory Service ➢ Maintains information about available resources • GSI - Grid Security Interface ➢ Authentication, authorization via proxies, delegation, PKI, SSL • Replica Catalog ➢ Manages partial copies of full data set across grid • GRAM - Grid Resource Allocation Management ➢ Allocation, reservation, monitoring and control of programs on remote systems • I/O ➢ Wrapper TCP, UDP, IP multicast and file I/O 10/28/2020 17

GT Local Services • • Condor ➢ Job and resource manager for compute-intensive jobs MPI – Message Passing Interface ➢ Portability across plaforms LSF – Load Sharing Facility ➢ Management of batch workload PBS – Portable Batch System ➢ Scheduling / resource management • NQE – Network Queueing Environment ➢ Resource manager on Cray systems 10/28/2020 18

Functionalities of GT 4 • Global Resource Allocation Manager (GRAM) - Grid Resource Access and Management (HTTP-based) • Communication (Nexus) - Unicast and multicast communication • Grid Security Infrastructure (GSI) - Authentication and related security services • Monitoring and Discovery Service (MDS) - Distributed access to structure and state information • Health and Status (HBM) - Heartbeat monitoring of system components • Global Access of Secondary Storage (GASS) - Grid access of data in remote secondary storage • Grid File Transfer (Grid. FTP) Inter-node fast file transfer 10/28/2020 19

Globus Job Workflow 10/28/2020 20

Globus Job Workflow • A typical job execution sequence proceeds as follows: The user delegates his credentials to a delegation service. • The user submits a job request to GRAM with the delegation identifier as a parameter. • GRAM parses the request, retrieves the user proxy certificate from the delegation service, and then acts on behalf of the user. • GRAM sends a transfer request to the RFT (Reliable File Transfer), which applies Grid. FTP to bring in the necessary files. • GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler Event Generator) initiates a set of user jobs. • The local scheduler reports the job state to the SEG. Once the job is complete, GRAM uses RFT and Grid. FTP to stage out the resultant files. The grid monitors the progress of these operations and sends the user a notification. 10/28/2020 21

Client-Globus Interactions • There are strong interactions between provider programs and user code. GT 4 makes heavy use of industry-standard web service protocols and mechanisms in service description, discovery, access, authentication, authorization. • GT 4 makes extensive use of java, C, and Python to write user code. Web service mechanisms define specific interfaces for grid computing. • Web services provide flexible, extensible, and widely adopted XML-based interfaces. 10/28/2020 22

Data Management Using GT 4 • • • For Grid applications one needs to provide access to and/or integrate large quantities of data at multiple sites. The GT 4 tools can be used individually or in conjunction with other tools to develop interesting solutions to efficient data access. The following list briefly introduces these GT 4 tools: 1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to-disk data movement over high-bandwidth WANs. Based on the popular FTP protocol for internet file transfer, Grid FTP adds additional features such as parallel data transfer, third-party data transfer, and striped data transfer. In addition, Grid FTP benefits from using the strong Globus Security Infra structure for securing data channels with authentication and reusability. It has been reported that the grid has achieved 27 Gbit/second end-to-end transfer speeds over some WANs. 2. RFT provides reliable management of multiple Grid FTP transfers. It has been used to orchestrate the transfer of millions of files among many sites simultaneously. 3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to information about the location of replicated files and data sets. 4. OGSA-DAI (Globus Data Access and Integration) tools were developed by the UK e. Science program and provide access to relational and XML databases. 10/28/2020 23

Introduction to Hadoop • Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. • It is part of the Apache project sponsored by the Apache Software Foundation. • Hadoop essentially provides two things: – A distributed filesystem called HDFS (Hadoop Distributed File System) – A framework and API for building and running Map. Reduce jobs 10/28/2020 24

• It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware. • Hadoop offers a software platform that was originally developed by Yahoo! group. The package enables users to write and run applications over vast amounts of distributed data. • Users can easily scale Hadoop to store and process petabytes of data in the web space. • Hadoop is economical in that it comes with an open source version of Map. Reduce that minimizes overhead in task spawning and massive data communication. • It is efficient, as it processes data with a high degree of parallelism across a large number of commodity nodes, and it is reliable in that it automatically keeps multiple data copies to facilitate redeployment of computing tasks upon unexpected system failures. 10/28/2020 25

Hadoop Distributed File System (HDFS) • HDFS is structured similarly to a regular Unix filesystem except that data storage is distributed across several machines. • It is not intended as a replacement to a regular filesystem, but rather as a filesystem-like layer for large distributed systems to use. • It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency. 10/28/2020 26

HDFS Cluster Machines • There are two and a half types of machine in a HDFS cluster: – Datanode - where HDFS actually stores the data, there are usually quite a few of these. – Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks make up a file, and what datanodes those blocks are stored on. – Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable. 10/28/2020 27

HDFS Cluster Machines • Name Node • Data Nodes • Secondary Name Nodes • Data can be accessed using either the Java API, or the Hadoop command line client. Many operations are similar to their Unix counterparts. 10/28/2020 28

HDFS Command Line • list files in the root directory hadoop fs -ls / • list files in my home directory hadoop fs -ls. / • list files in my home directory hadoop fs -text. /file. txt. gz • upload and retrieve a file hadoop fs -put. /localfile. txt /home/matthew/remotefile. txt hadoop fs -get /home/matthew/remotefile. txt. /local/file/path/file. txt 10/28/2020 29

Hadoop’s Architecture • Distributed, with some centralization • Main nodes of cluster are where most of the computational power and storage of the system lies • Main nodes run Task. Tracker to accept and reply to Map. Reduce tasks, and also Data. Node to store needed blocks closely as possible • Central control node runs Name. Node to keep track of HDFS directories & files, and Job. Tracker to dispatch compute tasks to Task. Tracker • Written in Java, also supports Python and Ruby 10/28/2020 30

Hadoop’s Architecture • Hadoop Distributed Filesystem • Tailored to needs of Map. Reduce • Targeted towards many reads of filestreams • Writes are more costly • High degree of data replication (3 x by default) • No need for RAID on normal nodes • Large blocksize (64 MB) • Location awareness of Data. Nodes in network 10/28/2020 31

Hadoop’s Architecture Name. Node: • Stores metadata for the files, like the directory structure of a typical FS. • The server holding the Name. Node instance is quite crucial, as there is only one. • Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata. • Handles creation of more replica blocks when necessary after a Data. Node failure Data. Node: • Stores the actual data in HDFS • Can run on any underlying filesystem (ext 3/4, NTFS, etc) • Notifies Name. Node of what blocks it has • Name. Node replicates blocks 2 x in local rack, 1 x elsewhere 10/28/2020 32

HDFS Features • HDFS is optimized differently than a regular file system. It is designed for non-realtime applications demanding high throughput instead of online applications demanding low latency. • For example, files cannot be modified once written, and the latency of reads/writes is really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the number of datanodes in a cluster, so it can handle workloads no single machine would ever be able to. 10/28/2020 33

HDFS Features Cont. . . • Failure tolerant - data can be duplicated across multiple datanodes to protect against machine failures. The industry standard seems to be a replication factor of 3 (everything is stored on three machines). • Scalability - data transfers happen directly with the datanodes so your read/write capacity scales fairly well with the number of datanodes • Space - need more disk space? Just add more datanodes and rebalance • Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-Reduce) • Optimized for Map. Reduce 10/28/2020 34

Deployment of Task Trackers with Data. Nodes 10/28/2020 35

Hadoop’s Architecture Map. Reduce Engine: • Job. Tracker & Task. Tracker • Job. Tracker splits up data into smaller tasks(“Map”) and sends it to the Task. Tracker process in each node • Task. Tracker reports back to the Job. Tracker node and reports on job progress, sends data (“Reduce”) or requests new jobs • None of these components are necessarily limited to using HDFS • Many other distributed file-systems with quite different architectures work • Many other software packages besides Hadoop's Map. Reduce platform make use of HDFS 10/28/2020 36

Map. Reduce • The second fundamental part of Hadoop is the Map. Reduce layer. • This is made up of two sub components: – An API for writing Map. Reduce workflows in Java. – A set of services for managing the execution of these workflows. 10/28/2020 37

Map and Reduce APIs • The basic premise is this: – Map tasks perform a transformation. – Reduce tasks perform an aggregation. 10/28/2020 38

Map. Reduce Services • Hadoop Map. Reduce comes with two primary services for scheduling and running Map. Reduce jobs. They are: – the Job Tracker (JT) and – the Task Trackers (TT) 10/28/2020 39

JT and TTs • The JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks globally. • TTs are in charge of running the Map and Reduce tasks themselves. 10/28/2020 40

JT and TTs Optimization • Many things can go wrong in a big distributed system, so these services have some clever tricks to ensure that your job finishes successfully: – Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers. – Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should) it will take advantage of data locality to make reading the data faster – Blacklisting a bad TT - if the JT detects that a TT has too many failed tasks, it will blacklist it. No tasks will then be scheduled on this task tracker. – Speculative Execution - the JT can schedule the same task to run on several machines at the same time, just in case some machines are slower than others. When one version finishes, the others are killed. 10/28/2020 41

Hadoop Framework Tools 10/28/2020 42

Hadoop in the Wild Hadoop is in use at most organizations that handle big data: • Yahoo! • Facebook • Amazon • Netflix • Etc… Some examples of scale: • Yahoo!’s Search Webmap runs on 10, 000 core Linux cluster and powers Yahoo! Web search • FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012) 10/28/2020 43

Three main applications of Hadoop • Advertisement (Mining user behavior to generate recommendations) • Searches (group related documents) • Security (search for uncommon patterns) 10/28/2020 44

Hadoop Highlights • Distributed File System • Fault Tolerance • Open Data Format • Flexible Schema • Queryable Database Why use Hadoop? • Need to process Multi Petabyte Datasets • Data may not have strict schema • Expensive to build reliability in each application • Nodes fails everyday • Need common infrastructure • Very Large Distributed File System • Assumes Commodity Hardware • Optimized for Batch Processing • Runs on heterogeneous OS 10/28/2020 45

Data. Node • A Block Sever • Stores data in local file system • Stores meta-data of a block - checksum • Serves data and meta-data to clients • Block Report • Periodically sends a report of all existing blocks to Name. Node • Facilitate Pipelining of Data • Forwards data to other specified Data. Nodes 10/28/2020 46

Block Placement • Replication Strategy • One replica on local node • Second replica on a remote rack • Third replica on same remote rack • Additional replicas are randomly placed • Clients read from nearest replica 10/28/2020 47

Data Correctness • • Use Checksums to validate data – CRC 32 File Creation • Client computes checksum per 512 byte • Data. Node stores the checksum • File Access • Client retrieves the data and checksum from Data. Node • If validation fails, client tries other replicas 10/28/2020 48

Data Pipelining • • Client retrieves a list of Data. Nodes on which to place replicas of a block Client writes block to the first Data. Node The first Data. Node forwards the data to the next Data. Node in the Pipeline When all replicas are written, the client moves on to write the next block in file 10/28/2020 49

Map. Reduce Usage • • • Log processing Web search indexing Ad-hoc queries 10/28/2020 50

Map. Reduce Process (org. apache. hadoop. mapred) • Job. Client • Submit job • Job. Tracker • Manage and schedule job, split job into tasks • Task. Tracker • Start and monitor the task execution • Child • The process that really execute the task 10/28/2020 51

References 1. 2. 3. 4. 5. 6. 7. 8. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, “Distributed and Cloud Computing: Clusters, Grids, Clouds and the Future of Internet”, First Edition, Morgan Kaufman Publisher, an Imprint of Elsevier, 2012. www. csee. usf. edu/~anda/CIS 6930 -S 11/notes/hadoop. ppt www. ics. uci. edu/~cs 237/lectures/cloudvirtualization/Hadoop. pptx http: //www. softwaresummit. com/2003/speakers/Brown. Grid. Intro. pdf http: //bigdata-madesimple. com/20 -essential-hadoop-tools-for-crunching-big-data/ https: //developer. yahoo. com/hadoop/tutorial/module 1. html https: //blog. matthewrathbone. com/2013/04/17/what-is-hadoop. html https: //en. wikipedia. org/wiki/Apache_Hadoop 10/28/2020 52

Assignment #5 • Visit https: //developer. yahoo. com/hadoop/tutorial/module 1. html and sumarize how hadoop compares to other distributed computing platforms. What are the scenarios where MPI based cluster performs better than Hadoop? • Write note on Data Access Frameworks and Orchestration Frameworks mentioned in slide 42 (Hadoop Framework Tools) 10/28/2020 53

Thank You Questions and Comments? http: //web. uettaxila. edu. pk/CMS/FALL 2017/te. GNCCms/ 10/28/2020 54