Open MOSIX approach to build scalable HPC farms

Open. MOSIX approach to build scalable HPC farms with an easy management infrastructure Rosario Esposito 1 Paolo Mastroserio 1 Francesco Maria Taurino 1, 2 Gennaro Tortone 1 INFN - Napoli 1 INFM - UDR Napoli 2 CHEP 2003 – La Jolla (San Diego)

Index n n n Introduction Open. Mosix overview Farm setup Use cases Conclusions CHEP 2003 – La Jolla 2

What makes clusters hard ? Setup (administrator) n setting up a 16 node farm by hand is prone to errors Maintenance (administrator) n ever tried to update a package on every node in the farm? Running jobs (users) n running a parallel program or set of sequential programs requires the users to figure out which hosts are available and manually assign tasks to the nodes, or use software tools based on static process allocation (queue managers) CHEP 2003 – La Jolla 3

What is Open. Mosix ? Description Open. Mosix is an Open. Source enhancement to the Linux kernel providing adaptive (on-line) load-balancing between x 86 Linux machines. It uses preemptive process migration to assign and reassign the processes among the nodes to take the best advantage of the available resources Open. Mosix moves processes around the Linux farm to balance the load, using less loaded machines first URL http: //www. openmosix. org CHEP 2003 – La Jolla 4

Open. Mosix introduction Execution environment n farm of [diskless] x 86 based nodes both UP and SMP that are connected by standard or high-speed LAN Implementation level n Linux kernel (no library to link with sources) System image model n virtual machine with a lot of memory and CPU Granularity n Process Goal n improve the overall (cluster-wide) performance and create a convenient multi-user, time-sharing environment for the execution of both sequential and parallel applications CHEP 2003 – La Jolla 5

Open. Mosix architecture (1/5) Network transparency the interactive user and the application level programs are provided by a virtual machine that looks like a single MP machine Preemptive process migration any user’s process, trasparently and at any time, can migrate to any available node. n n The migrating process is divided into two contexts: system context (deputy) that may not be migrated from “home” workstation (UHN); user context (remote) that can be migrated on a diskless node; CHEP 2003 – La Jolla 6

Open. Mosix architecture (2/5) Preemptive process migration master node diskless node CHEP 2003 – La Jolla 7

Open. Mosix architecture (3/5) Dynamic load balancing n n initiates process migrations in order to balance the load of farm responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds makes continuous attempts to reduce the load differences between pairs of nodes and dynamically migrating processes from nodes with higher load to nodes with a lower load the policy is symmetrical and decentralized; all of the nodes execute the same algorithm and the reduction of the load differences is performed indipendently by any pair of nodes CHEP 2003 – La Jolla 8

Open. Mosix architecture (4/5) Memory sharing n n n places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes delays as much as possible swapping out of pages makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes Efficient kernel communication n n is specifically developed to reduce the overhead of the internal kernel communications (e. g. between the process and its home site, when it is executing in a remote site) fast and reliable protocol with low startup latency and high throughput CHEP 2003 – La Jolla 9

Open. Mosix architecture (5/5) Probabilistic information dissemination algorithms n n provide each node with sufficient knowledge about available resources in other nodes, without polling measure the amount of the available resources on each node receive the resources indices that each node send at regular intervals to a randomly chosen subset of nodes the use of randomly chosen subset of nodes is due for support of dynamic configuration and to overcome partial nodes failures Decentralized control and autonomy n n each node makes its own control decisions independently and there is no master-slave relationship between nodes each node is capable of operating as an independent system; this property allows a dynamic configuration, where nodes may join or leave the farm with minimal disruption CHEP 2003 – La Jolla 10

Farm setup: PXE & Cluster. NFS n diskless nodes n n low cost eliminates install/upgrade of hardware, software on diskless client side backups are centralized in one single main server zero administration at diskless client side CHEP 2003 – La Jolla 11

Diskless farm setup traditional method (1/2) Traditional method n Server n n BOOTP server NFS server separate root directory for each client Client n n n BOOTP to obtain IP TFTP to load “tagged kernel” image root. NFS to load root filesystem CHEP 2003 – La Jolla 12

Diskless farm setup traditional method (2/2) Traditional method – Problems separate root directory structure for each node n hard to set up n n lots of directories with slightly different contents difficult to maintain n changes must be propagated to each directory CHEP 2003 – La Jolla 13

Cluster. NFS Description c. NFS is a patch to the standard Universal-NFS server code that “parses” file request to determine an appropriate match on the server Example when client machine foo 2 asks for file /etc/hostname it gets the contents of /etc/hostname$$HOST=foo 2$$ URL https: //sourceforge. net/projects/clusternfs CHEP 2003 – La Jolla 14

Cluster. NFS features Cluster. NFS allows all machines (including server) to share the root filesystem n n n all files are shared by default files for all clients are named filename$$CLIENT$$ files for specific client are named filename$$IP=xxx. xxx$$ or filename$$HOST=host. domain. com$$ CHEP 2003 – La Jolla 15

Diskless farm setup with Cluster. NFS (1/2) Cluster. NFS method n Server n n DHCP and TFTP server Cluster. NFS server single root directory for server and clients Clients n n n DHCP to obtain IP TFTP to load PXE boot loader and then kernel image root. NFS to load root filesystem CHEP 2003 – La Jolla 16

Diskless farm setup with Cluster. NFS (2/2) Cluster. NFS method – Advantages n easy to set up n n just copy (or create) the files that need to be different easy to maintain n n changes to shared files are global easy to add nodes A node can be added to a running farm in 1 minute CHEP 2003 – La Jolla 17

VIRGO experiment (Jun 2001) (1/4) VIRGO is the collaboration between Italian and French research teams, for the realization of an interferometric gravitational wave detector; The main goal of the VIRGO project is the first direct detection of gravitational waves emitted by astrophysical sources; Interferometric gravitational wave detectors produce a large amount of “raw” data that require a significant computing power to be analysed. To satisfy such a strong requirement of computing power we decided to build a Linux cluster running MOSIX (and now Open. Mosix) CHEP 2003 – La Jolla 18

VIRGO experiment (Jun 2001) (2/4) Hardware Farm nodes Super. Micro 6010 H - Dual Pentium III 1 Ghz - RAM: 512 Mbyte - HD: 18 Gbyte - 2 Fast Ethernet interfaces - 1 Gbit Ethernet interface - (only on master-node) Storage Alpha Server 4100 HD: 144 GB CHEP 2003 – La Jolla 19

VIRGO experiment (Jun 2001) (3/4) The Linux farm has been strongly tested by executing intensive data analysis procedures, based on the Matched Filter algorithm, one of the best ways to search for known waveforms within a signal affected by background noise. Matched Filter analysis requires a high computational cost as the method consists in an exhaustive comparison between the source signal and a set of known waveforms, called “templates”, to find possible matches. Using a large number of templates the quality of known signals identification gets better and better but a great amount of floating points operations has to be performed. Running Matched Filter test procedures on the Open. Mosix cluster have shown a progressive reduction of execution times, due to a high scalability of the computing nodes and an efficient dynamic load distribution; CHEP 2003 – La Jolla 20

VIRGO experiment (Jun 2001) (4/4) speed-up of repeated Matched Filter executions The increase of computing speed respect to the number of processors doesn’t follow an exactly linear curve; this is mainly due to the growth of communication time, spent by the computing nodes to transmit data over the local area network. CHEP 2003 – La Jolla 21

ARGO experiment (Jan 2002) (1/3) The aim of the ARGO-YBJ experiment is to study cosmic rays, mainly cosmic gamma-radiation, at an energy threshold of ~100 Ge. V, by means of the detection of small size air showers. This goal will be achieved by operating a full coverage array in the Yangbajing Laboratory (Tibet, P. R. China) at 4300 m a. s. l. As we have seen for the Virgo experiment, the analysis of data produced by Argo requires a significant amount of computing power. To satisfy this requirement we decided to implement an Open. MOSIX cluster. CHEP 2003 – La Jolla 22

ARGO experiment (Jan 2002) (2/3) n currently Argo researchers are using a small Linux farm, located in Naples, constituted by: n n 5 machines (dual 1 Ghz Pentium III with 1 Gbyte RAM) running Red. Hat 7. 2 + openmosix 2. 4. 13. 1 file server with 1 Tbyte of disk space CHEP 2003 – La Jolla 23

ARGO experiment (Jan 2002) (3/3) At this time the Argo Open. MOSIX farm is mainly used to run Monte Carlo simulations using “Corsika”, a Fortran application developed to simulate and analyse extensive air showers. The farm is also used to run other applications such as GEANT to simulate the behaviour of the Argo detector. The Open. MOSIX farm is responding very well to the researchers’ computing requirements and we already decided to upgrade the cluster in the near future, adding more computing nodes and starting the analysis of real data produced by Argo. Currently ARGO researchers in Naples have produced ~400 Gbytes of simulated data with this Open. MOSIX cluster CHEP 2003 – La Jolla 24

Conclusions n n n (1/2) the most noticeable features of Open. MOSIX are its loadbalancing and process migration algorithms, which implies that users need not have knowledge of the current state of the nodes this is most useful in time-sharing, multi-user environments, where users do not have means (and usually are not interested) in the status (e. g. load of the nodes) parallel application can be executed by forking many processes, just like in an SMP, where Open. MOSIX continuously attempts to optimize the resource allocation CHEP 2003 – La Jolla 25

Conclusions n n n (2/2) Building up farms with the “Open. Mosix+Cluster. NFS” approach requires no more than 2 hours With this approach management of a farm = management of a single server This solution has proven to be scalable in farms up to 32 nodes CHEP 2003 – La Jolla 26