High Performance Computing with Linux clusters Mark Silberstein

High Performance Computing with Linux clusters Mark Silberstein marks@tx. technion. ac. il Haifux Linux Club Technion 9. 12. 2002

What to expect J You will learn. . . ü Basic L You will NOT learn… terms of HPC and Parallel / Distributed systems ü What is A Cluster and where it is used ü Major challenges and some of their solutions in building / using / programming clusters ü How to use software utilities to build clusters ü How to program / debug / profile clusters ü Technical details of system administration ü Commercial software cluster products ü How to build High Availability clusters You can construct cluster yourself!!!!

Agenda FHigh performance computing Ø Introduction into Parallel World Ø Hardware Ø Planning , Installation & Management Ø Cluster glue – cluster middleware and tools Ø Conclusions

HPC: characteristics • Requires TFLOPS, soon PFLOPS ( 250 ) ü • Huge memory (TBytes) ü • Just to feel it: P-IV XEON 2. 4 G – 540 MFLOPS Grand challenge applications ( CFD, Earth simulations, weather forecasts. . . ) Large data sets (PBytes) ü Experimental data analysis ( CERN - Nuclear research ) Ø • Tens of TBytes daily Long runs (days, months) ü Time ~ Precision ( usually NOT linear ) Ø CFD -> 2 X precision => 8 X time

HPC: Supercomputers • • Not general-purpose machines, MPP State of the art ( from TOP 500 list ) ü NEC: Earth. Simulator 35860 TFLOPS Ø Ø 640 X 8 CPUs, 10 TB memory, 700 TB disk-space, 1. 6 PB mass store Area of computer = 4 tennis courts, 3 floors HP: ASCI Q, 7727 TFLOPS (4096 CPUs) ü IBM: ASCI white, 7226 TFLOPS (8192 CPUs) ü Linux Networ. X: 5694 TFLOPS, (2304 XEON P 4 CPUs) Prices: ü • ü CRAY: $ 90. 000

Everyday HPC • Examples from everyday life ü Independent Ø runs with different sets of parameters Monte Carlo ü Physical simulations ü Multimedia Ø Ø Rendering MPEG encoding ü You name it…. Do we really need Cray for this? ? ?

Clusters: “Poor man's Cray” • • Po. Ps, COW, CLUMPS NOW, Beowulf…. Different names, same simple idea ü Collection of interconnected whole computers ü Used as single unified computer resource • Motivation: ü HIGH Ø Ø performance for LOW price CFD Simulation runs 2 weeks (336 hours)on single PC. It runs 28 HOURS on cluster of 20 Pcs 10000 Runs each one 1 minute. Total ~ 7 cluster if 100 PCs ~ 1. 6 hours days. With

Why clusters • Price/Performance • Availability • Incremental growth • Upgradeability • • Potentially infinite scaling Scavenging (Cycle stealing) & Why now • Advances in – CPU capacity – Advances in Network Technology • Tools availability • Standartisation • LINUX

Why NOT clusters L Installation L Administration &Maintenance LDifficult programming model Cluster ? Parallel system

Agenda Ø High performance computing FIntroduction into Parallel World Ø Hardware Ø Planning , Installation & Management Ø Cluster glue – cluster middleware and tools Ø Conclusions

“Serial man” questions • • • “I bought dual CPU system, but my Mine. Sweeper does not work faster!!! Why? ” “Clusters. . . , ha-ha. . . , does not help! My two machines are connected together for years, but my Matlab simulation does not run faster if I turn on the second” “Great! Such a pitty that I bought $1 M SGI Onix!”

How program runs on multiprocessor MP Operating System Shared Memory P Processor P P Thread P � � Process P Application

Cluster: Multi-Computer MIDDLEWARE OS OS MIDDLEWARE Physical Memory P P Physical Memory CPUs P Network P CPUs

Software Parallelism Exploiting computing resources • Data Parallelism ü Single Instructions, Multiple Data (SIMD) Ø • Task parallelism ü • Data is distributed between multiple instances of the same process Multiple Instructions, Multiple Data (MIMD) Cluster terms ü Single Program, Multiple Data ü Serial Program, Parallel Systems Ø Running multiple instances of the same program on multiple systems

Single System Image (SSI) • • Illusion of single computing resource, created over collection of computers SSI level ü Application & Subsystems ü OS/kernel level ü Hardware • SSI boundaries ü When you are inside – cluster is a single resource ü When you are outside – cluster is a collection of PCs

Parallelism & SSI Parallelism Granularity Instruction Process Application Serial application Job Kernel & OS Programming Explicit parallel Resource Environments programming Management Transparency MOSIX PVFS Score DSM c. JVM Cluster PID Split-C Open. MP HPF Sca. LAPAC Ideal SSI Clusters are NOT there PVM Levels of SSI MPI PBS Condor

Agenda Ø High performance computing Ø Introduction into Parallel World FHardware Ø Planning , Installation & Management Ø Cluster glue – cluster middleware and tools Ø Conclusions

Cluster hardware • Nodes ü Fast CPU, Large RAM, Fast HDD Ø Ø • Network interconnect ü Low latency Ø ü Time to send zero sized packet High Throughput Ø • Commodity off-the-shelf PCs Dual CPU preferred (SMP) Size of network pipe Most common case: 1000/100 Mb Ethernet

Cluster interconnect problem • High latency ( ~ 0. 1 m. Sec ) & High CPU utilization ü Reasons: multiple copies, interrupts, kernel-mode communication • Solutions ü Hardware Ø Accelerator cards ü Software Ø Ø VIA (M-VIA for Linux – 23 u. Sec) Lightweight user-level protocols: Active. Messages, Fast. Messages

Cluster Interconnect Problem • Insufficient throughput ü Channel • High performance network interfaces+ new PCI bus ü SCI, Ø Ø • bonding Myrinet, Server. Net Ultra low application-to-application latency (1. 4 u. Sec) SCI Very high throughput ( 284 -350 MB/sec ) – SCI 10 GB Ethernet & Infiniband

Network Topologies • Switch JSame distance between neighbors LBottleneck for large clusters • Mesh/Torus/Hypercube JApplication specific topology LDifficult broadcast • Both

Agenda Ø High performance computing Ø Introduction into Parallel World Ø Hardware FPlanning , Installation & Management Ø Cluster glue – cluster middleware and tools Ø Conclusions

Cluster planning • Cluster environment – Dedicated • Cluster farm – Gateway based – Nodes Exposed – Opportunistic U • Different OS • Different HW U G R R R Cluster farm U R • Nodes are used as work stations – Homogeneous – Heterogeneous U U R R Resource U User of resource G Gateway

Cluster planning (Cont. ) • Cluster workloads ü Why Ø to discuss this? You should know what to expect Scaling: does adding new PC really help? ü Serial Ø Ø Ø workload – running independent jobs Purpose: high throughput Cost for application developer: NO Scaling: linear ü Parallel Ø Ø Ø workload – running distributed applications Purpose: high performance Cost for application developer: High in general Scaling: depends on the problem and usually not linear

Cluster Installation Tools • Installation tools requirements ü ü • Centralized management of initial configurations Easy and quick to add/remove cluster node Automation (Unattended install) Remote installation Common approach (System. Imager, SIS) ü ü Server holds several generic image of cluster-node Automatic initial image deployment Ø ü ü First boot from CD/floppy/NW invokes installation scripts Use of post-boot auto configuration (DHCP) Next boot – ready-to-use system

Cluster Installation Challenges (cont. ) • Initial image is usually large ( ~ 300 MB) ü Slow deployment over network ü Synchronization between nodes • Solution ü Use J J J L Root on NFS for cluster nodes (HUJI – CLIP) Very fast deployment – 25 Nodes for 15 minutes All Cluster nodes backup on one disk Easy configuration update (even when a node is off-line) NFS server: Single point of failure ü Use of shared FS (NFS)

Cluster system management and monitoring • Requirements ü ü Single management console Cluster-wide policy enforcement Ø ü Common configuration Ø ü ü ü • Cluster partitioning Keep all nodes synchronized Clock synchronization Single login and user environment Cluster-wide event-log and problem notification Automatic problem determination and self-healing

Cluster system management tools • Regular system administration tools ü Handy Ø • services coming with LINUX: yp – configuration files, autofs – mount management, dhcp – network parameters, ssh/rsh – remote command execution, ntp - clock synchronization, NFS – shared file system Cluster-wide tools ü C 3 (OSCAR cluster toolkit) Ø Cluster-wide … • Command invocation • Files management Ø Nodes Registry

Cluster system management tools • Cluster-wide policy enforcement ü Problem Nodes are sometimes down Ø Long execution Ø ü Solution Single policy - Distributed Execution (cfengine) Ø Continious policy enforcement Ø • Run-time monitoring and correction

Cluster system monitoring tools • Hawkeye ü Logs important events ü Triggers for problematic situations (disk space/CPU load/memory/daemons) ü Performs specified actions when critical situation occurs (Not implemented yet) • Ganglia ü Monitoring of vital system resources ü Multi-cluster environment

All-in-one Cluster tool kits • SCE http: //www. opensce. org ü Installation ü Monitoring ü Kernel modules for cluster wide process management • OSCAR http: //oscar. sourceforge. net • ROCS http: //www. rocksclusters. org ü Snapshot of available cluster installation/management/usage tools

Agenda Ø High performance computing Ø Introduction into Parallel World Ø Hardware Ø Planning , Installation & Management FCluster glue – cluster middleware and tools Ø Conclusions

Cluster glue - middleware • Various levels of Single System Image ü Comprehensive solutions Ø (Open)MOSIX Ø Cluster. VM ( java virtual machine for cluster ) Ø SCore (User Level OS) Ø Linux SSI project (High availability) ü Components of SSI Ø Cluster File system (PVFS, GFS, x. FS, Distributed RAID) Ø Cluster-wide PID (Beowulf) Ø Single point of entry (Beowulf)

Cluster middleware • Resource management ü Batch-queue Ø Ø • systems Condor Open. PBS Software libraries and environment ü Software DSM http: //discolab. rutgers. edu/projects/dsm ü MPI, PVM, BSP ü Omni Open. MP ü Parallel debuggers and profiling Ø Ø PARADYN Total. VIEW ( NOT free )

Cluster operating system Case Study – (open)MOSIX • Automatic load balancing ü Use • sophisticated algorithms to estimate node load Process migration ü Home node ü Migrating part • Memory ushering ü Avoid • thrashing Parallel I/O (MOPI) ü Bring Ø application to the data All disk operations are local

Cluster operating system Case Study – (open)MOSIX (cont. ) J Ease of use J Transparency J Suitable for multi-user environment J Sophisticated scheduling J Scalability J Automatic parallelization of multi-process applications L Generic load balancing not always appropriate L Migration restrictions L L Intensive I/O Shared memory L Problem with explicitly parallel/distributed applications (MPI/PVM/Open. MP) L OS - homogeneous LNO QUEUEING

Batch queuing cluster system Goal: To steal unused cycles When resource is not in use and release when back to work • Assumes opportunistic environment – Resources may fail/station shutdown • Manages heterogeneous environment – MS W 2 K/XP, Linux, Solaris, Alpha • Scalable (2 K nodes running) • Powerful policy management • Flexibility • Modularity • Single configuration point • User/Job priorities • Perl API • DAG jobs

Condor basics • Job is submitted with submission file ü ü • Uses Class. Ads to match between resources and jobs ü ü • • Every resource publishes its capabilities Every job publishes its requirements Starts single job on single resource ü • Job requirements Job preferences Many virtual resources may be defined Periodic check-pointing (requires lib linkage) If resource fails – restarts from the last check-point

Condor in Israel • Ben-Gurion university ü 50 • CPUs pilot installation Technion ü Pilot installation in DS lab ü Possible modules developments for Condor high availability enhancements ü Hopefully further adoption

Conclusions • • • Clusters are very cost efficient means of computing You can speed up your work with little effort and no money You should not necessarily be a CS professional to construct cluster • You can build cluster with FREE tools • With cluster you can use idle cycles of others

Cluster info sources • Internet ü ü ü ü • http//: hpc. devchannel. org http: //sourceforge. net http: //www. clustercomputing. org http: //www. linuxclustersinstitute. org http: //www. cs. mu. oz. au/~raj (!!!!) http: //dsonline. computer. org http: //www. topclusters. org Books ü ü Gregory F. Pfister, “In search of clusters” Raj. Buyya (ed), “High Performance Cluster Computing”

The end