The University of Liverpool Condor Pool Ian C

  • Slides: 18
Download presentation
The University of Liverpool Condor Pool Ian C. Smith

The University of Liverpool Condor Pool Ian C. Smith

University of Liverpool Condor Pool § contains around 300 machines running the University’s Managed

University of Liverpool Condor Pool § contains around 300 machines running the University’s Managed Windows (XP) Service. § most have 2. 33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine. § software updates via a weekly re-imaging process. § single combined submit host / central manager running on Sun Solaris V 440 SMP server. § restricted access to submit host for registered Condor users. § currently running Condor 7. 0. 2 (moving to 7. 4. 2 soon hopefully). § policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

Condor service caveats § § § § only suitable for DOS-based applications running in

Condor service caveats § § § § only suitable for DOS-based applications running in batch mode no communication between processes possible (“pleasantly parallel” applications only) statically linked executables work best (although can cope with DLLs) all files needed by application must be present on local disk (cannot access network drives) no built-in check-pointing or standard output/error streaming shorter jobs more likely to run to completion (10 -20 min seems to work best) very long running jobs can accommodated using Condor DAGMan or user level check-pointing

MATLAB advantages § originally developed for development of linear algebra algorithms but now contains

MATLAB advantages § originally developed for development of linear algebra algorithms but now contains many built-in functions geared to different disciplines divided into toolboxes § intuitive interactive environment allows rapid code development § simple but powerful file I/O: save <filename>, load <filename> (useful for check-pointing). § allows users to create their own functions stored as M-files § “standalone” applications can be built from M-files: § can run on platforms without MATLAB installed § do not need a licence to be able to run § can include all toolbox functions § APIs available for FORTRAN and C codes (“MEX files”)

MATLAB disadvantages § even standalone applications can run slower than equivalent C or FORTRAN

MATLAB disadvantages § even standalone applications can run slower than equivalent C or FORTRAN implementations. § standalone applications aren’t quite what they may seem: § more than just an. exe – “manifest” file needed to locate run-time libraries § need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting. exe) § luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive) § run-time errors can be difficult to trace when MATLAB jobs are run under Condor: § need to run under Condor on local PC § configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

Condor/MATLAB Research Applications § predicting the spread of avian influenza outbreaks in poultry flocks

Condor/MATLAB Research Applications § predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science) § modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science) § modelling of disease propagation in fish farms (Mathematical Sciences / Earth and Ocean Science) § testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics) § simulation of the infection of a bacterial cell by a virus (Mathematical Sciences) § modelling the effects of radiotherapy on normal tissue using 3 D voxel arrays (Medical Imaging and Radiotherapy)

Avian influenza results

Avian influenza results

Power-saving at Liverpool § have around 2 000 centrally managed PCs across campus which

Power-saving at Liverpool § have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. § original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity § policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20 -25 MWh) leading to an estimated saving of approx. £ 125 000 p. a. § makes extensive use of Power. MAN system from Data Synergy comprising: § service which forces machines into a low-power state and reports machine activity to Management Reporting Platform § Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

Power-saving at Liverpool § Have around 2 000 centrally managed PCs across campus which

Power-saving at Liverpool § Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. § Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 15 minutes of inactivity § Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20 -25 MWh) leading to an estimated saving of approx. £ 125 000 p. a. § Makes extensive use of Power. MAN system from Data Synergy comprising: § service which forces machines into a low-power state and reports machine activity to Management Reporting Platform § Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

Adapting Condor for use with power-saving PCs § Two main problems: § how to

Adapting Condor for use with power-saving PCs § Two main problems: § how to ensure Condor jobs are not evicted by hibernating PCs § how to wake up dormant PCs to run Condor jobs on-demand § Originally used Microsoft system service to power-down PCs after 30 min inactivity: § runs. bat file which checks if a user is logged in and shuts machine down if not § doesn’t detect owner of Condor job as a logged-in user § need to check for presence of condor_exe. bat § Power. MAN service now prevents job eviction: § can provide Power. MAN with a list of “protected programs” § ensures that system remains active if a protected program is running § include condor_starter process as a protected program (only present while a Condor job is running).

Adapting Condor for use with a power-saving PCs § Wake-on-LAN (“Wo. L”) used to

Adapting Condor for use with a power-saving PCs § Wake-on-LAN (“Wo. L”) used to bring hibernating machines back to full power: § § NICs must be remain powered-up during hibernation/power-off NICs must be capable of waking machines on receipt of a “magic packet” network must be able to route “magic packets” cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status): § § § if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines find number of powered up machines in each “teaching centre” (classroom) estimate the number of hibernating machines in each teaching centre from total number of machines in each sort centres from highest number of available machines to lowest wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up) MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

Automatic wake up issues § Assumes that any job can run on any machine:

Automatic wake up issues § Assumes that any job can run on any machine: § users cannot choose particular teaching centres or machines in their job Requirements § ideally, pool needs to be homogenous § errors in Requirements specification cause severe problems (machines repeatedly wake up then hibernate) § cron now includes a “sanity check” for this § Can only estimate number of hibernating machines in each centre § May wake up more machines than needed

Automatic wake up in action – Condor pool machine statistics

Automatic wake up in action – Condor pool machine statistics

Automatic wake up in action – Power. MAN statistics

Automatic wake up in action – Power. MAN statistics

Recent and Future Developments § starting to make use of automatic wake-up features of

Recent and Future Developments § starting to make use of automatic wake-up features of Condor 7. 4. 1 (condor_rooster) § cron advertises/updates Class. Ads for offline machines § Condor matches offline machines to jobs and wakes up machines as needed § use slow ramp-up of wake-ups to prevent server “overload” § users can now specify memory requirements, processor speed, when to run jobs etc § local tools available to assist in the preparation and running of MATLAB jobs: m_file_submit, matlab_build, matlab_submit