Condor The CCLRC Experience UK Condor Week 2004

  • Slides: 35
Download presentation
Condor: The CCLRC Experience UK Condor Week 2004 John Kewley Grid Technology Group e-Science

Condor: The CCLRC Experience UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Outline o The Challenge of Condor on Personal Workstations o The Pools: configuration and

Outline o The Challenge of Condor on Personal Workstations o The Pools: configuration and status o Our Users 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

The Challenge of Condor On Personal Workstations UK Condor Week 2004 John Kewley Grid

The Challenge of Condor On Personal Workstations UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Under Abundance of machines o Windows workstations (but centrally administered) o Linux desktops (but

Under Abundance of machines o Windows workstations (but centrally administered) o Linux desktops (but administered by “owners”) o Commodity Clusters (unavailable, many being decommissioned, no access to root) o Servers for CVS, backup, external web access, access grid (production systems – mission critical) o Training machines (turned off when not in use – only 4 at present) o HPCx (No comment!) 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Security / Paranoia o 2 zone firewall separates machines o No root access to

Security / Paranoia o 2 zone firewall separates machines o No root access to server machines o No root access to personal Linux Workstations o Personal firewalls “Not on 11 th October 2004 UK Condor Week MY machine you’re not” John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Site Firewalls + Flocking Internal Pool 11 th October 2004 UK Condor Week External

Site Firewalls + Flocking Internal Pool 11 th October 2004 UK Condor Week External Pool John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Site Firewall(s) o 2 levels of Firewall o Every request for a change in

Site Firewall(s) o 2 levels of Firewall o Every request for a change in the site firewall needs justification - takes up to 2 working days. o In theory, every submit node needs to be able to talk to some fixed (configurable) and ephemeral ports in every execute as well as the central node. o In addition, both UDP and TCP need to be opened. o It would be good if we could have a more precise definition of exactly what is necessary. 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Firewalls within a Condor Pool o Some resource owners have firewalls on their personal

Firewalls within a Condor Pool o Some resource owners have firewalls on their personal workstations o Since Condor needs each submit node to be able to talk to every potential execute node, this necessitates the opening of every firewall in the pool to every submit node when it is added. o Between adding the new node and the firewalls being updated, the firewalled nodes will be unavailable for use. Or are they? Maybe someone should tell Condor! 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Adding a new machine to the pool o If we add a new machine

Adding a new machine to the pool o If we add a new machine to the pool, the existing firewalls may not have anticipated this. o The firewalls will likely block this new machine o A Job may still match for the newly added machine to the firewalled resource. o This job will not be able to run o Parts of the system can jam as a result. – condor_q on submitting node – Subsequent parts of the submit script – (maybe also parts of the central node) 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Private networks o Similar “jams” occur if part of your pool (or flock of

Private networks o Similar “jams” occur if part of your pool (or flock of pools) is on a network that is unavailable to some of the other nodes o How can we permit jobs from submit nodes that can access the private network to run on these nodes whilst preventing Condor sending jobs from other submit nodes there? 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Workaround Solution o Mirror the firewall settings using Class. Ads o They can be

Workaround Solution o Mirror the firewall settings using Class. Ads o They can be updated at the whim of the machine owner as long as the settings are mirrored. o New users can be added at any time without disruption For more details, see my talk in the Security WG 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Other problems o Lack of root access – I had to go and grovel

Other problems o Lack of root access – I had to go and grovel to each resource owner not only for permission to install condor, but for them to log me in as root so I could do the installation. o Many different Linuxes. Condor installs neatly with the rpm on Red Hat family Linuxes. I had no trouble on the other ones, but the additional installation steps I had to perform for updating init. d was different in each case. I now use an updated version of the condor. root issued with the release. 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group

The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Strategy o “Community” approach: everyone has the right to run jobs from their machine.

Strategy o “Community” approach: everyone has the right to run jobs from their machine. o 2 Condor Pools – One for internal use only – One for access by external collaborators and testing 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Internal Pool o Comprised of central node, personal workstations and other “spare” machines. o

Internal Pool o Comprised of central node, personal workstations and other “spare” machines. o Inside “thick” part of site firewall, so no submission access from outside DL (although we expect to flock to/from other CCLRC sites) o Build up trust by gradually growing pool 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

External Pool o Comprised of the remains of a “broken-down” cluster o Originally Dual

External Pool o Comprised of the remains of a “broken-down” cluster o Originally Dual “head” node plus 8 workers on a private network. Now Dual + 4 standalone nodes. o Inside a “thin” firewall, so external access can be granted to collaborators (e. g. ETF/OMII Distributed Build and Test project) o Originally could be flocked to from the Internal Pool 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Configuration (1) o Always run jobs (this may change at some point) o The

Configuration (1) o Always run jobs (this may change at some point) o The majority of machines are setup for both execute and submit (even central node at present). There is only one node set up for submit only. o Additional Class. Ads – OS Flavour and Version – To mirror firewall settings (see Firewall “Avoidance” talk in WG 2 tomorrow) o Dual-boot nodes are configured for Condor in both of their manifestations 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Configuration (2) o All machines setup the same way (in /opt/condor) condor. sh for

Configuration (2) o All machines setup the same way (in /opt/condor) condor. sh for installation in /etc/profile. d : CONDOR_ROOT=/opt/condor export CONDOR_CONFIG=${CONDOR_ROOT}/etc/condor_config export PATH=${PATH}: ${CONDOR_ROOT}/bin condor. csh for installation in /etc/profile. d : set condor_root = /opt/condor setenv CONDOR_CONFIG "${condor_root}/etc/condor_config" set path = ( ${path} ${condor_root}/bin ) o Common condor_config. local for inclusion o Common condor init. d script with several enhancements over packaged one 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group

The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Internal Pool Stats o o 11 resource “Owners” at 2 sites 11 OS Variants

Internal Pool Stats o o 11 resource “Owners” at 2 sites 11 OS Variants 1 submit-only node (head-node of e-HTPX cluster – Red Hat 9) 27 Processors on 21 execution Machines (including central node) • 6 Windows – 3 x Windows XP Professional – 2 x Windows 2000 Professional – 1 x Windows NT 4. 0 Workstation • 21 Linux – 6 x Su. SE Linux 9. 0 – 2 x Su. SE Linux 8. 0 – 5 x White Box Enterprise Linux 3. 0 – 1 x Red Hat Enterprise Linux 3. 0 – 3 x Red Hat Linux 9 – 2 x Red Hat Linux 8. 0 – 1 x Mandrake Linux 10. 0 John Kewley th 11 October 2004 Presenter Name Grid Technology Group UK Condor Week– 1 x Gentoo Linux 1. 4 Facility Name e-Science Centre

condor_status $ condor_status -f "%-6 s" Arch -f "%-7 s" Op. Sys  -f

condor_status $ condor_status -f "%-6 s" Arch -f "%-7 s" Op. Sys -f " %-12 s" OPSYS_FLAVOUR -f "n" Op. Sys | sort | uniq -c 1 1 1 2 3 1 2 6 5 1 3 2 11 th October 2004 UK Condor Week INTEL INTEL INTEL LINUX LINUX WINNT 40 WINNT 51 Gentoo Mandrake 10 RH 80 RH 9 RHEL 2 SUSE 80 SUSE 90 WBL John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

External Pool Stats o o o 2 resource “owners” 2 OS Variants Can flock

External Pool Stats o o o 2 resource “owners” 2 OS Variants Can flock to/from pools at 4 other sites In the process of adding GSI Security 5 Machines containing 6 Linux Processors: – 2 x Red Hat Linux 7. 3 – 4 x White Box Enterprise Linux 3. 0 (currently disabled since inaccessible from outside due to firewall restrictions) 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Our Users UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Our Users UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

e-HTPX The e-HTPX project is developing a Grid-based e-science environment to allow structural biologists

e-HTPX The e-HTPX project is developing a Grid-based e-science environment to allow structural biologists remote, integrated access to web and grid technologies associated with protein crystallography. http: //clyde. dl. ac. uk/e-htpx/index. htm 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

e-HTPX Workflow Stage 1 – Select protein target Structure Solution Stage 2 – Crystallization

e-HTPX Workflow Stage 1 – Select protein target Structure Solution Stage 2 – Crystallization of Protein Stage 3 – Data Collection (X -ray diffraction images, Scaling and Integration) Target Selection Start Finish A single all encompassing web interface from which users can initiate, plan, direct and document the experimental workflow either locally or remotely from a desktop computer. 11 th October 2004 UK Condor Week Stage 4 – Structure Solution (HPC data processing to derive digital protein model) Stage 5 – Submit model into public database John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

e-HTPX Structure Solution o Given a target sequence for a protein, the Protein data

e-HTPX Structure Solution o Given a target sequence for a protein, the Protein data bank (PDB) is searched for similar sequences. o The corresponding structures are downloaded for use in a high-throughput system for determining the structure of the target protein. o Depending on the protein structure size and matching criteria, up to several hundred structures can be downloaded. The modelling for these is carried out by submitting multiple jobs to the cluster and/or Condor pool. 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

e-HTPX Structure Solution 11 th October 2004 UK Condor Week John Kewley Presenter Name

e-HTPX Structure Solution 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

CCP 1 / GAMESS-UK CCP 1: “The Electronic Structure of Molecules“ http: //www. ccp

CCP 1 / GAMESS-UK CCP 1: “The Electronic Structure of Molecules“ http: //www. ccp 1. ac. uk/ GAMESS-UK is a multi-method ab initio molecular electronic structure program. 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

CCP 1 / GAMESS-UK o GAMESS-UK is a Quantum-Mechanical molecular modelling program used by

CCP 1 / GAMESS-UK o GAMESS-UK is a Quantum-Mechanical molecular modelling program used by chemists, physicists and biologists to run molecular calculations. o Given the nuclear coordinates of a molecule, GAMESS -UK calculates a wavefunction that describes its electronic properties. o From the wavefunction, various molecular properties (e. g. shape, energetics and reactivity) can be calculated. http: //www. cfs. dl. ac. uk/ 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

GAMESS-UK + Condor The following are being investigated: o Building GAMESS-UK and run its

GAMESS-UK + Condor The following are being investigated: o Building GAMESS-UK and run its tests on a variety of environments (OS, compilers, libraries) o Using pool to build release packages of a cut-down evaluation version of the software. o Using Condor as it is intended: submitting many jobs to ascertain. 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

ETF “Build and Test” Testbed o The external pool is part of the ETF

ETF “Build and Test” Testbed o The external pool is part of the ETF “Build and Test” testbed. o Software bundles are distributed to a variety of OS types around the flocked pool for building and testing. o This type of (flocked) pool relies on heterogeneity and small numbers of each type are all that are required. http: //polaris. ecs. soton. ac. uk: 65000/ http: //wiki. nesc. ac. uk/read/sfct? Home. Page 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Other non-HTC Uses o I want to ensure my code compiles without warnings and/or

Other non-HTC Uses o I want to ensure my code compiles without warnings and/or runs its basic tests on – As many OSs as possible – With as many different compilers as possible o I want to perform a release build of my product for platform X, but I only have accounts on A, B and C o I have several server-licensed products and many potential occasional users. How can this be made available to them more easily (within the bounds of the licence of course!) 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

In Conclusion UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

In Conclusion UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre

Summary o 12 brave souls have offered up their personal workstations so others can

Summary o 12 brave souls have offered up their personal workstations so others can run arbitrary vanilla jobs. o Installations have been made on 12 different operating systems o Both pools are now in use. Provision of administrative support is underway – web page, user guide, etc o Distributed build is great! o Firewalls are not (although I now understand firewalls a lot better)! 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre

Final Thoughts o Setting up a Condor pool of personal workstations requires considerable coaxing,

Final Thoughts o Setting up a Condor pool of personal workstations requires considerable coaxing, convincing, coercion and cajoling. o Flocking through firewalls should be easier. Something needs doing, at least for flocking. o Distributed build can be very useful, but Condor’s default Class. Ads could do with extending (at least to more accurately describe the OS) o What use can be made of pools which are seriously heterogenous? 11 th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre