Farming with Condor Douglas Thain thaincs wisc edu

  • Slides: 52
Download presentation
Farming with Condor Douglas Thain thain@cs. wisc. edu INFN Bologna, December 2001

Farming with Condor Douglas Thain thain@cs. wisc. edu INFN Bologna, December 2001

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components – Daemons, pools, flocks, Class. Ads • Short Example – Executing 1000 jobs. • Complications – Firewalls, security, etc…

The Condor Project (Est. 1985) Distributed systems CS research performed by a team that

The Condor Project (Est. 1985) Distributed systems CS research performed by a team that faces – software engineering challenges in a UNIX/Linux/NT environment, – active interaction with users and collaborators, – daily maintenance and support challenges of a distributed production environment, – and educating and training students. Funding NSF, NASA, Do. E, Do. D, IBM, INTEL, Microsoft and the UW Graduate School

A Bird of Opportunity Busy Job Job “I am idle. ” “I have work.

A Bird of Opportunity Busy Job Job “I am idle. ” “I have work. ” Job Idle Central Manager Job Over the course of a week, 80% of a desktop machine’s time is wasted. Busy “I am idle. ” Job Idle

The Condor Principle: The owner is absolutely in charge! The Condor Corollary: The visitor

The Condor Principle: The owner is absolutely in charge! The Condor Corollary: The visitor must be prepared for the unexpected!

Tricky Details • What if the user returns? – Checkpoint the job periodically. –

Tricky Details • What if the user returns? – Checkpoint the job periodically. – Restart the job elsewhere from a checkpoint. • What if the machine does not have your files? – Perform I/O via Remote System Calls • These two features require that you link with the Condor C library. • Can’t relink? You may still use Condor, but with some loss in opportunities.

Checkpointing Job t ec h C in o p k Restart Job

Checkpointing Job t ec h C in o p k Restart Job

Remote System Calls Just like home! Shadow Disk Remote System Calls Job

Remote System Calls Just like home! Shadow Disk Remote System Calls Job

The INFN Condor Pool

The INFN Condor Pool

Top 10 Condor Pools: 226 Condor Pools 5576 Condor Hosts

Top 10 Condor Pools: 226 Condor Pools 5576 Condor Hosts

Back to the Farm • The cluster is the new engine of scientific computing.

Back to the Farm • The cluster is the new engine of scientific computing. • Inexpensive to: – procure – expand – repair

The Ideal Cluster • The ideal cluster has every node identical, in every way:

The Ideal Cluster • The ideal cluster has every node identical, in every way: – CPU • Users expect to be – Memory able to execute on – File system any node. – User accounts • Some models – Software installation (MPI) require perfectly matched nodes.

The Bad News • Keeping the entire cluster available for use is very difficult,

The Bad News • Keeping the entire cluster available for use is very difficult, when users expect complete symmetry! • Software failures: – Full disk, wild process, etc. . . • Hardware failures: – Replace with exact match? (not best buy) – Replace with better hardware? (goes unused) • Much better to query rather than assume state of the cluster.

High Throughput Computing is a 24 -7 -365 activity. FLOPY (60*60*24*7*52)*FLOPS

High Throughput Computing is a 24 -7 -365 activity. FLOPY (60*60*24*7*52)*FLOPS

Why Condor on the Farm? • Condor is expert at managing very heterogeneous resources

Why Condor on the Farm? • Condor is expert at managing very heterogeneous resources for high-throughput computing. • Large clusters, despite our best efforts, will always be slightly heterogeneous. – (It may not be in your financial interests to keep them perfectly homogeneous. ) • Condor assists users in making progress, despite the imperfections of the cluster. – Few users *require* the whole identical cluster. – The pursuit of cluster perfection is then an in issue of small throughput improvement, rather than 0 or max.

Basic HTC Mechanisms • Matchmaking - enables requests for services and offers to provide

Basic HTC Mechanisms • Matchmaking - enables requests for services and offers to provide services find each other (Class. Ads). • Persistence - records are kept in stable storage -- any component may crash and reboot. • Asynchronous API - enables management of dynamic (opportunistic) resources. • Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!). • Remote I/O - enables remote (from execution site) access to local (at submission site) data.

City Bird, Country Farm • The lessons learned and techniques used in stealing cycles

City Bird, Country Farm • The lessons learned and techniques used in stealing cycles from workstations are just as important when trying to maximize throughput of a homogeneous luster.

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components – Daemons, pools, flocks, Class. Ads • Short Example – Executing 1000 jobs. • Complications – Firewalls, security, etc…

Components • Condor can be quite complicated: – Many daemons, many connections, many logs.

Components • Condor can be quite complicated: – Many daemons, many connections, many logs. . . • The complexity is necessary and desirable: – Each process represents an independent interest: • Machine requirements (startd) • User requirements (schedd) • System requirements (central manager) • Explain the structure by working from the bottom up.

A Single Machine Central Manager Machine state and policy. condor startd User present? Speed?

A Single Machine Central Manager Machine state and policy. condor startd User present? Speed? Load? cpu key boa rd Size? RAM condor master “Something is wrong!” “Only run jobs submitted from Bologna or Milan. Size? Avail? disk administrator email Local policy file Prefer jobs owned by thain. Evict jobs that don’t fit in memory. “

A Single Pool Central Manager Global Policy: “All things being equal, Bologna gets 2

A Single Pool Central Manager Global Policy: “All things being equal, Bologna gets 2 x as many machines as Milan. ” Machine state and policy. condor startd cpu RAM condor startd disk condor startd cpu RAM disk condor startd disk cpu RAM disk Local Policy: “I prefer thain” condor startd cpu RAM disk Local Policy: “I don’t care. ” “I prefer mazzanti”

A Typical Pool condor startd cpu Central Manager Global Policy: “All things being equal,

A Typical Pool condor startd cpu Central Manager Global Policy: “All things being equal, Bologna gets 2 x as many machines as Milan. ” condor startd cpu RAM NFS / AFS Server cpu RAM disk condor startd cpu RAM Uniform Local Policy: “All machines except #3 prefer mazzanti”

Job condor schedd Schedulers Job Job condor startd Job I am idle. I have

Job condor schedd Schedulers Job Job condor startd Job I am idle. I have work. condor schedd Job Job Job Central Manager I am idle. cpu Job condor startd cpu I am idle. RAM condor Job startd cpu RAM Job condor startd cpu RAM

Multiple Pools condor Job startd condor schedd cpu INFN Central Manager RAM cpu condor.

Multiple Pools condor Job startd condor schedd cpu INFN Central Manager RAM cpu condor. Job startd cpu RAM cpu UWCS Central Manager cpu RAM condor. Job startd cpu RAM condor Job startd RAM condor Job startd Job Job Job condor Job startd cpu RAM condor Job startd cpu RAM

Matchmaking • Each Central Manager is an introduction service that matches compatible machines and

Matchmaking • Each Central Manager is an introduction service that matches compatible machines and jobs. • A simple language (Class. Ads) is used to represent everyone’s needs and desires. • The match is not binding contract -- each side is responsible for enforcing its own needs. • If a central manager crashes, jobs will continue to run, but no further introductions are made.

Class. Ad Example Job Ad: Machine Ad: Type = “Job” Cmd = “cmsim. exe”

Class. Ad Example Job Ad: Machine Ad: Type = “Job” Cmd = “cmsim. exe” Owner = “thain” Type = “Machine” Name = “vulture” Op. Sys = “LINUX” Memory = 256 Requirements = (Op. Sys==LINUX) && (Memory>128) Requirements = (Owner==“thain”)

Matchmaking with Class. Ads Job Ad I have work. Schedd match Central Manager Machine

Matchmaking with Class. Ads Job Ad I have work. Schedd match Central Manager Machine Ad Match notification Claim and execute Execute again. …and again! I am idle. Startd

Placement vs. Scheduling • A Condor Central Manager suggests the placement of jobs on

Placement vs. Scheduling • A Condor Central Manager suggests the placement of jobs on machines, with the understanding that all jobs are ready to run. • A Condor scheduler is responsible for executing a list of jobs with various requirements. It may order jobs according to the users requests. • Neither component plans ahead to make a schedule or a reservation for execution -- it is assumed change is so frequent that schedules are not useful.

Can we Schedule? • Of course, schedule is important for users that have strict

Can we Schedule? • Of course, schedule is important for users that have strict time contraints. • Scheduling is more important to High. Performance Computing (HPC) than High. Throughput Computing (HTC. ) • Scheduling requirements may be worked into Condor in one of two ways: – 1 - Users may share a single submission point. – 2 - The administrator may periodically reconfigure policy according to a schedule established elsewhere.

Scheduling Job condor startd condor schedd I am idle. Job Job Job Method 1:

Scheduling Job condor startd condor schedd I am idle. Job Job Job Method 1: All users share a schedd. Central Manager I am idle. cpu Job condor startd cpu I am 8: 00: All nodes idle. Method 2: prefer thain. Modify global policy when 10: 00: All nodes prefer mazzanti. necessary. RAM condor Job startd cpu RAM Job condor startd cpu RAM

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components – Daemons, pools, flocks, Class. Ads • Short Example – Executing 1000 jobs. • Complications – Firewalls, security, etc…

How Many Machines? % condor_status Name Op. Sys Arch lxpc 1. na. infn LINUX-GLIBC

How Many Machines? % condor_status Name Op. Sys Arch lxpc 1. na. infn LINUX-GLIBC INTEL axpd 21. pd. inf OSF 1 ALPHA vlsi 11. pd. inf SOLARIS 26 SUN 4 u State Activity Load. Av Mem Unclaimed Owner Claimed Idle Busy 0. 000 0. 266 0. 000 30 96 256 . . . Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF 1 INTEL/LINUX-GLIBC SUN 4 u/SOLARIS 251 SUN 4 u/SOLARIS 26 SUN 4 u/SOLARIS 27 SUN 4 x/SOLARIS 26 115 53 16 1 2 67 18 7 1 2 1 1 46 0 0 0 1 35 9 0 4 0 1 0 0 0 Total 194 97 46 50 0 1

Submit the Job • Create a submit file: • vi sim. submit Executable =

Submit the Job • Create a submit file: • vi sim. submit Executable = sim Input = sim. in Output = sim. out Log = sim. log • Submit the job: • condor_submit sim. submit queue

Watch the Progress % condor_q -- Submitter: axpbo 8. bo. infn. it : <131.

Watch the Progress % condor_q -- Submitter: axpbo 8. bo. infn. it : <131. 154. 10. 29: 1038> : ID 5. 0 OWNER thain Each job gets a unique number. SUBMITTED 6/21 12: 40 RUN_TIME ST PRI SIZE CMD 0+00: 15 R 0 2. 5 sim. exe Status: Unexpanded, Running or Idle Size of program image (MB)

Receive E-mail When Done This is an automated email from the Condor system on

Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo 8. bo. infn. it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/sim 40 exited with status 0. Submitted at: Completed at: Wed Jun 21 14: 24: 42 2000 Wed Jun 21 14: 36 2000 Real Time: Run Time: Committed Time: . . . 0 00: 11: 54 0 00: 06: 52 0 00: 01: 37

Running Many Processes • The real benefit of Condor comes from managing 1000 s

Running Many Processes • The real benefit of Condor comes from managing 1000 s of jobs. • First, get organized. Write a script to make 1000 input files. • Now, simply adjust your submit file: Executable = sim. exe Input = sim. in. $(PROCESS) Output = sim. out. $(PROCESS) Log = sim. log Queue 1000

What can go wrong? • If an execution site crashes: – Your job will

What can go wrong? • If an execution site crashes: – Your job will restart elsewhere. • If the central manager crashes: – Jobs will continue to run, no new matches will be made. • If the submit machine crashes: – Jobs will stop, but be re-started when it reboots. • The only way to lose a job is to throw away the disk on the submit machine!

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components

Outline • Introduction – What is Condor? Why Condor on the Farm? • Components – Daemons, pools, flocks, Class. Ads • Short Example – Executing 1000 jobs. • Complications – Firewalls, security, etc…

Firewalls • Why a firewall? – Prevent all outside contact. – Prevent non-approved contact.

Firewalls • Why a firewall? – Prevent all outside contact. – Prevent non-approved contact. – Carefully securing every node is too much work. • What’s the problem? – A variety of processes comprise Condor. – A variety of ports must be used at once. – Submit and execute machines must communicate directly, not through the CM.

The Firewall Problem Central Manager Firewall condor startd cpu Public Network condor schedd Private

The Firewall Problem Central Manager Firewall condor startd cpu Public Network condor schedd Private Network RAM condor startd cpu RAM condor startd cpu RAM

Firewall Solution #1 Central Manager Allow ports 1000 -1010. Firewall condor startd cpu Public

Firewall Solution #1 Central Manager Allow ports 1000 -1010. Firewall condor startd cpu Public Network condor schedd Private Network RAM condor startd cpu RAM condor startd Use only ports 1000 -1010. cpu RAM condor startd cpu RAM

Firewall Solution #1 • Pros: – Easy to configure Condor. – Easy to configure

Firewall Solution #1 • Pros: – Easy to configure Condor. – Easy to configure firewall. – Machine remain a part of the pool. • Cons: – Number of ports limits number of simultaneous interactions with the node. (running jobs + queue ops + negotiations, etc. ) – More ports = more connections, less security

Firewall Solution #2 Private Network Firewall condor startd cpu Public Network condor schedd h

Firewall Solution #2 Private Network Firewall condor startd cpu Public Network condor schedd h s s Central Manager RAM condor startd cpu RAM condor startd cpu RAM

Firewall Solution #2 • Pros: – Only port through router is ssh. • Cons:

Firewall Solution #2 • Pros: – Only port through router is ssh. • Cons: – Pool is partitioned! – Users must manually submit to every pool that is behind a firewall. (I. e. they won’t. ) – No global policy possible. – No global management/status possible.

Network Address Translation • Both solutions only work as long as the firewall simply

Network Address Translation • Both solutions only work as long as the firewall simply drops packets it doesn’t like. • If the firewall is a Network Address Translator (masquerade, ) then only solution #2 works. • Research in Progress: A Condor NAT that runs on the firewall and exports the pool to the outside world.

Security • Current Condor security: – Authenticate via DNS. – Authorize classes of hosts

Security • Current Condor security: – Authenticate via DNS. – Authorize classes of hosts for certain tasks. • New Condor (6. 3. X? ) security: – Authenticate with encrypted credentials. – Authorize on a per-user basis. – Forward credentials to necessary sites.

Condor 6. 2 Security • Authentication: DNS is queried for each incoming connection in

Condor 6. 2 Security • Authentication: DNS is queried for each incoming connection in order to determine the name. • Authorization: Each participant permits a class of hosts to perform certain tasks. At UW-CS: – HOSTALLOW_READ = *. wisc. edu, *. infn. it • Hosts that may query the machine state. – HOSTALLOW_WRITE = *. cs. wisc. edu, *. infn. it • Hosts that may execute jobs, send updates, etc. . . – HOSTALLOW_OWNER= $(FULL_HOSTNAME) • Hosts that may cause this machine to vacate. – HOSTALLOW_ADMINISTRATOR= condor. cs. wisc. edu • Hosts that may change priorities, turn Condor on/off

Condor 6. 3. X? Security • Principle: No single security mechanism is appropriate for

Condor 6. 3. X? Security • Principle: No single security mechanism is appropriate for all sites. Condor must have many tools. – United States Air Force: • Kerberos authentication, all connections encrypted – Cluster behind a firewall: • Host authentication, no encryption – Grid Computing: • GSI credentials from certain authorities, encryption is up to the user.

Condor 6. 3. X Security Central Manager condor schedd cpu Execute GSI ? condor

Condor 6. 3. X Security Central Manager condor schedd cpu Execute GSI ? condor startd cpu RAM YES! I/O GSI YES! GSI ? NO KRB 5 ? Submit FORWARD CERT Data storage Disk

You don’t have to be a super person to do super computing!

You don’t have to be a super person to do super computing!

Getting Condor • Condor Home Page – http: //www. cs. wisc. edu • Binaries

Getting Condor • Condor Home Page – http: //www. cs. wisc. edu • Binaries are freely available. • Versions: – 6. 2. x - Stable releases, bug fixes only – 6. 3. x - Development releases

For More Info • Condor Home Page – http: //www. cs. wisc. edu/condor •

For More Info • Condor Home Page – http: //www. cs. wisc. edu/condor • These slides: – http: //www. cs. wisc. edu/~thain • Douglas Thain – thain@cs. wisc. edu • Questions Now?