Condor Administration Alan De Smet Computer Sciences Department

  • Slides: 113
Download presentation
Condor Administration Alan De Smet Computer Sciences Department University of Wisconsin-Madison condor-admin@cs. wisc. edu

Condor Administration Alan De Smet Computer Sciences Department University of Wisconsin-Madison condor-admin@cs. wisc. edu http: //www. cs. wisc. edu/condor

Outline › Condor Daemons h. Job Startup › Configuration Files › Policy Expressions h.

Outline › Condor Daemons h. Job Startup › Configuration Files › Policy Expressions h. Startd (Machine) h. Negotiator h. Job States › › Priorities Security Administration Installation h“Full Installation” › Other Sources www. cs. wisc. edu/condor 2

Condor Daemons www. cs. wisc. edu/condor 3

Condor Daemons www. cs. wisc. edu/condor 3

Condor Daemons › condor_master - controls everything else › condor_startd - executing jobs hcondor_starter

Condor Daemons › condor_master - controls everything else › condor_startd - executing jobs hcondor_starter - helper for starting jobs › condor_schedd - submitting jobs hcondor_shadow - submit-side helper www. cs. wisc. edu/condor 4

Condor Daemons › condor_collector - Collects system information; only on Central Manager › condor_negotiator

Condor Daemons › condor_collector - Collects system information; only on Central Manager › condor_negotiator - Assigns jobs to machines; only on Central Manager › You only have to run the daemons for the services you want to provide www. cs. wisc. edu/condor 5

condor_master › Starts up all other Condor daemons › If a daemon exits unexpectedly,

condor_master › Starts up all other Condor daemons › If a daemon exits unexpectedly, › restarts deamon and emails administrator If a daemon binary is updated (timestamp changed), restarts the daemon www. cs. wisc. edu/condor 6

condor_master › Provides access to many remote administration commands: hcondor_reconfig, condor_restart, condor_off, condor_on, etc.

condor_master › Provides access to many remote administration commands: hcondor_reconfig, condor_restart, condor_off, condor_on, etc. › Default server for many other commands: hcondor_config_val, etc. www. cs. wisc. edu/condor 7

condor_master › Periodically runs condor_preen to clean up any files Condor might have left

condor_master › Periodically runs condor_preen to clean up any files Condor might have left on the machine h. Backup behavior, the rest of the daemons clean up after themselves, as well www. cs. wisc. edu/condor 8

condor_startd › Represents a machine to the Condor › › pool Should be run

condor_startd › Represents a machine to the Condor › › pool Should be run on any machine you want to run jobs Enforces the wishes of the machine owner (the owner’s “policy”) www. cs. wisc. edu/condor 9

condor_startd › Starts, stops, suspends jobs › Spawns the appropriate › condor_starter, depending on

condor_startd › Starts, stops, suspends jobs › Spawns the appropriate › condor_starter, depending on the type of job Provides other administrative commands (for example, condor_vacate) www. cs. wisc. edu/condor 10

condor_starter › Spawned by the condor_startd to handle all the details of starting and

condor_starter › Spawned by the condor_startd to handle all the details of starting and managing the job h. Transfer job’s binary to execute machine h. Send back exit status h. Etc. www. cs. wisc. edu/condor 11

condor_starter › On multi-processor machines, you get one condor_starter per CPU h. Actually one

condor_starter › On multi-processor machines, you get one condor_starter per CPU h. Actually one per running job h. Can configure to run more (or less) jobs than CPUs › For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd) www. cs. wisc. edu/condor 12

condor_schedd › Represents jobs to the Condor pool › Maintains persistent queue of jobs

condor_schedd › Represents jobs to the Condor pool › Maintains persistent queue of jobs h. Queue is not strictly FIFO (priority based) h. Each machine running condor_schedd maintains its own queue www. cs. wisc. edu/condor 13

condor_schedd › Responsible for contacting available machines and spawning waiting jobs h. When told

condor_schedd › Responsible for contacting available machines and spawning waiting jobs h. When told to by condor_negotiator › Should be run on any machine you › want to submit jobs from Services most user commands: hcondor_submit, condor_rm, condor_q www. cs. wisc. edu/condor 14

condor_shadow › Represents job on the submit machine › Services requests from standard universe

condor_shadow › Represents job on the submit machine › Services requests from standard universe jobs for remote system calls hincluding all file I/O › Makes decisions on behalf of the job hfor example: where to store the checkpoint file www. cs. wisc. edu/condor 15

condor_shadow Impact › One condor_shadow running on › submit machine for each actively running

condor_shadow Impact › One condor_shadow running on › submit machine for each actively running Condor job Minimal load on submit machine h. Usually blocked waiting for requests from the job or doing I/O h. Relatively small memory footprint www. cs. wisc. edu/condor 16

Limiting condor_shadow › Still, you can limit the impact of the shadows on a

Limiting condor_shadow › Still, you can limit the impact of the shadows on a given submit machine: h. They can be started by Condor with a “nice-level” that you configure (SHADOW_RENICE_INCREMENT) h. Can limit total number of shadows running on a machine (MAX_JOBS_RUNNING) www. cs. wisc. edu/condor 17

condor_collector › Collects information from all other › › Condor daemons in the pool

condor_collector › Collects information from all other › › Condor daemons in the pool Each daemon sends a periodic update called a Class. Ad to the collector Services queries for information: h. Queries from other Condor daemons h. Queries from users (condor_status) www. cs. wisc. edu/condor 18

condor_negotiator › Performs matchmaking in Condor h. Pulls list of available machines and job

condor_negotiator › Performs matchmaking in Condor h. Pulls list of available machines and job queues from condor_collector h. Matches jobs with available machines h. Both the job and the machine must satisfy each other’s requirements (2 -way matching) › Handles user priorities www. cs. wisc. edu/condor 19

Typical Condor Pool = Process Spawned = Class. Ad Communication Pathway master startd Submit-Only

Typical Condor Pool = Process Spawned = Class. Ad Communication Pathway master startd Submit-Only Execute-Only Central Manager schedd negotiator collector master schedd startd Execute-Only master startd Regular Node master startd schedd www. cs. wisc. edu/condor 20

Job Startup Central Manager Negotiator Submit Machine Collector Execute Machine Schedd Starter Submit Shadow

Job Startup Central Manager Negotiator Submit Machine Collector Execute Machine Schedd Starter Submit Shadow www. cs. wisc. edu/condor Job Condor Syscall Lib 21

Configuration Files www. cs. wisc. edu/condor 22

Configuration Files www. cs. wisc. edu/condor 22

Configuration Files › Multiple files concatenated h. Definitions in later files overwrite previous definitions

Configuration Files › Multiple files concatenated h. Definitions in later files overwrite previous definitions › Order of files: h. Global config file h. Local config files, shared config files h. Global and Local Root config file www. cs. wisc. edu/condor 23

Global Config File › Found either in file pointed to with the CONDOR_CONFIG environment

Global Config File › Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor_config, or ~condor/condor_config › Most settings can be in this file › Only works as a global file if it is on a shared file system www. cs. wisc. edu/condor 24

Other Shared Files › LOCAL_CONFIG_FILE macro h. Comma separated, processed in order › You

Other Shared Files › LOCAL_CONFIG_FILE macro h. Comma separated, processed in order › You can configure a number of other shared config files: h. Organize common settings (for example, all policy expressions) hplatform-specific config files www. cs. wisc. edu/condor 25

Local Config File › LOCAL_CONFIG_FILE macro (again) h. Usually uses $(HOSTNAME) › Machine-specific settings

Local Config File › LOCAL_CONFIG_FILE macro (again) h. Usually uses $(HOSTNAME) › Machine-specific settings hlocal policy settings for a given owner hdifferent daemons to run (for example, on the Central Manager!) www. cs. wisc. edu/condor 26

Local Config File › Can be on local disk of each machine /var/adm/condor_config. local

Local Config File › Can be on local disk of each machine /var/adm/condor_config. local › Can be in a shared directory /shared/condor_config. $(HOSTNAME) /shared/condor/hosts/$(HOSTNAME)/ condor_config. local www. cs. wisc. edu/condor 27

Root Config File (optional) › Always processed last › Allows root to specify settings

Root Config File (optional) › Always processed last › Allows root to specify settings which cannot be changed by other users h For example, the path to Condor daemons › Useful if daemons are started as root but someone else has write access to config files www. cs. wisc. edu/condor 28

Root Config File (optional) › /etc/condor_config. root or ~condor/condor_config. root › Then loads any

Root Config File (optional) › /etc/condor_config. root or ~condor/condor_config. root › Then loads any files specified in ROOT_CONFIG_FILE_LOCAL www. cs. wisc. edu/condor 29

Configuration File Syntax › # is a comment ›  at the end of

Configuration File Syntax › # is a comment › at the end of line is a linecontinuation hboth lines are treated as one big entry › All names are case insensitive h. Values are case sensitive www. cs. wisc. edu/condor 30

Configuration File Syntax › “Macros” have the form: h. Attribute_Name = value › You

Configuration File Syntax › “Macros” have the form: h. Attribute_Name = value › You reference other macros with: h. A = $(B) › Can create additional macros for organizational purposes www. cs. wisc. edu/condor 31

Configuration File Syntax › Macros are evaluated when needed h. Not when parsed h.

Configuration File Syntax › Macros are evaluated when needed h. Not when parsed h. In the following configuration file, B will evaluate to 2: A=1 B=$(A) A=2 www. cs. wisc. edu/condor 32

Policy Expressions www. cs. wisc. edu/condor 33

Policy Expressions www. cs. wisc. edu/condor 33

Policy Expressions › Allow machine owners to specify job priorities, restrict access, and implement

Policy Expressions › Allow machine owners to specify job priorities, restrict access, and implement local policies www. cs. wisc. edu/condor 34

Machine (Startd) Policy Expressions › START – When is this machine willing to start

Machine (Startd) Policy Expressions › START – When is this machine willing to start a job h. Typically used to restrict access when the machine is being used directly › RANK - Job preferences www. cs. wisc. edu/condor 35

Machine (Startd) Policy Expressions › SUSPEND - When to suspend a job › CONTINUE

Machine (Startd) Policy Expressions › SUSPEND - When to suspend a job › CONTINUE - When to continue a › › suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job www. cs. wisc. edu/condor 36

Policy Expressions › Specified in condor_config › Can reference condor_config macros h$(MACRONAME) › Policy

Policy Expressions › Specified in condor_config › Can reference condor_config macros h$(MACRONAME) › Policy evaluates both a machine Class. Ad and a job Class. Ad together h. Policy can reference items in either Class. Ad (See manual for list) www. cs. wisc. edu/condor 37

Minimal Settings › Always runs jobs START = True RANK = SUSPEND = False

Minimal Settings › Always runs jobs START = True RANK = SUSPEND = False CONTINUE = True PREEMPT = False KILL = False www. cs. wisc. edu/condor 38

Policy Configuration (Boss Fat Cat) › I am adding nodes to the Cluster… but

Policy Configuration (Boss Fat Cat) › I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes www. cs. wisc. edu/condor 39

New Settings for the Chemistry nodes › Prefer Chemistry jobs START = True RANK

New Settings for the Chemistry nodes › Prefer Chemistry jobs START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = True PREEMPT = False KILL = False www. cs. wisc. edu/condor 40

Submit file with Custom Attribute › Prefix an entry with “+” to add to

Submit file with Custom Attribute › Prefix an entry with “+” to add to job Class. Ad Executable = charm-run Universe = standard +Department = Chemistry queue www. cs. wisc. edu/condor 41

What if “Department” not specified? START = True RANK = Department =!= UNDEFINED &&

What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = True PREEMPT = False KILL = False www. cs. wisc. edu/condor 42

Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2

Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = True PREEMPT = False KILL = False www. cs. wisc. edu/condor 43

Policy Configuration (Boss Fat Cat) › Cluster is okay, but. . . Condor can

Policy Configuration (Boss Fat Cat) › Cluster is okay, but. . . Condor can only use the desktops when they would otherwise be idle www. cs. wisc. edu/condor 44

Desktops should › START jobs when their has been no activity on the keyboard/mouse

Desktops should › START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low www. cs. wisc. edu/condor 45

Desktops should › SUSPEND jobs as soon as activity is › › detected PREEMPT

Desktops should › SUSPEND jobs as soon as activity is › › detected PREEMPT jobs if the activity continues for 5 minutes or more KILL jobs if they take more than 5 minutes to preempt www. cs. wisc. edu/condor 46

Macros in the Config File Non. Condor. Load. Avg = (Load. Avg - Condor.

Macros in the Config File Non. Condor. Load. Avg = (Load. Avg - Condor. Load. Avg) High. Load = 0. 5 Bgnd. Load = 0. 3 CPU_Busy = ($(Non. Condor. Load. Avg) >= $(High. Load)) CPU_Idle = ($(Non. Condor. Load. Avg) <= $(Bgnd. Load)) Keyboard. Busy = (Keyboard. Idle < 10) Machine. Busy = ($(CPU_Busy) || $(Keyboard. Busy)) Activity. Timer = (Current. Time - Entered. Current. Activity) www. cs. wisc. edu/condor 47

Desktop Machine Policy START = $(CPU_Idle) && Keyboard. Idle > 300 SUSPEND = $(Machine.

Desktop Machine Policy START = $(CPU_Idle) && Keyboard. Idle > 300 SUSPEND = $(Machine. Busy) CONTINUE = $(CPU_Idle) && Keyboard. Idle > 120 PREEMPT = (Activity == "Suspended") && $(Activity. Timer) > 300 KILL = $(Activity. Timer) > 300 www. cs. wisc. edu/condor 48

Additional Policy Parameters › WANT_SUSPEND - If false, skips › SUSPEND, jumps to PREEMPT

Additional Policy Parameters › WANT_SUSPEND - If false, skips › SUSPEND, jumps to PREEMPT WANT_VACATE h. If true, gives job time to vacate cleanly (until KILL becomes true) h. If false, job is immediately killed (KILL is ignored) www. cs. wisc. edu/condor 49

Policy Review › Users submitting jobs can specify › › › Requirements and Rank

Policy Review › Users submitting jobs can specify › › › Requirements and Rank expressions Administrators can specify Startd policy expressions individually for each machine Custom attributes easily added You can enforce almost any policy! www. cs. wisc. edu/condor 50

START True WANT SUSPEND False True SUSPEND Road Map of the Policy Expressions =

START True WANT SUSPEND False True SUSPEND Road Map of the Policy Expressions = Expression True PREEMPT = Activity True WANT VACATE False True Vacating KILL True Killing www. cs. wisc. edu/condor 51

Negotiator Policy Expressions › PREEMPTION_REQUIREMENTS and › › PREEMPTION_RANK Evaluated when condor_negotiator considers replacing

Negotiator Policy Expressions › PREEMPTION_REQUIREMENTS and › › PREEMPTION_RANK Evaluated when condor_negotiator considers replacing a lower priority job with a higher priority job Completely unrelated to the PREEMPT expression www. cs. wisc. edu/condor 52

PREEMPTION_REQUIREMENTS › If false will not preempt machine h. Typically used to avoid pool

PREEMPTION_REQUIREMENTS › If false will not preempt machine h. Typically used to avoid pool thrashing PREEMPTION_REQUIREMENTS = $(State. Timer) > (1 * $(HOUR)) && Remote. User. Prio > Submittor. Prio * 1. 2 h. Only replace jobs running for at least one hour and 20% lower priority www. cs. wisc. edu/condor 53

PREEMPTION_RANK › Picks which already claimed machine to reclaim PREEMPTION_RANK =  (Remote. User.

PREEMPTION_RANK › Picks which already claimed machine to reclaim PREEMPTION_RANK = (Remote. User. Prio * 1000000) - Image. Size h. Strongly prefers preempting jobs with a large (bad) priority and a small image size www. cs. wisc. edu/condor 54

PREEMPTING begin CLAIMED Machine States OWNER UNCLAIMED MATCHED www. cs. wisc. edu/condor 55

PREEMPTING begin CLAIMED Machine States OWNER UNCLAIMED MATCHED www. cs. wisc. edu/condor 55

PREEMPTING CLAIMED Vacating Idle Killing Busy Machine Activities Suspended begin OWNER Idle UNCLAIMED MATCHED

PREEMPTING CLAIMED Vacating Idle Killing Busy Machine Activities Suspended begin OWNER Idle UNCLAIMED MATCHED Idle Benchmarking www. cs. wisc. edu/condor 56

PREEMPTING CLAIMED Vacating Idle Killing Busy Suspended begin OWNER Idle UNCLAIMED Idle MATCHED Idle

PREEMPTING CLAIMED Vacating Idle Killing Busy Suspended begin OWNER Idle UNCLAIMED Idle MATCHED Idle Machine Activities See the manual for the gory details (Section 3. 6: Configuring the Startd Policy) Benchmarking www. cs. wisc. edu/condor 57

Priorities www. cs. wisc. edu/condor 58

Priorities www. cs. wisc. edu/condor 58

Job Priority › Set with condor_prio › Range from -20 to 20 › Only

Job Priority › Set with condor_prio › Range from -20 to 20 › Only impacts order between jobs for a single user www. cs. wisc. edu/condor 59

User Priority › Determines allocation of machines to waiting users View with condor_userprio ›

User Priority › Determines allocation of machines to waiting users View with condor_userprio › › Inversely related to machines allocated h. A user with priority of 10 will be able to claim twice as many machines as a user with priority 20 www. cs. wisc. edu/condor 60

User Priority › Effective User Priority is determined by multiplying two factors h. Real

User Priority › Effective User Priority is determined by multiplying two factors h. Real Priority h. Priority Factor www. cs. wisc. edu/condor 61

Real Priority › Based on actual usage › Defaults to 0. 5 › Approaches

Real Priority › Based on actual usage › Defaults to 0. 5 › Approaches actual number of machines used over time h. Configuration setting PRIORITY_HALFLIFE www. cs. wisc. edu/condor 62

Priority Factor › Assigned by administrator h. Set with condor_userprio › Defaults to 1

Priority Factor › Assigned by administrator h. Set with condor_userprio › Defaults to 1 (DEFAULT_PRIO_FACTOR) › Nice users default to 1, 000 (NICE_USER_PRIO_FACTOR) h. Used for true bottom feeding jobs h. Add “nice_user=true” to your submit file www. cs. wisc. edu/condor 63

Security www. cs. wisc. edu/condor 64

Security www. cs. wisc. edu/condor 64

Host/IP Address Security › The basic security model in Condor h. Stronger security available

Host/IP Address Security › The basic security model in Condor h. Stronger security available (Encrypted communications, cryptographic authentication) › Can configure each machine in your pool to allow or deny certain actions from different groups of machines www. cs. wisc. edu/condor 65

Security Levels › READ access - querying information hcondor_status, condor_q, etc › WRITE access

Security Levels › READ access - querying information hcondor_status, condor_q, etc › WRITE access - updating information h. Does not include READ access! hcondor_submit, adding nodes to a pool, etc www. cs. wisc. edu/condor 66

Security Levels › ADMINISTRATOR access hcondor_on, condor_off, condor_reconfig, condor_ restart, etc. › OWNER access

Security Levels › ADMINISTRATOR access hcondor_on, condor_off, condor_reconfig, condor_ restart, etc. › OWNER access h. Things a machine owner can do (notably condor_vacate) www. cs. wisc. edu/condor 67

Setting Up Host/IP Address Security › List what hosts are allowed or denied to

Setting Up Host/IP Address Security › List what hosts are allowed or denied to perform each action h. If you list allowed hosts, everything else is denied h. If you list denied hosts, everything else is allowed h. If you list both, only allow hosts that are listed in “allow” but not in “deny” www. cs. wisc. edu/condor 68

Specifying Hosts › There are many possibilities for specifying which hosts are allowed or

Specifying Hosts › There are many possibilities for specifying which hosts are allowed or denied: h. Host names, domain names h. IP addresses, subnets www. cs. wisc. edu/condor 69

Wildcards › ‘*’ can be used anywhere (once) in a host name hfor example,

Wildcards › ‘*’ can be used anywhere (once) in a host name hfor example, “infn-corsi*. corsi. infn. it” › ‘*’ can be used at the end of any IP address hfor example “ 128. 105. 101. *” or “ 128. 105. *” www. cs. wisc. edu/condor 70

Setting up Host/IP Address Security › Can define values that effect all daemons: h.

Setting up Host/IP Address Security › Can define values that effect all daemons: h. HOSTALLOW_WRITE, HOSTDENY_READ, HOSTALLOW_ADMINISTRATOR, etc. › Can define daemon-specific settings: h. HOSTALLOW_READ_SCHEDD, HOSTDENY_WRITE_COLLECTOR, etc. www. cs. wisc. edu/condor 71

Example Security Settings HOSTALLOW_WRITE = *. infn. it HOSTALLOW_ADMINISTRATOR = infn-corsi 1*,  $(CONDOR_HOST),

Example Security Settings HOSTALLOW_WRITE = *. infn. it HOSTALLOW_ADMINISTRATOR = infn-corsi 1*, $(CONDOR_HOST), axpb 07. bo. infn. it, $(FULL_HOSTNAME) HOSTDENY_ADMINISTRATOR = infn-corsi 15 HOSTDENY_READ = *. gov, *. mil HOSTDENY_ADMINISTRATOR_NEGOTIATOR = * www. cs. wisc. edu/condor 72

Advanced Security Features › AUTHENTICATION_METHODS h. Kerberos, GSI (X. 509 certs), FS, NTSSPI ›

Advanced Security Features › AUTHENTICATION_METHODS h. Kerberos, GSI (X. 509 certs), FS, NTSSPI › Using Kerberos or GSI, you can grant access (READ, WRITE, etc) to specific users www. cs. wisc. edu/condor 73

Advanced Security Features › Some AUTHENTICATION_METHODS › support strong encryption For further details h.

Advanced Security Features › Some AUTHENTICATION_METHODS › support strong encryption For further details h. Q&A Session on Condor Security Wednesday morning h. Condor Manual hcondor-admin@cs. wisc. edu www. cs. wisc. edu/condor 74

Administration www. cs. wisc. edu/condor 75

Administration www. cs. wisc. edu/condor 75

Viewing things with condor_status › condor_status has lots of different › › › options

Viewing things with condor_status › condor_status has lots of different › › › options to display various kinds of info Supports “-constraint” so you can only view Class. Ads that match an expression you specify Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts) View any kind of daemon Class. Ad(-schedd, -master, etc) www. cs. wisc. edu/condor 76

Viewing things with condor_q › View the job queue › The “-long” option is

Viewing things with condor_q › View the job queue › The “-long” option is useful to see › › the entire Class. Ad for a given job Also supports the “-constraint” option Can view job queues on remote machines with the “-name” option www. cs. wisc. edu/condor 77

Looking at condor_q -analyze › condor_q will try to figure out why the ›

Looking at condor_q -analyze › condor_q will try to figure out why the › › job isn’t running Good at finding errors in job Requirements expressions Condor 6. 5 will include the advanced condor_analyze with additional information www. cs. wisc. edu/condor 78

Looking at condor_q -analyze › Typical results: 471216. 000: Run analysis summary. Of 820

Looking at condor_q -analyze › Typical results: 471216. 000: Run analysis summary. Of 820 machines, 458 are rejected by your job's requirements 25 reject your job because of their own requirements 0 match, but are serving users with a better priority in the pool 4 match, but prefer another specific job despite its worse userpriority 6 match, but will not currently preempt their existing job 327 are available to run your job www. cs. wisc. edu/condor 79

Debugging Jobs › Examine the job with condor_q hespecially -long and -analyze › Examine

Debugging Jobs › Examine the job with condor_q hespecially -long and -analyze › Examine the job’s user log h. Quickly find with: condor_q -format '%sn' User. Log 17. 0 h. Users should always have a user log (set with “log” in the submit file) www. cs. wisc. edu/condor 80

Debugging Jobs › Examine Shadow. Log on the submit machine h. Note any machines

Debugging Jobs › Examine Shadow. Log on the submit machine h. Note any machines the job tried to execute on › Examine Schedd. Log on the submit › machine Examine Start. Log and Starter. Log on the execute machine www. cs. wisc. edu/condor 81

Debugging Jobs › If necessary add “D_FULLDEBUG › › D_COMMAND D_SECONDS” to DEBUG_DAEMONNAME setting

Debugging Jobs › If necessary add “D_FULLDEBUG › › D_COMMAND D_SECONDS” to DEBUG_DAEMONNAME setting for additional log information Increase MAX_DAEMONNAME_LOG if logs are rolling over too quickly If all else fails, email us hcondor-admin@cs. wisc. edu www. cs. wisc. edu/condor 82

Installation www. cs. wisc. edu/condor 83

Installation www. cs. wisc. edu/condor 83

Considerations for Installing a Condor Pool › What machine should be your central ›

Considerations for Installing a Condor Pool › What machine should be your central › › manager? Does your pool have a shared file system? Where to install Condor binaries and configuration files? Where should you put each machine’s local directories? Start the daemons as root or as some other user? www. cs. wisc. edu/condor 84

What machine should be your central manager? › The central manager is very important

What machine should be your central manager? › The central manager is very important › for the proper functioning of your pool If the central manager crashes, jobs that are currently matched will continue to run, but new jobs will not be matched www. cs. wisc. edu/condor 85

Central Manager › Want assurances of high uptime or › prompt reboots A good

Central Manager › Want assurances of high uptime or › prompt reboots A good network connection helps www. cs. wisc. edu/condor 86

Does your pool have a shared file system? › It is easier to run

Does your pool have a shared file system? › It is easier to run vanilla universe › › jobs if so, but one is not required Shared location for configuration files can ease administration of a pool AFS can work, but Condor does not yet manage AFS tokens www. cs. wisc. edu/condor 87

Where to install binaries and configuration files? › Shared location for configuration files can

Where to install binaries and configuration files? › Shared location for configuration files can ease administration of a pool › Binaries on a shared file system makes upgrading easier, but can be less stable if there are network problems › condor_master on the local disk is a good compromise www. cs. wisc. edu/condor 88

Where should you put each machine’s local directories? › You need a fair amount

Where should you put each machine’s local directories? › You need a fair amount of disk space › in the spool directory for each condor_schedd (holds job queue and binaries for each job submitted) The execute directory is used by the condor_starter to hold the binary for any Condor job running on a machine www. cs. wisc. edu/condor 89

Where should you put each machine’s local directories? › The log directory is used

Where should you put each machine’s local directories? › The log directory is used by all daemons h. More space means more saved info www. cs. wisc. edu/condor 90

Start the daemons as root or some other user? › If possible, we recommend

Start the daemons as root or some other user? › If possible, we recommend starting the daemons as root h. More secure h. Less confusion for users h. Condor will try to run as the user “condor” whenever possible www. cs. wisc. edu/condor 91

Running Daemons as -Root Non › Condor will still work, users just have ›

Running Daemons as -Root Non › Condor will still work, users just have › to take some extra steps to submit jobs Can have “personal Condor” installed only you can submit jobs www. cs. wisc. edu/condor 92

Basic Installation Procedure › 1. Decide what version and parts of Condor › ›

Basic Installation Procedure › 1. Decide what version and parts of Condor › › to install and download them 2. Install the “release directory” - all the Condor binaries and libraries 3. Setup the Central Manager 4. (optional) Setup Condor on any other machines you wish to add to the pool 5. Spawn the Condor daemons www. cs. wisc. edu/condor 93

Condor Version Series › We distribute two versions of Condor h. Stable Series h.

Condor Version Series › We distribute two versions of Condor h. Stable Series h. Development Series www. cs. wisc. edu/condor 94

Stable Series › Heavily tested › Recommended for general use › 2 nd number

Stable Series › Heavily tested › Recommended for general use › 2 nd number of version string is even (6. 4. 7) www. cs. wisc. edu/condor 95

Development Series › Latest features, not necessarily well› › tested Not recommended unless you’re

Development Series › Latest features, not necessarily well› › tested Not recommended unless you’re willing to work with beta code or need new features 2 nd number of version string is odd (6. 5. 1) www. cs. wisc. edu/condor 96

Condor Versions › All daemons advertise a › Condor. Version attribute in the Class.

Condor Versions › All daemons advertise a › Condor. Version attribute in the Class. Ad they publish You can also view the version string by running ident on any Condor binary www. cs. wisc. edu/condor 97

Condor Versions › All parts of Condor on a single › › machine should

Condor Versions › All parts of Condor on a single › › machine should run the same version! Machines in a pool can usually run different versions and communicate with each other Documentation will specify when a version is incompatible with older versions www. cs. wisc. edu/condor 98

Downloading Condor › Go to http: //www. cs. wisc. edu/condor/ › Fill out the

Downloading Condor › Go to http: //www. cs. wisc. edu/condor/ › Fill out the form and download the different pieces you need h. Normally, you want the full stable release › There also “contrib” modules for non-standard parts of Condor h. For example, the View Server www. cs. wisc. edu/condor 99

Downloading Condor › Distributed as compressed “tar” files › Once you download, unpack them

Downloading Condor › Distributed as compressed “tar” files › Once you download, unpack them www. cs. wisc. edu/condor 100

Install the Release Directory › In the directory where you unpacked the tar file,

Install the Release Directory › In the directory where you unpacked the tar file, you’ll find a release. tar file with all the binaries and libraries › condor_install will install this as the release directory for you www. cs. wisc. edu/condor 101

Install the Release Directory › In a pool with a shared release › directory,

Install the Release Directory › In a pool with a shared release › directory, you should run condor_install somewhere with write access to the shared directory You need a separate release directory for each platform! www. cs. wisc. edu/condor 102

Setup the Central Manager › Central manager needs specific configuration to start the condor_collector

Setup the Central Manager › Central manager needs specific configuration to start the condor_collector and condor_negotiator › Easiest way to do this is by using condor_install › There’s a special option for setting up a central manager www. cs. wisc. edu/condor 103

Setup Additional Machines › If you have a shared file system, just › run

Setup Additional Machines › If you have a shared file system, just › run condor_init on any other machine you wish to add to your pool Without a shared file system, you must run condor_install on each host www. cs. wisc. edu/condor 104

Spawn the Condor daemons › Run condor_master to start Condor h. Remember to start

Spawn the Condor daemons › Run condor_master to start Condor h. Remember to start as root if desired › Start Condor on the central manager › first Add Condor to your boot scripts? h. We provide a “Sys. V-style” init script (<release>/etc/examples/condor. boot) www. cs. wisc. edu/condor 105

Shared Release Directory › Simplifies administration www. cs. wisc. edu/condor 106

Shared Release Directory › Simplifies administration www. cs. wisc. edu/condor 106

Shared Release Directory › Keep all of your config files in one place h.

Shared Release Directory › Keep all of your config files in one place h. Allows you to have a real global config file, with common values across the whole pool h. Much easier to make changes (even for “local” config files in one shared directory) www. cs. wisc. edu/condor 107

Shared Release Directory › Keep all of your binaries in one place h. Prevents

Shared Release Directory › Keep all of your binaries in one place h. Prevents having different versions accidentally left on different machines h. Easier to upgrade www. cs. wisc. edu/condor 108

“Full Installation” of condor_compile › condor_compile re-links user jobs › › with Condor libraries

“Full Installation” of condor_compile › condor_compile re-links user jobs › › with Condor libraries to create “standard” jobs. By default, only works with certain commands (gcc, g++, g 77, cc, CC, f 77, f 90, ld) With a “full-installation”, works with any command (notably, make) www. cs. wisc. edu/condor 109

“Full Installation” of condor_compile › Move real ld binary, the linker, to ld. real

“Full Installation” of condor_compile › Move real ld binary, the linker, to ld. real h. Location of ld varies between systems, typically /bin/ld › Install Condor’s ld script in its place › Transparently passes to ld. real by default; during condor_compile hooks in Condor libraries. www. cs. wisc. edu/condor 110

Other Sources › Condor Manual › Condor Web Site › condor-admin@cs. wisc. edu www.

Other Sources › Condor Manual › Condor Web Site › condor-admin@cs. wisc. edu www. cs. wisc. edu/condor 111

Publications h“Condor - A Distributed Job Scheduler, ” Beowulf Cluster Computing with Linux, MIT

Publications h“Condor - A Distributed Job Scheduler, ” Beowulf Cluster Computing with Linux, MIT Press, 2002 h“Condor and the Grid, ” Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, 2003 h. These chapters and other publications available online at our web site www. cs. wisc. edu/condor 112

Thank you! http: //www. cs. wisc. edu/condor-admin@cs. wisc. edu www. cs. wisc. edu/condor 113

Thank you! http: //www. cs. wisc. edu/condor-admin@cs. wisc. edu www. cs. wisc. edu/condor 113