Working with Condor Links n Condors homepage n

  • Slides: 43
Download presentation
Working with Condor

Working with Condor

: Links n Condor’s homepage: ¡ n http: //www. cs. wisc. edu/condor/ Condor manual

: Links n Condor’s homepage: ¡ n http: //www. cs. wisc. edu/condor/ Condor manual (for the version currently used): ¡ http: //www. cs. wisc. edu/condor/manual/v 6. 8/

Table of contents n n n n n Condor overview Usefull Condor commands Vanilla

Table of contents n n n n n Condor overview Usefull Condor commands Vanilla universe Macros Standard universe Java universe Matlab in Condor Class. Ads Dag. Man

Condor overview n n Condor is a system for running lots of jobs on

Condor overview n n Condor is a system for running lots of jobs on a (preferably large) cluster of computers. Condor is a specialized workload management system for computeintensive jobs.

Condor overview n Condor’s inner structure: ¡ Condor is built of several daemons: n

Condor overview n Condor’s inner structure: ¡ Condor is built of several daemons: n n n condor_master: This daemon is responsible for keeping all the rest of the Condor daemons running condor_startd: This daemon represents a given machine to the Condor pool. It advertises attributes about the machine it’s running on. Must run on machines accepting jobs. condor_schedd: This daemon is responsible for submitting jobs to condor. It manages the job queue (each machine has one!). Must run on machines submitting jobs. condor_collector: Runs only on the condor server. This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically sends updates to the collector. condor_negotiator: Runs only on the condor server. This daemon is responsible for all the match-making within the Condor system. condor_ ckpt_server: Runs only on the checkpointing server. This is the checkpoint server. It services requests to store and retrieve checkpoint files.

Condor overview n Condor uses user priorities to allocate machines to users in a

Condor overview n Condor uses user priorities to allocate machines to users in a fair manner. ¡ ¡ n A lower numerical value for user priority means higher priority. Each user starts out with the best user priority, 0. 5. If the number of machines a user currently has is greater then his priority, then his user priority will worsen (numerically increase) over time. If the number of machines a user currently has is lower then his priority, then priority will improve over time. Use condor_userprio {-allusers} to see user priorities

Usefull Condor commands n condor_status ¡ ¡ Shows all of the computers connected to

Usefull Condor commands n condor_status ¡ ¡ Shows all of the computers connected to condor (not all are accepting jobs) Usefull arguments: n n -claimed shows only machines running condor jobs ( and who runs them). -available shows only machines which are willing to run jobs now -long display entire classads. (discussed later on) -constraint <const. > show only resources matching the given.

Usefull Condor commands n condor_status ¡ Attributes n Arch: ¡ ¡ n INTEL X

Usefull Condor commands n condor_status ¡ Attributes n Arch: ¡ ¡ n INTEL X 86_64 means a 32 bit linux means a 64 bit linux Activity: ¡ ¡ ¡ “Idle” “Busy” “Suspended” “Vacating” “Killing” “Benchmarking” There is no job activity A job is busy running A job is currently suspended A job is currently checkpointing A job is currently being killed The startd is running benchmarks

Usefull Condor commands n condor_status ¡ More attributes n State: ¡ ¡ ¡ “Owner”

Usefull Condor commands n condor_status ¡ More attributes n State: ¡ ¡ ¡ “Owner” The machine owner is using the machine, and it is unavailable to Condor. “Unclaimed” The machine is available to run Condor jobs, but a good match is either not available or not yet found. “Matched” The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it. “Claimed” The machine is claimed by a remote machine and is probably running a job. “Preempting” A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.

Usefull Condor commands n condor_q ¡ ¡ Shows state of jobs submitted from the

Usefull Condor commands n condor_q ¡ ¡ Shows state of jobs submitted from the calling computer (the one running condor_q) Usefull arguments: n n n -analyze Perform schedulability analysis on jobs. Usefull to see why a scheduled job isn’t running, and if it’s ever going to run. -dag Sort DAG jobs under their DAGMan -constraint <const. > (classads) -global ( -g ) get the global queue. -run get information about running jobs.

Usefull Condor commands n condor_rm ¡ ¡ Removes a scheduled job from the queue

Usefull Condor commands n condor_rm ¡ ¡ Removes a scheduled job from the queue (of the scheduling computer). condor_rm cluster. proc n ¡ condor_rm cluster n ¡ Remove the given cluster of jobs condor_rm user n ¡ Remove the given job Remove all jobs owned by user condor_rm –all n Remove all jobs

Vanilla universe jobs n n n Vanilla universe is used for running jobs without

Vanilla universe jobs n n n Vanilla universe is used for running jobs without special needs and features. In Vanilla universe Condor runs the job the same as it would run without Condor Start with a simple example. c: #include <stdio. h> int main(){ printf(“hello condor”); return 0; } n Compile as usual: gcc example. c –o example

Vanilla universe jobs n n n In order to submit the job to Condor

Vanilla universe jobs n n n In order to submit the job to Condor we use the condor_submit command. Usage: condor_submit <sub_file> A simple submit file (sub_example): Universe = Vanilla Executable = example Log = test. log Output = test. out Error = test. error Queue n Notice that the submission commands are case insensitive.

Vanilla universe jobs n n There a few other usefull commands arguments = arg

Vanilla universe jobs n n There a few other usefull commands arguments = arg 1 arg 2 … ¡ n Input = <input file> ¡ n run the executable with the given arguments The file given is used as standard input environment = “<var 1>=<value 1> <var 2>=<value 2> …” ¡ ¡ ¡ Runs the job with the given environment variables. In order to use spaces in the entry use single quote To insert quotation use double quote mark, example: environment = “ a=“”quote”” b=‘a ‘’b’’ c’ ”

Vanilla universe jobs n getenv = <True | False> ¡ ¡ If getenv is

Vanilla universe jobs n getenv = <True | False> ¡ ¡ If getenv is set to True, then condor_ submit will copy all of the user's current shell environment variables at the time of job submission into the job Class. Ad. The job will therefore execute with the same set of environment variables that the user had at submit time. Defaults to False.

Vanilla universe jobs n A more advanced submission: Universe = Vanilla Executable = example

Vanilla universe jobs n A more advanced submission: Universe = Vanilla Executable = example Log = test. $(cluster). $(process). log Output = test. $(cluster). $(process). out Error = test. $(cluster). $(process). error Queue 7 n Here we see a use of predefined macros. ‘cluster’ gives us the value of the Cluster. Id job Class. Ad attribute, and the $(process) macro supplies the value of the Proc. Id job Class. Ad attribute

Macros n More on Macros: ¡ ¡ ¡ A macro is defined as follows:

Macros n More on Macros: ¡ ¡ ¡ A macro is defined as follows: <macro_name> = string It can be then used by writing $(macro_name) $$(attribute) is used to get a classad attribute from the machine running the job. $ENV(variable) gives us the environment variable ‘variable’ from the machine running the job. For more on macros go to condor’s manual, condor_submit section.

Other universes n n Standard universe Java universe

Other universes n n Standard universe Java universe

Standard universe n n The Standard universe provides checkpointing and remote system calls. Remote

Standard universe n n The Standard universe provides checkpointing and remote system calls. Remote system calls: ¡ n All system calls made by the job running in Condor are made on the submitting computer. Chekpointing: ¡ Save a snapshot of the current state of the running job, so the job can be restarted from the saved state in case of: n n Migration to another computer Machine crash or failure. SS

Standard universe n n In order to execute a program in the Standard universe

Standard universe n n In order to execute a program in the Standard universe it must be relinked with the Condor’s library. To do so use condor_compile with your usual link command. Example: ¡ n n condor_compile gcc example. c To manually cause a checkpoint use condor_checkpoint hostname There are some restrictions on jobs running in the standard universe:

Standard universe restrictions n n n Multi-process jobs are not allowed. This includes system

Standard universe restrictions n n n Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system(). Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory. Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration. Sending or receiving the SIGUSR 2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed. Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().

Standard universe restrictions n n n Multiple kernel-level threads are not allowed. However, multiple

Standard universe restrictions n n n Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap(). File locks are allowed, but not retained between checkpoints. All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error. Your job must be statically linked (On Digital Unix (OSF/1), HP -UX, and Linux, and therefore on our school). Reading to or writing from files larger than 2 GB is not supported.

Java universe n n Used to run java programs Example submit description file: universe

Java universe n n Used to run java programs Example submit description file: universe = java executable = Example. class arguments = Example output = Example. output error = Example. error queue n n Notice that the first argument is the main class of the job. The JVM must be informed when submitting jar files, this is done in the following way: ¡ n To run on a machine with a specific java version: ¡ n jar_files = example. jar Requirements = (Java. Version==“ 1. 5. 0_01”) Options to the Java VM itself can be set in the submit description file: ¡ ¡ java_vm_args = -DMy. Property=Value -verbose: gc … These options go after the java command but before the main class (Usage: java [options] class [args. . . ]). Do not use this to set the classpath (condor handles that itsef).

Matlab Functions n n Matlab functions/scripts are written in. m files. Structure: function {ret_var

Matlab Functions n n Matlab functions/scripts are written in. m files. Structure: function {ret_var = } func_name(arg 1, arg 2, …) …

Running Matlab functions in condor n First method: Calling matlab ¡ What we want

Running Matlab functions in condor n First method: Calling matlab ¡ What we want to do is run: n ¡ matlab -nodisplay -nojvm -nosplash –r ‘func(arg 1, arg 2, …)’ Instead of transferring the matlab executable we’ll write a script (run. csh): #!/bin/csh –f matlab -nodisplay -nojvm -nosplash -r "$*"

Running Matlab functions in condor n First method: Calling matlab ¡ The submission file:

Running Matlab functions in condor n First method: Calling matlab ¡ The submission file: executable = run. csh log = mat. log error = mat. error output = mat. output universe = vanilla getenv = True arguments = func(arg 1, arg 2, …) queue 1 ¡ Notice that in order to run matlab we must set getenv = true

Running Matlab functions in condor n Second method: Compiling the function ¡ ¡ ¡

Running Matlab functions in condor n Second method: Compiling the function ¡ ¡ ¡ First, we compile our Matlab script, example. m, into an executable: mcc –mv example. m The –v option is not mandatory. It is used to show details in the process of compilation. The files required for running will be “example” nad example. ctf The compiled function requires matlab’s shared libraries in order to run. So, we’ll send Condor a script which defines the necessary env. Variables and then runs the executable.

Running Matlab functions in condor n Second method: Compiling the function ¡ The script:

Running Matlab functions in condor n Second method: Compiling the function ¡ The script: #!/bin/tcsh setenv LD_LIBRARY_PATH /usr/local/stow/matlab-7. 0. 4 -R 14 SP 2/lib/matlab-7. 0. 4 R 14 SP 2/bin/glnx 86: /usr/local/stow/matlab-7. 0. 4 -R 1 4 SP 2/lib/matlab-7. 0. 4 -R 14 SP 2/sys/os/glnx 86: /usr/local/stow/matlab-7. 0. 4 R 14 SP 2/lib/matlab-7. 0. 4 -R 14 SP 2/sys/java/jre/glnx 86/jr e 1. 4. 2/lib/i 386/client: /usr/local/stow/matlab-7. 0. 4 -R 14 SP 2/lib/matlab-7. 0. 4 R 14 SP 2/sys/java/jre/glnx 86/jre 1. 4. 2/lib/i 386: /usr /local/stow/matlab-7. 0. 4 -R 14 SP 2/lib/matlab-7. 0. 4 -R 14 SP 2/sys/opengl/lib/glnx 86: setenv XAPPLRESDIR /usr/local/stow/matlab-7. 0. 4 -R 14 SP 2/lib/matlab-7. 0. 4 R 14 SP 2/X 11/app-defaults setenv LD_PRELOAD /libgcc_s. so. 1. /multi $1 $2

Class. Ads n n n Class. Ads are a flexible mechanism for representing the

Class. Ads n n n Class. Ads are a flexible mechanism for representing the characteristics and constraints of machines and jobs in the Condor system Condor acts as a matchmaker for Class. Ads are analogous to the classified advertising section in a newspaper. All machines running Condor advertise their attributes. A machine also advertises under what conditions it is willing to run a job, and what type of job it would prefer. When submitting a job, you specify your requirements and preferences. These attributes are bundled up into a job Class. Ad.

Class. Ads n Class. Ad expressions are formed by composing literals, attribute references and

Class. Ads n Class. Ad expressions are formed by composing literals, attribute references and other sub-expressions with operators and functions ¡ Literals: may be n n n integers (including TRUE – 1 and FALSE – 0) Real String, a list of characters between two double quote chars. Use to include the following char in the string, irrespective of what that character is. UNDEFINED keyword (case insensitive) ERROR keyword (case insensitive)

Class. Ads ¡ Attributes n n n n A pair (name, expression) is called

Class. Ads ¡ Attributes n n n n A pair (name, expression) is called an attribute. The attribute name is case insensitive. An optional scope resolution prefix may be added: “MY. ” and “TARGET. ” MY. refers to an attribute defined in the current Class. Ad. TARGET. Refers to an attribute defined in the Class. Ad in which the current Class. Ad is evaluated. If no scope prefix is given, the first try “MY. ”, if not found try “TARGET. ”, if not found try the Class. Ad environment, if not found then value is UNDEFINED. If there is a circular dependency between two classads (e. g. A uses B and B uses A) then the value is ERROR

Class. Ads ¡ Operators n n The operators are similar to c language. All

Class. Ads ¡ Operators n n The operators are similar to c language. All operators are case insensitive for strings, with the following exeptions: ¡ ¡ n =? = =!= Precedence: “is identical to” operator (similar to ==) “is not identical to” operator (similar to !=)

Class. Ads ¡ Predefined functions n Examples: ¡ ¡ ¡ n n Integer strcmp(Any.

Class. Ads ¡ Predefined functions n Examples: ¡ ¡ ¡ n n Integer strcmp(Any. Type Expr 1, Any. Type Expr 2) String strcat(Any. Type Expr 1 [ , Any. Type Expr 2. . . ]) Boolean is. Integer(Any. Type Expr) Function names are case insensitive For a full list of the functions refer to the user manual, section 4. 1. 1. 4

Class. Ads n n n When submitting a job, one give requirements which only

Class. Ads n n n When submitting a job, one give requirements which only machines answering them may run the job. One can also rank the machines available to run the job and choose the highest ranked machine to run the job. This can be done using the Requirements and Rank commands in the submission file.

Class. Ads submission commands n Requirements = <Class. Ad Boolean Expression> ¡ ¡ The

Class. Ads submission commands n Requirements = <Class. Ad Boolean Expression> ¡ ¡ The job will run on a machine only if the requirements expression evaluates to TRUE on that machine. Example: requirements = Memory >= 64 && Arch == "intel" The running machine must have at least 64 MB of ram and the architecture is INTEL. The computers in our school have two possible architecture names: “INTEL” if it’s a 32 bit computer or “X 86_64” if it’s a 64 bit computer.

Class. Ads submission commands n By default Condor adds to the requirements of a

Class. Ads submission commands n By default Condor adds to the requirements of a job the following requirements: ¡ ¡ n Arch, Op. Sys the same as the submitting computer. Disk >= Disk. Usage. The Disk. Usage attribute is initialized to the size of the executable plus the size of any files specified in a transfer_input_files command. (Memory * 1024) >= Image. Size. To ensure the target machine has enough memory to run your job. If Universe is set to Vanilla, File. System. Domain is set equal to the submit machine's File. System. Domain. In order to see a submitted job’s requirements (along with everything else about the job) use condor_q –l.

Class. Ads submission commands n rank = <Class. Ad Float Expression> ¡ ¡ Sorts

Class. Ads submission commands n rank = <Class. Ad Float Expression> ¡ ¡ Sorts all matching machines by the given exression. Condor will give the job the machine with the highest rank. The expression is a numeric expression (where boolean sub-expressions evaluate to 1. 0 or 0. 0)

Dag. Man n n Use a directed acyclic graph (DAG) to represent a set

Dag. Man n n Use a directed acyclic graph (DAG) to represent a set of jobs to be run in a certain order. A basic DAG submit file: JOB name 1 submit_file 1 JOB name 2 submit_file 2 … ¡ If “DONE” is specified in the end of a JOB line then that job is considered complete and is not submitted.

Dag. Man n Additional dag commands: ¡ SCRIPT: n n n Sets processing to

Dag. Man n Additional dag commands: ¡ SCRIPT: n n n Sets processing to be done before/after running the job. These “scripts” run on the submitting machine. SCRIPT PRE job_name executable [arguments] ¡ n Runs the executable before job_name is submitted SCRIPT POST job_name executable [arguments] ¡ Runs the executable after job_name has completed its execution under Condor.

Dag. Man n Additional dag commands: ¡ PARENT … CHILD n n Used to

Dag. Man n Additional dag commands: ¡ PARENT … CHILD n n Used to describe the dependencies between the jobs. PARENT p 1 p 2 … CHILD c 1 c 2 … ¡ ¡ Makes all pi’s parents of all ci’s (i. e. the ci’s will be submitted only after all pi’s have completed their execution) RETRY n RETRY job. Name Num. Of. Retries [UNLESS-EXIT value] ¡ ¡ If job fails runs again at most Num. Of. Retries times. If UNLESS-EXIT is specified and the value returned equals “value” then no further retries will be attempted.

Dag. Man n Additional dag commands: ¡ VARS n n ¡ Defines macros that

Dag. Man n Additional dag commands: ¡ VARS n n ¡ Defines macros that can be used in the submit description file of a job. VARS job. Name macroname= “string” [macroname 2= “string” …] ABORT-DAG-ON n n n Aborts the entire DAG if a specific node returns a specific value. Stops all nodes within the DAG immediately. This includes nodes currently running. ABORT-DAG-ON Job. Name Abort. Exit. Value [RETURN DAGReturn. Value] By default the returned value of the DAG is the value returned from the aborted node. If RETURN is specified then the return value is DAGReturn. Value

Dag. Man n Example DAG file: JOB A a. submit JOB B b. submit

Dag. Man n Example DAG file: JOB A a. submit JOB B b. submit JOB C a. submit PARENT A CHILD B C RETRY C 3 ABORT-DAG-On A 2 n n Submission of DAG’s is done with: condor_submit_dag file. dag In order to specify the max number of jobs submitted by the Dag. Man add the argument: ¡ n -maxjobs num. Of. Jobs If any node in a DAG fails, ¡ ¡ The Dag. Man continues to run the reminder of the nodes untill no more forward progress can be made. Then it creates a rescue file (input_file. rescue), where for each node that completed its execution the corresponding JOB line ends with DONE. Submitting this file continues DAG execution.

Dag. Man n It is possible to create a visualization of the DAG: ¡

Dag. Man n It is possible to create a visualization of the DAG: ¡ Add a line to the DAG file with: n ¡ ¡ n “DOT dot_file. dot” Submit the DAG dot -Tps dot_file. dot -o dag. ps A DAG inside a DAG: ¡ ¡ Suppose you want to include inner. dag in outer. dag Execute n ¡ condor_submit_dag -no_submit inner. dag Include the following “JOB” line in outer. dag: n n JOB job. Name inner. dag. condor. sub is the submission file for inner. dag