Condor by Example Douglas Thain Computer Sciences Department
Condor by Example Douglas Thain Computer Sciences Department University of Wisconsin-Madison October 2000 thain@cs. wisc. edu http: //www. cs. wisc. edu/condor
Lecture Format: › In each lecture: h. Lecture to whole group. h. Workshop and examples at computer. › Oops! h. Some items are filled in at the last minute. h. Please fill the _______ with notes. www. cs. wisc. edu/condor
Outline › › › Overview Submitting Jobs, Getting Feedback Setting Requirements with Class. Ads Which Universe? Move to Workshop www. cs. wisc. edu/condor
What is Condor? › Condor converts a collection of › unrelated workstations into a highthroughput computing facility. Condor uses matchmaking to make sure that everyone is happy. www. cs. wisc. edu/condor
What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances. h“How fast can I run simulation X on this machine? ” › High-throughput: CPU cycles/day (week, month, year? ) under non-ideal circumstances. h“How many times can I run simulation X in the next week using all available machines? ” www. cs. wisc. edu/condor
What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines… h. Crash! h. Are disconnected h. Run out of disk space h. Are removed or added from the pool h. Are put to other uses www. cs. wisc. edu/condor
What is Matchmaking? › Condor uses Matchmaking to make sure › that work gets done within the constraints of both users and owners. Users (jobs) have constraints: h“I need an Alpha with 256 MB RAM” › Owners (machines) have constraints: h“Only run jobs when I am away from my desk and never run jobs owned by Bob. ” www. cs. wisc. edu/condor
Who uses Condor? › Hundreds of universities and companies › around the world! University of Wisconsin, USA h 682 CPUs in one building h. Computer architecture simulations › National Institute of Physics, Italy h 200 CPUs in many cities h. Reconstruction of collider events › And many others! www. cs. wisc. edu/condor
What can Condor do for me? Condor can… › …increase your throughput. › …do your housekeeping. › …improve reliability. › …give performance feedback. www. cs. wisc. edu/condor
Cluster Overview Server 512 MB 800 MHz 20 GB 100 Mb/s network Client 128 MB 666 MHz Client 128 MB 666 MHz 10 GB 10 GB www. cs. wisc. edu/condor
How many machines now? › The map is out of date! › The system is always changing. › First example: What machines (and of what kind) are in the pool now? www. cs. wisc. edu/condor
How Many Machines? % condor_status Name Op. Sys Arch lxpc 1. na. infn LINUX-GLIBC INTEL axpd 21. pd. inf OSF 1 ALPHA vlsi 11. pd. inf SOLARIS 26 SUN 4 u State Activity Load. Av Mem Unclaimed Owner Claimed Idle Busy 0. 000 0. 266 0. 000 30 96 256 . . . Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF 1 INTEL/LINUX-GLIBC SUN 4 u/SOLARIS 251 SUN 4 u/SOLARIS 26 SUN 4 u/SOLARIS 27 SUN 4 x/SOLARIS 26 115 53 16 1 2 67 18 7 1 2 1 1 46 0 0 0 1 35 9 0 4 0 1 0 0 0 Total 194 97 46 50 0 1 www. cs. wisc. edu/condor
Machine States › Most machines will be: h. Owner: • The machine’s owner is busy at the console, so no Condor jobs may run. h. Claimed: • Condor has selected the machine to run jobs for other users. www. cs. wisc. edu/condor
Machine States › Only a few should be: h. Unclaimed: • The owner is gone, but Condor has not yet selected the machine. h. Matched: • Between claimed and unclaimed. h. Preempting: • Condor is busy removing a job. www. cs. wisc. edu/condor
More Things to Try % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor. cs. wisc. edu www. cs. wisc. edu/condor
Submitting Jobs www. cs. wisc. edu/condor
Steps to Running a Job › › Re-link for Condor. Submit the job. Watch the progess. Receive email when done. www. cs. wisc. edu/condor
Example Job Integrate sin(x) from 0 to 10, using 10 million slices. Simple program takes a few seconds. %. /integrate 10 10000000 2. 0445075 www. cs. wisc. edu/condor
PROGRAM INTEGRATE CHARACTER STR*10 REAL X, SLICES, LIMIT CALL READ GETARG(1, STR) (STR, *) LIMIT GETARG(2, STR) (STR, *) SLICES TOTAL=0 STEP=LIMIT/SLICES DO X=0, LIMIT, STEP TOTAL = TOTAL + SIN(X)*STEP END DO PRINT *, TOTAL END www. cs. wisc. edu/condor
Re-link for Condor › If you normally compile like this: hg 77 integrate. f -o integrate › Then compile for Condor like this: hcondor_compile g 77 integrate. f -o integrate www. cs. wisc. edu/condor
Submit the Job › Create a submit file: Executable = integrate • emacs integrate. submit & Arguments = 10 10000000 Output = integrate. out › Submit the job: Log = integrate. log • condor_submit integrate. submit queue www. cs. wisc. edu/condor
Watch the Progress % condor_q -- Submitter: axpbo 8. bo. infn. it : <131. 154. 10. 29: 1038> : ID 5. 0 OWNER thain Each job gets a unique number. SUBMITTED 6/21 12: 40 RUN_TIME ST PRI SIZE CMD 0+00: 15 R 0 2. 5 fib 40 Status: Unexpanded, Running or Idle Size of program image (MB) www. cs. wisc. edu/condor
Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo 8. bo. infn. it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Completed at: Wed Jun 21 14: 24: 42 2000 Wed Jun 21 14: 36 2000 Real Time: Run Time: Committed Time: . . . 0 00: 11: 54 0 00: 06: 52 0 00: 01: 37 www. cs. wisc. edu/condor
Running Many Processes › 100 processes are almost as easy as !. › Each condor_submit makes one cluster of › › one or more processes. Add the number of processes to run to the Queue statement. Use the $(PROCESS) variable to give each process slightly different instructions. www. cs. wisc. edu/condor
Running Many Processes › Perform the same program on 50 different › intervals. Output goes in integrate. out. 1, integrate. out. 2, and so on… Executable = integrate Arguments = $(PROCESS) 10000000 Output = integrate. out. $(PROCESS) Log = integrate. log Queue 50 www. cs. wisc. edu/condor
Running Many Processes % condor_q -- Submitter: axpbo 8. bo. infn. it : <131. 154. 10. 29: 1038> ID OWNER thain SUBMITTED 6/23 10: 47 9. 3 9. 6 9. 7. . . 21 jobs; 2 idle, 19 running, 0 held Cluster number RUN_TIME 0+00: 05: 40 0+00: 05: 11 0+00: 05: 09 ST R R R PRI 0 0 0 SIZE 2. 5 Process number www. cs. wisc. edu/condor CMD fib 3 fib 6 fib 7
Where Are They Running? › condor_q –run - Submitter: axpbo 8. bo. infn. it : <131. 154. 10. 29: 1038> : ID 9. 47 9. 48 9. 49 OWNER thain SUBMITTED 6/23 10: 47 RUN_TIME 0+00: 07: 03 0+00: 06: 51 0+00: 06: 30 HOST(S) ax 4 bbt. bo. infn. it pewobo 1. bo. infn. it osde 01. pd. infn. it Current Location www. cs. wisc. edu/condor
Help! I’m buried in Email! › By default, Condor sends one email › for each completed process. Add these to your submit file: hnotification = error hnotification = never › To send it to someone else: hnotify_user = thain@cs. wisc. edu www. cs. wisc. edu/condor
Removing Processes › Remove one process: hcondor_rm 9. 47 › Remove a whole cluster: hcondor_rm 9 › Remove everything! hcondor_rm -a www. cs. wisc. edu/condor
Getting Feedback www. cs. wisc. edu/condor
What have I done? › The user log file (fib. log) shows a chronological list of everything important that happened to a job. 001 (007. 035. 000) 06/21 17: 03: 44 Job executing on host: <140. 105. 6. 155: 2219> 004 (007. 035. 000) 06/21 17: 04: 58 Job was evicted. 009 (007. 035. 000) 06/21 17: 05: 10 Job was aborted by the user. www. cs. wisc. edu/condor
What have I done? % condor_history ID OWNER 9. 3 thain 9. 40 thain 9. 10 thain 9. 47 thain 9. 7 thain SUBMITTED 6/23 10: 47 6/23 10: 47 CPU_USAGE ST 0+00: 00 C 0+00: 24 C 0+00: 00 C 0+00: 05: 45 C 0+00: 00 C COMPLETED CMD 6/23 10: 58 fib 6/23 10: 59 fib 6/23 11: 01 fib www. cs. wisc. edu/condor 3 40 10 47 7
Brief I/O Summary % condor_q –io -- Schedd: c 01. cs. wisc. edu : <128. 105. 146. 101: 2016> ID OWNER READ WRITE SEEK XPUT BUFSIZE 756. 15 joe 244. 9 KB 379. 8 KB 71 1. 3 KB/s 512. 0 KB 758. 24 joe 198. 8 KB 219. 5 KB 78 45. 0 B /s 512. 0 KB 758. 26 joe 44. 7 KB 22. 1 KB 2727 13. 0 B /s 512. 0 KB 3 jobs; 0 idle, 3 running, 0 held www. cs. wisc. edu/condor BLKSIZE 32. 0 KB
Complete I/O Summary in Email Your condor job "/usr/joe/records. remote input output" exited with status 0. Total I/O: 104. 2 KB/s effective throughput 5 files opened 104 reads totaling 411. 0 KB 316 writes totaling 1. 2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling 398. 6 KB 311 write totaling 1. 2 MB 101 seeks (Only since Condor Version 6. 1. 11) www. cs. wisc. edu/condor
Complete I/O Summary in Email › The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate. www. cs. wisc. edu/condor
Complete I/O Summary in Email › Example: h. CMSSIM - collider simulation h“Why is this job so slow? ” h. Data summary: • read 250 MB from 20 MB file. h. Very high SEEK total -> random access. h. Solution: Increase buffer to 20 MB. www. cs. wisc. edu/condor
Who Uses Condor? % condor_q –global -- Schedd: to 02 xd. to. infn. it : <192. 84. 137. 2: 1030> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 127. 0 garzelli 6/21 18: 45 1+14: 18: 16 R 0 17. 2 tosti 2 trisdn -- Schedd: quark. ts. infn. it : <140. 105. 6. 101: 3908> ID OWNER SUBMITTED RUN_TIME ST 600. 0 dellaric 4/10 14: 57 55+09: 20: 31 R 665. 0 dellaric 6/2 11: 14 20+03: 27: 30 R 788. 0 pamela 6/20 09: 27 3+04: 41: 43 R PRI 0 0 0 www. cs. wisc. edu/condor SIZE 9. 1 9. 2 15. 4 CMD john p 2. dat john p 1. dat montepamela
Who uses Condor? % condor_status –submitters Name Machine rebuzzin@pv. infn. it pamela@ts. infn. it giunti@to. infn. it. . . decux 1. pv. quark. ts. i to 05 xd. to. Running 22 6 21 Running. Jobs cattaneo@pv. infn. it pamela@ts. infn. it rebuzzin@pv. infn. it Total Idle. Jobs 34 1 49 Max. Jobs. Running 200 200 Idle. Jobs 0 6 22 1 1 34 59 86 www. cs. wisc. edu/condor
Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16: 27 User Name ---------------meucci@pv. infn. it longof@ts. infn. it thain@bo. infn. it dellaric@ts. infn. it clueoff@pd. infn. it pamela@ts. infn. it rebuzzin@pv. infn. it giunti@to. infn. it ---------------Number of users shown: 8 Effective Priority ----0. 50 2. 00 3. 00 5. 81 18. 18 19. 72 ----- www. cs. wisc. edu/condor
Who Uses Condor? › The user priority is computed by Condor to › › estimate how much of the pool’s CPU resources have been used by each submitter. Lighter users receive a lower priority: they will be allocated CPUs before heavy users. Users consuming the same amount of CPU will be allocated an equal amount. www. cs. wisc. edu/condor
Measuring Goodput › Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. › This is a big topic all by itself: http: //www. cs. wisc. edu/condor/goodput www. cs. wisc. edu/condor
Measuring Goodput % condor_q –goodput -- Submitter: coral. cs. wisc. edu : <128. 105. 175. 116: 45697> : coral. cs. wisc. edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 719. 74 thain 6/23 07: 35 2+20: 47: 59 100. 0% 87. 6% 0. 00 719. 75 thain 6/23 07: 35 2+20: 38: 45 40. 5% 99. 8% 0. 00 719. 76 thain 6/23 07: 35 2+20: 38: 16 96. 9% 98. 7% 0. 00 719. 77 thain 6/23 07: 35 2+21: 10: 06 100. 0% 99. 8% 0. 00 www. cs. wisc. edu/condor
Setting Requirements › We believe that Condor must allow › both users (jobs) and owners (machines) to set requirements. This is an absolute necessity in order to convince people to participate in the community. www. cs. wisc. edu/condor
Class. Ads › Class. Ads are a simple language for › describing both the properties and the requirements of jobs and machines. Condor stores nearly everything in Class. Ads -- use the –l option to condor_q and condor_submit to get the full details. www. cs. wisc. edu/condor
Class. Ad for a Machine › condor_status –l axpbo 8 My. Type = "Machine" Target. Type = "Job" Name = "axpbo 8. bo. infn. it" START = TRUE Virtual. Memory = 342696 Disk = 28728536 Memory = 160 Cpus = 1 Arch = "ALPHA" Op. Sys = "OSF 1“ www. cs. wisc. edu/condor
Class. Ad for a Job › condor_q –l 9. 49 My. Type = "Job" Target. Type = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib. out. 49” Args = “ 49” Image. Size = 2544 Disk. Usage = 2544 Requirements = (Arch == "ALPHA") && (Op. Sys == "OSF 1") && (Disk >= Disk. Usage) && (Virtual. Memory >= Image. Size) www. cs. wisc. edu/condor
Default Requirements › By default, Condor assumes the requirements for your job are: “I need a machine with…” h. The same operating system and architecture as my workstation. h. Enough disk to store the program. h. Enough virtual memory to run the program. www. cs. wisc. edu/condor
Class. Ad Requirements › Similar to C/C++/Java expressions: h. Symbols: Arch, Op. Sys, Memory, Mips h. Values: 15, 6. 5, “LINUX” h. Operators: • ==, <, >, <=, >= • &&, || • () www. cs. wisc. edu/condor
Adding Requirements › In the submit file, add a line beginning with “requirements = “ Executable = fib Arguments = 40 Output = fib. out Log = fib. log Requirements = (Memory > 64) queue www. cs. wisc. edu/condor
Example Requirements › › (Memory>64) (Machine == “axpbo 3. bo. infn. it” ) (Mips>100) || (Kflops>10000) (Subnet != “ 131. 154. 10”) && (Disk > 20000000) www. cs. wisc. edu/condor
Preferences › Condor assumes that any machines that › › match your requirements are suitable. However, you may prefer some machines over others. (100 Mips is better than 10) To indicate a preference, you may provide a Class. Ad expression which ranks all matches. www. cs. wisc. edu/condor
Rank › The rank expression is evaluated into › a number for every potential matching machine. A machine with a higher number will be preferred over a machine with a lower number. www. cs. wisc. edu/condor
Rank Examples › Prefer machines with more Mips: • Rank = Mips › Prefer machines with a high ratio of memory to cpu performance: • Rank = Memory/Mips › Prefer more memory, but add 100 to the rank if the machine is Solaris 2. 7: • Rank = Memory + 100*(Op. Sys==“SOLARIS 27)” www. cs. wisc. edu/condor
Standard or Vanilla? www. cs. wisc. edu/condor
Which Universe? › Each Condor universe provides different services to different kinds of programs: h. Standard – Relinked UNIX programs h. Vanilla – Unmodified UNIX programs h. PVM h. Scheduler (Not described here) h. Globus www. cs. wisc. edu/condor
Standard Universe › Submit a specially-linked UNIX › application to the Condor system. Advantages: h. Checkpointing for fault tolerance. h. Remote I/O services: • • Friendly environment anywhere in the world. Data buffering and staging. I/O performance feedback. User remapping of data sources. www. cs. wisc. edu/condor
Standard Universe › Disadvantages: h. Must statically link with Condor library. h. Limited class of applications: • Single-process UNIX binaries. • Certain system calls prohibited. www. cs. wisc. edu/condor
System Call Limitations › Standard universe does not allow: h. Multiple processes: • fork(), exec(), system() h. Inter-process communication: • semaphores, messages, shared memory h. Complex I/O: • mmap(), select(), poll(), non-blocking I/O, … h. Kernel-level threads • (User level threads are OK. ) www. cs. wisc. edu/condor
System Call Limitations › Too restrictive? h. Use the vanilla universe. www. cs. wisc. edu/condor
Vanilla Universe › Submit any sort of UNIX program to › the Condor system. Advantages: h. No relinking required. h. Any program at all, including • • Binaries Shell scripts Interpreted programs (java, perl) Multiple processes www. cs. wisc. edu/condor
Vanilla Universe › Disadvantages: h. No checkpointing. h. Very limited remote I/O services. • Specify input files explicitly. • Specify output files explicitly. h. Condor will refuse to start a vanilla job on a machine that is unfriendly. • Class. Ads: Filesystem. Domain and UIDDomain www. cs. wisc. edu/condor
Which Universe? › Standard: h. Good for mixed Condor pools, flocked pools, and the Grid at large. › Vanilla: h. Good for a Condor pool of identical machines. www. cs. wisc. edu/condor
Conclusion › Condor expands your reach to many CPUs – › › › even those you cannot log in to. Condor makes it easy to run and manage large numbers of jobs Good candidates for the standard universe are single-process CPU-bound jobs with simple I/O. Too restrictive? Use the vanilla universe, but fewer available machines. www. cs. wisc. edu/condor
Move to Workshop Meet again in room ____ at _____. Bring printouts to follow along. www. cs. wisc. edu/condor
- Slides: 64