Whats new in Condor Condor Week 2006 Todd

  • Slides: 58
Download presentation
What’s new in Condor? Condor Week 2006 Todd Tannenbaum Computer Sciences Department University of

What’s new in Condor? Condor Week 2006 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs. wisc. edu http: //www. cs. wisc. edu/condor

So Todd… where is v 6. 8? Well, v 6. 7 has been a

So Todd… where is v 6. 8? Well, v 6. 7 has been a challenge… 2

3

3

4

4

Around since the 80’s 5

Around since the 80’s 5

Around since the 80’s Mullet Boy 6

Around since the 80’s Mullet Boy 6

100 people surveyed! Favorite “ility” ? 7

100 people surveyed! Favorite “ility” ? 7

100 people surveyed! Favorite “ility” ? Deployability! 8

100 people surveyed! Favorite “ility” ? Deployability! 8

Existing Ports • Digital UNIX 4. 0 Alpha • AIX 5. 2 (clipped) Power.

Existing Ports • Digital UNIX 4. 0 Alpha • AIX 5. 2 (clipped) Power. PC • Tru 64 5. 1 (clipped) Alpha • HP UNIX 10. 20 PA RISC • HP UNIX 11. 00 (clipped using hpux 10. 20 32 bit) PA RISC • Irix 6. 5 (clipped) SGI • Linux 2. 4. x (glibc 2. 2) - Red Hat 7. 1, 7. 2, 7. 3 (clipped) Alpha • Linux 2. 4. x (glibc 2. 2) - Red Hat 7. 1, 7. 2, 7. 3 Intel x 86 • Linux 2. 4. x (glibc 2. 2) - Red Hat 8 Intel x 86 • Linux 2. 4. x (glibc 2. 3) - Red Hat 9 Intel x 86 • Enterprise Server 8. 1 Intel Itanium • Solaris 8 Sparc • Solaris 9 Sparc • Microsoft Windows 2000 or XP (clipped) Intel x 86 9

› New Ports Introduced in v 6. 6. x h h h Mac. OSX

› New Ports Introduced in v 6. 6. x h h h Mac. OSX (“clipped") Power. PC Sigh… Debian Linux 3. 1 Intel x 86 Fedora Core 1 Intel x 86 Red Hat Enterprise Linux 3 Intel x 86 Su. SE Linux Enterprise Server 8. 1 Intel Itanium › Introduced in v 6. 7. x h h h AIX 5. 1 (“clipped") Power. PC Fedora Core 2 on x 86 Fedora Core 3 on x 86 Su. SE 8. 0 ("clipped") on AMD 64 Solaris 10 ("clipped") on Sparc Scientific Linux (Release 303) on x 86 “Psilord” – The Condor porting doctor. Talk to him in person tomorrow. › Still to be introduced in v 6. 7. x (before v 6. 8. 0) h HPUX 11 i 64 -bit pa-risc h RHEL 4 on x 86 h “native” 64 bit AMD Linux 10

Porting Table › See http: //www. cs. wisc. edu/condor/porting/port_table. html › Highlights h Almost

Porting Table › See http: //www. cs. wisc. edu/condor/porting/port_table. html › Highlights h Almost every 32 -bit Linux flavor as “full” h Every other Unix, Mac. OS and Windows available as “clipped” h Solaris 10 and HP-UX 11. x now “clipped” h Free. BSD 4 contribution from Yahoo!, added 5 and 6 h X 86_64 Linux: “full” running in the lab 11

Backfill Jobs › Execute machines will run a locally › staged executable when otherwise

Backfill Jobs › Execute machines will run a locally › staged executable when otherwise idle. Currently designed for BOINC. # Turn on backfill functionality, and use BOINC ENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(State. Timer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(Machine. Busy) 12

Joining Condor’s Einstein@Home Compute Team › If you’re running BOINC backfill jobs in ›

Joining Condor’s Einstein@Home Compute Team › If you’re running BOINC backfill jobs in › Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation Join the “Condor Backfill” team: hhttp: //einstein. phys. uwm. edu/team_display. p hp? teamid=5994 hhttp: //einstein. phys. uwm. edu/create_accoun t_form. php? teamid=5994 13

More “deployability” › “Personal” Condor Support on Win 32 h. Local. System not required

More “deployability” › “Personal” Condor Support on Win 32 h. Local. System not required › MSI installer on Win 32 (thanks Micron!) › New tools Safe, dynamic Condor service deployment. More info @ Research BOF 9 am Rm 219 hcondor_cold_start and hcondor_cold_stop 14

100 people surveyed! Favorite “ility” ? 15

100 people surveyed! Favorite “ility” ? 15

100 people surveyed! Favorite “ility” ? Availability! 16

100 people surveyed! Favorite “ility” ? Availability! 16

Condor with Firewalls and NATS: GCB in v 6. 8. 0! GCB layer connect

Condor with Firewalls and NATS: GCB in v 6. 8. 0! GCB layer connect translate Client app TCP/IP listen accept Server app GCB layer TCP/IP Relay point 17

Job Progress continues if connection is interrupted › Now for Vanilla, Java, and Grid

Job Progress continues if connection is interrupted › Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. h If network outage between execute and submit machine h If submit machine restarts h Grid Universe was tricky… › To take advantage of this feature, put the following line into their job’s submit description file: Job. Lease. Duration = <N seconds> For example: job_lease_duration = 1200 18

Job Progress continues if submit machine fails › Condor can now support a submit

Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare” (schedd failover) h. If your submit machine A is down for longer than N minutes, a second machine B can take over h. Requires shared filesystem between machines A and B 19

Central Manager Failover › Condor Central Manager has two services › condor_collector h. Now

Central Manager Failover › Condor Central Manager has two services › condor_collector h. Now a list of collectors is supported › condor_negotiator (matchmaker) h. If fails, election process, another takes over h. Accounting state is peridocially replicated h. Contributed technology from Technion 20

Reliability, cont. › Time shifts › Quill › Closing windows of vulnerability 21

Reliability, cont. › Time shifts › Quill › Closing windows of vulnerability 21

100 people surveyed! Favorite “ility” ? 22

100 people surveyed! Favorite “ility” ? 22

100 people surveyed! Favorite “ility” ? Lighweight? 23

100 people surveyed! Favorite “ility” ? Lighweight? 23

100 people surveyed! Favorite “ility” ? Lighweight? 24

100 people surveyed! Favorite “ility” ? Lighweight? 24

100 people surveyed! Favorite “ility” ? 25

100 people surveyed! Favorite “ility” ? 25

100 people surveyed! Favorite “ility” ? Functionality! 26

100 people surveyed! Favorite “ility” ? Functionality! 26

Security › Common Authentication Methods between Condor on Unix and Win 32 h. Kerberos

Security › Common Authentication Methods between Condor on Unix and Win 32 h. Kerberos 1. 4 • Additional hopeful benefit: Authentication against MS Active Directory! h. SSL h. Password (shared secret) › Starter only runs known executables › More powerful, unified map file(s) › GSI credentials delegated 27

With Condor on Win 32, it be nice if … › My jobs could

With Condor on Win 32, it be nice if … › My jobs could access my files just like the › › condor_shadow can I didn’t have to tie my execute machines to a single account I didn’t have to run condor_store_cred from every machine where my credential is needed (thank you Optena) 28

The Windows Cred. D › A centralized repository for user passwords C: >condor_store_cred add

The Windows Cred. D › A centralized repository for user passwords C: >condor_store_cred add Account: gquinn@CROW Enter password: myp 4 sswd “store password” <password> y 0 urs credd Operation succeeded. 29

The Windows Cred. D schedd “fetch password” myp 4 sswd y 0 urs <password>

The Windows Cred. D schedd “fetch password” myp 4 sswd y 0 urs <password> shadow Submit machines can use the Cred. D to impersonate the user in the shadow 30

The Windows Cred. D starter “fetch password” <password> condor_exec. exe myp 4 sswd y

The Windows Cred. D starter “fetch password” <password> condor_exec. exe myp 4 sswd y 0 urs Execute machines can use the Cred. D to run jobs as the submitting user! 31

Running Jobs as Submitting User › In submit file: h. Run_job_as_owner = true ›

Running Jobs as Submitting User › In submit file: h. Run_job_as_owner = true › In config file on submit and execute nodes: CREDD_HOST = vault. cs. wisc. edu STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True 32

Some Condor APIs › Command Line tools › › › › h condor_submit, condor_q,

Some Condor APIs › Command Line tools › › › › h condor_submit, condor_q, etc h -format, -constraint, -xml Condor Perl Module Chirp Checkpoint Library API MW --- improved! DRMAA (Works w/ Win 32, on Source. Forge) Condor Grid ASCII Protocol (GAHP) Web Service Interface 33

DRMAA › Distributed Resource Management Application API (DRMAA) h. GGF Working Group h. An

DRMAA › Distributed Resource Management Application API (DRMAA) h. GGF Working Group h. An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings hnot a protocol › Scope h. Does: job submission, monitoring, control, final status h. Does not: file staging, reservations, security, … 34

Condor GAHP › The Condor GAHP is a relatively low-level protocol › based on

Condor GAHP › The Condor GAHP is a relatively low-level protocol › based on simple ASCII messages through stdin and stdout Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events 35

GAHP, cont Example: R: $Gahp. Version: 1. 0. 0 Nov 26 2001 NCSA Co.

GAHP, cont Example: R: $Gahp. Version: 1. 0. 0 Nov 26 2001 NCSA Co. G Gahpd $ S: GRAM_PING 100 vulture. cs. wisc. edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $Gahp. Version: 1. 0. 0 Nov 26 2001 NCSA Co. G Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523. txt R: S S: GRAM_PING 100 vulture. cs. wisc. edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S 36

Web Service Interfaces › SOAP over http or https to › › the Condor

Web Service Interfaces › SOAP over http or https to › › the Condor daemons Use any language or platform (where you can find a decent SOAP library) Functionality Exposed in current release h Submit jobs h Retrieve job output h Remove/hold/release jobs h Query machine status (fetch ads from collector) h Query job status (fetch ads from the schedd) 37

Getting machine status via SOAP (in Java with Axis) locator = new Condor. Collector.

Getting machine status via SOAP (in Java with Axis) locator = new Condor. Collector. Locator(); collector = locator. getcondor. Collector(new URL(“http: //machine: port”)); ads = collector. query. Startd. Ads(“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions. 38

More Functionality changes. . › FINALLY, clean/consistent cross-platform quoting › rules for arguments and

More Functionality changes. . › FINALLY, clean/consistent cross-platform quoting › rules for arguments and environment variables (see condor_submit man page) Schedd can run Hawk. Eye modules, just like the Startd h Enables monitoring on the submit machine › condor_history : now faster than a snail, and › cleans up droppings. Deferral. Time, Deferral. Window h Coordinated starts › BIND_ALL_INTERFACES in config file › WANT_REMOTE_IO in job Class. Ad 39

Class. Ad Functions in Condor! › Conditionals h. If. Then. Else(condition, then, else) ›

Class. Ad Functions in Condor! › Conditionals h. If. Then. Else(condition, then, else) › String functions h. Strcat(), strcmp(), to. Upper(), etc. › String. List functions h. Example of a “string list” (CSV style) • Mylist = “Joe, Jon, Jeff, Jim, Jake” h. Str. List. Contains(), Str. List. Append(), Str. List. Remove(), etc. › Others h. Regular expressions, arithmetic, etc… 40

Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) ›

Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration CDF @ Fermi) h Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them h Could use Machine Rank… • but this ties to specific machines h Or • • could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group 41

100 people surveyed! Favorite “ility” ? 42

100 people surveyed! Favorite “ility” ? 42

100 people surveyed! Favorite “ility” ? Universability! 43

100 people surveyed! Favorite “ility” ? Universability! 43

Grid Universe › With new Grid Universe, always specify a ‘gridtype’. › So the

Grid Universe › With new Grid Universe, always specify a ‘gridtype’. › So the old “globus” Universe is now declared as: universe = gridtype = gt 2 Other gridtypes? h GT 2 (Globus Toolkit 2) h GT 3 (Globus Toolkit 3. 2) h GT 4 (Globus Toolkit 3. 9. 5+) ‘Condor-G’ h UNICORE h Nordugrid h PBS (Open. PBS, PBSPro – technology from INFN) h LSF (Platform LSF – technology from INFN) h CONDOR (thanks g. Lite!) ‘Condor-C’ 44

Other Grid Universe improvements › Condor-G has support for credential refresh via the ›

Other Grid Universe improvements › Condor-G has support for credential refresh via the › My. Proxy Online Credential Management in NMI http: //grid. ncsa. uiuc. edu/myproxy (both GT 2 and GT 4) GT 4 : we start a Grid. FTP server behind the scenes h Grid. FTP server bundled w/ Condor nowadays › Some functionality present in Condor-G added to Condor-C h Forwarding of refreshed credentials (EGEE) h GSI authentication support h Cleaner Class. Ad representation (URL) 45

Parallel Universe › Replaces the “MPI” universe › Allows running arbitrary programs that need

Parallel Universe › Replaces the “MPI” universe › Allows running arbitrary programs that need to gang-schedule multiple machines h. MPICH, LAM, … h. FT-MPICH (Seoul National Univ) h. Great for testing environments 46

Hey Jobs! We’re watching you! › Local Universe h. Just like Scheduler Universe, but

Hey Jobs! We’re watching you! › Local Universe h. Just like Scheduler Universe, but there is a condor_starter h. All advantages of the starter Submit schedd starter job Hey, job, behave or else! Execute startd starter job 47

100 people surveyed! Favorite “ility” ? 48

100 people surveyed! Favorite “ility” ? 48

100 people surveyed! Favorite “ility” ? Scalability! 49

100 people surveyed! Favorite “ility” ? Scalability! 49

Faster Negotiation › SIGNIFICANT_ATTRIBUTES determined automatically h. Job attributes Auto. Cluster. Id and Auto.

Faster Negotiation › SIGNIFICANT_ATTRIBUTES determined automatically h. Job attributes Auto. Cluster. Id and Auto. Cluster. Attributes h. Rounding of Attributes › Schedd uses non-blocking TCP connects to the › › › startd Negotiator caching Collector Forks for queries More coming… 50

› Knobs Scalability, cont. h GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE, h GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE, h GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE › One instance of

› Knobs Scalability, cont. h GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE, h GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE, h GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE › One instance of gridmanager handles › multiple jobs (all from a given user) One instance of condor_dagman can run multiple dags h. Is the Shadow next? › Buffered I/O read on schedd restart (thanks Yahoo!) 51

Quill › Job Class. Ads Master Startd …Schedd Job Queue log Quill RDBMS Queue

Quill › Job Class. Ads Master Startd …Schedd Job Queue log Quill RDBMS Queue + History Tables › › information mirrored into an RDBMS Both active jobs and historical jobs Benefits BOTH scalability and accessibility 52

Version 6. 9. x 53

Version 6. 9. x 53

What’s brewing for after v 6. 8. 0? › More data, data h. Stork

What’s brewing for after v 6. 8. 0? › More data, data h. Stork distributed now v 6. 7. x, incl DAGMan support – next it is Ne. ST’s turn. h. Ne. ST manage Condor spool files, ckpt servers • Grid. FTP used to move the bits h. Quill++ and Condor. DB goodness › Virtual Machines (and the future of Standard Universe) h. Research BOF w/ Jaeyoung Moon, rm 219 9 am 54

SOAP API › First focus will be to finish interfaces used by all command-line

SOAP API › First focus will be to finish interfaces used by all command-line tools hcondor_userprio, condor_cod, … › Explore message-based security h. Ian Alderman’s work w/ signed Class. Ad attributes 55

Privilege Separation › No more root in the Condor daemons! › Instead, a small

Privilege Separation › No more root in the Condor daemons! › Instead, a small component will be › › responsible for privileged operations Initial exploratory work w/ GNU userv (Cambridge) Now focusing on integration w/ glexec (g. Lite / nikhef) 56

“The Year of the Schedd” › Schedd is juggling to many tasks h. Break

“The Year of the Schedd” › Schedd is juggling to many tasks h. Break it down into smaller pieces, more modular › Scalability h. All non-blocking I/O h. Hierarchy of schedds › Schedd-on-the-side h“Scheduler booster” h. Transform & delegate job classads to different grids h. A “job router” for a grid 57

Thank you! 58

Thank you! 58