Whats new in Condor Whats coming up Condor

  • Slides: 51
Download presentation
What’s new in Condor? What’s coming up? Condor Week 2008 Todd Tannenbaum Computer Sciences

What’s new in Condor? What’s coming up? Condor Week 2008 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs. wisc. edu http: //www. cs. wisc. edu/condor

Release Situation › Stable Series h. Current: Condor v 7. 0. 1 (Feb 27

Release Situation › Stable Series h. Current: Condor v 7. 0. 1 (Feb 27 th 2008) h. Last Year: Condor ver 6. 8. 4. (Feb 5 th 2007) › Development Series h. Current: Condor v 7. 1. 0 (April 1 st 2008) h. Last Year : Condor ver 6. 9. 2. (April 10 th 2007) › v 6. 9 Series : ~ 14 months 2

3

3

Special Condor Week Edition 4

Special Condor Week Edition 4

5

5

How many cores in one new UW Condor cluster rack? 6

How many cores in one new UW Condor cluster rack? 6

New Ports › RHEL 5 x 86 & x 86_64 with stduniv and ›

New Ports › RHEL 5 x 86 & x 86_64 with stduniv and › › › glibc 2. 5 Playstation 3 HPUX 11 i Itanium (almost done) Cross testing on x 86 -like platforms Debian clipped port Out with the old. h. Red Hat Linux 7. x systems on the x 86 processor. h. Digital Unix systems on the Alpha processor. h. Yellow Dog Linux 3. 0 systems on the PPC processor. h. Mac. OS 10. 3 systems on the PPC processor. 7

› › › Big v 7. 0 Goodies Scalability Improvements GCB Improvements Privilege Separation

› › › Big v 7. 0 Goodies Scalability Improvements GCB Improvements Privilege Separation New Quill Virtual Machine Universe 8

Scalability 9

Scalability 9

10

10

11

11

12

12

13

13

14

14

15

15

Condor’s Privilege Separation › Apply principle of least › › privilege to Condor No

Condor’s Privilege Separation › Apply principle of least › › privilege to Condor No more root / superuser privilege required Currently completed on execute side Use glexec or Condor’s own “sudo” Can still run the “old way” if you want 16

Quill Take Two in v 7. x › Shared databases › More than just

Quill Take Two in v 7. x › Shared databases › More than just the Job. Ad, e. g. h. Startd: Machine Class. Ads h. Negotiator: matches h. Run: Job User Log information › More than just Postgre. SQL DBMS › All the details: http: //www. cs. wisc. edu/condor/quill_overview_07 -18 -2007. pdf 17

Start. D Sched. D DBMS Disk Negotiator Quill. D sql. log 18

Start. D Sched. D DBMS Disk Negotiator Quill. D sql. log 18

Virtual Machine Universe › Submit a “Job” that consists of a virtual › ›

Virtual Machine Universe › Submit a “Job” that consists of a virtual › › › machine image Condor schedules, manages, and monitors VM job Works w/ VMware Server and Xen Matchmaking Checkpoint/Restart/Migration Data Movement Plug: Bo. F Session 1: 30 pm tomorrow 19

What else? GCB Improvments! 20

What else? GCB Improvments! 20

21

21

22

22

› Improved Scalability: Only use the broker if required! h Local Host Optimizations •

› Improved Scalability: Only use the broker if required! h Local Host Optimizations • Bypass GCB if two daemons are talking on the same host h Local Network Optimizations • Two hosts on the same private net bypass the broker • Every network is assigned a unique network name • Daemons advertise (a) public accessible IP; (b) real IP; (c) network name. • Names match ? use real ip : use public IP. › Improved Robustness h Broker dies -> master finds another broker and restarts. h When master starts up, it pings a list o brokers and randomly chooses from those that respond. h Bug fixes › Improved Logging – now they are helpful and sane. 23

Process Tracking Guarantee Iron-clad tracking of process groups h. Even if running as the

Process Tracking Guarantee Iron-clad tracking of process groups h. Even if running as the job submitter h. Uses supplementary group ids h. Linux only h. Also as a standalone-daemon for OSG USE_GID_PROCESS_TRACKING = True MIN_TRACKING_GID = 750 MAX_TRACKING_GID = 757 24

Better Collector Authorization › New authorization levels to allow › different rules for submission

Better Collector Authorization › New authorization levels to allow › different rules for submission –vsexecution h. ADVERTISE_STARTD, ADVERTISE_SCHEDD New config setting COLLECTOR_REQUIREMENTS expression must evaluate to true for Collector to accept the ad. 25

# Well-known ports for the trusted daemons # Use the below ports if launching

# Well-known ports for the trusted daemons # Use the below ports if launching the condor_master # as root; else, pick 3 ports above 1024. MASTER_PORT = 890 SCHEDD_PORT = 891 STARTD_PORT = 892 MASTER_ARGS = -p $(MASTER_PORT) SCHEDD_ARGS = -p $(SCHEDD_PORT) STARTD_ARGS = -p $(STARTD_PORT) COLLECTOR_REQUIREMENTS = ( My. Type =? = "Machine" && regexp( "<[0 -9. ]*: $(STARTD_PORT)>" , My. Address ) ) || ( My. Type =? = "Scheduler" && regexp( "<[0 -9. ]*: $(SCHEDD_PORT)>" , My. Address ) ) || ( My. Type =? = "Daemon. Master" && regexp( "<[0 -9. ]*: $(MASTER_PORT)>" , My. Address ) ) || ( My. Type =!= "Machine" && My. Type =!= "Scheduler" && My. Type =!= "Daemon. Master" ) 26

Handy New Attributes › In your machine ad h Total. Time. Backfill. Busy, Total.

Handy New Attributes › In your machine ad h Total. Time. Backfill. Busy, Total. Time. Backfill. Idle, Total. Time. Backfill. Killing h Total. Time. Claimed. Busy, Total. Time. Claimed. Idle h Total. Time. Claimed. Retiring, Total. Time. Claimed. Suspended h Total. Time. Matched. Idle, Total. Time. Owner. Idle h Total. Time. Preempting. Killing, Total. Time. Preempting. Vacating , Total. Time. Unclaimed. Benchmarking, Total. Time. Unclaimed. I dle › In your job ad h h Num. Job. Starts Num. Job. Reconnects Num. Shadow. Exceptions Num. Shadow. Starts 27

And last but not least… › Leases added to COD. › Simple best-fit algorithm

And last but not least… › Leases added to COD. › Simple best-fit algorithm added to dedicated › › › scheduler. Can reference resource usage and quota information in preemption policy. condor_config_val –dump [-v] Chirp improvements h Jobs can write messages into the user log h Can use proc 0 Class. Ad as a “scratch pad” › Condor shutdown via expressions h External Awareness 28

… and finally … › File Transfer I/O Throttling h MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOADS ›

… and finally … › File Transfer I/O Throttling h MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOADS › More types of jobs can survive across a shutdown/crash of submit machine h Such as jobs that stream stdout/err. › User’s job log changes. › › h Can have a centralized job log file. h Get values of any job ad attribute in log. “Cron” like job scheduling (Crondor? ) Job Router shipped (Dan’s talk) License Change Source code publically released on web 29

… and finally … … and before shipping the new stable release … We

… and finally … … and before shipping the new stable release … We squashed LOTS of bugs! 30

31

31

Shiny new “bug free” Condor v 7. 0. x stable series! 32

Shiny new “bug free” Condor v 7. 0. x stable series! 32

Enough already, Todd. Tell me about what is cooking with v 7. 1. x

Enough already, Todd. Tell me about what is cooking with v 7. 1. x and beyond. 33

Terms of License Any and all dates in these slides are relative from a

Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into Pay. Pal accounts registered to Todd Tannenbaum …. 34

Generalizing the Startd/Starter Architecture › Making the startd more generic with the › ›

Generalizing the Startd/Starter Architecture › Making the startd more generic with the › › › underlying system. How about : running without a starter, running w/o a schedd+shadow, pulling jobs, running starter less jobs that it does not fork/exec, … Lightweight Jobs Examples • “Work Fetch” Ref to Derek’s Talk • Blue Heron Project Ref to Tom, Amanda, and Greg’s Talk 35

Some Love for Windows › Jobs can write to the registry h Condor allocates

Some Love for Windows › Jobs can write to the registry h Condor allocates HKEY_CURRENT_USER. › Problems w/ the Batch Login approach sessions › › on Windows Server 2003 fixed (by not using them ) Interoperability with Samba (as a PDC) has been improved Arch class-ad attribute now reflects the wide range of architectures available to the Windows world; it no longer simply returns INTEL 36

Green Computing › The startd has the ability to place a machine into a

Green Computing › The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc. ) h. HIBERNATE, HIBERNATE_CHECK_INTERVAL h. If all slots return non-zero, then the machine is powered down; otherwise; it continues running. › Machine Class. Ad contains all information required for a client to wake it up h. Condor can wake it up, also a standalone tool. h. This was NOT as easy as it should be. › Machines in “Offline State” h. Lots of other uses › Wake-up on Matchmaking Pressure › Future Work ? 37

Plugins › Think “Firefox”… › Callouts from Condor daemons on › › appropriate events

Plugins › Think “Firefox”… › Callouts from Condor daemons on › › appropriate events Plugin could re-implement or modify action (different than a client API) Will only build “as needed” as refactoring happens to add features h Miron : “I don’t want your plugs, I want new features!” › Examples: Collector, Accountant, File Transfers, Scheduling Algorithms, … 38

Scheduling in Condor Today CM startd startd schedd startd startd CM schedd › Distributed

Scheduling in Condor Today CM startd startd schedd startd startd CM schedd › Distributed Ownership › Settings reflect 3 separate viewpoints: h Pool manager, Resource Owner, Job Submitter 39

But some sites want to use Condor like this: schedd startd startd › Just

But some sites want to use Condor like this: schedd startd startd › Just one submission point (schedd) › All resources owned by one entity › We can do better for these sites. h. Policy configurations are complicated. h. Some useful policies not present because they are hard to do a wide-area distributed system. h. Today the dedicated “scheduler” only supports FIFO and a naive Best Fit algorithms. 40

So what to do? schedd startd startd › Give the schedd more scheduling options.

So what to do? schedd startd startd › Give the schedd more scheduling options. h. Examples: why can’t the schedd do priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ? › Pluggable scheduler routines. 41

DAGMan Improvements › Automatic running of rescue DAGs (useful › › › for nested

DAGMan Improvements › Automatic running of rescue DAGs (useful › › › for nested DAGs) Significantly improved speed of DAG recovery mode Assignment of “node categories” and category throttles Added generic node priorities & Depth First Traversal algorithm 42

DAGMan Depth First Example 43

DAGMan Depth First Example 43

Category Example Setup Run <= 2 Big job Run <= 5 Small jobjob Small

Category Example Setup Run <= 2 Big job Run <= 5 Small jobjob Small jobjob Small Cleanup 44

DAGMan Future Work › DAG Splicing › Allowing custom attributes in node › ›

DAGMan Future Work › DAG Splicing › Allowing custom attributes in node › › › Class. Ads Fixing condor_hold semantics Configurable job start rate Node iteration 45

DAGMan Future Work › Scalability h. Current potential about 1 million nodes h. Future

DAGMan Future Work › Scalability h. Current potential about 1 million nodes h. Future up to 10 million nodes › Submit files which generate more than one cluster 46

EC 2 / VM Universe Next Steps: Impregnate Condor into the Image › When?

EC 2 / VM Universe Next Steps: Impregnate Condor into the Image › When? On Demand. How? h. Job Router, Glide. In Factory, … › File Transfer To/From S 3 (Plugin!) › Options to handle Amazon’s looming threat: NAT only h. Overlay Network ? • GCB • Open. VPN h. Communicate by way of S 3 ? 47

Negotiation Performance › v 6. 8 -> automatic “significant attributes”, Match › caching v

Negotiation Performance › v 6. 8 -> automatic “significant attributes”, Match › caching v 7. 1. 0 -> “resource request” ads h Simple explanation: Resource request ad == a count plus all significant attributes. h Inserted into a schedd submitter ad. h “Give me 400 resources like this, and 200 resources like that, etc”. › Matchmaking algorithms remains the same, just › › how it “learns” about jobs changes. Disabled by default. Possibilities, possibilities… h More robust against unresponsive schedds h No startd Rank preemption? h Others? 48

49

49

And… › The End ™ of the NFS Locking issue › Avoid redundant copies

And… › The End ™ of the NFS Locking issue › Avoid redundant copies of the same executable in the Condor spool h. Maybe more? › The “Stamping of a Passport” › End-to-End Security Ref Ian’s Talk › A web site design from this decade. 50

Thank you for being such an awesome audience and an awesome user community!!! Jason

Thank you for being such an awesome audience and an awesome user community!!! Jason Stowe, enjoying free bacon at a local pub. Only in Wisconsin. 51