Whats new in Condor Whats coming up Condor

  • Slides: 49
Download presentation
What’s new in Condor? What’s coming up? Condor Week 2009 Condor Project Computer Sciences

What’s new in Condor? What’s coming up? Condor Week 2009 Condor Project Computer Sciences Department University of Wisconsin-Madison

Release Situation › Stable Series h. Current: Condor v 7. 2. 2 (April 14

Release Situation › Stable Series h. Current: Condor v 7. 2. 2 (April 14 2009) h Last Year: Condor v 7. 0. 1 (Feb 27 th 2008) › Development Series h. Current: Condor v 7. 3. 0 (Feb 24 2009) • v 7. 3. 1 “any day” h Last Year : Condor v 7. 1. 0 (April 1 st 2008) › How long is development taking? h v 6. 9 Series : ~ 18 months h v 7. 1 Series : ~ 12 months h v 7. 3 Series : plan says done in July 09 www. condorproject. org 2

New Ports In 7. 2. 0 and Beyond › Full ports: Debian 5. 0

New Ports In 7. 2. 0 and Beyond › Full ports: Debian 5. 0 x 86 & x 86_64 › Also added condor_compile support for gfortran www. condorproject. org 3

Big new goodies in v 7. 0 › › › Virtual Machine Universe Scalability

Big new goodies in v 7. 0 › › › Virtual Machine Universe Scalability Improvements GCB Improvements Privilege Separation New Quill “Crondor” www. condorproject. org 4

Big new goodies in v 7. 2 › › › Job Router Startd and

Big new goodies in v 7. 2 › › › Job Router Startd and Job Router hooks DAGMan tagging and splicing Green Computing started GLEXEC Concurrency Limits www. condorproject. org 5

Job Router › Automated way to let jobs run on a wider array of

Job Router › Automated way to let jobs run on a wider array of resources h. Transform jobs into different forms h. Reroute jobs to different destinations www. condorproject. org 66

What is “job routing”? original (vanilla) job Universe = “vanilla” Executable = “sim” Arguments

What is “job routing”? original (vanilla) job Universe = “vanilla” Executable = “sim” Arguments = “seed=345” Output = “stdout. 345” Error = “stderr. 345” Should. Transfer. Files = True When. To. Transfer. Output = “ON_EXIT” routed (grid) job Job. Router Routing Table: Site 1 … Site 2 … Universe = “grid” Grid. Type = “gt 2” Grid. Resource = “cmsgrid 01. hep. wisc. edu/jobmanager-condor” Executable = “sim” Arguments = “seed=345” Output = “stdout” Error = “stderr” Should. Transfer. Files = True When. To. Transfer. Output = “ON_EXIT” final status www. condorproject. org 77

Routing is just site-level matchmaking › With feedback from job queue • number of

Routing is just site-level matchmaking › With feedback from job queue • number of jobs currently routed to site X • number of idle jobs routed to site X • rate of recent success/failure at site X › And with power to modify job ad • change attribute values (e. g. Universe) • insert new attributes (e. g. Grid. Resource) • add a “portal” grid proxy if desired www. condorproject. org 88

Startd Job Hooks › Users wanted to take advantage of Condor’s resource management daemon

Startd Job Hooks › Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. h. Specialized scheduling needs h. Jobs live in their own database or other storage rather than a Condor job queue www. condorproject. org 99

Job Router Hooks › Truly transform jobs, not just reroute them h. E. g.

Job Router Hooks › Truly transform jobs, not just reroute them h. E. g. stuff a job into a virtual machine (either VM universe or Amazon EC 2) › Hooks invoked like startd ones www. condorproject. org 10 10

Our solution › Make a system of generic “hooks” that you can plug into:

Our solution › Make a system of generic “hooks” that you can plug into: h. A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program h. Hook Condor to your existing job management system without modifying the Condor code www. condorproject. org 11 11

DAGMan Depth First Example www. condorproject. org 12

DAGMan Depth First Example www. condorproject. org 12

Category Example Setup Run <= 2 Big job Run <= 5 Small jobjob Small

Category Example Setup Run <= 2 Big job Run <= 5 Small jobjob Small jobjob Small Cleanup www. condorproject. org 13

DAGMan Splicing creates one “in memory” DAG. No subdags means no extra condor_dagmans. A

DAGMan Splicing creates one “in memory” DAG. No subdags means no extra condor_dagmans. A X+A Y+A Z+A # Example Use Case JOB A A. sub JOB B B. sub X+B X+C Y+B Y+C Z+B Z+C SPLICE X diamond. dag SPLICE Y diamond. dag X+D Y+D Z+D SPLICE Z diamond. dag PARENT A CHILD X Y Z PARENT X Y Z CHILD B B # Notice scoping of node! www. condorproject. org 14

Green Computing › The startd has the ability to place a machine into a

Green Computing › The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc. ) h. HIBERNATE, HIBERNATE_CHECK_INTERVAL h. If all slots return non-zero, then the machine is powered down; otherwise; it continues running. › Machine Class. Ad contains all information required for a client to wake it up h. Condor can wake it up, also a standalone tool. h. This was NOT as easy as it should be. › Machines in “Offline State” h. Stored persistently to disk h. Lots of other uses www. condorproject. org 15

Concurrency Limits › Limit job execution based on admindefined consumable resources h. E. g.

Concurrency Limits › Limit job execution based on admindefined consumable resources h. E. g. licenses › Can have many different limits › Jobs say what resources they need › Negotiator enforces limits pool-wide www. condorproject. org 16 16

Concurrency Example › Negotiator config file h. MATLAB_LIMIT = 5 h. NFS_LIMIT = 20

Concurrency Example › Negotiator config file h. MATLAB_LIMIT = 5 h. NFS_LIMIT = 20 › Job submit file hconcurrency_limits = matlab, nfs: 3 h. This requests 1 Matlab token and 3 NFS tokens www. condorproject. org 17 17

Other goodies in v 7. 2 › ALLOW/DENY_CLIENT › Job queue backup on local

Other goodies in v 7. 2 › ALLOW/DENY_CLIENT › Job queue backup on local disk › PREEMPTION_REQUIREMENTS and › › RANK can reference additional attributes in negotiator about group resource usage Start on dynamic provisioning in the startd $$([]) www. condorproject. org 18

Dynamic Slot Partitioning › Divide slots into chunks sized for › › › matched

Dynamic Slot Partitioning › Divide slots into chunks sized for › › › matched jobs Readvertise remaining resources Partitionable resources are cpus, memory, and disk See Matt Farrellee’s talk www. condorproject. org 19 19

Dynamic Partitioning Caveats › Cannot preempt original slot or group of sub-slots h. Potential

Dynamic Partitioning Caveats › Cannot preempt original slot or group of sub-slots h. Potential starvation of jobs with large resource requirements › Partitioning happens once per slot each negotiation cycle h. Scheduling of large slots may be slow www. condorproject. org 20 20

New Variable Substitution › $$(Foo) in submit file h. Existing feature h. Attribute Foo

New Variable Substitution › $$(Foo) in submit file h. Existing feature h. Attribute Foo from machine ad substituted › $$([Memory * 0. 9]) in submit file h. New feature h. Expression is evaluated and then substituted www. condorproject. org 21 21

More Info For Preemption › New attributes for these preemption expressions in the negotiator…

More Info For Preemption › New attributes for these preemption expressions in the negotiator… h. PREEMPTION_REQUIREMENTS h. PREEMPTION_RANK › Used for controlling preemption due to user priorities www. condorproject. org 22 22

Right then. What about v 7. 3. x and beyond? www. condorproject. org 23

Right then. What about v 7. 3. x and beyond? www. condorproject. org 23

Terms of License Any and all dates in these slides are relative from a

Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into Pay. Pal accounts registered to Todd Tannenbaum …. 24

Some tasty dishes cooking in the Condor kitchen Special guest, Julia Child! www. condorproject.

Some tasty dishes cooking in the Condor kitchen Special guest, Julia Child! www. condorproject. org 25

Already served (leftovers) › CCB – Condor Connection Broker h Dan Bradley’s presentation ›

Already served (leftovers) › CCB – Condor Connection Broker h Dan Bradley’s presentation › Bring checkpoint/restart to Vanilla Job h Pete Keller’s presentation re DMTCP › Asynch notification of events to fill a hole in Condor’s web service API h Jungha Woo’s presentation › Grid Universe improvements h Xin Zhao’s presentation www. condorproject. org 26

Data “Drinks” Wando Fishbowl Anyone? www. condorproject. org 27

Data “Drinks” Wando Fishbowl Anyone? www. condorproject. org 27

Condor + Hadoop FS ! › Lots of hard work by Faisal Khan ›

Condor + Hadoop FS ! › Lots of hard work by Faisal Khan › Motivation h. Condor+HDFS = 2 + 2 = 5 !!! h. A Synergy exists (next slide) • Hadoop as distributed storage system • Condor as cluster management system h. Large number of distributed disks in a compute cluster Managing disk as a resource www. condorproject. org 28

Condor + HDFS › Dhruba Borthakur’s talk › Synergy h. Condor knows a lot

Condor + HDFS › Dhruba Borthakur’s talk › Synergy h. Condor knows a lot about its cluster • Capability of individual machines in terms of available memory, CPU load, disk space etc. • Availability of JRE (Java Universe) h. Condor can easily automate house keeping jobs e. g • rebalancing data blocks • Implementing user file quota www. condorproject. org 29

Condor + HDFS › Synergy h. Failover • High availability daemon in Condor h.

Condor + HDFS › Synergy h. Failover • High availability daemon in Condor h. Class. Ads • Let clients know the current IP of name server • Heartbeat www. condorproject. org 30

condor_hdfs daemon › Main integration point of HDFS within › › Condor Configures HDFS

condor_hdfs daemon › Main integration point of HDFS within › › Condor Configures HDFS cluster based on existing condor_config files Runs under condor_master and can be controlled by existing Condor utilities Publish interesting parameters to Collector e. g IP address, node type, disk activity Currently deployed at UW-Madison www. condorproject. org 31

Condor + HDFS : Next Steps › File. Node Failover › Block placement policies

Condor + HDFS : Next Steps › File. Node Failover › Block placement policies & management › Thinking about how Condor can steer jobs to the data h. Via a Class. Ad function used in the RANK expression? › Integrate with File Transfer Mechanism… www. condorproject. org 32

More Job Sandbox Options › Condor’s File Transfer mechanism h. Currently moves files between

More Job Sandbox Options › Condor’s File Transfer mechanism h. Currently moves files between submit and execute hosts (shadow and starter). h. Next : Files can have URLs • HTTP • HDFS h. How about Condor’s SPOOL ? › Need to schedule movement? New Stork h. Mehmet Balman’s presentation www. condorproject. org 33

Virtual Meatchine Dishes www. condorproject. org 34

Virtual Meatchine Dishes www. condorproject. org 34

Virtual Machine Sandboxing › We have the Virtual Machine Universe… h. Great for provisioning

Virtual Machine Sandboxing › We have the Virtual Machine Universe… h. Great for provisioning h. Nitin Narkhede’s presentation › … and now we are exploring different › mechanisms to run a job inside a VM. Benefits h. Isolate the job from execute host. h. Stage custom execution environments. h. Sandbox and control the job execution. www. condorproject. org 35

One way to do it – via the Condor Job Router › Hard work

One way to do it – via the Condor Job Router › Hard work by Varghese Mathew › Ordinary Jobs & VM Universe Jobs. › Job router – transform a job into a new › › form. Job router hook picks them up, sets them up inside a VM job, and submits the VM job. On completion, job router hook extracts output from the VM and returns to original job. www. condorproject. org 36

Different Flavors › › › Script Inside VM Starter Inside VM Personal Condor Inside

Different Flavors › › › Script Inside VM Starter Inside VM Personal Condor Inside VM VM joins the pool as an execute node All different ways to bind a job to a specific virtual machine. www. condorproject. org 37

Speaking of VM Universe… › Adding VM Universe Support for h. VMWare Server 2.

Speaking of VM Universe… › Adding VM Universe Support for h. VMWare Server 2. x h. KVM • Done via libvirt • Future VM systems added to libvirt should be easy to add in the future h. VMWare ESX, ESXi › Thank you community for contributions! www. condorproject. org 38

“Lightweight Jobs” Salad www. condorproject. org 39

“Lightweight Jobs” Salad www. condorproject. org 39

Fast, quick, light jobs › Options to put a Condor › job on a

Fast, quick, light jobs › Options to put a Condor › job on a diet Diet ideas: h Leave the luggage at home! No job file sandbox, everything in the job ad. h Don’t pay for strong semantic guarantees if you don’t need em. Define expectations on entry, update, completion. › Want to honor scheduling policy, however. www. condorproject. org 40

Some small side dishes Julia, a spy who really knew her eggs… www. condorproject.

Some small side dishes Julia, a spy who really knew her eggs… www. condorproject. org 41

› Non-blocking communication via threads h. Refer to Dan/Igor’s talk h. Especially all the

› Non-blocking communication via threads h. Refer to Dan/Igor’s talk h. Especially all the security session roundtrips h. The USCMS scalability testbed needs 70 collectors to support ~20 k dynamic machines; replaced with 1 collector w/ threading code. 70: 1, baby!!!!! › Configuration knob management h. Think about: config in firefox h. Hard-coded configurations now possible › Nested groups www. condorproject. org 42

Mmmm, tasty Condor Wiki www. condorproject. org 43

Mmmm, tasty Condor Wiki www. condorproject. org 43

Pabst and Jack, a dessert favorite! Scheduling Dessert www. condorproject. org 44

Pabst and Jack, a dessert favorite! Scheduling Dessert www. condorproject. org 44

Back to Green Computing › The startd has the ability to place a machine

Back to Green Computing › The startd has the ability to place a machine › into a low power state. (Standby, Hibernate, Soft-Off, etc. ). Machine Class. Ad contains all information required for a client to wake it up › Machines in “Offline State” h. Stored persistently to disk › NOW… have the matchmaker publish “match pressure” into these offline ads, enabling policies for auto-wakeup www. condorproject. org 45

Scheduling in Condor Today CM startd startd schedd startd startd CM schedd › Distributed

Scheduling in Condor Today CM startd startd schedd startd startd CM schedd › Distributed Ownership › Settings reflect 3 separate viewpoints: h Pool manager, Resource Owner, Job Submitter www. condorproject. org 46

But some sites want to use Condor like this: schedd startd startd › Just

But some sites want to use Condor like this: schedd startd startd › Just one submission point (schedd) › All resources owned by one entity › We can do better for these sites. h. Policy configurations are complicated. h. Some useful policies not present because they are hard to do a wide-area distributed system. h. Today the dedicated “scheduler” only supports FIFO and a naive Best Fit algorithms. www. condorproject. org 47

So what to do? schedd startd startd › Give the schedd more scheduling options.

So what to do? schedd startd startd › Give the schedd more scheduling options. h. Examples: why can’t the schedd do priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ? www. condorproject. org 48

Thank you to an awesome community!!! www. condorproject. org 49

Thank you to an awesome community!!! www. condorproject. org 49