Upgrading Condor Best Practices Condor Project Computer Sciences
Upgrading Condor Best Practices Condor Project Computer Sciences Department University of Wisconsin-Madison
The problem › More frequent releases of Condor h. Every six to nine months? › Understand this is a problem for › users We’re willing to help out www. cs. wisc. edu/Condor
Overview › Config file management › Condor testing strategies › Standard Universe issues www. cs. wisc. edu/Condor
Config files › LOCAL_CONFIG_FILE h. Used for #include-like behaviour: h. LOCAL_CONFIG_FILE = • $(HOSTS), $(GLOBAL), $(POLICY)… www. cs. wisc. edu/Condor
Typical Config file ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 5 Commented out lists the default value www. cs. wisc. edu/Condor
Config file editing › Never edit base condor_config file h. Except to specify the local file › Put all edits in a local file › One local file per config type h. E. g. for schedds, CMs, types of execute machines h. Can mix and match www. cs. wisc. edu/Condor
Dealing with a new config h. Diff base config with your config h. Understand new items h. Documented in manual version-history h. Existing ones rarely change – Usually capacity changes h. Almost always, overwriting base file works www. cs. wisc. edu/Condor
Managing config files › Centralized management key h. Cfengine, rsync, nfs (!) etc. www. cs. wisc. edu/Condor
Testing new versions www. cs. wisc. edu/Condor
Compatibility Guarantees › No guarantees… › But we try very hard! h. Both forward and backward › Especially within one machine h. Federation techniques require this www. cs. wisc. edu/Condor
Incremental testing! › Three basic components of Condor: h. Central Manager h. Submit points h. Execute machines › Test each independently www. cs. wisc. edu/Condor
Testing Central Manager › Take advantage of statelessness › Condor HAD can help out here If it breaks, existing jobs keep running www. cs. wisc. edu/Condor
Testing schedds › Adding a new test schedd easy h. Test jobs useful too, not just sleep › Schedd can be bottleneck • Probably only place you need to check cpu performance www. cs. wisc. edu/Condor
Testing startds › Easy to test a few at once › Be careful when running std uni › Glide in can be very helpful h. But beware of root specific issues h. Admin slots helpful www. cs. wisc. edu/Condor
Now that we’ve tested… Always be undo-able! (never overwrite files) Rely on master restart on stat change www. cs. wisc. edu/Condor
Big bang approach › What we do at CS › Just change a symlink to the binaries h. Master does the rest… › Can be a big hit on shared filesystems www. cs. wisc. edu/Condor
Incremental restart › First, restart CM h. No jobs lost › Send, reboot schedd h. If restart happens in 20 minutes, jobs keep running › What about the startds? h. Might be OK for standard uni h. Work on this coming soon… www. cs. wisc. edu/Condor
Standard Universe h. More sensitive to backward compatibility › Checkpoint. Platform clarifications • condor_qedit -constraint 'Last. Checkpoint. Platform =? = "LINUX INTEL 2. 6. x normal"' Last. Checkpoint. Platform '"LINUX INTEL 2. 6. x normal 0 xffffe 000"' www. cs. wisc. edu/Condor
Draining old Std Uni › Keep a few old startds around h. To finish old standard uni jobs › Set start to “Job. Universe == 1” › Or maybe rank… h. Only on the old platforms www. cs. wisc. edu/Condor
When to upgrade? h. Zeroth law of software engineering h. Development series actually pretty stable h. We’ll let you know about security issues h. Probably don’t need every minor version h. Don’t be more than one major stable version behind www. cs. wisc. edu/Condor
In summary… › Keep config files under control › Test each component in isolation › Be aware of standard universe issues www. cs. wisc. edu/Condor
Any questions? › Thank you! www. cs. wisc. edu/Condor
- Slides: 22