Reliability and Troubleshooting with Condor Douglas Thain Condor
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002
Condor Reliability • Condor was designed for idle machines: – Reclaim, reboot, crash, out of memory. . . – Sounds much like the grid! • US-CMS testbed – Distributed ownership, control, and resources. – (War stories abound. ) • Condor tools add controlled reliability. – Not absolute reliability, but: • A finite amount of retry. • A notification/recovery strategy. • Logging and book-keeping. • Known state after a failure.
US-CMS Physical Structure Private Network MOP Master Public Internet Workers Private Network Head Node Workers
US-CMS Logical Structure Master Site Worker Impala Globus MOP Condor DAGMan Real Work Condor-G Red items expect a reliable environment. Green items create a reliable environment.
(transaction interface) Condor-G Gatekeeper End-User Tools Condor-G Submitter Job Log System Job Log Queue Grid Managers Run Idle GAHP-Server GRAM Job Managers Run Idle Head Node Local Resource Manager
Directed Acyclic Graph Manager (DAGMan) • Condor-G deals with system failures, DAGMan deals with app and user failures. • PRE and POST may be used to validate inputs and outputs. • “Rescue DAG” describes what is left unexecuted. • DAG nodes may themselves be DAGs. A B pre. pl C post. pl D
Fault Tolerant Shell (FTSH) • Standard shell scripts are very error-prone. • FTSH adds time limits, retry, logging, and clean termination. • “Exceptions for scripts: ” unexpected errors cannot accidentally be ignored. try 10 times try for 15 minutes globus_url_copy A B end try for 1 hour run-simulation < B > C gzip < C >D end try for 15 minutes globus_url_copy D E end
Hawkeye Manager Class. Ad Queries Policy Manager Trigger Exprs Class. Ad Data Probe Modules Submit Repair Job Contact Sysadmin (Example Hawkeye Page) Log Event
For More Info. . . • Condor-G – http: //www. cs. wisc. edu/condorg • DAGMan – http: //www. cs. wisc. edu/condor/dagman • Fault Tolerant Shell – http: //www. cs. wisc. edu/~thain/research/ftsh • Hawkeye – http: //www. cs. wisc. edu/condor/hawkeye • Philosophy of Error Management – http: //www. cs. wisc. edu/condor/doc/error-scope. pdf • The Condor Project – http: //www. cs. wisc. edu/condor
- Slides: 9