Part 8 DAGMan Part 8 DAGMan A Grid
- Slides: 28
Part 8: DAGMan
Part 8: DAGMan • A: Grid Workflow Management • B: DAGMan • C: Laboratory: DAGMan
A: Grid Workflow Management
Job Dependencies • In many applications, some jobs are dependent on other jobs • E. g. job A must finish before job B starts • Often because job B uses output from job A • We call a set of interdependent jobs a workflow • Condor-G can run jobs in any order • We need a workflow manager
Two Motivating Examples The Sloan Digital Sky Survey The Montage Project
Sloan Digital Sky Survey • Map one-quarter of the entire sky • Determine the positions and absolute brightness of more than 100 million celestial objects. • Measure the distance to a million of the nearest galaxies, and to 100, 000 quasars. http: //www. sdss. org
Workflow to Find Galaxy Clusters catalog get. Catalog 5 cluster bcg. Coal 4 core 3 3 brg brg 2 2 field 1 ts. Obj max. Bcg max. Brg field 1 ts. Obj field. Prep
Workflow to Find Galaxy Clusters get. Catalog bcg. Coal max. Bcg max. Brg
Montage • Create a large mosaic image from many smaller images • Used for astronomy data • Correct optical distortions and intensity differences http: //montage. ipac. caltech. edu
Montage Workflow Data Stage in nodes Montage compute nodes Data stage out nodes Inter pool transfer nodes
Montage Workflow 1202 nodes
B: DAGMan
DAGMan • Directed Acyclic Graph Manager • Workflow manager for Condor-G • DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. • By default, Condor may run your jobs in any order, or everything simultaneously, so we need DAGMan to enforce an ordering when necessary. • (e. g. , “Don’t run job “B” until job “A” has completed successfully. ”)
What is a DAG? • A DAG is the data structure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. Job A Job B • Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job C Job D
Defining a DAG • A DAG is defined by a. dag file, listing each of its nodes and their dependencies: Job A # diamond. dag Job A a. sub Job B b. sub Job C c. sub Job D d. sub Parent A Child B C Parent B C Child D Job B Job C Job D • each node will run the Condor job specified by its accompanying Condor submit file
Submitting a DAG • To start your DAG, just run condor_submit_dag with your. dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond. dag • condor_submit_dag submits a job with DAGMan as the executable. • This job happens to run on the submitting machine, not any other computer. • Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. A Condor A Job Queue B DAGMan D C . dag File
Running a DAG (cont’d) • DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor B Job C Queue B DAGMan D C
Running a DAG (cont’d) • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. A Condor Job Queue B DAGMan D X Rescue File
Recovering a DAG • Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job C Queue B DAGMan D C Rescue File
Recovering a DAG (cont’d) • Once that job completes, DAGMan will continue the DAG as if the failure never happened. A Condor Job D Queue B DAGMan D C
Finishing a DAG • Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor Job Queue B DAGMan D C
Additional DAGMan Features • Provides other handy features for job management… • nodes can have PRE & POST scripts • failed nodes can be automatically re-tried a configurable number of times • job submission can be “throttled”
Another sample DAGMan submit file # Filename: diamond. dag Job A A. condor Job B B. condor Job B Job C C. condor Job D D. condor Script PRE A top_pre. csh Script PRE B mid_pre. perl $JOB Script POST B mid_post. perl $JOB $RETURN Script PRE C mid_pre. perl $JOB Script POST C mid_post. perl $JOB $RETURN Script PRE D bot_pre. csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3 Job A Job C Job D
Lab 8: DAGMan
Lab 8: DAGMan • In this lab, you’ll: • Run a simple DAGMan job • Run a more complex DAGMan job • Recover a failed DAGMan job
Credits • NSF disclaimer • Portions of this presentation were adapted from the following sources: • Jaime Frey, UW-Madison
- Htcondor dagman
- Dagman
- Htcondor dagman
- North bridge south bridge
- Addition symbol
- Bar components
- Unit ratio definition
- The part of a shadow surrounding the darkest part
- Part part whole
- 미니탭 gage r&r 해석
- Technical description examples
- Stakeholder matrix
- Eliminate reduce raise create grid
- Pirate game sheet
- Grid bag layout
- Grid linear spiral and quadrant are all
- Who invented latitude and longitude grid system
- Overview of grid computing
- Open science grid
- Topologicaly
- Hmda cra
- Varuna grid
- Capacity
- Exagrid gartner magic quadrant
- Pearson education competitors
- Narnia character grid
- Lgfl history of computing
- Salary grid
- Globus grid computing