Part 8 DAGMan Part 8 DAGMan A Grid

  • Slides: 28
Download presentation

Part 8: DAGMan

Part 8: DAGMan

Part 8: DAGMan • A: Grid Workflow Management • B: DAGMan • C: Laboratory:

Part 8: DAGMan • A: Grid Workflow Management • B: DAGMan • C: Laboratory: DAGMan

A: Grid Workflow Management

A: Grid Workflow Management

Job Dependencies • In many applications, some jobs are dependent on other jobs •

Job Dependencies • In many applications, some jobs are dependent on other jobs • E. g. job A must finish before job B starts • Often because job B uses output from job A • We call a set of interdependent jobs a workflow • Condor-G can run jobs in any order • We need a workflow manager

Two Motivating Examples The Sloan Digital Sky Survey The Montage Project

Two Motivating Examples The Sloan Digital Sky Survey The Montage Project

Sloan Digital Sky Survey • Map one-quarter of the entire sky • Determine the

Sloan Digital Sky Survey • Map one-quarter of the entire sky • Determine the positions and absolute brightness of more than 100 million celestial objects. • Measure the distance to a million of the nearest galaxies, and to 100, 000 quasars. http: //www. sdss. org

Workflow to Find Galaxy Clusters catalog get. Catalog 5 cluster bcg. Coal 4 core

Workflow to Find Galaxy Clusters catalog get. Catalog 5 cluster bcg. Coal 4 core 3 3 brg brg 2 2 field 1 ts. Obj max. Bcg max. Brg field 1 ts. Obj field. Prep

Workflow to Find Galaxy Clusters get. Catalog bcg. Coal max. Bcg max. Brg

Workflow to Find Galaxy Clusters get. Catalog bcg. Coal max. Bcg max. Brg

Montage • Create a large mosaic image from many smaller images • Used for

Montage • Create a large mosaic image from many smaller images • Used for astronomy data • Correct optical distortions and intensity differences http: //montage. ipac. caltech. edu

Montage Workflow Data Stage in nodes Montage compute nodes Data stage out nodes Inter

Montage Workflow Data Stage in nodes Montage compute nodes Data stage out nodes Inter pool transfer nodes

Montage Workflow 1202 nodes

Montage Workflow 1202 nodes

B: DAGMan

B: DAGMan

DAGMan • Directed Acyclic Graph Manager • Workflow manager for Condor-G • DAGMan allows

DAGMan • Directed Acyclic Graph Manager • Workflow manager for Condor-G • DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. • By default, Condor may run your jobs in any order, or everything simultaneously, so we need DAGMan to enforce an ordering when necessary. • (e. g. , “Don’t run job “B” until job “A” has completed successfully. ”)

What is a DAG? • A DAG is the data structure used by DAGMan

What is a DAG? • A DAG is the data structure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. Job A Job B • Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job C Job D

Defining a DAG • A DAG is defined by a. dag file, listing each

Defining a DAG • A DAG is defined by a. dag file, listing each of its nodes and their dependencies: Job A # diamond. dag Job A a. sub Job B b. sub Job C c. sub Job D d. sub Parent A Child B C Parent B C Child D Job B Job C Job D • each node will run the Condor job specified by its accompanying Condor submit file

Submitting a DAG • To start your DAG, just run condor_submit_dag with your. dag

Submitting a DAG • To start your DAG, just run condor_submit_dag with your. dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond. dag • condor_submit_dag submits a job with DAGMan as the executable. • This job happens to run on the submitting machine, not any other computer. • Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your

Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. A Condor A Job Queue B DAGMan D C . dag File

Running a DAG (cont’d) • DAGMan holds & submits jobs to the Condor queue

Running a DAG (cont’d) • DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor B Job C Queue B DAGMan D C

Running a DAG (cont’d) • In case of a job failure, DAGMan continues until

Running a DAG (cont’d) • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. A Condor Job Queue B DAGMan D X Rescue File

Recovering a DAG • Once the failed job is ready to be re-run, the

Recovering a DAG • Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job C Queue B DAGMan D C Rescue File

Recovering a DAG (cont’d) • Once that job completes, DAGMan will continue the DAG

Recovering a DAG (cont’d) • Once that job completes, DAGMan will continue the DAG as if the failure never happened. A Condor Job D Queue B DAGMan D C

Finishing a DAG • Once the DAG is complete, the DAGMan job itself is

Finishing a DAG • Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor Job Queue B DAGMan D C

Additional DAGMan Features • Provides other handy features for job management… • nodes can

Additional DAGMan Features • Provides other handy features for job management… • nodes can have PRE & POST scripts • failed nodes can be automatically re-tried a configurable number of times • job submission can be “throttled”

Another sample DAGMan submit file # Filename: diamond. dag Job A A. condor Job

Another sample DAGMan submit file # Filename: diamond. dag Job A A. condor Job B B. condor Job B Job C C. condor Job D D. condor Script PRE A top_pre. csh Script PRE B mid_pre. perl $JOB Script POST B mid_post. perl $JOB $RETURN Script PRE C mid_pre. perl $JOB Script POST C mid_post. perl $JOB $RETURN Script PRE D bot_pre. csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3 Job A Job C Job D

Lab 8: DAGMan

Lab 8: DAGMan

Lab 8: DAGMan • In this lab, you’ll: • Run a simple DAGMan job

Lab 8: DAGMan • In this lab, you’ll: • Run a simple DAGMan job • Run a more complex DAGMan job • Recover a failed DAGMan job

Credits • NSF disclaimer • Portions of this presentation were adapted from the following

Credits • NSF disclaimer • Portions of this presentation were adapted from the following sources: • Jaime Frey, UW-Madison