Updates on job checkpointing and partitioning Massimo Sgaravatto
Updates on “job checkpointing and partitioning” Massimo Sgaravatto INFN Padova
Changes in the doc. (wrt. prev. release) n Removed files from job state n n Not possible to move files from sub-jobs to job aggregator with job partitioning n n They must be saved to a SE, and their identifiers specified as <var, value> pairs in their final job states LB server used to persistently save the job states n n Defined just by <var, value> pairs Removed chkpt-server Possibility to specify pre-job (besides job aggregator) in job partitioning
Changes in the doc. (wrt. prev. release) n Two new functions added to API n set_final_state n n is_final_state n n To specify that the state is the last one Is this state the last one (I. e. was it “marked” using the set_final_state method ? ) ? Check if all the sub-jobs have saved their final states done by the job aggregator n The job aggregator responsible to decide the policy (e. g. all sub-jobs had to save their final states, at least one sub-job had to save its final state, at least x % of subjobs had to save their final states, …. )
APIs Object State: { // Data Members Label_t state_id = ``label''; Var. Value. Set var_value_pairs[] = {``var 1''=``value 1'', ``var 2''=``value 2'', . . . }; Steps. Set main_stepper = {``element 1'', ``element 2'', ``element 3'', . . . }; Label_t current_step; // Methods int save_value(Pair); int save_state(); string get_string_value(string); int get_int_value(string); double get_double_value(string); State load_state(Label_t); Label_t get_next_step(); int set_final_state(); bool is_final_state(Label_t);
Issues n Specifications of Job. Steps for the job aggregator n n n Should be the identifiers of the final states of the sub-jobs Possible approach: sub-job’s state ids represented by subjob’s dg-job-id Necessary to know the dg-job-id’s of the sub-jobs given the dg-job-id of the original “partitionable” job (the dg-job-id associated to the DAG) n Needed also to allow dg-get-job-chkpt for a partitionable job (dg-job-id of the partitionable job given as argument) n n Should return the states for its various sub-jobs Avoid that all sub-jobs are submitted to the same CE n Same problem also when a bunch of jobs with same Requirements and Ranks are submitted together (Estimated. Traversal. Time not promptly updated)
Next steps n n Some time (10 days ? ) for other WP 1 internal comments and then submit to WP 8 TWG ? Definition of architecture with much more details n Coordination with other teams, in particular CESNET (LB) and CNAF (DAGMAN)
- Slides: 6