The STAR Unified MetaScheduler SUMS A front end

  • Slides: 30
Download presentation
+ = The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for

+ = The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user analysis and data production. Jérôme Lauret, Gabriele Carcassi, Levente Hajdu Efstratios Efstathiadis, Lidia Didenko, Valeri Fine Iwona Sakrejda, Doug Olson 1

Outline ¡ Project overview l l l ¡ Design and architecture l l l

Outline ¡ Project overview l l l ¡ Design and architecture l l l ¡ l Key features Mona. LISA policy Contributions l ¡ Basic principles Building blocks Add-on (usage tracking) Usage Grid experience Schedulers l ¡ STAR Experiment Problematic Solution GUI, dispatchers Future work & Conclusion Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 2

Project overview Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 3

Project overview Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 3

The STAR Experiment ¡ The Solenoidal Tracker At RHIC l l l http: //www.

The STAR Experiment ¡ The Solenoidal Tracker At RHIC l l l http: //www. star. bnl. gov/ is an experiment located at BNL (USA) A collaboration of 546 people wide, spanning over 12 countries A PByte scale experiment overall (raw, reconstructed events, simulation) with large amount of files (several Million) ¡ ¡ ¡ ? Run 4 alone (2003 -2004) has produced 200 TB of raw data Rich set of data analysis and simulation problems Expecting 200 TB of reconstructed data 40 TB of Mu. DST (1 pass) Files copied to Tier 1 using SRM tools (see Track 4, 344 Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 4

Problematic ¡ Ongoing analysis l ¡ Data spread at many location l ¡ Past

Problematic ¡ Ongoing analysis l ¡ Data spread at many location l ¡ Past and new sets of data are constantly analyzed sites and storage type, some on distributed disk local to each machine not easily accessible Evolving technologies l l Distributed computing (re) shapes itself as we make progress: Condor-G, portals, Meta-Schedulers, Web Services, Grid Services, … Batch technologies themselves evolve Users have to adapt within a productive environment and ever growing scientific program May be fine for new experiment, not for running ones Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 5

Solution ¡ Allow user to pursue scientific endeavor without disruption l l ¡ Make

Solution ¡ Allow user to pursue scientific endeavor without disruption l l ¡ Make use of current/available resources Ensure same productivity (subjective without matrix) Develop a front end shielding the user from technology details and changes – Job concept Abstraction Attract users to migrate to new framework & Grid => data management, file relocation => Catalog Design a tool/framework allowing for evolution l l Changing underlying technology should NOT mean change in user’s daily routine Framework should allow for testing ideas, plug-in of new components (Dispatcher for Local Resource Managers = LRMS), moving users to distributed computing with no extraneous knowledge Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 6

And so SUMS was born … ¡ Project started in 2002 l l ¡

And so SUMS was born … ¡ Project started in 2002 l l ¡ Historically l l l ¡ Light developer team (<> ~ 1. 0 FTE) Surrounding activities have enriched the project and spawned activities and collaborations (Monitoring, U-JDL, Resource Brokering studies, …) STAR project, design and prototype responsibility taken by WSU. Project enhanced and brought to user community (Gabriele Carcassi) Current development & design (Levente Hajdu) Entirely written in Java l l Portable, modular class based design Project management, auto-documentation, … Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 7

Design / Architecture - Opened Sept 27 th- Oct 1 st 2004 Jérôme LAURET,

Design / Architecture - Opened Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 8

Basic principles ¡ Users do NOT write l l ¡ Instead, they write an

Basic principles ¡ Users do NOT write l l ¡ Instead, they write an XML – U-JDL l l ¡ shell scripts and submit series of tag=value Describing their “intent” to work on files, a Data. Set, collections, etc … They do not have to know where those files are located (LFN or collections may convert to PFN) They do not have to handle the gory details of resource management (bsub –R …) They do not need to think where their job will best fit, their input to SUMS are rates or ranges indications Following a prescribed schema and … % star-submit My. Job. xml % star-submit-template –template My. Template. Job. xml –entities jobname=test, year=2004 Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 9

What it does … Query/Wildcard resolution Job description test. xml <? xml version="1. 0"

What it does … Query/Wildcard resolution Job description test. xml <? xml version="1. 0" encoding="utf-8" ? > <job max. Files. Per. Process="500"> <command>root 4 star -q -b root. Macros/number. Of. Events. List. C ("$FILELIST")</command> <stdout URL="file: /star/u/xxx/scheduler/out/$JOBID. out" /> <input URL="catalog: star. bnl. gov? production=P 02 gd, fil etype=daq_reco_mudst" prefer. Storage="local" n. Files="all"/> <output from. Scratch="*. root" to. URL="file: /star/u/xxx/scheduler/out/" /> </job> /star/data 09/reco/production. Central/Full. Fie. . . / star/data 09/reco/production. Central/Full. Fie. . . /star/data 09/reco/production. Central/Full. Fie. . . sched 1043250413862_0. list /. csh /star/data 09/reco/production. Central/Full. Fie. . . sched 1043250413862_1. list /. csh /star/data 09/reco/production. Central/Full. Fie. . . sched 1043250413862_2. list /. csh /star/data 09/reco/production. Central/Full. Fie. . . User Input … () … Policy Sept 27 th- Oct 1 st 2004 …. Jérôme LAURET, RHIC-STAR/BNL dispatcher 10

Architecture / building blocks • Main boxes are java classes • The framework chooses

Architecture / building blocks • Main boxes are java classes • The framework chooses the blocks to use depending on user options (% … -policy XXX) • Interface between blocks are identical • Implementations of the Policy class = the heart of SUMS (decision making, planning, resource brokering, …) Extendable, adaptable Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 11

Job Initializer XML is validated, request objects created … Sept 27 th- Oct 1

Job Initializer XML is validated, request objects created … Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 12

Queues ¡ Queue concept is “opened” l l ¡ Queue can be a LRMS

Queues ¡ Queue concept is “opened” l l ¡ Queue can be a LRMS queue (PBS, LSF, SGE, …) Queue can be a Pool or a DRMS (Condor, Condor-G, …) A Web or Grid Service … anything for which a dispatcher can be written The object container is defined or defines l l l Defined by a name (may be logical) Associated to a dispatcher (has a pointer to a dispatcher object) – LSFDispatcher uses logical name = queue name Has resource requirements ¡ ¡ Sept 27 th- Oct 1 st 2004 CPUtime limits, memory limits, the type of storage it can access, storage limits Base rule: they can be undefined -1 (to be expected from Policy stand point) Jérôme LAURET, RHIC-STAR/BNL 13

Policies ¡ Policies integrate pre-defined queues l l Serialized XML as local configuration A

Policies ¡ Policies integrate pre-defined queues l l Serialized XML as local configuration A policy can make use of as many queues as necessary ¡ Queues may have l l l ¡ ¡ ¡ a type (LSF, PBS, Condor, …) a scope (Local, Distributed, …) Allows SUMS to decide which one to take depending on RB decision Queues can be given an initial weight (for example, used for ordering if weight = priority) Queues have a weight-incremental Complex policies may order queues as necessary (your choice) – Default order by weight (priority) Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 14

Policy note – job splitting ¡ <input> element can take several form l Transition

Policy note – job splitting ¡ <input> element can take several form l Transition formats: PFN, PFN (wildcard) ¡ ¡ l Locally distributed PFN support ¡ l ¡ <input URL="filelist: /star/u/username/filelists/mylist" /> Dataset, Meta. Data support ¡ l <input URL="file: //rcas 6078. rcf. bnl. gov/home/starreco/production. Central/Full. Field/P 02 gd/2001/ 279/st_physics_2279005_raw_0285. Mu. Dst. root" /> List support ¡ l <input URL="file: /star/data 15/reco/production. Central/Full. Field/P 02 ge/2001/322/st_physics_2322006 _raw_0016. Mu. Dst. root" /> <input URL="file: /star/data 15/reco/production. Central/Full. Field/P 02 ge/2001/*/*. Mu. Dst. root" /> <input URL="catalog: star. bnl. gov? production=P 02 gd, filetype=daq_reco_mudst, storage=local" n. Files="2000" /> … LFN support on the way … Preferred STAR usage: map Meta. Data/Collections or LFN to PFN, dispatch jobs --- BUT THERE ARE TWO WAYS --l l PFN converted (URL syntax do not end up in final lists, APPS work as usual) Lists are formatted and passed to APPS as URL, APPS need to sort URL Example: rootd syntax like URL passed as-is Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 15

Dispatchers High level dispatcher do a redirect to - PBS - LSF - SGE

Dispatchers High level dispatcher do a redirect to - PBS - LSF - SGE - Condor-G - BOSS -… Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 16

Add-On – Usage monitoring ¡ Needed usage feedback - Monitoring user’s usage to l

Add-On – Usage monitoring ¡ Needed usage feedback - Monitoring user’s usage to l l l ¡ Serves better the user community l l ¡ Allow for a better targeted tool Focus can be made on most used/preferred feature CS fantasy trimmed down Eliminates divergence and re-focus Practicality first, Sci. Fi later … Ensures equity of usage Helps re-focusing tutorials & documentation JSP based (tomcat) with My. SQL back-end l All options and usage are recorded Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 17

Example of useful information … Implemented two ways of accessing locally distributed files. Is

Example of useful information … Implemented two ways of accessing locally distributed files. Is it used ? ? Which storage type is most used … may very well be a $$ / accessibility question Added SGE dispatcher a few weeks ago … Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 18

Example II-a PDSF BNL 4500 jobs /day Peaks at 20 k Sept 27 th-

Example II-a PDSF BNL 4500 jobs /day Peaks at 20 k Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 19

Example II-b Pessimistic graph is an integral count over time. It shows that after

Example II-b Pessimistic graph is an integral count over time. It shows that after first usage, users keep using SUMS … NB: Drop from the beginning of the summer indicates • Vacation time • Conference time • Lack of new data (this is not the best period for SUMS commercial but informative nonetheless) See more statistics at Sept 27 th- Oct 1 st 2004 http: //www. star. bnl. gov/STAR/comp/Grid/scheduler/ Jérôme LAURET, RHIC-STAR/BNL 20

Physicist usage ¡ ¡ As far as we know, 85% of active users using

Physicist usage ¡ ¡ As far as we know, 85% of active users using SUMS Publications selection / confirmed as 100% SUMS analysis based l l l l J. Gonzales - Nuclear Experiment, abstract nucl-ex/0408016, Pseudorapidity Asymmetry and Centrality Dependence of Charged Hadron Spectra in d+Au Collisions at sqrt(SNN)=200 Ge. V (submitted to PRC) L. S. Barnby – QM Proceedings - 2004 J. Phys. G: Nucl. Part. Phys. 30 S 1121 -S 1124 T. Henry - Full jet reconstruction in d+Au and p+p collisions at RHIC, Journal of Physics G: Nuclear Physics (volume 30, issue 8) S 1287 J. S. Lange - Proceedings 19 th Winter Workshop on Nuclear Dynamics (2003), nucl-ex/0306005 - Review of search for heavy flavor (c, b quarks) production in leptonic decay channels in Au+Au collisions at sqrt(s. NN)=200 Ge. V at the STAR Experiment at RHIC. A. Tang - Anisotropy at RHIC: the first and the fourth harmonic … http: //www. star. bnl. gov/central/publications/ (7 papers / analysis submitted in the past 3 months) Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 21

Grid experience ¡ Use of SUMS for Grid job submissions possible l l l

Grid experience ¡ Use of SUMS for Grid job submissions possible l l l ¡ Grid experience has been a challenge l ¡ Modulo RSL extensions <input> <output> tags MUST specify path as relative path (“bla. root”, “blop/test. dat”, …) <output> attribute from. Scratch / to. URL designed to bring the files back (globus-url-copy) Cryptic messages, had a problem with a globus error 74: no clue of what it was for months, no Grid Help-desk, no knowledge base index. Turned out to be a firewall issue, burst of massive job death Nonetheless l ¼ of Run 4 simulation production made on grid ¡ l Success rate ¡ ¡ l 100, 000 events generated, analysis ongoing 85% when all goes well 60% when lots of jobs are submitted (above issue) Planning to run on larger scale platform, Grid 3+ and/or OSG-0 with (hopefully) better ways to track errors/problems Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 22

Schedulers Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 23

Schedulers Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 23

Schedulers ¡ ¡ Can a user front end to other LRMS/DRMS be called a

Schedulers ¡ ¡ Can a user front end to other LRMS/DRMS be called a “scheduler” ? ? Is using the local resource within the same paradigm than globally distributed resources ? Traditional - LRMS Distributed - DRMS Job Mostly Serialized Possibly following a Work-Flow Data File based Data sets, collections, … Scheduling One LRMS used Many – Issues are consistencies, Qo. S, unified information (from/to) AAA Handled by LRMS VO based, ownership is itself an issue Resources Dedicated or local policy managed (priority, usage throttle, …) Common, no global policies but agreements or statement of understanding Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 24

Schedulers ¡ Key features for a scheduler l l Keep global accounting Scheduling decisions

Schedulers ¡ Key features for a scheduler l l Keep global accounting Scheduling decisions may be based on ¡ ¡ l l ¡ Job migration, moving jobs to/from a trusted cluster Spanning and workflow Human readable messages … Scheduling algorithm can be complex l l ¡ Resource availability, respect of local policies, fairshare (cluster autonomy) Advance reservation, best use of resources Network and data cache, data availability … Attempts to predict (Weather Services) has been proven difficult Dedicated Global accounting and standard messages possible Mixed of LRMS and DRMS capabilities (user autonomy) not common Complex algorithm takes into account so many parameters … Empirical approach l l l Inspect queue behavior, send jobs, see how queue reacts … readjust Self-sustained system Adapts to network/resource/load changes ? ? Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 25

Empirical approach (? ) ¡ Monitoring Policy ¡ Information fed by agents to ML

Empirical approach (? ) ¡ Monitoring Policy ¡ Information fed by agents to ML Information is recovered by SUMS module l LSF l l l ¡ Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL Scheduling decisions made based on load and “queue” or “pool” response time Self-sustained system (no need for %tage based submission branching) Hopefully no need for complex algorithm Respond as resources, priorities, bandwidth adjusts Results / details in Efstratios Efstathiadis presentation, Track 4 - 393 26

¡ Contributions RHIC/Phenix collaboration have tested and using SUMS l l ¡ Contributions included

¡ Contributions RHIC/Phenix collaboration have tested and using SUMS l l ¡ Contributions included addition of dispatchers (PBS, BOSS) – Andrey Shevel Development includes creation of GUI front end for end-users – Mike Reuter Job tracking and monitoring l l l SUMS allows for dispatching to ANY queues BOSS (from CMS) a possible solution as “a” dispatcher Implemented / contributed by Andrey Shevel (Phenix/SUNYSB) – Track 5, 86 BODE tracking Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 27

Future work ¡ High Level User JDL work l l Started with a document

Future work ¡ High Level User JDL work l l Started with a document on RDL (PPDG-39) Motivation ¡ Current U-JDL simple enough but has its limitations l l ¡ Extension to new resource requirement possible but inelegant U-JDL considers most (but not all) data sets Lacks concept of tasks and sandboxes Workflow diagram are only AND (sequential) implemented (need OR, conditional branching etc …) SBIR with Tech-X (David Alexander) l Deliverables ¡ ¡ l Enhanced and complete U-JDL (AJHDL) A WSDL for creating a Grid Service Reviewed most available high level JDL ¡ ¡ ¡ Sept 27 th- Oct 1 st 2004 Job Submission Description Language (JSDL) (GGF) Analysis Job Description Language (AJDL) (Atlas) User Request Description Language (URDL) (PPDG-39 / Jlab/STAR) Job Description Language (JDL) (Data. Grid) Job Description Language (JDL) (JLab) … Jérôme LAURET, RHIC-STAR/BNL 28

Future work ¡ We promised our users the U-JDL will not change l l

Future work ¡ We promised our users the U-JDL will not change l l For what they know, it won’t (XSLT, schema transformation) But the ones using AJHDL will have access to more features ¡ We are working on job tracking ¡ We are working on the concept of Meta-Log (application level monitoring) l l Seems to be forgotten Valeri Fine – Poster, 480 Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 29

Conclusions ¡ SUMS is NOT l l ¡ a batch system A toy (real

Conclusions ¡ SUMS is NOT l l ¡ a batch system A toy (real needs, real use, real Physics) SUMS is l l l A front end to local and distributed RMS acting like a client to multiple, heterogeneous RMS A flexible opened architecture, object oriented framework in which with plug-and-play features A good environment for further developing ¡ ¡ l l l Standards (such as High level JDL) Scalability of other components (ML work, immediate use) Used in STAR for real Physics (usage and publication list) Used for Distributed / Grid Simulation job submission Used successfully by other experiments A mean to make active users transition to distributed computing and recover under-used resources … … Sept 27 th- Oct 1 st 2004 Jérôme LAURET, RHIC-STAR/BNL 30