Workflow Management Software For Tomorrow Fernando Barreiro Distributed

Workflow Management Software For Tomorrow Fernando Barreiro Distributed Data Management Meta-data handling Rucio AMI Physics Group Production requests py. AMI Requests, Prod. Sys 2/ Production DEFT DB DEFT Tasks JEDI Tasks Jobs Kaushik De Alexei Klimentov Tadashi Maeno Danila Oleynik Pavlo Svirin Matteo Turiilli Pan. DA server Pan. DA DB Jobs Analysis tasks Production requests pilot ARC interface pilot EGEE/EGI Worker nodes condor-g Pilot scheduler pilot OSG NDGF pilot HPCs HPC cross-experiment discussion CERN, May 10, 2019

Outline • WFM SW evolution (HEP) • WFM SW on HPC from non-LHC 25. 11. 2020 2

Workflow Management. Pan. DA. Production and Distributed Analysis System Pan. DA Brief Story https: //twiki. cern. ch/twiki/bin/view/Pan. DA Global ATLAS operations Up to ~800 k concurrent job slots 25 -30 M jobs/month at >250 sites ~1400 ATLAS users Big. Pan. DA Monitor http: //bigpanda. cern. ch/ First exascale workload manager in HENP 1. 3+ Exabytes processed every year in 2014 - 2018 Exascale scientific data processing today 2005: Initiated for US ATLAS (BNL and UTA) 2006: Support for analysis 2008: Adopted ATLAS-wide 2009: First use beyond ATLAS 2011: Dynamic data caching based on usage and demand 2012: ASCR/HEP Big. Pan. DA project 2014: Network-aware brokerage 2014 : Job Execution and Definition I/F (JEDI) adds complex task management and fine grained dynamic job management 2014: JEDI- based Event Service 2015: New ATLAS Production System, based on Pan. DA/JEDI 2015 : Manage Heterogeneous Computing Resources : HPCs and clouds 2016: DOE ASCR Big. Pan. DA@Titan project 2016: Pan. DA for bioinformatics 2017: COMPASS adopted Pan. DA , NICA (JINR) Pan. DA beyond HEP : Blue. Brain, Ice. Cube, LQCD Concurrent cores run by Pan. DA Big HPCs Grid Clouds 3

Future Challenges for Work. Flow(Load) Management Software • New physics workflows and technologies – machine learning training, parallelization, vectorization… • also new ways how Monte-Carlo campaigns are organized • Address computing model evolution and new strategies – “provisioning for peak” • Incorporating new architectures (like TPU, GPU, RISC, FPGA, ARM…) • Leveraging new technologies (containerization, no-SQL analysis models, high data reduction frameworks, tracking…) • Integration with networks (via DDM, via IS and directly) • Data popularity -> event popularity • Address future complexities in workflow handling – Machine learning and Task Time To Complete prediction – Monitoring, analytics, accounting and visualization – Granularity and data streaming 4

Future development. Harvester Highlights Primary objectives : • To have a common machinery for diverse computing resources • To provide a common layer in bringing coherence to different HPC implementations T. Maeno • To optimize • To address wide spectrum of computing workflow resources/facilities available to ATLAS and executions for experiments in general diverse site • New model : Pan. DA server- harvester-pilot • The project was launched in Dec 2016 (PI T. Maeno) 5 capabilities

Harvester Status ➢ What – – ➢ ➢ is Harvester A bridge service between workload, data management systems and resources to allow (quasi-) real time communication between them Flexible deployment model to work with various operational restrictions, constraints, and policies in those resources • E. g. local deployment on edge node for HPCs behind multi-factor authentication, central deployment + SSH + RPC for HPCs without outbound network connections, stage-in/out plugins for various data transfer services/tools, messaging via share file system, … Experiments can use harvester by implementing their own plug-ins, harvester is not tightly coupled with Pan. DA Current Status Architecture design, coding and implementation completed ➢ Commissioning ~done ➢ ➢ Deployed on wide range of resources – Theta/ALCF, Cori/NERSC, Titan/OLCF in production – Summit/OLCF, Mare. Nostrum 4/BSC under testing – Also at Google Cloud, Grid (~all ATLAS sites), HLT@CERN 6

CERN OLCF BNL Pan. DA/Harvester deployment for ATLAS @OLCF 7

Harvester for tomorrow (HPC only) – Full-chain technical validation with Yoda • Yoda : ATLAS Event Service with MPI functionality running on HPC – Yoda+Jumbo jobs in production • Jumbo jobs : relax input file boundaries, pick up any event from dataset – Two hops data stage-out with Globus Online + Rucio – Containers integration – Implementation of a capability to use simultaneously CPU/GPU within one node for MC and ML payloads – Implementation of a capability to dynamically shape payloads based on real-time resource information – Centralization of Harvester instances using CE, SSH, MOM, … 8

Simulation Science “Summit”: ~1017 FLOPS HEP SIMULATION NEUROSCIENCE “ENIAC”: ~103 FLOPS • Computing has seen an unparalleled exponential development • In the last decades supercomputer performance grew 1000 x every ~10 yea • Almost all scientific disciplines have long embraced this capability Original slide from F. Schurmann (EPFL)

Pegasus WFMS/Pan. DA • Collaboration started in October 2018 • December 2018: first working prototype, standard Pegasus example (Split document/word count) • tested on Titan • Future plans/possible applications/open questions: – test the same workflow in a heterogeneous environment: Grid + HPC (Summit/NERSC/…) with data transfer via RUCIO or other tools – Possible application: data transfer for LQCD jobs from/to OLCF storages – Currently Pegasus/Pan. DA integration works on job level. How can we integrate Pegasus with other Pan. DA components like JEDI is still TBD 10

Next Generation Executor Project (Rutgers U) • Schedules and runs multiple tasks concurrently and consecutively in one or more batch jobs: – Tasks are individual programs – Tasks are executed within the walltime of each batch job • Late binding: – Tasks are scheduled and then placed within a batch job at runtime • Tasks and resource heterogeneity: – Scheduling, placing and running CPU/GPU/Open. MM/MPI tasks in the same batch job – Use single/multiple CPU/GPU for the same task and across multiple tasks • Supports multiple HPC machines. • Requires limited development to support a new HPC machine. • Use cases: Molecular dynamics and HEP payloads on Summit 11

Status and near-term plans : • Test runs with bulk task submission with harvester and NGE • Address bulk and performant submission (currently ~320 units in 4000 sec) • Run at scale on Summit once issues with jsrun are addressed by IBM • Conduct submission with relevant workloads from MD and HEP 12