Pilot Job Frameworks Review Introduction Summary GDB presentation
Pilot Job Frameworks Review • Introduction • Summary • GDB presentation Maarten Litmaath (CERN), OSG meeting, 2008/06/16 1
Introduction • Classic jobs: user mapping on Computing Element head node – Before submission to the batch system • Early binding – Jobs may get stuck in a queue – Jobs typically cannot be overtaken by higher-priority jobs • For the same user or other users in the same VO • Pilot jobs allow for late binding – Run the current highest-priority compatible task given by a central task queue • A multi-user pilot job may pick a task of any member of the VO – Under what local identity will the task run? • Glexec – Setuid root utility for letting the task run under the local identity to which the CE would have mapped the user proxy – Can be executed only by proxies with special VOMS roles (e. g. Role=pilot), configured per VO Maarten Litmaath (CERN), OSG meeting, 2008/06/16 2
Summary • What are the security implications of this setuid root utility? – Code reviews • Additional reviews desired – Extensive certification • Ongoing – How will/should VOs use it exactly? GDB Multi-User Pilot Job Frameworks Review WG – Review the frameworks of the 4 big LHC experiments • Description of architecture and workflows • Answers to security questionnaire defined by the WG • LHCb: looks OK – WLCG Management Board awaiting proposal for ratification • CMS: looks OK, with some details to be provided later • ALICE, ATLAS: documentation not yet ready Maarten Litmaath (CERN), OSG meeting, 2008/06/16 3
Pilot Job Frameworks Review • GDB working group mandated by WLCG MB on Jan. 22, 2008 • Mission – Review security issues in the pilot job framework of each experiment • Pilot jobs are taken as multi-user in this context – Define a minimum set of security requirements – Advise on improvements • Per framework or common to all – Report to GDB and MB • Time frame is a few months • Members – – – ALICE: Predrag Buncic ATLAS: Torre Wenaus CMS: Igor Sfiligoi LHCb: Andrei Tsaregorodtsev WLCG: Maarten Litmaath (chair) – – Maarten Litmaath, GDB, 2008/06/11 EGEE: David Groep FNAL: Eileen Berman Grid. PP: Mingchao Ma OSG: Mine Altunay 4
Status • 6 phone conferences held • Discussion on mailing list • Each experiment is to provide a document about their system – LHCb were the first – CMS were next – ALICE and ATLAS not ready yet • ATLAS have a lot of documentation on Pan. DA on their TWiki pages – https: //twiki. cern. ch/twiki/bin/view/Atlas/Pan. DA • A security questionnaire has been discussed – Agreement on the relevance/scope of a question was not always evident – Each document should provide the answers for an experiment • Adequacy judged per experiment, no formal criteria • LHCb answers deemed satisfactory, CMS to provide more details later Maarten Litmaath, GDB, 2008/06/11 5
Questionnaire v 1. 0 (1/2) • Describe in a schematic way all components of the system. – If a component needs to use IPC to talk to another component for any reason, describe what kind of authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get. • Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage. • What happens around the identity change on the WN, e. g. how is each task sandboxed and to what extent? • How can running processes be accounted to the correct user? • How is a task spawned on the WN and how is it destroyed? • How can a site be blocked? Maarten Litmaath, GDB, 2008/06/11 6
Questionnaire v 1. 0 (2/2) • What site security processes are applied to the machine(s) running the WMS? [ Here WMS means the VO WMS, not the g. Lite WMS. ] – Who is allowed access to the machine(s) on which the service(s) run, and how do they obtain access? – How are authorized individuals authenticated on the machine(s)? – What is the process for keeping the service(s) and OS patched and upto-date, especially with respect to security patches? – Do you have an identified security contact? – Describe the incident response plan to deal with security incidents and reports of unauthorized use? – What services (in general) run on the machine(s) that offer the WMS service? – What processes exist to maintain audit logs (e. g. for use during an incident)? – What monitoring exists on the machine(s) to aid detection of security incidents or unauthorized use? • Can you limit the users that can submit jobs to the VO WMS? How? Maarten Litmaath, GDB, 2008/06/11 7
LHCb DIRAC WMS Maarten Litmaath, GDB, 2008/06/11 8
DIRAC WMS workflow • • Workload preparation on UI (input files, JDL) Workload submission to DIRAC WMS Requirements and priority analysis insertion into Central Task Queue Grid submission of pilot job with original requirements – Using special credential (VOMS role) • Start of pilot on WN – Install Job Agent, check WN capacity and environment • Request highest-priority matching workload from Task Queue – Download associated limited user proxy – Execute job wrapper through glexec • Preparation of input data, installation of missing LHCb software as needed • Parallel execution of user workload and watchdog process • Upload output data – Cleanup of user workload and proxy – Report resource consumption to DIRAC accounting system • Get another workload if enough remaining CPU time Maarten Litmaath, GDB, 2008/06/11 9
LHCb document • “Workload Management with Pilot Agents in DIRAC” – http: //indico. cern. ch/material. Display. py? session. Id=4&material. Id=0&conf. Id=20230 1. Overview 2. Workflow in the DIRAC WMS 3. User authentication and authorization 4. Brief DISET overview [DISET = DIRAC SEcure Transport] 5. Job submission to the DIRAC WMS 6. Proxy handling in the DIRAC WMS 7. Pilot Job 8. Running user job 9. Job monitoring, user interaction with a job 10. User job accounting 11. Pilot Job Questionnaire 12. References Maarten Litmaath, GDB, 2008/06/11 10
DIRAC WMS remarks • DISET provides secure communication channels • DIRAC WMS restricts users and can block them • It can block sites – A mask is applied to pilot job submission and to workload matching • It is trusted by myproxy. cern. ch • It is hosted by the CERN IT department according to IT regulations – Access restrictions, security updates, incident response procedures • A pilot job cannot renew its own proxy – It can only download limited proxies for user workloads Maarten Litmaath, GDB, 2008/06/11 11
CMS glidein. WMS schematics Maarten Litmaath, GDB, 2008/06/11 12
CMS glidein. WMS workflow • System relies on standard Condor components and communication • User jobs are submitted to a set of condor_schedd on submit machines – Waiting jobs are advertized in a Central Manager • Glidein factories submit pilot jobs a. k. a. glideins • VO frontends regulate the number of glideins to be submitted – Based on the number of jobs waiting in the condor_schedd queues • A condor_collector is used as a dashboard for message exchange. • Generic Connection Brokers can provide connectivity across firewalls and in NAT environments • A glidein contacts the Central Manager for a job matching the WN • The user job is spawned through glexec and cleaned up afterwards – The glidein may then wait a while for another job or exit Maarten Litmaath, GDB, 2008/06/11 13
Envisioned CMS production system Maarten Litmaath, GDB, 2008/06/11 14
Envisioned CMS analysis instance Maarten Litmaath, GDB, 2008/06/11 15
CMS document • “Workload management with glidein. WMS” – http: //indico. cern. ch/material. Display. py? session. Id=4&material. Id=0&conf. Id=20230 1. Introduction 2. Structural overview 1. 2. 3. 4. The Condor pool Glidein handling Working in a firewalled world Credentials handling 3. Deployment scenarios by USCMS 1. Current prototype production installation 2. Planned worldwide production installation 3. Planned analysis installation 4. Conclusions • Appendix – Pilot Job Frameworks questionnaire Maarten Litmaath, GDB, 2008/06/11 16
CMS glidein. WMS remarks • Condor provides secure communication channels • Both production and analysis instances can restrict/block users • Sites can be blocked – Using IP tables or Condor functionality • A submit machine is trusted by an associated My. Proxy server • Users will not login onto submit machine, but interact with CRAB server – CRAB = CMS Remote Analysis Builder • Submit machines etc. hosted by T 0/T 1/T 2 sites – Need access restrictions, security updates, incident response procedures • Condor does not yet support limited delegation – Accepted as feature request Maarten Litmaath, GDB, 2008/06/11 17
ATLAS Pan. DA architecture Maarten Litmaath, GDB, 2008/06/11 18
Pan. DA workflow • Pan. DA server receives user job definitions with data requirements • Jobs are inserted into global work queue • Brokerage module prioritizes jobs and assigns them to sites • Required input data is dispatched to the chosen site – Interaction with ATLAS Distributed Data Management system – Jobs are released to dispatcher when input data has been made available • Delivery of pilot jobs to WNs managed by independent subsystem • Pilot jobs contact job dispatcher for work – If no appropriate work is available, the pilot may pause or exit Maarten Litmaath, GDB, 2008/06/11 19
ALICE job submission in LCG Submits job ALICE Job Catalogue lfn 1, lfn 1 lfn 2, lfn 3, lfn 4 lfn guid {se’s} Job 1. 2 2 lfn 1, lfn 2, lfn 3, lfn 4 lfn guid {se’s} Job 1. 3 3 lfn 1, lfn 3, lfn 2, lfn 4 lfn 3 lfn guid {se’s} Job 2. 1 lfn 1, lfn 3 lfn guid {se’s} lfn 2, lfn 4 Job 3. 1 lfn 1, lfn 3 Job 3. 2 lfn 2 Optimizer Registers output LCG User Job ALICE central services ALICE catalogues Site Execs agent Close SE’s & Software Matchmaking Updates TQ User ALICE File Catalogue Job 1. 1 1 Job 2. 1 VO-Box Asks work-load Yes Env OK? No Die with grace Receives work-load Sends job result Retrieves workload Package Manager Computing Agent Submits job agent RB, WMS CE WN Sends job agent to site Maarten Litmaath, GDB, 2008/06/11 20
ALICE services Central services Catalogue DB User files, unix-like ACL User (client) X 509 certificate, VOMS registration in ALICE VO User is authenticated to the catalogue/job submission service through GSI authentication DBI API Server SOAP SSL Certificate authentication (GSI handshake), authorization SSL Temporary session token Job Manager Job splitting, job priorities DBI Receives a temporary session token Process DB User process ID, job states Authorization LDAP User DN Ali. En username and directory definition Groups/roles (VOMS) Maarten Litmaath, GDB, 2008/06/11 21
- Slides: 21