CHEP 2004 www euegee org Practical approaches to

  • Slides: 22
Download presentation
CHEP 2004 www. eu-egee. org Practical approaches to Grid workload management in the EGEE

CHEP 2004 www. eu-egee. org Practical approaches to Grid workload management in the EGEE project Massimo Sgaravatto INFN Padova On behalf of the EGEE JRA 1 IT-CZ cluster EGEE is a project funded by the European Union under contract INFSO-RI-508833

EGEE project • EGEE project § Aim: build a consistent, robust and secure Grid

EGEE project • EGEE project § Aim: build a consistent, robust and secure Grid infrastructure § Focus first on two pilot applications areas (HENP, Biomedical applications) • But the goal is to take other researchers in academia and industry • Middleware activity (JRA 1) § Re-engineer Grid software to provide production quality middleware § Evolution towards emerging standards, based on Service Oriented Architectures § Taking into account application requirements and production/ deployment/ management needs • See talk #247 (E. Laure) Chep 2004 - 2

Workload management • Grid workload and resource management is one of the key Grid

Workload management • Grid workload and resource management is one of the key Grid middleware functionality § How to efficiently schedule a big number of different data-intensive jobs, submitted by a distributed community of users, to a Grid encompassing many and heterogeneous resources • Progress was made in various projects with different integrated software solutions: § § Data. Grid Workload Management System Condor Euro. Grid-Unicore resource broker … • Still a lot to do § Scalability, reliability § Identification and handling of failures originating from different software layers, and possibly from 'foreign' Grid system and resources § Distributed (hierarchical ? ) super-scheduling § Proper semantics of resource information collection and distribution (push, pull, index, cache, refresh) § … Chep 2004 - 3

Workload Management System • Provision of Grid Workload Management System services assigned to the

Workload Management System • Provision of Grid Workload Management System services assigned to the “EGEE JRA 1 Italian Czech cluster” § CESNET § Datamat S. p. A. § INFN • Architecture of the EGEE WMS designed and being implemented § Taking into account feedback and requirements from reference applications and deployment/production/management activities § Taking into account previous experiences from other Grid projects (in particular the Data. Grid WMS) § Set of Grid services • • • Workload Manager (WM) Computing Element (CE): Resource access Logging & Bookkeeping (L&B) Job Provevance (JP) Grid Accounting service § Interoperating among them and with other EGEE Grid Services Chep 2004 - 4

Workload Manager Chep 2004 - 5

Workload Manager Chep 2004 - 5

Workload Manager Job management requests (submission, cancellation) expressed via a Job Description Language (JDL)

Workload Manager Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Chep 2004 - 6

Workload Manager Keeps submission requests Requests are kept for a while if no matching

Workload Manager Keeps submission requests Requests are kept for a while if no matching resources available Chep 2004 - 7

Workload Manager Repository of resource information available to matchmaker Updated via notifications and/or active

Workload Manager Repository of resource information available to matchmaker Updated via notifications and/or active polling on sources Chep 2004 - 8

Workload Manager Finds an appropriate CE for each submission request, taking into account job

Workload Manager Finds an appropriate CE for each submission request, taking into account job requests and preferences, Grid status, utilization policies on resources Chep 2004 - 9

Scheduling policies • Different possible policies § Eager scheduling: a job is bound to

Scheduling policies • Different possible policies § Eager scheduling: a job is bound to a resource as soon as possible • Job is then forwarded to that CE, where very likely it will end up in a queue § Lazy scheduling: job held by the WM until a resource becomes available • Job then forwarded to that CE for immediate execution • WM architecture able to accommodate both models (and the intermediate solutions) § Eager scheduling: matching a job against multiple resources § Lazy scheduling: matching a resource against multiple jobs • Needed to better investigate strengths and weaknesses of different policies in different scenarios § Evaluation of relevant metrics, covering both resource utilization and user satisfaction Chep 2004 - 10

Computing Element • Service representing a computing resource • Main functionality: job management §

Computing Element • Service representing a computing resource • Main functionality: job management § § Run jobs Cancel jobs Suspend and resume jobs Provide info on “quality of service” • How many resources match the job requirements ? • What is the estimated time to have the job starting its execution ? • … § … • Used by the WM or by any other client (e. g. end-user) • CE architecture accommodated to support both push and pull model § Push model: the job is pushed to the CE by the WM § Pull model: the CE asks the WM for jobs • These two models are somewhat mirrored in the resource information flow § In order to 'pull' a job a resource must choose where to 'push' information about itself Chep 2004 - 11

CE Architecture Client WEB Job. Submit Job. Assess Job. Kill Job. Suspend Job. Resume

CE Architecture Client WEB Job. Submit Job. Assess Job. Kill Job. Suspend Job. Resume Job. Get. Status WEB CE Web service accepting job management requests LSF Mon PBS ? Worker Nodes Chep 2004 - 12

CE Architecture Client Notifications Job requests WEB CE Mon Async. notifications about job/CE events

CE Architecture Client Notifications Job requests WEB CE Mon Async. notifications about job/CE events LSF PBS ? Worker Nodes Job requests (for CE working in pull mode) Chep 2004 - 13

Logging & Bookkeeping • Collects and manages job-related events (e. g. submission, suitable CE

Logging & Bookkeeping • Collects and manages job-related events (e. g. submission, suitable CE found, start of execution, …) from the WMS components • Processes these events to give a higher level view on job states • Both job states and raw data available to users § Also via Web Service interface • Possible to subscribe to receive notifications on particular job state changes • LB event trail can be analyzed to identify problems with resources ("black holes", unusual failure rates, etc). • See poster #419 for more details Chep 2004 - 14

Job Provenance • Keeps track of definition of submitted jobs, execution conditions and job

Job Provenance • Keeps track of definition of submitted jobs, execution conditions and job life cycle for a long time § § Job life logs (JDL, timestamps, jobids, …) Executable and input/output files Execution environment (OS, installed software version, …) Custom data provided by user • Used for § Debugging § Post-portem analysis § Comparison of job executions in an evolving environment • Service components § Primary Storage Server • Keeps data in the most compact and economic form § Index Servers • Configured to support a set of queryable attributes • See poster # 419 for more details Chep 2004 - 15

Grid Accounting • Accumulates information about the usage of Grid resources by users /

Grid Accounting • Accumulates information about the usage of Grid resources by users / groups (e. g. VOs) • To be used § To track resource usage § To discover abuses (and help avoiding them) • Also possible to charge users for the resources they have used • Allows implementation of submission policies based on resource usage § Exchange market among Grid users and Grid resource owners, which should result in market equilibrium • Load balancing on the Grid Chep 2004 - 16

Accounting architecture Accounting Resource metering: getting info about resource usage Computing Element Storage Element

Accounting architecture Accounting Resource metering: getting info about resource usage Computing Element Storage Element Chep 2004 - 17

Accounting architecture Accounting Reports about resource usage per user / VO/ resource Computing Element

Accounting architecture Accounting Reports about resource usage per user / VO/ resource Computing Element Storage Element Chep 2004 - 18

Accounting architecture Resource pricing Accounting Computing Element Resource owner Storage Element Chep 2004 -

Accounting architecture Resource pricing Accounting Computing Element Resource owner Storage Element Chep 2004 - 19

Accounting architecture Resource pricing Cost computation Accounting Computing Element Resource owner Storage Element Chep

Accounting architecture Resource pricing Cost computation Accounting Computing Element Resource owner Storage Element Chep 2004 - 20

Status • Workload Manager, Logging & Bookkeeping, Grid Accounting software inherited by Data. Grid

Status • Workload Manager, Logging & Bookkeeping, Grid Accounting software inherited by Data. Grid WMS software § Being revised and complemented according to the new architecture • E. g. Information Supermarket, Task. Queue new developments • Web services interfaces § First implementation already deployed in the EGEE GLITE prototype testbed • Computing Element § New fresh developments § CEMon prototype already implemented • Job Provenance § New component being implemented Chep 2004 - 21

Links • EGEE JRA 1 IT-CZ cluster homepage § http: //egee-jra 1 -wm. mi.

Links • EGEE JRA 1 IT-CZ cluster homepage § http: //egee-jra 1 -wm. mi. infn. it/egee-jra 1 -wm • EGEE JRA 1 (middleware activity) homepage § http: //egee-jra 1. web. cern. ch/egee-jra 1 • EGEE project homepage § http: //www. eu-egee. org Chep 2004 - 22