Grid Operations Centre LCG SLAs and Site Audits

  • Slides: 14
Download presentation
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8

Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004 Trevor. Daniels@rl. ac. uk

Policy Annexes Ø The Security and Availability Policy was approved by GDB in the

Policy Annexes Ø The Security and Availability Policy was approved by GDB in the October 2003 GDB. It referenced a number of annexed documents which contained all the technical detail which underpinned the Policy. Ø Three of these are presented to GDB today for consideration: § Procedure for Site Self-Audit § Service Level Agreement Guide § Resource Administrators’ Guide Ø They are not complete in all detail, but should be adequate to indicate the main features and to provide working documents for this stage of LCG (eg to guide the implementation of GOC procedures and systems) Ø Any significant change will be brought back to GDB for approval Trevor. Daniels@rl. ac. uk 2

Procedure for Site Self-Audit Trevor. Daniels@rl. ac. uk 3

Procedure for Site Self-Audit Trevor. Daniels@rl. ac. uk 3

Procedure for Site Self-Audit Ø Purpose: § to provide guidance to resource and service

Procedure for Site Self-Audit Ø Purpose: § to provide guidance to resource and service administrators for meeting the requirements of the Security and Availability Policy with respect to conducting a site self-audit for assessing compliance with that Policy Ø Site Self-Audit § a simple checklist of the main requirements for compliance with the Policy § will be contained in the site’s section within the GOC web pages § accessible for editing by registered personnel at that site § access controlled by certificate § latest version displayed on the GOC website as the Statement of Compliance section of that site’s SLA Trevor. Daniels@rl. ac. uk 4

Site Self-Audit Checklist (1) Ø Details of Site § Name § Services offered Ø

Site Self-Audit Checklist (1) Ø Details of Site § Name § Services offered Ø Quality of Services § What actions have been taken to deliver professionally managed and reliable LCG services? Ø Consequent Risks § What actions have been taken to minimise the increased risk of host compromise? Ø Site Policy § List any reservations or exceptions from compliance with both the LCG Policy and your site’s Policy Ø Notifying Site Personnel § Who has been notified of the security implications of participating in LCG? Trevor. Daniels@rl. ac. uk 5

Site Self-Audit Checklist (2) Ø Resource Administration § List the Services offered and the

Site Self-Audit Checklist (2) Ø Resource Administration § List the Services offered and the administrators responsible for each Ø Service Level Agreement § Provide a statement that an SLA in the prescribed format has been published for each of the Services offered Ø Physical Security § Provide a statement that the risks resulting from intruders, fire, flood, power failure, equipment failure and environmental hazards are consistent with the service quality specified in the SLA for each Service Trevor. Daniels@rl. ac. uk 6

Site Self-Audit Checklist (3) Ø Network Security § Provide a statement that the risks

Site Self-Audit Checklist (3) Ø Network Security § Provide a statement that the risks resulting from intrusions and failures of networking hardware and software consistent with the service quality specified in the SLA for each Service § Outline the measures taken to provide firewall protection to LCG Resources § Outline the measures taken to ensure all critical security-related patches and updates can be applied promptly to all LCG Resources § List the main steps in your incident response procedures Ø Access Control § What procedures have been implemented to ensure certificates and revocation lists associated with LCG Services and Users are renewed before they expire? Ø Compliance with Legislation § List any areas in the LCG Policy which are in conflict with local legislation Trevor. Daniels@rl. ac. uk 7

Service Level Agreement Trevor. Daniels@rl. ac. uk 8

Service Level Agreement Trevor. Daniels@rl. ac. uk 8

SLA Guide Ø Purpose: § to provide guidance to resource and service administrators for

SLA Guide Ø Purpose: § to provide guidance to resource and service administrators for meeting the requirements of the Security and Availability Policy with respect to formulating and observing Service Level Agreements covering Grid Services Ø Grid Service § defined as one which provides a production service to a remote and general community of LCG users over the Internet (as opposed to a private or local service) § Currently these are defined to be: • • CE Computing Element SI Security Infrastructure IS Information Services LB Logging and Book-keeping RB Resource Broker RC Replica Catalogue SE Storage Element UI User Interface Trevor. Daniels@rl. ac. uk 9

SLA Guide Ø Service Level Parameters § Scheduled Service Downtime A simple list of

SLA Guide Ø Service Level Parameters § Scheduled Service Downtime A simple list of the dates and times when a Service is intended to be withdrawn from use. To be used by GOC in calculating availability and reliability figures. § Service Availability Proportion of time the Service has been available to the time it was scheduled to be available § Service Reliability (MTTF) Mean time to failure in days. Algorithm defining this is given. § Service-Specific Performance-Related Parameters The particular performance metrics to be monitored will evolve with operational experience Some examples are shown in the document eg number of WN available to each CE Trevor. Daniels@rl. ac. uk 10

SLA Guide Ø Publishing the SLAs § Each site will publish its SLA in

SLA Guide Ø Publishing the SLAs § Each site will publish its SLA in a prescribed format on the GOC website § Each SLA has four parts: • Statement of Compliance (with the Security and Availability Policy) - comprises the most recent Site Self-Audit together with commitments to observe the Policy, to apply updates promptly and to make provision to respond to security incidents as required. • Schedule of Downtime – as described above • Service Level Parameters – as described above • Exceptions to the Security and Availability Policy – provision to note any country-specific exceptions or extensions to the Policy dictated by local legislation - Monitoring SLAs - GOC will monitor all defined Service Level Parameters and publish its findings on the GOC website Trevor. Daniels@rl. ac. uk 11

Resource Administrator’s Guide Trevor. Daniels@rl. ac. uk 12

Resource Administrator’s Guide Trevor. Daniels@rl. ac. uk 12

Resource Admin Guide Ø Purpose § Provides guidance to Resource and Service Administrators on

Resource Admin Guide Ø Purpose § Provides guidance to Resource and Service Administrators on operating LCG resources in compliance with the Security and Availabiltiy Policy Ø Topics Covered § Operational Cover • Reporting arrangements for monitoring and alerting • Reporting staffing levels • Reporting hours and schedule of cover provided § Risks • Guidance on estimating Availability and Reliability based on a risk assessment, in order to minimise the risks and their effects § Physical Security • Applies to Certificate Authorities only § Change Control • LCG software is controlled by CERN deployment team • Upgrades must be applied within 3 working days, critical ones within 24 hours Trevor. Daniels@rl. ac. uk 13

Resource Admin Guide Ø Accounting § Each site must preserve accounting information relevant to

Resource Admin Guide Ø Accounting § Each site must preserve accounting information relevant to LCG until it has been transmitted to the GOC § The GOC will collate all accounting data and prepare regular and interactive reports on LCG use Ø Fault Investigation § Explains the roles of Experiment support, GGUS, GOC and sites Ø Fault Tracking § Explains the use of Savannah at CERN and Remedy at GGUS Ø Fault Rectification § Installation and operational faults are the responsibility of Resource Administrators § Faults arising from defective software will be passed to the appropriate developers by the CERN Deployment Team Ø Fault Escalation § By GOC to senior personnel at sites for operational faults § Users may request escalation via GGUS Trevor. Daniels@rl. ac. uk 14