Operations Management Piotr Nyczyk CERN ITGD LCG Operations

  • Slides: 20
Download presentation
Operations Management Piotr Nyczyk CERN IT/GD LCG Operations and Fabric workshop CERN 2 -4

Operations Management Piotr Nyczyk CERN IT/GD LCG Operations and Fabric workshop CERN 2 -4 November 2004 Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 1

Outline ● Site testing – Test script – Web reports (links to other tools)

Outline ● Site testing – Test script – Web reports (links to other tools) ● Detecting problems ● Problem tracking and solving – Savannah project – Escalation procedure ● Conclusions ● Future issues Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 2

Site testing – test script (1) ● A set of many (~20) various tests

Site testing – test script (1) ● A set of many (~20) various tests that run on WN and checks essential functionality of a site: – general WN configuration: RPMs version, Environment, CSH, Broker. Info. . . – grid tools and services: Replica Manager (local and remote SE involved), lcg -utils, R-GMA – new tests are added instantly when new types of problems are detected/reported ● Output: detailed HTML report + plain file with <test-name>=OK/FAILED format ● Test submission tools (tztests script): – automatic test job submission (cronjob) – automatic output retrieval (cronjob) – outputs stored to hard disk, full history is kept – publishing the results to R-GMA – for future use Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 3

Site testing – test script (2) ● Test job is submitted everyday at ~7

Site testing – test script (2) ● Test job is submitted everyday at ~7 am from UI at CERN (adc 0005) ● On-demand submission to selected sites by CERN Deployment Team ● Web report available for site admins: http: //lcg-testzone-reports. web. cern. ch/lcg-testzone-reports/cgi-bin/listreports. cgi ● Is available in LCG-2 CVS repository http: //lcgdeploy. cvs. cern. ch/cgi-bin/lcgdeploy. cgi/lcg 2/tztests/ ● Can be easily downloaded and used by site admins and ROC managers ● Simple configuration files (templates available): – ~/. tztestsrc – for job submission script – cgi-bin/testzone-tests. cfg – for web reporting tool ● To submit test job: tztests start [<CE-filter>] ● To check status, get outputs, and generate report: tztests status tztests get_outputs tztests gen Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 4

Web reports (1) ● ● Operations Management – CERN 2 -4 November - Piotr.

Web reports (1) ● ● Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch Generated automatically (CGI) from tztests script output files List of all reports (most recent on top) Sorted by the day „Last report” always shows the most recent results for all sites (bookmark) 5

Web reports (2) GOC DB entry for a site Operations Management – CERN 2

Web reports (2) GOC DB entry for a site Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 6

Web reports (3) GIIS monitoring page Operations Management – CERN 2 -4 November -

Web reports (3) GIIS monitoring page Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 7

Web reports (4) Detailed report Operations Management – CERN 2 -4 November - Piotr.

Web reports (4) Detailed report Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 8

Web reports (5) Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch

Web reports (5) Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 9

Detecting problems (1) ● Problem found in the report: ● Click to find the

Detecting problems (1) ● Problem found in the report: ● Click to find the reason: ● Start Follow-Up process Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch Easy case 10

Detecting problems (2) ● ● Sometimes finding the reason is not so easy. But!

Detecting problems (2) ● ● Sometimes finding the reason is not so easy. But! We want to give a hint to site admin and understand the problems Looking for reasons in various places: – GIIS monitoring page – graphs are very useful to detect dangerous patterns – manual LDAP queries help to find information system related problems – manual Replica Manager tests involving remote site – manual job resubmission – configuration files in remote sites (gsiftp downloads) – A LOT OF INTERACTION WITH SITE ADMINS Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 11

Detecting problems (3) ● Extensive use of knowledge which we acquired during operations. ●

Detecting problems (3) ● Extensive use of knowledge which we acquired during operations. ● Transfer of knowledge is a critical issue in CIC rotation. ● Wiki page with categorized knowledge: http: //goc. grid. sinica. edu. tw/gocwiki/Site. Problems. Follow. Up. Faq – ● is currently being collected by Deployment Team at CERN Mailing list to exchange observations and comments between CIC members: project-egee-sa 1 -followup@cern. ch Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 12

Problem tracking - Savannah ● Savannah project to keep track of problem solving: LCG

Problem tracking - Savannah ● Savannah project to keep track of problem solving: LCG 2 Sites http: //savannah. cern. ch/projects/lcg 2 sites/ ● Problem detected in a single site is represented as a single task. ● Categorization: ● ● ● – task category – site name – item group – problem type Several additional (custom) fields: “action taken”, “person contacted”, “response” Currently no automated synchronization with test results (automatic task creation) and sites DB (categories) Usage pattern according to “Escalation Procedure” - defines the rules for problem follow-up Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 13

Problem tracking - Example Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern.

Problem tracking - Example Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 14

Escalation procedure (1) ● A procedure to fix problems in sites by interaction with

Escalation procedure (1) ● A procedure to fix problems in sites by interaction with site admins ● No remote controle of sites – just exchange of emails/phone calls ● ● Gradual increase in action intensity over time – a way to “push” site administrator to fix a problem Clearly and formally defined chain of actions and deadlines: – mail to site admin – second mail to site admin – phone call to site admin – report to GDB Deadlines are usually 3 days (1 day for big sites) – but, negotiation. . . ● ● A complete record of interactions with site admins (Savannah): contacted person, response, etc. Currently performed by Deployment Team in CERN but soon to be distributed to other CICs (CIC rotation) Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 15

Escalation procedure (2) ● ● Phase I – go through all detected problems in

Escalation procedure (2) ● ● Phase I – go through all detected problems in the test report. Phase II – go through all tasks in Savannah which are not outdated and check if state changed. On each state change decide if the old problem was solved and a new one was detected. If yes extend the deadline, return to first action and notify site admin. Quarantine means state is currently OK (no problem reported) but task is still open for a few days to check the stability (new deadline). Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 16

Escalation procedure (3) ● ● ● Phase III – go through all tasks in

Escalation procedure (3) ● ● ● Phase III – go through all tasks in Savannah which are outdated. Task is closed only if site has already been in quarantine which expired and it is still OK. Next action is taken from the following escalation sequence: – mail to site admin – second mail to site admin – phone call to site admin – report to GDB On update deadline is postponed. Phase IV – update task (description + postpone deadline) whenever a response from a site admin is received Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 17

Conclusions ● Currently over 80 sites, and over 30 open tasks in Savannah to

Conclusions ● Currently over 80 sites, and over 30 open tasks in Savannah to take care for only 2 people every day ● Most of the steps are currently completely manual ● Information is in different places and even inconsistent or missing ● No administrative power and remote controle of sites to enforce problem solving BUT ● ● ● Current “escalation procedure” shows significant improvement of sites status We are about to distribute the procedure to other CICs Integration with monitoring/reporting tools in progress – ideas how to simplify/automatize the procedure Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 18

Future issues ● ● ● Migrate to other CICs (CIC rotation): – full documentation

Future issues ● ● ● Migrate to other CICs (CIC rotation): – full documentation of “escalation procedure” to be written – knowledge database (Wiki) to be completed Better integration with monitoring tools: – R-GMA based monitoring framework with links to problem tracing tool (work in progress, but test results are already there) – semi-automatic task creation and “escalation steps” Delegate work to ROCs: – CIC for urgent and quick actions – ROCs will take care of long term site “treatment” Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 19

References ● Tests web reports page: http: //lcg-testzone-reports. web. cern. ch/lcg-testzone-reports/cgibin/listreports. cgi ● Recertification

References ● Tests web reports page: http: //lcg-testzone-reports. web. cern. ch/lcg-testzone-reports/cgibin/listreports. cgi ● Recertification test script: http: //lcgdeploy. cvs. cern. ch/cgi-bin/lcgdeploy. cgi/lcg 2/tztests/ ● Wiki based knowledge database: http: //goc. grid. sinica. edu. tw/gocwiki/Site. Problems. Follow. Up. Faq ● LCG 2 Sites Savannah project: http: //savannah. cern. ch/projects/lcg 2 sites/ ● Mailing list for problem follow-up and “escalation procedure” discussions: project-egee-sa 1 -followup@cern. ch Operations Management – CERN 2 -4 November - Piotr. Nyczyk@cern. ch 20