EGIIn SPIRE AMOD report 6 Feb 12 Feb
EGI-In. SPIRE AMOD report 6 Feb – 12 Feb 2012 Fernando H. Barreiro Megino CERN IT-ES-VOS 2/14/12 EGI-In. SPIRE RI-261323 1 www. egi. eu
Overview: Analysis 2/14/12 EGI-In. SPIRE RI-261323 2 www. egi. eu
Overview: Production Claire Gwenlan: “[…] we are now on the tail end of MC 11 c […] the load is not going to be like what you've seen for the past few weeks/months […] Until… MC 12… coming soon…” 2/14/12 EGI-In. SPIRE RI-261323 3 www. egi. eu
Overview: DDM ATLAS membership of ddmadmin certificate expired on 11 Feb 2012 and transfer jobs were rejected or failed 2/14/12 EGI-In. SPIRE RI-261323 4 www. egi. eu
CERN and ADC • Sun 5 th CERN-PROD_DATADISK: GGUS: 78923 • lcg-cr failures • Caused by latest EMI release on "preprod" WNs (10%) • Rolled back to LCG WN on Wed morning • Mon 6 th Schedconfig failed to update • Set IT and TW clouds offline in Panda over the morning • Recovery from dump - only expert procedures available • Dedicated postmortem • Tue 7 th ADCR & ATLR intervention: • Oracle security updates • Almost transparent. Unavailability of Panda&DDM for a few minutes at 9: 00 2/14/12 EGI-In. SPIRE RI-261323 5 www. egi. eu
CERN and ADC: Panda. Mon issues • 2 out of 6 servers out of production for a week to prevent session count overload errors • Wed 8 th-Thu 9 th curl control commands failing intermittently • Machines using large amount of swap space: Alarm about voatlas 180 using 50 GB during Thu night Voatlas 140&141 out of production Utilization of swap space 9 th Feb 2/14/12 EGI-In. SPIRE RI-261323 10 th Feb 6 www. egi. eu
ddmadmin certificate renewal (1) • ddmadmin is the robot certificate used to authenticate DDM and other ADCops agents • Yearly ddmadmin proxy expired 9 th Feb • 23 rd Jan (>2 weeks before) a campaign was started to renew the proxy on all DDM and ADCops machines • Some machines were forgotten • • ddmusr 01@voatlas 125: Victor ddmusr 03@voatlas 161: Functional Test subscription ddmusr 01@voatlas 244: ADC monitoring collector Maybe more Need to elaborate a clear list of places where the ddmadmin proxy is installed 2/14/12 EGI-In. SPIRE RI-261323 7 www. egi. eu
ddmadmin certificate renewal (2) • The ATLAS membership of ddmadmin expired on Sat 11 th Feb…and caught everybody by surprise • All FTS job submissions were rejected • Few hours after the problem was reported, the membership was renewed • Proxies are cached via proxy delegation and it took several hours until new change was propagated to all services (FTS, SEs, …) • glite-delegation-destroy&init did not seem to make any effect • e. g. Hiro deleted all proxies from /tmp on all FTS agent hosts to speed up the recovery in the US cloud • RAL had to roll out the grid-mapfiles manually after the incident GGUS: 79137 2/14/12 EGI-In. SPIRE RI-261323 8 www. egi. eu
ddmadmin certificate renewal (3) Need recovery procedures, a tested backup proxy and notifications about the proxy sent out to the AMOD mailing list 2/14/12 EGI-In. SPIRE RI-261323 9 www. egi. eu
Tier 1 s • IN 2 P 3 -CC downtime Tue 7 th • • • SARA downtime Tue 7 th • • Intervention on core network Affecting all services (LFC, FTS, SE, CE…) UK cloud set offline Failing jobs at SARA on Thu 9 th GGUS: 79089 • • Replacement of 6620 SAN storage hardware and firmware updates Affecting services such as SRM, d. Cache and UI RAL downtime Wed 8 th • • Maintenance and upgrade of the various services and servers. Affecting LFC, d. Cache, FTS, batch system, Worker nodes, etc. Complete cloud offline in Panda and DDM Downtime for CE and SE extended until Wed 8 th Not site issue Panda brokerage did not recognize NIKHEF-ELPROD_PHYS-TOP as NIKHEF location Tadashi fixed immediately FZK transfer and staging failures on Sun 12 th GGUS: 79145 • High load and full disks INFN-MILANO-ATLASC SRM problems GGUS: 78998 • Recurring problem over many days: “failed to contact on remote SRM [httpg: //t 2 cmcondor. mi. infn. it: 8444/srm/managerv 2]” • /etc/grid-security/vomsdir/atlas/vo. racf. bnl. gov. lsc missing on Sto. RM servers and therefore rejecting all proxies with VOMS extensions provided by BNL VOMS server • Later problem with the fetch-crl cronjob 2/14/12 EGI-In. SPIRE RI-261323 10 www. egi. eu
• Thanks to ADC experts and ADCo. S shifters for their support • BEWARE: No AMODs in the next weeks 2/14/12 EGI-In. SPIRE RI-261323 11 www. egi. eu
- Slides: 11