WP 2 Infrastructure and Service Management Status Report











- Slides: 11
WP 2: Infrastructure and Service Management Status Report ETICS All-Hands – 23 May 2007 CERN: Marian Zurek, Carlos Aguado Sanchez INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo, Todd Miller www. eu-etics. org INFSOM-RI-026753
Staffing and Resources • Changes @ UW-Madison – – – Todd Miller joined us this spring Andy Pavlo heading back to grad school this summer… Becky and Peter still here : ) • Carlos joined WP 2 @ CERN – Much needed sysadmin help for Marian! – He is OMI-Europe - no more than 20% since end of January INFSOM-RI-026753
Deliverables Status • D 2. 3 - Status of certification, integration and validation testbed setup (prototype) [PM 12] – Delivered on time. INFSOM-RI-026753
Major Tasks Performed • Many Q 4 Issues Addressed – Greatly improved WS/WA/Metronome deployment process – automation, documentation, portability & site generalization – Documentation Expanded & Updated – ETICS-generic WS installation & configuration docs – Metronome Admin & Developer (API) docs – Better Facility Utilization – 2/3 now in regular use, 3 rd (UW-Madison) on the way� – Upcoming cross-site job migration capabilities will further address this – Communication Improving – – – weekly WP 2 calls now routine (formerly semi-weekly, often missed) each call consists of both status and technical debugging time In-person technical visits have been invaluable; more needed – Systems Admin Workload @ CERN – High load, many issues (firewall policy) – Many new WNs – Migration to the new HW (redundant power supplier, NIC, RAID) INFSOM-RI-026753
Major Tasks Performed • Development and deployment of new build & test infrastructure features – Automatic cross-site job migration prototype completed – Co-scheduled resources / parallel testing feature improvements – Many additional small features & bug fixes • Expansion and continued hardening of the ETICS infrastructure at all three sites (CERN, INFN, and UW-Madison) – expansion: efforts to deploy production ETICS services at UW-Madison is nearly complete, delay has been productive – identification a number of previously unknown WS/WA deployment issues specific to the UW-Madison environment has led to portability & documentation improvements – contributed to the generalization of the deployment process beyond the CERN and INFN environments. – installation and configuration of additional hardware – testing and debugging of the infrastructure • Continued support of the development, testing, and education & outreach efforts of other work packages INFSOM-RI-026753
Major Tasks Performed • D 2. 3 Status of certification, integration and validation testbed setup (prototype) Document Completed – Thanks to all of WP 2 for content & reviewers for helpful feedback • And last but not least: Boring system administration … every day – OS updates/upgrades, reboots, backups, disk space mgmt. , disappearing WNs, crashes, power outages, filesystem failures, etc. – As CERN is the facility with the most usage, most of this falls onto Marian – The etics. cern. ch service is highly available. No significant downtime was caused by the WP 2 infrastructure – Installation at OMII-Europe, UK – Coordination with IPv 6 activities (GARR, IN 2 P 3) – Support for different users, cooperating projects INFSOM-RI-026753
Issues • Parallel Testing Feature Integration – Originally an end of Q 4 goal, Metronome functionality delivered in Q 3 response to to g. Lite demands, but is still not integrated in ETICS today – We must agree on an integration plan this week! • Capacity Planning / Scalability – Raised in Q 4, still unresolved. – Major new users/projects may require new resources. – We need to better understand how easy/quick it is to add resources to an existing facility, and how many can be added in the same manner before new scalability issues arise. – NMI has been demonstrated to scale to 100’s of nodes, and Condor to 1000’s… but ETICS + NMI + Condor? It also depends on specific workload… • Uneven Facility Utilization – Improving: 3/3 sites deployed, 2/3 in regular use, 3 rd on the way� – CERN & INFN facilities set up, already in use, production-ready – UW facility almost ready, but not yet in regular use by ETICS – Upcoming cross-site job migration capabilities should solve this issue INFSOM-RI-026753
Workplan • Q 6 Top Priorities – Continued support of the existing ETICS infrastructure as a production service – Responding to Hardware/OS/Service issues – Automation of currently manual tasks – Deployment of new systems & services – Scalability work – Delivery and integration of a prototype cross-facility job migration capability, using the job-routing capabilities of Condor/Metronome, to harness idle remote resources and allow easier access to exotic remote platforms. – Completing deployment of production ETICS services at UW-Madison. – The continued integration of the NMI Build & Test Software with the ETICS web services. INFSOM-RI-026753
Workplan • Q 6/7/8 Unprioritized (next steps and/or resources unclear): – Hardware Virtualization – Wo. D (Windows. On. Demand) service, VMWare and/or Xen – Service Monitoring (Service Level Status) Improvements – see already http: //sls. cern. ch/sls/service. php? id=ETICS – Security issues – Passwords present in the CVS – Public / private resource allocation – A project wants to use ETICS and brings in its private nodes and wants its full power to be private – Steering the jobs to this node, preventing from others landing there – Already supported by NMI/Condor, needs to be documented/customized for ETICS – Steering jobs to/identifying nodes with specific resources – Already supported by NMI/Condor, needs to be documented/customized for ETICS INFSOM-RI-026753
Metrics • Bugs, jobs, tasks – – – 11 open Metronome bugs/issues 27 closed/addressed Metronome bugs/issues 6 open Condor bugs/issues 4 closed/addressed Condor bugs/issues Details available at: – bugs: https: //savannah. cern. ch/bugs/? group=etics and select category=NMI or category=Condor – 7 open tasks, 5 closed – Details available at: https: //savannah. cern. ch/task/? group=etics select category=NMI INFSOM-RI-026753
Conclusion • Discussion/Questions/Etc. INFSOM-RI-026753