LHC Computing Grid Project LCG Ian Bird LCG
LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN Geneva, Switzerland BNL March 2005 ian. bird@cern. ch
Ian. Bird@cern. ch LCG Project Overview 2 § LCG Project Overview § Overview of main project areas § Deployment and Operations § Current LCG-2 Status § Operations and issues § Plans for migration to g. Lite + … § Service Challenges § Interoperability § Outlook & Summary
Ian. Bird@cern. ch LCG Project LHC Computing Grid Project 3 Aim of the project To prepare, deploy and operate the computing environment for the experiments to analyse the data from the LHC detectors Applications development environment, common tools and frameworks Build and operate the LHC computing service The Grid is just a tool towards achieving this goal
LCG Project Areas & Management Project Leader Les Robertson Resource Manager – Chris Eck Planning Officer – Jürgen Knobloch Administration – Fabienne Baud-Lavigne Ian. Bird@cern. ch from o t a M Pere rch 05 Applications Area Torre Wenaus 1 Ma 4 Development environment Joint projects, Data management Distributed analysis CERN Fabric Area Bernd Panzer Large cluster management Data recording, Cluster technology Networking, Computing service at CERN Distributed Analysis - ARDA Massimo Lamanna Prototyping of distributed end-user analysis using grid technology Middleware Area Frédéric Hemmer Provision of a base set of grid middleware (acquisition, development, integration) Testing, maintenance, support Grid Deployment Area Ian Bird Establishing and managing the Grid Service - Middleware, certification, security operations, registration, authorisation, accounting Joint with EGEE
LCG Project Relation with EGEE • Goal Create a European-wide production quality multi-science grid infrastructure on top of national & regional grid programs • Scale 70 partners in 27 countries Initial funding (€ 32 M) for 2 years Ian. Bird@cern. ch • Activities 5 Grid operations and support (joint LCG/EGEE operations team) Middleware re-engineering (close attention to LHC data analysis requirements) Training, support for applications groups (inc. contribution to the ARDA team) • Builds on LCG grid deployment Experience gained in HEP LHC experiments pilot applications
Applications Area LCG Project § § POOL, SEAL, ROOT, Geant 4, GENSER, PI/AIDA, Savannah § § § Ian. Bird@cern. ch 400 TB of POOL data produced in 2004 Pre-release of Conditions Database (COOL) 3 D project will help POOL and COOL in terms of scalability § 3 D: Distributed Deployment of Databases § 6 All Applications Area projects have software deployed in production by the experiments § § Geant 4 successfully used in ATLAS, CMS and LHCb Data Challenges with excellent reliability GENSER MC generator library in production Progress on integrating ROOT with other Applications Area components § Improved I/O package used by POOL; Common dictionary, maths library with SEAL § § Pere Mato (CERN, LHCb) has taken over from Torre Wenaus (BNL, ATLAS) as Applications Area Manager Plan for next phase of the applications area being developed for internal review at end of March
The ARDA project Ian. Bird@cern. ch LCG Project § 7 ARDA is an LCG project § main activity is to enable LHC analysis on the grid § ARDA is contributing to EGEE NA 4 ¨ uses the entire CERN NA 4 -HEP resource § Interface with the new EGEE middleware (g. Lite) § § By construction, use the new middleware Use the grid software as it matures Verify the components in an analysis environments (users!) Provide early and continuous feedback § Support the experiments in the evolution of their analysis systems § Forum for activity within LCG/EGEE and with other projects/initiatives
Ian. Bird@cern. ch LCG Project ARDA activity with the experiments 8 § The complexity of the field requires a great care in the phase of middleware evolution and delivery: § Complex (evolving) requirements § New use cases to be explored (for HEP: large-scale analysis) § Different communities in the loop - LHC experiments, middleware experts from the experiments and other communities providing large middleware stacks (CMS GEOD, US OSG, LHCb Dirac, etc…) § The complexity of the experiment-specific part is comparable (often larger) to the “general” one § The experiments do require seamless access to a set of sites (computing resources) but the real usage (therefore the benefit for the LHC scientific programme) will come by exploiting the possibility to build their computing systems on a flexible and dependable infrastructure § How to progress? § Build end-to-end prototype systems for the experiments to allow end users to perform analysis tasks
Ian. Bird@cern. ch LCG Project LHC prototype overview 9 LHC Experiment Main focus Basic prototype component Experiment analysis application framework GUI to Grid GANGA Da. Vinci Interactive analysis PROOF ROOT Ali. ROOT High level services DIAL Athena Aligned with APROM strategy ORCA Exploit native g. Lite functionality Middleware
Ian. Bird@cern. ch LCG Project LHC experiments prototypes (ARDA) 10 All prototypes have been “demoed” within the corresponding user communities
11 Ian. Bird@cern. ch LCG Project CERN Fabric
CERN Fabric Ian. Bird@cern. ch LCG Project § 12 Fabric automation has seen very good progress § The new systems for managing large farms are in production at Extremely Large Fabric CERN since January management system § New CASTOR Mass Storage System § § Being deployed first on the high throughput cluster for the ongoing ALICE data recording computing challenge configuration, installation and management of nodes Agreement on collaboration with Fermilab on Linux distribution § Scientific Linux based on Red Hat Enterprise 3 lemon LHC Era Monitoring - system § Improves uniformity between the HEP sites serving LHC and Run & service monitoring 2 experiments § CERN computer centre preparations LHC Era Automated Fabric – § Power upgrade to 2. 5 MW hardware / state management § Computer centre refurbishment well under way § Acquisition process started Includes technology developed by European Data. Grid
Ian. Bird@cern. ch LCG Project Preparing for 7, 000 boxes in 2008 13
High Throughput Prototype (openlab + LCG prototype) 4 * GE connections to the backbone 24 Disk Server (P 4, SATA disks, ~ 2 TB disk space each) Ian. Bird@cern. ch LCG Project 12 Tape Server STK 9940 B § 36 Disk Server (dual P 4, IDE disks, ~ 1 TB disk space each) Experience with 4 *ENTERASYS N 7 10 GE Switches likely ingredients 2 * Enterasys X-Seriesin § 64 -bit programming LCG: 2 * 50 Itanium 2 (dual 1. 3/1. 5 GHz, 2 GB mem) 10 GE per node 10 GE § next generation I/O(10 Gb Ethernet, Infiniband, etc. ) § High performance cluster used for evaluations, and 10 GE per node for data challenges with experiments § Flexible configuration – components moved in and 80 * IA 32 CPU Server out of production environment (dual 2. 4 GHz P 4, 1 GB mem. ) 10 GE § Co-funded by industry and CERN 10 GE 1 GE per node 28 TB , IBM Storage. Tank 40 * IA 32 CPU Server (dual 2. 4 GHz P 4, 1 GB mem. ) 14 10 GE WAN connection 12 Tape Server 80 IA 32 CPU Server STK 9940 B (dual 2. 8 GHz P 4, 2 GB mem. )
Ian. Bird@cern. ch LCG Project Alice Data Recording Challenge 15 § § § Target – one week sustained at 450 MB/sec Used the new version of Castor mass storage system Note smooth degradation and recovery after equipment failure
LCG Project Ian. Bird@cern. ch 16 Deployment and Operations
desktops portables Tier-2 small centres …. IC IFCA UB Tier-1 RAL IN 2 P 3 FNAL NIKHEF LHC Computing Model (simplified!!) TRIUMF Budapest • Tier-0 – the accelerator centre CNAF – Filter raw data reconstruction Prague FZK event summary data (ESD) Taipei BNL – Record the master copy of raw and LIP PIC ICEPP Nordic ESD Legnaro Cambridge • Tier-1 – – Managed Mass Storage – permanent storage raw, ESD, calibration data, meta-data, analysis • data and databases grid-enabled data service – Data-heavy (ESD-based) analysis – Re-processing of raw data – National, regional support – “online” to the data acquisition process high availability, long-term commitment CSCS IFIC Rome CIEMAT MSU …. Krakow USC Tier-2 – – Well-managed, gridenabled disk storage – End-user analysis – batch and interactive – Simulation
Ian. Bird@cern. ch LCG Project Computing Resources: March 2005 18 Country providing resources Country anticipating joining In LCG-2: ð 121 sites, 32 countries ð >12, 000 cpu ð ~5 PB storage Includes non-EGEE sites: • 9 countries • 18 sites
Infrastructure metrics Ian. Bird@cern. ch LCG Project Region 19 countries sites cpu M 6 (TA) cpu M 15 (TA) cpu actual CERN 0 1 900 1800 942 UK/Ireland 2 19 100 2200 2398 France 1 8 400 895 886 Italy 1 20 553 679 1777 South East 5 7 146 322 133 South West 2 12 250 498 Central Europe 5 8 385 730 373 Northern Europe 2 4 2000 427 Germany/Switzerland 2 10 100 400 1207 Russia 1 6 50 152 238 EGEE-total 21 95 3084 9428 8879 USA 1 3 - - 458 Canada 1 6 - - 316 Asia-Pacific 6 8 - - 394 Hewlett-Packard 1 1 - - 100 Total other 9 18 - - 1268 Grand Total 30 113 - - 10147 Countries, sites, and CPU available in LCG-2 production service EGEE partner regions Other collaborating sites
Service Usage § § Active HEP experiments: ¨ 4 LHC, D 0, CDF, Zeus, Babar § Active other VO: ¨ Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics) § 6 disciplines § Registered users in these VO: 500 § In addition to these there are many VO that are local to a region, supported by their ROCs, but not yet visible across EGEE LCG Project Ian. Bird@cern. ch 20 VOs and users on the production service § Scale of work performed: § LHC Data challenges 2004: ¨ >1 M SI 2 K years of cpu time (~1000 cpu years) ¨ 400 TB of data generated, moved and stored ¨ 1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity) Number of jobs processed/month
Current production software (LCG-2) Ian. Bird@cern. ch LCG Project § 21 Evolution through 2003/2004 • Maintenance agreements with: § Focus has been on making these reliable and robust • VDT team (inc Globus support) ¨ rather than additional functionality • DESY/FNAL - d. Cache § Respond to needs of users, admins, operators § The software stack is the following: • EGEE/LCG teams: § Virtual Data Toolkit • WLM, VOMS, R-GMA, Data ¨ Globus (2. 4. x), Condor, etc Management § EU Data. Grid project developed higher-level components ¨ Workload management (RB, L&B, etc) ¨ Replica Location Service (single central catalog), replica management tools ¨ R-GMA as accounting and monitoring framework ¨ VOMS being deployed now § Operations team re-worked components: ¨ Information system: MDS GRIS/GIIS LCG-BDII ¨ edg-rm tools replaced and augmented as lcg-utils ¨ Developments on: Disk pool managers (d. Cache, DPM) · Not addressed by JRA 1 § Other tools as required: ¨ e. g. Grid. Ice – EU Data. Tag project
LCG Project Software – 2 § § Was an issue – limited to Red. Hat 7. 3 § Now ported to: Scientific Linux (RHEL), Fedora, IA 64, AIX, SGI Ian. Bird@cern. ch § 22 Platform support Another problem was heaviness of installation § Now much improved and simpler – with simple installation tools, allow integration with existing fabric management tools § Much lighter installation on worker nodes – user level
Overall status LCG Project § The production grid service is quite stable § The services are quite reliable § Remaining instabilities in the IS are being addressed ¨ Sensitivity to site management § Problems in underlying services must be addressed ¨ Work on stop-gap solutions (e. g. RB maintains state, Globus gridftp reliable file transfer service) § The biggest problem is stability of sites Ian. Bird@cern. ch § Configuration problems due to complexity of the middleware § Fabric management at less experienced sites § § Operations/Applications select stable sites (BDII allows a applicationspecific view) § In large tests, selecting stable sites, achieve >>90% efficiency § F 23 Job efficiency is not high, unless Operations workshop last November to address this § Fabric management working group – write fabric management cookbook § Tighten operations control of the grid – escalation procedures, removing bad sites Complexity is in the number of sites – not number of cpu
Operations Structure § Operations Management Centre (OMC): LCG Project § § Ian. Bird@cern. ch Core Infrastructure Centres (CIC) § § § § Act as front-line support for user and operations issues Provide local knowledge and adaptations One in each region – many distributed User Support Centre (GGUS) § § 24 Manage daily grid operations – oversight, troubleshooting Run essential infrastructure services Provide 2 nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M 12) Taipei also run a CIC Regional Operations Centres (ROC) § § At CERN – coordination etc In FZK – manage PTS – provide single point of contact (service desk) Not foreseen as such in TA, but need is clear
LCG Project Grid Operations § § RC RC ROC RC CIC RC RC RC Ian. Bird@cern. ch § RC § § § OMC CIC RC RC ROC RC RC RC = Resource Centre ROC § RC Operational oversight (grid operator) responsibility rotates weekly between CICs Report problems to ROC/RC ROC is responsible for ensuring problem is resolved ROC oversees regional RCs ROCs responsible for organising the operations in a region § § Essential to scale the operation CICs act as a single Operations Centre § RC CIC 25 RC ROC CIC § RC RC The grid is flat, but Hierarchy of responsibility Coordinate deployment of middleware, etc CERN coordinates sites not associated with a ROC
SLAs and 24 x 7 LCG Project § Start with service level definitions § § Ian. Bird@cern. ch 26 § § Publish metrics & performance of sites relative to their commitments § Remote monitoring/management of services § § Can be considered for small sites Middleware/services § Should cope with bad sites Clarify what 24 x 7 means § What a site supports (apps, software, MPI, compilers, etc) Levels of support (# admins, hrs/day, on-call, operators…) Response time to problems Define metrics to measure compliance § § § Service should be available 24 x 7 Does not mean all sites must be available 24 x 7 Specific crucial services that justify cost Classify services according to level of support required Operations tools need to become more and more automated Having an operating production infrastructure should not mean having staff on shift everywhere § § “best-effort” support The infrastructure (and applications) must adapt to failures
Operational Security Ian. Bird@cern. ch LCG Project § 27 Operational Security team in place § EGEE security officer, ROC security contacts § Concentrate on 3 activities: ¨ Incident response • Security group andpractice work was started in LCG – was from dedicated the start aweb ¨ Best advice for Grid Admins – creating cross-grid activity. ¨ Security Service Monitoring evaluation • §Much. Incident already in place at start of EGEE: usage policy, registration Response process §and infrastructure, etc. JSPG agreement on IR in collaboration with OSG • We regard it as. Update crucialexisting that this activity remains broader than just EGEE policy “To guide the development of common ¨ capability for handling and response to cyber security incidents on Grids” § Basic framework for incident definition and handling § Site registration process in draft § Part of basic SLA § CA Operations § EUGrid. PMA – best practice, minimum standards, etc. § More and more CAs appearing
Ian. Bird@cern. ch LCG Project Policy – Joint Security Group Incident Response Certification Authorities Usage Rules Security & Availability Policy User Registration Audit Requirements Best practice Guides Application Development & Network Admin Guide http: //cern. ch/proj-lcg-security/documents. html 28
User Support Ian. Bird@cern. ch LCG Project We have found that user support has 2 distinct aspects: 29 § User support – § § Call centre/helpdesk § Coordinated through GGUS § ROCs as front-line § Task force in place to improve the service Deployment Support Middleware Problems Operations Centres (CIC / ROC) Operations Problems Global Grid User Support (GGUS) Single Point of Contact Coordination of User. Support Other Communities e. g. Biomed Resource Centres (RC) Hardware Problems Application Specific User Support VO specific problems LHC experiments non-LHC experiments VO Support § Was an oversight in the project and is not really provisioned § In LCG there is a team (5 FTE): ¨ Help apps integrate with m/w ¨ Direct 1: 1 support ¨ Understanding of needs ¨ Act as advocate for app § This is really missing for the other apps – adaptation to the grid environment takes expertise
Ian. Bird@cern. ch LCG Project Certification process § § Process was decisive to improve the middleware The process is time consuming (5 releases 2004) § § § § Many sequential steps Many different site layouts have to be tested Format of internal and external releases differ Multiple packaging formats (tool based, generic) All components are treated equal ¨ same level of testing for non vital and core components ¨ new tools and tools in use by other projects are tested to the same level Process to include new components is not transparent Timing for releases difficult § Users: now; sites: scheduled § Upgrades need a long time to cover all sites § Some sites had problems to become functional after an upgrade 30
Additional Input Ian. Bird@cern. ch LCG Project § 31 Data Challenges § § § client libs need fast and frequent updates core services need fast patches (functional/fixes) applications need a transparent release preparation many problems only become visible during full scale production Configuration is a major problem on smaller sites Operations Workshop § smaller sites can handle major upgrades only every 3 months § sites need to give input in the selection of new packages ¨ resolve conflicts with local policies
Changes I Ian. Bird@cern. ch LCG Project § 32 Simple Installation/Configuration Scripts § YAIM (Yet. Another. Install. Method) ¨ semi-automatic simple configuration management ¨ based on scripts (easy to integrate into other frameworks) ¨ all configuration for a site are kept in one file § APT (Advanced Package Tool) based installation of middleware RPMs ¨ simple dependency management ¨ updates (automatic on demand) ¨ no OS installation § Client libs packaged in addition as user space tar-ball ¨ can be installed like application software
Changes II LCG Project § § Different frequency of separate release types § § § Fixed release dates for major releases (allows planning) § Ian. Bird@cern. ch § 33 client libs (UI, WN) services (CE, SE) core services (RB, BDII, . . ) major releases (configuration changes, RPMs, new services) updates (bug fixes) added any time to specific releases non-critical components will be made available with reduced testing every 3 months, sites have to upgrade within 3 weeks Minor releases every month § based on ranked components available at a specific date in the month § not mandatory for smaller RCs to follow ¨ client libs will be installed as application level software § early access to pre-releases of new software for applications ¨ client libs. will be made available on selected sites ¨ services with functional changes are installed on EIS-Applications testbed ¨ early feedback from applications
Certification Process Bugs/Patches/Task Savannah LCG Project RC Ian. Bird@cern. ch EIS GDB assign and update cost 1 Bugs/Patches/Task Savannah prioritization & selection Head of Deployment 2 List for next release (can be empty) integration & first tests 3 C&T Developers 4 C&T CICs 34 Applications Internal Releases Client Release EIS C&T components ready at cutoff EIS User Level install of client tools 5 Applications full deployment on test clusters (6) functional/stress tests ~1 week 6 C&T Developers Service Release Client Release Updates Release Core Service Release 7
Deployment Process Update Release Notes LCG Project YAIM Release(s) Update User Guides GIS Release Notes Installation Guides Every Month EIS User Guides Every 3 months on fixed dates ! Ian. Bird@cern. ch Certification is run daily 35 Release Client Release 11 Every Month Deploy Major Releases (Mandatory) ROCs Re-Certify Deploy Client Releases (User Space) GIS Deploy Service Releases (Optional) CICs RCs CIC at own pace
Ian. Bird@cern. ch LCG Project Operations Procedures 36 § § § Driven by experience during 2004 Data Challenges Reflecting the outcome of the November Operations Workshop Operations Procedures § roles of CICs - ROCs - RCs § weekly rotation of operations centre duties (CIC-on-duty) § daily tasks of the operations shift ¨ monitoring (tools, frequency) ¨ problem reporting ¨ ¨ problem tracking system communication with ROCs & RCs escalation of unresolved problems handing over the service to the next CIC
Ian. Bird@cern. ch LCG Project Implementation 37 § § Evolutionary Development Procedures § documented (constantly adapted) ¨ available at the CIC portal http: //cic. in 2 p 3. fr/ ¨ in use by the shift crews § Portal http: //cic. in 2 p 3. fr § § § access to tools and process documentation repository for logs and FAQs provides means of efficient communication provides condensed monitoring information Problem tracking system § currently based on Savannah at CERN § is moving to the GGUS at FZK ¨ exports/imports tickets to local systems used by the ROCs § Weekly Phone Conferences and Quarterly Meetings
LCG Project Grid operator dashboard Cic-on-duty Dashboard Ian. Bird@cern. ch https: //cic. in 2 p 3. fr/pages/cic/framedashboard. html All in One Dashboard 38
Operator procedure LCG Project • OMC Escalation Blacklist OMC • CIC phone 2 • nd mail 1 • stmail • SEVERITY • ESCALATION • PROCEDURE Ian. Bird@cern. ch • Incident • closure • Savannah • ROC • (5. 1) • (6) • Monitoring tools • Grid. Ice, GOC • (1) • In Depth • Testing • GIIS • Monitor • (2) • Diagnosis • help • Wiki • pages • (3) • Report • Savannah • (4) • Follow up • Cic • mailing • tool • (5) • (5. 2) • (5. 1) • RC 39
Ian. Bird@cern. ch LCG Project Selection of monitoring tools 40 GIIS Monitor graphs Sites Functional Tests and History GOC Data Base Scheduled Downtimes Grid. Ice – VO view Grid. Ice – fabric view Live Job Monitor Certificate Lifetime Monitor
41 Ian. Bird@cern. ch LCG Project Middleware
LCG Project Architecture & Design § § Ian. Bird@cern. ch § 42 Design team including representatives from Middleware providers (Ali. En, Condor, EDG, Globus, …) including US partners produced middleware architecture and design. Takes into account input and experiences from applications, operations, and related projects DJRA 1. 1 – EGEE Middleware Architecture (June 2004) § https: //edms. cern. ch/document/476451/ § DJRA 1. 2 – EGEE Middleware Design (August 2004) § https: //edms. cern. ch/document/487871/ § Much feedback from within the project (operation & applications) and from related projects § Being used and actively discussed by OSG, Grid. Lab, etc. Input to various GGF groups
g. Lite Services and Responsible Clusters Enabling Grids for E-scienc. E JRA 3 Grid Access Service CERN API Access Services Authorization Information & Monitoring Auditing Authentication Security Services Metadata Catalog File & Replica Catalog Storage Element Data Management IT/CZ Application Monitoring Information & Monitoring Services Accounting Job Provenance Package Manager Site Proxy Computing Element Workload Management Data Services INFSO-RI-508833 UK Job Management Services CERN - Computing Challenges 43
g. Lite Services for Release 1 Enabling Grids for E-scienc. E JRA 3 Grid Access Service CERN API Access Services Authorization Authentication e k e n t i o L g s Security Services Storage Element Management Site Proxy Data Services INFSO-RI-508833 c i v r e e c r t t m g M o t u File & Replicag c in Accounting o Catalog d F or c c Data a s e IT/CZ Information &fo Application k Monitoring s Monitoring a s y Auditing Metadata Catalog UK Information & Monitoring Services Job Provenance Package Manager Computing Element Workload Management Job Management Services CERN - Computing Challenges 44
g. Lite Services for Release 1 Software stack and origin (simplified) § LCG Project § § Ian. Bird@cern. ch § 45 Computing Element Gatekeeper (Globus) Condor-C (Condor) CE Monitor (EGEE) Local batch system (PBS, LSF, Condor) Workload Management § WMS (EDG) § Logging and bookkeeping (EDG) § Condor-C (Condor) § Storage Element § § File Transfer/Placement (EGEE) glite-I/O (Ali. En) Grid. FTP (Globus) SRM: Castor (CERN), d. Cache (FNAL, DESY), other SRMs § Catalog § File and Replica Catalog (EGEE) § Metadata Catalog (EGEE) § Information and Monitoring § R-GMA (EDG) § Security § VOMS (Data. TAG, EDG) § GSI (Globus) § Authentication for C and Java based (web) services (EDG)
Summary § WMS LCG Project § Task Queue, Pull mode, Data management interface ¨ Available in the prototype ¨ Used in the testing testbed ¨ Now working on the certification testbed § Catalog § Ian. Bird@cern. ch § My. SQL and Oracle ¨ Available in the prototype ¨ Used in the testing testbed ¨ Delivered to SA 1 § g. Lite I/O § § § But not tested yet Available in the prototype Used in the testing testbed ¨ Basic functionality and stress test available Delivered to SA 1 46 Submission to LCG-2 demonstrated But not tested yet FTS § § UI § § Available in the prototype ¨ Incudes data management ¨ Not yet formally tested R-GMA § § § FTS is being evolved with LCG ¨ Milestone on March 15, 2005 ¨ Stress tests in service challenges Available in the prototype Testing has shown deployment problems VOMS § § Available in the prototype No tests available
LCG Project Schedule § § User documentation currently being added § On a limited scale testbed Ian. Bird@cern. ch § 47 All of the Services are available now on the development testbed Most of the Services are being deployed on the LCG Preproduction Service § Initially at CERN, more sites once tested/validated § Scheduled in April-May § Schedule for deployment at major sites by the end of May § In time to be included in the LCG service challenge that must demonstrate full capability in July prior to operate as a stable service in 2 H 2005
Ian. Bird@cern. ch LCG Project Migration Strategy § Certify g. Lite components on existing LCG-2 service LCG-2 (=EGEE-0) 2004 prototyping § § § Deploy components in parallel – replacing with new service once stability and functionality is demonstrated WN tools and libs must co-exist on same cluster nodes prototyping product 2005 product As far as possible must have a smooth transition LCG-3 (=EGEE-x? ) 48
49 Ian. Bird@cern. ch LCG Project Service Challenges
Problem Statement LCG Project § § Ø Getting all sites to acquire and run the infrastructure is non-trivial (managed disk storage, tape storage, agreed interfaces, 24 x 365 service aspect, including during conferences, vacation, illnesses etc. ) Ø Need to understand networking requirements and plan early Ian. Bird@cern. ch § 50 ‘Robust File Transfer Service’ often seen as the ‘goal’ of the LCG Service Challenges Whilst it is clearly essential that we ramp up at CERN and the T 1/T 2 sites to meet the required data rates well in advance of LHC data taking, this is only one aspect But transferring ‘dummy files’ is not enough… § Still have to show that basic infrastructure works reliably and efficiently Ø Need to test experiments’ Use Cases § Check for bottlenecks and limits in s/w, disk and other caches etc. § § We can presumably write some test scripts to ‘mock up’ the experiments’ Computing Models But the real test will be to run your s/w… § Which requires strong involvement from production teams
Ian. Bird@cern. ch LCG Project LCG Service Challenges - Overview § LHC will enter production (physics) in April 2007 § LCG ‘solution’ is a world-wide Grid § But… § LCG must be ready at full production capacity, functionality and reliability in less than 2 years from now § Will generate an enormous volume of data § Will require huge amount of processing power § Many components understood, deployed, tested. . § Unprecedented scale § Humungous challenge of getting large numbers of institutes and individuals, all with existing, sometimes conflicting commitments, to work together § Issues include h/w acquisition, personnel hiring and training, vendor rollout schedules etc. Ø Should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential § Whilst being stable, reliable and easy to use 51
Ian. Bird@cern. ch LCG Project Key Principles 52 § Service challenges results in a series of services that exist in parallel with baseline production service § Rapidly and successively approach production needs of LHC § Initial focus: core (data management) services § Swiftly expand out to cover full spectrum of production and analysis chain § Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages Ø Necessary resources and commitment pre-requisite to success! § Should not be under-estimated!
Ian. Bird@cern. ch LCG Project Initial Schedule (evolving) 53 § Q 1 / Q 2: up to 5 T 1 s, writing to disk at 100 MB/s per T 1 (no expts) § Q 3 / Q 4: include two experiments, tape and a few selected T 2 s § 2006: progressively add more T 2 s, more experiments, ramp up to twice nominal data rate § 2006: production usage by all experiments at reduced rates (cosmics); validation of computing models § 2007: delivery and contingency Ø N. B. there is more detail in Dec / Jan / Feb GDB presentations
Key dates for Service Preparation June 05 - Technical Design Report LCG Project Sep 05 - SC 3 Service Phase May 06 – SC 4 Service Phase Sep 06 – Initial LHC Service in stable operation Apr 07 – LHC Service commissioned Ian. Bird@cern. ch 2005 54 SC 2 SC 3 2006 2007 cosmics SC 4 LHC Service Operation 2008 First physics First beams Full physics run SC 2 – Reliable data transfer (disk-network-disk) – 5 Tier-1 s, aggregate 500 MB/sec sustained at CERN SC 3 – Reliable base service – most Tier-1 s, some Tier-2 s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC 4 – All Tier-1 s, major Tier-2 s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput
Ian. Bird@cern. ch LCG Project Fermi. Lab – Dec 04/Jan 05 55 § Fermi. Lab demonstrated 500 MB/s for 3 days in November
M 56 t. K no !!! Ian. Bird@cern. ch LCG Project FTS stability
57 Ian. Bird@cern. ch LCG Project Interoperability
Introduction: grid flavours § LCG-2 vs Grid 3 Ian. Bird@cern. ch LCG Project § 58 § § § Both use same VDT version ¨ Globus 2. 4. x ¨ LCG-2 has components for WLM, IS, R-GMA, etc Both use ~same information schema (GLUE) ¨ Grid 3 schema not all GLUE ¨ Some small extensions by each§ Work done: ¨ Both use “MDS” (BDII) Catalogues: LCG-2 vs Nordu. Grid § § Nordu. Grid uses modified version of Globus 2. x ¨ Does not use gatekeeper – different interface Very different information schema ¨ but does use MDS § With Grid 3/OSG – strong contacts, many points of collaboration, etc. § With Nordu. Grid – discussions have started § LCG-2: EDG derived catalogue (for POOL) § Grid 3 and Nordu. Grid: Globus RLS § Canada: § Gateway into Grid. Canada and West. Grid (Globus based) in production
Common areas (with Grid 3/OSG) Ian. Bird@cern. ch LCG Project § 59 Interoperation § Align Information Systems § Run jobs between LCG-2 and Grid 3/Nordu. Grid § Storage interfaces – SRM § Reliable file transfer ¨ Service challenges § Infrastructure § Security ¨ Security policy – JSPG ¨ Operational security Both are explicitly common activities across all sites § Monitoring ¨ Job monitoring ¨ Grid monitoring ¨ Accounting § Grid Operations ¨ Common operations policies ¨ Problem tracking
Interoperation § LCG Project § G 3 site runs LCG-developed generic info provider – fills their site GIIS with missing info – GLUE schema § From LCG-2 BDII can see G 3 sites § Running a job on grid 3 site needed: ¨ G 3 installs full set of LCG CAs ¨ Added users into VOMS ¨ WN installation (very lightweight now) installs on the fly Ian. Bird@cern. ch § 60 LCG-2 jobs on Grid 3 jobs on LCG-2 § Added Grid 3 VO to our configuration § They point directly to the site (do not use IS for job submission) § § Job submission LCG-2 Grid 3 has been demonstrated Nordu. Grid – can run generic info provider at a site § But requires work to use the NG clusters
Storage and file transfer § LCG Project § LCG-2, g. Lite, Open Science Grid all agree on SRM as basic interface to storage § SRM collaboration for >2 years, group in GGF § SRM interoperability has been demonstrated § LHCb use SRM in their stripping phase Ian. Bird@cern. ch § 61 Storage interfaces Reliable file transfer § Work ongoing with Tier 1’s (inc. FNAL, BNL, Triumf) in service challenges. § Agree that interface is SRM and srmcopy or gridftp as transfer protocol ¨ Reliable transfer software will run at all sites – already in place as part of service challenges
Operations Ian. Bird@cern. ch LCG Project § 62 Several points where collaboration will happen Started from LCG and OSG operations workshops § Operational security/incident response § Common site charter/service definitions possible? § Collaboration on operations centres (CIC-on-duty) ? § Operations monitoring: § Common schema for problem description/views – allow tools to understand both? § Common metrics for performance and reliability § Common site and application validation suites (for LHC apps) § Accounting § Grid 3 and LCG-2 use GGF schema § Agree to publish into common tool (NG could too) § Job monitoring § LCG-2 Logging and bookkeeping – well defined set of states § Agree common set will allow common tools to view job states in any grid § Need good high level (web) tools to display – user could track jobs easily across grids
Ian. Bird@cern. ch LCG Project Outlook & Summary 63 § LHC startup is very close. Services have to be in place 6 months earlier Thank for program your ! § Theyou Service Challenge is theattention ramp-up process § All aspects are really challenging! § Now the experiment computing models have been published, trying to clarify what services LCG must provide – and what their interfaces need to be § Baseline services working group § A enormous amount of work has been done by the various grid projects § Already at full complexity and scale foreseen for startup § But still significant problems to address – functionality, and stability of operations § Time to bring these efforts together to build the solution for LHC
- Slides: 63