Status of LCG2 Deployment Ian Bird LHCC Referees
Status of LCG-2 Deployment Ian Bird LHCC Referees Meeting 22 March 2004 LHCC Referees Meeting – 22 March 2004 - 1
General Status of LCG-2 • LCG-2 has been deployed to the core sites and a few others SE is based on grid. FTP interfaces for now • Several problems in RM/RLS addressed to be compatible with POOL (some arose because of schema changes) • • ALICE and CMS data challenges have begun • For ALICE several issues with data migration to Castor have been addressed (ADC, FIO) • Started weekly GDA coordination meetings • Open to everyone – experiments, sites • Weekly coordination meetings with experiment for DC’s • With Alice since January; join CMS meetings; start with Atlas this week • Weekly core site phone conference for operational issues LHCC Referees Meeting – 22 March 2004 - 2
LCG-2 Core sites (22 March – 11: 00) • Core sites joined with basic commitments • Ramped up since GDB • At several sites LCG fully integrated with large clusters • E. g. at CERN LXBatch is part of LCG-2, available fraction will increase • UI is installed as standard user tool in LXplus • Site LCG-2 CPU January commitment CERN 324 400 FZK 144 100 PIC 160 100 FNAL 4 10 CNAF 715 400 Nikhef 250 124 Taipei 98 60 RAL 146 140 LHCC Referees Meeting – 22 March 2004 - 3
Issues during start-up • Job distribution between sites • • Were several problems (Nikhef, FZK, CNAF) – now resolved Ongoing issue with CNAF – advertised resources not correct; effect is site takes jobs and does not run them • Nikhef: – Local jobs had priority; resources advertised but jobs were queued – Tweaked parameters used by RB ranking to allow more external jobs • CNAF: – Farm partitioned by VO (4 partitions + overflow) – Advertised no. free CPU is not relevant for a VO/RB – Probably need CE per VO on large sites with complex schedulers » Site needs to write appropriate info provider to provide reasonable availability estimates » GLUE schema needs to be more flexible • Resource exhaustion: • Home directories full – caused crashes (several reasons – used as scratch space, data files written there and not cleaned) LHCC Referees Meeting – 22 March 2004 - 4
LCG-1 LCG-2 • Several other sites ready to join LCG-2 • • • Will now start to move LCG-1 sites into LCG-2 Create 2 parallel information systems • • All needed data on disk SE has been migrated to tape The LCG-1 RLS entries will not be migrated (checked with experiments – have confirmed twice) Existing LCG-1 SE’s will be upgraded to LCG-2 RLS (per VO) upgraded (schema change) Single system to support • • • 1 contains all sites 1 contains only certified core sites Migrate to core as a site demonstrates stability. Users can choose stable core or help validate other sites Currently this needs a separate RB for the non-core sites Data in LCG-1 • • • Tier 2’s in Italy, Spain, UK, and some sites that were not in LCG-1 (Canada, HP) Security issues – only need to fix one system Simplifies needs for certification testbed – do not need LCG-1 version Operational and coordination load is now significant • Must bring in GOC to help this effort now LHCC Referees Meeting – 22 March 2004 - 5
Data Management • Classic grid. FTP SE’s set up at all sites for disk • Load-balancing grid. FTP into CASTOR @ CERN working • • • However, remote DNS configs mean that it might not be used CASTOR at CNAF, PIC has been tested RAL: WP 5/SE has grid. FTP usable via Replica Manager FZK: depend on d. Cache integration with TSM FNAL: accessible via d. Cache/grid. FTP d. Cache (as disk SE): • Now have all requested changes – being tested • Significant work has gone into RLS, RM tools Performance, additional tools • Work is ongoing – many issues to be addressed and understood • LHCC Referees Meeting – 22 March 2004 - 6
Mass Storage • SRM-based SE Was delayed because of interoperability problems (Castor, d. Cache, grid. FTP, Replica Manager, RLS, etc); also missing func. in d. Cache needed for use as disk manager • Anticipate ability to migrate SE to SRM-based at end March • • Invisible to RM, RLS, POOL tools • Gain unified MSS interfaces • Gain managed storage on disk-only SE’s (with d. Cache) • Sites can also use other SRM-compliant solutions • DRM from LBNL for disk; HRM for HPSS interfaces LHCC Referees Meeting – 22 March 2004 - 7
Next middleware release • Planned for end March • Will have SRM/d. Cache SE • VDT 1. 1. 13 • Compatibility with US sites; no repackaging • Many bug fixes • Target – release this week and on EIS testbed for experiment verification by experiments • Decision on when to deploy will be made with experiments • Needs coordination LHCC Referees Meeting – 22 March 2004 - 8
Release process • Monthly coordinated releases Which may or may not be deployed at that moment • Cannot stay in the mode where everything is a showstopper – problems must be prioritised • • Gradually reaching a more stable situation now • Priorities for future releases – agreed in GDA meetings based on: Experiment experience, problems, and needs • Operational experience, stability, robustness, performance • LHCC Referees Meeting – 22 March 2004 - 9
Portability • Goal: Clean up system (OS) dependencies, external dependencies • Port (at least WN) code to other OS’s • Enable clean install of services over existing OS installations • • LCG now owns and manages the code repository • Intend to port to RHEL, and IA 64 Build machines are ready – porting started • If there is a need for others, then resources (people, testbeds, build machines) are needed. • This project is one where we could use help from experienced people • LHCC Referees Meeting – 22 March 2004 - 10
- Slides: 10