LHCb status and plans Ph Charpentier CERN Status
LHCb status and plans Ph. Charpentier CERN
Status of DC 06 m Reminder: o o Two-fold goal: produce and reconstruct useful data, exercise the LHCb Computing model, DIRAC and ganga To be tested: LHCb status and plans Software distribution P Job submission and data upload (simulation: no input data) P Data export from CERN (FTS) using MC raw data (DC 06 -SC 4) P Job submission with input data (reconstruction and re-reconstruction) P d For staged and non-staged files Data distribution (DSTs to Tier 1 s T 0 D 1 storage) P Batch analysis on the Grid (data analysis and standalone SW) P Datasets deletion P o LHCb Grid community solution DIRAC (WMS, DMS, production system) P ganga (for analysis jobs) P Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 2
DC 06 phases m Summer 2006 o Data production on all sites LHCb status and plans P m Background events (~100 Mevts b-inclusive and 300 Mevts minimum bias), all MC raw files uploaded to CERN Autumn 2006 o MC raw files transfers to Tier 1 s, registration in the DIRAC processing database P As part of SC 4, using FTS d Ran smoothly (when SEs were up and running, never 7 at once) P m Fake reconstruction for some files (software not finally tuned) December 2006 onwards o Simulation, digitisation and reconstruction Signal events (200 Mevts) P DSTs uploaded to Tier 1 SEs P d Originally to all 7 Tiers, then to CERN+2 Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 3
DC 06 phases (cont’d) m February 2007 onwards o Background events reconstruction at Tier 1 s LHCb status and plans P Uses 20 MC raw files as input d were no longer on cache, hence had to be recalled from tape P output m r. DST uploaded locally to Tier 1 June 2007 onwards o Background events stripping at Tier 1 s P Uses 2 r. DST as input P Accesses the 40 corresponding MC raw files for full reconstruction of selected events P DST distributed to Tier 1 s d �Originally 7 Tier 1 s, then CERN+2 d need to clean up datasets from sites to free space Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 4
Software distribution o Performed by LHCb SAM jobs P See LHCb status and plans o Joël Closier’s poster at CHEP Problems encountered P Reliability of shared area: scalability of NFS? P Access permissions (lhcbsgm) P Move to pool accounts… P Important: beware of access permissions when changing acounts mapping at sites!!! d moving to pool accounts was a nightmare Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 5
Simulation jobs m Up to 10, 000 jobs running simultaneously o LHCb status and plans m Continuous requests from physics teams Problems encountered o SE unavailability for output data upload P Implemented a fail-over mechanism in the DIRAC DMS P Final data transfer filed in one of the VOBOXes d Had to develop multithreaded transfer agent § too large backlog of transfers P Had to develop an lcg-cp able to transfer to SURL d Request to support SURL in lcg-cp d Took 10 months to be in production (2 weeks to implement) o Handling of full disk SEs P Handled by VOBOXes P Cleaning SEs: painful as no SRM tool (mail to SE admin) Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 6
Reconstruction jobs m Needs files to be staged LHCb status and plans o o Easy for first prompt processing, painful for reprocessing Developed a DIRAC stager agent P Jobs m are put in the central queue only when files are staged File access problems o o Inconsistencies between SRM t. URLs and root access unreliability of rfio, problems with rootd protocol authentication on the Grid (now fixed by ROOT) Impossible of copy input data locally (not enough disk guaranteed) lcg-gt returning a t. URL on d. Cache but not staging files P Workaround Ph. C. with dccp, then fixed by d. Cache WLCG Workshop 1 -2 Sept 2007, Victoria, BC 7
What is still missing? m g. Lite WMS o Many attempts at using it, not very successful LHCb status and plans P Still m not used in production (not released as such…) Full VOMS support o Many problems of mapping when using VOMS P Was working, had to move back to plain proxies due to d. Cache problems P No castor proper authentication (i. e. no security for files) m SRM v 2. 2 o m See plans later, ongoing tests Agreement and support for generic pilot jobs o Essential for good optimisation at Tier 1 s P Prioritisation analysis) Ph. C. of activities (simulation, reconstruction, WLCG Workshop 1 -2 Sept 2007, Victoria, BC 8
Plans and outlook m Re-processing of background o Just restarted (software fault found): 6, 000 jobs LHCb status and plans P 20 o Stripping will follow: 3, 000 jobs P 42 m files as input per job SRM v 2. 2 tests o Ongoing, many issues found and fixed P Very collaborative work with GD P Difficult to get space tokens and corresponding pools properly configured m Analysis o Ph. C. Rapidly growing (batch data analysis, ROOT scripts for fits, toy MC) WLCG Workshop 1 -2 Sept 2007, Victoria, BC 9
Plans (cont’d) m Conditions DB test o o LHCb status and plans o m Deployed and 3 D streaming working at all Tier 1 s Stress tests starting (Bologna) Usage in production during Autumn LFC replication o Requested at all Tier 1 s P o m In production for over 6 months at CNAF Dress rehearsals o o Ph. C. Oracle backend, 3 D streaming Assuming it means producing data at Tier 0, shipping to Tier 1 s and processing there… Pit - Tier 0: ongoing Autumn: include Tier 1 distribution and reconstruction LHCb welcomes a concurrent DR in Spring 08 WLCG Workshop 1 -2 Sept 2007, Victoria, BC 10
Storage Resources m Main problem encountered is with Disk 1 Tape. X storage o 3 out of 7 sites didn’t provide what had been requested Continuously change distribution plans P Need to clean up datasets to get space (painful with SRM v 1) LHCb status and plans P o Not efficient to add servers one by one P o m When all servers are full, puts a very large load on the new server Not easy to monitor the storage usage Too many instabilities in SEs o Full time job checking availability Enabling/disabling SEs in the DMS P VOBOX helps but needs guidance to avoid Do. S P m Several plans for SE migration o Ph. C. RAL, PIC, CNAF, SARA (to NIKHEF): to be clarified WLCG Workshop 1 -2 Sept 2007, Victoria, BC 11
Generic pilots LHCb status and plans m LHCb happy with the proposed agreement from JSPG (EDMS 855383) o Eager to see it endorsed by all Tier 1 s P Essential o DIRAC prepared for running its payload through a glexec-compatible mechanism P Wait Ph. C. as LHCb run concurrent activities at Tier 1’s for sites to deploy the one they prefer WLCG Workshop 1 -2 Sept 2007, Victoria, BC 12
Middleware deployment cycle m Problem of knowing “what runs where” o Reporting problems that was fixed long ago LHCb status and plans P but m either were not released or not deployed Attempt at getting the client MW from LCG-AA o o very promising solution very collaborative attitude from GD P versions for all available platforms installed as soon as ready P allows testing on LXPLUS and on production WNs d tarball shipped with DIRAC and environment set using CMT d not yet in full production mode, but very promising P allows full control of versions d possible to report precisely to developers d no way to know which version runs by default on a WN Ph. C. WLCG Workshop 1 -2 Sept 2007, Victoria, BC 13
LHCb and PPS m Very impractical to test client MW on PPS o LHCb status and plans o m completely different setup for DIRAC hard to verify all use cases (e. g. file access) Was used for testing some services P e. g. o g. Lite WMS but easier to get an LHCb instance of the service P known to the production BDII P possibility to use or not depending on reliability P sees all production resources d caveat: should not break e. g. production CEs § but expected to be beyond that level of testing… m PPS uses a lot of resources in GD o Ph. C. worth discussing with experiments how to test MW WLCG Workshop 1 -2 Sept 2007, Victoria, BC 14
Monitoring & availability m Essential to test sites permanently o o See J. Closier’s poster at CHEP Use the SAM framework availability of CEs open to LHCb P install LHCb and LCG-AA software LHCb status and plans P check d platform dependent P reports to the SAM database P LHCb would like to report the availability as they see it d no point claiming a site is available just for the ops VO o o Ph. C. Faulty sites are “banned” from the DIRAC submission Faulty SEs or full disk-SEs can also be “banned” from the DMS (as source and/or destination) WLCG Workshop 1 -2 Sept 2007, Victoria, BC 15
Conclusions m LHCb using WLCG/EGEE infrastructure successfully LHCb status and plans o m Still many issues to iron out (mainly DM) o o m LCG-AA deployment, production preview instances Plans to mainly continue regular activities o Ph. C. SE reliability, scalability and availability Data access SRM v 2. 2 SE migration at many sites Trying to improve certification and usage of middleware o m Eagerly waiting for generic pilots general scheme Move from “challenge mode” to “steady mode” WLCG Workshop 1 -2 Sept 2007, Victoria, BC 16
- Slides: 16