LHCb Distributed Computing and the Grid V Vagnoni
LHCb Distributed Computing and the Grid V. Vagnoni (INFN Bologna) D. Galli, U. Marconi, V. Vagnoni N. Brook K. Harrison E. Van Herwijnen, J. Closier, P. Mato A. Khan A. Tsaregorodtsev H. Bulten, S. Klous F. Harris, I. Mc. Arthur, A. Soroko G. N. Patrick, G. Kuznetsov 18 June 2002 V. Vagnoni BEAUTY 2002 INFN Bologna Bristol Cambridge CERN Edinburgh Marseille Nikhef Oxford RAL 1
Overview of presentation • Current organisation of LHCb distributed computing • The Bologna Beowulf cluster and its performance in distributed environment • Current use of Globus and EDG middleware • Planning for data challenge and the use of Grid • Current LHCb Grid/applications R/D • Conclusions 18 June 2002 V. Vagnoni BEAUTY 2002 2
History of distributed MC production • Distributed System has been running for 3+ years & processed many millions of events for LHCb design. • Main production sites: – CERN, Bologna, Liverpool, Lyon, NIKHEF & RAL • Globus already used for job submission to RAL and Lyon • System interfaced to GRID and demonstrated at EUDG Review and Ne. SC/UK Opening. • For 2002 Data Challenges, adding new institutes: – Bristol, Cambridge, Oxford, Scot. Grid • In 2003, add – Barcelona, Moscow, Germany, Switzerland & Poland. 18 June 2002 V. Vagnoni BEAUTY 2002 3
LOGICAL FLOW Submit jobs remotely via Web Analysis Execute on farm Data quality check Update bookkeeping database Transfer data to mass store 18 June 2002 V. Vagnoni BEAUTY 2002 4
Monitoring and Control of MC jobs • LHCb has adopted PVSS II as prototype control and monitoring system for MC production. – PVSS is a commercial SCADA (Supervisory Control And Data Acquisition) product developed by ETM. – Adopted as Control framework for LHC Joint Controls Project (JCOP). – Available for Linux and Windows platforms. 18 June 2002 V. Vagnoni BEAUTY 2002 5
18 June 2002 V. Vagnoni BEAUTY 2002 6
Example of LHCb computing facility: the Bologna Beowulf cluster • Set up at INFN-CNAF – ˜ 100 CPUs hosted in Dual Processor machines (ranging from 866 MHz to 1. 2 GHz PIII), 512 MB RAM – 2 Network Attached Storage systems • 1 TB in RAID 5, with 14 IDE disks + hot spare • 1 TB in RAID 5, with 7 SCSI disks + hot spare • Linux disk-less processing nodes with OS centralized on a file server (root file-system mounted over NFS) • Usage of private network IP addresses and Ethernet VLAN – High level of network isolation – Access to external services (afs, mccontrol, bookkeeping db, java servlets of various kinds, …) provided by means of NAT mechanism on a GW node 18 June 2002 V. Vagnoni BEAUTY 2002 7
Farm Configuration Red Hat 7. 2 (kernel 2. 4. 18) DNS NAT (IP masquerading) Disk-less node CERN Red Hat 6. 1 Kernel 2. 2. 18 PBS Master MC control server Farm Monitoring Mirrored disks (RAID 1) Gateway Uplink NAS 1 TB RAID 5 Private Public VLAN Fast Ethernet Switch Control Node Processing Node 1 Processing Node n Master Server Ethernet Link Power Distributor 18 June 2002 Power Control Mirrored disks (RAID 1) V. Vagnoni BEAUTY 2002 Disk-less nodes CERN Red Hat 6. 1 Kernel 2. 2. 18 PBS Slave Red Hat 7. 2 OS file-systems Home directories Various services: PXE remote boot, DHCP, NIS 8
Fast ethernet switch Rack (1 U dual-processor MB) NAS 1 TB Ethernet controlled power distributor 18 June 2002 V. Vagnoni BEAUTY 2002 9
Farm performance • Farm capable to simulate and reconstruct about (700 LHCb-events/day)*(100 CPUs)=70000 LHCbevents/day • Data transfer over the WAN to the CASTOR tape library at CERN realised by using bbftp – very good throughput (up to 70 Mbits/s over currently available 100 Mbits/s) 18 June 2002 V. Vagnoni BEAUTY 2002 10
Current Use of Grid Middleware in development system • Authentication – grid-proxy-init • Job submission to Data. Grid – dg-job-submit • Monitoring and control – dg-job-status – dg-job-cancel – dg-job-get-output • Data publication and replication – globus-url-copy, GDMP 18 June 2002 V. Vagnoni BEAUTY 2002 11
Example 1: Job Submission dg-job-submit /home/evh/sicb/bbincl 1600061. jdl -o /home/evh/logsub/ bbincl 1600061. jdl: # Executable = "script_prod"; Arguments = "1600061, v 235 r 4 dst, v 233 r 2"; Std. Output = "file 1600061. output"; Std. Error = "file 1600061. err"; Input. Sandbox = {"/home/evhtbed/scripts/x 509 up_u 149", "/home/evhtbed/sicb/mcsend", "/home/ev htbed/sicb/fsize", "/home/evhtbed/sicb/cdispose. class", "/home/evhtbed/v 235 r 4 dst. tar. gz", "/home/evhtbed/sicb/bbincl 1600061. sh", "/home/evhtbed/scr ipt_prod", "/home/evhtbed/sicb 1600061. dat", "/home/evhtbed/sicb 160 0062. dat", "/home/evhtbed/sicb 1600063. dat", "/home/evhtbed/v 233 r 2. tar. g z"}; Output. Sandbox = {"job 1600061. txt", "D 1600063", "file 1600061. output", "file 1600061. err", "job 16 00062. txt", "job 1600063. txt"}; 18 June 2002 V. Vagnoni BEAUTY 2002 12
Example 2: Data Publishing & Replication Compute Element Storage Element MSS Local disk Job Data globus-url-copy Data register-local-file publish CERN TESTBED Replica Catalogue NIKHEF - Amsterdam REST-OF-GRID Job replica-get Data Storage Element 18 June 2002 V. Vagnoni BEAUTY 2002 13
LHCb Data Challenge 1 (July -September 2002) • Physics Data Challenge (PDC) for detector, physics and trigger evaluations – based on existing MC production system – small amount of Grid tech to start with – Generate ~3*10**7 events (signal + specific background + generic b and c + min bias) • Computing Data Challenge (CDC) for checking developing software – will make more extensive use of Grid middleware • Components will be incorporated into PDC once proven in CDC 18 June 2002 V. Vagnoni BEAUTY 2002 14
GANGA: Gaudi ANd Grid Alliance Joint Atlas (C. Tull) and LHCb (P. Mato) project, formally supported by Grid. PP/UK with 2 joint Atlas/LHCb research posts at Cambridge and Oxford • Application facilitating end-user physicists and production managers the use of Grid services for running Gaudi/Athena jobs. GUI • a GUI based application that should help for the complete job life-time: GANGA - job preparation and Collective Histograms & Job. Options configuration Monitoring Resource Algorithms Results Grid - resource booking Services - job submission GAUDI Program - job monitoring and control 18 June 2002 V. Vagnoni BEAUTY 2002 15
Required functionality • Before Gaudi/Athena program starts – Security (obtaining certificates and credentials) – Job configuration (algorithm configuration, input data selection, . . . ) – Resource booking and policy checking (CPU, storage, network) – Installation of required software components – Job preparation and submission • While Gaudi/Athena program is running: – Job monitoring (generic and specific) – Job control (suspend, abort, . . . ) • After program has finished: – Data management (registration) 18 June 2002 V. Vagnoni BEAUTY 2002 16
Conclusions • LHCb already has distributed MC production using GRID facilities for job submission • We are embarking on large scale data challenges commencing July 2002, and we are developing our analysis model • Grid middleware will be being progressively integrated into our production environment as it matures (starting with EDG, and looking forward to GLUE) • R/D projects are in place – for interfacing users (production + analysis) and Gaudi/Athena software framework to Grid services – for putting production system into integrated Grid environment with monitoring and control • All work being conducted in close participation with EDG and LCG projects – Ongoing evaluations of EDG middleware with physics jobs – Participate in LCG working groups e. g. Report on ‘Common use cases for a HEP Common Application layer’ http: //cern. ch/fca/HEPCAL. doc 18 June 2002 V. Vagnoni BEAUTY 2002 17
- Slides: 17