LHCb online offline computing Domenico Galli Bologna INFN

LHCb on-line / off-line computing Domenico Galli, Bologna INFN CSN 1 Lecce, 24. 9. 2003

Off-line computing n We plan LHCb-Italy off-line computing resources to be as much centralized as possible. n n Put as much computing power as possible in CNAF Tier-1. n To minimize system administration manpower. n To optimize resources exploitation. “Distributed” for us means distributed among CNAF and other European Regional Centres. Virtual drawback: strong dependence on CNAF resource sharing. The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later. LHCb on-line / off-line computing. 2 D. Galli

2003 Activities n n In 2003 LHCb-Italy contributed to DC 03 (production of MC samples for TDR). 47 Mevt / 60 d. n 32 Mevt minimum bias; n 10 Mevt inclusive b; n n 50 signal samples, whose size is 50 to 100 kevt. 18 Computing centres involved. n Italian contribution: 11. 5% (should be 15%). LHCb on-line / off-line computing. 3 D. Galli

2003 Activities (II) n n n Italian contribution to DC 03 has been obtained using limited resources (40 k. Si 2000, i. e. 100 1 GHz PIII CPUs). Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres. DIRAC, LHCb distributed MC production system, has been used to produce 36600 jobs; 85% of them run out of CERN with 92% mean efficiency. LHCb on-line / off-line computing. 4 D. Galli

2003 Activities (III) n DC 03 has also been used to validate LHCb distributed analysis model. n n n Distribution to Tier-1 centres of signal and bg MC samples stored at CERN during production. Samples has been pre-reduced based on kinematical or trigger criteria. Selection algorithms for specific decay channels (~30) has been executed. Events has been classified by means of tagging algorithms. LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons. LHCb on-line / off-line computing. 5 D. Galli

2003 Activities (IV) n n To perform high statistics data samples analysis the PVFS distributed file system has been used. 110 MB/s aggregate I/O using 100 Base-T Ethernet connection (to be compared with 50 MB/s of a typical 1000 Base T NAS). LHCb on-line / off-line computing. 6 D. Galli

2003 Activities (V) n n Analysis work by LHCb-Italy has been included in “Reoptimized Detector Design and Performance” TDR (2 hadrons channel + tagging). 3 LHCb internal notes has been written: n n n CERN-LHCb/2003 -123: Bologna group, “Selection of B/Bs h+h- decays at LHCb”; CERN-LHCb/2003 -124: Bologna group, “CP sensitivity with B/Bs h+h- decays at LHCb”. CERN-LHCb/2003 -115: Milano group, “LHCb flavour tagging performance”. LHCb on-line / off-line computing. 7 D. Galli

Software Roadmap LHCb on-line / off-line computing. 8 D. Galli

DC 04 (April-June 2004) – Physics Goals n Demonstrate performance of HLTs (needed for computing TDR) n n Improve B/S estimates of optimisation TDR n n Large minimum bias sample + signal Large bb sample + signal Physics improvements to generators LHCb on-line / off-line computing. 9 D. Galli

DC 04 – Computing Goals n Main goal: gather information to be used for writing LHCb computing TDR n Robustness test of the LHCb software and production system n n Test of the LHCb distributed computing model n n n Using software as realistic as possible in terms of performance Including distributed analyses Incorporation of the LCG application area software into the LHCb production environment Use of LCG resources as a substantial fraction of the production capacity LHCb on-line / off-line computing. 10 D. Galli

DC 04 – Production Scenario n n Generate (Gauss, “SIM” output): n 150 Million events minimum bias n 50 Million events inclusive b decays n 20 Million exclusive b decays in the channels of interest Digitize (Boole, “DIGI” output): n n n All events, apply L 0+L 1 trigger decision Reconstruct (Brunel, “DST” output): n Minimum bias and inclusive b decays passing L 0 and L 1 trigger n Entire exclusive b-decay sample Store: n SIM+DIGI+DST of all reconstructed events LHCb on-line / off-line computing. 11 D. Galli

Goal: Robustness Test of the LHCb Software and Production System n First use of the simulation program Gauss based on Geant 4 n Introduction of the new digitisation program, Boole n n With HLTEvent as output Robustness of the reconstruction program, Brunel n Including any new tuning or other available improvements n Not including mis-alignment/calibration Pre-selection of events based on physics criteria (Da. Vinci) n AKA “stripping” n Performed by production system after the reconstruction n Producing multiple DST output streams Further development of production tools (Dirac etc. ) n e. g. integration of stripping n e. g. Book-keeping improvements n e. g. Monitoring improvements LHCb on-line / off-line computing. 12 D. Galli

Goal: Test of the LHCb Computing Model n Distributed data production n Including LCG 1 n Controlled by the production manager at CERN n In close collaboration with the LHCb production site managers Distributed data sets n n n As in 2003, will be run on all available production sites CERN: n Complete DST (copied from production centres) n Master copies of pre-selections (stripped DST) Tier 1: n Complete replica of pre-selections n Master copy of DST produced at associated sites n Master (unique!) copy of SIM+DIGI produced at associated sites Distributed analysis LHCb on-line / off-line computing. 13 D. Galli

Goal: Incorporation of the LCG Software n Gaudi will be updated to: n Use POOL (persistency hybrid implementation) mechanism n Use certain SEAL (general framework services) services n n All the applications will use the new Gaudi n n e. g. Plug-in manager Should be ~transparent but must be commissioned N. B. : n POOL provides existing functionality of ROOT I/O n n And more: e. g. location independent event collections But incompatible with existing TDR data n May need to convert it if we want just one data format LHCb on-line / off-line computing. 14 D. Galli

Needed Resources for DC 04 n CPU requirement is 10 times what was needed for DC 03 n Current resource estimates indicate DC 04 will last 3 months n n Assumes that Gauss is twice slower than SICBMC n Currently planned for April-June GOAL: use of LCG resources as a substantial fraction of the production capacity n n We can hope for up to 50% Storage requirement: n 6 TB at CERN for complete DST n 19 TB distributed among TIER 1 for locally produced SIM+DIGI+DST n up to 1 TB per TIER 1 for pre-selected DSTs LHCb on-line / off-line computing. 15 D. Galli

Resources request to Bologna Tier-1 for DC 04 n CPU power: 200 k. SI 2000 (500 1 GHz PIII CPU). n Disk: 5 TB n Tape: 5 TB LHCb on-line / off-line computing. 16 D. Galli

Tier-1 Grow in Next Years CPU [k. SI 2000] Disk [TB] Tape [TB] 2004 2005 2006 2007 200 400 800 5 20 100 200 5 20 200 600 LHCb on-line / off-line computing. 17 D. Galli

Online Computing n LHCb-Italy has been involved in online group to design the L 1/HLT trigger farm. n Sezione di Bologna n n Sezione di Milano n n G. Avoni, A. Carbone , D. Galli, U. Marconi, G. Peco, M. Piccinini, V. Vagnoni T. Bellunato, L. Carbone, P. Dini Sezione di Ferrara n A. Gianoli LHCb on-line / off-line computing. 18 D. Galli

Online Computing (II) n n Lots of changes since the Online TDR n abandoned Network Processors n included Level-1 DAQ n have now Ethernet from the readout boards n destination assignment by TFC (Timing and Fast Control) Main ideas the same n large gigabit Ethernet Local Area Network to connect detector sources to CPU destinations n simple (push) protocol, no event-manager n commodity components wherever possible n everything controlled, configured and monitored by ECS (Experimental Control System) LHCb on-line / off-line computing. 19 D. Galli

DAQ Architecture Level-1 Traffic Front-end Electronics FE FE FE TRM 126 -224 Links 44 k. Hz 5. 5 -11. 0 GB/s 62 -87 Switches Switch Switch SFC SFC 94 -175 Links 7. 1 -12. 6 GB/s 94 -175 SFCs Switch Gb Ethernet Level-1 Traffic Mixed Traffic 29 Switches CPU CPU CPU Sorter TFC System SFC Switch ~1800 CPUs 32 Links L 1 -Decision Readout Network Storage System 323 Links 4 k. Hz 1. 6 GB/s Multiplexing Layer 64 -137 Links 88 k. Hz HLT Traffic CPU CPU LHCb on-line / off-line computing. 20 D. Galli CPU CPU Farm

Following the Data-Flow Front-end Electronics FE 1 FE FE 2 FE FE 2 TRM 1 Switch Switch Readout Network Sorter L 1 -Decision Storage System SFC 2 1 SFC Switch Gb Ethernet Level-1 Traffic HLT Traffic Mixed Traffic CPU CPU CPU 94 Links 7. 1 GB/s 94 SFCs L 1 Trigger CPU D CPU ~1800 CPUs 2 SFC 1 SFC L 0 Yes TFC L 1 System Yes Switch HLT CPU B ΦΚ CPU s CPU LHCb on-line / off-line computing. 21 D. Galli Yes CPU CPU Farm

Design Studies n n Items under study: n Physical farm implementation (choice of cases, cooling, etc. ) n Farm management (bootstrap procedure, monitoring) n Subfarm Controllers (event-builders, load-balancing queue) n Ethernet Switches n Integration with TFC and ECS n System Simulation LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes. LHCb on-line / off-line computing. 22 D. Galli

Tests in Bologna n n To begin the activity in Bologna we started (August 2003) from scratch by trying to transfer data through 1000 Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances. As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss. LHCb on-line / off-line computing. 23 D. Galli

Tests in Bologna (II) – Previous results n n In IEEE 802. 3 standard specifications, for 100 m long cat 5 e cables, the BER (Bit Error Rates) is said to be < 10 -10. Previous measures, performed by A. Barczyc, B. Jost, N. Neufeld using Network Processors (not real PCs) and 100 m long cat 5 e cables showed a BER < 10 -14. Recent measures (presented A. Barczyc at Zürich, 18. 09. 2003), performed using PCs gave a frame drop rate O(10 -6). Many data (too much for L 1!) get lost inside kernel network stack implementation in PCs. LHCb on-line / off-line computing. 24 D. Galli

Tests in Bologna (III) n Transferring data on 1000 Base-T Ehernet is not as trivial as it was for 100 Base-TX Ethernet. n n n A new bus (PCI-X) and new chipsets (e. g. Intel E 7601, 875 P) has been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate). Linux kernel implementation of network stack has been rewritten 2 times since kernel 2. 4 to support gigabit data flow (networking code is 20% of the kernel source). Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten). Standard Linux Red. Hat 9 A setup uses back-compatibility stuff and looses packets. No many people are interested in achieving very low packet loss (except for video streaming). Also a DATATAG group is working on packet losses (M. Rio, T. Kelly, M. Goutelle, R. Hughes-Jones, J. P. Martin-Flatin, “A map of the networking code in Linux Kernel 2. 4. 20”, draft 8, 18 August 2003). LHCb on-line / off-line computing. 25 D. Galli

Tests in Bologna. Results Summary n n n Throughput was always higher than expected (957 Mb/s of IP payload measured) while data loss was our main concern. We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on 1000 Base-T Ethernet without datagram loss (4 datagrams lost / 2. 0 x 1010 datagrams sent). This required: n use the appropriate software: n n NAPI kernel ( 2. 4. 20 ). NAPI-enabled drivers (for Intel e 1000 driver, recompilation with a special flag set was needed). n kernel parameters tuning (buffer & queue length). n 1000 Base-T flow control enabled on NIC. LHCb on-line / off-line computing. 26 D. Galli

Test-bed 0 n 2 x PC with 3 x 1000 Base-T interfaces each n Motherboard: Super. Micro X 5 DPL-i. GM n n n Plugged-in PCI-X Ethernet Card: Intel Pro/1000 MT Dual Port Server Adapter n n Ethernet controller Intel 82546 EB: 2 x 1000 Base-T interfaces (supports Jumbo Frames) 1000 Base-T 8 ports switch: HP Pro. Curve 6108 n 16 Gbps backplane: non-blocking architecture latency: < 12. 5 µs (LIFO 64 -byte packets) throughput: 11. 9 million pps (64 -byte packets) switching capacity: 16 Gbps n max 500 MHz (cfr 125 MHz 1000 Base-T) n n Dual Pentium IV Xeon 2. 4 GHz, 1 GB ECC RAM Chipset Intel E 7501 400/533 MHz FSB (front side bus) Bus Controller Hub Intel P 64 H 2 (2 x PCI-X, 64 bit, 66/100/133 MHz) Ethernet controller Intel 82545 EM: 1 x 1000 Base-T interface (supports Jumbo Frames) Cat. 6 e cables LHCb on-line / off-line computing. 27 D. Galli

Test-bed 0 (II) 1000 Base-T switch 10. 1. 7 10. 0. 7 n 131. 154. 10. 7 10. 1. 2 10. 0. 2 131. 154. 10. 2 lhcbcn 1 uplink lhcbcn 2 echo 1 > /proc/sys/net/ipv 4/conf/all/arp_filter n to use only one interface to receive packet owning to a certain network (131. 154. 10, 10. 0 and 10. 1). LHCb on-line / off-line computing. 28 D. Galli

Test-bed 0 (III) LHCb on-line / off-line computing. 29 D. Galli

Super. Micro X 5 DPL-i. GM Motherboard (Chipset Intel E 7501) n Chipset internal bandwidth is granted n 6. 4 Gb/s min LHCb on-line / off-line computing. 30 D. Galli

Benchmark Software n We used 2 benchmark software: n n n Netperf 2. 2 p 14 UDP_STREAM Self-made basic sender & receiver programs using UDP & RAW IP We discovered a bug in netperf on Linux platform: n since Linux calls setsockopt(SO_SNDBUF) & setsockopt(SO_RCVBUF) set the buffer size to twice the requested size, while Linux calls getsockopt(SO_SNDBUF) & getsockopt(SO_RCVBUF) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls. LHCb on-line / off-line computing. 31 D. Galli

Benchmark Environment n Kernel 2. 4. 20 -18. 9 smp n Giga. Ethernet driver: e 1000 n version 5. 0. 43 -k 1 (Red. Hat 9 A) n version 5. 2. 16 recompiled with NAPI flag enabled n System disconnected from public network n Runlevel 3 (X 11 stopped) n Daemons stopped (crond, atd, sendmail, etc. ) n Flow control on (on both NICs and switch) n Numer of descriptors allocated by the driver rings: 256, 4096 n IP send buffer size: 524288 (x 2) Bytes n IP receive buffer size: 524288 (x 2), 1048576 (x 2) Bytes n Tx queue length 100, 1600 LHCb on-line / off-line computing. 32 D. Galli

First Results. Linux Red. Hat 9 A, Kernel 2. 4. 20, Default Setup, no Tuning. n n n First benchmark results about datagram loss showed big fluctuations which, in principle, can due to packet queue reset, other CPU process, interrupts, soft_irqs, broadcast network traffic, etc. Resulting distribution is multi-modal. Mean loss: 1 datagram lost every 20000 datagram sent. Too much for LHCb L 1!!! LHCb on-line / off-line computing. 33 D. Galli

First Results. Linux Red. Hat 9 A, Kernel 2. 4. 20, Default Setup, no Tuning (II) n We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full). LHCb on-line / off-line computing. 34 D. Galli

Changes in Linux Network Stack Implementation n 2. 1 2. 2: netlink, bottom halves, HFC (harware flow control) n As few computation as possible while in interrupt context (interrupt disabled). n Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled). n HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied. n Bottom halves execution strictly serialized among CPUs; only one packet at a time can enter the system. 2. 3. 43 2. 4: softnet, softirq n softirqs are software thread that replaces bottom halves. n possible parallelism on SMP machines 2. 5. 53 2. 4. 20 (N. B. : back-port): NAPI (new application program interface) n interrupt mitigation technology (mixture of interrupt and polling mechanisms) LHCb on-line / off-line computing. 35 D. Galli

Interrupt livelock n n Given the interrupt rate coming in, the IP processing thread never gets a chance to remove any packets off the system. There are so many interrupts coming into the system such that no useful work is done. Packets go all the way to be queued, but are dropped because the backlog queue is full. System resourced are abused extensively but no useful work is accomplished. LHCb on-line / off-line computing. 36 D. Galli

NAPI (New API) n NAPI is a interrupt mitigation mechanism constituted by a mixture of interrupt and polling mechanisms: n n Polling: n useful under heavy load. n introduces more latency under light load. n abuses the CPU by polling devices that have no packet to offer. Interrupts: n n improve latency under light load. make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate). LHCb on-line / off-line computing. 37 D. Galli

Packet Reception in Linux kernel 2. 4. 19 (softnet) and 2. 4. 20 (NAPI) Softnet (kernel 2. 4. 19) NAPI (kernel 2. 4. 20) LHCb on-line / off-line computing. 38 D. Galli

NAPI (II) n n Under low load, before the MLFFR is reached, the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced. Under heavy load, the system takes its time to poll devices registered. Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased. LHCb on-line / off-line computing. 39 D. Galli

NAPI (III) n NAPI changes driver-to-kernel interfaces. n n n all network drivers should be rewritten. In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility). Backlog queues, when used in back-compatibility mode, are polled just like other devices. LHCb on-line / off-line computing. 40 D. Galli

True NAPI vs Back-Compatibility Mode NAPI kernel with NAPI driver NAPI kernel with old LHCb on-line / off-line computing. 41 (not NAPI-aware) driver D. Galli

The Intel e 1000 Driver n n Even in the last version of e 1000 driver (5. 2. 16) NAPI is turned off by default (to allow the usage of the driver also in kernels 2. 4. 19). To enable NAPI, e 1000 5. 2. 16 driver must be recompiled with the option: make CFLAGS_EXTRA=-DCONFIG_E 1000_NAPI LHCb on-line / off-line computing. 42 D. Galli

Best Results n n Maximum trasfer rate (udp 4096 byte datagrams): 957 Mb/s. Mean datagram lost fraction (@ 957 Mb/s): 2. 0 x 10 -10 (4 datagram lost for 2. 0 x 1010 4 k-datagrams sent) n corresponding to BER 6. 2 x 10 -15 (using 1 m cat 6 e cables) if data loss is totally due to hardware CRC errors. LHCb on-line / off-line computing. 43 D. Galli

To be Tested to Improve Further n kernel 2. 5 n n sysenter & sysexit (instead of int 0 x 80) for context switching following system calls (3 -4 times faster). Asynchronous datagram receiving Jumbo frames n n fully preemptive (real time) Ethernet frames whose MTU (Maximum Transmission Unit) is 9000 instead of 1500. Less IP datagram fragmentation in packets. Kernel Mode Linux (http: //web. yl. is. s. u-tokyo. ac. jp/~tosh/kml/) n n n KML is a technology that enables the execution of ordinary userspace programs inside kernel space. Protection-by-software (like in Java bytecode) instead of protection-by-hardware. System calls become function calls (132 time faster than int 0 x 80, 36 time faster than sysenter/sysexit). LHCb on-line / off-line computing. 44 D. Galli

Milestones n 8. 2004 – Streaming benchmarks: n Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable. n Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames). n n n Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC. Test of event building (receive 2 message stream and send 1 joined messages stream) 12. 2004 – SFC (Sub Farm Controller) to nodes communication: n Definition of SFC-to-nodes communication protocol. n Definition of SFC queueing and scheduling mechanism n First implementation of queueing/scheduling procedures (possibly zero-copy). LHCb on-line / off-line computing. 45 D. Galli

Milestones (II) n n OS test (if performance need to be improved) n kernel Linux 2. 5. 53. n KML (kernel mode linux). Design and test of bootstrap procedures: n n Measure of the rate of failure of simultaneous boot of a cluster of PCs, using pxe/dhcp and tftp. Test of node switch on/off and powe cycle using ASF. Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers) Definition of requirement for the trigger software: n error trapping. n timeout. LHCb on-line / off-line computing. 46 D. Galli