First Level Event Selection Package of the CBM

First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration) Collaboration 26 March 2009, CHEP'09 Ivan Kisel, GSI CHEP'09 /18 Prague, March 26, 2009

Tracking Challenge in CBM (FAIR/GSI, Germany) * * * Fixed-target heavy-ion experiment 107 Au+Au collisions/sec ~ 1000 charged particles/collision Non-homogeneous magnetic field Double-sided strip detectors (85% fake space points) Track reconstruction in STS/MVD and displaced vertex search required in the first trigger level 26 March 2009, CHEP'09 Ivan Kisel, GSI 2

Open Charm Event Selection D (c = 312 m): D+ K- + + (9. 5%) D 0 (c = 123 m): D 0 K - + (3. 8%) D 0 K- + + - (7. 5%) D s (c = 150 m): D+s K+K- + (5. 3%) +c (c = 60 m): +c p. K- + (5. 0%) K + No simple trigger primitive, like high pt, available to tag events of interest. The only selective signature is the detection of the decay vertex. First level event selection is done in a processor farm fed with data from the event building network 26 March 2009, CHEP'09 Ivan Kisel, GSI 3

Many-core HPC • On-line event selection • Mathematical and computational optimization • Optimization of the detector ü CPU Intel: XX-cores ? GP CPU Intel: Larrabee ü Gaming STI: Cell ü GP GPU Nvidia: Tesla Open. CL ? ? CPU/GPU AMD: Fusion ? FPGA Xilinx: Virtex • Heterogeneous systems of many cores • Uniform approach to all CPU/GPU families • Similar programming languages (CUDA, Ct, Open. CL) • Parallelization of the algorithm (vectors, multi-threads, many-cores) 26 March 2009, CHEP'09 Ivan Kisel, GSI Cores HW Threads SIMD width Nspeed-up = Ncores*(Nthreads/2)*WSIMD 4

Standalone Package for Event Selection 26 March 2009, CHEP'09 Ivan Kisel, GSI 5

Kalman Filter for Track Fit large errors arbitrary non-homogeneous magnetic field as large map multiple scattering in material >>> 256 KB of Local Store weight for update small errors not enough accuracy in single precision no correction from measurements 26 March 2009, CHEP'09 Ivan Kisel, GSI 6

Code (Part of the Kalman Filter) inline void Add. Material( Track. V &track, Station &st, Fvec_t &qp 0 ) { cnst mass 2 = 0. 1396*0. 1396; Fvec_t tx = track. T[2]; Fvec_t ty = track. T[3]; Fvec_t txtx = tx*tx; Fvec_t tyty = ty*ty; Fvec_t txtx 1 = txtx + ONE; Fvec_t h = txtx + tyty; Fvec_t t = sqrt(txtx 1 + tyty); Fvec_t h 2 = h*h; Fvec_t qp 0 t = qp 0*t; cnst c 1=0. 0136, c 2=c 1*0. 038, c 3=c 2*0. 5, c 4=-c 3/2. 0, c 5=c 3/3. 0, c 6=-c 3/4. 0; Fvec_t s 0 = (c 1+c 2*st. log. Rad. Thick + c 3*h + h 2*(c 4 + c 5*h +c 6*h 2) )*qp 0 t; Fvec_t a = (ONE+mass 2*qp 0 t)*st. Rad. Thick*s 0; Cov. V &C = track. C; } C. C 22 += txtx 1*a; C. C 32 += tx*ty*a; C. C 33 += (ONE+tyty)*a; Use headers to overload +, -, *, / operators --> the source code is unchanged ! 26 March 2009, CHEP'09 Ivan Kisel, GSI 7

Header (Intel’s SSE) typedef F 32 vec 4 Fvec_t; SIMD instructions /* Arithmetic Operators */ friend F 32 vec 4 operator +(const F 32 vec 4 &a, const F 32 vec 4 &b) { return _mm_add_ps(a, b); } friend F 32 vec 4 operator -(const F 32 vec 4 &a, const F 32 vec 4 &b) { return _mm_sub_ps(a, b); } friend F 32 vec 4 operator *(const F 32 vec 4 &a, const F 32 vec 4 &b) { return _mm_mul_ps(a, b); } friend F 32 vec 4 operator /(const F 32 vec 4 &a, const F 32 vec 4 &b) { return _mm_div_ps(a, b); } /* Functions */ friend F 32 vec 4 min( const F 32 vec 4 &a, const F 32 vec 4 &b ){ return _mm_min_ps(a, b); } friend F 32 vec 4 max( const F 32 vec 4 &a, const F 32 vec 4 &b ){ return _mm_max_ps(a, b); } /* Square Root */ friend F 32 vec 4 sqrt ( const F 32 vec 4 &a ){ return _mm_sqrt_ps (a); } /* Absolute value */ friend F 32 vec 4 fabs( const F 32 vec 4 &a){ return _mm_and_ps(a, _f 32 vec 4_abs_mask); } /* Logical */ friend F 32 vec 4 operator&( const F 32 vec 4 &a, const F 32 vec 4 &b ){ // mask returned return _mm_and_ps(a, b); } friend F 32 vec 4 operator|( const F 32 vec 4 &a, const F 32 vec 4 &b ){ // mask returned return _mm_or_ps(a, b); } friend F 32 vec 4 operator^( const F 32 vec 4 &a, const F 32 vec 4 &b ){ // mask returned return _mm_xor_ps(a, b); } friend F 32 vec 4 operator!( const F 32 vec 4 &a ){ // mask returned return _mm_xor_ps(a, _f 32 vec 4_true); } friend F 32 vec 4 operator||( const F 32 vec 4 &a, const F 32 vec 4 &b ){ // mask returned return _mm_or_ps(a, b); } /* Comparison */ friend F 32 vec 4 operator<( const F 32 vec 4 &a, const F 32 vec 4 &b ){ // mask returned return _mm_cmplt_ps(a, b); } 26 March 2009, CHEP'09 Ivan Kisel, GSI 8

Intel P 4 Kalman Filter Track Fit on Intel Xeon, AMD Opteron and Cell 10000 faster! Motivated, but not restricted by Cell ! • 2 Intel Xeon Processors with Hyper-Threading enabled and 512 k. B cache at 2. 66 GHz; • 2 Dual Core AMD Opteron Processors 265 with 1024 k. B cache at 1. 8 GHz; • 2 Cell Broadband Engines with 256 k. B local store at 2. 4 G Hz. lxg 1411@GSI eh 102@KIP blade 11 bc 4 @IBM Comp. Phys. Comm. 178 (2008) 374 -383 26 March 2009, CHEP'09 Ivan Kisel, GSI 9

Performance of the KF Track Fit on CPU/GPU Systems Speed-up 3. 7 on the Xeon 5140 (Woodcrest) at 2. 4 GHz using icc 9. 1 Cores HW Threads SIMD width Real-time performance on the quad-core Xeon 5345 ( Clovertown) at 2. 4 GHz – speed-up 30 with 16 threads Real-time performance on different Intel CPU platforms Real-time performance on NVIDIA for a single track 26 March 2009, CHEP'09 Ivan Kisel, GSI CBM Progress Report, 2008 10

Cellular Automaton Track Finder: Pentium 4 1000 faster! 26 March 2009, CHEP'09 Ivan Kisel, GSI 11

Summary • Standalone package for online event selection is ready for investigation • Cellular Automaton track finder takes 5 ms per minimum bias event • Kalman Filter track fit is a benchmark for modern CPU/GPU architectures • SIMDized multi-threaded KF track fit takes 0. 1 s/track on Intel Core i 7 • Throughput of 2. 2· 107 tracks /sec is reached on NVIDIA GTX 280 26 March 2009, CHEP'09 Ivan Kisel, GSI 12