Experience with multithreaded C applications in the ATLAS

  • Slides: 26
Download presentation
Experience with multi-threaded C++ applications in the ATLAS Data. Flow Performance problems found and

Experience with multi-threaded C++ applications in the ATLAS Data. Flow Performance problems found and solved: • STL containers • thread scheduling • other Szymon Gadomski University of Bern, Switzerland INP Cracow, Poland on behalf of the ATLAS Trigger/DAQ Data. Flow, CHEP 2003 conference

ATLAS Data. Flow software • Flow of data in the ATLAS DAQ system –

ATLAS Data. Flow software • Flow of data in the ATLAS DAQ system – Data to LVL 2 (part of event), to EF (whole event), to mass storage. – See talks by Giovanna Lehman (overview of Data. Flow) and by Stefan Stancu (networking). • PCs, standard Linux, applications written in C++ (so far using only gcc to compile), standard network technology (Gb ethernet). • “Soft” real time system, no guaranteed response time. The average response time is what matters. • Common tasks (exchanging messages, state machine, access configuration db, reporting errors, …) using a framework (well, actually two…). S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 2

ATLAS Data Flow software (2) • State of the project: – development done mostly

ATLAS Data Flow software (2) • State of the project: – development done mostly in 2001 -2002, – measurements for Technical Design Report – performance, – preparation for beam test support – stability, robustness and deployment. • 7 kinds of applications (+3 kinds of controllers) • Always several threads (independent processes within one application without their own resources). • Roles, challenges and use of threads very different. • In this short talk only a few examples – use of threads, problems, solutions. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 3

Testbed at CERN 4 U PCs >= 2 GHz 1 U PCs >= 2

Testbed at CERN 4 U PCs >= 2 GHz 1 U PCs >= 2 GHz FPGA Traffic generators S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 4

LVL 2 processing unit (L 2 PU) - role Data. Flow application 1600 x

LVL 2 processing unit (L 2 PU) - role Data. Flow application 1600 x ROB ROB Detector data! Open choice. Interface with control software 140 x ROS data request (Ro. I only) L 2 SV 10 x L 1 + Ro. I data LVL 2 decision data detailed LVL 2 result L 2 PU • gets LVL 1 decision • asks for data • gets it • makes LVL 2 decision • sends it • sends detailed result 1 x p. ROS Up to 500 x Multiplicties are indicative only S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 5 Mass. Storage

L 2 PU design LVL 1 Result L 2 SV Input Thread Add to

L 2 PU design LVL 1 Result L 2 SV Input Thread Add to Event Queue Assemble Ro. I Data Get next Event from Queue ROS‘s Run LVL 2 Selection code Ro. I Data Requests p. ROS If complete restart Worker Ro. I Data LVL 2 Decision L 2 PU LVL 2 Result Ro. I Data Request data + wait Continue Selection code If Accept send Result Send Decision S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" Worker Thread CHEP, March 03 6

Sub-farm Interface (SFI) - role • gets event id (L 2 accept) • asks

Sub-farm Interface (SFI) - role • gets event id (L 2 accept) • asks for all event data • gets it • builds complete event • buffers it • sends it to Event Filter 140 x Data. Flow application Interface with control ROS data clear request LVL 2 accepts and rejects DFM 1 x assign Eo. E SFI 50 x complete event request Multiplicties are indicative only EF S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 7 Mass. Storage

SFI Design DFM ROS EB Rate/SFI ~50 Hz End of Event Assigns Event Data

SFI Design DFM ROS EB Rate/SFI ~50 Hz End of Event Assigns Event Data Input Thread Data Requests Assigns Assembly Thread ROSFragments Event Handler Reask Fragment IDs Request Thread Events SFI ØDifferent threads for requesting and receiving data ØThreads for assembly and for sending to Event Handler S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" Full Event EF CHEP, March 03 8

Lesson with L 2 PU and SFI – STL containers • With no apparent

Lesson with L 2 PU and SFI – STL containers • With no apparent dependence between threads in code, it was observed that threads were not running independently. No effect from more threads. • Visual. Threads, using instrumented pthread library: # threads – STL containers use a memory pool, by default one per executable. There is a lock, threads may block each other. time S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" blocked! CHEP, March 03 9

Lesson with L 2 PU and SFI – STL containers (2) # threads •

Lesson with L 2 PU and SFI – STL containers (2) # threads • The solution is to use pthread allocator. Independent memory pools for each thread, no lock, no blocking. • Use for all containers used at event rate. • Careful with creating objects in one thread and deleting in another. blocked less often S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 10

SFI History Date Change EB EB + Output to EF 30 Oct `02 First

SFI History Date Change EB EB + Output to EF 30 Oct `02 First integration on testbed 0. 5 MB/s - 13 Nov Sending data requests at a regular pace 8. 0 MB/s - 14 Nov Reduce the number of threads 15 MB/s - 20 Nov Switch off hyper-threading threads 17 MB/s - 21 Nov Introduce credit based traffic shaping threads 28 MB/s - 13 Dec First try on throughput 14 MB/s 17 Jan Chose pthread allocator for STL object threads 53 MB/s 18 MB/s 29 Jan DC Buffer recycling when sending 56 MB/s 19 MB/s 05 Feb IOVec storage type in the EFormat library 58 MB/s 46 MB/s 21 Feb Buffer pool per threads 64 MB/s 48 MB/s 21 Feb Grouping interthread communication threads 73 MB/s 51 MB/s 26 Feb Avoiding one system call per message - 80 MB/s 55 MB/s Most improvements (and most problems) are related to threads. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 11

Lessons from SFI • Traffic shaping (limiting the number of outstanding requests for data)

Lessons from SFI • Traffic shaping (limiting the number of outstanding requests for data) eliminates packet loss. • Grouping interthread communication – decrease frequency of thread activation. • Some improvements in more predictable areas: • avoiding copies and system calls, • avoiding creations by recycling buffers, • avoiding contention, each thread has its own buffers. Ø Optimizations driven by measurements with full functionality. • Effective development: developer works on a good testbed, tests and optimizes, short cycle. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 12

Performance of the SFI 95 MB/s – IO limited EB only Throughput CPU limited

Performance of the SFI 95 MB/s – IO limited EB only Throughput CPU limited (2. 4 GHz CPU) #ROLs/ROS • Reaching I/O limit at 95 MB/s otherwise CPU limited • 35% performance gain with at least 8 ROLs/ROS • Will approach I/O limit for 1 ROL/ROS with faster CPU S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 13

Readout System (ROS) - role ROBin request ~12 bufers for data I/O Manager LVL

Readout System (ROS) - role ROBin request ~12 bufers for data I/O Manager LVL 2 or EB Data request data ROS controller ROI collection and partial event building. Not exactly like SFI: ROS SFI Request Rate 24 k. Hz L 2 3 k. Hz EB 50 Hz Data per req. 2 k. B LVL 2 8 k. B EB 1. 5 MB Data rate 72 MB/s 75 MB/s All numbers approximate. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 14

IOManager in ROS Request Handlers Rob. Ins Control, error Trigger Requests (L 2, EB,

IOManager in ROS Request Handlers Rob. Ins Control, error Trigger Requests (L 2, EB, Delete) Request Queue The number of request handlers is configurable S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" = Process = Linux Scheduler = Thread CHEP, March 03 15

Thread scheduling problem • System without interrupt. Poll and yield. • Standard linux scheduler

Thread scheduling problem • System without interrupt. Poll and yield. • Standard linux scheduler puts the thread away until next time slice. Up to 10 ms. Solution is to change scheduling in kernel • For 2. 4. 9 kernels there exists an unofficial patch (tested on CERN RH 7. 2) • For CERN RH 7. 3 there is a CERN-certified patch linux_2. 4. 18_18_sched. yield. patch 20 ms latency for getting data This is and evolving field, need to continue evaluating thread-related changes of Linux kernels. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 16

Conclusions • The Data. Flow of ATLAS DAQ has a set of applications managing

Conclusions • The Data. Flow of ATLAS DAQ has a set of applications managing the flow of data. • All prototypes exist, have been optimized, are used for performance measurements and are prepared for Beam Test. • Standard technology (Gb ethernet, PCs, standard Linux, C++ with gcc, multithreaded) meets ATLAS requirements. • A few lessons were learned. S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 17

Backup slides S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March

Backup slides S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 18

Data Flow Manager (DFM) - role 200 x EF ROS data clear request 16

Data Flow Manager (DFM) - role 200 x EF ROS data clear request 16 x L 2 SV LVL 2 accepts and rejects DFM 1 x assign Eo. E 30 x SFO SFI 100 x Multiplicties are indicative only data Disk files Data. Flow application I/F with Online. SW S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 19 Mass. Storage

DFM Design L 2 SV ROS L 2 Decisions I/O Thread Event. Assigns Clears

DFM Design L 2 SV ROS L 2 Decisions I/O Thread Event. Assigns Clears L 2 Desicions End. Of. Event SFI I/O Rate ~4 k. Hz End. Of Event Load Balancing Bookkeeping Cleanup Thread Timeouts SFI Assigns DFM ØBulk of work done in I/O thread ØCleanup thread identifies timed out events ØFully embedded in the DC framework Threads allow for independent and parallel processing within an application S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 20

STL containers (3) S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP,

STL containers (3) S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 21

SFI performance Input up to 95 Mb/s (~3/4 of the 1 Gb line) Input

SFI performance Input up to 95 Mb/s (~3/4 of the 1 Gb line) Input and output at 55 Mb/s (~1/2 line speed) With all the logic of Event. Building and all the objects involved, the performance is already close to the network limit (on a 2. 4 GHz PC). S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 22

Performance of Event Building • N SFIs • 1 DFM • hardware emulators of

Performance of Event Building • N SFIs • 1 DFM • hardware emulators of ROS max EB rate with 8 SFIs ~ 350 Hz (17% of ATLAS EB rate) S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" CHEP, March 03 23

After the patch L 2 request rate (k. Hz) Xeon/2 GHz - Linux 2.

After the patch L 2 request rate (k. Hz) Xeon/2 GHz - Linux 2. 4. 18+CERN scheduling patch 100% L 2 Requests 1 ROL per L 2 request release grouping = 100 200 150 Simulated I/O latency 100 latency = 2 usecs latency = 5 usecs 50 latency = 10 usecs latency = 20 usecs 0 0 10 20 30 # request handlers S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" 40 latency = 50 usecs latency = 1000 usecs CHEP, March 03 24

Flow of messages L 2 SV Ro. IB L 2 PU SFI ROS/ROB 1

Flow of messages L 2 SV Ro. IB L 2 PU SFI ROS/ROB 1 a: L 2 SV_LVL 1 Result 1. . i sequential processing or time out wait LVL 2 decision or time out p ROS DFM 2 a: L 2 PU_Data Request 2 b: ROS/ROB_Fragment 1. . i 3 a: L 2 PU_LVL 2 Result 3 b: p. ROS_Ack 1 b: L 2 PU_LVL 2 Decision 4 a: L 2 SV_LVL 2 Decision Note 6 a: SFI_Data. Request associated with 5 a: DFM_Decision used for error recovery. 5 a: DFM_Decision 4 b: DFM_Ack 1. . n 5 a': DFM_SFIAssign 6 a: SFI_Data. Request reassign time-out event 1. . n receive or timeout 1. . n wait Eo. E or time out 6 b: ROS/ROB_Event. Fragment 1. . n Build event 5 b: SFI_Eo. E EF 7: DFM_Clear DFM_Flow. Control S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" SFI_Flow. Control CHEP, March 03 25

Deployment view RODs RODs RODs RO{B/S} RO{B, S} Ro. IB SV Switch LVL 2

Deployment view RODs RODs RODs RO{B/S} RO{B, S} Ro. IB SV Switch LVL 2 Supervisors Sub. Farm Switch LVL 2 Processors LVL 2 Switch DFM Switch EB Switch DFMs SFIs EF Switch Local EF Farms S. Gadomski, "Experience with multi-threaded C++ in ATLAS Data. Flow" To Remote EF Farm CHEP, March 03 26