Basic message passing benchmarks methodology and pitfalls Rolf

Basic message passing benchmarks, methodology and pitfalls Rolf Hempel C & C Research Laboratories, NEC Europe Ltd. Rathausallee 10 53757 Sankt Augustin Germany email: hempel@ccrl-nece. technopark. gmd. de SPEC Workshop, Wuppertal, Sept. 13, 1999 Further info. : http: //www. ccrl-nece. technopark. gmd. de

NEC R&D Group Overseas Facilities Japan C&C Research Laboratories NEC Europe Ltd. (Bonn, Germany) Rolf Hempel, NEC Europe Ltd. NEC Research Institute, Inc. (Princeton, U. S. A. ) C&C Research Laboratories, NEC U. S. A. , Inc. (Princeton, San Jose, U. S. A. )

MPI Implementations at NEC CCRLE: Earth Simulator SX-5 M Image courtesy of National Space Development Agency of Japan / Japan Atomic Energy Research Institute MPI design for Earth Simulator MPI/SX product development (currently for SX-4/5) Cenju-4 MPI for Cenju derived from MPI/SX Rolf Hempel, NEC Europe Ltd. MPI for PC cluster LAMP

Most important MPI implementation: MPI/SX SX-5 (in the past SX-4): Commercial product of NEC, Parallel vector supercomputer Since 1997: MPI/SX product development at CCRLE · Standard-compliant, fully tested interface · Optimized MPI-1, MPI-2 almost finished · Maintenance & Customer support Rolf Hempel, NEC Europe Ltd.

MPI design for the Earth Simulator Massively parallel computer: Thousands of processors MPI is the basis for all large-scale applications Design and implementation at CCRLE Image courtesy of National Space Development Agency of Japan / Japan Atomic Energy Research institute At the moment: design in collaboration with OS group at NEC Fuchu Rolf Hempel, NEC Europe Ltd.

The purpose of message-passing benchmarks Goal: measure the time needed for elementary operations (such as send/receive) Behavior of application programs can be modeled Problem: difficult to measure time for single operation: · no global clock · clock resolution too low · difficult to differentiate: receive time synchronization delay Rolf Hempel, NEC Europe Ltd.

Standard solution: measure loops over communication operations Simple example: the ping-pong benchmark: Process 0 Process 1 Send to 1 Recv from 0 Recv from 1 Send to 0 Result: Time for single message = T / 2000 Rolf Hempel, NEC Europe Ltd. Iterate 1000 times measure time T for entire loop

Implicit assumption: The time for a message in an application code will be similar to the benchmark result Why is this usually not the case? 1. Receiver in ping-pong is always ready to receive Þ receive in ‘solicited message’ mode Þ delays or intermediate copies can be avoided 2. Only two processes are active Þ no contention on interconnect system, non-scaling effects are not visible (e. g. locks on global data structures) Rolf Hempel, NEC Europe Ltd.

3. Untypical response to different progress concepts: Single-threaded MPI implementation: Multi-threaded MPI implementation: Application process Application thread Call to MPI_Recv MPI library: check for and service outstanding requests Rolf Hempel, NEC Europe Ltd. Communication thread constantly check for and service comm. requests MPI process

Single-threaded MPI: progress at receiver only when MPI routine is called Multi-threaded MPI: progress at receiver independent of application thread Bad sender/receiver synchronization Þ bad progress Communication progress independent of sender/receiver synchronization This advantage is not seen in the ping-pong benchmark! Þ single-threaded always better (Comparison of ping-pong latency on Myrinet PC cluster) Rolf Hempel, NEC Europe Ltd.

4. Data may be cached between loop iterations: MPI process active in ping-pong v ec R _ PI M cache M PI _S e nd Comparison of ping-pong bandwidth for Myrinet PC cluster cached versus non-cached data Rolf Hempel, NEC Europe Ltd. Much higher bandwidth than in real applications

This can be even worse on CC-NUMA architectures (e. g. SGI Origin 2000): Operation to be timed: call shmem_get 8 (b(1), a(1), n, myrank-1) Processor myrank-1 Processor myrank a b Firs t tra nsf er onl y A(1) A(2) · · A(n) Rolf Hempel, NEC Europe Ltd. All transfers Cache Consequence: All but first transfers are VERY fast!

The problem of parametrization Goal: Instead of displaying a performance graph, define a few parameters that characterize the communication behavior Most popular Parameters: Latency: time for 0 byte message Bandwidth: asymptotic throughput for long messages Rolf Hempel, NEC Europe Ltd.

Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations Example: MPI on the NEC Cenju-4 (MPP system, R 10000 processors) ec cy Bl ue fit tin g: lat en cy = 8. 5 ms ec ten a l ng: i t t i f Red ms 4 2 = Rolf Hempel, NEC Europe Ltd. Something seems to change at message size = 1024 bytes

It can be even worse: Discontinuities in timing curve Example: NEC SX-4, send / receive within 32 processor node Rolf Hempel, NEC Europe Ltd.

In most modern MPI implementations: Short messages ‘’Eager’’ messages Three different protocols, depending on message length Sender process Receiver process Message envelope + data Pre-defined slots Sender process Receiver process Message envelope Pre-defined slots Local copy of message data User memory Get data when receiver ready Sender process ‘’Long’’ messages Message envelope Receive buffer Receiver process Pre-defined slots Synchronization Send buffer Receive buffer Rolf Hempel, NEC Europe Ltd.

Good reasons for protocol changes: Short messages: · Copying does not cost much · Important not to block the sender for too long · Use pre-allocated slots for intermediate copy Þ avoid time to allocate memory ‘‘Eager’’ messages: · Copying does not cost much · Important not to block the sender for too long · Allocate intermediate buffer on the fly Long messages: · Copying is expensive, avoid intermediate copies · Rendezvous between sender and receiver · Synchronous data transfer Þ optimal throughput In SX-5 implementation: Protocol changes at 1024 and 100, 000 Bytes Rolf Hempel, NEC Europe Ltd.

Protocol changes are dangerous for curve fittings! Famous example: The COMMS 1 - COMMS 3 benchmarks Procurement at NEC: A Customer requested a certain message latency, to be measured with COMMS 1 Problem: Simplistic curve fitting in COMMS 1 Þ SX-4 latency was too high Solution: We made our MPI SLOWER for long messages Þ tilt of fitting line Þ much better latency a happy customer! Rolf Hempel, NEC Europe Ltd.

The problem of averaging Problem: difficult to measure time for single MPI operation: · no global clock · clock resolution too low · difficult to differentiate: receive time synchronization delay Solution: Averaging over many iterations Implicit assumption: All iterations take the same time Þ Averaging increases accuracy This is not always true! Rolf Hempel, NEC Europe Ltd.

Example: Remote get operation shmem_get 8 on Cray T 3 E Experiment with message size of 10 Kbytes: · 100 measurement runs (tests) · In each test, ITER calls to shmem_get 8 in a loop · For each test, time divided by ITER Þ Time for a single operation The graph on the next slide shows the results for 100 tests. The reported results are courtesy Prof. Glenn Luecke Iowa State University Ames, Iowa U. S. A. Rolf Hempel, NEC Europe Ltd.

There are 7 huge spikes in the graph. Probable reason: operating system interruptions. Should those tests be included in the average time? Rolf Hempel, NEC Europe Ltd.

A more subtle example: MPI_Reduce Test setup: · Root process plus six child processes · Reduction on an 8 byte real · Time is measured for 1000 iterations Real*8 a, b, t 1, t 2 · · t 1 = MPI_WTIME () Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION, + MPI_MAX, 0, MPI_COMM_WORLD, IERROR) end do t 2 = MPI_WTIME () Let’s compare the behavior of two MPI implementations: Rolf Hempel, NEC Europe Ltd.

root = receiver senders 1. Step: synchronize all procs 2. Step: perform reduction as in first algorithm root = receiver First implementation, details: · Asynchronous algorithm · Senders are not blocked, independent of receiver status · Low synchronization delay Þ relatively good algorithm Second implementation, details: · Synchronous algorithm · Senders are blocked, until the last process is ready · High synchronization delay Þ relatively bad algorithm senders

What happens at runtime? root = receiver senders For the first MPI implementation: Leaf processes are fastest, and are not blocked in single operation Þ They quickly go through loop Þ Messages pile up at intermediate processors Þ Increasing message queues slow down those processors Þ At some point flow-control takes over, but progress is slow This does not happen with the second (bad) MPI implementation! The unnecessary synchronization keeps messages from piling up. Important to remember: Rolf Hempel, NEC Europe Ltd. · We want to measure the time for ONE reduction operation. · We only use loop averaging to increase the accuracy.

Compound Benchmarks Benchmark driver: · user interface · call individual components · report timings / statistics · compute effective benchmark parameters MPI_Send / MPI_Recv MPI_Bsend / MPI_Recv MPI_Isend / MPI_Recv MPI_Reduce MPI_Bcast · Benchmark suite executed in a single run · Elegant and flexible to use Can there be any problems? Rolf Hempel, NEC Europe Ltd. . . . Guess what!

In a parallel program: Good idea not to print results in every process (or the ordering is up to chance) Alternative: Send messages to process 0, Do all the printing there What if the active process sets change between benchmark components? MPI_Send / MPI_Recv MPI_Bsend / MPI_Recv MPI_Isend / MPI_Recv MPI_Reduce MPI_Bcast Rolf Hempel, NEC Europe Ltd.

We have seen a public benchmark suite in which: · In every phase, every process sends a message to process 0 · Timing results for this phase · Message ‘‘I am not active in this phase’’ · At the same time: process 0 is active in the ping-pong with process 1 Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong. Good MPI implementation: handle incoming comm. Requests as early as possible Þ good progress Bad MPI implementation: handle ‘‘unsolicited messages’’ only if nothing else is to be done Þ bad progress In our case: The bad MPI implementation wins the ping-pong benchmark! Rolf Hempel, NEC Europe Ltd.

Summary: Important to remember when writing an MPI benchmark: Point-to-point performance is very sensitive to · relative timing of send / receive · the protocol (dependent on message length) · contention / locks · cache effects Þ Ping-pong results have very limited value Trade-off between progress and message latency (invisible in ping-pong benchmark, can be important in real applications) Parameter fittings usually lead to ‘‘bogus’’ results (they do not take protocol changes into account) Rolf Hempel, NEC Europe Ltd.

Summary: (continued) Important to remember when writing an MPI benchmark: Loop averaging is highly problematic · variance in single operation performance invisible · messages from different iterations may lead to congestion Keep single benchmarks separate · different stages in a suite may interfere with each other · ‘ ‘unsolicited message’’ problem It should not happen that making MPI slower Þ better benchmark results Rolf Hempel, NEC Europe Ltd.