Datacenter Fabric Workshop Open MPI Overview and Current

Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM August 22, 2005 Page 1 of (#)

Overview • Point-to-Point Architecture • Open. IB – Implementation – Results • Future Work Page 2

Point-to-Point Architecture • Component Architecture: – “Plug-ins” for different capabilities (e. g. different networks) – Tunable run-time parameters • Three component frameworks: – Point-to-point messaging layer (PML) implements MPI semantics – Byte Transfer Layer (BTL) abstracts network interfaces – Memory Pool (mpool) provides for memory management/registration Page 3

PML Framework • Single PML manages multiple BTL modules – Maintains set of BTLs on a per-peer basis – Message fragmentation and scheduling • Implements MPI semantics − Synchronous / buffered / ready / normal sends − Persistent requests / Request completion • Eager/Rendezvous protocol − Eager send of short messages − Configurable threshold (short vs. long) − Multiple long protocols Page 4

PML Protocols • Send / receive pipeline to / from pre-registered buffers (non-contiguous data) • MPI_Alloc_mem support – Red/black tree of memory registrations – BTL associated with registration is used by scheduler – Xfer of contiguous data with 1 RDMA (after match) • “Leave pinned” run-time parameter – Registration on first-use – MRU cache (configurable size) of registrations – Bandwidth equivalent to pre-registered buffers (MPI_Alloc_mem) Page 5

PML Protocols (Continued) • Dynamic memory registration/deregistration – Fragment message and build pipeline of RDMA requests – Overlap [de-]registration with RDMA – Bandwidth 97% of pre-registered memory at large message sizes (8 Mbytes) – Performance impacted by bus type/bandwidth Page 6

BTL Framework • MPI agnostic • Provides simple API to upper layers – Tagged send/receive primitives – One-sided put/get operations • Access to data type engine for zero copy data transfer • BTL modules natively support commodity networks: – Current (self, shared memory, myrinet GM/MX, Infiniband mvapi/Open. IB, Portals, TCP) – Planned (LAPI, Quadrics Elan 4) Page 7

Open. IB BTL • • BTL module initialization Resources allocation Connection management Small message Xfer Large message Xfer Open. IB Issues Future Work Page 8

BTL module initialization • A separate BTL module is initialized for each port on each HCA • The PML schedules across these BTL modules just as any other interconnect • When multiple BTL modules exist peers establish QP connections by matching subnets Page 9

Resource Allocation Page 10

SRQ Scalability Nodes Frag size. Kbytes #posted RQ per QPMbytes K* SRQMbytes 128 8 64 64 2 1 256 8 64 128 4 2 512 8 64 256 8 4 1024 8 64 512 10 5 K- multiplier based on number of nodes Page 11

Connection management • Addressing information is exchanged dynamically via an OOB channel – This greatly improves scalability but at the cost of increased first message latency – Connections are established with peers in the same subnet (local subnet routing only) Page 12

Small Message Xfer – Maintain list of pre-registered fragments for send and recv – List grows dynamically in chunks (more efficient to register) – Small messages are copied to/from preregistered buffers – Recv descriptors are posted as needed based on min/max thresholds Page 13

Small Message Performance Average Latency Open. MPI - Open. IB - *optimized 5. 13 usec Open. MPI - Open. IB - *defaults 5. 43 usec Open. MPI - Mvapi - *optimized 5. 64 usec Open. MPI - Mvapi - *defaults 5. 94 usec Mvapich - Mvapi (rdma/mem poll) 4. 19 usec Mvapich - Mvapi (send/recv) 6. 51 usec * Send/Recv based protocol Page 14

Large Message Xfer • RDMA Write and RDMA Read are both supported • RDMA Read provides better performance than RDMA Write - control messages are reduced • RDMA pipeline protocol performance highly dependent on I/O Bus performance Page 15

Results Open. MPI/Open. IB - All Page 16

Results Open. MPI/Open. IB - All - Log Page 17

Results Open. MPI/Open. IB - Eager limit Page 18

Results Combined Results Page 19

Results Combined Results - Log Page 20

Open. IB Opportunities – User level notification of VM activity • Caching of memory registrations can be dangerous • Need the ability to detect VM changes that effect memory registrations (such as sbrk and munmap) – Reliable Multicast for collectives – SRQ performance, 2/10 usec penalty, but who’s counting? Page 21

Future Work • Small message RDMA (using working set of peers) - optional • Dynamic connection management using Unreliable Datagrams • Dynamic connection teardown - optional Page 22

Source Code Access • Subversion repository • Download client from: – http: //subversion. tigris. org/ – v 1. 2. 1 or later • Check out with: – svn co http: //svn. open-mpi. org/svn/ompi/trunk ompi – Anonymous, read-only access Page 23

Questions? Tim Woodall Email: twoodall@lanl. gov Phone: 505 -665 -5224 Galen Shipman Email: gshipman@lanl. gov Page 24

Hardware Specs • Dual Intel Xeon 3. 2 GHz – 1024 KB Cache • 2 Gbytes memory • Bus: Intel Corp. E 7525/E 7520/E 7320 PCI Express • Mellanox Technologies MT 25208 Infini. Host III Ex • 288 Port Voltaire switch Page 25