Cha MPIonPro A Multidevice MPI 1 2 Implementation

  • Slides: 22
Download presentation
Cha. MPIon/Pro™: A Multidevice MPI 1. 2 Implementation for ASCI Blue and White Using

Cha. MPIon/Pro™: A Multidevice MPI 1. 2 Implementation for ASCI Blue and White Using LAPI and Shared Memory Anthony Skjellum Rossen Dimitrov Bronis de Supinski Andrew Watkins October 11, 2001 Scicom. P 4

Background • MPI Software Technology is improving and expanding its commercial middleware – MPI/Pro

Background • MPI Software Technology is improving and expanding its commercial middleware – MPI/Pro – to create Cha. MPIon/Pro for Ultrascale systems, including IBM-based ASCI systems • MPI/Pro is deployed across a variety of parallel systems and clusters • Cha. MPIon/Pro is the next-generation commercial, scalable middleware product from MPI Software Technology, to include MPI-2 features

MPI/Pro Features • • Low CPU Overhead (non-polling) Thread safety Asynchronous completion notification Independent

MPI/Pro Features • • Low CPU Overhead (non-polling) Thread safety Asynchronous completion notification Independent message progress (complies fully with the MPI Progress Rule) • Overlapping of computation and communication • Efficient multi-device architecture

MPI/Pro Features • • • Optimized persistent mode of communication Optimized derived data types

MPI/Pro Features • • • Optimized persistent mode of communication Optimized derived data types Efficient mechanisms for message de-multiplexing Handles long-running complex programs reliably Broad conformance and performance testing Open MP friendly

Miscellaneous MPI/Pro Features • • Supports Scyld ‘bproc’ technology Supports efficient Ethernet channel bonding

Miscellaneous MPI/Pro Features • • Supports Scyld ‘bproc’ technology Supports efficient Ethernet channel bonding Supports Etnus Total. View™ Supports ROMIO now, native MPI I/O nearly complete

Cha. MPIon/Pro’s Performance and Scalability Objectives • • Scaling to 10, 000 processors or

Cha. MPIon/Pro’s Performance and Scalability Objectives • • Scaling to 10, 000 processors or more Multi-device support Topology awareness Thread safety Optimized collective operations Optimized derived datatypes Efficient memory (and NIC resource) usage

Cha. MPIon/Pro’s Functionality and Usability Objectives • • • Integration with schedulers and resource

Cha. MPIon/Pro’s Functionality and Usability Objectives • • • Integration with schedulers and resource managers Integration with tools (Totalview and Vampir, etc) Functionality controlled by tunable parameters Tunable parameters not intended to fix problems User feedback

Cha. MPIon/Pro 1. 2 Library Architecture Cha. MPIon/Pro API Collectives Point-topoint Portals, LAPI, SMP,

Cha. MPIon/Pro 1. 2 Library Architecture Cha. MPIon/Pro API Collectives Point-topoint Portals, LAPI, SMP, Myrinet, Quadrics Groups, Attributes, Virtual Communi- Datatypes Error topologies handlers cators

Cha. MPIon/Pro vs. IBM MPI • Cha. MPIon/Pro layered on top of IBM’s system

Cha. MPIon/Pro vs. IBM MPI • Cha. MPIon/Pro layered on top of IBM’s system software (LAPI); IBM has access to lower layers (MPL, kernel mechanisms) • IBM’s MPI at the same or lower level than LAPI • Peak bandwidth of IBM’s MPI higher than LAPI’s • Cha. MPIon/Pro, for Blue, has higher throughput for messages in the range of 4 K to 400 K • Cha. MPIon/Pro has lower latency for short messages (up to 1 K) using the SMP device

Cha. MPIon/Pro Protocols LAPI Device • Short protocol: LAPI_Amsend, LAPI header handler, and LAPI

Cha. MPIon/Pro Protocols LAPI Device • Short protocol: LAPI_Amsend, LAPI header handler, and LAPI counters; one copy at receiver • Long protocol: LAPI_Amsend, LAPI header handler, LAPI counters, and LAPI_Get; no copies

Cha. MPIon/Pro Protocols SMP Device • Short protocol: System V shared memory, fast locks;

Cha. MPIon/Pro Protocols SMP Device • Short protocol: System V shared memory, fast locks; one intermediate copy in shared memory - two memcopies • Long protocol: using LAPI Device in loopback mode; the intermediate copy in shared memory limits the bandwidth - LAPI loopback is faster than two memcopies

SMP Device Features • Cha. MPIon/Pro employs a fast locking mechanism, using Power. PC

SMP Device Features • Cha. MPIon/Pro employs a fast locking mechanism, using Power. PC 604 assembler, for lower SMP overhead • Cha. MPIon/Pro has lower latency than IBM’s MPI for short messages in SMP mode on Blue (< 1 K) • Scalable shared memory device—buffer usage does not expand quadratically or even linearly with NP per node

Collective Operations Multi-hierarchy Operations Class 1 Class 2 Class 3 Level 2 Bcast Level

Collective Operations Multi-hierarchy Operations Class 1 Class 2 Class 3 Level 2 Bcast Level 1 Reduce Level 0 Gather Scatter Bcast

Cross-Box Latency (Snow)

Cross-Box Latency (Snow)

Cross-Box Bandwidth (Snow)

Cross-Box Bandwidth (Snow)

On-Box (SMP) Latency (Blue)

On-Box (SMP) Latency (Blue)

Scatter Performance (Long Messages—Blue )

Scatter Performance (Long Messages—Blue )

On-Going Work • Completion of MPI I/O capability (in final debugging) • 1 -sided

On-Going Work • Completion of MPI I/O capability (in final debugging) • 1 -sided support • Performance-revealing enhancements • Rest of MPI-2

Wish List for Enhancing Cha. MPIon/Pro for IBM Systems • Access to one-copy protocol

Wish List for Enhancing Cha. MPIon/Pro for IBM Systems • Access to one-copy protocol between unrelated processes in shared memory • Access to KLAPI or other optimal-level transports, as IBM’s MPI is able to exploit • Ability to fine-tune SMP scheduling for collective optimization

Conclusions • Cha. MPIon/Pro is a competitive solution for IBM SP systems, even though

Conclusions • Cha. MPIon/Pro is a competitive solution for IBM SP systems, even though it works at a disadvantage today as compared to IBM’s MPI. • Cha. MPIon/Pro is a performance-portable MPI 1. 2 system that will soon provide MPI-2 capabilities • Many added-value improvements will be added in coming years, including further optimizatons • Work over next 18 months will support both largescale and medium-scale SP systems, and other DOE ASCI systems

Questions? Work performed in part under the auspices of the U. S. Department of

Questions? Work performed in part under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405 Eng-48, UCRL-VG-145222.

Contacts Dr. Anthony Skjellum President 662 -320 -4300 x 15 tony@mpi-softtech. com Dr. Bronis

Contacts Dr. Anthony Skjellum President 662 -320 -4300 x 15 tony@mpi-softtech. com Dr. Bronis de Supinski LLNL/CASC 925 -422 -1062 bronis@llnl. gov Dr. Rossen Dimitrov Principal Software Engineer 603 -891 -4766 rossen@mpi-softtech. com Andrew Watkins Senior Software Engineer 662 -320 -4300 x 23 andrew@mpi-softtech. com