Portable MPI and Related Parallel Development Tools Rusty
Portable MPI and Related Parallel Development Tools Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory (The rest of our group: Bill Gropp, Rob Ross, David Ashton, Brian Toonen, Anthony Chan)
Outline • MPI – – What is it? Where did it come from? One implementation Why has it “succeeded”? • Case study: an MPI application – Portability – Libraries – Tools • Future developments in parallel programming – MPI development – Languages – Speculative approaches 2
What is MPI? • A message-passing library specification – extended message-passing model – not a language or compiler specification – not a specific implementation or product • For parallel computers, clusters, and heterogeneous networks • Full-featured • Designed to provide access to advanced parallel hardware for – end users – library writers – tool developers 3
Where Did MPI Come From? • Early vendor systems (NX, EUI, CMMD) were not portable. • Early portable systems (PVM, p 4, TCGMSG, Chameleon) were mainly research efforts. – Did not address the full spectrum of message-passing issues – Lacked vendor support – Were not implemented at the most efficient level • The MPI Forum organized in 1992 with broad participation by vendors, library writers, and end users. • MPI Standard (1. 0) released June, 1994; many implementation efforts. • MPI-2 Standard (1. 2 and 2. 0) released July, 1997. 4
Informal Status Assessment • All MPP vendors now have MPI-1. (1. 0, 1. 1, or 1. 2) • Public implementations (MPICH, LAM, CHIMP) support heterogeneous workstation networks. • MPI-2 implementations are being undertaken now by all vendors. • MPI-2 is harder to implement than MPI-1 was. • MPI-2 implementations will appear piecemeal, with I/O first. 5
MPI Sources • The Standard itself: – at http: //www. mpi-forum. org – All MPI official releases, in both postscript and HTML • Books on MPI and MPI-2: – Using MPI: Portable Parallel Programming with the Message-Passing Interface (2 nd edition), by Gropp, Lusk, and Skjellum, MIT Press, 1999. – Using MPI-2: Extending the Message-Passing Interface, by Gropp, Lusk, and Thakur, MIT Press, 1999 – MPI: The Complete Reference, volumes 1 and 2, MIT Press, 1999. • Other information on Web: – at http: //www. mcs. anl. gov/mpi – pointers to lots of stuff, including other talks and tutorials, a FAQ, other MPI pages 6
The MPI Standard Documentation 7
Tutorial Material on MPI, MPI-2 8
The MPICH Implementation of MPI • As a research project: exploring tradeoffs between performance and portability; conducting research in implementation issues. • As a software project: providing a free MPI implementation on most machines; enabling vendors and others to build complete MPI implementation on their own communication services. • MPICH 1. 2. 2 just released, with complete MPI-1, parts of MPI-2 (I/O and C++), port to Windows 2000. • Available at http: //www. mcs. anl. gov/mpich 9
Lessons From MPI Why Has It Succeeded? • • The MPI Process Portability Performance Simplicity Modularity Composability Completeness 10
The MPI Process • Started with open invitation to all those interested in standardizing message-passing model • Participation from – Parallel computing vendors – Computer Scientists – Application scientists • Open process – All invited, but hard work required – All deliberations available at all times • Reference implementation developed during design process – Helped debug design – Immediately available when design completed 11
Portability • Most important property of a programming model for high-performance computing – Application lifetimes 5 to 20 years – Hardware lifetimes much shorter – (not to mention corporate lifetimes!) • Need not lead to lowest common denominator approach • Example: MPI semantics allow direct copy of data from user space send buffer to user space receive buffer – Might be implemented by hardware data mover – Might be implemented by netwrk hardware – Might be implemented by socket • The hard part: portability with performance 12
Performance • MPI can help manage the crucial memory hierarchy – Local vs. remote memory is explicit – A received message is likely to be in cache • MPI provides collective operations for both communication and computation that hide complexity or non-portability of scalable algorithms from the programmer. • Can interoperate with optimising compilers • Promotes use of high-performance libraries • Doesn’t provide performance portability – This problem is still too hard, even for the best compilers – E. g. BLAS 13
Simplicity • Simplicity is in the eye of the beholder – MPI-1 has about 125 functions • Too big! • Too small! – MPI-2 has about 150 more – Even this is not very many by comparison • Few applications use all of MPI – But few MPI functions go unused • One can write serious MPI programs with as few as six functions – Other programs with a different six… • Economy of concepts – Communicators encapsulate both process groups and contexts – Datatypes both enable heterogeneous communication and allow non-contiguous messages buffers • Symmetry helps make MPI easy to understand. 14
Modularity • Modern applications often combine multiple parallel components. • MPI supports component-oriented software through its use of communicators • Support of libraries means applications may contain no MPI calls at all. 15
Composability • MPI works with other tools – Compilers • Since it is a library – Debuggers • Debugging interface used by MPICH, Total. View, others – Profiling tools • The MPI “profiling interface” is part of standard • MPI-2 provides precise interaction with multithreaded programs – – MPI_THREAD_SINGLE MPI_THREAD_FUNNELLED (Open. MP loops) MPI_THREAD_SERIAL (Open MP single) MPI_THREAD_MULTIPLE • The interface provides for both portability and performance 16
Completeness • MPI provides a complete programming model. • Any parallel algorithm can be expressed. • Collective operations operate on subsets of processes. • Easy things are not always easy, but • Hard things are possible. 17
The Center for the Study of Astrophysical Thermonuclear Flashes • To simulate matter accumulation on the surface of compact stars, nuclear ignition of the accreted (and possibly underlying stellar) material, and the subsequent evolution of the star’s interior, surface, and exterior • X-ray bursts (on neutron star surfaces) • Novae (on white dwarf surfaces) • Type Ia supernovae (in white dwarf interiors) 18
FLASH Scientific Results ' Wide range of compressibility Wide range of length and time scales ' Many interacting physical processes ' Only indirect validation possible ' Rapidly evolving computing environment ' Many people in collaboration ' Flame-vortex interactions Compressible turbulence Laser-driven shock instabilities Nova outbursts on white dwarfs Richtmyer-Meshkov instability Cellular detonations Helium burning on neutron stars Rayleigh-Taylor instability Gordon Bell prize at SC 2000 19
The FLASH Code: MPI in Action • Solves complex systems of equations for hydrodynamics and nuclear burning • Written primarily in Fortran-90 • Uses Paramesh library for adaptive mesh refinement; Paramesh is implemented with MPI • I/O (for checkpointing, visualization, other purposes) done with HDF-5 library, which is implemented with MPI-IO • Debugged with Total. View, using standard debugger interface • Tuned with Jumpshot and Vampir, using MPI profiling interface • Gordon Bell prize winner in 2000 • Portable to all parallel computing environments (since MPI) 20
FLASH Scaling Runs 21
X-Ray Burst on the Surface of a Neutron Star 22
Showing the AMR Grid 23
MPI Performance Visualization with Jumpshot • For detailed analysis of parallel program behavior, timestamped events are collected into a log file during the run. Processes • A separate display program (Jumpshot) aids the user in conducting a post mortem analysis of program behavior. • Log files can become large, making it impossible to inspect the entire program at once. • The FLASH Project motivated an indexed file format (SLOG) that uses a preview to select a time of interest and quickly display an interval. Logfile Jumpshot Display 24
Removing Barriers From Paramesh 25
Using Jumpshot • MPI functions and messages automatically logged • User-defined states • Nested states • Zooming and scrolling • Spotting opportunities for optimization 26
Future Developments in Parallel Programming: MPI and Beyond • MPI not perfect • Any widely-used replacement will have to share the properties that made MPI a success. • Some directions (in decreasing order of speculativeness) – – – Improvements to MPI implementations Improvements to the MPI definition Continued evolution of libraries Research and development for parallel languages Further out: radically different programming models for radically different architectures. 27
MPI Implementations • Implementations beget implementation research – Datatypes, I/O, memory motion elimination • On most platforms, better collective – Most MPI implementations build collective on point-to-point, too high-level – Need stream-oriented methods that understand MPI datatypes • Optimize for new hardware – In progress for VIA, Infiniband – Need more emphasis on collective operations – Off-loading message processing onto NIC • Scaling beyond 10, 000 processes • Parallel I/O – Clusters – Remote • Fault-tolerance – Intercommunicators provide an approach • Working with multithreading approaches 28
Improvements to MPI Itself • Better Remote-memory-access interface – Simpler for some simple operations – Atomic fetch-and-increment • Some minor fixup already in progress – MPI 2. 1 • Building on experience with MPI-2 • Interactions with compilers 29
Libraries and Languages • General Libraries – Global Arrays – PETSc – Sca. LAPACK • Application-specific libraries • Most built on MPI, at least for portable version. 30
More Speculative Approaches • • • HTMT for Petaflops Blue Gene PIMS MTA All will need a programming model that explicitly manages a deep memory hierarchy. • Exotic + small benefit = dead 31
Summary • MPI is a successful example of a community defining, implementing, and adopting a standard programming methodology. • It happened because of the open MPI process, the MPI design itself, and early implementation. • MPI research continues to refine implementations on modern platforms, and this is the “main road” ahead. • Tools that work with MPI programs are thus a good investment. • MPI provides portability and performance for complex applications on a variety of architectures. 32
- Slides: 32