MPJ Express An Implementation of Message Passing Interface




















































- Slides: 52
MPJ Express: An Implementation of Message Passing Interface (MPI) in Java Aamir Shafi http: //mpj-express. org http: //acet. rdg. ac. uk/projects/mpj 11/29/2020 1
Writing Parallel Software • There are mainly two approaches for writing parallel software: • Software that can be executed on parallel hardware to exploit computational and memory resources • The first approach is to use messaging libraries (packages) written in already existing languages like C, Fortran, and Java: • Message Passing Interface (MPI) • Parallel Virtual Machine (PVM) • The second and more radical approach is to provide new languages: • HPC has a history of novel parallel languages • High Performance Fortran (HPF) • Unified Parallel C (UPC) • In this talk we talk about an implementation of MPI in Java called MPJ Express 11/29/2020 2
Introduction to Java for HPC • Java was released by Sun in 1996: • A mainstream language in software industry, • Attractive features include: • • Portability, Automatic garbage collection, Type-safety at compile time and runtime, Built-in support for multi-threading: • A possible option to provide nested parallelism on multi-core systems, • Performance: • Just-In-Time compilers convert source code to byte code, • Modern JVMs perform compilation from byte code to native machine code on the fly • But Java has safety features that may limit performance. 11/29/2020 3
Introduction to Java for HPC • Three existing approaches to Java messaging: • Pure Java (Sockets based), • Java Native Interface (JNI), and • Remote Method Invocation (RMI), • mpi. Java has been perhaps the most popular Java messaging system • mpi. Java (http: //www. hpjava. org/mpi. Java. html) • MPJ/Ibis (http: //www. cs. vu. nl/ibis/mpj. html) • Motivation for a new Java messaging system: • Maintain compatibility with Java threads by providing thread-safety, • Handle contradicting issues of high-performance and portability. 11/29/2020 4
Distributed Memory Cluster Proc 1 Proc 0 message Memory Proc 2 LAN Ethernet Myrinet Infiniband etc CPU Proc 3 Proc 7 Proc 6 Proc 4 Proc 5 11/29/2020 5
11/29/2020 6
Write machines files 11/29/2020 7
Bootstrap MPJ Express runtime 11/29/2020 8
Write Parallel Program 11/29/2020 9
Compile and Execute 11/29/2020 10
Introduction to MPJ Express • MPJ Express is an implementation of a Java messaging system, based on Java bindings: • Will eventually supersede mpi. Java. • Aamir Shafi, Bryan Carpenter, and Mark Baker • Thread-safe communication devices using Java NIO and Myrinet: • Maintain compatibility with Java threads, • The buffering layer provides explicit memory management instead of relying on the garbage collector, • Runtime system for portable bootstrapping 11/29/2020 11
James Gosling Says… 11/29/2020 12
Who is using MPJ Express? • First released in September 2005 under LGPL (an opensource licence): • Approximately 1000 users all around the world • Some projects using this software: • Cartablanca is a simulation package that uses Jacobian-Free-Newton. Krylov (JFNK) methods to solve non-linear problems • The project is done at Los Alamos National Lab (LANL) in the US • Researchers at University of Leeds, UK have used this software in Modelling and Simulation in e-Social Science (Mo. Se. S) project • Teaching Purposes: • Parallel Programming using Java (PPJ): • http: //www. sc. rwth-aachen. de/Teaching/Labs/PPJ 05/ • Parallel Processing SS 2006: • http: //tramberend. inform. fh-hannover. de/ 11/29/2020 13
MPJ Express Design 11/29/2020 14
Presentation Outline • Implementation Details: • • • Point-to-point communication Communicators, groups, and contexts Process topologies Derived datatypes Collective communications • MPJ Express Buffering Layer • Runtime System • Performance Evaluation 11/29/2020 15
Java NIO Device • Uses non-blocking I/O functionality, • Implements two communication protocols: • Eager-send protocol for small messages, • Rendezvous protocol for large messages, • Locks around communication methods results in deadlocks: • In Java, the keyword synchronized ensures that only one object can call synchronized method at a time, • A process sending a message to itself using synchronous send, • Locks for thread-safety: • Writing messages: • A lock for send-communication-sets, • Locks for destination channels: • One for every destination process, • Obtained one after the other, • Reading messages: 11/29/2020 • A lock for receive-communication-sets. 16
Standard mode with eager send protocol (small messages) 11/29/2020 17
Standard mode with rendezvous protocol (large messages) 11/29/2020 18
MPJ Express Buffering Layer • MPJ Express requires a buffering layer: • To use Java NIO: • Socket. Channels use byte buffers for data transfer, • To use proprietary networks like Myrinet efficiently, • Implement derived datatypes, • Various implementations are possible based on actual storage medium, • Direct or indirect Byte. Buffers, • An mpjbuf buffer object consists of: • A static buffer to store primitive datatypes, • A dynamic buffer to store serialized Java objects, • Creating Byte. Buffers on the fly is costly: • Memory management is based on Knuth’s buddy algorithm, • Two implementations of memory management. 11/29/2020 19
MPJ Express Buffering Layer • Frequent creation and destruction of communication buffers hurts performance. • To tackle this, MPJ Express requires a buffering layer: • Provides two implementations of Knuth’s buddy algorithm, • To use Java NIO and proprietary networks: • Direct Byte. Buffers, • Implement derived datatypes 11/29/2020 20
Presentation Outline • Implementation Details: • • • Point-to-point communication Communicators, groups, and contexts Process topologies Derived datatypes Collective communications • MPJ Express Buffering Layer • Runtime System • Performance Evaluation 11/29/2020 21
Communicators, groups, and contexts • MPI provides a higher level abstraction to create parallel libraries: • Safe communication space • Group scope for collective operations • Process Naming • Communicators + Groups provide: • Process Naming (instead of IP address + ports) • Group scope for collective operations • Contexts: • Safe communication 11/29/2020 22
What is a group? • A data-structure that contains processes • Main functionality: • Keep track of ranks of processes • Explanation of figure • Group A contains eight processes • Group B and C are created from Group A • All group operations are local (no communication with remote processes) 11/29/2020 23
Example of a group operation(Union) • Explanation of union operation • Two processes a and d are in both groups: • Thus, six processes are executing this operation • Each group has its own view of this group operations: • Apply theory of relativity • Re-assigning ranks in new groups: • Process 0 in group A is re-assigned rank 0 in Group C • Process 0 in group B is re-assigned rank 4 in Group C • If any existing process does not make it into the new group, it returns MPI. GROUP_EMPTY 11/29/2020 24
What are communicators? • A data-structure that contains groups (and thus processes) • Why is it useful: • Process naming, ranks are names for application programmers • Easier than IPaddress + ports • Group communications as well as point to point communication • There are two types of communicators, • Intracommunicators: • Communication within a group • Intercommunicators: • Communication between two groups (must be disjoint) 11/29/2020 25
What are contexts? • An unique integer: • An additional tag on the messages • Each communicator has a distinct context that provides a safe communication universe: • A context is agreed upon by all processes when a communicator is built • Intracommunicators has two contexts: • One for point-to-point communications • One for collective communications, • Intercommunicators has two contexts: • Explained in the coming slides 11/29/2020 26
Process topologies • Used to specify processes in a geometric shape • Virtual topologies: have no connection with the physical layout of machines: • Its possible to make use of underlying machine architecture • These virtual topologies can be assigned to processes in an Intracommunicator • MPI provides: • Cartesian topology • Graph topology 11/29/2020 27
Cartesian topology: Mapping four processes onto 2 x 2 topology • Each process is assigned a coordinate: • • • Rank 0: (0, 0) Rank 1: (1, 0) Rank 2: (0, 1) Rank 3: (1, 1) Uses: • • Calculate rank by knowing grid (not globus one!) position Calculate grid positions from ranks Easier to locate rank of neighbours Applications may have communication patterns: • Lots of messaging with immediate neighbours 11/29/2020 28
Periods in cartesian topology • Axis 1 (y-axis is periodic): • Processes in top and bottom rows have valid neighbours towards top and bottom respectively • Axis 0 (x-axis is nonperiodic): • Processes in right and left column have undefined neighbour towards right and left respectively 11/29/2020 29
Derived datatypes • Besides, basic datatypes, it is possible to communicate heterogeneous, non-contiguous data. • Contiguous • Indexed • Vector • Struct 11/29/2020 30
Indexed datatype • The elements that may form this datatype should be: • Same types • At non-contiguous locations • Add flexibility by specifying displacements int SIZE = 4; int [] blklen = new int[DIM], displ = new int[DIM]; for(i=0 ; i<DIM ; i++) { blklen[i]=DIM-i; displ[i]=(i*DIM)+i; } double[] params = new double[SIZE*SIZE]; double[] rparams = new double[SIZE*SIZE]; Datatype i = Datatype. Indexed(blklen, displ, MPI. INT); //array_of_block_lengths, array_displacements Send(params, 0, 1, i, dst, tag); //0 is offset, 1 is count Recv(rparams, 0, 1, i, src, tag); 11/29/2020 31
11/29/2020 32
Presentation Outline • Implementation Details: • • • Point-to-point communication Communicators, groups, and contexts Process topologies Derived datatypes Collective communications • Runtime System • Thread-safety in MPJ Express • Performance Evaluation 11/29/2020 33
Collective communications • Provided as a convenience for application developers: • Save significant development time • Efficient algorithms may be used • Stable (tested) • Built on top of point-to-point communications, • These operations include: • Broadcast, Barrier, Reduce, Allreduce, Alltoall, Scatter, Scan, Allscatter • Versions that allows displacements between the data 11/29/2020 34
Broadcast, scatter, gather, alltoall 11/29/2020 35 Image from MPI standard doc
Reduce collective operations 11/29/2020 Ø Ø Ø MPI. PROD MPI. SUM MPI. MIN MPI. MAX MPI. LAND MPI. BAND MPI. LOR MPI. BOR MPI. LXOR MPI. BXOR MPI. MINLOC MPI. MAXLOC 36
Barrier with Tree Algorithm 11/29/2020 37
Execution of barrier with eight processes n n n Eight processes, thus forms only one group Each process exchanges an integer 4 times 11/29/2020 Overlaps communications well 38
Intracomm. Bcast( … ) • Sends data from a process to all the other processes • Code from adlib: • A communication library for HPJava • The current implementation is based on n-ary tree: • Limitation: broadcasts only from rank=0 • Generated dynamically • Cost: O( log 2(N) ) • MPICH 1. 2. 5 uses linear algorithm: • Cost O(N) • MPICH 2 has much improved algorithms • LAM/MPI uses n-ary trees: • Limitation, broadcast from rank=0 11/29/2020 39
Broadcasting algorithm, total processes=8, root=0 11/29/2020 40
Presentation Outline • Implementation Details: • • • Point-to-point communication Communicators, groups, and contexts Process topologies Derived datatypes Collective communications • Runtime System • Thread-safety in MPJ Express • Performance Evaluation 11/29/2020 41
The Runtime System 11/29/2020 42
Thread-safety in MPI • The MPI 2. 0 specification introduced the notion of thread-compliant MPI implementation, • Four levels of thread-safety: • MPI_THREAD_SINGLE, • MPI_THREAD_FUNNELED, • MPI_THREAD_SERIALIZED, • MPI_THREAD_MULTIPLE, • A blocked thread should not halt the execution of other threads, • “Issues in Developing Thread-Safe MPI Implementation” by Gropp et al. 11/29/2020 43
Presentation Outline • Implementation Details: • • • Point-to-point communication Communicators, groups, and contexts Process topologies Derived datatypes Collective communications • Runtime System • Thread-safety in MPJ Express • Performance Evaluation 11/29/2020 44
Latency on Fast Ethernet 11/29/2020 45
Throughput on Fast Ethernet 11/29/2020 46
Latency on Gigabit Ethernet 11/29/2020 47
Throughput on Gig. E 11/29/2020 48
Choking experience 1 11/29/2020 49
Latency on Myrinet 11/29/2020 50
Throughput on Myrinet 11/29/2020 51
Questions ? 11/29/2020 52