Parallel Programming with MPI MPI is the Message

Introduction to Collective Operations in MPI § Collective operations are called by all processes

MPI Collective Communication § Communication and computation is coordinated among a group of processes

Synchronization § MPI_BARRIER(comm) – Blocks until all processes in the group of the communicator

Collective Data Movement P 0 A A Broadcast P 1 A P 2 A

More Collective Data Movement P 0 A P 1 B P 2 A B

Collective Computation P 0 A P 1 B P 2 C P 3 D

MPI Collective Routines § Many Routines: MPI_ALLGATHER, MPI_ALLGATHERV, MPI_ALLREDUCE, MPI_ALLTOALLV, MPI_BCAST, MPI_GATHERV, MPI_REDUCESCATTER, MPI_SCAN,

MPI Built-in Collective Computation Operations § § § MPI_MAX MPI_MIN MPI_PROD MPI_SUM MPI_LAND MPI_LOR

Exercise: Stencil using Alltoallv § In the basic version of the stencil code –

Section Summary § Collectives are a very powerful feature in MPI § Optimized heavily

Necessary Data Transfers § Provide access to remote data through a halo exchange (5

The Local Data Structure § Communicating with contiguous data § Edge element may be

Introduction to Datatypes in MPI § Datatypes allow to (de)serialize arbitrary data layouts into

Simple/Predefined Datatypes § Equivalents exist for all C, C++ and Fortran native datatypes –

Derived Datatype Example 0 1 2 contig 3 4 5 6 contig 7 8

MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) § Contiguous array of oldtype § Should not

MPI_Type_vector(int count, int blocklen, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) § Specify strided blocks

Commit, Free, and Dup § Types must be committed before use – Only the

Use Datatype in Halo Exchange contig (count=bx, MPI_DOUBLE, …) or count with MPI_DOUBLE by

Exercise 1: Stencil with Derived Datatypes (1/2) § In the basic version of the

Exercise 1: Stencil with Derived Datatypes (2/2) § Nonblocking sends and receives § Data

MPI_Type_create_hvector(int count, int blocklen, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype) § Create byte strided

MPI_Type_create_indexed_block(int count, int blocklen, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) § Pulling irregular subsets

MPI_Type_indexed(int count, int* array_of_blocklens, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) § Like indexed_block, but

MPI_Type_create_struct(int count, int *array_of_blocklens, int *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype) § Most general constructor,

MPI_Type_create_subarray(int ndims, int* array_of_sizes, int *array_of_subsizes, int *array_of_starts, int order, MPI_Datatype oldtype, MPI_Datatype *newtype)

MPI_BOTTOM and MPI_Get_address § MPI_BOTTOM is the absolute zero address – Portability (e. g.

Other DDT Functions § Pack/Unpack – Mainly for compatibility to legacy libraries – You

Datatype Selection Order § Simple and effective performance model: – More parameters == slower

Section Summary § Derived datatypes are a sophisticated mechanism to describe ANY layout in

Slides: 34

Download presentation

Parallel Programming with MPI (MPI is the Message Passing Interface) CS 475 By Dr. Ziad A. Al-Sharif Based on the tutorial from the Argonne National Laboratory https: //www. mcs. anl. gov/~raffenet/permalinks/argonne 19_mpi. php

Blocking Collective Operations

Introduction to Collective Operations in MPI § Collective operations are called by all processes in a communicator. § MPI_BCAST distributes data from one process (the root) to all others in a communicator. § MPI_REDUCE combines data from all processes in the communicator and returns it to one process. § In many numerical algorithms, SEND/RECV can be replaced by BCAST/REDUCE, improving both simplicity and efficiency. 3

MPI Collective Communication § Communication and computation is coordinated among a group of processes in a communicator § Tags are not used; different communicators deliver similar functionality § Non-blocking collective operations in MPI-3 – Not covered in this tutorial (but conceptually simple) § Three classes of operations: synchronization, data movement, collective computation 4

Synchronization § MPI_BARRIER(comm) – Blocks until all processes in the group of the communicator comm call it – A process cannot get out of the barrier until all other processes have reached barrier 5

Collective Data Movement P 0 A A Broadcast P 1 A P 2 A P 3 A P 0 A B C D Scatter A P 1 B P 2 C P 3 Gather D 6

More Collective Data Movement P 0 A P 1 B P 2 A B C D C A B C D P 3 D A B C D P 0 A 1 A 2 A 3 A 0 B 0 C 0 D 0 P 1 B 0 B 1 B 2 B 3 P 2 C 0 C 1 C 2 C 3 A 2 B 2 C 2 D 2 P 3 D 0 D 1 D 2 D 3 A 3 B 3 C 3 D 3 Allgather Alltoall A 1 B 1 C 1 D 1 7

Collective Computation P 0 A P 1 B P 2 C P 3 D P 0 A P 1 B P 2 C f(ABC) P 3 D f(ABCD) Reduce Scan f(ABCD) f(AB) 8

MPI Collective Routines § Many Routines: MPI_ALLGATHER, MPI_ALLGATHERV, MPI_ALLREDUCE, MPI_ALLTOALLV, MPI_BCAST, MPI_GATHERV, MPI_REDUCESCATTER, MPI_SCAN, MPI_SCATTERV § “All” versions deliver results to all participating processes § “V” versions (stands for vector) allow the chunks to have different sizes § MPI_ALLREDUCE, MPI_REDUCESCATTER, and MPI_SCAN take both built-in and user-defined combiner functions 9

MPI Built-in Collective Computation Operations § § § MPI_MAX MPI_MIN MPI_PROD MPI_SUM MPI_LAND MPI_LOR MPI_LXOR MPI_BAND MPI_BOR MPI_BXOR MPI_MAXLOC MPI_MINLOC Maximum Minimum Product Sum Logical and Logical or Logical exclusive or Bitwise and Bitwise or Bitwise exclusive or Maximum and location Minimum and location 10

Exercise: Stencil using Alltoallv § In the basic version of the stencil code – Used nonblocking send/receive for each direction § Let’s try to use single alltoallv collective call § Start from nonblocking_p 2 p/stencil. c § Solution available in blocking_coll/stencil_alltoallv. c count = by count = bx count = 0 11

Section Summary § Collectives are a very powerful feature in MPI § Optimized heavily in most MPI implementations – Algorithmic optimizations (e. g. , tree-based communication) – Hardware optimizations (e. g. , network or switch-based collectives) § Matches the communication pattern of many applications 12

MPI Derived Datatypes

Necessary Data Transfers § Provide access to remote data through a halo exchange (5 point stencil) 14

The Local Data Structure § Communicating with contiguous data § Edge element may be stored non-contiguously § Manually packing and unpacking by bx 15

Introduction to Datatypes in MPI § Datatypes allow to (de)serialize arbitrary data layouts into a message stream – Networks provide serial channels – Same for block devices and I/O § Several constructors allow arbitrary layouts – Recursive specification possible – Declarative specification of data-layout • “what” and not “how”, leaves optimization to implementation (many unexplored possibilities!) – Choosing the right constructors is not always simple 16

Simple/Predefined Datatypes § Equivalents exist for all C, C++ and Fortran native datatypes – C int MPI_INT – C float MPI_FLOAT – C double MPI_DOUBLE – C uint 32_t MPI_UINT 32_T – Fortran integer MPI_INTEGER § For more complex or user-created datatypes, MPI provides routines to represent them as well – Contiguous – Vector/Hvector – Indexed/Indexed_block/Hindexed_block – Struct – Some convenience types (e. g. , subarray) 17

Derived Datatype Example 0 1 2 contig 3 4 5 6 contig 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 contig indexed vector struct 18

MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) § Contiguous array of oldtype § Should not be used as last type (can be replaced by count) 0 1 2 3 4 struct 5 6 7 8 9 10 11 12 13 14 struct contig 15 16 17 18 struct 0 1 2 3 4 5 6 7 8 contig 19 9 10 11

MPI_Type_vector(int count, int blocklen, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) § Specify strided blocks of data of oldtype § Very useful for Cartesian arrays 0 1 2 3 4 vector 5 6 7 8 0 9 1 10 11 2 3 struct 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 struct vector 20

Commit, Free, and Dup § Types must be committed before use – Only the ones that are used! – MPI_Type_commit may perform heavy optimizations (and will hopefully) § MPI_Type_free – Free MPI resources of datatypes – Does not affect types built from it § MPI_Type_dup – Duplicates a type – Library abstraction (composability) 21

Use Datatype in Halo Exchange contig (count=bx, MPI_DOUBLE, …) or count with MPI_DOUBLE by bx vector (count=by, blocklen=1, stride=bx+2, MPI_DOUBLE, …) 22

Exercise 1: Stencil with Derived Datatypes (1/2) § In the basic version of the stencil code – Used nonblocking communication �� – Used manual packing/unpacking of data �� § Let’s try to use derived datatypes – Specify the locations of the data instead of manually packing/unpacking What datatype do we need here? by bx What datatype do we need here? 23

Exercise 1: Stencil with Derived Datatypes (2/2) § Nonblocking sends and receives § Data location specified by MPI datatypes § Manual packing of data no longer required § Start from nonblocking_p 2 p/stencil. c § Solution in derived_datatype/stencil. c 24

MPI_Type_create_hvector(int count, int blocklen, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype) § Create byte strided vectors § Useful for composition, e. g. , vector of structs stride = 11 bytes 0 1 2 3 struct 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 struct hvector struct stride = 3 oldtypes vector 25

MPI_Type_create_indexed_block(int count, int blocklen, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) § Pulling irregular subsets of data from a single array – dynamic codes with index lists, expensive though! – blen=2 – displs={0, 5, 8, 13, 18} 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Indexed_block 26

MPI_Type_indexed(int count, int* array_of_blocklens, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) § Like indexed_block, but can have different block lengths – blen={1, 1, 2, 1} – displs={0, 3, 5, 9, 13, 17} 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 indexed 27

MPI_Type_create_struct(int count, int *array_of_blocklens, int *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype) § Most general constructor, allows different types and arbitrary arrays (also most costly) 0 1 2 3 4 struct 28

MPI_Type_create_subarray(int ndims, int* array_of_sizes, int *array_of_subsizes, int *array_of_starts, int order, MPI_Datatype oldtype, MPI_Datatype *newtype) § Convenience function for creating datatypes for array segments § Specify subarray of n-dimensional array (sizes) by start (starts) and size (subsize) (0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) (1, 3) (2, 0) (2, 1) (2, 2) (2, 3) (3, 0) (3, 1) (3, 2) (3, 3) 29

MPI_BOTTOM and MPI_Get_address § MPI_BOTTOM is the absolute zero address – Portability (e. g. , may be non-zero in globally shared memory) § MPI_Get_address – Returns address relative to MPI_BOTTOM – Portability (do not use “&” operator in C!) § Very important to – build struct datatypes – If data spans multiple arrays int a = 4; float b = 9. 6; MPI_Datatype struct; MPI_Get_address(&a, &disps[0]); MPI_Get_address(&b, &disps[1]); MPI_Type_create_struct(count, blocklens[], disps, oldtypes[], &struct); 30

Other DDT Functions § Pack/Unpack – Mainly for compatibility to legacy libraries – You should not be doing this yourself § Get_envelope/contents – Only for expert library developers – Libraries like MPITypes 1 make this easier § MPI_Create_resized – Change extent and size (dangerous but useful) http: //www. mcs. anl. gov/mpitypes/ 31

Datatype Selection Order § Simple and effective performance model: – More parameters == slower § predefined < contig < vector < index_block < index < struct § Some (most) MPIs are inconsistent – But this rule is portable § Advice to users: – Try datatype “compression” bottom-up W. Gropp et al. : Performance Expectations and Guidelines for MPI Derived Datatypes 32

Section Summary § Derived datatypes are a sophisticated mechanism to describe ANY layout in memory – Hierarchical construction of derived datatypes allows them to be just as complex as the data layout is – More complex layouts require more complex datatype constructions § Current state of MPI implementations might be a bit lagging in performance, but it is improving – Increasing amount of hardware support to process derived datatypes on the network hardware – If the performance is lagging when you try it out, complain to the MPI implementer, don’t just stop using it! 33

References