Open MPI China MCP 1 Agenda MPI Overview

  • Slides: 33
Download presentation
Open MPI China MCP 1

Open MPI China MCP 1

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

What is MPI? • Message Passing Interface – “De facto” standard – Not an

What is MPI? • Message Passing Interface – “De facto” standard – Not an “official” standard(IEEE, IETF) • Written and ratified by the MPI Forum – Body of academic, research, and industry representatives • MPI spec – – MPI-1 published in 1994 MPI-2 published in 1996 MPI-3 published in 2012 Specified interfaces in C, C++, Fortran 77/90 4

MPI High-Level View User Application MPI API Operation System

MPI High-Level View User Application MPI API Operation System

MPI Goal • High-level network API – Abstract away the underlying transport – Easy

MPI Goal • High-level network API – Abstract away the underlying transport – Easy to use for customers • API designed to be “friendly” to high performance network – Ultra low latency (nanoseconds matter) – Rapid ascent to wire-rate bandwidth • Typically used in High Performance Computing(HPC) environments – Has a bias for large compute jobs • “HPC” definition is evolving – MPI starting to be used outside of HPC – MPI is a good network IPC API 6

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Open MPI Overview • Open. MPI is an open source, high-performance implementation of MPI

Open MPI Overview • Open. MPI is an open source, high-performance implementation of MPI – Open MPI represents the union of four research/academic, open source MPI implementations: LAM(Local Area Multicomputer)/MPI, LA(Los Alamos)/MPI, FT-MPI(Fault-Tolerant MPI) and PACXMPI(Parallel Computer e. Xtension MPI) • Open MPI has three main abstraction project layers – Open Portable Access Layer (OPAL): Open MPI's core portability between different operating systems and basic utilities. – Open MPI Run-Time Environment (ORTE): Launch, monitor individual processes, and group individual processes in to “jobs” – Open MPI (OMPI): Public MPI API and only one exposed to applications. 8

Open MPI High-Level View MPI Application Open MPI (OMPI) Project Open MPI Run-Time Environment

Open MPI High-Level View MPI Application Open MPI (OMPI) Project Open MPI Run-Time Environment (ORTE) Project Open Portable Access Layer (OPAL) Project Operation System Hardware

Project Separation MPI Application libompi libopen-rte libopen-pal Operation System Hardware 10

Project Separation MPI Application libompi libopen-rte libopen-pal Operation System Hardware 10

Library dependencies MPI Application libompi libopen-rte libopen-pal Operation System Hardware 11

Library dependencies MPI Application libompi libopen-rte libopen-pal Operation System Hardware 11

Plugin Architecture • Open MPI architecture design – Portable, high-performance implementation of the MPI

Plugin Architecture • Open MPI architecture design – Portable, high-performance implementation of the MPI standard – Share common base code to meet widely different requirement – Run-time loadable components were natural choice, the same interface behavior can be implemented multiple different ways. Users can then choose, at run time, which plugin(s) to use • Plugin Architecture – Each project is structured similarly • Main / Core code • Components(Plugins) • Frameworks – Governed by the Modular Component Architecture 12

MCA Architecture Overview User Application MPI API Modular Component Architecture (MCA) Framework Comp. …

MCA Architecture Overview User Application MPI API Modular Component Architecture (MCA) Framework Comp. … … Comp. … Framework Comp. … Framework … 13

MCA Layout • MCA – Top-level architecture for component services – Find, load, unload

MCA Layout • MCA – Top-level architecture for component services – Find, load, unload components • Frameworks – – Targeted set of functionality Defined interfaces Essentially: a group of one type of plugins E. g. , MPI point-to-point, high-resolution timers • Components – Code that exports a specific interface – Loaded/unloaded rum-time – “Plugins” • Modules – A components paired with resources – E. g. , TCP component loaded, find 2 IP interface(eth 0, eth 1), make 2 TCP modules 14

OMPI Architecture Overview OMPI Layer MPI Byte Transfer Layer (btl) MPI one-sided communicatio n

OMPI Architecture Overview OMPI Layer MPI Byte Transfer Layer (btl) MPI one-sided communicatio n interface( osc) Memory Pool Framework (mpool) … Framework … Comp. … rgpusm. grdma Base … rdma pt 2 pt. Base … tuned sm Base sm tcp Base … MPI collective operations (coll) 15

ORTE Architecture Overview ORTE Layer Process Lifecycle Management (PLM) Routing table for the RML

ORTE Architecture Overview ORTE Layer Process Lifecycle Management (PLM) Routing table for the RML (routed) Open. RTE Group Communicatio n(grpcomm) Framework … Comp. … … Comp. bad pmi Base … direct radix Base … tool hnp Base slurm tm Base … I/O Forwarding service (iof) 16

OPAL Architecture Overview OPAL Layer IP interface (if) Hardware locality (hwloc) … Framework …

OPAL Architecture Overview OPAL Layer IP interface (if) Hardware locality (hwloc) … Framework … Comp. … gzip bzip Base … Compression Framework (compress) hwloc 151 external Base … dawin linux Base Linux_ipv 6 Posix_ipv 4 Base … High resolution timer (timer) 17

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Open MPI TI Implementation • Open MPI on K 2 H platform – All

Open MPI TI Implementation • Open MPI on K 2 H platform – All components in 1. 7. 1 are supported – Launching and initial interfacing by using “SSH” – Adding BTLs for SRIO and Hyperlink transports MPI Application A 15 SMP Linux Ethernet IPC SRIO MPI IPC Kern el Open CL IPC Shared memory/Navigator C 66 x subsyste m Hyperlink K 2 H Kern el Open. MP Run-time K 2 H C 66 x subsyste m Open CL IPC MPI Shared memory/Navigato r Kern el Open. MP Run-time 19 Node 0 Node 1

OMPI TI Added Components OMPI Layer MPI Byte Transfer Layer (btl) MPI one-sided communicatio

OMPI TI Added Components OMPI Layer MPI Byte Transfer Layer (btl) MPI one-sided communicatio n interface( osc) Memory Pool Framework (mpool) … Framework … Comp. … rgpusm. grdma Base … rdma pt 2 pt. Base … tuned sm Base srio hlink Base … MPI collective operations (coll) 20

Open. MPI Hyperlink BTL • Hyperlink is TI-proprietary high speed, point-to-point interface, with 4

Open. MPI Hyperlink BTL • Hyperlink is TI-proprietary high speed, point-to-point interface, with 4 lanes up to 12. 5 Gbps (maximum transfer of 5. 5 -6 Gbytes/s). • New BTL module has been added to ti-openmpi (openmpi 1. 7. 1 based) to support transport over Hyperlink. MPI Hyperlink communication is driven by A 15 only. • K 2 H device has 2 Hyperlink ports (0 and 1) allowing one So. C to connect directly with two neighboring So. Cs. – Daisy chaining is not supported. – Additional connectivity can be obtained by mapping common memory region in intermediate node – Data transfers are operated by EDMA • Hyperlink BTL support is seamlessly integrated into Open. MPI run-time: – Example code to run mpptest using 2 nodes over hyperlink: /opt/ti-openmpi/bin/mpirun --mca btl self, hlink -np 2 -host c 1 n 1, c 1 n 2. /mpptest -sync logscale – Example code to run nbody using 4 nodes hyperlink: /opt/ti-openmpi/bin/mpirun --mca btl self, hlink -np 4 -host c 1 n 1, c 1 n 2, c 1 n 3, c 1 n 4. /nbody 1000 HL 1 K 2 H HL 0 K 2 H HL 1 K 2 H HL 0 HL 1 3 node Hyperlink topology 4 node Hyperlink topology HL 0 K 2 H HL 0 HL 1 K 2 H 21

Open. MPI Hyperlink BTL – connection types Adjacent connections Local read Node 2 writes

Open. MPI Hyperlink BTL – connection types Adjacent connections Local read Node 2 writes to Node 3 src dst src Node 3 Local read Node 3 writes to Node 2 Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal unidirectional connection Node 3 reads from Node 2 t Node 2 Node 3 dst transfer Node 1 writes to Node 2 src Sending fragment from node 1 to node 3 HL 0 Sending fragment from node 3 to node 1 src Node 1 HL 0 HL 1 dst Node 1 reads from Node 4 Diagonal connections HL 1 HL 0 Node 3 writes to Node 4 Same memory block mapped via both Hyperlink ports (to different nodes), used only for diagonal unidirectional connection HL 1 transfer Node 4 22

Open. MPI SRIO BTL • Serial Rapid. IO connections are high speed low-latency connections

Open. MPI SRIO BTL • Serial Rapid. IO connections are high speed low-latency connections that can be switched via external switching fabric (SRIO switches) or by K 2 H on-chip packet forwarding tables (when SRIO switch is not available) • K 2 H device has 4 SRIO lanes that can be configured as 4 x 1 lane links, or 1 x 4 lane link. Wire speed can be up to 5 Gbps, with data link speed of 4 Gbps (due to 8/10 b encoding) • Texas Instruments ti-openmpi (based on openmpi 1. 7. 1) includes SRIO BTL based on SRIO DIO transport, using Linux rio_mport device driver. MPI SRIO communication is driven by A 15 only. • SRIO nodes are statically enumerated (current support) and programming of packet forwarding tables is done inside MPI run-time, based on list of participating nodes. HW topology is specified by JSON file • Programming of packet forwarding tables is static and allows HW-assisted routing of packets w/o any SW intervention in transferring nodes. – Packet forwarding table has 8 entries (some limitations can be encountered based on topology and traffic patters) – Each entry specify min-SRIO-ID, max-SRIO-ID, outgoing port – External SRIO fabric typically provide non-blocking switching capabilities and might be favorable for certain applications and HW designs • SRIO BTL, based on destination hostname determines outgoing port and destination ID. Previously programmed packet forwarding tables in all nodes ensure deterministic routability to destination node. • SRIO BTL support is seamlessly integrated into Open. MPI run-time: – Example code to run mpptest using 2 nodes over SRIO: /opt/ti-openmpi/bin/mpirun --mca btl self, srio -np 2 -host c 1 n 1, c 1 n 2. /mpptest -sync logscale – Example code to run nbody using 12 nodes over SRIO: /opt/ti-openmpi/bin/mpirun --mca btl self, srio -np 12 -host c 1 n 1, c 1 n 2, c 1 n 3, c 1 n 4, c 4 n 1, c 4 n 2, c 4 n 3, c 4 n 4, c 7 n 1, c 7 n 2, c 7 n 3, c 7 n 4. /nbody 1000 23

Open. SRIO BTL – possible topologies SRIO switch K 2 H star topology K

Open. SRIO BTL – possible topologies SRIO switch K 2 H star topology K 2 H Packet forwarding capability allows creation of HW virtual links (no SW operation!) K 2 H K 2 H 2 -D torus (16 -nodes) K 2 H Connections with 4 lanes per link K 2 H K 2 H K 2 H K 2 H Full connectivity of 4 nodes – 1 lane per link 24

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Open MPI Run-time Parameters • MCA parameters are the basic unit of run-time tuning

Open MPI Run-time Parameters • MCA parameters are the basic unit of run-time tuning for Open MPI. – The system is a flexible mechanism that allows users to change internal Open MPI parameter values at run time – If a task can be implemented in multiple, user-discernible ways, implement as many as possible and make choosing between them be an MCA parameter • Service provided by the MCA base – Does not mean that they are restricted to the MCA components of frameworks – OPAL, ORTE, and OMPI projects all have “base” parameters – Allows users to be proactive and tweak Open MPI's behavior for their environment. It’s allows users to experiment with the parameter space to find the best configuration for their specific system. 26

MCA parameters lookup order 1. mpirun command line mpirun –mca <name> <value> 2. Environment

MCA parameters lookup order 1. mpirun command line mpirun –mca <name> <value> 2. Environment variable export OMPI_MCA_<name> <value> 3. File, these location are themselves tunable – $HOME/. openmpi/mca-params. conf – $prefix/etc/openmpi-mca-params. conf 4. Default value 27

MCA run-time parameters usage • Get the MCA information – The ompi_info command can

MCA run-time parameters usage • Get the MCA information – The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters. /opt/ti-openmpi/bin/ompi_info –param all Show all the MCA parameters for all components that ompi_info finds /opt/ti-openmpi/bin/ompi_info –param btl all Show all the MCA parameters for all BTL components /opt/ti-openmpi/bin/ompi_info –param btl tcp Show all the MCA parameters for TCP BTL component • MCA Usage – The mpirun command execute serial and parallel jobs in Open MPI /opt/ti-openmpi/bin/mpirun –mca orte_base_help_aggregate 0 –mca btl_base_verbose 100 –mca btl self, tcp –np 2 –host k 2 node 1, k 2 node 2 /home/mpiuser/nbody 1000 Select the btl_base_verbose and use tcp for transport 28

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Open MPI API Usage • Open MPI API is standard MPI API, refer to

Open MPI API Usage • Open MPI API is standard MPI API, refer to the following link to get more information: http: //www. open-mpi. org/doc/ • This example project locate at <mcsdk-hpc_install_path>/demos/testmpi MPI_Init (&argc, &argv); /* Startup */ /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* Who am I? */ /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* How many peers do I have */ /* get number of processes */ { /* Get the name of the processor */ char processor_name[320]; int name_len; MPI_Get_processor_name(processor_name, &name_len); printf("Hello world from processor %s, rank %d out of %d processorsn", processor_name, rank, size); gethostname(processor_name, 320); printf ("locally obtained hostname %sn", processor_name); } MPI_Finalize(); /* Finish the MPI application and release sources*/ 30

Run the Open MPI example • Use the mpirun and mca parameters to run

Run the Open MPI example • Use the mpirun and mca parameters to run the example /opt/ti-openmpi/bin/mpirun –mca btl self, sm, tcp –np 8 –host k 2 node 1, k 2 node 2. /testmpi • Output messages >>> Hello world from processor k 2 hnode 1, rank 3 out of 8 processors locally obtained hostname k 2 hnode 1 Hello world from processor k 2 hnode 1, rank 0 out of 8 processors locally obtained hostname k 2 hnode 1 Hello world from processor k 2 hnode 2, rank 5 out of 8 processors locally obtained hostname k 2 hnode 2 Hello world from processor k 2 hnode 2, rank 4 out of 8 processors locally obtained hostname k 2 hnode 2 Hello world from processor k 2 hnode 2, rank 7 out of 8 processors locally obtained hostname k 2 hnode 2 Hello world from processor k 2 hnode 2, rank 6 out of 8 processors locally obtained hostname k 2 hnode 2 Hello world from processor k 2 hnode 1, rank 1 out of 8 processors locally obtained hostname k 2 hnode 1 Hello world from processor k 2 hnode 1, rank 2 out of 8 processors locally obtained hostname k 2 hnode 1 <<< 31

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open

Agenda • • • MPI Overview Open MPI Architecture Open MPI TI Implementation Open MPI Run-time Parameters Open MPI Usage Example Getting Started

Getting Started Bookmarks URL Download http: //softwaredl. ti. com/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS. html Getting Started Guide http:

Getting Started Bookmarks URL Download http: //softwaredl. ti. com/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS. html Getting Started Guide http: //processors. wiki. ti. com/index. php/MCSDK_HPC_3. x_Getting_Started_Guid e TI Open. MPI User Guide http: //processors. wiki. ti. com/index. php/MCSDK_HPC_3. x_Open. MPI Open Source High Performance Computing, Message Passing Interface (http: //www. open-mpi. org/) Open MPI Training Documents http: //www. open-mpi. org/video/ Support http: //e 2 e. ti. com/support/applications/high-performance-computing/f/952. aspx 33