Fully measure the performance of your cluster systems


















- Slides: 18
Fully measure the performance of your cluster systems with Intel® MPI Benchmarks Performance, Scalability & Fabric Flexibility Zhuowei(James) Si 12/Nov/2019 Part of Intel® Parallel Studio XE Cluster Edition and Available Individually
Boost Distributed Application Performance with Intel® MPI Library Standards Based Optimized MPI Library for Distributed Computing Intel® MPI Library § Built on open source MPICH Implementation § Tuned for low latency, high bandwidth & scalability § Multi fabric support for flexibility in deployment § Includes MPITUNE and Intel MPI Benchmarks subcomponnents What’s New in 2019 version § Customized Libfabric 1. 6. 1 sources are included. § Customized Libfabric 1. 6. 1 with sockets, psm 2, and verbs providers binaries is included. § PSM 2 Multiple Endpoints (Multi-EP) support for Linux* OS. Learn More: software. intel. com/intel-mpi-library Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. **Certain technical specifications and select processors/skus apply. See product site for more details. 2
Intel® MPI Benchmarks Overview A set of MPI performance measurements § For point-to-point and global communication operations § A wide range of message sizes supported § Developed using ANSI C and standard MPI The benchmark results fully characterize § Performance of a cluster system, including node performance, network latency, and throughput § Efficiency of the MPI implementation used Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 3
Where is Intel MPI Benchmarks Inside Intel® Parallel Studio XE Composer Edition BUILD Compilers & Libraries Professional Edition ANALYZE Intel® Math Kernel Library C / C++, Fortran Compilers Analysis Tools Intel® VTune™ Amplifier Intel® Data Analytics Acceleration Library Performance Profiler Intel Threading Building Blocks Intel® Inspector ter Edition Clus SCALE Cluster Tools Intel® MPI Library Message Passing Interface Library Intel MPI Benchmarks Memory & Thread Debugger Also available at Git. Hub: https: //github. com/intel/mpi-benchmarks Intel® Integrated Performance Primitives Intel® Advisor Intel® Trace Analyzer & Collector Image, Signal & Data Processing Vectorization Optimization Thread Prototyping & Flow Graph Analysis Cluster Diagnostic Expert System C++ Threading Intel® Distribution for Python* High Performance Python Intel® Cluster Checker Operating System: Windows*, Linux*, Mac. OS 1* Intel® Architecture Platforms 1 Available only in the Composer Edition. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 4
Components of Intel MPI Benchmarks IMB-MPI 1 – benchmarks for MPI-1 functions. § IMB-MT – benchmarks for MPI-1 functions running within multiple threads per rank. Two components for MPI-2 functionality: § IMB-EXT – one-sided communications benchmarks. § IMB-IO – input/output (I/O) benchmarks. Two components for MPI-3 functionality: § IMB-NBC – benchmarks for non-blocking collective (NBC) operations. § IMB-RMA – one-sided communications benchmarks. These benchmarks measure the Remote Memory Access (RMA) functionality introduced in the MPI-3 standard. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 5
Build & Run Intel MPI Benchmarks § For Windows*, use the enclosed Microsoft* Visual Studio* solution files located in version -specific subdirectories under the imb/WINDOWS directory. § For Linux*, set up the environment for the compiler and Intel® MPI Library. – For the Intel® compilers, run: source <compiler_dir>/bin/compilervars. sh intel 64 – For the Intel® MPI Library, run: source <intel_mpi_dir>/intel 64/bin/mpivars. sh – Set the CC variable to point to the appropriate compiler wrapper, mpiicc or mpicc. – Run one or more Makefile commands under IMB directory § To run the Intel® MPI Benchmarks, use the following command-line syntax: $ mpirun -n <P> IMB-<component> [arguments] – <P> is the number of processes. – <component> is the component-specific suffix that can take MPI 1, EXT, IO, NBC, and RMA values. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 6
Command-Line Control § By default, all benchmarks run on Q active processes defined as follows: Q=[1, ] 2, 4, 8, . . . , largest 2 x - For example, if P=11, the benchmarks run on Q=[1, ]2, 4, 8, 11 active processes. Single transfer IMB-IO benchmarks run with Q=1. Single transfer IMB-EXT and IMB-RMA benchmarks run with Q=2. § You can run all benchmarks in the following modes: - Standard (or non-multiple, default) – the benchmarks run in a single process group. - Multiple – the benchmarks run in several process groups. § Run a very large configuration, with the following parameters: – Maximum iterations: 20 – Maximum run time per message: 1. 5 seconds – Maximum message buffer size: 2 GBytes mpirun -np 512 IMB-MPI 1 -npmin 512 alltoallv -iter 20 -time 1. 5 -mem 2 Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 7
Intel MPI Benchmarks Outputs § The benchmark output includes the following information: – General information: machine, system, release, and version – The calling sequence (command-line flags) are repeated in the output chart – Results for the non-multiple mode Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 8
MPI-1 Benchmarks § IMB introduces three classes of benchmarks: – Single transfer benchmarks involve two active processes into communication – Parallel transfer benchmarks involve more than two active processes into communication – Collective benchmarks measure MPI collective operations Single Transfer Parallel Transfer Collective Ping. Pong Sendrecv Bcast、Multi-Bcast Ping. Pong. Specific. Source Exchange Allgather、Multi-Allgather …… …… …… § To run IMB on Standard Mode or Multiple Mode,IMB-MPI 1 contains the following benchmarks: Standard Mode Multiple Mode Ping. Pong Multi-Ping. Pong …… …… Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 9
Ping. Pong, Ping. Pong. Specific. Source, Ping. Pong. Any. Source § Use Ping. Pong, Ping. Pong. Specific. Source, and Ping. Pong. Any. Source for measuring startup and throughput of a single message sent between two processes. § Ping. Pong. Any. Source uses the MPI_ANY_SOURCE value for destination rank, while Ping. Pong and Ping. Pong. Specific. Source use an explicit value. Property Description Measured pattern As symbolized between in the figure right. This benchmark runs on two active processes (Q=2). MPI routines MPI_Send, MPI_Recv MPI data type MPI_BYTE Reported timings time=Δt/2 (in μsec) as indicated in the figure below. Reported throughput X/time Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 10
Multithreaded MPI-1 Benchmarks New in IMB 2019 § The IMB-MT component provides benchmarks for some of the MPI-1 functions, running in multiple threads. § Using MPI_THREAD_MULTIPLE mode and several threads executes per rank, each does the communication. § To make the communication patterns meaningful, the benchmark has to meet the following requirements: – Data must be distributed between threads. . – The communication pattern must ensure the deterministic order of data sends and receives. § Thread control inside a rank is performed using the Open. MP* API. § IMB-MT benchmarks are always run as if the multiple mode is enabled Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 11
What is MPITUNE Intel MPI Benchmarks Usage Case § Tool to optimize Intel® MPI Library settings which uses Intel MPI Benchmarks for performance testing. § Walks through [application specific] search space of settings and tests performance, and writes out a configuration file with the best settings found § Works in two different modes: Cluster specific & Application specific Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 12
MPITUNE Example Allgather on 8 HSW-EP nodes N 256/PPN 32 Allgather 1 Allgather 2 Allgather 3 Allgather 4 Allgather 5 Tuned 10000000 Average latency in usec 1000000 10000 100 10 1 2 4 8 16 32 64 128 256 512 1 k 2 k 4 k 8 k 16 k 32 k 64 k 128 k 256 k 512 k 1 m 2 m 4 m Message size in bytes Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 13
Use an Extensive Diagnostic Toolset for High Performance Intel Cluster Checker Compute Clusters—Intel® Cluster Checker (for Linux*) Intel MPI Benchmarks Usage Case Ensure Cluster Systems Health § Expert system approach providing cluster systems expertise - verifies system health: find issues, offers suggested actions § Provides extensible framework, API for integrated support § Check 100+ characteristics that may affect operation & performance – improve uptime & productivity Comprehensive pre-packed cluster systems expertise out -of-the-box § Suitable for HPC experts & those new to HPC § Tests can be executed in selected groups on any subset of nodes For application developers, cluster architects & users, & system administrators Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 14
Customer Support Online Service Support § Please ask at Intel Clusters and HPC Technology Forum: https: //software. intel. com/en-us/forums/intelclusters-and-hpc-technology § Or file a ticket at Intel Online Service Center: https: //supporttickets. intel. com/? lang=en-US Customer Support § MPI performance analysis using Intel MPI Benchmarks § MPI Performance tuning & profiling Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 15
Online Resources Intel® MPI Library § www. intel. com/go/mpi Intel® Trace Analyzer and Collector product page § www. intel. com/go/traceanalyzer Intel MPI Benchmarks User Guide: § https: //software. intel. com/en-us/imb-user-guide Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 16
Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and Mobile. Mark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U. S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE 2, SSE 3, and SSSE 3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 18