Efficient Asynchronous Message Passing via SCI with ZeroCopying

  • Slides: 27
Download presentation
Efficient Asynchronous Message Passing via SCI with Zero-Copying SCI Europe 2001 – Trinity College

Efficient Asynchronous Message Passing via SCI with Zero-Copying SCI Europe 2001 – Trinity College Dublin Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl* + * Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz

Agenda • What is Zero-Copying? What is it good for? Zero-Copying with SCI •

Agenda • What is Zero-Copying? What is it good for? Zero-Copying with SCI • Support through SMI-Library Shared Memory Interface • Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations • Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copying • Transfer of data between two user-level accessible memory buffers with N explicit

Zero-Copying • Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies: N-way–Copying ð No intermediate copy: Zero-Copying • Effective Bandwidth and Efficiency: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Efficiency Comparison Giga. Ethernet SCI DMA Fast. Ethernet SCI Europe 2001 – Trinity College

Efficiency Comparison Giga. Ethernet SCI DMA Fast. Ethernet SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copying with SCI does zero-copy by nature. But: SCI via IO-Bus is limited: •

Zero-Copying with SCI does zero-copy by nature. But: SCI via IO-Bus is limited: • No SMP-style shared memory • Specially allocated memory regions were required ð No general zero-copy possible New possibility: • Using user-allocated buffers for SCI communication ð Allows general zero-copy! Connection setup is always required. SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

SMI Library Shared Memory Interface High-Level SCI support library for parallel applications or libraries

SMI Library Shared Memory Interface High-Level SCI support library for parallel applications or libraries • Application startup • Synchronization & basic communication • Shared-Memory setup: - Collective regions - Point-2 -point regions - Individual regions • Dynamic memory management • Data transfer SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Data Moving (I) Shared Memory Paradigm: • Import remote memory in local address space

Data Moving (I) Shared Memory Paradigm: • Import remote memory in local address space • Perform memcpy() or maybe DMA • SMI Support: - region type REMOTE - Synchronous (PIO): - SMI_Memcpy() Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() Problems: • High Mapping Overhead • Resource Usage (ATT entries on PCI-SCI adapter) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Mapping Overhead ð Not suitable for dynamic memory setups! SCI Europe 2001 – Trinity

Mapping Overhead ð Not suitable for dynamic memory setups! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Data Moving (II) Connection Paradigm: • Connect to remote memory location • No representation

Data Moving (II) Connection Paradigm: • Connect to remote memory location • No representation in local address space ð only DMA possible • SMI support: • Region type RDMA • Synchronous / Asynchronous DMA: SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait Problems: • Alignment restrictions • Source needs to be pinned down SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Setup Acceleration Memory buffer setup costs time ! ð Reduce number of operations to

Setup Acceleration Memory buffer setup costs time ! ð Reduce number of operations to increase performance Desirable: only one operation per buffer • Problem: limited ressources • Solution: caching of SCI segment states by lazy-release - Leave buffers registered, remote segments connected or mapped - Release unneeded resources if setup of new resource fails - Different replacement strategies possible: LRU, LFU, best-fit, random, immediate - Attention: remote segment deallocation! ð Callback on connection event to release local connection • MPI persistent communication operations: • Pre-register user buffer & higher „hold“ priority SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Memory Allocation Allocate „good“ memory: • MPI_Alloc_mem() / MPI_Free_mem() • Part of MPI-2 (mostly

Memory Allocation Allocate „good“ memory: • MPI_Alloc_mem() / MPI_Free_mem() • Part of MPI-2 (mostly for single-sided operations) • SCI-MPICH defines attributes: - type: shared, private or default ð Shared memory performs best. - alignment: none, specified or default ð Non-shared memory should be page-aligned • „Good“ memory should only be enforced for communication buffers! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copy Protocols • Applicable for hand-shake based rendez-vous protocol • Requirements: • registered user

Zero-Copy Protocols • Applicable for hand-shake based rendez-vous protocol • Requirements: • registered user allocated buffers or • regular SCI segments ð„good“ memory via MPI_Alloc_mem() • State of memory range must be known 1. SMI provides query functionality 1. Registering / Connection / Mapping may fail • Several different setups possible 1. Fallback mechanism required SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Asynchronous Rendez-Vous Sender Application Device Thread Control Messages Receiver Device Application Thread Irecv Isend

Asynchronous Rendez-Vous Sender Application Device Thread Control Messages Receiver Device Application Thread Irecv Isend Ask to send OK to send Don e Data Transfer Wait SCI Europe 2001 – Trinity College Dublin Continue Lehrstuhl für Betriebssysteme Wait Don e

Test Setup Systems used for performance evaluation: • Pentium-III @ 800 MHz • 512

Test Setup Systems used for performance evaluation: • Pentium-III @ 800 MHz • 512 MB RAM @ 133 MHz • 64 -bit / 66 MHz PCI (Server. Works Server. Set III LE) • Dolphin D 330 (single ring topology) • Linux 2. 4. 4 -bigphysarea • modified SCI driver (user memory for SCI) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Bandwidth Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Bandwidth Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Application Kernel: NPB IS • Parallel bucket sort • Keys are integer numbers •

Application Kernel: NPB IS • Parallel bucket sort • Keys are integer numbers • Dominant communication: MPI_Alltoallv for distributed key array: Class Array size Procs [Mi. B] Msg size Alltoallv [ki. B] [ms] % of execution time A 1 4 256 16. 363 34. 6 W 8 4 2048 123. 921 36. 2 SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

MPI_Alltoallv Performance • MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall

MPI_Alltoallv Performance • MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall • Improved performance with asynchronous DMA operations • Application speedup deduced Class Procs regular speedup [ms] user speedup [ms] A 4 7. 578 1. 22 9. 617 1. 16 W 4 52. 415 1. 26 63. 957 1. 21 SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Asynchronous Communication Goal: Overlap Computation & Communication • How to quantify the efficiency for

Asynchronous Communication Goal: Overlap Computation & Communication • How to quantify the efficiency for this? ð Typical overlapping effect: Computation Synchronous total time Asynchronous computation time SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Saturation and Efficiency (I) Two parameters are required: 1. Saturation s • Duration of

Saturation and Efficiency (I) Two parameters are required: 1. Saturation s • Duration of computation period required to make total time (communication & computation) increase 2. Efficiency e • Relation of overhead to message latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Saturation and Efficiency (II) ttotal tmsg_a ttotal - tbusy tmsg_s Computation Synchronous Saturation s

Saturation and Efficiency (II) ttotal tmsg_a ttotal - tbusy tmsg_s Computation Synchronous Saturation s Asynchronous tbusy SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize)

Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Experimental Setup: Spinning Different ways of keeping CPU busy: • FIXED Spin on single

Experimental Setup: Spinning Different ways of keeping CPU busy: • FIXED Spin on single variable for a given amount of CPU time ð No memory stress • DAXPY Perform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size) ð Stress memory system SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

DAXPY – 64 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl

DAXPY – 64 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

DAXPY – 256 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl

DAXPY – 256 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

FIXED – 64 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl

FIXED – 64 ki. B Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Asynchronous Performance Saturation and Efficiency derived from experiments: Experiment 64 ki. B DAXPY 256

Asynchronous Performance Saturation and Efficiency derived from experiments: Experiment 64 ki. B DAXPY 256 ki. B DAXPY 64 ki. B FIXED tmsg [ms] s [ms] e a-DMA-0 -R 0. 490 0. 285 0. 581 a-DMA-0 -U 0. 735 0. 473 0. 643 s-PIO-1 0. 572 0. 056 0. 043 a-DMA-0 -R 1. 300 1. 099 0. 845 a-DMA-0 -U 1. 506 1. 148 0. 762 s-PIO-1 1. 895 -0. 030 -0. 015 a-DMA-0 -R 0. 493 0. 446 0. 904 a-DMA-0 -U 0. 738 0. 691 0. 936 s-PIO-1 0. 567 0. 016 0. 028 Protocol SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Summary & Outlook • Efficient utilization of new SCI driver functionality for MPI communication:

Summary & Outlook • Efficient utilization of new SCI driver functionality for MPI communication: ð Max. bandwidth of 230 Mi. B/s (regular) 190 Mi. B/s (user) • Connection overhead hidden by segment caching ð Asynchronous communication pays off much earlier than before • New (? ) quantification scheme for efficiency of asynchronous communication • Flexible MPI memory allocation supports MPI application writer • Connection-oriented DMA transfers reduce resource utilization • DMA alignment problems • Segment callback required for improved connection caching SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme