Proprietary or Commodity Interconnect Performance in Largescale Supercomputers

  • Slides: 21
Download presentation
Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma Virtamo Instructor: D. Sc. (Tech. ) Jussi Heikonen

Contents Introduction Background Objectives Test platforms Testing methodology Results Conclusions

Contents Introduction Background Objectives Test platforms Testing methodology Results Conclusions

Introduction A Modern High Performance Computing (HPC) system consists of: • Login nodes, service

Introduction A Modern High Performance Computing (HPC) system consists of: • Login nodes, service nodes, IO nodes, Compute nodes • Usually multicore/SMP • Interconnect network(s) which links all nodes together Applications are run in parallel • The individual tasks of a parallel application exchange data via the interconnect • The interconnect plays a critical role in the overall performance of the parallel application Commodity system • Use of off-the-shelf components • Leverages economies of scale involved in the PC industry • “Industry standard” architectures which allow for the system to be extended in a vendor-independent fashion Proprietary system • A highly integrated system designed specifically for HPC • May contain some off-the-shelf components but cannot be extended in a vendor independent fashion.

Background: The Cluster Revolution In the last decade clusters have rapidly become the system

Background: The Cluster Revolution In the last decade clusters have rapidly become the system architecture of choice in HPC. The high end of the market is still dominated by proprietary MPP systems. Source: http: //www. top 500. org

Background: The Architecture Evolution The architectures of proprietary and commodity HPC systems have been

Background: The Architecture Evolution The architectures of proprietary and commodity HPC systems have been converging. Nowadays it's difficult to differentiate between the two. Increasing R&D costs drive move towards commodity components Competition between AMD and Intel Competition in specialized cluster interconnects The interconnect network is a key differentiating factor between commodity and proprietary architectures. Disclaimer: IBM Blue Gene is an notable execption

Problem Statement Does a supercomputer architecture with a proprietary interconnection network offer a significant

Problem Statement Does a supercomputer architecture with a proprietary interconnection network offer a significant performance advantage compared to a similiar sized cluster with a commodity network?

Test Platforms In 2007 CSC - Scientific Computing Ltd. conducted a 10 M€ procurement

Test Platforms In 2007 CSC - Scientific Computing Ltd. conducted a 10 M€ procurement to aquire new HPC systems The procurement was split into two parts • Lot 1: Capability computing • Massively parallel “Grand Challenge” –computation • Lot 2: Capacity computing • Sequential and small to medium size parallel problems • Problems requiring large amounts of memory

Test Platform 1: Louhi Winner of Lot 1: “capability computing” Cray XT 4 •

Test Platform 1: Louhi Winner of Lot 1: “capability computing” Cray XT 4 • Phase 1 (2007): 2024 2. 6 GHz AMD Opteron cores (dual core): 10. 5 TFlop/s • Phase 2 (2008): 6736 2. 6 GHz AMD Opteron cores (quad core* ): 70. 1 TFlop/s Unicos/lc operating system • Linux on the login and service nodes • Catamount microkernel on compute nodes Proprietary “Sea. Star 2” interconnection network • 3 -dimensional torus topology • • Each node connected to 6 neighbors with 7, 6 GByte/s links Each node has a router integrated into the NIC • NIC connected directly to AMD Hyper. Transport bus • NIC has an onboard CPU for protocol processing (“protocol offloading”) • Remote Direct Memory Access (RDMA) * 4 Flops/cycle

Test Platform 1: Louhi Source: Cray Inc.

Test Platform 1: Louhi Source: Cray Inc.

Test Platform 2: Murska Winner of Lot 2: “capacity computing” HP CP 4000 BL

Test Platform 2: Murska Winner of Lot 2: “capacity computing” HP CP 4000 BL XC blade cluster • 2048 2. 6 GHz AMD Opteron cores (dual-core): 10. 6 TFlop/s • Dual-socket HP BL 465 c server blades HPC Linux operating system • HP's turnkey cluster OS based on RHEL Infini. Band interconnect network • A multipurpose high-speed network • Fat tree topology (blocking) • 24 -port blade enclosure switches • • 16 16 Gbit/s DDR *) downlinks 8 16 Gbit/s DDR uplinks, running at 8 Gbit/s • 288 -port master switch with SDR **) ports (8 Gbit/s) • ● ● Recently upgraded to DDR Host Channel Adapters (HCA) connected to 8 x PCIe buses Remote Direct Memory Access (RDMA) *) Double Data Rate **) Single Data Rate

Test Methodology Testing of individual parameters with microbenchmarks • End-to-end communication latency and bandwidth

Test Methodology Testing of individual parameters with microbenchmarks • End-to-end communication latency and bandwidth • Communication processing overhead • Consistency of performance across the system Testing of real-world behavior with a scientific application • Gromacs – A popular open source molecular dynamics application All measurements use Message Passing Interface (MPI) • MPI is by far the most popular parallel programming application programming interface (API) for HPC • Murska uses HP's HP-MPI implementation • Louhi uses a Cray-modified version of the MPICH 2 implementation

End-to-end Latency Intel MPI Benchmarks (IMB) Ping. Pong test was used • Measures the

End-to-end Latency Intel MPI Benchmarks (IMB) Ping. Pong test was used • Measures the latency to send and recieve a single point-to-point message as a function of the message size • Arguably the most popular metric of interconnect performance (esp. short messages) Murska's HP-MPI has 2 modes of operation (RDMA and SRQ) • RDMA requires 256 kbytes of memory per MPI task while SRQ (Shared Recieve Queue) has a constant memory requirement SRQ causes a notable latency overhead Murska using RDMA has a slightly lower latency with short messages As the message size grows, Louhi outperforms Murska

End-to-end Bandwidth IMB Ping. Pong test with large message sizes • Inverse of the

End-to-end Bandwidth IMB Ping. Pong test with large message sizes • Inverse of the latency • Test was done between nearest-neighbor nodes Louhi has still not reached peak bandwidth with the largest message size (4 megabytes) The gap between RDMA and SRQ narrows as the link becomes saturated: SRQ doesn't affect performance with large message sizes

Communication Processing Overhead Measure of how much communication stresses the CPU • The C

Communication Processing Overhead Measure of how much communication stresses the CPU • The C in HPC stands for Computing, not Communication ; ) MPI has asynchronous communication routines which overlap communication and computation • This requires autonomous communication mechanisms • Murska has RDMA, Louhi has protocol offloading and RDMA Sandia Nat'l Labs SMB benchmark was used • See how much work one process gets done while another process communicates with it constantly. • Result as application availability percentage • 100%: communication is performed completely in the background • 0%: communication is performed completely in the foreground, no work done • Separate results for getting work done at the sender and at the reciever side

Reciever Side Availability 100% Availability: Communication does not interfere with processing at all Louhi's

Reciever Side Availability 100% Availability: Communication does not interfere with processing at all Louhi's availability improves with 8 k-128 k messages Murska's availability drops significantly between 16 k and 32 k 0% Availability: Processing communication hogs the CPU completely

Sender Side Availability Louhi's availability improves dramatically with large messages: The offload engine can

Sender Side Availability Louhi's availability improves dramatically with large messages: The offload engine can process packets autonomously.

Gromacs A popular and mature molecular dynamics simulation package • Open Source, downloadable from

Gromacs A popular and mature molecular dynamics simulation package • Open Source, downloadable from http: //www. gromacs. org Programmed with MPI • Designed to exploit overlapping computation and communication, if available Parallel speedup was measured by using a fixed size molecular system • Run times for task counts from 16 to 128 were measured • MPI calls were profiled • How much time spent in communication subroutines? • Which subroutines were the most time-consuming?

Gromacs Run Time spent in MPI communication routines starts increasing at 64 tasks Murska’s

Gromacs Run Time spent in MPI communication routines starts increasing at 64 tasks Murska’s scaling stops at 32 tasks Louhi’s scaling stops at 64 tasks

MPI Call Profile Fraction of the total MPI time spent in a specific MPI

MPI Call Profile Fraction of the total MPI time spent in a specific MPI call With small message sizes the MPI_Alltoall (all processes send a message to each other) dominates the time spent in MPI_Wait (wait for an asynchronous message transfer to complete) starts dominating time usage on Murska as the task count grows. MPI_Alltoall starts dominating MPI time usage on Louhi again as the task count grows.

Conclusions On Murska, a tradeoff has to be made with large parallel problems •

Conclusions On Murska, a tradeoff has to be made with large parallel problems • SRQ: Sacrifice latency in favor of memory capacity • RDMA: Sacrifice memory capacity in favor of latency Murska is able to outperform Louhi in some benchmarks • Especially in short message performance in RDMA mode Louhi was more consistent in providing low processing overhead • Being able to overlap long messages tends to be more important than short messages as they take more time to complete Gromacs scaled significantly better on Louhi • Most likely largely due to lower communication processing overhead A proprietary system still has it's place • The interconnect is designed from ground up to handle MPI communication and HPC workloads • The streamlined microkernel also helps out Focusing only on “hero numbers” (e. g. short message latency) can be misleading

Questions? Thank you!

Questions? Thank you!