NERSC 5 Analyzing HPC Communication Requirements Shoaib Kamil
NERSC 5 Analyzing HPC Communication Requirements Shoaib Kamil , Lenny Oliker, John Shalf, David Skinner jshalf@lbl. gov NERSC and Computational Research Division Lawrence Berkeley National Laboratory Brocade Networks October 10, 2007 BIPS
Overview • CPU clock scaling bonanza has ended – – Heat density New physics below 90 nm (departure from bulk material properties) • Yet, by end of decade mission critical applications expected to have 100 X computational demands of current levels (PITAC Report, Feb 1999) • The path forward for high end computing is increasingly reliant on massive parallelism – – • Petascale platforms will likely have hundreds of thousands of processors System costs and performance may soon be dominated by interconnect What kind of an interconnect is required for a >100 k processor system? – – – 12/7/2020 2 BIPS What topological requirements? (fully connected, mesh) Bandwidth/Latency characteristics? Specialized support for collective communications?
Questions (How do we determine appropriate interconnect requirements? ) • Topology: will the apps inform us what kind of topology to use? – – – Crossbars: Not scalable Fat-Trees: Cost scales superlinearly with number of processors Lower Degree Interconnects: (n-Dim Mesh, Torus, Hypercube, Cayley) • Costs scale linearly with number of processors • Problems with application mapping/scheduling fault tolerance • Bandwidth/Latency/Overhead – – Which is most important? (trick question: they are intimately connected) Requirements for a “balanced” machine? (eg. performance is not dominated by communication costs) • Collectives – – – 12/7/2020 3 BIPS How important/what type? Do they deserve a dedicated interconnect? Should we put floating point hardware into the NIC?
Approach • Identify candidate set of “Ultrascale Applications” that span scientific disciplines – Applications demanding enough to require Ultrascale computing resources – Applications that are capable of scaling up to hundreds of thousands of processors – Not every app is “Ultrascale!” • Find communication profiling methodology that is – Scalable: Need to be able to run for a long time with many processors. Traces are too large – Non-invasive: Some of these codes are large and can be difficult to instrument even using automated tools – Low-impact on performance: Full scale apps… not proxies! 12/7/2020 4 BIPS
IPM (the “hammer”) Integrated Performance Monitoring • portable, lightweight, scalable profiling • fast hash method • profiles MPI topology • profiles code regions • open source MPI_Pcontrol(1, ”W”); …code… MPI_Pcontrol(-1, ”W”); ###################### # IPMv 0. 7 : : csnode 041 256 tasks ES/ESOS # madbench. x (completed) 10/27/04/14: 45: 56 # # <mpi> <user> <wall> (sec) # 171. 67 352. 16 393. 80 #… ######################## # W # <mpi> <user> <wall> (sec) # 36. 40 198. 00 198. 36 # # call [time] %mpi %wall # MPI_Reduce 2. 395 e+01 65. 8 6. 1 # MPI_Recv 9. 625 e+00 26. 4 2. 4 # MPI_Send 2. 708 e+00 7. 4 0. 7 # MPI_Testall 7. 310 e-02 0. 0 # MPI_Isend 2. 597 e-02 0. 1 0. 0 ######################## … Developed by David Skinner, NERSC 12/7/2020 5 BIPS
Application Overview (the “nails”) NAME Discipline Problem/Method Structure MADCAP Cosmology CMB Analysis Dense Matrix FVCAM Climate Modeling AGCM 3 D Grid CACTUS Astrophysics General Relativity 3 D Grid LBMHD Plasma Physics MHD 2 D/3 D Lattice GTC Magnetic Fusion Vlasov-Poisson Particle in Cell PARATEC Material Science DFT Fourier/Grid Super. LU Multi-Discipline LU Factorization Sparse Matrix PMEMD Life Sciences Molecular Dynamics Particle 12/7/2020 6 BIPS
Latency Bound vs. Bandwidth Bound? • How large does a message have to be in order to saturate a dedicated circuit on the interconnect? – N 1/2 from the early days of vector computing – Bandwidth Delay Product in TCP System Technology MPI Latency Peak Bandwidth Delay Product SGI Altix Numalink-4 1. 1 us 1. 9 GB/s 2 KB Cray X 1 Cray Custom 7. 3 us 6. 3 GB/s 46 KB NEC ES NEC Custom 5. 6 us 1. 5 GB/s 8. 4 KB Myrinet Cluster Myrinet 2000 5. 7 us 500 MB/s 2. 8 KB Cray XD 1 Rapid. Array/IB 4 x 1. 7 us 2 GB/s 3. 4 KB • Bandwidth Bound if msg size > Bandwidth*Delay • Latency Bound if msg size < Bandwidth*Delay – Except if pipelined (unlikely with MPI due to overhead) – Cannot pipeline MPI collectives (but can in Titanium) 12/7/2020 7 BIPS
Call Counts Wait. All Wait Irecv Wait. All Wait Isend Irecv Isend Send Irecv Wait Isend 12/7/2020 8 BIPS Send Reduce Send. Recv Gather Recv Allreduce Recv Wait. Any Isend Irecv
Diagram of Message Size Distribution Function 12/7/2020 9 BIPS
Message Size Distributions 12/7/2020 10 BIPS
P 2 P Buffer Sizes 12/7/2020 11 BIPS
Collective Buffer Sizes 12/7/2020 12 BIPS
Collective Buffer Sizes 95% Latency Bound!!! 12/7/2020 13 BIPS
P 2 P Topology Overview Total Message Volume Max 0 12/7/2020 14 BIPS
Low Degree Regular Mesh Communication Patterns 12/7/2020 15 BIPS
Cactus Communication PDE Solvers on Block Structured Grids 12/7/2020 16 BIPS
LBMHD Communication 12/7/2020 17 BIPS
GTC Communication Call Counts 12/7/2020 18 BIPS
FVCAM Communication 12/7/2020 19 BIPS
Super. LU Communication 12/7/2020 20 BIPS
PMEMD Communication Call Counts 12/7/2020 21 BIPS
PARATEC Communication 3 D FFT 12/7/2020 22 BIPS
Communication Bound Computation Latency/Balance Diagram Need More Interconnect Bandwidth Need Lower Interconnect Latency Need Faster Processors Bandwidth Bound Latency Bound Communication 12/7/2020 23 BIPS
Summary of Communication Patterns Code 256 procs %P 2 P : %Collective Avg. Coll Bufsize Avg. P 2 P Bufsize TDC@2 k max, avg. %FCN Utilization GTC 40% : 60% 100 128 k 10 , 4 2% Cactus 99% : 1% 8 300 k 6 , 5 2% LBMHD 99% : 1% 8 3 D=848 k 2 D=12 k 12 , 11. 8 5% 2% Super. LU 93% : 7% 24 48 30 , 30 25% PMEMD 98% : 2% 768 6 k or 72 255 , 55 22% PARATEC 99% : 1% 4 64 255 , 255 100% (<10%) MADCAP-MG 78% : 22% 163 k 1. 2 M 44 , 40 23% 12/7/2020 FVCAM 99% : 1% 8 96 k 20 , 15 16% 24 BIPS
Requirements for Interconnect Topology Intensity (#neighbors) Fully Connected PARATEC PMEMD Super. LU AMR (coming soon!) 3 D LBMHD 2 D LBMHD Cactus MADCAP FVCAM Embarassingly Parallel CAM/GTC Monte Carlo Regular 12/7/2020 25 BIPS Regularity of Communication Topology Irregular
Coverage By Interconnect Topologies Fully Connected PARATEC Intensity (#neighbors) Fully Connected Network (Fat-Tree/Crossbar) PMEMD Super. LU 3 D Mesh 2 D LBMHD Cactus Embarassingly Parallel 3 D LBMHD MADCAP FVCAM 2 D Mesh CAM/GTC Regular 12/7/2020 26 BIPS AMR Regularity of Communication Topology Irregular
Coverage by Interconnect Topologies Fully Connected PARATEC Intensity (#neighbors) Fully Connected Network (Fat-Tree/Crossbar) ? PMEMD Super. LU 3 D Mesh 2 D LBMHD Cactus Embarassingly Parallel FVCAM 2 D Mesh CAM/GTC Regular 12/7/2020 27 BIPS AMR 3 D LBMHD MADCAP ? Regularity of Communication Topology Irregular
Revisiting Original Questions • Topology – Most codes require far less than full connectivity • PARATEC is the only code requiring full connectivity • Many require low degree (<12 neighbors) – Low TDC codes not necessarily isomorphic to a mesh! • Non-isotropic communication pattern • Non-uniform requirements • Bandwidth/Delay/Overhead requirements – – • Scalable codes primarily bandwidth-bound messages Average message sizes several Kbytes Collectives – Most payloads less than 1 k (8 -100 bytes!) • Well below the bandwidth delay product • Primarily latency-bound (requires different kind of interconnect) – – 12/7/2020 28 BIPS Math operations limited primarily to reductions involving sum, max, and min operations. Deserves a dedicated network (significantly different reqs. )
Mitigation Strategies • What does the data tell us to do? – P 2 P: Focus on messages that are bandwidth-bound (eg. larger than bandwidth-delay product) • Switch Latency=50 ns • Propagation Delay = 5 ns/meter propagation delay • End-to-End Latency = 1000 -1500 ns for the very best interconnects! – Shunt collectives to their own tree network (BG/L) – Route latency-bound messages along nondedicated links (multiple hops) or alternate network (just like collectives) – Try to assign a direct/dedicated link to each of the distinct destinations that a process communicates with 12/7/2020 29 BIPS
Operating Systems for CMP • Even Cell Phones will need OS (and our idea of an OS is tooooo BIG!) – – • Mediating resources for many cores, protection from viruses, and managing increasing code complexity But it has to be very small and modular! (see also embedded Linux) Old OS Assumptions are bogus for hundreds of cores! – Assumes limited number of CPUs that must be shared • • – Greedy allocation of finite I/O device interfaces (eg. 100 cores go after the network interface simultaneously) • • – • Old OS: CPU failure --> Kernel Panic (will happen with increasing frequency in future silicon!) New OS: CPU failure --> Partition Restart (partitioned device drivers) Old OS invoked any interprocessor communication or scheduling vs. direct HW access What will the new OS look like? – – – 12/7/2020 30 BIPS Old OS: Interrupts and threads (time-multiplexing) (inefficient!) New OS: side-cores dedicated to DMA and async I/O Fault Isolation • • – Old OS: First process to acquire lock gets device (resource/lock contention! Nondeterm delay!) New OS: Qo. S management for symmetric device access Background task handling via threads and signals • • – Old OS: time-multiplexing (context switching and cache pollution!) New OS: spatial partitioning Whatever it is, it will probably look like Linux (or ISV’s will make life painful) Linux too big, but microkernel not sufficiently robust Modular kernels commonly used in embedded Linux applications! (e. g. vx. Works running under a hypervisor XEN, K 42, D. K. Panda Side Cores)
I/O For Massive Concurrency • Scalable I/O for massively concurrent systems! – Many issues with coordinating access to disk within node (on chip or CMP) – OS will need to devote more attention to Qo. S for cores competing for finite resource (mutex locks and greedy resource allocation policies will not do!) (it is rugby where device == the ball) 12/7/2020 31 BIPS n. Tasks I/O Rate 16 Tasks/node I/O Rate 8 tasks per node 8 - 131 Mbytes/sec 16 7 Mbytes/sec 139 Mbytes/sec 32 11 Mbytes/sec 217 Mbytes/sec 64 11 Mbytes/sec 318 Mbytes/sec 128 25 Mbytes/sec 471 Mbytes/sec
Other Topics for Discussion • RDMA • Low-overhead messaging • Support for one-sided messages – Page pinning issues – TLB peers – Side Cores 12/7/2020 32 BIPS
Conundrum • Can’t afford to continue with Fat-trees or other Fully-Connected Networks (FCNs) • Can’t map many Ultrascale applications to lower degree networks like meshes, hypercubes or torii • How can we wire up a custom interconnect topology for each application? 12/7/2020 33 BIPS
Switch Technology • Packet Switch: – Read each packet header and decide where it should go fast! – Requires expensive ASICs for line-rate switching decisions – Optical Transceivers Force 10 E 1200 1260 x 1 Gig. E 56 x 10 Gig. E • Circuit Switch: – Establishes direct circuit from point-topoint (telephone switchboard) – Commodity MEMS optical circuit switch 400 x 400 l 1 -40 Gig. E Movaz i. WSS 12/7/2020 34 BIPS • Common in telecomm industry • Scalable to large crossbars – Slow switching (~100 microseconds) – Blind to message boundaries
A Hybrid Approach to Interconnects HFAST • Hybrid Flexibly Assignable Switch Topology (HFAST) – Use optical circuit switches to create custom interconnect topology for each application as it runs (adaptive topology) – Why? Because circuit switches are • Cheaper: Much simpler, passive components • Scalable: Already available in large crossbar configurations • Allow non-uniform assignment of switching resources – GMPLS manages changes to packet routing tables in tandem with circuit switch reconfigurations 12/7/2020 35 BIPS
HFAST • HFAST Solves Some Sticky Issues with Other Low. Degree Networks – Fault Tolerance: 100 k processors… 800 k links between them using a 3 D mesh (probability of failures? ) – Job Scheduling: Finding right sized slot – Job Packing: n-Dimensional Tetris… – Handles apps with low comm degree but not isomorphic to a mesh or nonuniform requirements • How/When to Assign Topology? – Job Submit Time: Put topology hints in batch script (BG/L, RS) – Runtime: Provision mesh topology and monitor with IPM. Then – – 12/7/2020 36 BIPS use data to reconfigure circuit switch during barrier. Runtime: Pay attention to MPI Topology directives (if used) Compile Time: Code analysis and/or instrumentation using UPC, CAF or Titanium.
HFAST Recent Work • Clique-mapping to improve switch port utilization efficiency (Ali Pinar) – The general solution is NP-complete – Bounded clique size creates an upper-bound that is < NPcomplete, but still potentially very large – Examining good “heuristics” and solutions to restricted cases for mapping that completes within our lifetime • AMR and Adaptive Applications (Oliker, Lijewski) – Examined evolution of AMR communication topology – Degree of communication is very low if filtered for highbandwidth messages – Reconfiguration costs can be hidden behind computation • Hot-spot monitoring (Shoaib Kamil) – Use circuit switches to provision overlay network gradually as application runs – Gradually adjust topology to remove hot-spots 12/7/2020 37 BIPS
Conclusions/Future Work? • Expansion of IPM studies – – – More DOE codes (eg. AMR: Cactus/SAMARAI, Chombo, Enzo) Temporal changes in communication patterns (AMR examples) More architectures (Comparative study like Vector Evaluation – Put results in context of real DOE workload analysis project) • HFAST – – Performance prediction using discrete event simulation Cost Analysis (price out the parts for mock-up and compare to – Time domain switching studies (eg. how do we deal with equivalent fat-tree or torus) PARATEC? ) • Probes – – 12/7/2020 38 BIPS Use results to create proxy applications/probes Apply to HPCC benchmarks (generates more realistic communication patterns than the “randomly ordered rings” without complexity of the full application code)
- Slides: 38