Algorithmic Challenges of Exascale Computing Kathy Yelick Associate

  • Slides: 46
Download presentation
Algorithmic Challenges of Exascale Computing Kathy Yelick Associate Laboratory Director for Computing Sciences and

Algorithmic Challenges of Exascale Computing Kathy Yelick Associate Laboratory Director for Computing Sciences and Acting NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Obama for Communication. Avoiding Algorithms “New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing

Obama for Communication. Avoiding Algorithms “New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems. ” FY 2012 Congressional Budget Request, Volume 4, FY 2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65 -67.

Energy Cost Challenge for Computing Facilities At ~$1 M per MW, energy costs are

Energy Cost Challenge for Computing Facilities At ~$1 M per MW, energy costs are substantial • 1 petaflop in 2010 uses 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling • 1 exaflop in 2018 at 20 MW is DOE target usual scaling goal 2005 2010 2015 3 2020

Measuring Efficiency – NERSC in 2010 ran at 450 publications per MW-year – But

Measuring Efficiency – NERSC in 2010 ran at 450 publications per MW-year – But that number drops with each new machine • Next best: application performance per Watt – Newest, largest machine is best – Lower energy and cost per core 0, 09 $ per core-hour • Race-to-Halt generally minimized energy use • For Scientific Computing centers, the metric should be science output per Watt…. 0, 10 0, 08 Center Sys. Admin Power & cooling 0, 07 0, 06 0, 05 0, 04 0, 03 0, 02 0, 01 0, 00 Old-HPC Cluster New-HPC

New Processor Designs are Needed to Save Energy Cell phone processor (0. 1 Watt,

New Processor Designs are Needed to Save Energy Cell phone processor (0. 1 Watt, 4 Gflop/s) Server processor (100 Watts, 50 Gflop/s) • Server processors have been designed for performance, not energy – Graphics processors are 10 -100 x more efficient – Embedded processors are 100 -1000 x – Need manycore chips with thousands of cores 5

The Amdahl Case for Heterogeneity F is fraction of time in parallel; 1 -F

The Amdahl Case for Heterogeneity F is fraction of time in parallel; 1 -F is serial Asymmetric Speedup 250 F=0. 999 200 Chip with area for 256 thin cores 150 F=0. 99 100 F=0. 975 50 F=0. 9 Assumes speedup for Fat / Thin = Sqrt of Area advantage F=0. 5 0 1 (256 cores) 2 4 8 16 32 64 Size of Fat core in Thin Core units 128 256 (1 core) (193 cores) 1 fat core 256 small cores A Chip with up to 256 “thin” cores and “fat” core that uses some of the thin core area 9/5/2021 Heterogeneity Analysis by: Mark Hill, U. Wisc 6

New Processors Means New Software Interconnect Memory Processors 130 MW Server Processors 75 MW

New Processors Means New Software Interconnect Memory Processors 130 MW Server Processors 75 MW Manycore • Exascale will have chips with thousands of tiny processor cores, and a few large ones • Architecture is an open question: – sea of embedded cores with heavyweight “service” nodes – Lightweight cores are accelerators to CPUs 7

Memory Capacity is Not Keeping Pace Technology trends against a constant or increasing memory

Memory Capacity is Not Keeping Pace Technology trends against a constant or increasing memory per core • Memory density is doubling every three years; processor logic is every two • Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs Cost of Computation vs. Memory Source: David Turek, IBM Source: IBM Question: Can you double concurrency without doubling memory? 8 8

Why avoid communication? • Running time of an algorithm is sum of 3 terms:

Why avoid communication? • Running time of an algorithm is sum of 3 terms: – – – # flops * time_per_flop # words moved / bandwidth # messages * latency communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time [FOSC] Annual improvements Time_per_flop 59% Bandwidth Latency Network 26% 15% DRAM 23% 5% • And these are hard to change: • “Latency is physics, bandwidth is money”

Bandwidth (to Memory and Remote Nodes) is an Energy Hog Intranode/SMP Intranode/MPI Communication 1000

Bandwidth (to Memory and Remote Nodes) is an Energy Hog Intranode/SMP Intranode/MPI Communication 1000 On-chip / CMP communication 100 now 2018 10 te m on D ro C nt ca li lo ss er c p/ hi ffc O sy s ne c t M R on -c 5 m m on m 1 m A hi p ip -c h is eg R P FL O P te r 1 D Pico. Joules 10000

Value of Local Store Memory • Unit stride as important as cache hits on

Value of Local Store Memory • Unit stride as important as cache hits on hardware with prefetch – Don’t cut unit stride when tiling • Software controlled memory gives more control (“scrathpad”) – May also be more for new level of memory between DRAM and disk Joint work with Shoaib Kamil, Lenny Oliker, John Shalf, Kaushik Datta

New Processors Means New Software Interconnect Memory Processors 130 MW Server Processors 75 MW

New Processors Means New Software Interconnect Memory Processors 130 MW Server Processors 75 MW Manycore 25 Megawatts Low power memory and interconnect • Exascale will have chips with thousands of tiny processor cores, and a few large ones • Architecture is an open question: – sea of embedded cores with heavyweight “service” nodes – Lightweight cores are accelerators to CPUs • Low power memory and storage technology are key 12

Why Avoid Synchronization? • Processors do not run at the same speed – Never

Why Avoid Synchronization? • Processors do not run at the same speed – Never did, due to caches – Power / temperature management makes this worse – Power swings of 50% on systems – At $3 M/MW of capital costs, don’t want 50% headroom 60% HPC can’t turn this off

Errors Can Turn into Performance Problems • Fault resilience introduces inhomogeneity in execution rates

Errors Can Turn into Performance Problems • Fault resilience introduces inhomogeneity in execution rates (error correction is not instantaneous) Slide source: John Shalf

Challenges to Exascale Performance Growth 1) 2) 3) 4) 5) 6) 7) 8) 9)

Challenges to Exascale Performance Growth 1) 2) 3) 4) 5) 6) 7) 8) 9) System power is the primary constraint Concurrency (1000 x today) Memory bandwidth and capacity are not keeping pace Processor architecture is open, but likely heterogeneous Programming model heroic compilers will not hide this Algorithms need to minimize data movement, not flops I/O bandwidth unlikely to keep pace with machine speed Resiliency critical at large scale (in time or processors) Bisection bandwidth limited by cost and energy Unlike the last 20 years most of these (1 -7) are equally important across scales, e. g. , 1000 1 -PF machines

Algorithms to Optimize for Communication

Algorithms to Optimize for Communication

Avoiding Communication in Iterative Solvers • Consider Sparse Iterative Methods for Ax=b – Krylov

Avoiding Communication in Iterative Solvers • Consider Sparse Iterative Methods for Ax=b – Krylov Subspace Methods: GMRES, CG, … – Can we lower the communication costs? • Latency of communication, i. e. , reduce # messages by computing multiple reductions at once • Bandwidth to memory hierarchy, i. e. , compute Ax, A 2 x, … Akx with one read of A • Solve time dominated by: – Sparse matrix-vector multiple (SPMV) • Which even on one processor is dominated by “communication” time to read the matrix – Global collectives (reductions) • Global latency-limited Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

Autotuning Gets Kernel Performance Near Optimal • Roofline model captures bandwidth and computation limits

Autotuning Gets Kernel Performance Near Optimal • Roofline model captures bandwidth and computation limits • Autotuning gets kernels near the roof NVIDIA C 2050 (Fermi) Xeon X 5550 (Nehalem) 1024 single-precision peak 512 double-precision peak DP add-only DGEMM 256 128 double-precision peak 128 RTM/wave eqn. th single-precision peak R ST w ba nd 7 pt Stencil 16 GTC/pushi 8 D 27 pt Stencil 16 27 pt Stencil e ba M 32 EA 32 64 ic DGEMM DP add-only RTM/wave eqn. nd w 64 ev id th id 256 8 Sp. MV GTC/chargei 4 GTC/pushi Sp. MV 4 GTC/chargei 2 2 1/ 32 1/ 16 1/ 8 1/ 4 1/ 2 1 2 4 8 16 32 1/ 16 1/ 8 1/ 4 1/ 2 1 2 4 Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier, … 8 16 32

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx]

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx] • Replace k iterations of y = A x with [Ax, A 2 x, …, Akx] A 3·x A 2·x A·x x 1 2 3 4 … … 32 • Idea: pick up part of A and x that fit in fast memory, compute each of k products • Example: A tridiagonal, n=32, k=3 • Works for any “well-partitioned” A

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx]

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx] • Replace k iterations of y = A x with [Ax, A 2 x, …, Akx] • Sequential Algorithm A 3·x Step 1 Step 2 Step 3 Step 4 A 2·x A·x x 1 2 3 4… … 32 • Example: A tridiagonal, n=32, k=3 • Saves bandwidth (one read of A&x for k steps) • Saves latency (number of independent read events)

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx]

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx] • Replace k iterations of y = A x with [Ax, A 2 x, …, Akx] • Parallel Algorithm A 3·x Proc 1 Proc 2 Proc 3 Proc 4 A 2·x A·x x 1 2 3 4… … 32 • Example: A tridiagonal, n=32, k=3 • Each processor communicates once with neighbors

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx]

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A 2 x, …, Akx] • Replace k iterations of y = A x with [Ax, A 2 x, …, Akx] • Parallel Algorithm A 3·x Proc 1 Proc 2 Proc 3 Proc 4 A 2·x A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 • Each processor works on (overlapping) trapezoid • Saves latency (# of messages); Not bandwidth But adds redundant computation … 32

Matrix Powers Kernel on a General Matrix For implicit memory management (caches) uses a

Matrix Powers Kernel on a General Matrix For implicit memory management (caches) uses a TSP algorithm for layout Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin • Saves communication for “well partitioned” matrices • Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23

Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax) Speedups on Intel Clovertown

Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax) Speedups on Intel Clovertown (8 core) Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick

Minimizing Communication of GMRES to solve Ax=b • GMRES: find x in span{b, Ab,

Minimizing Communication of GMRES to solve Ax=b • GMRES: find x in span{b, Ab, …, Akb} minimizing || Ax-b ||2 Standard GMRES for i=1 to k w = A · v(i-1) … Sp. MV MGS(w, v(0), …, v(i-1)) update v(i), H endfor solve LSQ problem with H Communication-avoiding GMRES W = [ v, A 2 v, … , Akv ] [Q, R] = TSQR(W) … “Tall Skinny QR” build H from R solve LSQ problem with H Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k • Oops – W from power method, precision lost!

TSQR: An Architecture-Dependent Algorithm R 00 W 0 R 01 Parallel: W = W

TSQR: An Architecture-Dependent Algorithm R 00 W 0 R 01 Parallel: W = W R 10 R 02 1 R 11 R 20 W 2 R 30 W 3 W 0 R 01 Sequential: W = W R 02 1 R 03 W 2 W 3 Work by Laura Grigori, Jim Demmel, Mark W 0 R 00 Hoemmen, Julien Langou R Dual Core: W = W 01 R 02 1 R 11 R 03 W 2 R 11 W 3 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically

TSQR Performance Results • Parallel – Intel Clovertown – Up to 8 x speedup

TSQR Performance Results • Parallel – Intel Clovertown – Up to 8 x speedup (8 core, dual socket, 10 M x 10) – Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6. 7 x speedup (16 procs, 100 K x 200) – Blue. Gene/L • Up to 4 x speedup (32 procs, 1 M x 50) – Grid – 4 x on 4 cities (Dongarra et al) – Cloud – early result – up and running using Mesos • Sequential – Out-of-Core on Power. PC laptop • As little as 2 x slowdown vs (predicted) infinite DRAM • LAPACK with virtual memory never finished Data from Grey Ballard, Mark Hoemmen, Laura Grigori, Julien Langou, Jack Dongarra, Michael Anderson 27

Matrix Powers Kernel (and TSQR) in GMRES 28

Matrix Powers Kernel (and TSQR) in GMRES 28

Communication-Avoiding Krylov Method (GMRES) Performance on 8 core Clovertown

Communication-Avoiding Krylov Method (GMRES) Performance on 8 core Clovertown

CA-Krylov Methods Summary and Future Work • The Communication-Avoidance works – Provably optimal –

CA-Krylov Methods Summary and Future Work • The Communication-Avoidance works – Provably optimal – Faster in practice • Ongoing work for “hard” matrices – Partition poorly (high surface to volume) – Idea: separate out dense rows (HSS matrices) – [Erin Carson, Nick Knight, Jim Demmel] 30

Beyond Domain Decomposition: 2. 5 D Matrix Multiply • Conventional “ 2 D algorithms”

Beyond Domain Decomposition: 2. 5 D Matrix Multiply • Conventional “ 2 D algorithms” use P 1/2 x P 1/2 mesh and minimal memory • New “ 2. 5 D algorithms” use (P/c)1/2 x c 1/2 mesh and c-fold memory • Matmul sends c 1/2 times fewer words – lower bound • Matmul sends c 3/2 times fewer messages – lower bound Perfect Strong Scaling Word by Edgar Solomonik and Jim Demmel Lesson: Never waste fast memory

Communication Avoidance in Multigrid: Method of Local Corrections (MLC) for Poisson’s Equation , Real

Communication Avoidance in Multigrid: Method of Local Corrections (MLC) for Poisson’s Equation , Real analytic, with rapidly convergent Taylor expansion MLC uses domain decomposition, plus a noniterative form of multigrid, to represent the solution in a way that minimizes global communication and increases computational intensity. • Local domains fit in cache; ssolves computed using FFT and a simplified version of FMM. For 163 - 323, this yields ~5 flops/byte. • The (global) coarse problem is small. Communication comparable to single relaxation step. • Weak scaling: 95% efficiency up to 1024 processors. Work by Phil Colella et al

Avoiding Synchronization

Avoiding Synchronization

Avoid Synchronization from Applications Computations as DAGs View parallel executions as the directed acyclic

Avoid Synchronization from Applications Computations as DAGs View parallel executions as the directed acyclic graph of the computation Cholesky 4 x 4 QR 4 x 4 Slide source: Jack Dongarra 34

Avoiding Synchronization in Communication host CPU two-sided message id data payload one-sided put message

Avoiding Synchronization in Communication host CPU two-sided message id data payload one-sided put message address network interface data payload memory • Two-sided message passing (e. g. , MPI) requires matching a send with a receive to identify memory address to put data – Wildly popular in HPC, but cumbersome in some applications – Couples data transfer with synchronization • Using global address space decouples synchronization – Pay for what you need! – Note: Global Addressing ≠ Cache Coherent Shared memory Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group

Event Driven LU in UPC • Assignment of work is static; schedule is dynamic

Event Driven LU in UPC • Assignment of work is static; schedule is dynamic • Ordering needs to be imposed on the schedule – Critical path operation: Panel Factorization • General issue: dynamic scheduling in partitioned memory – Can deadlock in memory allocation – “memory constrained” lookahead some edges omitted

DAG Scheduling Outperforms Bulk. Synchronous Style PLASMA on shared memory UPC on partitioned memory

DAG Scheduling Outperforms Bulk. Synchronous Style PLASMA on shared memory UPC on partitioned memory UPC LU factorization code adds cooperative (nonpreemptive) threads for latency hiding – New problem in partitioned memory: allocator deadlock – Can run on of memory locally due tounlucky execution order PLASMA by Dongarra et al; UPC LU joint with Parray Husbands 37

Irregular vs. Regular Parallelism • Computations with known task graphs can be mapped to

Irregular vs. Regular Parallelism • Computations with known task graphs can be mapped to resources in an offline manner (before computation starts) – Regular graph: By a compiler (static) or runtime (semi-static) – Irregular graphs: By a DAG scheduler – No need for online scheduling • If graphs are not known ahead of time (structure, task costs, communication costs), then dynamic scheduling is needed – Task stealing / task sharing – Demonstrated on shared memory • Conclusion: If your task graph is dynamic, the runtime needs to be, but what if it static?

Load Balancing with Locality • Locality is important: – When memory hierarchies are deep

Load Balancing with Locality • Locality is important: – When memory hierarchies are deep – When computational intensity is low • Most (all? ) successful examples of locality-important applications/machines use static scheduling – Unless they have a irregular/dynamic task graph • Two extremes are well-studied – Dynamic parallelism without locality – Static parallelism (with threads = processors) with locality • Dynamic scheduling and locality control don’t mix – Locality control can cause non-optimal task schedule, which can blow up memory use (breadth vs. depth first traversal) – Can run out of memory locally when you don’t globally

Use Two Programming Models to Match Machine PARATEC fv. CAM (768 cores on Jaguar)

Use Two Programming Models to Match Machine PARATEC fv. CAM (768 cores on Jaguar) 2000 12 (240 cores on Jaguar) 700 14 1800 10 Time (sec) 8 1200 1000 6 800 4 600 400 12 Memory Time (sec) Memory 1400 Time 600 Memory per node (GB) 1600 500 10 400 8 300 6 200 4 100 2 2 200 0 0 1 2 3 6 0 12 0 1 cores per MPI process 2 3 6 cores per MPI process Hybrid Programming is key to saving memory (2011) and sometimes improves performance 40 12 Memory per node (GB) Time

Getting the Best of Each PGAS for locality and convenience Global address space: directly

Getting the Best of Each PGAS for locality and convenience Global address space: directly read/write remote data Partitioned: data is designated as local or global • Affinity control x: 1 x: 5 x: 7 y: 0 y: y: • Scalability l: l: l: • Never say “receive” g: p 0 g: p 1 Debate is around the control model: • Dynamic thread creation vs. SPMD • Hierarchical SPMD compromise • “Think parallel” and group as needed g: pn

Hierarchical SPMD in Titanium • Thread teams may execute distinct tasks partition(T) { {

Hierarchical SPMD in Titanium • Thread teams may execute distinct tasks partition(T) { { model_fluid(); } { model_muscles(); } { model_electrical(); } } • Hierarchy for machine / tasks – Nearby: access shared data – Far away: copy data A 1 • Advantages: 4 B D 2 C – Provable pointer – Mixed data / task style – Lexical scope prevents some deadlocks 42 3 span 1 (core local) span 2 (processor local span 3 (node local) span 4 (global)

Hierarchical Programming Model: Phalanx Memory Proc Me m • • • Proc Me m

Hierarchical Programming Model: Phalanx Memory Proc Me m • • • Proc Me m • Invoke functions on set of cores and set of memories • Hierarchy of memories – Can query to get (some) aspects of the hierarchical structures • Functionally homogeneous cores (on Echelon) – Can query to get (performance) properties of cores • Hierarchy of thread blocks – May be aligned with hardware based on queries Echelon Prog. Sys Team: Michael Garland, Alex Aiken, Brad Chamberlain, Mary Hall, Greg Titus, Kathy Yelick

Stepping Back • Communication avoidance as old at tiling • Communication optimality as old

Stepping Back • Communication avoidance as old at tiling • Communication optimality as old as Hong/Kung • What’s new? – – Raising the level of abstraction at which we optimize BLAS 2 BLAS 3 LU or SPMV/DOT Krylov Changing numerics in non-trivial ways Rethinking methods to models • Communication and synchronization avoidance • Software engineering: breaking abstraction • Compilers: inter-procedural optimizations 44

Exascale Views • “Exascale” is about continuing growth in computing performance for science –

Exascale Views • “Exascale” is about continuing growth in computing performance for science – Energy efficiency is key – Job size is irrelevant • Success means: – Influencing market: HPC, technical computing, clouds, general purpose – Getting more science from data and computing • Failure means: – few big machines for a few big applications • Not all computing problems are exascale, but they should all be exascale-technology aware 45

Thank You!

Thank You!