Description and Purposes of the Roofline Performance Model

  • Slides: 19
Download presentation
Description and Purposes of the Roofline Performance Model Tuomas Koskela, Mathieu Lobet, Jack Deslippe,

Description and Purposes of the Roofline Performance Model Tuomas Koskela, Mathieu Lobet, Jack Deslippe, Zakhar Matveev November 1, 2020 -1 -

Why Do We Need the Roofline Model? • Need a sense of absolute performance

Why Do We Need the Roofline Model? • Need a sense of absolute performance when optimizing applications – How do I know if my performance is good? – Why am I not getting peak performance of the platform? • Many potential optimization directions – How do I know which one to apply? – What is the limiting factor in my app’s performance? – How do I know when to stop? -2 -

Attainable Performance (Gflops/s) Roofline is a Visual Performance Model it em a B y

Attainable Performance (Gflops/s) Roofline is a Visual Performance Model it em a B y or h dt i w nd Lim Compute Limit M • Reflects a performance bound (Gflop/s) as a function of Arithmetic Intensity (AI). • Is a feature of the architecture. • Allows comparison of app performance against theoretical “Roofline visually intuitive peakis aon absolute performance scale model used to bound the performance of various numerical methods and operations running on multicore, manycore, or accelerator processor architectures. ” Arithmetic Intensity (flops/byte) -3 -

Attainable Performance (Gflops/s) Arithmetic Intensity is a Ratio of Flops to Bytes it h

Attainable Performance (Gflops/s) Arithmetic Intensity is a Ratio of Flops to Bytes it h idt Lim Compute Limit Arithmetic Intensity dw n a em y. B r o M Arithmetic Intensity (flops/byte) -4 - = Flops computed Bytes transferred

Attainable Performance (Gflops/s) Peak Performance is Bound by the Roof it em a B

Attainable Performance (Gflops/s) Peak Performance is Bound by the Roof it em a B y or h dt i w nd Lim Compute Limit The attainable system performance is the maximal performance that can be reached by an application: M Attainable performance = min [Gflop/s] Arithmetic Intensity (flops/byte) -5 - Peak performace [Gflop/s] Peak Arithmetic Memory x Intensity Bandwidth [Gflop/Gby [Gbyte/s] te]

Attainable Performance (Gflops/s) There Are Different Ceilings CPU Limit FMA+SIMD it Lim h dt

Attainable Performance (Gflops/s) There Are Different Ceilings CPU Limit FMA+SIMD it Lim h dt i w h dt th nd i a d i w B dw nd L 1 n a B h Ba dt i M 2 w L RA nd D a C B M M A DR CPU Limit FMA CPU Limit Scalar KNL theoretical peak performance = AVX Frequency (1. 4 Ghz) x 8 (vector width) x 2 (dual vpus) x 2 (FMA inst. ) x number of Cores = 3. 0 GFlops/s In reality mileage may vary, it’s usually best to measure peak bandwidths and peak performance by micro benchmarks Arithmetic Intensity (flops/byte) -6 -

Attainable Performance (Gflops/s) Different Flavors of the Roofline Model: Classical and Cache-Aware CPU Limit

Attainable Performance (Gflops/s) Different Flavors of the Roofline Model: Classical and Cache-Aware CPU Limit FMA+SIMD it Lim h dt i w h dt th nd i a d i w B dw nd L 1 n a B h Ba dt i M 2 w L RA nd D a C B M M A DR CPU Limit FMA CPU Limit Scalar You can define AI from any level of cache or memory, (L 1, L 2, L 3, MCDRAM, DRAM) • Classical Roofline defines AI as bytes from DRAM • Cache-Aware Roofline defines AI as total bytes from ALL memory hierarchy levels • Both are useful!! Cache-Aware Roofline automation is implemented in Intel Vector Advisor 2017. Classical Roofline automation in Advisor will be available in future releases Arithmetic Intensity (flops/byte) -7 -

Classical Roofline vs Cache-Aware Roofline Classical Roofline Model Cache-Aware Roofline Model AI = #

Classical Roofline vs Cache-Aware Roofline Classical Roofline Model Cache-Aware Roofline Model AI = # FLOPS / BYTES (DRAM ) AI = # FLOPS / # BYTES ( CPU) • Bytes out of a level in memory hierarchy are measured in AI • AI depends on problem size • AI is platform dependent • AI depends on cache reuse • Bytes into the cpu from all levels in memory hierarchy are measured in AI • AI is independent of problem size • AI is independent of platform • AI is constant for an algorithm -8 -

Attainable Performance (Gflops/s) Is My Application Bound by a Memory Bandwidth or a Compute

Attainable Performance (Gflops/s) Is My Application Bound by a Memory Bandwidth or a Compute Peak? 1. Memory Bound 2. Memory/ Compute Bound 3. Compute Bound Arithmetic Intensity (flops/byte) -9 - Often it’s a combination of the two • Applications in area 1 are purely memory bandwidth bound • Applications in area 3 are purely compute bound • In area 2 we need more information

Example 1: Memory Bound Application Attainable Performance (Gflops/s) 1. Low Ai, “Stream-like” application. Assume

Example 1: Memory Bound Application Attainable Performance (Gflops/s) 1. Low Ai, “Stream-like” application. Assume it’s well vectorized • No cache reuse DRAM bandwidth bound DRAM AI = L 1 AI 2. Implement L 2 cache optimization • Cache-Aware Classical Both Equal Arithmetic Intensity (flops/byte) [1] S. Williams et al. CACM (2009), crd. lbl. gov/departments/computer-science/PAR/research/roofline - 10 - L 2 Cache is fully reused, GFLOPS increase C-A roofline rises up to the L 2 bandwidth limit Cl roofline moves to the right because we are doing less loads from DRAM. 3. Implement L 1 cache optimization • See 2. By chance, Cl roofline seems bound by the scalar add peak

Attainable Performance (Gflops/s) Example 2: Compute Bound Application Cache-Aware Classical Both Equal Arithmetic Intensity

Attainable Performance (Gflops/s) Example 2: Compute Bound Application Cache-Aware Classical Both Equal Arithmetic Intensity (flops/byte) [1] S. Williams et al. CACM (2009), crd. lbl. gov/departments/computer-science/PAR/research/roofline - 11 - 1. High AI “particle - like” application. • No cache reuse again • Compute bound but not using vectorization/FMA/both VPUs 2. Implement vectorization • Since we are not touching memory, the AI in both C-A and Cl roofline does not change • We are fully utilizing VPUs FLOPS increases 3. Implement FMA use

Classical Roofline vs Cache-Aware Roofline Classical Roofline Model Cache-Aware Roofline Model AI = #

Classical Roofline vs Cache-Aware Roofline Classical Roofline Model Cache-Aware Roofline Model AI = # FLOPS / BYTES (DRAM ) AI = # FLOPS / # BYTES ( CPU) • Bytes out of DRAM are measured in AI • Intuitive feedback on optimizations • Total Bytes into the CPU from all levels in memory hierarchy are measured in AI • Any* optimizations will only improve FLOPS, NOT AI • Explicitly shows the bounding bandwidth in memory hierarchy – Memory optimizations (e. g. cache blocking) improve AI – Compute (e. g. vectorization, FMA) optimizations improve FLOPS • May mislead on what is the bounding roof – Limiting roof is (usually) the next one above you on the roofline - 12 -

Why Not Use All Points? Streaming out of MCDRAM example….

Why Not Use All Points? Streaming out of MCDRAM example….

Why Not Use All Points? App with goodreuse in L 2 for example.

Why Not Use All Points? App with goodreuse in L 2 for example.

What to Do With This Info? • Compute bound applications – Make sure you

What to Do With This Info? • Compute bound applications – Make sure you have good Open. MP scalability. Look to see thread activity for major Open. MP regions. – Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization. • Memory bandwidth bound applications – Try to improve memory locality, cache reuse – Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on KNL Profit by getting ~ 5 x more bandwidth GB/s. - 15 -

Summary • Roofline analysis helps to determine the gap between application performance and peak

Summary • Roofline analysis helps to determine the gap between application performance and peak performance of a computing platform on an absolute scale • An application is placed on the roofline by measuring its Arithmetic Intensity and Gflops/s – See the next talk to learn how to measure them with Advisor • Arithmetic intensity can be computed from bytes transferred from any or all levels of the memory hierarchy – Cache-Aware Roofline: Measure Total bytes moved into the CPU – Classical Roofline: Measure bytes read from DRAM • Once you know where your application sits on the roofline(s), start thinking about how to move it upwards – there is no universal recipe! - 16 -

Attainable Performance (Gflops/s) Ask Yourself “Why am I Here? ” and “Where am I

Attainable Performance (Gflops/s) Ask Yourself “Why am I Here? ” and “Where am I going? ” ? ? ? 1. Memory Bound 2. Memory/ Compute Bound 3. Compute Bound Arithmetic Intensity (flops/byte) - 17 - Usually, it is more complicated… You won’t be on any ceiling. Or if you are, it is coincidence. BUT asking the questions “why am I not on a higher ceiling? ” and “what should I do to reach it? ” is always productive.

Further Reading and Viewing • • • S. Williams et al. CACM (2009) http:

Further Reading and Viewing • • • S. Williams et al. CACM (2009) http: //crd. lbl. gov/departments/computerscience/PAR/research/roofline A. Ilic et al. , IEEE Computer Architecture Letters (2014) http: //www. intel. com/content/www/us/en/events/hpcdevcon/parallel-programmingtrack. html#utilizing https: //www. youtube. com/watch? v=h 2 QEM 1 Hp. Fgg https: //software. intel. com/en-us/articles/getting-started-with-intel-advisor-rooflinefeature http: //www. nersc. gov/users/application-performance/measuring-arithmetic-intensity/ https: //www. codeproject. com/Articles/1169323/Intel-Advisor-Roofline-Analysis https: //software. intel. com/sites/default/files/managed/1 e/19/roofing-a-house. pdf https: //github. com/tkoskela/py. Advisor - 18 -

National Energy Research Scientific Computing Center - 19 -

National Energy Research Scientific Computing Center - 19 -