BIPS C O M P U T A

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams 1, 2, Jonathan Carter 2, Leonid Oliker 1, 2, John Shalf 2, Katherine Yelick 1, 2 1 University 2 Lawrence of California, Berkeley National Laboratory samw@eecs. berkeley. edu

BIPS Motivation v Multicore is the de facto solution for improving peak performance for the next decade v How do we ensure this applies to sustained performance as well ? v Processor architectures are extremely diverse and compilers can rarely fully exploit them v Require a HW/SW solution that guarantees performance without completely sacrificing productivity

Overview BIPS v Examined the Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application v Present and analyze two threaded & auto-tuned implementations v Benchmarked performance across 5 diverse multicore microarchitectures § § § Intel Xeon (Clovertown) AMD Opteron (rev. F) Sun Niagara 2 (Huron) IBM QS 20 Cell Blade (PPEs) IBM QS 20 Cell Blade (SPEs) v We show § § § Auto-tuning can significantly improve application performance Cell consistently delivers good performance and efficiency Niagara 2 delivers good performance and productivity

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multicore SMPs used

Multicore SMP Systems Core 2 Core 2 4 MB Shared L 2 FSB Opteron 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) HT BIPS 1 MB victim SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) 21. 3 GB/s(read) 10. 66 GB/s 10. 6 GB/s(write) 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs Sun Niagara 2 (Huron) MT MT Sparc Sparc 8 K L 1 8 K L 1 SPE 4 x 128 b memory controllers (2 banks each) 21. 33 GB/s (write) 42. 66 GB/s (read) 667 MHz FBDIMMs SPE SPE 256 K MFC MFC PPE 512 KB L 2 EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) 667 MHz DDR 2 DIMMs IBM QS 20 Cell Blade Crossbar Switch 90 GB/s (writethru) 10. 66 GB/s MFC MFC SPE SPE 25. 6 GB/s 512 MB XDR DRAM BIF SPE SPE 256 K MFC MFC EIB (Ring Network) <20 GB/s each direction MFC 256 K XDR SPE MFC BIF MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Multicore SMP Systems BIPS (memory hierarchy) Core 2 Core 2 AMD Opteron (rev. F) Core 2 Opteron 4 GB/s (each direction) Intel Xeon (Clovertown) ed Opteron 4 MB Shared L 2 FSB 4 MB Shared L 2 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim HT 4 MB Shared L 2 128 b memory controller HT s a b e y h c ch a C rar l a ie IBM QS 20 Cell Blade Sun Niagara 2 (Huron) n o y. H i t n or e v m n Co Me 4 MB Shared L 2 1 MB victim SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) 21. 3 GB/s(read) 10. 66 GB/s 10. 6 GB/s(write) 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs MT MT Sparc Sparc 8 K L 1 8 K L 1 SPE MFC 4 x 128 b memory controllers (2 banks each) 21. 33 GB/s (write) 42. 66 GB/s (read) 667 MHz FBDIMMs SPE MFC MFC MFC PPE 512 KB L 2 SPE SPE 25. 6 GB/s 512 MB XDR DRAM BIF SPE SPE 256 K MFC MFC EIB (Ring Network) <20 GB/s each direction MFC 256 K XDR SPE 667 MHz DDR 2 DIMMs PPE EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) SPE 256 K Crossbar Switch 90 GB/s (writethru) 10. 66 GB/s MFC BIF MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Multicore SMP Systems BIPS (memory hierarchy) d e s a b e y h c ch a C rar l a ie n o y. H i t n (Huron) r Sun Niagara 2 IBM QS 20 Cell Blade e o v e n m r o o e t C S hy M l a Core 2 4 MB Shared L 2 FSB Opteron 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) 1 MB victim SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) 21. 3 GB/s(read) 10. 66 GB/s 10. 6 GB/s(write) 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs MT MT Sparc Sparc 8 K L 1 8 K L 1 SPE MFC 4 x 128 b memory controllers (2 banks each) 21. 33 GB/s (write) 42. 66 GB/s (read) 667 MHz FBDIMMs SPE MFC MFC SPE PPE SPE c arc o L ier t n i H o j y s r i o D m Me MFC MFC SPE 25. 6 GB/s 512 MB XDR DRAM BIF SPE 256 K MFC MFC EIB (Ring Network) <20 GB/s each direction MFC SPE 512 KB L 2 256 K XDR SPE 667 MHz DDR 2 DIMMs PPE EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) SPE 256 K Crossbar Switch 90 GB/s (writethru) 10. 66 GB/s MFC BIF MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Multicore SMP Systems BIPS (memory hierarchy) AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) s d a e r b h s t n P o i t + a t e h en c Sun Niagara 2 IBM QS 20 Cell Blade m Ca (Huron) e l e p p im il bs s + Core 2 4 MB Shared L 2 FSB Opteron 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 1 MB victim SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) 21. 3 GB/s(read) 10. 66 GB/s 10. 6 GB/s(write) 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs MT MT Sparc Sparc 8 K L 1 8 K L 1 SPE MFC 4 x 128 b memory controllers (2 banks each) 21. 33 GB/s (write) 42. 66 GB/s (read) 667 MHz FBDIMMs SPE MFC PPE 512 KB L 2 SPE MFC MFC SPE SPE 25. 6 GB/s 512 MB XDR DRAM BIF SPE ion MFC MFC EIB (Ring Network) <20 GB/s each direction MFC SPE 256 K e t r o t nta S l e a c m Lo ple im MFC 256 K XDR SPE 667 MHz DDR 2 DIMMs PPE EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) SPE 256 K Crossbar Switch 90 GB/s (writethru) 10. 66 GB/s BIF MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Multicore SMP Systems BIPS (peak flops) Core 2 4 MB Shared L 2 Opteron 1 MB victim 75 Gflop/s FSB Sun Niagara 2 (Huron) 8 K L 1 8 K L 1 11 Gflop/s SPE 4 x 128 b memory controllers (2 banks each) 42. 66 GB/s (read) SPE SPE 256 K MFC SPE SPE MFC MFC 512 KB L 2 MFC MFC MFC SPEs: 29 Gflop/s SPE SPE 25. 6 GB/s 512 MB XDR DRAM BIF MFC EIB (Ring Network) <20 GB/s each direction MFC SPE 256 K PPEs: 13 Gflop/s MFC 256 K XDR SPE PPE EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) 667 MHz DDR 2 DIMMs IBM QS 20 Cell Blade Crossbar Switch 667 MHz FBDIMMs 10. 66 GB/s 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs 21. 33 GB/s (write) 128 b memory controller 10. 66 GB/s 10. 6 GB/s(write) 90 GB/s (writethru) SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) MT MT Sparc Sparc 1 MB victim 17 Gflop/s 10. 6 GB/s 21. 3 GB/s(read) 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) BIF MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Multicore SMP Systems BIPS (peak DRAM bandwidth) Core 2 Opteron 21 GB/s(read) 10 GB/s(write) 4 MB Shared L 2 FSB 4 MB Shared L 2 1 MB victim Sun Niagara 2 (Huron) 8 K L 1 8 K L 1 42 GB/s(read) 21 GB/s(write) SPE 4 x 128 b memory controllers (2 banks each) 42. 66 GB/s (read) SPE SPE 256 K MFC MFC PPE 512 KB L 2 EIB (Ring Network) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) 667 MHz DDR 2 DIMMs IBM QS 20 Cell Blade Crossbar Switch 667 MHz FBDIMMs 10. 66 GB/s 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs 21. 33 GB/s (write) 128 b memory controller 10. 66 GB/s 10. 6 GB/s(write) 90 GB/s (writethru) SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) MT MT Sparc Sparc 1 MB victim 21 GB/s 10. 6 GB/s 21. 3 GB/s(read) 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) MFC MFC 51 GB/s 256 K XDR SPE SPE 25. 6 GB/s 512 MB XDR DRAM <20 GB/s each direction BIF SPE SPE 256 K MFC MFC EIB (Ring Network) MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

BIPS C O M P U T A T I O N A L R E S E A R C H D I Auto-tuning V I S I O N

Auto-tuning BIPS v Hand optimizing each architecture/dataset combination is not feasible v Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search § § § Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable v Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW, SPIRAL), and Sparse Methods(OSKI)

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Introduction to LBMHD N

Introduction to Lattice Methods BIPS v Structured grid code, with a series of time steps v Popular in CFD (allows for complex boundary conditions) v Overlay a higher dimensional phase space § § § Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e. g. 5 -27 velocities per point in space) are used to reconstruct macroscopic quantities Significant Memory capacity requirements 12 0 4 15 25 6 10 +Z 23 18 26 20 1 +Y 17 8 14 2 21 9 13 22 5 11 24 7 16 3 19 +X

LBMHD BIPS (general characteristics) v Plasma turbulence simulation v Couples CFD with Maxwell’s equations v Two distributions: § § momentum distribution (27 scalar velocities) magnetic distribution (15 vector velocities) v Three macroscopic quantities: § § § Density Momentum (vector) Magnetic Field (vector) +Z 12 0 15 +Z 25 6 10 +Y +X macroscopic variables 1 17 19 3 14 23 18 9 16 momentum distribution +X +Y 21 26 20 5 24 7 +Z 25 21 13 22 11 15 14 2 26 20 +Y 8 23 18 12 4 13 22 17 24 19 magnetic distribution 16 +X

LBMHD BIPS (flops and bytes) v Must read 73 doubles, and update 79 doubles per point in space (minimum 1200 bytes) v Requires about 1300 floating point operations per point in space v Flop: Byte ratio § § 0. 71 (write allocate architectures) 1. 07 (ideal) v Rule of thumb for LBMHD: § Architectures with more flops than bandwidth are likely memory bound (e. g. Clovertown)

LBMHD BIPS (implementation details) v Data Structure choices: § § Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees spatial locality, unit-stride, and vectorizes well v Parallelization § § § Fortran version used MPI to communicate between tasks. = bad match for multicore The version in this work uses pthreads for multicore, and MPI for inter-node MPI is not used when auto-tuning v Two problem sizes: § § 643 (~330 MB) 1283 (~2. 5 GB)

Stencil for Lattice Methods BIPS v Very different the canonical heat equation stencil § § There are multiple read and write arrays There is no reuse read_lattice[ ][ ] write_lattice[ ][ ]

BIPS Side Note on Performance Graphs v Threads are mapped first to cores, then sockets. i. e. multithreading, then multicore, then multisocket v Niagara 2 always used 8 threads/core. v Show two problem sizes v We’ll step through performance as optimizations/features are enabled within the auto-tuner v More colors implies more optimizations were necessary v This allows us to compare architecture performance while keeping programmer effort(productivity) constant

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Pthreads Implementation

BIPS Pthread Implementation Intel Xeon (Clovertown) Sun Niagara 2 (Huron) AMD Opteron (rev. F) IBM Cell Blade (PPEs) v Not naïve § fully unrolled loops § NUMA-aware § 1 D parallelization v Always used 8 threads per core on Niagara 2 v 1 P Niagara 2 is faster than 2 P x 86 machines

BIPS Pthread Implementation Intel Xeon (Clovertown) 4. 8% of peak flops 17% of bandwidth Sun Niagara 2 (Huron) AMD Opteron (rev. F) 14% of peak flops 17% of bandwidth IBM Cell Blade (PPEs) 54% of peak flops 14% of bandwidth 1% of peak flops 0. 3% of bandwidth v Not naïve § fully unrolled loops § NUMA-aware § 1 D parallelization v Always used 8 threads per core on Niagara 2 v 1 P Niagara 2 is faster than 2 P x 86 machines

BIPS Initial Pthread Implementation Intel Xeon (Clovertown) AMD Opteron (rev. F) Performance degradation despite improved surface to volume ratio Sun Niagara 2 (Huron) Performance degradation despite improved surface to volume ratio IBM Cell Blade (PPEs) v Not naïve § fully unrolled loops § NUMA-aware § 1 D parallelization v Always used 8 threads per core on Niagara 2 v 1 P Niagara 2 is faster than 2 P x 86 machines

Cache effects BIPS v Want to maintain a working set of velocities in the L 1 cache v 150 arrays each trying to keep at least 1 cache line. v Impossible with Niagara’s 1 KB/thread L 1 working set = capacity misses v On other architectures, the combination of: § § § Low associativity L 1 caches (2 way on opteron) Large numbers or arrays Near power of 2 problem sizes can result in large numbers of conflict misses v Solution: apply a lattice (offset) aware padding heuristic to the velocity arrays to avoid/minimize conflict misses

BIPS Auto-tuned Performance Intel Xeon (Clovertown) (+Stencil-aware Padding) AMD Opteron (rev. F) v v v Sun Niagara 2 (Huron) IBM Cell Blade (PPEs) This lattice method is essentially 79 simultaneous 72 -point stencils Can cause conflict misses even with highly associative L 1 caches (not to mention opteron’s 2 way) Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache +Padding Naïve+NUMA

Blocking for the TLB BIPS v Touching 150 different arrays will thrash TLBs with less than 128 entries. v Try to maximize TLB page locality v Solution: borrow a technique from compilers for vector machines: § § § Fuse spatial loops Strip mine into vectors of size vector length (VL) Interchange spatial and velocity loops v Can be generalized by varying: § § The number of velocities simultaneously accessed The number of macroscopics / velocities simultaneously updated v Has the side benefit expressing more ILP and DLP (SIMDization) and cleaner loop structure at the cost of increased L 1 cache traffic

Multicore SMP Systems BIPS (TLB organization) Core 2 4 MB Shared L 2 Opteron 1 MB victim 16 entries FSB 10. 6 GB/s 32 entries 4 KB pages 667 MHz DDR 2 DIMMs 667 MHz FBDIMMs Sun Niagara 2 (Huron) MT MT Sparc Sparc 8 K L 1 8 K L 1 128 entries 4 MB pages Crossbar Switch 90 GB/s (writethru) 179 GB/s (fill) 4 MB Shared L 2 (16 way) (address interleaving via 8 x 64 B banks) 4 x 128 b memory controllers (2 banks each) 21. 33 GB/s (write) 42. 66 GB/s (read) 667 MHz FBDIMMs 128 b memory controller 10. 66 GB/s 10. 6 GB/s(write) 1 MB victim SRI / crossbar 128 b memory controller Chipset (4 x 64 b controllers) 21. 3 GB/s(read) 1 MB victim SRI / crossbar FSB 10. 6 GB/s 1 MB victim Opteron HT Core 2 AMD Opteron (rev. F) 4 GB/s (each direction) Intel Xeon (Clovertown) 10. 66 GB/s 667 MHz DDR 2 DIMMs IBM QS 20 Cell Blade SPE SPE 256 K 256 K PPEs: 1024 entries SPEs: 256 entries MFC MFC EIB (Ring Network) MFC MFC SPE SPE 25. 6 GB/s 512 MB XDR DRAM BIF MFC MFC EIB (Ring Network) <20 GB/s each direction MFC 256 K XDR SPE 512 KB L 2 SPE MFC BIF 4 KB pages MFC MFC XDR 256 K SPE SPE 25. 6 GB/s 512 MB XDR DRAM SPE

Cache / TLB Tug-of-War BIPS v For cache locality we want small VL v For TLB page locality we want large VL v Each architecture has a different balance between these two forces TLB miss penalty L 1 / 1200 VL v Solution: auto-tune to find the optimal VL Page size / 8

BIPS Auto-tuned Performance Intel Xeon (Clovertown) (+Vectorization) AMD Opteron (rev. F) v v Sun Niagara 2 (Huron) IBM Cell Blade (PPEs) Each update requires touching ~150 components, each likely to be on a different page TLB misses can significantly impact performance v v Solution: vectorization Fuse spatial loops, strip mine into vectors of size VL, and interchange with phase dimensional loops v Auto-tune: search for the optimal vector length Significant benefit on some architectures Becomes irrelevant when bandwidth dominates performance v v +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+Explicit Unrolling/Reordering) Intel Xeon (Clovertown) AMD Opteron (rev. F) Sun Niagara 2 (Huron) IBM Cell Blade (PPEs) Give the compilers a helping hand for the complex loops v Code Generator: Perl script to generate all power of 2 possibilities v Auto-tune: search for the best unrolling and expression of data level parallelism v Is essential when using SIMD intrinsics v +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance Intel Xeon (Clovertown) (+Software prefetching) AMD Opteron (rev. F) v Expanded the code generator to insert software prefetches in case the compiler doesn’t. Auto-tune: § no prefetch § prefetch 1 line ahead § prefetch 1 vector ahead. v Relatively little benefit for relatively little work IBM Cell Blade (PPEs) v Sun Niagara 2 (Huron) +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+Software prefetching) Intel Xeon (Clovertown) 6% of peak flops 22% of bandwidth AMD Opteron (rev. F) v Expanded the code generator to insert software prefetches in case the compiler doesn’t. v Auto-tune: § no prefetch § prefetch 1 line ahead § prefetch 1 vector ahead. Relatively little benefit for relatively little work 32% of peak flops 40% of bandwidth v Sun Niagara 2 (Huron) IBM Cell Blade (PPEs) 59% of peak flops 15% of bandwidth 10% of peak flops 3. 7% of bandwidth +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores) Intel Xeon (Clovertown) AMD Opteron (rev. F) v v v Sun Niagara 2 (Huron) Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work IBM Cell Blade (PPEs) +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores) Intel Xeon (Clovertown) 7. 5% of peak flops 18% of bandwidth AMD Opteron (rev. F) 42% of peak flops 35% of bandwidth v v v Sun Niagara 2 (Huron) Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work IBM Cell Blade (PPEs) 59% of peak flops 15% of bandwidth +SIMDization 10% of peak flops 3. 7% of bandwidth +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores) Intel Xeon (Clovertown) AMD Opteron (rev. F) v v 1. 6 x 4. 3 x v v v Sun Niagara 2 (Huron) Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work IBM Cell Blade (PPEs) +SIMDization 1. 5 x 10 x +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Cell Implementation

BIPS Cell Implementation v Double precision implementation § DP will severely hamper performance v Vectorized, double buffered, but not auto-tuned § No NUMA optimizations § No Unrolling § VL is fixed § Straight to SIMD intrinsics § Prefetching replaced by DMA list commands v Only collision() was implemented.

BIPS Auto-tuned Performance Intel Xeon (Clovertown) (Local Store Implementation) AMD Opteron (rev. F) v v v Sun Niagara 2 (Huron) First attempt at cell implementation. VL, unrolling, reordering fixed No NUMA Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance IBM Cell Blade (SPEs)* +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

BIPS Auto-tuned Performance (Local Store Implementation) Intel Xeon (Clovertown) 7. 5% of peak flops 18% of bandwidth Sun Niagara 2 (Huron) AMD Opteron (rev. F) 42% of peak flops 35% of bandwidth v v v First attempt at cell implementation. VL, unrolling, reordering fixed No NUMA Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance IBM Cell Blade (SPEs)* 57% of peak flops 33% of bandwidth 59% of peak flops 15% of bandwidth +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

BIPS Speedup from Heterogeneity Intel Xeon (Clovertown) AMD Opteron (rev. F) v v 1. 6 x Sun Niagara 2 (Huron) 1. 5 x 4. 3 x v v v First attempt at cell implementation. VL, unrolling, reordering fixed Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance IBM Cell Blade (SPEs)* 13 x over auto-tuned PPE +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

Speedup over naive BIPS Intel Xeon (Clovertown) AMD Opteron (rev. F) v v 1. 6 x Sun Niagara 2 (Huron) 4. 3 x v v v First attempt at cell implementation. VL, unrolling, reordering fixed Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance IBM Cell Blade (SPEs)* +SIMDization 1. 5 x 130 x +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

BIPS C O M P U T A T I O N A L R E S E A R C H D Summary I V I S I O N

BIPS Aggregate Performance (Fully optimized) Cell SPEs deliver the best full system performance § Although, Niagara 2 delivers near comparable per socket performance v Dual core Opteron delivers far better performance (bandwidth) than Clovertown v Clovertown has far too little effective FSB bandwidth v

BIPS Parallel Efficiency (average performance per thread, Fully optimized) v Aggregate Mflop/s / #cores v Niagara 2 & Cell showed very good multicore scaling v Clovertown showed very poor multicore scaling (FSB limited)

Power Efficiency BIPS (Fully Optimized) v Used a digital power meter to measure sustained power v Calculate power efficiency as: sustained performance / sustained power v All cache-based machines delivered similar power efficiency v FBDIMMs (~12 W each) sustained power § § 8 DIMMs on Clovertown (total of ~330 W) 16 DIMMs on N 2 machine (total of ~450 W)

BIPS Productivity v Niagara 2 required significantly less work to deliver good performance (just vectorization for large problems). v Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance. v Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

BIPS Summary v Niagara 2 delivered both very good performance and productivity v Cell delivered very good performance and efficiency (processor and power) v On the memory bound Clovertown parallelism wins out over optimization and auto-tuning v Our multicore auto-tuned LBMHD implementation significantly outperformed the already optimized serial implementation v Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop: byte ratio) v Architectural transparency is invaluable in optimizing code

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Multi-core arms race N

New Multicores BIPS 2. 2 GHz Opteron (rev. F) 1. 40 GHz Niagara 2

New Multicores BIPS 2. 2 GHz Opteron (rev. F) 2. 3 GHz Opteron (Barcelona) v Barcelona is a quad core Opteron v Victoria Falls is a dual socket (128 threads) Niagara 2 v Both have the same total bandwidth 1. 40 GHz Niagara 2 1. 16 GHz Victoria Falls +Smaller pages +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Speedup from multicore/socket 2. 2 GHz Opteron (rev. F) 2. 3 GHz Opteron (Barcelona) v Barcelona is a quad 1. 9 x (1. 8 x frequency normalized) 1. 40 GHz Niagara 2 core Opteron v Victoria Falls is a dual socket (128 threads) Niagara 2 v Both have the same total bandwidth 1. 16 GHz Victoria Falls +Smaller pages 1. 6 x (1. 9 x frequency normalized) +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Speedup from Auto-tuning 2. 2 GHz Opteron (rev. F) 4. 3 x 1. 40 GHz Niagara 2 2. 3 GHz Opteron (Barcelona) v Barcelona is a quad 3. 9 x core Opteron v Victoria Falls is a dual socket (128 threads) Niagara 2 v Both have the same total bandwidth 1. 16 GHz Victoria Falls +Smaller pages +SIMDization 1. 5 x 16 x +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS C O M P U T A T I O N A L R E S E A R C H D I Questions? V I S I O N

Acknowledgements BIPS v UC Berkeley v v § RADLab Cluster (Opterons) § PSI cluster(Clovertowns) Sun Microsystems § Niagara 2 donations Forschungszentrum Jülich § Cell blade cluster access George Vahala, et. al § Original version of LBMHD ASCR Office in the DOE office of Science § contract DE-AC 02 -05 CH 11231