BIPS Performance Understanding Prediction and Tuning at the

BIPS Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. November 2004

BIPS • • • Outline Motivation for Automatic Performance Tuning Recent results for sparse matrix kernels Application to T 3 P, Omega 3 P OSKI = Optimized Sparse Kernel Interface Future Work

BIPS Prizes • Best Paper, Intern. Conf. Parallel Processing, 2004 – “Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” • Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries, 2003 – Best Student Presentation too, to Richard Vuduc – “Automatic performance tuning and analysis of sparse triangular solve” • Finalist, Best Student Paper, Supercomputing 2002 – To Richard Vuduc – “Performance Optimization and Bounds for Sparse Matrix-vector Multiply” • Best Presentation Prize, MICRO-33: 3 rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000 – To Richard Vuduc – “Statistical Modeling of Feedback Data in an Automatic Tuning System”

BIPS Motivation for Automatic Performance Tuning • Historical trends – Sparse matrix-vector multiply (Sp. MV): 10% of peak or less – 2 x faster than CSR with “hand-tuning” – Tuning becoming more difficult over time • Performance depends on machine, kernel, matrix – Matrix known at run-time – Best data structure + implementation can be surprising • Our approach: empirical modeling and search – – – Up to 4 x speedups and 31% of peak for Sp. MV Many optimization techniques for Sp. MV Several other kernels: triangular solve, ATA*x, Ak*x Proof-of-concept: Integrate with Omega 3 P Release OSKI Library, integrate into PETSc

BIPS Example: The Difficulty of Tuning • n = 21216 • nnz = 1. 5 M • kernel: Sp. MV • Source: NASA structural analysis problem • 8 x 8 dense substructure

BIPS Speedups on Itanium 2: The Need for Search Mflop/s Best: 4 x 2 Reference Mflop/s

Ultra 2 i - 9% BIPS 63 Mflop/s Ultra 3 - 6% Sp. MV Performance (Matrix #2): Generation 2 35 Mflop/s Pentium III - 19% 96 Mflop/s 42 Mflop/s 109 Mflop/s 53 Mflop/s Pentium III-M - 15% 120 Mflop/s 58 Mflop/s

Power 3 - 13% BIPS 195 Mflop/s Power 4 - 14% Sp. MV Performance (Matrix #2): Generation 1 100 Mflop/s Itanium 1 - 7% 225 Mflop/s 103 Mflop/s 703 Mflop/s 469 Mflop/s Itanium 2 - 31% 1. 1 Gflop/s 276 Mflop/s

BIPS Opteron Performance Profile

BIPS Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3 x 3 blocking – – Logical grid of 3 x 3 cells Fill-in explicit zeros Unroll 3 x 3 block multiplies “Fill ratio” = 1. 5 • On Pentium III: 1. 5 x speedup!

Summary of Performance Optimizations BIPS • Optimizations for Sp. MV – – – – Register blocking (RB): up to 4 x over CSR Variable block splitting: 2. 1 x over CSR, 1. 8 x over RB Diagonals: 2 x over CSR Reordering to create dense structure + splitting: 2 x over CSR Symmetry: 2. 8 x over CSR, 2. 6 x over RB Cache blocking: 2. 2 x over CSR Multiple vectors (Sp. MM): 7 x over CSR And combinations… • Sparse triangular solve – Hybrid sparse/dense data structure: 1. 8 x over CSR • Higher-level kernels – AAT*x, ATA*x: 4 x over CSR, 1. 8 x over RB – A 2*x: 2 x over CSR, 1. 5 x over RB

BIPS Potential Impact on Applications: T 3 P • Source: SLAC [Ko] • 80% of time spent in Sp. MV • Relevant optimization techniques – Symmetric storage – Register blocking • On Single Processor Itanium 2 – 1. 68 x speedup • 532 Mflops, or 15% of 3. 6 GFlop peak – 4. 4 x speedup with 8 multiple vectors • 1380 Mflops, or 38% of peak

BIPS Potential Impact on Applications: Omega 3 P • Application: accelerator cavity design [Ko] • Relevant optimization techniques – Symmetric storage – Register blocking – Reordering • Reverse Cuthill-Mc. Kee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks – – – Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’ 97] • 2 x speedup on Itanium 2, but SPMV not dominant

Source: Accelerator Cavity Design Problem (Ko via Husbands) BIPS

100 x 100 Submatrix Along Diagonal BIPS

Post-RCM Reordering BIPS

“Microscopic” Effect of RCM Reordering BIPS Before: Green + Red After: Green + Blue

“Microscopic” Effect of Combined RCM+TSP Reordering BIPS Before: Green + Red After: Green + Blue

BIPS

BIPS Optimized Sparse Kernel Interface OSKI • Provides sparse kernels automatically tuned for user’s matrix & machine – BLAS-style functionality: Sp. MV. , Tr. SV, … – Hides complexity of run-time tuning – Includes new, faster locality-aware kernels: ATA*x, … • Faster than standard implementations – Up to 4 x faster matvec, 1. 8 x trisolve, 4 x ATA*x • For “advanced” users & solver library writers – Available as stand-alone library (Oct ’ 04) – Available as PETSc extension (Dec ’ 04) • Lines of code: ? ? written by us, ? ? generated

BIPS How the OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

BIPS How the OSKI Tunes (Overview) • At library build/install-time – Pre-generate and compile code variants into dynamic libraries – Collect benchmark data • Measures and records speed of possible sparse data structure and code variants on target architecture – Installation process uses standard, portable GNU Auto. Tools • At run-time – Library “tunes” using heuristic models • Models analyze user’s matrix & benchmark data to choose optimized data structure and code – Non-trivial tuning cost: up to ~40 mat-vecs • Library limits the time it spends tuning based on estimated workload – provided by user or inferred by library • User may reduce cost by save tuning results for application on future runs with same or similar matrix

BIPS Optimizations in the Initial OSKI Release • Fully automatic heuristics for – Sparse matrix-vector multiply • Register-level blocking + symmetry + multiple vectors • Cache-level blocking – Sparse triangular solve with register-level blocking and “switch-to-dense” optimization – Sparse ATA*x with register-level blocking • User may select other optimizations manually • “Plug-in” extensibility – Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x) – All available in dynamic libraries – Accessible via high-level embedded script language – Very advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

BIPS Extra Slides

BIPS Example: Combining Optimizations • Register blocking, symmetry, multiple (k) vectors – Three low-level tuning parameters: r, c, v X k v * r c += Y A

BIPS Example: Combining Optimizations • Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] – Symmetric, blocked, 1 vector • Up to 2. 6 x over nonsymmetric, blocked, 1 vector – Symmetric, blocked, k vectors • Up to 2. 1 x over nonsymmetric, blocked, k vecs. • Up to 7. 3 x over nonsymmetric, nonblocked, 1, vector – Symmetric Storage: 64. 7% savings

BIPS Current Work • Public software release • Impact on library designs: Sparse BLAS, Trilinos, PETSc, … • Integration in large-scale applications – DOE: Accelerator design; plasma physics – Geophysical simulation based on Block Lanczos (ATA*X; LBL) • Systematic heuristics for data structure selection? • Evaluation of emerging architectures – Revisiting vector micros • Other sparse kernels – Matrix triple products, Ak*x • Parallelism • Sparse benchmarks (with UTK) [Gahvari & Hoemmen] • Automatic tuning of MPI collective ops [Nishtala, et al. ]

BIPS Review of Tuning by Illustration (Extra Slides)

Splitting for Variable Blocks and Diagonals BIPS • Decompose A = A 1 + A 2 + … At – – Detect “canonical” structures (sampling) Split Tune each Ai Improve performance and save storage • New data structures – Unaligned block CSR • Relax alignment in rows & columns – Row-segmented diagonals

BIPS Example: Variable Block Row (Matrix #12) 2. 1 x over CSR 1. 8 x over RB

BIPS Example: Row-Segmented Diagonals 2 x over CSR

BIPS Mixed Diagonal and Block Structure

BIPS Example: Sparse Triangular Factor • Raefsky 4 (structural problem) + Super. LU + colmmd • N=19779, nnz=12. 6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1. 8 x over CSR

BIPS Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT “axpy” dot product • Register-level: ai. T to be r´c block row, or diag row • Algorithmic-level transformations for A 2*x, A 3*x, …

BIPS Summary • Automated block size selection – Empirical modeling and search – Register blocking for Sp. MV, triangular solve, ATA*x • Not fully automated – Given a matrix, select splittings and transformations • Lots of combinatorial problems – TSP reordering to create dense blocks (Pinar ’ 97; Moon, et al. ’ 04)

BIPS Extra Slides

BIPS A Sparse Matrix You Encounter Every Day Who am I? I am a Big Repository Of useful And useless Facts alike. Who am I? (Hint: Not your e-mail inbox. )

BIPS Problem Context • Sparse kernels abound – Models of buildings, cars, bridges, economies, … – Google Page. Rank algorithm • Historical trends – – Sparse matrix-vector multiply (Sp. MV): 10% of peak 2 x faster with “hand-tuning” Tuning becoming more difficult over time Promise of automatic tuning: PHi. PAC/ATLAS, FFTW, … • Challenges to high-performance – Not dense linear algebra! • Complex data structures: indirect, irregular memory access • Performance depends strongly on run-time inputs

BIPS Key Questions, Ideas, Conclusions • How to tune basic sparse kernels automatically? – Empirical modeling and search • • Up to 4 x speedups for Sp. MV 1. 8 x for triangular solve 4 x for ATA*x; 2 x for A 2*x 7 x for multiple vectors • What are the fundamental limits on performance? – Kernel-, machine-, and matrix-specific upper bounds • Achieve 75% or more for Sp. MV, limiting low-level tuning • Consequences for architecture? • General techniques for empirical search-based tuning? – Statistical models of performance

BIPS • • • Road Map Sparse matrix-vector multiply (Sp. MV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Statistical models of performance

BIPS Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) + A(i, j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

BIPS • • • Road Map Sparse matrix-vector multiply (Sp. MV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Statistical models of performance

Historical Trends in Sp. MV Performance BIPS • The Data – – Uniprocessor Sp. MV performance since 1987 “Untuned” and “Tuned” implementations Cache-based superscalar micros; some vectors LINPACK

BIPS Sp. MV Historical Trends: Mflop/s

BIPS Sp. MV Historical Trends: Fraction of Peak

BIPS Example: The Difficulty of Tuning • n = 21216 • nnz = 1. 5 M • kernel: Sp. MV • Source: NASA structural analysis problem

BIPS Still More Surprises • More complicated non-zero structure in general

BIPS Still More Surprises • More complicated non-zero structure in general • Example: 3 x 3 blocking – Logical grid of 3 x 3 cells

Historical Trends: Mixed News BIPS • Observations – – Good news: Moore’s law like behavior Bad news: “Untuned” is 10% peak or less, worsening Good news: “Tuned” roughly 2 x better today, and improving Bad news: Tuning is complex – (Not really news: Sp. MV is not LINPACK) • Questions – Application: Automatic tuning? – Architect: What machines are good for Sp. MV?

BIPS Road Map • Sparse matrix-vector multiply (Sp. MV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques – Sp. MV [SC’ 02; IJHPCA ’ 04 b] – Sparse triangular solve (Sp. TS) [ICS/POHLL ’ 02] – ATA*x [ICCS/Wo. PLA ’ 03] • Upper bounds on performance • Statistical models of performance

BIPS SPARSITY: Framework for Tuning Sp. MV • SPARSITY: Automatic tuning for Sp. MV [Im & Yelick ’ 99] – General approach • Identify and generate implementation space • Search space using empirical models & experiments – Prototype library and heuristic for choosing register block size • Also: cache-level blocking, multiple vectors • What’s new? – New block size selection heuristic • Within 10% of optimal — replaces previous version – Expanded implementation space • Variable block splitting, diagonals, combinations – New kernels: sparse triangular solve, ATA*x, Ar*x

BIPS Automatic Register Block Size Selection • Selecting the r x c block size – Off-line benchmark: characterize the machine • Precompute Mflops(r, c) using dense matrix for each r x c • Once per machine/architecture – Run-time “search”: characterize the matrix • Sample A to estimate Fill(r, c) for each r x c – Run-time heuristic model • Choose r, c to maximize Mflops(r, c) / Fill(r, c) • Run-time costs – Up to ~40 Sp. MVs (empirical worst case)

BIPS Accuracy of the Tuning Heuristics (1/4) DGEMV NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)

BIPS Accuracy of the Tuning Heuristics (2/4) DGEMV

BIPS Accuracy of the Tuning Heuristics (3/4) DGEMV

BIPS Accuracy of the Tuning Heuristics (4/4) DGEMV

BIPS • • Road Map Sparse matrix-vector multiply (Sp. MV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance – SC’ 02 • Statistical models of performance

BIPS Motivation for Upper Bounds Model • Questions – Speedups are good, but what is the speed limit? • Independent of instruction scheduling, selection – What machines are “good” for Sp. MV?

BIPS Upper Bounds on Performance: Blocked Sp. MV • P = (flops) / (time) – Flops = 2 * nnz(A) • Lower bound on time: Two main assumptions – 1. Count memory ops only (streaming) – 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge min access “latency” ai at Li cache & amem – e. g. , Saavedra-Barrera and PMa. C MAPS benchmarks

BIPS Example: Bounds on Itanium 2

BIPS Fraction of Upper Bound Across Platforms

BIPS Achieved Performance and Machine Balance • Machine balance [Callahan ’ 88; Mc. Calpin ’ 95] – Balance = Peak Flop Rate / Bandwidth (flops / double) • Ideal balance for mat-vec: 2 flops / double – For Sp. MV, even less • Sp. MV ~ streaming – 1 / (avg load time to stream 1 array) ~ (bandwidth) – “Sustained” balance = peak flops / model bandwidth

BIPS

BIPS Where Does the Time Go? • Most time assigned to memory • Caches “disappear” when line sizes are equal – Strictly increasing line sizes

BIPS Execution Time Breakdown: Matrix 40

BIPS Speedups with Increasing Line Size

BIPS Summary: Performance Upper Bounds • What is the best we can do for Sp. MV? – Limits to low-level tuning of blocked implementations – Refinements? • What machines are good for Sp. MV? – Partial answer: balance characterization • Architectural consequences? – Example: Strictly increasing line sizes

BIPS • • • Road Map Sparse matrix-vector multiply (Sp. MV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Tuning other sparse kernels Statistical models of performance – FDO ’ 00; IJHPCA ’ 04 a

BIPS Statistical Models for Automatic Tuning • Idea 1: Statistical criterion for stopping a search – A general search model • Generate implementation • Measure performance • Repeat – Stop when probability of being within e of optimal falls below threshold • Can estimate distribution on-line • Idea 2: Statistical performance models – Problem: Choose 1 among m implementations at run-time – Sample performance off-line, build statistical model

BIPS Example: Select a Matmul Implementation

BIPS Example: Support Vector Classification

BIPS • • Road Map Sparse matrix-vector multiply (Sp. MV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Tuning other sparse kernels Statistical models of performance Summary and Future Work

BIPS Summary of High-Level Themes • “Kernel-centric” optimization – Vs. basic block, trace, path optimization, for instance – Aggressive use of domain-specific knowledge • Performance bounds modeling – Evaluating software quality – Architectural characterizations and consequences • Empirical search – Hybrid on-line/run-time models • Statistical performance models – Exploit information from sampling, measuring

BIPS Related Work • My bibliography: 337 entries so far • Sample area 1: Code generation – Generative & generic programming – Sparse compilers – Domain-specific generators • Sample area 2: Empirical search-based tuning – Kernel-centric • linear algebra, signal processing, sorting, MPI, … – Compiler-centric • profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization

BIPS Future Directions (A Bag of Flaky Ideas) • Composable code generators and search spaces • New application domains – Page. Rank: multilevel block algorithms for topic-sensitive search? • New kernels: cryptokernels – rich mathematical structure germane to performance; lots of hardware • New tuning environments – Parallel, Grid, “whole systems” • Statistical models of application performance – Statistical learning of concise parametric models from traces for architectural evaluation – Compiler/automatic derivation of parametric models

BIPS Acknowledgements • Super-advisors: Jim and Kathy • Undergraduate R. A. s: Attila, Ben, Jin, Michael, Rajesh, Shoaib, Sriram, Tuyet-Linh • See pages xvi—xvii of dissertation.

BIPS TSP-based Reordering: Before (Pinar ’ 97; Moon, et al ‘ 04)

BIPS TSP-based Reordering: After (Pinar ’ 97; Moon, et al ‘ 04) Up to 2 x speedups over CSR

BIPS Example: L 2 Misses on Itanium 2 Misses measured using PAPI [Browne ’ 00]

BIPS Example: Distribution of Blocked Non-Zeros

BIPS Register Profile: Itanium 2 1190 Mflop/s

Ultra 2 i - 11% BIPS 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x 86 35 Mflop/s Pentium III - 21% 108 Mflop/s 42 Mflop/s 50 Mflop/s Pentium III-M - 15% 122 Mflop/s 58 Mflop/s

Power 3 - 17% BIPS 252 Mflop/s Power 4 - 16% Register Profiles: IBM and Intel IA 64 122 Mflop/s Itanium 1 - 8% 247 Mflop/s 107 Mflop/s 820 Mflop/s 459 Mflop/s Itanium 2 - 33% 1. 2 Gflop/s 190 Mflop/s

BIPS Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix – Fraction of matrix to sample: s Î [0, 1] – Cost ~ O(s * nnz) – Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals – Idea: Monitor variance • Cost of tuning – Lower bound: convert matrix in 5 to 40 unblocked Sp. MVs – Heuristic: 1 to 11 Sp. MVs

BIPS Sparse/Dense Partitioning for Sp. TS • Partition L into sparse (L 1, L 2) and dense LD: • Perform Sp. TS in three steps: • Sparsity optimizations for (1)—(2); DTRSV for (3) • Tuning parameters: block size, size of dense triangle

BIPS Sp. TS Performance: Power 3

BIPS

BIPS Summary of Sp. TS and AAT*x Results • Sp. TS — Similar to Sp. MV – 1. 8 x speedups; limited benefit from low-level tuning • AATx, ATAx – Cache interleaving only: up to 1. 6 x speedups – Reg + cache: up to 4 x speedups • 1. 8 x speedup over register only – Similar heuristic; same accuracy (~ 10% optimal) – Further from upper bounds: 60— 80% • Opportunity for better low-level tuning a la PHi. PAC/ATLAS • Matrix triple products? Ak*x? – Preliminary work

BIPS Register Blocking: Speedup

BIPS Register Blocking: Performance

BIPS Register Blocking: Fraction of Peak

BIPS Example: Confidence Interval Estimation

BIPS Costs of Tuning

BIPS Splitting + UBCSR: Pentium III

BIPS Splitting + UBCSR: Power 4

BIPS Splitting+UBCSR Storage: Power 4

BIPS

BIPS Example: Variable Block Row (Matrix #13)

BIPS Dense Tuning is Hard, Too • Even dense matrix multiply can be notoriously difficult to tune

BIPS Dense matrix multiply: surprising performance as register tile size varies.

BIPS

BIPS Preliminary Results (Matrix Set 2): Itanium 2 Web/IR Dense FEM (var) Bio Econ Stat LP

BIPS Multiple Vector Performance

BIPS

BIPS What about the Google Matrix? • Google approach – Approx. once a month: rank all pages using connectivity structure • Find dominant eigenvector of a matrix – At query-time: return list of pages ordered by rank • Matrix: A = a. G + (1 -a)(1/n)uu. T – Markov model: Surfer follows link with probability a, jumps to a random page with probability 1 -a – G is n x n connectivity matrix [n » 3 billion] • gij is non-zero if page i links to page j • Normalized so each column sums to 1 • Very sparse: about 7— 8 non-zeros per row (power law dist. ) – u is a vector of all 1 values – Steady-state probability xi of landing on page i is solution to x = Ax • Approximate x by power method: x = Akx 0 – In practice, k » 25

BIPS

BIPS MAPS Benchmark Example: Power 4

BIPS MAPS Benchmark Example: Itanium 2

BIPS Saavedra-Barrera Example: Ultra 2 i

BIPS

BIPS Summary of Results: Pentium III

BIPS Summary of Results: Pentium III (3/3)

BIPS Execution Time Breakdown (PAPI): Matrix 40

BIPS Preliminary Results (Matrix Set 1): Itanium 2 Dense FEM (var) Assorted LP

BIPS Tuning Sparse Triangular Solve (Sp. TS) • Compute x=L-1*b where L sparse lower triangular, x & b dense • L from sparse LU has rich dense substructure – Dense trailing triangle can account for 20— 90% of matrix non-zeros • Sp. TS optimizations – Split into sparse trapezoid and dense trailing triangle – Use tuned dense BLAS (DTRSV) on dense triangle – Use Sparsity register blocking on sparse part • Tuning parameters – Size of dense trailing triangle – Register block size

Sparse Kernels and Optimizations BIPS • Kernels – – Sparse matrix-vector multiply (Sp. MV): y=A*x Sparse triangular solve (Sp. TS): x=T-1*b y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) – Register blocking – Cache blocking – Multiple dense vectors (x) – A has special structure (e. g. , symmetric, banded, …) – Hybrid data structures (e. g. , splitting, switch-to-dense, …) – Matrix reordering • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data

BIPS Cache Blocked Sp. MV on LSI Matrix: Ultra 2 i A 10 k x 255 k 3. 7 M non-zeros Baseline: 16 Mflop/s Best block size & performance: 16 k x 64 k 28 Mflop/s

BIPS Cache Blocking on LSI Matrix: Pentium 4 A 10 k x 255 k 3. 7 M non-zeros Baseline: 44 Mflop/s Best block size & performance: 16 k x 16 k 210 Mflop/s

BIPS Cache Blocked Sp. MV on LSI Matrix: Itanium A 10 k x 255 k 3. 7 M non-zeros Baseline: 25 Mflop/s Best block size & performance: 16 k x 32 k 72 Mflop/s

BIPS Cache Blocked Sp. MV on LSI Matrix: Itanium 2 A 10 k x 255 k 3. 7 M non-zeros Baseline: 170 Mflop/s Best block size & performance: 16 k x 65 k 275 Mflop/s

BIPS x 1 Inter-Iteration Sparse Tiling (1/3) t 1 y 1 x 2 t 2 y 2 x 3 t 3 y 3 x 4 t 4 y 4 x 5 t 5 y 5 • [Strout, et al. , ‘ 01] • Let A be 6 x 6 tridiagonal • Consider y=A 2 x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij

BIPS x 1 Inter-Iteration Sparse Tiling (2/3) t 1 y 1 x 2 t 2 y 2 x 3 t 3 y 3 x 4 t 4 y 4 x 5 t 5 y 5 • [Strout, et al. , ‘ 01] • Let A be 6 x 6 tridiagonal • Consider y=A 2 x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij • Orange = everything needed to compute y 1 – Reuse a 11, a 12

BIPS x 1 Inter-Iteration Sparse Tiling (3/3) t 1 y 1 x 2 t 2 y 2 x 3 t 3 y 3 x 4 t 4 y 4 • [Strout, et al. , ‘ 01] • Let A be 6 x 6 tridiagonal • Consider y=A 2 x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij • Orange = everything needed to compute y 1 – Reuse a 11, a 12 • Grey = y 2, y 3 x 5 t 5 y 5 – Reuse a 23, a 33, a 43

BIPS x 1 x 2 Inter-Iteration Sparse Tiling: Issues t 1 t 2 y 1 y 2 x 3 t 3 y 3 x 4 t 4 y 4 x 5 t 5 y 5 • Tile sizes (colored regions) grow with no. of iterations and increasing out-degree – G likely to have a few nodes with high out-degree (e. g. , Yahoo) • Mathematical tricks to limit tile size? – Judicious dropping of edges [Ng’ 01]

BIPS Summary and Questions • Need to understand matrix structure and machine – Be. BOP: suite of techniques to deal with different sparse structures and architectures • Google matrix problem – Established techniques within an iteration – Ideas for inter-iteration optimizations – Mathematical structure of problem may help • Questions – Structure of G? – What are the computational bottlenecks? – Enabling future computations? • E. g. , topic-sensitive Page. Rank multiple vector version [Haveliwala ’ 02] – See www. cs. berkeley. edu/~richie/bebop/intel/google for more info, including more complete Itanium 2 results.

BIPS Exploiting Matrix Structure • Symmetry (numerical or structural) – Reuse matrix entries – Can combine with register blocking, multiple vectors, … • Matrix splitting – Split the matrix, e. g. , into r x c and 1 x 1 – No fill overhead • Large matrices with random structure – E. g. , Latent Semantic Indexing (LSI) matrices – Technique: cache blocking • Store matrix as 2 i x 2 j sparse submatrices • Effective when x vector is large • Currently, search to find fastest size

BIPS Symmetric Sp. MV Performance: Pentium 4

BIPS Sp. MV with Split Matrices: Ultra 2 i

BIPS Cache Blocking on Random Matrices: Itanium Speedup on four banded random matrices.

Sparse Kernels and Optimizations BIPS • Kernels – – Sparse matrix-vector multiply (Sp. MV): y=A*x Sparse triangular solve (Sp. TS): x=T-1*b y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) – Register blocking – Cache blocking – Multiple dense vectors (x) – A has special structure (e. g. , symmetric, banded, …) – Hybrid data structures (e. g. , splitting, switch-to-dense, …) – Matrix reordering • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data

BIPS Register Blocked Sp. MV: Pentium III

BIPS Register Blocked Sp. MV: Ultra 2 i

BIPS Register Blocked Sp. MV: Power 3

BIPS Register Blocked Sp. MV: Itanium

BIPS Possible Optimization Techniques • Within an iteration, i. e. , computing (G+uu. T)*x once – Cache block G*x • On linear programming matrices and matrices with random structure (e. g. , LSI), 1. 5— 4 x speedups • Best block size is matrix and machine dependent – Reordering and/or splitting of G to separate dense structure (rows, columns, blocks) • Between iterations, e. g. , (G+uu. T)2 x – (G+uu. T)2 x = G 2 x + (Gu)u. Tx + u(u. TG)x + u(u. Tu)u. Tx • Compute Gu, u. TG, u. Tu once for all iterations • G 2 x: Inter-iteration tiling to read G only once

BIPS Multiple Vector Performance: Itanium

Sparse Kernels and Optimizations BIPS • Kernels – – Sparse matrix-vector multiply (Sp. MV): y=A*x Sparse triangular solve (Sp. TS): x=T-1*b y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) – Register blocking – Cache blocking – Multiple dense vectors (x) – A has special structure (e. g. , symmetric, banded, …) – Hybrid data structures (e. g. , splitting, switch-to-dense, …) – Matrix reordering • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data

BIPS Sp. TS Performance: Itanium (See POHLL ’ 02 workshop paper, at ICS ’ 02. )

Sparse Kernels and Optimizations BIPS • Kernels – – Sparse matrix-vector multiply (Sp. MV): y=A*x Sparse triangular solve (Sp. TS): x=T-1*b y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) – Register blocking – Cache blocking – Multiple dense vectors (x) – A has special structure (e. g. , symmetric, banded, …) – Hybrid data structures (e. g. , splitting, switch-to-dense, …) – Matrix reordering • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data

BIPS Optimizing AAT*x • Kernel: y=AAT*x, where A is sparse, x & y dense – Arises in linear programming, computation of SVD – Conventional implementation: compute z=AT*x, y=A*z • Elements of A can be reused: • When ak represent blocks of columns, can apply register blocking.

BIPS Optimized AAT*x Performance: Pentium III

BIPS Current Directions • Applying new optimizations – Other split data structures (variable block, diagonal, …) – Matrix reordering to create block structure – Structural symmetry • New kernels (triple product RART, powers Ak, …) • Tuning parameter selection • Building an automatically tuned sparse matrix library – Extending the Sparse BLAS – Leverage existing sparse compilers as code generation infrastructure – More thoughts on this topic tomorrow

BIPS Related Work • Automatic performance tuning systems – PHi. PAC [Bilmes, et al. , ’ 97], ATLAS [Whaley & Dongarra ’ 98] – FFTW [Frigo & Johnson ’ 98], SPIRAL [Pueschel, et al. , ’ 00], UHFFT [Mirkovic and Johnsson ’ 00] – MPI collective operations [Vadhiyar & Dongarra ’ 01] • Code generation – FLAME [Gunnels & van de Geijn, ’ 01] – Sparse compilers: [Bik ’ 99], Bernoulli [Pingali, et al. , ’ 97] – Generic programming: Blitz++ [Veldhuizen ’ 98], MTL [Siek & Lumsdaine ’ 98], GMCL [Czarnecki, et al. ’ 98], … • Sparse performance modeling – [Temam & Jalby ’ 92], [White & Saddayappan ’ 97], [Navarro, et al. , ’ 96], [Heras, et al. , ’ 99], [Fraguela, et al. , ’ 99], …

More Related Work BIPS • Compiler analysis, models – CROPS [Carter, Ferrante, et al. ]; Serial sparse tiling [Strout ’ 01] – TUNE [Chatterjee, et al. ] – Iterative compilation [O’Boyle, et al. , ’ 98] – Broadway compiler [Guyer & Lin, ’ 99] – [Brewer ’ 95], ADAPT [Voss ’ 00] • Sparse BLAS interfaces – – BLAST Forum (Chapter 3) NIST Sparse BLAS [Remington & Pozo ’ 94]; Sparse. Lib++ SPARSKIT [Saad ’ 94] Parallel Sparse BLAS [Fillipone, et al. ’ 96]

BIPS Context: Creating High-Performance Libraries • Application performance dominated by a few computational kernels • Today: Kernels hand-tuned by vendor or user • Performance tuning challenges – Performance is a complicated function of kernel, architecture, compiler, and workload – Tedious and time-consuming • Successful automated approaches – Dense linear algebra: ATLAS/PHi. PAC – Signal processing: FFTW/SPIRAL/UHFFT

BIPS Cache Blocked Sp. MV on LSI Matrix: Itanium

BIPS Sustainable Memory Bandwidth

BIPS Multiple Vector Performance: Pentium 4

BIPS Multiple Vector Performance: Itanium

BIPS Multiple Vector Performance: Pentium 4

BIPS Optimized AAT*x Performance: Ultra 2 i

BIPS Optimized AAT*x Performance: Pentium 4

BIPS Tuning Pays Off—PHi. PAC

BIPS Tuning pays off – ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

BIPS Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2 i 2 -D slice of 3 -D space; implementations colorcoded by performance in Mflop/s 16 registers, but 2 -by-3 tile size fastest

BIPS High Precision GEMV (XBLAS)

BIPS High Precision Algorithms (XBLAS) • Double-double (High precision word represented as pair of doubles) – Many variations on these algorithms; we currently use Bailey’s • Exploiting Extra-wide Registers – Suppose s(1) , … , s(n) have f-bit fractions, SUM has F>f bit fraction – Consider following algorithm for S = Si=1, n s(i) • Sort so that |s(1)| |s(2)| … |s(n)| • SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM – Theorem (D. , Hida) Suppose F<2 f (less than double precision) • If n 2 F-f + 1, then error 1. 5 ulps • If n = 2 F-f + 2, then error 22 f-F ulps (can be 1) • If n 2 F-f + 3, then error can be arbitrary (S 0 but sum = 0 ) – Examples • s(i) double (f=53), SUM double extended (F=64) – accurate if n 211 + 1 = 2049 • Dot product of single precision x(i) and y(i) – s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) – accurate if n 216 + 1 = 65537