Tools and Primitives for High Performance Graph Computation

Large graphs are everywhere… • Internet structure • Social interactions WWW snapshot, courtesy Y.

An analogy? Continuous physical modeling As the “middleware” of scientific computing, linear algebra has

An analogy? Continuous physical modeling Discrete structure analysis Linear algebra Graph theory Computers 5

An analogy? Well, we’re not there yet …. Mathematical tools ? “Impedance match” to

The Primitives Challenge • By analogy to numerical scientific computing. . . Basic Linear

Primitives should … 8 • Supply a common notation to express computations • Have

The sparse matrices Thecase Casefor Sparse Matrices Many irregular applications contain coarse-grained parallelism that

Sparse array-based primitives Identification of Primitives Sparse matrix-matrix multiplication (Sp. GEMM) x Element-wise operations

Multiple-source breadth-first search 1 2 4 7 3 AT 11 X 6 5

Multiple-source breadth-first search 1 2 4 7 3 AT 12 X AT X 6

Multiple-source breadth-first search 1 2 4 7 3 AT 13 X 6 AT X

Combinatorial BLAS [Buluc, G] A parallel graph library based on distributed-memory sparse arrays and

• TEPS score • Millions BC performance in distributed memory • 250 •

KDT: A toolbox for graph analysis and pattern discovery [G, Reinhardt, Shah] Layer 1:

Star-P architecture Star-P client manager package manager processor #1 dense/sparse sort processor #2 Sca.

Landscape connectivity modeling 19 • Habitat quality, gene flow, corridor identification, conservation planning •

Circuitscape [Mc. Rae, Shah] • Predicting gene flow with resistive networks • Matlab, Python,

Sp. GEMM: sparse matrix x sparse matrix Why focus on Sp. GEMM? 22 •

Two Versions of Sparse GEMM A 1 A 2 A 3 A 4 A

Parallelism in multiple-source BFS 1 2 4 7 3 AT X 6 AT X

Modeled limits on speedup, sparse 1 -D & 2 -D algorithm 1 -D algorithm

Submatrices are hypersparse (nnz << n) Average nonzeros per column = Average nonzeros per

Distributed-memory sparse matrix-matrix multiplication j k § 2 D block layout § Outer product

CSB: Compressed sparse block storage [Buluc, Fineman, Frigo, G, Leiserson] 28

CSB for parallel Ax and ATx [Buluc, Fineman, Frigo, G, Leiserson] • Efficient multiplication

From Semirings to Computational Patterns 30

Matrices over semirings • Matrix multiplication C = AB (or matrix/vector): Ci, j =

From semirings to computational patterns Sparse matrix times vector as a semiring operation: –

From semirings to computational patterns Sparse matrix times vector as a computational pattern: 33

Sp. GEMM as a computational pattern • Explore length-two paths that use specified vertices

Graph BLAS: A pattern-based library 35 • User-specified operations and attributes give the performance

Challenge: Complete the analogy. . . Mathematical tools ? “Impedance match” to computer operations

Slides: 36

Download presentation

Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World Graphs July 12, 2010 1 Support: NSF, DARPA, DOE, Intel

Large graphs are everywhere… • Internet structure • Social interactions WWW snapshot, courtesy Y. Hyun 2 • Scientific datasets: biological, chemical, cosmological, ecological, … Yeast protein interaction network, courtesy H. Jeong

Types of graph computations 3

An analogy? Continuous physical modeling As the “middleware” of scientific computing, linear algebra has supplied or enabled: • Mathematical tools Linear algebra • “Impedance match” to computer operations • High-level primitives • High-quality software libraries Computers • Ways to extract performance from computer architecture • Interactive environments 4

An analogy? Continuous physical modeling Discrete structure analysis Linear algebra Graph theory Computers 5 Computers

An analogy? Well, we’re not there yet …. Mathematical tools ? “Impedance match” to computer operations Discrete structure analysis ? High-level primitives ? High-quality software libs Graph theory ? Ways to extract performance from computer architecture ? Interactive environments 6 Computers

The Primitives Challenge • By analogy to numerical scientific computing. . . Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n) C = A*B y = A*x • 7 What should the combinatorial BLAS look like? μ = x. T y

Primitives should … 8 • Supply a common notation to express computations • Have broad scope but fit into a concise framework • Allow programming at the appropriate level of abstraction and granularity • Scale seamlessly from desktop to supercomputer • Hide architecture-specific details from users

The sparse matrices Thecase Casefor Sparse Matrices Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. 9 Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited

Sparse array-based primitives Identification of Primitives Sparse matrix-matrix multiplication (Sp. GEMM) x Element-wise operations Sparse matrix-dense vector multiplication x Sparse matrix indexing . * Matrices on various semirings: 10 (x, +) , (and, or) , (+, min) , …

Multiple-source breadth-first search 1 2 4 7 3 AT 11 X 6 5

Multiple-source breadth-first search 1 2 4 7 3 AT 12 X AT X 6 5

Multiple-source breadth-first search 1 2 4 7 3 AT 13 X 6 AT X • Sparse array representation => space efficient • Sparse matrix-matrix multiplication => work efficient • Three possible levels of parallelism: searches, vertices, edges 5

A Few Examples 14

Combinatorial BLAS [Buluc, G] A parallel graph library based on distributed-memory sparse arrays and algebraic graph primitives Typical software stack Betweenness Centrality (BC) What fraction of shortest paths pass through this node? Brandes’ algorithm 15

• TEPS score • Millions BC performance in distributed memory • 250 • BC performance • 200 RMAT powerlaw graph, 2 Scale vertices, avg degree 8 • 150 • Scale 17 • Scale 18 • 100 • Scale 19 • Scale 20 • 50 • 8 1 • 1 00 • 1 21 • 1 44 • 1 69 • 1 96 • 2 25 • 2 56 • 2 89 • 3 24 • 3 61 • 4 00 • 4 41 • 4 84 4 • 6 9 • 4 6 • 3 • 2 5 • 0 • Number of Cores • TEPS = Traversed Edges Per Second • One page of code using C-BLAS 16

KDT: A toolbox for graph analysis and pattern discovery [G, Reinhardt, Shah] Layer 1: Graph Theoretic Tools 17 • Graph operations • Global structure of graphs • Graph partitioning and clustering • Graph generators • Visualization and graphics • Scan and combining operations • Utilities

Star-P architecture Star-P client manager package manager processor #1 dense/sparse sort processor #2 Sca. LAPACK processor #3 FFTW Ordinary Matlab variables processor #0 FPGA interface MPI user code UPC user code . . . MATLAB® processor #n-1 server manager matrix manager 18 Distributed matrices

Landscape connectivity modeling 19 • Habitat quality, gene flow, corridor identification, conservation planning • Pumas in southern California: 12 million nodes, < 1 hour • Targeting larger problems: Yellowstone-to-Yukon corridor Figures courtesy of Brad Mc. Rae

Circuitscape [Mc. Rae, Shah] • Predicting gene flow with resistive networks • Matlab, Python, and Star-P (parallel) implementations • Combinatorics: • 20 – Initial discrete grid: ideally 100 m resolution (for pumas) – Partition landscape into connected components – Graph contraction: habitats become nodes in resistive network Numerics: – Resistance computations for pairs of habitats in the landscape – Iterative linear solvers invoked via Star-P: Hypre (PCG+AMG)

A Few Nuts & Bolts 21

Sp. GEMM: sparse matrix x sparse matrix Why focus on Sp. GEMM? 22 • Graph clustering (Markov, peer pressure) • Subgraph / submatrix indexing • Shortest path calculations • Betweenness centrality • Graph contraction • Cycle detection • Multigrid interpolation & restriction • Colored intersection searching • Applying constraints in finite element computations • Context-free parsing. . . 1 1 1 x x 1 1

Two Versions of Sparse GEMM A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 x B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 = C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 1 D block-column distribution Ci = C i + A B i j k k i Cij += Aik Bkj 23 x = Cij 2 D block distribution

Parallelism in multiple-source BFS 1 2 4 7 3 AT X 6 AT X Three levels of parallelism from 2 -D data decomposition: 24 • columns of X : over multiple simultaneous searches • rows of X & columns of AT: • rows of AT: over multiple frontier nodes over edges incident on high-degree frontier nodes 5

Modeled limits on speedup, sparse 1 -D & 2 -D algorithm 1 -D algorithm N P N • 1 -D algorithms do not scale beyond 40 x • Break-even point is around 50 processors 25 P

Submatrices are hypersparse (nnz << n) Average nonzeros per column = Average nonzeros per column within block = blocks Total memory using compressed sparse columns = Any algorithm whose complexity depends on matrix dimension n is asymptotically too wasteful. 26 c

Distributed-memory sparse matrix-matrix multiplication j k § 2 D block layout § Outer product formulation § Sequential “hypersparse” kernel k * i = Cij += Aik * Bkj • Scales well to hundreds of processors • Betweenness centrality benchmark: over 200 MTEPS • Experiments: TACC Lonestar cluster Time vs Number of cores -- 1 M-vertex RMAT 27

CSB: Compressed sparse block storage [Buluc, Fineman, Frigo, G, Leiserson] 28

CSB for parallel Ax and ATx [Buluc, Fineman, Frigo, G, Leiserson] • Efficient multiplication of a sparse matrix and its transpose by a vector • Compressed sparse block storage • Critical path never more than ~ sqrt(n)*log(n) • Multicore / multisocket architectures 29

From Semirings to Computational Patterns 30

Matrices over semirings • Matrix multiplication C = AB (or matrix/vector): Ci, j = Ai, 1 B 1, j + Ai, 2 B 2, j + · · · + Ai, n Bn, j • Replace scalar operations and + by : associative, distributes over , identity 1 : associative, commutative, identity 0 annihilates under 31 • Then Ci, j = Ai, 1 B 1, j Ai, 2 B 2, j · · · Ai, n Bn, j • Examples: ( , +) ; (and, or) ; (+, min) ; . . . • No change to data reference pattern or control flow

From semirings to computational patterns Sparse matrix times vector as a semiring operation: – Given vertex data xi and edge data ai, j – For each vertex j of interest, compute yj = ai 1, j xi 1 ai 2, j xi 2 · · · aik, j xik – 32 User specifies: definition of operations and

From semirings to computational patterns Sparse matrix times vector as a computational pattern: 33 – Given vertex data and edge data – For each vertex of interest, combine data from neighboring vertices and edges – User specifies: desired computation on data from neighbors

Sp. GEMM as a computational pattern • Explore length-two paths that use specified vertices • Possibly do some filtering, accumulation, or other computation with vertex and edge attributes • E. g. “friends of friends” (per Lars Backstrom) • May or may not want to form the product graph explicitly • Formulation as semiring matrix multiplication is often possible but sometimes clumsy • Same data flow and communication patterns as in Sp. GEMM 34

Graph BLAS: A pattern-based library 35 • User-specified operations and attributes give the performance benefits of algebraic primitives with a more intuitive and flexible interface. • Common framework integrates algebraic (edge-based), visitor (traversal-based), and map-reduce patterns. • 2 D compressed sparse block structure supports userdefined edge/vertex/attribute types and operations. • “Hypersparse” kernels tuned to reduce data movement. • Initial target: manycore and multisocket shared memory.

Challenge: Complete the analogy. . . Mathematical tools ? “Impedance match” to computer operations Discrete structure analysis ? High-level primitives ? High-quality software libs Graph theory ? Ways to extract performance from computer architecture ? Interactive environments 36 Computers