Parallel Graph Algorithms Kamesh Madduri KMaddurilbl gov Lawrence

Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl. gov Lawrence Berkeley National Laboratory CS 267/Eng. C 233 Spring 2010 April 15, 2010

Lecture Outline • Applications • Review of key results • Case studies: Graph traversal-based problems, parallel algorithms – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

Routing in transportation networks Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds H. Bast et al. , “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.

Internet and the WWW • The world-wide web can be represented as a directed graph – Web search and crawl: traversal – Link analysis, ranking: Page rank and HITS – Document classification and clustering • Internet topologies (router networks) are naturally modeled as graphs

Scientific Computing • Reorderings for sparse solvers – Fill reducing orderings § Partitioning, eigenvectors – Heavy diagonal to reduce pivoting (matching) • Data structures for efficient exploitation of sparsity • Derivative computations for optimization Image source: Yifan Hu, “A gallery of large graphs” – Matroids, graph colorings, spanning trees • Preconditioning – Incomplete Factorizations – Partitioning for domain decomposition – Graph techniques in algebraic multigrid § Independent sets, matchings, etc. – Support Theory § Spanning trees & graph embedding techniques B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http: //www. er. doe. gov/ascr/ascac/Meetings/Oct 08/Hendrickson%20 ASCAC. pdf Image source: Tim Davis, UF Sparse Matrix Collection.

Large-scale data analysis • Graph abstractions are very useful to analyze complex data sets. • Sources of data: petascale simulations, experimental devices, the Internet, sensor networks • Challenges: data size, heterogeneity, uncertainty, data quality Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Image sources: (1) http: //physics. nmt. edu/images/astro/hst_starfield. jpg (2, 3) www. visual. Complexity. com Social Informatics: new analytics challenges, data uncertainty

Data Analysis and Graph Algorithms in Systems Biology • Study of the interactions between various components in a biological system • Graph-theoretic formulations are pervasive: – Predicting new interactions: modeling – Functional annotation of novel proteins: matching, clustering – Identifying metabolic pathways: paths, clustering – Identifying new protein complexes: clustering, centrality Image Source: Giot et al. , “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722 -1736, 2003.

Graph –theoretic problems in social networks – Community identification: clustering – Targeted advertising: centrality – Information spreading: modeling Image Source: Nexus (Facebook application)

Network Analysis for Intelligence and Survelliance • [Krebs ’ 04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality Image Source: http: //www. orgnet. com/hijackers. html • A global view of entities is often more insightful • Detect anomalous activities by exact/approximate subgraph isomorphism. Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45 -47

Research in Parallel Graph Algorithms Application Areas Methods/ Problems Social Network Analysis Find central entities Community detection Network dynamics WWW Marketing Social Search Computational Biology Gene regulation Metabolic pathways Genomics Scientific Computing Graph partitioning Matching Coloring Engineering VLSI CAD Route planning Graph Algorithms Traversal Data size Shortest Paths Connectivity Problem Complexity Max Flow … … … Architectures GPUs FPGAs x 86 multicore servers Massively multithreaded architectures Multicore Clusters Clouds

Characterizing Graph-theoretic computations (2/2) Input data Problem: Find *** • paths • clusters • partitions • matchings • patterns • orderings Graph kernel • traversal • shortest path algorithms • flow algorithms • spanning tree algorithms • topological sort …. . Factors that influence choice of algorithm • graph sparsity (m/n ratio) • static/dynamic nature • weighted/unweighted, weight distribution • vertex degree distribution • directed/undirected • simple/multi/hyper graph • problem size • granularity of computation at nodes/edges • domain-specific characteristics Graph problems are often recast as sparse linear algebra (e. g. , partitioning) or linear programming (e. g. , matching) computations

Lecture Outline • Applications • Review of key results • Graph traversal-based parallel algorithms, case studies – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

History • 1735: “Seven Bridges of Königsberg” problem, resolved by Euler, first result in graph theory. • … • 1966: Flynn’s Taxonomy. • 1968: Batcher’s “sorting networks” • 1969: Harary’s “Graph Theory” • … • 1972: Tarjan’s “Depth-first search and linear graph algorithms” • 1975: Reghbati and Corneil, Parallel Connected Components • 1982: Misra and Chandy, distributed graph algorithms. • 1984: Quinn and Deo’s survey paper on “parallel graph algorithms” • …

The PRAM model • Idealized parallel shared memory system model • Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead • EREW, CREW • Measuring performance: space and time complexity; total number of operations (work)

The Helman-Ja. Ja model • Extension to the PRAM model for shared memory algorithm design and analysis. • T(n, p) is measured by the triplet –TM(n, p), TC(n, p), B(n, p) – TM(n, p): maximum number of non-contiguous main memory accesses required by any processor – TC(n, p): upper bound on the maximum local computational complexity of any of the processors – B(n, p): number of barrier synchronizations.

PRAM Pros and Cons • Pros – Simple and clean semantics. – The majority of theoretical parallel algorithms are specified with the PRAM model. – Independent of the communication network topology. • Cons – – – Not realistic, too powerful communication model. Algorithm designer is misled to use IPC without hesitation. Synchronized processors. No local memory. Big-O notation is often misleading.

Building blocks of classical PRAM graph algorithms • Prefix sums • List ranking – Euler tours, Pointer jumping, Symmetry breaking • Vertex collapse • Tree contraction

Prefix Sums • Input: A, an array of n elements; associative binary operation • Output: O(n) work, O(log n) time, n processors B(3, 1) C(3, 2) B(2, 2) C(2, 2) B(2, 1) C(2, 1) B(1, 1) C(1, 1) B(1, 2) C(1, 2) B(1, 3) C(1, 3) B(1, 4) C(1, 4) B(0, 1) B(0, 2) B(0, 3) B(0, 4) B(0, 5) B(0, 6) B(0, 7) B(0, 8) C(0, 1) C(0, 2) C(0, 3) C(0, 4) C(0, 5) C(0, 6) C(0, 7) C(0, 8)

Parallel Prefix • X: array of n elements stored in arbitrary order. • For each element i, let X(i). value be its value and X(i). next be the index of its successor. • For binary associative operator Θ, compute X(i). prefix such that – X(head). prefix = X (head). value, and – X(i). prefix = X(i). value Θ X(predecessor). prefix where – head is the first element – i is not equal to head, and – predecessor is the node preceding i. • List ranking: special case of parallel prefix, values initially set to 1, and addition is the associative operator.

List ranking Illustration • Ordered list (X. next values) 2 3 4 5 6 7 8 9 • Random list (X. next values) 4 6 5 7 8 3 2 9

List Ranking key idea 1. Chop X randomly into s pieces 2. Traverse each piece using a serial algorithm. 3. Compute the global rank of each element using the result computed from the second step. • In the Helman-Ja. Ja model, TM(n, p) = O(n/p).

Connected Components • Building block for many graph algorithms – Minimum spanning tree, biconnected components, planarity testing, etc. • Representative of the “graft-and-shortcut” approach • CRCW PRAM algorithms – [Shiloach & Vishkin ’ 82]: O(log n) time, O((m+n) logn) work – [Gazit ’ 91]: randomized, optimal, O(log n) time. • CREW algorithms – [Han & Wagner ’ 90]: O(log 2 n) time, O((m+nlog n) logn) work.

Shiloach-Vishkin algorithm • Input: n isolated vertices and m PRAM processors. • Each processor Pi grafts a tree rooted at vertex vi to the tree that contains one of its neighbors u under the constraints u< vi • Grafting creates k ≥ 1 connected subgraphs, and each subgraph is then shortcut so that the depth of the trees reduce at least by half. shortcut graft 4 2 1 3 1, 4 2, 3 • Repeat graft and shortcut until no more grafting is possible. • Runs on arbitrary CRCW PRAM in O(log n) time with O(m) processors. • Helman-Ja. Ja model: TM = (3 m/p + 2)log n, TB = 4 log n.

An example higher-level algorithm Typically composed of several low-level efficient algorithms. Tarjan-Vishkin’s biconnected components algorithm: O(log n) time, O(m+n) time. 1. Compute spanning tree T for the input graph G. 2. Compute Eulerian circuit for T. 3. Root the tree at an arbitrary vertex. 4. Preorder numbering of all the vertices. 5. Label edges using vertex numbering. 6. Connected components using the Shiloach-Vishkin algorithm.

Data structures: graph representation Static case • Dense graphs (m = O(n 2)): adjacency matrix commonly used. • Sparse graphs: adjacency lists Dynamic • representation depends on common-case query • Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates? • Graph update rate • Queries: connectivity, paths, flow, etc. • Optimizing for locality a key design consideration.

Data structures in (parallel) graph algorithms • A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree • Implementations are typically array-based for performance considerations. • Key data structure considerations in parallel graph algorithm design – Practical parallel priority queues – Space-efficiency – Parallel set/multiset operations, e. g. , union, intersection, etc.

Lecture Outline • Applications • Review of key results • Case studies: Graph traversal-based problems, parallel algorithms – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

Graph Algorithms on today’s systems • Concurrency – Simulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronization • Locality • Work-efficiency – Try to improve cache locality, but avoid “too much” superfluous computation

The locality challenge “Large memory footprint, low spatial and temporal locality impede performance” Serial Performance of “approximate betweenness centrality” on a 2. 67 GHz Intel Xeon 5560 (12 GB RAM, 8 MB L 3 cache) Input: Synthetic R-MAT graphs (# of edges m = 8 n) No Last-level Cache (LLC) misses O(m) LLC misses ~ 5 X drop in performance

The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems” • Graph topology assumptions in classical algorithms do not match real-world datasets • Parallelization strategies at loggerheads with techniques for enhancing memory locality • Classical “work-efficient” graph algorithms may not fully exploit new architectural features – Increasing complexity of memory hierarchy (x 86), DMA support (Cell), wide SIMD, floating point-centric cores (GPUs). • Tuning implementation to minimize parallel overhead is nontrivial – Shared memory: minimizing overhead of locks, barriers. – Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation.

Optimizing BFS on cache-based multicore platforms, for networks with “power-law” degree distributions Problem Spec. Assumptions No. of vertices/edges 106 ~ 109 Edge/vertex ratio 1 ~ 100 Static/dynamic? Static Diameter O(1) ~ O(log n) Weighted/Unweighted Vertex degree distribution Unbalanced (“power law”) Directed/undirected? Both Simple/multi/hypergraph? Multigraph Granularity of computation at vertices/edges? Minimal (Data: Mislove et al. , IMC 2007. ) Exploiting domain-specific characteristics? Partially Synthetic R-MAT networks Test data

Graph traversal (BFS) problem definition 1 Input: Output: 5 8 1 0 source vertex 7 2 1 2 3 3 3 4 6 distance from source vertex 4 4 9 1 2 Memory requirements (# of machine words): • Sparse graph representation: m+n • Stack of visited vertices: n • Distance array: n

Parallel BFS Strategies 1. Expand current frontier (level-synchronous approach, suited for low diameter graphs) 5 0 7 8 3 1 4 6 9 source vertex • O(D) parallel steps • Adjacencies of all vertices in current frontier are visited in parallel 2 2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs) source vertex 0 5 8 7 3 2 1 4 6 9 • path-limited searches from “super vertices” • APSP between “super vertices”

A deeper dive into the “level synchronous” strategy Locality (where are the random accesses originating from? ) 53 84 93 0 31 44 74 26 63 11 1. Ordering of vertices in the “current frontier” array, i. e. , accesses to adjacency indexing array, cumulative accesses O(n). 2. Ordering of adjacency list of each vertex, cumulative O(m). 3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m). 1. Access Pattern: idx array -- 53, 31, 74, 26 2, 3. Access Pattern: d array -- 0, 84, 93, 44, 63, 0, 0, 11

Performance Observations Youtube social network Graph expansion Flickr social network Edge filtering

Improving locality: Vertex relabeling x x x x x x x x x x x x • Well-studied problem, slight differences in problem formulations – Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks. – Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compression • NP-hard problem, several known heuristics – We require fast, linear-work approaches – Existing ones: BFS or DFS-based, Cuthill-Mc. Kee, Reverse Cuthill-Mc. Kee, exploit overlap in adjacency lists, dimensionality reduction

Improving locality: Optimizations • Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited). – e. g. , access order: 53, 31, 26, 74, 84, 0, … • Objective: Reduce TLB misses, private cache misses, exploit shared cache. 53 84 93 0 • Optimizations: 1. Sort the adjacency lists of each vertex – helps order memory accesses, reduce TLB misses. 1. Permute vertex labels – enhance spatial locality. 2. Cache-blocked edge visits – exploit temporal locality. 31 44 74 26 63 11

Improving locality: Cache blocking Metadata denoting blocking pattern Adjacencies (d) x x x x frontier x x Adjacencies (d) 3 2 1 x x x linear processing x x x x Process high-degree vertices separately Tune to L 2 cache size x x New: cache-blocked approach • Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses • Requires multiple passes through the frontier array, tuning for optimal block size. • Note: frontier array size may be O(n)

Vertex relabeling heuristic Similar to older heuristics, but tuned for small-world networks: 1. High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure). § Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit) 2. Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking. 3. Form dense blocks around high-degree vertices § Reverse Cuthill-Mc. Kee, removing degree 1 and degree 2 vertices

Architecture-specific Optimizations 1. Software prefetching on the Intel Core i 7 (supports 32 loads and 20 stores in flight) – Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses. 2. Aligning adjacency lists to optimize memory accesses – 16 -byte aligned loads and stores are faster. – Alignment helps reduce cache misses due to fragmentation – 16 -byte aligned non-temporal stores (during creation of new frontier) are fast. 3. SIMD SSE integer intrinsics to process “high-degree vertex” adjacencies. 4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics have very low overhead) 5. Hugepage support (significant TLB miss reduction) 6. NUMA-aware memory allocation exploiting first-touch policy

Experimental Setup Network n m Max. outdegree % of vertices w/ outdegree 0, 1, 2 Orkut 3. 07 M 223 M 32 K 5 Live. Journal 5. 28 M 77. 4 M 9 K 40 Flickr 1. 86 M 22. 6 M 26 K 73 Youtube 1. 15 M 4. 94 M 28 K 76 R-MAT 8 M-64 M 8 n n 0. 6 Intel Xeon 5560 (Core i 7, “Nehalem”) • 2 sockets x 4 cores x 2 -way SMT • 12 GB DRAM, 8 MB shared L 3 • 51. 2 GBytes/sec peak bandwidth • 2. 66 GHz proc. Performance averaged over 10 different source vertices, 3 runs each.

Impact of optimization strategies Optimization Generality Impact* Tuning required? (Preproc. ) Sort adjacency lists High -- No (Preproc. ) Permute vertex labels Medium -- Yes Preproc. + binning frontier vertices + cache blocking M 2. 5 x Yes Lock-free parallelization M 2. 0 x No Low-degree vertex filtering Low 1. 3 x No Software Prefetching M 1. 10 x Yes Aligning adjacencies, streaming stores M 1. 15 x No Fast atomic intrinsics H 2. 2 x No * Optimization speedup (performance on 4 cores) w. r. t baseline parallel approach, on a synthetic R-MAT graph (n=223, m=226)

Cache locality improvement Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words) Theoretical count of the number of noncontiguous memory accesses: m+3 n

Parallel performance (Orkut graph) Execution time: 0. 28 seconds (8 threads) Parallel speedup: 4. 9 Speedup over baseline: 2. 9 Graph: 3. 07 million vertices, 220 million edges Single socket of Intel Xeon 5560 (Core i 7)

Lecture Outline • Applications • Review of key results • Case studies: Graph traversal-based problems, parallel algorithms – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

Parallel Single-source Shortest Paths (SSSP) algorithms • Edge weights: concurrency primary challenge! • No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work • Parallel priority queues: relaxed heaps [DGST 88], [BTZ 98] • Ullman-Yannakakis randomized approach [UY 90] • Meyer and Sanders, ∆ - stepping algorithm [MS 03] • Distributed memory implementations based on graph partitioning • Heuristics for load balancing and termination detection K. Madduri, D. A. Bader, J. W. Berry, and J. R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances, ” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

∆ - stepping algorithm [MS 03] • Label-correcting algorithm: Can relax edges from unsettled vertices also • ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm” • ∆: bucket width • Vertices are ordered using buckets representing priority range of size ∆ • Each bucket may be processed in parallel

∆ - stepping algorithm: illustration ∆ = 0. 1 (say) 0 0. 05 0. 07 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ Buckets 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 4 5 6 0 ∞ ∞ ∞ Buckets 0 0 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket Initialization: Insert s into bucket, d(s) = 0

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞ ∞ ∞ Buckets 0 0 R 2. 01 S

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 4 5 6 0 ∞ ∞ ∞ Buckets 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 0 2. 01 S 0

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 0 ∞ 2 3 . 01 ∞ ∞ Buckets 0 2 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 0 ∞ 2 3 . 01 ∞ ∞ Buckets 0 2 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 1 3. 06 S 0

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 0 ∞ 2 3 . 01 ∞ ∞ Buckets 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 0 1 3. 06 S 0 2

∆ - stepping algorithm: illustration 0. 05 3 0. 07 0 0. 01 0. 02 0. 13 0. 15 2 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 0 . 03. 01. 06 Buckets 0 1 3 4 5 6 ∞ ∞ ∞ 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 2

∆ - stepping algorithm: illustration 0. 05 0. 07 0 0. 01 2 0. 02 0. 13 0. 15 3 0. 56 0. 23 4 0. 18 5 1 d array 0 1 2 3 4 5 6 0. 03. 01. 06. 16. 29. 62 Buckets 1 4 2 6 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 2 1 3

Classify edges as “heavy” and “light”

Relax light edges (phase) Repeat until B[i] Is empty

Relax heavy edges. No reinsertions in this step.

No. of phases (machine-independent performance count) high diameter low diameter

Average shortest path weight for various graph families ~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0, 1]

Last non-empty bucket (machine-independent performance count) Fewer buckets, more parallelism

Number of bucket insertions (machine-independent performance count)

Lecture Outline • Applications • Review of key results • Case studies: Graph traversal-based problems, parallel algorithms – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

Betweenness Centrality • Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph – degree, closeness, eigenvalue, betweenness • Betweenness Centrality ( : No. of shortest paths between s and t) • Applied to several real-world networks – – Social interactions WWW Epidemiology Systems biology

Algorithms for Computing Betweenness • All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n 3) time), sum up the fractional dependency values (O(n 2) space). • Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.

Our New Parallel Algorithms • Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality – low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n)) – Exact algorithm: O(mn) work, O(m+n) space, O(n. D+nm/p) time. • Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references – In practice, 2 -3 X faster than previous algorithm – Lock-free => better scalability on large parallel systems

Parallel BC Algorithm • Consider an undirected, unweighted graph • High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores • Two steps – traversal and path counting – dependency accumulation

Parallel BC Algorithm Illustration 0 5 8 7 3 2 1 4 6 9

Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 0 5 8 7 3 source vertex 2 1 4 6 9

Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 source vertex 2 8 3 6 P 0 1 4 D 9 2 7 5 0 1 0 1 0 0

Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 source vertex 2 8 3 6 P 0 1 4 D 9 8 3 2 7 5 0 1 2 0 2 1 0 1 2 0 5 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 7 0 7

Parallel BC Algorithm Illustration 1. Traversal step: at the end, we have all reachable vertices, their corresponding predecessor multisets, and D values. 5 0 7 source vertex 2 8 3 1 4 6 9 S D 2 1 6 4 8 3 2 7 5 0 0 P 1 2 1 1 2 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 6 0 2 3 0 8 0 5 6 7 8 0 7

Graph traversal step analysis • Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n) • Upper bound on size of each predecessor multiset: In-degree • Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts • New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths. – simplifies the accumulation step – reduces an atomic operation in traversal step – cache-friendly!

Graph Traversal Step locality analysis for all vertices u at level d in parallel do All the vertices are in a contiguous block (stack) for all adjacencies v of u in parallel do All the adjacencies of a vertex are stored compactly (graph rep. ) dv = D[v]; Non-contiguous memory access if (dv < 0) Non-contiguous vis = fetch_and_add(&Visited[v], 1); memory access if (vis == 0) D[v] = d+1; p. S[count++] = v; Non-contiguous memory access fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); Store to S[u] if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack, update dependence scores. S 5 0 7 source vertex 2 8 3 1 4 6 9 2 1 6 4 8 3 2 7 5 0 Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7

Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in a level-synchronous manner. S 5 0 7 source vertex 2 8 3 1 4 6 9 2 1 6 4 8 3 2 7 5 0 Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7

Accumulation step locality analysis All the vertices are in a contiguous block (stack) for level d = Graph. Diameter-2 to 1 do for all vertices v at level d in parallel do for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta_sum_v; Each S[v] is a contiguous block Only floating point operation in code

Centrality Analysis applied to Protein Interaction Networks 43 interactions Protein Ensembl ID ENSG 00000145332. 2 Kelch-like protein 8

Lecture Outline • Applications • Review of key results • Case studies: Graph traversal-based problems, parallel algorithms – – Breadth-First Search Single-source Shortest paths Betweenness Centrality Community Identification

Community Identification • Implicit communities in large-scale networks are of interest in many cases. – WWW – Social networks – Biological networks • Formulated as a graph clustering problem. – Informally, identify/extract “dense” sub-graphs. • Several different objective functions exist. – Metrics based on intra-cluster vs. intercluster edges, community sizes, number of communities, overlap … • Highly studied research problem – 100 s of papers yearly in CS, Social Sciences, Physics, Comp. Biology, Applied Math journals and conferences.

Agglomerative Clustering, Parallelization • Bottom-up approach: Start with n singleton communities, iteratively merge pairs to form larger communities. – What measure to minimize/maximize? A measure known as modularity – How do we order merges? Use a priority queue • Parallelization: perform multiple “independent” merges simultaneously.

Graph Analysis with cache-based multicore systems • SNAP: Small-world Network Analysis and Partitioning. • 10 -100 x faster than competing graph analysis software. – Parallelism, heuristics exploiting graph topology. • Can process graphs with billions of vertices and edges. • Open-source: http: //snap-graph. sf. net • Optimizations for multicore systems: improving cache locality, minimize synchronization overhead with atomics. D. A. Bader and K. Madduri, IPDPS 2008, Parallel Computing 2008.

Designing fast parallel graph algorithms System requirements: High (on-chip memory, DRAM, network, IO) bandwidth. Solution: Efficiently utilize available memory bandwidth. Algorithmic innovation to avoid corner cases. Improve locality “Random. Access”-like Problem Complexity Locality Data reduction/ Compression “Stream”-like Faster methods # of passes over data constant log n ~n 104 106 108 1012 Data size (n: # of vertices/edges) Peta+

Review of lecture • Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance • Earlier work on parallel graph algorithms – PRAM algorithms, list ranking, connected components – graph representations • Parallel algorithm case studies – BFS: locality, level-synchronous approach, multicore tuning – Shortest paths: exploiting concurrency, parallel priority queue – Betweenness centrality: importance of locality, atomics

Future Research Challenges • New methods/analytics: Modeling network dynamics; persistent monitoring of Analytics, dynamically changing properties. Summarization, Visualization • Software: Portable, high-performance, extensible routines. • Emerging systems: how do we best utilize BIG DATA GPUs, flash disks? Modeling & Simulation

Thank you! • Questions?