OPAL Open Source Parallel Algorithm Library Designing HighPerformance
OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department Albuquerque High Performance Computing Center University of New Mexico dbader@eece. unm. edu http: //hpc. eece. unm. edu/ High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
High-Performance Applications using SMP Clusters • Long-term Earth science studies using terascale remotely-sensed global satellite imagery (4 km AVHRR GAC) • Computational Ecological Studies: Self. Organization of Semi-Arid Landscapes: Test of Optimality Principles • Computational Bioinformatics: Large Scale Phylogeny Reconstruction 2 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Research Collaborators • Joseph JáJá, University of Maryland • Bernard Moret, CS (Experimental Algorithmics), University of New Mexico • Bruce Milne, Biology (Landscape Ecology), University of New Mexico • Tandy Warnow, CS, University of Texas-Austin • IBM ACTC Group (David Klepacki, John Levesque, and others) • Current Graduate Students: • Mi Yan, Niranjan Prabhu, Vinila Yarlagadda • Laboratory Alumni: • 3 Kavita Balakavi (Intel), Ajith Illendula (Intel) High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Acknowledgment of Support • NSF CISE Postdoctoral Research Associate in Experimental Computer Science No. 96 -25668 • NSF BIO Division of Environmental Biology DEB 9910123 • Department of Energy Sandia-University New Assistant Professorship Program (SUNAPP) Award AX-3006 • IBM SUR Grant (UNM Vista-Azul Project ) • NPACI/SDSC and NCSA/Alliance • NSF 00 -* Algorithms for Irregular Discrete Computations on SMPs 4 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Outline • Motivation • SMP Cluster Programming (SIMPLE) • Complexity model Message-Passing • Shared-Memory • • OPAL Facets (parallel libraries) • OPAL Setting (programming framework) • Example SMP Algorithms 5 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Motivation • High performance computing has been leveraging COTS workstation technologies Commodity microprocessors • High-performance networks • Operating system and compiler technology • • Symmetric multiprocessor (SMP) Hardware support for hierarchical memory management • Multithreaded operating system kernels • Optimizing compilers and runtime systems • 6 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
SMP Cluster Architectures • • IBM SP (NPACI Blue Horizon 144 x 8) Linux Clusters Compaq Alpha. Servers (PSC/NSF Terascale 682 x 4) Sun Ultra HPC (4 x 64) UNM/Alliance Roadrunner Linux Super. Cluster (64 x 2) 7 LLNL ASCI White IBM SP (512 x 16) High Performance Algorithms for SMP Clusters, Prof. David A. Bader UNM/Alliance Los. Lobos IBM Netfinity(256 x 2) 15 August 2000
Message-Passing Performance 8 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Shared-Memory Performance • One Sun HPC E 10 K processor • Contiguous array; each element read exactly once • C, X = cyclic read (stride X) of contiguous array • R = random access of array 9 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
High Performance Algorithms for SMP Clusters • “SIMPLE” Model • Use a hybrid, natural combination of messagepassing and shared-memory • Message passing interface between nodes • Shared-memory programming (Open. MP, POSIX Threads) on each SMP node • Methodology for adapting message-passing algorithms for SMP Clusters • Freely-available open source implementation of parallel algorithms, libraries, and programming environment, for C/C++/Fortran with GNU Public License (GPL) 10 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Optimizing from MPI to SIMPLE (Regular or Irregular Algorithms) • Similar Single-Program Multiple-Data (SPMD) paradigm • Replace multiple MPI tasks per node with a single task and multiple shared-memory threads • Parallelize sequential work into equivalent shared-memory algorithms • Replace MPI communication primitives with corresponding “SIMPLE” primitives 11 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Portability: Access from User Space 13 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Parallel Complexity Models 14 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
SIMPLE Complexity Model Message Passing Primitives 15 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Comparison of PRAM to SMP • PRAM (theory) • O(n) processors • Global clock • Synchronous sharedmemory • Unit cost for computation or memory access • Ideal Read/Write models (EREW, CRCW) 16 • SMP (practice) • “P” processors (2 to 64) • Asynchronous lock-step operation • Uniform memory access to main memory (< 600 ns), faster access to local cache (10 -40 ns) • Cache-coherency at external caches • Contention for shared memory High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
OPAL Complexity Model • SMP Complexity model motivated by Helman and JáJá, Ramachandran • Complexity given by the triplet (MA, ME, TC) • MA is the number of memory accesses, • ME is the maximum volume of data exchanged between any processor and memory, • TC is the computational complexity. 17 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
OPAL Facets • Common Primitives • • Read/Write Replicate Barrier Scan Reduce Broadcast Allreduce • Techniques Pointer-jumping • Balanced Trees (Prefix-Sums) • Symmetric Breaking (3 Coloring) • Parallel Prefix (List Ranking) • 18 • Graph Algorithms Spanning Tree • Euler Tour • Tree Functions • Ear Decomposition • • Combinatorics Sorting • Selection • • Bioinformatics (Minimum Evolution) Phylogeny Trees • Computational Genomics: Breakpoints, Inversions, Translocations • High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
SMP Complexity Model SMP Node Primitives • • 19 Read/Write Replicate Barrier Scan Reduce Broadcast Allreduce Etc. • SMP Complexity model motivated by Helman and JáJá • Complexity given by the triplet (MA, ME, TC) • MA is the number of memory accesses, • ME is the maximum volume of data exchanged between any processor and memory, • TC is the computational complexity. High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
OPAL Setting: Programming Environment 20 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Local Context Parameters for Each Thread 21 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Control Primitives 22 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Memory Management Primitives 23 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Example Application: Radixsort • Stable sort of n integers spread evenly across a cluster of p shared-memory r-way nodes • Decompose b-bit keys into -bit digits • Perform b / passes of counting sort on digits (LSD MSD) • Counting Sort • • • 24 Compute histogram of local keys Communicate: Alltoall primitive of histograms Locally compute prefix-sums of histograms Communicate: (Inverse) Alltoall of prefix-sums Rank each local element Perform a personalized communication (1 -relation) rearranging elements into sorted order High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
25 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
27 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Execution Time of Radix Sort on an SMP Cluster 28 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
SMP Example: Ear Decomposition • Ear decomposition Partitions the edges of a graph, useful in parallel processing • “Like peeling the layers of an onion” • • Applied to scientific computing problems Computational mechanics (structural rigidity) • Computational biology (molecular structure, atoms in DNA chains) • Computational fluid dynamics • • Similar to other parallel algorithms for combinatorial problems Trivial and fast sequential algorithm • Efficient PRAM algorithm • But no known practical, parallel algorithm • 29 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Ear Decomposition Example Input Output Ears Spanning Tree 30 High Performance Algorithms for SMP Clusters, Prof. David A. Bader n = number of vertices m = number of edges 15 August 2000
Ear Decomposition Complexities Sequential Complexity: 31 • Message Passing: • Shared Memory: • Spanning Tree • Ear Decomposition High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Comparison of Ear Decomposition Algorithms 33 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Performance of SMP Ear Decomposition on a Variety of Input Graphs n = 8192 34 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
SMP Ear Decomposition Algorithms 35 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Conclusions • New hybrid model for SMP Clusters • Open Source Parallel Algorithm Library (OPAL) • High-Performance methodology • Fastest known algorithms on SMPs and SMP clusters • Preliminary experimental results 36 High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
Future Work • Algorithms for SMP Clusters • • Validate complexity model Identify classes of efficient algorithms Library of SMP algorithms Methodology for algorithm-engineering • Clusters of Heterogeneous SMP Nodes • • • Varying node sizes Nodes from different vendors & architectures Hierarchical clusters of SMPs • Scientific Applications • • • 37 Bioinformatics and Genomics Landscape Ecology and Remote Sensing Computational Fluid Dynamics High Performance Algorithms for SMP Clusters, Prof. David A. Bader 15 August 2000
- Slides: 34