Emerging Challenges Opportunities in Parallel Computing The Cretaceous



![Parallel Computing Theory is Robust • Theoretical foundations – P-Completeness [Cook’ 73] – Boolean Parallel Computing Theory is Robust • Theoretical foundations – P-Completeness [Cook’ 73] – Boolean](https://slidetodoc.com/presentation_image_h2/c2b3953a2992a4fb8047f501708077a8/image-4.jpg)





























- Slides: 33

Emerging Challenges & Opportunities in Parallel Computing: The Cretaceous Redux? Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC 04 -94 AL 85000.

The Relationship Between Theory & Practice in Parallel Computing: Plus a Silly Metaphor Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC 04 -94 AL 85000.

Outline • Theory and practice in parallel computing are estranged • Emerging applications will challenge the status quo • Architectural changes will add further disruption • These forces will create rich opportunities for theory community
![Parallel Computing Theory is Robust Theoretical foundations PCompleteness Cook 73 Boolean Parallel Computing Theory is Robust • Theoretical foundations – P-Completeness [Cook’ 73] – Boolean](https://slidetodoc.com/presentation_image_h2/c2b3953a2992a4fb8047f501708077a8/image-4.jpg)
Parallel Computing Theory is Robust • Theoretical foundations – P-Completeness [Cook’ 73] – Boolean Circuits [Borodin’ 77] – PRAMs [Fortune and Wyllie’ 78] – NC and P-Completeness [Pippenger/Cook’ 79] • Technology-informed theoretical models – Fixed interconnection machines, e. g. hypercubes [many] – LOGP [Culler, et al. ’ 93] – Bulk Synchronous Parallel [Gerbessiotis & Valiant’ 92] • “Practical” ideas with strong theoretical underpinnings – PGAS Languages [several] – CILK [Leiserson’s group’ 95] • 21 years of SPAA, 28 years of PODC, etc…

Sandia is a Leader in Parallel Computing CM-2 n. CUBE-2 i. PSC-860 ASCI Red Paragon Cplant Red Storm 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Designed by Rolf Riesen, July 2005 2006 2008 1988 1990 1992 1994 1996 1998 2000 2002 2004 Gordon Bell Prize R&D 100 Parallel Software Patent Meshing Gordon Bell Prize R&D 100 Signal Processing Karp Challenge Patent Parallel Software R&D 100 Meshing R&D 100 Dense Solvers Gordon Bell Prize R&D 100 Xyce R&D 100 World Record R&D 100 Allocator Storage Teraflops SC 96 Gold Medal Networking Mannheim Su. Par. Cup R&D 100 Salvo Patent Data Mining R&D 100 Catamount World Record 281 GFlops Patent R&D 100 World Record Partitioning Trillinos 143 GFlops R&D 100 Fernbach Patent Aztec Award Paving

Theory at Sandia • Sandia designs, procures, programs, runs & treasures big parallel computers • Sandia has at least 200 Ph. Ds working on parallel computing – Mostly physics & engineering degrees – But many computer scientists as well • Very few of these practitioners could define a PRAM – Let alone explain NC! • None use CILK or UPC • What’s wrong with this picture!?

Elements of Parallel Computing Practice • Clusters – “Killer micros” enable commodity-based parallel computing – Attractive price and price/performance – Stable model for algorithms & software • MPI – Portable and stable programming model and language – Allowed for huge investment in software • Bulk-Synchronous Parallel Programming (BSP) – Basic approach to almost all successful MPI programs – Compute locally; communicate; repeat – Excellent match for clusters+MPI – Good fit for many scientific applications • Algorithms – Stability of the above allows for sustained algorithmic research

A Virtuous Circle… Commodity Clusters Architectures Explicit Message Passing Programming Models Software Algorithms Bulk Synchronous Parallel …or a vicious noose? MPI

MPI Applications LAMPPS Linpack PETSc

CILK PRAM LOGP UPC

Existing Applications Are Evolving • Leading edge scientific applications increasingly include: – Adaptive, unstructured data structures – Complex, multiphysics simulations – Multiscale computations in space and time – Complex synchronizations (e. g. discrete events) • These raise significant parallelization challenges – Limited by memory, not processor performance – Unsolved micro-load balancing problems – Finite degree of coarse-grained parallelism – Bulk synchronous parallel not always appropriate • These changes will stress existing approaches to parallelism

New Applications are Emerging: E. g. Network Science • Graphs are ideal for representing entities and relationships • Rapidly growing use in biological, social, environmental, and other sciences The way it was … Zachary’s karate club (|V|=34) The way it is now … Twitter social network (|V|≈200 M)

Emerging New Scientific Questions • New algorithms – Community detection, centrality, graph generation, etc. – Right set of questions and concepts still unknown. – Statistics, machine learning, anomaly detection, etc. • New issues – Noisy, error-filled data. What can we conclude robustly? – Temporal evolution of networks. • New science – Social dynamics and ties to technology & media – Large economic, social, political consequences • Parallel computing needed for big data and/or fast response

Computational Challenges for Network Science • Minimal computation to hide access time • Runtime is dominated by latency – Random accesses to global address space – Parallelism is very fine grained and dynamic • Access pattern is data dependent – Prefetching unlikely to help – Usually only want small part of cache line • Potentially abysmal locality at all levels of memory hierarchy • Many algorithms are not bulk synchronous • Approaches based on virtuous circle don’t work!

Locality Challenges What we traditionally care about Emerging Codes What industry cares about From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

A Renaissance in Architecture Research • Good news – Moore’s Law marches on – Real estate on a chip is essentially free • Major paradigm change – huge opportunity for innovation • Bad news – Power considerations limit the improvement in clock speed – Parallelism is only viable route to improve performance • Current response, multicore processors – Computation/Communication ratio will get worse • Makes life harder for applications • Long-term consequences unclear

Example: AMD Opteron

Example: AMD Opteron Memory (Latency Avoidance) L 1 D-Cache L 2 Cache L 1 I-Cache

Example: AMD Opteron Memory (Lat. Avoidance) Load/Store L 1 Unit D-Cache I-Fetch Scan Align L 2 Cache L 1 I-Cache Memory Controller Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance)

Example: AMD Opteron Memory (Latency Avoidance) Load/Store L 1 Unit D-Cache Bus HT I-Fetch Scan Align L 1 I-Cache Memory Controller L 2 Cache DDR Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) Memory and I/O Interfaces

Example: AMD Opteron Memory (Latency Avoidance) FPU Execution Load/Store L 1 Unit D-Cache HT Int Execution Bus I-Fetch Scan Align L 1 I-Cache Memory Controller Thanks to Thomas Sterling L 2 Cache DDR Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) Memory and I/O Interfaces COMPUTER

Architectural Wish List for Graphs • Low latency / high bandwidth – For small messages! • Latency tolerant • Light-weight synchronization mechanisms for fine-grained parallelism • Global address space – No graph partitioning required – Avoid memory-consuming profusion of ghost-nodes – No local/global numbering conversions • One machine with these properties is the Tera MTA-2 – And its successor the Cray XMT

How Does the MTA/XMT Work? • Latency tolerance via massive multi-threading – – Context switch every tick Global address space, hashed to reduce hot-spots No cache or local memory. Multiple outstanding loads • Remote memory request doesn’t stall processor – Other streams work while your request gets fulfilled • Light-weight, word-level synchronization – Minimizes conflicts, enables parallelism • Flexible dynamic load balancing – Thread virtualization – Futures

Case Study: Single Source Shortest Path PBGL SSSP Time (s) • Parallel Boost Graph Library (PBGL) – Lumsdaine, et al. , on Opteron cluster – Some graph algorithms can scale on some inputs • PBGL - MTA Comparison on SSSP – Erdös-Renyi random graph (|V|=228) – PBGL SSSP can scale on non-power law graphs – Order of magnitude speed difference – 2 orders of magnitude efficiency difference • Big difference in power consumption – [Lumsdaine, Gregor, H. , Berry, 2007] MTA SSSP # Processors


New Apps Disruptive Architectures Multicore



What Happens Next? • Virtuous circle will not survive the coming disruptions • New programming models, languages, algorithms and abstractions will be needed • But MPI cannot die – Billions of dollars in investment in software – “I don’t know what the parallel programming language of the future will look like, but I know it will be called MPI” • Luckily, theory is forever …

Rebuilding the Foundations • Applied parallel computing will need new ideas to continue moving forward • Ideas and tools from theory community can: – Provide abstractions to manage hardware complexity – Underlie robust algorithm development and analysis – Suggest new programming models and abstractions – Point towards new architectural features – Support efficient utilization of resources – Provide underpinnings for the future of applied parallel computing


Conclusions • Applied parallel computing is facing unprecedented challenges – Multi-core processors – Disruptive architectural innovations – Demands of emerging applications • Theory can provide reliable light in the coming darkness – Theoretical insights are resilient to technology changes • Theory community will have new opportunities – Provide robust foundation for future progress – Become central to applied parallel computing • This is a great time to be doing parallel computing!

Thanks • Cevdet Aykanat, Michael Bender, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Vitus Leung, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.