Parallel Hardware Parallel Applications Parallel Software Programming Models

  • Slides: 38
Download presentation
Parallel Hardware Parallel Applications Parallel Software Programming Models for Manycore Systems Kathy Yelick U.

Parallel Hardware Parallel Applications Parallel Software Programming Models for Manycore Systems Kathy Yelick U. C. Berkeley

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore p p A tio a lic y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs/Dwarfs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness ns

Applications. What are the problems? “Who needs 100 cores to run M/S Word? ”

Applications. What are the problems? “Who needs 100 cores to run M/S Word? ” – Need compelling apps that use 100 s of cores How did we pick applications? 1. Enthusiastic expert application partner, leader in field, promise to help design, use, evaluate our technology 2. Compelling in terms of likely market or social impact, with short term feasibility and longer term potential 3. Requires significant speed-up, or a smaller, more efficient platform to work as intended 4. As a whole, applications cover the most important – – Platforms (handheld, laptop, games) Markets (consumer, business, health)

Compelling Laptop/Handheld Apps (David Wessel) • Musicians have an insatiable appetite for computation –

Compelling Laptop/Handheld Apps (David Wessel) • Musicians have an insatiable appetite for computation – – – • Music Enhancer – – • Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays Laptop/Handheld recreate 3 D sound over ear buds Hearing Augmenter – • More channels, instruments, more processing, more interaction! Latency must be low (5 ms) Must be reliable (No clicks) Handheld as accelerator for hearing aid Novel Instrument User Interface – – New composition and performance systems beyond keyboards Input device for Laptop/Handheld Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10 -inch-diameter icosahedron incorporating 120 tweeters.

Content-Based Image Retrieval (Kurt Keutzer) Relevance Feedback Query by example Image Database 1000’s of

Content-Based Image Retrieval (Kurt Keutzer) Relevance Feedback Query by example Image Database 1000’s of images Similarity Metric Candidate Results Final Result • Built around Key Characteristics of personal databases – Very large number of pictures (>5 K) – Non-labeled images – Many pictures of few people – Complex pictures including people, events, places, and objects

Coronary Artery Disease (Tony Keaveny) Before • Modeling to help patient compliance? ¯ •

Coronary Artery Disease (Tony Keaveny) Before • Modeling to help patient compliance? ¯ • After 450 k deaths/year, 16 M w. symptom, 72 M BP Massively parallel, Real-time variations ¯ CFD FE solid (non-linear), fluid (Newtonian), pulsatile ¯ Blood pressure, activity, habitus, cholesterol

Meeting Diarist and Teleconference Aid (Nelson Morgan) • Meeting Diarist – Laptops/ Handhelds at

Meeting Diarist and Teleconference Aid (Nelson Morgan) • Meeting Diarist – Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting n Teleconference speaker identifier, speech helper ¨ L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Parallel Browser • Goal: Desktop quality browsing on handhelds – Enabled by 4 G

Parallel Browser • Goal: Desktop quality browsing on handhelds – Enabled by 4 G networks, better output devices • Bottlenecks to parallelize – Parsing, Rendering, Scripting • “Skip. Jax” – Parallel replacement for Java. Script/AJAX – Based on Brown’s Flap. Jax

Broader Coverage of Applications through “Motifs” How invent parallel systems of future when tied

Broader Coverage of Applications through “Motifs” How invent parallel systems of future when tied to old code, programming models, CPUs of the past? Look for common computational patterns 1. Embedded Computing (42 EEMBC benchmarks) 2. Desktop/Server Computing (28 SPEC 2006) 3. Data Base / Text Mining Software 4. Games/Graphics/Vision 5. Machine Learning 6. High Performance Computing (Original “ 7 Dwarfs”) Result: 13 “Dwarfs” (Use “motif” instead after go from 7 to 13? )

“Motif/Dwarf" Popularity (Red Hot Blue Cool) • How do compelling apps relate to 13

“Motif/Dwarf" Popularity (Red Hot Blue Cool) • How do compelling apps relate to 13 motif/dwarfs?

Roles of Motifs/Dwarfs 1. “Anti-benchmarks” Motifs not tied to code or language artifacts encourage

Roles of Motifs/Dwarfs 1. “Anti-benchmarks” Motifs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware 2. Universal, understandable vocabulary, at least at high level To talk across disciplinary boundaries 3. Bootstrapping: Parallelize parallel research Allow analysis of HW & SW design without waiting years for full apps 4. Targets for libraries

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore p p A tio a lic y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs/Dwarfs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness ns

Developing Parallel Software • 2 types of programmers 2 layers • Efficiency Layer (10%

Developing Parallel Software • 2 types of programmers 2 layers • Efficiency Layer (10% of today’s programmers) – Expert programmers build Frameworks & Libraries, Hypervisors, … – “Bare metal” efficiency possible at Efficiency Layer • Productivity Layer (90% of today’s programmers) – Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries – Frameworks & libraries composed to form app frameworks • Effective composition techniques allows the efficiency programmers to be highly leveraged Create language for Composition and Coordination (C&C)

Composition to Build Applications Serial Code – Libraries: Serial code invoke libraries with internal

Composition to Build Applications Serial Code – Libraries: Serial code invoke libraries with internal parallelism, e. g. , Matrix library – Frameworks: Parallel patterns with serial plug-ins, e. g. , Stencil Parallel Code n Composition is hierarchical n Interfaces are (mostly) serial

Composition is key to software reuse • Solutions exist for libraries with hidden parallelism:

Composition is key to software reuse • Solutions exist for libraries with hidden parallelism: – Partitions in OS help runtime composition • Instantiating parallel frameworks is harder: – Framework specifies required independence • E. g. , operations in map, div&conq must not interfere • Guaranteed independence through types – Type system extensions (side effects, ownership) • Extra: Image READ Array[double] – Data decomposition may be implicit or explicit: • Partition: Array[T], … List [Array[T]] • Partition: Graph[T], … List [Graph[T]] (well understood) – Efficiency layer code has these specifications at interfaces, which are verified, tested, or asserted • Independence is proven by checking side effects and overlap at instantiation

Coordination is used to create parallelism: • Support parallelism patterns of applications – Data

Coordination is used to create parallelism: • Support parallelism patterns of applications – Data parallelism (degree = data size, not core count) • – – Delaunay • • May be nested: forall images, forall blocks in image Divide-and-conquer (parallelism from recursion) Event-driven: nondeterminism at algorithm level • Branch&Bound dwarf, etc. Serial semantics with limited nondeterminism Choose the solution that fits your domain – – – • DCT, etc. Data parallelism comes from array/aggregate operations and loops without side effects Divide-and-conquer parallelism comes from recursive functions with non-overlapping side effects Event-driven programs are written as guarded atomic commands, which may be implemented as transactions Discovered parallelism mapped to available resources – Sparse LU Techniques include static scheduling, autotuning, dynamic schedule and possibly hints

C&C Language Strategy • Application-driven: Domain-specific languages – Ensure usefulness for at least one

C&C Language Strategy • Application-driven: Domain-specific languages – Ensure usefulness for at least one application • • Music language Image framework Browser language Health application language • Bottom-up implementation strategy – Ensure efficiently implementable – “Grow” a language from one that is efficient but not productive by abstraction levels • Identify common features across DSLs – Cross-language meetings/discussions

Coordination & Composition in CBIR Application • • Parallelism in CBIR is hierarchical Mostly

Coordination & Composition in CBIR Application • • Parallelism in CBIR is hierarchical Mostly independent tasks/data with reducing output stream of images feature extraction Face Recog output stream of feature vectors ? DCT … ? task parallel over extraction algorithms data parallel map DCT over tiles reduce reduction on histograms from each tile DWT stream parallel over images DCT extractor coll output one histogram (feature vector)

Using Map Reduce for Image Retreival • “Map Reduce” can mean various things •

Using Map Reduce for Image Retreival • “Map Reduce” can mean various things • To us, it means – A map stage, where threads compute independently – A reduce stage, where the results of the map stage are summarized • This is a pattern of computation and communication – Not an implementation involving key/value pairs, parallel I/O. . . • We consider Map Reduce computations where: – A map function produces a set of outputs – Each of a set of reduce functions, gated by per element predicates, produces a set of outputs Work by B. Catanzaro, N. Sundaram K. Keutzer

SVM Classification Results • Average ~100 x speedup (180 x max) • Map Reduce

SVM Classification Results • Average ~100 x speedup (180 x max) • Map Reduce Framework reduced kernel LOC by 64% Work by B. Catanzaro, N. Sundaram K. Keutzer

C&C Language for Health … and the applications to go with it • Personalized

C&C Language for Health … and the applications to go with it • Personalized medicine application has large amounts of data parallelism – Irregular data structures / access: sparse matrices, particles – But most of code could be expressed in a data-parallel way, meaning serial semantics – Note that parallelism over data is essential at O(100) cores • Composition across languages is still key – Calls to optimized (not data parallel) libraries – Supported by static analysis for phase-based computations

Partitioned Global Address Space Global address space • Global address space: any thread/process may

Partitioned Global Address Space Global address space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local or global x: 1 y: x: 5 y: l: l: g: g: p 0 p 1 x: 7 y: 0 By default: • Object heaps are shared • Program stacks are private pn • 3 Current languages: UPC, CAF, and Titanium – All three use an SPMD execution model – Designed for large-scale (clusters) and scientific computing • 3 Emerging languages: X 10, Fortress, and Chapel

Arrays in a Global Address Space • Key features of Titanium arrays – Generality:

Arrays in a Global Address Space • Key features of Titanium arrays – Generality: indices may start/end any point – Domain calculus allow for slicing, subarray, transpose and other operations without data copies • Use domain calculus to identify ghosts and iterate: foreach (p in grid. A. shrink(1). domain()). . . • Array copies automatically work on intersection grid. B. copy(grid. A. shrink(1)); intersection (copied area) “restricted” (non-ghost) cells ghost cells grid. A grid. B Joint work with Titanium group

Languages Support Helps Productivity C++/Fortran/MPI AMR • Chombo package from LBNL • Bulk-synchronous comm:

Languages Support Helps Productivity C++/Fortran/MPI AMR • Chombo package from LBNL • Bulk-synchronous comm: – Pack boundary data between procs – All optimizations done by programmer • • • Titanium AMR Entirely in Titanium Finer-grained communication – – No explicit pack/unpack code Automated in runtime system General approach – – Language allow programmer optimizations Compiler/runtime does some automatically Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Particle/Mesh Method: Heart Simulation • Elastic structures in an incompressible fluid. – Blood flow,

Particle/Mesh Method: Heart Simulation • Elastic structures in an incompressible fluid. – Blood flow, clotting, inner ear, embryo growth, … • 2 D Dirac Delta Function Complicated parallelization – Particle/Mesh method, but “Particles” connected into materials (1 D or 2 D structures) – Communication patterns irregular between particles (structures) and mesh (fluid) Code Size in Lines Fortran Titanium 8000 4000 Note: Fortran code is not parallel Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave Mc. Queen

Titanium Experience: Composition • Data parallelism could have been used – Parallel over n,

Titanium Experience: Composition • Data parallelism could have been used – Parallel over n, rather than p – Compiler can generate SPMD code – Most code could be written as pure data parallelism (serial semantics) and translated [Su] • Can we “mix” data and other parallelism? – Compiler analysis makes this possible: – Barriers are restricted: all threads must reach the same barrier (proven by “single” analysis [Gay and Aiken]) • Single analysis identifies global execution points – Allows global optimizations (across threads) – Create natural points to switch in and out of data parallel or serial code – Also may be points for heterogeneous processor switches in code Joint work with the Titanium group

Efficiency layer Remember why we are here…

Efficiency layer Remember why we are here…

Efficiency Layer: Selective Virtualization • Efficiency layer is abstract machine model + selective virtualization

Efficiency Layer: Selective Virtualization • Efficiency layer is abstract machine model + selective virtualization • Libraries provide add-ons – Schedulers • Add runtime w/ dynamic scheduling for dynamic task tree General task graph with weights & structure Div&Conq with task stealing – Memory movement / sharing primitives – Synchronization primitives • E. g. , fast barriers, atomic operations – More on this in Krste Asanovic’s talk • Division of layers allows us to explore execution model separately from programming model

Synthesis • Extensive tuning knobs at efficiency level – Performance feedback from hardware and

Synthesis • Extensive tuning knobs at efficiency level – Performance feedback from hardware and OS • Sketching: Correct by construction – More on this in Ras Bodik’s talk Spec: simplementation (3 loop 3 D stencil) Sketch: optimized skeleton (5 loops, missing some index/bounds) Optimized code (tiled, prefetched, time skewed) • Autotuning: Efficient by search – Examples: Spectral (FFTW, SPIRAL), Dense (PHi. PAC, Atlas), Sparse (OSKI), Structured grids (Stencils) – Can select from algorithms/data structures changes not producible by compiler transform

Autotuning: 21 st Century Code Generation • Problem: generating optimal code is like searching

Autotuning: 21 st Century Code Generation • Problem: generating optimal code is like searching for needle in a haystack • Manycore even more diverse • New approach: “Auto-tuners” – 1 st generate program variations of combinations of optimizations (blocking, prefetching, …) and data structures – Then compile and run to heuristically search for best code for that computer • Examples: PHi. PAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W (FFT), OSKI (sparse matrices) Search space for block sizes (dense matrix): • Axes are block dimensions • Temperature is speed 50% more zeros 50% faster

LBMHD: Structure Grid Application • Plasma turbulence simulation • Two distributions: – momentum distribution

LBMHD: Structure Grid Application • Plasma turbulence simulation • Two distributions: – momentum distribution (27 components) – magnetic distribution (15 vector components) • Three macroscopic quantities: – Density – Momentum (vector) – Magnetic Field (vector) • Must read 73 doubles, and update(write) 79 doubles per point in space • Requires about 1300 floating point operations per point in space • Just over 1. 0 flops/byte (ideal) • No temporal locality between points in space within one time step Joint work with Sam Williams, Lenny Oliker, John Shalf, and Jonathan Carter

Autotuned Performance (Cell/SPE version) Intel Clovertown AMD Opteron • • • Sun Niagara 2

Autotuned Performance (Cell/SPE version) Intel Clovertown AMD Opteron • • • Sun Niagara 2 (Huron) IBM Cell Blade* First attempt at cell implementation. VL, unrolling, reordering fixed Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

Productivity • Niagara 2 required significantly less work to deliver good performance. • For

Productivity • Niagara 2 required significantly less work to deliver good performance. • For LBMHD, Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance. • Virtually every optimization was required (sooner or later) for Opteron and Cell. • Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

PGAS Languages + Autotuning for DMA Multicore • PGAS languages are a good fit

PGAS Languages + Autotuning for DMA Multicore • PGAS languages are a good fit to shared memory machines, including multicore – Global address space implemented as reads/writes – Also may be exploited for processor with explicit local store rather than cache, e. g. , Cell, GPUs, … • Open question in architecture – Cache-coherence shared memory – Software-controlled local memory (or hybrid) l: m: x: 1 y: x: 5 y: Private on-chip x: 7 y: 0 Shared partitioned on-chip Shared off-chip DRAM

Correctness

Correctness

Ensuring Correctness • Productivity Layer – Enforce independence of tasks using decomposition (partitioning) and

Ensuring Correctness • Productivity Layer – Enforce independence of tasks using decomposition (partitioning) and copying operators – Goal: Remove chance for concurrency errors (e. g. , nondeterminism from execution order, not just low-level data races) • Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, and so on) – Mixture of verification and automated directed testing – Error detection on frameworks with sequential code as specification

Software Correctness • • At the Productivity layer, many concurrency errors are not permitted

Software Correctness • • At the Productivity layer, many concurrency errors are not permitted At the Efficiency layer, we need more tools for correctness – Both concurrency errors and (eventually) numerical errors • Traditional approach to correct software: testing – Low probability of finding an error; lots of manual effort • Symbolic model checking – Many recent successes in security, control systems, etc. – Ideas from theorem proving applied to specific classes of errors – Can’t handle libraries, complex data types. . . • Concolic testing combines Concrete execution + Symbolic analysis Non. Conditiona l Statement s T F – Use state-of-art theorem proving to find inputs that reach all program paths • Ideas applied successfully on concurrent programs F T T Conditiona l Statement s T F F T

Why Languages at all? • Most of work is in runtime and libraries •

Why Languages at all? • Most of work is in runtime and libraries • Do we need a language? And a compiler? – If higher level syntax is needed for productivity • We need a language – If static analysis is needed to help with correctness • We need a compiler (front-end) – If static optimizations are needed to get performance • We need a compiler (back-end) • All of these decisions will be driven by application need