Parallel Hardware Parallel Applications Parallel Software The Parallel

  • Slides: 79
Download presentation
Parallel Hardware Parallel Applications Parallel Software The Parallel Computing Laboratory: A Research Agenda based

Parallel Hardware Parallel Applications Parallel Software The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Kathy Yelick April 28, 2008 1

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨ Need for Parallel Libraries and Frameworks o Parallel Libraries ¨ Success Metric ¨ High performance (speed and accuracy) n Autotuning ¨ Required Functionality ¨ Ease of use o Summary ¨ Identify of meeting goals, other talks opportunities for collaboration 2

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨ Need for Parallel Libraries and Frameworks o Parallel Libraries ¨ Success Metric ¨ High performance (speed and accuracy) n Autotuning ¨ Required functionality ¨ Ease of use o Summary o of meeting goals, other talks Identify opportunities for collaboration 3

A Parallel Revolution, Ready or Not o Old Moore’s Law is over ¨ o

A Parallel Revolution, Ready or Not o Old Moore’s Law is over ¨ o New Moore’s Law is here ¨ o No more doubling speed of sequential code every 18 months 2 X processors (“cores”) per chip every technology generation, but same clock rate Sea change for HW & SW industries since changing the model of programming and debugging 4

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to 13 motifs? 5

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to 13 motifs? 6

Choosing Driving Applications o “Who needs 100 cores to run M/S Word? ” ¨

Choosing Driving Applications o “Who needs 100 cores to run M/S Word? ” ¨ o 1. 2. 3. 4. Need compelling apps that use 100 s of cores How did we pick applications? Enthusiastic expert application partner, leader in field, promise to help design, use, evaluate our technology Compelling in terms of likely market or social impact, with short term feasibility and longer term potential Requires significant speed-up, or a smaller, more efficient platform to work as intended As a whole, applications cover the most important ¨ ¨ Platforms (handheld, laptop, games) Markets (consumer, business, health) 7

Compelling Client Applications Image Query by example Image Database Music/Hearing Robust Speech Input 1000’s

Compelling Client Applications Image Query by example Image Database Music/Hearing Robust Speech Input 1000’s of images Parallel Browser Personal Health 8

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to 13 motifs? 9

Par Lab Research Overview y t i v ti c u r d o

Par Lab Research Overview y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica p p A s n o i Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 10

Par Lab Research Overview y t i v ti c u r d o

Par Lab Research Overview y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica p p A s n o i Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 11

Developing Parallel Software o o 2 types of programmers 2 layers Efficiency Layer (10%

Developing Parallel Software o o 2 types of programmers 2 layers Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries, Hypervisors, … ¨ “Bare metal” efficiency possible at Efficiency Layer ¨ o Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries ¨ Frameworks & libraries composed to form app frameworks ¨ o Effective composition techniques allows the efficiency programmers to be highly leveraged Create language for Composition and Coordination (C&C) o Talk by Kathy Yelick 12

Par Lab Research Overview y t i v ti c u r d o

Par Lab Research Overview y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica p p A s n o i Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 13

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨ Need for Parallel Libraries and Frameworks o Parallel Libraries ¨ Success Metric ¨ High performance (speed and accuracy) n Autotuning ¨ Required Functionality ¨ Ease of use o Summary ¨ Identify of meeting goals, other talks opportunities for collaboration 14

Success Metric - Impact o LAPACK and Sca. LAPACK are widely used ¨ Adopted

Success Metric - Impact o LAPACK and Sca. LAPACK are widely used ¨ Adopted by Cray, Fujitsu, HP, IBM, IMSL, Math. Works, NAG, NEC, SGI, … ¨ >86 M web hits @ Netlib (incl. CLAPACK, LAPACK 95) n 35 K hits/day Cosmic Microwave Background Analysis, BOOMERan. G collaboration, MADCAP code (Apr. 27, 2000). Sca. LAPACK Xiaoye Li: Sparse LU

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning –

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem ¨ Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs ¨ o Faster Algorithms (2 talks) Symmetric eigenproblem (O(n 2) instead of O(n 3)) ¨ Sparse LU factorization ¨ o More accurate algorithms (2 talks) ¨ o Either at “usual” speed, or at any cost Structure-exploiting algorithms ¨ Roots(p) (O(n 2) instead of O(n 3))

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning –

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem ¨ Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs ¨ o Faster Algorithms (2 talks) Symmetric eigenproblem (O(n 2) instead of O(n 3)) ¨ Sparse LU factorization ¨ o More accurate algorithms (2 talks) ¨ o Either at “usual” speed, or at any cost Structure-exploiting algorithms ¨ Roots(p) (O(n 2) instead of O(n 3))

Automatic Performance Tuning o o Writing high performance software is hard Ideal: get high

Automatic Performance Tuning o o Writing high performance software is hard Ideal: get high fraction of peak performance from one algorithm Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler, … ¨ Best choice can depend on knowing a lot of applied mathematics and computer science ¨ Changes with each new hardware, compiler release Goal: Automation ¨ Generate and search a space of algorithms ¨ Past successes: PHi. PAC, ATLAS, FFTW, Spiral ¨ Many conferences, DOE projects, …

The Difficulty of Tuning Sp. MV // y <-- y + A*x for all

The Difficulty of Tuning Sp. MV // y <-- y + A*x for all nonzero A(i, j): y(i) += A(i, j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=row[i] to row[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8 x 8 dense blocks

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Reference Mflop/s (7.

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Reference Mflop/s (7. 6%)

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Best: 4 x

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Best: 4 x 2 Reference Mflop/s (7. 6%)

Sp. MV Performance—raefsky 3

Sp. MV Performance—raefsky 3

More surprises tuning Sp. MV o More complex example o Example: 3 x 3

More surprises tuning Sp. MV o More complex example o Example: 3 x 3 blocking ¨ Logical grid of 3 x 3 cells

Extra Work Can Improve Efficiency o More complex example o Example: 3 x 3

Extra Work Can Improve Efficiency o More complex example o Example: 3 x 3 blocking ¨ Logical grid of 3 x 3 cells ¨ Pad with zeros ¨ “Fill ratio” = 1. 5 ¨ 1. 5 x as many flops o On Pentium III: 1. 5 x speedup! (2/3 time) 1. 52 = 2. 25 x flop rate

Autotuned Performance of Sp. MV Intel Clovertown AMD Opteron +More DIMMs(opteron), +FW fix, array

Autotuned Performance of Sp. MV Intel Clovertown AMD Opteron +More DIMMs(opteron), +FW fix, array padding(N 2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Sun Niagara 2 (Huron) Naïve Pthreads Naïve 25

Autotuning Sp. MV o Large search space of possible optimizations ¨ Large speed ups

Autotuning Sp. MV o Large search space of possible optimizations ¨ Large speed ups possible ¨ Parallelism adds more! o Later talks ¨ Sam Williams on tuning Sp. MV for a variety of multicore, other platforms ¨ Ankit Jain on easy-to-use system for incorporating autotuning into applications ¨ Kaushik Datta on tuning special case of stencils ¨ Rajesh Nishtala on tuning collection communications o But don’t you still have to write difficult code to generate search space? 26

Program Synthesis o Best implementation/data structure hard to write, identify ¨ o Don’t do

Program Synthesis o Best implementation/data structure hard to write, identify ¨ o Don’t do this by hand Sketching: code generation using 2 QBF Spec: simplementation (3 loop 3 D stencil) Sketch: optimized skeleton (5 loops, missing some index/bounds) o Optimized code (tiled, prefetched, time skewed) Talk by Armando Solar-Lezama / Ras Bodik on program synthesis by sketching, applied to stencils

Communication-Avoiding Linear Algebra (CALU) • Exponentially growing gaps between • Floating point time <<

Communication-Avoiding Linear Algebra (CALU) • Exponentially growing gaps between • Floating point time << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Floating point time << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5. 5%/year • Goal: reorganize linear algebra to avoid communication • Not just hiding communication (speedup 2 x ) • Arbitrary speedups possible • Possible for Dense and Sparse Linear Algebra

CALU Summary (1/4) o QR or LU decomposition of m x n matrix, m>>n

CALU Summary (1/4) o QR or LU decomposition of m x n matrix, m>>n ¨ Parallel implementation n Conventional: O( n log p ) messages New: O( log p ) messages – optimal Performance: ¨ ¨ Serial implementation with fast memory of size W n Conventional: O( mn/W ) moves of data from slow to fast memory ¨ n n mn/W = how many times larger matrix is than fast memory New: O(1) moves of data Performance: ¨ ¨ ¨ QR 5 x faster on cluster, LU 7 x faster on cluster OOC QR only 2 x slower than having DRAM Expect gains with Multicore as well Price: n n n Some redundant computation (but flops are cheap!) Different representation of answer for QR (tree structured) LU stable in practice so far, but not GEPP

CALU Summary (2/4) o QR or LU decomposition of n x n matrix ¨

CALU Summary (2/4) o QR or LU decomposition of n x n matrix ¨ ¨ Communication lower by factor of b = block size Lots of speed up possible (modeled and measured) n Modeled speedups of new QR over Sca. LAPACK ¨ ¨ ¨ n Measured and modeled speedups of new LU over Sca. LAPACK ¨ ¨ o I BM Power 5 (512 procs): up to 9. 7 x Petascale (8 K procs): up to 22. 9 x Grid (128 procs): up to 11 x IBM Power 5 (Bassi): up to 2. 3 x speedup (measured) Cray XT 4 (Franklin): up to 1. 8 x speedup (measured) Petascale (8 K procs): up to 80 x (modeled) Speed limit: Cholesky? Matmul? Extends to sparse LU ¨ ¨ ¨ Communication more dominant, so pay off may be higher Speed limit: Sparse Cholesky? Talk by Xiaoye Li on alternative

CALU Summary (3/4) o Take k steps of Krylov subspace method GMRES, CG, Lanczos,

CALU Summary (3/4) o Take k steps of Krylov subspace method GMRES, CG, Lanczos, Arnoldi ¨ Assume matrix “well-partitioned, ” with modest surface-tovolume ratio ¨ Parallel implementation ¨ n n ¨ Serial implementation n n ¨ n Need to be able to “compress” interactions between distant i, j Hierarchical, semiseparable matrices … Lots of speed up possible (modeled and measured) n ¨ Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Can incorporate some preconditioners n ¨ Conventional: O(k log p) messages New: O(log p) messages - optimal Price: some redundant computation Talks by Marghoob Mohiyuddin, Mark Hoemmen

CALU Summary (4/4) o Lots of related work ¨ Some going back to 1960’s

CALU Summary (4/4) o Lots of related work ¨ Some going back to 1960’s ¨ Reports discuss this comprehensively, we will not o Our contributions ¨ Several new algorithms, improvements on old ones ¨ Unifying parallel and sequential approaches to avoiding communication ¨ Time for these algorithms has come, because of growing communication costs n n Systematic examination of as much of linear algebra as we can Why just linear algebra?

Linear Algebra on GPUs o Important part of architectural space to explore o Talk

Linear Algebra on GPUs o Important part of architectural space to explore o Talk by Vasily Volkov ¨ NVIDIA has licensed our BLAS (SGEMM) ¨ Fastest implementations of dense LU, Cholesky, QR n n 80 -90% of “peak” Require various optimizations special to GPU ¨ ¨ Use CPU for BLAS 1 and BLAS 2, GPU for BLAS 3 In LU, replace TRSM by TRTRI + GEMM (~stable as GEPP) 33

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning –

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem ¨ Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs ¨ o Faster Algorithms (2 talks) Symmetric eigenproblem (O(n 2) instead of O(n 3)) ¨ Sparse LU factorization ¨ o More accurate algorithms (2 talks) ¨ o Either at “usual” speed, or at any cost Structure-exploiting algorithms ¨ Roots(p) (O(n 2) instead of O(n 3))

Faster Algorithms o (Highlights) MRRR algorithm for symmetric eigenproblem Talk by Osni Marques /

Faster Algorithms o (Highlights) MRRR algorithm for symmetric eigenproblem Talk by Osni Marques / B. Parlett / I. Dhillon / C. Voemel ¨ 2006 SIAM Linear Algebra Prize for Parlett, Dhillon ¨ o Parallel Sparse LU ¨ o Talk by Xiaoye Li Up to 10 x faster HQR R. Byers / R. Mathias / K. Braman ¨ 2003 SIAM Linear Algebra Prize ¨ o Extensions to QZ: ¨ o B. Kågström / D. Kressner / T. Adlerborn Faster Hessenberg, tridiagonal, bidiagonal reductions: R. van de Geijn / E. Quintana-Orti ¨ C. Bischof / B. Lang ¨ G. Howell / C. Fulton ¨

Collaborators o UC Berkeley: ¨ o o Jim Demmel, Ming Gu, W. Kahan, Beresford

Collaborators o UC Berkeley: ¨ o o Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Yozo Hida, Jason Riedy, Vasily Volkov, Christof Voemel, David Bindel, undergrads… U Tennessee, Knoxville ¨ Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak Other Academic Institutions ¨ UT Austin, UC Davis, CU Denver, Florida IT, Georgia Tech, U Kansas, U Maryland, North Carolina SU, UC Santa Barbara ¨ TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb Research Institutions ¨ INRIA, LBL Industrial Partners (predating Par. Lab) ¨ Cray, HP, Intel, Interactive Supercomputing, Math. Works, NAG, NVIDIA

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning –

High Performance (Speed and Accuracy) o Matching Algorithms to Architectures (8 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem ¨ Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs ¨ o Faster Algorithms (2 talks) Symmetric eigenproblem (O(n 2) instead of O(n 3)) ¨ Sparse LU factorization ¨ o More accurate algorithms (2 talks) ¨ o Either at “usual” speed, or at any cost Structure-exploiting algorithms ¨ Roots(p) (O(n 2) instead of O(n 3))

More Accurate Algorithms o Motivation ¨ o User requests, debugging Iterative refinement for Ax=b,

More Accurate Algorithms o Motivation ¨ o User requests, debugging Iterative refinement for Ax=b, least squares ¨ “Promise” the right answer for O(n 2) additional cost ¨ Talk by Jason Riedy o Arbitrary precision versions of everything ¨ Using your favorite multiple precision package ¨ Talk by Yozo Hida o Jacobi-based SVD ¨ Faster than QR, can be arbitrarily more accurate ¨ Drmac / Veselic

What could go into linear algebra libraries? For all linear algebra problems For all

What could go into linear algebra libraries? For all linear algebra problems For all matrix/problem structures For all data types For all architectures and networks For all programming interfaces Produce best algorithm(s) w. r. t. performance and accuracy (including condition estimates, etc) Need to prioritize, automate, enlist help!

What do users want? (1/2) Performance, ease of use, functionality, portability o Composability o

What do users want? (1/2) Performance, ease of use, functionality, portability o Composability o On multicore, expect to implement dense codes via DAG scheduling (Dongarra’s PLASMA) ¨ Talk by Krste Asanovic / Heidi Pan on threads ¨ o Reproducibility ¨ o Made challenging by nonassociativity of floating point Ongoing collaborations on Driving Apps ¨ ¨ ¨ Jointly analyzing needs Talk by T. Keaveny on Medical Application Other apps so far: mostly dense and sparse linear algebra, FFTs n some interesting structured needs emerging

What do users want? (2/2) o DOE/NSF User Survey ¨ Small but interesting sample

What do users want? (2/2) o DOE/NSF User Survey ¨ Small but interesting sample at www. netlib. org/lapack-dev ¨ What matrix sizes do you care about? n 1000 s: 34% n 10, 000 s: 26% n 100, 000 s or 1 Ms: 26% ¨ How many processors, on distributed memory? n >10: 34%, >100: 31%, >1000: 19% ¨ Do you use more than double precision? n Sometimes or frequently: 16% o New graduate program in CSE with 106 faculty from 18 departments ¨ New needs may emerge

Highlights of New Dense Functionality o Updating / downdating ¨ Stewart, Langou o More

Highlights of New Dense Functionality o Updating / downdating ¨ Stewart, Langou o More generalized ¨ Bai , Wang of factorizations: SVDs: o More generalized Sylvester/Lyapunov ¨ Kågström, Jonsson, Granat o Structured eigenproblems ¨ Selected matrix polynomials: n Mehrmann eqns:

Organizing Linear Algebra www. netlib. org/lapack www. netlib. org/scalapack gams. nist. gov www. netlib.

Organizing Linear Algebra www. netlib. org/lapack www. netlib. org/scalapack gams. nist. gov www. netlib. org/templates www. cs. utk. edu/~dongarra/etemplates

Improved Ease of Use o Which do you prefer? AB CALL PDGESV( N ,

Improved Ease of Use o Which do you prefer? AB CALL PDGESV( N , NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) CALL PDGESVX( FACT, TRANS, N , NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Ease of Use: One approach o Easy interfaces vs access to details ¨ Some

Ease of Use: One approach o Easy interfaces vs access to details ¨ Some users want access to all details, because n Peak performance matters n Control over memory allocation ¨ Other users want “simpler” interface n Automatic allocation of workspace n No universal agreement across systems on “easiest interface” n Leave decision to higher level packages Keep expert driver / simple driver / computational routines o Add wrappers for other languages o ¨ Fortran 95, Java, Matlab, Python, even ¨ Automatic allocation of workspace o C Add wrappers to convert to “best” parallel layout

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨

Outline o Overview of Par Lab ¨ Motivation & Scope ¨ Driving Applications ¨ Need for Parallel Libraries and Frameworks o Parallel Libraries ¨ Success Metric ¨ High performance (speed and accuracy) ¨ Autotuning ¨ Required Functionality ¨ Ease of use o Summary of meeting goals, other talks 46

Some goals for the meeting o Introduce Par. Lab o Describe numerical library efforts

Some goals for the meeting o Introduce Par. Lab o Describe numerical library efforts in detail o Exchange information ¨ User needs, tools, goals o Identify opportunities for collaboration 47

Summary of other talks (1) o Monday, April 28 (531 Cory) ¨ ¨ ¨

Summary of other talks (1) o Monday, April 28 (531 Cory) ¨ ¨ ¨ ¨ 12: 00 - 12: 45 Jim Demmel - Overview of PAR Lab / Numerical Libraries 12: 45 - 1: 00 Avneesh Sud (Microsoft) - Introduction to library effort at Microsoft 1: 00 - 1: 45 Sam Williams/Ankit Jain - Tuning Sparse-matrix-vector multiply/Parallel OSKI 1: 45 – 1: 50 Break 1: 50 - 2: 20 Marghoob Mohiyuddin - Avoiding Communication in Sp. MV-like kernels 2: 20 - 2: 50 Mark Hoemmen - Avoiding communication in Krylov Subspace Methods 2: 50 - 3: 00 Break 3: 00 - 3: 30 Rajesh Nishtala - Tuning collective communication 3: 30 - 4: 00 Yozo Hida - High accuracy linear algebra 4: 00 – 4: 25 Jason Riedy - Iterative Refinement in linear algebra 4: 25 – 4: 30 Break 4: 30 – 5: 00 Tony Keaveny - Medical Image Analysis in PAR Lab 5: 00 - 5: 30 Ras Bodik/ Armando Solar-Lezama - Program synthesis by Sketching 5: 30 - 6: 00 Vasily Volkov - Linear Algebra on GPUs 48

Summary of other talks (2) o Tuesday, April 29 (Wozniak Lounge) ¨ ¨ ¨

Summary of other talks (2) o Tuesday, April 29 (Wozniak Lounge) ¨ ¨ ¨ 9: 00 - 10: 00 Kathy Yelick - Programming Systems for PAR Lab 10: 00 - 10: 30 Kaushik Datta – Tuning Stencils 10: 30 - 11: 00 Xiaoye Li - Parallel sparse LU factorization 11: 00 - 11: 30 Osni Marques - Parallel Symmetric Eigensolvers 11: 30 – 12: 00 Krste Asanovic / Heidi Pan – Thread system 49

EXTRA SLIDES 50

EXTRA SLIDES 50

P. S. Parallel Revolution May Fail o John Hennessy, President, Stanford University, 1/07: “…when

P. S. Parallel Revolution May Fail o John Hennessy, President, Stanford University, 1/07: “…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry. ” “A Conversation with Hennessy & Patterson, ” ACM Queue Magazine, 4: 10, 1/07. o 100% failure rate of Parallel Computer Companies ¨ o Convex, Encore, Inmos (Transputer), Mas. Par, NCUBE, Kendall Square Research, Sequent, (Silicon Graphics), Thinking Machines, … What if IT goes from a growth industry to a replacement industry? ¨ If SW can’t effectively use 32, 64, . . . cores per chip SW no faster on new computer Only buy if computer wears out 51

5 Themes of Par Lab 1. Applications 2. Identify Common Computational Patterns 3. 2

5 Themes of Par Lab 1. Applications 2. Identify Common Computational Patterns 3. 2 Layers + Coordination & Composition Language + Autotuning OS and Architecture 5. Breaking through disciplinary boundaries Developing Parallel Software with Productivity, Efficiency, and Correctness 4. Compelling apps drive top-down research agenda Composable primitives, not packaged solutions Deconstruction, Fast barrier synchronization, Partitions Diagnosing Power/Performance Bottlenecks 52

Par Lab Research Overview y t i v ti c u r d o

Par Lab Research Overview y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica p p A s n o i Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 53

Compelling Laptop/Handheld Apps (David Wessel) o Musicians have an insatiable appetite for computation ¨

Compelling Laptop/Handheld Apps (David Wessel) o Musicians have an insatiable appetite for computation ¨ ¨ ¨ o Music Enhancer ¨ ¨ o Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays Laptop/Handheld recreate 3 D sound over ear buds Hearing Augmenter ¨ o More channels, instruments, more processing, more interaction! Latency must be low (5 ms) Must be reliable (No clicks) Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10 -inch-diameter icosahedron incorporating 120 tweeters. Laptop/Handheld as accelerator for hearing aide Novel Instrument User Interface ¨ ¨ New composition and performance systems beyond keyboards Input device for Laptop/Handheld 54

Content-Based Image Retrieval (Kurt Keutzer) Relevance Feedback Query by example Similarity Metric Image Database

Content-Based Image Retrieval (Kurt Keutzer) Relevance Feedback Query by example Similarity Metric Image Database o 1000’s of images Candidate Results Final Result Built around Key Characteristics of personal databases ¨ Very large number of pictures (>5 K) ¨ Non-labeled images ¨ Many pictures of few people ¨ Complex pictures including people, events, and objects places, 55

Coronary Artery Disease Before n (Tony Keaveny) After Modeling to help patient compliance? •

Coronary Artery Disease Before n (Tony Keaveny) After Modeling to help patient compliance? • 450 k deaths/year, 16 M w. symptom, 72 M BP n Massively parallel, Real-time variations • CFD FE solid (non-linear), fluid (Newtonian), pulsatile • Blood pressure, activity, habitus, cholesterol 56

Compelling Laptop/Handheld Apps (Nelson Morgan) o Meeting Diarist ¨ Laptops/ Handhelds at meeting coordinate

Compelling Laptop/Handheld Apps (Nelson Morgan) o Meeting Diarist ¨ Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting n Teleconference speaker identifier, speech helper ¨ L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said 57

Compelling Laptop/Handheld Apps o Health Coach ¨ Since laptop/handheld always with you, Record images

Compelling Laptop/Handheld Apps o Health Coach ¨ Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far n ¨ Since laptop/handheld always with you, record amount of exercise so far, show body would look if maintain this exercise and diet pattern next 3 months n o “What if I order a pizza for my next meal? A salad? ” “What would I look like if I regularly ran less? Further? ” Face Recognizer/Name Whisperer ¨ Laptop/handheld scans faces, matches image database, whispers name in ear (relies on Content Based Image Retrieval) 58

Parallel Browser (Ras Bodik) o Goal: Desktop quality browsing on handhelds ¨ Enabled by

Parallel Browser (Ras Bodik) o Goal: Desktop quality browsing on handhelds ¨ Enabled by 4 G networks, better output devices o Bottlenecks ¨ Parsing, to parallelize Rendering, Scripting o “Skip. Jax” ¨ Parallel replacement for Java. Script/AJAX ¨ Based on Brown’s Flap. Jax 59

Theme 2. Use design patterns instead of benchmarks? (Kurt Keutzer) o o 1. 2.

Theme 2. Use design patterns instead of benchmarks? (Kurt Keutzer) o o 1. 2. 3. 4. 5. 6. o How invent parallel systems of future when tied to old code, programming models, CPUs of the past? Look for common design patterns (see A Pattern Language, Christopher Alexander, 1975) Embedded Computing (42 EEMBC benchmarks) Desktop/Server Computing (28 SPEC 2006) Data Base / Text Mining Software Games/Graphics/Vision Machine Learning High Performance Computing (Original “ 7 Dwarfs”) Result: 13 “Motifs” (Use “motif” instead when go from 7 to 13) 60

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to

“Motif" Popularity o (Red Hot Blue Cool) Cool How do compelling apps relate to 13 motifs? 61

4 Valuable Roles of Motifs 1. “Anti-benchmarks” 2. Universal, understandable vocabulary, at least at

4 Valuable Roles of Motifs 1. “Anti-benchmarks” 2. Universal, understandable vocabulary, at least at high level 3. To talk across disciplinary boundaries Bootstrapping: Parallelize parallel research 4. Motifs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware Allow analysis of HW & SW design without waiting years for full apps Targets for libraries (see later) 62

Themes 1 and 2 Summary o Application-Driven Research (top down) vs. CS Solution-Driven Research

Themes 1 and 2 Summary o Application-Driven Research (top down) vs. CS Solution-Driven Research (bottom up) o Drill down on (initially) 5 app areas to guide research agenda o Motifs to guide design of apps and implement via programming framework per motif o Motifs help break through traditional interfaces ¨ Benchmarking, multidisciplinary conversations, parallelizing parallel research, and target for libraries 63

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore y t i v ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica Diagnosing Power/Performance p p A s n o i 64

Theme 3: Developing Parallel SW (Kurt Keutzer and Kathy Yelick) o Observation: Use Motifs

Theme 3: Developing Parallel SW (Kurt Keutzer and Kathy Yelick) o Observation: Use Motifs as design patterns Design patterns are implemented as 2 kinds of frameworks: 1. Traditional numerical parallel library 2. Library where apply supplied function to data in parallel Numerical Libraries Function Application Libraries o o o Dense matrices Sparse matrices Spectral Combinational Finite state machines o o o Map. Reduce Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid Graph traversal, graphical models Computations may be viewed at multiple levels: e. g. , an FFT library may be built by instantiating a Map-Reduce library, mapping 1 D FFTs and then transposing (generalized reduce) 65

Developing Parallel Software o o 2 types of programmers 2 layers Efficiency Layer (10%

Developing Parallel Software o o 2 types of programmers 2 layers Efficiency Layer (10% of today’s programmers) ¨ Expert programmers build Frameworks & Libraries, Hypervisors, … ¨ “Bare metal” efficiency possible at Efficiency Layer o Productivity Layer (90% of today’s programmers) ¨ Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries ¨ Frameworks & libraries composed to form app frameworks o Effective composition techniques allows the efficiency programmers to be highly leveraged Create language for Composition and Coordination (C&C) 66

C & C Language Requirements (Kathy Yelick) Applications specify C&C language requirements: o Constructs

C & C Language Requirements (Kathy Yelick) Applications specify C&C language requirements: o Constructs for creating application frameworks o Primitive parallelism constructs: ¨ Data parallelism ¨ Divide-and-conquer parallelism ¨ Event-driven execution o Constructs for composing programming frameworks: ¨ Frameworks require independence ¨ Independence is proven at instantiation with a variety of techniques o Needs to have low runtime overhead and ability to measure and control performance 67

Ensuring Correctness (Koushek Sen) o Productivity Layer ¨ Enforce independence of tasks using decomposition

Ensuring Correctness (Koushek Sen) o Productivity Layer ¨ Enforce independence of tasks using decomposition (partitioning) and copying operators ¨ Goal: Remove chance for concurrency errors (e. g. , nondeterminism from execution order, not just low-level data races) o Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, and so on) ¨ Mixture of verification and automated directed testing ¨ Error detection on frameworks with sequential code as specification ¨ Automatic detection of races, deadlocks 68

21 st Century Code Generation (Demmel, Yelick) o Problem: generating optimal code is like

21 st Century Code Generation (Demmel, Yelick) o Problem: generating optimal code is like searching for needle in a haystack o Manycore o New approach: “Auto-tuners” even more diverse 1 st generate program variations of combinations of optimizations (blocking, prefetching, …) and data structures ¨ Then compile and run to heuristically search for best code for that computer ¨ o o Search space for block sizes (dense matrix): • Axes are block dimensions • Temperature is speed Examples: PHi. PAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W (FFT) Example: Sparse Matrix (Sp. MV) for 4 multicores ¨ Fastest Sp. MV; Optimizations: BCOO v. BCSR data structures, NUMA, 16 b vs. 32 b indicies, … 69

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore vity ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness t lica Diagnosing Power/Performance p p A s n o i 70

Theme 4: OS and Architecture (Krste Asanovic, John Kubiatowicz) o Traditional OSes brittle, insecure,

Theme 4: OS and Architecture (Krste Asanovic, John Kubiatowicz) o Traditional OSes brittle, insecure, memory hogs ¨ Traditional monolithic OS image uses lots of precious memory * 100 s - 1000 s times (e. g. , AIX uses GBs of DRAM / CPU) o How can novel architectural support improve productivity, efficiency, and correctness for scalable hardware? ¨ Efficiency instead of performance to capture energy as well as performance Other challenges: power limit, design and verification costs, low yield, higher error rates o How prototype ideas fast enough to run real SW? o 71

Deconstructing Operating Systems o Resurgence of interest in virtual machines ¨ Hypervisor: thin SW

Deconstructing Operating Systems o Resurgence of interest in virtual machines ¨ Hypervisor: thin SW layer btw guest OS and HW o Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources ¨ Opportunity for OS innovation o Leverage HW partitioning support for very thin hypervisors, and to allow software full access to hardware within partition 72

HW Solution: Small is Beautiful o Want Software Composable Primitives, Not Hardware Packaged Solutions

HW Solution: Small is Beautiful o Want Software Composable Primitives, Not Hardware Packaged Solutions ¨ ¨ o Expect modestly pipelined (5 - to 9 -stage) CPUs, FPUs, vector, SIMD PEs ¨ o Small cores not much slower than large cores Parallel is energy efficient path to performance: CV 2 F ¨ o “You’re not going fast if you’re headed in the wrong direction” Transactional Memory is usually a Packaged Solution Lower threshold and supply voltages lowers energy per op Configurable Memory Hierarchy (Cell v. Clovertown) ¨ ¨ Can configure on-chip memory as cache or local RAM Programmable DMA to move data without occupying CPU Cache coherence: Mostly HW but SW handlers for complex cases Hardware logging of memory writes to allow rollback 73

Build Academic Manycore from FPGAs o As 10 CPUs will fit in Field Programmable

Build Academic Manycore from FPGAs o As 10 CPUs will fit in Field Programmable Gate Array (FPGA), 1000 -CPU system from 100 FPGAs? 8 32 -bit simple “soft core” RISC at 100 MHz in 2004 (Virtex-II) 2 X CPUs, 1. 2 X clock rate o HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore FPGA generations every 1. 5 yrs; E. g. , 1000 processor, standard ISA binary-compatible, 64 -bit, cache-coherent supercomputer @ 150 MHz/CPU in 2007 ¨ Ideal for heterogeneous chip architectures ¨ RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington ¨ o “Research Accelerator for Multiple Processors” as a vehicle to lure more researchers to parallel challenge and decrease time to parallel solution 74

1008 Core “RAMP Blue” (Wawrzynek, Krasnov, … at Berkeley) o 1008 = 12 32

1008 Core “RAMP Blue” (Wawrzynek, Krasnov, … at Berkeley) o 1008 = 12 32 -bit RISC cores / FPGA, 4 FGPAs/board, 21 boards ¨ Simple n o Full star-connection between modules NASA Advanced Supercomputing (NAS) Parallel Benchmarks (all class S) ¨ o Micro. Blaze soft cores @ 90 MHz UPC versions (C plus shared-memory abstraction) CG, EP, IS, MG RAMPants creating HW & SW for manycore community using next gen FPGAs Chuck Thacker & Microsoft designing next boards ¨ 3 rd party to manufacture and sell boards: 1 H 08 ¨ Gateware, Software BSD open source ¨ o RAMP Gold for Par Lab: new CPU 75

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore vity ti c u r d o Pr Laye cy n e i c i Eff ayer L OS. h Arc Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS OS Hypervisor Multicore/GPGPU RAMPManycore Correctness t lica Diagnosing Power/Performance p p A s n o i 76

Theme 5: Diagnosing Power/ Performance Bottlenecks (Jim Demmel) o o o Collect data on

Theme 5: Diagnosing Power/ Performance Bottlenecks (Jim Demmel) o o o Collect data on Power/Performance bottlenecks Aid autotuner, scheduler, OS in adapting system Turn data into useful information that can help efficiencylevel programmer improve system? E. g. , % peak power, % peak memory BW, % CPU, % network ¨ E. g. , sample traces of critical paths ¨ o Turn data into useful information that can help productivitylevel programmer improve app? Where am I spending my time in my program? ¨ If I change it like this, impact on Power/Performance? ¨ Rely on machine learning to find correlations in data, predict Power/Performance? ¨ 77

Physical Par Lab - 5 th Floor Soda 78

Physical Par Lab - 5 th Floor Soda 78

Impact of Automatic Performance Tuning o Widely used in performance tuning of Kernels ¨

Impact of Automatic Performance Tuning o Widely used in performance tuning of Kernels ¨ ATLAS (Phi. PAC) - www. netlib. org/atlas n ¨ FFTW – www. fftw. org n ¨ o o Fast Fourier Transform and similar transforms, Wilkinson Software Prize Spiral - www. spiral. net n o Dense BLAS, now in Matlab, many other releases Digital Signal Processing Communication Collectives (UCB, UTK) Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), UHFFT (Houston), POET (UTSA), … More projects (PERI, TOPS 2, CSc. ADS), conferences, government reports, …