Multicores Multiprocessors and Clusters Computer Architecture Applications suggest

Computer Architecture Applications suggest how to improve technology, provide revenue to fund development Applications

Crossroads First Microprocessor Intel 4004, 1971 • 4 -bit accumulator architecture • 8 mm

Hardware • Team from IBM building PC prototypes in 1979 • Motorola 68000 chosen

Crossroads DYSEAC, first mobile computer! • 900 vacuum tubes • memory of 512 words

End of Uniprocessors P! IL , re re Ha ed d Har wa as

Trends • Shrinking of transistor sizes: 250 nm (1997) 130 nm (2002) 65 nm

Trends Transistors: 1. 43 x / year Cores: 1. 2 - 1. 4 x

Crossroads 1996 When I took this class! 2002 2009 2011 Reduced ILP to 1

Introduction • Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability,

Parallel Programming • Parallel software is the problem • Need to get significant performance

Parallel Programming • MPI, Open. MP, and Stream Processing are methods of distributing workloads

Shared Memory • SMP: shared memory multiprocessor – Small number of cores – Share

Example: Sum Reduction • Sum 100, 000 numbers on 100 processor UMA – Each

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn

Distributed Memory • Distributed shared memory (DSM) – Memory distributed among processors – Non-uniform

Distributed Memory • • URDC SGI ICE 8400 179 nodes, 2112 cores up to

Sum Reduction (Again) • Sum 100, 000 on 100 processors • First distribute 100

Sum Reduction (Again) • Given send() and receive() operations limit = 100; half =

Matrix Multiplication C 0 C 1 C 2 C 3 C 4 C 5

Message Passing Interface • language-independent communications protocol – provides a means to enable communication

Message Passing Interface • set of subroutines used explicitly to communicate between processes. •

Is it a curse? • The need to include the parallelism explicitly in the

Open. MP • a system of so-called "compiler directives" that are used to express

Systems using Open. MP • SMP (Symmetric Multi-Processor) – designed for shared-memory machines •

Open. MP compiler directives are inserted by the programmer, which allows stepwise parallelization of

Multiplication for (ii = 0; ii < nrows; ii++){ for(jj = 0; jj <

Multiplication unified code: Open. MP constructs are treated as comments when sequential compilers are

Why is Open. MP popular? • The simplicity and ease of use – No

Open. MP Tradeoffs • Cons – currently only runs efficiently in shared-memory multiprocessor platforms

MPI Tradeoffs • Pros of MPI – does not require shared memory architectures which

Another Example • Consider the following code fragment that finds the sum of f(x)

$Solution for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }$

Measuring Performance • Two primary metrics: wall clock time (response time for a program)

Measuring Performance • Elapsed Time – counts everything (disk and memory accesses, I/O ,

Benchmark Games • An embarrassed Intel Corp. acknowledged Friday that a bug in a

Problems of Benchmarking • Hard to evaluate real benchmarks: – Machine not built yet,

Amdahl's Law • The performance enhancement of an improvement is limited by how much

Amdahl's Law 1. Speed up = 4 2. Old execution time = 100 3.

Amdahl's Law • An application is “almost all” parallel: 90%. Speedup using – 10

Stream Processing • streaming data in and out of an execution core without utilizing

Parallelism • ILP exploits implicit parallel operations within a loop or straight-line code segment

Time (processor cycle) Multithreaded Categories Superscalar Simultaneous Fine-Grained. Coarse-Grained Multiprocessing Multithreading Thread 1 Thread

Graphics Processing Units • Few hundred $ = hundreds of parallel FPUs – High

History of GPUs • Early video cards – Frame buffer memory with address generation

Graphics Processing Units • Given the hardware invested to do graphics well, how can

Programming the GPU • Compute Unified Device Architecture-CUDA – Elegant solution to problem of

Programming Model • CUDA’s design goals – extend a standard sequential programming language, specifically

Programming the GPU • CUDA Programming Model – Single Instruction Multiple Thread (SIMT) •

GPU Threads in SM (GTX 570) • 32 threads within a block work collectively

GPU Threads in SM (GTX 570) • Up to 1024 Threads/Block and 8 Active

Matrix Multiplication • For a 4096 x 4096 matrix multiplication - Matrix C will

Thread Organization • If we expand to 4096 equations, we can process each row

Results CPU Configuration: Intel Xeon @2. 33 GHz with 2 GB RAM GPU Configuration:

Programming the GPU • Distinguishing execution place of functions: § _device_ or _global_ =>

Programming the GPU //Invoke DAXPY daxpy(n, 2. 0, x, y); //DAXPY in C void

Programming the GPU //Invoke DAXPY with 256 threads per Thread Block _host_ int nblocks

Programming the GPU • CUDA • Hardware handles thread management • Invisible to the

Efficiency Considerations • Avoid execution divergence – threads within a warp follow different execution

Efficiency Considerations • GPU Architecture – Each SM executes multiple warps in a time-sharing

Open. MP vs CUDA #pragma omp parallel for shared(A) private(i, j) for (i =

GPU Architectures • Processing is highly data-parallel – GPUs are highly multithreaded – Use

Example Systems 2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD

Example Systems 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2) 2

IBM Cell Broadband Engine 128 , 128 -bit registers 128 -bit vector/cycle Abbreviations PPE:

CELL BE Programming Model No direct access to DRAM from LS of SPE, Buffer

White Spaces After Digital TV • In telecommunications, white spaces refer to vacant frequency

Challenge • Irregular Terrain Model (ITM), also known as the Longley-Rice model, is used

ITM • Given the path length (d) for a radio transmitter T, a circle

GPU Strategies for ITM 128*16 threads 16*16 threads 192*16 threads

IBM CELL BE • Workload: 256 k profiles. • Strategies: – Message Queue (MQ),

Comparison • • Productivity from code development perspective Based on personal experience of a

Instruction and Data Streams • An alternate classification Data Streams Single Instruction Single Streams

SIMD • Operate elementwise on vectors of data – E. g. , MMX and

Vector Processors • Highly pipelined function units • Stream data from/to vector registers to

Example: (Y = a × X + Y) • Conventional MIPS code l. d

Matrix Multiplication • CPU Configuration: Intel Xeon @2. 33 GHz with 2 GB RAM

Slides: 92

Download presentation

Multicores, Multiprocessors, and Clusters

Computer Architecture Applications suggest how to improve technology, provide revenue to fund development Applications Technology Co mp atib ility Improved technologies make new applications possible Cost of software development makes compatibility a major force in market 2

Crossroads First Microprocessor Intel 4004, 1971 • 4 -bit accumulator architecture • 8 mm p. MOS • 2, 300 transistors • 3 x 4 mm 2 • 750 k. Hz clock • 8 -16 cycles/inst.

Hardware • Team from IBM building PC prototypes in 1979 • Motorola 68000 chosen initially, but 68000 was late • 8088 is 8 -bit bus version of 8086 => allows cheaper system • Estimated sales of 250, 000 • 100, 000 s sold [ Personal Computing Ad, 11/81] 4

Crossroads DYSEAC, first mobile computer! • 900 vacuum tubes • memory of 512 words of 45 bits each • Carried in two tractor trailers, 12 tons + 8 tons • Built for US Army Signal Corps

End of Uniprocessors P! IL , re re Ha ed d Har wa as re b a dw d an a tw f so r Intel cancelled high performance uniprocessor, joined IBM and Sun for multiple processors 6

Trends • Shrinking of transistor sizes: 250 nm (1997) 130 nm (2002) 65 nm (2007) 32 nm (2010) 28 nm(2011, AMD GPU, Xilinx FPGA) 22 nm(2011, Intel Ivy Bridge, die shrink of the Sandy Bridge architecture) • Transistor density increases by 35% per year and die size increases by 10 -20% per year… more cores! 7

Trends Transistors: 1. 43 x / year Cores: 1. 2 - 1. 4 x Performance: 1. 15 x Frequency: 1. 05 x Power: 1. 04 x 2004 2010 Source: Micron University Symp. 8

Crossroads 1996 When I took this class! 2002 2009 2011 Reduced ILP to 1 chapter! Shift to multicore! Reduced emphasis on ILP Request, Data, Thread, Introduce thread level P. Instruction Level Introduce: GPU, cloud computing, Smart phones, tablets! 9

Introduction • Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency • Job-level (process-level) parallelism – High throughput for independent jobs • Parallel processing program – Single program run on multiple processors • Multicore microprocessors – Chips with multiple processors (cores)

Parallel Programming • Parallel software is the problem • Need to get significant performance improvement – Otherwise, just use a faster uniprocessor, since it’s easier! • Difficulties – Partitioning – Coordination – Communications overhead

Parallel Programming • MPI, Open. MP, and Stream Processing are methods of distributing workloads on computers. • Key: Overlapping program architecture with the target hardware architecture

Shared Memory • SMP: shared memory multiprocessor – Small number of cores – Share single memory with uniform memory latency (symmetric) • SGI Altix UV 1000 (ARDC, December 2011 ) – – 58 nodes, 928 cores up to 2, 560 cores with architectural support to 327, 680 support for up to 16 TB of global shared memory. Programming: Parallel Open. MP or Threaded

Example: Sum Reduction • Sum 100, 000 numbers on 100 processor UMA – Each processor has ID: 0 ≤ Pn ≤ 99 – Partition 1000 numbers per processor – Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; • Now need to add these partial sums – Reduction: divide and conquer – Half the processors add pairs, then quarter, … – Need to synchronize between reduction steps

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor 0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);

Distributed Memory • Distributed shared memory (DSM) – Memory distributed among processors – Non-uniform memory access/latency (NUMA) – Processors connected via direct (switched) and non-direct (multihop) interconnection networks – Hardware sends/receives messages between processors

Distributed Memory • • URDC SGI ICE 8400 179 nodes, 2112 cores up to 768 cores in a single rack, scalable from 32 to tens of thousands of nodes – lower cost than SMP! • distributed memory system (Cluster), • typically using MPI programming.

Sum Reduction (Again) • Sum 100, 000 on 100 processors • First distribute 100 numbers to each – The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; • Reduction – Half the processors send, other half receive and add – The quarter send, quarter receive and add, …

Sum Reduction (Again) • Given send() and receive() operations limit = 100; half = 100; /* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ – Send/receive also provide synchronization – Assumes send/receive take similar time to addition

Matrix Multiplication C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 = X B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8

Message Passing Interface • language-independent communications protocol – provides a means to enable communication between different CPUs • point-to-point and collective communication • is a specification, not an implementation • standard for communication among processes on a distributed memory system – does not mean that its usage is restricted • processes do not have anything in common, and each has its own memory space.

Message Passing Interface • set of subroutines used explicitly to communicate between processes. • MPI programs are truly "multi-processing" • Parallelization can not be done automatically or semi-automatically as in "multi-threading" programs • function and subroutine calls have to be inserted into the code • alter the algorithm of the code with respect to the serial version.

Is it a curse? • The need to include the parallelism explicitly in the program • is a curse – more work and requires more planning than multithreading, • and a blessing – often leads to more reliable and scalable code – the behavior is in the hands of the programmer. – Well-written MPI codes can be made to scale for thousands of CPUs.

Open. MP • a system of so-called "compiler directives" that are used to express parallelism on a shared-memory machine. • an industry standard – most parallel enabled compilers that are used on SMP machines are capable of processing Open. MP directives. • Open. MP is not a “language” • Instead, Open. MP specifies a set of subroutines in an existing language (FORTRAN, C) for parallel programming on a shared memory machine

Systems using Open. MP • SMP (Symmetric Multi-Processor) – designed for shared-memory machines • advantage of not requiring communication between processors • allow multi-threading, – dynamic form of parallelism in which sub-processes are created and destroyed during program execution. • Open. MP will not work on distributed-memory clusters

Open. MP compiler directives are inserted by the programmer, which allows stepwise parallelization of pre-existing serial programs

Multiplication for (ii = 0; ii < nrows; ii++){ for(jj = 0; jj < ncols; jj++){ for (kk = 0; kk < nrows; kk++){ array[ii][jj] = array[ii][kk] * array[kk][jj] + array[ii][jj]; } } } = X

Multiplication unified code: Open. MP constructs are treated as comments when sequential compilers are used. #pragma omp parallel for shared(array, ncols, nrows) private(ii, jj, kk) for (ii = 0; ii < nrows; ii++){ for(jj = 0; jj < ncols; jj++){ for (kk = 0; kk < nrows; kk++){ array[ii][jj] = array[ii]kk] * array[kk][jj] + array[ii][jj]; } } }

Why is Open. MP popular? • The simplicity and ease of use – No message passing – data layout and decomposition is handled automatically by directives. • Open. MP directives may be incorporated incrementally. – program can be parallelized one portion after another and thus no dramatic change to code is needed – original (serial) code statements need not, in general, be modified when parallelized with Open. MP. This reduces the chance of inadvertently introducing bugs and helps maintenance as well. • The code is in effect a serial code and more readable • Code size increase is generally smaller.

Open. MP Tradeoffs • Cons – currently only runs efficiently in shared-memory multiprocessor platforms – requires a compiler that supports Open. MP. – scalability is limited by memory architecture. – reliable error handling is missing. – synchronization between subsets of threads is not allowed. – mostly used for loop parallelization

MPI Tradeoffs • Pros of MPI – does not require shared memory architectures which are more expensive than distributed memory architectures – can be used on a wider range of problems since it exploits both task parallelism and data parallelism – can run on both shared memory and distributed memory architectures – highly portable with specific optimization for the implementation on most hardware • Cons of MPI – requires more programming changes to go from serial to parallel version – can be harder to debug

Another Example • Consider the following code fragment that finds the sum of f(x) for 0 <= x < n. for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }

$Solution for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }$

Solution for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); } #pragma omp parallel for shared(sum, a, n) private(ii, value) for (ii = 0; ii < n; ii++) { value = some_complex_long_fuction(a[ii]); #pragma omp critical sum = sum + value; } or better, you can use the reduction clause to get #pragma omp parallel for private(sum) reduction(+: sum) for(ii = 0; ii < n; ii++){ sum = sum + some_complex_long_fuction(a[ii]); }

Measuring Performance • Two primary metrics: wall clock time (response time for a program) and throughput (jobs performed in unit time) – If we upgrade a machine with a new processor what do we increase? – If we add a new machine to the lab what do we increase? • Performance is measured with benchmark suites: a collection of programs that are likely relevant to the user – SPEC CPU 2006: cpu-oriented (desktops) – SPECweb, TPC: throughput-oriented (servers) – EEMBC: for embedded processors/workloads

Measuring Performance • Elapsed Time – counts everything (disk and memory accesses, I/O , etc. ) a useful number, but often not good for comparison purposes • CPU time – doesn't count I/O or time spent running other programs can be broken up into system time, and user time

Benchmark Games • An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint 92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… Saturday, January 6, 1996 New York Times

SPEC CPU 2000

Problems of Benchmarking • Hard to evaluate real benchmarks: – Machine not built yet, simulators too slow – Benchmarks not ported – Compilers not ready • Benchmark performance is composition of hardware and software (program, input, compiler, OS) performance, which must all be specified

Compiler and Performance

Amdahl's Law • The performance enhancement of an improvement is limited by how much the improved feature is used. In other words: Don’t expect an enhancement proportional to how much you enhanced something. • Example: "Suppose a program runs in 100 seconds on a machine, with multiply operations responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? "

Amdahl's Law 1. Speed up = 4 2. Old execution time = 100 3. New execution time = 100/4 = 25 4. If 80 seconds is used by the affected part => 5. Unaffected part = 100 -80 = 20 sec 6. Execution time new = Execution time unaffected + Execution time affected / Improvement 7. 25= 20 + 80/Improvement 8. Improvement = 16 How about 5 X speedup?

Amdahl's Law • An application is “almost all” parallel: 90%. Speedup using – 10 processors => 5. 3 x – 100 processors => 9. 1 x – 1000 processors => 9. 9 x

Stream Processing • streaming data in and out of an execution core without utilizing inter-thread communication, scattered (i. e. , random) writes or even reads, or local memory. – hardware is drastically simplified – specialized chips (graphics processing unit)

Parallelism • ILP exploits implicit parallel operations within a loop or straight-line code segment • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

Time (processor cycle) Multithreaded Categories Superscalar Simultaneous Fine-Grained. Coarse-Grained Multiprocessing Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

Graphics Processing Units • Few hundred $ = hundreds of parallel FPUs – High performance computing more accessible – Blossomed with easy programming environment • GPUs and CPUs do not go back in computer architecture genealogy to a common ancestor – Primary ancestors of GPUs: Graphics accelerators

History of GPUs • Early video cards – Frame buffer memory with address generation for video output • 3 D graphics processing – Originally high-end computers (e. g. , SGI) – 3 D graphics cards for PCs and game consoles • Graphics Processing Units – Processors oriented to 3 D graphics tasks – Vertex/pixel processing, shading, texture mapping, ray tracing

Graphics in the System

Graphics Processing Units • Given the hardware invested to do graphics well, how can we supplement it to improve performance of a wider range of applications? • Basic idea: – Heterogeneous execution model • CPU is the host, GPU is the device – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA thread – Programming model: “Single Instruction Multiple Thread”

Programming the GPU • Compute Unified Device Architecture-CUDA – Elegant solution to problem of expressing parallelism • Not all algorithms, but enough to matter • Challenge: Coordinating HOST and CPU – Scheduling of computation – Data transfer • GPU offers every type of parallelism that can be captured by the programming environment

Programming Model • CUDA’s design goals – extend a standard sequential programming language, specifically C/C++, • focus on the important issues of parallelism—how to craft efficient parallel algorithms—rather than grappling with the mechanics of an unfamiliar and complicated language. – minimalist set of abstractions for expressing parallelism • highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores.

GTX 570 GPU Up to 1536 Threads/SM

Programming the GPU • CUDA Programming Model – Single Instruction Multiple Thread (SIMT) • A thread is associated with each data element • Threads are organized into blocks • Blocks are organized into a grid • GPU hardware handles thread management, not applications or OS

GPU Threads in SM (GTX 570) • 32 threads within a block work collectively ü Memory access optimization, latency hiding

GPU Threads in SM (GTX 570) • Up to 1024 Threads/Block and 8 Active Blocks per SM

Programming the GPU

Matrix Multiplication

Matrix Multiplication • For a 4096 x 4096 matrix multiplication - Matrix C will require calculation of 16, 777, 216 matrix cells. • On the GPU each cell is calculated by its own thread. • We can have 23, 040 active threads (GTX 570), which means we can have this many matrix cells calculated in parallel. • On a general purpose processor we can only calculate one cell at a time. • Each thread exploits the GPUs fine granularity by computing one element of Matrix C. • Sub-matrices are read into shared memory from global memory to act as a buffer and take advantage of GPU bandwidth.

Solving Systems of Equations

Thread Organization • If we expand to 4096 equations, we can process each row completely in parallel with 4096 threads • We will require 4096 kernel launches. One for each equation

Results CPU Configuration: Intel Xeon @2. 33 GHz with 2 GB RAM GPU Configuration: NVIDIA Tesla C 1060 @1. 3 GHz *For single precision, speedup improves by at least a factor of 2 X Execution time includes data transfer from host to device

Programming the GPU • Distinguishing execution place of functions: § _device_ or _global_ => GPU Device § Variables declared are allocated to the GPU memory § _host_ => System processor (HOST) • Function call § § Name<<dim. Grid, dim. Block>>(. . parameter list. . ) block. Idx: block identifier thread. Idx: threads per block identifier block. Dim: threads per block

Programming the GPU //Invoke DAXPY daxpy(n, 2. 0, x, y); //DAXPY in C void daxpy(int n, double a, double* x, double* y) { for (int i=0; i<n; i++) y[i]= a*x[i]+ y[i] }

Programming the GPU //Invoke DAXPY with 256 threads per Thread Block _host_ int nblocks = (n+255)/256; daxpy<<<nblocks, 256>>> (n, 2. 0, x, y); //DAXPY in CUDA _device_ void daxpy(int n, double a, double* x, double* y){ int i=block. IDx. x*block. Dim. x+thread. Idx. x; if (i<n) y[i]= a*x[i]+ y[i] }

Programming the GPU • CUDA • Hardware handles thread management • Invisible to the programmer (productivity), • Performance programmers need to know the operation principles of the threads! • Productivity vs. performance • How much power to be given to the programmer, CUDA is still evolving!

Efficiency Considerations • Avoid execution divergence – threads within a warp follow different execution paths. – Divergence between warps is ok • Allow loading a block of data into SM – process it there, and then write the final result back out to external memory. • Coalesce memory accesses – Access executive words instead of gather-scatter • Create enough parallel work – 5 K to 10 K threads

Efficiency Considerations • GPU Architecture – Each SM executes multiple warps in a time-sharing fashion while one or more are waiting for memory values • Hiding the execution cost of warps that are executed concurrently. – How many memory requests can be serviced and how many warps can be executed together while one warp is waiting for memory values.

Open. MP vs CUDA #pragma omp parallel for shared(A) private(i, j) for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]}

GPU Architectures • Processing is highly data-parallel – GPUs are highly multithreaded – Use thread switching to hide memory latency • Less reliance on multi-level caches – Graphics memory is wide and high-bandwidth • Trend toward general purpose GPUs – Heterogeneous CPU/GPU systems – CPU for sequential code, GPU for parallel code • Programming languages/APIs – Direct. X, Open. GL – C for Graphics (Cg), High Level Shader Language (HLSL) – Compute Unified Device Architecture (CUDA)

Easy to Learn Takes time to master

Example Systems 2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD Opteron X 4 2356 (Barcelona)

Example Systems 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2) 2 × oct-core IBM Cell QS 20

IBM Cell Broadband Engine 128 , 128 -bit registers 128 -bit vector/cycle Abbreviations PPE: Power. PC Engine SPE: Synergistic Processing Element MFC: Memory Flow Controller 32 KB, I, D 512 KB U LS: Local Store SIMD: Single Instruction Multiple Data 300 -600 cycles

CELL BE Programming Model No direct access to DRAM from LS of SPE, Buffer size: 16 KB

• CASE STUDIES

White Spaces After Digital TV • In telecommunications, white spaces refer to vacant frequency bands between licensed broadcast channels or services like wireless microphones. • After the transition to digital TV in the U. S. in June 2009, the amount of white space exceeded the amount of occupied spectrum even in major cities. • Utilization of white spaces for digital communications requires propagation loss models to detect occupied frequencies in near real-time for operation without causing harmful interference to a DTV signal, or other wireless systems operating on a previously vacant channel.

Challenge • Irregular Terrain Model (ITM), also known as the Longley-Rice model, is used to make predictions of radio field strength based on the elevation profile of terrains between the transmitter and the receiver. – Due to constant changes in terrain topography and variations in radio propagation, there is a pressing need for computational resources capable of running hundreds of thousands of transmission loss calculations per second.

ITM • Given the path length (d) for a radio transmitter T, a circle is drawn around T with radius d. • Along the contour line, 64 hypothetical receivers (Ri) are placed with equal distance from each other. • Vector lines from T to each Ri are further partitioned into 0. 5 km sectors (Sj ). • Atmospheric and geographic conditions along each sector form the profile of that terrain (used 256 K profiles). • For each profile, ITM involves independent computations based on atmospheric and geographic conditions followed by transmission loss calculations.

GPU Strategies for ITM • ITM requires 45 registers • Each profile is 1 KB ( radio frequency, path length, antenna heights, surface transfer impedance, plus 157 elevation points) reduces register count to 37 1. 5 GB per GPU 8 KB / multiprocessor 16 KB How many threads / MP?

GPU Strategies for ITM 128*16 threads 16*16 threads 192*16 threads

GPU Strategies for ITM

IBM CELL BE • Workload: 256 k profiles. • Strategies: – Message Queue (MQ), – DMA and double buffering with various buffer sizes (DDB-n), – SIMD with buffer size of 16 KB (DDB-16+SIMD-) • FG: fine grained; CG: coarse grained. – Profile level SIMDization (CG) improves performance by 7. 5 x over MQ

Productivity

Comparison • • Productivity from code development perspective Based on personal experience of a Ph. D. student – with C/C++ knowledge and the serial version of the ITM code in hand, – without prior background on the Cell BE and GPU programming environments. • Data logged for the “learning curve” and “design and debugging” times individually.

Instruction and Data Streams • An alternate classification Data Streams Single Instruction Single Streams Multiple SISD: Intel Pentium 4 SIMD: SSE instructions of x 86 MISD: No examples today MIMD: Intel Xeon e 5345 • SPMD: Single Program Multiple Data – A parallel program on a MIMD computer

SIMD • Operate elementwise on vectors of data – E. g. , MMX and SSE instructions in x 86 • Multiple data elements in 128 -bit wide registers • All processors execute the same instruction at the same time – Each with different data address, etc. • Simplifies synchronization • Reduced instruction control hardware • Works best for highly data-parallel applications

Vector Processors • Highly pipelined function units • Stream data from/to vector registers to units – Data collected from memory into registers – Results stored from registers to memory • Example: Vector extension – 32 × 64 -element registers (64 -bit elements) – Vector instructions • lv, sv: load/store vector • addv. d: add vectors of double • addvs. d: add scalar to each element of vector of double • Significantly reduces instruction-fetch bandwidth

Vector Processors

Example: (Y = a × X + Y) • Conventional MIPS code l. d $f 0, a($sp) addiu r 4, $s 0, #512 loop: l. d $f 2, 0($s 0) mul. d $f 2, $f 0 l. d $f 4, 0($s 1) add. d $f 4, $f 2 s. d $f 4, 0($s 1) addiu $s 0, #8 addiu $s 1, #8 subu $t 0, r 4, $s 0 bne $t 0, $zero, loop • Vector MIPS code l. d $f 0, a($sp) lv $v 1, 0($s 0) mulvs. d $v 2, $v 1, $f 0 lv $v 3, 0($s 1) addv. d $v 4, $v 2, $v 3 sv $v 4, 0($s 1) ; load scalar a ; upper bound of what to load ; load x(i) ; a × x(i) ; load y(i) ; a × x(i) + y(i) ; store into y(i) ; increment index to x ; increment index to y ; compute bound ; check if done ; load scalar a ; load vector x ; vector-scalar multiply ; load vector y ; add y to product ; store the result

Matrix Multiplication • CPU Configuration: Intel Xeon @2. 33 GHz with 2 GB RAM • GPU Configuration: NVIDIA Tesla C 1060 @1. 3 GHz • +For multiplication, matrix size larger than 4096 x 4096 stresses host device’s RAM • *For single precision, speedup improves by at least a factor of 2 X Matrix Size 256 x 256 512 x 512 1024 x 1024 2048 x 2048 4096 x 4096+ CPU Time GPU (sec) Speedup* 0. 159 0. 002 71 1. 518 0. 009 169 25. 773 0. 037 682 547. 882 0. 208 2623 4556. 700 1. 362 3345