Chapter 7 Multicores Multiprocessors and Clusters n Goal

n Goal: connecting multiple computers to get higher performance n n n High throughput

Hardware and Software n Hardware n n n Software n n n Serial: e.

What We’ve Already Covered n § 2. 11: Parallelism and Instructions n n §

n n Parallel software is the problem Need to get significant performance improvement n

Amdahl’s Law n n Sequential part can limit speedup Example: 100 processors, 90× speedup?

Scaling Example n Workload: sum of 10 scalars, and 10 × 10 matrix sum

Scaling Example (cont) n n n What if matrix size is 100 × 100?

Strong vs Weak Scaling n Strong scaling: problem size fixed n n As in

n SMP: shared memory multiprocessor n n n Hardware provides single physical address space

Example: Sum Reduction n Sum 100, 000 numbers on 100 processor UMA n n

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn

n n Each processor has private physical address space Hardware sends/receives messages between processors

Loosely Coupled Clusters n Network of independent computers n n Each has private memory

Sum Reduction (Again) n n Sum 100, 000 on 100 processors First distribute 100

Sum Reduction (Again) n Given send() and receive() operations limit = 100; half =

Grid Computing n Separate computers interconnected by long-haul networks n n n E. g.

n Performing multiple threads of execution in parallel n n n Fine-grain multithreading n

Simultaneous Multithreading n In multiple-issue dynamically scheduled processor n n Schedule instructions from multiple

Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 20

Future of Multithreading n n Will it survive? In what form? Power considerations simplified

n An alternate classification Data Streams Single Instruction Single Streams Multiple n Multiple SISD:

SIMD n Operate elementwise on vectors of data n E. g. , MMX and

Vector Processors n n Highly pipelined function units Stream data from/to vector registers to

Example: DAXPY (Y = a × X + Y) Conventional MIPS code l. d

Vector vs. Scalar n Vector architectures and compilers n n Simplify data-parallel programming Explicit

n Early video cards n n 3 D graphics processing n n Frame buffer

Graphics in the System Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

GPU Architectures n Processing is highly data-parallel n n GPUs are highly multithreaded Use

Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors Chapter 7 — Multicores, Multiprocessors,

Example: NVIDIA Tesla n Streaming Processors n n n Single-precision FP and integer units

Classifying GPUs n Don’t fit nicely into SIMD/MIMD model n Conditional execution in a

n Network topologies n Arrangements of processors, switches, and links Bus Ring N-cube (N

Multistage Networks Chapter 7 — Multicores, Multiprocessors, and Clusters — 34

Network Characteristics n Performance n n Latency per message (unloaded network) Throughput n n

n n Linpack: matrix linear algebra SPECrate: parallel run of SPEC CPU programs n

Code or Applications? n Traditional benchmarks n n Parallel programming is evolving n n

n Assume performance metric of interest is achievable GFLOPs/sec n n Arithmetic intensity of

Roofline Diagram Attainable GPLOPs/sec = Max ( Peak Memory BW × Arithmetic Intensity, Peak

Comparing Systems n Example: Opteron X 2 vs. Opteron X 4 n n 2

Optimizing Performance n Optimize FP performance n n n Balance adds & multiplies Improve

Optimizing Performance n Choice of optimization depends on arithmetic intensity of code n Arithmetic

2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD Opteron X

Four Example Systems 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2)

And Their Rooflines n Kernels Sp. MV (left) n LBHMD (right) n Some optimizations

Performance on Sp. MV n Sparse matrix/vector multiply n n Irregular memory accesses, memory

Performance on LBMHD n Fluid dynamics: structured grid over time steps n n Each

Achieving Performance n Compare naïve vs. optimized code n If naïve code performs well,

n Amdahl’s Law doesn’t apply to parallel computers n n n Since we can

Pitfalls n Not developing the software to take account of a multiprocessor architecture n

n n Goal: higher performance by using multiple processors Difficulties n n n Many

Slides: 51

Download presentation

Chapter 7 Multicores, Multiprocessors, and Clusters

n Goal: connecting multiple computers to get higher performance n n n High throughput for independent jobs Parallel processing program n n Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism n n § 9. 1 Introduction Single program run on multiple processors Multicore microprocessors n Chips with multiple processors (cores) Chapter 7 — Multicores, Multiprocessors, and Clusters — 2

Hardware and Software n Hardware n n n Software n n n Serial: e. g. , Pentium 4 Parallel: e. g. , quad-core Xeon e 5345 Sequential: e. g. , matrix multiplication Concurrent: e. g. , operating system Sequential/concurrent software can run on serial/parallel hardware n Challenge: making effective use of parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 3

What We’ve Already Covered n § 2. 11: Parallelism and Instructions n n § 3. 6: Parallelism and Computer Arithmetic n n n Associativity § 4. 10: Parallelism and Advanced Instruction-Level Parallelism § 5. 8: Parallelism and Memory Hierarchies n n Synchronization Cache Coherence § 6. 9: Parallelism and I/O: n Redundant Arrays of Inexpensive Disks Chapter 7 — Multicores, Multiprocessors, and Clusters — 4

n n Parallel software is the problem Need to get significant performance improvement n n Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties n n n Partitioning Coordination Communications overhead § 7. 2 The Difficulty of Creating Parallel Processing Programs Parallel Programming Chapter 7 — Multicores, Multiprocessors, and Clusters — 5

Amdahl’s Law n n Sequential part can limit speedup Example: 100 processors, 90× speedup? n Tnew = Tparallelizable/100 + Tsequential n n n Solving: Fparallelizable = 0. 999 Need sequential part to be 0. 1% of original time Chapter 7 — Multicores, Multiprocessors, and Clusters — 6

Scaling Example n Workload: sum of 10 scalars, and 10 × 10 matrix sum n n n Single processor: Time = (10 + 100) × tadd 10 processors n n n Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5. 5 (55% of potential) 100 processors n n n Speed up from 10 to 100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 7

Scaling Example (cont) n n n What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors n n n 100 processors n n n Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd Speedup = 10010/1010 = 9. 9 (99% of potential) Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

Strong vs Weak Scaling n Strong scaling: problem size fixed n n As in example Weak scaling: problem size proportional to number of processors n 10 processors, 10 × 10 matrix n n 100 processors, 32 × 32 matrix n n Time = 20 × tadd Time = 10 × tadd + 1000/100 × tadd = 20 × tadd Constant performance in this example Chapter 7 — Multicores, Multiprocessors, and Clusters — 9

n SMP: shared memory multiprocessor n n n Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time n UMA (uniform) vs. NUMA (nonuniform) § 7. 3 Shared Memory Multiprocessors Shared Memory Chapter 7 — Multicores, Multiprocessors, and Clusters — 10

Example: Sum Reduction n Sum 100, 000 numbers on 100 processor UMA n n Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums n n n Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor 0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Chapter 7 — Multicores, Multiprocessors, and Clusters — 12

n n Each processor has private physical address space Hardware sends/receives messages between processors § 7. 4 Clusters and Other Message-Passing Multiprocessors Message Passing Chapter 7 — Multicores, Multiprocessors, and Clusters — 13

Loosely Coupled Clusters n Network of independent computers n n Each has private memory and OS Connected using I/O system n n Suitable for applications with independent tasks n n n E. g. , Ethernet/switch, Internet Web servers, databases, simulations, … High availability, scalable, affordable Problems n n Administration cost (prefer virtual machines) Low interconnect bandwidth n c. f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters — 14

Sum Reduction (Again) n n Sum 100, 000 on 100 processors First distribute 100 numbers to each n n The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Reduction n n Half the processors send, other half receive and add The quarter send, quarter receive and add, … Chapter 7 — Multicores, Multiprocessors, and Clusters — 15

Sum Reduction (Again) n Given send() and receive() operations limit = 100; half = 100; /* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ n n Send/receive also provide synchronization Assumes send/receive take similar time to addition Chapter 7 — Multicores, Multiprocessors, and Clusters — 16

Grid Computing n Separate computers interconnected by long-haul networks n n n E. g. , Internet connections Work units farmed out, results sent back Can make use of idle time on PCs n E. g. , SETI@home, World Community Grid Chapter 7 — Multicores, Multiprocessors, and Clusters — 17

n Performing multiple threads of execution in parallel n n n Fine-grain multithreading n n Replicate registers, PC, etc. Fast switching between threads § 7. 5 Hardware Multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading n n Only switch on long stall (e. g. , L 2 -cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Chapter 7 — Multicores, Multiprocessors, and Clusters — 18

Simultaneous Multithreading n In multiple-issue dynamically scheduled processor n n Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT n Two threads: duplicated registers, shared function units and caches Chapter 7 — Multicores, Multiprocessors, and Clusters — 19

Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 20

Future of Multithreading n n Will it survive? In what form? Power considerations simplified microarchitectures n n Tolerating cache-miss latency n n Simpler forms of multithreading Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 7 — Multicores, Multiprocessors, and Clusters — 21

n An alternate classification Data Streams Single Instruction Single Streams Multiple n Multiple SISD: Intel Pentium 4 SIMD: SSE instructions of x 86 MISD: No examples today MIMD: Intel Xeon e 5345 SPMD: Single Program Multiple Data n n § 7. 6 SISD, MIMD, SPMD, and Vector Instruction and Data Streams A parallel program on a MIMD computer Conditional code for different processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 22

SIMD n Operate elementwise on vectors of data n E. g. , MMX and SSE instructions in x 86 n n All processors execute the same instruction at the same time n n Multiple data elements in 128 -bit wide registers Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 7 — Multicores, Multiprocessors, and Clusters — 23

Vector Processors n n Highly pipelined function units Stream data from/to vector registers to units n n n Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS n n 32 × 64 -element registers (64 -bit elements) Vector instructions n n lv, sv: load/store vector addv. d: add vectors of double addvs. d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 24

Example: DAXPY (Y = a × X + Y) Conventional MIPS code l. d $f 0, a($sp) addiu r 4, $s 0, #512 loop: l. d $f 2, 0($s 0) mul. d $f 2, $f 0 l. d $f 4, 0($s 1) add. d $f 4, $f 2 s. d $f 4, 0($s 1) addiu $s 0, #8 addiu $s 1, #8 subu $t 0, r 4, $s 0 bne $t 0, $zero, loop n Vector MIPS code l. d $f 0, a($sp) lv $v 1, 0($s 0) mulvs. d $v 2, $v 1, $f 0 lv $v 3, 0($s 1) addv. d $v 4, $v 2, $v 3 sv $v 4, 0($s 1) n ; load scalar a ; upper bound of what to load ; load x(i) ; a × x(i) ; load y(i) ; a × x(i) + y(i) ; store into y(i) ; increment index to x ; increment index to y ; compute bound ; check if done ; load scalar a ; load vector x ; vector-scalar multiply ; load vector y ; add y to product ; store the result Chapter 7 — Multicores, Multiprocessors, and Clusters — 25

Vector vs. Scalar n Vector architectures and compilers n n Simplify data-parallel programming Explicit statement of absence of loop-carried dependences n n Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad-hoc media extensions (such as MMX, SSE) n Better match with compiler technology Chapter 7 — Multicores, Multiprocessors, and Clusters — 26

n Early video cards n n 3 D graphics processing n n Frame buffer memory with address generation for video output Originally high-end computers (e. g. , SGI) Moore’s Law lower cost, higher density 3 D graphics cards for PCs and game consoles Graphics Processing Units n n § 7. 7 Introduction to Graphics Processing Units History of GPUs Processors oriented to 3 D graphics tasks Vertex/pixel processing, shading, texture mapping, rasterization Chapter 7 — Multicores, Multiprocessors, and Clusters — 27

Graphics in the System Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

GPU Architectures n Processing is highly data-parallel n n GPUs are highly multithreaded Use thread switching to hide memory latency n n n Graphics memory is wide and high-bandwidth Trend toward general purpose GPUs n n n Less reliance on multi-level caches Heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code Programming languages/APIs n n n Direct. X, Open. GL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA) Chapter 7 — Multicores, Multiprocessors, and Clusters — 29

Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 30

Example: NVIDIA Tesla n Streaming Processors n n n Single-precision FP and integer units Each SP is fine-grained multithreaded Warp: group of 32 threads n Executed in parallel, SIMD style n n 8 SPs × 4 clock cycles Hardware contexts for 24 warps n Registers, PCs, … Chapter 7 — Multicores, Multiprocessors, and Clusters — 31

Classifying GPUs n Don’t fit nicely into SIMD/MIMD model n Conditional execution in a thread allows an illusion of MIMD n n But with performance degredation Need to write general purpose code with care Instruction-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time Dynamic: Discovered at Runtime VLIW Superscalar SIMD or Vector Tesla Multiprocessor Chapter 7 — Multicores, Multiprocessors, and Clusters — 32

n Network topologies n Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2 D Mesh § 7. 8 Introduction to Multiprocessor Network Topologies Interconnection Networks Fully connected Chapter 7 — Multicores, Multiprocessors, and Clusters — 33

Multistage Networks Chapter 7 — Multicores, Multiprocessors, and Clusters — 34

Network Characteristics n Performance n n Latency per message (unloaded network) Throughput n n n n Link bandwidth Total network bandwidth Bisection bandwidth Congestion delays (depending on traffic) Cost Power Routability in silicon Chapter 7 — Multicores, Multiprocessors, and Clusters — 35

n n Linpack: matrix linear algebra SPECrate: parallel run of SPEC CPU programs n n SPLASH: Stanford Parallel Applications for Shared Memory n n Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite n n Job-level parallelism § 7. 9 Multiprocessor Benchmarks Parallel Benchmarks computational fluid dynamics kernels PARSEC (Princeton Application Repository for Shared Memory Computers) suite n Multithreaded applications using Pthreads and Open. MP Chapter 7 — Multicores, Multiprocessors, and Clusters — 36

Code or Applications? n Traditional benchmarks n n Parallel programming is evolving n n Fixed code and data sets Should algorithms, programming languages, and tools be part of the system? Compare systems, provided they implement a given application E. g. , Linpack, Berkeley Design Patterns Would foster innovation in approaches to parallelism Chapter 7 — Multicores, Multiprocessors, and Clusters — 37

n Assume performance metric of interest is achievable GFLOPs/sec n n Arithmetic intensity of a kernel n n Measured using computational kernels from Berkeley Design Patterns FLOPs per byte of memory accessed For a given computer, determine n n Peak GFLOPS (from data sheet) Peak memory bytes/sec (using Stream benchmark) § 7. 10 Roofline: A Simple Performance Modeling Performance Chapter 7 — Multicores, Multiprocessors, and Clusters — 38

Roofline Diagram Attainable GPLOPs/sec = Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance ) Chapter 7 — Multicores, Multiprocessors, and Clusters — 39

Comparing Systems n Example: Opteron X 2 vs. Opteron X 4 n n 2 -core vs. 4 -core, 2× FP performance/core, 2. 2 GHz vs. 2. 3 GHz Same memory system n To get higher performance on X 4 than X 2 n n Need high arithmetic intensity Or working set must fit in X 4’s 2 MB L-3 cache Chapter 7 — Multicores, Multiprocessors, and Clusters — 40

Optimizing Performance n Optimize FP performance n n n Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Optimize memory usage n Software prefetch n n Avoid load stalls Memory affinity n Avoid non-local data accesses Chapter 7 — Multicores, Multiprocessors, and Clusters — 41

Optimizing Performance n Choice of optimization depends on arithmetic intensity of code n Arithmetic intensity is not always fixed n n May scale with problem size Caching reduces memory accesses n Increases arithmetic intensity Chapter 7 — Multicores, Multiprocessors, and Clusters — 42

2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD Opteron X 4 2356 (Barcelona) § 7. 11 Real Stuff: Benchmarking Four Multicores … Four Example Systems Chapter 7 — Multicores, Multiprocessors, and Clusters — 43

Four Example Systems 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2) 2 × oct-core IBM Cell QS 20 Chapter 7 — Multicores, Multiprocessors, and Clusters — 44

And Their Rooflines n Kernels Sp. MV (left) n LBHMD (right) n Some optimizations change arithmetic intensity n x 86 systems have higher peak GFLOPs n n But harder to achieve, given memory bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 45

Performance on Sp. MV n Sparse matrix/vector multiply n n Irregular memory accesses, memory bound Arithmetic intensity n 0. 166 before memory optimization, 0. 25 after n Xeon vs. Opteron n Similar peak FLOPS Xeon limited by shared FSBs and chipset Ultra. SPARC/Cell vs. x 86 n n 20 – 30 vs. 75 peak GFLOPs More cores and memory bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 46

Performance on LBMHD n Fluid dynamics: structured grid over time steps n n Each point: 75 FP read/write, 1300 FP ops Arithmetic intensity n 0. 70 before optimization, 1. 07 after n Opteron vs. Ultra. SPARC n n More powerful cores, not limited by memory bandwidth Xeon vs. others n Still suffers from memory bottlenecks Chapter 7 — Multicores, Multiprocessors, and Clusters — 47

Achieving Performance n Compare naïve vs. optimized code n If naïve code performs well, it’s easier to write high performance code for the system System Kernel Naïve GFLOPs/sec Optimized GFLOPs/sec Naïve as % of optimized Intel Xeon Sp. MV LBMHD 1. 0 4. 6 1. 5 5. 6 64% 82% AMD Opteron X 4 Sp. MV LBMHD 1. 4 7. 1 3. 6 14. 1 38% 50% Sun Ultra. SPARC T 2 Sp. MV LBMHD 3. 5 9. 7 4. 1 10. 5 86% 93% IBM Cell QS 20 Sp. MV LBMHD Naïve code not feasible 6. 4 16. 7 0% 0% Chapter 7 — Multicores, Multiprocessors, and Clusters — 48

n Amdahl’s Law doesn’t apply to parallel computers n n n Since we can achieve linear speedup But only on applications with weak scaling § 7. 12 Fallacies and Pitfalls Fallacies Peak performance tracks observed performance n n n Marketers like this approach! But compare Xeon with others in example Need to be aware of bottlenecks Chapter 7 — Multicores, Multiprocessors, and Clusters — 49

Pitfalls n Not developing the software to take account of a multiprocessor architecture n Example: using a single lock for a shared composite resource n n Serializes accesses, even if they could be done in parallel Use finer-granularity locking Chapter 7 — Multicores, Multiprocessors, and Clusters — 50

n n Goal: higher performance by using multiple processors Difficulties n n n Many reasons for optimism n n n Developing parallel software Devising appropriate architectures § 7. 13 Concluding Remarks Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects! Chapter 7 — Multicores, Multiprocessors, and Clusters — 51