Lecture 29 Parallel Programming Overview Lecture 29 Fall

  • Slides: 39
Download presentation
Lecture 29: Parallel Programming Overview Lecture 29 Fall 2011

Lecture 29: Parallel Programming Overview Lecture 29 Fall 2011

Parallel Programming Paradigms --Various Methods q There are many methods of programming parallel computers.

Parallel Programming Paradigms --Various Methods q There are many methods of programming parallel computers. Two of the most common are message passing and data parallel. l l l q Message Passing - the user makes calls to libraries to explicitly share information between processors. Data Parallel - data partitioning determines parallelism Shared Memory - multiple processes sharing common memory space Remote Memory Operation - set of processes in which a process can access the memory of another process without its participation Threads - a single process having multiple (concurrent) execution paths Combined Models - composed of two or more of the above. Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support. An effective implementation is one which closely matches its target hardware and provides the user ease in programming. Lecture 29 Fall 2011

Parallel Programming Paradigms: Message Passing The message passing model is defined as: q l

Parallel Programming Paradigms: Message Passing The message passing model is defined as: q l set of processes using only local memory l processes communicate by sending and receiving messages l data transfer requires cooperative operations to be performed by each process (a send operation must have a matching receive) Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages. Lecture 29 Fall 2011

Parallel Programming Paradigms: Data Parallel q The data parallel model is defined as: l

Parallel Programming Paradigms: Data Parallel q The data parallel model is defined as: l Each process works on a different part of the same data structure l Commonly a Single Program Multiple Data (SPMD) approach l Data is distributed across processors l All message passing is done invisibly to the programmer l Commonly built "on top of" one of the common message passing libraries q Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler. q The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes. Lecture 29 Fall 2011

Implementation of Message Passing: MPI q Message Passing Interface often called MPI. q A

Implementation of Message Passing: MPI q Message Passing Interface often called MPI. q A standard portable message-passing library definition developed in 1993 by a group of parallel computer vendors, software writers, and application scientists. q Available to both Fortran and C programs. q Available on a wide variety of parallel machines. q Target platform is a distributed memory system q All inter-task communication is by message passing. q All parallelism is explicit: the programmer is responsible for parallelism the program and implementing the MPI constructs. q Programming model is SPMD (Single Program Multiple Data) Lecture 29 Fall 2011

Implementations: F 90 / High Performance Fortran (HPF) q Fortran 90 (F 90) -

Implementations: F 90 / High Performance Fortran (HPF) q Fortran 90 (F 90) - (ISO / ANSI standard extensions to Fortran 77). q High Performance Fortran (HPF) - extensions to F 90 to support data parallel programming. q Compiler directives allow programmer specification of data distribution and alignment. q New compiler constructs and intrinsics allow the programmer to do computations and manipulations on data with different distributions. Lecture 29 Fall 2011

Steps for Creating a Parallel Program 1. 2. If you are starting with an

Steps for Creating a Parallel Program 1. 2. If you are starting with an existing serial program, debug the serial code completely Identify the parts of the program that can be executed concurrently: l l l 3. Decompose the program: l l l 4. l l l 6. Functional Parallelism Data Parallelism Combination of both Code development l 5. Requires a thorough understanding of the algorithm Exploit any inherent parallelism which may exist. May require restructuring of the program and/or algorithm. May require an entirely new algorithm. Code may be influenced/determined by machine architecture Choose a programming paradigm Determine communication Add code to accomplish task control and communications Compile, Test, Debug Optimization l l Lecture 29 l Measure Performance Locate Problem Areas Improve them Fall 2011

Amdahl’s Law q Speedup due to enhancement E is Exec time w/o E Speedup

Amdahl’s Law q Speedup due to enhancement E is Exec time w/o E Speedup w/ E = -----------Exec time w/ E q Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Ex. Time w/ E = Ex. Time w/o E ((1 -F) + F/S) Speedup w/ E = 1 / ((1 -F) + F/S) Lecture 29 Fall 2011

Examples: Amdahl’s Law q Amdahl’s Law tells us that to achieve linear speedup with

Examples: Amdahl’s Law q Amdahl’s Law tells us that to achieve linear speedup with 100 processors (e. g. , speedup of 100), none of the original computation can be scalar! q To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0. 01% or less q What speedup could we achieve from 100 processors if 30% of the original program is scalar? Speedup w/ E = 1 / ((1 -F) + F/S) = 1 / (0. 7 + 0. 7/100) = 1. 4 Serial program/algorithm might need to be restructuring to allow for efficient parallelization. Fall 2011 Lecture 29 q

Decomposing the Program q q There are three methods for decomposing a problem into

Decomposing the Program q q There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both Functional Decomposition (Functional Parallelism) l l q Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution Good to use when there is not static structure or fixed determination of number of calculations to be performed Domain Decomposition (Data Parallelism) l l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution Good to use for problems where: - data is static (factoring and solving large matrix or finite difference calculations) - dynamic data structure tied to single entity where entity can be subsetted (large multibody problems) - domain is fixed but computation within various regions of the domain is dynamic (fluid vortices models) l There are many ways to decompose data into partitions to be distributed: - One Dimensional Data Distribution – – Block Distribution Cyclic Distribution - Two Dimensional Data Distribution Lecture 29 – – – Block Distribution Block Cyclic Distribution Cyclic Block Distribution Fall 2011

Functional Decomposing of a Program Lecture 29 l Decomposing the problem into different tasks

Functional Decomposing of a Program Lecture 29 l Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution l Good to use when there is not static structure or fixed determination of number of calculations to be performed Fall 2011

Functional Decomposing of a Program Lecture 29 Fall 2011

Functional Decomposing of a Program Lecture 29 Fall 2011

Domain Decomposition (Data Parallelism) Lecture 29 l Partitioning the problem's data domain and distributing

Domain Decomposition (Data Parallelism) Lecture 29 l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed: Fall 2011

Summing 100, 000 Numbers on 100 Processors q Start by distributing 1000 elements of

Summing 100, 000 Numbers on 100 Processors q Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i]; /* sum local array subset q The processors then coordinate in adding together the sub sums (Pn is the number of the processor, send(x, y) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2; /*dividing line if (Pn>= half && Pn<limit) send(Pn-half, sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1); /*final sum in P 0’s sum Lecture 29 Fall 2011

An Example with 10 Processors sum sum sum P 0 P 1 P 2

An Example with 10 Processors sum sum sum P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 Lecture 29 half = 10 Fall 2011

Domain Decomposition (Data Parallelism) Lecture 29 l Partitioning the problem's data domain and distributing

Domain Decomposition (Data Parallelism) Lecture 29 l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed: Fall 2011

Cannon's Matrix Multiplication Lecture 29 Fall 2011

Cannon's Matrix Multiplication Lecture 29 Fall 2011

Review: Multiprocessor Basics q Q 1 – How do they share data? q Q

Review: Multiprocessor Basics q Q 1 – How do they share data? q Q 2 – How do they coordinate? q Q 3 – How scalable is the architecture? How many processors? # of Proc Communication Message passing 8 to 2048 model Shared NUMA 8 to 256 address UMA 2 to 64 Physical connection Lecture 29 Network 8 to 256 Bus 2 to 36 Fall 2011

Review: Bus Connected SMPs (UMAs) Processor Cache Single Bus Memory I/O q Caches are

Review: Bus Connected SMPs (UMAs) Processor Cache Single Bus Memory I/O q Caches are used to reduce latency and to lower bus traffic q Must provide hardware for cache coherence and process synchronization q Bus traffic and bandwidth limits scalability (<~ 36 processors) Lecture 29 Fall 2011

Network Connected Multiprocessors Processor Cache Memory Interconnection Network (IN) q Either a single address

Network Connected Multiprocessors Processor Cache Memory Interconnection Network (IN) q Either a single address space (NUMA and cc. NUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives l Lecture 29 Interconnection network supports interprocessor communication Fall 2011

Communication in Network Connected Multi’s q q Lecture 29 Implicit communication via loads and

Communication in Network Connected Multi’s q q Lecture 29 Implicit communication via loads and stores l hardware designers have to provide coherent caches and process synchronization primitive l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM)) Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication Fall 2011

Cache Coherency in NUMAs q For performance reasons we want to allow the shared

Cache Coherency in NUMAs q For performance reasons we want to allow the shared data to be stored in caches q Once again have multiple copies of the same data with the same address in different processors l q bus snooping won’t work, since there is no single bus on which all memory references are broadcast Directory-base protocols Lecture 29 l keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc. ) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data Fall 2011

IN Performance Metrics q q Network cost l number of switches l number of

IN Performance Metrics q q Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link Network bandwidth (NB) – represents the best case l q Bisection bandwidth (BB) – represents the worst case l q bandwidth of each link * number of links divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line Other IN performance issues Lecture 29 l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay Fall 2011

Bus IN Bidirectional network switch Processor node q N processors, 1 switch ( q

Bus IN Bidirectional network switch Processor node q N processors, 1 switch ( q Only 1 simultaneous transfer at a time Lecture 29 l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 ), 1 link (the bus) Fall 2011

Ring IN q N processors, N switches, 2 links/switch, N links q N simultaneous

Ring IN q N processors, N switches, 2 links/switch, N links q N simultaneous transfers q l NB = link bandwidth * N l BB = link bandwidth * 2 If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case Lecture 29 Fall 2011

Fully Connected IN q N processors, N switches, N-1 links/switch, (N*(N-1))/2 links q N

Fully Connected IN q N processors, N switches, N-1 links/switch, (N*(N-1))/2 links q N simultaneous transfers Lecture 29 l NB = link bandwidth * (N*(N-1))/2 l BB = link bandwidth * (N/2)2 Fall 2011

Crossbar (Xbar) Connected IN q N processors, N 2 switches (unidirectional), 2 links/switch, N

Crossbar (Xbar) Connected IN q N processors, N 2 switches (unidirectional), 2 links/switch, N 2 links q N simultaneous transfers Lecture 29 l NB = link bandwidth * N l BB = link bandwidth * N/2 Fall 2011

Hypercube (Binary N-cube) Connected IN 2 -cube 3 -cube q N processors, N switches,

Hypercube (Binary N-cube) Connected IN 2 -cube 3 -cube q N processors, N switches, log. N links/switch, (Nlog. N)/2 links q N simultaneous transfers Lecture 29 l NB = link bandwidth * (Nlog. N)/2 l BB = link bandwidth * N/2 Fall 2011

2 D and 3 D Mesh/Torus Connected IN N processors, N switches, 2, 3,

2 D and 3 D Mesh/Torus Connected IN N processors, N switches, 2, 3, 4 (2 D torus) or 6 (3 D torus) links/switch, 4 N/2 links or 6 N/2 links q N simultaneous transfers q Lecture 29 l NB = link bandwidth * 4 N or l BB = link bandwidth * 2 N 1/2 or link bandwidth * 6 N link bandwidth * 2 N 2/3 Fall 2011

Fat Tree q Trees are good structures. People in CS use them all the

Fat Tree q Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network. A q Lecture 29 D The bisection bandwidth on a tree is horrible - 1 link, at all times The solution is to 'thicken' the upper links. l q C Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l q B More links as the tree gets thicker increases the bisection Rather than design a bunch of N-port switches, use pairs Fall 2011

Fat Tree q N processors, log(N-1)*log. N switches, 2 up + 4 down =

Fat Tree q N processors, log(N-1)*log. N switches, 2 up + 4 down = 6 links/switch, N*log. N links q N simultaneous transfers Lecture 29 l NB = link bandwidth * Nlog. N l BB = link bandwidth * 4 Fall 2011

SGI NUMAlink Fat Tree www. embedded-computing. com/articles/woodacre Lecture 29 Fall 2011

SGI NUMAlink Fat Tree www. embedded-computing. com/articles/woodacre Lecture 29 Fall 2011

IN Comparison q For a 64 processor system Bus Network bandwidth 1 Bisection bandwidth

IN Comparison q For a 64 processor system Bus Network bandwidth 1 Bisection bandwidth 1 Total # of Switches 1 Ring Torus 6 -cube Fully connected Links per switch Total # of links Lecture 29 1 Fall 2011

Network Connected Multiprocessors Proc SGI Origin R 16000 Cray 3 TE Alpha 21164 Intel

Network Connected Multiprocessors Proc SGI Origin R 16000 Cray 3 TE Alpha 21164 Intel ASCI Red Proc Speed # Proc BW/link (MB/sec) fat tree 800 300 MHz 2, 048 3 D torus 600 Intel 333 MHz 9, 632 mesh 800 IBM ASCI White Power 3 375 MHz 8, 192 multistage Omega 500 NEC ES SX-5 500 MHz 640*8 640 -xbar 16000 NASA Columbia Intel 1. 5 GHz Itanium 2 512*20 IBM BG/L Power PC 440 65, 536*2 3 D torus, fat tree, barrier Lecture 29 128 IN Topology 0. 7 GHz fat tree, Infiniband Fall 2011

IBM Blue. Gene 512 -node proto Blue. Gene/L Peak Perf 1. 0 / 2.

IBM Blue. Gene 512 -node proto Blue. Gene/L Peak Perf 1. 0 / 2. 0 TFlops/s 180 / 360 TFlops/s Memory Size 128 GByte 16 / 32 TByte Foot Print 9 sq feet 2500 sq feet Total Power 9 KW 1. 5 MW # Processors 512 dual proc 65, 536 dual proc Networks 3 D Torus, Tree, Barrier Torus BW 3 B/cycle Lecture 29 Fall 2011

A Blue. Gene/L Chip 32 K/32 K L 1 11 GB/s 128 440 CPU

A Blue. Gene/L Chip 32 K/32 K L 1 11 GB/s 128 440 CPU 2 KB L 2 256 5. 5 GB/s Double FPU 256 700 MHz 256 32 K/32 K L 1 128 440 CPU 2 KB L 2 Lecture 29 3 D torus 1 6 in, 6 out 1. 6 GHz 1. 4 Gb/s link 4 MB L 3 ECC e. DRAM 128 B line 8 -way assoc 256 5. 5 GB/s Double FPU Gbit ethernet 16 KB Multiport SRAM buffer 11 GB/s Fat tree 8 3 in, 3 out 350 MHz 2. 8 Gb/s link Barrier 4 global barriers DDR control 144 b DDR 256 MB 5. 5 GB/s Fall 2011

Networks of Workstations (NOWs) Clusters q Clusters of off-the-shelf, whole computers with multiple private

Networks of Workstations (NOWs) Clusters q Clusters of off-the-shelf, whole computers with multiple private address spaces q Clusters are connected using the I/O bus of the computers l lower bandwidth that multiprocessor that use the memory bus l lower speed network links l more conflicts with I/O traffic q Clusters of N processors have N copies of the OS limiting the memory available for applications q Improved system availability and expandability q l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability Economy-of-scale advantages with respect to costs Lecture 29 Fall 2011

Commercial (NOW) Clusters Proc Lecture 29 Proc Speed # Proc Dell P 4 Xeon

Commercial (NOW) Clusters Proc Lecture 29 Proc Speed # Proc Dell P 4 Xeon Power. Edge 3. 06 GHz 2, 500 e. Server IBM SP 1. 7 GHz 2, 944 VPI Big. Mac Apple G 5 2. 3 GHz 2, 200 HP ASCI Q Alpha 21264 1. 25 GHz 8, 192 Power 4 Network Myrinet Mellanox Infiniband Quadrics LLNL Thunder Intel Itanium 2 1. 4 GHz 1, 024*4 Quadrics Barcelona Power. PC 970 2. 2 GHz 4, 536 Myrinet Fall 2011

Summary q Flynn’s classification of processors - SISD, SIMD, MIMD l l l q

Summary q Flynn’s classification of processors - SISD, SIMD, MIMD l l l q Q 1 – How do processors share data? Q 2 – How do processors coordinate their activity? Q 3 – How scalable is the architecture (what is the maximum number of processors)? Shared address multis – UMAs and NUMAs l Scalability of bus connected UMAs limited (< ~ 36 processors) Network connected NUMAs more scalable l Interconnection Networks (INs) l - fully connected, xbar ring mesh n-cube, fat tree Message passing multis q Cluster connected (NOWs) multis q Lecture 29 Fall 2011