CSE 431 Computer Architecture Fall 2005 Lecture 25

  • Slides: 19
Download presentation
CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin

CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin ( www. cse. psu. edu/~mji ) www. cse. psu. edu/~cg 431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] CSE 431 L 25 Multi. Intro. 1 Irwin, PSU, 2005

The Big Picture: Where are We Now? Processor Control Datapath Output Processor Output Memory

The Big Picture: Where are We Now? Processor Control Datapath Output Processor Output Memory Input Control Datapath q Multiprocessor – multiple processors with a single shared address space q Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system CSE 431 L 25 Multi. Intro. 2 Irwin, PSU, 2005

Applications Needing “Supercomputing” q Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration) q

Applications Needing “Supercomputing” q Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration) q Do. E stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons) q Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks) q Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction) q Bioinformatics and computational biology (genomics, protein folding, designer drugs) q Societal health and safety (pollution reduction, disaster planning, terrorist action detection) http: //www. nap. edu/books/0309095026/html/ CSE 431 L 25 Multi. Intro. 3 Irwin, PSU, 2005

Encountering Amdahl’s Law q Speedup due to enhancement E is Exec time w/o E

Encountering Amdahl’s Law q Speedup due to enhancement E is Exec time w/o E Speedup w/ E = -----------Exec time w/ E q Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Ex. Time w/ E = Ex. Time w/o E ((1 -F) + F/S) Speedup w/ E = 1 / ((1 -F) + F/S) CSE 431 L 25 Multi. Intro. 5 Irwin, PSU, 2005

Examples: Amdahl’s Law Speedup w/ E = 1 / ((1 -F) + F/S) q

Examples: Amdahl’s Law Speedup w/ E = 1 / ((1 -F) + F/S) q Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(. 75 +. 25/20) = 1. 31 q What if its usable only 15% of the time? Speedup w/ E = 1/(. 85 +. 15/20) = 1. 17 q Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! q To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0. 01% or less CSE 431 L 25 Multi. Intro. 7 Irwin, PSU, 2005

Supercomputer Style Migration (Top 500) http: //www. top 500. org/lists/2005/11/ Nov data Cluster –

Supercomputer Style Migration (Top 500) http: //www. top 500. org/lists/2005/11/ Nov data Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block q In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80% CSE 431 L 25 Multi. Intro. 8 Irwin, PSU, 2005

Multiprocessor/Clusters Key Questions q Q 1 – How do they share data? q Q

Multiprocessor/Clusters Key Questions q Q 1 – How do they share data? q Q 2 – How do they coordinate? q Q 3 – How scalable is the architecture? How many processors can be supported? CSE 431 L 25 Multi. Intro. 9 Irwin, PSU, 2005

Flynn’s Classification Scheme q SISD – single instruction, single data stream l q SIMD

Flynn’s Classification Scheme q SISD – single instruction, single data stream l q SIMD – single instruction, multiple data streams l q no such machine (although some people put vector machines in this category) MIMD – multiple instructions, multiple data streams l q single control unit broadcasting operations to multiple datapaths MISD – multiple instruction, single data l q aka uniprocessor - what we have been talking about all semester aka multiprocessors (SMPs, MPPs, clusters, NOWs) Now obsolete except for. . . CSE 431 L 25 Multi. Intro. 10 Irwin, PSU, 2005

SIMD Processors PE PE PE PE Control q Single control unit q Multiple datapaths

SIMD Processors PE PE PE PE Control q Single control unit q Multiple datapaths (processing elements – PEs) running in parallel l Q 1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit l Q 2 – Each PE performs the same operation on its own local data CSE 431 L 25 Multi. Intro. 11 Irwin, PSU, 2005

Example SIMD Machines Maker Year # PEs # b/ Max PE PE memory clock

Example SIMD Machines Maker Year # PEs # b/ Max PE PE memory clock (MB) (MHz) System BW (MB/s) Illiac IV UIUC 1972 64 64 1 13 2, 560 DAP ICL 1980 4, 096 1 2 5 2, 560 MPP Goodyear 1982 16, 384 1 2 10 20, 480 CM-2 Thinking Machines 1987 65, 536 1 512 7 16, 384 1989 16, 384 4 1024 25 23, 000 MP-1216 Mas. Par CSE 431 L 25 Multi. Intro. 12 Irwin, PSU, 2005

Multiprocessor Basic Organizations q Processors connected by a single bus q Processors connected by

Multiprocessor Basic Organizations q Processors connected by a single bus q Processors connected by a network # of Proc Communication Message passing 8 to 2048 model Shared NUMA 8 to 256 address UMA 2 to 64 Physical connection CSE 431 L 25 Multi. Intro. 13 Network 8 to 256 Bus 2 to 36 Irwin, PSU, 2005

Shared Address (Shared Memory) Multi’s q Q 1 – Single address space shared by

Shared Address (Shared Memory) Multi’s q Q 1 – Single address space shared by all the processors q Q 2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l q UMAs (uniform memory access) – aka SMP (symmetric multiprocessors) l q Use of shared data must be coordinated via synchronization primitives (locks) all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested NUMAs (nonuniform memory access) l some main memory accesses are faster than others depending on the processor making the request and which location is requested l can scale to larger sizes than UMAs so are potentially higher performance CSE 431 L 25 Multi. Intro. 14 Irwin, PSU, 2005

N/UMA Remote Memory Access Times (RMAT) Year Type Max Proc 64 Interconnection RMAT Network

N/UMA Remote Memory Access Times (RMAT) Year Type Max Proc 64 Interconnection RMAT Network (ns) Sun Starfire 1996 SMP Cray 3 TE 1996 NUMA 2048 2 -way 3 D torus 300 HP V 1998 SMP 8 x 8 crossbar 1000 SGI Origin 3000 1999 NUMA 512 Fat tree 500 Compaq 1999 SMP Alpha. Server GS 32 Switched bus 400 Sun V 880 2002 SMP 8 Switched bus 240 HP Superdome 9000 2003 SMP 64 Switched bus 275 Fat tree ? ? ? 32 NASA Columbia 2004 NUMA 10240 CSE 431 L 25 Multi. Intro. 15 Address buses, 500 data switch Irwin, PSU, 2005

Single Bus (Shared Address UMA) Multi’s Processor Cache Single Bus Memory I/O q Caches

Single Bus (Shared Address UMA) Multi’s Processor Cache Single Bus Memory I/O q Caches are used to reduce latency and to lower bus traffic q Must provide hardware to ensure that caches and memory are consistent (cache coherency) – covered in Lecture 26 q Must provide a hardware mechanism to support process synchronization – covered in Lecture 26 CSE 431 L 25 Multi. Intro. 16 Irwin, PSU, 2005

Summing 100, 000 Numbers on 100 Processors q Processors start by running a loop

Summing 100, 000 Numbers on 100 Processors q Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable) sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; q The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors)) repeat synch(); /*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half] until (half == 1); /*final sum in sum[0] CSE 431 L 25 Multi. Intro. 17 Irwin, PSU, 2005

An Example with 10 Processors sum[P 0]sum[P 1]sum[P 2] sum[P 3]sum[P 4]sum[P 5]sum[P 6]

An Example with 10 Processors sum[P 0]sum[P 1]sum[P 2] sum[P 3]sum[P 4]sum[P 5]sum[P 6] sum[P 7]sum[P 8] sum[P 9] P 0 P 1 P 2 P 3 P 4 P 0 P 1 P 0 CSE 431 L 25 Multi. Intro. 19 P 5 P 6 P 7 P 8 P 9 half = 10 half = 5 half = 2 half = 1 Irwin, PSU, 2005

Message Passing Multiprocessors q Each processor has its own private address space q Q

Message Passing Multiprocessors q Each processor has its own private address space q Q 1 – Processors share data by explicitly sending and receiving information (messages) q Q 2 – Coordination is built into message passing primitives (send and receive) CSE 431 L 25 Multi. Intro. 20 Irwin, PSU, 2005

Summary q Flynn’s classification of processors – SISD, SIMD, MIMD l Q 1 –

Summary q Flynn’s classification of processors – SISD, SIMD, MIMD l Q 1 – How do processors share data? l Q 2 – How do processors coordinate their activity? l Q 3 – How scalable is the architecture (what is the maximum number of processors)? q Shared address multis – UMAs and NUMAs q Bus connected (shared address UMAs) multis q l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Bus traffic limits scalability of architecture (< ~ 36 processors) Message passing multis CSE 431 L 25 Multi. Intro. 21 Irwin, PSU, 2005

Next Lecture and Reminders q Next lecture - Reading assignment – PH 9. 3

Next Lecture and Reminders q Next lecture - Reading assignment – PH 9. 3 q Reminders l HW 5 (and last) due Dec 1 st (Part 2), Dec 6 th (Part 1) l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (no conflicts scheduled) - Tuesday, December 13 th, 2: 30 -4: 20, 22 Deike CSE 431 L 25 Multi. Intro. 22 Irwin, PSU, 2005