ECE 259 CPS 221 Advanced Computer Architecture II

General Course Information • Professor: Daniel J. Sorin – sorin@ee. duke. edu – http:

Course Objectives • Learn about parallel computer architecture • Learn how to read/evaluate research

Course Guidelines • Students are responsible for: – – Leading discussions of research papers

Project • The project is a semester-long assignment that should reflect the goal of

Academic Misconduct • I will not tolerate academically dishonest work. This includes cheating on

Course Topics • • • Parallel programming Machine organizations Scalable, non-coherent machines Cache-coherent shared

Outline for Intro to Multiprocessing • Motivation & Applications • Theory, History, & A

Motivation • ECE 252 / CPS 220: computers with 1 processor • This course

Applications: Science and Engineering • Examples – – – – • • Weather prediction

Applications: Commercial • Examples – On-line transaction processing (OLTP) – Decision support systems (DSS)

Applications: Multi-media/home • Examples – Speech recognition – Data compression/decompression – 3 D graphics

So Why Is MP Research Disappearing? • Where is MP research presented? – –

Outline • Motivation & Applications • Theory, History, & A Generic Parallel Machine •

In Theory • Sequential – Time to sum n numbers? O(n) – Time to

But in Practice, How Do You • Name a datum (e. g. , all

Historical View Join At: P P P P P M M M M M

Historical View, cont. • Machine Programming Model – Join at network so program with

A Generic Parallel Machine Node 0 Node 1 P P Mem $ $ CA

Programming Model • How does the programmer view the system? • Shared memory: processors

Today’s Parallel Computer Architecture • Extension of traditional computer architecture to support communication and

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) *

Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + B[2] A[2] C[2]

Partitioning of Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + B[2]

Programming Model • • • 1) 2) 3) 4) Historically, Programming Model == Architecture

Global Shared Physical Address Space Machine Physical Address Space load Pn Common Physical Addresses

Page Mapping in Shared Memory MP Node 0 0, N-1 Node 1 N, 2

Return of The Simple Problem private int i, my_start, my_end, mynode; shared float A[N],

Message Passing Architectures Node 0 0, N-1 P Node 1 0, N-1 P Mem

Message Passing Programming Model Local Process Address Space match address x Recv y, P,

The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum;

Separation of Architecture from Model • At the lowest level, shared memory sends messages

Data Parallel • Programming Model – Operations are performed on each element of a

The Simple Problem Strikes Back Assuming we have N processors A = (A +

Aside - Single Program, Multiple Data • SPMD Programming Model – Each processor executes

Data Flow Architectures B[0] A[0] C[0] + B[1] A[1] C[1] + B[2] A[2] C[2]

Data Flow Architectures • Explicitly represent data dependencies (Data Flow Graph) • No artificial

Review: Separation of Model and Architecture • Shared Memory – Single shared address space

Review: A Generic Parallel Machine Node 0 Node 1 P P Mem $ •

Programming Model Design Issues • Naming: How is communicated data and/or partner node referenced?

Issue: Naming • Single Global Linear-Address-Space (shared memory) • Multiple Local Address/Name Spaces (message

Issue: Operations • Uniprocessor RISC – Ld/St and atomic operations on memory – Arithmetic

Issue: Ordering • Uniprocessor – Programmer sees order as program order – Out-of-order execution

Issue: Order/Synchronization • Coordination mainly takes three forms: – Mutual exclusion (e. g. ,

Performance Issue: Latency • Must deal with latency when using fast processors • Options:

Performance Issue: Bandwidth • Private and global bandwidth requirements • Private bandwidth requirements can

Cost of Communication Cost = Frequency x (Overhead + Latency + Xfer size/BW -

Example: Intel Pentium Pro Quad – All coherence and multiprocessing glue in processor module

Example: SUN Enterprise (pre-E 10 K) – 16 cards of either type: processors +

Example: Cray T 3 E – Scale up to 1024 processors, 480 MB/s links

Example: Intel Paragon i 860 L 1 $ Intel Paragon node Memory bus (64

Example: IBM SP-2 – Made out of essentially complete RS 6000 workstations – Network

Summary • Motivation & Applications • Theory, History, & A Generic Parallel Machine •

Slides: 57

Download presentation

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

General Course Information • Professor: Daniel J. Sorin – sorin@ee. duke. edu – http: //www. ee. duke. edu/~sorin • Course info – http: //www. ee. duke. edu/~sorin/ece 259/ • Office hours – 1111 Hudson Hall – Times? ? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 2

Course Objectives • Learn about parallel computer architecture • Learn how to read/evaluate research papers • Learn how to perform research • Learn how to present research (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 3

Course Guidelines • Students are responsible for: – – Leading discussions of research papers - 15% of grade Participating in class discussions - 15% of grade Final exam - 20% of grade Individual or group project - 50% of grade • Leading paper discussions – – Summarize the paper Handle questions from class & ask questions of class What was good about the paper What was bad or lacking or confusing (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 4

Project • The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper. – – – Written proposal (no more than 3 pages), due Feb 23 Oral presentation of proposal (in class), March 1 Written progress report (<= 3 pages), March 29 Final document in conference/journal format (<= 12 pages), Apr 14 Final presentation (in class), Apr 14/16 • Groups of 2 or 3 are OK • Get started early! Talk to me about project ideas. (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 5

Academic Misconduct • I will not tolerate academically dishonest work. This includes cheating on the final exam and plagiarism on the project. • Be careful on the project to cite prior work and to give proper credit to others' research. • Ask me if you have any questions. Not knowing the rules does not make misconduct OK. (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 6

Course Topics • • • Parallel programming Machine organizations Scalable, non-coherent machines Cache-coherent shared memory machines Memory consistency models Interconnection networks Evaluation tools and methodology High availability systems Interactions with processors and I/O Novel architectures New technology (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 7

Outline for Intro to Multiprocessing • Motivation & Applications • Theory, History, & A Generic Parallel Machine • Programming Models – Shared Memory, Message Passing, Data Parallel • Issues in Programming Models – Function: naming, operations, & ordering – Performance: latency, bandwidth, etc. • Examples of Commercial Machines (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 8

Motivation • ECE 252 / CPS 220: computers with 1 processor • This course uses N processors in a computer to get – Higher Throughput via many jobs in parallel – Improved Cost-Effectiveness (e. g. , adding 3 processors may yield 4 X throughput for 2 X system cost) – To get Lower Latency from shrink-wrapped software (e. g. , databases and web servers) – Lower latency through Parallelization of your application (but this is hard) • Need more performance than today’s microprocessor? – Wait for tomorrow’s microprocessor – Use many microprocessors in parallel (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 9

Applications: Science and Engineering • Examples – – – – • • Weather prediction Evolution of galaxies Oil reservoir simulation Automobile crash tests Drug development VLSI CAD Nuclear bomb simulation Typically model physical systems or phenomena Problems are 2 D or 3 D Usually require “number crunching” Involve “true” parallelism (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 10

Applications: Commercial • Examples – On-line transaction processing (OLTP) – Decision support systems (DSS) – “Application servers” (Web. Sphere) • Involves data movement, not much number crunching – OTLP has many small queries – DSS has fewer large queries • Involves throughput parallelism – Inter-query parallelism for OLTP – Intra-query parallelism for DSS (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 11

Applications: Multi-media/home • Examples – Speech recognition – Data compression/decompression – 3 D graphics • Becoming ubiquitous • Involves everything (crunching, data movement, true parallelism, and throughput parallelism) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 12

So Why Is MP Research Disappearing? • Where is MP research presented? – – – – ISCA = International Symposium on Computer Architecture ASPLOS = Arch. Support for Programming Languages and OS HPCA = High Performance Computer Architecture SPAA = Symposium on Parallel Algorithms and Architecture ICS = International Conference on Supercomputing PACT = Parallel Architectures and Compilation Techniques Etc. • Why is the number of MP papers at ISCA falling? • Read “The Rise and Fall of MP Research” – Mark D. Hill and Ravi Rajwar (Univ. of Wisconsin) – On reading list (linked off course website) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 13

Outline • Motivation & Applications • Theory, History, & A Generic Parallel Machine • Programming Models • Issues in Programming Models • Examples of Commercial Machines (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 14

In Theory • Sequential – Time to sum n numbers? O(n) – Time to sort n numbers? O(n log n) – What model? RAM • Parallel – Time to sum? Tree for O(log n) – Time to sort? Non-trivially O(log n) – What model? » PRAM [Fortune, Willie STOC 78] » P processors in lock-step » One memory (e. g. , CREW for concurrent read exclusive write) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 2<->3 3<->4 1<->2 15

But in Practice, How Do You • Name a datum (e. g. , all say a[2])? • Communicate values? • Coordinate and synchronize? • Select processing node size (few-bit ALU to a PC)? • Select number of nodes in system? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 16

Historical View Join At: P P P P P M M M M M IO IO IO I/O (Network) Program With: Message Passing Memory Processor Shared Memory Data Parallel Single-Instruction Multiple-Data (SIMD), (Dataflow/Systolic) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 17

Historical View, cont. • Machine Programming Model – Join at network so program with message passing model – Join at memory so program with shared memory model – Join at processor so program with SIMD or data parallel • Programming Model Machine – Message-passing programs on message-passing machine – Shared-memory programs on shared-memory machine – SIMD/data-parallel programs on SIMD/data-parallel machine • But – Isn’t hardware basically the same? Processors, memory, & I/O? – Convergence! Why not have generic parallel machine & program with model that fits the problem? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 18

A Generic Parallel Machine Node 0 Node 1 P P Mem $ $ CA CA Interconnect P P Mem $ CA Node 2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh CA Node 3 ECE 259 / CPS 221 • Separation of programming models from architectures • All models require communication • Node with processor(s), memory, communication assist 19

Outline • Motivation & Applications • Theory, History, & A Generic Parallel Machine • Programming Models – Shared Memory – Message Passing – Data Parallel • Issues in Programming Models • Examples of Commercial Machines (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 20

Programming Model • How does the programmer view the system? • Shared memory: processors execute instructions and communicate by reading/writing a globally shared memory • Message passing: processors execute instructions and communicate by explicitly sending messages • Data parallel: processors do the same instructions at the same time, but on different data (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 21

Today’s Parallel Computer Architecture • Extension of traditional computer architecture to support communication and cooperation – Communications architecture Multiprogramming User Level System Level Shared Memory Message Passing Data Parallel Programming Model Libraries and Compilers Communication Hardware Operating System Support Communication Abstraction Hardware/Software Boundary Physical Communication Medium (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 22

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] • How do I make this parallel? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 23

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] • Split the loops » Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] • Data flow graph? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 24

Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + B[2] A[2] C[2] + C[3] + * * B[3] A[3] + + + 2 + N-1 cycles to execute on N processors But with what assumptions? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 25

Partitioning of Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + B[2] A[2] C[2] + C[3] + * * B[3] A[3] + + global synch (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh + ECE 259 / CPS 221 26

Programming Model • • • 1) 2) 3) 4) Historically, Programming Model == Architecture Thread(s) of control that operate on data Provides a communication abstraction that is a contract between hardware and software (a la ISA) Current Programming Models Shared Memory Message Passing Data Parallel (Shared Address Space) (Data Flow) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 27

Global Shared Physical Address Space Machine Physical Address Space load Pn Common Physical Addresses store P 0 Shared Portion of Address Space Private Portion of Address Space Pn Private P 2 Private • Communication, sharing, and synchronization with loads/stores on shared variables • Must map virtual pages to physical page frames P 1 Private P 0 Private (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 • Consider OS support for good mapping 28

Page Mapping in Shared Memory MP Node 0 0, N-1 Node 1 N, 2 N-1 P P Mem $ $ NI NI Interconnect NI NI Mem $ $ P Keep private data and frequently used shared data on same node as computation P Node 2 2 N, 3 N-1 Node 3 3 N, 4 N-1 Load by Processor 0 to address N+3 goes to Node 1 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 29

Return of The Simple Problem private int i, my_start, my_end, mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] • Can run this on any shared memory machine (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 30

Message Passing Architectures Node 0 0, N-1 P Node 1 0, N-1 P Mem $ $ CA CA Interconnect CA CA Mem $ $ P • Cannot directly access memory on another node • IBM SP-2, Intel Paragon • Cluster of workstations Node 3 0, N-1 P Node 2 0, N-1 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 31

Message Passing Programming Model Local Process Address Space match address x Recv y, P, t Send x, Q, t Process P address y Process Q • User-level send/receive abstraction – local buffer (x, y), process (P, Q), and tag (t) – naming and synchronization (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 32

The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum, 0); if (mynode == 0) for i = 1 to P-1 recv(tmp, i) sum = sum + tmp • Send/Recv communicates and synchronizes • P processors (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 33

Separation of Architecture from Model • At the lowest level, shared memory sends messages – HW is specialized to expedite read/write messages • What programming model / abstraction is supported at user level? • Can I have shared-memory abstraction on message passing HW? • Can I have message passing abstraction on shared memory HW? • Some 1990 s research machines integrated both – MIT Alewife, Wisconsin Tempest/Typhoon, Stanford FLASH (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 34

Data Parallel • Programming Model – Operations are performed on each element of a large (regular) data structure in a single step – Arithmetic, global data transfer • Processor is logically associated with each data element – SIMD architecture: Single instruction, multiple data • Early architectures mirrored programming model – Many bit-serial processors • Today we have FP units and caches on microprocessors • Can support data parallel model on shared memory or message passing architecture (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 35

The Simple Problem Strikes Back Assuming we have N processors A = (A + B) * C sum = global_sum (A) • • Language supports array assignment Special HW support for global operations CM-2 bit-serial CM-5 32 -bit SPARC processors – Message Passing and Data Parallel models – Special control network (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 36

Aside - Single Program, Multiple Data • SPMD Programming Model – Each processor executes the same program on different data – Many Splash 2 benchmarks are in SPMD model for each molecule at this processor { simulate interactions with myproc+1 and myproc-1; } • Not connected to SIMD architectures – Not lockstep instructions – Could execute different instructions on different processors » Data dependent branches cause divergent paths (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 37

Data Flow Architectures B[0] A[0] C[0] + B[1] A[1] C[1] + B[2] A[2] C[2] + C[3] + * * B[3] A[3] + + + Execute Data Flow Graph No control sequencing (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 38

Data Flow Architectures • Explicitly represent data dependencies (Data Flow Graph) • No artificial constraints, like sequencing instructions! – Early machines had no registers or cache • Instructions can “fire” when operands are ready – Remember Tomasulo’s algorithm • How do we know when operands are ready? • Matching store – Large associative search! • Later machines moved to coarser grain (threads) – Allowed registers and cache for local computation – Introduced messages (with operations and operands) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 39

Review: Separation of Model and Architecture • Shared Memory – Single shared address space – Communicate, synchronize using load / store – Can support message passing • Message Passing – Send / Receive – Communication + synchronization – Can support shared memory • Data Parallel – Lock-step execution on regular data structures – Often requires global operations (sum, max, min. . . ) – Can support on either shared memory or message passing • Dataflow (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 40

Review: A Generic Parallel Machine Node 0 Node 1 P P Mem $ • Separation of programming models from architectures $ CA CA Interconnect P P Mem $ CA Node 2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh • All models require communication CA Node 3 ECE 259 / CPS 221 • Node with processor(s), memory, communication assist 41

Outline • Motivation & Applications • Theory, History, & A Generic Parallel Machine • Programming Models • Issues in Programming Models – Function: naming, operations, & ordering – Performance: latency, bandwidth, etc. • Examples of Commercial Machines (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 42

Programming Model Design Issues • Naming: How is communicated data and/or partner node referenced? • Operations: What operations are allowed on named data? • Ordering: How can producers and consumers of data coordinate their activities? • Performance – Latency: How long does it take to communicate in a protected fashion? – Bandwidth: How much data can be communicated per second? How many operations per second? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 43

Issue: Naming • Single Global Linear-Address-Space (shared memory) • Multiple Local Address/Name Spaces (message passing) • Naming strategy affects – Programmer / Software – Performance – Design complexity (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 44

Issue: Operations • Uniprocessor RISC – Ld/St and atomic operations on memory – Arithmetic on registers • Shared Memory Multiprocessor – Ld/St and atomic operations on local/global memory – Arithmetic on registers • Message Passing Multiprocessor – Send/receive on local memory – Broadcast • Data Parallel – Ld/St – Global operations (add, max, etc. ) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 45

Issue: Ordering • Uniprocessor – Programmer sees order as program order – Out-of-order execution (Tomasulo’s algorithm) actually changes order – Write buffers – Important to maintain dependencies • Multiprocessor – – What is order among several threads accessing shared data? What affect does this have on performance? What if implicit order is insufficient? Memory consistency model specifies rules for ordering (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 46

Issue: Order/Synchronization • Coordination mainly takes three forms: – Mutual exclusion (e. g. , spin-locks) – Event notification » Point-to-point (e. g. , producer-consumer) » Global (e. g. , end of phase indication, all or subset of processes) – Global operations (e. g. , sum) • Issues: – – Synchronization name space (entire address space or portion) Granularity (per byte, per word, . . . => overhead) Low latency, low serialization (hot spots) Variety of approaches » Test&set, compare&swap, Load. Locked-Store. Conditional » Full/Empty bits and traps » Queue-based locks, fetch&op with combining » Scans (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 47

Performance Issue: Latency • Must deal with latency when using fast processors • Options: – Reduce frequency of long latency events » Algorithmic changes, computation and data distribution – Reduce latency » Cache shared data, network interface design, network design – Tolerate latency » Message passing overlaps computation with communication (program controlled) » Shared memory overlaps access completion and computation using consistency model and prefetching (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 48

Performance Issue: Bandwidth • Private and global bandwidth requirements • Private bandwidth requirements can be supported by: – Distributing main memory among PEs – Application changes, local caches, memory system design • Global bandwidth requirements can be supported by: – – Scalable interconnection network technology Distributed main memory and caches Efficient network interfaces Avoiding contention (hot spots) through application changes (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 49

Cost of Communication Cost = Frequency x (Overhead + Latency + Xfer size/BW - Overlap) • Frequency = number of communications per unit work – Algorithm, placement, replication, bulk data transfer • Overhead = processor cycles spent initiating or handling – Protection checks, status, buffer mgmt, copies, events • Latency = time to move bits from source to dest – Communication assist, topology, routing, congestion • Transfer time = time through bottleneck – Comm assist, links, congestions • Overlap = portion overlapped with useful work – Comm assist, comm operations, processor design (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 50

Example: Intel Pentium Pro Quad – All coherence and multiprocessing glue in processor module – Highly integrated, targeted at high volume – Low latency and bandwidth (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 52

Example: SUN Enterprise (pre-E 10 K) – 16 cards of either type: processors + memory, or I/O – All memory accessed over bus, so symmetric – Higher bandwidth, higher latency bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 53

Example: Cray T 3 E – Scale up to 1024 processors, 480 MB/s links – Memory controller generates comm. request for non-local references – No hardware mechanism for coherence (SGI Origin etc. provide this) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 54

Example: Intel Paragon i 860 L 1 $ Intel Paragon node Memory bus (64 -bit, 50 MHz) Mem ctrl DMA Driver Sandia’ s Intel Paragon XP/S-based Supercomputer 4 -way interleaved DRAM 8 bits, 175 MHz, bidirectional 2 D grid network with processing node attached to every switch (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh NI ECE 259 / CPS 221 55

Example: IBM SP-2 – Made out of essentially complete RS 6000 workstations – Network interface integrated in I/O bus (bw limited by I/O bus) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 56

Summary • Motivation & Applications • Theory, History, & A Generic Parallel Machine • Programming Models • Issues in Programming Models • Examples of Commercial Machines (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 57