18 742 Spring 2011 Parallel Computer Architecture Lecture

  • Slides: 39
Download presentation
18 -742 Spring 2011 Parallel Computer Architecture Lecture 3: Programming Models and Architectures Prof.

18 -742 Spring 2011 Parallel Computer Architecture Lecture 3: Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University © 2010 Onur Mutlu from Adve, Chiou, Falsafi, Hill, Lebeck, Patt, Reinhardt, Smith, Singh

Announcements n Project topics q q q Some ideas will be out Talk with

Announcements n Project topics q q q Some ideas will be out Talk with Chris Craik and me However, you should do the literature search and come up with a concrete research project to advance the state of the art n n You have already read many papers in 740/741, 447 Project proposal due: Jan 31 2

Last Lecture n n n n How to do the reviews Research project and

Last Lecture n n n n How to do the reviews Research project and proposal Flynn’s Taxonomy Why Parallel Computers Amdahl’s Law Superlinear Speedup Some Multiprocessor Metrics Parallel Computing Bottlenecks q q Serial bottleneck Bottlenecks in the parallel portion n Synchronization Load imbalance Resource contention 3

Readings for Next Lecture n Required: q q n Hill and Marty, “Amdahl’s Law

Readings for Next Lecture n Required: q q n Hill and Marty, “Amdahl’s Law in the Multi-Core Era, ” IEEE Computer 2008. Annavaram et al. , “Mitigating Amdahl’s Law Through EPI Throttling, ” ISCA 2005. Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009. Ipek et al. , “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, ” ISCA 2007. Recommended: q q Olukotun et al. , “The Case for a Single-Chip Multiprocessor, ” ASPLOS 1996. Barroso et al. , “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, ” ISCA 2000. Kongetira et al. , “Niagara: A 32 -Way Multithreaded SPARC Processor, ” IEEE Micro 2005. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities, ” AFIPS 1967. 4

Reviews n n n Due Friday (Jan 21) Seitz, “The Cosmic Cube, ” CACM

Reviews n n n Due Friday (Jan 21) Seitz, “The Cosmic Cube, ” CACM 1985. Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009. 5

What Will We Cover in This Lecture? n n Hill, Jouppi, Sohi, “Multiprocessors and

What Will We Cover in This Lecture? n n Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers, ” pp. 551 -560, in Readings in Computer Architecture. Culler, Singh, Gupta, Chapter 1 (Introduction) in “Parallel Computer Architecture: A Hardware/Software Approach. ” 6

MIMD Processing Overview 7

MIMD Processing Overview 7

MIMD Processing n Loosely coupled multiprocessors q q No shared global memory address space

MIMD Processing n Loosely coupled multiprocessors q q No shared global memory address space Multicomputer network n q Usually programmed via message passing n n Network-based multiprocessors Explicit calls (send, receive) for communication Tightly coupled multiprocessors q q Shared global memory address space Traditional multiprocessing: symmetric multiprocessing (SMP) Existing multi-core processors, multithreaded processors Programming model similar to uniprocessors except (multitasking uniprocessor) n Operations on shared data require synchronization 8

Main Issues in Tightly-Coupled MP n Shared memory synchronization q n Cache consistency q

Main Issues in Tightly-Coupled MP n Shared memory synchronization q n Cache consistency q n n n More commonly called cache coherence Ordering of memory operations q n Locks, atomic operations, transactional memory From different processors: called memory consistency model Resource sharing and partitioning Communication: Interconnection networks Load imbalance 9

Multithreading n Coarse grained q q n Fine grained q q q n Quantum

Multithreading n Coarse grained q q n Fine grained q q q n Quantum based Event based Cycle by cycle Thornton, “CDC 6600: Design of a Computer, ” 1970. Burton Smith, “A pipelined, shared resource MIMD computer, ” ICPP 1978. Simultaneous q q Can dispatch instructions from multiple threads at the same time Good for improving execution unit utilization 10

Programming Models vs. Architectures 11

Programming Models vs. Architectures 11

Programming Models vs. Architectures n Five major models q q q n Shared memory

Programming Models vs. Architectures n Five major models q q q n Shared memory Message passing Data parallel (SIMD) Dataflow Systolic Hybrid models? 12

Shared Memory vs. Message Passing n Are these programming models or execution models supported

Shared Memory vs. Message Passing n Are these programming models or execution models supported by the hardware architecture? n n Does a multiprocessor that is programmed by “shared memory programming model” have to support shared address space processors? Does a multiprocessor that is programmed by “message passing programming model” have to have no shared address space between processors? 13

Programming Models: Message Passing vs. Shared Memory n Difference: how communication is achieved between

Programming Models: Message Passing vs. Shared Memory n Difference: how communication is achieved between tasks n Message passing programming model q q n Shared memory programming model q q n Explicit communication via messages Analogy: telephone call or letter, no shared location accessible to all Implicit communication via memory operations (load/store) Analogy: bulletin board, post information at a shared space Suitability of the programming model depends on the problem to be solved. Issues affected by the model include: q Overhead, scalability, ease of programming, bugs, match to underlying hardware, … 14

Message Passing vs. Shared Memory Hardware n Difference: how task communication is supported in

Message Passing vs. Shared Memory Hardware n Difference: how task communication is supported in n hardware Shared memory hardware (or machine model) q All processors see a global shared address space n q n A write to a location are visible to the reads of other processors Message passing hardware (machine model) q q n Ability to access all memory from each processor No shared memory Send and receive variants are the only method of communication between processors (much like networks of workstations today, i. e. clusters) Suitability of the hardware depends on the problem to be solved as well as the programming model. 15

Message Passing vs. Shared Memory Hardware Join At: P P P P P M

Message Passing vs. Shared Memory Hardware Join At: P P P P P M M M M M IO IO IO I/O (Network) Program With: Message Passing Memory Processor Shared Memory (Dataflow/Systolic), Single-Instruction Multiple-Data (SIMD) ==> Data Parallel

Programming Model vs. Hardware n Most of parallel computing history, there was no separation

Programming Model vs. Hardware n Most of parallel computing history, there was no separation between programming model and hardware q q q n n Message passing: Caltech Cosmic Cube, Intel Hypercube, Intel Paragon Shared memory: CMU C. mmp, Sequent Balance, SGI Origin. SIMD: ILLIAC IV, CM-1 However, any hardware can really support any programming model Why? q Application compiler/library OS services hardware 17

Layers of Abstraction n Compiler/library/OS map the communication abstraction at the programming model layer

Layers of Abstraction n Compiler/library/OS map the communication abstraction at the programming model layer to the communication primitives available at the hardware layer 18

Programming Model vs. Architecture n Machine Programming Model q q q n Programming Model

Programming Model vs. Architecture n Machine Programming Model q q q n Programming Model Machine q q q n Join at network, so program with message passing model Join at memory, so program with shared memory model Join at processor, so program with SIMD or data parallel Message-passing programs on message-passing machine Shared-memory programs on shared-memory machine SIMD/data-parallel programs on SIMD/data-parallel machine Isn’t hardware basically the same? q q Processors, memory, interconnect (I/O) Why not have generic parallel machine and program with model that fits the problem? 19

A Generic Parallel Machine Node 0 Node 1 P n P Mem $ $

A Generic Parallel Machine Node 0 Node 1 P n P Mem $ $ CA CA Interconnect P n P Mem $ Mem Separation of programming models from architectures All models require communication $ CA Node 2 CA Node 3 n Node with processor(s), memory, communication assist

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) *

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] n How do I make this parallel?

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) *

Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] n Split the loops Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] n Data flow graph?

Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + C[2] + +

Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + C[2] + + * + + + 2 + N-1 cycles to execute on N processors what assumptions? B[3] A[3] C[3] * * * B[2] A[2]

Partitioning of Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + C[2]

Partitioning of Data Flow Graph B[0] A[0] C[0] + B[1] A[1] C[1] + C[2] + + * + + global synch B[3] A[3] C[3] * * * B[2] A[2] +

Shared (Physical) Memory Machine Physical Address Space load Pn Common Physical Addresses store P

Shared (Physical) Memory Machine Physical Address Space load Pn Common Physical Addresses store P 0 Shared Portion of Address Space Private Portion of Address Space n Pn Private n P 2 Private P 1 Private P 0 Private n Communication, sharing, and synchronization with store / load on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping

Shared (Physical) Memory on Generic Node 0 MP 0, N-1 (Addresses) Node 1 N,

Shared (Physical) Memory on Generic Node 0 MP 0, N-1 (Addresses) Node 1 N, 2 N-1 P P Mem $ $ CA CA Interconnect CA CA Mem $ $ P Keep private data and frequently used shared data on same node as computation P Node 2 2 N, 3 N-1 Node 3 3 N, 4 N-1

Return of The Simple Problem private int i, my_start, my_end, mynode; shared float A[N],

Return of The Simple Problem private int i, my_start, my_end, mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] n Can run this on any shared memory machine

Message Passing Architectures Node 0 0, N-1 P Node 1 0, N-1 P Mem

Message Passing Architectures Node 0 0, N-1 P Node 1 0, N-1 P Mem n Mem $ $ CA CA Interconnect CA CA Mem n $ $ P P Node 2 0, N-1 Node 3 0, N-1 n Cannot directly access memory on another node IBM SP-2, Intel Paragon Cluster of workstations

Message Passing Programming Local Process Model. Local Process Address Space match address x Recv

Message Passing Programming Local Process Model. Local Process Address Space match address x Recv y, P, t Send x, Q, t Process P n User level send/receive abstraction q q local buffer (x, y), process (Q, P) and tag (t) naming and synchronization address y Process Q

The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum;

The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum, 0); if (mynode == 0) for i = 1 to P-1 recv(tmp, i) sum = sum + tmp n n Send/Recv communicates and synchronizes P processors

Separation of Architecture from Model n At the lowest level shared memory model is

Separation of Architecture from Model n At the lowest level shared memory model is all about sending and receiving messages q n n n HW is specialized to expedite read/write messages using load and store instructions What programming model/abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? How efficient? Can I have message passing abstraction on shared memory HW? How efficient?

Challenges in Mixing and Matching n n Assume prog. model same as ABI (compiler/library

Challenges in Mixing and Matching n n Assume prog. model same as ABI (compiler/library OS hardware) Shared memory prog model on shared memory HW q n Message passing prog model on message passing HW q n How do you get good messaging performance? Shared memory prog model on message passing HW q q n How do you design a scalable runtime system/OS? How do you reduce the cost of messaging when there are frequent operations on shared data? Li and Hudak, “Memory Coherence in Shared Virtual Memory Systems, ” ACM TOCS 1989. Message passing prog model on shared memory HW q q Convert send/receives to load/stores on shared buffers How do you design scalable HW? 32

Data Parallel Programming Model n Programming Model q q n n Operations are performed

Data Parallel Programming Model n Programming Model q q n n Operations are performed on each element of a large (regular) data structure (array, vector, matrix) Program is logically a single thread of control, carrying out a sequence of either sequential or parallel steps The Simple Problem Strikes Back A = (A + B) * C sum = global_sum (A) Language supports array assignment

Data Parallel Hardware Architectures n (I)Early architectures directly mirrored programming model n Single control

Data Parallel Hardware Architectures n (I)Early architectures directly mirrored programming model n Single control processor (broadcast each instruction to an array/grid of processing elements) q Consolidates control n Many processing elements controlled by the master n Examples: Connection Machine, MPP q Batcher, “Architecture of a massively parallel processor, ” ISCA 1980. n q 16 K bit serial processing elements Tucker and Robertson, “Architecture and Applications of the Connection Machine, ” IEEE Computer 1988. n 64 K bit serial processing elements 34

Connection Machine 35

Connection Machine 35

Data Parallel Hardware Architectures (II) n Later data parallel architectures q q q Higher

Data Parallel Hardware Architectures (II) n Later data parallel architectures q q q Higher integration SIMD units on chip along with caches More generic multiple cooperating multiprocessors with vector units Specialized hardware support for global synchronization n n E. g. barrier synchronization Example: Connection Machine 5 q q Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer, ” CACM 1993. Consists of 32 -bit SPARC processors Supports Message Passing and Data Parallel models Special control network for global synchronization 36

Review: Separation of Model and Architecture n Shared Memory q q q n Message

Review: Separation of Model and Architecture n Shared Memory q q q n Message Passing q q q n Single shared address space Communicate, synchronize using load / store Can support message passing Send / Receive Communication + synchronization Can support shared memory Data Parallel q q q Lock-step execution on regular data structures Often requires global operations (sum, max, min. . . ) Can be supported on either SM or MP

Review: A Generic Parallel Machine Node 0 Node 1 P n P Mem $

Review: A Generic Parallel Machine Node 0 Node 1 P n P Mem $ $ CA CA n Interconnect P P Mem $ CA Node 2 CA Node 3 n Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist

Data Flow Programming Models and Architectures n n Program consists of data flow nodes

Data Flow Programming Models and Architectures n n Program consists of data flow nodes A data flow node fires (fetched and executed) when all its inputs are ready q n n No artificial constraints, like sequencing instructions How do we know when operands are ready? q q n i. e. when all inputs have tokens Matching store for operands (remember Oo. O execution? ) large associative search! Later machines moved to coarser grained dataflow (threads + dataflow within a thread) q q allowed registers and cache for local computation introduced messages (with operations and operands) 39