von Eicken et al Active Messages a Mechanism

  • Slides: 13
Download presentation
von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS 258 Lecture by: Dan Bonachea Slide 1

Motivation for AM (review) How do we make parallel programs fast? • Minimize communication

Motivation for AM (review) How do we make parallel programs fast? • Minimize communication overhead • Overlap communication & computation (shoot for 100% utilization of all resources) • Consider the entire program – Communication – Computation – Interactions between the two Slide 2

Message-Driven Architectures • Research systems – J-Machine/MDP, Monsoon, etc – Defining quality: all significant

Message-Driven Architectures • Research systems – J-Machine/MDP, Monsoon, etc – Defining quality: all significant computation happens within the context of a handler – Computational model is basically dataflow programming » Support languages with dynamic parallelism, e. g. Multi. LISP – Interesting note: about 1/3 of all handlers in J-machine end up blocking and get swapped out by software • Pros: – Low overhead communication – reaction to lousy performance of send/recv model traditionally used in message-passing systems – Tight integration with network – directly "execute" messages • Cons: – Typically need hardware support in the NIC to achieve good performance - need more sophisticated buffering & scheduling – Poor locality of computation => small register sets and degraded raw computational performance (bad cache locality) – Poor cost/performance ratio, hard to program(? ) – Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages Slide 3

Message-Passing Architectures • Commercial systems – n. Cube, CM-5 – Defining feature: all significant

Message-Passing Architectures • Commercial systems – n. Cube, CM-5 – Defining feature: all significant computation happens in a devoted computational thread => good locality, performance • Traditional programming model is blocking, matched send/recv (implemented as 3 -phase rendezvous) – Inherently a poor programming model for the lowest level: – Doesn't match the semantics of the NIC and performance gets lost in the translation – Doesn’t allow for overlap without expensive buffering • There's no compelling reason to keep this model as our lowest level network interface, even for this arch – Sometimes easier to program, but we want the lowest overhead interface possible as the NIC-level interface – Can easily provide a send/recv abstraction upon a more efficient interface – No way to recapture lost performance if the lowest level interface is slow Slide 4

Active Messages - a new "mechanism" • Main idea: Take the best features of

Active Messages - a new "mechanism" • Main idea: Take the best features of the message-driven model and unify them with the capabilities of message-passing hardware – Get the same or better performance as messagedriven systems with little or no special-purpose hardware – Fix the mismatch between low-level software interface and hardware capabilities that cripples performance » Eliminate all buffering not required by transport » Expose out-of-order, asynchronous delivery – Need to restrict the allowable behavior of handlers somewhat to make this possible Slide 5

Active Messages - Handlers • User-provided handlers that "execute" messages – Handlers run immediately

Active Messages - Handlers • User-provided handlers that "execute" messages – Handlers run immediately upon message arrival – Handlers run quickly and to completion (no blocking) – Handlers run atomically with respect to each other – These restrictions make it possible to implement handlers with no buffering on simple message-passing hardware • The purpose of AM Handlers: – Quickly extract a message from the network and "integrate" the data into the running computation in an application-specific way, with a small amt of work – Handlers do NOT perform significant computation themselves » only the minimum functionality required to communicate » this is the crucial difference between AM and the message. Slide 6 driven model

Active Messages - Handlers (cont. ) • Miscellaneous Restriction: – Communication is strictly request-reply

Active Messages - Handlers (cont. ) • Miscellaneous Restriction: – Communication is strictly request-reply (ensures acyclic protocol dependencies) – prevents deadlock with strictly bounded buffer space (assuming 2 virtual networks are available) • Still powerful enough to implement most if not all communication paradigms – Shared memory, message-passing, message-driven, etc • AM is especially useful as a compilation target for higher-level languages (Split-C, Titanium, etc) – Acceptable to trade off programmability and possibly some protection to maximize performance – Code often generated by a compiler anyhow, so guarding against naïve users is less critical Slide 7

Proof of Concept: Split-C • Split-C: an explicitly parallel, SPMD version of C –

Proof of Concept: Split-C • Split-C: an explicitly parallel, SPMD version of C – Global address space abstraction, with a visible local/remote distinction – Split-phase, one-sided (asynchronous) remote memory operations – Sender executes put or get, then a sync on local counter for completion of 1 or more ops • User/compiler explicitly specifies prefetching to get overlap • Write in shared memory style, but remote operations explicit – local/global distinction important for high performance, so expose it to user – can also implement arbitrarily generalized data transfers (scatter-gather, strided) • Important points: – AM can efficiently provide global memory space on existing message-passing systems in software, using the right model – evolutionary change rather than revolutionary (keep the architecture) – works very well for coarse-grained SPMD apps Slide 8

Results • Dramatic reduction in latency on commercial message-passing machines with NO additional hardware

Results • Dramatic reduction in latency on commercial message-passing machines with NO additional hardware – n. CUBE/2: » AM send/handle: 11 us/15 us overhead » Blocking message send/recv: 160 us overhead – CM-5: » AM: <2 us overhead » Blocking message send/recv: 86 us overhead • About an order of magnitude improvement with no hardware investment Slide 9

Optional Hardware/Kernel Support for AM • DMA transfer support => large messages • Registers

Optional Hardware/Kernel Support for AM • DMA transfer support => large messages • Registers on NIC for composing messages – General registers, not FIFOs - allow message reuse – Ability to compose a request & reply simultaneously • Fast user-level interrupts – Allow fully user-level interrupts (trap directly to handler) – PC injection is one way to do this – Any protection mechanisms required for kernel to allow user-level NIC interrupts • Support for efficient polling Slide 10

Problems with AM-1 paper • Handler atomicity wrt. main computation – Addressed in von.

Problems with AM-1 paper • Handler atomicity wrt. main computation – Addressed in von. Eiken's thesis – Solutions: » Atomic instructions » Mechanism to temporarily disable NIC interrupts using a memory flag or reserved register • Described as an abstract mechanism, not a solid portable spec • Little support for recv protection, multithreading, CLUMP's, abstract naming, etc • AM-2 fixes the above problems Slide 11

GAM & Active Messages-2 • Done at Berkeley by Mainwaring, Culler, et al. •

GAM & Active Messages-2 • Done at Berkeley by Mainwaring, Culler, et al. • Standardized API & generalized somewhat • Adds support missing in AM-1 for: – multiple logical endpoints per application (modularity, multi-threading, multi-NIC) – non-SPMD configurations – recv-side protection mechanisms to catch non-malicious bugs (tags) – multi-threaded applications – level of indirection on handlers for non-aligned memory spaces (heterogeneous system) – fault-tolerance support for congestion, node failure, etc (return to sender) – opaque endpoint naming (client code portability, transparent multiprotocol implementations) – polling implicitly may happen on all calls, so explicit polls rarely required – enforce strict request/reply - eases implementation on some systems (HPAM) Slide 12

Influence of Active Messages • Many implementations of AM in some form – natively

Influence of Active Messages • Many implementations of AM in some form – natively on NIC's: Myrinet (NOW project), Via (Buonadonna & Begel), HP Medusa (Richard Martin), Intel Paragon (Liu), Meiko CS-2 (Schauser) – on other transports: TCP (Liu and Mainwaring) UDP (me), MPI (me), LAPI (Yau & Welcome) – other interesting: Multi-protocol AM (shared memory & network for CLUMPS) (Lumetta) • Used as compilation target for many parallel languages/systems: – Split-C, Id 90/TAM, Titanium, PVM, UPC, MPI… • Influenced the design of important systems – E. g: IBM SP supercomputer: LAPI - low-level messaging layer that is basically AM Slide 13