The Monsoon Project Arvind Computer Science Artificial Intelligence

EM 4: single-chip dataflow micro, 80 PE multiprocessor, ETL, Japan Sigma-1: The largest Dataflow

Outline Static Dataflow Machines n Not general-purpose enough Dynamic Dataflow Machines n As easy

Dataflow Graphs {x = a + b; y = b * 7 in (x-y)

Static Dataflow Machine: Instruction Templates c e od Op 1 2 3 4 5

Static Dataflow Machine Jack Dennis, 1973 Receive Instruction Templates 1 2. . . FU

Static Dataflow: Problems/Limitations Mismatch between the model and the implementation n The model requires

Dynamic Dataflow Architectures Allocate instruction templates, i. e. , a frame, dynamically to support

A Frame in Dynamic Dataflow 1 2 + 1 3 L, 4 L *

Monsoon Processor Greg Papadopoulos op r d 1, d 2 ip Instruction Fetch fp+r

Temporary Registers & Threads Robert Iannucci op r S 1, S 2 n sets

Actual Monsoon Pipeline: Eight Stages Instruction Memory 32 Instruction Fetch Effective Address Presence bits

Instructions directly control the pipeline The opcode specifies an operation for each pipeline stage:

Procedure Linkage Operators f a 1 get frame extract tag change Tag 0 Like

Data Structures in Dataflow Data structures reside in a structure store Þ tokens carry

I-Structure Storage: Split-phase operations & Presence bits <s, fp, a > s I-Fetch t

Parallel Language Model Tree of Activation Frames g: Global Heap of Shared Objects f:

Id World implicit parallelism Id Dataflow Graphs + I-Structures +. . . TTDA Monsoon

Id World people Rishiyur Nikhil, Keshav Pingali, Vinod Kathail, David Culler Ken Traub Steve

Id Applications on Monsoon @ MIT Numerical n n Hydrodynamics - SIMPLE Global Circulation

Id Run Time System (RTS) on Monsoon Frame Manager: Allocates frame memory on processors

The Monsoon Project Motorola Cambridge Research Center + MIT-Motorola collaboration 1988 -91 Research Prototypes

Single Processor Monsoon Performance Evolution One 64 -bit processor (10 MHz) + 4 M

Monsoon Speed Up Results Boon Ang, Derek Chiou, Jamey Hicks critical path speed up

Base Performance? Id on Monsoon vs. C / F 77 on R 3000 MIPS

The Monsoon Experience Performance of implicitly parallel Id programs scaled effortlessly. Id programs on

Slides: 28

Download presentation

The Monsoon Project Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2008 http: //csg. csail. mit. edu/arvind/ DF 5 -1

EM 4: single-chip dataflow micro, 80 PE multiprocessor, ETL, Japan Sigma-1: The largest Dataflow Machines dataflow machine, ETL, Japan Static - mostly for signal processing NEC - NEDIP and IPP Hughes, Hitachi, AT&T, Loral, TI, Sanyo M. I. T. Engineering model Shown at K. . Hiraki Supercomputing 96 Dynamic Shown at Manchester (‘ 81) Supercomputing 91 Greg Papadopoulos S. Sakai M. I. T. - TTDA, Monsoon (‘ 88) M. I. T. /Motorola - Monsoon (‘ 91) (8 PEs, 8 IS) ETL - SIGMA-1 (‘ 88) (128 PEs, 128 John IS) Gurd T. Shimada ETL - EM 4 (‘ 90) (80 PEs), EM-X (‘ 96) (80 PEs) Sandia - EPS 88, EPS-2 IBM - Empire Y. Kodama Chris Andy Jack. . . Joerg Boughton Costanza March 4, 2008 http: //csg. csail. mit. edu/arvind/ Related machines: Burton Smith’s Denelcor HEP, Monsoon Horizon, Tera 2

Outline Static Dataflow Machines n Not general-purpose enough Dynamic Dataflow Machines n As easy to build as a simple pipelined processor The software view n The memory model: I-structures Monsoon and its performance March 4, 2008 http: //csg. csail. mit. edu/arvind/ 3

Dataflow Graphs {x = a + b; y = b * 7 in (x-y) * (x+y)} Values in dataflow graphs are represented as token < ip , v > instruction ptr port b a ip = 3 p = L data 1 + 2 *7 x 3 y - 4 + An operator executes when all 5 * its input tokens are present; copies of the result token are distributed to the destination operators no separate control flow March 4, 2008 http: //csg. csail. mit. edu/arvind/ 4

Static Dataflow Machine: Instruction Templates c e od Op 1 2 3 4 5 + * i at tin s e D 3 L 3 R on 1 De 2 n tio a in st d Op an r e 1 d an r e Op 4 L 4 R 1 2 + *7 x y 5 L 5 R out 3 4 - + Presence bits Each arc in the graph has a operand slot in the program March 4, 2008 b a 2 http: //csg. csail. mit. edu/arvind/ 5 * 5

Static Dataflow Machine Jack Dennis, 1973 Receive Instruction Templates 1 2. . . FU Send Op dest 1 dest 2 p 1 src 1 FU FU FU p 2 src 2 FU <s 1, p 1, v 1>, <s 2, p 2, v 2> Many such processors can be connected together Programs can be statically divided among the processor March 4, 2008 http: //csg. csail. mit. edu/arvind/ 6

Static Dataflow: Problems/Limitations Mismatch between the model and the implementation n The model requires unbounded FIFO token queues per arc but the architecture provides storage for one token per arc The architecture does not ensure FIFO order in the reuse of an operand slot The merge operator has a unique firing rule The static model does not support n n March 4, 2008 Function calls Data Structures - No easy solution in the static framework - Dynamic dataflow provided a framework for solutions http: //csg. csail. mit. edu/arvind/ 7

Dynamic Dataflow Architectures Allocate instruction templates, i. e. , a frame, dynamically to support each loop iteration and procedure call n termination detection needed to deallocate frames The code can be shared if we separate the code and the operand storage a token <fp, ip, port, data> frame pointer March 4, 2008 instruction pointer http: //csg. csail. mit. edu/arvind/ 9

A Frame in Dynamic Dataflow 1 2 + 1 3 L, 4 L * 2 3 R, 4 R 3 4 5 - 3 5 L + 4 5 R * 5 out Program 1 4 5 *7 x 3 2 + <fp, ip, p , v> 1 b a y 4 - + 7 Frame 5 * Need to provide storage for only one operand/operator March 4, 2008 http: //csg. csail. mit. edu/arvind/ 10

Monsoon Processor Greg Papadopoulos op r d 1, d 2 ip Instruction Fetch fp+r Operand Fetch Code Frames Token Queue ALU Form Token Network March 4, 2008 http: //csg. csail. mit. edu/arvind/ Network 11

Temporary Registers & Threads Robert Iannucci op r S 1, S 2 n sets of registers (n = pipeline depth) Code Instruction Fetch Operand Fetch Frames Registers evaporate when an instruction thread is broken Token Queue ALU Form Robert Iannucci Token Registers are also used for exceptions & interrupts March 4, 2008 Network http: //csg. csail. mit. edu/arvind/ Network 12

Actual Monsoon Pipeline: Eight Stages Instruction Memory 32 Instruction Fetch Effective Address Presence bits Frame Memory 3 72 March 4, 2008 User Queue System Queue Frame Operation 72 72 2 R, 2 W Registers Presence Bit Operation ALU 144 Network Form Token http: //csg. csail. mit. edu/arvind/ 13

Instructions directly control the pipeline The opcode specifies an operation for each pipeline stage: opcode r dest 1 EA WM Reg. Op ALU Form. Token EA - effective address WM [dest 2] Easy to implement; no hazard detection FP + r: frame relative r: absolute IP + r: code relative (not supported) - waiting matching Unary; Normal; Sticky; Exchange; Imperative PBs X port ® PBs X Frame op X ALU inhibit Register ops: ALU: VL X VR ® V’L X V’R , CC Form token: VL X VR X Tag 1 X Tag 2 X CC ® Token 1 X Token 2 March 4, 2008 http: //csg. csail. mit. edu/arvind/ 14

Procedure Linkage Operators f a 1 get frame extract tag change Tag 0 Like standard call/return but caller & callee can be active simultaneously token in frame 0 token in frame 1 March 4, 2008 an . . . change Tag 1 change Tag n 1: n: Fork Graph for f change Tag 0 http: //csg. csail. mit. edu/arvind/ change Tag 1 15

Data Structures in Dataflow Data structures reside in a structure store Þ tokens carry pointers I-structures: Write-once, Read multiple times or n March 4, 2008 allocate, write, read, . . . , read, deallocate Þ No problem if a reader arrives before the writer at the memory location http: //csg. csail. mit. edu/arvind/ Memory. . P P a I-fetch a v I-store 16

I-Structure Storage: Split-phase operations & Presence bits <s, fp, a > s I-Fetch t s I-Fetch split phase t address to be read forwarding address I-structure Memory a 1 a 2 a 3 a 4 v 1 v 2 fp. ip <a, Read, (t, fp)> • Need to deal with multiple deferred reads • other operations: fetch/store, take/put, clear March 4, 2008 http: //csg. csail. mit. edu/arvind/ 17

Parallel Language Model Tree of Activation Frames g: Global Heap of Shared Objects f: h: active threads asynchronous and parallel at all levels loop March 4, 2008 http: //csg. csail. mit. edu/arvind/ 19

Id World implicit parallelism Id Dataflow Graphs + I-Structures +. . . TTDA Monsoon *T *T-Voyager March 4, 2008 http: //csg. csail. mit. edu/arvind/ 20

Id World people Rishiyur Nikhil, Keshav Pingali, Vinod Kathail, David Culler Ken Traub Steve Heller, Richard Soley, Dinart Mores Jamey Hicks, Alex Caro, Andy Shaw, Boon Ang Shail Anditya R Paul Johnson Paul Barth Jan Maessen Christine Flood Jonathan Young Derek Chiou Arun Iyangar Zena Ariola Mike Bekerle R. S. Nikhil Keshav Pingali David Culler Boon S. Ang Jamey Hicks Derek Chiou K. Eknadham (IBM), Wim Bohm (Colorado), Joe Stoy (Oxford), . . . March 4, 2008 http: //csg. csail. mit. edu/arvind/ Ken Traub Steve Heller 21

Id Applications on Monsoon @ MIT Numerical n n Hydrodynamics - SIMPLE Global Circulation Model - GCM Photon-Neutron Transport code -GAMTEB N-body problem Symbolic n n Combinatorics - free tree matching, Paraffins Id-in-Id compiler System n n I/O Library Heap Storage Allocator on Monsoon Fun and Games n n March 4, 2008 n Breakout Life Spreadsheet http: //csg. csail. mit. edu/arvind/ 22

Id Run Time System (RTS) on Monsoon Frame Manager: Allocates frame memory on processors for procedure and loop activations Derek Chiou Heap Manager: Allocates storage in I -Structure memory or in Processor memory for heap objects. Arun Iyengar March 4, 2008 http: //csg. csail. mit. edu/arvind/ 23

The Monsoon Project Motorola Cambridge Research Center + MIT-Motorola collaboration 1988 -91 Research Prototypes Monsoon Processor 64 -bit 10 M tokens/sec 16 -node Fat Tree I-structure 100 MB/sec 4 M 64 -bit words Unix Box 16 2 -node systems (MIT, LANL, Motorola, Colorado, Oregon, Mc. Gill, USC, . . . ) 2 16 -node systems (MIT, LANL) Id World Software Tony Dahbura March 4, 2008 http: //csg. csail. mit. edu/arvind/ 24

Single Processor Monsoon Performance Evolution One 64 -bit processor (10 MHz) + 4 M 64 -bit I-structure Feb. 91 Aug. 91 Mar. 92 Sep. 92 Matrix Multiply 4: 04 3: 58 3: 55 1: 46 500 x 500 Wavefront 5: 00 500 x 500, 144 iters. Paraffins n = 19 n = 22 : 50 5: 00 : 31 GAMTEB-9 C 40 K particles 1 M particles 17: 20 7: 13: 20 10: 42 4: 17: 14 SIMPLE-100 1 iterations 1 K iterations : 19 4: 48: 00 : 15 3: 48 l a e r a is d h e t e N do o t 5: 36 2: 36: 00 : 10 ne i h ac m : 02. 4 : 32 5: 36 2: 22: 00 : 06 1: 19: 49 hours: minutes: seconds March 4, 2008 http: //csg. csail. mit. edu/arvind/ 25

Monsoon Speed Up Results Boon Ang, Derek Chiou, Jamey Hicks critical path speed up (millions of cycles) 1 pe 2 pe 4 pe 8 pe Matrix Multiply 500 x 500 1. 99 3. 90 7. 74 1057 531 271 137 Paraffins n=22 1. 00 1. 99 3. 92 7. 25 322 162 82 44 GAMTEB-2 C 40 K particles 1. 00 1. 95 3. 81 7. 35 590 303 155 80 SIMPLE-100 iters 1. 00 1. 86 3. 45 6. 27 4681 2518 1355 747 September, 1992 March 4, 2008 Could not have asked for more http: //csg. csail. mit. edu/arvind/ 26

Base Performance? Id on Monsoon vs. C / F 77 on R 3000 MIPS (R 3000) (x 10 e 6 cycles) Matrix Multiply 500 x 500 Paraffins n=22 GAMTEB-9 C 40 K particles SIMPLE-100 iters Monsoon (1 pe) (x 10 e 6 cycles) 954 + 1058 102 + 322 265 * 590 1787 * 4682 MIPS codes won’t run on a parallel machine without recompilation/recoding 8 -way superscalar? Unlikely to give 7 fold speedup R 3000 cycles collected via Pixie * Fortran 77, fully optimized + MIPS C, O = 3 64 -bit floating point used in Matrix-Multiply, GAMTEB and SIMPLE March 4, 2008 http: //csg. csail. mit. edu/arvind/ 27

The Monsoon Experience Performance of implicitly parallel Id programs scaled effortlessly. Id programs on a single-processor Monsoon took 2 to 3 times as many cycles as Fortran/C on a modern workstation. n Can certainly be improved Effort to develop the invisible software (loaders, simulators, I/O libraries, . . ) dominated the effort to develop the visible software (compilers. . . ) March 4, 2008 http: //csg. csail. mit. edu/arvind/ 28