An hardware inspired model for parallel programming Arvind

  • Slides: 32
Download presentation
An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence Lab

An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November 2, 2006 http: //csg. csail. mit. edu/6. 827/ L 15 -1

What we said in the first lecture This subject is about The foundations of

What we said in the first lecture This subject is about The foundations of functional languages: n the -calculus, types, monads, confluence, operational semantics, TRS. . . General purpose implicit parallel programming in Haskell & p. H Parallel programming based on atomic actions or transactions in Bluespec Dataflow model of computation and understanding connections. . . Bluespec and p. H borrow heavily from functional languages but their execution models differ completely from each other November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 2

p. H: Implicit Parallel Programming p. H: parallel Haskell (Types, Higher-order functions, I-structures, M-structures)

p. H: Implicit Parallel Programming p. H: parallel Haskell (Types, Higher-order functions, I-structures, M-structures) Dataflow and multithreaded compilation model front-end compilation We didn’t discuss compilation much ! Multithreaded Intermediate Language code generation Multithreaded C November 2, 2006 • R. S. Nikhil, Arvind & • many brilliant students • @ MIT mid 80’s to 90’s SMP’s Clusters http: //csg. csail. mit. edu/6. 827/ 3

Fully Parallel, Multithreaded Model Tree of Activation Frames Global Heap of Shared Objects f:

Fully Parallel, Multithreaded Model Tree of Activation Frames Global Heap of Shared Objects f: Synchronization? g: h: active threads asynchronous at all levels loop Efficient mappings on architectures has proved difficult November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 4

Instead of focusing on compilation, we will study A hardware inspired methodology for “synthesizing”

Instead of focusing on compilation, we will study A hardware inspired methodology for “synthesizing” parallel programs n n Rule-based specification of behavior (Guarded Atomic Actions) w Lets you think one rule at a time Composition of modules with guarded interfaces Bluespec Example: 802. 11 a transmitter Unity – late 80 s Warning: The ideas are untested in the software domain; you are the trailblazers. Chandy & Misra November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 5

Bluespec: State and Rules organized into modules module interface All state (e. g. ,

Bluespec: State and Rules organized into modules module interface All state (e. g. , Registers, FIFOs, RAMs, . . . ) is explicit. Behavior is expressed in terms of atomic actions on the state: Rule: condition action Rules can manipulate state in other modules only via their interfaces. November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 6

Programming with rules: Example Euclid’s GCD Terms GCD(x, y), integers Rewrite rules GCD(x, y)

Programming with rules: Example Euclid’s GCD Terms GCD(x, y), integers Rewrite rules GCD(x, y) GCD(y, x) GCD(x, y) GCD(x, y-x) if x>y, y 0 if x y, y 0 (R 1) (R 2) Initial term GCD(init. X, init. Y) Execution GCD(6, 15) GCD(3, 6) November 2, 2006 R 2 GCD(6, 9) GCD(3, 3) R 2 http: //csg. csail. mit. edu/6. 827/ GCD(6, 3) R 1 GCD(3, 0) 7

GCD in Bluespec module mk. GCD (I_GCD); Reg#(int) x <- mk. Reg. U; Reg#(int)

GCD in Bluespec module mk. GCD (I_GCD); Reg#(int) x <- mk. Reg. U; Reg#(int) y <- mk. Reg(0); State x y swap sub typedef int Int#(32) rule swap when ((x>y)&&(y!=0)) ==> x <= y; y <= x; endrule subtract when ((x<=y)&&(y!=0))==> y <= y – x; endrule Internal behavior method Action start(int a, int b) when (y==0) ==> x <= a; y <= b; endmethod External method int result() when (y==0); interface return x; endmethod endmodule Assumes x /= 0 and y /= 0 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 8

GCD Hardware Module y == 0 t rdy #(type t) In a GCD call

GCD Hardware Module y == 0 t rdy #(type t) In a GCD call t could be GCD module int result implicit conditions y == 0 enab rdy start t int Int#(32), UInt#(16), Int#(13), . . . interface I_GCD; t t method Action start (int a, int b); method intt result(); endinterface The module can easily be made polymorphic Many different implementations can provide the same interface: module mk. GCD (I_GCD) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 9

Bluespec: Two-Level Compilation Bluespec (Objects, Types, Higher-order functions) • Lennart Augustsson • @Sandburst 2000

Bluespec: Two-Level Compilation Bluespec (Objects, Types, Higher-order functions) • Lennart Augustsson • @Sandburst 2000 -2002 • Type checking • Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) • Rule conflict analysis • Rule scheduling Level 2 synthesis Object code (Verilog/C) November 2, 2006 • James Hoe & Arvind • @MIT 1997 -2000 http: //csg. csail. mit. edu/6. 827/ 10

Static Elaboration Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading

Static Elaboration Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading Software Toolflow: source elaborate w/params Hardware source Toolflow: compile run w/ params run 1 November 2, 2006 . exe run 2 run w/ params run 1 run 3 run 1 … design 1 design 2 design 3 run 1. 1 run 1 … http: //csg. csail. mit. edu/6. 827/ run 2. 1 run 1 … run 3. 1 run 1 … 11

Expressing designs for 802. 11 a transmitter in Bluespec (BSV) November 2, 2006 http:

Expressing designs for 802. 11 a transmitter in Bluespec (BSV) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ L 15 -12

802. 11 a Transmitter Overview headers 24 Uncoded bits Controller data Scrambler Interleaver Mapper

802. 11 a Transmitter Overview headers 24 Uncoded bits Controller data Scrambler Interleaver Mapper Cyclic Extend IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers November 2, 2006 Encoder Must produce one OFDM symbol every 4 msec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol One OFDM symbol (64 Complex Numbers) http: //csg. csail. mit. edu/6. 827/ accounts for > 95% area 13

Preliminary results Design Block Controller Scrambler Conv. Encoder Interleaver Mapper IFFT Cyc. Extender November

Preliminary results Design Block Controller Scrambler Conv. Encoder Interleaver Mapper IFFT Cyc. Extender November 2, 2006 Lines of Code (BSV) 49 40 113 76 112 95 23 Relative Area 0% 0% 0% 1% 11% 85% 3% Complex arithmetic libraries constitute another 200 lines of code http: //csg. csail. mit. edu/6. 827/ 14

Combinational IFFT in 0 out 0 in 1 Radix 4 in 3 x 16

Combinational IFFT in 0 out 0 in 1 Radix 4 in 3 x 16 in 4 Radix 4 … t 0 t 1 t 2 t November 2, 32006 + - - * + + * - *j - out 3 out 4 … out 63 + * … out 2 Radix 4 in 63 * Radix 4 out 1 Permute_3 Radix 4 Permute_2 Permute_1 in 2 Radix 4 All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, . . . http: //csg. csail. mit. edu/6. 827/ 15

Design Alternative Reuse a block over multiple cycles f f g we expect: Throughput

Design Alternative Reuse a block over multiple cycles f f g we expect: Throughput to reduce – less parallelism Energy/unit work to increase - due to extra HW Area to decrease – reusing a block November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 16

Combinational IFFT Opportunity for reuse in 0 in 1 … x 16 Radix 4

Combinational IFFT Opportunity for reuse in 0 in 1 … x 16 Radix 4 … Radix 4 in 63 out 1 Permute_3 in 4 Radix 4 Permute_2 in 3 Radix 4 Permute_1 in 2 out 0 out 2 out 3 out 4 … out 63 Reuse the same circuit three times November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 17

Circular pipeline: Reusing the Pipeline Stage in 0 Radix 4 … in 3 Radix

Circular pipeline: Reusing the Pipeline Stage in 0 Radix 4 … in 3 Radix 4 in 63 November 2, 2006 out 3 out 4 out 63 Stage Counter Permute_3 16 Radix 4 s can be shared but not the three permutations. Hence the need for muxes out 2 … Permute_2 … out 1 64, 4 -way Muxes in 2 Permute_1 in 1 out 0 http: //csg. csail. mit. edu/6. 827/ 18

Superfolded circular pipeline: Just one Radix-4 node! in 0 in 4 Permute_3 November 2,

Superfolded circular pipeline: Just one Radix-4 node! in 0 in 4 Permute_3 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ out 2 out 3 out 4 … Permute_2 in 63 Index Counter 0 to 15 4, 16 -way De. Muxes … out 1 64, 4 -way Muxes in 3 Radix 4 Permute_1 in 2 4, 16 -way Muxes in 1 out 0 out 63 Stage Counter 0 to 2 19

Which design consumes the least energy to transmit a symbol? Can we quickly code

Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? n single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 20

Bluespec code: Radix-4 Node function Vector#(4, Complex) radix 4(Vector#(4, Complex) t, Vector#(4, Complex) k);

Bluespec code: Radix-4 Node function Vector#(4, Complex) radix 4(Vector#(4, Complex) t, Vector#(4, Complex) k); Vector#(4, Complex) m = new. Vector(), y = new. Vector(), z = new. Vector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; * + + * - - * + + * - *j - y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction November 2, 2006 http: //csg. csail. mit. edu/6. 827/ Polymorphic code: works on any type of numbers for which *, + and have been defined 21

Combinational IFFT Can be used as a reference in 0 in 1 … x

Combinational IFFT Can be used as a reference in 0 in 1 … x 16 Radix 4 … Radix 4 in 63 out 1 Permute_3 in 4 Radix 4 Permute_2 in 3 Radix 4 Permute_1 in 2 out 0 out 2 out 3 out 4 … out 63 stage_f function repeat it three times November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 22

Bluespec Code for Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors

Bluespec Code for Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4, SVector#(64, Complex)) stage_data = replicate(new. SVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); The code is unfolded to generate a combinational circuit November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 23

Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin

Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = get. Twiddle(stage, from. Integer(i)); let y = radix 4(twid, stage_in[idx: idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); Stage function November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 24

Synchronous pipeline f 1 f 2 f 3 x in. Q s. Reg 1

Synchronous pipeline f 1 f 2 f 3 x in. Q s. Reg 1 rule sync-pipeline (True); in. Q. deq(); s. Reg 1 <= f 1(in. Q. first()); s. Reg 2 <= f 2(s. Reg 1); out. Q. enq(f 3(s. Reg 2)); endrule November 2, 2006 s. Reg 2 out. Q This is real IFFT code; just replace f 1, f 2 and f 3 with stage_f code http: //csg. csail. mit. edu/6. 827/ 25

What about pipeline bubbles? rule sync-pipeline (True); typedef union tagged { Maybe#(data_T) sx, ox;

What about pipeline bubbles? rule sync-pipeline (True); typedef union tagged { Maybe#(data_T) sx, ox; void Invalid; for (Integer i = 1; i < n; i = i + 1) data_T Valid; begin //Get stage input if (i == 0) } Maybe#(type data_T); if (in. Q. not. Empty) begin sx = in. Q. first(); in. Q. deq(); end else sx = Invalid; else sx = s. Regs[i-1]; case(sx) matches //Calculate value tagged Valid. x: ox = f(from. Integer(i), x); tagged Invalid: ox = Invalid; endcase if (i == n-1) out. Q. enq(ox); //Write Outputs else s. Regs[i] <= ox; endrule November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 26

Folded pipeline f 1 f f 2 x in. Q stage s. Reg out.

Folded pipeline f 1 f f 2 x in. Q stage s. Reg out. Q rule folded-pipeline (True); if (stage==1) begin in. Q. deq(); sx. In= in. Q. first(); end else sx. In= s. Reg; sx. Out = f(stage, sx. In); if (stage==3) out. Q. enq(sx. Out); else s. Reg <= sx. Out; stage <= (stage==3)? 1 : stage+1; endrule November 2, 2006 http: //csg. csail. mit. edu/6. 827/ f 3 function f (stage, sx); case (stage) 1: return f 1(sx); 2: return f 2(sx); 3: return f 3(sx); endcase endfunction This is real IFFT code too. . . 27

Expressing these designs in Bluespec is easy All these designs were done in less

Expressing these designs in Bluespec is easy All these designs were done in less than one day! Area and power estimates? Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) Super-Folded (4 Radices) Super-Folded (2 Radices) Super-Folded (1 Radix) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 28

802. 11 a Transmitter Synthesis results IFFT Design Area (mm 2) Symb ol Latenc

802. 11 a Transmitter Synthesis results IFFT Design Area (mm 2) Symb ol Latenc y (CLKs) Throughput Latency (CLKs/sym) Min. Freq Required Average Power (m. W) Pipelined 5. 25 12 04 1. 0 MHz 4. 92 Combinational 4. 91 10 04 1. 0 MHz 3. 99 Folded (16 Radices) 3. 97 12 04 1. 0 MHz 7. 27 Super-Folded (8 Radices) 3. 69 15 06 1. 5 MHz 10. 9 SF(4 Radices) 2. 45 21 12 3. 0 MHz 14. 4 SF(2 Radices) 1. 84 33 24 6. 0 MHz 21. 1 SF (1 Radix) 1. 52 57 48 12 MHZ 34. 6 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 29

Why are the areas so similiar November 2, 2006 http: //csg. csail. mit. edu/6.

Why are the areas so similiar November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 30

802. 11 a Observation Dataflow network n aka Kahn networks How should this level

802. 11 a Observation Dataflow network n aka Kahn networks How should this level of concurrency be expressed in a reference code (say in C or system. C? Can we write Specs which work for both hardware and software November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 31

Bluespec Tool flow Bluespec System. Verilog source Bluespec Compiler Verilog 95 RTL C Bluesim

Bluespec Tool flow Bluespec System. Verilog source Bluespec Compiler Verilog 95 RTL C Bluesim Cycle Accurate Verilog sim VCD output Debussy Visualization gates Power estimatio n tool Sequence Design Power. Theater http: //csg. csail. mit. edu/6. 827/ November 2, 2006 RTL synthesis FPGA 32