An hardware inspired model for parallel programming Arvind
- Slides: 32
An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November 2, 2006 http: //csg. csail. mit. edu/6. 827/ L 15 -1
What we said in the first lecture This subject is about The foundations of functional languages: n the -calculus, types, monads, confluence, operational semantics, TRS. . . General purpose implicit parallel programming in Haskell & p. H Parallel programming based on atomic actions or transactions in Bluespec Dataflow model of computation and understanding connections. . . Bluespec and p. H borrow heavily from functional languages but their execution models differ completely from each other November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 2
p. H: Implicit Parallel Programming p. H: parallel Haskell (Types, Higher-order functions, I-structures, M-structures) Dataflow and multithreaded compilation model front-end compilation We didn’t discuss compilation much ! Multithreaded Intermediate Language code generation Multithreaded C November 2, 2006 • R. S. Nikhil, Arvind & • many brilliant students • @ MIT mid 80’s to 90’s SMP’s Clusters http: //csg. csail. mit. edu/6. 827/ 3
Fully Parallel, Multithreaded Model Tree of Activation Frames Global Heap of Shared Objects f: Synchronization? g: h: active threads asynchronous at all levels loop Efficient mappings on architectures has proved difficult November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 4
Instead of focusing on compilation, we will study A hardware inspired methodology for “synthesizing” parallel programs n n Rule-based specification of behavior (Guarded Atomic Actions) w Lets you think one rule at a time Composition of modules with guarded interfaces Bluespec Example: 802. 11 a transmitter Unity – late 80 s Warning: The ideas are untested in the software domain; you are the trailblazers. Chandy & Misra November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 5
Bluespec: State and Rules organized into modules module interface All state (e. g. , Registers, FIFOs, RAMs, . . . ) is explicit. Behavior is expressed in terms of atomic actions on the state: Rule: condition action Rules can manipulate state in other modules only via their interfaces. November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 6
Programming with rules: Example Euclid’s GCD Terms GCD(x, y), integers Rewrite rules GCD(x, y) GCD(y, x) GCD(x, y) GCD(x, y-x) if x>y, y 0 if x y, y 0 (R 1) (R 2) Initial term GCD(init. X, init. Y) Execution GCD(6, 15) GCD(3, 6) November 2, 2006 R 2 GCD(6, 9) GCD(3, 3) R 2 http: //csg. csail. mit. edu/6. 827/ GCD(6, 3) R 1 GCD(3, 0) 7
GCD in Bluespec module mk. GCD (I_GCD); Reg#(int) x <- mk. Reg. U; Reg#(int) y <- mk. Reg(0); State x y swap sub typedef int Int#(32) rule swap when ((x>y)&&(y!=0)) ==> x <= y; y <= x; endrule subtract when ((x<=y)&&(y!=0))==> y <= y – x; endrule Internal behavior method Action start(int a, int b) when (y==0) ==> x <= a; y <= b; endmethod External method int result() when (y==0); interface return x; endmethod endmodule Assumes x /= 0 and y /= 0 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 8
GCD Hardware Module y == 0 t rdy #(type t) In a GCD call t could be GCD module int result implicit conditions y == 0 enab rdy start t int Int#(32), UInt#(16), Int#(13), . . . interface I_GCD; t t method Action start (int a, int b); method intt result(); endinterface The module can easily be made polymorphic Many different implementations can provide the same interface: module mk. GCD (I_GCD) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 9
Bluespec: Two-Level Compilation Bluespec (Objects, Types, Higher-order functions) • Lennart Augustsson • @Sandburst 2000 -2002 • Type checking • Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) • Rule conflict analysis • Rule scheduling Level 2 synthesis Object code (Verilog/C) November 2, 2006 • James Hoe & Arvind • @MIT 1997 -2000 http: //csg. csail. mit. edu/6. 827/ 10
Static Elaboration Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading Software Toolflow: source elaborate w/params Hardware source Toolflow: compile run w/ params run 1 November 2, 2006 . exe run 2 run w/ params run 1 run 3 run 1 … design 1 design 2 design 3 run 1. 1 run 1 … http: //csg. csail. mit. edu/6. 827/ run 2. 1 run 1 … run 3. 1 run 1 … 11
Expressing designs for 802. 11 a transmitter in Bluespec (BSV) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ L 15 -12
802. 11 a Transmitter Overview headers 24 Uncoded bits Controller data Scrambler Interleaver Mapper Cyclic Extend IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers November 2, 2006 Encoder Must produce one OFDM symbol every 4 msec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol One OFDM symbol (64 Complex Numbers) http: //csg. csail. mit. edu/6. 827/ accounts for > 95% area 13
Preliminary results Design Block Controller Scrambler Conv. Encoder Interleaver Mapper IFFT Cyc. Extender November 2, 2006 Lines of Code (BSV) 49 40 113 76 112 95 23 Relative Area 0% 0% 0% 1% 11% 85% 3% Complex arithmetic libraries constitute another 200 lines of code http: //csg. csail. mit. edu/6. 827/ 14
Combinational IFFT in 0 out 0 in 1 Radix 4 in 3 x 16 in 4 Radix 4 … t 0 t 1 t 2 t November 2, 32006 + - - * + + * - *j - out 3 out 4 … out 63 + * … out 2 Radix 4 in 63 * Radix 4 out 1 Permute_3 Radix 4 Permute_2 Permute_1 in 2 Radix 4 All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, . . . http: //csg. csail. mit. edu/6. 827/ 15
Design Alternative Reuse a block over multiple cycles f f g we expect: Throughput to reduce – less parallelism Energy/unit work to increase - due to extra HW Area to decrease – reusing a block November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 16
Combinational IFFT Opportunity for reuse in 0 in 1 … x 16 Radix 4 … Radix 4 in 63 out 1 Permute_3 in 4 Radix 4 Permute_2 in 3 Radix 4 Permute_1 in 2 out 0 out 2 out 3 out 4 … out 63 Reuse the same circuit three times November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 17
Circular pipeline: Reusing the Pipeline Stage in 0 Radix 4 … in 3 Radix 4 in 63 November 2, 2006 out 3 out 4 out 63 Stage Counter Permute_3 16 Radix 4 s can be shared but not the three permutations. Hence the need for muxes out 2 … Permute_2 … out 1 64, 4 -way Muxes in 2 Permute_1 in 1 out 0 http: //csg. csail. mit. edu/6. 827/ 18
Superfolded circular pipeline: Just one Radix-4 node! in 0 in 4 Permute_3 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ out 2 out 3 out 4 … Permute_2 in 63 Index Counter 0 to 15 4, 16 -way De. Muxes … out 1 64, 4 -way Muxes in 3 Radix 4 Permute_1 in 2 4, 16 -way Muxes in 1 out 0 out 63 Stage Counter 0 to 2 19
Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? n single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 20
Bluespec code: Radix-4 Node function Vector#(4, Complex) radix 4(Vector#(4, Complex) t, Vector#(4, Complex) k); Vector#(4, Complex) m = new. Vector(), y = new. Vector(), z = new. Vector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; * + + * - - * + + * - *j - y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction November 2, 2006 http: //csg. csail. mit. edu/6. 827/ Polymorphic code: works on any type of numbers for which *, + and have been defined 21
Combinational IFFT Can be used as a reference in 0 in 1 … x 16 Radix 4 … Radix 4 in 63 out 1 Permute_3 in 4 Radix 4 Permute_2 in 3 Radix 4 Permute_1 in 2 out 0 out 2 out 3 out 4 … out 63 stage_f function repeat it three times November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 22
Bluespec Code for Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4, SVector#(64, Complex)) stage_data = replicate(new. SVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); The code is unfolded to generate a combinational circuit November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 23
Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = get. Twiddle(stage, from. Integer(i)); let y = radix 4(twid, stage_in[idx: idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); Stage function November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 24
Synchronous pipeline f 1 f 2 f 3 x in. Q s. Reg 1 rule sync-pipeline (True); in. Q. deq(); s. Reg 1 <= f 1(in. Q. first()); s. Reg 2 <= f 2(s. Reg 1); out. Q. enq(f 3(s. Reg 2)); endrule November 2, 2006 s. Reg 2 out. Q This is real IFFT code; just replace f 1, f 2 and f 3 with stage_f code http: //csg. csail. mit. edu/6. 827/ 25
What about pipeline bubbles? rule sync-pipeline (True); typedef union tagged { Maybe#(data_T) sx, ox; void Invalid; for (Integer i = 1; i < n; i = i + 1) data_T Valid; begin //Get stage input if (i == 0) } Maybe#(type data_T); if (in. Q. not. Empty) begin sx = in. Q. first(); in. Q. deq(); end else sx = Invalid; else sx = s. Regs[i-1]; case(sx) matches //Calculate value tagged Valid. x: ox = f(from. Integer(i), x); tagged Invalid: ox = Invalid; endcase if (i == n-1) out. Q. enq(ox); //Write Outputs else s. Regs[i] <= ox; endrule November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 26
Folded pipeline f 1 f f 2 x in. Q stage s. Reg out. Q rule folded-pipeline (True); if (stage==1) begin in. Q. deq(); sx. In= in. Q. first(); end else sx. In= s. Reg; sx. Out = f(stage, sx. In); if (stage==3) out. Q. enq(sx. Out); else s. Reg <= sx. Out; stage <= (stage==3)? 1 : stage+1; endrule November 2, 2006 http: //csg. csail. mit. edu/6. 827/ f 3 function f (stage, sx); case (stage) 1: return f 1(sx); 2: return f 2(sx); 3: return f 3(sx); endcase endfunction This is real IFFT code too. . . 27
Expressing these designs in Bluespec is easy All these designs were done in less than one day! Area and power estimates? Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) Super-Folded (4 Radices) Super-Folded (2 Radices) Super-Folded (1 Radix) November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 28
802. 11 a Transmitter Synthesis results IFFT Design Area (mm 2) Symb ol Latenc y (CLKs) Throughput Latency (CLKs/sym) Min. Freq Required Average Power (m. W) Pipelined 5. 25 12 04 1. 0 MHz 4. 92 Combinational 4. 91 10 04 1. 0 MHz 3. 99 Folded (16 Radices) 3. 97 12 04 1. 0 MHz 7. 27 Super-Folded (8 Radices) 3. 69 15 06 1. 5 MHz 10. 9 SF(4 Radices) 2. 45 21 12 3. 0 MHz 14. 4 SF(2 Radices) 1. 84 33 24 6. 0 MHz 21. 1 SF (1 Radix) 1. 52 57 48 12 MHZ 34. 6 November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 29
Why are the areas so similiar November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 30
802. 11 a Observation Dataflow network n aka Kahn networks How should this level of concurrency be expressed in a reference code (say in C or system. C? Can we write Specs which work for both hardware and software November 2, 2006 http: //csg. csail. mit. edu/6. 827/ 31
Bluespec Tool flow Bluespec System. Verilog source Bluespec Compiler Verilog 95 RTL C Bluesim Cycle Accurate Verilog sim VCD output Debussy Visualization gates Power estimatio n tool Sequence Design Power. Theater http: //csg. csail. mit. edu/6. 827/ November 2, 2006 RTL synthesis FPGA 32
- Internal parts of computer
- Katrina lacurts
- Arvind seshan
- Arvind rajaraman
- Arvind lee
- Sertalind
- Arvind arasu
- Dr arvind mishra
- Arvind seshan
- Arvind vig
- Arvind krishnamurthy
- Hardware description language (hdl) can be used as a
- Hardware programming language
- Nature inspired inventions
- Hunger games theseus and the minotaur
- 10 l fio2
- Nature-inspired learning algorithms
- Moving figures inspired by futurism
- Definition of psychoanalytic criticism
- Great teaching inspired learning
- The tempest 1992
- All scripture is inspired by god
- Inspired versus infringing
- Conviction inspired by deep thinking
- The inspired
- Greatly inspired
- Katal. architekt 1926
- T piece oxygen delivery
- What factors most inspired conquistadors to set sail?
- A nature inspired optimization algorithms "torrent"
- Inventions inspired by science fiction
- Inspired leadership initiative
- Abraham models for concurrency "torrent"