Multicycle Implementations Arvind Computer Science Artificial Intelligence Lab

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012 http: //csg. csail. mit. edu/SNU L 6 -1

Harvard-Style Datapath for MIPS PCSrc br rind jabs pc+4 Reg. Write Mem. Write 0 x 4 Add clk PC clk addr inst 31 Inst. Memory we rs 1 rs 2 rd 1 ws wd rd 2 clk we addr ALU GPRs z Imm Ext wdata ALU Control Op. Code Reg. Dst Ext. Sel rdata Data Memory Op. Sel BSrc zero? WBSrc

What problem arises if instructions and data reside in the same memory? At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle Structural hazard

Princeton Microarchitecture Datapath & Control PCSrc Reg. Write PCen off Mem. Write off 0 x 4 Add clk PC 31 IR clk on we rs 1 rs 2 rd 1 ws wd rd 2 GPRs clk we addr ALU z Imm Ext rdata Data Memory wdata ALU Control Fetch phase IRen Op. Code Reg. Dst Ext. Sel Op. Sel BSrc zero? Addr. Src = PC WBSrc

Two-State Controller: Princeton Architecture fetch phase execute phase Addr. Src=PC IRen=on PCen=off Wen=off Addr. Src=ALU IRen=off PCen=on Wen=on A flipflop can be used to remember the phase

Hardwired Controller: Princeton Architecture op code IR zero? S 1 -bit Toggle FF I-fetch / Execute old combinational logic (Harvard) new combinational logic . . . Ext. Sel, BSrc, Op. Sel, WBSrc, Reg. Dest, PCsrc 1, PCsrc 2 Mem. Write Reg. Write Wen PCen IRen Addr. Src

Two-Cycle SMIPS Register File stage PC +4 ir Decode Data Memory Inst Memory January 13, 2012 Execute http: //csg. csail. mit. edu/SNU L 6 -7 7

Two-Cycle SMIPS module mk. Proc(Proc); Reg#(Addr) pc <- mk. Reg. U; RFile rf <- mk. RFile; Memory mem <- mk. Memory; Pipe. Reg#(FBundle) ir <- mk. Pipe. Reg; Reg#(Bit#(1)) stage <- mk. Reg(0); let pcir = ir. first(); let pc = pcir. pc; let inst = pcir. inst; rule do. Proc; if(stage==0 && ir. not. Full) begin //fetch let inst. Resp <- mem(Mem. Req{op: Ld, addr: pc, data: ? }); ir. enq(FBundle{pc: pc, inst: inst. Resp}); stage <= 1; end January 13, 2012 http: //csg. csail. mit. edu/SNU L 6 -8

• Two-Cycle SMIPS cont-1 if(stage==1 && ir. not. Empty) begin //decode let dec. Inst = decode(inst); Data r. Val 1 = rf. rd 1(dec. Inst. r. Src 1); Data r. Val 2 = rf. rd 2(dec. Inst. r. Src 2); //execute let exec. Inst = exec(dec. Inst, pc, r. Val 1, r. Val 2); if(exec. Inst. inst. Type==Ld || exec. Inst. inst. Type==St) exec. Inst. data <mem(Mem. Req{op: exec. Inst. inst. Type, addr: exec. Inst. addr, data: exec. Inst. data}); pc <= exec. Inst. br. Taken ? exec. Inst. addr : pc+4; //writeback January 13, 2012 http: //csg. csail. mit. edu/SNU L 6 -9

• Two-Cycle SMIPS cont-2 //writeback if(exec. Inst. inst. Type==Alu || exec. Inst. inst. Type==Ld) rf. wr(exec. Inst. r. Dst, exec. Inst. data); ir. deq; stage <= 0; endrule endmodule; January 13, 2012 http: //csg. csail. mit. edu/SNU L 6 -10

Processor Performance Time Program = Instructions Program * Cycles Instruction * Time Cycle – Instructions per program depends on source code, compiler technology and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined 1 short

Single-Cycle Hardwired Control: Harvard architecture We will assume clock period is sufficiently long for all of the following steps to be “completed”: 1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation data fetch if required register write-back setup time t. C > t. IFetch + t. RFetch + t. ALU+ t. DMem+ t. RWB At the rising edge of the following clock, the PC, the register file and the memory are updated

Clock Period t. C-Princeton > max {t. M , t. RF+ t. ALU+ t. M + t. WB} t. C-Princeton > t. RF+ t. ALU+ t. M + t. WB while in the hardwired Harvard architecture t. C-Harvard > t. M + t. RF + t. ALU+ t. M+ t. WB which will execute instructions faster?

Clock Rate vs CPI Suppose t. M >> t. RF+ t. ALU + t. WB t. C-Princeton = 0. 5 * t. C-Harvard CPIPrinceton = 2 CPIHarvard = 1 No difference in performance! Is it possible to design a controller for the Princeton architecture with CPI < 2 ? CPI = Clock cycles Per Instruction

Princeton microarchitecture (redrawn) The same (mux not shown) 0 x 4 Add PC we addr rdata Memory wdata fetch phase IR we rs 1 rs 2 rd 1 ws wd rd 2 GPRs ALU we addr rdata Memory Imm Ext wdata execute phase Only one of the phases is active in any cycle a lot of datapath is not in use at any given time

Princeton Microarchitecture Overlapped execution 0 x 4 Add PC we addr rdata Memory IR we rs 1 rs 2 rd 1 ws wd rd 2 GPRs wdata ALU rdata Memory Imm Ext fetch phase we addr wdata execute phase Can we overlap instruction fetch and execute? Yes, unless IR contains a Load or Store Which action should be prioritized? What do we do with Fetch? Execute Stall it How?

Stalling the instruction fetch Princeton Microarchitecture stall? 0 x 4 Add nop PC we addr rdata Memory wdata fetch phase IR we rs 1 rs 2 rd 1 ws wd rd 2 GPRs ALU we addr rdata Memory Imm Ext wdata execute phase When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown) What if IR contains a jump or branch instruction?

Need to stall on branches Princeton Microarchitecture Jump? 0 x 4 Add nop PC we addr rdata Memory wdata IR we rs 1 rs 2 rd 1 ws wd rd 2 GPRs Imm Ext ALU we addr rdata Memory wdata When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the next. PC (branch-target) address in the PC

Pipelined Princeton Microarchitecture PCSrc 2 Reg. Write Mem. Write PCen 0 x 4 Add clk nop clk PC 31 IR clk we rs 1 rs 2 rd 1 ws wd rd 2 GPRs clk we addr ALU z Imm Ext rdata Data Memory wdata ALU Control IRSrc Op. Code Reg. Dst Ext. Sel Op. Sel BSrc zero? MAddr. Src stall? stall WBSrc

Pipelined Princeton Architecture Clock: t. C-Princeton > t. RF+ t. ALU+ t. M CPI: (1 - f) + 2 f cycles per instruction where f is the fraction of instructions that cause a stall What is a likely value of f?