CS 35101 Computer Architecture Week 12 Multi Cycle

Head’s Up q This week’s material l l Microprogramming Introduction to pipelining - Reading

Review: Multicycle Data and Control Path Address Read Data (Instr. or Data) 1 1

Review: RTL Summary Step Instr fetch Decode Execute Memory access Writeback R-type Mem Ref

Review: Multicycle Datapath FSM Decode Ior. D=0 Instr Fetch 1 Unless otherwise assigned ALUSrc.

Simplifying Control Design q For an implementation of the full MIPS ISA l l

Microprogramming q Microinstruction has to specify l l q what control signals should be

Defining a Microinstruction Format q Format – the fields of the microinstruction and the

The Entire Control Microprogram Addr ALU control SRC 1 SRC 2 Reg control 0

Control PLA 1 PCWrite. Cond Ior. D Mem. Read Mem. Write IRWrite Memto. Reg

Combinational control logic Outputs Review: FSM Implementation Op 5 Op 4 Op 3 Op

Control Path Design Alternatives Initial representation Finite state diagram Microprogram Sequencing control Explicit next

Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock

Multicycle Advantages & Disadvantages q Uses the clock cycle efficiently – the clock cycle

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk

Pipelined MIPS Processor q Start the next instruction while still working on the current

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk

Pipelining the MIPS ISA q What makes it easy l all instructions are the

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS

MIPS Pipeline Control Path Modifications q All control signals are determined during Decode l

Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering

Why Pipeline? For Throughput! Time (clock cycles) IM Reg DM IM Reg ALU Inst

Can pipelining get us into trouble? q Yes: Pipeline Hazards l l structural hazards:

A Unified Memory Would Be a Structural Hazard Time (clock cycles) Inst 4 Mem

How About Register File Access? Time (clock cycles) DM IM Reg ALU Inst 2

Register Usage Can Cause Data Hazards q IM Reg DM IM Reg ALU or

Loads Can Cause Data Hazards q Reg DM IM Reg ALU sub r 4,

Stores Can Cause Data Hazards q xor r 4, r 1, r 5 IM

One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM

Another Way to “Fix” a Data Hazard or $8, $1, $9 xor $4, $1,

Forwarding with Load-use Data Hazards or $8, $1, $9 xor $4, $1, $5 q

Branch Instructions Cause Control Hazards q Inst 4 IM Reg DM IM Reg ALU

One Way to “Fix” a Control Hazard DM IM Reg Can fix branch hazard

Corrected Datapath to Save Reg. Write Addr q Need to preserve the destination register

MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode

Other Pipeline Structures Are Possible q What about (slow) multiply operation? l let it

Sample Pipeline Alternatives q ARM 7 IM Reg PC update IM access q XScale

Summary q All modern day processors use pipelining q Pipelining doesn’t help latency of

Slides: 39

Download presentation

CS 35101 - Computer Architecture Week 12: Multi Cycle Implementation Paul Durand ( www. cs. kent. edu/~durand ) [Adapted from M Irwin (www. cse. psu. edu/~mji) ] [Adapted from COD, Patterson & Hennessy, © 2005, UCB]

Head’s Up q This week’s material l l Microprogramming Introduction to pipelining - Reading assignment – PH 6. 1 q Reminders l HW#6 l I/O, exceptions, and interrupts - Reading assignment – PH 5. 6, 8. 5, and A. 7 through A. 8

Review: Multicycle Data and Control Path Address Read Data (Instr. or Data) 1 1 Write Data 0 Write Data Shift left 2 Instr[25 -0] Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read IR 1 Memory MDR PC Instr[31 -26] 0 PC[31 -28] Data 2 Instr[15 -0] Sign Extend 32 Instr[5 -0] Shift left 2 2 0 1 zero ALU 4 0 28 0 1 2 3 ALU control ALUout Mem. Read Mem. Write Memto. Reg IRWrite PCSource ALUOp Control ALUSrc. B FSM ALUSrc. A Reg. Write Reg. Dst A Ior. D B PCWrite. Cond PCWrite

Review: RTL Summary Step Instr fetch Decode Execute Memory access Writeback R-type Mem Ref Branch Jump IR = Memory[PC]; PC = PC + 4; A = Reg[IR[25 -21]]; B = Reg[IR[20 -16]]; ALUOut = PC +(sign-extend(IR[15 -0])<< 2); ALUOut = A op B; ALUOut = A + sign-extend (IR[15 -0]); Reg[IR[15 MDR = 11]] = Memory[ALUOut]; ALUOut; or Memory[ALUOut] = B; Reg[IR[20 -16]] = MDR; if (A==B) PC = ALUOut; PC = PC[31 -28] ||(IR[25 -0] << 2);

Review: Multicycle Datapath FSM Decode Ior. D=0 Instr Fetch 1 Unless otherwise assigned ALUSrc. A=0 Mem. Read; IRWrite ALUSrc. B=11 Start ALUSrc. A=0 PCWrite, IRWrite, ALUOp=00 ALUsrc. B=01 Mem. Write, Reg. Write=0 PCWrite. Cond=0 PCSource, ALUOp=00 others=X ) PCWrite ) type eq R b ) w s = = (Op = j) lw or (Op 2 p = p O O ( ( 9 6 8 ALUSrc. A=1 ALUSrc. B=10 ALUSrc. B=00 PCSource=10 Execute ALUOp=00 ALUOp=01 ALUOp=10 PCWrite. Cond=0 PCSource=01 PCWrite. Cond=0 (Op PCWrite. Cond = (Op = lw) sw ) 0 3 Memory Access Mem. Read Ior. D=1 PCWrite. Cond=0 4 5 Mem. Write Ior. D=1 PCWrite. Cond=0 Reg. Dst=0 Reg. Write Memto. Reg=1 PCWrite. Cond=0 Write Back 7 Reg. Dst=1 Reg. Write Memto. Reg=0 PCWrite. Cond=0

Simplifying Control Design q For an implementation of the full MIPS ISA l l instructions can take from 1 clock cycle to 20 clock cycles resulting in finite state machines with hundreds to thousands of states with even more arcs (state sequences) q Graphical representations become impossibly complex q Instead, represent the set of control signals that are asserted during a state as a low-level control “instruction” to be executed by the datapath microinstructions q “Executing” the microinstruction is equivalent to asserting the control signals specified by the microinstruction

Microprogramming q Microinstruction has to specify l l q what control signals should be asserted what microinstruction should be executed next Each microinstruction corresponds to one state in the FSM and is assigned a state number (or “address”) 1. 2. 3. Sequential behavior – increment the state (address) of the current microinstruction to get to the state (address) of the next Jump to the microinstruction that begins execution of the next MIPS instruction (state 0) Branch to a microinstruction based on control unit input using dispatch tables - q need one for microinstructions following state 1 need another for microinstructions following state 2 The set of microinstructions that define a MIPS assembly language instruction (macroinstruction) is its microroutine

Defining a Microinstruction Format q Format – the fields of the microinstruction and the control signals that are affected by each field l control signals specified by a field usually have functions that are related l format is chosen to simplify the representation and to make it difficult to write inconsistent microinstructions - i. e. , that allow a given control signal be set to two different values q Make each field of the microinstruction responsible for specifying a nonoverlapping set of control signals l l signals that are never asserted simultaneously may share the same field seven fields for our simple machine - ALU control; SRC 1; SRC 2; Register control; Memory; PCWrite control; Sequencing

The Entire Control Microprogram Addr ALU control SRC 1 SRC 2 Reg control 0 1 Add PC PC 4 Ext shft Read 2 3 4 Add A Extend 9 Seq’ing Read PC ALU Seq Disp 1 Write MDR Write ALU Func code A B 7 8 PCWrite control Read ALU 5 6 Memory A B Fetch Seq Write ALU Subt Disp 2 Seq Fetch ALUOut- Fetch cond Jump Fetch address

Control PLA 1 PCWrite. Cond Ior. D Mem. Read Mem. Write IRWrite Memto. Reg PCSource ALUOp ALUSource. B ALUSource. A Reg. Write Reg. Dst Addr. Ctl Microprogram Counter System clock Op 4 Op 3 Op 2 Op 1 Op 0 Addr select logic Op 5 Adder Outputs Microcode Implementation Inst[31 -26] sequencing control

Combinational control logic Outputs Review: FSM Implementation Op 5 Op 4 Op 3 Op 2 Op 1 Op 0 Inputs Inst[31 -26] System Clock State Reg PCWrite. Cond Ior. D Mem. Read Mem. Write IRWrite Memto. Reg PCSource ALUOp ALUSource. B ALUSource. A Reg. Write Reg. Dst Next State

Control Path Design Alternatives Initial representation Finite state diagram Microprogram Sequencing control Explicit next state function Microprogram counter + dispatch PLAs Logic representation Logic equations Implementation technique q Microcode Programmable Logic Array (PLA) Microprogram representation advantages l Easier to design, write, and debug

Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction Cycle 1 Cycle 2 Clk Single Cycle Implementation: lw q Waste Is wasteful of area since some functional units must (e. g. , adders) be duplicated since they can not be shared during a clock cycle but q sw Is simple and easy to understand

Multicycle Advantages & Disadvantages q Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step l l q balance the amount of work to be done in each step restrict each step to use only one major functional unit Multicycle implementations allow l functional units to be used more than once per instruction as long as they are used on different clock cycles l faster clock rates l different instructions to take a different number of clock cycles but q Requires additional internal state registers, muxes, and more complicated (FSM) control

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the data back to the register file

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead Multiple Cycle Implementation: Clk Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch

Pipelined MIPS Processor q Start the next instruction while still working on the current one l improves throughput - total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 IFetch Dec lw Exec IFetch Dec sw R-type l Mem WB Exec Mem IFetch Dec WB instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Pipeline Implementation: lw IFetch sw Mem wasted cycle Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch Exec WB R-type IFetch

Pipelining the MIPS ISA q What makes it easy l all instructions are the same length (32 bits) few instruction formats (three) with symmetry across formats l memory operations can occur only in loads and stores l operands must be aligned in memory so a single data transfer requires only one memory access l q What makes it hard l l l structural hazards: what if we had only one memory control hazards: what about branches data hazards: what if an instruction’s input operands depend on the output of a previous instruction

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS datapath? l State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add Shift left 2 4 16 Sign Extend Read Data 2 0 1 32 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 1 0

MIPS Pipeline Control Path Modifications q All control signals are determined during Decode l and held in the state registers between pipeline stages IFetch Dec Exec Mem WB 1 0 Control Add Shift left 2 4 16 Sign Extend Read Data 2 0 1 32 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 1 0

Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering questions like: l l l how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed?

Why Pipeline? For Throughput! Time (clock cycles) IM Reg DM IM Reg ALU Inst 3 DM ALU Inst 2 Once the pipeline is full, one instruction is completed every cycle Reg ALU Inst 1 IM ALU O r d e r Inst 0 ALU I n s t r. Inst 4 Time to fill the pipeline Reg Reg DM Reg

Can pipelining get us into trouble? q Yes: Pipeline Hazards l l structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use item before it is ready - instruction depends on result of prior instruction still in the pipeline l control hazards: attempt to make a decision before condition is evaulated - branch instructions q Can always resolve hazards by waiting l l pipeline control must detect the hazard take action (or delay action) to resolve hazards

A Unified Memory Would Be a Structural Hazard Time (clock cycles) Inst 4 Mem Reg Reg Mem Reg ALU Inst 3 Reg ALU Inst 2 Mem Reg ALU Inst 1 Reading data from memory Mem ALU O r d e r lw Reg ALU I n s t r. Mem Mem Reading instruction from memory Mem Reg

How About Register File Access? Time (clock cycles) DM IM Reg ALU Inst 2 Reg ALU Inst 1 IM ALU O r d e r add ALU I n s t r. add Reg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Reg DM Reg Inst 4 clock edge that controls register writing clock edge that controls loading of pipeline state registers

Register Usage Can Cause Data Hazards q IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards xor r 4, r 1, r 5 q Read before write data hazard Reg Reg DM Reg

Loads Can Cause Data Hazards q Reg DM IM Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r lw r 1, 100(r 2) ALU I n s t r. Dependencies backward in time cause hazards and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 4, r 1, r 5 q Load-use data hazard Reg Reg DM Reg

Stores Can Cause Data Hazards q xor r 4, r 1, r 5 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sw r 1, 100(r 5) IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Reg

One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Can fix data hazard by waiting – stall – but impacts CPI stall sub $4, $1, $5 and $6, $1, $7 Reg DM Reg

Another Way to “Fix” a Data Hazard or $8, $1, $9 xor $4, $1, $5 IM Reg DM IM Reg ALU and $6, $1, $7 DM ALU sub $4, $1, $5 Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Fix data hazards by forwarding results as soon as they are available to where they are needed Reg Reg DM Reg

Forwarding with Load-use Data Hazards or $8, $1, $9 xor $4, $1, $5 q IM Reg DM IM Reg ALU and $6, $1, $7 DM ALU sub $4, $1, $5 Reg ALU $1, 4($2) IM ALU O r d e r lw ALU I n s t r. Reg Reg DM Will still need one stall cycle even with forwarding Reg

Branch Instructions Cause Control Hazards q Inst 4 IM Reg DM IM Reg ALU Inst 3 DM ALU lw Reg ALU beq IM ALU O r d e r add ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Reg

One Way to “Fix” a Control Hazard DM IM Reg Can fix branch hazard by waiting – stall – but affects throughput DM Reg IM Reg DM IM Reg ALU beq IM ALU O r d e r add ALU I n s t r. stall lw Inst 3 Reg DM Reg

Corrected Datapath to Save Reg. Write Addr q Need to preserve the destination register address in the pipeline state registers IF/ID ID/EX EX/MEM Add Shift left 2 4 PC Instruction Memory Read Address Add Read Addr 1 Data Memory Register Read Addr 2 Data 1 File Write Addr Write Data 16 Sign Extend Read Data 2 32 MEM/WB ALU Address Write Data Read Data

MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode l and held in the state registers between pipeline stages ID/EX EX/MEM IF/ID Control Add Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Data Memory Register Read Addr 2 Data 1 File Write Addr Write Data 16 Sign Extend Read Data 2 32 MEM/WB Add ALU Address Write Data Read Data

Other Pipeline Structures Are Possible q What about (slow) multiply operation? l let it take two cycles MUL q ALU IM Reg DM Reg What if the data memory access is twice as slow as the instruction memory? l l make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) Reg ALU IM DM 1 DM 2 Reg

Sample Pipeline Alternatives q ARM 7 IM Reg PC update IM access q XScale IM IM 1 PC update BTB access start IM access Reg IM 2 ALU op DM access shift/rotate commit result (write back) DM Reg SHFT ALU Strong. ARM-1 decode reg access ALU q EX decode reg 1 access IM access DM 1 DM write reg write start DM access exception ALU op shift/rotate reg 2 access Reg DM 2

Summary q All modern day processors use pipelining q Pipelining doesn’t help latency of single task, it helps throughput of entire workload l Multiple tasks operating simultaneously using different resources q Potential speedup = Number of pipe stages q Pipeline rate limited by slowest pipeline stage l l q Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards l Stalling negatively affects throughput