CPU performance equation T I x CPI x

Generic CPU Instruction Processing Steps (Implied by The Von Neumann Computer Model) Instruction Fetch

MIPS CPU Design: What do we have so far? Multi-Cycle Datapath (Textbook Version) One

Operations (Dependant RTN) for Each Cycle R-Type IF ID EX Instruction Fetch Instruction Decode

FSM State Transition Diagram (From Book) IF IR ¬ Mem[PC] PC ¬ PC +

Reg. Dst Reg. Wr File Equal Mem. To. Reg Result Store ALUSrc ALUctr Mem.

Operations (Dependant RTN) for Each Cycle Logic Immediate R-Type Load Store Branch IF Instruction

Multi-cycle Datapath Instruction CPI • R-Type/Immediate: Require four cycles, CPI =4 – IF, ID,

MIPS Multi-cycle Datapath Performance Evaluation • What is the average CPI? – State diagram

Instruction CPU Pipelining • • • Instruction pipelining is a CPU implementation technique where

Single Cycle Vs. Pipelining P rogram execution Tim e order (in instructions) lw $1,

CPU Pipelining: Design Goals • The length of the machine clock cycle is determined

From MIPS Multi-Cycle Datapath: 5 steps or 5 cycles or Stages n=5 Five Stages

Ideal Pipelined Instruction Processing (i. e no stall cycles) Timing Representation n = 5

Ideal Pipelined Instruction Processing (i. e no stall cycles) 5 Stage Pipeline Representation Pipeline

Single Cycle, Multi-Cycle, Vs. Pipelined CPU Cycle 1 Cycle 2 Clk Single Cycle Implementation:

Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: • Single

Basic Pipelined CPU Design Steps 1. Analyze instruction set operations using independent RTN =>

MIPS Pipeline Stage Identification IF: Instruction fetch ID: Instruction decode/ register file read EX:

MIPS: An Initial Pipelined Datapath Buffers (registers) between pipeline stages are added: 0 M

A Corrected Pipelined Datapath 4 th Edition Figure 4. 41 page 355 3 rd

Read/Write Access To Register Bank • • Two instructions need to access the register

1 IF ID EX MEM WB Write destination register in first half of WB

Adding Pipeline Control Points PCSrc MIPS Pipeline Version #1 0 M u x 1

Pipeline Control • Pass needed control signals along from one stage to the next

Pipeline Control Signal (Generation/Latching/Propagation) • The Main Control generates the control signals during ID

Pipelined Datapath with Control Added MIPS Pipeline Version #1 MIPS Pipeline Version 1: No

Basic Performance Issues In Pipelining • Pipelining increases the CPU instruction throughput: The number

Pipelining Performance Example • Example: For an unpipelined multicycle CPU: – Clock cycle =

Pipeline Hazards CPI = 1 + Average Stalls Per Instruction • Hazards are situations

Performance of Pipelines with Stalls • Hazard conditions in pipelines may make it necessary

Structural (or Hardware) Hazards • In pipelined machines overlapped instruction execution requires pipelining of

IF ID EX MEM WB One shared memory for instructions and data Or store

$CPI = 1 + stall clock cycles per instruction = 1 + fraction of$

A Structural Hazard Example (i. e loads/stores) • Given that data references are 40%

Data Hazards i. e Operands • Data hazards occur when the pipeline changes the

Data Hazards Example • 1 2 Problem with starting next instruction before first is

Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The

Data Hazard Resolution/Stall Reduction: Data Forwarding • Observation: Why not use temporary results produced

Forwarding In MIPS Pipeline • The ALU result from the EX/MEM register may be

ID MEM EX WB 1 2 3 2 Forwarding Paths Added This diagram shows

Data Hazard Resolution: Forwarding • The forwarding unit compares operand registers of the instruction

Pipelined Datapath With Forwarding IF ID EX MEM WB Main Control Opcode 1 3

Data Hazard Example With Forwarding Time (in clock cycles) CC 1 Value of register

A Data Hazard Requiring A Stall A load followed immediately by an R-type instruction

Datapath With Hazard Detection Unit IF/IDWrite A load followed by an instruction that uses

Stall + Forward Hazard Detection Unit Operation EECC 550 - Shaaban #48 Lec #

Compiler Instruction Scheduling (Re-ordering) Example • Reorder the instructions to avoid as many pipeline

• Control Hazards When a conditional branch is executed it may change the

Basic Branch Handling in Pipelines 1 One scheme discussed earlier is to always stall

Control Hazards: Example • Three other instructions are in the pipeline before branch instruction

Hardware Reduction of Branch Stall Cycles i. e. pipeline redesign MIPS Pipeline Version #3

Reducing Delay (Penalty) of Taken Branches • • • So far: Next PC of

Pipeline Performance Example • Assume the following MIPS instruction mix: Type Arith/Logic Load Store

ISA Reduction of Branch Penalties: i. e. ISA Support Needed Delayed Branch (Action) •

Delayed Branch Example (Single Branch Delay slot, instruction or cycle used here) (All RISC

Delayed Branch-delay Slot Scheduling Strategies The branch-delay slot instruction can be chosen from three

Scheduling The Branch Delay Slot Example: From the body of a loop Most Common

Compiler Instruction Scheduling Example To reduce or eliminate stalls With Branch Delay Slot •

Compiler Instruction Scheduling Example (With Branch Delay Slot) • • Without compiler scheduling loop:

The MIPS R 4000 Integer Pipeline • Implements MIPS 64 but uses an 8

Deeper Pipelines = More Stall Cycles and Higher CPI MIPS R 4000 Example LW

Slides: 63

Download presentation

• CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. • For single-cycle CPU: CPI = 1 – good Long cycle time – bad • On the other hand, for multi-cycle CPU: CPI increased (3 -5) – bad Shorter cycle – good • How to lower effective CPI without increasing C: Solution: CPU Pipelining i. e. Instruction processing overlap EECC 550 - Shaaban #1 Lec # 7 Winter 2012 1 -10 -2013

Generic CPU Instruction Processing Steps (Implied by The Von Neumann Computer Model) Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Obtain instruction from program storage (memory) The Program Counter (PC) points to next instruction to be processed Determine required actions and instruction size T = I x CPI x C Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor or next instruction (i. e Update PC to fetch next instruction to be processed) Major CPU Performance Limitation: The Von Neumann computing model implies sequential execution one instruction at a time EECC 550 - Shaaban #2 Lec # 7 Winter 2012 1 -10 -2013

MIPS CPU Design: What do we have so far? Multi-Cycle Datapath (Textbook Version) One ALU One Memory CPI: R-Type = 4, Load = 5, Store 4, Jump/Branch = 3 Only one instruction being processed in datapath How to lower CPI further without increasing CPU clock cycle time, C? T = I x CPI x C Processing an instruction starts when the previous instruction is completed EECC 550 - Shaaban #3 Lec # 7 Winter 2012 1 -10 -2013

Operations (Dependant RTN) for Each Cycle R-Type IF ID EX Instruction Fetch Instruction Decode Execution IR ¬ Mem[PC] PC ¬ PC + 4 Store IR ¬ Mem[PC] PC ¬ PC + 4 Branch IR ¬ Mem[PC] PC ¬ PC + 4 Jump IR ¬ Mem[PC] PC ¬ PC + 4 A ¬ R[rs] A ¬ R[rs] B ¬ B ¬ R[rt] R[rt] ALUout ¬ PC + (Sign. Ext(imm 16) x 4) ALUout ¬ PC + ALUout ¬ A funct B MEM Load (Sign. Ext(imm 16) x 4) ALUout ¬ PC + (Sign. Ext(imm 16) x 4) Zero ¬ A - B ALUout ¬ A + Sign. Ex(Imm 16) (Sign. Ext(imm 16) x 4) A + Sign. Ex(Imm 16) ALUout ¬ PC + (Sign. Ext(imm 16) x 4) PC ¬ Jump Address Zero: PC ¬ ALUout CPI = 3 – 5 C = 2 ns Memory MDR ¬ Mem[ALUout] ¬ B T = I x CPI x C WB Write Back R[rd] ¬ ALUout R[rt] ¬ MDR Reducing the CPI by combining cycles increases CPU clock cycle Instruction Fetch (IF) & Instruction Decode (ID) cycles are common for all instructions EECC 550 - Shaaban #4 Lec # 7 Winter 2012 1 -10 -2013

FSM State Transition Diagram (From Book) IF IR ¬ Mem[PC] PC ¬ PC + 4 ID A ¬ R[rs] B ¬ CPI = 3 - 5 R[rt] ALUout ¬ PC + (Sign. Ext(imm 16) x 4) T = I x CPI x C EX ALUout ¬ PC ¬ Jump Address A + Sign. Ex(Im 16) ALUout ¬ A func B Zero ¬ A -B Zero: MDR ¬ Mem[ALUout] MEM PC ¬ ALUout R[rd] ¬ ALUout WB R[rt] ¬ MDR Mem[ALUout] ¬ B Reducing the CPI by combining cycles increases CPU clock cycle WB 3 rd Edition Figure 5. 37 page 338 EECC 550 - Shaaban (See Handout) #5 Lec # 7 Winter 2012 1 -10 -2013

Reg. Dst Reg. Wr File Equal Mem. To. Reg Result Store ALUSrc ALUctr Mem. Rd Mem. Wr M Data Mem B R Mem Access A Ext ALU Ext. Op Reg File Operand Fetch Instruction Fetch IR PC Next PC n. PC_sel Multi-cycle Datapath (Our Version) Registers added: Three ALUs, Two Memories IR: Instruction register A, B: Two registers to hold operands read from register file. R: or ALUOut, holds the output of the ALU M: or Memory data register (MDR) to hold data read from data memory EECC 550 - Shaaban #6 Lec # 7 Winter 2012 1 -10 -2013

Operations (Dependant RTN) for Each Cycle Logic Immediate R-Type Load Store Branch IF Instruction Fetch IR ¬ Mem[PC] PC ¬ PC + 4 IR ¬ Mem[PC] PC ¬ PC + 4 ID Instruction Decode A ¬ R[rs] B ¬ R[rt A ¬ R[rs] A ¬ B ¬ B ¬ R[rt] R[rs] R[rt] Zero ¬ A - B EX Execution R ¬ A funct B R ¬ A OR Zero. Ext[imm 16] R ¬ A + Sign. Ex(Im 16) If Zero = 1: PC ¬ PC + (Sign. Ext(imm 16) x 4) MEM WB Memory Write Back M ¬ Mem[R] R[rd] ¬ R R[rt] Instruction Fetch (IF) & Instruction Decode cycles are common for all instructions ¬ M Mem[R] ¬ B CPI = 3 – 5 C = 2 ns EECC 550 - Shaaban #7 Lec # 7 Winter 2012 1 -10 -2013

Multi-cycle Datapath Instruction CPI • R-Type/Immediate: Require four cycles, CPI =4 – IF, ID, EX, WB • Loads: Require five cycles, CPI = 5 – IF, ID, EX, MEM, WB • Stores: Require four cycles, CPI = 4 – IF, ID, EX, MEM • Branches: Require three cycles, CPI = 3 – IF, ID, EX • Average program 3 £ CPI £ 5 depending on program profile (instruction mix). Non-overlapping Instruction Processing: Processing an instruction starts when the previous instruction is completed. EECC 550 - Shaaban #8 Lec # 7 Winter 2012 1 -10 -2013

MIPS Multi-cycle Datapath Performance Evaluation • What is the average CPI? – State diagram gives CPI for each instruction type. – Workload (program) below gives frequency of each type. Type CPIi for type Frequency CPIi x freq. Ii Arith/Logic 4 40% 1. 6 Load 5 30% 1. 5 Store 4 10% 0. 4 branch 3 20% 0. 6 Average CPI: 4. 1 Better than CPI = 5 if all instructions took the same number of clock cycles (5). EECC 550 - Shaaban #9 Lec # 7 Winter 2012 1 -10 -2013

Instruction CPU Pipelining • • • Instruction pipelining is a CPU implementation technique where multiple operations on a number of instructions are overlapped. – For Example: The next instruction is fetched in the next cycle without waiting for the current instruction to complete. An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. Each step is called a pipeline stage or a pipeline segment. The stages or steps are connected in a linear fashion: one stage to the next to form the pipeline (or pipelined CPU datapath) -- instructions enter at one end and progress 5 stage through the stages and exit at the other end. 1 2 3 4 5 pipeline The time to move an instruction one step down the pipeline is is equal to the machine (CPU) cycle and is determined by the stage with the longest processing delay. Pipelining increases the CPU instruction throughput: The number of instructions completed per cycle. – Instruction Pipeline Throughput : The instruction completion rate of the pipeline and is determined by how often an instruction exists the pipeline. – Under ideal conditions (no stall cycles), instruction throughput is one instruction per machine cycle, or ideal effective CPI = 1 Or ideal IPC = 1 T = I x CPI • Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency). Pipelining may actually increase individual instruction latency – Minimum instruction latency = n cycles, where n is the number of pipeline stages 4 th Edition Chapter 4. 5 - 4. 8 - 3 rd Edition Chapter 6. 1 - 6. 6 EECC 550 - Shaaban #10 Lec # 7 Winter 2012 1 -10 -2013 x. C

Single Cycle Vs. Pipelining P rogram execution Tim e order (in instructions) lw $1, 100($0) 2 Instruction Reg fetch lw $2, 200($0) 4 6 8 ALU Data access 10 12 14 16 Single Cycle Reg Instruction Reg fetch 8 ns 18 lw $3, 300($0) Data access ALU C = 8 ns Reg Instruction fetch 8 ns Time for 1000 instructions = 8 x 1000 = 8000 ns . . . 4 Pipeline Fill Cycles Program execution Time order (in instructions) 2 lw $1, 100($0) Instruction fetch lw $2, 200($0) 2 ns lw $3, 300($0) 8 ns 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns 8 Data access ALU Reg 2 ns 10 14 12 1 2 Reg Data access 2 ns Assuming the following datapath/control hardware components delays: Memory Units: 2 ns ALU and adders: 2 ns Register File: 1 ns Control Unit < 1 ns 5 Reg 2 ns Time for 1000 instructions = time to fill pipeline + cycle time x 1000 = Pipelining Speedup = 8000/2008 = 3. 98 4 5 Stage Instruction Pipeline Reg ALU 3 8 + 2 x 1000 = 2008 ns Fill Cycles C = 2 ns EECC 550 - Shaaban #11 Lec # 7 Winter 2012 1 -10 -2013

CPU Pipelining: Design Goals • The length of the machine clock cycle is determined by the time required for the slowest pipeline stage. Similar to non-pipelined multi-cycle CPU • An important pipeline design consideration is to balance the 5 stage length of each pipeline stage. 1 2 3 4 5 pipeline • If all stages are perfectly balanced, then the effective time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipeline stages • Under these ideal conditions: – Speedup from pipelining = the number of pipeline stages = n – Goal: One instruction is completed every cycle: CPI = 1. T = I x CPI x C While keeping clock cycle C short EECC 550 - Shaaban #12 Lec # 7 Winter 2012 1 -10 -2013

From MIPS Multi-Cycle Datapath: 5 steps or 5 cycles or Stages n=5 Five Stages of Load Cycle 1 Cycle 2 Load Cycle 3 Cycle 4 Cycle 5 IF ID EX 1 2 3 MEM 4 WB 5 stage pipeline 5 1 - Instruction Fetch (IF) Instruction Fetch And PC update PC ¬ • Fetch the instruction from the Instruction Memory. PC + 4 2 - Instruction Decode (ID): Registers Fetch and Instruction Decode. 3 - Execute (EX): Calculate the memory address. 4 - Memory (MEM): Read the data from the Data Memory. 5 - Write Back (WB): Write the data back to the register file. n = number of pipeline stages (5 in this case) The number of pipeline stages is determined by the instruction that needs the largest number of cycles EECC 550 - Shaaban #13 Lec # 7 Winter 2012 1 -10 -2013

Ideal Pipelined Instruction Processing (i. e no stall cycles) Timing Representation n = 5 stage pipeline CPI = 1 (ideal) Program Order Fill Cycles = number of stages -1 = n -1 1 2 3 4 5 Time in clock cycles ® Clock cycle Number Instruction Number 1 2 3 4 5 Instruction I+1 Instruction I+2 Instruction I+3 Instruction I +4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF 6 7 WB MEM EX ID 8 9 Ideal CPI = 1 WB MEM EX (or IPC =1) WB MEM WB 4 cycles = n -1 = 5 -1 Time to fill the pipeline n= 5 Pipeline Stages: 1 2 3 4 5 IF ID EX MEM WB = Instruction Fetch = Instruction Decode = Execution = Memory Access = Write Back Any individual instruction goes through all five pipeline stages taking 5 cycles to complete Thus instruction latency= 5 cycles First instruction, I Completed Instruction, I+4 completed Pipeline Fill Cycles: No instructions completed yet Number of fill cycles = Number of pipeline stages - 1 Here 5 - 1 = 4 fill cycles Ideal pipeline operation: After fill cycles, one instruction is completed per cycle giving the ideal pipeline CPI = 1 (ignoring fill cycles) or Instructions per Cycle = IPC = 1/CPI = 1 Ideal pipeline operation without any stall cycles EECC 550 - Shaaban #14 Lec # 7 Winter 2012 1 -10 -2013

Ideal Pipelined Instruction Processing (i. e no stall cycles) 5 Stage Pipeline Representation Pipeline Fill cycles = 5 -1 = 4 Time 1 I 2 IF 2 4 5 WB 6 2 IF ID 7 8 EX MEM IF ID EX MEM IF ID EX I 4 I 5 Program Flow 3 4 5 EX MEM WB 9 ID I 3 I 6 3 1 10 Any individual instruction goes through all five pipeline stages taking 5 cycles to complete Thus instruction latency = 5 cycles WB WB MEM WB Here n = 5 pipeline stages or steps Number of pipeline fill cycles = Number of stages - 1 Here 5 -1 = 4 After fill cycles: One instruction is completed every cycle (Effective CPI = 1) (ideally) Ideal pipeline operation without any stall cycles EECC 550 - Shaaban #15 Lec # 7 Winter 2012 1 -10 -2013

Single Cycle, Multi-Cycle, Vs. Pipelined CPU Cycle 1 Cycle 2 Clk Single Cycle Implementation: 8 ns Load Store Waste 2 ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load IF ID EX MEM WB Store IF ID EX R-type MEM IF 4 Pipeline Fill Cycles Pipeline Implementation: Load IF ID Store IF EX ID R-type IF MEM EX ID WB MEM EX Assuming the following datapath/control hardware components delays: Memory Units: 2 ns ALU and adders: 2 ns Register File: 1 ns Control Unit < 1 ns WB MEM WB EECC 550 - Shaaban #16 Lec # 7 Winter 2012 1 -10 -2013

Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: • Single Cycle Machine: T = I x CPI x C CPI = 1 C = 8 ns – 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns • Multi-cycle Machine: 3 £ CPI £ 5 C = 2 ns – 2 ns/cycle x 4. 6 CPI (due to inst mix) x 1000 inst = 9200 ns Depends on program instruction mix • Ideal pipelined machine, 5 -stages: Effective CPI = 1 C = 2 ns – 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns • Speedup = 8000/2008 = 3. 98 times faster than single cycle CPU • Speedup = 9200/2008 = 4. 58 times faster than multi cycle CPU EECC 550 - Shaaban #17 Lec # 7 Winter 2012 1 -10 -2013

Basic Pipelined CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements. 2. Select required datapath components and connections. 3. Assemble an initial datapath meeting the ISA requirements. 4. Identify pipeline stages based on operation, balancing stage delays, and ensuring no hardware conflicts exist when common hardware is used by two or more stages simultaneously in the same cycle. 5. Divide the datapath into the stages identified above by adding buffers between the stages of sufficient width to hold: i. e registers • Instruction fields. 2 • Remaining control lines needed for remaining pipeline stages. 3 • All results produced by a stage and any unused results of previous stages. 1 6. Analyze implementation of each instruction to determine setting of control points that effects the register transfer taking pipeline hazard conditions into account. (More on this a bit later) 7. Assemble the control logic. EECC 550 - Shaaban #18 Lec # 7 Winter 2012 1 -10 -2013

MIPS Pipeline Stage Identification IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation 5 Stage Pipeline 1 2 IF ID 3 4 5 EX MEM WB MEM: Memory access WB: Write back 0 M u x 1 Add Add result 4 Shift left 2 PC Read register 1 Address Read data 1 Read register 2 Registers Read Write data 2 register Instruction memory Write data 16 1 IF 2 Stage 1 0 M u x 1 ID Stage 2 Sign extend Zero ALU result Address Data memory Read data 1 M u x 0 Write data 32 3 EX Stage 3 4 MEM Stage 4 5 WB Stage 5 What is needed to divide datapath into pipeline stages? Start with initial datapath with: 3 ALUs, 2 Memories EECC 550 - Shaaban #19 Lec # 7 Winter 2012 1 -10 -2013

MIPS: An Initial Pipelined Datapath Buffers (registers) between pipeline stages are added: 0 M u x 1 Everything an instruction needs for the remaining processing stages must be saved in buffers so that it travels with the instruction from one CPU pipeline stage to the next ID/EX IF/ID EX/MEM MEM/WB Add Add result 4 PC Address Instruction memory Instruction Shift left 2 rs rt Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU result Address Data memory Write data This design has A problem IF Instruction Fetch Stage 1 16 Imm 16 Sign extend 1 M u x 0 32 ID Instruction Decode Stage 2 Read data EX MEM WB Execution Memory Write Back Stage 3 Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? Hint: Any values an instruction requires must travel with it as it goes through the pipeline stages including instruction fields still needed in later stages Stage 4 Stage 5 n = 5 pipeline stages EECC 550 - Shaaban #20 Lec # 7 Winter 2012 1 -10 -2013

A Corrected Pipelined Datapath 4 th Edition Figure 4. 41 page 355 3 rd Edition Figure 6. 17 page 395 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add 4 Add result Address PC Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 16 Classic Five Stage Integer MIPS Pipeline IF Instruction Fetch Stage 1 Sign extend 0 M u x 1 Zero ALU result Address Data memory Write data Read data 1 M u x 0 32 rt/rd ID Instruction Decode Stage 2 EX Execution Stage 3 n = 5 pipeline stages MEM Memory Stage 4 WB Write Back Stage 5 EECC 550 - Shaaban #21 Lec # 7 Winter 2012 1 -10 -2013

Read/Write Access To Register Bank • • Two instructions need to access the register bank in the same cycle: – One instruction to read operands in its instruction decode (ID) cycle. – The other instruction to write to a destination register in its Write Back (WB) cycle. This represents a potential hardware conflict over access to the register bank. Solution: Coordinate register reads and write in the same cycle as follows: • Register write in Write Back WB cycle Operand register reads in Instruction Decode ID cycle occur in the second half of the cycle (indicated here by the dark shading of the second half of the cycle) IF occur in the first half of the cycle. (indicated here by the dark shading of the first half of the WB cycle) ID IF EX ID MEM WB EX MEM WB EECC 550 - Shaaban #22 Lec # 7 Winter 2012 1 -10 -2013

1 IF ID EX MEM WB Write destination register in first half of WB cycle 2 IF ID EX MEM WB 3 IF 4 ID MEM WB Read operand registers in second half of ID cycle IF 5 EX ID EX MEM WB EX MEM Operation of ideal integer in-order 5 -stage pipeline IF ID WB EECC 550 - Shaaban #23 Lec # 7 Winter 2012 1 -10 -2013

Adding Pipeline Control Points PCSrc MIPS Pipeline Version #1 0 M u x 1 IF ID Stage 1 EX Stage 2 MEM Stage 3 IF/ID ID/EX Branches resolved here in MEM (Stage 4) WB Stage 5 Stage 4 EX/MEM MEM/WB Add 4 Shift left 2 Address Instruction memory Instruction Reg. Write PC Read register 1 Add result Branch Mem. Write Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero ALU result 0 M u x 1 Memto. Reg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Classic Five Stage Integer MIPS Pipeline Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 0 M u x 1 ALU control Mem. Read ALUOp Reg. Dst 4 th Ed. Fig. 4. 46 page 359 3 rd Ed. Fig. 6. 22 page 400 MIPS Pipeline Version 1: No forwarding, branch resolved in MEM stage EECC 550 - Shaaban #24 Lec # 7 Winter 2012 1 -10 -2013

Pipeline Control • Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the needed data MEM EX WB All control line values for remaining stages generated in ID WB Instruction Control Opcode M WB EX M 2 1 IF ID Stage 1 IF/ID 1 2 IF ID 3 4 EX MEM WB Stage 4 Stage 5 Stage 3 Stage 2 ID/EX 3 4 5 EX MEM WB WB EX/MEM 5 Stage Pipeline 5 MEM/WB EECC 550 - Shaaban #25 Lec # 7 Winter 2012 1 -10 -2013

Pipeline Control Signal (Generation/Latching/Propagation) • The Main Control generates the control signals during ID – Control signals for EX (ALUSrc, ALUOp. . . ) are used 1 cycle later – Control signals for MEM (Mem. Wr/Rd, Branch) are used 2 cycles later – Control signals for WB (Memto. Reg. Wr) are used 3 cycles later ID EX MEM Stage 2 Stage 3 Stage 4 Reg. Dst Main Control Mem. Rd Mem. Wr Branch Memto. Reg Reg. Wr Mem. Rd Mem. Wr Branch Memto. Reg. Wr Stage 5 Mem/WB Register Reg. Dst Ex/Mem Register ALUSrc ALUOp ID/Ex Register IF/ID Register ALUSrc ALUOp WB Memto. Reg. Wr EECC 550 - Shaaban #26 Lec # 7 Winter 2012 1 -10 -2013

Pipelined Datapath with Control Added MIPS Pipeline Version #1 MIPS Pipeline Version 1: No forwarding, branch resolved in MEM stage PCSrc IF Stage 1 ID/EX ID 0 M u x 1 EX Stage 3 WB Stage 2 Control IF/ID WB MEM EX/MEM M WB EX M Stage 4 Stage 5 MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Classic Five Stage Integer MIPS Pipeline 4 th 3 rd Ed. Fig. 4. 51 page 362 Ed. Fig. 6. 27 page 404 Zero ALU result 0 M u x 1 Memto. Reg Instruction memory Branch Shift left 2 Mem. Write Address Instruction PC Add result Reg. Write 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 Mem. Read ALUOp Reg. Dst Target address of branch determined in EX but PC is updated in MEM stage (i. e branch is resolved in MEM, stage 4) EECC 550 - Shaaban #27 Lec # 7 Winter 2012 1 -10 -2013

Basic Performance Issues In Pipelining • Pipelining increases the CPU instruction throughput: The number of instructions completed per unit time. T = I x CPI x C Under ideal conditions (i. e. No stall cycles): – Pipelined CPU instruction throughput is one instruction completed per machine cycle, or CPI = 1 Ideally (ignoring pipeline fill cycles) Or Instruction throughput: Instructions Per Cycle = IPC =1 • Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency or time). – It usually slightly increases the execution time of individual instructions over unpipelined CPU implementations due to: • The increased control overhead of the pipeline and pipeline stage registers delays + Here n = 5 stages • Every instruction goes though every stage in the pipeline even if the stage is not needed. (i. e MEM pipeline stage in the case of RType instructions) EECC 550 - Shaaban #28 Lec # 7 Winter 2012 1 -10 -2013

Pipelining Performance Example • Example: For an unpipelined multicycle CPU: – Clock cycle = 10 ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. – If pipelining adds 1 ns to the CPU clock cycle then the speedup in instruction execution from pipelining is: i. e. C = 11 ns Non-pipelined Average execution time/instruction = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x 4 + 40%x 5) = 10 ns x 4. 4 = 44 ns CPI = 4. 4 C CPI In the pipelined CPU implementation, ideal CPI = 1 Pipelined execution time/instruction = Clock cycle x CPI = (10 ns + 1 ns) x 1 = 11 ns CPI Speedup from pipelining = Time Per Instruction time unpipelined Time per Instruction time pipelined = 44 ns / 11 ns = 4 times faster T = I x CPI x C here I did not change EECC 550 - Shaaban #29 Lec # 7 Winter 2012 1 -10 -2013

Pipeline Hazards CPI = 1 + Average Stalls Per Instruction • Hazards are situations in pipelined CPUs which prevent the next instruction in the instruction stream from executing during the designated clock cycle possibly resulting in one or i. e A resource the instruction requires for correct more stall (or wait) cycles. execution is not available in the cycle needed • Hazards reduce the ideal speedup (increase CPI > 1) gained from pipelining and are classified into three classes: Resource Not available: – Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions. Hardware structure (component) conflict Hardware Component – Data hazards: Arise when an instruction depends on the Correct Operand (data) value results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Operand not ready yet when needed in EX – Control hazards: Arise from the pipelining of conditional Correct PC branches and other instructions that change the PC. Correct PC not available when needed in IF EECC 550 - Shaaban #30 Lec # 7 Winter 2012 1 -10 -2013

Performance of Pipelines with Stalls • Hazard conditions in pipelines may make it necessary to stall the pipeline by a number of cycles degrading performance from the ideal pipelined CPU CPI of 1. Average CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction • If pipelining overhead is ignored and we assume that the stages are perfectly balanced then speedup from pipelining is given by: Speedup = CPI unpipelined / CPI pipelined = CPI unpipelined / (1 + Pipeline stall cycles per instruction) • When all instructions in the multicycle CPU take the same number of cycles equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) EECC 550 - Shaaban #31 Lec # 7 Winter 2012 1 -10 -2013

Structural (or Hardware) Hazards • In pipelined machines overlapped instruction execution requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. To prevent hardware structures conflicts • If a resource conflict arises due to a hardware resource being required by more than one instruction in a single cycle, and one or more such instructions cannot be accommodated, then a structural hazard has occurred, for example: e. g. – When a pipelined machine has a shared single-memory for both data and instructions. ® stall the pipeline for one cycle for memory data access i. e A hardware component the instruction requires for correct execution is not available in the cycle needed EECC 550 - Shaaban #32 Lec # 7 Winter 2012 1 -10 -2013

IF ID EX MEM WB One shared memory for instructions and data Or store ID EX MEM WB Program Order IF IF ID EX IF ID MEM EX MIPS with Memory Unit Structural Hazards IF Instructions 1 -4 above are assumed to be instructions other than loads/stores ID WB MEM EX WB MEM EECC 550 - Shaaban #33 Lec # 7 Winter 2012 1 -10 -2013

$CPI = 1 + stall clock cycles per instruction = 1 + fraction of$

CPI = 1 + stall clock cycles per instruction = 1 + fraction of loads and stores x 1 IF ID EX MEM WB One shared memory for instructions and data Program Order Or store One Stall or Wait Cycle Resolving A Structural Hazard with Stalling IF Instructions 1 -3 above are assumed to be instructions other than loads/stores ID EX MEM EECC 550 - Shaaban #34 Lec # 7 Winter 2012 1 -10 -2013

A Structural Hazard Example (i. e loads/stores) • Given that data references are 40% for a specific instruction mix or program, and that the ideal pipelined CPI ignoring hazards is equal to 1. • A machine with a data memory access structural hazards requires a single stall cycle for data references and has a clock rate 1. 05 times higher than the ideal machine. Ignoring other performance losses for this machine: Average instruction time = CPI X Clock cycle time Average instruction time = (1 + 0. 4 x 1) x Clock cycle ideal CPI = 1. 4 1. 05 = 1. 3 X Clock cycle time ideal i. e. CPU without structural hazard is 1. 3 times faster CPI = 1 + Average Stalls Per Instruction EECC 550 - Shaaban #35 Lec # 7 Winter 2012 1 -10 -2013

Data Hazards i. e Operands • Data hazards occur when the pipeline changes the order of read/write accesses to instruction operands in such a way that the resulting access order differs from the original sequential instruction operand access order of the unpipelined CPU resulting in incorrect execution. • Data hazards may require one or more instructions to be stalled in the pipeline to ensure correct execution. CPI = 1 + stall clock cycles per instruction • Example: Arrows represent data dependencies Producer of Result (data) Consumers of Result (data) 1 2 3 4 5 sub and or add sw $2, $1, $3 $12, $5 $13, $6, $2 $14, $2 $15, 100($2) between instructions Instructions that have no dependencies among them are said to be parallel or independent A high degree of Instruction-Level Parallelism (ILP) is present in a given code sequence if it has a large number of parallel instructions – All the instructions after the sub instruction use its result data in register $2 – As part of pipelining, these instruction are started before sub is completed: • Due to this data hazard instructions need to be stalled for correct execution. (As shown next) i. e Correct operand data not ready yet when needed in EX cycle EECC 550 - Shaaban #36 Lec # 7 Winter 2012 1 -10 -2013

Data Hazards Example • 1 2 Problem with starting next instruction before first is 3 finished 4 – Data dependencies here that “go backward in time” 5 create data hazards. sub and or add sw $2, $1, $3 $12, $5 $13, $6, $2 $14, $2 $15, 100($2) Time (in clock cycles) CC 1 Value of register $2: 10 Program execution order (in instructions) 1 sub $2, $1, $3 2 and $12, $5 3 or $13, $6, $2 4 add $14, $2 5 sw $15, 100($2) IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10/– 20 – 20 DM Reg IM Reg DM Reg IM Reg Reg DM Reg EECC 550 - Shaaban #37 Lec # 7 Winter 2012 1 -10 -2013

Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The control unit must detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) CC 1 Value of register $2: 10 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 CC 11 10 10/– 20 – 20 Program execution order (in instructions) sub $2, $1, $3 2 and $12, $5 3 or $13, $6, $2 4 add $14, $2 5 sw $15, 100($2) CPI = 1 + stall clock cycles per instruction IM Reg IM DM STALL Reg DM Reg IM 2 Stall cycles inserted here to resolve data hazard and ensure correct execution Above timing is for MIPS Pipeline Version #1 Reg DM Reg IM Reg Reg DM Reg EECC 550 - Shaaban #38 Lec # 7 Winter 2012 1 -10 -2013

Data Hazard Resolution/Stall Reduction: Data Forwarding • Observation: Why not use temporary results produced by memory/ALU and not wait for them to be written back in the register bank. • Data Forwarding is a hardware-based technique (also called register bypassing or register short-circuiting) used to eliminate or minimize data hazard stalls that makes use of this observation. • Using forwarding hardware, the result of an instruction (i. e data) is copied directly (i. e. forwarded) from where it is produced (ALU, memory read port etc. ), to where subsequent instructions need it (ALU input register, memory write port etc. ) EECC 550 - Shaaban #39 Lec # 7 Winter 2012 1 -10 -2013

Forwarding In MIPS Pipeline • The ALU result from the EX/MEM register may be forwarded or fed back to the ALU input latches as needed instead of the register operand value read in the ID stage. • Similarly, the Data Memory Unit result from the MEM/WB register may be fed back to the ALU input latches as needed. • If the forwarding hardware detects that a previous ALU operation is to write the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. EECC 550 - Shaaban #40 Lec # 7 Winter 2012 1 -10 -2013

ID MEM EX WB 1 2 3 2 Forwarding Paths Added This diagram shows better forwarding baths than in textbook EECC 550 - Shaaban #41 Lec # 7 Winter 2012 1 -10 -2013

Data Hazard Resolution: Forwarding • The forwarding unit compares operand registers of the instruction in EX stage with destination registers of the previous two instructions in MEM and WB • If there is a match one or both operands will be obtained from forwarding paths bypassing the registers ID MEM EX WB 1 3 2 4 th Ed. Fig. 4. 54 page 368 3 rd Ed. Fig. 6. 30 page 409 Operand Register numbers of instruction in EX Destination Register numbers of instructions in MEM and WB EECC 550 - Shaaban #42 Lec # 7 Winter 2012 1 -10 -2013

Pipelined Datapath With Forwarding IF ID EX MEM WB Main Control Opcode 1 3 2 4 th Ed. Fig. 4. 56 page 370 3 rd Ed. Fig. 6. 32 page 411 • The forwarding unit compares operand registers of the instruction in EX stage with destination registers of the previous two instructions in MEM and WB • If there is a match one or both operands will be obtained from forwarding paths bypassing the registers EECC 550 - Shaaban #43 Lec # 7 Winter 2012 1 -10 -2013

Data Hazard Example With Forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X Program execution order (in instructions) 1 sub $2, $1, $3 2 and $12, $5 3 or $13, $6, $2 4 add $14, $2 5 sw $15, 100($2) 1 IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 – 20 X 10/– 20 X X – 20 X X 2 3 4 5 6 8 9 DM Reg IM Forward Result DM Reg IM Reg DM Reg IM What registers numbers are being compared by the forwarding unit during cycle 5? What about in Cycle 6? 7 Reg DM Reg EECC 550 - Shaaban #44 Lec # 7 Winter 2012 1 -10 -2013

A Data Hazard Requiring A Stall A load followed immediately by an R-type instruction that uses the loaded value (or any other type of instruction that needs loaded value in EX stage) Load 1 2 3 4 5 Even with forwarding in place a stall cycle is needed (shown next) This condition must be detected by hardware EECC 550 - Shaaban #45 Lec # 7 Winter 2012 1 -10 -2013

A Data Hazard Requiring A Stall A load followed immediately by an R-type instruction that uses the loaded value results in a single stall cycle even with forwarding as shown: Stall one cycle then, forward data of “lw” instruction to “and” instruction First stall one cycle then forward A stall cycle CPI = 1 + stall clock cycles per instruction • We can stall the pipeline by keeping all instructions following the “lw” instruction in the same pipeline stage for one cycle What is the hazard detection unit (shown next slide) doing during cycle 3? EECC 550 - Shaaban #46 Lec # 7 Winter 2012 1 -10 -2013

Datapath With Hazard Detection Unit IF/IDWrite A load followed by an instruction that uses the loaded value is detected by the hazard detection unit and a stall cycle is inserted. The hazard detection unit checks if the instruction in the EX stage is a load by checking its Mem. Read control line value If that instruction is a load it also checks if any of the operand registers of the instruction in the decode stage (ID) match the destination register of the load. In case of a match it inserts a stall cycle (delays decode and fetch by one cycle). rs rt ID/EX. Mem. Read Hazard detection unit ID/EX rt WB Control 0 M u x Instruction memory Instruction PCWrite IF/ID PC MIPS Pipeline Version 2: With forwarding, branch still resolved in MEM stage EX/MEM M WB EX M MEM/WB WB M u x Registers Data memory ALU M u x A stall if needed is created by disabling instruction write (keep last instruction) in IF/ID and by inserting a set of control values with zero values in ID/EX IF MIPS Pipeline Version #2 IF/ID. Register. Rs IF/ID. Register. Rt Rt IF/ID. Register. Rd Rd ID/EX. Register. Rt Rs Rt M u x ID 4 th Edition Figure 4. 60 page 375 3 rd Edition Figure 6. 36 page 416 M u x EX/MEM. Register. Rd Forwarding unit EX MEM/WB. Register. Rd MEM WB EECC 550 - Shaaban #47 Lec # 7 Winter 2012 1 -10 -2013

Stall + Forward Hazard Detection Unit Operation EECC 550 - Shaaban #48 Lec # 7 Winter 2012 1 -10 -2013

Compiler Instruction Scheduling (Re-ordering) Example • Reorder the instructions to avoid as many pipeline stalls as possible: lw $15, 0($2) Original lw $16, 4($2) Stall add $14, $5, $16 Code sw $16, 4($2) • The data hazard occurs on register $16 between the second lw and the add instruction resulting in a stall cycle even with forwarding • With forwarding we (or the compiler) need to find only one independent instruction to place between them, swapping the lw instructions works: lw $16, 4($2) i. e pipeline version #2 No Scheduled lw $15, 0($2) i. e pipeline version #1 Stalls Code With add $14, $5, $16 Forwarding sw $16, 4($2) • Without forwarding we need two independent instructions to place between them, so in addition a nop is added (or the hardware will insert a stall). Or stall cycle lw lw nop add sw $16, 4($2) $15, 0($2) $14, $5, $16, 4($2) Scheduled Code No Forwarding EECC 550 - Shaaban #49 Lec # 7 Winter 2012 1 -10 -2013

• Control Hazards When a conditional branch is executed it may change the PC (when taken) and, without any special measures, leads to stalling the pipeline for a number of cycles until the branch condition is known and PC is updated (branch is resolved). Here end of stage 4 (MEM) – For Pipeline Versions 1 or 2 i. e version 2 – Otherwise the PC may not be correct when needed in IF • In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM stage) resulting in three stall cycles as shown below: Branch instruction Branch successor + 1 Branch successor + 2 Branch successor + 3 Branch successor + 4 Branch successor + 5 IF ID EX MEM WB stall IF ID IF 3 stall cycles EX MEM ID EX IF ID IF Branch Penalty Correct PC available here (end of MEM cycle or stage) WB MEM WB EX MEM ID EX IF ID IF Assuming we stall or flush the pipeline on a branch instruction: Three clock cycles are wasted for every branch for current MIPS pipeline Branch Penalty = stage number where branch is resolved - 1 here Branch Penalty = 4 - 1 = 3 Cycles i. e Correct PC is not available when needed in IF EECC 550 - Shaaban #50 Lec # 7 Winter 2012 1 -10 -2013

Basic Branch Handling in Pipelines 1 One scheme discussed earlier is to always stall ( flush or freeze) the pipeline whenever a conditional branch is decoded by holding or deleting any instructions in the pipeline until the branch destination is known (zero pipeline registers, control lines). Pipeline stall cycles from branches = frequency of branches X branch penalty • Ex: Branch frequency = 20% branch penalty = 3 cycles CPI = 1 +. 2 x 3 = 1. 6 CPI = 1 + stall clock cycles per instruction 2 Another method is to assume or predict that the branch is not taken where the state of the machine is not changed until the branch outcome is definitely known. Execution here continues with the next (PC+4) instruction; stall occurs here when the branch is taken. Pipeline stall cycles from branches = frequency of taken branches X branch penalty • Ex: Branch frequency = 20% of which 45% are taken CPI = 1 +. 2 x. 45 x 3 = 1. 27 CPI = 1 + Average Stalls Per Instruction branch penalty = 3 cycles EECC 550 - Shaaban #51 Lec # 7 Winter 2012 1 -10 -2013

Control Hazards: Example • Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in MEM stage. Branch Resolved in Stage 4 (MEM) Thus Taken Branch Penalty = 4 – 1 = 3 stall cycles Not Taken Direction If Taken, go here (Target) • In the above diagram, we are assuming “branch not taken” – Need to add hardware for flushing the three following instructions if we are wrong losing three cycles when the branch is taken. i. e. Taken Branch Penalty i. e the branch was resolved as taken in MEM stage EECC 550 - Shaaban #52 Lec # 7 Winter 2012 1 -10 -2013

Hardware Reduction of Branch Stall Cycles i. e. pipeline redesign MIPS Pipeline Version #3 Pipeline hardware measures to reduce taken branch stall cycles: 1 - Find out whether a branch is taken earlier in the pipeline. 2 - Compute the taken PC earlier in the pipeline. In MIPS: i. e Resolve the branch in an early stage in the pipeline – In MIPS branch instructions BEQ, BNE, test a register for equality to zero. – This can be completed in the ID cycle by moving the zero test into that cycle (ID). – Both PCs (taken and not taken) must be computed early. – Requires an additional adder in ID because the current ALU is not useable until EX cycle. – This results in just a single cycle stall on taken branches. • Branch Penalty when taken = stage resolved - 1 = 2 - 1 = 1 As opposed branch penalty = 3 cycles before (pipelene versions 1 and 2) MIPS Pipeline Version 3: With forwarding, branch resolved in ID stage EECC 550 - Shaaban #53 Lec # 7 Winter 2012 1 -10 -2013

Reducing Delay (Penalty) of Taken Branches • • • So far: Next PC of a branch known or resolved in MEM stage: Costs three lost cycles if the branch is taken. MIPS Pipeline Version #3 If next PC of a branch is known or resolved in EX stage, one cycle is saved. Branch address calculation can be moved to ID stage (stage 2) using a register comparator, costing only one cycle if branch is taken as shown below. Branch Penalty = stage 2 -1 = 1 cycle IF. Flush Hazard detection unit MIPS Pipeline Version 3: With forwarding, branch resolved in ID stage WB Control 0 M u x IF/ID 4 EX/MEM M WB EX M MEM/WB WB Shift left 2 Registers PC MIPS Pipeline Version #3 ID/EX M u x = M u x Instruction memory ALU Data mem ory M u x IF Sign extend EX ID MEM M u x WB M u x Forwarding unit Here the branch is resolved in ID stage (stage 2) Thus branch penalty if taken = 2 - 1 = 1 cycle 4 th Edition Figure 4. 65 page 384 3 rd Edition Figure 6. 41 page 427 EECC 550 - Shaaban #54 Lec # 7 Winter 2012 1 -10 -2013

Pipeline Performance Example • Assume the following MIPS instruction mix: Type Arith/Logic Load Store branch Frequency 40% 30% of which 25% are followed immediately by an instruction using the loaded value 1 stall 10% 20% of which 45% are taken 1 stall • What is the resulting CPI for the pipelined MIPS with forwarding and branch address calculation in ID stage i. e Version 3 when using the branch not-taken scheme? Branch Penalty = 1 cycle • CPI = Ideal CPI + Pipeline stall clock cycles per instruction = = 1 + 1 + 1. 165 stalls by loads + stalls by branches. 3 x. 25 x 1 +. 2 x. 45 x 1. 075 +. 09 CPI = 1 + Average Stalls Per Instruction EECC 550 - Shaaban #55 Lec # 7 Winter 2012 1 -10 -2013

ISA Reduction of Branch Penalties: i. e. ISA Support Needed Delayed Branch (Action) • When delayed branch is used in an ISA, the branch action is delayed by n cycles (or instructions), following this execution pattern: Program Order conditional branch instruction sequential successor 1 sequential successor 2 n branch delay slots ……. . These instructions in branch delay slots are sequential successorn always executed regardless of branch direction branch target if taken } • The sequential successor instructions are said to be in the branch delay slots. These instructions are executed whether or not the branch is taken. • In Practice, all ISAs that utilize delayed branching including MIPS utilize a single instruction branch delay slot. (All RISC ISAs) – The job of the compiler is to make the successor instruction in the delay slot a valid and useful instruction. EECC 550 - Shaaban #56 Lec # 7 Winter 2012 1 -10 -2013

Delayed Branch Example (Single Branch Delay slot, instruction or cycle used here) (All RISC ISAs) Not Taken Branch (no stall) The instruction in the branch delay slot is executed whether the branch is taken or not Here, assuming the MIPS pipeline (version 3) with reduced branch penalty = 1 EECC 550 - Shaaban #57 Lec # 7 Winter 2012 1 -10 -2013

Delayed Branch-delay Slot Scheduling Strategies The branch-delay slot instruction can be chosen from three cases: A An independent instruction from before the branch: Most Common Always improves performance when used. The branch must not depend on the rescheduled instruction. e. g From Body of a loop B Hard to Find C An instruction from the target of the branch: Improves performance if the branch is taken and may require instruction duplication. This instruction must be safe to execute if the branch is not taken. An instruction from the fall through instruction stream: Improves performance when the branch is not taken. The instruction must be safe to execute when the branch is taken. EECC 550 - Shaaban #58 Lec # 7 Winter 2012 1 -10 -2013

Scheduling The Branch Delay Slot Example: From the body of a loop Most Common choice EECC 550 - Shaaban #59 Lec # 7 Winter 2012 1 -10 -2013

Compiler Instruction Scheduling Example To reduce or eliminate stalls With Branch Delay Slot • Schedule the following MIPS code for the pipelined MIPS CPU with forwarding and reduced branch delay using a single branch delay slot to minimize stall cycles: i. e MIPS Pipeline Version 3 loop: lw $1, 0($2) add $1, $3 sw $1, 0($2) addi $2, -4 bne $2, $4, loop # $1 array element # add constant in $3 # store result array element # decrement address by 4 # branch if $2 != $4 • Assuming the initial value of $2 = $4 + 40 (i. e it loops 10 times) – What is the CPI and total number of cycles needed to run the code with and without scheduling? For MIPS Pipeline Version 3 EECC 550 - Shaaban #60 Lec # 7 Winter 2012 1 -10 -2013

Compiler Instruction Scheduling Example (With Branch Delay Slot) • • Without compiler scheduling loop: Three Stalls Per Iteration Needed because new value of $2 is not produced yet lw $1, 0($2) Stall add $1, $3 sw $1, 0($2) addi $2, -4 Stall bne $2, $4, loop Stall (or NOP) Ignoring the initial 4 cycles to fill the pipeline: Each iteration takes = 8 cycles CPI = 8/5 = 1. 6 Total cycles = 8 x 10 = 80 cycles With compiler scheduling: loop: Move between lw add Move to branch delay slot lw $1, 0($2) addi $2, -4 add $1, $3 bne $2, $4, loop sw $1, 4($2) No Stalls Adjust address offset Ignoring the initial 4 cycles to fill the pipeline: Each iteration takes = 5 cycles CPI = 5/5 = 1 Total cycles = 5 x 10 = 50 cycles Speedup = 80/ 50 = 1. 6 Target CPU: MIPS Pipeline Version 3 (With forwarding, branch resolved in ID stage) EECC 550 - Shaaban #61 Lec # 7 Winter 2012 1 -10 -2013

The MIPS R 4000 Integer Pipeline • Implements MIPS 64 but uses an 8 -stage pipeline instead of the classic 5 stage pipeline to achieve a higher clock speed. 1 2 • Pipeline Stages: 3 4 5 6 7 8 Branch resolved here in stage 4 Thus branch penalty = 4 -1 = 3 cycles – – – IF: First half of instruction fetch. Start instruction cache access. IS: Second half of instruction fetch. Complete instruction cache access. RF: Instruction decode and register fetch, hazard checking. EX: Execution including branch-target and condition evaluation. DF: Data fetch, first half of data cache access. Data available if a hit. DS: Second half of data fetch access. Complete data cache access. Data available if a cache hit – TC: Tag check, determine data cache access hit. – WB: Write back for loads and register-register operations. – Branch resolved in stage 4. Branch Penalty = 3 cycles if taken ( 2 with branch delay slot) EECC 550 - Shaaban #62 Lec # 7 Winter 2012 1 -10 -2013

Deeper Pipelines = More Stall Cycles and Higher CPI MIPS R 4000 Example LW data available here Program Order T = I x CPI x C Forwarding of LW Data • Even with forwarding the deeper pipeline leads to a 2 -cycle load delay (2 stall cycles). As opposed to 1 -cycle in classic Thus: Deeper Pipelines = More Stall Cycles T = I x CPI x C 5 -stage pipeline EECC 550 - Shaaban #63 Lec # 7 Winter 2012 1 -10 -2013