Pipelined Processor Design ICS 233 Computer Architecture Assembly

Pipelined Processor Design ICS 233 Computer Architecture & Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals

Presentation Outline v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 2

Pipelining Example v Laundry Example: Three Stages 1. Wash dirty load of clothes 2. Dry wet clothes 3. Fold and put clothes into drawers v Each stage takes 30 minutes to complete v Four loads of clothes to wash, dry, and fold Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language A B C D © Muhamed Mudawar – slide 3

Sequential Laundry 6 PM Time 30 7 30 8 30 30 9 30 30 10 30 30 11 30 30 12 AM 30 30 A B C D v Sequential laundry takes 6 hours for 4 loads v Intuitively, we can use pipelining to speed up laundry Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 4

Pipelined Laundry: Start Load ASAP 6 PM 30 A B C D Pipelined Processor Design 7 30 30 8 30 30 30 9 PM Time 30 30 30 v Pipelined laundry takes 3 hours for 4 loads v Speedup factor is 2 for 4 loads v Time to wash, dry, and fold one load is still the same (90 minutes) ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 5

Serial Execution versus Pipelining v Consider a task that can be divided into k subtasks ² The k subtasks are executed on k different stages ² Each subtask requires one time unit ² The total execution time of the task is k time units v Pipelining is to overlap the execution ² The k stages work in parallel on k different tasks ² Tasks enter/leave pipeline at the rate of one task per time unit 1 2 … k 1 2 … 1 2 k Without Pipelining One completion every k time units Pipelined Processor Design … k … 1 2 k … k With Pipelining One completion every 1 time unit ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 6

Synchronous Pipeline v Uses clocked registers between stages v Upon arrival of a clock edge … ² All registers hold the results of previous stages simultaneously v The pipeline stages are combinational logic circuits v It is desirable to have balanced stages ² Approximately equal delay in all stages Sk Register S 2 Register S 1 Register Input Register v Clock period is determined by the maximum stage delay Output Clock Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 7

Pipeline Performance v Let ti = time delay in stage Si v Clock cycle t = max(ti) is the maximum stage delay v Clock frequency f = 1/t = 1/max(ti) v A pipeline can process n tasks in k + n – 1 cycles ² k cycles are needed to complete the first task ² n – 1 cycles are needed to complete the remaining n – 1 tasks v Ideal speedup of a k-stage pipeline over serial execution Sk = Serial execution in cycles Pipelined Processor Design nk = k+n– 1 ICS 233 – Computer Architecture & Assembly Language Sk → k for large n © Muhamed Mudawar – slide 8

MIPS Processor Pipeline v Five stages, one cycle per stage 1. IF: Instruction Fetch from instruction memory 2. ID: Instruction Decode, register read, and J/Br address 3. EX: Execute operation or calculate load/store address 4. MEM: Memory access for load and store 5. WB: Write Back result to register Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 9

Single-Cycle vs Pipelined Performance v Consider a 5 -stage instruction execution in which … ² Instruction fetch = ALU operation = Data memory access = 200 ps ² Register read = register write = 150 ps v What is the clock cycle of the single-cycle processor? v What is the clock cycle of the pipelined processor? v What is the speedup factor of pipelined execution? v Solution Single-Cycle Clock = 200+150+200+150 = 900 ps IF Reg ALU 900 ps MEM Reg IF Reg ALU MEM Reg 900 ps Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 10

Single-Cycle versus Pipelined – cont’d v Pipelined clock cycle = max(200, 150) = 200 ps IF Reg 200 IF 200 ALU Reg IF MEM Reg ALU MEM 200 200 Reg 200 v CPI for pipelined execution = 1 ² One instruction completes each cycle (ignoring pipeline fill) v Speedup of pipelined execution = 900 ps / 200 ps = 4. 5 ² Instruction count and CPI are equal in both cases v Speedup factor is less than 5 (number of pipeline stage) ² Because the pipeline stages are not balanced Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 11

Pipeline Performance Summary v Pipelining doesn’t improve latency of a single instruction v However, it improves throughput of entire workload ² Instructions are initiated and completed at a higher rate v In a k-stage pipeline, k instructions operate in parallel ² Overlapped execution using multiple hardware resources ² Potential speedup = number of pipeline stages k ² Unbalanced lengths of pipeline stages reduces speedup v Pipeline rate is limited by slowest pipeline stage v Unbalanced lengths of pipeline stages reduces speedup v Also, time to fill and drain pipeline reduces speedup Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 12

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 13

Single-Cycle Datapath v Shown below is the single-cycle datapath v How to pipeline this single-cycle datapath? Answer: Introduce pipeline register at end of each stage IF = Instruction Fetch ID = Decode & Register Read Jump or Branch Target Address EX = Execute 00 30 Instruction Memory Instruction 1 PC 0 Rs 5 32 Rt 5 Address Rd 0 1 Reg. Dst clk Beq Bne ALU result Imm 26 +1 Imm 16 zero 32 Bus. A RA Registers RB Bus. B RW Bus. W 0 E 32 A L U 32 Data Memory Address Data_out Data_in 1 0 32 32 1 32 Reg Write Ext. Op ALUSrc ALUCtrl Pipelined Processor Design WB = Write Back J Next PC 30 PCSrc MEM = Memory Access ICS 233 – Computer Architecture & Assembly Language Mem Read Write Mem to. Reg © Muhamed Mudawar – slide 14

Pipelined Datapath v Pipeline registers are shown in green, including the PC v Same clock edge updates all pipeline registers, register file, and data memory (for store instruction) 1 Address 0 Rd 1 RW 32 Imm E Bus. B Bus. W 32 zero A L U 1 Data Memory ALUout RB ALU result Imm 16 A NPC Rt 5 RA Bus. A Next PC Address Data_out 0 32 32 0 32 1 WB Data PC 0 Rs 5 B Instruction Imm 26 Register File Instruction Memory Instruction +1 MEM = Memory Access WB = Write Back EX = Execute D ID = Decode & Register Read NPC 2 IF = Instruction Fetch Data_in 32 clk Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 15

Problem with Register Destination v Is there a problem with the register destination address? ² Instruction in the ID stage different from the one in the WB stage Address 0 Rd 1 RB RW ALU result Imm 16 E Bus. B Bus. W 32 zero 32 A Imm Next PC A L U 1 Data Memory ALUout Rt 5 RA Bus. A MEM = Memory Access Address Data_out 0 32 32 D 1 PC 0 Rs 5 B Instruction Imm 26 Register File Instruction Memory Instruction +1 NPC 2 EX = Execute 0 32 1 WB Data ID = Decode & Register Read IF = Instruction Fetch WB = Write Back ² Instruction in the WB stage is not writing to its destination register but to the destination of a different instruction in the ID stage Data_in 32 clk Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 16

Pipelining the Destination Register v Destination Register number should be pipelined ² Destination register number is passed from ID to WB stage ² The WB stage writes back data knowing the destination register ID EX RW 0 1 Bus. W 32 1 Address Data_out 0 32 32 0 32 1 WB Data Bus. B A L U Data Memory ALUout Imm A 32 E 32 zero Data_in Rd 4 Rd ALU result Imm 16 D RB B Address Rt 5 RA Bus. A WB Next PC Rd 3 1 PC 0 Rs 5 MEM Rd 2 Instruction Imm 26 Register File Instruction Memory Instruction +1 NPC 2 IF clk Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 17

Graphically Representing Pipelines v Multiple instruction execution over multiple clock cycles ² Instructions are listed in execution order from top to bottom ² Clock cycles move from left to right Program Execution Order ² Figure shows the use of resources at each stage and each cycle Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 lw $t 6, 8($s 5) IM Reg ALU DM Reg IM Reg ALU DM add $s 1, $s 2, $s 3 ori $s 4, $t 3, 7 sub $t 5, $s 2, $t 3 sw $s 2, 10($t 3) Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language CC 6 CC 7 CC 8 © Muhamed Mudawar – slide 18

Instruction-Time Diagram v Instruction-Time Diagram shows: ² Which instruction occupying what stage at each clock cycle v Instruction flow is pipelined over the 5 stages Instruction Order Up to five instructions can be in the pipeline during the same cycle Instruction Level Parallelism (ILP) lw $t 7, 8($s 3) lw $t 6, 8($s 5) IF ori $t 4, $s 3, 7 sub $s 5, $s 2, $t 3 sw $s 2, 10($s 3) CC 1 Pipelined Processor Design ALU instructions skip the MEM stage. Store instructions skip the WB stage ID EX MEM WB IF ID EX – IF ID WB EX MEM – CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 ICS 233 – Computer Architecture & Assembly Language Time © Muhamed Mudawar – slide 19

Control Signals EX 1 32 0 1 Address Data_out 0 Bus. W Data Memory ALUout Imm Bus. B A L U 32 32 0 32 1 Data_in Rd 4 RW E 32 zero 32 A NPC Rd Bne D Address RB Beq Rd 3 1 PC 0 Rt 5 RA Bus. A Imm 16 WB J ALU result B Instruction Rs 5 Next PC Rd 2 Instruction Memory Imm 26 Register File PCSrc Instruction +1 MEM WB Data ID NPC 2 IF clk Reg Dst Reg Write Ext Op ALU Src ALU Ctrl Mem Read Write Mem to. Reg Same control signals used in the single-cycle datapath Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 20

Address Data_out 0 32 32 0 32 1 WB Data A Imm 32 Data Memory ALUout 1 Bus. W 0 1 A L U Data_in Rd 4 RW 32 zero 32 E Bus. B Bne D Rd Op 1 Address RB Beq Rd 3 PC 0 Rt 5 RA Bus. A J ALU result Imm 16 B Instruction Rs 5 Next PC Rd 2 Instruction Memory Imm 26 Register File PCSrc Instruction +1 NPC 2 Pipelined Control Pipelined Processor Design Main & ALU Control Ext Op ALU Src J ALU Beq Ctrl Bne ICS 233 – Computer Architecture & Assembly Language Mem Read Write Mem to. Reg WB Reg Write MEM Reg Dst EX Pass control signals along pipeline just like the data func clk © Muhamed Mudawar – slide 21

Pipelined Control – Cont'd v ID stage generates all the control signals v Pipeline the control signals as the instruction moves ² Extend the pipeline registers to include the control signals v Each stage uses some of the control signals ² Instruction Decode and Register Read § Control signals are generated § Reg. Dst is used in this stage ² Execution Stage => Ext. Op, ALUSrc, and ALUCtrl § Next PC uses J, Beq, Bne, and zero signals for branch control ² Memory Stage => Mem. Read, Mem. Write, and Memto. Reg ² Write Back Stage => Reg. Write is used in this stage Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 22

Control Signals Summary Decode Stage Op Execute Stage Memory Stage Write Control Signals Back Reg. Dst ALUSrc Ext. Op R-Type 1=Rd 0=Reg addi 0=Rt slti Beq Bne ALUCtrl Mem. Rd Mem. Wr Mem. Reg. Write 0 0 0 func 0 0 0 1 1=Imm 1=sign 0 0 0 ADD 0 0 0 1 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1 andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1 ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1 lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1 sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0 beq x 0=Reg x 0 1 0 SUB 0 0 x 0 bne x 0=Reg x 0 0 1 SUB 0 0 x 0 j x x x 1 0 0 x 0 Pipelined Processor Design x J ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 23

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 24

Pipeline Hazards v Hazards: situations that would cause incorrect execution ² If next instruction were launched during its designated clock cycle 1. Structural hazards ² Caused by resource contention ² Using same resource by two instructions during the same cycle 2. Data hazards ² An instruction may compute a result needed by next instruction ² Hardware can detect dependencies between instructions 3. Control hazards ² Caused by instructions that change control flow (branches/jumps) ² Delays in changing the flow of control v Hazards complicate pipeline control and limit performance Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 25

Structural Hazards v Problem ² Attempt to use the same hardware resource by two different instructions during the same cycle Structural Hazard Two instructions are attempting to write the register file during same cycle v Example ² Writing back ALU result in stage 4 Instructions ² Conflict with writing load data in stage 5 lw $t 6, 8($s 5) IF ori $t 4, $s 3, 7 sub $t 5, $s 2, $s 3 sw $s 2, 10($s 3) CC 1 Pipelined Processor Design ID EX MEM WB IF ID EX MEM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 ICS 233 – Computer Architecture & Assembly Language Time © Muhamed Mudawar – slide 26

Resolving Structural Hazards v Serious Hazard: ² Hazard cannot be ignored v Solution 1: Delay Access to Resource ² Must have mechanism to delay instruction access to resource ² Delay all write backs to the register file to stage 5 § ALU instructions bypass stage 4 (memory) without doing anything v Solution 2: Add more hardware resources (more costly) ² Add more hardware to eliminate the structural hazard ² Redesign the register file to have two write ports § First write port can be used to write back ALU results in stage 4 § Second write port can be used to write back load data in stage 5 Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 27

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 28

Data Hazards v Dependency between instructions causes a data hazard v The dependent instructions are close to each other ² Pipelined execution might change the order of operand access v Read After Write – RAW Hazard ² Given two instructions I and J, where I comes before J ² Instruction J should read an operand after it is written by I ² Called a data dependence in compiler terminology I: add $s 1, $s 2, $s 3 # $s 1 is written J: sub $s 4, $s 1, $s 3 # $s 1 is read ² Hazard occurs when J reads the operand before I writes it Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 29

Example of a RAW Data Hazard Program Execution Order Time (cycles) value of $s 2 sub $s 2, $t 1, $t 3 add $s 4, $s 2, $t 5 or $s 6, $t 3, $s 2 and $s 7, $t 4, $s 2 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 10 10 10 20 20 20 IM Reg ALU DM Reg IM Reg ALU DM sw $t 8, 10($s 2) v Result of sub is needed by add, or, and, & sw instructions v Instructions add & or will read old value of $s 2 from reg file v During CC 5, $s 2 is written at end of cycle, old value is read Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 30

Instruction Order Solution 1: Stalling the Pipeline Time (in cycles) value of $s 2 sub $s 2, $t 1, $t 3 add $s 4, $s 2, $t 5 or $s 6, $t 3, $s 2 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 20 20 IM Reg ALU DM Reg IM Reg Reg ALU DM Reg stall IM Reg ALU DM v Three stall cycles during CC 3 thru CC 5 (wasting 3 cycles) ² Stall cycles delay execution of add & fetching of or instruction v The add instruction cannot read $s 2 until beginning of CC 6 ² The add instruction remains in the Instruction register until CC 6 ² The PC register is not modified until beginning of CC 6 Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 31

Solution 2: Forwarding ALU Result v The ALU result is forwarded (fed back) to the ALU input ² No bubbles are inserted into the pipeline and no cycles are wasted v ALU result is forwarded from ALU, MEM, and WB stages Program Execution Order Time (cycles) value of $s 2 sub $s 2, $t 1, $t 3 add $s 4, $s 2, $t 5 or $s 6, $t 3, $s 2 and $s 7, $s 6, $s 2 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 10 10 10 20 20 20 IM Reg ALU DM Reg IM Reg ALU DM sw $t 8, 10($s 2) Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 32

Implementing Forwarding v Two multiplexers added at the inputs of A & B registers ² Data from ALU stage, MEM stage, and WB stage is fed back v Two signals: Forward. A and Forward. B control forwarding Forward. A 1 Address Data Memory 0 32 32 Data_out 0 32 1 WData E Result A Im 26 32 A L U Data_in Rd 4 Rd 0 1 Bus. W 0 1 2 3 32 ALU result 32 D RW Bus. B Imm 16 Rd 3 RB 0 1 2 3 B Rt RA Bus. A Rd 2 Instruction Rs Register File Imm 26 clk Forward. B Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 33

Forwarding Control Signals Signal Explanation Forward. A = 0 First ALU operand comes from register file = Value of (Rs) Forward. A = 1 Forward result of previous instruction to A (from ALU stage) Forward. A = 2 Forward result of 2 nd previous instruction to A (from MEM stage) Forward. A = 3 Forward result of 3 rd previous instruction to A (from WB stage) Forward. B = 0 Second ALU operand comes from register file = Value of (Rt) Forward. B = 1 Forward result of previous instruction to B (from ALU stage) Forward. B = 2 Forward result of 2 nd previous instruction to B (from MEM stage) Forward. B = 3 Forward result of 3 rd previous instruction to B (from WB stage) Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 34

Forwarding Example Instruction sequence: lw $t 4, 4($t 0) ori $t 7, $t 1, 2 sub $t 3, $t 4, $t 7 When sub instruction is fetched ori will be in the ALU stage lw will be in the MEM stage Forward. A = 2 from MEM stage Forward. B = 1 from ALU stage sub $t 3, $t 4, $t 7 ori $t 7, $t 1, 2 lw $t 4, 4($t 0) Bus. W 32 1 Result Data Memory 0 32 32 0 Data_out 32 1 WData Address Data_in Rd 4 Rd 0 1 2 3 1 D RW Bus. B A L U Rd 3 RB 32 ALU result 32 A RA B Rt 0 1 2 3 Bus. A Rd 2 Instruction Rs 2 ext Register File Imm 16 Imm 26 clk Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 35

RAW Hazard Detection v Current instruction being decoded is in Decode stage ² Previous instruction is in the Execute stage ² Second previous instruction is in the Memory stage ² Third previous instruction in the Write Back stage If ((Rs != 0) and (Rs == Rd 2) and (EX. Reg. Write)) Forward. A 1 Else if ((Rs != 0) and (Rs == Rd 3) and (MEM. Reg. Write)) Forward. A 2 Else if ((Rs != 0) and (Rs == Rd 4) and (WB. Reg. Write)) Forward. A 3 Else Forward. A 0 If ((Rt != 0) and (Rt == Rd 2) and (EX. Reg. Write)) Forward. B 1 Else if – Computer & Assembly © Muhamed Mudawar – slide 36 ((Rt != ICS 0)233 and (Rt. Architecture == Rd 3) and. Language (MEM. Reg. Write)) Forward. B 2 Pipelined Processor Design

Hazard Detect and Forward Logic Bus. W Result 32 Data_out 32 0 32 1 WData Memory 0 Data_in 32 ALUCtrl Rd 2 0 1 A Im 26 0 1 2 3 1 Address Rd 4 RW Bus. B D Rd RB A L U Rd 3 Rt RA 0 1 2 3 Bus. A 32 ALU result 32 E B Instruction Rs Register File Imm 26 clk Reg. Dst Forward. B Forward. A Hazard Detect and Forward func Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language Reg. Write WB Main & ALU Control Reg. Write MEM Op EX Reg. Write © Muhamed Mudawar – slide 37

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Pipeline Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 38

Load Delay v Unfortunately, not all data hazards can be forwarded ² Load has a delay that cannot be eliminated by forwarding v In the example shown below … ² The LW instruction does not read data until end of CC 4 ² Cannot forward data to ADD at end of CC 3 - NOT possible Program Order Time (cycles) lw $s 2, 20($t 1) add $s 4, $s 2, $t 5 or $t 6, $t 3, $s 2 and $t 7, $s 2, $t 4 Pipelined Processor Design CC 1 CC 2 CC 3 CC 4 CC 5 IF Reg ALU DM Reg IF Reg ALU DM ICS 233 – Computer Architecture & Assembly Language CC 6 CC 7 CC 8 However, load can forward data to 2 nd next and later instructions Reg © Muhamed Mudawar – slide 39

Detecting RAW Hazard after Load v Detecting a RAW hazard after a Load instruction: ² The load instruction will be in the EX stage ² Instruction that depends on the load data is in the decode stage v Condition for stalling the pipeline if ((EX. Mem. Read == 1) // Detect Load in EX stage and (Forward. A==1 or Forward. B==1)) Stall // RAW Hazard v Insert a bubble into the EX stage after a load instruction ² Bubble is a no-op that wastes one clock cycle ² Delays the dependent instruction after load by once cycle § Because of RAW hazard Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 40

Stall the Pipeline for one Cycle v ADD instruction depends on LW stall at CC 3 ² Allow Load instruction in ALU stage to proceed ² Freeze PC and Instruction registers (NO instruction is fetched) ² Introduce a bubble into the ALU stage (bubble is a NO-OP) v Load can forward data to next instruction after delaying it Program Order Time (cycles) lw $s 2, 20($s 1) add $s 4, $s 2, $t 5 or $t 6, $s 3, $s 2 Pipelined Processor Design CC 1 CC 2 CC 3 CC 4 CC 5 IM Reg ALU DM Reg IM stall bubble Reg ALU DM Reg IM Reg ALU DM ICS 233 – Computer Architecture & Assembly Language CC 6 CC 7 CC 8 Reg © Muhamed Mudawar – slide 41

Showing Stall Cycles v Stall cycles can be shown on instruction-time diagram v Hazard is detected in the Decode stage v Stall indicates that instruction is delayed v Instruction fetching is also delayed after a stall v Example: Data forwarding is shown using green arrows lw $s 1, ($t 5) lw $s 2, 8($s 1) add $v 0, $s 2, $t 3 sub $v 1, $s 2, $v 0 IF ID IF EX MEM WB Stall ID EX MEM WB IF ID EX MEM WB CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 Time Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 42

Hazard Detect, Forward, and Stall 32 Rd 2 0 1 Data_out 32 0 32 1 WData Memory 0 0 1 2 3 Bus. W Result Im 26 A 1 Address Data_in 32 Rd 4 RW Bus. B D Rd RB A L U Rd 3 Rt RA 0 1 2 3 Bus. A 32 ALU result 32 E B PC Instruction Rs Register File Imm 26 clk Disable PC Reg. Dst Forward. B func Forward. A Hazard Detect Forward, & Stall Mem. Read Stall Pipelined Processor Design Bubble =0 Reg. Write 0 1 ICS 233 – Computer Architecture & Assembly Language WB Control Signals MEM Main & ALU Control EX Op Reg. Write © Muhamed Mudawar – slide 43

Code Scheduling to Avoid Stalls v Compilers reorder code in a way to avoid load stalls v Consider the translation of the following statements: A = B + C; D = E – F; // A thru F are in Memory v Fast code: No Stalls v Slow code: lw lw add $t 0, 4($s 0) # &B = 4($s 0) $t 1, 8($s 0) # &C = 8($s 0) $t 2, $t 0, $t 1 # stall cycle sw lw lw sub sw $t 2, $t 3, $t 4, $t 5, Pipelined Processor Design 0($s 0) 16($s 0) 20($s 0) $t 3, $t 4 12($0) # &A = 0($s 0) # &E = 16($s 0) # &F = 20($s 0) # stall cycle # &D = 12($0) ICS 233 – Computer Architecture & Assembly Language lw lw add sw sub $t 0, $t 1, $t 3, $t 4, $t 2, $t 5, 4($s 0) 8($s 0) 16($s 0) 20($s 0) $t 0, $t 1 0($s 0) $t 3, $t 4 sw $t 5, 12($s 0) © Muhamed Mudawar – slide 44

Name Dependence: Write After Read v Instruction J should write its result after it is read by I v Called anti-dependence by compiler writers I: sub $t 4, $t 1, $t 3 # $t 1 is read J: add $t 1, $t 2, $t 3 # $t 1 is written v Results from reuse of the name $t 1 v NOT a data hazard in the 5 -stage pipeline because: ² Reads are always in stage 2 ² Writes are always in stage 5, and ² Instructions are processed in order v Anti-dependence can be eliminated by renaming ² Use a different destination register for add (eg, $t 5) Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 45

Name Dependence: Write After Write v Same destination register is written by two instructions v Called output-dependence in compiler terminology I: sub $t 1, $t 4, $t 3 # $t 1 is written J: add $t 1, $t 2, $t 3 again # $t 1 is written v Not a data hazard in the 5 -stage pipeline because: ² All writes are ordered and always take place in stage 5 v However, can be a hazard in more complex pipelines ² If instructions are allowed to complete out of order, and ² Instruction J completes and writes $t 1 before instruction I v Output dependence can be eliminated by renaming $t 1 v Read After Read. ICS is a name Pipelined Processor Design 233 –NOT Computer Architecture & Assemblydependence Language © Muhamed Mudawar – slide 46

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 47

Control Hazards v Jump and Branch can cause great performance loss v Jump instruction needs only the jump target address v Branch instruction needs two things: ² Branch Result Taken or Not Taken ² Branch Target Address § PC + 4 If Branch is NOT taken § PC + 4 × immediate If Branch is Taken v Jump and Branch targets are computed in the ID stage ² At which point a new instruction is already being fetched ² Jump Instruction: 1 -cycle delay ² Branch: 2 -cycle delay for branch result (taken or not taken) Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 48

2 -Cycle Branch Delay v Control logic detects a Branch instruction in the 2 nd Stage v ALU computes the Branch outcome in the 3 rd Stage v Next 1 and Next 2 instructions will be fetched anyway v Convert Next 1 and Next 2 into bubbles if branch is taken Beq $t 1, $t 2, L 1 cc 2 cc 3 IF Reg ALU IF Next 1 Next 2 L 1: target instruction Pipelined Processor Design cc 4 cc 5 cc 6 Reg Bubble Bubble IF Reg ALU DM Branch Target Addr ICS 233 – Computer Architecture & Assembly Language cc 7 © Muhamed Mudawar – slide 49

E 0 1 Bus. W A L U 0 2 32 3 32 1 zero ALUout 3 Bus. B 0 1 A 2 Bne D RW 32 Beq Rd 3 Rd Im 26 NPC 2 Address RB Bus. A Imm 16 J B 1 Rt 5 RA 0 1 Next PC Rd 2 Instruction 0 Rs 5 Register File Instruction Memory Instruction PCSrc Imm 26 Op +1 PC Jump or Branch Target Implementing Jump and Branch func clk Pipelined Processor Design Main & ALU Control Signals Bubble = 0 0 1 ICS 233 – Computer Architecture & Assembly Language MEM Branch target & outcome are computed in ALU stage J, Beq, Bne EX Branch Delay = 2 cycles Reg Dst © Muhamed Mudawar – slide 50

Predict Branch NOT Taken v Branches can be predicted to be NOT taken v If branch outcome is NOT taken then ² Next 1 and Next 2 instructions can be executed ² Do not convert Next 1 & Next 2 into bubbles ² No wasted cycles Beq $t 1, $t 2, L 1 Next 2 Pipelined Processor Design cc 1 cc 2 cc 3 IF Reg ALU NOT Taken IF cc 4 cc 5 cc 6 Reg ALU DM Reg IF Reg ALU DM ICS 233 – Computer Architecture & Assembly Language cc 7 Reg © Muhamed Mudawar – slide 51

Reducing the Delay of Branches v Branch delay can be reduced from 2 cycles to just 1 cycle v Branches can be determined earlier in the Decode stage ² A comparator is used in the decode stage to determine branch decision, whether the branch is taken or not ² Because of forwarding the delay in the second stage will be increased and this will also increase the clock cycle v Only one instruction that follows the branch is fetched v If the branch is taken then only one instruction is flushed v We should insert a bubble after jump or taken branch ² This will convert the next instruction into a NOP Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 52

= Address RB Rd RW 2 3 Bus. B E 0 1 Bus. W 1 A L U 0 2 32 3 32 0 1 32 A Bus. A B 1 Rt 5 RA 0 1 Rd 2 Instruction 0 Rs 5 Register File Instruction Memory Op PCSrc Instruction Im ALUout Imm 16 6 m 2 Data forwarded then compared D J Beq Bne Rd 3 Next PC Reset +1 Longer Cycle Im 16 Zero PC Jump or Branch Target Reducing Branch Delay to 1 Cycle Pipelined Processor Design Reg Dst J, Beq, Bne Control Signals Bubble = 0 ALUCtrl 0 1 ICS 233 – Computer Architecture & Assembly Language MEM Main & ALU Control EX Reset signal converts next instruction after jump or taken branch into a bubble func clk © Muhamed Mudawar – slide 53

Next. . . v Pipelining versus Serial Execution v Pipelined Datapath and Control v Pipeline Hazards v Data Hazards and Forwarding v Load Delay, Hazard Detection, and Stall v Control Hazards v Delayed Branch and Dynamic Branch Prediction Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 54

Branch Hazard Alternatives v Predict Branch Not Taken (previously discussed) ² Successor instruction is already fetched ² Do NOT Flush instruction after branch if branch is NOT taken ² Flush only instructions appearing after Jump or taken branch v Delayed Branch ² Define branch to take place AFTER the next instruction ² Compiler/assembler fills the branch delay slot (for 1 delay cycle) v Dynamic Branch Prediction ² Loop branches are taken most of time ² Must reduce branch delay to 0, but how? ² How to predict branch behavior at runtime? Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 55

Delayed Branch v Define branch to take place after the next instruction v For a 1 -cycle branch delay, we have one delay slot branch instruction label: branch delay slot (next instruction) . . . branch target (if branch taken) add $t 2, $t 3, $t 4 v Compiler fills the branch delay slot beq $s 1, $s 0, label Delay Slot ² By selecting an independent instruction ² From before the branch v If no independent instruction is found ² Compiler fills delay slot with a NO-OP Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language label: . . . beq $s 1, $s 0, label add $t 2, $t 3, $t 4 © Muhamed Mudawar – slide 56

Drawback of Delayed Branching v New meaning for branch instruction ² Branching takes place after next instruction (Not immediately!) v Impacts software and compiler ² Compiler is responsible to fill the branch delay slot ² For a 1 -cycle branch delay One branch delay slot v However, modern processors and deeply pipelined ² Branch penalty is multiple cycles in deeper pipelines ² Multiple delay slots are difficult to fill with useful instructions v MIPS used delayed branching in earlier pipelines ² However, delayed branching is not useful in recent processors Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 57

Zero-Delayed Branching v How to achieve zero delay for a jump or a taken branch? ² Jump or branch target address is computed in the ID stage ² Next instruction has already been fetched in the IF stage Solution v Introduce a Branch Target Buffer (BTB) in the IF stage ² Store the target address of recent branch and jump instructions v Use the lower bits of the PC to index the BTB ² Each BTB entry stores Branch/Jump address & Target Address ² Check the PC to see if the instruction being fetched is a branch ² Update the PC using the target address stored in the BTB Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 58

Branch Target Buffer v The branch target buffer is implemented as a small cache ² Stores the target address of recent branches and jumps v We must also have prediction bits ² To predict whether branches are taken or not taken ² The prediction bits are dynamically determined by the hardware Branch Target & Prediction Buffer Addresses of Recent Branches Inc mux PC Target Predict Addresses Bits low-order bits used as index = predict_taken Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 59

Dynamic Branch Prediction v Prediction of branches at runtime using prediction bits v Prediction bits are associated with each entry in the BTB ² Prediction bits reflect the recent history of a branch instruction v Typically few prediction bits (1 or 2) are used per entry v We don’t know if the prediction is correct or not v If correct prediction … ² Continue normal execution – no wasted cycles v If incorrect prediction (misprediction) … ² Flush the instructions that were incorrectly fetched – wasted cycles ² Update prediction bits and target address for future use Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 60

Dynamic Branch Prediction – Cont’d IF Use PC to address Instruction Memory and Branch Target Buffer PC = target address Increment PC No ID No Jump or taken branch? Found BTB entry with predict taken? Yes No Yes EX Normal Execution Yes Correct Prediction No stall cycles Mispredicted Jump/branch Enter jump/branch address, target address, and set prediction in BTB entry. Flush fetched instructions Restart PC at target address Pipelined Processor Design Jump or taken branch? Mispredicted branch Branch not taken Update prediction bits Flush fetched instructions Restart PC after branch ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 61

1 -bit Prediction Scheme v Prediction is just a hint that is assumed to be correct v If incorrect then fetched instructions are flushed v 1 -bit prediction scheme is simplest to implement ² 1 bit per branch instruction (associated with BTB entry) ² Record last outcome of a branch instruction (Taken/Not taken) ² Use last outcome to predict future behavior of a branch Not Taken Predict Not Taken Predict Taken Not Taken Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 62

1 -Bit Predictor: Shortcoming v Inner loop branch mispredicted twice! ² Mispredict as taken on last iteration of inner loop ² Then mispredict as not taken on first iteration of inner loop next time around outer: … … inner: … … bne …, …, inner … bne …, …, outer Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 63

2 -bit Prediction Scheme v 1 -bit prediction scheme has a performance shortcoming v 2 -bit prediction scheme works better and is often used ² 4 states: strong and weak predict taken / predict not taken v Implemented as a saturating counter ² Counter is incremented to max=3 when branch outcome is taken ² Counter is decremented to min=0 when branch is not taken Taken Not Taken Strong Predict Not Taken Pipelined Processor Design Taken Not Taken Weak Predict Not Taken Weak Predict Taken ICS 233 – Computer Architecture & Assembly Language Taken Not Taken Strong Predict Taken © Muhamed Mudawar – slide 64

Fallacies and Pitfalls v Pipelining is easy! ² The basic idea is easy ² The devil is in the details § Detecting data hazards and stalling pipeline v Poor ISA design can make pipelining harder ² Complex instruction sets (Intel IA-32) § Significant overhead to make pipelining work § IA-32 micro-op approach ² Complex addressing modes § Register update side effects, memory indirection Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 65

Pipeline Hazards Summary v Three types of pipeline hazards ² Structural hazards: conflicts using a resource during same cycle ² Data hazards: due to data dependencies between instructions ² Control hazards: due to branch and jump instructions v Hazards limit the performance and complicate the design ² Structural hazards: eliminated by careful design or more hardware ² Data hazards are eliminated by forwarding ² However, load delay cannot be eliminated and stalls the pipeline ² Delayed branching can be a solution when branch delay = 1 cycle ² BTB with branch prediction can reduce branch delay to zero ² Branch misprediction should flush the wrongly fetched instructions Pipelined Processor Design ICS 233 – Computer Architecture & Assembly Language © Muhamed Mudawar – slide 66