Reduction of Data Hazards Stalls i e Current

  • Slides: 77
Download presentation
Reduction of Data Hazards Stalls i. e Current pipeline: In-order with Dynamic Scheduling Single

Reduction of Data Hazards Stalls i. e Current pipeline: In-order with Dynamic Scheduling Single issue with FP support • So far we have dealt with data hazards in instruction pipelines by: – Result forwarding (register bypassing) to reduce or eliminate stalls needed to prevent RAW hazards as a result of true data dependence. – Hazard detection hardware to stall the pipeline starting with the instruction i. e forward + stall (if needed) that uses the result. – Compiler-based static pipeline scheduling to separate the dependent instructions minimizing actual hazard-prevention stalls in scheduled code. • Loop unrolling to increase basic block size: More ILP exposed. i. e Start of instruction execution is not in program order • Dynamic scheduling: (out-of-order execution) – Uses a hardware-based mechanism to reorder or rearrange instruction execution order to reduce stalls dynamically at runtime. Why? • Better dynamic exploitation of instruction-level parallelism (ILP). – Enables handling some cases where instruction dependencies are unknown at compile time (ambiguous dependencies). – Similar to the other pipeline optimizations above, a dynamically scheduled processor cannot remove true data dependencies, but tries to avoid or reduce stalling. Fourth Edition: Appendix A. 7, Chapter 2. 4, 2. 5 (Third Edition: Appendix A. 8, Chapter 3. 2, 3. 3) EECC 551 - Shaaban

Dynamic Pipeline Scheduling: The Concept (Out-of-order execution) i. e Start of instruction execution is

Dynamic Pipeline Scheduling: The Concept (Out-of-order execution) i. e Start of instruction execution is not in program order • Dynamic pipeline scheduling overcomes the limitations of in-order pipelined execution by allowing out-of-order instruction execution. • Instruction are allowed to start executing out-of-order as soon as their operands are available. Dependency Graph • Better dynamic exploitation of instruction-level parallelism (ILP). Example: Program Order True Data Dependency 1 In the case of in-order pipelined execution 2 SUB. D must wait for DIV. D to complete which stalled ADD. D before starting execution 3 In out-of-order execution SUBD can start as soon as the values of its operands F 8, F 14 are available. 1 2 DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 14 Does not depend on DIV. D or ADD. D • This implies allowing out-of-order instruction commit (completion). • May lead to imprecise exceptions if an instruction issued earlier raises an exception. – This is similar to pipelines with multi-cycle floating point units. In Fourth Edition: Appendix A. 7, Chapter 2. 4 (In Third Edition: Appendix A. 8, Chapter 3. 2) Order = Program Instruction Order EECC 551 - Shaaban 3

Dynamic Pipeline Scheduling • Dynamic instruction scheduling is accomplished by: Always done in program

Dynamic Pipeline Scheduling • Dynamic instruction scheduling is accomplished by: Always done in program order Can be done out of program order – Dividing the Instruction Decode ID stage into two stages: • Issue: Decode instructions, check for structural hazards. – A record of data dependencies is constructed as instructions are issued – This creates a dynamically-constructed dependency graph for the window of instructions in-flight (being processed) in the CPU. • Read operands: Wait until data hazard conditions, if any, are resolved, then read operands when available (then start execution) (All instructions pass through the issue stage in order but can be stalled or pass each other in the read operands stage). – In the instruction fetch stage IF, fetch an additional instruction every cycle into a latch or several instructions into an instruction queue. – Increase the number of functional units to meet the demands of the additional instructions in their EX stage. • Two approaches to dynamic scheduling: (Control Data Corp. ) 1 – Dynamic scheduling with the Scoreboard used first in CDC 6600 (1963) 2 – The Tomasulo approach pioneered by the IBM 360/91 (1966) Fourth Edition: Appendix A. 7, Chapter 2. 4 (Third Edition: Appendix A. 8, Chapter 3. 2) CDC 660 is the world’s first “Supercomputer” Cost: $7 million in 1963 EECC 551 - Shaaban

Dynamic Scheduling With A Scoreboard • The scoreboard is a centralized hardware mechanism that

Dynamic Scheduling With A Scoreboard • The scoreboard is a centralized hardware mechanism that maintains an execution rate of one instruction per cycle by executing an instruction as soon as its operands are available in registers and no hazard conditions prevent it. – e. g. Forming a single-issue out-of-order pipeline EX Includes MEM • It replaces ID, EX, WB with four stages: ID 1, ID 2, EX, WB • Every instruction goes through the scoreboard where a record of data dependencies is constructed (corresponds to instruction issue). – In effect dynamically constructing the dependency graph by hardware for a window of instructions as they are issued one at a time in program order. In ID 1 (Issue) Issue No changes to Instruction Fetch (IF) • A system with a scoreboard is assumed to have several functional units with their status information reported to the scoreboard. • If the scoreboard determines that an instruction cannot execute immediately it executes another waiting instruction and keeps monitoring hardware units status and decide when the instruction can proceed to execute. • The scoreboard also decides when an instruction can write its results to registers (hazard detection and resolution is centralized in the scoreboard). Instruction Fetch (IF) is not changed In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) Order = Program Instruction Order Introduced in CDC 6600 EECC 551 - Shaaban (1963)

FP Register Write Port/Bus FP Register Read Ports/Buses Integer Register Write Port/Bus No forwarding

FP Register Write Port/Bus FP Register Read Ports/Buses Integer Register Write Port/Bus No forwarding supported in scoreboard The basic structure of a MIPS processor with a scoreboard In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) FP units are not pipelined similar to CDC 6600 EECC 551 - Shaaban

Instruction Execution Stages with A Scoreboard 1. Issue (ID 1): Always done in program

Instruction Execution Stages with A Scoreboard 1. Issue (ID 1): Always done in program order Can be done out of program order An instruction is issued if: Stage 0 Instruction Fetch (IF): No changes, in-order • A functional unit for the instruction is available (No structural hazard). • The instruction result destination register is not marked for writing by an earlier active instruction (No WAW hazard, i. e no output dependence) 1. If the above conditions are satisfied, the scoreboard issues the instruction to a functional unit and updates its internal data structures. As indicated by instruction issue requirements, structural and WAW hazards are resolved here by stalling the instruction issue. (this stage replaces part of ID stage in the conventional MIPS pipeline). 2. Read operands (ID 2): The scoreboard monitors the availability of the source operands. A source operand is available when no earlier active instruction will write it. When all source operands are available the scoreboard tells the functional unit to read all operands from the registers at once (no forwarding supported) and start execution (RAW hazards resolved here dynamically). This completes ID. 3. Execution (EX): From registers (No forwarding) The functional unit starts execution upon receiving operands. When the results are ready it notifies the scoreboard (replaces EX, MEM in MIPS). 4. Write result (WB): Once the scoreboard senses that a functional unit completed execution, it checks for WAR hazards and stalls the completing instruction if needed otherwise the write back is completed. The functional unit issued to the instruction is marked as available (not busy) after WB is completed. In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) Stage 0: Fetch, no changes, in-order EECC 551 - Shaaban

Three Parts of the Scoreboard 1 2 Instruction status: Which of 4 steps the

Three Parts of the Scoreboard 1 2 Instruction status: Which of 4 steps the instruction is in. Functional unit status: Indicates the state of the functional unit (FU). Nine fields for each functional unit: – – – Busy Op Fi Fj, Fk Qj, Qk Rj, Rk Indicates whether the unit is busy or not Operation to perform in the unit (e. g. , + or –) Destination register Source-register numbers Functional units producing source (operand) registers Fj, Fk Flags indicating when Fj, Fk are ready (set to Yes after operand is available to read both operands read at once from registers) 3 Yes or = 1 means ready i. e when both Rj, Rk are set to yes (both operands are ready) Register result status: Indicates which functional unit will write to each register, if one exists. Blank when no pending instructions will write that register. In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

The Scoreboard: Detailed Pipeline Control Yes In Fourth Edition: Appendix A. 7 (In Third

The Scoreboard: Detailed Pipeline Control Yes In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) Yes EECC 551 - Shaaban

A Scoreboard Example The following code is run on the MIPS with a scoreboard

A Scoreboard Example The following code is run on the MIPS with a scoreboard given earlier with: Functional Unit (FU) Integer Floating Point Multiply Floating Point add Floating point Divide L. D F 6, 34(R 2) L. D F 2, 45(R 3) MUL. D F 0, F 2, F 4 SUB. D F 8, F 6, F 2 DIV. D F 10, F 6 ADD. D F 6, F 8, F 2 In Fourth Edition: Appendix A. 7 (In Third Edition: Appendix A. 8) # of FUs 1 2 1 1 EX cycles 1 10 2 40 All functional units are not pipelined (similar to CDC 6600) Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D F 6, 34 (R 2) 2 L. D F 2, 45 (R 3) 3 MUL. D F 0, F 2, F 4 4 SUB. D F 8, F 6, F 2 5 DIV. D F 10, F 6 L. D MUL. D SUB. D DIV. D ADD. D F 6, 34(R 2) F 2, 45(R 3) F 0, F 2, F 4 F 8, F 6, F 2 F 10, F 6 F 6, F 8, F 2 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Real Data Dependence (RAW) 6 ADD. D F 6, F 8, F 2 Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 1 FP EX Cycles: Add = 2 cycles, Multiply = 10,

Scoreboard Example: Cycle 1 FP EX Cycles: Add = 2 cycles, Multiply = 10, Divide = 40 Issue Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 Busy Yes No No Clock F 0 1 FU S 2 Fk R 2 Means at end of Cycle 1 Op Load dest Fi F 6 S 1 Fj FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 2 FP EX Cycles : Add = 2 cycles, Multiply =

Scoreboard Example: Cycle 2 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 2 FU Read Execution. Write Issue operandscomplete. Result 1 2 Busy Yes No No Op Load dest Fi F 6 F 0 F 2 F 4 S 1 Fj S 2 Fk R 2 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer Issue second L. D? No, stall on structural hazard. Single integer functional unit is busy. EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 3 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 3 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 F 6 DIV. D F 10 F 8 F 2 ADD. DF 6 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 3 • Read Execution. Write Issue operandscomplete. Result 1 2 3 EX, MEM for L. D. in one cycle ? Busy Yes No No Op Load dest Fi F 6 F 0 F 2 F 4 FU Issue MUL. D? S 2 Fk R 2 FU for j FU for k Fj? Qj Qk Rj S 1 Fj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer No, cannot issue out of order (second L. D not issued yet) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 4 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 4 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Actually free end Integer of this cycle 4 Mult 1 (available for Mult 2 instruction issue Add next cycle) Divide Register result status Clock 4 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 Busy No No No Op dest Fi F 0 F 2 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk F 30 FU EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 5 Issue Instruction status Instruction j k L. D F 6

Scoreboard Example: Cycle 5 Issue Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 5 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 Busy Yes No No Op Load dest Fi F 2 F 0 F 2 F 4 S 1 Fj S 2 Fk R 3 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer Issue second L. D EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 6 Issue Instruction status Instruction j k L. D F 6

Scoreboard Example: Cycle 6 Issue Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 6 Busy Yes No No No Op Load Mult dest Fi F 2 F 0 F 2 F 4 Clock 6 FU S 1 Fj F 6 F 8 F 10 F 2 S 2 Fk R 3 FU for j FU for k Fj? Qj Qk Rj F 4 Integer F 12 Fk? Rk Yes No . . . Yes F 30 Mult 1 Integer Issue MUL. D EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 7 Issue Instruction status Instruction j k L. D F 6

Scoreboard Example: Cycle 7 Issue Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 7 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 Busy Yes No Op Load Mult dest Fi F 2 F 0 Sub F 8 F 0 F 2 S 1 Fj F 4 Mult 1 Integer S 2 FU for j FU for k Fj? Fk Qj Qk Rj R 3 F 2 F 4 Integer F 6 F 2 Integer F 6 F 8 F 10 F 12 Fk? Rk Yes No Yes . . . Yes No F 30 Add • Issue SUB. D • Read multiply operands? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 8 a (First half of cycle 8) Issue Instruction status Instruction

Scoreboard Example: Cycle 8 a (First half of cycle 8) Issue Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 8 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 Busy Yes No Yes Op Load Mult dest Fi F 2 F 0 Sub Div F 8 F 10 F 2 Mult 1 Integer • Issue DIV. D F 4 S 1 Fj S 2 FU for j FU for k Fj? Fk Qj Qk Rj Rk R 3 Yes F 2 F 4 Integer No F 6 F 0 F 2 F 6 Integer Mult 1 F 6 F 8 F 10 Add F 12 Yes No . . . Yes No Yes F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 8 b (Second half of cycle 8) Instruction status Instruction j

Scoreboard Example: Cycle 8 b (Second half of cycle 8) Instruction status Instruction j k F 6 34+ R 2 L. D F 2 45+ R 3 L. D F 2 F 4 MUL. D F 0 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 8 F 2 ADD. D F 6 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 8 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 Busy No Yes Yes Op F 0 F 2 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Add Yes No Mult 1 F 6 F 8 F 10 Mult 1 Fk? Rk F 12 . . . Yes F 30 Divide • Second L. D writes result to F 2 EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 9 FP EX Cycles : Add = 2 cycles, Multiply =

Scoreboard Example: Cycle 9 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles 10 Mult 1 Remaining (execution Mult 2 actually starts 2 Add next cycle) Divide Register result status Clock 9 • • FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 8 ? Busy No Yes Yes Op F 0 F 2 Mult 1 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Read operands for MUL. D & SUB. D Issue ADD. D? Yes No Mult 1 F 6 F 8 F 10 Add Fk? Rk F 12 . . . Yes F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles 8 Mult 1 Remaining Mult 2 0 Add Done executing Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 8 Clock 11 • FU Busy No Yes Yes Op F 0 F 2 Mult 1 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Yes No Mult 1 F 6 F 8 F 10 Add Fk? Rk F 12 . . . Yes F 30 Divide Issue ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 12 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 12 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles Remaining 7 Mult 1 Mult 2 Add Actually free end of this cycle 12 Divide Register result status Clock 12 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 Busy No Yes No No Yes Op F 0 F 2 dest Fi S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj Mult F 0 F 2 F 4 Div F 10 F 6 F 4 Mult 1 Fk? Rk Yes Mult 1 F 6 F 8 F 10 Yes No F 12 . . . Yes F 30 Divide • Read operands for DIV. D? • Issue ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 13 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 13 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 Issue ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles 6 Mult 1 Remaining Mult 2 Add Divide Register result status F 0 Clock 13 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S 1 S 2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk No Yes Mult F 0 F 2 F 4 Yes No Yes Add F 6 F 8 F 2 Yes Yes Div F 10 F 6 Mult 1 No Yes FU Mult 1 F 2 F 4 F 6 F 8 F 10 Add F 12 . . . F 30 Divide • Issue ADD. D, Add FP unit available at end of cycle 12 (start of 13) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 17 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 17 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer 2 Mult 1 Mult 2 Add Divide Register result status F 0 Clock 17 • Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No Yes Mult F 0 F 2 F 4 No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 FU Mult 1 F 2 F 4 Write result of ADD. D? No WAR hazard FU for j FU for k Fj? Qj Qk Rj Mult 1 F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes No Yes . . . F 30 Divide Write result of ADD. D? No, WAR hazard (DIV. D did not read any operands including F 6) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 20 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 20 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status F 0 Clock 20 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No No No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes . . . F 30 Divide • Read operands for DIV. D? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 21 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 21 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles Remaining Mult 1 (execution Mult 2 actually starts Add next cycle) 40 Divide Register result status F 0 Clock 21 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No No No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 F 2 FU F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes . . . F 30 Divide • DIV. D reads operands, starts execution next cycle • Write result of ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 22 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 22 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add 39 Divide Register result status F 0 Clock 22 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 16 22 13 14 dest S 1 S 2 Busy Op Fi Fj Fk No No Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 Fk? Rk Yes . . . F 30 Divide First cycle DIV. D execution (39 more ex cycles) ADD. D writes result in F 6 (No WAR, DIV. D read operands in cycle 21) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 61 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 61 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Done executing 0 Divide Register result status F 0 Clock 61 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S 1 S 2 Busy Op Fi Fj Fk No No Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 Fk? Rk Yes . . . F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Scoreboard Example: Cycle 62 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 62 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status F 0 Clock 62 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S 1 S 2 Fj Fk Busy Op Fi No No No F 2 F 4 Instruction Block done FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk F 30 FU • We have: • In-oder issue, • Out-of-order execution, completion EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Dynamic Scheduling: The Tomasulo Algorithm • Developed at IBM and first implemented in IBM’s

Dynamic Scheduling: The Tomasulo Algorithm • Developed at IBM and first implemented in IBM’s 360/91 mainframe in 1966, about 3 years after the debut of the scoreboard in the CDC 6600. • Dynamically schedule the pipeline in hardware to reduce stalls. • Differences between IBM 360 & CDC 6600 ISA. – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600. – IBM has 4 FP registers vs. 8 in CDC 6600 (part of ISA). • Current CPU architectures that can be considered descendants of the IBM 360/91 which implement and utilize a variation of the Tomasulo Algorithm include: RISC CPUs: Alpha 21264, HP 8600, MIPS R 12000, Power. PC G 4. . RISC-core x 86 CPUs: AMD Athlon, Intel Pentium III, 4, Xeon, …. In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Algorithm Vs. Scoreboard • • Register Renaming • Forwarding • Control & buffers

Tomasulo Algorithm Vs. Scoreboard • • Register Renaming • Forwarding • Control & buffers distributed with Functional Units (FUs) Vs. centralized in Scoreboard: – FU buffers are called “reservation stations” which have pending instructions and operands and other instruction status info (including data dependencies). – Reservations stations are sometimes referred to as “physical registers” or “renaming registers” as opposed to architecture registers specified by the ISA Registers in instructions are replaced by either values (if available) or pointers (renamed) to reservation stations (RS) that will supply the value later: – This process is called register renaming. Done in issue stage (in-order • Register renaming eliminates WAR, WAW hazards (name dependence). – Allows for a hardware-based version of loop unrolling. – More reservation stations than ISA registers are possible, leading to optimizations that compilers can’t achieve and prevents the number of ISA registers from becoming a bottleneck. Instruction results go (forwarded) from RSs to RSs , not through registers, over Common Data Bus (CDB) that broadcasts results to all waiting RSs (dependant instructions). Loads and Stores are treated as FUs with RSs as well. In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

IBM 360/91 6600 Vs. Tomasulo-based (1966) Pipelined Functional Units (6 load, 3 store, 3

IBM 360/91 6600 Vs. Tomasulo-based (1966) Pipelined Functional Units (6 load, 3 store, 3 +, 2 x/÷) x, 1 ÷) window size: £ 14 instructions Eliminated By register instructions renaming No issue on structural hazard Over CDB WAW: renaming avoids it WAR: renaming avoids it completion distributed Broadcast results from FU In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) registers CDC Control Data Corp. Scoreboard-based (1963) Multiple Functional (Not pipelined) (1 load/store, 1 + , 2 £ 5 ID 1 same stall issue stall Write/read EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007 WB

Dynamic Scheduling: The Tomasulo Approach (Instruction Fetch) Instructions to Issue (in program order) (IQ)

Dynamic Scheduling: The Tomasulo Approach (Instruction Fetch) Instructions to Issue (in program order) (IQ) The basic structure of a MIPS floating-point unit using Tomasulo’s algorithm In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) Pipelined FP units are used here EECC 551 - Shaaban

Reservation Station (RS) Fields • Op Operation to perform in the unit (e. g.

Reservation Station (RS) Fields • Op Operation to perform in the unit (e. g. , + or –) When available • Vj, Vk Value of Source operands S 1 and S 2 – Store buffers have a single V field indicating result to be stored. • Qj, Qk Reservation stations producing source registers. (i. e. operand values needed by instruction) (value to be written). RS’s – No ready flags as in Scoreboard; Qj, Qk=0 => ready. – Store buffers only have Qi for RS producing result. • A: Address information for loads or stores. Initially immediate field of instruction then effective address when calculated. • Busy: Indicates reservation station is busy. • Register result status: Qi Indicates which Reservation Station will write each register, if one exists. – Blank (or 0) when no pending instruction (i. e. RS) EECC 551 - Shaaban bank behaves station exist Register that will writeliketoa reservation that register. # lec # 4 Winter 2007 12 -17 -2007 In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2)

1 Always done in program order 2 Three Stages of Tomasulo Algorithm Issue: Get

1 Always done in program order 2 Three Stages of Tomasulo Algorithm Issue: Get instruction from pending Instruction Queue (IQ). – Instruction issued to a free reservation station(RS) (no structural hazard). Stage 0 Instruction Fetch (IF): No changes, in-order – Selected RS is marked busy. – Control sends available instruction operands values (from ISA registers) to assigned RS. – Operands not available yet are renamed to RSs that will produce the operand (register renaming). (Dynamic construction of data dependency graph) Execution (EX): Operate on operands. Also includes waiting for operands + MEM – When both operands are ready then start executing on assigned FU. – If all operands are not ready, watch Common Data Bus (CDB) for needed result (forwarding done via CDB). (i. e. wait on any remaining operands, no RAW) 3 Write result (WB): Finish execution. – – Data dependencies observed Write result on Common Data Bus (CDB) to all awaiting units (RSs) i. e broadcast result on CDB (forwarding) Mark reservation station as available. Note: No WB for stores • Normal data bus: data + destination (“go to” bus). • Common Data Bus (CDB): data + source (“come from” bus): Can be done out of program order – 64 bits for data + 4 bits for Functional Unit source address. – Write data to waiting RS if source matches expected RS (that produces result). – Does the result forwarding via broadcast to waiting RSs. In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) Including destination register EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Steps in The Tomsulo Approach and The Requirements of Each Step In Fourth Edition:

Steps in The Tomsulo Approach and The Requirements of Each Step In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) EECC 551 - Shaaban

Drawbacks of The Tomasulo Approach • Implementation Complexity: – Example: The implementation of the

Drawbacks of The Tomasulo Approach • Implementation Complexity: – Example: The implementation of the Tomasulo algorithm may have caused delays in the introduction of 360/91, MIPS 10000, IBM 620 among other CPUs. • Many high-speed associative result stores using (CDB) are required. • Performance limited by one Common Data Bus – Possible solution: Multiple CDBs ® more Functional Unit and RS logic needed for parallel associative stores. (Even more complexity) EECC 551 - Shaaban In Fourth Edition: Chapter 2. 4 (In Third Edition: Chapter 3. 2) # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Approach Example Using the same code used in the scoreboard example to be

Tomasulo Approach Example Using the same code used in the scoreboard example to be run on the Tomasulo configuration given earlier: # of RSs Integer Floating Point Multiply/divide Floating Point add 1 2 3 EX Cycles 1 10/40 2 L. D F 6, 34(R 2) Pipelined Functional Units L. D F 2, 45(R 3) Real Data Dependence (RAW) MUL. D SUB. D F 0, F 2, F 4 Anti-dependence (WAR) Output Dependence (WAW) F 8, F 6, F 2 DIV. D F 10, F 6 ADD. D F 6, F 8, F 2 In Fourth Edition: Chapter 2. 5 (In Third Edition: Chapter 3. 3) L. D processing takes two cycles: EX, MEM (only one cycle in scoreboard example) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D F 6, 34 (R 2) 2 L. D F 2, 45 (R 3) 3 MUL. D F 0, F 2, F 4 4 SUB. D F 8, F 6, F 2 5 DIV. D F 10, F 6 L. D MUL. D SUB. D DIV. D ADD. D F 6, 34(R 2) F 2, 45(R 3) F 0, F 2, F 4 F 8, F 6, F 2 F 10, F 6 F 6, F 8, F 2 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Real Data Dependence (RAW) 6 ADD. D F 6, F 8, F 2 The same code used is the scoreboard example Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 0 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 0 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k Issue L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status F 0 Clock 0 Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj S 2 Vk RS for j Qj RS for k Qk F 2 F 4 F 6 F 8 (i. e at end of cycle 0) Busy No No No Address F 10 F 12. . . F 30 FU EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example Cycle 1 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example Cycle 1 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Issue Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 1 F 0 FU F 2 Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj S 2 Vk F 4 RS for j Qj F 6 Busy Yes No No No Address 34+R 2 RS for k Qk F 8 F 10 F 12. . . F 30 Load 1 EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 2 Issue Instruction status Instruction j k Issue L. D F

Tomasulo Example: Cycle 2 Issue Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status F 0 Clock 2 FU Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj F 2 Load 2 S 2 Vk F 4 RS for j Qj F 6 Busy Yes No Address 34+R 2 45+R 3 RS for k Qk F 8 F 10 F 12. . . F 30 Load 1 Note: Unlike 6600, can have multiple loads outstanding (CDC 6600 only has one integer FU) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 3 Issue Instruction status Execution Instruction j k Issue complete L.

Tomasulo Example: Cycle 3 Issue Instruction status Execution Instruction j k Issue complete L. D F 6 34+ R 2 1 3 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations S 1 Time Name Busy Op Vj 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Write Result Load 1 Load 2 Load 3 Busy Yes No Address 34+R 2 45+R 3 Load processing takes 2 cycles (EX, Mem) S 2 Vk RS for j Qj R(F 4) Load 2 RS for k Qk Clock 3 FU F 0 F 2 Mult 1 Load 2 F 4 F 6 F 8 F 10 F 12. . . F 30 Load 1 EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 4 Issue Instruction status j k Instruction L. D F 6

Tomasulo Example: Cycle 4 Issue Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 Yes 0 Add 2 No Add 3 No 0 Mult 1 Yes 0 Mult 2 No Register result status Clock 4 FU Issue 1 2 3 4 Execution complete 3 4 Write Result 4 Load 1 Load 2 Load 3 S 1 Op Vj SUBD M(34+R 2) S 2 Vk RS for j Qj MULTD R(F 4) Load 2 F 4 F 6 F 8 M(34+R 2) Add 1 F 0 F 2 Mult 1 Load 2 Busy No Yes No Address F 10 F 12. . . 45+R 3 RS for k Qk Load 2 F 30 i. e. register F 6 has the loaded value from memory • Load 2 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 5 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 5 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 Issue DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 2 Add 1 Yes Execution cycles remaining 0 Add 2 No (execution actually Add 3 No starts next cycle) 10 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 5 FU Issue 1 2 3 4 5 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 RS for j Qj Busy No No No Address F 12. . . S 1 Op Vj SUBD M(34+R 2) S 2 Vk M(45+R 3) RS for k Qk MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) M(34+R 2) Add 1 Mult 2 F 30 Load 2 result forwarded via CDB to Add 1, Mult 1 (SUB. D, MUL. D execution will start execution next cycle 6) Issue DIV. D EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example Cycle 6 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example Cycle 6 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 Issue ADD. DF 6 F 8 F 2 Reservation Stations Time Name Busy 1 Add 1 Yes Execution cycles remaining 0 Add 2 Yes Add 3 No 9 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 6 • FU Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 RS for j Qj Busy No No No Address F 12. . . S 1 Op Vj SUBD M(34+R 2) ADDD S 2 Vk M(45+R 3) RS for k Qk MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) Add 2 Add 1 Mult 2 Add 1 F 30 ADD. D is issued here vs. scoreboard (in cycle 16) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 7 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 7 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 Yes Done executing 0 Add 2 Yes Add 3 No 8 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 7 FU Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 Busy No No No Address F 12. . . 7 S 1 Op Vj SUBD M(34+R 2) ADDD S 2 Vk M(45+R 3) RS for j Qj RS for k Qk Add 1 MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) Add 2 Add 1 Mult 2 F 30 • RS Add 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 10 Instruction status Instruction j k L. D F 6 34+

Tomasulo Example: Cycle 10 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 Yes Done executing 0 Add 3 No 5 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 10 FU Issue 1 2 3 4 5 6 Op Execution complete 3 4 Write Result 4 5 7 Load 1 Load 2 Load 3 Busy No No No Address F 10 F 12. . . 8 10 S 1 Vj S 2 Vk RS for j Qj RS for k Qk ADDD M()–M() M(45+R 3) MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) Add 2 M()–M() Mult 2 F 30 • RS Add 2 completed execution EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+

Tomasulo Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 F 10 F 6 DIV. D F 8 F 2 ADD. D F 6 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 11 FU • Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 7 8 10 11 S 2 Vk RS for j Qj MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) (M-M)+M() M()–M() Mult 2 Op S 1 Vj Load 1 Load 2 Load 3 Busy No No No Address RS for k Qk F 10 F 12. . . F 30 Write back result of ADD. D in this cycle (What about anti-dependence over F 6 with DIV. D ? ) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 15 Instruction status j k Instruction L. D F 6 34+

Tomasulo Example: Cycle 15 Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 Yes Done executing 0 Mult 2 Yes Register result status Clock 15 FU Issue 1 2 3 4 5 6 Execution complete 3 4 15 7 Write Result 4 5 Address F 10 F 12. . . 8 10 11 S 2 Vk RS for j Qj MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) (M–M)+M() M()–M() Mult 2 Op S 1 Vj Load 1 Load 2 Load 3 Busy No No No RS for k Qk F 30 • Mult 1 completed execution; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 16 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 16 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 4 DIV. D F 10 F 6 5 ADD. D F 6 F 8 F 2 6 Reservation Stations Time Name Busy Op 0 Add 1 No Execution cycles 0 Add 2 No remaining (execution actually Add 3 No starts next cycle) 0 Mult 1 No 40 Mult 2 Yes DIVD Register result status Clock 16 FU Execution complete 3 4 15 7 Write Result 4 5 16 8 10 Load 1 Load 2 Load 3 Busy No No No Address F 10 F 12. . . 11 S 1 Vj S 2 Vk M*F 4 M(34+R 2) F 0 F 2 F 4 M*F 4 M(45+R 3) RS for j Qj RS for k Qk F 6 F 8 (M–M)+M() M()–M() Mult 2 Only Divide instruction remains DIV. D execution will start next cycle (17) F 30 EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Example: Cycle 57 (vs 62 cycles for scoreboard) Execution complete 3 4 15

Tomasulo Example: Cycle 57 (vs 62 cycles for scoreboard) Execution complete 3 4 15 7 56 10 S 1 Vj Write Result 4 5 16 8 57 11 S 2 Vk RS for j Qj RS for k Qk F 0 F 2 F 4 F 6 F 8 M*F 4 M(45+R 3) (M–M)+M() M()–M() M*F 4/M Instruction status j k Issue Instruction L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 4 DIV. D F 10 F 6 5 ADD. D F 6 F 8 F 2 6 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 57 FU Load 1 Load 2 Load 3 Busy No No No Address Instruction Block done F 10 F 12. . . F 30 • Again we have: • In-oder issue, • Out-of-order execution, completion EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Loop Example (Hardware-Based Version of Loop-Unrolling) Loop: L. D F 0, 0(R 1)

Tomasulo Loop Example (Hardware-Based Version of Loop-Unrolling) Loop: L. D F 0, 0(R 1) Note independent loop iterations MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, # -8 BNE R 1, R 2, Loop ; branch if R 1 ¹ R 2 • • • Assume FP Multiply takes 4 execution clock cycles. Assume first load takes 8 cycles (possibly due to a cache miss), second load takes 4 3 rd …. cycles (cache hit). Assume R 1 = 80 initially. i. e. Perfect branch prediction. How? Assume DADDUI only takes one cycle (issue) Target? What if prediction Assume branch resolved in issue stage (no EX or CDB write) Is wrong? Assume branch is predicted taken and no branch misprediction. No branch delay slot is used in this example. Stores take 4 cycles (ex, mem) and do not write on CDB We’ll go over the execution to complete first two loop iterations. Expanded from loop example in Chapter 2. 5 (Third Edition Chapter 3. 3) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

First Iteration 1 Tomasulo Loop Example Dependency Graph (First three iterations shown) L. D

First Iteration 1 Tomasulo Loop Example Dependency Graph (First three iterations shown) L. D F 0, 0 (R 1) Example Code 2 MUL. D F 4, F 0, F 2 3 Second Iteration S. D F 4, 0(R 1) Second Iteration 4 L. D F 0, 0 (R 1) 5 Third Iteration MUL. D F 4, F 0, F 2 6 Third Iteration { { { First Iteration S. D F 4, 0 (R 1) 7 1 2 3 4 5 6 7 8 9 L. D MUL. D S. D F 0, 0 (R 1) F 4, F 0, F 2 F 4, 0(R 1) F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) Loop maintenance (DADDUI) and branches (BNE) not shown L. D F 0, 0 (R 1) 8 MUL. D F 4, F 0, F 2 9 S. D F 4, 0 (R 1) Name dependencies between iteration 3 instructions and iteration 1 instructions are not shown in graph EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 0 (i. e at end of cycle 0) Instruction status Instruction

Loop Example Cycle 0 (i. e at end of cycle 0) Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 0 R 1 80 F 0 Issue S 1 Vj F 2 Execution Write complete Result S 2 Vk F 4 Busy Address No No No Qi No No No Load 1 Load 2 Load 3 Store 1 Store 2 Store 3 RS for j RS for k Qj Qk Code: L. D MUL. D S. D DADDUI BNE F 6 F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Qi EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 1 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 1 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 1 R 1 80 F 0 Qi Issue 1 S 1 Vj Cycles remaining Execution Write complete Result S 2 Vk Busy Address Load 1 8 Yes 80 Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D MUL. D S. D DADDUI BNE F 2 F 4 F 6 F 8 F 0, 0(R 1) Issue F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Load 1 First L. D issues, takes 8 cycles to complete execution EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 2 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 2 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 2 R 1 80 F 0 Qi Load 1 Issue 1 2 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 7 Yes 80 Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 Issue S. D F 4, 0(R 1) DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First MUL. D issues, wait on first L. D (Load 1) to write on CDB EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 3 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 3 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 3 R 1 80 F 0 Qi Load 1 Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 6 Yes 80 Load 2 No Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Issue DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First S. D issues, wait on first MUL. D (Mult 1) to write on CDB EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 4 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 4 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 4 R 1 72 F 0 Qi Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 Load 1 F 4 Busy Address Load 1 5 Yes 80 Load 2 No Qi Load 3 No Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 Issue BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First DADDUI issues (not shown) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 5 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 5 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 5 R 1 72 F 0 Qi Load 1 Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 4 Yes 80 Load 2 No Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 BNE R 1, R 2, loop Issue F 6 F 8 F 10 F 12. . . F 30 Mult 1 First BNE issues (not shown), assumed predicted taken EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 6 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 6 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 6 R 1 72 F 0 Qi • Second L. D. issues Load 2 Issue 1 2 3 6 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 3 Yes 80 Load 2 4 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) Issue MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 (will take four ex cycles) Note: F 0 never sees Load 1 result • WAW between first and second L. D on F 0 eliminated by register renaming EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 7 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 7 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 7 • Second R 1 72 F 0 Qi Load 2 Issue 1 2 3 6 7 S 1 Vj F 2 Execution Write complete Result R(F 2) Busy Address Load 1 2 Yes 80 Load 2 3 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 Issue S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 Load 2 BNE R 1, R 2, loop F 4 F 6 S 2 Vk F 8 F 10 F 12. . . F 30 Mult 2 MUL. D issues (to RS Mult 2) Note: F 4 never sees Mult 1 result • WAW between first and second MUL. D on F 4 eliminated by register renaming EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 8 Issue Instruction status Instruction j k iteration L. D F

Loop Example Cycle 8 Issue Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 8 R 1 72 • Second F 0 Qi Load 2 Issue 1 2 3 6 7 8 S 1 Vj F 2 Execution Write complete Result R(F 2) Busy Address Load 1 1 Yes 80 2 Load 2 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 Yes 72 Mult 2 Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Issue DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop Load 2 F 4 F 6 S 2 Vk F 8 F 10 F 12. . . F 30 Mult 2 S. D issues (to RS Store 2) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 9 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 9 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 9 R 1 64 F 0 Qi Load 2 Issue 1 2 3 6 7 8 S 1 Vj F 2 First Load EX Done Execution Write complete Result Busy Address 9 Load 1 0 Yes 80 Load 2 1 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 Yes 72 Mult 2 Store 3 No S 2 RS for j RS for k Vk Qj Qk Code: R(F 2) Load 1 Load 2 F 4 F 6 L. D MUL. D S. D DADDUI BNE F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 Issue R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Issue second DADDUI (not shown) • Load 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 10 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 10 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No Execution cycles remaining 0 Add 2 No (execution actually starts next cycle) 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 10 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Second Load EX Done Execution Write complete Result Busy Address 9 10 Load 1 No Load 2 0 Yes 72 Load 3 No Qi 10 Store 1 Yes 80 Mult 1 Store 2 Yes 72 Mult 2 Store 3 No S 2 RS for j RS for k Vk Qj Qk Code: M(80) R(F 2) Load 2 F 6 Load 2 • Load 1 result forwarded via CDB • Issue second BNE (not shown) F 4 L. D MUL. D S. D DADDUI BNE F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop Issue F 10 F 12. . . F 30 Mult 2 to Mult 1, execution will start next cycle 11 • Load 2 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 11 Instruction status Instruction j k iteration F 0 0 R

Loop Example Cycle 11 Instruction status Instruction j k iteration F 0 0 R 1 1 L. D F 0 F 2 1 MUL. D F 4 0 R 1 1 S. D F 0 0 R 1 2 L. D F 0 F 2 2 MUL. D F 4 0 R 1 2 S. D Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Execution cycles remaining 0 Add 3 No (execution actually starts next cycle) 3 Mult 1 Yes MULTD 4 Mult 2 Yes MULTD Register result status Clock 11 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 4 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) Issue F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Load 2 result forwarded via CDB to Mult 2, execution will start next cycle 12 Third iteration L. D. issues (to RS Load 3) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 12 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 12 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 2 Mult 1 Yes MULTD 3 Mult 2 Yes MULTD Register result status Clock 12 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 3 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Issue third iteration MUL. D ? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 13 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 13 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 1 Mult 1 Yes MULTD 2 Mult 2 Yes MULTD Register result status Clock 13 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 2 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 Load 3 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Issue third iteration MUL. D, S. D ? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 14 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 14 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No EX Done 0 Mult 1 Yes MULTD 1 Mult 2 Yes MULTD Register result status Clock 14 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 Load 2 Load 3 1 10 11 Store 1 4 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Mult 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 15 Third Load EX Done Instruction status Instruction j k iteration

Loop Example Cycle 15 Third Load EX Done Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Available by end 0 Add 3 No of this cycle 0 Mult 1 No EX Done 0 Mult 2 Yes MULTD Register result status Clock 15 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 0 10 11 Store 1 4 15 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 M(80)*R(F 2) Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Mult 2 completing; what is waiting for it? • Third iteration L. D done execution EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 16 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 16 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No Execution cycles remaining 0 Add 2 No (execution actually starts next cycle) 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 16 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 3 15 16 Store 2 4 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 Issue F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Issue third iteration MUL. D (to RS Mult 1) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 17 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 17 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 17 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 2 15 16 Store 2 3 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) Yes 64 Mult 1 Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) Issue R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Third iteration L. D writes on CDB (delayed one cycle due to CDB conflict) Issue third iteration S. D (to RS Store 3) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 18 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 18 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 3 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 18 R 1 56 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 1 15 16 Store 2 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) Yes 64 Mult 1 Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 Issue R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Issue third iteration DADDUI EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

(First Loop Iteration Done) Loop Example Cycle 19 Instruction status Instruction j k iteration

(First Loop Iteration Done) Loop Example Cycle 19 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 2 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 19 R 1 56 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj S 2 Vk M(64) R(F 2) F 2 First Store Done Execution Write complete Result 9 10 14 15 19 10 11 15 16 F 4 Busy Address Load 1 No Load 2 No Load 3 No Qi Store 1 0 No Store 2 1 Yes 72 M(72)*R(72) Store 3 Yes 64 Mult 1 RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop Issue F 6 F 8 F 10 F 12. . . F 30 Mult 1 First S. D done (No write on CDB for stores) First loop iteration done Issue third iteration BNE EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

(First Two Loop Iterations Done) Loop Example Cycle 20 Instruction status Instruction j k

(First Two Loop Iterations Done) Loop Example Cycle 20 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 1 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 20 R 1 56 F 0 Qi Load 1 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 14 15 19 10 11 15 16 20 S 2 RS for j Vk Qj M(64) R(F 2) F 2 F 4 F 6 Busy Address 54 Load 1 4 Yes Load 2 No Load 3 No Qi Store 1 No Store 2 0 No Store 3 Yes 64 Mult 1 RS for k Qk Code: L. D F 0, 0(R 1) Issue MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop F 8 F 10 F 12. . . F 30 Mult 1 Second S. D done (No write on CDB for stores) Second loop iteration done Issue fourth iteration L. D (to RS Load 1) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Loop Example Cycle 21 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 21 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No EX Done 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 21 R 1 56 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 14 15 19 10 11 15 16 20 S 2 RS for j Vk Qj M(64) R(F 2) Busy Address 54 Load 1 3 Yes Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 Yes 64 Mult 1 RS for k Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 Issue S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop Load 1 F 2 F 6 F 4 F 8 F 10 F 12. . . F 30 Mult 1 (third iteration MUL. D) completing; what is waiting for it? Issue fourth iteration MUL. D (to RS Mult 2) EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007

Tomasulo Loop Example Timing Diagram Iteration Cycle L. D. 1 1 2 3 4

Tomasulo Loop Example Timing Diagram Iteration Cycle L. D. 1 1 2 3 4 5 6 7 8 9 10 I E E E E W MUL. D S. D. I 12 13 14 15 E E W I DADDUI 17 18 19 E E 20 21 I L. D. I MUL. D E E W I S. D. E E W I DADDUI E I BNE I E E E W MUL. D I S. D. E 3 rd L. D write delayed one cycle I L. D. 3 16 I BNE 2 11 3 rd MUL. D issue delayed until mul RS is available DADDUI E E I E I I BNE I L. D. MUL. D 4 I S. D. DADDUI BNE I = Issue E = Execute W = Write Result on CDB EECC 551 - Shaaban # lec # 4 Winter 2007 12 -17 -2007