Reduction of Data Hazards Stalls with Dynamic Scheduling

  • Slides: 77
Download presentation
Reduction of Data Hazards Stalls with Dynamic Scheduling • So far we have dealt

Reduction of Data Hazards Stalls with Dynamic Scheduling • So far we have dealt with data hazards in instruction pipelines by: – Result forwarding (register bypassing) to reduce or eliminate stalls needed to prevent RAW hazards as a result of true data dependence. – Hazard detection hardware to stall the pipeline starting with the instruction that uses the result. – Compiler-based static pipeline scheduling to separate the dependent instructions minimizing actual hazard-prevention stalls in scheduled code. • Loop unrolling to increase basic block size: More ILP exposed. i. e Start of instruction execution is not in program order • Dynamic scheduling: (out-of-order execution) – Uses a hardware-based mechanism to reorder or rearrange instruction execution order to reduce stalls dynamically at runtime. • Better dynamic exploitation of instruction-level parallelism (ILP). – Enables handling some cases where instruction dependencies are unknown at compile time (ambiguous dependencies). – Similar to the other pipeline optimizations above, a dynamically scheduled processor cannot remove true data dependencies, but tries to avoid or reduce stalling. (In Appendix A. 8, Chapter 3. 2, 3. 3) EECC 551 - Shaaban

Dynamic Pipeline Scheduling: The Concept (Out-of-order execution) i. e Start of instruction execution is

Dynamic Pipeline Scheduling: The Concept (Out-of-order execution) i. e Start of instruction execution is not in program order • Dynamic pipeline scheduling overcomes the limitations of in-order pipelined execution by allowing out-of-order instruction execution. • Instruction are allowed to start executing out-of-order as soon as their operands are available. Dependency Graph • Better dynamic exploitation of instruction-level parallelism (ILP). Example: 1 2 True Data Dependency 1 In the case of in-order pipelined execution 2 SUB. D must wait for DIV. D to complete which stalled ADD. D before starting execution 3 In out-of-order execution SUBD can start as soon as the values of its operands F 8, F 14 are available. DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 14 Does not depend on DIV. D or ADD. D • This implies allowing out-of-order instruction commit (completion). • May lead to imprecise exceptions if an instruction issued earlier raises an exception. – This is similar to pipelines with multi-cycle floating point units. (In Appendix A. 8, Chapter 3. 2) Order = Program Instruction Order EECC 551 - Shaaban 3

Dynamic Pipeline Scheduling • Dynamic instruction scheduling is accomplished by: Always done in program

Dynamic Pipeline Scheduling • Dynamic instruction scheduling is accomplished by: Always done in program order Can be done out of program order – Dividing the Instruction Decode ID stage into two stages: • Issue: Decode instructions, check for structural hazards. – A record of data dependencies is constructed as instructions are issued – This creates a dynamically-constructed dependency graph for the window of instructions in-flight (being processed) in the CPU. • Read operands: Wait until data hazard conditions, if any, are resolved, then read operands when available (then start execution) (All instructions pass through the issue stage in order but can be stalled or pass each other in the read operands stage). – In the instruction fetch stage IF, fetch an additional instruction every cycle into a latch or several instructions into an instruction queue. – Increase the number of functional units to meet the demands of the additional instructions in their EX stage. • Two approaches to dynamic scheduling: (Control Data Corp. ) – Dynamic scheduling with the Scoreboard used first in CDC 6600 (1963) – The Tomasulo approach pioneered by the IBM 360/91 (1966) CDC 660 is the world’s first (In Appendix A. 8, Chapter 3. 2) “Supercomputer” Cost: $7 million in 1963 EECC 551 - Shaaban

Dynamic Scheduling With A Scoreboard • The scoreboard is a centralized hardware mechanism that

Dynamic Scheduling With A Scoreboard • The scoreboard is a centralized hardware mechanism that maintains an execution rate of one instruction per cycle by executing an instruction as soon as its operands are available in registers and no hazard conditions prevent it. – e. g. Forming a single-issue out-of-order pipeline • It replaces ID, EX, WB with four stages: ID 1, ID 2, EX, WB • Every instruction goes through the scoreboard where a record of data dependencies is constructed (corresponds to instruction issue). – In effect dynamically constructing the dependency graph by hardware for a window of instructions as they are issued one at a time in program order. • A system with a scoreboard is assumed to have several functional units with their status information reported to the scoreboard. • If the scoreboard determines that an instruction cannot execute immediately it executes another waiting instruction and keeps monitoring hardware units status and decide when the instruction can proceed to execute. • The scoreboard also decides when an instruction can write its results to registers (hazard detection and resolution is centralized in the scoreboard). Instruction Fetch (IF) is not changed (In Appendix A. 8) Order = Program Instruction Order Introduced in CDC 6600 EECC 551 - Shaaban (1963)

FP Register Write Port/Bus FP Register Read Ports/Buses Integer Register Write Port/Bus The basic

FP Register Write Port/Bus FP Register Read Ports/Buses Integer Register Write Port/Bus The basic structure of a MIPS processor with a scoreboard (In Appendix A. 8) FP units are not pipelined similar to CDC 6600 EECC 551 - Shaaban

Instruction Execution Stages with A Scoreboard 1 Issue (ID 1): Always done in program

Instruction Execution Stages with A Scoreboard 1 Issue (ID 1): Always done in program order Can be done out of program order An instruction is issued if: • A functional unit for the instruction is available (No structural hazard). • The instruction result destination register is not marked for writing by an earlier active instruction (No WAW hazard, i. e no output dependence) • If the above conditions are satisfied, the scoreboard issues the instruction to a functional unit and updates its internal data structures. As indicated by instruction issue requirements, structural and WAW hazards are resolved here by stalling the instruction issue. (this stage replaces part of ID stage in the conventional MIPS pipeline). 2 Read operands (ID 2): The scoreboard monitors the availability of the source operands. A source operand is available when no earlier active instruction will write it. When all source operands are available the scoreboard tells the functional unit to read all operands from the registers at once (no forwarding supported) and start execution (RAW hazards resolved here dynamically). This completes ID. 3 Execution (EX): The functional unit starts execution upon receiving operands. When the results are ready it notifies the scoreboard (replaces EX, MEM in MIPS). 4 Write result (WB): Once the scoreboard senses that a functional unit completed execution, it checks for WAR hazards and stalls the completing instruction if needed otherwise the write back is completed. The functional unit issued to the instruction is marked as available (not busy) after WB is completed. (In Appendix A. 8) Stage 0: Fetch, no changes EECC 551 - Shaaban

Three Parts of the Scoreboard 1 2 Instruction status: Which of 4 steps the

Three Parts of the Scoreboard 1 2 Instruction status: Which of 4 steps the instruction is in. Functional unit status: Indicates the state of the functional unit (FU). Nine fields for each functional unit: – – – Busy Op Fi Fj, Fk Qj, Qk Rj, Rk Indicates whether the unit is busy or not Operation to perform in the unit (e. g. , + or –) Destination register Source-register numbers Functional units producing source (operand) registers Fj, Fk Flags indicating when Fj, Fk are ready (set to Yes after operand is available to read both operands read at once from registers) 3 i. e when both Rj, Rk are set to yes Register result status: Indicates which functional unit will write to each register, if one exists. Blank when no pending instructions will write that register. (In Appendix A. 8) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

The Scoreboard: Detailed Pipeline Control Yes (In Appendix A. 8) Yes EECC 551 -

The Scoreboard: Detailed Pipeline Control Yes (In Appendix A. 8) Yes EECC 551 - Shaaban

A Scoreboard Example The following code is run on the MIPS with a scoreboard

A Scoreboard Example The following code is run on the MIPS with a scoreboard given earlier with: Functional Unit (FU) Integer Floating Point Multiply Floating Point add Floating point Divide L. D F 6, 34(R 2) L. D F 2, 45(R 3) MUL. D F 0, F 2, F 4 SUB. D F 8, F 6, F 2 DIV. D F 10, F 6 ADD. D F 6, F 8, F 2 (In Appendix A. 8) # of FUs 1 2 1 1 EX cycles 1 10 2 40 All functional units are not pipelined (similar to CDC 6600) Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D F 6, 34 (R 2) 2 L. D F 2, 45 (R 3) 3 MUL. D F 0, F 2, F 4 4 SUB. D F 8, F 6, F 2 5 DIV. D F 10, F 6 L. D MUL. D SUB. D DIV. D ADD. D F 6, 34(R 2) F 2, 45(R 3) F 0, F 2, F 4 F 8, F 6, F 2 F 10, F 6 F 6, F 8, F 2 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Real Data Dependence (RAW) 6 ADD. D F 6, F 8, F 2 Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 1 FP EX Cycles: Add = 2 cycles, Multiply = 10,

Scoreboard Example: Cycle 1 FP EX Cycles: Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 Busy Yes No No Clock F 0 1 FU S 2 Fk R 2 Means at end of Cycle 1 Op Load dest Fi F 6 S 1 Fj FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 2 FP EX Cycles : Add = 2 cycles, Multiply =

Scoreboard Example: Cycle 2 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 2 FU Read Execution. Write Issue operandscomplete. Result 1 2 Busy Yes No No Op Load dest Fi F 6 F 0 F 2 F 4 S 1 Fj S 2 Fk R 2 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer Issue second L. D? No, stall on structural hazard. Single integer functional unit is busy. EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 3 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 3 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 F 6 DIV. D F 10 F 8 F 2 ADD. DF 6 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 3 • Read Execution. Write Issue operandscomplete. Result 1 2 3 EX, MEM for L. D. in one cycle ? Busy Yes No No Op Load dest Fi F 6 F 0 F 2 F 4 FU Issue MUL. D? S 2 Fk R 2 FU for j FU for k Fj? Qj Qk Rj S 1 Fj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer No, cannot issue out of order (second L. D not issued yet) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 4 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 4 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. DF 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Functional unit status Time Name Actually free end Integer of this cycle 4 Mult 1 (available for Mult 2 instruction issue Add next cycle) Divide Register result status Clock 4 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 Busy No No No Op dest Fi F 0 F 2 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk F 30 Integer EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 5 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 5 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 5 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 Busy Yes No No Op Load dest Fi F 2 F 0 F 2 F 4 S 1 Fj S 2 Fk R 3 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk Yes F 30 Integer Issue second L. D EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 6 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 6 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 6 Busy Yes No No No Op Load Mult dest Fi F 2 F 0 F 2 F 4 Clock 6 FU S 1 Fj F 6 F 8 F 10 F 2 S 2 Fk R 3 FU for j FU for k Fj? Qj Qk Rj F 4 Integer F 12 Fk? Rk Yes No . . . Yes F 30 Mult 1 Integer Issue MUL. D EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 7 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 7 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 0 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 7 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 Busy Yes No Op Load Mult dest Fi F 2 F 0 Sub F 8 F 0 F 2 S 1 Fj F 4 Mult 1 Integer S 2 FU for j FU for k Fj? Fk Qj Qk Rj R 3 F 2 F 4 Integer F 6 F 2 Integer F 6 F 8 F 10 F 12 Fk? Rk Yes No Yes . . . Yes No F 30 Add • Issue SUB. D • Read multiply operands? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 8 a (First half of cycle 8) Instruction status Instruction j

Scoreboard Example: Cycle 8 a (First half of cycle 8) Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 8 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 Busy Yes No Yes Op Load Mult dest Fi F 2 F 0 Sub Div F 8 F 10 F 2 Mult 1 Integer • Issue DIV. D F 4 S 1 Fj S 2 FU for j FU for k Fj? Fk Qj Qk Rj Rk R 3 Yes F 2 F 4 Integer No F 6 F 0 F 2 F 6 Integer Mult 1 F 6 F 8 F 10 Add F 12 Yes No . . . Yes No Yes F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 8 b (Second half of cycle 8) Instruction status Instruction j

Scoreboard Example: Cycle 8 b (Second half of cycle 8) Instruction status Instruction j k F 6 34+ R 2 L. D F 2 45+ R 3 L. D F 2 F 4 MUL. D F 0 F 6 F 2 SUB. D F 8 F 6 DIV. D F 10 F 8 F 2 ADD. D F 6 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status Clock 8 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 Busy No Yes Yes Op F 0 F 2 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Add Yes No Mult 1 F 6 F 8 F 10 Mult 1 Fk? Rk F 12 . . . Yes F 30 Divide • Second L. D writes result to F 2 EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 9 FP EX Cycles : Add = 2 cycles, Multiply =

Scoreboard Example: Cycle 9 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles 10 Mult 1 Remaining (execution Mult 2 actually starts 2 Add next cycle) Divide Register result status Clock 9 • • FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 8 ? Busy No Yes Yes Op F 0 F 2 Mult 1 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Read operands for MUL. D & SUB. D Issue ADD. D? Yes No Mult 1 F 6 F 8 F 10 Add Fk? Rk F 12 . . . Yes F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles 8 Mult 1 Remaining (execution Mult 2 actually starts 0 Add next cycle) Divide Register result status Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 8 Clock 11 • FU Busy No Yes Yes Op F 0 F 2 Mult 1 dest Fi Mult F 0 Sub Div F 8 F 10 F 4 S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj F 2 F 4 F 6 F 0 F 2 F 6 Yes Yes No Mult 1 F 6 F 8 F 10 Add Fk? Rk F 12 . . . Yes F 30 Divide Issue ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 12 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 12 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer 7 Mult 1 Mult 2 Add Divide Register result status Clock 12 FU Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 Busy No Yes No No Yes Op F 0 F 2 dest Fi S 1 Fj S 2 Fk FU for j FU for k Fj? Qj Qk Rj Mult F 0 F 2 F 4 Div F 10 F 6 F 4 Mult 1 Fk? Rk Yes Mult 1 F 6 F 8 F 10 Yes No F 12 . . . Yes F 30 Divide • Read operands for DIV. D? • Issue ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 13 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 13 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer 6 Mult 1 Mult 2 Add Divide Register result status F 0 Clock 13 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S 1 S 2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk No Yes Mult F 0 F 2 F 4 Yes No Yes Add F 6 F 8 F 2 Yes Yes Div F 10 F 6 Mult 1 No Yes FU Mult 1 F 2 F 4 F 6 F 8 F 10 Add F 12 . . . F 30 Divide • Issue ADD. D, Add FP unit available at end of cycle 12 (start of 13) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 17 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 17 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer 2 Mult 1 Mult 2 Add Divide Register result status F 0 Clock 17 • Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No Yes Mult F 0 F 2 F 4 No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 FU Mult 1 F 2 F 4 Write result of ADD. D? No WAR hazard FU for j FU for k Fj? Qj Qk Rj Mult 1 F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes No Yes . . . F 30 Divide Write result of ADD. D? No, WAR hazard (DIV. D did not read any operands including F 6) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 20 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 20 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add Divide Register result status F 0 Clock 20 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No No No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes . . . F 30 Divide • Read operands for DIV. D? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 21 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 21 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Execution Integer cycles Remaining Mult 1 (execution Mult 2 actually starts Add next cycle) 40 Divide Register result status F 0 Clock 21 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S 1 S 2 Busy Op Fi Fj Fk No No No Yes Add F 6 F 8 F 2 Yes Div F 10 F 6 F 2 FU F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 Add F 12 Fk? Rk Yes Yes . . . F 30 Divide • DIV. D reads operands, starts execution next cycle • Write result of ADD. D? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 22 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 22 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add 39 Divide Register result status F 0 Clock 22 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 16 22 13 14 dest S 1 S 2 Busy Op Fi Fj Fk No No Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 Fk? Rk Yes . . . F 30 Divide First cycle DIV. D execution (39 more ex cycles) ADD. D writes result in F 6 (No WAR, DIV. D read operands in cycle 21) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 61 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 61 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add 0 Divide Register result status F 0 Clock 61 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S 1 S 2 Busy Op Fi Fj Fk No No Yes Div F 10 F 6 FU F 2 F 4 FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 Fk? Rk Yes . . . F 30 Divide EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Scoreboard Example: Cycle 62 Instruction status Instruction j k L. D F 6 34+

Scoreboard Example: Cycle 62 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Functional unit status Time Name Integer Mult 1 Mult 2 Add 0 Divide Register result status F 0 Clock 62 Read Execution. Write Issue operandscomplete. Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S 1 S 2 Fj Fk Busy Op Fi No No No F 2 F 4 Instruction Block done FU for j FU for k Fj? Qj Qk Rj F 6 F 8 F 10 F 12 . . . Fk? Rk F 30 FU • We have: • In-oder issue, • Out-of-order execution, completion EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Dynamic Scheduling: The Tomasulo Algorithm • Developed at IBM and first implemented in IBM’s

Dynamic Scheduling: The Tomasulo Algorithm • Developed at IBM and first implemented in IBM’s 360/91 mainframe in 1966, about 3 years after the debut of the scoreboard in the CDC 6600. • Dynamically schedule the pipeline in hardware to reduce stalls. • Differences between IBM 360 & CDC 6600 ISA. – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600. – IBM has 4 FP registers vs. 8 in CDC 6600 (part of ISA). • Current CPU architectures that can be considered descendants of the IBM 360/91 which implement and utilize a variation of the Tomasulo Algorithm include: RISC CPUs: Alpha 21264, HP 8600, MIPS R 12000, Power. PC G 4. . RISC-core x 86 CPUs: AMD Athlon, Intel Pentium III, 4, Xeon, …. (In Chapter 3. 2) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Algorithm Vs. Scoreboard • • Control & buffers distributed with Functional Units (FUs)

Tomasulo Algorithm Vs. Scoreboard • • Control & buffers distributed with Functional Units (FUs) Vs. centralized in Scoreboard: – FU buffers are called “reservation stations” which have pending instructions and operands and other instruction status info (including data dependencies). – Reservations stations are sometimes referred to as “physical registers” or “renaming registers” as opposed to architecture registers specified by the ISA Registers in instructions are replaced by either values (if available) or pointers (renamed) to reservation stations (RS) that will supply the value later: – This process is called register renaming. • Register renaming eliminates WAR, WAW hazards (name dependence). • • – Allows for a hardware-based version of loop unrolling. – More reservation stations than ISA registers are possible, leading to optimizations that compilers can’t achieve and prevents the number of ISA registers from becoming a bottleneck. Instruction results go (forwarded) from RSs to RSs , not through registers, over Common Data Bus (CDB) that broadcasts results to all waiting RSs (dependant instructions). Loads and Stores are treated as FUs with RSs as well. (In Chapter 3. 2) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

IBM 360/91 6600 Vs. Tomasulo-based (1966) Pipelined Functional Units (6 load, 3 store, 3

IBM 360/91 6600 Vs. Tomasulo-based (1966) Pipelined Functional Units (6 load, 3 store, 3 +, 2 x/÷) x, 1 ÷) window size: £ 14 instructions Eliminated By register instructions renaming No issue on structural hazard WAW: renaming avoids it WAR: renaming avoids it completion Broadcast results from FU (In Chapter 3. 2) registers CDC Control Data Corp. Scoreboard-based (1963) Multiple Functional (Not pipelined) (1 load/store, 1 + , 2 £ 5 ID 1 same stall issue stall Write/read EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005 WB

Dynamic Scheduling: The Tomasulo Approach (Instruction Fetch) Instructions to Issue (IQ) The basic structure

Dynamic Scheduling: The Tomasulo Approach (Instruction Fetch) Instructions to Issue (IQ) The basic structure of a MIPS floating-point unit using Tomasulo’s algorithm (In Chapter 3. 2) Pipelined FP units are used here EECC 551 - Shaaban

Reservation Station Fields • Op Operation to perform in the unit (e. g. ,

Reservation Station Fields • Op Operation to perform in the unit (e. g. , + or –) • Vj, Vk Value of Source operands S 1 and S 2 – Store buffers have a single V field indicating result to be stored. • Qj, Qk Reservation stations producing source registers. (i. e. operand values needed by instruction) (value to be written). – No ready flags as in Scoreboard; Qj, Qk=0 => ready. – Store buffers only have Qi for RS producing result. • A: Address information for loads or stores. Initially immediate field of instruction then effective address when calculated. • Busy: Indicates reservation station is busy. • Register result status: Qi Indicates which Reservation Station will write each register, if one exists. – Blank (or 0) when no pending instruction (i. e. RS) EECC 551 - Shaaban exist that will write to that register. Register bank behaves like a reservation station (In Chapter 3. 2) # lec # 4 Winter 2005 12 -7 -2005

1 Always done in program order 2 Three Stages of Tomasulo Algorithm Issue: Get

1 Always done in program order 2 Three Stages of Tomasulo Algorithm Issue: Get instruction from pending Instruction Queue (IQ). – Instruction issued to a free reservation station(RS) (no structural hazard). – Selected RS is marked busy. – Control sends available instruction operands values (from ISA registers) to assigned RS. – Operands not available yet are renamed to RSs that will produce the operand (register renaming). (Dynamic construction of data dependency graph) Execution (EX): Operate on operands. – When both operands are ready then start executing on assigned FU. – If all operands are not ready, watch Common Data Bus (CDB) for needed result (forwarding done via CDB). (i. e. wait on any remaining operands, no RAW) 3 Write result (WB): Finish execution. – – Data dependencies observed Write result on Common Data Bus (CDB) to all awaiting units (RSs) Mark reservation station as available. • Normal data bus: data + destination (“go to” bus). • Common Data Bus (CDB): data + source (“come from” bus): Can be done out of program order – 64 bits for data + 4 bits for Functional Unit source address. – Write data to waiting RS if source matches expected RS (that produces result). – Does the result forwarding via broadcast to waiting RSs. (In Chapter 3. 2) Including destination register EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Steps in The Tomsulo Approach and The Requirements of Each Step (In Chapter 3.

Steps in The Tomsulo Approach and The Requirements of Each Step (In Chapter 3. 2) EECC 551 - Shaaban

Drawbacks of The Tomasulo Approach • Implementation Complexity: – Example: The implementation of the

Drawbacks of The Tomasulo Approach • Implementation Complexity: – Example: The implementation of the Tomasulo algorithm may have caused delays in the introduction of 360/91, MIPS 10000, IBM 620 among other CPUs. • Many high-speed associative result stores using (CDB) are required. • Performance limited by one Common Data Bus – Possible solution: (In Chapter 3. 2) Multiple CDBs ® more Functional Unit and RS logic needed for parallel associative stores. (Even more complexity) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Approach Example Using the same code used in the scoreboard example to be

Tomasulo Approach Example Using the same code used in the scoreboard example to be run on the Tomasulo configuration given earlier: # of RSs Integer Floating Point Multiply/divide Floating Point add 1 2 3 EX Cycles 1 10/40 2 L. D F 6, 34(R 2) Pipelined Functional Units L. D F 2, 45(R 3) Real Data Dependence (RAW) MUL. D SUB. D F 0, F 2, F 4 (WAR) Output Dependence (WAW) F 8, F 6, F 2 DIV. D F 10, F 6 ADD. D F 6, F 8, F 2 (In Chapter 3. 3) Anti-dependence L. D processing takes two cycles: EX, MEM (only one cycle in scoreboard example) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D

Dependency Graph For Example Code 1 1 2 3 4 5 6 L. D F 6, 34 (R 2) 2 L. D F 2, 45 (R 3) 3 MUL. D F 0, F 2, F 4 4 SUB. D F 8, F 6, F 2 5 DIV. D F 10, F 6 L. D MUL. D SUB. D DIV. D ADD. D F 6, 34(R 2) F 2, 45(R 3) F 0, F 2, F 4 F 8, F 6, F 2 F 10, F 6 F 6, F 8, F 2 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Real Data Dependence (RAW) 6 ADD. D F 6, F 8, F 2 The same code used is the scoreboard example Anti-dependence (WAR) Output Dependence (WAW) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 0 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 0 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k Issue L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status F 0 Clock 0 Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj S 2 Vk RS for j Qj RS for k Qk F 2 F 4 F 6 F 8 (i. e at end of cycle 0) Busy No No No Address F 10 F 12. . . F 30 FU EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example Cycle 1 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example Cycle 1 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 1 F 0 FU F 2 Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj S 2 Vk F 4 RS for j Qj F 6 Busy Yes No No No Address 34+R 2 RS for k Qk F 8 F 10 F 12. . . F 30 Load 1 EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 2 Instruction status Instruction j k Issue L. D F 6

Tomasulo Example: Cycle 2 Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status F 0 Clock 2 FU Execution complete Write Result Load 1 Load 2 Load 3 S 1 Vj F 2 Load 2 S 2 Vk F 4 RS for j Qj F 6 Busy Yes No Address 34+R 2 45+R 3 RS for k Qk F 8 F 10 F 12. . . F 30 Load 1 Note: Unlike 6600, can have multiple loads outstanding (CDC 6600 only has one integer FU) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 3 Instruction status Execution Instruction j k Issue complete L. D

Tomasulo Example: Cycle 3 Instruction status Execution Instruction j k Issue complete L. D F 6 34+ R 2 1 3 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations S 1 Time Name Busy Op Vj 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Write Result Load 1 Load 2 Load 3 Busy Yes No Address 34+R 2 45+R 3 Load processing takes 2 cycles (EX, Mem) S 2 Vk RS for j Qj R(F 4) Load 2 RS for k Qk Clock 3 FU F 0 F 2 Mult 1 Load 2 F 4 F 6 F 8 F 10 F 12. . . F 30 Load 1 EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 4 Instruction status j k Instruction L. D F 6 34+

Tomasulo Example: Cycle 4 Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 Yes 0 Add 2 No Add 3 No 0 Mult 1 Yes 0 Mult 2 No Register result status Clock 4 FU Issue 1 2 3 4 Execution complete 3 4 Write Result 4 Load 1 Load 2 Load 3 S 1 Op Vj SUBD M(34+R 2) S 2 Vk RS for j Qj MULTD R(F 4) Load 2 F 4 F 6 F 8 M(34+R 2) Add 1 F 0 F 2 Mult 1 Load 2 Busy No Yes No Address F 10 F 12. . . 45+R 3 RS for k Qk Load 2 F 30 i. e. register F 6 has the loaded value from memory • Load 2 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 5 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 5 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 2 Add 1 Yes Execution cycles remaining 0 Add 2 No (execution actually Add 3 No starts next cycle) 10 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 5 FU Issue 1 2 3 4 5 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 RS for j Qj Busy No No No Address F 12. . . S 1 Op Vj SUBD M(34+R 2) S 2 Vk M(45+R 3) RS for k Qk MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) M(34+R 2) Add 1 Mult 2 F 30 Load 2 result forwarded via CDB to Add 1, Mult 1 (SUB. D, MUL. D execution will start execution next cycle 6) Issue DIV. D EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example Cycle 6 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example Cycle 6 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. DF 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. DF 6 F 8 F 2 Reservation Stations Time Name Busy 1 Add 1 Yes 0 Add 2 Yes Add 3 No 9 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 6 • FU Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 RS for j Qj Busy No No No Address F 12. . . S 1 Op Vj SUBD M(34+R 2) ADDD S 2 Vk M(45+R 3) RS for k Qk MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) Add 2 Add 1 Mult 2 Add 1 F 30 ADD. D is issued here vs. scoreboard (in cycle 16) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 7 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 7 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 Yes 0 Add 2 Yes Add 3 No 8 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 7 FU Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 Load 1 Load 2 Load 3 Busy No No No Address F 12. . . 7 S 1 Op Vj SUBD M(34+R 2) ADDD S 2 Vk M(45+R 3) RS for j Qj RS for k Qk Add 1 MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 F 10 Mult 1 M(45+R 3) Add 2 Add 1 Mult 2 F 30 • RS Add 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 10 Instruction status Instruction j k L. D F 6 34+

Tomasulo Example: Cycle 10 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 Yes 0 Add 3 No 5 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 10 FU Issue 1 2 3 4 5 6 Op Execution complete 3 4 Write Result 4 5 7 Load 1 Load 2 Load 3 Busy No No No Address F 10 F 12. . . 8 10 S 1 Vj S 2 Vk RS for j Qj RS for k Qk ADDD M()–M() M(45+R 3) MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) Add 2 M()–M() Mult 2 F 30 • RS Add 2 completed execution EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+

Tomasulo Example: Cycle 11 Instruction status Instruction j k L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 F 10 F 6 DIV. D F 8 F 2 ADD. D F 6 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 11 FU • Issue 1 2 3 4 5 6 Execution complete 3 4 Write Result 4 5 7 8 10 11 S 2 Vk RS for j Qj MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) (M-M)+M() M()–M() Mult 2 Op S 1 Vj Load 1 Load 2 Load 3 Busy No No No Address RS for k Qk F 10 F 12. . . F 30 Write back result of ADD. D in this cycle (What about anti-dependence with DIV. D ? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 15 Instruction status j k Instruction L. D F 6 34+

Tomasulo Example: Cycle 15 Instruction status j k Instruction L. D F 6 34+ R 2 L. D F 2 45+ R 3 MUL. D F 0 F 2 F 4 SUB. D F 8 F 6 F 2 DIV. D F 10 F 6 ADD. D F 6 F 8 F 2 Reservation Stations Time Name Busy 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 15 FU Issue 1 2 3 4 5 6 Execution complete 3 4 15 7 Write Result 4 5 Address F 10 F 12. . . 8 10 11 S 2 Vk RS for j Qj MULTD M(45+R 3) DIVD R(F 4) M(34+R 2) Mult 1 F 0 F 2 F 4 F 6 F 8 Mult 1 M(45+R 3) (M–M)+M() M()–M() Mult 2 Op S 1 Vj Load 1 Load 2 Load 3 Busy No No No RS for k Qk F 30 • Mult 1 completed execution; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 16 FP EX Cycles : Add = 2 cycles, Multiply =

Tomasulo Example: Cycle 16 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Instruction j k Issue L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 4 DIV. D F 10 F 6 5 ADD. D F 6 F 8 F 2 6 Reservation Stations Time Name Busy Op 0 Add 1 No Execution cycles 0 Add 2 No remaining (execution actually Add 3 No starts next cycle) 0 Mult 1 No 40 Mult 2 Yes DIVD Register result status Clock 16 FU Execution complete 3 4 15 7 Write Result 4 5 16 8 10 Load 1 Load 2 Load 3 Busy No No No Address F 10 F 12. . . 11 S 1 Vj S 2 Vk M*F 4 M(34+R 2) F 0 F 2 F 4 M*F 4 M(45+R 3) RS for j Qj RS for k Qk F 6 F 8 (M–M)+M() M()–M() Mult 2 Only Divide instruction remains DIV. D execution will start next cycle (17) F 30 EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Example: Cycle 57 (vs 62 cycles for scoreboard) Execution complete 3 4 15

Tomasulo Example: Cycle 57 (vs 62 cycles for scoreboard) Execution complete 3 4 15 7 56 10 S 1 Vj Write Result 4 5 16 8 57 11 S 2 Vk RS for j Qj RS for k Qk F 0 F 2 F 4 F 6 F 8 M*F 4 M(45+R 3) (M–M)+M() M()–M() M*F 4/M Instruction status j k Issue Instruction L. D F 6 34+ R 2 1 L. D F 2 45+ R 3 2 MUL. D F 0 F 2 F 4 3 SUB. D F 8 F 6 F 2 4 DIV. D F 10 F 6 5 ADD. D F 6 F 8 F 2 6 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 57 FU Load 1 Load 2 Load 3 Busy No No No Address Instruction Block done F 10 F 12. . . F 30 • Again we have: • In-oder issue, • Out-of-order execution, completion EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Loop Example (Hardware-Based Version of Loop-Unrolling) Loop: L. D F 0, 0(R 1)

Tomasulo Loop Example (Hardware-Based Version of Loop-Unrolling) Loop: L. D F 0, 0(R 1) Note independent loop iterations MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, # -8 BNE R 1, R 2, Loop ; branch if R 1 ¹ R 2 • • • Assume FP Multiply takes 4 execution clock cycles. Assume first load takes 8 cycles (possibly due to a cache miss), second load takes 4 cycles (cache hit). Assume R 1 = 80 initially. i. e. Perfect branch prediction. How? Assume DADDUI only takes one cycle (issue) Assume branch resolved in issue stage (no EX or CDB write) Assume branch is predicted taken and no branch misprediction. No branch delay slot is used in this example. Stores take 4 cycles (ex, mem) and do not write on CDB We’ll go over the execution to complete first two loop iterations. (Expanded from loop example in Chapter 3. 3) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

First Iteration 1 Tomasulo Loop Example Dependency Graph (First three iterations shown) L. D

First Iteration 1 Tomasulo Loop Example Dependency Graph (First three iterations shown) L. D F 0, 0 (R 1) Example Code 2 MUL. D F 4, F 0, F 2 3 Second Iteration S. D F 4, 0(R 1) Second Iteration 4 L. D F 0, 0 (R 1) 5 Third Iteration MUL. D F 4, F 0, F 2 6 Third Iteration { { { First Iteration S. D F 4, 0 (R 1) 7 1 2 3 4 5 6 7 8 9 L. D MUL. D S. D F 0, 0 (R 1) F 4, F 0, F 2 F 4, 0(R 1) F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) Loop maintenance (DADDUI) and branches (BNE) not shown L. D F 0, 0 (R 1) 8 MUL. D F 4, F 0, F 2 9 S. D F 4, 0 (R 1) Name dependencies between iteration 3 instructions and iteration 1 instructions are not shown in graph EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 0 (i. e at end of cycle 0) Instruction status Instruction

Loop Example Cycle 0 (i. e at end of cycle 0) Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 0 R 1 80 F 0 Issue S 1 Vj F 2 Execution Write complete Result S 2 Vk F 4 Busy Address No No No Qi No No No Load 1 Load 2 Load 3 Store 1 Store 2 Store 3 RS for j RS for k Qj Qk Code: L. D MUL. D S. D DADDUI BNE F 6 F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Qi EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 1 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 1 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 No Register result status Clock 1 R 1 80 F 0 Qi Issue 1 S 1 Vj Cycles remaining Execution Write complete Result S 2 Vk Busy Address Load 1 8 Yes 80 Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D MUL. D S. D DADDUI BNE F 2 F 4 F 6 F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Load 1 First L. D issues, takes 8 cycles to complete execution EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 2 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 2 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 2 R 1 80 F 0 Qi Load 1 Issue 1 2 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 7 Yes 80 Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First MUL. D issues, wait on first L. D (Load 1) to write on CDB EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 3 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 3 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 3 R 1 80 F 0 Qi Load 1 Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 6 Yes 80 Load 2 No Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First S. D issues, wait on first MUL. D (Mult 1) to write on CDB EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 4 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 4 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 4 R 1 72 F 0 Qi Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 Load 1 F 4 Busy Address Load 1 5 Yes 80 Load 2 No Qi Load 3 No Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First DADDUI issues (not shown) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 5 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 5 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 5 R 1 72 F 0 Qi Load 1 Issue 1 2 3 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 4 Yes 80 Load 2 No Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First BNE issues (not shown), assumed predicted taken EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 6 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 6 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 6 R 1 72 F 0 Qi • Second L. D. issues Load 2 Issue 1 2 3 6 S 1 Vj Execution Write complete Result S 2 Vk R(F 2) F 2 F 4 Busy Address Load 1 3 Yes 80 Load 2 4 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 (will take four cycles) Note: F 0 never sees Load 1 result • WAW between first and second L. D on F 0 eliminated by register renaming EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 7 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 7 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 7 • Second R 1 72 F 0 Qi Load 2 Issue 1 2 3 6 7 S 1 Vj F 2 Execution Write complete Result R(F 2) Busy Address Load 1 2 Yes 80 Load 2 3 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 No Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) Load 1 DADDUI R 1, #-8 Load 2 BNE R 1, R 2, loop F 4 F 6 S 2 Vk F 8 F 10 F 12. . . F 30 Mult 2 MUL. D issues (to RS Mult 2) Note: F 4 never sees Mult 1 result • WAW between first and second MUL. D on F 4 eliminated by register renaming EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 8 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 8 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 8 R 1 72 • Second F 0 Qi Load 2 Issue 1 2 3 6 7 8 S 1 Vj F 2 Execution Write complete Result R(F 2) Busy Address Load 1 1 Yes 80 2 Load 2 Yes 72 Load 3 No Qi Store 1 Yes 80 Mult 1 Store 2 Yes 72 Mult 2 Store 3 No RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 Load 1 BNE R 1, R 2, loop Load 2 F 4 F 6 S 2 Vk F 8 F 10 F 12. . . F 30 Mult 2 S. D issues (to RS Store 2) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 9 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 9 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 9 R 1 64 F 0 Qi Load 2 Issue 1 2 3 6 7 8 S 1 Vj F 2 Execution Write complete Result 9 Load 1 0 Load 2 1 Load 3 Store 1 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk R(F 2) Load 1 Load 2 F 4 F 6 Busy Address Yes 80 Yes 72 No Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Issue second DADDUI (not shown) • Load 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 10 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 10 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 10 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 0 Load 3 10 Store 1 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) Load 2 F 6 Load 2 • Load 1 result forwarded via CDB • Issue second BNE (not shown) F 4 Busy Address No Yes 72 No Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 8 F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 to Mult 1, execution will start next cycle 11 • Load 2 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 11 Instruction status Instruction j k iteration F 0 0 R

Loop Example Cycle 11 Instruction status Instruction j k iteration F 0 0 R 1 1 L. D F 0 F 2 1 MUL. D F 4 0 R 1 1 S. D F 0 0 R 1 2 L. D F 0 F 2 2 MUL. D F 4 0 R 1 2 S. D Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 3 Mult 1 Yes MULTD 4 Mult 2 Yes MULTD Register result status Clock 11 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 4 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Load 2 result forwarded via CDB to Mult 2, execution will start next cycle 12 Third iteration L. D. issues (to RS Load 3) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 12 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 12 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 2 Mult 1 Yes MULTD 3 Mult 2 Yes MULTD Register result status Clock 12 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 3 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Issue third iteration MUL. D ? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 13 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 13 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 1 Mult 1 Yes MULTD 2 Mult 2 Yes MULTD Register result status Clock 13 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 Load 2 Load 3 2 10 11 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 Load 3 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 Issue third iteration MUL. D, S. D ? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 14 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 14 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 1 Mult 2 Yes MULTD Register result status Clock 14 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 Load 2 Load 3 1 10 11 Store 1 4 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(80) R(F 2) M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 Mult 1 Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Mult 1 completing; what is waiting for it? EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 15 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 15 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 No 0 Mult 2 Yes MULTD Register result status Clock 15 R 1 64 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 0 10 11 Store 1 4 15 Store 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(72) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No Yes 64 Qi Yes 80 M(80)*R(F 2) Yes 72 Mult 2 No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 2 • Mult 2 completing; what is waiting for it? • Third iteration L. D done execution EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 16 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 16 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 16 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 3 15 16 Store 2 4 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) No Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Issue third iteration MUL. D (to RS Mult 1) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 17 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 17 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 4 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 17 R 1 64 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 2 15 16 Store 2 3 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) Yes 64 Mult 1 Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Third iteration L. D writes on CDB (delayed one cycle due to CDB conflict) Issue third iteration S. D (to RS Store 3) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 18 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 18 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 3 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 18 R 1 56 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 Load 1 14 15 Load 2 Load 3 10 11 Store 1 1 15 16 Store 2 2 Store 3 S 2 RS for j RS for k Vk Qj Qk M(64) R(F 2) F 2 F 4 F 6 F 8 Busy Address No No No Qi Yes 80 M(80)*R(F 2) Yes 72 M(72)*R(72) Yes 64 Mult 1 Code: L. D MUL. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, loop F 10 F 12. . . F 30 Mult 1 Issue third iteration DADDUI EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

(First Loop Iteration Done) Loop Example Cycle 19 Instruction status Instruction j k iteration

(First Loop Iteration Done) Loop Example Cycle 19 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 2 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 19 R 1 56 F 0 Qi Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 14 15 19 10 11 15 16 S 2 Vk M(64) R(F 2) F 2 F 4 Busy Address Load 1 No Load 2 No Load 3 No Qi Store 1 0 No Store 2 1 Yes 72 M(72)*R(72) Store 3 Yes 64 Mult 1 RS for j RS for k Qj Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop F 6 F 8 F 10 F 12. . . F 30 Mult 1 First S. D done (No write on CDB for stores) First loop iteration done Issue third iteration BNE EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

(First Two Loop Iterations Done) Loop Example Cycle 20 Instruction status Instruction j k

(First Two Loop Iterations Done) Loop Example Cycle 20 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 1 Mult 1 Yes MULTD 0 Mult 2 No Register result status Clock 20 R 1 56 F 0 Qi Load 1 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 14 15 19 10 11 15 16 20 S 2 RS for j Vk Qj M(64) R(F 2) F 2 F 4 F 6 Busy Address 54 Load 1 4 Yes Load 2 No Load 3 No Qi Store 1 No Store 2 0 No Store 3 Yes 64 Mult 1 RS for k Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop F 8 F 10 F 12. . . F 30 Mult 1 Second S. D done (No write on CDB for stores) Second loop iteration done Issue fourth iteration L. D (to RS Load 1) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Loop Example Cycle 21 Instruction status Instruction j k iteration L. D F 0

Loop Example Cycle 21 Instruction status Instruction j k iteration L. D F 0 0 R 1 1 MUL. D F 4 F 0 F 2 1 S. D F 4 0 R 1 1 L. D F 0 0 R 1 2 MUL. D F 4 F 0 F 2 2 S. D F 4 0 R 1 2 Reservation Stations Time Name Busy Op 0 Add 1 No 0 Add 2 No 0 Add 3 No 0 Mult 1 Yes MULTD 0 Mult 2 Yes MULTD Register result status Clock 21 R 1 56 F 0 Qi Load 3 Issue 1 2 3 6 7 8 S 1 Vj Execution Write complete Result 9 10 14 15 19 10 11 15 16 20 S 2 RS for j Vk Qj M(64) R(F 2) Busy Address 54 Load 1 3 Yes Load 2 No Load 3 No Qi Store 1 No Store 2 No Store 3 Yes 64 Mult 1 RS for k Qk Code: L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, #-8 BNE R 1, R 2, loop Load 1 F 2 F 6 F 4 F 8 F 10 F 12. . . F 30 Mult 1 (third iteration MUL. D) completing; what is waiting for it? Issue fourth iteration MUL. D (to RS Mult 2) EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005

Tomasulo Loop Example Timing Diagram Iteration Cycle L. D. 1 1 2 3 4

Tomasulo Loop Example Timing Diagram Iteration Cycle L. D. 1 1 2 3 4 5 6 7 8 9 10 I E E E E W MUL. D S. D. I 12 13 14 15 E E W I DADDUI 17 18 19 E E 20 I MUL. D E E W I S. D. E E W I DADDUI E I BNE I E E E W MUL. D I S. D. E 3 rd L. D write delayed one cycle I L. D. 3 rd MUL. D issue delayed until mul RS is available DADDUI E E I E I I BNE I L. D. MUL. D 4 21 I L. D. 3 16 I BNE 2 11 I S. D. DADDUI BNE I = Issue E = Execute W = Write Result on CDB EECC 551 - Shaaban # lec # 4 Winter 2005 12 -7 -2005