Superscalar Processors J Nelson Amaral Scalar to Superscalar

  • Slides: 66
Download presentation
Superscalar Processors J. Nelson Amaral

Superscalar Processors J. Nelson Amaral

Scalar to Superscalar • Scalar Processor: one instruction pass through each pipeline stage in

Scalar to Superscalar • Scalar Processor: one instruction pass through each pipeline stage in each cycle • Superscalar Processor: multiple instructions at each pipeline stage in each cycle – Wider pipeline • Superpipelined Processor: Decompose stages into smaller stages → More Stages – Deeper pipeline Baer p. 75

Superscalar • Front end (IF and ID) – Must fetch and decode multiple instructions

Superscalar • Front end (IF and ID) – Must fetch and decode multiple instructions per cycle • m-way superscalar: brings (ideally) m instructions per cycle into the pipeline • Back end (EX, Mem and WB) – Must execute and write back several instructions per cycle Baer p. 75

Superscalar • In-order (or static) – Instructions leave front-end in program order • Out-of-order

Superscalar • In-order (or static) – Instructions leave front-end in program order • Out-of-order (or dynamic) – instructions leave front-end, and execute, in a different order than the program order – WB is called commit stage • must ensure that the program semantics is followed • more complex design Baer p. 76

Limits to Superscalar Performance • Superscalars rely on exploiting Instruction. Level Parallelism (ILP) –

Limits to Superscalar Performance • Superscalars rely on exploiting Instruction. Level Parallelism (ILP) – They remove WAR and WAW dependences – But the amount of ILP is limited by RAW (true) dependences Example: S 0: R 1 ← R 2 + R 3 S 1: R 4 ← R 1 + R 5 S 2: R 1 ← R 6 + R 7 S 3: R 4 ← R 1 + R 9 Data Dependence Graph: S 0 RAW S 1 WAW WAR WAW S 2 RAW S 3 Baer p. 76

Limits to Superscalar Performance • Superscalars rely on exploiting Instruction. Level Parallelism (ILP) –

Limits to Superscalar Performance • Superscalars rely on exploiting Instruction. Level Parallelism (ILP) – They remove WAR and WAW dependences – But the amount of ILP is limited by RAW (true) dependences Example: S 0: R 1 ← R 2 + R 3 S 1: R 4 ← R 1 + R 5 S 2: RA R 1 ← R 6 + R 7 RA + R 9 S 3: RB R 4 ← R 1 Data Dependence Graph: S 0 RAW S 1 WAW WAR WAW S 2 RAW S 3 Baer p. 76

Limits to Superscalar Performance • Complexity of logic to remove dependencies – Designers predicted

Limits to Superscalar Performance • Complexity of logic to remove dependencies – Designers predicted 8 -way and 16 -way superscalars – We have 6 -way superscalars and m is not likely to grow Baer p. 76

Limits to Superscalar Performance Number of Forward Paths 1 -way: Baer p. 76

Limits to Superscalar Performance Number of Forward Paths 1 -way: Baer p. 76

Limits to Superscalar Performance Number of Forward Paths 2 -way: m-way requires m 2

Limits to Superscalar Performance Number of Forward Paths 2 -way: m-way requires m 2 paths may become too long for signal propagation within a single clock Baer p. 76

Limits to Clock Cycle Reduction • Power dissipation increases with frequency • Read and

Limits to Clock Cycle Reduction • Power dissipation increases with frequency • Read and Writing to pipeline registers in every cycle. – Time to access pipeline register imposes a bound on the duration of a pipeline stage Baer p. 76

Limits on Pipeline Length • Speculative actions (pe. branch prediction) are resolved later in

Limits on Pipeline Length • Speculative actions (pe. branch prediction) are resolved later in a longer pipeline – Recovery from misspeculation is delayed Branch Misspred. Penalty: 10 cycles Branch Misspred. Penalty: 20 cycles 31 -stage pipeline 14 -stage pipeline Baer p. 76

Why the Multicore Revolution? Power Dissipation: Linear growth with clock frequency - Cannot make

Why the Multicore Revolution? Power Dissipation: Linear growth with clock frequency - Cannot make single cores faster Moore’s Law: Number of transistors in a chip continues the exponential growth - What to do with extra logic? Design Complexity: Extracting more performance from single core requires extreme design complexity. - What to do with extra logic? Baer p. 77

Speed Demons X Brainiacs DEC Alpha In-Order Superscalar 1994 Pentium III Out-of-Order Superscalar 1999

Speed Demons X Brainiacs DEC Alpha In-Order Superscalar 1994 Pentium III Out-of-Order Superscalar 1999 register renaming reorder buffer Baer p. 77 reservation stations

Out-of-Order and Memory Hierarchy • Question: Does out-of-order execution help hide memory latencies? •

Out-of-Order and Memory Hierarchy • Question: Does out-of-order execution help hide memory latencies? • Short answer: No. – Latencies of 100 cycles or more are too long and fill up all internal queues and stall pipelines – Latencies around 100 cycles are too short to justify context switching. • Solution: hardware for several contexts to enable fast context switching → multithreading Baer p. 78

DEC Alpha 21164 4 -way in-order RISC virtually indexed 32 64 -bit Instruction Buffer

DEC Alpha 21164 4 -way in-order RISC virtually indexed 32 64 -bit Instruction Buffer 32 Miss Address File: merge outstanding misses to the same L 2 line. Baer p. 79

21164 Instruction Pipeline Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48

21164 Instruction Pipeline Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48 -entry I-TLB 64 -entry D-TLB Baer p. 79

Brings 4 instructions from I-Cache (accesses I-Cache and ITLB in parallel) Performs branch prediction,

Brings 4 instructions from I-Cache (accesses I-Cache and ITLB in parallel) Performs branch prediction, calculates branch target slotting stage: steers instructions to units; resolves static conflicts resolves dynamic conflicts; schedules forwardings and stallings Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48 -entry I-TLB 64 -entry D-TLB Baer p. 80

Example i 1: R 1 ← R 2 + R 3 i 2: R

Example i 1: R 1 ← R 2 + R 3 i 2: R 4 ← R 1 – R 5 i 3: R 7 ← R 8 – R 9 i 4: F 0 ← F 2 + F 4 i 5: i 6: i 7: i 8: i 9: i 10: i 11: i 12: # Use integer pipeline 1 # Use integer pipeline 2 # Requires an integer pipeline # Floating point add Assume no structural or data hazard for these instructions. Baer p. 81

i 1: i 2: i 3: i 4: R 1 ← R 2 +

i 1: i 2: i 3: i 4: R 1 ← R 2 + R 3 R 4 ← R 1 – R 5 R 7 ← R 8 – R 9 F 0 ← F 2 + F 4 S 0 S 1 Front-end Occupancy S 2 S 3 Backend Time: t 0 + 1 i 5 i 1 i 6 i 2 i 7 i 3 i 8 i 4 Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← R 2 +

i 1: i 2: i 3: i 4: R 1 ← R 2 + R 3 R 4 ← R 1 – R 5 R 7 ← R 8 – R 9 F 0 ← F 2 + F 4 S 0 S 1 i 9 i 5 i 10 i 6 i 11 i 7 i 12 i 8 i 1 i 2 Front-end Occupancy S 2 S 3 Backend Time: t 0 + 2 1 i 3 i 4 Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← R 2 +

i 1: i 2: i 3: i 4: R 1 ← R 2 + R 3 R 4 ← R 1 – R 5 R 7 ← R 8 – R 9 F 0 ← F 2 + F 4 Front-end Occupancy S 0 S 1 S 2 i 9 i 10 i 5 i 6 i 1 i 2 i 11 i 12 i 7 i 8 i 3 i 4 S 3 Backend Time: t 0 + 3 2 i 3 cannot move to S 3 because of resource conflict (there are only two integer pipelines) i 4 does not move to S 3 to preserve program order (it is blocked by i 3) Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← R 2 +

i 1: i 2: i 3: i 4: R 1 ← R 2 + R 3 R 4 ← R 1 – R 5 R 7 ← R 8 – R 9 F 0 ← F 2 + F 4 S 0 S 1 i 9 i 10 i 5 i 6 i 11 i 12 i 7 i 8 Front-end Occupancy S 2 S 3 Backend Time: t 0 + 4 3 i 1 i 2 i 3 i 4 i 2 cannot move to the backend because of of RAW dependency with i 1. Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← R 2 +

i 1: i 2: i 3: i 4: R 1 ← R 2 + R 3 R 4 ← R 1 – R 5 R 7 ← R 8 – R 9 F 0 ← F 2 + F 4 S 0 S 1 i 9 i 13 i 10 i 14 i 5 i 6 i 11 i 15 i 12 i 16 i 7 i 8 Front-end Occupancy S 2 S 3 Backend Time: t 0 + 5 4 i 1 i 2 i 3 i 4 Baer p. 82

Backend Begins L 1 D-cache and D-TLB accesses Decide hit/miss in L 1 D-cache

Backend Begins L 1 D-cache and D-TLB accesses Decide hit/miss in L 1 D-cache and D-TLB Hit: Forward data (if needed); write to int. or FP register Data available if hit in L 2 Miss: Start access to L 2 Baer p. 82

Scoreboard Speculation Example: a load L, and a dependent use U reach S 3

Scoreboard Speculation Example: a load L, and a dependent use U reach S 3 at cycle t If the load hits L 1 -cache, then schedule L at t+1 and U at t+3. Scoreboard assumes it is a hit. Know if it is a hit or miss here. If it is a miss, abort any dependent instruction already issued. Baer p. 82

Can Compiler Help Performance? (Example) i 1: i 2: i 3: i 4: R

Can Compiler Help Performance? (Example) i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 Assume that all instructions are in issuing slot (state S 2) at time t.

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 Compiler Effect S 0 S 1 S 2 i 9 i 10 i 5 i 6 i 1 i 2 i 11 i 12 i 7 i 8 i 3 i 4 S 3 Backend Time: t + 1 Instruction i 3 cannot advance to S 3 because of an structural hazard: The load in i 1 uses an integer pipe to compute the address Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 S 0 S 1 i 9 i 10 i 5 i 6 i 11 i 12 i 7 i 8 Compiler Effect S 2 S 3 Backend Time: t + 3 1 2 i 1 i 2 i 3 i 4 i 2 cannot advance because of the RAW dependency with i 1 at t+3 the load continues execution in the back end (2 -cycle latency) Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 S 0 S 1 i 13 i 9 i 14 i 10 i 5 i 6 i 15 i 11 i 16 i 12 i 7 i 8 Compiler Effect S 2 S 3 Backend Time: t + 4 3 i 1 i 2 i 3 i 4 Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 Compiler Effect S 0 S 1 S 2 i 13 i 14 i 9 i 10 i 5 i 6 i 15 i 16 i 11 i 12 i 7 i 8 S 3 Backend Time: t + 5 4 i 2 i 3 i 4 cannot advance because of the RAW dependency with i 3 Baer p. 82

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R

i 1: i 2: i 3: i 4: R 1 ← Mem[R 2] R 4 ← R 1 + R 3 R 5 ← R 1 + R 6 R 7 ← R 4 + R 5 Compiler Effect S 0 S 1 S 2 i 17 i 13 i 14 i 18 i 19 i 15 i 20 i 16 i 9 i 10 i 5 i 6 i 11 i 12 i 7 i 8 S 3 Backend Time: t + 6 5 i 3 i 4 advances to execution at t+6 and it will be the only integer instruction executing at that cycle. Baer p. 82

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R 4 ← R 1 + R 3 i 3: R 5 ← R 1 + R 6 i 4: R 7 ← R 4 + R 5 After Compiler Optimization S 0 S 1 S 2 i 8 i 9 i 4 i 5 i 1’ i 10 i 11 i 6 i 7 i 2 i 3 S 3 Backend Time: t + 1 Two integer Instructions advance to S 3. Baer p. 82

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R 4 ← R 1 + R 3 i 3: R 5 ← R 1 + R 6 i 4: R 7 ← R 4 + R 5 S 0 S 1 i 12 i 8 i 13 i 9 i 4 i 5 i 14 i 10 i 15 i 11 i 6 i 7 After Compiler Optimization S 2 S 3 Backend Time: t + 2 1 i 1’ i 2 i 3 Baer p. 82

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R 4 ← R 1 + R 3 i 3: R 5 ← R 1 + R 6 i 4: R 7 ← R 4 + R 5 After Compiler Optimization S 0 S 1 S 2 i 13 i 8 i 9 i 4 i 5 i 14 i 15 i 10 i 11 i 6 i 7 S 3 Backend Time: t + 4 2 3 i 1’ i 2 i 3 Load in i 1 still needs two cycles to execute. Baer p. 82

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R 4 ← R 1 + R 3 i 3: R 5 ← R 1 + R 6 i 4: R 7 ← R 4 + R 5 After Compiler Optimization S 0 S 1 S 2 i 16 i 13 i 17 i 8 i 9 i 4 i 5 i 14 i 18 i 15 i 19 i 10 i 11 i 6 i 7 S 3 Backend Time: t + 5 4 i 1 i 2 i 3 i 2 and i 3 can advance to backend together. There is no depencency between them. Baer p. 82

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R

i 1: R 1 ← Mem[R 2] i 1’: integer nop i 2: R 4 ← R 1 + R 3 i 3: R 5 ← R 1 + R 6 i 4: R 7 ← R 4 + R 5 After Compiler Optimization S 0 S 1 S 2 S 3 i 16 i 12 i 17 i 13 i 8 i 9 i 4 i 5 i 18 i 19 i 14 i 15 i 10 i 11 i 6 i 7 Backend Time: t + 6 4 5 i 2 i 3 i 4 still advances to backend at t+6! but now i 5 could advance along with i 4 * Textbook says that i 4 would advance to backend at t+5. Baer p. 82

Scoreboarding “Scoreboarding allows instructions to execute out of order when there are sufficient resources

Scoreboarding “Scoreboarding allows instructions to execute out of order when there are sufficient resources and no data dependences. ” John L. Hennessy and David A. Patterson Computer Architecture: A Quantitative Approach Third Edition, p. A-69.

Another scoreboarding

Another scoreboarding

Scoreboarding • Thornton Algorithm (Scoreboarding): CDC 6600 (1964): – A single unit (the scoreboard)

Scoreboarding • Thornton Algorithm (Scoreboarding): CDC 6600 (1964): – A single unit (the scoreboard) monitors the progress of the execution of instructions and the status of all registers. • Tomasulo’s Algorithm: IBM 360/91 (1967) – Reservation stations buffer operands and results. A Common Data Bus (CDB) distributes results directly to functional units Some of this material is from Prof. Vojin G. Oklobzija’s tutorial at ISSCC’ 97. Baer p. 81

CDC 6600 Group I Not shown: branch unit that modifies the PC Group III

CDC 6600 Group I Not shown: branch unit that modifies the PC Group III Group IV Baer p. 86

CDC 6600 Scoreboard Operation Issue free functional unit? no Stall yes WAW hazard? yes

CDC 6600 Scoreboard Operation Issue free functional unit? no Stall yes WAW hazard? yes Stall no Issue Baer p. 86

CDC 6600 Scoreboard Operation Dispatch Mark execution unit busy Operands ready? no Stall yes

CDC 6600 Scoreboard Operation Dispatch Mark execution unit busy Operands ready? no Stall yes Read operands Baer p. 87

CDC 6600 Scoreboard Operation Execution complete? no Stall yes Notify Scoreboard that it is

CDC 6600 Scoreboard Operation Execution complete? no Stall yes Notify Scoreboard that it is ready to write result Baer p. 87

CDC 6600 Scoreboard Operation Write result WAR hazard? yes Stall no Write WAR Example:

CDC 6600 Scoreboard Operation Write result WAR hazard? yes Stall no Write WAR Example: i 0 DIV. D F 0, F 2, F 4 i 1 ADD. D F 10, F 8 i 2 SUB. D F 8, F 14 Has to stall the write of i 2 until i 1 has read F 8 Baer p. 87

Scoreboarding Example i 1: R 4 ← R 0 * R 2 # Uses

Scoreboarding Example i 1: R 4 ← R 0 * R 2 # Uses multiplier 1 i 2: R 6 ← R 4 * R 8 # Uses multiplier 2 i 3: R 8 ← R 2 + R 12 # Uses Adder i 4: R 4 ← R 14 + R 16 # Uses Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 1 Instructions in Flight Instruction Status Res. Fi i 1 issued R 4 Source Reg Fj Fk R 0 Units Qj Qk R 2 Reg Flags Rj Rk 1 1 Unit Busy (U)? Register Unit Mult 1 0 R 4 NIL Mult 1 Mult 2 0 R 6 NIL Adder 0 R 8 NIL Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 2 Instructions in Flight Instruction Status Res. Fi i 1 dispatched issued R 4 R 0 R 2 issued R 6 R 4 R 8 Unit Source Reg Fj Fk Units Qj Qk Mult 1 Reg Flags Rj Rk 1 1 0 1 Register Unit Mult 1 Busy (U)? 01 R 4 Mult 1 Mult 2 0 R 6 NIL Mult 2 Adder 0 R 8 NIL Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 3 i 2 cannot be dispatched because R 4 is not available Instructions in Flight Instruction Status Res. Fi i 1 dispatched execute R 4 R 0 R 2 issued R 6 R 4 R 8 i 3 issued R 8 R 2 R 12 Unit Mult 1 Units Source Reg Fj Fk Qj Qk Mult 1 Busy (U)? These values are wrong on Table 3. 2 (p. 88) in the textbook 1 Reg Flags Rj Rk 1 1 0 1 1 1 Register Unit R 4 Mult 1 Mult 2 0 R 6 Mult 2 Adder 0 R 8 NIL Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 4 i 4 cannot issue: (i) Adder is busy; AND (ii) WAW dependency on i 1 Instructions in Flight Instruction Status Res. Fi i 1 execute R 4 R 0 R 2 issued R 6 R 4 R 8 i 3 dispatched issued R 8 R 2 R 12 Source Reg Fj Fk Units Qj Qk Mult 1 Reg Flags Rj Rk 1 1 0 1 1 1 Unit Busy (U)? Register Unit Mult 1 1 R 4 Mult 1 Mult 2 0 R 6 Mult 2 Adder 01 R 8 Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 5 (No change) Instructions in Flight Instruction Status Res. Fi i 1 execute R 4 R 0 R 2 issued R 6 R 4 R 8 i 3 dispatched execute R 8 R 2 R 12 Source Reg Fj Fk Units Qj Qk Mult 1 Reg Flags Rj Rk 1 1 0 1 1 1 Unit Busy (U)? Register Unit Mult 1 1 R 4 Mult 1 Mult 2 0 R 6 Mult 2 Adder 1 R 8 Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 6 i 3 asks for permission to write. Permission is denied (WAR with i 2). Instructions in Flight Instruction Status Res. Fi i 1 execute R 4 R 0 R 2 issued R 6 R 4 R 8 i 3 execute R 8 R 2 R 12 Source Reg Fj Fk Units Qj Qk Mult 1 Reg Flags Rj Rk 1 1 0 1 1 1 Unit Busy (U)? Register Unit Mult 1 1 R 4 Mult 1 Mult 2 0 R 6 Mult 2 Adder 1 R 8 Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 8 i 1 asks for permission to write. Permission is granted. Instructions in Flight Instruction Status Res. Fi i 1 execute write R 4 R 0 R 2 issued R 6 R 4 R 8 i 3 execute R 8 R 2 R 12 Source Reg Fj Fk Units Qj Qk Mult 1 Reg Flags Rj Rk 1 1 0 1 1 1 Unit Busy (U)? Register Unit Mult 1 1 R 4 Mult 1 Mult 2 0 R 6 Mult 2 Adder 1 R 8 Adder Baer p. 88

i 1: i 2: i 3: i 4: R 4 ← R 0 *

i 1: i 2: i 3: i 4: R 4 ← R 0 * R 2 R 6 ← R 4 * R 8 ← R 2 + R 12 R 4 ← R 14 + R 16 # Uses multiplier 1 # Uses multiplier 2 # Uses Adder Cycle 9 Instructions in Flight Instruction Status Res. Fi i 2 dispatched issued R 6 R 4 R 8 i 3 execute write R 8 R 2 i 4 issue R 4 R 14 Source Reg Fj Fk Units Qj Qk Reg Flags Rj Rk 0 1 R 12 1 1 R 16 1 1 Mult 1 Unit Busy (U)? Register Unit Mult 1 0 R 4 Adder Mult 2 1 R 6 Mult 2 Adder 1 R 8 Adder Baer p. 88

Register Renaming, Reorder Buffer, and Reservation Stations • Difference between in-order X out-of-order execution:

Register Renaming, Reorder Buffer, and Reservation Stations • Difference between in-order X out-of-order execution: – When instructions leave the front end? • In-order: WAR and WAW prevent dispatch • Out-of-order: register renaming avoids WAR and WAW • How are instructions processed in the backend? • Instructions can wait in reservation stations because of RAW dependencies or structural hazards • A reorder buffer imposes program order commitment Baer p. 89

Register Renaming (example) i 1: R 1 ← R 2/R 3 i 2: R

Register Renaming (example) i 1: R 1 ← R 2/R 3 i 2: R 4 ← R 1 + R 5 i 3: R 5 ← R 6 + R 7 i 4: R 1 ← R 8 + R 9 # Takes a long time The registers that appear in the program are logical or architectural registers. In-order: Only i 1 issues. Others are blocked by RAW dependency. At the last stage of the front end all registers are Out-of-order: i 3 and i 4 can issue mapped to physical registers. and finish execution while i 1 executes Baer p. 89

Renaming Process Renaming Stage: Ri ←Rj op Rk Ra ← Rb op Rc Rb

Renaming Process Renaming Stage: Ri ←Rj op Rk Ra ← Rb op Rc Rb = Rename(Rj); Rc = Rename(Rk); Ra = freelist(first); Rename(Ri) = freelist(first); first ←next(first) Baer p. 90

Register Renaming (example) How about i 3, can it write into R 5 before

Register Renaming (example) How about i 3, can it write into R 5 before i 1 and i 2 complete? If i 1 generates an exception, what will be the value of R 5 in the exception state? i 1: R 32 R 1 ← R 2/R 3 i 2: R 33 R 4 ←R 32 R 1 + R 5 i 3: R 34 R 5 ← R 6 + R 7 i 4: R 35 R 1 ← R 8 + R 9 i 4 will finish execution before i 1. Can we allow it to write the result to R 1 before i 1? Freelist = {R 32, R 33, R 34, R 35, R 36, …} Baer p. 90 Ri R 1 R 2 R 3 Rename(Ri) R 1 R 32 R 35 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 4 R 33 R 5 R 34 R 6 R 7 R 8 R 9

Reorder Buffer • Even though we allow out-of-order execution, we require in-order-completion. • A

Reorder Buffer • Even though we allow out-of-order execution, we require in-order-completion. • A reorder buffer (ROB) ensures that the results produced by instructions are committed to the logical register in order. Baer p. 91

Reorder Buffer (cont. ) • Each entry in the ROB has the following fields:

Reorder Buffer (cont. ) • Each entry in the ROB has the following fields: – flag: has the instruction completed? – value: value computed by the instruction – result register name: logical register – instruction type: arithmetic/load/store/branch/… • Each instruction that has its destination register renamed is entered in the ROB Baer p. 91

Instruction i 1 Flag Not. Ready Value Some None Reg. Name R 1 Type

Instruction i 1 Flag Not. Ready Value Some None Reg. Name R 1 Type Arit i 2 Not Ready None R 4 Arit i 3 i 4 Not. Ready Not Ready Some None R 5 R 1 Arit i 1: R 32 R 1 ← R 2/R 3 i 2: R 33 R 4 ←R 32 R 1 + R 5 i 3: R 34 R 5 ← R 6 + R 7 i 4: R 35 R 1 ← R 8 + R 9 Freelist = {R 32, R 33, R 34, R 35, R 36, …} Baer p. 92 Head Tail Ri R 1 R 2 R 3 Rename(Ri) R 1 R 32 R 35 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 4 R 33 R 5 R 34 R 6 R 7 R 8 R 9

But…. • Where do instructions wait before being executed? • How an instruction knows

But…. • Where do instructions wait before being executed? • How an instruction knows that it is ready to be executed? Baer p. 93

Reservation Stations • After register renaming, the front-end dispatches the instruction to a reservation

Reservation Stations • After register renaming, the front-end dispatches the instruction to a reservation station. • Reservation stations can: – be grouped into a centralized queue called an instruction window. – be associated with functional units according to the opcode. Baer p. 93

Reservation Stations (cont. ) • Each entry in the Reservation Station must contain: –

Reservation Stations (cont. ) • Each entry in the Reservation Station must contain: – Operation to be performed – Source operands (either value or physical name of the register) – a flag indicates which one – physical name of the result register – ROB entry where the result will be stored. Baer p. 93

Scheduling • Scheduling: Selection of which instruction should execute next in a given execution

Scheduling • Scheduling: Selection of which instruction should execute next in a given execution unit – oldest instruction; – critical instruction; Baer p. 93

Ready Bit • A ready bit is associated with each physical register. • When

Ready Bit • A ready bit is associated with each physical register. • When an instruction that uses a physical register Ri is dispatched: – if Ri is ready, pass Ri value to the reservation station and set flag to true (ready) – if Ri is not ready, pass the name of Ri to the reservation station and set flag to false (not ready) – When both flags are true, the instruction is ready to be issued. Baer p. 93

Ready Bit (cont. ) • Upon completion, an instruction broadcasts the name and content

Ready Bit (cont. ) • Upon completion, an instruction broadcasts the name and content of its result physical register to all reservation stations (RS). – Each RS that needs it, will grab the content and update its flags. Baer p. 93