CPE 631 ILP Dynamic Exploitation Electrical and Computer

  • Slides: 114
Download presentation
CPE 631: ILP, Dynamic Exploitation Electrical and Computer Engineering University of Alabama in Huntsville

CPE 631: ILP, Dynamic Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenković milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka UAH-CPE 631

Outline n n Instruction Level Parallelism (ILP) Recap: Data Dependencies Extended MIPS Pipeline and

Outline n n Instruction Level Parallelism (ILP) Recap: Data Dependencies Extended MIPS Pipeline and Hazards Dynamic scheduling with a scoreboard AM La. CASA 2

ILP: Concepts and Challenges n n ILP (Instruction Level Parallelism) – overlap execution of

ILP: Concepts and Challenges n n ILP (Instruction Level Parallelism) – overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions n n n AM La. CASA reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls n Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput 3

Two approaches to exploit parallelism n Dynamic techniques n n largely depend on hardware

Two approaches to exploit parallelism n Dynamic techniques n n largely depend on hardware to locate the parallelism Static techniques n relay on software AM La. CASA 4

Techniques to exploit parallelism Technique (Section in the textbook) AM La. CASA Reduces Forwarding

Techniques to exploit parallelism Technique (Section in the textbook) AM La. CASA Reduces Forwarding and bypassing (Section A. 2) Data hazard (DH) stalls Delayed branches (A. 2) Control hazard stalls Basic dynamic scheduling (A. 8) DH stalls (RAW) Dynamic scheduling with register renaming (3. 2) WAR and WAW stalls Dynamic branch prediction (3. 4) CH stalls Issuing multiple instruction per cycle (3. 6) Ideal CPI Speculation (3. 7) Data and control stalls Dynamic memory disambiguation (3. 2, 3. 7) RAW stalls w. memory Loop Unrolling (4. 1) CH stalls Basic compiler pipeline scheduling (A. 2, 4. 1) DH stalls Compiler dependence analysis (4. 4) Ideal CPI, DH stalls Software pipelining and trace scheduling (4. 3) Ideal CPI and DH stalls Compiler speculation (4. 4) Ideal CPI, and D/CH stalls 5

Where to look for ILP? n Amount of parallelism available within a basic block

Where to look for ILP? n Amount of parallelism available within a basic block n n BB: straight line code sequence of instructions with no branches in except to the entry, and no branches out except at the exit Example: Gcc (Gnu C Compiler): 17% control transfer n n n AM La. CASA 5 or 6 instructions + 1 branch Dependencies => amount of parallelism in a basic block is likely to be much less than 5 => look beyond single block to get more instruction level parallelism Simplest and most common way to increase amount of parallelism among instruction is to exploit parallelism among iterations of a loop => Loop Level Parallelism for(i=1; i<=1000; i++) x[i]=x[i] + s; n Vector Processing: see Appendix G 6

Definition: Data Dependencies n Data dependence: instruction j is data dependent on instruction i

Definition: Data Dependencies n Data dependence: instruction j is data dependent on instruction i if either of the following holds n n n AM La. CASA n Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i If dependent, cannot execute in parallel Try to schedule to avoid hazards Easy to determine for registers (fixed names) Hard for memory (“memory disambiguation”): n n Does 100(R 4) = 20(R 6)? From different loop iterations, does 20(R 6) = 20(R 6)? 7

Examples of Data Dependencies Loop: LD. D ADD. D SD. D DADUI BNE F

Examples of Data Dependencies Loop: LD. D ADD. D SD. D DADUI BNE F 0, 0(R 1) ; F 0 = array element F 4, F 0, F 2 ; add scalar in F 2 0(R 1), F 4 ; store result and R 1, #-8 ; decrement pointer R 1, R 2, Loop ; branch if R 1!=R 2 AM La. CASA 8

Definition: Name Dependencies n Two instructions use same name (register or memory location) but

Definition: Name Dependencies n Two instructions use same name (register or memory location) but don’t exchange data n n AM La. CASA Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. If dependent, can’t execute in parallel Renaming to remove data dependencies Again Name Dependencies are Hard for Memory Accesses n n Does 100(R 4) = 20(R 6)? From different loop iterations, does 20(R 6) = 20(R 6)? 9

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S.

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 SUBUI 14 BNEZ 15 NOP AM La. CASA F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 0, -8(R 1) F 4, F 0, F 2 -8(R 1), F 4 F 0, -16(R 1) F 4, F 0, F 2 -16(R 1), F 4 F 0, -24(R 1) F 4, F 0, F 2 -24(R 1), F 4 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 How can remove them? 10

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S.

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 DSUBUI 14 BNEZ 15 NOP AM La. CASA F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 6, -8(R 1) F 8, F 6, F 2 -8(R 1), F 8 F 10, -16(R 1) F 12, F 10, F 2 -16(R 1), F 12 F 14, -24(R 1) F 16, F 14, F 2 -24(R 1), F 16 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 The Orginal“register renaming” 11

Definition: Control Dependencies n n Example: if p 1 {S 1; }; if p

Definition: Control Dependencies n n Example: if p 1 {S 1; }; if p 2 {S 2; }; S 1 is control dependent on p 1 and S 2 is control dependent on p 2 but not on p 1 Two constraints on control dependences: n n An instruction that is control dep. on a branch cannot be moved before the branch, so that its execution is no longer controlled by the branch An instruction that is not control dep. on a branch cannot be moved to after the branch so that its execution is controlled by the branch AM La. CASA L: DADDU R 5, R 6, R 7 ADD R 1, R 2, R 3 BEQZ R 4, L SUB R 1, R 5, R 6 OR R 7, R 1, R 8 12

Dynamically Scheduled Pipelines UAH-CPE 631

Dynamically Scheduled Pipelines UAH-CPE 631

Overcoming Data Hazards with Dynamic Scheduling n Why in HW at run time? n

Overcoming Data Hazards with Dynamic Scheduling n Why in HW at run time? n n Works when can’t know real dependence at compile time Simpler compiler Code for one machine runs well on another Example DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 12 SUB. D cannot execute because the dependence of ADD. D on DIV. D causes the pipeline to stall; yet SUBD is not data dependent on anything! AM La. CASA n Key idea: Allow instructions behind stall to proceed 14

Overcoming Data Hazards with Dynamic Scheduling (cont’d) n n Enables out-of-order execution => out-of-order

Overcoming Data Hazards with Dynamic Scheduling (cont’d) n n Enables out-of-order execution => out-of-order completion Out-of-order execution divides ID stage: n n n AM La. CASA 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands Scoreboarding – technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies (CDC 6600, 1963) 15

Scoreboarding Implications n Out-of-order completion => WAR, WAW hazards? DIV. D ADD. D SUB.

Scoreboarding Implications n Out-of-order completion => WAR, WAW hazards? DIV. D ADD. D SUB. D n n n AM La. CASA n n DIV. D ADD. D SUB. D F 0, F 2, F 4 F 10, F 8, F 12 Solutions for WAR n n F 0, F 2, F 4 F 10, F 8 F 8, F 12 Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages 16

Four Stages of Scoreboard Control n n AM La. CASA ID 1: Issue —

Four Stages of Scoreboard Control n n AM La. CASA ID 1: Issue — decode instructions & check for structural hazards ID 2: Read operands — wait until no data hazards, then read operands EX: Execute — operate on operands; when the result is ready, it notifies the scoreboard that it has completed execution WB: Write results — finish execution; the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction DIV. D ADD. D SUB. D F 0, F 2, F 4 F 10, F 8 F 8, F 12 Scoreboarding stalls the SUBD in its write result stage until ADDD reads its operands 17

Four Stages of Scoreboard Control n 1. n n La. CASA If a functional

Four Stages of Scoreboard Control n 1. n n La. CASA If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands—wait until no data hazards, then read operands (ID 2) n AM Issue—decode instructions & check for structural hazards (ID 1) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. 18

Four Stages of Scoreboard Control n 3. n n 4. n n Execution—operate on

Four Stages of Scoreboard Control n 3. n n 4. n n Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 8, F 14 AM La. CASA n CDC 6600 scoreboard would stall SUBD until ADD. D reads operands 19

Three Parts of the Scoreboard n n 1. Instruction status—which of 4 steps the

Three Parts of the Scoreboard n n 1. Instruction status—which of 4 steps the instruction is in (Capacity = window size) 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit n n n AM La. CASA n Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e. g. , + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register 20

MIPS with a Scoreboard Registers FP Mult FP Div Add 1 Add 2 Add

MIPS with a Scoreboard Registers FP Mult FP Div Add 1 Add 2 Add 3 AM La. CASA Control/ Status Scoreboard Control/ Status 21

Detailed Scoreboard Pipeline Control Instruction status Wait until Bookkeeping Issue Not busy (FU) and

Detailed Scoreboard Pipeline Control Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result (D) Busy(FU) yes; Op(FU) op; Fi(FU) ’D’; Fj(FU) ’S 1’; Fk(FU) ’S 2’; Qj Result(’S 1’); Qk Result(’S 2’); Rj not Qj; Rk not Qk; Result(’D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result AM La. CASA f((Fj( f )≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk( f ) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No)) 22

Scoreboard Example AM La. CASA 23

Scoreboard Example AM La. CASA 23

Scoreboard Example: Cycle 1 AM La. CASA Issue 1 st L. D! 24

Scoreboard Example: Cycle 1 AM La. CASA Issue 1 st L. D! 24

Scoreboard Example: Cycle 2 AM La. CASA Issue 2 nd L. D? Structural hazard!

Scoreboard Example: Cycle 2 AM La. CASA Issue 2 nd L. D? Structural hazard! No further instructions will issue! 25

Scoreboard Example: Cycle 3 AM La. CASA Issue MUL. D? 26

Scoreboard Example: Cycle 3 AM La. CASA Issue MUL. D? 26

Scoreboard Example: Cycle 4 AM La. CASA Check for WAR hazards! If none, write

Scoreboard Example: Cycle 4 AM La. CASA Check for WAR hazards! If none, write result! 27

Scoreboard Example: Cycle 5 AM La. CASA Issue 2 nd L. D! 28

Scoreboard Example: Cycle 5 AM La. CASA Issue 2 nd L. D! 28

Scoreboard Example: Cycle 6 AM La. CASA Issue MUL. D! 29

Scoreboard Example: Cycle 6 AM La. CASA Issue MUL. D! 29

Scoreboard Example: Cycle 7 AM La. CASA Issue SUB. D! 30

Scoreboard Example: Cycle 7 AM La. CASA Issue SUB. D! 30

Scoreboard Example: Cycle 8 AM La. CASA Issue DIV. D! 31

Scoreboard Example: Cycle 8 AM La. CASA Issue DIV. D! 31

Scoreboard Example: Cycle 9 AM La. CASA Read operands for MUL. D and SUB.

Scoreboard Example: Cycle 9 AM La. CASA Read operands for MUL. D and SUB. D! Assume we can feed Mult 1 and Add units in the same clock cycle. Issue ADD. D? Structural Hazard (unit is busy)! 32

Scoreboard Example: Cycle 11 AM La. CASA Last cycle of SUB. D execution. 33

Scoreboard Example: Cycle 11 AM La. CASA Last cycle of SUB. D execution. 33

Scoreboard Example: Cycle 12 AM La. CASA Check WAR on F 8. Write F

Scoreboard Example: Cycle 12 AM La. CASA Check WAR on F 8. Write F 8. 34

Scoreboard Example: Cycle 13 AM La. CASA Issue ADD. D! 35

Scoreboard Example: Cycle 13 AM La. CASA Issue ADD. D! 35

Scoreboard Example: Cycle 14 AM La. CASA Read operands for ADD. D! 36

Scoreboard Example: Cycle 14 AM La. CASA Read operands for ADD. D! 36

Scoreboard Example: Cycle 15 AM La. CASA 37

Scoreboard Example: Cycle 15 AM La. CASA 37

Scoreboard Example: Cycle 16 AM La. CASA 38

Scoreboard Example: Cycle 16 AM La. CASA 38

Scoreboard Example: Cycle 17 AM La. CASA Why cannot write F 6? 39

Scoreboard Example: Cycle 17 AM La. CASA Why cannot write F 6? 39

Scoreboard Example: Cycle 19 AM La. CASA 40

Scoreboard Example: Cycle 19 AM La. CASA 40

Scoreboard Example: Cycle 20 AM La. CASA 41

Scoreboard Example: Cycle 20 AM La. CASA 41

Scoreboard Example: Cycle 21 AM La. CASA 42

Scoreboard Example: Cycle 21 AM La. CASA 42

Scoreboard Example: Cycle 22 AM La. CASA Write F 6? 43

Scoreboard Example: Cycle 22 AM La. CASA Write F 6? 43

Scoreboard Example: Cycle 61 AM La. CASA 44

Scoreboard Example: Cycle 61 AM La. CASA 44

Scoreboard Example: Cycle 62 AM La. CASA 45

Scoreboard Example: Cycle 62 AM La. CASA 45

Scoreboard Results n For the CDC 6600 n n n 70% improvement for Fortran

Scoreboard Results n For the CDC 6600 n n n 70% improvement for Fortran 150% improvement for hand coded assembly language cost was similar to one of the functional units n n n Still this was in ancient time n n AM La. CASA n n surprisingly low bulk of cost was in the extra busses no caches & no main semiconductor memory no software pipelining compilers? So, why is it coming back n performance via ILP 46

Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries

Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries n n AM La. CASA how far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch) Number and types of functional units n n can we find independent instructions to execute avoid structural hazards Presence of antidependences and output dependences n WAR and WAW stalls become more important 47

Things to Remember n n Pipeline CPI = Ideal pipeline CPI + Structural stalls

Things to Remember n n Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Data dependencies Dynamic scheduling to minimise stalls Dynamic scheduling with a scoreboard AM La. CASA 48

Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries

Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries n n AM La. CASA how far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch) Number and types of functional units n n can we find independent instructions to execute avoid structural hazards Presence of antidependences and output dependences n WAR and WAW stalls become more important 49

Tomasulo’s Algorithm n n n Used in IBM 360/91 FPU (before caches) Goal: high

Tomasulo’s Algorithm n n n Used in IBM 360/91 FPU (before caches) Goal: high FP performance without special compilers Conditions: n n n AM La. CASA Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations Long memory accesses and long FP delays This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! n Alpha 21264, HP 8000, MIPS 10000, Pentium III, Power. PC 604, … 50

Tomasulo’s Algorithm (cont’d) n Control & buffers distributed with Function Units (FU) n n

Tomasulo’s Algorithm (cont’d) n Control & buffers distributed with Function Units (FU) n n Registers in instructions replaced by values or pointers to reservation stations (RS) => register renaming n n AM La. CASA n FU buffers called “reservation stations” => buffer the operands of instructions waiting to issue; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 51

Tomasulo-based FPU for MIPS FP Op Queue Load Buffers From Mem From Instruction Unit

Tomasulo-based FPU for MIPS FP Op Queue Load Buffers From Mem From Instruction Unit FP Registers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Store 1 Store 2 Store 3 Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders Reservation Stations To Mem FP multipliers AM La. CASA Common Data Bus (CDB) 52

Reservation Station Components n n Op: Operation to perform in the unit (e. g.

Reservation Station Components n n Op: Operation to perform in the unit (e. g. , + or –) Vj, Vk: Value of Source operands n n Qj, Qk: Reservation stations producing source registers (value to be written) n n n AM La. CASA Store buffers has V field, result to be stored Note: Qj/Qk=0 => source operand is already available in Vj /Vk Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 53

Three Stages of Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue n

Three Stages of Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue n n 2. Execute—operate on operands (EX) n n n La. CASA n n n Write it on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) n AM When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) n n If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 2 clocks for Fl. pt. +, -; 10 for * ; 40 clks for / 54

Tomasulo Example Instruction stream 3 Load/Buffers FU count down 3 FP Adder R. S.

Tomasulo Example Instruction stream 3 Load/Buffers FU count down 3 FP Adder R. S. 2 FP Mult R. S. AM Clock cycle counter La. CASA 55

Tomasulo Example Cycle 1 AM La. CASA 56

Tomasulo Example Cycle 1 AM La. CASA 56

Tomasulo Example Cycle 2 AM La. CASA Note: Can have multiple loads outstanding 57

Tomasulo Example Cycle 2 AM La. CASA Note: Can have multiple loads outstanding 57

Tomasulo Example Cycle 3 AM La. CASA • Note: registers names are removed (“renamed”)

Tomasulo Example Cycle 3 AM La. CASA • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load 1 completing; what is waiting for Load 1? 58

Tomasulo Example Cycle 4 AM La. CASA • Load 2 completing; what is waiting

Tomasulo Example Cycle 4 AM La. CASA • Load 2 completing; what is waiting for Load 2? 59

Tomasulo Example Cycle 5 AM La. CASA • Timer starts down for Add 1,

Tomasulo Example Cycle 5 AM La. CASA • Timer starts down for Add 1, Mult 1 60

Tomasulo Example Cycle 6 AM La. CASA • Issue ADDD here despite name dependency

Tomasulo Example Cycle 6 AM La. CASA • Issue ADDD here despite name dependency on F 6? 61

Tomasulo Example Cycle 7 AM La. CASA • Add 1 (SUBD) completing; what is

Tomasulo Example Cycle 7 AM La. CASA • Add 1 (SUBD) completing; what is waiting for it? 62

Tomasulo Example Cycle 8 AM La. CASA 63

Tomasulo Example Cycle 8 AM La. CASA 63

Tomasulo Example Cycle 9 AM La. CASA 64

Tomasulo Example Cycle 9 AM La. CASA 64

Tomasulo Example Cycle 10 AM La. CASA • Add 2 (ADDD) completing; what is

Tomasulo Example Cycle 10 AM La. CASA • Add 2 (ADDD) completing; what is waiting for it? 65

Tomasulo Example Cycle 11 AM La. CASA • Write result of ADDD here? •

Tomasulo Example Cycle 11 AM La. CASA • Write result of ADDD here? • All quick instructions complete in this cycle! 66

Tomasulo Example Cycle 12 AM La. CASA 67

Tomasulo Example Cycle 12 AM La. CASA 67

Tomasulo Example Cycle 13 AM La. CASA 68

Tomasulo Example Cycle 13 AM La. CASA 68

Tomasulo Example Cycle 14 AM La. CASA 69

Tomasulo Example Cycle 14 AM La. CASA 69

Tomasulo Example Cycle 15 AM La. CASA • Mult 1 (MULTD) completing; what is

Tomasulo Example Cycle 15 AM La. CASA • Mult 1 (MULTD) completing; what is waiting for it? 70

Tomasulo Example Cycle 16 AM La. CASA • Just waiting for Mult 2 (DIVD)

Tomasulo Example Cycle 16 AM La. CASA • Just waiting for Mult 2 (DIVD) to complete 71

Tomasulo Example Cycle 55 AM La. CASA 72

Tomasulo Example Cycle 55 AM La. CASA 72

Tomasulo Example Cycle 56 AM • Mult 2 (DIVD) is completing; what is waiting

Tomasulo Example Cycle 56 AM • Mult 2 (DIVD) is completing; what is waiting for it? La. CASA 73

Tomasulo Example Cycle 57 AM • Once again: In-order issue, out-of-order execution and out-of-order

Tomasulo Example Cycle 57 AM • Once again: In-order issue, out-of-order execution and out-of-order completion. La. CASA 74

Tomasulo Drawbacks n Complexity n n n delays of 360/91, MIPS 10000, Alpha 21264,

Tomasulo Drawbacks n Complexity n n n delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA: AQA 2/e, but not in silicon! Many associative stores (CDB) at high speed Performance limited by Common Data Bus n n Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! n AM n La. CASA Multiple CDBs more FU logic for parallel assoc stores Non-precise interrupts! n We will address this later 75

Tomasulo Loop Example Loop: LD MULTD SD SUBI BNEZ n n n La. CASA

Tomasulo Loop Example Loop: LD MULTD SD SUBI BNEZ n n n La. CASA n 0(R 1) F 0 F 2 0 R 1 #8 Loop This time assume Multiply takes 4 clocks Assume 1 st load takes 8 clocks (L 1 cache miss), 2 nd load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ n AM F 0 F 4 R 1 Reality: integer instructions ahead of Fl. Pt. Instructions Show 2 iterations 76

Loop Example Iteration Count Added Store Buffers Instruction Loop AM La. CASA Value of

Loop Example Iteration Count Added Store Buffers Instruction Loop AM La. CASA Value of Register used for address, iteration control 77

Loop Example Cycle 1 AM La. CASA 78

Loop Example Cycle 1 AM La. CASA 78

Loop Example Cycle 2 AM La. CASA 79

Loop Example Cycle 2 AM La. CASA 79

Loop Example Cycle 3 AM La. CASA Implicit renaming sets up data flow graph

Loop Example Cycle 3 AM La. CASA Implicit renaming sets up data flow graph 80

Loop Example Cycle 4 AM La. CASA 81

Loop Example Cycle 4 AM La. CASA 81

Loop Example Cycle 5 AM La. CASA 82

Loop Example Cycle 5 AM La. CASA 82

Loop Example Cycle 6 AM La. CASA 83

Loop Example Cycle 6 AM La. CASA 83

Loop Example Cycle 7 AM La. CASA 84

Loop Example Cycle 7 AM La. CASA 84

Loop Example Cycle 8 AM La. CASA 85

Loop Example Cycle 8 AM La. CASA 85

Loop Example Cycle 9 AM La. CASA 86

Loop Example Cycle 9 AM La. CASA 86

Loop Example Cycle 10 AM La. CASA 87

Loop Example Cycle 10 AM La. CASA 87

Loop Example Cycle 11 AM La. CASA 88

Loop Example Cycle 11 AM La. CASA 88

Loop Example Cycle 12 AM La. CASA 89

Loop Example Cycle 12 AM La. CASA 89

Loop Example Cycle 13 AM La. CASA 90

Loop Example Cycle 13 AM La. CASA 90

Loop Example Cycle 14 AM La. CASA 91

Loop Example Cycle 14 AM La. CASA 91

Loop Example Cycle 15 AM La. CASA 92

Loop Example Cycle 15 AM La. CASA 92

Loop Example Cycle 16 AM La. CASA 93

Loop Example Cycle 16 AM La. CASA 93

Loop Example Cycle 17 AM La. CASA 94

Loop Example Cycle 17 AM La. CASA 94

Loop Example Cycle 18 AM La. CASA 95

Loop Example Cycle 18 AM La. CASA 95

Loop Example Cycle 19 AM La. CASA 96

Loop Example Cycle 19 AM La. CASA 96

Loop Example Cycle 20 AM • Once again: In-order issue, out-of-order execution and out-of-order

Loop Example Cycle 20 AM • Once again: In-order issue, out-of-order execution and out-of-order completion. La. CASA 97

Why can Tomasulo overlap iterations of loops? n Register renaming n n Reservation stations

Why can Tomasulo overlap iterations of loops? n Register renaming n n Reservation stations n n n AM La. CASA Multiple iterations use different physical destinations for registers (dynamic loop unrolling) Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard Other perspective: Tomasulo building data flow dependency graph on the fly 98

Tomasulo’s scheme offers 2 major advantages n (1) the distribution of the hazard detection

Tomasulo’s scheme offers 2 major advantages n (1) the distribution of the hazard detection logic n n AM La. CASA distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards 99

Multiple Issue n n Allow multiple instructions to issue in a single clock cycle

Multiple Issue n n Allow multiple instructions to issue in a single clock cycle (CPI < 1) Two flavors n Superscalar n n n AM La. CASA Issue varying number of instruction per clock Can be statically (compiler tech. ) or dynamically (Tomasulo) scheduled VLIW (Very Long Instruction Word) n Issue a fixed number of instructions formatted as a single long instruction or as a fixed instruction packet 100

Multiple Issue with Dynamic Scheduling FP Op Queue Load Buffers From Mem From Instruction

Multiple Issue with Dynamic Scheduling FP Op Queue Load Buffers From Mem From Instruction Unit FP Registers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Store 1 Store 2 Store 3 Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders Reservation Stations To Mem FP multipliers AM La. CASA Issue: 2 instructions per clock cycle 101

Multiple Issue with Dynamic Scheduling: An Example Loop: L. D ADD. D S. D

Multiple Issue with Dynamic Scheduling: An Example Loop: L. D ADD. D S. D DADDIU BNE F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 R 1, -#8 R 1, R 2, Loop Assumptions: 2 -issue processor: can issue any pair of instructions if reservation stations are available AM La. CASA Resources: ALU (int + effective address), a separate pipelined FP for each operation type, branch prediction hardware, 1 CDB 2 cc for loads, 3 cc for FP Add Branches single issue, branch prediction is perfect 102

Execution in Dual-issue Tomasulo Pipeline Iter. Inst. Issue Exe. (begins) 1 LD. D F

Execution in Dual-issue Tomasulo Pipeline Iter. Inst. Issue Exe. (begins) 1 LD. D F 0, 0(R 1) 1 2 1 ADD. D F 4, F 0, F 2 1 5 1 S. D 0(R 1), F 4 2 3 1 DADDIU R 1, -#8 2 4 1 BNE R 1, R 2, Loop 3 6 2 LD. D F 0, 0(R 1) 4 7 2 ADD. D F 4, F 0, F 2 4 10 2 S. D 0(R 1), F 4 5 8 2 DADDIU R 1, -#8 5 9 2 BNE R 1, R 2, Loop 6 11 3 LD. D F 0, 0(R 1) 7 12 ADD. D F 4, F 0, F 2 7 15 3 S. D 0(R 1), F 4 8 13 3 DADDIU R 1, -#8 8 14 3 BNE R 1, R 2, Loop 9 16 AM 3 La. CASA Mem. Access 3 Write Com. at CDB 4 first issue 8 Wait for LD. D 9 Wait for ADD. D 5 Wait for ALU Wait for DAIDU 8 9 Wait for BNE 13 Wait for LD. D 14 Wait for ADD. D 10 Wait for ALU Wait for DAIDU 13 14 Wait for BNE 18 Wait for LD. D 19 Wait for ADD. D 15 Wait for ALU Wait for DAIDU 103

Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 1/L. D 3

Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 1/L. D 3 1/S. D 4 1/DADDIU 5 FP ALU Data Cache CDB 1/L. D 1/ADD. D 1/DADDIU 6 7 2/L. D 8 2/S. D 2/L. D 1/ADD. D 9 2/ DADDIU 1/S. D 2/L. D 10 2/ADD. D 2/DADDIU 11 AM La. CASA 12 3/L. D 13 3/S. D 3/L. D 2/ADD. D 14 3/ DADDIU 2/S. D 3/L. D 15 3/ADD. D 3/DADDIU 16 17 18 19 3/ADD. D 3/S. D 104

Multiple Issue with Dynamic Scheduling n DADDIU waits for ALU used by S. D

Multiple Issue with Dynamic Scheduling n DADDIU waits for ALU used by S. D n n n Add one ALU dedicated to effective address calculation Use 2 CDBs Draw table for the dual-issue version of Tomasulo’s pipeline AM La. CASA 105

Multiple Issue with Dynamic Scheduling Iter. Inst. Issue Exe. (begins) 1 LD. D F

Multiple Issue with Dynamic Scheduling Iter. Inst. Issue Exe. (begins) 1 LD. D F 0, 0(R 1) 1 2 1 ADD. D F 4, F 0, F 2 1 5 1 S. D 0(R 1), F 4 2 3 1 DADDIU R 1, -#8 2 3 1 BNE R 1, R 2, Loop 3 5 2 LD. D F 0, 0(R 1) 4 6 2 ADD. D F 4, F 0, F 2 4 9 2 S. D 0(R 1), F 4 5 7 2 DADDIU R 1, -#8 5 6 2 BNE R 1, R 2, Loop 6 8 3 LD. D F 0, 0(R 1) 7 9 ADD. D F 4, F 0, F 2 7 12 3 S. D 0(R 1), F 4 8 10 3 DADDIU R 1, -#8 8 9 3 BNE R 1, R 2, Loop 9 11 AM 3 La. CASA Mem. Access 3 Write Com. at CDB 4 first issue 8 Wait for LD. D 9 Wait for ADD. D 4 Executes earlier Wait for DAIDU 7 8 Wait for BNE 12 Wait for LD. D 13 Wait for ADD. D 7 10 11 Executes earlier Wait for BNE 15 16 10 106

Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 3 Adr. Adder

Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 3 Adr. Adder FP ALU Data Cache 1/DADDIU 1/S. D 5 2/ DADDIU 2/S. D 2/L. D 13 2/DADDIU 1/ADD. D 3/ DADDIU 3/L. D 2/ADD. D 3/S. D 2/L. D 1/S. D 3/L. D 11 12 1/DADDIU 2/L. D 8 10 1/L. D 1/ADD. D 7 9 CDB#2 1/L. D 4 6 CDB#1 3/DADDIU 3/L. D 3/ADD. D 2/S. D 14 AM La. CASA 15 16 3/ADD. D 3/S. D 107

What about Precise Interrupts? n n Tomasulo had: In-order issue, out-of-order execution, and out

What about Precise Interrupts? n n Tomasulo had: In-order issue, out-of-order execution, and out -of-order completion Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream AM La. CASA 108

Hardware-based Speculation n n With wide issue processors control dependences become a burden, even

Hardware-based Speculation n n With wide issue processors control dependences become a burden, even with sophisticated branch predictors Speculation: speculate on the outcome of branches and execute the program as if our guesses were correct => need a mechanism to handle situations when the speculations were incorrect AM La. CASA 109

Relationship between precise interrupts and speculation n n Speculation is a form of guessing

Relationship between precise interrupts and speculation n n Speculation is a form of guessing Important for branch prediction: n n If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: n AM La. CASA n Need to “take our best shot” at predicting branch direction This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit 110

HW support for precise interrupts n Need HW buffer for results of uncommitted instructions:

HW support for precise interrupts n Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) n n n AM La. CASA n 4 fields: instr. type, destination, value, ready Use reorder buffer number instead of reservation station when execution completes Supplies operands between FP execution complete & commit Op (Reorder buffer can be operand Queue source => more registers like RS) Instructions commit Once instruction commits, result is put into register Res Stations As a result, easy to undo FP Adder speculated instructions on mispredicted branches or exceptions Reorder Buffer FP Regs Res Stations FP Adder 111

Four Steps of Speculative Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue

Four Steps of Speculative Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue n n 2. Execution—operate on operands (EX) n n AM La. CASA When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) n n If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result n When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) 112

What are the hardware complexities with reorder buffer (ROB)? How do you find the

What are the hardware complexities with reorder buffer (ROB)? How do you find the latest version of a register? n n La. CASA Program Counter Valid Result Dest Reg AM Need as many ports on ROB as register file Reorder Table FP Op Queue Res Stations FP Adder Compar network n (As specified by Smith paper) need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Exceptions? n Reorder Buffer FP Regs Res Stations FP Adder 113

Summary n Reservations stations: implicit register renaming to larger set of registers + buffering

Summary n Reservations stations: implicit register renaming to larger set of registers + buffering source operands n n n Not limited to basic blocks (integer units gets ahead, beyond branches) Today, helps cache misses as well n n n La. CASA n n Don’t stall for L 1 Data cache miss (insufficient ILP for L 2 miss? ) Lasting Contributions n AM Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium III; Power. PC 604; MIPS R 10000; HP-PA 8000; Alpha 21264 114