CPE 631 ILP Dynamic Exploitation Electrical and Computer
- Slides: 114
CPE 631: ILP, Dynamic Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenković milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka UAH-CPE 631
Outline n n Instruction Level Parallelism (ILP) Recap: Data Dependencies Extended MIPS Pipeline and Hazards Dynamic scheduling with a scoreboard AM La. CASA 2
ILP: Concepts and Challenges n n ILP (Instruction Level Parallelism) – overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions n n n AM La. CASA reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls n Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput 3
Two approaches to exploit parallelism n Dynamic techniques n n largely depend on hardware to locate the parallelism Static techniques n relay on software AM La. CASA 4
Techniques to exploit parallelism Technique (Section in the textbook) AM La. CASA Reduces Forwarding and bypassing (Section A. 2) Data hazard (DH) stalls Delayed branches (A. 2) Control hazard stalls Basic dynamic scheduling (A. 8) DH stalls (RAW) Dynamic scheduling with register renaming (3. 2) WAR and WAW stalls Dynamic branch prediction (3. 4) CH stalls Issuing multiple instruction per cycle (3. 6) Ideal CPI Speculation (3. 7) Data and control stalls Dynamic memory disambiguation (3. 2, 3. 7) RAW stalls w. memory Loop Unrolling (4. 1) CH stalls Basic compiler pipeline scheduling (A. 2, 4. 1) DH stalls Compiler dependence analysis (4. 4) Ideal CPI, DH stalls Software pipelining and trace scheduling (4. 3) Ideal CPI and DH stalls Compiler speculation (4. 4) Ideal CPI, and D/CH stalls 5
Where to look for ILP? n Amount of parallelism available within a basic block n n BB: straight line code sequence of instructions with no branches in except to the entry, and no branches out except at the exit Example: Gcc (Gnu C Compiler): 17% control transfer n n n AM La. CASA 5 or 6 instructions + 1 branch Dependencies => amount of parallelism in a basic block is likely to be much less than 5 => look beyond single block to get more instruction level parallelism Simplest and most common way to increase amount of parallelism among instruction is to exploit parallelism among iterations of a loop => Loop Level Parallelism for(i=1; i<=1000; i++) x[i]=x[i] + s; n Vector Processing: see Appendix G 6
Definition: Data Dependencies n Data dependence: instruction j is data dependent on instruction i if either of the following holds n n n AM La. CASA n Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i If dependent, cannot execute in parallel Try to schedule to avoid hazards Easy to determine for registers (fixed names) Hard for memory (“memory disambiguation”): n n Does 100(R 4) = 20(R 6)? From different loop iterations, does 20(R 6) = 20(R 6)? 7
Examples of Data Dependencies Loop: LD. D ADD. D SD. D DADUI BNE F 0, 0(R 1) ; F 0 = array element F 4, F 0, F 2 ; add scalar in F 2 0(R 1), F 4 ; store result and R 1, #-8 ; decrement pointer R 1, R 2, Loop ; branch if R 1!=R 2 AM La. CASA 8
Definition: Name Dependencies n Two instructions use same name (register or memory location) but don’t exchange data n n AM La. CASA Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. If dependent, can’t execute in parallel Renaming to remove data dependencies Again Name Dependencies are Hard for Memory Accesses n n Does 100(R 4) = 20(R 6)? From different loop iterations, does 20(R 6) = 20(R 6)? 9
Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 SUBUI 14 BNEZ 15 NOP AM La. CASA F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 0, -8(R 1) F 4, F 0, F 2 -8(R 1), F 4 F 0, -16(R 1) F 4, F 0, F 2 -16(R 1), F 4 F 0, -24(R 1) F 4, F 0, F 2 -24(R 1), F 4 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 How can remove them? 10
Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 DSUBUI 14 BNEZ 15 NOP AM La. CASA F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 6, -8(R 1) F 8, F 6, F 2 -8(R 1), F 8 F 10, -16(R 1) F 12, F 10, F 2 -16(R 1), F 12 F 14, -24(R 1) F 16, F 14, F 2 -24(R 1), F 16 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 The Orginal“register renaming” 11
Definition: Control Dependencies n n Example: if p 1 {S 1; }; if p 2 {S 2; }; S 1 is control dependent on p 1 and S 2 is control dependent on p 2 but not on p 1 Two constraints on control dependences: n n An instruction that is control dep. on a branch cannot be moved before the branch, so that its execution is no longer controlled by the branch An instruction that is not control dep. on a branch cannot be moved to after the branch so that its execution is controlled by the branch AM La. CASA L: DADDU R 5, R 6, R 7 ADD R 1, R 2, R 3 BEQZ R 4, L SUB R 1, R 5, R 6 OR R 7, R 1, R 8 12
Dynamically Scheduled Pipelines UAH-CPE 631
Overcoming Data Hazards with Dynamic Scheduling n Why in HW at run time? n n Works when can’t know real dependence at compile time Simpler compiler Code for one machine runs well on another Example DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 12 SUB. D cannot execute because the dependence of ADD. D on DIV. D causes the pipeline to stall; yet SUBD is not data dependent on anything! AM La. CASA n Key idea: Allow instructions behind stall to proceed 14
Overcoming Data Hazards with Dynamic Scheduling (cont’d) n n Enables out-of-order execution => out-of-order completion Out-of-order execution divides ID stage: n n n AM La. CASA 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands Scoreboarding – technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies (CDC 6600, 1963) 15
Scoreboarding Implications n Out-of-order completion => WAR, WAW hazards? DIV. D ADD. D SUB. D n n n AM La. CASA n n DIV. D ADD. D SUB. D F 0, F 2, F 4 F 10, F 8, F 12 Solutions for WAR n n F 0, F 2, F 4 F 10, F 8 F 8, F 12 Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages 16
Four Stages of Scoreboard Control n n AM La. CASA ID 1: Issue — decode instructions & check for structural hazards ID 2: Read operands — wait until no data hazards, then read operands EX: Execute — operate on operands; when the result is ready, it notifies the scoreboard that it has completed execution WB: Write results — finish execution; the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction DIV. D ADD. D SUB. D F 0, F 2, F 4 F 10, F 8 F 8, F 12 Scoreboarding stalls the SUBD in its write result stage until ADDD reads its operands 17
Four Stages of Scoreboard Control n 1. n n La. CASA If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands—wait until no data hazards, then read operands (ID 2) n AM Issue—decode instructions & check for structural hazards (ID 1) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. 18
Four Stages of Scoreboard Control n 3. n n 4. n n Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 8, F 14 AM La. CASA n CDC 6600 scoreboard would stall SUBD until ADD. D reads operands 19
Three Parts of the Scoreboard n n 1. Instruction status—which of 4 steps the instruction is in (Capacity = window size) 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit n n n AM La. CASA n Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e. g. , + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register 20
MIPS with a Scoreboard Registers FP Mult FP Div Add 1 Add 2 Add 3 AM La. CASA Control/ Status Scoreboard Control/ Status 21
Detailed Scoreboard Pipeline Control Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result (D) Busy(FU) yes; Op(FU) op; Fi(FU) ’D’; Fj(FU) ’S 1’; Fk(FU) ’S 2’; Qj Result(’S 1’); Qk Result(’S 2’); Rj not Qj; Rk not Qk; Result(’D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result AM La. CASA f((Fj( f )≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk( f ) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No)) 22
Scoreboard Example AM La. CASA 23
Scoreboard Example: Cycle 1 AM La. CASA Issue 1 st L. D! 24
Scoreboard Example: Cycle 2 AM La. CASA Issue 2 nd L. D? Structural hazard! No further instructions will issue! 25
Scoreboard Example: Cycle 3 AM La. CASA Issue MUL. D? 26
Scoreboard Example: Cycle 4 AM La. CASA Check for WAR hazards! If none, write result! 27
Scoreboard Example: Cycle 5 AM La. CASA Issue 2 nd L. D! 28
Scoreboard Example: Cycle 6 AM La. CASA Issue MUL. D! 29
Scoreboard Example: Cycle 7 AM La. CASA Issue SUB. D! 30
Scoreboard Example: Cycle 8 AM La. CASA Issue DIV. D! 31
Scoreboard Example: Cycle 9 AM La. CASA Read operands for MUL. D and SUB. D! Assume we can feed Mult 1 and Add units in the same clock cycle. Issue ADD. D? Structural Hazard (unit is busy)! 32
Scoreboard Example: Cycle 11 AM La. CASA Last cycle of SUB. D execution. 33
Scoreboard Example: Cycle 12 AM La. CASA Check WAR on F 8. Write F 8. 34
Scoreboard Example: Cycle 13 AM La. CASA Issue ADD. D! 35
Scoreboard Example: Cycle 14 AM La. CASA Read operands for ADD. D! 36
Scoreboard Example: Cycle 15 AM La. CASA 37
Scoreboard Example: Cycle 16 AM La. CASA 38
Scoreboard Example: Cycle 17 AM La. CASA Why cannot write F 6? 39
Scoreboard Example: Cycle 19 AM La. CASA 40
Scoreboard Example: Cycle 20 AM La. CASA 41
Scoreboard Example: Cycle 21 AM La. CASA 42
Scoreboard Example: Cycle 22 AM La. CASA Write F 6? 43
Scoreboard Example: Cycle 61 AM La. CASA 44
Scoreboard Example: Cycle 62 AM La. CASA 45
Scoreboard Results n For the CDC 6600 n n n 70% improvement for Fortran 150% improvement for hand coded assembly language cost was similar to one of the functional units n n n Still this was in ancient time n n AM La. CASA n n surprisingly low bulk of cost was in the extra busses no caches & no main semiconductor memory no software pipelining compilers? So, why is it coming back n performance via ILP 46
Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries n n AM La. CASA how far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch) Number and types of functional units n n can we find independent instructions to execute avoid structural hazards Presence of antidependences and output dependences n WAR and WAW stalls become more important 47
Things to Remember n n Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Data dependencies Dynamic scheduling to minimise stalls Dynamic scheduling with a scoreboard AM La. CASA 48
Scoreboard Limitations n Amount of parallelism among instructions n n Number of scoreboard entries n n AM La. CASA how far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch) Number and types of functional units n n can we find independent instructions to execute avoid structural hazards Presence of antidependences and output dependences n WAR and WAW stalls become more important 49
Tomasulo’s Algorithm n n n Used in IBM 360/91 FPU (before caches) Goal: high FP performance without special compilers Conditions: n n n AM La. CASA Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations Long memory accesses and long FP delays This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! n Alpha 21264, HP 8000, MIPS 10000, Pentium III, Power. PC 604, … 50
Tomasulo’s Algorithm (cont’d) n Control & buffers distributed with Function Units (FU) n n Registers in instructions replaced by values or pointers to reservation stations (RS) => register renaming n n AM La. CASA n FU buffers called “reservation stations” => buffer the operands of instructions waiting to issue; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 51
Tomasulo-based FPU for MIPS FP Op Queue Load Buffers From Mem From Instruction Unit FP Registers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Store 1 Store 2 Store 3 Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders Reservation Stations To Mem FP multipliers AM La. CASA Common Data Bus (CDB) 52
Reservation Station Components n n Op: Operation to perform in the unit (e. g. , + or –) Vj, Vk: Value of Source operands n n Qj, Qk: Reservation stations producing source registers (value to be written) n n n AM La. CASA Store buffers has V field, result to be stored Note: Qj/Qk=0 => source operand is already available in Vj /Vk Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 53
Three Stages of Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue n n 2. Execute—operate on operands (EX) n n n La. CASA n n n Write it on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) n AM When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) n n If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 2 clocks for Fl. pt. +, -; 10 for * ; 40 clks for / 54
Tomasulo Example Instruction stream 3 Load/Buffers FU count down 3 FP Adder R. S. 2 FP Mult R. S. AM Clock cycle counter La. CASA 55
Tomasulo Example Cycle 1 AM La. CASA 56
Tomasulo Example Cycle 2 AM La. CASA Note: Can have multiple loads outstanding 57
Tomasulo Example Cycle 3 AM La. CASA • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load 1 completing; what is waiting for Load 1? 58
Tomasulo Example Cycle 4 AM La. CASA • Load 2 completing; what is waiting for Load 2? 59
Tomasulo Example Cycle 5 AM La. CASA • Timer starts down for Add 1, Mult 1 60
Tomasulo Example Cycle 6 AM La. CASA • Issue ADDD here despite name dependency on F 6? 61
Tomasulo Example Cycle 7 AM La. CASA • Add 1 (SUBD) completing; what is waiting for it? 62
Tomasulo Example Cycle 8 AM La. CASA 63
Tomasulo Example Cycle 9 AM La. CASA 64
Tomasulo Example Cycle 10 AM La. CASA • Add 2 (ADDD) completing; what is waiting for it? 65
Tomasulo Example Cycle 11 AM La. CASA • Write result of ADDD here? • All quick instructions complete in this cycle! 66
Tomasulo Example Cycle 12 AM La. CASA 67
Tomasulo Example Cycle 13 AM La. CASA 68
Tomasulo Example Cycle 14 AM La. CASA 69
Tomasulo Example Cycle 15 AM La. CASA • Mult 1 (MULTD) completing; what is waiting for it? 70
Tomasulo Example Cycle 16 AM La. CASA • Just waiting for Mult 2 (DIVD) to complete 71
Tomasulo Example Cycle 55 AM La. CASA 72
Tomasulo Example Cycle 56 AM • Mult 2 (DIVD) is completing; what is waiting for it? La. CASA 73
Tomasulo Example Cycle 57 AM • Once again: In-order issue, out-of-order execution and out-of-order completion. La. CASA 74
Tomasulo Drawbacks n Complexity n n n delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA: AQA 2/e, but not in silicon! Many associative stores (CDB) at high speed Performance limited by Common Data Bus n n Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! n AM n La. CASA Multiple CDBs more FU logic for parallel assoc stores Non-precise interrupts! n We will address this later 75
Tomasulo Loop Example Loop: LD MULTD SD SUBI BNEZ n n n La. CASA n 0(R 1) F 0 F 2 0 R 1 #8 Loop This time assume Multiply takes 4 clocks Assume 1 st load takes 8 clocks (L 1 cache miss), 2 nd load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ n AM F 0 F 4 R 1 Reality: integer instructions ahead of Fl. Pt. Instructions Show 2 iterations 76
Loop Example Iteration Count Added Store Buffers Instruction Loop AM La. CASA Value of Register used for address, iteration control 77
Loop Example Cycle 1 AM La. CASA 78
Loop Example Cycle 2 AM La. CASA 79
Loop Example Cycle 3 AM La. CASA Implicit renaming sets up data flow graph 80
Loop Example Cycle 4 AM La. CASA 81
Loop Example Cycle 5 AM La. CASA 82
Loop Example Cycle 6 AM La. CASA 83
Loop Example Cycle 7 AM La. CASA 84
Loop Example Cycle 8 AM La. CASA 85
Loop Example Cycle 9 AM La. CASA 86
Loop Example Cycle 10 AM La. CASA 87
Loop Example Cycle 11 AM La. CASA 88
Loop Example Cycle 12 AM La. CASA 89
Loop Example Cycle 13 AM La. CASA 90
Loop Example Cycle 14 AM La. CASA 91
Loop Example Cycle 15 AM La. CASA 92
Loop Example Cycle 16 AM La. CASA 93
Loop Example Cycle 17 AM La. CASA 94
Loop Example Cycle 18 AM La. CASA 95
Loop Example Cycle 19 AM La. CASA 96
Loop Example Cycle 20 AM • Once again: In-order issue, out-of-order execution and out-of-order completion. La. CASA 97
Why can Tomasulo overlap iterations of loops? n Register renaming n n Reservation stations n n n AM La. CASA Multiple iterations use different physical destinations for registers (dynamic loop unrolling) Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard Other perspective: Tomasulo building data flow dependency graph on the fly 98
Tomasulo’s scheme offers 2 major advantages n (1) the distribution of the hazard detection logic n n AM La. CASA distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards 99
Multiple Issue n n Allow multiple instructions to issue in a single clock cycle (CPI < 1) Two flavors n Superscalar n n n AM La. CASA Issue varying number of instruction per clock Can be statically (compiler tech. ) or dynamically (Tomasulo) scheduled VLIW (Very Long Instruction Word) n Issue a fixed number of instructions formatted as a single long instruction or as a fixed instruction packet 100
Multiple Issue with Dynamic Scheduling FP Op Queue Load Buffers From Mem From Instruction Unit FP Registers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Store 1 Store 2 Store 3 Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders Reservation Stations To Mem FP multipliers AM La. CASA Issue: 2 instructions per clock cycle 101
Multiple Issue with Dynamic Scheduling: An Example Loop: L. D ADD. D S. D DADDIU BNE F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 R 1, -#8 R 1, R 2, Loop Assumptions: 2 -issue processor: can issue any pair of instructions if reservation stations are available AM La. CASA Resources: ALU (int + effective address), a separate pipelined FP for each operation type, branch prediction hardware, 1 CDB 2 cc for loads, 3 cc for FP Add Branches single issue, branch prediction is perfect 102
Execution in Dual-issue Tomasulo Pipeline Iter. Inst. Issue Exe. (begins) 1 LD. D F 0, 0(R 1) 1 2 1 ADD. D F 4, F 0, F 2 1 5 1 S. D 0(R 1), F 4 2 3 1 DADDIU R 1, -#8 2 4 1 BNE R 1, R 2, Loop 3 6 2 LD. D F 0, 0(R 1) 4 7 2 ADD. D F 4, F 0, F 2 4 10 2 S. D 0(R 1), F 4 5 8 2 DADDIU R 1, -#8 5 9 2 BNE R 1, R 2, Loop 6 11 3 LD. D F 0, 0(R 1) 7 12 ADD. D F 4, F 0, F 2 7 15 3 S. D 0(R 1), F 4 8 13 3 DADDIU R 1, -#8 8 14 3 BNE R 1, R 2, Loop 9 16 AM 3 La. CASA Mem. Access 3 Write Com. at CDB 4 first issue 8 Wait for LD. D 9 Wait for ADD. D 5 Wait for ALU Wait for DAIDU 8 9 Wait for BNE 13 Wait for LD. D 14 Wait for ADD. D 10 Wait for ALU Wait for DAIDU 13 14 Wait for BNE 18 Wait for LD. D 19 Wait for ADD. D 15 Wait for ALU Wait for DAIDU 103
Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 1/L. D 3 1/S. D 4 1/DADDIU 5 FP ALU Data Cache CDB 1/L. D 1/ADD. D 1/DADDIU 6 7 2/L. D 8 2/S. D 2/L. D 1/ADD. D 9 2/ DADDIU 1/S. D 2/L. D 10 2/ADD. D 2/DADDIU 11 AM La. CASA 12 3/L. D 13 3/S. D 3/L. D 2/ADD. D 14 3/ DADDIU 2/S. D 3/L. D 15 3/ADD. D 3/DADDIU 16 17 18 19 3/ADD. D 3/S. D 104
Multiple Issue with Dynamic Scheduling n DADDIU waits for ALU used by S. D n n n Add one ALU dedicated to effective address calculation Use 2 CDBs Draw table for the dual-issue version of Tomasulo’s pipeline AM La. CASA 105
Multiple Issue with Dynamic Scheduling Iter. Inst. Issue Exe. (begins) 1 LD. D F 0, 0(R 1) 1 2 1 ADD. D F 4, F 0, F 2 1 5 1 S. D 0(R 1), F 4 2 3 1 DADDIU R 1, -#8 2 3 1 BNE R 1, R 2, Loop 3 5 2 LD. D F 0, 0(R 1) 4 6 2 ADD. D F 4, F 0, F 2 4 9 2 S. D 0(R 1), F 4 5 7 2 DADDIU R 1, -#8 5 6 2 BNE R 1, R 2, Loop 6 8 3 LD. D F 0, 0(R 1) 7 9 ADD. D F 4, F 0, F 2 7 12 3 S. D 0(R 1), F 4 8 10 3 DADDIU R 1, -#8 8 9 3 BNE R 1, R 2, Loop 9 11 AM 3 La. CASA Mem. Access 3 Write Com. at CDB 4 first issue 8 Wait for LD. D 9 Wait for ADD. D 4 Executes earlier Wait for DAIDU 7 8 Wait for BNE 12 Wait for LD. D 13 Wait for ADD. D 7 10 11 Executes earlier Wait for BNE 15 16 10 106
Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU 2 3 Adr. Adder FP ALU Data Cache 1/DADDIU 1/S. D 5 2/ DADDIU 2/S. D 2/L. D 13 2/DADDIU 1/ADD. D 3/ DADDIU 3/L. D 2/ADD. D 3/S. D 2/L. D 1/S. D 3/L. D 11 12 1/DADDIU 2/L. D 8 10 1/L. D 1/ADD. D 7 9 CDB#2 1/L. D 4 6 CDB#1 3/DADDIU 3/L. D 3/ADD. D 2/S. D 14 AM La. CASA 15 16 3/ADD. D 3/S. D 107
What about Precise Interrupts? n n Tomasulo had: In-order issue, out-of-order execution, and out -of-order completion Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream AM La. CASA 108
Hardware-based Speculation n n With wide issue processors control dependences become a burden, even with sophisticated branch predictors Speculation: speculate on the outcome of branches and execute the program as if our guesses were correct => need a mechanism to handle situations when the speculations were incorrect AM La. CASA 109
Relationship between precise interrupts and speculation n n Speculation is a form of guessing Important for branch prediction: n n If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: n AM La. CASA n Need to “take our best shot” at predicting branch direction This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit 110
HW support for precise interrupts n Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) n n n AM La. CASA n 4 fields: instr. type, destination, value, ready Use reorder buffer number instead of reservation station when execution completes Supplies operands between FP execution complete & commit Op (Reorder buffer can be operand Queue source => more registers like RS) Instructions commit Once instruction commits, result is put into register Res Stations As a result, easy to undo FP Adder speculated instructions on mispredicted branches or exceptions Reorder Buffer FP Regs Res Stations FP Adder 111
Four Steps of Speculative Tomasulo Algorithm n 1. Issue—get instruction from FP Op Queue n n 2. Execution—operate on operands (EX) n n AM La. CASA When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) n n If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result n When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) 112
What are the hardware complexities with reorder buffer (ROB)? How do you find the latest version of a register? n n La. CASA Program Counter Valid Result Dest Reg AM Need as many ports on ROB as register file Reorder Table FP Op Queue Res Stations FP Adder Compar network n (As specified by Smith paper) need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Exceptions? n Reorder Buffer FP Regs Res Stations FP Adder 113
Summary n Reservations stations: implicit register renaming to larger set of registers + buffering source operands n n n Not limited to basic blocks (integer units gets ahead, beyond branches) Today, helps cache misses as well n n n La. CASA n n Don’t stall for L 1 Data cache miss (insufficient ILP for L 2 miss? ) Lasting Contributions n AM Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium III; Power. PC 604; MIPS R 10000; HP-PA 8000; Alpha 21264 114
- Instruction level parallelism in computer architecture
- Ilp in computer architecture
- Dlp fo-plp
- Ilp computer architecture
- Flavour enhancer 627 and 631 side effects
- Round to the nearest ten thousand
- 631-828-3160
- 704-631-1500
- Latécoère 631
- Latecoere 631 interieur
- 631-992-3221
- Lei 631/2019
- Gianluca verona rinati
- Image analysis and exploitation
- New entry entrepreneurship
- Child exploitation and obscenity section
- Exploration and exploitation in organizational learning
- Tvs industrial and logistics parks
- Pentium 4 block diagram
- Ilp machine learning
- Career cruising ilp
- Isolierte extremitätenperfusion ilp
- Ilp
- Ilp
- Compiler techniques for exposing ilp
- Ilp
- Http //ilp/fp2
- Ilp/fp/generic
- Windows heap exploitation
- Système d exploitation
- Exploitation
- Bilan d'ouverture
- Exploitation plan example
- Exploitation
- Child beauty pageants history
- Interference competition
- Chapeau ferraillage
- Exploitation competition
- Definition of sexual exploitation
- Risk reduction strategies for new entry exploitation
- Q learning exploration vs exploitation
- Systeme d'exploitation
- Dave aitel immunity
- Heap exploitation techniques
- Risk reduction strategies for new entry exploitation
- Systeme d'exploitation
- Systeme d'exploitation
- Exploitation of the infirmed
- Exploitation of labor
- Exploitation plan meaning
- Consumer exploitation
- Definition of child exploitation uk
- Strategy
- Klipsch school of electrical and computer engineering
- Tum department of electrical and computer engineering
- Dynamic dynamic - bloom
- Wan management protocol
- Cwmp tr069
- Jcrm neuquen
- Jb nagar study circle
- Cpe vpn
- Bisk cpe
- Unr 365
- Centralno procesna enota
- Cpe426
- Cpe 426
- Tr069 stack
- Leamos la cpe
- Cpe
- Cpe
- Ku cpe
- Calendrier cpe 2021
- Cpe crocus
- Cpe risk assessment
- Vhdl full form
- Cpe rama media
- Cpe
- Cpe lifecycle management
- Multi-vendor deployment
- Exemplificação de preenchimento da ficha eletrotécnica
- Casa cpe
- Cpe426
- Cpe 426
- Cpe 426
- Cpe426
- What is the probability cpe
- Jicpa cpe
- Baseline network in computer architecture
- Dynamic scheduling in computer architecture
- Dynamic scheduling in computer architecture
- Dynamic interconnection network
- Dynamic interconnection network
- 3 components of computer system
- Difference between a computer and computer system
- Keyboard mouse scanner and microphone are blank devices
- Computer organization and architecture difference
- Tools and materials used in repairing male plug
- Basic structure of a computer
- Complete computer description in computer organization
- Design of basic computer in computer architecture
- Are thermal and electrical conductivity related
- Electrical costing
- Electrical trade theory n2 summary
- Principles of wires
- Raceways and fittings
- Primary secondary distribution system
- Basic electrical tools
- Example of
- Electrical 1 module 9 conductors and cables
- Protective colloid
- Model 1 acid strength and conductivity
- Sign with white background with a green panel
- 고유응답
- Mechanical and electrical vibrations
- Pros of fossil fuels