Chapter 3 InstructionLevel Parallelism and Its Dynamic Exploitation

  • Slides: 41
Download presentation
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation 1

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation 1

Overview • Instruction level parallelism • Dynamic Scheduling Techniques – Scoreboarding (Appendix A. 8)

Overview • Instruction level parallelism • Dynamic Scheduling Techniques – Scoreboarding (Appendix A. 8) – Tomasulo’s Algorithm • Reducing Branch Cost with Dynamic Hardware Prediction – Basic Branch Prediction and Branch-Prediction Buffers – Branch Target Buffers 2

CPI Equation Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls

CPI Equation Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling with scoreboarding RAW stalls Dynamic scheduling with register renaming WAR and WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal CPI Compiler dependence analysis Ideal CPI and data stalls Software pipelining and trace scheduling Ideal CPI and data stalls Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory 3

Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic

Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic block – Blocks are small (6 -7 instructions) – Instructions are dependent • Exploit ILP across multiple basic blocks – Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; – Alternative to vector instructions 4

Basic Pipeline Scheduling • Find sequences of unrelated instructions • Compiler’s ability to schedule

Basic Pipeline Scheduling • Find sequences of unrelated instructions • Compiler’s ability to schedule – Amount of ILP available in the program – Latencies of the functional units • Latency assumptions for the examples – Standard MIPS integer pipeline – No structural hazards (fully pipelined or duplicated units – Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op 3 FP ALU op SD 2 LD FP ALU op 1 LD SD 0 5

Sample Pipeline EX IF ID FP 1 FP 2 FP 3 FP 4 DM

Sample Pipeline EX IF ID FP 1 FP 2 FP 3 FP 4 DM WB . . . FP ALU IF FP ALU SD IF ID FP 1 FP 2 FP 3 FP 4 DM WB IF ID stall FP 1 FP 2 ID FP 1 FP 2 FP 3 FP 4 DM WB IF ID EX stall DM WB FP 3 6

Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] +

Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F 0, 0(R 1) stall ADDD F 4, F 0, F 2 stall SD 0(R 1), F 4 SUBI R 1, #8 stall BNEZ R 1, Loop stall 1 2 3 4 5 6 7 8 9 10 Sequential MIPS Assembly Code Loop: LD ADDD SD SUBI BNEZ F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 R 1, #8 R 1, Loop Scheduled pipelined execution: Loop: LD F 0, 0(R 1) 1 SUBI R 1, #8 2 ADDD F 4, F 0, F 2 3 stall 4 BNEZ R 1, Loop 5 SD 8(R 1), F 4 6 7

Dynamic Scheduling • Scheduling separates dependent instructions – Static – performed by the compiler

Dynamic Scheduling • Scheduling separates dependent instructions – Static – performed by the compiler – Dynamic – performed by the hardware • Advantages of dynamic scheduling – Handles dependences unknown at compile time – Simplifies the compiler – Optimization is done at run time • Disadvantages – Can not eliminate true data dependences 8

Out-of-order execution (1/2) • Central idea of dynamic scheduling – In-order execution: DIVD F

Out-of-order execution (1/2) • Central idea of dynamic scheduling – In-order execution: DIVD F 0, F 2, F 4 IF ID DIV …. . ADDD F 10, F 8 IF ID stall … SUBD F 12, F 8, F 14 IF stall …. . – Out-of-order execution: DIVD F 0, F 2, F 4 SUBD F 12, F 8, F 14 ADDD F 10, F 8 IF ID DIV …. . IF ID A 1 A 2 A 3 A 4 … IF ID stall …. . 9

Out-of-Order Execution (2/2) • Separate issue process in ID: – Issue • decode instruction

Out-of-Order Execution (2/2) • Separate issue process in ID: – Issue • decode instruction • check structural hazards • in-order execution – Read operands • Wait until no data hazards • Read operands • Out-of-order execution/completion – Exception handling problems – WAR hazards 10

DS with a Scoreboard • Details in Appendix A. 8 • Allows out-of-order execution

DS with a Scoreboard • Details in Appendix A. 8 • Allows out-of-order execution – Sufficient resources – No data dependencies • Responsible for issue, execution and hazards • Functional units with long delays – Duplicated – Fully pipelined • CDC 6600 – 16 functional units 11

MIPS with Scoreboard 12

MIPS with Scoreboard 12

Scoreboard Operation • Scoreboard centralizes hazard management – Every instruction goes through the scoreboard

Scoreboard Operation • Scoreboard centralizes hazard management – Every instruction goes through the scoreboard – Scoreboard determines when the instruction can read its operands and begin execution – Monitors changes in hardware and decides when an stalled instruction can execute – Controls when instructions can write results • New pipeline ID Issue EX Read Regs Execution WB Write 13

Execution Process • Issue – Functional unit is free (structural) – Active instructions do

Execution Process • Issue – Functional unit is free (structural) – Active instructions do not have same Rd (WAW) • Read Operands – Checks availability of source operands – Resolves RAW hazards dynamically (out-of-order execution) • Execution – Functional unit begins execution when operands arrive – Notifies the scoreboard when it has completed execution • Write result – Scoreboard checks WAR hazards – Stalls the completing instruction if necessary 14

Scoreboard Data Structure • Instruction status – indicates pipeline stage • Functional unit status

Scoreboard Data Structure • Instruction status – indicates pipeline stage • Functional unit status Busy – functional unit is busy or not Op – operation to perform in the unit (+, -, etc. ) Fi – destination register Fj, Fk – source register numbers Qj, Qk – functional unit producing Fj, Fk Rj, Rk – flags indicating when Fj, Fk are ready • Register result status – FU that will write registers 15

Scoreboard Data Structure (1/3) Instruction Issue LD F 6, 34(R 2) Y LD F

Scoreboard Data Structure (1/3) Instruction Issue LD F 6, 34(R 2) Y LD F 2, 45(R 3) Y MULTD F 0, F 2, F 4 Y SUBD F 8, F 6, F 2 Y DIVD F 10, F 6 Y Read operands Y Y Execution completed Write Y Y Y ADDD F 6, F 8, F 2 Name Integer Mult 1 Mult 2 Add Divide Busy Y Y N Y Y Op Load Mult Fi F 2 F 0 Fj R 3 F 2 Fk Qj F 4 Integer Sub Div F 8 F 10 F 6 F 0 F 2 F 4 Functional Unit Mult 1 Int F 6 Qk Integer Mult 1 F 8 F 10 F 12 Add Div . . . Rj N N Rk Y N N Y Y F 30 16

Scoreboard Data Structure (2/3) 17

Scoreboard Data Structure (2/3) 17

Scoreboard Data Structure (3/3) 18

Scoreboard Data Structure (3/3) 18

Scoreboard Algorithm 19

Scoreboard Algorithm 19

Scoreboard Limitations • Amount of available ILP • Number of scoreboard entries – Limited

Scoreboard Limitations • Amount of available ILP • Number of scoreboard entries – Limited to a basic block – Extended beyond a branch • Number and types of functional units – Structural hazards can increase with DS • Presence of anti- and output- dependences – Lead to WAR and WAW stalls 20

Tomasulo Approach • Another approach to eliminate stalls – Combines scoreboard with – Register

Tomasulo Approach • Another approach to eliminate stalls – Combines scoreboard with – Register renaming (to avoid WAR and WAW) • Designed for the IBM 360/91 – High FP performance for the whole 360 family – Four double precision FP registers – Long memory access and long FP delays • Can support overlapped execution of multiple iterations of a loop 21

Tomasulo Approach 22

Tomasulo Approach 22

Stages • Issue – Empty reservation station or buffer – Send operands to the

Stages • Issue – Empty reservation station or buffer – Send operands to the reservation station – Use name of reservation station for operands • Execute – Execute operation if operands are available – Monitor CDB for availability of operands • Write result – When result is available, write it to the CDB 23

Example (1/2) 24

Example (1/2) 24

Example (2/2) 25

Example (2/2) 25

Tomasulo’s Algorithm An enhanced and detailed design in Fig. 3. 5 of the textbook

Tomasulo’s Algorithm An enhanced and detailed design in Fig. 3. 5 of the textbook 26

Loop Iterations Loop: LD F 0, 0(R 1) MULTD F 4, F 0, F

Loop Iterations Loop: LD F 0, 0(R 1) MULTD F 4, F 0, F 2 SD 0(R 1), F 4 SUBI R 1, #8 BNEZ R 1, Loop 27

Dynamic Hardware Prediction • Importance of control dependences – Branches and jumps are frequent

Dynamic Hardware Prediction • Importance of control dependences – Branches and jumps are frequent – Limiting factor as ILP increases (Amdahl’s law) • Schemes to attack control dependences – Static • Basic (stall the pipeline) • Predict-not-taken and predict-taken • Delayed branch and canceling branch – Dynamic predictors • Effectiveness of dynamic prediction schemes – Accuracy – Cost 28

Basic Branch Prediction Buffers a. k. a. Branch History Table (BHT) - Small direct-mapped

Basic Branch Prediction Buffers a. k. a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target BHT T (predict taken) PC: NT (predict not- taken) PC + 4 29

N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes

N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2 -bit predictor almost as good as any general n-bit predictor 30

Prediction Accuracy of a 4 K-entry 2 -bit Prediction Buffer 31

Prediction Accuracy of a 4 K-entry 2 -bit Prediction Buffer 31

Correlating Branch Predictors a. k. a. Two-level Predictors – Use recent behavior of other

Correlating Branch Predictors a. k. a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target BHT T (predict taken) PC: NT (predict not- taken) 1 -bit global branch history: NT/T (stores behavior of previous branch) PC + 4 NT T 32

Example BNEZ DADDIU L 1: DADDIU BNEZ L 2: R 1, L 1 ;

Example BNEZ DADDIU L 1: DADDIU BNEZ L 2: R 1, L 1 ; branch b 1 (d!=0) R 1, R 0, #1 R 3, R 1, #-1 R 3, L 2. . . ; branch b 2 Basic one-bit predictor d=? 2 0 b 1 pred NT T b 1 action T NT new b 1 pred T NT b 2 pred b 2 action NT T T NT new b 2 pred T NT One-bit predictor with one-bit correlation d=? b 1 pred b 1 action new b 1 pred b 2 action new b 2 pred 2 0 NT/NT T NT T/NT NT/NT NT/T T NT NT/T 33

(2, 2) Branch Prediction Buffer 34

(2, 2) Branch Prediction Buffer 34

(m, n) Predictors • Use behavior of the last m branches • 2 m

(m, n) Predictors • Use behavior of the last m branches • 2 m n-bit predictors for each branch • Simplementation – Use m-bit shift register to record the behavior of the last m branches m-bit GBH (m, n) BPF PC: n-bit predictor 35

Size of the Buffers • Number of bits in a (m, n) predictor –

Size of the Buffers • Number of bits in a (m, n) predictor – 2 m x n x Number of entries in the table • Example – assume 8 K bits in the BHT – (0, 1): 8 K entries – (0, 2): 4 K entries – (2, 2): 1 K entries – (12, 2): 1 entry! • Does not use the branch address • Relies only on the global branch history 36

Performance Comparison of 2 -bit Predictors 37

Performance Comparison of 2 -bit Predictors 37

Branch-Target Buffers • Further reduce control stalls (hopefully to 0) • Store the predicted

Branch-Target Buffers • Further reduce control stalls (hopefully to 0) • Store the predicted address in the buffer • Access the buffer during IF 38

Prediction with BTF 39

Prediction with BTF 39

Target Instruction Buffers • Store target instructions instead of addresses • Advantages – BTB

Target Instruction Buffers • Store target instructions instead of addresses • Advantages – BTB access can take longer than time between IFs and BTB can be larger – Branch folding • Zero-cycle unconditional branches – Replace branch with target instruction 40

Performance Issues • Limitations of branch prediction schemes – Prediction accuracy (80% - 95%)

Performance Issues • Limitations of branch prediction schemes – Prediction accuracy (80% - 95%) • Type of program • Size of buffer – Penalty of misprediction • Fetch from both directions to reduce penalty – Memory system should: • Dual-ported • Have an interleaved cache • Fetch from one path and then from the other 41