CS 152 Computer Architecture and Engineering Lecture 16








































- Slides: 40

CS 152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards CS 152 Lec 16. 1

The Big Picture: Where are We Now? ° The Five Classic Components of a Computer Processor Input Control Memory Datapath Output ° Today’s Topics: • • • Recap last lecture Hardware loop unrolling with Tomasulo algorithm Administrivia Speculation, branch prediction Reorder buffers CS 152 Lec 16. 2

Scoreboard: a bookkeeping technique ° Out-of-order execution divides ID stage: 1. 2. Issue—decode instructions, check for structural hazards Read operands—wait until no data hazards, then read operands ° Scoreboards date to CDC 6600 in 1963 ° Instructions execute whenever not dependent on previous instructions and no hazards. ° CDC 6600: In order issue, out-of-order execution, out-of -order commit (or completion) • No forwarding! • Imprecise interrupt/exception model for now CS 152 Lec 16. 3

Registers FP Mult FP Divide FP Add Integer SCOREBOARD Functional Units Scoreboard Architecture(CDC 6600) Memory CS 152 Lec 16. 4

Scoreboard Implications ° Out-of-order completion => WAR, WAW hazards? ° Solutions for WAR: • Stall writeback until registers have been read • Read registers only during Read Operands stage ° Solution for WAW: • Detect hazard and stall issue of new instruction until other instruction completes ° No register renaming! ° Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units ° Scoreboard keeps track of dependencies between instructions that have already issued. ° Scoreboard replaces ID, EX, WB with 4 stages CS 152 Lec 16. 5

Four Stages of Scoreboard Control ° Issue—decode instructions & check for structural hazards (ID 1) • Instructions issued in program order (for hazard checking) • Don’t issue if structural hazard • Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) ° Read operands—wait until no data hazards, then read operands (ID 2) • All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. • No forwarding of data in this model! CS 152 Lec 16. 6

Four Stages of Scoreboard Control ° Execution—operate on operands (EX) • The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. ° Write result—finish execution (WB) • Stall until no WAR hazards with previous instructions: Example: DIVD ADDD SUBD F 0, F 2, F 4 F 10, F 8 F 8, F 14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands CS 152 Lec 16. 7

Three Parts of the Scoreboard ° Instruction status: Which of 4 steps the instruction is in ° Functional unit status: —Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Op: Fi: Fj, Fk: Qj, Qk: Rj, Rk: Indicates whether the unit is busy or not Operation to perform in the unit (e. g. , + or –) Destination register Source-register numbers Functional units producing source registers Fj, Fk Flags indicating when Fj, Fk are ready ° Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register CS 152 Lec 16. 8

Scoreboard Example CS 152 Lec 16. 9

Detailed Scoreboard Pipeline Control Instruction status Issue Wait until Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S 1’; Not busy (FU) Fk(FU) `S 2’; Qj Result(‘S 1’); and not result(D) Qk Result(`S 2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Execution complete Functional unit done Write result Bookkeeping Rj No; Rk No f((Fj(f)≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj(f)=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk(f) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No)) CS 152 Lec 16. 10

Scoreboard Example: Cycle 1 CS 152 Lec 16. 11

Scoreboard Example: Cycle 2 • Issue 2 nd LD? CS 152 Lec 16. 12

Scoreboard Example: Cycle 3 • Issue MULT? CS 152 Lec 16. 13

Scoreboard Example: Cycle 4 CS 152 Lec 16. 14

Scoreboard Example: Cycle 5 CS 152 Lec 16. 15

Scoreboard Example: Cycle 6 CS 152 Lec 16. 16

Scoreboard Example: Cycle 7 • Read multiply operands? CS 152 Lec 16. 17

Scoreboard Example: Cycle 8 a (First half of clock cycle) CS 152 Lec 16. 18

Scoreboard Example: Cycle 8 b (Second half of clock cycle) CS 152 Lec 16. 19

Scoreboard Example: Cycle 9 Note Remaining • Read operands for MULT & SUB? Issue ADDD? CS 152 Lec 16. 20

Scoreboard Example: Cycle 10 CS 152 Lec 16. 21

Scoreboard Example: Cycle 11 CS 152 Lec 16. 22

Scoreboard Example: Cycle 12 • Read operands for DIVD? CS 152 Lec 16. 23

Scoreboard Example: Cycle 13 CS 152 Lec 16. 24

Scoreboard Example: Cycle 14 CS 152 Lec 16. 25

Scoreboard Example: Cycle 15 CS 152 Lec 16. 26

Scoreboard Example: Cycle 16 CS 152 Lec 16. 27

Scoreboard Example: Cycle 17 WAR Hazard! • Why not write result of ADD? ? ? CS 152 Lec 16. 28

Scoreboard Example: Cycle 18 CS 152 Lec 16. 29

Scoreboard Example: Cycle 19 CS 152 Lec 16. 30

Scoreboard Example: Cycle 20 CS 152 Lec 16. 31

Scoreboard Example: Cycle 21 • WAR Hazard is now gone. . . CS 152 Lec 16. 32

Scoreboard Example: Cycle 22 CS 152 Lec 16. 33

Faster than light computation (skip a couple of cycles) CS 152 Lec 16. 34

Scoreboard Example: Cycle 61 CS 152 Lec 16. 35

Scoreboard Example: Cycle 62 CS 152 Lec 16. 36

Review: Scoreboard Example: Cycle 62 • In-order issue; out-of-order execute & commit CS 152 Lec 16. 37

CDC 6600 Scoreboard ° Speedup 1. 7 from compiler; 2. 5 by hand BUT slow memory (no cache) limits benefit ° Limitations of 6600 scoreboard: • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards), especially integer/load store units • Do not issue on structural hazards • Wait for WAR hazards • Prevent WAW hazards CS 152 Lec 16. 38

Summary #1/2: Compiler techniques for parallelism ° Loop unrolling Multiple iterations of loop in software: • Amortizes loop overhead over several iterations • Gives more opportunity for scheduling around stalls ° Software Pipelining Take one instruction from each of several iterations of the loop • Software overlapping of loop iterations • Today will show hardware overlapping of loop iterations ° Very Long Instruction Word machines (VLIW) Multiple operations coded in single, long instruction • Requires sophisticated compiler to decide which operations can be done in parallel • Trace scheduling find common path and schedule code as if branches didn’t exist (+ add “fixup code”) ° All of these require additional registers CS 152 Lec 16. 39

Summary #2/2 ° HW exploiting ILP • Works when can’t know dependence at compile time. • Code for one machine runs well on another ° Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) • • Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn’t handle forwarding. No automatic register renaming CS 152 Lec 16. 40