CS 152 Computer Architecture and Engineering Lecture 15

Review: Pipelining • Hazards limit performance – Structural: need more HW resources – Data:

Review: Resolve RAW by “forwarding” (or bypassing) IAU npc I mem Regs op rw

Review: The exception problem in simple pipeline Time Bad Inst TLB fault Overflow IFetch

Today’s schedule • Software techniques to improve performance – Brief discussion of VLIW •

Case Study: MIPS R 4000 (200 MHz) • 8 Stage Pipeline: – IF–first half

Case Study: MIPS R 4000 IF IS IF RF IS IF EX RF IS

MIPS R 4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last

MIPS FP Pipe Stages FP Instr 1 … Add, Subtract Multiply U Divide U

R 4000 Performance • Not ideal CPI of 1: – FP structural stalls: Not

Can we somehow make CPI closer to 1? • Let’s assume full pipelining: –

FP Loop: Where are the Hazards? Loop: LD ADDD SD SUBI BNEZ NOP F

FP Loop Showing Stalls 1 Loop: LD F 0, 0(R 1) 2 stall 3

Revised FP Loop Minimizing Stalls 1 Loop: LD F 0, 0(R 1) 2 stall

Unroll Loop Four Times (straightforward way) 1 Loop: LD 2 ADDD 3 SD 4

Unrolled Loop That Minimizes Stalls 1 Loop: LD 2 LD 3 LD 4 LD

Getting CPI < 1: Issuing Multiple Instructions/Cycle • Two main variations: Superscalar and VLIW

Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP

Loop Unrolling in Superscalar Integer instruction Loop: FP instruction LD F 0, 0(R 1)

Limits of Superscalar • While Integer/FP split is simple for the HW, get CPI

Loop Unrolling in VLIW Memory reference 1 Memory FP reference 2 LD F 0,

Software Pipelining • Observation: if iterations from loops are independent, then can get more

Software Pipelining Example After: Software Pipelined 1 2 3 4 5 SD ADDD LD

Software Pipelining with Loop Unrolling in VLIW Memory reference 1 Memory reference 2 FP

Administrivia • Dynamic scheduling techniques discussed in the Other Hennessy & Patterson book: –

Can we use HW to get CPI closer to 1? • Why in HW

Problems? • How do we prevent WAR and WAW hazards? • How do we

Scoreboard: a bookkeeping technique • Out of order execution divides ID stage: 1. Issue—decode

Registers FP Mult FP Divide FP Add Integer SCOREBOARD 10/20/99 ©UCB Fall 1999 Functional

Scoreboard Implications • Out of order completion => WAR, WAW hazards? • Solutions for

Four Stages of Scoreboard Control • Issue—decode instructions & check for structural hazards (ID

Four Stages of Scoreboard Control • Execution—operate on operands (EX) – The functional unit

Three Parts of the Scoreboard • Instruction status: Which of 4 steps the instruction

Scoreboard Example 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 34

Detailed Scoreboard Pipeline Control Instruction status Issue Wait until Busy(FU) yes; Op(FU) op; Fi(FU)

Scoreboard Example: Cycle 1 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 2 • Issue 2 nd LD? 10/20/99 ©UCB Fall 1999 CS

Scoreboard Example: Cycle 3 • Issue MULT? 10/20/99 ©UCB Fall 1999 CS 152 /

Scoreboard Example: Cycle 4 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 5 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 6 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 7 • Read multiply operands? 10/20/99 ©UCB Fall 1999 CS 152

Scoreboard Example: Cycle 8 a (First half of clock cycle) 10/20/99 ©UCB Fall 1999

Scoreboard Example: Cycle 8 b (Second half of clock cycle) 10/20/99 ©UCB Fall 1999

Scoreboard Example: Cycle 9 Note Remaining • Read operands for MULT & SUB? Issue

Scoreboard Example: Cycle 10 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 11 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 12 • Read operands for DIVD? 10/20/99 ©UCB Fall 1999 CS

Scoreboard Example: Cycle 13 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 14 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 15 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 16 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 17 WAR Hazard! • Why not write result of ADD? ?

Scoreboard Example: Cycle 18 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 19 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 20 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 21 • WAR Hazard is now gone. . . 10/20/99 ©UCB

Scoreboard Example: Cycle 22 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Faster than light computation (skip a couple of cycles) 10/20/99 ©UCB Fall 1999 CS

Scoreboard Example: Cycle 61 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Scoreboard Example: Cycle 62 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Review: Scoreboard Example: Cycle 62 • In-order issue; out-of-order execute & commit CS 152

CDC 6600 Scoreboard • Speedup 1. 7 from compiler; 2. 5 by hand BUT

Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC

Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs.

Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load 1 Load

Reservation Station Components Op: Operation to perform in the unit (e. g. , +

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation

Tomasulo Example 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 69

Tomasulo Example Cycle 1 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding 10/20/99 ©UCB

Tomasulo Example Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations;

Tomasulo Example Cycle 4 • Load 2 completing; what is waiting for Load 1?

Tomasulo Example Cycle 5 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 6 • Issue ADDD here vs. scoreboard? 10/20/99 ©UCB Fall 1999

Tomasulo Example Cycle 7 • Add 1 completing; what is waiting for it? 10/20/99

Tomasulo Example Cycle 8 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 9 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 10 • Add 2 completing; what is waiting for it? 10/20/99

Tomasulo Example Cycle 11 • Write result of ADDD here vs. scoreboard? • 10/20/99

Tomasulo Example Cycle 12 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 13 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 14 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 15 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 16 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 55 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15.

Tomasulo Example Cycle 56 • Mult 2 is completing; what is waiting for it?

Tomasulo Example Cycle 57 • Once again: In-order issue, out-of-order execution and completion. CS

Compare to Scoreboard Cycle 62 • Why take longer on scoreboard/6600? Structural Hazards Lack

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Units x, 1 ÷) Pipelined Functional

Tomasulo Drawbacks • Complexity – delays of 360/91, MIPS 10000, IBM 620? • Many

Summary #1 • HW exploiting ILP – Works when can’t know dependence at compile

Summary #2 • Reservations stations: renaming to larger set of registers + buffering source

Slides: 94

Download presentation

CS 152 Computer Architecture and Engineering Lecture 15 Dynamic Scheduling October 20, 1999 John Kubiatowicz (http. cs. berkeley. edu/~kubitron) lecture slides: http: //www-inst. eecs. berkeley. edu/~cs 152/ 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 1

Review: Pipelining • Hazards limit performance – Structural: need more HW resources – Data: need forwarding, compiler scheduling – Control: early evaluation & PC, delayed branch, prediction • Data hazards must be handled carefully: – RAW data hazards handled by forwarding – WAW and WAR hazards don’t exist in 5 stage pipeline • MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) • Exceptions in 5 stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage – Must flush all later instructions (at earlier pipe stages) • More performance from deeper pipelines, parallelism? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 2

Review: Resolve RAW by “forwarding” (or bypassing) IAU npc I mem Regs op rw rs rt Forward mux B A im PC n op rw alu S n op rw D mem m Regs 10/20/99 • Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe • Increase muxes to add paths from pipeline registers • Data Forwarding = Data Bypassing n op rw ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 3

Review: The exception problem in simple pipeline Time Bad Inst TLB fault Overflow IFetch Dcd Exec IFetch Dcd Program Flow Data TLB Mem WB Exec Mem IFetch Dcd WB • Use pipeline to sort this out! – Pass exception status along with instruction. – Keep track of PCs for every instruction in pipeline. – Don’t act on exception until it reache WB stage • Handle interrupts through “faulting noop” in IF stage • When instruction reaches WB stage: – Save PC EPC, Interrupt vector addr PC – Turn all instructions in earlier stages into noops! 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 4

Today’s schedule • Software techniques to improve performance – Brief discussion of VLIW • Advanced techniques discussion of Dynamic scheduling: Assume single issue! – Scoreboard – Tomasulo 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 5

Case Study: MIPS R 4000 (200 MHz) • 8 Stage Pipeline: – IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. – IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. – EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. – DF–data fetch, first half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back for loads and register operations. • 8 Stages: What is impact on Load delay? Branch delay? Why? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 6

Case Study: MIPS R 4000 IF IS IF RF IS IF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF IF THREE Cycle Branch Latency (conditions evaluated during EX phase) IS IF RF IS IF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF TWO Cycle Load Latency Delay slot plus two stalls Branch likely cancels delay slot if not taken 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 7

MIPS R 4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 8

MIPS FP Pipe Stages FP Instr 1 … Add, Subtract Multiply U Divide U D+R, A, R Square root Negate U Absolute value FP compare Stages: M N R S U 10/20/99 2 3 4 5 6 7 U E+M A S+A M R A+R M D 28 R+S M … N D+A N+A R D+R, D+A, U S U U E (A+R)108 … A R S A R First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers ©UCB Fall 1999 A D E 8 Mantissa ADD stage Divide pipeline stage Exception test stage CS 152 / Kubiatowicz Lec 15. 9

R 4000 Performance • Not ideal CPI of 1: – FP structural stalls: Not enough FP hardware (parallelism) – FP result stalls: RAW data hazard (latency) – Branch stalls (2 cycles + unfilled slots) – Load stalls (1 or 2 clock cycles) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 10

Can we somehow make CPI closer to 1? • Let’s assume full pipelining: – If we have a 4 cycle instruction, then we need 3 instructions between a producing instruction and its use: multf $F 0, $F 2, $F 4 delay 1 delay 2 delay 3 addf $F 6, $F 10, $F 0 Earliest forwarding for 4 -cycle instructions Earliest forwarding for 1 -cycle instructions Fetch Decode addf 10/20/99 Ex 1 Ex 2 Ex 3 delay 2 delay 1 ©UCB Fall 1999 Ex 4 WB multf CS 152 / Kubiatowicz Lec 15. 11

FP Loop: Where are the Hazards? Loop: LD ADDD SD SUBI BNEZ NOP F 0, 0(R 1) ; F 0=vector element F 4, F 0, F 2 ; add scalar from F 2 0(R 1), F 4 ; store result R 1, 8 ; decrement pointer 8 B (DW) R 1, Loop ; branch R 1!=zero ; delayed branch slot Instruction producing result FP ALU op Load double Integer op • 10/20/99 Instruction using result Another FP ALU op Store double Integer op Latency in clock cycles 3 2 1 0 0 Where are the stalls? ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 12

FP Loop Showing Stalls 1 Loop: LD F 0, 0(R 1) 2 stall 3 ADDD F 4, F 0, F 2 4 stall 5 stall 6 SD 0(R 1), F 4 7 SUBI R 1, 8 8 BNEZ R 1, Loop 9 stall Instruction producing result FP ALU op Load double ; F 0=vector element ; add scalar in F 2 ; store result ; decrement pointer 8 B (DW) ; branch R 1!=zero ; delayed branch slot Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 • 9 clocks: Rewrite code to minimize stalls? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 13

Revised FP Loop Minimizing Stalls 1 Loop: LD F 0, 0(R 1) 2 stall 3 ADDD F 4, F 0, F 2 4 SUBI R 1, 8 5 BNEZ R 1, Loop 6 SD 8(R 1), F 4 ; delayed branch ; altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction producing result FP ALU op Load double Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 6 clocks: Unroll loop 4 times code to make faster? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 14

Unroll Loop Four Times (straightforward way) 1 Loop: LD 2 ADDD 3 SD 4 LD 5 ADDD 6 SD 7 LD 8 ADDD 9 SD 10 LD 11 ADDD 12 SD 13 SUBI 14 BNEZ 15 NOP F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 6, -8(R 1) F 8, F 6, F 2 -8(R 1), F 8 F 10, -16(R 1) F 12, F 10, F 2 -16(R 1), F 12 F 14, -24(R 1) F 16, F 14, F 2 -24(R 1), F 16 R 1, #32 R 1, LOOP 1 cycle stall 2 cycles stall ; drop SUBI & BNEZ Rewrite loop to minimize stalls? ; drop SUBI & BNEZ ; alter to 4*8 15 + 4 x (1+2) = 27 clock cycles, or 6. 8 per iteration Assumes R 1 is multiple of 4 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 15

Unrolled Loop That Minimizes Stalls 1 Loop: LD 2 LD 3 LD 4 LD 5 ADDD 6 ADDD 7 ADDD 8 ADDD 9 SD 10 SD 11 SD 12 SUBI 13 BNEZ 14 SD F 0, 0(R 1) F 6, -8(R 1) F 10, -16(R 1) F 14, -24(R 1) F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 2 0(R 1), F 4 -8(R 1), F 8 -16(R 1), F 12 R 1, #32 R 1, LOOP 8(R 1), F 16 • What assumptions made when moved code? – OK to move store past SUBI even though changes register – OK to move loads before stores: get right data? – When is it safe for compiler to do such changes? ; 8 -32 = -24 14 clock cycles, or 3. 5 per iteration When safe to move instructions? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 16

Getting CPI < 1: Issuing Multiple Instructions/Cycle • Two main variations: Superscalar and VLIW • Superscalar: varying no. instructions/cycle (1 to 6) – Parallelism and dependencies determined/resolved by HW – IBM Power. PC 604, Sun Ultra. Sparc, DEC Alpha 21164, HP 7100 • Very Long Instruction Words (VLIW): fixed number of instructions (16) parallelism determined by compiler – Pipeline is exposed; compiler must schedule delays to get right result • Explicit Parallel Instruction Computer (EPIC)/ Intel – 128 bit packets containing 3 instructions (can execute sequentially) – Can link 128 bit packets together to allow more parallelism – Compiler determines parallelism, HW checks dependencies and fowards/stalls 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 17

Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64 bits/clock cycle; Int on left, FP on right – Can only issue 2 nd instruction if 1 st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Int. instruction FP instruction Int. instruction WB FP instruction WB Int. instruction MEM WB FP instruction 10/20/99 MEM WB Stages IF ID EX MEM WB IF ID EX MEM IF ID EX IF ID ©UCB Fall 1999 EX CS 152 / Kubiatowicz Lec 15. 18

Loop Unrolling in Superscalar Integer instruction Loop: FP instruction LD F 0, 0(R 1) 1 LD F 6, 8(R 1) 2 LD F 10, 16(R 1) ADDD F 4, F 0, F 2 3 LD F 14, 24(R 1) ADDD F 8, F 6, F 2 4 LD F 18, 32(R 1) ADDD F 12, F 10, F 2 SD 0(R 1), F 4 ADDD F 16, F 14, F 2 SD 8(R 1), F 8 ADDD F 20, F 18, F 2 SD 16(R 1), F 12 8 SD 24(R 1), F 16 9 SUBI R 1, #40 10 BNEZ R 1, LOOP 11 SD 32(R 1), F 20 12 Clock cycle 5 6 7 • Unrolled 5 times to avoid delays (+1 due to SS) • 12 clocks, or 2. 4 clocks per iteration 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 19

Limits of Superscalar • While Integer/FP split is simple for the HW, get CPI of 0. 5 only for programs with: – Exactly 50% FP operations – No hazards • If more instructions issue at same time, greater difficulty of decode and issue – Even 2 scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff instruction space for simple decoding – The long instruction word has room for many operations – By definition, all the operations the compiler puts in the long instruction word can execute in parallel – E. g. , 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide – Need compiling technique that schedules across several branches 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 20

Loop Unrolling in VLIW Memory reference 1 Memory FP reference 2 LD F 0, 0(R 1) LD F 10, 16(R 1) LD F 18, 32(R 1) LD F 26, 48(R 1) LD F 6, 8(R 1) 1 LD F 14, 24(R 1) 2 LD F 22, 40(R 1) ADDD F 4, F 0, F 2 ADDD F 8, F 6, F 2 ADDD F 12, F 10, F 2 ADDD F 16, F 14, F 2 4 ADDD F 20, F 18, F 2 ADDD F 24, F 22, F 2 5 SD 8(R 1), F 8 ADDD F 28, F 26, F 2 SD 24(R 1), F 16 7 SD 40(R 1), F 24 SUBI R 1, #48 BNEZ R 1, LOOP 9 SD 0(R 1), F 4 SD 16(R 1), F 12 SD 32(R 1), F 20 SD 0(R 1), F 28 FP Int. op/ Clock operation 1 op. 2 branch 3 6 8 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1. 3 clocks per iteration Need more registers in VLIW(EPIC => 128 int + 128 FP) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 21

Software Pipelining • Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop ( Tomasulo in SW) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 22

Software Pipelining Example After: Software Pipelined 1 2 3 4 5 SD ADDD LD SUBI BNEZ • Symbolic Loop Unrolling 0(R 1), F 4 ; Stores M[i] F 4, F 0, F 2 ; Adds to M[i-1] F 0, -16(R 1); Loads M[i-2] R 1, #8 R 1, LOOP overlapped ops Before: Unrolled 3 times 1 LD F 0, 0(R 1) 2 ADDD F 4, F 0, F 2 3 SD 0(R 1), F 4 4 LD F 6, -8(R 1) 5 ADDD F 8, F 6, F 2 6 SD -8(R 1), F 8 7 LD F 10, -16(R 1) 8 ADDD F 12, F 10, F 2 9 SD -16(R 1), F 12 10 SUBI R 1, #24 11 BNEZ R 1, LOOP SW Pipeline Time Loop Unrolled – Maximize result use distance – Less code space than unrolling Time – Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 23

Software Pipelining with Loop Unrolling in VLIW Memory reference 1 Memory reference 2 FP operation 1 FP op. 2 Int. op/ branch Clock LD F 0, 48(R 1) ST 0(R 1), F 4 ADDD F 4, F 0, F 2 LD F 6, 56(R 1) ST 8(R 1), F 8 ADDD F 8, F 6, F 2 SUBI R 1, #24 2 LD F 10, 40(R 1) ST 8(R 1), F 12 ADDD F 12, F 10, F 2 BNEZ R 1, LOOP 3 1 • Software pipelined across 9 iterations of original loop – In each iteration of above loop, we: » Store to m, m 8, m 16 (iterations I 3, I 2, I 1) » Compute for m 24, m 32, m 40 (iterations I, I+1, I+2) » Load from m 48, m 56, m 64 (iterations I+3, I+4, I+5) • 9 results in 9 cycles, or 1 clock per iteration • Average: 3. 3 ops per clock, 66% efficiency Note: Need less registers for software pipelining (only using 7 registers here, was using 15) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 24

Administrivia • Dynamic scheduling techniques discussed in the Other Hennessy & Patterson book: – “Computer Architecture: A Quantitative Approach” – Chapter 4 • Get moving on Lab 5! – This lab is even harder than Lab 4. – Trickier to debug…! 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 25

Can we use HW to get CPI closer to 1? • Why in HW at run time? – Works when can’t know real dependence at compile time – Compiler simpler – Code for one machine runs well on another • Key idea: Allow instructions behind stall to proceed: DIVD ADDD SUBD F 0, F 2, F 4 F 10, F 8 F 12, F 8, F 14 • Out of order execution => out of order completion. • Disadvantages? – Complexity – Precise interrupts harder! (Talk about this next time) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 26

Problems? • How do we prevent WAR and WAW hazards? • How do we deal with variable latency? – Forwarding for RAW hazards harder. RAW WAR 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 27

Scoreboard: a bookkeeping technique • Out of order execution divides ID stage: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands • Scoreboards date to CDC 6600 in 1963 • Instructions execute whenever not dependent on previous instructions and no hazards. • CDC 6600: In order issue, out of order execution, out of order commit (or completion) – No forwarding! – Imprecise interrupt/exception model for now 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 28

Registers FP Mult FP Divide FP Add Integer SCOREBOARD 10/20/99 ©UCB Fall 1999 Functional Units Scoreboard Architecture(CDC 6600) Memory CS 152 / Kubiatowicz Lec 15. 29

Scoreboard Implications • Out of order completion => WAR, WAW hazards? • Solutions for WAR: – Stall writeback until registers have been read – Read registers only during Read Operands stage • Solution for WAW: – Detect hazard and stall issue of new instruction until other instruction completes • No register renaming! • Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units • Scoreboard keeps track of dependencies between instructions that have already issued. • Scoreboard replaces ID, EX, WB with 4 stages 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 30

Four Stages of Scoreboard Control • Issue—decode instructions & check for structural hazards (ID 1) – Instructions issued in program order (for hazard checking) – Don’t issue if structural hazard – Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) • Read operands—wait until no data hazards, then read operands (ID 2) – All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. – No forwarding of data in this model! 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 31

Four Stages of Scoreboard Control • Execution—operate on operands (EX) – The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. • Write result—finish execution (WB) – Stall until no WAR hazards with previous instructions: Example: DIVD ADDD SUBD F 0, F 2, F 4 F 10, F 8 F 8, F 14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 32

Three Parts of the Scoreboard • Instruction status: Which of 4 steps the instruction is in • Functional unit status: —Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Op: Fi: Fj, Fk: Qj, Qk: Rj, Rk: Indicates whether the unit is busy or not Operation to perform in the unit (e. g. , + or –) Destination register Source register numbers Functional units producing source registers Fj, Fk Flags indicating when Fj, Fk are ready • Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 33

Scoreboard Example 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 34

Detailed Scoreboard Pipeline Control Instruction status Issue Wait until Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S 1’; Not busy (FU) Fk(FU) `S 2’; Qj Result(‘S 1’); and not result(D) Qk Result(`S 2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Execution complete Functional unit done Write result 10/20/99 Bookkeeping Rj No; Rk No f((Fj(f)≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj(f)=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk(f) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No)) ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 35

Scoreboard Example: Cycle 1 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 36

Scoreboard Example: Cycle 2 • Issue 2 nd LD? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 37

Scoreboard Example: Cycle 3 • Issue MULT? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 38

Scoreboard Example: Cycle 4 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 39

Scoreboard Example: Cycle 5 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 40

Scoreboard Example: Cycle 6 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 41

Scoreboard Example: Cycle 7 • Read multiply operands? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 42

Scoreboard Example: Cycle 8 a (First half of clock cycle) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 43

Scoreboard Example: Cycle 8 b (Second half of clock cycle) 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 44

Scoreboard Example: Cycle 9 Note Remaining • Read operands for MULT & SUB? Issue ADDD? CS 152 / Kubiatowicz 10/20/99 ©UCB Fall 1999 Lec 15. 45

Scoreboard Example: Cycle 10 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 46

Scoreboard Example: Cycle 11 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 47

Scoreboard Example: Cycle 12 • Read operands for DIVD? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 48

Scoreboard Example: Cycle 13 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 49

Scoreboard Example: Cycle 14 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 50

Scoreboard Example: Cycle 15 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 51

Scoreboard Example: Cycle 17 WAR Hazard! • Why not write result of ADD? ? ? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 53

Review: Scoreboard Example: Cycle 62 • In-order issue; out-of-order execute & commit CS 152 / Kubiatowicz 10/20/99 ©UCB Fall 1999 Lec 15. 62

CDC 6600 Scoreboard • Speedup 1. 7 from compiler; 2. 5 by hand BUT slow memory (no cache) limits benefit • Limitations of 6600 scoreboard: – No forwarding hardware – Limited to instructions in basic block (small window) – Small number of functional units (structural hazards), especially integer/load store units – Do not issue on structural hazards – Wait for WAR hazards – Prevent WAW hazards 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 63

Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 – IBM has 4 FP registers vs. 8 in CDC 6600 – IBM has memory register ops • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power. PC 604, … 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 64

Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; – FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 65

Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders 10/20/99 Reservation Stations To Mem FP multipliers Common ©UCB Data. Fall Bus (CDB) 1999 CS 152 / Kubiatowicz Lec 15. 66

Reservation Station Components Op: Operation to perform in the unit (e. g. , + or –) Vj, Vk: Value of Source operands – Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) – Note: No ready flags as in Scoreboard; Qj, Qk=0 => ready – Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 67

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) – 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 68

Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 71

Tomasulo Example Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard CS 152 / Kubiatowicz 10/20/99 ©UCB Fall 1999 • Load 1 completing; what is waiting for Load 1? Lec 15. 72

Tomasulo Example Cycle 4 • Load 2 completing; what is waiting for Load 1? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 73

Tomasulo Example Cycle 7 • Add 1 completing; what is waiting for it? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 76

Tomasulo Example Cycle 10 • Add 2 completing; what is waiting for it? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 79

Tomasulo Example Cycle 11 • Write result of ADDD here vs. scoreboard? • 10/20/99 All quick instructions complete ©UCB Fall 1999 in this cycle! CS 152 / Kubiatowicz Lec 15. 80

Tomasulo Example Cycle 56 • Mult 2 is completing; what is waiting for it? 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 88

Tomasulo Example Cycle 57 • Once again: In-order issue, out-of-order execution and completion. CS 152 / Kubiatowicz 10/20/99 ©UCB Fall 1999 Lec 15. 89

Compare to Scoreboard Cycle 62 • Why take longer on scoreboard/6600? Structural Hazards Lack of forwarding 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 90

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Units x, 1 ÷) Pipelined Functional Units Multiple Functional (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 window size: ≤ 14 instructions ≤ 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 91

Tomasulo Drawbacks • Complexity – delays of 360/91, MIPS 10000, IBM 620? • Many associative stores (CDB) at high speed • Performance limited by Common Data Bus – Multiple CDBs => more FU logic for parallel assoc stores 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 92

Summary #1 • HW exploiting ILP – Works when can’t know dependence at compile time. – Code for one machine runs well on another • Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) – – 10/20/99 Enables out of order execution => out of order completion ID stage checked both for structural & data dependencies Original version didn’t handle forwarding. No automatic register renaming ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 93

Summary #2 • Reservations stations: renaming to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Helps cache misses as well • Lasting Contributions – Dynamic scheduling – Register renaming – Load/store disambiguation • 360/91 descendants are Pentium II; Power. PC 604; MIPS R 10000; HP PA 8000; Alpha 21264 10/20/99 ©UCB Fall 1999 CS 152 / Kubiatowicz Lec 15. 94