Graduate Computer Architecture Lec 6 Dynamically Scheduled Instruction
- Slides: 48
Graduate Computer Architecture Lec 6 – Dynamically Scheduled Instruction Processing Shih-Hao Hung Computer Science & Information Engineering National Taiwan University Fall 2005 2/8/05 Adapted from Prof. D. Patterson & Culler’s CS 252 Spring 2005 class notes Copyright 2005 UCB
What stops instruction issue? Creation of a new binding 2/8/05 Instr. Fetch Issue & Resolve op fetch Scoreboard Add r 1 : = r 2 + r 3 Add r 2 : = r 2 + 4 Lod r 5 : = mem[r 1+16] Lod r 6 : = mem[r 1+32] Mul r 7 : = r 5 * r 6 Bnz r 1, foo Sub r 7 : = r 0 – r 0 … : = r 7 FU op fetch ex 2
Review: Software Pipelining Example After: Software Pipelined 1 2 3 4 5 • Symbolic Loop Unrolling SD ADDD LD SUBI BNEZ 0(R 1), F 4 ; Stores M[i] F 4, F 0, F 2 ; Adds to M[i-1] F 0, -16(R 1); Loads M[i-2] R 1, #8 R 1, LOOP overlapped ops Before: Unrolled 3 times 1 LD F 0, 0(R 1) 2 ADDD F 4, F 0, F 2 3 SD 0(R 1), F 4 4 LD F 6, -8(R 1) 5 ADDD F 8, F 6, F 2 6 SD -8(R 1), F 8 7 LD F 10, -16(R 1) 8 ADDD F 12, F 10, F 2 9 SD -16(R 1), F 12 10 SUBI R 1, #24 11 BNEZ R 1, LOOP SW Pipeline Time Loop Unrolled – Maximize result-use distance – Less code space than unrolling Time – Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling 5 cycles per iteration 2/8/05 3
Can we use HW to get CPI closer to 1? • Why in HW at run time? – Works when can’t know real dependence at compile time – Compiler simpler – Code for one machine runs well on another • Key idea: Allow instructions behind stall to proceed DIVD ADDD SUBD F 0, F 2, F 4 F 10, F 8 F 12, F 8, F 14 • Out-of-order execution => out-of-order completion. 2/8/05 4
Problems? • How do we prevent WAR and WAW hazards? • How do we deal with variable latency? – Forwarding for RAW hazards harder. RAW WAR 2/8/05 5
Scoreboard Implications • Out-of-order completion => WAR, WAW hazards? • Solutions for WAR: – Stall writeback until registers have been read – Read registers only during Read Operands stage • Solution for WAW: – Detect hazard and stall issue of new instruction until other instruction completes • No register renaming! • Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units • Scoreboard keeps track of dependencies between instructions that have already issued. • Scoreboard replaces ID, EX, WB with 4 stages 2/8/05 6
Missing the boat on loops 1 Loop: LD F 0, 0(R 1) 2 stall 3 ADDD F 4, F 0, F 2 4 SUBI R 1, 8 5 BNEZ R 1, Loop 6 SD 8(R 1), F 4 ; delayed branch ; altered when move past SUBI • Even if all loop iterations independent – Recursion on the iteration variable – Output dependence and anti-dependence with each dest register • All iterations use the same register names! 2/8/05 7
What do registers offer? • Short, absolute name for a recently computed (or frequently used) value • Fast, high bandwidth storage in the datapath • Means of broadcasting a computed value to set of instructions that use the value – Later in time or spread out in space… 2/8/05 8
Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 – IBM has 4 FP registers vs. 8 in CDC 6600 – IBM has memory-register ops • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power. PC 604, … 2/8/05 9
Register Renaming (Conceptual) rd rs • Imagine if each write to register Ri created a new instance of that register – kth instance Ri. k • Later references to source register treated as Ri. k • Next use as a destination creates Ri. k+1 2/8/05 10
Register Renaming (less Conceptual) ifetch rd rs op rs rt rd value renam architected reg’s physical data reg op R[rs] R[rt] • Separate the functions of the register • Reg identifier in instruction is mapped to “physical register” id for current instance of the register ? opfetch – Physical reg set may be larger than allocated • What are the rules for allocating / deallocating physical registers? 2/8/05 op Vs Vt ? 11
Reg renaming • Source Reg s: ifetch – physical reg P=R[s] • Destination reg d: op – Old physical register R[d] “terminates” – R[d] : =get_free • Free physical register when – No longer referenced by any architected register (terminated) – No incomplete instructions waiting to read it » Easy with in-order » Out of order? rs rd renam op R[rs] R[rt] ? opfetch op 2/8/05 rt Vs Vt ? 12
Temporary renaming • Value “currently” bound to register is not present in the register file, instead… • To be produced by particular instruction in the datapath – Designated by function unit that will produce value, or – Nearest matching instruction ahead in the datapath (in-order), or – With an associated “tag” 2/8/05 13
Broadcasting result value • Series of instructions issued and waiting for value to be produced by logically preceding instruction. • CDC 6600 has each come back and read the value once it is placed in register file • Alternative: broadcast value and reg # to all the waiting instructions – One that match grab the value 2/8/05 14
Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; – FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 2/8/05 15
Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders 2/8/05 Reservation Stations To Mem FP multipliers Common Data Bus (CDB) 16
Reservation Station Components Op: Operation to perform in the unit (e. g. , + or –) Vj, Vk: Value of Source operands – Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) – Note: No ready flags as in Scoreboard; Qj, Qk=0 => ready – Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 2/8/05 17
Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) – 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast 2/8/05 18
Tomasulo Example 2/8/05 19
Tomasulo Example Cycle 1 2/8/05 20
Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding 2/8/05 21
Tomasulo Example Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard 22 2/8/05 • Load 1 completing; what is waiting for Load 1?
Tomasulo Example Cycle 4 • Load 2 completing; what is waiting for Load 2? 2/8/05 23
Tomasulo Example Cycle 5 2/8/05 24
Tomasulo Example Cycle 6 • Issue ADDD here vs. scoreboard? 2/8/05 25
Tomasulo Example Cycle 7 • Add 1 completing; what is waiting for it? 2/8/05 26
Tomasulo Example Cycle 8 2/8/05 27
Tomasulo Example Cycle 9 2/8/05 28
Tomasulo Example Cycle 10 • Add 2 completing; what is waiting for it? 2/8/05 29
Tomasulo Example Cycle 11 • Write result of ADDD here vs. scoreboard? • All quick instructions complete in this cycle! 2/8/05 30
Tomasulo Example Cycle 12 2/8/05 31
Tomasulo Example Cycle 13 2/8/05 32
Tomasulo Example Cycle 14 2/8/05 33
Tomasulo Example Cycle 15 2/8/05 34
Tomasulo Example Cycle 16 2/8/05 35
Faster than light computation (skip a couple of cycles) 2/8/05 36
Tomasulo Example Cycle 55 2/8/05 37
Tomasulo Example Cycle 56 • Mult 2 is completing; what is waiting for it? 2/8/05 38
Tomasulo Example Cycle 57 • Once again: In-order issue, out-of-order execution and 39 2/8/05 completion.
Compare to Scoreboard Cycle 62 • Why take longer on scoreboard/6600? • Structural Hazards • Lack of forwarding 2/8/05 40
Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Units x, 1 ÷) Pipelined Functional Units Multiple Functional (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 window size: ≤ 14 instructions ≤ 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard 2/8/05 41
Tomasulo Drawbacks • Complexity – delays of 360/91, MIPS 10000, IBM 620? • Many associative stores (CDB) at high speed • Performance limited by Common Data Bus – Multiple CDBs => more FU logic for parallel assoc stores 2/8/05 42
Discussion: Generalize Tomasulo Alg • Many function units – Tag size • Pipelined function units – Track tag through pipeline (like MIPS) • Multiple instruction issue – Serialize the renaming step – Linear recurrence (like ripple carry) – Generalize to parallel prefix calculation 2/8/05 43
Discussion: Load/Store ordering • In 360/91 loads allowed to bypass stores or loads with different addresses • Stores must wait for “logically preceding” loads and stores to same address – Record original program order? – Serialize through effective address calculation? 2/8/05 44
Discussion: interaction with caches? 2/8/05 45
Reorder Buffer 2/8/05 46
Summary #1 • HW exploiting ILP – Works when can’t know dependence at compile time. – Code for one machine runs well on another • Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) – – 2/8/05 Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn’t handle forwarding. No automatic register renaming 47
Summary #2 • Reservations stations: renaming to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Helps cache misses as well • Lasting Contributions – Dynamic scheduling – Register renaming – Load/store disambiguation • 360/91 descendants are Pentium II; Power. PC 604; MIPS R 10000; HP-PA 8000; Alpha 21264 2/8/05 48
- Daibb management accounting question solution
- Structure of indian banking system
- Instruction format in computer architecture
- State diagram in computer architecture
- Pipelining in computer architecture examples
- Ilp computer architecture
- Arc instruction set
- Instruction set architecture in computer organization
- Attach additional responsibilities to an object dynamically
- Dynamically continuous innovation examples
- Java dynamic code generation
- Dynamically continuous innovation
- Dynamically continuous innovation
- Moderately scheduled interview adalah
- Scheduled commercial bank
- The probability that a regularly scheduled flight
- A rock concert producer has scheduled an outdoor concert
- What is scheduled area
- Scheduled significado
- Scheduled waste management presentation
- The probability that a regularly scheduled flight
- Three bus architecture
- Differentiated instruction vs individualized instruction
- Site:slidetodoc.com
- Difference computer organization and architecture
- Basic computer organization
- Lec scoreboard
- 11th chemistry thermodynamics lec 13
- Lec ditto
- Lec scoreboard
- Componentes del lec
- 11th chemistry thermodynamics lec 10
- Lec element
- August lec 250
- Lec 16
- Lec 1
- Art 455 lec
- Lec
- 132000 lec
- Lec@b@ret
- Tura analítica
- Sekisui slec
- 416 lec
- Lec
- Diesel lec
- Lec promotion
- Lec anatomia
- History of software development life cycle
- Xyloprin