COMP 206 Computer Architecture and Implementation Montek Singh

  • Slides: 12
Download presentation
COMP 206: Computer Architecture and Implementation Montek Singh Mon. , Oct. 14, 2002 Topic:

COMP 206: Computer Architecture and Implementation Montek Singh Mon. , Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation) 1

Hardware Support for More ILP ã Speculation: allow an instruction to issue that is

Hardware Support for More ILP ã Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken l Hardware needs to provide an “undo” operation = squash ã Often try to combine with dynamic scheduling ã Tomasulo: separate speculative bypassing of results from real bypassing of results l When instruction no longer speculative, write results (instruction commit) l execute out-of-order but commit in order l Example: Power. PC 620, MIPS R 10000, Intel P 6, AMD K 5 … 2

Hardware support for More ILP • Need HW buffer for results of uncommitted instructions:

Hardware support for More ILP • Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) – Reorder buffer can be operand source – Once instruction commits, result is Instr found in register Queue – 3 fields: instr. type, destination, value – Use reorder buffer number instead of reservation station – Instructions commit in order Res Stations – As a result, its easy to undo speculated instructions on FP Adder mispredicted branches or on exceptions Reorder Buffer FP Regs Res Stations FP Mult Figure 3. 29, page 228 3

Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue l

Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue l If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination; each RS now also has a field for ROB#. 2. Execution—operate on operands (EX) l When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3. Write result—finish execution (WB) l Write on Common Data Bus to all awaiting RS’s & ROB; mark RS available 4. Commit—update register with reorder result l When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer 4

Result Shift Register and Reorder Buffer ã General solution to three problems l Precise

Result Shift Register and Reorder Buffer ã General solution to three problems l Precise exceptions l Speculative execution l Register renaming ã Solution in three steps l In-order initiation, out-of-order termination (using RSRa) l In-order initiation, in-order termination (using RSRb) l In-order initiation, in-order termination, with renaming (using ROB) ã Architectural model l Essentially MIPS FP pipeline l Add takes 2 clock cycles, multiplication 5, division 10 l Memory accesses take 1 clock cycle l Integer instructions take 1 clock cycle l 1 branch delay slot, delayed branches 5

I-O Initiation, O-O Termination (RSRa) LOOP: LD LD MULTD ADDI SUBD DIVD ADDD BLEZ

I-O Initiation, O-O Termination (RSRa) LOOP: LD LD MULTD ADDI SUBD DIVD ADDD BLEZ ADDI F 6, 32(R 2) F 2, 48(R 3) F 0, F 2, F 4 R 2, 8 R 3, 8 F 8, F 6, F 2 F 10, F 0 F 6, F 8, F 6 R 4, LOOP R 4, 1 6

I-O Initiation, I-O Termination (RSRb) 7

I-O Initiation, I-O Termination (RSRb) 7

Idea Behind ROB ã Combine benefits of early issue and in-order update of state

Idea Behind ROB ã Combine benefits of early issue and in-order update of state ã Obtained from RSRa by adding a renaming mechanism to it ã Add a FIFO to RSRa (implement as circular buffer) ã When RSRa allows issuing of new instruction l Enter instruction at tail of circular buffer l Buffer entry has multiple fields Ø [Result; Valid Bit; Destination Register Name; PC value; Exceptions] ã Termination happens when result is produced, broadcast on CDB, written into circular buffer (replace M with T) l Written ROB entry can serve as source of operands from now on ã Commit happens when value is moved from circular buffer to register (replace W with C) l Happens when instruction reaches head of circular buffer and has completed execution with no exceptions 8

I-O Initiation, I-O Termination (ROB) LOOP: LD LD MULTD ADDI SUBD DIVD ADDD BLEZ

I-O Initiation, I-O Termination (ROB) LOOP: LD LD MULTD ADDI SUBD DIVD ADDD BLEZ ADDI F 6, 32(R 2) F 2, 48(R 3) F 0, F 2, F 4 R 2, 8 R 3, 8 F 8, F 6, F 2 F 10, F 0 F 6, F 8, F 6 R 4, LOOP R 4, 1 9

States of Circular Buffer • Entry in yellow is at head of buffer •

States of Circular Buffer • Entry in yellow is at head of buffer • Entry in green is tail of buffer, i. e. , next instruction goes here • Greyed instructions have committed LOOP: LD LD MULTD ADDI SUBD DIVD ADDD BLEZ ADDI F 6, 32(R 2) F 2, 48(R 3) F 0, F 2, F 4 R 2, 8 R 3, 8 F 8, F 6, F 2 F 10, F 0 F 6, F 8, F 6 R 4, LOOP R 4, 1 10

Complexity of ROB ã Assume dual-issue superscalar l Load/Store machine with three-operand instructions l

Complexity of ROB ã Assume dual-issue superscalar l Load/Store machine with three-operand instructions l 64 registers l 16 -entry circular buffer ã Hardware support needed for ROB l For each buffer entry Ø One write port Ø Four read ports (two source operands of two instructions) Ø Four 6 -bit comparators for associative lookup l For each read port Ø 16 -way “priority” encoder with wrap-around (to get latest value) ã Limited capacity of ROB is a structural hazard ã Repeated writes to same register actually happen l This is not the case in “classical” Tomasulo 11

Example: System Interactions Memory access time is 200 ns Which design is fastest? 12

Example: System Interactions Memory access time is 200 ns Which design is fastest? 12