Lecture 5 Overview of Superscalar Techniques Cpr E























- Slides: 23
Lecture 5 Overview of Superscalar Techniques Cpr. E 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2. 1 “Complexity-Effective Superscalar Processors”, Ph. D Thesis by Subbarao Palacharla, Ch. 1 1
Sequential Execution Model Any program execution is “correct” if the final architectural states (registers and memory contents) is the same as by sequential execution Single-cycle implementation is intuitively correct If instructions are not executed sequentially, what is “correct” execution? 2
Out-of-order Execution Compared with a sequential execution, an out-of-order execution may 1. 2. Fetch and execute instructions that should not be executed Execute instructions in an different order 3
Sequential Execution Model A program execution is correct if 1. 2. 3. The same set of instructions write to user-visible register and memory; Each instruction receives the same operands as in the sequential execution; and Any register or memory word receives the value of the last write as in the sequential execution 4
Dependences and Correctness Three types of dependences between instructions n n n Control dependence Data dependence Name dependence Why do we care dependences? n n Processor hardware can observe those dependences By correctly handling the dependences, the three statements will hold 5
Data Dependence LD MULTI LD SUBD DIVD ADD F 2, 0(R 3) F 0, F 2, F 4 F 6, 0(R 2) F 8, F 6, F 2 F 10, F 6 F 12, F 8, F 2 LD 1 LD 2 MULTI SUBD DIVD ADD Note: no branch in this code 6
Data Dependences Instruction J is dependent on I if n n I’s output is used by J, or J is dependent on K, and K is dependent on I Loop: L. D ADD. D S. D DADDUI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, LOOP Data Dependence Graph L. D DADDUI ADD. D BNE S. D 7
Data Dependences through registers Load ALU Dependence through memory SW Memory regfile Load ALU Br Store ADD r 8, r 9, r 10 BEQ r 8, r 11, loop LW SW r 8, 100(r 9) LW r 10, 100(r 9) 8
Name Dependences Antidependence (WAR): one instruction overwrite a register or memory location that a prior instruction reads LW R 1, 100(R 2) ADD R 2, R 3, R 4 Output dependence (WAW): two instructions write the same register or memory location LW R 1, 100(R 2) Add R 2, R 1, R 2 Add R 1, R 3, R 4 Those dependences can be removed 9
Dependences vs Hazards Dependences are properties of programs Hazards are properties of pipelines Dependences indicates the potential of hazards Pipeline implementations determine actual hazards and the length of any stall What hazards are exposed by MIPS 5 -stage pipeline? 10
Dynamic Scheduling General idea: when an instruction stalls, look for independent instructions following it DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 14 DIV. D SUB. D ADD. D Instruction window: how far to look ahead Out-of-order execution Respect data dependence What hazards would be exposed? 11
Data Dependence between Operations ALU to ALU SUBD F 8, F 6, F 2 ADD F 6, F 8, F 2 SUBD IF ID EX WB ADD IF ID -EX WB Load and other insts LD F 2, 0(R 3) MULTI F 0, F 2, F 4 LD IF ID EX MEM WB MULTI IF ID --EX WB 12
Dependences between Operations Store to load //R 3+100==R 4? S. D F 6, 100(R 3) L. D F 2, 0(R 4) • Register instruction can be detected by matching register index S. D L. D IF IF ID ID EX EX MEM -? WB MEM • Detecting memory dependence is more difficult 13
Dynamic Scheduling L. D MULTI L. D SUB. D DIV. D ADD. D F 2, 0(R 3) F 0, F 2, F 4 F 6, 0(R 2) F 8, F 6, F 2 F 10, F 6 F 12, F 8, F 2 LD 1 LD 2 MULTI SUBD DIVD ADD How to schedule pipeline operations? 14
Is This Working? Inst IF ID Schd EXE MEM WB L. D 1 2 3 4 5 6 MULT 1 2 3 -5 6 -11 - 12 L. D SUB. D 2 2 3 3 4 4 -6 5 7 -8 6 9 7 10 DIV. D 3 4 32 33 Add. D 3 4 11 12 5 -11 12 -31 5 -8 9 -10 Assume (1) two-way issue; (2) FU delay as implied 15
Dynamic Scheduling Implementation Wakeup I 1 I 2 I 3 … I_k SELECT To FUs Adapted from UCB CS 252 S 98, Copyright 1998 USB Scoreboarding: 1966: scoreboarding in CDC 6600 Tomasulo: Three years later in IBM 360/91 Introducing register renaming Use tag-based instruction wakeup 16
Name Dependences and Register Renaming Original code: ADD R 3, R 1, R 2 SUB R 4, R 3 ADD R 3, R 6, R 7 SUB R 3, R 4 What prevents parallelism? Renamed code: R 3, R 4, R 3 renamed to P 6, P 7, P 8, P 9 sequentially ADD P 6, R 1, R 2 SUB P 7, R 4, P 6 ADD P 8, R 6, R 7 SUB P 9, R 5, P 7 Finally R 3 <= P 9, R 4 <= P 7 17
Register Renaming and Correctness 1. 2. 3. The same set of instructions write to user-visible register and memory; Each instruction receives the same operands as in the sequential execution; and Any register or memory word receives the value of the last write as in the sequential execution 18
Renaming Implementation First proposed in Tomasulo (1969) Use register status table Renamed to reservation station In other processors (e. g. Alpha 21264, Intel P 4) Use register mapping table No separate architectural/physical registers; no copy-back In P-III Use register alias table Renamed arch. register to physical register Pd Data copied back to arch. register Rd Rs Rt Renaming Ps Pt 19
Branch Prediction and Speculative Execution Modern processors must speculate! n n Branch prediction: SPEC 2 k INT has one branch per seven instructions! Precise interrupt Memory disambiguation More performance-oriented speculations Two disjointed but connected issues: 1. 2. How to make the best prediction What to do when the speculation is wrong 20
Branch Prediction and Speculative Execution Review the three conditions of correctness: 1. 2. 3. The processor commits the same set of instructions as executed in a sequential processor Any committed instruction receives the same operands (from its parents) as in the sequential execution Any register or memory word receives the value of the last write from the committed instructions and as in the sequential execution 21
Branch Prediction and Speculative Execution Branch prediction – control speculation n n Must predict on branches What to predict Branch direction Branch target address What info can be used n n n PC value Previous branch outputs also use branch pattern in complex branch predictors What building blocks are need n Branch prediction table (BHT), branch target buffer (BTB), pattern registers, and some logics 22
Generic Superscalar Processor Models schedule D-cache FU FU bypass Regfile Wakeup select Rename Fetch Issue queue based commit execute schedule D-cache FU FU Wakeup select bypass Reg ROB Rename Fetch Reservation based commit execute Revised from Paracharla Ph. D thesis 1998 23