MAMAS Computer Architecture 234367 Lecture 7 Out Of

Go beyond IPC=1 · Superscalar · VLIW · OOO © Avi Mendelson, 4/2005 2

Can we improve performance ? · In theory “data flow machines” has the best

Data flow execution – a different approach for high performance computers · Data flow

Data flow execution - cont · Can we build a machine that will execute

OOOE - General Scheme Most of the modern computers are using OOO execution. Most

Out Of Order Execution Basic idea: – The fetch is done in the program

How to convert “In-order” instruction flow into “data flow” · The problems: 1. Data

Register Renaming · Hold a pool of physical registers. · Architectural registers are mapped

OOOE with Register Renaming: Example Before renaming (1) WAW WAR r 1 mem 1

The magic of the modern X 86 architectures (Intel, AMD, etc. ) · The

OOOE Architecture: based on Pentium-II · Bus Interface Unit ID Instr. Decode and rename

Re-order Buffer (ROB) · Mechanism for keeping the in-order view of the user. ·

The renaming Algorithm · Each uop allocate a new entry in the ROB. –The

Reservation station (RS) · Pool of all “not yet executed” uops – Holds both

Memory Order Buffer (MOB) · Goal – Manipulates the Load and Store operations. If

An example of OOO Execution © Avi Mendelson, 4/2005 20 OOO execution

RAT R 0 Instruction Q R 1 R 2 R 3 ROB RS MOB

Backup © Avi Mendelson, 4/2005 29 OOO execution

Slides: 26

Download presentation

MAMAS – Computer Architecture 234367 Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken from: (1) Lihu Rapoport (2) Randi Katz and (3) Petterson © Avi Mendelson, 4/2005 1 OOO execution

Go beyond IPC=1 · Superscalar · VLIW · OOO © Avi Mendelson, 4/2005 2 OOO execution

Can we improve performance ? · In theory “data flow machines” has the best performance – View the program as a parallel operations wait to be executed (will be demonstrate next slide) – Execute instruction as soon as its inputs are ready · So, why computers are Von-Neumann based and not Data-Flow based – Hard to debug – Hard to write Data-Flow programs (need special programming language in order to be efficient) © Avi Mendelson, 4/2005 6 OOO execution

Data flow execution – a different approach for high performance computers · Data flow execution is an alternative for Van-Neumann execution. Here, the instructions are executed in the order of their input dependencies and not in the order they appears in the program · Example: assume that we have as many execution units as we need: (1) (2) (3) (4) (5) (6) r 1 r 8 r 5 r 6 r 4 r 7 © Avi Mendelson, 4/2005 r 4 r 1 r 5 r 6 r 5 r 8 / + + + * r 7 r 2 1 r 3 r 6 r 4 Data Flow Graph We could execute it in 3 cycles 1 3 2 4 5 6 7 OOO execution

Data flow execution - cont · Can we build a machine that will execute the “data flow graph”? · In the early 70 th several machines were built to work according to the data-flow graph. They were called “data flow machines”. They were vanished due to the reasons we mentioned before. · Solution: Let the user think he/she are using Van-Neumann machine, and let the system work in “data-flow mode” © Avi Mendelson, 4/2005 8 OOO execution

OOOE - General Scheme Most of the modern computers are using OOO execution. Most of them are doing the fetching and the retirement INORDER, but it executes in OUT_OF_ORDER Fetch & Decode Instruction pool In-order Retire (commit) In-order Execute Out-of-order © Avi Mendelson, 4/2005 9 OOO execution

Out Of Order Execution Basic idea: – The fetch is done in the program order (in-order) and fast enough in order to “fill-out” a window of instructions. – Out of the instruction window, the system forms a data flow graph and looks for instructions which are ready to be executed: · All the data the instructions are depended on, are ready · Resources are available. – As soon as the instruction is execution it needs to signal to all the instructions which are depend on it that it generate new input. – The instructions are commit in “program’s order” to preserve the “user view” · Advantages: – Help exploit Instruction Level Parallelism (ILP) – Help cover latencies (e. g. , cache miss, divide) © Avi Mendelson, 4/2005 10 OOO execution

How to convert “In-order” instruction flow into “data flow” · The problems: 1. Data Flow has only RAW dependencies, while OOOE has also WAR and WAW dependencies (as we showed in the last Lecture 5) 2. How to guarantee the in-order completion. · The Solutions: 1. Register Renaming (based on “Tomasulo algorithm”) solves the WAR and WAW dependencies 2. We need to “enumerate” the instructions at decode time (in order) so we know in what order to retire them © Avi Mendelson, 4/2005 11 OOO execution

Register Renaming · Hold a pool of physical registers. · Architectural registers are mapped into physical registers – When an instruction writes to an architectural register · A free physical register is allocated from the pool · The physical register points to the architectural register · The instruction writes the value to the physical register – When an instruction reads from an architectural register · reads the data from the latest instruction which writes to the same architectural register, and precedes the current instruction. · If no such instruction exists, read directly from the architectural register. – When an instruction commits · Moves the value from the physical register to the architectural register it points. © Avi Mendelson, 4/2005 12 OOO execution

OOOE with Register Renaming: Example Before renaming (1) WAW WAR r 1 mem 1 (2) r 2 + t 1 (3) r 1 mem 2 (4) r 3 t 3 (5) r 1 (6) r 4 r 5 + t 5 (7) r 5 (8) r 6 t 7 + 2 WAW © Avi Mendelson, 4/2005 After renaming t 1 mem 1 t 2 r 2 + r 1 mem 2 t 3 r 3 + r 1 t 4 r 3 + mem 3 r 5 + r 1 t 5 mem 3 t 6 2 r 5 + 2 t 7 2 t 8 13 OOO execution

The magic of the modern X 86 architectures (Intel, AMD, etc. ) · The user view of the X 86 machine is as a CISC architecture. · The machine supports this view by keeping the in-order parts as close as possible to the X 86 view. · While moving from the In-order part (front-end) to the OOO part (execution), the hardware translates each X 86 instruction into a set of uop operations, which are the internal machine operations. These operations are RISC like (load-store based). · During this translation, the hardware performs the register renaming. So, during the execution time it uses internal registers and not the X 86 ones. The number of these registers can be changed from one generation to another. · While moving back from the OOO part (execution) to the In-Order part (commit), the hardware translates the registers back to X 86, in order to keep for the user a coherent picture. © Avi Mendelson, 4/2005 14 OOO execution

OOOE Architecture: based on Pentium-II · Bus Interface Unit ID Instr. Decode and rename Write back bus Load/Store Operations · · · RS IFU Instr. Fetch on Branch prediction. 2. Decode and rename: Data cache MOB Instruction cache In Order Front-end 1. Fetch from instruction cache, base Out-Of-Order 3. Do in Parallel: · Arithmetic Operations Translate to Uops Use the RAT table for renaming Put ALL instructions in ROB Put all “arithmetic instructions” in the RS queue Put all Load/Store instructions in MOB · Load and store operations are executed based on MOB information Arithmetic operations are executed based on RS information. 4. All results are written back to ROB, RAT while RS and MOB “steal” values they need ROB · Retire (commit) Logic © Avi Mendelson, 4/2005 15 In Order 5. The retire logic (commit logic) moves instructions out of the ROB and updates the architectural registers OOO execution

Re-order Buffer (ROB) · Mechanism for keeping the in-order view of the user. · Basic ROB functions – Provide large physical register space for register renaming – Keeps intermediate results, some of them may not be commit if the branch prediction was wrong (we will discuss this mechanism later on) – Keeps information on what is the “Real Register” the commit need to update © Avi Mendelson, 4/2005 16 OOO execution

The renaming Algorithm · Each uop allocate a new entry in the ROB. –The entries are allocated in the program’s order –The RAT (register aliasing table) keeps a table that indicates for any architectural register, if the program was executed in-order, what uop (ROB entry) will generate its value. · Every uop that generate value(s) (to register and/or flag) will update the RAT table. · For every input for the uop, we look who is responsible for generate the value. If translation exist in the RAT, we indicate that the value will be retrieved from uop in that ROB entry. If translation does not exist, we retrieve the value from the “architectural register” - RRF (Real Register File) © Avi Mendelson, 4/2005 17 OOO execution

Reservation station (RS) · Pool of all “not yet executed” uops – Holds both the uop attributes as well as the values of the input data · For each operand, it keeps indication if it is ready – Operand that need to be retrieved from the RRF is always ready – Operand that waits for another Uop to generate its value, will “lesson” to the WB bus. When the value appears on the bus (the value is always associated with the ROB number it needs to update), all RS entries how need to consume this value, “still” it from the bus and mark the input as ready (this is done in parallel to the ROB update. – Uops whose all operands are ready can be dispatched for execution – Dispatcher chooses which of the ready uops to execute next. If can also do “forwarding”; i. e. , schedule the instruction at the same cycle the information is written to the RS entry. · As soon as Uop completes its execution, it is deleted from the RS. · If the RS is full, it stalls the decoder © Avi Mendelson, 4/2005 18 OOO execution

Memory Order Buffer (MOB) · Goal – Manipulates the Load and Store operations. If possible, it allows out-of-order among memory operations · Structure similar in concept to ROB · Every memory uop allocates new entry in-order. · Address need to be updated when known · Problem- Memory dependencies cannot be fully resolved statically (memory disambiguation) – store r 1, a; load r 2, b can advance load before store – store r 1, [r 3]; load r 2, b load should wait till r 3 is known · In most of the modern processors, Loads may pass loads/stores but Stores must be execute in order (among stores). · For simplicity, this course assumes that all MOB operations are executed in order. © Avi Mendelson, 4/2005 19 OOO execution

RAT R 0 R 1 R 2 R 3 ROB I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X RS Instruction Q ◄ MOB Execute Retire © Avi Mendelson, 4/2005 22 OOO execution

RAT R 0 R 1 RB 0 R 2 R 3 ROB LD R 1, X I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X RS M 0 Instruction Q ◄ MOB LD RB 0, X Takes 3 cycles Execute Retire © Avi Mendelson, 4/2005 23 OOO execution

RAT R 0 R 1 R 2 RB 0 RB 1 R 3 ROB LD R 1, X R 2 <- R 3 I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X ◄ RS RB 1 <- R 3 M 0 RS 0 Instruction Q MOB LD RB 0, X Execute Retire © Avi Mendelson, 4/2005 24 OOO execution

RAT R 0 R 1 R 2 RB 1 R 3 ROB LD R 1, X R 2 <- R 3 R 1 <- R 1+R 0 I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X ◄ Instruction Q RS RB 1 <- R 3 M 0 RS 0 RB 2 <- RB 0+R 0 RS 1 RB 1 <- R 3 MOB LD RB 0, X Execute Retire © Avi Mendelson, 4/2005 25 OOO execution

RAT R 0 R 1 R 2 RB 1 R 3 ROB LD R 1, X R 2 <- R 3 R 1 <- R 1+R 0 I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X ◄ Instruction Q RS M 0 O. K RB 2 <- RB 0+R 0 RS 1 I 4 MOB LD RB 0, X Got the value now Execute Cannot execute since the data is not ready yet © Avi Mendelson, 4/2005 Retire 26 OOO execution

RAT R 0 R 1 R 2 RB 1 R 3 ROB LD R 1, X I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X I 4 I 5 Instruction Q RS OK OK R 1 <- RB 0+R 0 RS 1 R 2 <- R 3 ◄ MOB RB 2 <- RB 0+R 0 I 4 RS 2 I 5 RS 3 RB 2 <- RB 0+R 0 Execute Retire © Avi Mendelson, 4/2005 27 OOO execution

◄ RAT R 0 R 1 RB 3 R 2 R 3 ROB I 5 I 4 R 1 <- R 1+R 0 R 2 <- R 3 LD R 1, X Instruction Q MOB RS I 6 R 1 <- RB 1+R 0 OK I 4 I 5 I 6 I 4 I 5 rs 2 rs 3 rs 0 I 4 I 5 R 2 <- R 3 LD R 1, X © Avi Mendelson, 4/2005 28 Execute Retire OOO execution