15 74018 740 Computer Architecture Lecture 12 Issues

Reviews n Due next Monday q q n Mutlu et al. , “Runahead Execution:

Announcements n Milestone I q q n Deadline postponed to next Friday (Oct 14)

Last Lecture n n Three pieces of dynamic information that make static scheduling difficult

Review: Summary of OOO Execution Concepts n Renaming eliminates false dependencies n n Tag

Review: An Exercise MUL ADD ADD MUL ADD n n n R 3 R

Some Implementation Issues n Back to back dispatch of dependent operations q n n

Intel Pentium Pro Figure courtesy of Prof. Wen-mei Hwu, University of Illinois 9

Pentium Pro vs. Pentium 4 Pointers to architectural register values 11

Required/Recommended Reading n n Hinton et al. , “The Microarchitecture of the Pentium 4

Buffer-Decoupled Oo. O Designs n Register Alias Table contains only tags and valid bits

Physical Register Files n Register values stored only in physical register files (architectural register

Physical Register Files in Pentium 4 n Hinton et al. , “The Microarchitecture of

Physical Register Files in Alpha 21264 n Kessler, “The Alpha 21264 Microprocessor, ” IEEE

Alpha 21264 Pipelines Source: Microprocessor Report, 10/28/96 17

IF ID/ Reg Reservation Stations Centralized Reservation Stations (Pentium Pro) Add Br WB FPU

Distributed Reservation Stations n Many processors: Power. PC 604, Pentium 4 Add IF ID/

Centralized vs. Distributed Reservation Stations n Centralized (monolithic): + Reservation stations not statically partitioned

Issues in Scheduling Logic (I) n What if multiple instructions become ready in the

Issues in Scheduling Logic (II) n When to schedule the dependents of a multi-cycle

Slides: 22

Download presentation

15 -740/18 -740 Computer Architecture Lecture 12: Issues in Oo. O Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Reviews n Due next Monday q q n Mutlu et al. , “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors, ” HPCA 2003. Mutlu et al. , “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance, ” IEEE Micro Top Picks 2006. Due next Wednesday n Chrysos and Emer, “Memory Dependence Prediction Using Store Sets, ” ISCA 1998. 2

Announcements n Milestone I q q n Deadline postponed to next Friday (Oct 14) Format: 2 -pages Include results from your initial evaluations. We need to see good progress. Of course, the results need to make sense (i. e. , you should be able to explain them) Midterm I q October 21 3

Last Lecture n n Three pieces of dynamic information that make static scheduling difficult Example of out of order execution q q Timeline How things work in a design that uses a consolidated physical register file 4

Today n Issues in Oo. O Execution 5

Review: Summary of OOO Execution Concepts n Renaming eliminates false dependencies n n Tag broadcast enables value communication between instructions dataflow An out-of-order engine dynamically builds the dataflow graph of a piece of the program q which piece? n q Limited to the instruction window Can we do it for the whole program? Why would we like to? n How can we have a large instruction window efficiently? 6

Review: An Exercise MUL ADD ADD MUL ADD n n n R 3 R 1, R 2 R 5 R 3, R 4 R 7 R 2, R 6 R 10 R 8, R 9 R 11 R 7, R 10 R 5, R 11 F D E R W Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume one adder and one multiplier How many cycles q q q in a non-pipelined machine in an in-order-dispatch pipelined machine with future file and reorder buffer in an out-of-order dispatch pipelined machine with future file and reorder buffer 7

Some Implementation Issues n Back to back dispatch of dependent operations q n n n When should the tag be broadcast? Is it a good idea to have the reservation stations also be the reorder buffer? Is it a good idea to have values in the register alias table? Is it a good idea to store values in reservation stations? Is it a good idea to have a single, centralized reservation stations for all FUs? Idea: Decoupling different buffers q q Many modern processors decouple these buffers Many have distributed reservation stations across FUs 8

Intel Pentium Pro Figure courtesy of Prof. Wen-mei Hwu, University of Illinois 9

Buffer Decoupling in Intel Pentium 4 10

Pentium Pro vs. Pentium 4 Pointers to architectural register values 11

Required/Recommended Reading n n Hinton et al. , “The Microarchitecture of the Pentium 4 Processor, ” Intel Technology Journal, 2001. Kessler, “The Alpha 21264 Microprocessor, ” IEEE Micro, March-April 1999. Yeager, “The MIPS R 10000 Superscalar Microprocessor, ” IEEE Micro, April 1996. Smith and Sohi, “The Microarchitecture of Superscalar Processors, ” Proc. IEEE, Dec. 1995. 12

Buffer-Decoupled Oo. O Designs n Register Alias Table contains only tags and valid bits q n During renaming, instruction allocates entries in different buffers (as needed): q q n Tag is a pointer to the physical register containing the value Reorder buffer Physical register file Reservation stations (centralized vs. distributed) Store/load buffer Tradeoffs? 13

Physical Register Files n Register values stored only in physical register files (architectural register state is part of this file), e. g. Pentium 4 + Smaller physical register file than ROB: Not all instructions write to registers more area efficient + No duplication of data values in reservation stations, arch register file, reorder buffer, rename table (space efficient) + Only one write and read path for register data values (simpler control for register read and writes) -- Accessing register state always requires indirection (higher latency) - Instructions with ready operands need to access register file - Copying architectural register state takes time -- Register file size increases. Centralization limits scalability. 14

Physical Register Files in Pentium 4 n Hinton et al. , “The Microarchitecture of the Pentium 4 Processor, ” Intel Technology Journal, 2001. 15

Physical Register Files in Alpha 21264 n Kessler, “The Alpha 21264 Microprocessor, ” IEEE Micro, March-April 1999. 16

Alpha 21264 Pipelines Source: Microprocessor Report, 10/28/96 17

IF ID/ Reg Reservation Stations Centralized Reservation Stations (Pentium Pro) Add Br WB FPU FPU Mul LSU 18

Distributed Reservation Stations n Many processors: Power. PC 604, Pentium 4 Add IF ID/ Reg Br WB FPU FPU Mul LSU 19

Centralized vs. Distributed Reservation Stations n Centralized (monolithic): + Reservation stations not statically partitioned (can adapt to changes in instruction mix, e. g. , 100% adds) -- All entries need to have all fields even though some fields might not be needed for some instructions (e. g. branches, stores, etc) -- Number of ports = issue width -- More complex control and routing logic n Distributed: + Each RS can be specialized to the functional unit it serves (more area efficient) + Number of ports can be 1 (RS per functional unit) + Simpler control and routing logic -- Other RS’s cannot be used if one functional unit’s RS is full (static partitioning) 20

Issues in Scheduling Logic (I) n What if multiple instructions become ready in the same cycle? q Why does this happen? n n n Which one to schedule? q q n An instruction with multiple dependents Multiple instructions can complete in the same cycle Oldest Random Most dependents first Most critical first (longest latency dependency chain) Does not matter for performance unless the active window is very large q Butler and Patt, “An Investigation of the Performance of Various Dynamic Scheduling Techniques, ” MICRO 1992. 21

Issues in Scheduling Logic (II) n When to schedule the dependents of a multi-cycle execute instruction? q q n When to schedule the dependents of a variable-latency execute instruction? q q q n One cycle before the completion of the instruction Example: IMUL, Pentium 4, 3 -cycle ADD A load can hit or miss in the data cache Option 1: Schedule dependents assuming load will hit Option 2: Schedule dependents assuming load will miss When do we take out an instruction from a reservation station? 22