3 2 Dynamic Scheduling Seen scoreboard approach Allows

3. 2. Dynamic Scheduling • Seen scoreboard approach • Allows hardware to reorder instructions – Allows optimisation at run-time – Allows code generated for one pipeline to run on another • Complex!

Tomasulo’s Algorithm • Contains elements of scoreboarding • Introduces register renaming – Eliminates WAR and WAW hazards

Register Renaming fdiv fadd sd fsub fmul %f 2, %f 4, %f 0 Antidependence %f 0, %f 8, %f 6, [%l 1] %f 10, %f 14, %f 8 %f 10, %f 8, %f 6 Output dependence

With Renaming • Introduce “temporary registers” fdiv fadd sd fsub fmul %f 2, %f 4, %f 0, %f 8, S S, [%l 1] %f 10, %f 14, T %f 10, T, %f 6

Tomasulo’s Algorithm • Operands are buffered in reservation stations – Issued instructions use reservation stations, not registers – Each functional unit is controlled by its own reservation stations • Effectively, increases number of registers available

Structure of FP Unit

Pipeline Stages • Issue – Move instruction from queue to reservation station if possible, renaming registers • Execute – Wait for operands, then execute • Write results – Written to the data bus, then to registers and reservation stations

Reservation Station Fields • • Operation Operand sources (reservation stations) Operand values Memory address information (for load/store operations) • Busy indicator

3. 3. Examples • H&P work through detailed examples of the application of Tomasulo’s algorithm

3. 4. Reducing Branch Penalties • As exploitation of ILP increases, control dependences become limiting factor • Dynamic branch prediction – Considered static schemes in Appendix A • Resolve branch early, prevent stalls

Basic Dynamic Branch Prediction • Branch-prediction buffer – One bit: taken/not taken – Indexed by address of branch instruction – Toggle when prediction is incorrect – Mispredicts loops twice as often as necessary: • At end of loop (inevitable) • At start of loop

Branch Prediction • Two-bit scheme – Mispredicts loops once only Taken Not taken Predict taken Taken Not taken Predict not taken Taken Not taken

Branch Prediction Performance • SPEC 89 (IBM Power. PC) – 4096 entry branch-prediction buffer – Accuracy: 82% – 99% (better for FP than int) • Doing better – Correlating predictors (two-level predictors) – Use other branches to help prediction

Tournament Predictors • Multilevel branch predictor • Predict which of several predictors to use! • E. g. Alpha 21264 – 2 -bit tournament predictor • Global predictor (2 -bit) • Local predictor: two levels: – 10 -bit history table – Standard 3 -bit predictor – Very good (1. 15% misprediction rate – int, 0. 1% – FP)

3. 5. High-Performance Fetch • Branch-target buffers – Holds target address for predicted taken branches – Indexed by current address • IF: if currently decoding instruction is branch, fetch target address – One cycle earlier than previously – Effectively, predicts next PC