3 2 Dynamic Scheduling Seen scoreboard approach Allows
3. 2. Dynamic Scheduling • Seen scoreboard approach • Allows hardware to reorder instructions – Allows optimisation at run-time – Allows code generated for one pipeline to run on another • Complex!
Tomasulo’s Algorithm • Contains elements of scoreboarding • Introduces register renaming – Eliminates WAR and WAW hazards
Register Renaming fdiv fadd sd fsub fmul %f 2, %f 4, %f 0 Antidependence %f 0, %f 8, %f 6, [%l 1] %f 10, %f 14, %f 8 %f 10, %f 8, %f 6 Output dependence
With Renaming • Introduce “temporary registers” fdiv fadd sd fsub fmul %f 2, %f 4, %f 0, %f 8, S S, [%l 1] %f 10, %f 14, T %f 10, T, %f 6
Tomasulo’s Algorithm • Operands are buffered in reservation stations – Issued instructions use reservation stations, not registers – Each functional unit is controlled by its own reservation stations • Effectively, increases number of registers available
Structure of FP Unit
Pipeline Stages • Issue – Move instruction from queue to reservation station if possible, renaming registers • Execute – Wait for operands, then execute • Write results – Written to the data bus, then to registers and reservation stations
Reservation Station Fields • • Operation Operand sources (reservation stations) Operand values Memory address information (for load/store operations) • Busy indicator
3. 3. Examples • H&P work through detailed examples of the application of Tomasulo’s algorithm
3. 4. Reducing Branch Penalties • As exploitation of ILP increases, control dependences become limiting factor • Dynamic branch prediction – Considered static schemes in Appendix A • Resolve branch early, prevent stalls
Basic Dynamic Branch Prediction • Branch-prediction buffer – One bit: taken/not taken – Indexed by address of branch instruction – Toggle when prediction is incorrect – Mispredicts loops twice as often as necessary: • At end of loop (inevitable) • At start of loop
Branch Prediction • Two-bit scheme – Mispredicts loops once only Taken Not taken Predict taken Taken Not taken Predict not taken Taken Not taken
Branch Prediction Performance • SPEC 89 (IBM Power. PC) – 4096 entry branch-prediction buffer – Accuracy: 82% – 99% (better for FP than int) • Doing better – Correlating predictors (two-level predictors) – Use other branches to help prediction
Tournament Predictors • Multilevel branch predictor • Predict which of several predictors to use! • E. g. Alpha 21264 – 2 -bit tournament predictor • Global predictor (2 -bit) • Local predictor: two levels: – 10 -bit history table – Standard 3 -bit predictor – Very good (1. 15% misprediction rate – int, 0. 1% – FP)
3. 5. High-Performance Fetch • Branch-target buffers – Holds target address for predicted taken branches – Indexed by current address • IF: if currently decoding instruction is branch, fetch target address – One cycle earlier than previously – Effectively, predicts next PC
- Slides: 17