MS 108 Computer System I Lecture 7 Tomasulos
- Slides: 35
MS 108 Computer System I Lecture 7 Tomasulo’s Algorithm Prof. Xiaoyao Liang 2015/4/10 1
The Tomasulo’s Algorithm • From IBM 360/91 • Goal: High Performance using a limited number of registers without a special compiler – 4 double-precision FP registers on 360 – Uses register renaming • Why Study a 1966 Computer? – The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, Power. PC 604, … 2
Tomasulo Algorithm • Control & buffers are distributed with Function Units (FU) – FU buffers called “reservation stations (RS)” – Contain information about instructions, including operands – More reservation stations than registers, so can do optimizations compilers can’t • Registers in instructions replaced by values or pointers to reservation stations – form of register renaming – avoids WAR, WAW hazards • Results to FU from RS, not through registers (equivalent of forwarding). A Common Data Bus (CDB) broadcasts results to all FUs (their RSes) • Loads and Stores treated as FUs with RSes as well 3
Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders 4 Reservation Stations To Mem FP multipliers Common Data Bus (CDB)
Tomasulo Organization Reservation Station Components • • Busy: Indicates reservation station or FU is busy Op: Operation to perform in the unit (e. g. , + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written) – Note: Qj, Qk=0 => ready • A: effective address 5
Tomasulo Organization • Register result status— Qi – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register • Common data bus – Normal data bus: data + destination (“go to” bus) – CDB: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast 6
Three Stages of Tomasulo Algorithm • 1. Issue—get instruction from FP Op Queue – If reservation station free (no structural hazard), control issues the instruction & sends operands (renames registers). • 2. Execute—operate on operands (EX) – When both operands ready then execute; if not ready, watch Common Data Bus for result • 3. Write result—finish execution (WB) – Write on Common Data Bus to all awaiting units; mark reservation station available 7
Tomasulo Loop Example Loop: LD MULTD SD SUBI BNEZ F 0, F 4, R 1, 0(R 1) F 0, F 2 0(R 1) R 1, #8 Loop • This time assume multiply takes 4 clock cycles in the execution stage • Assume 1 st load takes 8 clock cycles (L 1 cache miss) in the execution stage, 2 nd load takes 1 extra cycle (hit) • Assume store takes 3 cycles in the execution stage • To be clear, will not show clocks for SUBI, BNEZ • Show about 2 iterations 8
Loop Example using simplified presentation for load/store components Instruction status: ITER Instruction 1 1 1 2 Iter- 2 ation 2 LD MULTD SD F 0 F 4 F 4 Count Reservation Stations: Time Name Busy Add 1 No Add 2 No Add 3 No Mult 1 No Mult 2 No j k 0 F 0 0 R 1 F 2 R 1 Op Vj Exec Write Issue Comp. Result Load 1 Load 2 Load 3 Store 1 Store 2 Store 3 S 1 Vk S 2 Qj RS Qk 0 9 F 0 R 1 80 F 2 F 4 F 6 F 8 No No No Added Store Buffers Code: LD MULTD SD SUBI BNEZ F 0 F 4 R 1 F 10 F 12 Register result status Clock Qk Busy Addr 0 F 0 0 R 1 Loop R 1 F 2 R 1 #8 . . . F 30 Instruction Loop Qi Value of Register used for address, iteration control
Loop Example Cycle 1 10
Loop Example Cycle 2 11
Loop Example Cycle 3 12
Loop Example Cycle 4 • 13 Dispatching SUBI Instruction (not in FP queue)
Loop Example Cycle 5 • 14 And, BNEZ instruction (not in FP queue)
Loop Example Cycle 6 • Notice that F 0 never sees Load from location 80 15
Loop Example Cycle 7 • • 16 Register file completely detached from computation First and Second iteration completely overlapped
Loop Example Cycle 8 17
Loop Example Cycle 9 • Load 1 completing: who is waiting? • 18 Note: Dispatching SUBI
Loop Example Cycle 10 Instruction status: ITER Instruction 1 1 1 2 2 2 LD MULTD SD F 0 F 4 F 4 j k 0 F 0 0 R 1 F 2 R 1 Reservation Stations: Time 4 Exec Write Issue Comp. Result 1 2 3 6 7 8 S 1 Vk 9 10 10 S 2 Qj Name Busy Op Vj Add 1 No Add 2 No Add 3 No Mult 1 Yes Multd M[80] R(F 2) Mult 2 Yes Multd R(F 2) Load 2 RS Qk Busy Addr Load 1 Load 2 Load 3 Store 1 Store 2 Store 3 Code: LD MULTD SD SUBI BNEZ No Yes Yes No F 0 F 4 R 1 Qk 72 80 72 Mult 1 Mult 2 0 F 0 0 R 1 Loop R 1 F 2 R 1 #8 . . . F 30 Register result status Clock R 1 10 64 F 0 Qi Load 2 F 4 Mult 2 • Load 2 completing: who is waiting? • 19 Note: Dispatching BNEZ F 6 F 8 F 10 F 12
Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load 1 Load 2 Load 3 Load 4 Load 5 Load 6 Store Buffers Add 1 Add 2 Add 3 Mult 1 Mult 2 FP adders 20 Reservation Stations To Mem FP multipliers Common Data Bus (CDB)
Loop Example Cycle 11 • Next load in sequence 21
Loop Example Cycle 12 • Why not issue third multiply? 22
Loop Example Cycle 13 • Why not issue third store? 23
Loop Example Cycle 14 • Mult 1 completing. Who is waiting? 24
Loop Example Cycle 15 • Mult 2 completing. Who is waiting? 25
Loop Example Cycle 16 26
Loop Example Cycle 17 27
Loop Example Cycle 18 28
Loop Example Cycle 19 29
Loop Example Cycle 20 • Once again: In-order issue, out-of-order execution and out-of-order completion. 30
Why can Tomasulo overlap iterations of loops? • Register renaming – Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations – Buffer old values of registers - avoiding the WAR stall that we saw in the scoreboard. • Other perspective: Tomasulo builds data flow dependency graph on the fly. 31
Tomasulo’s scheme offers 2 major advantages (1)the distribution of the hazard detection logic – Distributed reservation stations and the CDB – If multiple instructions waiting on single result, the instructions can be released simultaneously by broadcast on CDB – If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards 32
- 360-108-108
- Pvifad
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Computer security 161 cryptocurrency lecture
- Computer aided drug design lecture notes
- Computer architecture lecture notes
- Isa vs microarchitecture
- Difference between a computer and computer system
- What is computer organization
- Factorization of 162
- Simplifying radicals calculator
- Minimo comun divisor
- Lcm of 108
- Hcf of 24 and 108
- Canto 30 inferno
- 108 factors
- Sura 108
- Plain scale and diagonal scale
- Corso toscana 108
- 4-1 work together, p. 97
- 466560/108
- Ee 108
- 4-sinf ona tili fanidan yillik dars ishlanma
- 108 temples
- 105 106 107
- 75 factor tree
- Hasil dari 108 + 132 dikurangi 134
- Round the factors to estimate the products
- 100 101 102 103 104 105 106 107 108 109 110
- @mhshouston
- Engineering 108.com
- Is 121 a prime number or composite
- 108 semester
- Edmonton mental health clinic 108 street
- Mcd definicion
- 108 lab