Superscalar Loose Ends Up to now Techniques for

Loose Ends • Up to now: – Techniques for handling register-related dependencies • Register

Big Picture • Modern CPUs are/use: – Superscalar – Out-of-Order –Scheduling Speculative execution OOO

Non-Spec Exec • Fetch instructions, even fetch past a branch, but don’t exec right

Branch Prediction/Speculative Execution • When we hit a branch, guess if it’s T or

Again, with Speculative Execution • Assume fetched direction is correct fetch decode issue exec

Branch Misprediction Recovery ARF br RAT ? !? ARF state corresponds to state prior

Solution 1: Stall and Drain ARF Allow all instructions to execute and commit; ARF

Solution 2: Checkpointing ARF At each branch, make a copy of the RAT (register

Speculating Past Multiple Branches • Branch every 4 -6 instructions • Pipeline depth typically

Speculative Execution is OK • ROB maintains program order • ROB temporarily stores results

Non-Spec Memory Instruction • Fetch instructions, even fetch past a branch, but don’t exec

Executing Memory Instructions Load R 3 = 0[R 6] Add R 7 = R

Out-of-Order Load Execution = store • So don’t let loads execute out-oforder! A B

What else could happen, if we execute load before store completes? foo B bar

Memory Disambiguation Problem • Why can’t this happen with non-memory insts? – Operand specifiers

Two Problems • Memory disambiguation – Are there any earlier unexecuted stores to the

Load Store Queue (LSQ) Oldest Youngest L/S PC Seq L S S L L

Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory

Memory Ordering Source: Alpha 21264 HRM • Load X bypassing Load X violates certain

Load Store Queue (LSQ) Age-ordered RS ALLOC ROB Store Queue Load Queue Split LSQ

age 1 4 0 6 address data A A C 11000000 0 F 0

• Needed for – Multiprocessor support – Maintaining memory consistency model • Load-load

Commit, Exceptions • What happens if a speculatively executed instruction faults? A A B

Treat Fault Like a Result • Regular register results written to ROB until –

Example A LOAD R 1 = 0[R 2] (commit) Resolved Miss B C ADD

Superscalar Commit is like Sampling Scalar Commit Processor States A A B C D

$Commit “Algorithm” • For i={0. . commit_width-1} – Has ROB[oldest+i] finished execution? • If$

Simple. Scalar Model • Fetch • Dispatch (Issue) • Issue (Exec) • Writeback IFQ,

Superscalar Commit • To sustain > 1 IPC execution, must commit at > 1

Slides: 37

Download presentation

Super-scalar

Loose Ends • Up to now: – Techniques for handling register-related dependencies • Register renaming for WAR, WAW • Tomasulo’s algorithm for scheduling RAW • Still need to address: – Control dependencies – Memory dependencies – Faults

Big Picture • Modern CPUs are/use: – Superscalar – Out-of-Order –Scheduling Speculative execution OOO Actual execution Update ROB/PRF, readiness, arbitration Fetch (may be speculative) Decode Superscalar fetch Superscalar Branch prediction decode Schedule Exec(hard Target prediction for CISC) LB or SB Rename Write results To ARF/Memory Issue Still in-order, Send to RS, ROB, but also LB, SB Writeback Commit superscalar (AKA “Alloc”) Not in program order!

Non-Spec Exec • Fetch instructions, even fetch past a branch, but don’t exec right away (assume fetching/retiring 3 inst. per cycle) fetch decode issue exec F D write I commit E W C 1. DIV R 1 = R 2 / R 3 2. ADD R 4 = R 5 + R 6 3. BEQZ R 1, foo 4. SUB R 5 = R 4 – R 3 5. MUL R 7 = R 5 * R 2 6. MUL R 8 = R 7 * R 3 Assume execution of : Add/Sub/Beq: 1 cycles Mult: 4 cycles Divide: 10 cycles

Branch Prediction/Speculative Execution • When we hit a branch, guess if it’s T or NT – We’ll discuss how next lecture ADD A T C B Branch LOAD SUB NT LOAD DIV Q XOR ADD D R … … STORE LOAD Branch ADD SUB STORE MUL Guess T Keep scheduling and executing Instructions as if the branch Didn’t even exist Sometime later, if we messed up… Just throw it all out And fetch the correct instructions

Again, with Speculative Execution • Assume fetched direction is correct fetch decode issue exec F 1. DIV R 1 = R 2 / R 3 2. ADD R 4 = R 5 + R 6 3. BEQZ R 1, foo 4. SUB R 5 = R 4 – R 3 5. MUL R 7 = R 5 * R 2 6. MUL R 8 = R 7 * R 3 D write I E W C commit

Branch Misprediction Recovery ARF br RAT ? !? ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrongpath instruction state

Solution 1: Stall and Drain ARF Allow all instructions to execute and commit; ARF corresponds to last committed instruction RAT ARF now corresponds to the state right before the next instruction to be renamed (foo) br X Reset RAT so that all mappings ? !? refer to the ARF Pros: Very simple to implement Resume renaming the new correctfoo Correct path instructions from Cons: Performance loss pathfetch; instructions from fetch can’t rename because RAT is wrong due to stalls

Solution 2: Checkpointing ARF At each branch, make a copy of the RAT (register mapping at the time of the branch) br br foo RAT RAT RAT On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming Checkpoint Free Pool

Speculating Past Multiple Branches • Branch every 4 -6 instructions • Pipeline depth typically 10 -20 stages • With peak 3 -4 instructions per cycle • 20 stages * 3 inst / stage * 1 branch / 5 inst – Approximately 12 branches in the pipeline when pipe is full • Need 12 checkpoints (on average, more for burst) • What’s the probability of still being on right path?

Speculative Execution is OK • ROB maintains program order • ROB temporarily stores results – If we screw something up, only the ROB knows, but no architected state is affected • Register rename recovery makes sure we resume with correct register mapping If we screw up, we: 1. Can undo the effects 2. Know how to resume

Non-Spec Memory Instruction • Fetch instructions, even fetch past a branch, but don’t exec right away (assume fetching/retiring 3 inst. per cycle) fetch decode issue exec F D write I commit E W C 1. LOAD R 3 = 0(R 6) 2. ADD R 7 = R 3 + R 9 3. STORE R 4, 0(R 7) 4. SUB R 1 = R 1 – R 2 5. LOAD R 8 = 0(R 1) 6. LOAD R 9 = 0(R 2) Assume execution of : Add/Sub: 1 cycles Load/Store hit: 2 cycles Load/Store miss: 100 cycles

Executing Memory Instructions Load R 3 = 0[R 6] Add R 7 = R 3 + R 9 Store R 4 0[R 7] Sub R 1 = R 1 – R 2 Load R 8 = 0[R 1] Issue Miss Cache Miss! serviced… Issue Cache Hit! But there was a later load… • If R 1 != R 7, then Load R 8 gets correct value from cache • If R 1 == R 7, then Load R 8 should have gotten value from the Store, but it didn’t!

Out-of-Order Load Execution = store • So don’t let loads execute out-oforder! A B E C D F H A = load B C D E F G IPC = 8/3 = 2. 67 IPC = 8/7 = 1. 14 G H Ok, maybe not a good idea. No support for OOO load execution can crush your IPC

What else could happen, if we execute load before store completes? foo B bar D$ No problem. A foo B foo Some sort of data forwarding mechanism A A foo Value from cache is stale B foo D$ No problem. A foo B foo C foo Uh oh. Should have used B’s store value Uh oh. A foo B bar Uh oh. Luckily, this usually can’t even happen

Memory Disambiguation Problem • Why can’t this happen with non-memory insts? – Operand specifiers in non-memory insts are absolute • “R 1” refers to one specific location – Operand specifiers in memory insts are ambiguous • “R 1” refers to a memory location specified by the value of R 1. As pointers change, so does this location.

Two Problems • Memory disambiguation – Are there any earlier unexecuted stores to the same address as myself? (I’m a load) – Binary question: answer is yes or no • Store-to-load forwarding problem – Which earlier store do I get my value from? (I’m a load) – Which later load(s) do I forward my value to? (I’m a store) – Non-binary question: answer is one or more instruction identifiers

Tomasulo’s Algorithm: The Picture

Load Store Queue (LSQ) Oldest Youngest L/S PC Seq L S S L L L 0 x. F 048 0 x. F 04 C 0 x. F 054 0 x. F 060 0 x. F 840 0 x. F 858 0 x. F 85 C 0 x. F 870 0 x. F 628 0 x. F 63 C 41773 41774 41775 41776 41777 41778 41779 41780 41781 41782 Disambiguation: loads cannot execute until all earlier store addresses computed Addr Value 0 x 3290 42 0 x 3410 25 0 x 3290 -17 0 x 3418 1234 0 x 3290 -17 0 x 3300 1 0 x 3290 0 0 x 3410 25 0 x 3290 0 0 x 3300 1 Data Cache 0 x 3290 -17 42 0 x 3300 1 0 x 3410 38 25 0 x 3418 1234 Forwarding: broadcast/search entire LSQ for match

Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically

Memory Ordering Source: Alpha 21264 HRM • Load X bypassing Load X violates certain memory consistency model (e. g. , sequential consistency) • Load-load order trap replays

Load Store Queue (LSQ) Age-ordered RS ALLOC ROB Store Queue Load Queue Split LSQ • • Memory instructions are allocated into LSQ in program order LSQ manages memory reference ordering Unified LSQ vs. Split LSQ Sandy Bridge: 64 Load buffers, 36 Store buffers

age address 1 1 0 1 A B C 0 2 ? ? ? data 00000001 12340000 FFFF 1111 FFFFFF 00 Store Queue • Each load checks against older stores – Associative search – A performance issue of scalability Issued? Issuing a Load for Execution age 1 1 0 2 address A D C Load Queue Issued to Memory for execution

age address 1 1 0 1 A B C 0 2 ? ? ? Store Queue • • data 00000001 12340000 FFFF 1111 FFFFFF 00 Issued? Issuing a Load for Execution age 1 1 1 2 0 2 address A D C Store-to-load forwarding Load Queue Implementation dependent: comprehensive size matching can be prohibitively expensive Simple method: forward when a larger store (word) precedes a smaller load (half)

age address 1 1 0 0 1 1 1 2 A B C ? ? ? data 00000001 12340000 FFFF 1111 FFFFFF 00 Store Queue • Issued? Issuing a Load for Execution age address 1 1 1 0 1 2 2 3 A D C K Speculativel y issue for execution Load Queue Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) – Naively – Use Memory Dependency Predictor • • Store, when address ready, checks newer loads in the Load Queue “Replay” needed if speculation turns out to be incorrect (e. g. Alpha’s store-load replay)

age address 1 1 1 0 1 1 1 2 A B C K data 00000001 12340000 FFFF 1111 FFFFFF 00 Store Queue Issued? Store Checks Pre-Mature Loads age address 1 1 1 2 A D C 1 3 K M 1 4 P Conflict detected! Replay the load Load Queue • Store, when address ready, checks newer loads in the Load Queue – Associative Search • “Replay” needed if speculation turns out to be incorrect (e. g. Alpha’s store-load replay)

age 1 4 0 6 address data A A C 11000000 0 F 0 F 00000002 memory Store Queue Issued? Issuing a Store for Execution Issued to age address 1 0 0 0 4 5 5 6 A D C K Load Queue • Shown above the basic concept • Implementation dependent – Not allow store bypassing load, since it has little impact on performance – Perform associative search

age 1 4 0 6 address data A A C 11000000 0 F 0 F 00000002 cannot issue for execution Store Queue Issued? Issuing a Store for Execution age address 1 4 0 5 A D 0 5 0 6 C K Load Queue

• Needed for – Multiprocessor support – Maintaining memory consistency model • Load-load trap invoked – Trap on the later, conflicted Load-load trap instructions – Replay Issued? Load-Load Ordering age address 0 4 A 1 1 1 0 D C A M N K 5 5 6 6 6 7 Load Queue

Commit, Exceptions • What happens if a speculatively executed instruction faults? A A B B A, B Commit Branch mispred C C W D D X E Divide by Zero! Outside world sees: A, B, fault! Should have been: A, B, C, D, fault! E Divide by Zero! Fault should never be seen!

Treat Fault Like a Result • Regular register results written to ROB until – Instruction is oldest for in-order state update – Instruction is known to be non-speculative • Do the same with faults! – At exec, make note of any faults, but don’t expose to outside world – If the instruction gets to commit, then expose the fault

Example A LOAD R 1 = 0[R 2] (commit) Resolved Miss B C ADD R 3 = R 1 + R 4 SUB R 1 = R 5 – R 6 D DIV R 4 = R 7 / R 1 E LOAD R 6 = 0[R 7] Flush rest of ROB, Start fetching Fault handler E W F LOAD P 1 imm R 2 X X ADD P 2 P 1 R 4 X X SUB P 3 R 5 R 6 X X DIV P 4 R 7 P 3 X X X LOAD P 5 imm R 7 X X X Fault! (3 commits) Divide by zero Fault deferred until architecturally correct point. Now raise fault Other fault “never happened”…

Superscalar Commit is like Sampling Scalar Commit Processor States A A B C D E A B C D E F G H Superscalar Commit Processor States A B C D E F G H Each “state” in the superscalar machine always corresponds to one state of the scalar machine (but not necessarily the other way around), and the ordering of states is preserved

$Commit “Algorithm” • For i={0. . commit_width-1} – Has ROB[oldest+i] finished execution? • If$

Commit “Algorithm” • For i={0. . commit_width-1} – Has ROB[oldest+i] finished execution? • If not, break – Does ROB[oldest+i] have a fault? • If so, raise it now (flush pipe, NPC = fault handler) • If not, write to architected state (ARF/Memory)

Simple. Scalar Model • Fetch • Dispatch (Issue) • Issue (Exec) • Writeback IFQ, I$, bpred Fakes front-end pipeline with delay RS, ROB (idep/odep) Decode, map dependencies, allocate Simulator uses an oracle/checker approach. It keeps an “official” version of thereadyq, state, LSQ and then makes Arbitrate WBversion events sure thatand theschedule simulated correctly generates the same results (atodep commit). list Notify dependents of result • Commit ROB, LSQ commit, store writeback

Superscalar Commit • To sustain > 1 IPC execution, must commit at > 1 IPC as well • So long as no one wants to “look”, we can make multiple updates to state each cycle – ARF needs multiple write ports – Potentially multiple writes to memory

Questions?