CSE 490590 Computer Architecture ILP II Steve Ko

Last time… • Register renaming – Overcoming the restriction caused by the # of

Precise Interrupts It must appear as if an interrupt is taken between two instructions

Phases of Instruction Execution PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer

In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Kill Commit Reorder Buffer Decode In-order

Extensions for Precise Exceptions Inst# use exec op p 1 src 1 p 2

Rollback and Renaming Register File (now holds only committed state) Ins# use exec op

Renaming Table Rename Table r 1 t r 2 tag valid bit v Ins#

Control Flow Penalty Next fetch started PC I-cache Modern processors may have > 10

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of

Branch Penalties in Modern Pipelines Ultra. SPARC-III instruction fetch pipeline stages (in-order issue, 4

Reducing Control Flow Penalty Software solutions • Eliminate branches - loop unrolling Increases the

CSE 490/590 Administrivia • Project 1 & midterm grading mostly done – Will distribute

Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors

Static Branch Prediction Overall probability a branch is taken is ~60 -70% but: backward

Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch

Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction

Branch History Table Fetch PC 00 k I-Cache BHT Index 2 k-entry BHT, 2

Acknowledgements • These slides heavily contain material developed and copyright by – Krste Asanovic

Slides: 19

Download presentation

CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo CSE 490/590, Spring 2011

Last time… • Register renaming – Overcoming the restriction caused by the # of registers – Reorder buffer & renaming table • Precise interrupts – It must appear as if an interrupt has occurred in-between two instructions CSE 490/590, Spring 2011 2

Precise Interrupts It must appear as if an interrupt is taken between two instructions (say Ii and Ii+1) • the effect of all instructions up to and including Ii is totally complete • no effect of any instruction after Ii has taken place The interrupt handler either aborts the program or restarts it at Ii+1. CSE 490/590, Spring 2011 3

Phases of Instruction Execution PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer Arch. State Fetch: Instruction bits retrieved from cache. Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Execute: Instructions and operands sent to execution units. When execution completes, all results and exception flags are available. Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). CSE 490/590, Spring 2011 4

In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Kill Commit Reorder Buffer Decode In-order Kill Execute Inject handler PC Exception? • Instructions fetched and decoded into instruction reorder buffer in-order • Execution is out-of-order ( out-of-order completion) • Commit (write-back to architectural state, i. e. , regfile & memory, is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) CSE 490/590, Spring 2011 5

Extensions for Precise Exceptions Inst# use exec op p 1 src 1 p 2 src 2 pd dest data cause ptr 2 next to commit ptr 1 next available Reorder buffer • add <pd, dest, data, cause> fields in the instruction template • commit instructions to reg file and memory in program order buffers can be maintained circularly • on exception, clear reorder buffer by resetting ptr 1=ptr 2 (stores must wait for commit before updating memory) CSE 490/590, Spring 2011 6

Rollback and Renaming Register File (now holds only committed state) Ins# use exec op p 1 src 1 p 2 src 2 pd dest t 1 t 2. . tn data Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register? Search the “dest” field in the reorder buffer CSE 490/590, Spring 2011 7

Renaming Table Rename Table r 1 t r 2 tag valid bit v Ins# use exec op p 1 Register File src 1 p 2 src 2 pd dest t 1 t 2. . tn data Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers CSE 490/590, Spring 2011 8

Control Flow Penalty Next fetch started PC I-cache Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Fetch Buffer Fetch Decode Issue Buffer Func. Units Branch executed Result Buffer Execute Commit Arch. State CSE 490/590, Spring 2011 9

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1) Is the preceding instruction a taken branch? 2) If so, what is the target address? Instruction Taken known? Target known? J After Inst. Decode JR After Inst. Decode After Reg. Fetch BEQZ/BNEZ After Reg. Fetch* After Inst. Decode *Assuming zero detect on register read CSE 490/590, Spring 2011 10

Branch Penalties in Modern Pipelines Ultra. SPARC-III instruction fetch pipeline stages (in-order issue, 4 -way superscalar, 750 MHz, 2000) A Branch Target Address Known Branch Direction & Jump Register Target Known J R PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read E Integer Execute P F B I Remainder of execute pipeline (+ another 6 stages) CSE 490/590, Spring 2011 11

Reducing Control Flow Penalty Software solutions • Eliminate branches - loop unrolling Increases the run length • Reduce resolution time - instruction scheduling Compute the branch condition as early as possible (of limited value) Hardware solutions • Find something else to do - delay slots Replaces pipeline bubbles with useful work (requires software cooperation) • Speculate - branch prediction Speculative execution of instructions beyond the branch CSE 490/590, Spring 2011 12

CSE 490/590 Administrivia • Project 1 & midterm grading mostly done – Will distribute on Wed – Regrading -> Jangyoung • Project 2 – Start early! CSE 490/590, Spring 2011 13

Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: • Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: • Keep result computation separate from commit • Kill instructions following branch in pipeline • Restore state to state following branch CSE 490/590, Spring 2011 14

Static Branch Prediction Overall probability a branch is taken is ~60 -70% but: backward 90% forward 50% JZ JZ ISA can attach preferred direction semantics to branches, e. g. , Motorola MC 88110 bne 0 (preferred taken) beq 0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e. g. , HP PA-RISC, Intel IA-64 typically reported as ~80% accurate CSE 490/590, Spring 2011 15

Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) CSE 490/590, Spring 2011 16

Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction after two consecutive mistakes! taken take right ¬ taken ¬take wrong taken ¬take right ¬ taken take wrong ¬ taken BP state: (predict take/¬take) x (last prediction right/wrong) CSE 490/590, Spring 2011 17

Branch History Table Fetch PC 00 k I-Cache BHT Index 2 k-entry BHT, 2 bits/entry Instruction Opcode offset + Branch? Target PC Taken/¬Taken? 4 K-entry BHT, 2 bits/entry, ~80 -90% correct predictions CSE 490/590, Spring 2011 18

Acknowledgements • These slides heavily contain material developed and copyright by – Krste Asanovic (MIT/UCB) – David Patterson (UCB) • And also by: – – Arvind (MIT) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) • MIT material derived from course 6. 823 • UCB material derived from course CS 252 CSE 490/590, Spring 2011 19