CSE 490590 Computer Architecture ILP II Steve Ko

  • Slides: 19
Download presentation
CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at

CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo CSE 490/590, Spring 2011

Last time… • Register renaming – Overcoming the restriction caused by the # of

Last time… • Register renaming – Overcoming the restriction caused by the # of registers – Reorder buffer & renaming table • Precise interrupts – It must appear as if an interrupt has occurred in-between two instructions CSE 490/590, Spring 2011 2

Precise Interrupts It must appear as if an interrupt is taken between two instructions

Precise Interrupts It must appear as if an interrupt is taken between two instructions (say Ii and Ii+1) • the effect of all instructions up to and including Ii is totally complete • no effect of any instruction after Ii has taken place The interrupt handler either aborts the program or restarts it at Ii+1. CSE 490/590, Spring 2011 3

Phases of Instruction Execution PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer

Phases of Instruction Execution PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer Arch. State Fetch: Instruction bits retrieved from cache. Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Execute: Instructions and operands sent to execution units. When execution completes, all results and exception flags are available. Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). CSE 490/590, Spring 2011 4

In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Kill Commit Reorder Buffer Decode In-order

In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Kill Commit Reorder Buffer Decode In-order Kill Execute Inject handler PC Exception? • Instructions fetched and decoded into instruction reorder buffer in-order • Execution is out-of-order ( out-of-order completion) • Commit (write-back to architectural state, i. e. , regfile & memory, is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) CSE 490/590, Spring 2011 5

Extensions for Precise Exceptions Inst# use exec op p 1 src 1 p 2

Extensions for Precise Exceptions Inst# use exec op p 1 src 1 p 2 src 2 pd dest data cause ptr 2 next to commit ptr 1 next available Reorder buffer • add <pd, dest, data, cause> fields in the instruction template • commit instructions to reg file and memory in program order buffers can be maintained circularly • on exception, clear reorder buffer by resetting ptr 1=ptr 2 (stores must wait for commit before updating memory) CSE 490/590, Spring 2011 6

Rollback and Renaming Register File (now holds only committed state) Ins# use exec op

Rollback and Renaming Register File (now holds only committed state) Ins# use exec op p 1 src 1 p 2 src 2 pd dest t 1 t 2. . tn data Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register? Search the “dest” field in the reorder buffer CSE 490/590, Spring 2011 7

Renaming Table Rename Table r 1 t r 2 tag valid bit v Ins#

Renaming Table Rename Table r 1 t r 2 tag valid bit v Ins# use exec op p 1 Register File src 1 p 2 src 2 pd dest t 1 t 2. . tn data Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers CSE 490/590, Spring 2011 8

Control Flow Penalty Next fetch started PC I-cache Modern processors may have > 10

Control Flow Penalty Next fetch started PC I-cache Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Fetch Buffer Fetch Decode Issue Buffer Func. Units Branch executed Result Buffer Execute Commit Arch. State CSE 490/590, Spring 2011 9

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1) Is the preceding instruction a taken branch? 2) If so, what is the target address? Instruction Taken known? Target known? J After Inst. Decode JR After Inst. Decode After Reg. Fetch BEQZ/BNEZ After Reg. Fetch* After Inst. Decode *Assuming zero detect on register read CSE 490/590, Spring 2011 10

Branch Penalties in Modern Pipelines Ultra. SPARC-III instruction fetch pipeline stages (in-order issue, 4

Branch Penalties in Modern Pipelines Ultra. SPARC-III instruction fetch pipeline stages (in-order issue, 4 -way superscalar, 750 MHz, 2000) A Branch Target Address Known Branch Direction & Jump Register Target Known J R PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read E Integer Execute P F B I Remainder of execute pipeline (+ another 6 stages) CSE 490/590, Spring 2011 11

Reducing Control Flow Penalty Software solutions • Eliminate branches - loop unrolling Increases the

Reducing Control Flow Penalty Software solutions • Eliminate branches - loop unrolling Increases the run length • Reduce resolution time - instruction scheduling Compute the branch condition as early as possible (of limited value) Hardware solutions • Find something else to do - delay slots Replaces pipeline bubbles with useful work (requires software cooperation) • Speculate - branch prediction Speculative execution of instructions beyond the branch CSE 490/590, Spring 2011 12

CSE 490/590 Administrivia • Project 1 & midterm grading mostly done – Will distribute

CSE 490/590 Administrivia • Project 1 & midterm grading mostly done – Will distribute on Wed – Regrading -> Jangyoung • Project 2 – Start early! CSE 490/590, Spring 2011 13

Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors

Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: • Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: • Keep result computation separate from commit • Kill instructions following branch in pipeline • Restore state to state following branch CSE 490/590, Spring 2011 14

Static Branch Prediction Overall probability a branch is taken is ~60 -70% but: backward

Static Branch Prediction Overall probability a branch is taken is ~60 -70% but: backward 90% forward 50% JZ JZ ISA can attach preferred direction semantics to branches, e. g. , Motorola MC 88110 bne 0 (preferred taken) beq 0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e. g. , HP PA-RISC, Intel IA-64 typically reported as ~80% accurate CSE 490/590, Spring 2011 15

Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch

Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) CSE 490/590, Spring 2011 16

Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction

Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction after two consecutive mistakes! taken take right ¬ taken ¬take wrong taken ¬take right ¬ taken take wrong ¬ taken BP state: (predict take/¬take) x (last prediction right/wrong) CSE 490/590, Spring 2011 17

Branch History Table Fetch PC 00 k I-Cache BHT Index 2 k-entry BHT, 2

Branch History Table Fetch PC 00 k I-Cache BHT Index 2 k-entry BHT, 2 bits/entry Instruction Opcode offset + Branch? Target PC Taken/¬Taken? 4 K-entry BHT, 2 bits/entry, ~80 -90% correct predictions CSE 490/590, Spring 2011 18

Acknowledgements • These slides heavily contain material developed and copyright by – Krste Asanovic

Acknowledgements • These slides heavily contain material developed and copyright by – Krste Asanovic (MIT/UCB) – David Patterson (UCB) • And also by: – – Arvind (MIT) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) • MIT material derived from course 6. 823 • UCB material derived from course CS 252 CSE 490/590, Spring 2011 19