CS 704 Advanced Computer Architecture Lecture 16 Instruction

  • Slides: 52
Download presentation
CS 704 Advanced Computer Architecture Lecture 16 Instruction Level Parallelism (Dynamic Branch Prediction ….

CS 704 Advanced Computer Architecture Lecture 16 Instruction Level Parallelism (Dynamic Branch Prediction …. Cont’d) Prof. Dr. M. Ashraf Chughtai

Today's Topics Recap Correlating Branch Predictors Tournament Predictor High Performance Instruction Delivery – Branch

Today's Topics Recap Correlating Branch Predictors Tournament Predictor High Performance Instruction Delivery – Branch Target Buffer Summary MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 2

Recap: Dynamic Scheduling and Branch Prediction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level

Recap: Dynamic Scheduling and Branch Prediction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 3

Recap: Dynamic Scheduling and Branch Prediction - Static: rely on the software (compiler) -

Recap: Dynamic Scheduling and Branch Prediction - Static: rely on the software (compiler) - Dynamic: hardware intensive approaches MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 4

Important questions: Branch-Prediction Buffer Q 1: What is the impact of increasing the size

Important questions: Branch-Prediction Buffer Q 1: What is the impact of increasing the size of branch-prediction buffer on two branches in a program? A single predictor predicting a single branch is generally more accurate than is that same predictor serving more than one instructions; and MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 5

Q: 1 Branch-Prediction Buffer It is less likely that two branches in a program

Q: 1 Branch-Prediction Buffer It is less likely that two branches in a program share a single predictor Therefore, increasing the size of predictor buffer does not have significant effect on two branches in a program MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 6

Question 2 Branch-Prediction Buffer How sharing a predictor effects the misprediction rate This is

Question 2 Branch-Prediction Buffer How sharing a predictor effects the misprediction rate This is explained with the help of following example: Consider two sequences of branch-taken and nottaken , sharing 1 -bit predictor; and identify the sequence that a) reduces the misprediction rate b) increases the misprediction rate MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 7

Example: Sequence 1 P B 1 P NT T Prediction Correct? - T No

Example: Sequence 1 P B 1 P NT T Prediction Correct? - T No - B 2 P B 1 NT NT T No - P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Here, the columns B 1 and B 2 show the branches B 1 and B 2 B 1 is always TAKEN B 2 is always Not-TAKEN MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 8

Example: Sequence 1 P B 1 P NT T Prediction - T B 2

Example: Sequence 1 P B 1 P NT T Prediction - T B 2 P B 1 NT NT T No No P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 9

Example: Sequence 1 P B 1 P NT T Prediction - T B 2

Example: Sequence 1 P B 1 P NT T Prediction - T B 2 P B 1 NT NT T No No P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 10

Example: Sequence 2 P B 1 P NT T T yes - Prediction B

Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 11

Example: Sequence 2 P B 1 P NT T T yes - Prediction B

Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 12

Example: Sequence 2 P B 1 P NT T T yes - Prediction B

Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 13

Example: Conclusion Why sharing of predictor increases misprediction rate? It is clear from the

Example: Conclusion Why sharing of predictor increases misprediction rate? It is clear from the above example that: if a predictor is shared by a set of branch instructions then the members of the set of branch instruction may change, over the course of execution of long program Hence, the branch action history changes and predictor is likely to mispredict more often MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 14

Correlating Branch Predictors Re-visited We have observed that in program segment IF (d==0) d=1;

Correlating Branch Predictors Re-visited We have observed that in program segment IF (d==0) d=1; IF (d==1) d=2; MAC/VU-Advanced Computer Architecture Branch b 1 for d!=0 Branch b 2 for b!=1 Lecture 15 – Instruction Level Parallelism -Dynamic (4) 15

Correlating Branch Predictors Re-visited This problem may be resolved in Correlating-Branch Predictor by recording

Correlating Branch Predictors Re-visited This problem may be resolved in Correlating-Branch Predictor by recording m most recently executed branches as taken or not taken (in 2 m branch-history tables for 1 -, 2 -, … or n-bit predictor), and using branch-pattern to select the proper branch history table for the current branch MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 16

Correlating Branch Predictors In general, (m, n) predictor means record last m branches to

Correlating Branch Predictors In general, (m, n) predictor means record last m branches to select between 2 m history tables each with n-bit counters (2 m n-bit predictor) A 2 -bit BHT is regarded as (0, 2) correlating predictor; MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 17

Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is written as

Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is written as (1, 1) predictor Here, we have two (21) separate prediction bits (i. e. , two 1 -bit BHTs ) - one prediction bit is used if the last branch executed was not-taken - other prediction bit is used if the last branch executed was taken And is denoted as: New prediction when last NT New prediction when last T MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 18

Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is referred to

Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is referred to as (1, 1) predictor Here, we have two (21) separate prediction bits (i. e. , two 1 -bit BHTs ) one prediction bit is used if the last branch executed was not-taken other prediction bit is used if the last branch executed was taken - MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 19

Correlating Branch Predictor: Example And these two bits are denoted as: New prediction when

Correlating Branch Predictor: Example And these two bits are denoted as: New prediction when last NT New prediction when last T E. g. , T/NT stands for: New prediction is TAKEN if previous was NOTTAKEN and is NOT-TAKEN if previous was TAKEN MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 20

Correlating Branch Predictor: Example In an (m, n) predictor, the global history of most

Correlating Branch Predictor: Example In an (m, n) predictor, the global history of most recent m branches is recorder in an m -bit shift register Here, each bit records whether the branch was taken or not taken The branch-prediction buffers is indexed using concatenation of low-order bits from branch-address with m-bit global history A typical (2, 2) correlating predictor is shown next …. MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 21

(2, 2) Correlating Branches Predictor 4 -bit Branch address 22, 2 -bits per branch

(2, 2) Correlating Branches Predictor 4 -bit Branch address 22, 2 -bits per branch predictors with 16 entries each 4 0 1 2 : : 16 17 18 : : 32 33 34 : : 48 49 50 : : : 31 : 47 : 63 Prediction Forms the lower part of 6 -bit address : 15 1 MAC/VU-Advanced Computer Architecture 0 2 -bit global branch history – the upper 2 -bits of 6 -bit address Lecture 15 – Instruction Level Parallelism -Dynamic (4) 22

(2, 2) Correlating Branches Predictor Here, the buffer is drawn as 2 -dimensional object,

(2, 2) Correlating Branches Predictor Here, the buffer is drawn as 2 -dimensional object, each buffer is 2 bit wide, in reality they are arranged linearly (2, 2) branch prediction buffer uses 2 -bit global history to choose from among 4 predictors, for each branch address of 4 -bit (among the 16 entries in each of the 4 predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 23

(2, 2) Correlating Branches Predictor Behavior of recent branches selects between, say, four predictions

(2, 2) Correlating Branches Predictor Behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Indexing is done by concatenation of 4 lower-order address bits of the branch (word address) and 2 global bits to form 6 -bit address to select 2 -bit prediction from 64 entries in 4 buffers each of 16 entries MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 24

Comparison of (0, 2) and (2, 2) predictors Frequency of Mispredictions 18% 4096 Entries

Comparison of (0, 2) and (2, 2) predictors Frequency of Mispredictions 18% 4096 Entries 2 -bit BHT Unlimited Entries 2 -bit BHT 1024 Entries (2, 2) BHT 0% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 25

Multilevel Branch Predictors (Tournament Predictors) Multilevel branch prediction (Nested Branches) involve information at local

Multilevel Branch Predictors (Tournament Predictors) Multilevel branch prediction (Nested Branches) involve information at local and global levels to predict correctly MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 26

Multilevel Branch Predictors (Tournament Predictors) Several levels of Branch-Prediction Tables and An algorithm to

Multilevel Branch Predictors (Tournament Predictors) Several levels of Branch-Prediction Tables and An algorithm to choose among different predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 27

State Transition Diagram of Tournament Predictor Modify as Fig. 3. 16 pp 204 T

State Transition Diagram of Tournament Predictor Modify as Fig. 3. 16 pp 204 T NT Use predictor 1 State 01 T NT T Use predictor 1 State 00 Use Predictor 2 State 10 NT Use predictor 2 State 11 T NT MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 28

Multilevel Branch Predictors (Tournament Predictors) The transition for predicted predictor is specified by: correct

Multilevel Branch Predictors (Tournament Predictors) The transition for predicted predictor is specified by: correct = 1 Incorrect =0 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 29

Multilevel Branch Predictors (Tournament Predictors) The state transition diagram shows that from the saturating

Multilevel Branch Predictors (Tournament Predictors) The state transition diagram shows that from the saturating state - The counter is incremented whenever predicted predictor is correct and other is incorrect (i. e. , for 1/0) and - The counter is decremented in the reverse direction (i. e. , for 0/1) The counter does not change for all other predictions for non-saturating present state MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 30

Multilevel Branch Predictors (Tournament Predictors) For the saturating state 00 (Use predictor 1) -

Multilevel Branch Predictors (Tournament Predictors) For the saturating state 00 (Use predictor 1) - it increments to the state 01 (use predictor 1) for 1/0 - and decrements to state 11 (use predictor 2) for 0/1 For the saturating state 11 (Use predictor 2) - it increments to the state 00 (use predictor 1) for 1/0 - and decrements to state 10 (use predictor 2) for 0/1 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 31

High Performance Instruction Delivery In MIPS 5 -stage pipeline, we need to know the

High Performance Instruction Delivery In MIPS 5 -stage pipeline, we need to know the address of the next-instruction-fetch at the end of current IF cycle That is, for ZERO branch penalty, we need to know whether the as-yet un-decoded instruction is branch; and if yes then what is the next-PC? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 32

Branch Target Buffer This is accomplished by introducing a Cache that contains the address

Branch Target Buffer This is accomplished by introducing a Cache that contains the address of the next instruction if branch is taken as well as not-taken This cache is known as the Branch. Target Cache or Branch-Target Buffer (BTB) MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 33

Branch-Prediction Buffer vs. Branch-Target Buffer Recall from our discussion last time that branch-prediction buffer

Branch-Prediction Buffer vs. Branch-Target Buffer Recall from our discussion last time that branch-prediction buffer is accessed during the ID stage, after the instruction decode, i. e. , We know the branch-target address at the end of ID stage to fetch the next predicted instruction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 34

Branch Target Buffer PC instruction to Fetch Lookup Predicted PC Prediction State Number of

Branch Target Buffer PC instruction to Fetch Lookup Predicted PC Prediction State Number of entries in Branch target Buffer = No: Inst. Is not predicted to be branch Proceed Normally Yes: Inst. Is branch and predicted PC should be used as the next PC MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Branch Predicted Taken or Not Taken 35

Branch-Target Buffer Branch Target Buffer has three fields: - Lookup: addresses of the known

Branch-Target Buffer Branch Target Buffer has three fields: - Lookup: addresses of the known branch instructions (predicted as taken) - Predicted PC: PC of the fetched instruction predicted taken-branch - Prediction State: Optional- extra prediction state bits MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 36

Branch-Target Buffer Complications? - Complication arise in using 2 -bit predictor because it uses

Branch-Target Buffer Complications? - Complication arise in using 2 -bit predictor because it uses information for both the branches taken and not-taken - This complication is resolved in Power. PC processors by using both the Targetbuffer and Prediction-buffer MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 37

Branch-Target Buffer Steps involved in using Branch Target Buffer at IF, ID and EXE

Branch-Target Buffer Steps involved in using Branch Target Buffer at IF, ID and EXE pipeline stages - IF - ID - EXE - (insert flow chart of Fig. 3. 20) MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 38

Branch-Target Buffer – Flow Chart Explanation - IF Stage The PC of an instruction

Branch-Target Buffer – Flow Chart Explanation - IF Stage The PC of an instruction is compared with the contents of the buffer if it is found then the instruction must be a branch instruction predicted taken Else It may be a branch predicted not-taken or normal instruction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 39

Branch-Target Buffer - ID Stage i) Decode the instruction and If in the IF

Branch-Target Buffer - ID Stage i) Decode the instruction and If in the IF Stage, entry was found in the Target-buffer as predictedbranch then begin fetching immediately from the predicted PC ii) Check the decoded instruction If it is Taken-branch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 40

Branch-Target Buffer - EX Stage performs one of the four possible functions i) Where

Branch-Target Buffer - EX Stage performs one of the four possible functions i) Where in the IF stage entry was not found in the target buffer and in the ID stage If it is found to be Taken-branch (i-a) then enter branch-instruction address and next PC into branch-target buffer (i-b) Else proceed as normal instruction execution MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 41

Branch-Target Buffer - EX Stage (ii) Where in the IF stage the entry was

Branch-Target Buffer - EX Stage (ii) Where in the IF stage the entry was found in the target-buffer and in the ID stage If it is found to be Taken-branch (ii-a) then correctly predicted , so execute normally without stall (ii-b) Else it is mispredicted, so kill the fetched instruction, restart fetching at an other address and delete entry from the targetbuffer MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 42

Branch-Target Buffer. . Conclusion If the correctly predicted branch entry is found in the

Branch-Target Buffer. . Conclusion If the correctly predicted branch entry is found in the buffer Then there will be no branch penalty Else It suffers at least 2 clock cycle delay as misprediction penalty - one clock delay for fetching the wrong instruction and - one clock cycle to restart the fetch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 43

Branch-Target Buffer - Examples Inst. in Buffer Prediction Actual Branch Yes Taken Not-Taken No

Branch-Target Buffer - Examples Inst. in Buffer Prediction Actual Branch Yes Taken Not-Taken No - Taken 2 No - Not Taken MAC/VU-Advanced Computer Architecture Penalty Cycles 0 2 0 Lecture 16 – Instruction Level Parallelism -Dynamic (5) 44

Branch-Target Buffer - Solution We can compute the penalty by looking at the probability

Branch-Target Buffer - Solution We can compute the penalty by looking at the probability of two events: i) Branch predicted taken but end up not take = %buffer hit rate x % incorrect prediction = 0. 95 x 0. 1 = 0. 095 ii) Branch is taken but is not found in the buffer = % incorrect prediction = 0. 1 The penalty in both the cases is 2 cycles, therefore Branch Penalty = (0. 095 + 0. 1)x 2 = 0. 195 x 2 = 0. 39 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 45

Example: Branch-Target Buffer Problem: Consider a branch-target buffer implemented for conditional branches only for

Example: Branch-Target Buffer Problem: Consider a branch-target buffer implemented for conditional branches only for pipelined processor Assuming that: § Misprediction penalty = 4 cycles § Buffer miss-penalty = 3 cycles § Hit rate and accuracy each = 90% § Branch Frequency = 15% MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 46

Example: Branch-Target Buffer Solution The speedup with Branch Target Buffer verses no BTB is

Example: Branch-Target Buffer Solution The speedup with Branch Target Buffer verses no BTB is expressed as: Speedup = CPI no BTB /CPI BTB = (CPI base +Stalls no BTB ) / ( CPI base + Stalls BTB ) The stalls are determined as: Stalls = ΣFrequency s ε stall x Penalty s s The sum over all the stall cases as the product of frequency of the stall cases and the stall-penalty MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 47

Example: Branch-Target Buffer i) ii) Stalls no BTB = 0. 15 x 2 =

Example: Branch-Target Buffer i) ii) Stalls no BTB = 0. 15 x 2 = 0. 30 To find Stalls BTB we have to consider each output from BTB There exist three possibilities: a) Branch misses the BTB: frequency = 15 % x 0. 1 = 1. 5% = 0. 015 Penalty =3 Stalls =0. 045 b) Branch can hit and correctly predicted: frequency = 15 % x 0. 9 (hit) x 0. 9 (prediction) = 12. 1% = 0. 121 Penalty =0 MAC/VU-Advanced Lecture 16 – Instruction Level Stalls =0 Computer Architecture 48 Parallelism -Dynamic (5)

Example: Branch-Target Buffer c) Branch can hit but incorrectly predicted: frequency = 15 %

Example: Branch-Target Buffer c) Branch can hit but incorrectly predicted: frequency = 15 % x 0. 9 (hit) x 0. 1 (misprediction) = 1. 3% = 0. 013 Penalty =4 = 0. 052 Stalls ii) Stalls BTB = 0. 045 + 0. 052 = 0. 097 Speedup = (CPI base + Stalls no BTB ) / ( CPI base + Stalls BTB ) = (1. 0 + 0. 3) / (1. 0 + 0. 097) = 1. 2 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 49

Improvement in BTB In order to achieve more instruction delivery, one possible variation in

Improvement in BTB In order to achieve more instruction delivery, one possible variation in the Branch Target Buffer is: To store one or more target instructions, in stead of or in addition to, the predicted Target Address MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 50

Improvement in BTB Advantages: - It possibly allows larger BTB as it permits access

Improvement in BTB Advantages: - It possibly allows larger BTB as it permits access to take longer than the time between successive instruction fetches - Buffering the actual Target-Instructions allow Branch Folding, i. e. , ZERO cycle Unconditional Branching or some times ZERO Cycle conditional Branching MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 51

Conclusion: MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 52

Conclusion: MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 52