CS 704 Advanced Computer Architecture Lecture 16 Instruction
- Slides: 52
CS 704 Advanced Computer Architecture Lecture 16 Instruction Level Parallelism (Dynamic Branch Prediction …. Cont’d) Prof. Dr. M. Ashraf Chughtai
Today's Topics Recap Correlating Branch Predictors Tournament Predictor High Performance Instruction Delivery – Branch Target Buffer Summary MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 2
Recap: Dynamic Scheduling and Branch Prediction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 3
Recap: Dynamic Scheduling and Branch Prediction - Static: rely on the software (compiler) - Dynamic: hardware intensive approaches MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 4
Important questions: Branch-Prediction Buffer Q 1: What is the impact of increasing the size of branch-prediction buffer on two branches in a program? A single predictor predicting a single branch is generally more accurate than is that same predictor serving more than one instructions; and MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 5
Q: 1 Branch-Prediction Buffer It is less likely that two branches in a program share a single predictor Therefore, increasing the size of predictor buffer does not have significant effect on two branches in a program MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 6
Question 2 Branch-Prediction Buffer How sharing a predictor effects the misprediction rate This is explained with the help of following example: Consider two sequences of branch-taken and nottaken , sharing 1 -bit predictor; and identify the sequence that a) reduces the misprediction rate b) increases the misprediction rate MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 7
Example: Sequence 1 P B 1 P NT T Prediction Correct? - T No - B 2 P B 1 NT NT T No - P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Here, the columns B 1 and B 2 show the branches B 1 and B 2 B 1 is always TAKEN B 2 is always Not-TAKEN MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 8
Example: Sequence 1 P B 1 P NT T Prediction - T B 2 P B 1 NT NT T No No P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 9
Example: Sequence 1 P B 1 P NT T Prediction - T B 2 P B 1 NT NT T No No P B 2 P B 1 P B 2 P T NT NT T No - B 1 T NT NT T No - No P B 2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 10
Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 11
Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 12
Example: Sequence 2 P B 1 P NT T T yes - Prediction B 2 P B 1 P B 2 P NT NT T No no No - T yes - T B 1 T NT No - yes - no P B 2 NT T - Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 13
Example: Conclusion Why sharing of predictor increases misprediction rate? It is clear from the above example that: if a predictor is shared by a set of branch instructions then the members of the set of branch instruction may change, over the course of execution of long program Hence, the branch action history changes and predictor is likely to mispredict more often MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 14
Correlating Branch Predictors Re-visited We have observed that in program segment IF (d==0) d=1; IF (d==1) d=2; MAC/VU-Advanced Computer Architecture Branch b 1 for d!=0 Branch b 2 for b!=1 Lecture 15 – Instruction Level Parallelism -Dynamic (4) 15
Correlating Branch Predictors Re-visited This problem may be resolved in Correlating-Branch Predictor by recording m most recently executed branches as taken or not taken (in 2 m branch-history tables for 1 -, 2 -, … or n-bit predictor), and using branch-pattern to select the proper branch history table for the current branch MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 16
Correlating Branch Predictors In general, (m, n) predictor means record last m branches to select between 2 m history tables each with n-bit counters (2 m n-bit predictor) A 2 -bit BHT is regarded as (0, 2) correlating predictor; MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 17
Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is written as (1, 1) predictor Here, we have two (21) separate prediction bits (i. e. , two 1 -bit BHTs ) - one prediction bit is used if the last branch executed was not-taken - other prediction bit is used if the last branch executed was taken And is denoted as: New prediction when last NT New prediction when last T MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 18
Correlating Branch Predictor: Example 1 -bit predictor with 1 -bit correlation is referred to as (1, 1) predictor Here, we have two (21) separate prediction bits (i. e. , two 1 -bit BHTs ) one prediction bit is used if the last branch executed was not-taken other prediction bit is used if the last branch executed was taken - MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 19
Correlating Branch Predictor: Example And these two bits are denoted as: New prediction when last NT New prediction when last T E. g. , T/NT stands for: New prediction is TAKEN if previous was NOTTAKEN and is NOT-TAKEN if previous was TAKEN MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 20
Correlating Branch Predictor: Example In an (m, n) predictor, the global history of most recent m branches is recorder in an m -bit shift register Here, each bit records whether the branch was taken or not taken The branch-prediction buffers is indexed using concatenation of low-order bits from branch-address with m-bit global history A typical (2, 2) correlating predictor is shown next …. MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 21
(2, 2) Correlating Branches Predictor 4 -bit Branch address 22, 2 -bits per branch predictors with 16 entries each 4 0 1 2 : : 16 17 18 : : 32 33 34 : : 48 49 50 : : : 31 : 47 : 63 Prediction Forms the lower part of 6 -bit address : 15 1 MAC/VU-Advanced Computer Architecture 0 2 -bit global branch history – the upper 2 -bits of 6 -bit address Lecture 15 – Instruction Level Parallelism -Dynamic (4) 22
(2, 2) Correlating Branches Predictor Here, the buffer is drawn as 2 -dimensional object, each buffer is 2 bit wide, in reality they are arranged linearly (2, 2) branch prediction buffer uses 2 -bit global history to choose from among 4 predictors, for each branch address of 4 -bit (among the 16 entries in each of the 4 predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 23
(2, 2) Correlating Branches Predictor Behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Indexing is done by concatenation of 4 lower-order address bits of the branch (word address) and 2 global bits to form 6 -bit address to select 2 -bit prediction from 64 entries in 4 buffers each of 16 entries MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 24
Comparison of (0, 2) and (2, 2) predictors Frequency of Mispredictions 18% 4096 Entries 2 -bit BHT Unlimited Entries 2 -bit BHT 1024 Entries (2, 2) BHT 0% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 25
Multilevel Branch Predictors (Tournament Predictors) Multilevel branch prediction (Nested Branches) involve information at local and global levels to predict correctly MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 26
Multilevel Branch Predictors (Tournament Predictors) Several levels of Branch-Prediction Tables and An algorithm to choose among different predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 27
State Transition Diagram of Tournament Predictor Modify as Fig. 3. 16 pp 204 T NT Use predictor 1 State 01 T NT T Use predictor 1 State 00 Use Predictor 2 State 10 NT Use predictor 2 State 11 T NT MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 28
Multilevel Branch Predictors (Tournament Predictors) The transition for predicted predictor is specified by: correct = 1 Incorrect =0 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 29
Multilevel Branch Predictors (Tournament Predictors) The state transition diagram shows that from the saturating state - The counter is incremented whenever predicted predictor is correct and other is incorrect (i. e. , for 1/0) and - The counter is decremented in the reverse direction (i. e. , for 0/1) The counter does not change for all other predictions for non-saturating present state MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 30
Multilevel Branch Predictors (Tournament Predictors) For the saturating state 00 (Use predictor 1) - it increments to the state 01 (use predictor 1) for 1/0 - and decrements to state 11 (use predictor 2) for 0/1 For the saturating state 11 (Use predictor 2) - it increments to the state 00 (use predictor 1) for 1/0 - and decrements to state 10 (use predictor 2) for 0/1 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 31
High Performance Instruction Delivery In MIPS 5 -stage pipeline, we need to know the address of the next-instruction-fetch at the end of current IF cycle That is, for ZERO branch penalty, we need to know whether the as-yet un-decoded instruction is branch; and if yes then what is the next-PC? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 32
Branch Target Buffer This is accomplished by introducing a Cache that contains the address of the next instruction if branch is taken as well as not-taken This cache is known as the Branch. Target Cache or Branch-Target Buffer (BTB) MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 33
Branch-Prediction Buffer vs. Branch-Target Buffer Recall from our discussion last time that branch-prediction buffer is accessed during the ID stage, after the instruction decode, i. e. , We know the branch-target address at the end of ID stage to fetch the next predicted instruction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 34
Branch Target Buffer PC instruction to Fetch Lookup Predicted PC Prediction State Number of entries in Branch target Buffer = No: Inst. Is not predicted to be branch Proceed Normally Yes: Inst. Is branch and predicted PC should be used as the next PC MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Branch Predicted Taken or Not Taken 35
Branch-Target Buffer Branch Target Buffer has three fields: - Lookup: addresses of the known branch instructions (predicted as taken) - Predicted PC: PC of the fetched instruction predicted taken-branch - Prediction State: Optional- extra prediction state bits MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 36
Branch-Target Buffer Complications? - Complication arise in using 2 -bit predictor because it uses information for both the branches taken and not-taken - This complication is resolved in Power. PC processors by using both the Targetbuffer and Prediction-buffer MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 37
Branch-Target Buffer Steps involved in using Branch Target Buffer at IF, ID and EXE pipeline stages - IF - ID - EXE - (insert flow chart of Fig. 3. 20) MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 38
Branch-Target Buffer – Flow Chart Explanation - IF Stage The PC of an instruction is compared with the contents of the buffer if it is found then the instruction must be a branch instruction predicted taken Else It may be a branch predicted not-taken or normal instruction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 39
Branch-Target Buffer - ID Stage i) Decode the instruction and If in the IF Stage, entry was found in the Target-buffer as predictedbranch then begin fetching immediately from the predicted PC ii) Check the decoded instruction If it is Taken-branch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 40
Branch-Target Buffer - EX Stage performs one of the four possible functions i) Where in the IF stage entry was not found in the target buffer and in the ID stage If it is found to be Taken-branch (i-a) then enter branch-instruction address and next PC into branch-target buffer (i-b) Else proceed as normal instruction execution MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 41
Branch-Target Buffer - EX Stage (ii) Where in the IF stage the entry was found in the target-buffer and in the ID stage If it is found to be Taken-branch (ii-a) then correctly predicted , so execute normally without stall (ii-b) Else it is mispredicted, so kill the fetched instruction, restart fetching at an other address and delete entry from the targetbuffer MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 42
Branch-Target Buffer. . Conclusion If the correctly predicted branch entry is found in the buffer Then there will be no branch penalty Else It suffers at least 2 clock cycle delay as misprediction penalty - one clock delay for fetching the wrong instruction and - one clock cycle to restart the fetch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 43
Branch-Target Buffer - Examples Inst. in Buffer Prediction Actual Branch Yes Taken Not-Taken No - Taken 2 No - Not Taken MAC/VU-Advanced Computer Architecture Penalty Cycles 0 2 0 Lecture 16 – Instruction Level Parallelism -Dynamic (5) 44
Branch-Target Buffer - Solution We can compute the penalty by looking at the probability of two events: i) Branch predicted taken but end up not take = %buffer hit rate x % incorrect prediction = 0. 95 x 0. 1 = 0. 095 ii) Branch is taken but is not found in the buffer = % incorrect prediction = 0. 1 The penalty in both the cases is 2 cycles, therefore Branch Penalty = (0. 095 + 0. 1)x 2 = 0. 195 x 2 = 0. 39 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 45
Example: Branch-Target Buffer Problem: Consider a branch-target buffer implemented for conditional branches only for pipelined processor Assuming that: § Misprediction penalty = 4 cycles § Buffer miss-penalty = 3 cycles § Hit rate and accuracy each = 90% § Branch Frequency = 15% MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 46
Example: Branch-Target Buffer Solution The speedup with Branch Target Buffer verses no BTB is expressed as: Speedup = CPI no BTB /CPI BTB = (CPI base +Stalls no BTB ) / ( CPI base + Stalls BTB ) The stalls are determined as: Stalls = ΣFrequency s ε stall x Penalty s s The sum over all the stall cases as the product of frequency of the stall cases and the stall-penalty MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 47
Example: Branch-Target Buffer i) ii) Stalls no BTB = 0. 15 x 2 = 0. 30 To find Stalls BTB we have to consider each output from BTB There exist three possibilities: a) Branch misses the BTB: frequency = 15 % x 0. 1 = 1. 5% = 0. 015 Penalty =3 Stalls =0. 045 b) Branch can hit and correctly predicted: frequency = 15 % x 0. 9 (hit) x 0. 9 (prediction) = 12. 1% = 0. 121 Penalty =0 MAC/VU-Advanced Lecture 16 – Instruction Level Stalls =0 Computer Architecture 48 Parallelism -Dynamic (5)
Example: Branch-Target Buffer c) Branch can hit but incorrectly predicted: frequency = 15 % x 0. 9 (hit) x 0. 1 (misprediction) = 1. 3% = 0. 013 Penalty =4 = 0. 052 Stalls ii) Stalls BTB = 0. 045 + 0. 052 = 0. 097 Speedup = (CPI base + Stalls no BTB ) / ( CPI base + Stalls BTB ) = (1. 0 + 0. 3) / (1. 0 + 0. 097) = 1. 2 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 49
Improvement in BTB In order to achieve more instruction delivery, one possible variation in the Branch Target Buffer is: To store one or more target instructions, in stead of or in addition to, the predicted Target Address MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 50
Improvement in BTB Advantages: - It possibly allows larger BTB as it permits access to take longer than the time between successive instruction fetches - Buffering the actual Target-Instructions allow Branch Folding, i. e. , ZERO cycle Unconditional Branching or some times ZERO Cycle conditional Branching MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 51
Conclusion: MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 52
- Computer architecture lecture notes
- Microarchitecture vs isa
- Instruction format in computer architecture
- State diagram in computer architecture
- Instruction pipelining in computer architecture
- Ilp computer architecture
- Chapter 4 example
- Instruction set architecture in computer organization
- Review of fundamentals of cpu
- Tpp 704 01
- Afi 36 704
- 704-631-1500
- Mc 338
- Irc 704
- Chemical grade
- Iso 704
- 704 kar 3:305
- 704 error
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Instruction de lecture
- Advanced inorganic chemistry lecture notes
- 3 bus architecture
- Individualized instruction vs differentiated instruction
- Direct vs indirect instruction
- 3 6 9 spare system chart
- Computer architecture vs computer organization
- Interrupt cycle flow chart
- Marie skipcond
- Types of instruction set
- Vliw
- Mips processor
- Very long instruction word architecture
- Arm high speed multiplier organization
- Which instruction set architecture is used in beaglebone?
- Instruction set architecture
- 430830
- Mips code
- Very long instruction word architecture
- Computer security 161 cryptocurrency lecture
- Computer-aided drug design lecture notes
- Lmc branching
- Contoh computer assisted instruction
- Cisc complex instruction set computer
- Zero instruction set computer
- Computer instruction
- Advanced topics in computer science
- Craig reinhart
- Advanced computer forensics
- Fastbloc se
- Software architecture definitions
- Call and return architecture
- Integral product architecture example
- Modular product architectures