CS 704 Advanced Computer Architecture Lecture 15 Instruction

  • Slides: 41
Download presentation
CS 704 Advanced Computer Architecture Lecture 15 Instruction Level Parallelism (Dynamic Branch Prediction) Prof.

CS 704 Advanced Computer Architecture Lecture 15 Instruction Level Parallelism (Dynamic Branch Prediction) Prof. Dr. M. Ashraf Chughtai

Today's Topics Recap - Lecture 14 Dynamic Branch Prediction Buffer Examples of Branch Predictor

Today's Topics Recap - Lecture 14 Dynamic Branch Prediction Buffer Examples of Branch Predictor Summary MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 2

Recap: Lecture 14 Tomasulo's Approach for IBM 360/91 to achieve high Performance without special

Recap: Lecture 14 Tomasulo's Approach for IBM 360/91 to achieve high Performance without special compilers Here, the control and buffers are distributed with Function Units (FU) Registers in instructions are replaced by values or pointers to reservation stations(RS) ; i. e. , the registers are renamed Unlike Scoreboard, Tomasulo can have multiple loads outstanding MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 3

Recap: Lecture 14 These two properties allow to issue an instruction having name dependence

Recap: Lecture 14 These two properties allow to issue an instruction having name dependence ; e. g. , MULT is issued which has name dependence of register F 2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 4

Recap: Lecture 14 Tomasulo eliminates the WAR hazard as in this example ADD. D

Recap: Lecture 14 Tomasulo eliminates the WAR hazard as in this example ADD. D writes the result in Cycle 11 even if the DIV. D will start execution in Cycle 16 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 5

Recap: Lecture 14 Tomasulo issues in-order and may execute outof-order MAC/VU-Advanced Computer Architecture Lecture

Recap: Lecture 14 Tomasulo issues in-order and may execute outof-order MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 6

Recap: Lecture 14 • Here, the integer instructions SUBI and BNEZ are executed out-of-order

Recap: Lecture 14 • Here, the integer instructions SUBI and BNEZ are executed out-of-order to evaluate the condition • The perdition Branch-Taken is implemented by repeating the loop instruction as shown MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 7

Recap: Lecture 14 • The perdition Branch-Taken is implemented by two iterations of the

Recap: Lecture 14 • The perdition Branch-Taken is implemented by two iterations of the code • R 1 has been initialized to 80 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 8

Recap: Lecture 14 • L. D is issued in 6 th clock cycle, prior

Recap: Lecture 14 • L. D is issued in 6 th clock cycle, prior to the condition evaluation – Predict Branch Taken • R 1 is updated in Clock 6, by executing SUB in Clock cycle 5 • SUBI and BNZE are issued in Clock Cycle 4 and 5 respectively • F 0 never sees the result MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 9

Recap: Lecture 14 • MUL 1 issued in clock cycle 2 does not start

Recap: Lecture 14 • MUL 1 issued in clock cycle 2 does not start execution till Wr to F 0 by LD is complete to avoid WAR Hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 10

Recap: Lecture 14 • L. D 1 issues in cycle 1, completes execution in

Recap: Lecture 14 • L. D 1 issues in cycle 1, completes execution in cycle 9 ( 8 CPI first time) It writes to F 0 in cycle 10 • LD 2 issued in cycle 6 completes execution (4 CPI second time • So MUL 1 will start in cycle 11 avoiding WAR Hazard • SD 1 will start execution on the completion of MUL 1 to avoid WAW hazard MAC/VU-Advanced Computer Architecture • SUBI and BNEZ issued in clock cycles 9 and 10 respectively • SUBI completes execution in 10 cycle, updates R 1 to the next iteration Lecture 15 – Instruction Level Parallelism -Dynamic (4) 11

Recap: Lecture 14 • MUL 1 execution started in cycle 11 completes in cycle

Recap: Lecture 14 • MUL 1 execution started in cycle 11 completes in cycle 14 write result in F 4 in cycle 15 • SD 1 issued in cycle 3, will start execution in Cycle 16 avoiding WAR hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 12

Recap: Lecture 14 • MUL 1 execution started in cycle 11 completes in cycle

Recap: Lecture 14 • MUL 1 execution started in cycle 11 completes in cycle 14 write result in F 4 in cycle 15 • SD 1 issued in cycle 3, will start execution in Cycle 16 completes in cycle 18 • SBI issued in cycle 16 update R 1 for next iteration in cycle 18 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 13

Recap: Lecture 14 • MUL 2 execution started in cycle 12 completed in cycle

Recap: Lecture 14 • MUL 2 execution started in cycle 12 completed in cycle 15 write result in F 4 in cycle 16 • SD 2 issued in cycle 8, start s execution in Cycle 17 after MUL 2 writes result in cycle 16 to avoid WAR hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 14

Introduction to Dynamic Branch Prediction In the last lecture, we considered a loopbased example,

Introduction to Dynamic Branch Prediction In the last lecture, we considered a loopbased example, to discuss the Tomasulo’s approach to overcome the WAW and WAR hazards Here, we observed that dynamically scheduled pipeline can yield high performance provided branches are predicted accurately MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 15

Branch History Table If the prediction is wrong, then invert prediction-bit MAC/VU-Advanced Computer Architecture

Branch History Table If the prediction is wrong, then invert prediction-bit MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 16

1 -bit Dynamic Branch Prediction Problem: - In a loop, 1 -bit BHT will

1 -bit Dynamic Branch Prediction Problem: - In a loop, 1 -bit BHT will cause two mispredictions in a row - 1 -bit predictor mispredict at twice the rate that the branch is not-taken - Let us consider an example of loop- branch (For i=1 to 10); i. e. , the branch is taken 9 times and not-taken once MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 17

1 -bit Dynamic Branch Prediction … Conclusion As the Performance = ƒ (accuracy, cost

1 -bit Dynamic Branch Prediction … Conclusion As the Performance = ƒ (accuracy, cost of mispredictions) The accuracy of the predictor is expected to match the taken-branch frequency, which in the previous example is 9 out of 10 (90%) But the 1 -bit prediction has 8 out of 10 (80%) MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 18

2 -bit Dynamic Branch Prediction 2 bits are used to encode 4 -states in

2 -bit Dynamic Branch Prediction 2 bits are used to encode 4 -states in the system (counter) Say: States 00 and 01 for Predict Not-Taken States 10 and 11 for Predict Taken MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 19

2 -bit Dynamic Branch Prediction T NT Predict Taken State 11 Predict Taken State

2 -bit Dynamic Branch Prediction T NT Predict Taken State 11 Predict Taken State 10 T T NT NT Predict Not Taken State 00 Predict Not Taken State 01 T NT MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 20

2 -bit Dynamic Branch Prediction In a saturating counter implementation: 2 -bit counter saturates

2 -bit Dynamic Branch Prediction In a saturating counter implementation: 2 -bit counter saturates at: - 00 (Predict Taken) or - 11 (Predict Not taken) The counter is incremented when a branch is taken and decremented when it is not taken; e. g. , - 00 to 01 for Taken when predicted not taken - 10 to 11 for Taken when predicted taken MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 21

2 -bit Dynamic Branch Prediction Here, when the counter is greater than or equal

2 -bit Dynamic Branch Prediction Here, when the counter is greater than or equal to ½ of its maximum value (>=10; i. e. , state 01 and 11) branch is predicted as taken; otherwise (i. e. , <10: state 10 and 00) the branch is predicted as untaken Let us try the example of loop For i=1, 10 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 22

2 -bit Dynamic Branch Prediction Let us try the example of loop For i=1,

2 -bit Dynamic Branch Prediction Let us try the example of loop For i=1, 10 Iteration P. S. Branch NS Prediction 0 -not Taken 11 Taken 2 11 Taken : 9 11 Taken 10 11 Not taken 10 Taken Prediction fails once only MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 23

Branch Prediction Buffer (BPB) or BHT Implementation MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction

Branch Prediction Buffer (BPB) or BHT Implementation MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 24

Branch Prediction Buffer (BPB) or BHT Implementation If Prediction is wrong Then prediction bits

Branch Prediction Buffer (BPB) or BHT Implementation If Prediction is wrong Then prediction bits are changed – In case Predicted Taken: State changes 11 10) Predicted not taken: State changes 00 01 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 25

Branch History Table Accuracy For example Place Fig. 3. 8 pp 200 here Here,

Branch History Table Accuracy For example Place Fig. 3. 8 pp 200 here Here, for SPEC 89 benchmark A branch prediction buffer with 4096 entries results in: - Prediction accuracy ranging from: 99% to 82 % or - Mispredictions rate of 1% - 18% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 26

Branch History Table Accuracy wrt size Insert Fig. 3. 9 pp 201 MAC/VU-Advanced Computer

Branch History Table Accuracy wrt size Insert Fig. 3. 9 pp 201 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 27

Impact of size on accuracy of BHT As we try to exploit more ILP,

Impact of size on accuracy of BHT As we try to exploit more ILP, the accuracy of the Branch Predictor becomes critical Here, the accuracy of the predictor is shown by increasing the size of the buffer as 4096 Entries 2 -bit BHT Unlimited Entries 2 -bit BHT Simply increasing the number of bits per predictor without changing the predictor structure has little impact – so we have to look at other methods to increase the accuracy of the predictors MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 28

Correlating Branches The 2 -bit predictor scheme uses only the recent behavior of the

Correlating Branches The 2 -bit predictor scheme uses only the recent behavior of the single branch to predict the future behavior of branch In practice, the behavior of other branches, rather than only a single branch, we are trying to predict, may also influence the prediction accuracy Let us consider the worst case of SPEC 92 benchmark for 2 -bit predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 29

Correlating Branches SPEC 92 benchmark for 2 -bit predictor example: Assume aa is assigned

Correlating Branches SPEC 92 benchmark for 2 -bit predictor example: Assume aa is assigned R 1 and bb the register R 2 IF (aa==2) DSUBUI R 3, R 1, #2 aa=0; BNEZ R 3, L 1 ; branch b 1 (aa!=2) DADD R 1, R 0 ; aa=0 Not Branch IF (bb==2) L 1 DSUBUI R 3, R 2, #2 bb=0; BNEZ R 3, L 2 ; branch b 2 (bb!=2) DADD R 2, R 0 ; bb=0 Not Branch IF (aa!=bb) L 2 DSUBU R 3, R 1, R 2 { BEQZ R 3, L 3 ; branch b 3 (aa=bb) Here, the behavior of b 3 (L 2) is correlated with the behavior of b 1 and b 2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 30

Correlating Branches Here, if b 1 and b 2 are both nottaken (aa=0; bb=0)

Correlating Branches Here, if b 1 and b 2 are both nottaken (aa=0; bb=0) then b 3 is taken A predictor that uses the behavior of a single branch to predict the behavior of that branch cannot capture this behavior So we need a correlating branch predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 31

Correlating Branch Predictors Hypothesis: recent branches are correlated; that is, behavior of recently executed

Correlating Branch Predictors Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 32

Correlating Branch Predictors In general, (m, n) predictor means record last m branches to

Correlating Branch Predictors In general, (m, n) predictor means record last m branches to select between 2 m history tables each with n-bit counters – Old 2 -bit BHT is then a (0, 2) predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 33

Correlating Branch Predictor: Example Let us consider an illustrative code: (d is assigned to

Correlating Branch Predictor: Example Let us consider an illustrative code: (d is assigned to R 1) IF (d==0) BNEZ R 1, L 1 ; branch b 1 (d!=0) d=1; DADDIU R 1, R 0, #1 ; branch not taken, d=1 IF (d==1) L 1: DADDIU R 3, R 1, #-1 BNEZ R 3, L 2 ; branch b 2 – (d!=1) The working of correlating predictor is as follows Initial d d==0? b 1 d before b 2 d==1? b 2 0 yes NT 1 No T 1 yes NT 2 No T 2 no T Here, if b 1 is not taken b 2 will not be taken – ……. next MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 34

Correlating Branch Predictor: Example We write the pair of prediction bits as: Prediction if

Correlating Branch Predictor: Example We write the pair of prediction bits as: Prediction if last branch in the program is not-taken/ Prediction if the last branch is taken Therefore, the 4 possible combinations are: Prediction bits New Prediction if last branch Not Taken NT / NT NT/T T/NT T/T MAC/VU-Advanced Computer Architecture NT NT T T Lecture 15 – Instruction Level Parallelism -Dynamic (4) Branch Taken NT T 35

Correlating Branch Predictor: Example The action of the 1 -bit predictor with 1 -bit

Correlating Branch Predictor: Example The action of the 1 -bit predictor with 1 -bit of correlation, written as (1, 1) for the above example is shown here (Fig. 3. 13 …. pp 203 In this case the only misprediction is on the first iteration, when d=2 as this is not correlated with the previous perdition MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 36

Correlating Branches (2, 2) branch prediction buffer uses 2 -bit global history to choose

Correlating Branches (2, 2) branch prediction buffer uses 2 -bit global history to choose from among 4 predictors for each branch address Branch address 2 -bits per branch predictors Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Prediction 2 -bit global branch history MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 37

Accuracy of Different Schemes Frequency of Mispredictions 18% 4096 Entries 2 -bit BHT Unlimited

Accuracy of Different Schemes Frequency of Mispredictions 18% 4096 Entries 2 -bit BHT Unlimited Entries 2 -bit BHT 1024 Entries (2, 2) BHT 0% MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 38

Branch History Table or Branch Target Buffer PC instruction to Fetch Lookup Predicted PC

Branch History Table or Branch Target Buffer PC instruction to Fetch Lookup Predicted PC Number of entries in Branch target Buffer No: Inst. Is not predicted to be branch Proceed Normally Yes: Inst. Is branch and predicted PC should be used as the next PC MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Branch Predicted Taken or Not Taken 39

Dynamic Branch Prediction Summary Branch History Table: 2 bits for loop accuracy Correlation: Recently

Dynamic Branch Prediction Summary Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 40

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 41