Microarchitecture of Superscalars 3 Branch Prediction Dezs Sima

Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2. 0) Dezső Sima, 2007

Branch prediction • 1. Introdutcion • 2. Basic branch prediction mechanisms • 3. Auxiliary branch prediction mechanisms • 4. Accessing the branch target path

1. 1 The branch processing problem of pipelining (1) ti ii b F i i+1 t i+2 t i+3 D E W F D i i+2 ij t i+4 F F BTI Branch fetching Branch detection BTA calculation 2 bubbles BTI fetching Figure 1. 1: Straightforward processing of an unconditional branch on a four stage pipeline

1. 1 The branch processing problem of pipelining (2) ti ii bc F i i+1 t i+2 t i+3 D E W F D E F D i i+2 t i+5 F i i+3 ij t i+4 F BTI bc fetching bc detection Condition checking (branch!) BTA calculation 3 bubbles BTI fetching Figure 1. 2: Straightforward processing of a conditional branch on a four stage pipeline with immediate condition resolution

1. 1 The branch processing problem of pipelining (3) Figure 1. 3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed condition resolution

1. 1 The branch processing problem of pipelining (4) No of pipeline stages 40 30 20 Pentium (5) 10 * 1990 Pentium Pro (~12) K 6 * (6) * 1995 Pentium 4 (~20) * Athlon (6) P 4 Prescott (~30) * Athlon-64 (12) * Core Duo Conroe (14) * * 2000 2005 Figure 1. 4: Number of pipeline stages in Intel’s and AMD’s processors Year

1. 2 Branch statistics (1) Figure 1. 5: Dynamic ratio of branches

1. 2 Branch statistics (2) Figure 1. 6: Ratio of the main instruction types Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18 th ISCA, pp. 137 -146

1. 2 Branch statistics (3) Branches Unconditional branches Simple unconditional branch Branch to subroutine ~ 1/3 Return from subroutine Conditional branches Loop-closing conditional branch Other conditional branches ~ 1/3 Taken for the first (n-1) iterations ~ 1/6 Taken Not taken Taken ~ 5/6 Figure 1. 7: Grohoski’s estimate of branch statistics Source: Grohoski, G. F, IBM J. Res. Develop. , 34 Jan. pp. 37 -58 Not taken ~ 1/6

1. 2 Branch statistics (3) Frequency of taken branches Frequency of not taken branches 57 - 99 % 1 - 43 % Edenfield & al. 1990 75 % 25 % Grohoski 1990 ~ 5/6 ~ 1/6 Reference Lee, Smith 1984 Figure 1. 8: Frequency of taken and not taken branches Source: Sima, D et. al. , ACA, Addison Wesley, 1997, pp. 303

1. 3 The principle of branch prediction (1) Figure 1. 9: Correctly predicted conditional branch with delayed condition resolution on a four stage pipeline

1. 3 The principle of branch prediction (2) ti ii bc t i+1 F i i+1 i i+2 t i+2 D E F D t i+3 t i+4 E tj E t j+1 t j+2 E W t j+3 t j+4 F bc fetching bc detection Condition checking Condition BTA checking calculation (no branch!) Condition checking Branch pred. (branch!) Dynamic BTA calc. stop i i+3 BTI (speculative) F D FF ij BTA fetching BTI decode i j+1 F A large number of bubbles i i+1 fetching Figure 1. 10: Incorrectly predicted conditional branch with delayed condition resolution on a four stage pipeline

1. 3 The principle of branch prediction (3) Figure 1. 11: Branch misprediction penalty on a long pipeline

1. 4 Branch prediction accuracy/penalty (1) Guessing method (relevant for Implementation prediction accuracy) Processor Am 29000 (1987) Implicit dynamic 32 -entry two-way set associative BTIC Implicit dynamic, 32 -entry fully associative overridden by opcode. BTIC based static 2 -bit dynamic 256 -entry BTAC MC 88110 (1991) MC 68060 (1993) Prediction accuracy Reference 60 % for repetitive branches 70 % on SPEC Weiss 1987 Diefendorff, Allen 1992 > 90 % Circello, Goodrich 1993 MIPS R 10000 (1996) 2 -bit dynamic 512 -entry BHT 90 % Halfhill, 1994 Power. PC 620 (1995) Implicit dynamic, augmented with 2 -bit dynamic Implicit dynamic, overridden by 3 -bit dynamic or compiler based static 2 -bit dynamic 256 -entry fully associative BTAC, 2 -Kentry BHT 32 -entry fully associative BTAC, 256 -entry BHT 90 % Thomson, Ryan 1994 80 % on SPECint 92 Gwennap 1994 PA-8000 (1995) Ultra. Sparc (1995) BHT BTIC 2 K-entries in the IC, 88 % on SPECint 92 94 % each shared among two on SPECfp 92 instructions : Branch history table : Branch target instruction cache BTAC IC : Branch target address cache : Instruction cache Figure 1. 12: Branch prediction accuracy Source: Sima, D et. al. , ACA, Addison Wesley, 1997, pp. 340 Wayner 1994

1. 4 Prediction accuracy/penalty (2) Effective penalty of branch processing (simplified) fc : fm: Pc : Pm: Probability (frequency) of correctly predicted branches Probability (frequency) of mispredicted branches Penalty of correctly predicted branches Penalty of mispredicted branches Examples: PPro P 4 Willamette P 4 Prescott 1 1 1. 5 0. 1 0. 05 10 cycles 20 cycles 30 cycles

2. Basic branch prediction mechanisms 2. 1 Introduction (1) Branch processing Branch detection Branch prediction Accessing the branch target path

2. 1 Introduction (2) Branch prediction mechanisms Basic branch prediction mechanism Auxilliary branch prediction mechanism

2. 1 Introduction (2) Basic branch prediction mechanism Processor based Local Compiler hints

? Prediction depends only on the behaviour of the branch considered Figure 2. 1. : Local prediction

2. 1 Introduction (2) Basic branch prediction mechanism Processor based Local Global (2 -level) Compiler hints

1 0 0 Path 2: . . 0 0 Path 1: 0 0 . . 1 0 0 ? Prediction depends on the actual execution path, that is on all branches executed Figure 2. 2. : Global prediction

2. 1 Introduction (2) Basic branch prediction mechanism Processor based Local Global (2 -level) Compiler hints Combined (Choice prediction)

2. 2. Local prediction (1) Local prediction 1 -level 2 -level

2. 2. Local prediction (2) 1 -level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1 -bit prediction 80486 (1989) MC 68040 (1990) Super. Sparc (1992) R 4000 (1992) POWER 1 (1990) POWER 2 (1993) R 8000 (1994) PPC 601 (1993) PPC: Power. PC PPC 601 (1993)

2. 2. Local prediction (3) BHT (Branch History Table) IFA: x } x: 0: sequential cont 1: branch. Figure 2. 3: Principle of the 1 -bit dynamic prediction

2. 2. Local prediction (4) NT T Not taken Taken NT T T: Branch has been taken NT: Branch has not been taken Figure 2. 4: State transition diagram of the 1 -bit dynamic prediction

2. 2. Local prediction (6) 1 -level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1 -bit prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) Super. Sparc (1992) Ultra. Sparc (1995) R 4000 (1992) POWER 1 (1990) POWER 2 (1993) 2 -bit prediction R 8000 (1994) PPC 601 (1993) PPC: Power. PC PPC 601 (1993) R 10000 (1996) PPC 604 (1995) PPC 620 (1996)

2. 2. Local prediction (7) BHT IFA: xx } xx: 00, 01: sequential cont 10, 11: branch. BHT: Branch History Table Figure 2. 6: Principle of the 2 -bit dynamic prediction

2. 2. Local prediction (8) ANT AT Initialised when a branch is taken first ANT Strongly Weakly taken 11 10 AT Prediction: "Taken" AT ANT Weakly not taken Strongly not taken 01 00 AT Prediction: "Not Taken" Branch has been : AT: actually taken ANT: actually not taken Figure 2. 7: State transition diagram of the most frequently used 2 -bit dynamic prediction (Smith algorithm) ANT

2. 2. Local prediction (5) Accessing BHTs/BTACs Cache-like access (direct / set associative) Indexed access IFA: Associative access IFA: Index BHT C (Counters) For large tables most branches will map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated prediction accuracy. IFA: Tags Index IFA Tags C IFA C (E. g. two-way set associative) Reduces interferences but increases cost. Avoids interference but stronly increases cost. Examples: 16 K entry local BHT (Power 4) 16 K entry global BHT (Power 4) 16 K entry selector table (Power 4) 128*4 way BHT/BTAC (Pentium Pro) 1 K*4 way BHT/BTAC (Pentium II, III, 4) 128*2 way BTAC (Power 3) 64 entry BTAC (PPC 604) Figure 2. 5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers

2. 2. Local prediction (9) 1 -level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1 -bit prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) Super. Sparc (1992) Ultra. Sparc (1995) R 4000 (1992) POWER 1 (1990) POWER 2 (1993) 2 -bit prediction R 8000 (1994) PPC 601 (1993) R 10000 (1996) PPC 604 (1995) PPC 620 (1996) PPC: Power. PC Figure 2. 8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars 3 -bit prediction

2. 2. Local prediction (10) Local prediction 1 -level 2 -level Fixed prediction Static prediction Dynamic prediction Always the same prediction Based on the object code Based on the execution history

2. 2. Local prediction (11) 2 -level local branch prediction 2 -level local prediction (1. -level: branch patterns, 2. -level: history bits) Individual counters Shared counters With a shared global history table for all patterns With individual history tables for different patterns (Alpha 21264) (Pentium Pro) IFA: Local BHT (e. g. 16× 2 bit) IFA: Local BHT (e. g. 1 K× 10 bit) 1100101001 Local BHT (e. g. 1 K× 3 bit)1 101 Branch The 21264 uses 3 -bit saturating counters whose most significant bit provides the prediction Local BHT (e. g. 128× 4 bit) 0110 e. g. 4 -ways each 6 10 Branch

2. 2. Local prediction (12) 76 0 BTA (linear) BHT Tag Index 127 Way 2 Way 3 Way 0 Way 1 01 0 15 0 6 x x xx: 00/01 not taken 10/11 taken Tags History 4 -bit Tags 0 Counters Figure 2. 9. : The principle of Pentium Pro’s 128 x 4 way set associative BHT History 4 -bit

2. 2. Local prediction (13) 127 0 Tag Tag H C H C H Figure 2. 10. : The actual layout of Pentium Pro’s 128 x 4 way set associative BHT C

2. 3. Global prediction (1) Basic branch prediction mechanism Processor based Local Global (2 -level) Compiler hints Combined (Choice prediction)

2. 3. Global prediction (1) Global prediction Simple global

2. 3. Global prediction (1) Global history (shift register) 0 1 1 0 0 1 1 BHT x Figure 2. 11. : Simple global prediction Branch history

2. 3. Global prediction (1) Global prediction Simple global Gshare

2. 3. Global prediction (1) Global history 0 1 1 0 0 1 1 } XOR IFA . . . 1 0 0 1 1 0 0 BHT x Figure 2. 12. : Principle of the Gshare prediction Branch history

2. 3. Global prediction (1) Global prediction Simple global Gshare Gselect

2. 3. Global prediction (1) Global history 0 1 1 0 0 1 1 BHT Branch history x . . . 1 IFA: 0 1 1 0 Figure 2. 13. : Principle of the Gselect prediction 0

2. 4. Combined prediction (1) Basic branch prediction mechanism Processor based Local Global (2 -level) Compiler hints Combined (Choice prediction)

2. 4. Combined prediction (2) IFA: Global history Global BHT Local BHT Best choice BHT IFA: x Global prediction Local prediction Global prediction Actual prediction (for updating) Resulting prediction Figure 2. 14. : Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4)

2. 4. Combined prediction (3) Combined prediction Alpha 21264 1. prediction 2 -level local dynamic prediction with a shared counter table for all patterns Simple 2 -level global prediction (1 K * 10 bits/1 K * 3 bits) (12 -bit global history/4 K * 2 bits) Choice Global history referenced choice table (12 -bit global history/4 K * 2 -bits) Figure 2. 15. : Implementation alternatives of the combined prediction

2. 4. Combined prediction (4) • • • Minimum branch penalty: 7 cycles Typical branch penalty: 11+ cycles (IQ delay) 48 K bits of target addresses stored in I-cache 32 -entry return address stack Predictor tables are reset on a context switch Figure 2. 16. : The combined predictor of the Alpha 21264 Source: Microprocessor Report, 10/28/96

2. 4. Combined prediction (5) Combined prediction Alpha 21264 1. prediction 2 -level local dynamic prediction with a shared counter table for all patterns Simple 2 -level global prediction (1 K * 10 bits/1 K * 3 bits) 1 -level local dynamic prediction POWER 4 (16 K * 1 -bit) (12 -bit global history/4 K * 2 bits) 2 -level Gshare global prediction (11 -bit global history is hashed with the IFA, 16 K * 1 -bit counter table) Choice Global history referenced choice table (12 -bit global history/4 K * 2 -bits) Accessed in the same way as the global counter table (16 K * 1 -bit) Figure 2. 17. : Implementation alternatives of the combined prediction

2. 4. Combined prediction (6) 11 -bit global history 0. . . 18 5 1 1 1 0 0 1 1 0 XOR 0 1 1 0 0 } 1 -bit per group IFA: BHT 14 14 14 16 K*1 bit Selector Table Local History Update Local prediction 16 K*1 bit Global History Select the better Global prediction Figure 2. 18. : The principle of the combined predictor of the POWER 4

2. 5. Overview of the basic branch prediction mechanisms Figure 2. 20. : Trends of branch prediction schemes used in 2. and 3. generation superscalars

3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Pentium Pro P 4 Will/Northw. P 4 Prescott K 6 K 7 K 8 PPC 604 PPC 620 POWER 3 POWER 4 POWER 5 Alpha 21164 1 Alpha 21264 PA-8000 PA-8500/8700 Ultra. SPARC-III 1: 1. generation superscalars 1 2: Supported by compiler hints RAS: Return Address Stack Figure 3. 1. : Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars

Figure 3. 2: Static branch prediction algorithm of the Pentium Pro Source: Shanley T. , „Pentium Processor System Architecture„, Addison-Wesley Developers Press, 1996

3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Preemptive use of compiler hints Pentium Pro P 4 Will/Northw. P 4 Prescott P 4 Will/Northw. P 4 Prescott K 6 K 7 K 8 PPC 604 PPC 620 P 4 Will/Northw. P 4 Prescott K 6 (16 -entries) K 7 (12 -entries) K 8 (12 -entries) PPC 620 POWER 3 POWER 4 POWER 5 1 Alpha 21264 PA-8000 POWER 4 2 POWER 52 Alpha 21164 (12 -entries) Alpha 21264 (32 -entries) PA-8000 PA-8500/8700 Ultra. SPARC-III 1: RAS Pentium Pro Alpha 21164 Dedicated prediction 1. generation superscalars 1 2: Supported by compiler hints Ultra. SPARC-III (8 -entries) RAS: Return Address Stack Figure 3. 1. : Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars

Return Address Stack (RAS) POP return address on a RET PUSH return address on a CALL RAS used to continue execution speculatively from the popped up return address PUSH return address on a CALL POP return address on a RET Architectural stack with preserved sequential consistency The Problem of RASs: A procedure, such as a printf () might be called from many different locations, so there are many different return addresses. During speculative ooo execution however, the logical sequence of the related PUSH RET instructions may be disturbed, so the predicted return address may be wrong. For checking the prediction the RET instruction will be executed, and for a misprediction a repair mechanism will be activated (to cancel wrongly executed instructions and repair the corrupted RAS).

3. Auxililary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Preemptive use of compiler hints Loop detector Indirect branch pred. Pentium Pro P 4 Will/Northw. P 4 Prescott P 4 Will/Northw. P 4 Prescott K 6 K 7 K 8 PPC 604 P 4 Will/Northw. P 4 Prescott K 6 (16 -entries) K 7 (12 -entries) K 8 (12 -entries) PPC 604 PPC 620 POWER 3 POWER 4 POWER 5 1 Alpha 21264 PA-8000 POWER 4 2 POWER 52 Alpha 21164 (12 -entries) Alpha 21264 (32 -entries) POWER 4 2 POWER 5 2 POWER 4 PA-8000 PA-8500/8700 Ultra. SPARC-III 1: RAS Pentium Pro Alpha 21164 Dedicated prediction 1. generation superscalars 1 2: Supported by compiler hints Ultra. SPARC-III (8 -entries) RAS: Return Address Stack Figure 3. 1. : Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars

4. Accessing the branch target path (1) 4. 1. Overview BTA Calculated on the fly Figure 4. 1. : Alternatives to generate the BTA

A BTA IIFA Compute BTA I F A R I I+1 I+2 I+3 Instruction fetch address + sequential address (IFA) I-cache BTI+1 BTI+2 BTI+3 This scheme is employed in earlier scalar (pipeline) processors as well as in a number of superscalar processors, such as: Z 80000 (1984) i 486 (1989) MC 68040 (1990) Sparc CY 7 C 601 (1988), Super. Sparc (1992 p), Power PC 601 (1993), 603 (1993), Power 1 (1990), Power 2 (1993), POWER 4 (2001), POWER 5 (2005) 21064 (1992), 21064 A (1994), 21164 (1995), R 4000 (1992), R 10000 (1996) Ultra SPARC III (2003) Figure 4. 2. : Principle of calculating the BTA on the fly Source: Sima, D et. al. , ACA, Addison Wesley, 1997, pp. 303

4. Accessing the branch target path (1) 4. 1. Overview BTA Calculated on the fly Accessed from the BTAC Figure 4. 1. : Alternatives to generate the BTA

+ I F A R Instruction fetch address (IFA) A I I+1 I+2 I+3 BTI+1 BTI+2 IIFA BTA BTAC I-cache BTI BA-1 Sequential address BTI+3 Branch target address The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are read from the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are designated as BA-1). Figure 4. 3. : Principle of the BTAC scheme to access the branch target path

+ IFA: I$ IFA: BHT BTAC IFA: Tag I F A R Update BTAC (create/delete BTAC entry) IB Further processing C Update BHT with branch result Tags BTA Update BTAC with BTA if BHT initiates it. Figure 4. 4. : The principle of branch prediction using both a BHT and a BTAC (C: counter) (Designated as BTB (Branch Target Buffer) by Intel) if BTAC misses IIFA BTA if mispred. if BTAC hits

Processor Number of Implementation of BTAC entries the BTAC ES/9000 520 -based procs (1992 p) 4 K 2 -way associative Pentium (1994) 256 Fully associative Pentium Pro 512 4 -way associative Pentium 4 4 K 4 -way associative MC 68060 (1993) 256 4 -way associative R 8000 (1994)1 1 K PA 8000 (1995) 32 Fully associative Power PC 604 (1994) 64 Fully associative Power PC 620 (1995) 256 Fully associative 1: Each entry is shared among 4 instructions Figure 4. 5. : Examples of processors using the BTAC scheme

Figure 4. 6. : The physical implementation of branch prediction in Intel’s P 4 Northwood and Prescott cores Source: de Vries H. , „Looking at Intel’s Prescott die, part II. ”, http: //www. chip-architect. com, April 2003

4. Accessing the branch target path (1) 4. 1. Overview BTA Calculated on the fly Accessed from BTAC Figure 4. 1. : Alternatives to generate the BTA From the I$

Instruction fetch address (IFA) A I I F A R + BA I-cache BTI BTA+ BTIC To decoding The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and selected for decoding instead of the instruction from the I-cache. The address of the subsequent instruction along the taken path is also read from BTIC and becomes the next IFA Examples: Gmicrol/200 (1988), AM 29000 (1988), MC 88110 (1993). Figure 4. 7. : Principle of the BTIC scheme to access the branch target path IFA

4. Accessing the branch target path (1) 4. 1. Overview BTA Calculated on the fly Examples Ultra SPARC III K 6 Power 4, 5 Accessed from BTAC From the I$ PPro/PIII/P 4 21264 K 7/K 8 Power 3 Figure 4. 8. : Trends to generate the BTA

4. 2. Case example 1: K 7 (1) To each 16 -Byte long fetch block a 16 bit selector block is allocated as follows: BTA Fetch block (16 -Byte) 15 14 13 12 3 2 1 Instruction execution Selector block (16 -bit) 15 13 14 12 3 2 1 0 The selector block identifies branches, included in the associated fetch block. Two bits of the selector block correspont to two bytes of the fetch block. RETs are a single byte long all other branches are at least two bytes long. Assuming max. a single RET in the fetch block, there may be at most one branch ending in any pair of Bytes. In a fetch block, there are up to a single RET and two non-RET branches. More branches in a fetch block lead to conflicts in the prediction logic. 0

4. 2. Case example 1: K 7 (2) Each two bit entry indicates whether or not there is a branch ending in the corresponding two bytes in the fetch block, if yes, it identifies the type of the branch as well. A branch instruction that crosses the 16 -byte boundary is counted to the second 16 byte window. Coding of the two bits (assumed) 00: no branch 01: RET 10: There is a conditional branch whose brach is in the BTA 0 field of the BTAC 11: There is a conditional branch whose brach is in the BTA 1 field of the BTAC

4. 2. Case example 1: K 7 (3) Characteristic examples of selector settings: xx 00 00 No branch IFA+16 xx 00 01 00 00 00 A RET instruction Return address of the RET xx 00 00 00 10 00 00 00 A cond. branch (it’s BTA is in the BTAC 0 field) BTA 0 if taken else IFA+16 xx 00 00 11 00 00 Two cond. branches (their BTAs are in the BTAC 0 and BTAC 1 fields) BC 1 N BC 2 Y Y BTA 0 BTA 1 N IFA+16 During predecoding instruction boundaries as well as branch instructions are detected and the appropriate selector entries are marked accordingly. Predecoding is performed not faster than 4 bytes/cycle If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated selector blocks are invalidated

4. 2. Case example 1: K 7 (4) The selector table is shared between the upper and lower part of the I$, and an extra address bit (A) identifies whether the entry belongt to the upper or the lower part of the I$. Source: Kaiser, A. , ”K 7 Branch Prediction”, Dec. 1999, http: //www. s. netic. de

4. 2. Case example 1: K 7 (5) 31 15 14 IFA: Tag 31 4 30 2 -way set associative I$ 14 13 43 0 BTAC IFA: Tag Index BTA 0 BTA 1 Index 1 K x 2 addr. I F A R IFA [13: 4] 1 K*16 B fetch blocks Way 0 Way 1 IFA [14: 4] [31: 15] 16 b Selector Table BTA (Exec. ) (shared for the upper and lower parts of the I$) 1 K*16 B fetch blocks BTA 1 BTA 0 Fetch unit (during predecoding) Tags 15 16 B+P IFA [3: 0] 16 B Fetch block 16 B+P 0 16 bit selector block Tags IFA [3: 1] A 15 0 IFA 14 W: 31 BTA 0 C: BTA x x 32 -bit Decode and issue instructions beginning with the given address Sequential (no branch) 12 entries RET BTA 1 BTA 0 Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RAT RET address Figure 4. 9. : Assumed simplified scheme of accessing the branch target path in the K 7, without showing the global prediction (A: address bit, C: Conditional branch, W: Way) +16

4. 2. Case example 2: K 8 (1) The K 8 doubled the size of the selector table, so each fetch block has it’s own selector entry. The K 8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) / fetch block, the coding of the selector entries is modified accordingly. When instruction cache lines are evicted to the L 2 cache, branch selectors and predecode information are also stored in the L 2 cache. The K 8 uses 48 -bit addresses but the BTAC keeps only the 15 least significant bits to identify the next address. Each BTA entry identifies the least significant 15 -bits of the IFA as well as additional information, such as 3 -bit old IFA (bits 16, 15) W bit: W identificator

4. 2. Case example 2: K 8 (2) 31 15 14 IFA: Tag 4 30 31 2 -way set associative I$ 14 13 43 0 IFA: Tag Index SA Index BTAC ? BTA 2 BTA 1 BTA 0 I F A R 512 x 4 addr. 1 K*16 B fetch blocks IFA [12: 4] Way 1 Selector Table + 16 Way 0 IFA [14: 4] [31: 15] BTA calculator ? BTA 2 BTA 1 BTA 0 1 K*16 B fetch blocks Tags 16 B Fetch block 15 16 B+P 16 b SA Predecoding SA [3: 0] Decode and issue instructions beginning with the given address 0 16 b 16 B+P 15 16 bit selector block Tags IFA [3: 1] 0 x x Old IFA 15 16 W 14 New IFA 0 RC BTA 11 -bit Sequential BTA 2/RET (no branch) BTA 1/RET BTA 0/RET 12 entries RAT Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RET address Figure 4. 10. : Assumed simplified scheme of accessing the branch target path in the K 8, without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address)

4. 2. Case example 2: K 8 (3) Figure 4. 11. : Logical view of Opteron’s (K 8’s) instruction fetch and decode stages Source: de Vries H. , „Understanding the detailed Architecture of AMD’s 64 bit Core”, http: //www. chip-archtect. com, Sept. , 2003

4. 2. Case example 2: K 8 (4) Figure 4. 12. : Physical implementation of Opteron’s (K 8’s) instruction cache and decoding Source: de Vries H. , „Understanding the detailed Architecture of AMD’s 64 bit Core”, http: //www. chip-archtect. com, Sept. , 2003