COMPUTER ARCHITECTURE CS 6354 Branch Prediction Samira Khan
COMPUTER ARCHITECTURE CS 6354 Branch Prediction Samira Khan University of Virginia Mar 31, 2016 The content and concept of this course are adapted from CMU ECE 740
AGENDA • • Logistics Review from last lecture Unifying memory and storage Branch prediction 2
LOGISTICS • Grades available at Collab • Get graded papers from Elaheh – Monday April 4 th, 6: 30 - 7: 30 pm 3
SOLUTIONS TO THE DRAM SCALING PROBLEM • Two potential solutions – Tolerate DRAM (by taking a fresh look at it) – Enable emerging memory technologies to eliminate/minimize DRAM • Do both – Hybrid memory systems 4
Executive Summary • Observations – DRAM timing parameters are dictated by the worst-case cell (smallest cell across all products at highest temperature) – DRAM operates at lower temperature than the worst case • Idea: Adaptive-Latency DRAM – Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) • Analysis: Characterization of 115 DIMMs – Great potential to lower DRAM timing parameters (17 – 54%) without any errors • Real System Performance Evaluation – Significant performance improvement (14% for memoryintensive workloads) without errors (33 days) 5
SOLUTIONS TO THE DRAM SCALING PROBLEM • Two potential solutions – Tolerate DRAM (by taking a fresh look at it) – Enable emerging memory technologies to eliminate/minimize DRAM • Do both – Hybrid memory systems 6
SOLUTION 2: EMERGING MEMORY TECHNOLOGIES • Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) • Example: Phase Change Memory – Expected to scale to 9 nm (2022 [ITRS]) – Expected to be denser than DRAM: can store multiple bits/cell • But, emerging technologies have shortcomings as well – Can they be enabled to replace/augment/surpass DRAM? 7
HYBRID MEMORY SYSTEMS CPU DRAM Fast, durable Small, leaky, volatile, high-cost DRAM Ctrl PCM Ctrl Phase Change Memory (or Tech. X) Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies 8
OTHER OPPORTUNITIES WITH EMERGING TECHNOLOGIES • Merging of memory and storage – e. g. , a single interface to manage all data • New applications – e. g. , ultra-fast checkpoint and restore • More robust system design – e. g. , reducing data loss • Processing tightly-coupled with memory – e. g. , enabling efficient search and filtering 9
COORDINATED MEMORY AND STORAGE WITH NVM (I) • The traditional two-level storage model is a bottleneck with NVM – Volatile data in memory a load/store interface – Persistent data in storage a file system interface – Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores Two-Level Store Load/Store Operating system and file system Virtual memory Address translation Main Memory fopen, fread, fwrite, … Processor and caches Persistent (e. g. , Phase-Change) Storage (SSD/HDD) Memory 10
COORDINATED MEMORY AND STORAGE WITH NVM (I) • Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data – Improves both energy and performance – Simplifies programming model as well Unified Memory/Storage Persistent Memory Manager Load/Store Processor and caches Feedback Persistent (e. g. , Phase-Change) Memory Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory, ” WEED 2013. 11
THE PERSISTENT MEMORY MANAGER (PMM) • Exposes a load/store interface to access persistent data – Applications can directly access persistent memory no conversion, translation, location overhead for persistent data • Manages data placement, location, persistence, security – To get the best of multiple forms of storage • Manages metadata storage and retrieval – This can lead to overheads that need to be managed • Exposes hooks and interfaces for system software – To enable better data placement and management decisions • Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory, ” WEED 2013. 12
THE PERSISTENT MEMORY MANAGER (PMM) Persistent objects PMM uses access and hint information to allocate, migrate 13 and access data in the heterogeneous array of devices
ENERGY BENEFITS OF A SINGLE-LEVEL STORE ~24 X ~5 X Results for Post. Mark 14
ENERGY BENEFITS OF A SINGLE-LEVEL STORE ~16 X ~5 X Results for Post. Mark 15
ENABLING AND EXPLOITING NVM: ISSUES • Many issues and ideas from technology layer to algorithms layer • Enabling NVM and hybrid memory – How to tolerate errors? – How to enable secure operation? – How to tolerate performance and power shortcomings? – How to minimize cost? • Exploiting emerging tecnologies – How to exploit non-volatility? – How to minimize energy consumption? – How to exploit NVM on chip? Problems Algorithms Programs User Runtime System (VM, OS, MM) ISA Microarchitecture Logic Devices 16
THREE PRINCIPLES FOR (MEMORY) SCALING • Better cooperation between devices and the system – Expose more information about devices to upper layers – More flexible interfaces • Better-than-worst-case design – Do not optimize for the worst case – Worst case should not determine the common case • Heterogeneity in design (specialization, asymmetry) – Enables a more efficient design (No one size fits all) • These principles are related and sometimes coupled 17
SUMMARY OF EMERGING MEMORY TECHNOLOGIES • Key trends affecting main memory – End of DRAM scaling (cost, capacity, efficiency) – Need for high capacity – Need for energy efficiency • Emerging NVM technologies can help – PCM or STT-MRAM more scalable than DRAM and non-volatile – But, they have shortcomings: latency, active energy, endurance • We need to enable promising NVM technologies by overcoming their shortcomings • Many exciting opportunities to reinvent main memory at all layers of computing stack 18
CONTROL DEPENDENDCE BRANCH PREDICTION 19
REVIEW: BRANCH TYPES Type Direction at fetch time Number of When is next possible next fetch addresses? resolved? Conditional Unknown 2 Execution (register dependent) Unconditional Always taken 1 Decode (PC + offset) Call Always taken 1 Decode (PC + offset) Return Always taken Many Execution (register dependent) Indirect Always taken Many Execution (register dependent) Different branch types can be handled differently 20
REVIEW: HOW TO HANDLE CONTROL DEPENDENCES • Critical to keep the pipeline full with correct sequence of dynamic instructions. • Potential solutions if the instruction is a control-flow instruction: • • • Stall the pipeline until we know the next fetch address Guess the next fetch address (branch prediction) Employ delayed branching (branch delay slot) Do something else (fine-grained multithreading) Eliminate control-flow instructions (predicated execution) Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution) 21
HOW TO HANDLE CONTROL DEPENDENCES • Critical to keep the pipeline full with correct sequence of dynamic instructions. • Potential solutions if the instruction is a control-flow instruction: • • • Stall the pipeline until we know the next fetch address Guess the next fetch address (branch prediction) Employ delayed branching (branch delay slot) Do something else (fine-grained multithreading) Eliminate control-flow instructions (predicated execution) Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution) 22
REVIEW: BRANCH PREDICTION • Idea: Predict the next fetch address (to be used in the next cycle) • Requires three things to be predicted at fetch stage: – Whether the fetched instruction is a branch – (Conditional) branch direction – Branch target address (if taken) • Observation: Target address remains the same for a conditional direct branch across dynamic instances – Idea: Store the target address from previous instance and access it with the PC – Called Branch Target Buffer (BTB) or Branch Target Address Cache 23
REVIEW: FETCH STAGE WITH BTB Direction predictor (2 -bit counters) taken? PC + inst size Program Counter Next Fetch Address hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer) Always-taken CPI = [ 1 + (0. 20*0. 3) * 2 ] = 1. 12 (70% of branches taken) 24
SIMPLE BRANCH DIRECTION PREDICTION SCHEMES • Compile time (static) – Always not taken – Always taken – BTFN (Backward taken, forward not taken) – Profile based (likely direction) • Run time (dynamic) – Last time prediction (single-bit) 25
MORE SOPHISTICATED DIRECTION PREDICTION • Compile time (static) – Always not taken – Always taken – BTFN (Backward taken, forward not taken) – Profile based (likely direction) – Program analysis based (likely direction) • Run time (dynamic) – Last time prediction (single-bit) – Two-bit counter based prediction – Two-level prediction (global vs. local) – Hybrid 26
STATIC BRANCH PREDICTION (I) • Always not-taken – Simple to implement: no need for BTB, no direction prediction – Low accuracy: ~30 -40% – Compiler can layout code such that the likely path is the “nottaken” path • Always taken – No direction prediction – Better accuracy: ~60 -70% • Backward branches (i. e. loop branches) are usually taken • Backward branch: target address lower than branch PC • Backward taken, forward not taken (BTFN) – Predict backward (loop) branches as taken, others not-taken 27
STATIC BRANCH PREDICTION (II) • Profile-based – Idea: Compiler determines likely direction for each branch using profile run. Encodes that direction as a hint bit in the branch instruction format. + Per branch prediction (more accurate than schemes in previous slide) accurate if profile is representative! -- Requires hint bits in the branch instruction format -- Accuracy depends on dynamic branch behavior: TTTTTNNNNN 50% accuracy TNTNTNTNTN 50% accuracy -- Accuracy depends on the representativeness of profile input set 28
STATIC BRANCH PREDICTION (III) • Program-based (or, program analysis based) – Idea: Use heuristics based on program analysis to determine statically- predicted direction – Opcode heuristic: Predict BLEZ as NT (negative integers used as error values in many programs) – Loop heuristic: Predict a branch guarding a loop execution as taken (i. e. , execute the loop) – Pointer and FP comparisons: Predict not equal + Does not require profiling -- Heuristics might be not representative or good -- Requires compiler analysis and ISA support • Ball and Larus, ”Branch prediction for free, ” PLDI 1993. – 20% misprediction rate 29
STATIC BRANCH PREDICTION (III) • Programmer-based – Idea: Programmer provides the statically-predicted direction – Via pragmas in the programming language that qualify a branch as likelytaken versus likely-not-taken + Does not require profiling or program analysis + Programmer may know some branches and their program better than other analysis techniques -- Requires programming language, compiler, ISA support -- Burdens the programmer? 30
STATIC BRANCH PREDICTION • All previous techniques can be combined – Profile based – Programmer based • How would you do that? • What are common disadvantages of all three techniques? – Cannot adapt to dynamic changes in branch behavior • This can be mitigated by a dynamic compiler, but not at a fine granularity (and a dynamic compiler has its overheads…) 31
DYNAMIC BRANCH PREDICTION • Idea: Predict branches based on dynamic information (collected at run-time) • Advantages + Prediction based on history of the execution of branches + It can adapt to dynamic changes in branch behavior + No need for static profiling: input set representativeness problem goes away • Disadvantages -- More complex (requires additional hardware) 32
LAST TIME PREDICTOR • Last time predictor – Single bit per branch (stored in BTB) – Indicates which direction branch went last time it executed TTTTTNNNNN 90% accuracy • Always mispredicts the last iteration and the first iteration of a loop branch – Accuracy for a loop with N iterations = (N-2)/N + Loop branches for loops with large number of iterations -- Loop branches for loops will small number of iterations TNTNTNTNTN 0% accuracy Last-time predictor CPI = [ 1 + (0. 20*0. 15) * 2 ] = 1. 06 (Assuming 85% accuracy) 33
IMPLEMENTING THE LAST-TIME PREDICTOR tag BTB idx N-bit tag table One Bit BTB Per branch taken? = PC+4 1 0 next. PC The 1 -bit BHT (Branch History Table) entry is updated with the correct outcome after each execution of a branch 34
STATE MACHINE FOR LAST-TIME PREDICTION actually taken actually not taken predict taken actually not taken 35
IMPROVING THE LAST TIME PREDICTOR • Problem: A last-time predictor changes its prediction from T NT or NT T too quickly – even though the branch may be mostly taken or mostly not taken • Solution Idea: Add hysteresis to the predictor so that prediction does not change on a single different outcome – Use two bits to track the history of predictions for a branch instead of a single bit – Can have 2 states for T or NT instead of 1 state for each • Smith, “A Study of Branch Prediction Strategies, ” ISCA 1981. 36
TWO-BIT COUNTER BASED PREDICTION • Each branch associated with a two-bit counter • One more bit provides hysteresis • A strong prediction does not change with one single different outcome n Accuracy for a loop with N iterations = (N-1)/N TNTNTNTNTN 50% accuracy (assuming init to weakly taken) + Better prediction accuracy -- More hardware cost (but counter can be part of a BTB entry) 2 BC predictor CPI = [ 1 + (0. 20*0. 10) * 2 ] = 1. 04 (90% accuracy) 37
STATE MACHINE FOR 2 -BIT SATURATING COUNTER • Counter using saturating arithmetic – There is a symbol for maximum and minimum values actually taken pred taken 11 actually !taken actually taken pred !taken 01 pred taken 10 actually !taken actually taken pred !taken 00 actually !taken 38
HYSTERESIS USING A 2 -BIT COUNTER actually taken “strongly taken” actually !taken pred taken actually taken “weakly !taken” pred taken actually !taken pred !taken actually taken Change prediction after 2 consecutive mistakes “weakly taken” actually !taken “strongly !taken” actually !taken 39
IS THIS ENOUGH? • ~85 -90% accuracy for many programs with 2 -bit counter based prediction (also called bimodal prediction) • Is this good enough? • How big is the branch problem? 40
COMPUTER ARCHITECTURE CS 6354 Branch Prediction Samira Khan University of Virginia Mar 29, 2016 The content and concept of this course are adapted from CMU ECE 740
- Slides: 41