DivergeMerge Processor DMP Hyesoon Kim Jos A Joao

  • Slides: 57
Download presentation
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group University of Texas at Austin *Microsoft Research

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 10/31/2021 2

Predicated Execution (normal branch code) (predicated code) A if (cond) { b = 0;

Predicated Execution (normal branch code) (predicated code) A if (cond) { b = 0; } else { b = 1; } T N C B A B C D D A B C p 1 = (cond) branch p 1, TARGET A mov b, 1 jmp JOIN B C TARGET: mov b, 0 p 1 = (cond) (!p 1) mov b, 1 (p 1) mov b, 0 Convert control flow dependence to data dependence 10/31/2021 3

Benefit of Predicated Execution o Predicated Execution can be high performance and energy-efficient. Predicated

Benefit of Predicated Execution o Predicated Execution can be high performance and energy-efficient. Predicated Execution Fetch Decode Rename Schedule Register. Read Execute A E F A D B C C E D F C A B F E C D B A A B C D E F C A B D E F A B D C E F A F E C D B F D E B C A C D A B E B C A D A B C B Branch Prediction D B A A nop Fetch Decode Rename Schedule Register. Read Execute F E Pipeline flush!! F 10/31/2021 4 E D B A

Limitations/Problems of Predication o ISA: Predicate registers and predicated instructions n o o Dynamic-Hammock

Limitations/Problems of Predication o ISA: Predicate registers and predicated instructions n o o Dynamic-Hammock Predication[Klauser’ 98] can solve this problem but it is only applicable to simple hammocks. Adaptivity: Static predication is not adaptive to run-time branch behavior. n Branch behavior changes based on input set, phase, control-flow path. n Wish Branches[Kim’ 05] Complex CFG: A large subset of control-flow graphs is not converted to predicated code. n Function calls, loops, many instructions inside a region, and complex CFGs n Hyperblock[Mahlke’ 92] cannot adapt to frequently-executed paths dynamically. 10/31/2021 5

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 10/31/2021 6

Diverge-Merge Processor (DMP) o DMP can dynamically predicate complex branches (in addition to simple

Diverge-Merge Processor (DMP) o DMP can dynamically predicate complex branches (in addition to simple hammocks). o o The compiler identifies n Diverge branches n Control-flow merge (CFM) points The microarchitecture decides when and what to predicate dynamically. 10/31/2021 7

Dynamic Predication A Low-confidence T N C B A (mov R 1, 1) PR

Dynamic Predication A Low-confidence T N C B A (mov R 1, 1) PR 10 = 1 B H A B C p 1 = (cond) branch p 1, TARGET mov R 1, 1 jmp JOIN TARGET: mov R 1, 0 H JOIN: add R 5, R 1, 1 (mov R 1, 0) C PR 11 = 0 select-µops (φ-nodes in SSA) PR 12 = (cond) ? PR 11 : PR 10 H Klauser et al. [PACT’ 98]: Dynamic-hammock predication 10/31/2021 8

Diverge-Merge Processor A C Diverge Branch B B D C E E F A

Diverge-Merge Processor A C Diverge Branch B B D C E E F A G H Insert select-µops H CFM point Frequently executed path Not frequently executed path 10/31/2021 9

Diverge-Merge Processor A C A A A B D F E A G H

Diverge-Merge Processor A C A A A B D F E A G H Frequently executed path diverge-branch Not frequently executed path 10/31/2021 10 executed block CFM point

Control-Flow Graphs A A A . . . simple hammock nested hammock DMP Dynamic

Control-Flow Graphs A A A . . . simple hammock nested hammock DMP Dynamic Hammock SW pred Wish br. Dual-path 10/31/2021 11 frequently-hammock loop non-merging

Dual-path Execution vs. DMP Dual-path A Low-confidence C B D E F path 1

Dual-path Execution vs. DMP Dual-path A Low-confidence C B D E F path 1 path 2 DMP path 1 path 2 C B D D CFM E F D E F 10/31/2021 12

Control-Flow Graphs A A A . . . simple hammock nested hammock frequently-hammock DMP

Control-Flow Graphs A A A . . . simple hammock nested hammock frequently-hammock DMP Dynamichammock SW pred sometimes Wish br. sometimes Dual-path 10/31/2021 13 loop non-merging

Distribution of Mispredicted Branches o 66% of mispredicted branches can be dynamically predicated in

Distribution of Mispredicted Branches o 66% of mispredicted branches can be dynamically predicated in DMP. 10/31/2021 14

Distribution of Mispredicted Branches o 66% of mispredicted branches can be dynamically predicated in

Distribution of Mispredicted Branches o 66% of mispredicted branches can be dynamically predicated in DMP. 10/31/2021 15

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 10/31/2021 16

Fetch Mechanism Low Confidence A C Diverge Branch B A B D Round-robin fetch

Fetch Mechanism Low Confidence A C Diverge Branch B A B D Round-robin fetch E F C E G H H CFM point predicted path 10/31/2021 17

Dynamic Predication A B C E branch r 0, C add r 1 r

Dynamic Predication A B C E branch r 0, C add r 1 r 3, #1 add r 1 r 2, # -1 branch pr 10, C p 1 = pr 10 add pr 21 pr 13, #1 (p 1) add pr 31 pr 12, # -1(!p 1) select-µop pr 41 = p 1? pr 21 : pr 31 H add r 4 r 1, r 3 add pr 24 pr 41, pr 13 Arch. Phy. M R 1 PR 11 PR 41 PR 21 1 R 2 PR 12 R 3 PR 13 RAT 1 Arch. Phy. M R 1 PR 11 PR 31 1 R 2 PR 12 R 3 PR 13 RAT 2 Forks RAT, RAS, and GHR 10/31/2021 18

DMP Support o ISA Support n o Compiler Support [CGO’ 07] n o Mark

DMP Support o ISA Support n o Compiler Support [CGO’ 07] n o Mark diverge branches/CFM points. The compiler identifies diverge branches and the corresponding CFM points. Hardware Support n n n Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication 10/31/2021 19

Hardware Complexity Analysis DMP Dyn. Dual ham. path Multi path SW Wish pred. br.

Hardware Complexity Analysis DMP Dyn. Dual ham. path Multi path SW Wish pred. br. Front-End Confidence Estimator Rename Support Predicate Registers Select-Uop Gen. ST-LD Forwarding Check Flush/no Flush 10/31/2021 20

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 10/31/2021 21

Simulation Methodology o 12 SPEC 2000 INT, 5 SPEC 95 INT n o o

Simulation Methodology o 12 SPEC 2000 INT, 5 SPEC 95 INT n o o Alpha ISA execution driven simulator Baseline processor configuration n n o o Different input sets for profiling and evaluation 64 KB perceptron predictor/O-GEHL (paper) Minimum 30 -cycle branch misprediction penalty 8 -wide, 512 -entry instruction window 2 KB 12 -bit history enhanced JRS confidence estimator Less aggressive processor (paper) Power model using Wattch 10/31/2021 22

Different CFG types 10/31/2021 23

Different CFG types 10/31/2021 23

Performance Improvement 10/31/2021 24

Performance Improvement 10/31/2021 24

Energy Consumption 10/31/2021 25

Energy Consumption 10/31/2021 25

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

Outline o o o Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 10/31/2021 26

Conclusion o o o DMP introduces the concept of frequently-hammocks and it dynamically predicates

Conclusion o o o DMP introduces the concept of frequently-hammocks and it dynamically predicates complex CFGs. DMP can overcome three major limitations of software predication: ISA support, adaptivity, complex CFG. DMP reduces branch mispredictions energy efficiently n o 19% performance improvement, 9% less energy DMP divides the work between the compiler and the microarchitecture: n n The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate dynamically. 10/31/2021 27

Thank You!!

Thank You!!

Questions?

Questions?

Handling Mispredictions Diverge Br. A C B Misprediction! D F E G H A

Handling Mispredictions Diverge Br. A C B Misprediction! D F E G H A A CFM point branch pr 10, C p 1 = pr 10 B B add pr 21 pr 13, #1 (p 1) (0) C E (1) add pr 31 pr 12, # -1(!p 1) add pr 44 pr 34, # -1(!p 1) (1) select-µop pr 41 = p 1? pr 21 : pr 31 D add pr 34 pr 31, pr 13 H add pr 24 pr 41, pr 13 D H predicted path 10/31/2021 30 Flush

Loop Branches o Exit Condition n o Benefit n o The loop branch is

Loop Branches o Exit Condition n o Benefit n o The loop branch is predicted to exit the loop. Reduced pipeline flushes: when the predicated loop is iterated more times than it should be. o Instructions in the extra iterations of the loop become NOPs. Instructions after loop-exit can still be executed. Negative Effects n n Increased execution delay of loop-carried dependencies The overhead of select-µops 10/31/2021 31

Loop Branches o A B Predicate each loop iteration separately A add r 1,

Loop Branches o A B Predicate each loop iteration separately A add r 1, #1 r 0 = (cond 1) branch A, r 0 A A A add r 1 r 1, #1 r 0 = (cond 1) branch A, r 0 branch A, pr 10 add pr 21 pr 11, #1 pr 20 = (cond 1) branch A, pr 20 p 1 = pr 10 (p 1) p 2 = pr 20 select-uop pr 22 = p 1 ? pr 21: pr 11 select-uop pr 23 = p 1? pr 20: pr 10 A add pr 31 pr 22, #1 pr 30 = (cond 1) branch A, pr 30 B add r 7 r 1, #10 (p 2) select-uop pr 32 = p 2 ? pr 31: pr 22 select-uop pr 33 = p 2 ? pr 30: pr 23 Loop br. is predicted to exit the loop B add pr 7 pr 32, #10 10/31/2021 32

Enhanced Mechanisms o o Multiple CFM points n The hardware chooses one CFM point

Enhanced Mechanisms o o Multiple CFM points n The hardware chooses one CFM point for each instance of dynamic predication. Exit Optimizations n Counter Policy: What if one path does not reach the CFM point? o n Number of fetched instructions > Threshold Yield Policy: What if another low confidence diverge branch is encountered in dynamic predication mode? o Later low confidence branch is more likely mispredicted. 10/31/2021 33 A B G H C D E F

Detailed DMP Support o o 32 Predicate register ids Fetch mechanism n n High

Detailed DMP Support o o 32 Predicate register ids Fetch mechanism n n High performance I-Cache Fetch two cache lines Predict 3 branches Fetch stops at the first taken branch 10/31/2021 34

Diverge and Merge? 10/31/2021 35

Diverge and Merge? 10/31/2021 35

Useful Dynamic Predication Mode 10/31/2021 36

Useful Dynamic Predication Mode 10/31/2021 36

Perfect Branch Prediction 10/31/2021 37

Perfect Branch Prediction 10/31/2021 37

Maximum Power 10/31/2021 38

Maximum Power 10/31/2021 38

Branch Predictor Effects 10/31/2021 39

Branch Predictor Effects 10/31/2021 39

Confidence Estimator Effects 10/31/2021 40

Confidence Estimator Effects 10/31/2021 40

Results in Less Aggressive Processors 10/31/2021 41

Results in Less Aggressive Processors 10/31/2021 41

DMP vs. Perfect Conditional BP 10/31/2021 42

DMP vs. Perfect Conditional BP 10/31/2021 42

Enhanced DMP Mechanisms 10/31/2021 43

Enhanced DMP Mechanisms 10/31/2021 43

DMP vs. Other Mechanisms 10/31/2021 44

DMP vs. Other Mechanisms 10/31/2021 44

Comparisons with Predication/Wish Branches non-predicated 10/31/2021 45

Comparisons with Predication/Wish Branches non-predicated 10/31/2021 45

Reduction in Pipeline Flushes o Average overhead: n Dynamic-hammock: 4 instructions/entry n Dual-path: 150

Reduction in Pipeline Flushes o Average overhead: n Dynamic-hammock: 4 instructions/entry n Dual-path: 150 instructions/entry n Multipath: 200 instructions/entry n DMP: 20 instructions/entry 10/31/2021 46

Handling Nested Diverge Branches Diverge Br. Basic DMP o A n C B D

Handling Nested Diverge Branches Diverge Br. Basic DMP o A n C B D Enhanced DMP o F n E G H Ignore other low confidence div. branches CFM point 10/31/2021 47 Exit dynamic predication mode and re-enter from the younger low confidence branch on predicted path (Yield policy)

Compiler Support [CGO’ 07] o Compiler analyzes the control flow and the profile data

Compiler Support [CGO’ 07] o Compiler analyzes the control flow and the profile data n n n Step 1: Identify diverge branch candidates and CFM points. Step 2: Select diverge branches based on (1) the number of instructions between a branch and the CFM point (2) the probability of merging at the CFM point o Heuristics or a cost-benefit model Step 3: Mark the selected branches/CFM points. 10/31/2021 48

Future Research o Hardware Support n n Better confidence estimators Efficient hardware mechanism to

Future Research o Hardware Support n n Better confidence estimators Efficient hardware mechanism to detect diverge branches and CFM points o o Increase hardware complexity but eliminate the need for ISA/compiler support Compiler Support n Better compiler algorithms [CGO’ 07] 10/31/2021 49

Power Measurement Configurations o o 100 nm Technology Baseline processor n o Less aggressive

Power Measurement Configurations o o 100 nm Technology Baseline processor n o Less aggressive processor n o o 4 GHZ 1. 5 GHz CC 3 clock-gating model in Wattch: unused units dissipate only 10% of their maximum power DMP: one more RAT/RAS/GHR, select-uop generation module, additional fields in BTB, predicate registers, CFM registers, loadstore forwarding, instruction retirement 10/31/2021 50

Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode 10/31/2021 51

Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode 10/31/2021 51

Fetched/Executed Instructions 10/31/2021 52

Fetched/Executed Instructions 10/31/2021 52

ISA Support o Example of Diverge Br and CFM markers OPCODE TARGET 00 :

ISA Support o Example of Diverge Br and CFM markers OPCODE TARGET 00 : normal branch 10 : diverge forward branch 11 : diverge loop branch CFM = CFM rel address + PC 10/31/2021 53 CFM rel address

Entering Dynamic Predication Mode o Entry condition n o The Front-end n n n

Entering Dynamic Predication Mode o Entry condition n o The Front-end n n n o When a diverge branch has low confidence. Stores the address of the CFM point to the CFM register. Forks the RAS, GHR, and RAT. Allocates a predicate register. Fetch Mechanisms n n Round-robin fetch from two paths The processor follows the branch predictor until it reaches the corresponding CFM point. 10/31/2021 54

Exiting Dynamic Predication Mode o Exit condition n n o Both paths of a

Exiting Dynamic Predication Mode o Exit condition n n o Both paths of a diverge branch have reached the corresponding CFM point. A diverge branch is resolved. Select-µop mechanism n n Similar to φ-node in SSA Merges register values from two paths. 10/31/2021 55

Multipath Execution A Low-confidence path 2 C path 3 B path 4 D E

Multipath Execution A Low-confidence path 2 C path 3 B path 4 D E F G H H H I I I C D path 1 B E F Low-confidence G Instructions after the control-flow merge point are fetched multiple times. Waste of resources and energy. 10/31/2021 56

Modeling Software Predication o o Mark using a binary instrumentation tool All simple and

Modeling Software Predication o o Mark using a binary instrumentation tool All simple and nested hammocks can be predicated. All instruction between a branch and the control-flow merge point are fetched. All nested branches are predicated. 10/31/2021 57