CORF CORF Coalescing Operand Register File for GPUs

  • Slides: 24
Download presentation
CORF: CORF Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden Hyeran Jeon Daniel

CORF: CORF Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden Hyeran Jeon Daniel Wong Farzad Khorasani Nael Abu-Ghazaleh The 24 th International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS 2019

GPU Register File • Frequent accesses to the RF consume a substantial amount of

GPU Register File • Frequent accesses to the RF consume a substantial amount of the dynamic energy. Register File L 2 Cache 25 20 Size (MB) 20 14 15 10 5 0 6 2 0, 75 Fermi (2010) 3, 8 6 1, 5 Kepler (2012) 3 Maxwell (2014) 4 Pascal (2016) Volta (2018) • Port contention due to limited ports on operand collection stage affect performance as register operations are serialized. CORF: Coalescing Operand Register File for 2/20

Our Proposal: CORF Idea: Combining multiple register reads into a single physical read Results:

Our Proposal: CORF Idea: Combining multiple register reads into a single physical read Results: IPC by 9% RF dynamic energy by 17% RF static energy by 52% CORF: Coalescing Operand Register File for 3/20

Outline • Background and Motivation • CORF: Coalescing Operand Register File • Compile Time

Outline • Background and Motivation • CORF: Coalescing Operand Register File • Compile Time Operation • Run Time Operation • Limitation • CORF++: Re-architected RF • Compile Time Operation • Run Time Operation • Evaluation • Summary CORF: Coalescing Operand Register File for 4/20

Baseline Register File • 128 KB RF per SM • Split across 4 banks

Baseline Register File • 128 KB RF per SM • Split across 4 banks • A bank is made up of 8 sub-banks, each 128 bits wide • A full warp register can be striped using one entry of one bank Bank 0 t 0 P 1. . . t 1 t 2 t 3 t 4 t 5 t 6 32 -bits t 7 t 28 t 29 t 30 t 31 GPU Register File Design . . . P 255 128 -bits Sub-bank 0 Sub-bank 1 Sub-bank 7 GPU Register File Design Do all values need full 32 -bit width to be represented? CORF: Coalescing Operand Register File for 5/20

Prominence of Narrow-width Values • Register operand characteristic 1 -byte 120% 2 -bytes 3

Prominence of Narrow-width Values • Register operand characteristic 1 -byte 120% 2 -bytes 3 -bytes 4 -bytes 80% 72% 60% 33% 40% 65% Percentage 100% 20% G AV P BL AC W M M HS CH OL ES LIB SG E G AV NN GR ] U[ RN N] [C et r. N fa Ci Sq u ee ze Ne t[C NN ] ] NN [C et ST O x. N Al e SA D CT OR AD RE DU D CT I BA ON CK PR OB HE AR TW AL L VE SR AD NW UM M AN AS SI RE E GU BT BF S 0% Can we pack multiple narrow-width values into a single physical register? CORF: Coalescing Operand Register File for 6/20

Register Packing • Co-locating multiple narrow-width registers in the same physical register [Ergin’ 04][Wang’

Register Packing • Co-locating multiple narrow-width registers in the same physical register [Ergin’ 04][Wang’ 17] P 1 P 2 P 3 P 4 P 5 r 1 r 2 r 3 r 5 r 3 r 1 r 4 r 5 P 1 P 2 P 3 P 4 P 5 40% saving! Baseline RF Packed RF the effective size of register file • Goal of register packing is reduce to • First fit policy is used at the allocation time • Mapping is done using a Renaming Table logic • But, still each register read requires a separate physical register read Can we combine multiple register reads required by an instruction? CORF: Coalescing Operand Register File for 7/20

Coalescing Opportunities • Let’s try to coalesce reads using simple register packing (with first

Coalescing Opportunities • Let’s try to coalesce reads using simple register packing (with first fit First-fit Upper bound policy) 80% 69% 60% 40% 20% G P AV W BL SG LIB AC E HS MM CH OL ES D S CT AD O RE RA DU DD C BA TIO CK N HE PR AR OB TW AL L Al e Sq x. N ST ue e O ez t[C e. N NN Ci et[ ] fa r. N CNN et [ ] GR CNN U[ ] RN N] AV G SR A VE GU NW S RE AS E SI AN M UM BF BT G 4% 0% AV Percentage 100% Instructions with coalesce-able register reads • Upper bound: Fraction of all dynamic instructions which: • Contain two register source operands that are both narrow-width • They fit together in a single register entry Let’s fit incorporate a compiler-guided register to identify • First is weak in promoting coalescing; how allocation to pack thepolicy right registers together? pairs of registers commonly read together CORF: Coalescing Operand Register File for 8/20

CORF: Overview • Register pairs are identified at compile time through static analysis. •

CORF: Overview • Register pairs are identified at compile time through static analysis. • (r 1, r 3) as well as (r 2, r 4). Kernel Binary Profile Register Pairings Common Pairs Identification CORF RF 2 (r 1, r 3) r 1, r 2 (r 2, r 4) r 1, r 4 r 1 ---r 2 r 4 r 3 ---- r 1 8 7 r 2 r 4 10 r 3 Compile time Execution time • At run-time, common register pairs will be dynamically packed together, if possible. • (r 2, r 4) in this example. CORF: Coalescing Operand Register File for 9/20

CORF: Compile Time Operation • Identifying exclusive common pairs • Profiling the frequency of

CORF: Compile Time Operation • Identifying exclusive common pairs • Profiling the frequency of register pairs in order to build a Register Affinity Graph Profile Register Pairings Common Pairs Identification 2 (r 1, r 3) r 1, r 2 (r 2, r 4) r 1, r 4 • Remove edges of the registers that have more than one edge to identify exclusive common pairs r 1 8 7 r 2 r 4 10 r 3 Compile time • Passing compiler-assisted helps to the hardware • Set of exclusive register pairs identified by the compiler are annotated in the executable’s preamble of a kernel CORF: Coalescing Operand Register File for 10/20

CORF: Run Time Operation • During run time, CORF packs the identified register pairs

CORF: Run Time Operation • During run time, CORF packs the identified register pairs into the same physical register entry • (r 1, r 3) do not fit in a single physical register CORF RF r 1 ---r 2 r 4 r 3 ---- Execution time • Coalescing opportunities are identified using the Renaming Table • If the two source operands reside in the same physical register, then accesses are coalesced • CORF coalescing opportunities are limited to registers stored within the same physical register entry. What if a register is commonly accessed with two or more other registers? CORF: Coalescing Operand Register File for 11/20

CORF++: Overview • Instead of identifying exclusive register pairs, compiler solves a variant of

CORF++: Overview • Instead of identifying exclusive register pairs, compiler solves a variant of graph coloring problem to simplify the allocation to left or right-aligning assignment Kernel Binary Register Affinity Graph Alignment Identification Coalescing-Aware RF Left Right r 1 ---r 3 r 4 r 2 r 5 r 6 ---- r 1 r 4 r 5 Compile time r 2 r 3 r 6 Execution time • During runtime, any left-aligned register is coalesce-able with any rightaligned don't overlap. How toregister allocateproviding registers they to left/right RF slices for maximizing coalescing opportunities? How to architect the RF to allow coalescing across different phys. registers? CORF: Coalescing Operand Register File for 12/20

CORF++: Compiler Support • Optimal solution to remove the minimum number of edges of

CORF++: Compiler Support • Optimal solution to remove the minimum number of edges of a graph to make it two-colorable is NP-hard • Any graph with no odd cycles (cycles made up of an odd number of edges) is 2 colorable • We developed the following heuristic to remove all odd cycles: 1. Assign each edge a weight corresponding to its original weight, divided by the number of odd cycles that removing it would break 2. Remove the edge with the minimum weight and update the weights 3. Repeat until all odd cycles are eliminated CORF: Coalescing Operand Register File for 13/20

CORF++: Architecture Support • To support coalescing across different physical register entries, we need

CORF++: Architecture Support • To support coalescing across different physical register entries, we need Coalescing-Aware RF dual-addressable banks r 1 ---r 3 • Moreover, we need to changed register-to-bank mapping policy. P 2 r 4 r 2 P 3 r 5 • Details in the paper ---- Subbank 7 Subbank 5 Subbank 4 Subbank 3 Subbank 2 Subbank 1 MUX MUX Subbank 6 Execution time Address 1: P 2 Subbank 0 r 6 Address 2: P 3 CORF: Coalescing Operand Register File for 14/20

First-fit 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CORF++ Upper

First-fit 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CORF++ Upper Bound INT-intensive FP-intensive 69% 48% 23% ez e ue G AV W P BL SGE M AC M HS CH OL ES LIB G AV NN ] Ne t [C Ci NN fa r. N ] et [C NN GR ] U[ RN N] O t[C ST Ne ex Sq Al CT IO BA N CK PR HE OB AR TW AL L RE DU RA D D D SA VE C TO AD SR NW UM M R GU EE AS SI AN BT S 4% BF Percentage Coalesced Instructions Fraction of coalesced instructions CORF and CORF++ significantly increase the amount of coalescing opportunity CORF: Coalescing Operand Register File for 15/20

Performance Improvement 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% CORF++

Performance Improvement 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% CORF++ INT-intensive FP-intensive 9% G AV P G AV OL CH W ES M M BL AC HS LIB SG E G AV NN GR ] U[ RN N] [C et r. N fa Ci Sq u ee ze Ne t[C NN ] ] NN [C et ST O x. N Al e SA D CT OR AD RE DU D CT IO N BA CK PR OB HE AR TW AL L VE SR AD NW UM M AN AS SI GU BT BF RE E 4% S Percentage CORF IPC improvement CORF and CORF++ improve IPC by reducing the pressure on RF ports CORF: Coalescing Operand Register File for 16/20

RF Access Reduction—Dynamic Energy CORF++ 40% Percentage 30% 23% 20% 10% G AV P

RF Access Reduction—Dynamic Energy CORF++ 40% Percentage 30% 23% 20% 10% G AV P G AV OL CH W ES M M BL AC HS LIB SG E G AV NN GR ] U[ RN N] [C et r. N fa Ci Sq u ee ze Ne t[C NN ] ] NN [C et ST O x. N Al e SA D CT OR AD RE DU D CT I BA ON CK PR OB HE AR TW AL L VE SR AD NW UM M AN AS SI RE E GU BT BF S 0% Register file access reduction CORF and CORF++ reduce RF accesses which gets translated to 8. 5% and 17% dynamic energy reduction CORF: Coalescing Operand Register File for 17/20

Effective Size of RF—Static Energy CORF++ RF-Virtualization Combined 100% Percentage 80% 60% 54% 35%

Effective Size of RF—Static Energy CORF++ RF-Virtualization Combined 100% Percentage 80% 60% 54% 35% 34% 40% 20% G AV P W ES BL AC HS CH OL M M LIB SG E G AV NN GR ] U[ RN N] [C et r. N fa Ci ee ze Ne t[C NN ] ] NN [C et ST O x. N Sq u Al e SA D CT OR AD RE DU D CT I BA ON CK PR OB HE AR TW AL L VE SR AD NW UM M AN AS SI RE E GU BT BF S 0% Register file size reduction Reduction in effective size of RF gets translated to 53% static energy reduction CORF: Coalescing Operand Register File for 18/20

Summary • Register file is a critical structure in GPUs • A lot of

Summary • Register file is a critical structure in GPUs • A lot of values do not require full-width register to be represented • We proposed CORF which combines multiple register reads into a single physical register read • CORF++ furtherly re-architects RF to take more advantage of register coalescing opportunities • Our technique improves IPC by 9% and reduces RF dynamic and static energy by 17% and 52%, respectively. CORF: Coalescing Operand Register File for 19/20

CORF: CORF Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden Hyeran Jeon Daniel

CORF: CORF Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden Hyeran Jeon Daniel Wong Farzad Khorasani Nael Abu-Ghazaleh The 24 th International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS 2019

CORF: Coalescing Operand Register File for G AV Dynamic Energy AV G BL S

CORF: Coalescing Operand Register File for G AV Dynamic Energy AV G BL S LI AC GE B HS M CH M OL ES W P AV G BF BT S GU R AS EE SI AN M UM NW SR AD VE CT SA O D RE RA DU DD BA CTIO C HE KP N AR RO TW B AL A L Sq lex ue Ne STO ez t[C e N Ci Net N] fa [C r. N N et N] GR [CN U[ N] RN N] AV G 100% 80% 60% 40% 20% 0% Dynamic Energy BL S LI AC GE B HS M CH M OL ES W P AV G BF BT S GU R AS EE SI AN M UM NW SR AD VE CT SA O D RE RA DU DD BA CTIO C HE KP N AR RO TW B AL A L Sq lex S ue Ne TO ez t[C e N Ci Net N] fa [C r. N N et N] GR [CN U[ N] RN N] AV G Percentage 100% 80% 60% 40% 20% 0% Percentage RF Dynamic Energy Overhead Dynamic Energy 21/20

INT-intensive CORF: Coalescing Operand Register File for G Static Energy AV AV G INT-intensive

INT-intensive CORF: Coalescing Operand Register File for G Static Energy AV AV G INT-intensive BL S LI AC GE B HS M CH M OL ES W P AV G BF B S GU TR AS EE SI AN M UM NW SR AD VE CT SA D RE ORA DU D D BA CTIO CK N HE P AR RO TW B AL A L Sq lex ue Ne STO ez t[C e N Ci Net N] fa [C r. N N et N] GR [CN U[ N] RN N] AV G 100% 80% 60% 40% 20% 0% Static Energy BL S LI AC GE B HS M CH M OL ES W P AV G BF BT S GU R AS EE SI AN M UM NW SR AD VE CT SA O D RE RA DU DD BA CTIO C HE KP N AR RO TW B AL A L Sq lex ue Ne STO ez t[C e N Ci Net N] fa [C r. N N et N] GR [CN U[ N] RN N] AV G Percentage 100% 80% 60% 40% 20% 0% Percentage RF Static Energy Overhead Static Energy FP-intensive 22/20

Methodology • Simulator: • GPGPU-Sim modeling NVIDIA Fermi architecture • Workloads: • • Rodinia

Methodology • Simulator: • GPGPU-Sim modeling NVIDIA Fermi architecture • Workloads: • • Rodinia Parboil CUDA-SDK Tango CORF: Coalescing Operand Register File for 23/20

CORF++: Illustrative Example A SASS Code: GLD r 1, [0 x 80]; ISUB r

CORF++: Illustrative Example A SASS Code: GLD r 1, [0 x 80]; ISUB r 2, r 1, 0 x 7; SHR r 4, r 1, 0 x 8; LLD r 3, [r 4]; IADD r 5, r 2, r 3; IMUL r 1, r 4, r 5; ISUB r 2, r 3, r 4; B <L: r 3, r 5 | R: r 2, r 4> ► GLD r 1, [0 x 80]; ISUB r 2, r 1, 0 x 7; Physical P 0 Register P 1 File P 2 r 1 r 2 C <L: r 3, r 5 | R: r 2, r 4> ► SHR r 4, r 1, 0 x 8; Physical Register File P 0 P 1 P 2 r 1 r 2 r 4 D <L: r 3, r 5 | R: r 2, r 4> ► LLD r 3, [r 4]; Physical Register File P 0 P 1 P 2 r 3 r 1 r 2 r 4 E <L: r 3, r 5 | R: r 2, r 4> ► IADD r 5, r 2, r 3; Physical Register File P 0 P 1 P 2 r 3 r 1 r 2 r 5 r 4 CORF: Coalescing Operand Register File for F <L: r 3, r 5 | R: r 2, r 4> ► IMUL r 1, r 4, r 5; ISUB r 2, r 3, r 4 Physical Register File P 0 P 1 P 2 r 3 r 1 r 2 r 5 r 4 24/20