Cache Coherence for GPU Architectures Inderpreet Singh 1

What is a GPU? Workgroups CPU spawn GPU CPU Wavefronts GPU Core L 1

Evolution of GPUs • Graphics pipeline Open. GL/ Direct. X Vertex Shader Pixel Shader

Evolution of GPUs • Future: coherent memory space • Efficient critical sections • Load

GPU Coherence Challenges • Challenge 1: Coherence traffic Load C Load D Load E

GPU Coherence Challenges • Challenge 2: Tracking in-flight requests • Significant % of L

GPU Coherence Challenges • Challenge 3: Complexity Non-coherent L 1 MESI L 2 States

GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU

Temporal Coherence (TC) • Global time Local Timestamp > Global Time VALID Core 1

Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L 1 D No

Temporal Coherence (TC) What lifetime values should be requested on loads? • Use a

TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses

TC-Weak • Stores return Global Write Completion Time (GWCT) 1 2 3 data=NEW FENCE

TC-Weak Stalling TC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications

Methodology • • • GPGPU-Sim v 3. 1. 2 for GPU core model GEMS

Interconnect Traffic MESI Interconnect Traffic 1. 50 NO-COH GPU-VI TC-Weak 2. 3 • Reduces

Performance MESI Speedup 2. 0 NO-L 1 GPU-VI TC-Weak 1. 5 • TC-Weak with

Complexity Non-Coherent L 1 MESI TC-Weak L 1 L 2 States MESI L 1

Summary • First work to characterize GPU coherence challenges • Save traffic and energy

Backup Slides Inderpreet Singh Cache Coherence for GPU Architectures 20

Lifetime Predictor • One prediction value per L 2 bank • Events local to

TC-Strong vs TC-Weak TCSUO TCW Fixed lifetime for all applications TCSOO TCS TCW w/

Interconnect Power and Energy Inderpreet Singh Cache Coherence for GPU Architectures 23

Slides: 23

Download presentation

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 1 University of British Columbia 2 Simon Fraser University 3 AMD Research Image source: www. forces. gc. ca

What is a GPU? Workgroups CPU spawn GPU CPU Wavefronts GPU Core L 1 D done ▪▪ ▪ Interconnect CPU time Inderpreet Singh spawn L 2 Bank GPU Cache Coherence for GPU Architectures ▪▪ ▪ 2

Evolution of GPUs • Graphics pipeline Open. GL/ Direct. X Vertex Shader Pixel Shader • Compute (Open. CL, CUDA) • e. g. Matrix Multiplication Inderpreet Singh Cache Coherence for GPU Architectures 3

Evolution of GPUs • Future: coherent memory space • Efficient critical sections • Load balancing Stencil computation lock shared structure … computation … unlock Inderpreet Singh Workgroups Cache Coherence for GPU Architectures 4

GPU Coherence Challenges • Challenge 1: Coherence traffic Load C Load D Load E Load F … Load C MESI No coherence GPU-VI 2. 2 Interconnect traffic 1. 5 1. 3 Recalls Load K Load L Load M Load N … Load O Load P Load Q Load R … C 1 C 2 C 3 C 4 L 1 D A B rcl A 1. 0 0. 5 Load G Load H Load I Load J … rcl A ack ack rcl A ack gets C Do not require coherence Inderpreet Singh Cache Coherence for GPU Architectures L 2/Directory A B 5

GPU Coherence Challenges • Challenge 2: Tracking in-flight requests • Significant % of L 2 S Shared S_M M Modified L 2 / Directory MSHR Inderpreet Singh Cache Coherence for GPU Architectures 6

GPU Coherence Challenges • Challenge 3: Complexity Non-coherent L 1 MESI L 2 States MESI L 1 States Events States Non-coherent L 2 Inderpreet Singh Cache Coherence for GPU Architectures 7

GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU 1. Traffic: transferring 2. Storage: tracking 3. Complexity: managing GPU cache coherence without coherence messages? • YES – using global time Inderpreet Singh Cache Coherence for GPU Architectures 8

Temporal Coherence (TC) • Global time Local Timestamp > Global Time VALID Core 1 Core 2 L 1 D 0 A=0 Interconnect L 2 Bank 0 Inderpreet Singh ▪▪ ▪ A=0 Global Timestamp ▪▪ < Global Time ▪ NO L 1 COPIES Cache Coherence for GPU Architectures 9

Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L 1 D No coherence Interconnect messages A=0 Sto re A A ad Lo 10 T= =1 10 L 2 Bank 0 10 Inderpreet Singh A=1 A=0 ▪▪ ▪ Cache Coherence for GPU Architectures 10

Temporal Coherence (TC) What lifetime values should be requested on loads? • Use a predictor to predict lifetime values What about stores to unexpired blocks? • Stall them at the L 2? Inderpreet Singh Cache Coherence for GPU Architectures 11

TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak Inderpreet Singh Cache Coherence for GPU Architectures 12

TC-Weak • Stores return Global Write Completion Time (GWCT) 1 2 3 data=NEW FENCE flag=SET T=0 T=31 T=1 GPU Core 2 L 1 D No stalling at L 2 GWCT Table W 0: W 1: re W T Sto =N SE g aa lt d fa GWCT Table W 0: W 1: L 1 D 30 data=OLD Interconnect L 2 Bank Inderpreet Singh 30 data=NEW data=OLD 47 flag=NULL flag=SET Cache Coherence for GPU Architectures 13

TC-Weak Stalling TC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications Inderpreet Singh Cache Coherence for GPU Architectures 14

Methodology • • • GPGPU-Sim v 3. 1. 2 for GPU core model GEMS Ruby v 2. 1. 1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: • 6 do not require coherence • 6 require coherence • Barnes Hut • Cloth Physics • Versatile Place and Route • Max-Flow Min-Cut • 3 D Wave Equation Solver • Octree Partitioning Inderpreet Singh Cache Coherence for GPU Architectures Locks Stencil communication Load balancing 15

Interconnect Traffic MESI Interconnect Traffic 1. 50 NO-COH GPU-VI TC-Weak 2. 3 • Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications 1. 25 1. 00 • Lower traffic than 16 x-sized 32 way directory 0. 75 0. 50 0. 25 0. 00 Inderpreet Singh Do not require coherence Cache Coherence for GPU Architectures 16

Performance MESI Speedup 2. 0 NO-L 1 GPU-VI TC-Weak 1. 5 • TC-Weak with simple predictor performs 85% better than disabling L 1 caches 1. 0 • Performs 28% better than TC with stalling 0. 5 • Larger directory sizes do not improve performance 0. 0 Inderpreet Singh Require coherence Cache Coherence for GPU Architectures 17

Complexity Non-Coherent L 1 MESI TC-Weak L 1 L 2 States MESI L 1 States Non-Coherent L 2 Inderpreet Singh TC-Weak L 2 Cache Coherence for GPU Architectures 18

Summary • First work to characterize GPU coherence challenges • Save traffic and energy by using global time • Reduce protocol complexity • 85% performance improvement over no coherence Questions? Inderpreet Singh Cache Coherence for GPU Architectures 19

Backup Slides Inderpreet Singh Cache Coherence for GPU Architectures 20

Lifetime Predictor • One prediction value per L 2 bank • Events local to L 2 bank update prediction value A arde Loto S Events L 2 Bank TT==20 0 Prediction prediction-prediction++ Value Inderpreet Singh 30 10 A Prediction 1. Expired load: ↑ 2. Unexpired store: ↓ 3. Unexpired eviction: ↓ Cache Coherence for GPU Architectures 21

TC-Strong vs TC-Weak TCSUO TCW Fixed lifetime for all applications TCSOO TCS TCW w/ predictor Best lifetime for each application 1. 2 Speedup 1. 4 1. 0 0. 8 0. 6 Inderpreet Singh All applications 1. 0 0. 8 0. 6 All applications Cache Coherence for GPU Architectures 22

Interconnect Power and Energy Inderpreet Singh Cache Coherence for GPU Architectures 23