Cache Coherence for GPU Architectures Inderpreet Singh 1

  • Slides: 23
Download presentation
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1,

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 1 University of British Columbia 2 Simon Fraser University 3 AMD Research Image source: www. forces. gc. ca

What is a GPU? Workgroups CPU spawn GPU CPU Wavefronts GPU Core L 1

What is a GPU? Workgroups CPU spawn GPU CPU Wavefronts GPU Core L 1 D done ▪▪ ▪ Interconnect CPU time Inderpreet Singh spawn L 2 Bank GPU Cache Coherence for GPU Architectures ▪▪ ▪ 2

Evolution of GPUs • Graphics pipeline Open. GL/ Direct. X Vertex Shader Pixel Shader

Evolution of GPUs • Graphics pipeline Open. GL/ Direct. X Vertex Shader Pixel Shader • Compute (Open. CL, CUDA) • e. g. Matrix Multiplication Inderpreet Singh Cache Coherence for GPU Architectures 3

Evolution of GPUs • Future: coherent memory space • Efficient critical sections • Load

Evolution of GPUs • Future: coherent memory space • Efficient critical sections • Load balancing Stencil computation lock shared structure … computation … unlock Inderpreet Singh Workgroups Cache Coherence for GPU Architectures 4

GPU Coherence Challenges • Challenge 1: Coherence traffic Load C Load D Load E

GPU Coherence Challenges • Challenge 1: Coherence traffic Load C Load D Load E Load F … Load C MESI No coherence GPU-VI 2. 2 Interconnect traffic 1. 5 1. 3 Recalls Load K Load L Load M Load N … Load O Load P Load Q Load R … C 1 C 2 C 3 C 4 L 1 D A B rcl A 1. 0 0. 5 Load G Load H Load I Load J … rcl A ack ack rcl A ack gets C Do not require coherence Inderpreet Singh Cache Coherence for GPU Architectures L 2/Directory A B 5

GPU Coherence Challenges • Challenge 2: Tracking in-flight requests • Significant % of L

GPU Coherence Challenges • Challenge 2: Tracking in-flight requests • Significant % of L 2 S Shared S_M M Modified L 2 / Directory MSHR Inderpreet Singh Cache Coherence for GPU Architectures 6

GPU Coherence Challenges • Challenge 3: Complexity Non-coherent L 1 MESI L 2 States

GPU Coherence Challenges • Challenge 3: Complexity Non-coherent L 1 MESI L 2 States MESI L 1 States Events States Non-coherent L 2 Inderpreet Singh Cache Coherence for GPU Architectures 7

GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU

GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU 1. Traffic: transferring 2. Storage: tracking 3. Complexity: managing GPU cache coherence without coherence messages? • YES – using global time Inderpreet Singh Cache Coherence for GPU Architectures 8

Temporal Coherence (TC) • Global time Local Timestamp > Global Time VALID Core 1

Temporal Coherence (TC) • Global time Local Timestamp > Global Time VALID Core 1 Core 2 L 1 D 0 A=0 Interconnect L 2 Bank 0 Inderpreet Singh ▪▪ ▪ A=0 Global Timestamp ▪▪ < Global Time ▪ NO L 1 COPIES Cache Coherence for GPU Architectures 9

Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L 1 D No

Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L 1 D No coherence Interconnect messages A=0 Sto re A A ad Lo 10 T= =1 10 L 2 Bank 0 10 Inderpreet Singh A=1 A=0 ▪▪ ▪ Cache Coherence for GPU Architectures 10

Temporal Coherence (TC) What lifetime values should be requested on loads? • Use a

Temporal Coherence (TC) What lifetime values should be requested on loads? • Use a predictor to predict lifetime values What about stores to unexpired blocks? • Stall them at the L 2? Inderpreet Singh Cache Coherence for GPU Architectures 11

TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses

TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak Inderpreet Singh Cache Coherence for GPU Architectures 12

TC-Weak • Stores return Global Write Completion Time (GWCT) 1 2 3 data=NEW FENCE

TC-Weak • Stores return Global Write Completion Time (GWCT) 1 2 3 data=NEW FENCE flag=SET T=0 T=31 T=1 GPU Core 2 L 1 D No stalling at L 2 GWCT Table W 0: W 1: re W T Sto =N SE g aa lt d fa GWCT Table W 0: W 1: L 1 D 30 data=OLD Interconnect L 2 Bank Inderpreet Singh 30 data=NEW data=OLD 47 flag=NULL flag=SET Cache Coherence for GPU Architectures 13

TC-Weak Stalling TC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications

TC-Weak Stalling TC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications Inderpreet Singh Cache Coherence for GPU Architectures 14

Methodology • • • GPGPU-Sim v 3. 1. 2 for GPU core model GEMS

Methodology • • • GPGPU-Sim v 3. 1. 2 for GPU core model GEMS Ruby v 2. 1. 1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: • 6 do not require coherence • 6 require coherence • Barnes Hut • Cloth Physics • Versatile Place and Route • Max-Flow Min-Cut • 3 D Wave Equation Solver • Octree Partitioning Inderpreet Singh Cache Coherence for GPU Architectures Locks Stencil communication Load balancing 15

Interconnect Traffic MESI Interconnect Traffic 1. 50 NO-COH GPU-VI TC-Weak 2. 3 • Reduces

Interconnect Traffic MESI Interconnect Traffic 1. 50 NO-COH GPU-VI TC-Weak 2. 3 • Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications 1. 25 1. 00 • Lower traffic than 16 x-sized 32 way directory 0. 75 0. 50 0. 25 0. 00 Inderpreet Singh Do not require coherence Cache Coherence for GPU Architectures 16

Performance MESI Speedup 2. 0 NO-L 1 GPU-VI TC-Weak 1. 5 • TC-Weak with

Performance MESI Speedup 2. 0 NO-L 1 GPU-VI TC-Weak 1. 5 • TC-Weak with simple predictor performs 85% better than disabling L 1 caches 1. 0 • Performs 28% better than TC with stalling 0. 5 • Larger directory sizes do not improve performance 0. 0 Inderpreet Singh Require coherence Cache Coherence for GPU Architectures 17

Complexity Non-Coherent L 1 MESI TC-Weak L 1 L 2 States MESI L 1

Complexity Non-Coherent L 1 MESI TC-Weak L 1 L 2 States MESI L 1 States Non-Coherent L 2 Inderpreet Singh TC-Weak L 2 Cache Coherence for GPU Architectures 18

Summary • First work to characterize GPU coherence challenges • Save traffic and energy

Summary • First work to characterize GPU coherence challenges • Save traffic and energy by using global time • Reduce protocol complexity • 85% performance improvement over no coherence Questions? Inderpreet Singh Cache Coherence for GPU Architectures 19

Backup Slides Inderpreet Singh Cache Coherence for GPU Architectures 20

Backup Slides Inderpreet Singh Cache Coherence for GPU Architectures 20

Lifetime Predictor • One prediction value per L 2 bank • Events local to

Lifetime Predictor • One prediction value per L 2 bank • Events local to L 2 bank update prediction value A arde Loto S Events L 2 Bank TT==20 0 Prediction prediction-prediction++ Value Inderpreet Singh 30 10 A Prediction 1. Expired load: ↑ 2. Unexpired store: ↓ 3. Unexpired eviction: ↓ Cache Coherence for GPU Architectures 21

TC-Strong vs TC-Weak TCSUO TCW Fixed lifetime for all applications TCSOO TCS TCW w/

TC-Strong vs TC-Weak TCSUO TCW Fixed lifetime for all applications TCSOO TCS TCW w/ predictor Best lifetime for each application 1. 2 Speedup 1. 4 1. 0 0. 8 0. 6 Inderpreet Singh All applications 1. 0 0. 8 0. 6 All applications Cache Coherence for GPU Architectures 22

Interconnect Power and Energy Inderpreet Singh Cache Coherence for GPU Architectures 23

Interconnect Power and Energy Inderpreet Singh Cache Coherence for GPU Architectures 23