HMG Extending Cache Coherence Protocols Across Modern Hierarchical

Coming Up NUMA behavior bottlenecks performance in multi-GPU systems Mitigating NUMA impact requires caching

HMG: Hierarchical Multi-GPUs GP U Monolithic. NVGPU Switch GP U GP M GP UGP

4 -GPU system, 4 GPMs (GPU Chip Modules) in each GPU Entire cache invalidation

Slightly better than software coherence Without consideration of NUMA effect inter-GPU link BW is

Leveraging Scoped Memory Model Synchronization is scoped Only enforce coherence in a subset of

HMG Overview GP U NVSwitch GP U MCM-GPU GP M GPM (GPU Module) SM

Extending to Scoped GPU Memory Model storehit A on A load GPM 1 GPM

Extending to Scoped GPU Memory Model st. release. gpu A GPM 1 GPM 0

Problem of Extending to Multi. GPUs GPU 0 GPM 0 GPU 1 GPM 0

Hierarchical Multi-GPU Cache Coherence GPU 0 GPU 1 GPM 0 sys home of A

Hierarchical Multi-GPU Cache Coherence GPU 0 GPU 1 gpu home of AGPM 0 sys

2. 0 1. 5 Why? 1. 0 Each store request invalidates 1. 5 valid

2. 0 1. 0 HMG-50%: cutting coherence directory size by 50% only makes performance

Summary Hierarchical cache coherence is necessary to mitigate NUMA effect in multi. GPUs Unlike

Slides: 15

Download presentation

HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems Xiaowei Ren 1, 2, Daniel Lustig 2, Evgeny Bolotin 2, Aamer Jaleel 2, Oreste Villa 2, David Nellans 2 1 The University of British Columbia 2 NVIDIA

Coming Up NUMA behavior bottlenecks performance in multi-GPU systems Mitigating NUMA impact requires caching and cache coherence, but existing cache coherence protocols do not scale Our work: Insight: weaker memory model, less data sharing, latency-tolerant architecture Achieve 97% of the performance of an idealized caching system 2

HMG: Hierarchical Multi-GPUs GP U Monolithic. NVGPU Switch GP U GP M GP UGP M ~200 GB/s NVSwitch GP GP M U MCM-GPU GPGPU GP Monolithic MCM-GPU M M ~2 TB/s Single-GPU Multi-GPU GP GP M U GPM (GPU Chip Module) NUMA behavior bottlenecks performance scaling require caching and cache coherence protocols 3 GP M GPM (GPU Chip Module)

4 -GPU system, 4 GPMs (GPU Chip Modules) in each GPU Entire cache invalidation at synchronization points much worse than idealized caching 4 2. 0 1. 5 1. 0 0. 5 0. 0 29% No-Caching SW Ideal SW cache coherence: Normalized Speedup Existing Coherence Protocols Don’t Scale

Slightly better than software coherence Without consideration of NUMA effect inter-GPU link BW is the critical bottleneck Assumed a stronger memory model 5 21% 1. 5 1. 0 0. 5 0. 0 HW Ideal Fine-grained cacheline invalidations 2. 0 No. Caching SW HW-VI cache coherence: Normalized Speedup Existing Coherence Protocols Don’t Scale

Leveraging Scoped Memory Model Synchronization is scoped Only enforce coherence in a subset of caches that are under the scope in question Store results can be visible to some threads earlier than others Store requests do not need to be stalled until all other sharers are invalidated 6

HMG Overview GP U NVSwitch GP U MCM-GPU GP M GPM (GPU Module) SM + L 1 $ L 2 $ Director y State Tag Sharer s Directory-based cache coherence, keep track of all sharers Map synchronization scopes to caches: . cta L 1 $. gpu/. sys L 2$ L 1 $ coherence is software-maintained, we mainly focus on the L 2 $ 7

Extending to Scoped GPU Memory Model storehit A on A load GPM 1 GPM 0 home of A V: A: [1, V: A: [1] Directory 2] L 2 A $ write through L 2 A $ Directory As some GPMs can see the latest value earlier: inv A Directory GPM 2 8 L 2 A $ Assign a home cache to each address Non-atomic loads can hit in all caches L 2 $ Directory GPM 3 no inv acks for store requests no transient states to reduce stal

Extending to Scoped GPU Memory Model st. release. gpu A GPM 1 GPM 0 home of A V: A: [1, V: A: [1] 2] A inv A & ack fwd Directory GPM 2 9 A write ack through ack fwd A Directory ld. acquire greater than. cta scope invalidates L 1 cache, but not L 2 forward store. release to all GPMs to clear all infight invalidations L 2 $ Directory GPM 3 retire store. release after it’s acked

Problem of Extending to Multi. GPUs GPU 0 GPM 0 GPU 1 GPM 0 home of A Directory L 2 A$ L 2 $ load A inter-GPM net. ~2 TB/s Directory GPM 1 10 L 2 $ Directory Critical bottleneck is NUMA effect due to BW difference reply ~200 GB/s inter-GPM net. L 2 $ Directory GPM 1 67% inter-GPU loads are redundant Record data sharing hierarchically can avoid redundant inter-GPU loads

Hierarchical Multi-GPU Cache Coherence GPU 0 GPU 1 GPM 0 sys home of A V: A: [GPU Directory 1] L 2 $ load A inter-GPM net. gpu home of AGPM 0 L 2 A $ reply V: A: [GPM Directory 1] load A inter-GPM net. reply Directory GPM 1 11 L 2 $ L 2 A$ Directory GPM 1 Unlike CPUs, no extra structures coherence states to reduce laten Assign both system and GPU home caches to each address Loads and store invalidations are propagated hierarchically

Hierarchical Multi-GPU Cache Coherence GPU 0 GPU 1 gpu home of AGPM 0 sys home of A V: A: [GPU Directory 1] st A inter-GPM net. L 2 A $ inv A V: A: [GPM Directory 1] inter-GPM net. inv A Directory GPM 1 12 L 2 $ L 2 A$ Directory GPM 1 Unlike CPUs, no extra structures coherence states to reduce laten Assign both system and GPU home caches to each address Loads and store invalidations are propagated hierarchically ld. acquire and st. release are like the single-GPU scenario st. release retires after it’s acked

2. 0 1. 5 Why? 1. 0 Each store request invalidates 1. 5 valid cache lines HW HMG Ideal 0. 5 0. 0 13 3% Only 3% slower than the idealized caching No. Caching SW Normalized Speedup Overall Performance Each coherence directory eviction invalidates 1 valid cache line BW cost of invalidation message is 3. 58 GB/s

2. 0 1. 0 HMG-50%: cutting coherence directory size by 50% only makes performance slightly worse 0. 5 0. 0 14 2. 7% of each GPM’s L 2 cache data capacity 1. 5 No-Caching HMG-50% HMG Ideal Normalized Speedup Hardware Cost and Scalability HMG is scalable to future bigger multi-GPU systems

Summary Hierarchical cache coherence is necessary to mitigate NUMA effect in multi. GPUs Unlike CPUs, extending coherence to multi-GPUs does not need extra hardware structures or transient coherence states Leveraging the latest scoped memory model can significantly simplify the design of GPU coherence protocol Thank you for listening! 15