HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPUGPU SYSTEMS JASON

  • Slides: 45
Download presentation
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ PUTHOOR†, BRADFORD M BECKMANN†, MARK D HILL*†, STEVEN K REINHARDT†, DAVID A WOOD*† *University of Wisconsin-Madison †Advanced Micro Devices, Inc.

ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current

ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒ High bandwidth difficult to support at directory ‒ Extreme resource requirements We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence ‒ Reduces bandwidth by 94% 3 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 ‒ Reduces resource requirements by 95%

PHYSICAL INTEGRATION 4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION 4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION 5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION 5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION 6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION 6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU Cores 7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER

PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU Cores 7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Credit: IBM

LOGICAL INTEGRATION General-purpose GPU computing ‒ Open. CL ‒ CUDA Heterogeneous Uniform Memory Access

LOGICAL INTEGRATION General-purpose GPU computing ‒ Open. CL ‒ CUDA Heterogeneous Uniform Memory Access (h. UMA) ‒Shared virtual address space ‒Cache coherence Allows new heterogeneous apps 8 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE Motivation Background ‒ System overview ‒ Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous

OUTLINE Motivation Background ‒ System overview ‒ Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM OVERVIEW SYSTEM LEVEL High-bandwidth interconnect 10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11,

SYSTEM OVERVIEW SYSTEM LEVEL High-bandwidth interconnect 10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics)

SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics) 11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Arrow thickness →bandwidth Invalidation traffic

SYSTEM OVERVIEW GPU Very high bandwidth: L 2 has high miss rate 12 |

SYSTEM OVERVIEW GPU Very high bandwidth: L 2 has high miss rate 12 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM OVERVIEW Low bandwidth: Low L 2 miss rate 13 | HETEROGENEOUS SYSTEM COHERENCE

SYSTEM OVERVIEW Low bandwidth: Low L 2 miss rate 13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

CACHE ARCHITECTURE REMINDER CPU/GPU L 2 CACHE Demand requests Searches cache tags from L

CACHE ARCHITECTURE REMINDER CPU/GPU L 2 CACHE Demand requests Searches cache tags from L 1 Allocates cache anfor a tag match MSHR Tag hit on probe: send entry On a directory data to other core On a miss, send probe, check Onrequest a hit, return to directory MSHRs and tags data to the L 1 14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DIRECTORY ARCHITECTURE REMINDER DIRECTORY Demand requests Searches cache tags from L 2 Allocates cache

DIRECTORY ARCHITECTURE REMINDER DIRECTORY Demand requests Searches cache tags from L 2 Allocates cache anfor a tag match MSHR On a miss, the entry data Allocate and send comes from DRAM probes to L 2 caches 15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

BACKGROUND SUMMARY System under investigation ‒ Heterogeneous CPU-GPU on chip ‒ High-bandwidth DRAM Directory

BACKGROUND SUMMARY System under investigation ‒ Heterogeneous CPU-GPU on chip ‒ High-bandwidth DRAM Directory pipeline complex ‒ MSHR array is associative ‒ Difficult to pipeline with more than 1 request per cycle ‒ Important resources: MSHR entries 16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks ‒ Simulation overview ‒ Directory bandwidth ‒ MSHRs

OUTLINE Motivation Background Heterogeneous System Bottlenecks ‒ Simulation overview ‒ Directory bandwidth ‒ MSHRs ‒ Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SIMULATION DETAILS gem 5 simulator Workloads ‒ Simple CPU ‒ GPU simulator based on

SIMULATION DETAILS gem 5 simulator Workloads ‒ Simple CPU ‒ GPU simulator based on AMD GCN ‒ All memory requests through gem 5 CPU Clock CPU Cores CPU Shared L 2 GPU Clock Compute Units GPU Shared L 2 L 3 (Memory-side) DRAM Peak Bandwidth Baseline Directory ‒ Modified to use h. UMA ‒ Rodinia & AMD APP SDK 2 GHz 2 2 MB (16 -way banked) 1 GHz 32 4 MB (64 -way banked) 16 MB (16 -way banked) DDR 3, 16 channels 700 GB/s 256 k entries (8 -way banked) 18 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

GPGPU BENCHMARKS Rodinia benchmarks ‒ bp trains the connection weights on a neural network

GPGPU BENCHMARKS Rodinia benchmarks ‒ bp trains the connection weights on a neural network ‒ bfs breadth-first search ‒ hs performs a transient 2 D thermal simulation (5 -point stencil) ‒ lud matrix decomposition ‒ nw performs a global optimization for DNA sequence alignment ‒ km does k-means clustering ‒ sd speckle-reducing anisotropic diffusion AMD SDK ‒ bn bitonic sort ‒ dct discrete cosine transform ‒ hg histogram ‒ mm matrix multiplication 19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM BOTTLENECKS Difficult to scale directory bandwidth ‒ Difficult to multi-port ‒ Complicated pipeline

SYSTEM BOTTLENECKS Difficult to scale directory bandwidth ‒ Difficult to multi-port ‒ Complicated pipeline Designed to support CPU High resource usage bandwidth ‒ Must allocate MSHR for entire duration of request High bandwidth ‒ MSHR array difficult to scale 20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DIRECTORY TRAFFIC Directory accesses per GPU cycle 4. 5 4 Difficult to support >1

DIRECTORY TRAFFIC Directory accesses per GPU cycle 4. 5 4 Difficult to support >1 request per cycle 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 21 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

RESOURCE USAGE Maximum MSHRs 100000 1000 100 Very difficult to scale MSHR array bp

RESOURCE USAGE Maximum MSHRs 100000 1000 100 Very difficult to scale MSHR array bp bfs hs lud nw 22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Steady state at 700 GB/s Causes significant back-pressure on L 2 s km sd bn dct hg mm

PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Back-pressure from limited MSHRs and bandwidth 5

PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Back-pressure from limited MSHRs and bandwidth 5 4 Slow down 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

BOTTLENECKS SUMMARY Directory bandwidth ‒ Must support up to 4 requests per cycle ‒

BOTTLENECKS SUMMARY Directory bandwidth ‒ Must support up to 4 requests per cycle ‒ Difficult to construct pipeline Resource usage ‒ MSHRs are a constraining resource ‒ Need more than 10, 000 ‒ Without resource constraints, up to 4 x better performance 24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details ‒ Overall system design

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details ‒ Overall system design ‒ Region buffer design ‒ Region directory design ‒ Example ‒ Hardware complexity Results Conclusions 25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

BASELINE DIRECTORY COHERENCE Initialization Kernel Launch Read result 26 | HETEROGENEOUS SYSTEM COHERENCE |

BASELINE DIRECTORY COHERENCE Initialization Kernel Launch Read result 26 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) Initialization Kernel Launch 27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER

HETEROGENEOUS SYSTEM COHERENCE (HSC) Initialization Kernel Launch 27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) Region buffers coordinate with region directory Direct-access bus 28 |

HETEROGENEOUS SYSTEM COHERENCE (HSC) Region buffers coordinate with region directory Direct-access bus 28 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: EXAMPLE MEMORY REQUEST GPU L 2 Cache GPU Region Buffer Region Directory 31

HSC: EXAMPLE MEMORY REQUEST GPU L 2 Cache GPU Region Buffer Region Directory 31 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: L 2 CACHE & REGION BUFFER Region tags and permissions Only region-level permission

HSC: L 2 CACHE & REGION BUFFER Region tags and permissions Only region-level permission traffic Interface for direct-access bus 32 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: REGION DIRECTORY Region tags, sharers, and permissions 33 | HETEROGENEOUS SYSTEM COHERENCE |

HSC: REGION DIRECTORY Region tags, sharers, and permissions 33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: HARDWARE COMPLEXITY Region protocols reduce directory size ‒ Region directory: 8 x fewer

HSC: HARDWARE COMPLEXITY Region protocols reduce directory size ‒ Region directory: 8 x fewer entries Region buffers ‒ At each L 2 cache ‒ 1 -KB region (16 64 -B blocks) ‒ 16 -K region entries ‒ Overprovisioned for low-locality workloads 34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC SUMMARY Key insight ‒ GPU-CPU applications exhibit high spatial locality ‒ Use direct-access

HSC SUMMARY Key insight ‒ GPU-CPU applications exhibit high spatial locality ‒ Use direct-access bus present in systems ‒ Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information ‒ At each L 2 cache ‒ Bypass coherence network and directory Replace directory with region directory ‒ Significantly reduces total size needed 35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results ‒ Speed-up ‒

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results ‒ Speed-up ‒ Latency of loads ‒ Bandwidth ‒ MSHR usage Conclusions 36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive,

THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1 -KB region size 37 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC PERFORMANCE 5 4. 5 Normalized speed-up 4 Largest slow-downs slowdowns Broadcast from constrained

HSC PERFORMANCE 5 4. 5 Normalized speed-up 4 Largest slow-downs slowdowns Broadcast from constrained resources Baseline HSC 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth 1. 2 broadcast 1 0. 8 0. 6

DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth 1. 2 broadcast 1 0. 8 0. 6 0. 4 baseline HSC Average bandwidth significantly reduced Theoretical reduction from 16 block regions 0. 2 0 bp bfs hs lud nw 39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

HSC RESOURCE USAGE Normalized directory MSHRs required 0. 25 0. 2 0. 15 0.

HSC RESOURCE USAGE Normalized directory MSHRs required 0. 25 0. 2 0. 15 0. 1 Maximum MSHRs significantly reduced 0. 05 0 bp bfs hs lud nw 40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves

RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance ‒ Reduces the average load latency ‒ Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 41 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

RELATED WORK Coarse-grained coherence ‒ Region coherence ‒ Applied to snooping systems [Cantin, ISCA

RELATED WORK Coarse-grained coherence ‒ Region coherence ‒ Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] ‒ Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] ‒ Spatiotemporal coherence [Alisafaee, MICRO 2012] ‒ Dual-grain directory coherence [Basu, UW-TR 2013] ‒ Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] ‒ Intra-GPU coherence 42 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current

CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒ High bandwidth difficult to support at directory ‒ Extreme resource requirements We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence ‒ Reduces bandwidth by 94% 43 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 ‒ Reduces resource requirements by 95%

Questions? Contact: powerjg@cs. wisc. edu 44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013

Questions? Contact: powerjg@cs. wisc. edu 44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 45 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Backup Slides

Backup Slides

LOAD LATENCY 4. 5 Normalized load latency 4 3. 5 3 2. 5 broadcast

LOAD LATENCY 4. 5 Normalized load latency 4 3. 5 3 2. 5 broadcast baseline HSC Average load time significantly reduced 2 1. 5 1 0. 5 0 bp bfs hs lud nw 47 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm

EXECUTION TIME BREAKDOWN 120 GPU CPU hg mm Execution time (%) 100 80 60

EXECUTION TIME BREAKDOWN 120 GPU CPU hg mm Execution time (%) 100 80 60 40 20 0 bp bfs hs lud nw 48 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct