HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPUGPU SYSTEMS JASON
- Slides: 45
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ PUTHOOR†, BRADFORD M BECKMANN†, MARK D HILL*†, STEVEN K REINHARDT†, DAVID A WOOD*† *University of Wisconsin-Madison †Advanced Micro Devices, Inc.
ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒ High bandwidth difficult to support at directory ‒ Extreme resource requirements We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence ‒ Reduces bandwidth by 94% 3 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 ‒ Reduces resource requirements by 95%
PHYSICAL INTEGRATION 4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION 5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION 6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU Cores 7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Credit: IBM
LOGICAL INTEGRATION General-purpose GPU computing ‒ Open. CL ‒ CUDA Heterogeneous Uniform Memory Access (h. UMA) ‒Shared virtual address space ‒Cache coherence Allows new heterogeneous apps 8 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE Motivation Background ‒ System overview ‒ Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM OVERVIEW SYSTEM LEVEL High-bandwidth interconnect 10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics) 11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Arrow thickness →bandwidth Invalidation traffic
SYSTEM OVERVIEW GPU Very high bandwidth: L 2 has high miss rate 12 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM OVERVIEW Low bandwidth: Low L 2 miss rate 13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
CACHE ARCHITECTURE REMINDER CPU/GPU L 2 CACHE Demand requests Searches cache tags from L 1 Allocates cache anfor a tag match MSHR Tag hit on probe: send entry On a directory data to other core On a miss, send probe, check Onrequest a hit, return to directory MSHRs and tags data to the L 1 14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
DIRECTORY ARCHITECTURE REMINDER DIRECTORY Demand requests Searches cache tags from L 2 Allocates cache anfor a tag match MSHR On a miss, the entry data Allocate and send comes from DRAM probes to L 2 caches 15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
BACKGROUND SUMMARY System under investigation ‒ Heterogeneous CPU-GPU on chip ‒ High-bandwidth DRAM Directory pipeline complex ‒ MSHR array is associative ‒ Difficult to pipeline with more than 1 request per cycle ‒ Important resources: MSHR entries 16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE Motivation Background Heterogeneous System Bottlenecks ‒ Simulation overview ‒ Directory bandwidth ‒ MSHRs ‒ Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SIMULATION DETAILS gem 5 simulator Workloads ‒ Simple CPU ‒ GPU simulator based on AMD GCN ‒ All memory requests through gem 5 CPU Clock CPU Cores CPU Shared L 2 GPU Clock Compute Units GPU Shared L 2 L 3 (Memory-side) DRAM Peak Bandwidth Baseline Directory ‒ Modified to use h. UMA ‒ Rodinia & AMD APP SDK 2 GHz 2 2 MB (16 -way banked) 1 GHz 32 4 MB (64 -way banked) 16 MB (16 -way banked) DDR 3, 16 channels 700 GB/s 256 k entries (8 -way banked) 18 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
GPGPU BENCHMARKS Rodinia benchmarks ‒ bp trains the connection weights on a neural network ‒ bfs breadth-first search ‒ hs performs a transient 2 D thermal simulation (5 -point stencil) ‒ lud matrix decomposition ‒ nw performs a global optimization for DNA sequence alignment ‒ km does k-means clustering ‒ sd speckle-reducing anisotropic diffusion AMD SDK ‒ bn bitonic sort ‒ dct discrete cosine transform ‒ hg histogram ‒ mm matrix multiplication 19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM BOTTLENECKS Difficult to scale directory bandwidth ‒ Difficult to multi-port ‒ Complicated pipeline Designed to support CPU High resource usage bandwidth ‒ Must allocate MSHR for entire duration of request High bandwidth ‒ MSHR array difficult to scale 20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
DIRECTORY TRAFFIC Directory accesses per GPU cycle 4. 5 4 Difficult to support >1 request per cycle 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 21 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
RESOURCE USAGE Maximum MSHRs 100000 1000 100 Very difficult to scale MSHR array bp bfs hs lud nw 22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Steady state at 700 GB/s Causes significant back-pressure on L 2 s km sd bn dct hg mm
PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Back-pressure from limited MSHRs and bandwidth 5 4 Slow down 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
BOTTLENECKS SUMMARY Directory bandwidth ‒ Must support up to 4 requests per cycle ‒ Difficult to construct pipeline Resource usage ‒ MSHRs are a constraining resource ‒ Need more than 10, 000 ‒ Without resource constraints, up to 4 x better performance 24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details ‒ Overall system design ‒ Region buffer design ‒ Region directory design ‒ Example ‒ Hardware complexity Results Conclusions 25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
BASELINE DIRECTORY COHERENCE Initialization Kernel Launch Read result 26 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HETEROGENEOUS SYSTEM COHERENCE (HSC) Initialization Kernel Launch 27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HETEROGENEOUS SYSTEM COHERENCE (HSC) Region buffers coordinate with region directory Direct-access bus 28 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC: EXAMPLE MEMORY REQUEST GPU L 2 Cache GPU Region Buffer Region Directory 31 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC: L 2 CACHE & REGION BUFFER Region tags and permissions Only region-level permission traffic Interface for direct-access bus 32 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC: REGION DIRECTORY Region tags, sharers, and permissions 33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC: HARDWARE COMPLEXITY Region protocols reduce directory size ‒ Region directory: 8 x fewer entries Region buffers ‒ At each L 2 cache ‒ 1 -KB region (16 64 -B blocks) ‒ 16 -K region entries ‒ Overprovisioned for low-locality workloads 34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC SUMMARY Key insight ‒ GPU-CPU applications exhibit high spatial locality ‒ Use direct-access bus present in systems ‒ Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information ‒ At each L 2 cache ‒ Bypass coherence network and directory Replace directory with region directory ‒ Significantly reduces total size needed 35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results ‒ Speed-up ‒ Latency of loads ‒ Bandwidth ‒ MSHR usage Conclusions 36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1 -KB region size 37 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC PERFORMANCE 5 4. 5 Normalized speed-up 4 Largest slow-downs slowdowns Broadcast from constrained resources Baseline HSC 3. 5 3 2. 5 2 1. 5 1 0. 5 0 bp bfs hs lud nw 38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth 1. 2 broadcast 1 0. 8 0. 6 0. 4 baseline HSC Average bandwidth significantly reduced Theoretical reduction from 16 block regions 0. 2 0 bp bfs hs lud nw 39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
HSC RESOURCE USAGE Normalized directory MSHRs required 0. 25 0. 2 0. 15 0. 1 Maximum MSHRs significantly reduced 0. 05 0 bp bfs hs lud nw 40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance ‒ Reduces the average load latency ‒ Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 41 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
RELATED WORK Coarse-grained coherence ‒ Region coherence ‒ Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] ‒ Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] ‒ Spatiotemporal coherence [Alisafaee, MICRO 2012] ‒ Dual-grain directory coherence [Basu, UW-TR 2013] ‒ Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] ‒ Intra-GPU coherence 42 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒ High bandwidth difficult to support at directory ‒ Extreme resource requirements We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence ‒ Reduces bandwidth by 94% 43 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 ‒ Reduces resource requirements by 95%
Questions? Contact: powerjg@cs. wisc. edu 44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 45 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Backup Slides
LOAD LATENCY 4. 5 Normalized load latency 4 3. 5 3 2. 5 broadcast baseline HSC Average load time significantly reduced 2 1. 5 1 0. 5 0 bp bfs hs lud nw 47 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm
EXECUTION TIME BREAKDOWN 120 GPU CPU hg mm Execution time (%) 100 80 60 40 20 0 bp bfs hs lud nw 48 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct
- Suspension vs solution
- Colloid is a heterogeneous system
- Milkor
- Vertical integration
- Library integrated systems
- Integrated systems testing
- Adams integrated systems
- Integrated systems market
- Construct an integrated business process
- Mototrbo dispatch software
- Milkor integrated systems cape town
- Ims integrated marketing systems
- Institute for software integrated systems
- St charles integrated care centre
- Gsr part 2
- Kontinuitetshantering
- Typiska novell drag
- Tack för att ni lyssnade bild
- Vad står k.r.å.k.a.n för
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Adressändring ideell förening
- Vilotidsbok
- A gastrica
- Vad är densitet
- Datorkunskap för nybörjare
- Boverket ka
- Debattartikel mall
- Magnetsjukhus
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Formel för lufttryck
- Svenskt ramverk för digital samverkan
- Jag har gått inunder stjärnor text
- Presentera för publik crossboss
- Argument för teckenspråk som minoritetsspråk
- Vem räknas som jude
- Klassificeringsstruktur för kommunala verksamheter
- Luftstrupen för medicinare
- Bästa kameran för astrofoto
- Centrum för kunskap och säkerhet
- Programskede byggprocessen
- Mat för idrottare
- Verktyg för automatisering av utbetalningar
- Rutin för avvikelsehantering
- Smärtskolan kunskap för livet