SYNCHRONIZATION USING REMOTESCOPE PROMOTION MARC S ORR SHUAI

  • Slides: 20
Download presentation
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR†§, SHUAI CHE§, AYSE YILMAZER§, BRADFORD M. BECKMANN§,

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR†§, SHUAI CHE§, AYSE YILMAZER§, BRADFORD M. BECKMANN§, MARK D. HILL†§, DAVID A. WOOD†§ †UW-MADISON, §AMD RESEARCH ASPLOS, MARCH 16, 2015

EXECUTIVE SUMMARY Heterogeneous chips, like GPUs, have hierarchical memories wi 0 wi 1 wi

EXECUTIVE SUMMARY Heterogeneous chips, like GPUs, have hierarchical memories wi 0 wi 1 wi 2 wi 3 All Global Synchronization wi 0 wi 1 wi 2 wi 3 Best of Both? Scoped Synchronization wi 0 wi 1 wi 2 wi 3 (7% Speedup) Work Stealing (18% Speedup) NEW: Remote-Scope Promotion (25% Speedup) 2 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 3 | SYNCHRONIZATION USING

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 3 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

BACKGROUND: SYNCHRONIZATION + SCOPES Parallel Synchronization semantics ‒ acquire: pull latest data (to me)

BACKGROUND: SYNCHRONIZATION + SCOPES Parallel Synchronization semantics ‒ acquire: pull latest data (to me) ‒ release: push latest data (to others) Scopes bound synchronization: ‒ Smaller scope less synchronization overhead scope abbrev. description work-item wi Like a CPU thread wavefront wv work-items executing in lockstep on SIMD work-group wg wavefronts executing on the same CU component cmp work-groups executing on the same GPU system sys All work-items/threads in the process 4 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

ACQUIRE/RELEASE ANIMATION void inc. X_component() { void inc. X_workgroup() { while (!CAS_acq_cmp(&L, 0, 1));

ACQUIRE/RELEASE ANIMATION void inc. X_component() { void inc. X_workgroup() { while (!CAS_acq_cmp(&L, 0, 1)); X = X + 1; while (!CAS_acq_wg(&L, 0, 1)); X = X + 1; st_rel_cmp(&L, 0); st_rel_wg(&L, 0); } } CU 0 wg 0 wi 1 L 1 Cache Xwg= scope 0 3 4 1 L=1 0 L 2 CU 1 wg 1 wi 2 wi 3 L 1 Cache wg X scope 1 =2 component X = 2 scope L = 0 1 5 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

SCOPED SYNCHRONIZATION’S STRENGTHS Static local sharing wg 0 wi 1 wg_scope 0 data 0

SCOPED SYNCHRONIZATION’S STRENGTHS Static local sharing wg 0 wi 1 wg_scope 0 data 0 wg 1 wi 2 wi 3 wg_scope 1 data 1 component scope Dynamic global sharing wg 0 wi 1 wg 1 wi 2 wi 3 wg scope 0 wg scope 1 global data store On current hardware, wg scope can yield >20% speedup over cmp scope 6 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

SCOPED SYNCHRONIZATION’S LIMITATIONS Dynamic local sharing: some threads access shared data less frequently than

SCOPED SYNCHRONIZATION’S LIMITATIONS Dynamic local sharing: some threads access shared data less frequently than others in an ad-hoc manner Example: work stealing wg scope 0 wgqueue scope 1 1 deq wg 1 wi 2 wi 3 enq wg 0 wi 1 queue stale 0 scope component 7 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 8 | SYNCHRONIZATION USING

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 8 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

REMOTE-SCOPE PROMOTION Insight: wg 1 needs to trigger the promotion of scope 0 Contribution:

REMOTE-SCOPE PROMOTION Insight: wg 1 needs to trigger the promotion of scope 0 Contribution: hardware support for scope promotion & ISA instructions that utilize it wg_scope 0 queue 0 wg 1 wi 2 wi 3 promote deq wg 0 wi 1 wg_scope queue 1 1 sh flu queue stale 0 scope component 9 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

PROMOTION SEMANTIC Prior memory models: HRF-direct, HRF-indirect ‒ Invariant: acquire/release pair must occur at

PROMOTION SEMANTIC Prior memory models: HRF-direct, HRF-indirect ‒ Invariant: acquire/release pair must occur at the same scope work-item 0 (in wg 0) work-item 1 (in wg 1) st(V, 2) promotion st_rel_cmp(L, st_rel_wg(L, 0) 0) OK synchronizes-with RACE! relationship Three new memory orders: cas_rm_acq_cmp(&L, cas_acq_wg(&L, 0, 0, 0, 1) 1)1) ld(R 1, V) remote. Acquire Promote the scope of last release to the scope of this acquire, then perform acquire remote. Release Promote the scope of next acquire to the scope of this release, then perform release remote. Acquire+Release combine remote acquire & remote release 10 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

IMPLEMENTATION remote_acq_cmp(L) 1. 2. § CU 0 wi 2 SH U FL L 1

IMPLEMENTATION remote_acq_cmp(L) 1. 2. § CU 0 wi 2 SH U FL L 1 Cache V=3 L=0 L 2 CU 2 wg 1 wi 1 V=2 Perform a release operation on L Promote the scope of the next acquire on L 2. CU 1 wg 0 remote_rel_cmp(L) 1. Promote the scope of the last release on L Perform an acquire operation on L wg 2 wi 3 L 1 Cache promote V=2 FLUSH § L=1 11 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015 wi 4 wi 5 L 1 Cache H US FL

IMPLEMENTATION DETAILS Hardware Support ‒ Sending/receiving sub-operations between CUs ‒ Cache line locking to

IMPLEMENTATION DETAILS Hardware Support ‒ Sending/receiving sub-operations between CUs ‒ Cache line locking to resolve races Guarantee “coherence order” for read-modify-writes ‒ Hardware support to stall new synchronization operations at target scope Paper formalizes scope promotion ‒ Shows that scope promotion is compatible with coherence order 12 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 13 | SYNCHRONIZATION USING

OUTLINE Background: Synchronization + Scopes Synchronization using Remote-Scope Promotion Results/Conclusion 13 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

METHODOLOGY Prototyped remote scoped synchronization in gem 5 ‒ Extended with internal GPU model

METHODOLOGY Prototyped remote scoped synchronization in gem 5 ‒ Extended with internal GPU model Refactored 3 Pannotia workloads to retrieve graph nodes from task queues ‒ SSSP, Color, Page. Rank (each run with 3 -4 inputs) 14 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

RESULTS scope-only steal-only rem-sync 1. 8 1. 6 1. 4 1. 2 1 0.

RESULTS scope-only steal-only rem-sync 1. 8 1. 6 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 o. m ea n -3 ge PR -2 PR -1 PR r-4 co lo r-3 co lo r-2 co lo r-1 co lo SS SP -2 SP SS -3 1. 25 x 1. 18 x 1. 07 x -1 Speedup baseline scenario baseline scope-only steal-only rem-sync Scope of sync. ? global local 15 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015 Work stealing? no no Yes

CONCLUSION wi 0 wi 1 wi 2 wi 3 All Global Synchronization wi 0

CONCLUSION wi 0 wi 1 wi 2 wi 3 All Global Synchronization wi 0 wi 1 wi 2 wi 3 Best of Both! Scoped Synchronization wi 0 wi 1 wi 2 wi 3 (7% Speedup) Work Stealing (18% Speedup) NEW: Remote-Scope Promotion (25% Speedup) 16 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015

Questions?

Questions?

Backup

Backup

ΜBENCHMARK RESULTS wg-scope vs. cmp-scope on AMD A 10 -7850 K 1. 6 Small

ΜBENCHMARK RESULTS wg-scope vs. cmp-scope on AMD A 10 -7850 K 1. 6 Small tasks benefit from scopes 1. 4 speedup 1. 2 1 All LD 0. 8 75% LD 0. 6 50% LD 0. 4 25% LD 0. 2 All ST 0 4 8 16 32 64 128 256 512 # of memory operations between acquire and release Scopes matter! 19 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015 1024

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 20 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015