Fence Scoping Changhui Lin Vijay Nagarajan Rajiv Gupta

  • Slides: 30
Download presentation
Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside *

Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh

Reordering in Uniprocessors Memory operations are reordered to improve performance Hardware (e. g. ,

Reordering in Uniprocessors Memory operations are reordered to improve performance Hardware (e. g. , store buffer, reorder buffer) Compiler (e. g. , code motion, caching value in register) a 1: St x a 2: Ld y a 1: St x No harm as long as dependences are respected

Reordering in Multiprocessors counter-intuitive program behavior Initially x=y=0 P 1 a 1: x =

Reordering in Multiprocessors counter-intuitive program behavior Initially x=y=0 P 1 a 1: x = 1; P 2 b 1: Ry = y; a 2: y = 1; b 2: Rx = x; Intuitively, y=1 x=1 Ry=1 Rx=1 a 1 b 1: 1; Rxy = y; a 2 yxx = x; 1; b 2: R a 1 b 1 a 1: b 2 Rxyx = y; 1; x; b 2 a 2: R yx = x; 1; (Rx=0, Ry =0) (Rx=1, Ry =1) (Rx=0, Ry =1)

Reordering in Multiprocessors counter-intuitive program behavior Initially p=NULL, flag = false P 1 p

Reordering in Multiprocessors counter-intuitive program behavior Initially p=NULL, flag = false P 1 p = new A(…) flag = true; P 2 if (flag) a = p->var; flag is supposed to be set after p is allocated

Fence Instructions Memory Consistency Models Specify what reordering is allowed e. g. , SC,

Fence Instructions Memory Consistency Models Specify what reordering is allowed e. g. , SC, TSO (x 86, SPARC), RMO (ARM, Power. PC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence P 1 p = new A(…) FENCE flag = true;

Fence Instructions Memory Consistency Models Specify what reordering is allowed e. g. , SC,

Fence Instructions Memory Consistency Models Specify what reordering is allowed e. g. , SC, TSO (x 86, SPARC), RMO (ARM, Power. PC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence Inevitable -- building concurrent implementations (e. g. , mutual exclusion, queues) [Attiya et. al. , POPL’ 11] Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al. , PLDI’ 98]

Motivation Control Data Access Process Data Concurrent algorithm Not all memory orderings enforced by

Motivation Control Data Access Process Data Concurrent algorithm Not all memory orderings enforced by fences are necessary Fences are usually used to enforce some specific memory operations Programmers know better how a fence is used, which can be conveyed to the hardware

Scoped Fence (S-Fence) A S-Fence only orders memory operations in the scope Scope definition

Scoped Fence (S-Fence) A S-Fence only orders memory operations in the scope Scope definition (Class scope, Set scope) Bridge the gap between programmers’ intention and hardware execution Programmers specify the scope Scope information is conveyed to hardware, imposing fewer ordering constraints Lightweight hardware and compiler support

Scoped Fence (S-Fence) Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var 1,

Scoped Fence (S-Fence) Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var 1, var 2, …}] set scope

Work-Stealing Queue Algorithm 1 void put (TASK task){ 2 tail = TAIL; 3 wsq[tail]

Work-Stealing Queue Algorithm 1 void put (TASK task){ 2 tail = TAIL; 3 wsq[tail] = task; 4 FENCE // store-store 5 TAIL = tail+1; 6 } 7 TASK take ( ){ 8 tail = TAIL – 1; 9 TAIL = tail; 10 FENCE // store-load 11 head = HEAD; 12 if (tail<head){ 13 TAIL = head; 14 return EMPTY; 15 } …… 24 return task 25 } 26 TASK steal ( ){ 27 head = HEAD; 28 tail = TAIL; …… 35 return task; 36 } Chase-Lev lock-free concurrent work-stealing queue

Parallel Spanning Tree ① 1 task = wsq. take(); 2 for (each neighbor task’

Parallel Spanning Tree ① 1 task = wsq. take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); ② 5 wsq. put(task’) ; 6 } ③ (a) 8 9 10 11 tail = TAIL – 1; TAIL = tail; FENCE head = HEAD; …… color[task’] = label; parent[task’] = task; 2 tail = TAIL; 3 wsq[tail] = task’; 4 FENCE 5 TAIL = tail + 1; (b)

Class Scope S-FENCE[class] class scope Make use of class in OO languages to illustrate

Class Scope S-FENCE[class] class scope Make use of class in OO languages to illustrate the concept Constrain a fence to the object class where it is used (Encapsulation) Intuition: function members operate on data members of the class

Class Scope S-FENCE[class] class A { B b; int m 1, m 2; void

Class Scope S-FENCE[class] class A { B b; int m 1, m 2; void func. A() { m 1 = val 1; b. func. B(); S-FENCE 1[class] m 2 = val 2; } } class scope class B { int n 1, n 2; void func. B() { n 1 = val 3; S-FENCE 2[class] n 2 = val 4; } } S-FENCE 1: m 1, m 2, n 1, n 2 S-FENCE 2: n 1, n 2

Class Scope Semantics More details in paper

Class Scope Semantics More details in paper

Parallel Spanning Tree ① 1 task = wsq. take(); 2 for (each neighbor task’

Parallel Spanning Tree ① 1 task = wsq. take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); ② 5 wsq. put(task’) ; 6 } ③ (a) 8 tail = TAIL – 1; 9 TAIL = tail; 10 SFENCE[class] FENCE 11 head = HEAD; …… color[task’] = label; parent[task’] = task; 2 3 4 5 tail = TAIL; wsq[tail] = task’; FENCE SFENCE[class] TAIL = tail + 1; (b)

Compiler Support ISA Extension class-fence fs_start – start of a fence scope fs_end –

Compiler Support ISA Extension class-fence fs_start – start of a fence scope fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences Informing hardware to mark memory operations properly

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence Store Buffer Decoding - memory operations in the scope are. . . marked via FSB Fence issue - check the Reorder Buffer entry for current scope . . . Fence Scope Bits

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence Store Buffer Decoding - memory operations in the scope are. . . marked via FSB Fence issue - check the Reorder Buffer entry for current scope . . . Fence Scope Bits

Hardware Support Setting Fence Bits FSS: stack to record scope FSB fs_start a I

Hardware Support Setting Fence Bits FSS: stack to record scope FSB fs_start a I 0 I 1 fs_start b I 2 outer inner I 3 I 4 fs_end b I 5 I 6 fs_end a I 7 0 1 2 3

Hardware Support Setting Fence Bits FSS: stack to record scope FSB fs_start a I

Hardware Support Setting Fence Bits FSS: stack to record scope FSB fs_start a I 0 I 1 fs_start b I 2 outer inner I 3 I 4 fs_end b I 5 I 6 fs_end a I 7 0 1 2 3

Hardware Support Setting Fence Bits FSS: stack to record scope Issue Fence by checking

Hardware Support Setting Fence Bits FSS: stack to record scope Issue Fence by checking FSB on the current scope FSB fs_start a I 0 I 1 fs_start b I 2 outer inner I 3 I 4 fs_end b I 5 I 6 fs_end a I 7 0 1 2 3

Hardware Support Setting Fence Bits FSS: stack to record scope Issue Fence by checking

Hardware Support Setting Fence Bits FSS: stack to record scope Issue Fence by checking FSB on the current scope FSB fs_start a I 0 I 1 fs_start b I 2 outer inner I 3 I 4 fs_end b I 5 I 6 fs_end a I 7 0 1 2 3

Why S-Fence performs Better? Store Buffer drained St A 1 St X 2 FENCE

Why S-Fence performs Better? Store Buffer drained St A 1 St X 2 FENCE 3 Ld Y 4 St B SB stall St A St X St A SB ROB stall & Fence issued . . . Ld Y St B ROB Timeline stall Scoped Fence St A : a cache miss Traditional Fence 0 stall St A St X St A Ld Y St B

Set Scope Dekker algorithm Initially flag 1 = flag 2 = 0 P 1

Set Scope Dekker algorithm Initially flag 1 = flag 2 = 0 P 1 P 2 m 1 = … m 2 = … flag 1 = 1; flag 2 = 1; FENCE if (flag 2 == 0) critical section FENCE if (flag 1 == 0) critical section

Set Scope Dekker algorithm Initially flag 1 = flag 2 = 0 P 1

Set Scope Dekker algorithm Initially flag 1 = flag 2 = 0 P 1 P 2 m 1 = … m 2 = … flag 1 = 1; flag 2 = 1; S-FENCE[set, {flag 1, flag 2}] S-FENCE … if (flag 2 == 0) critical section if (flag 1 == 0) critical section

Set Scope S-FENCE[set, {var 1, var 2, …}] set scope only order memory accesses

Set Scope S-FENCE[set, {var 1, var 2, …}] set scope only order memory accesses to {var 1, var 2, …} Compiler and Hardware Supports flag memory accesses to the specified variables set fence scope bits in hardware for flagged memory accesses For simplicity, we do not differentiate memory accesses to different sets

Experimental Evaluation Cycle-accurate simulation (SESC) Integrate scoped fence logic RMO memory model Benchmarks pst

Experimental Evaluation Cycle-accurate simulation (SESC) Integrate scoped fence logic RMO memory model Benchmarks pst - parallel spanning tree (work-stealing queue, class scope) ptc – parallel transitive closure (work-stealing queue, class scope) barnes – from SPLASH 2 (fences inserted for SC, set scope) radiosity – from SPLASH 2 (fences inserted for SC, set scope)

Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) class scope Fence Stall Reduced

Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) class scope Fence Stall Reduced ~13% ~50% set scope ~40 -50%

Conclusion Introduce the concept of fence scope Propose class scope and set scope Open.

Conclusion Introduce the concept of fence scope Propose class scope and set scope Open. CL 2. 0 (sub-group, work-group, device, system) Lightweight compiler and hardware support No change in inter-processor communication Fence scope should be implemented in some form !

Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside *

Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh