Revisiting Loop Transformations with X 10 Clocks Tomofumi

  • Slides: 39
Download presentation
Revisiting Loop Transformations with X 10 Clocks Tomofumi Yuki Inria / LIP / ENS

Revisiting Loop Transformations with X 10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X 10 Workshop 2015

The Problem n The Parallelism Challenge n cannot escape going parallel n parallel programming

The Problem n The Parallelism Challenge n cannot escape going parallel n parallel programming is hard n automatic parallelization is limited n There won’t be any Silver Bullet n X 10 as a partial answer n high-level language with parallelism in mind n features to control parallelism/locality X 10 Workshop 2015 2

Programming with X 10 n Small set of parallel constructs n finish/async n clocks

Programming with X 10 n Small set of parallel constructs n finish/async n clocks n at (places), atomic, when n Can be composed freely n Interesting for both programmer and compilers n also challenging But, it seems to be under-utilized X 10 Workshop 2015 3

This Paper n Exploring how to use X 10 clocks expressivity performance alternative“usual” way

This Paper n Exploring how to use X 10 clocks expressivity performance alternative“usual” way X 10 Workshop 2015 4

Context: Loop Transformations n Key to expose parallelism n some times it’s easy for

Context: Loop Transformations n Key to expose parallelism n some times it’s easy for i for j X[i] +=. . . n for i forall j X[i] +=. . . but not always for i = 0. . N for j = 1. . M X[j] = foo(X[j-1], X[j+1]); for i = 1. . 2 N+M forall j = /*complex bounds*/ X[j] = foo(X[2*j-i-1], X[2*j-i+1]); X 10 Workshop 2015 5

Automatic Parallelization n Very sensitive to inputs for (i=1; i<N; i++) for (j=1; j<M;

Automatic Parallelization n Very sensitive to inputs for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; for (t 1=2; t 1<=3; t 1++) { lbp=1; ubp=t 1 -1; #pragma omp parallel for private(lbv, ubv, t 3) for (t 2=lbp; t 2<=ubp; t 2++) { S 1((t 1 -t 2), t 2); } } for (t 1=M+1; t 1<=N; t 1++) { S 1((t 1 -1), 1); lbp=2; ubp=M-1; #pragma omp parallel for private(lbv, ubv, t 3) for (t 2=lbp; t 2<=ubp; t 2++) { S 1((t 1 -t 2), t 2); S 2((t 1 -t 2 -1), (t 2 -1)); } } for (t 1=N+1; t 1<=M; t 1++) { lbp=t 1 -N+1; ubp=t 1 -2; #pragma omp parallel for private(lbv, ubv, t 3) for (t 2=lbp; t 2<=ubp; t 2++) { S 1((t 1 -t 2), t 2); S 2((t 1 -t 2 -1), (t 2 -1)); } S 1(1, (t 1 -1)); } } for (t 1=max(M+1, N+1); t 1<=N+M-2; t 1++) { lbp=t 1 -N+1; ubp=M-1; #pragma omp parallel for private(lbv, ubv, t 3) for (t 2=lbp; t 2<=ubp; t 2++) { S 1((t 1 -t 2), t 2); S 2((t 1 -t 2 -1), (t 2 -1)); } } very difficult to understand trust it or not use it for (t 1=4; t 1<=min(M, N); t 1++) { S 1((t 1 -1), 1); lbp=2; ubp=t 1 -2; #pragma omp parallel for private(lbv, ubv, t 3) for (t 2=lbp; t 2<=ubp; t 2++) { S 1((t 1 -t 2), t 2); S 2((t 1 -t 2 -1), (t 2 -1)); } S 1(1, (t 1 -1)); } X 10 Workshop 2015 6

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++)

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; X 10 Workshop 2015 7

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++)

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; X 10 Workshop 2015 8

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++)

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++) advance; 1. make async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; many iterations parallel X 10 Workshop 2015 9

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++)

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++) advance; 1. make many iterations async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; 2. order them by asynchronizations for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; parallel X 10 Workshop 2015 10

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++)

Expressing with Clocks n Goal: retain the original structure async for (i=1; i<N; i++) advance; 1. make many iterations parallel async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; 2. order them by asynchronizations for (i=1; i <N-1; i++) advance; async compound effect: parallelism for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + similar x[i+1][j+1]; to those with loop trans. advance; X 10 Workshop 2015 11

Outline n Introduction n X 10 Clocks n Examples n Discussion X 10 Workshop

Outline n Introduction n X 10 Clocks n Examples n Discussion X 10 Workshop 2015 12

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier;

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier; //P 2 barrier; S 1; n Clocks are more dynamic //P 1 advance; S 0; advance; //P 2 advance; S 1; 13

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier;

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier; deadlock //P 2 barrier; S 1; n Clocks are more dynamic //P 1 advance; S 0; advance; //P 2 advance; S 1; 14

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier;

clocks vs barriers n Barriers can easily deadlock //P 1 barrier; S 0; barrier; deadlock //P 2 barrier; S 1; n Clocks are more dynamic //P 1 advance; S 0; advance; OK //P 2 advance; S 1; 15

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) clocked async {

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) clocked async { for (j=i: N) advance; S 0; } Creation of a clock Each process is registered Sync registered processes Each process is un-registered n The process creating a clock is also registered 16

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) clocked async {

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) clocked async { for (j=i: N) advance; S 0; } n Each process waits until all processes starts n The primary process has to terminate first 17

n Implicit Syntax activ ity 1 Dynamicity of Clocks clocked finish for (i=1: N)

n Implicit Syntax activ ity 1 Dynamicity of Clocks clocked finish for (i=1: N) clocked async { for (j=i: N) advance; S 0; } n Each process waits until all processes starts n The primary process has to terminate first 18

activ n Implicit Syntax ity 1 activ ity 2 Dynamicity of Clocks clocked finish

activ n Implicit Syntax ity 1 activ ity 2 Dynamicity of Clocks clocked finish for (i=1: N) clocked async { for (j=i: N) advance; S 0; } n Each process waits until all processes starts n The primary process has to terminate first 19

activ n Implicit Syntax ity 1 activ ity 2 activ ity 3 activ ity

activ n Implicit Syntax ity 1 activ ity 2 activ ity 3 activ ity 4 activ ity 5 activ ity 6 Dynamicity of Clocks clocked finish for (i=1: N) clocked async { for (j=i: N) advance; S 0; } n Each process waits until all processes starts n The primary process has to terminate first 20

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) { clocked async

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) { clocked async { for (j=i: N) advance; S 0; } advance; } n The primary process calls advance each time n Different synchronization pattern 21

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) { clocked async

Dynamicity of Clocks n Implicit Syntax clocked finish for (i=1: N) { clocked async { for (j=i: N) advance; S 0; } advance by primary activity n The primary process calls advance each time n Different synchronization pattern 22

activ n Implicit Syntax ity 1 activ ity 2 activ ity 3 activ ity

activ n Implicit Syntax ity 1 activ ity 2 activ ity 3 activ ity 4 activ ity 5 activ ity 6 Dynamicity of Clocks clocked finish for (i=1: N) { clocked async { for (j=i: N) advance; S 0; } advance by primary activity n The primary process calls advance each time n Different synchronization pattern 23

Outline n Introduction n X 10 Clocks n Examples n Discussion X 10 Workshop

Outline n Introduction n X 10 Clocks n Examples n Discussion X 10 Workshop 2015 24

Example: Skewing n Skewing the loops is not easy for (i=1: N) for (j=1:

Example: Skewing n Skewing the loops is not easy for (i=1: N) for (j=1: N) h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]) skewing 25

Example: Skewing n Skewing the loops is not easy for (i=1: 2 N-1) forall

Example: Skewing n Skewing the loops is not easy for (i=1: 2 N-1) forall (j=max(1, i-N): min(N, i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1]) skewing 26

Example: Skewing n Skewing the loops is not easy for (i=1: 2 N-1) forall

Example: Skewing n Skewing the loops is not easy for (i=1: 2 N-1) forall (j=max(1, i-N): min(N, i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1]) changes to loop bounds and indexing skewing 27

Example: Skewing n Equivalent parallelism without changing loops clocked finish for (i=1: N) {

Example: Skewing n Equivalent parallelism without changing loops clocked finish for (i=1: N) { clocked async for (j=1: N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } 28

Example: Skewing n Equivalent parallelism without changing loops clocked finish for (i=1: N) {

Example: Skewing n Equivalent parallelism without changing loops clocked finish for (i=1: N) { clocked async for (j=1: N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } locally sequential the launch of the entire block is deferred 29

Example: Skewing n You can have the same skewing clocked finish for (j=1: N)

Example: Skewing n You can have the same skewing clocked finish for (j=1: N) { clocked async for (i=1: N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } skewing 30

Example: Skewing n You can have the same skewing clocked finish for (j=1: N)

Example: Skewing n You can have the same skewing clocked finish for (j=1: N) { note: interchange clocked async outer parallel loop with clocks for (i=1: N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } skewing 31

Example: Skewing n You can have the same skewing clocked finish for (j=1: N)

Example: Skewing n You can have the same skewing clocked finish for (j=1: N) { clocked async for (i=1: N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } skewing 32

Example: Loop Fission n Common use of barriers forall (i=1: N) S 1; S

Example: Loop Fission n Common use of barriers forall (i=1: N) S 1; S 2; forall (i=1: N) S 1; forall (i=1: N) S 2; for (i=1: N) async { S 1; S 2; } for (i=1: N) async { S 1; advance; S 2; } X 10 Workshop 2015 33

Example: Loop Fusion n Removes all the parallelism for (i=1: N) S 1; for

Example: Loop Fusion n Removes all the parallelism for (i=1: N) S 1; for (i=1: N) S 2; for (i=1: N) S 1; S 2; async for (i=1: N) S 1; advance; async for (i=1: N) S 2; advance; X 10 Workshop 2015 34

Example: Loop Fusion n Sometimes fusion is not too simple for (i=1: N-1) S

Example: Loop Fusion n Sometimes fusion is not too simple for (i=1: N-1) S 1(i); for (i=2: N) S 2(i); code structure stays # of advance control S 1(1); for (i=2: N-1) S 1(i); S 2(N); async for (i=1: N-1) S 1; advance; async for (i=2: N) S 2; advance; X 10 Workshop 2015 35

What can be expressed? n Limiting factor: parallelism n difficult to use for sequential

What can be expressed? n Limiting factor: parallelism n difficult to use for sequential loop nests n works for wave-front parallelism n Intuition n clocks defer execution n deferring parent activity has cumulative effect X 10 Workshop 2015 36

Discussion n Learning curve n behavior of clock n takes time to understand n

Discussion n Learning curve n behavior of clock n takes time to understand n How much can you express? n 1 D affine schedules for sure n loop permutation is not possible n what if we use multiple clocks? X 10 Workshop 2015 37

Potential Applications n It might be easier for some people n have multiple ways

Potential Applications n It might be easier for some people n have multiple ways to write code n Detect X 10 fragments with such property n convert to forall for performance X 10 Workshop 2015 38

X 10 Workshop 2015 39

X 10 Workshop 2015 39