Loop Tiling for Iterative Stencil Computations Marta Jimnez













![JI-loop Distance Subgraph [1, -1] [0, 0, 0] JI 1 -loop JI 2 -loop JI-loop Distance Subgraph [1, -1] [0, 0, 0] JI 1 -loop JI 2 -loop](https://slidetodoc.com/presentation_image_h2/11d7b496bb8792c8b162526c1112d943/image-14.jpg)













- Slides: 27
Loop Tiling for Iterative Stencil Computations Marta Jiménez
What is an Iterative Stencil Computation? Matrix A DO K = 1, NITER /* time-step loop */ do J =. . . do I =. . . {A(I, J), A(I+1, J), …} enddo {wrapped-around computations} ENDDO • ISC often performed for PDE, GM, IP – swim, tomcatv, mgrid (from SPEC 95 benchmark) – Jacobi
Loop Tiling • Loop Tiling – divides IS into regular tiles to make the working set fit in the memory level being exploited – can be applied hierarchically (Multilevel Tiling) • Current algorithms for Loop Tiling are limited to loops that: – are “perfectly” nested – are fully permutable – define a rectangular IS • However, in iterative stencil computations, loops are: – NOT perfectly nested – NOT fully permutable
Today’s talk • Show Loop Tiling can be applied to iterative stencil computations – based on Song & Li’s paper [PLDI 99] • define a Program Model • 1 Level of 1 D-Tiling (cache) – program example: SWIM • 2 levels of Tiling – 2 D-Tiling at the cache level – 1 D-Tiling at the register level (based on Jiménez et al. [ICS 98][HPCA 98]) • Performance Results – Loop Tiling on EV 5 & EV 6
Steps 1 - Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li 2 - Perform 2 D-Tiling for the Cache Level 3 - Perform 1 D-Tiling for the Register Level
1 st Step: achieve desired program model l Program Model: DO K = 1, NITER /* time-step loop */ do J 1 = LJ 1, UJ 1 do I 1 = LI 1, UI 1 {A(I, J), A(I+1, J), …} enddo. . . do Jm = LJm, UJm do Im = LIm, UIm {A(I, J), A(I+1, J), …} enddo ENDDO ü Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model
SWIM original code initializations NCYCLE = NCYCLE +1 CALL CALC 2 IF (NCYCLE >= ITMAX) STOP IF (NCYCLE <= 1) THEN CALL CALC 3 Z ELSE CALL CALC 3 ENDIF GO TO 90 90 l SUBROUTINE CALCX do J = 1, N do I = 1, M. . . enddo c wrapped-around computations do J = 1, N. . . enddo do I = 1, M. . . enddo. . . Transformations – Inline subroutines Convert GO TO into DO-loop – Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE –
Wrapped-around Computations J J DO K = 2, ITMAX-1 do J = 1, N do I = 1, M. . . enddo CALC 1 wrapped-around comp do J = 1, N. . . enddo do I = 1, M. . . enddo. . . CALC 2 CALC 3 do J = 1, N do I = 1, M. . . enddo. . . ENDDO I I
Wrapped-around Computations l Projection along direction I DO K = 2, ITMAX-1 c c ü J do J = 1, N. . . enddo wrapped-around comp do J = 1, N. . . enddo. . . ENDDO Another way of dealing with the wrapped-around computations is performing code sinking
1 st Step: achieved program model l Flow dependencies & iterations space for SWIM (Projection along direction I ) J 1 DO K = 2, ITMAX-1 do J = 1, N. . . enddo N CALC 1 K=2 wrapped-around do J = 1, N. . . enddo CALC 2 wrapped-around do J = 1, N. . . enddo wrapped-around ENDDO CALC 3 K=3 K-loop (time)
Steps 1 - Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2 - Perform 2 D-Tiling for the Cache Level 3 - Perform 1 D-Tiling for the Register Level
1 D-Tiling J 1 N N 1 K=2 OFFSET-i K=3 SLOPE K=4 ü Dependencies are violated ü Tiling parameters: SLOPE, OFFSETS-i N
2 D-Tiling J 1 I N 1 1 1 N N 1 N 1 M M 1 M K (time-step loop) ü Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (J and I) ü Computed using the JI-loop distance subgraph
JI-loop Distance Subgraph [1, -1] [0, 0, 0] JI 1 -loop JI 2 -loop [1, 0, 0] [1, -1, 0] [1, 0, -1] [0, 0, 0] flow dependencies anti-dependencies output dependencies [1, 0, 0] JI 3 -loop [1, -1, 0] [1, 0, -1] [1, 0, 0] [1, 0, -1] [1, -1, 0] ü Each node represents a JI-loop nest ü Each edge represents a dependence (distance vector)
Wrapped-around Computations l SWIM: Projection along direction I 1 J N DO K = 2, ITMAX-1 do J = 1, N. . . enddo K=2 wrapped-around do J = 1, N. . . enddo K=3 K-loop (time) wrapped-around ENDDO ü Backward dependencies with large distances make Tiling not profitable – apply Circular Loop Skewing to shorten backward dependencies
Circular Loop Skewing l Shorts backward dependencies by changing the iteration order J 1 J 2 N 1 2 3 4 K=2 BETA-i K=3 ü DELTA CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph)
Circular Loop Skewing DO K = 2, ITMAX-1 do JX = 1+BETA 1+DELTA(K-2), N+BETA 1+DELTA(K-2) J = MOD(JX-1, N) + 1. . . enddo wrapped-around J 1 K=2 BETA-i do JX = 1+BETA 2+DELTA(K-2), N+BETA 2+DELTA(K-2) J = MOD(JX-1, N) + 1. . . enddo wrapped-around do JX = 1+BETA 3+DELTA(K-2), N+BETA 3+DELTA(K-2) J = MOD(JX-1, N) + 1. . . enddo wrapped-around ENDDO N K=3 DELTA 1 2 3 4
2 nd Step: 2 D-Tiling for cache level l SWIM: projection along direction I ü CLS parameters: DELTA=2, BETA 1=0, BETA 2=1, BETA 3=2 ü Tiling parameters: SLOPE=2, OFFSET 1=1, OFFSET 2=OFFSET 3=0 DO JJ =. . . DO II =. . . DO K =. . . if (first tile) then do JX =. . . offsets iter. enddo endif do JX =. . . Iter. inside tile enddo 1 K=2 K=3 do JX =. . . Iter. inside tile enddo ENDDO K=4 2 1 3 2 N 3 J 1 N 2 1 3 2 3
Steps 1 - Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2 - Perform 2 D-Tiling for the Cache Level 3 - Perform 1 D-Tiling for the Register Level
3 rd Step: 1 D-Tiling for register level DO JJ =. . . DO II =. . . DO K =. . . do JX = LJ, UJ J = MOD (JX-1, N)+1 do IX = LI, UI I = MOD (IX-1, M)+1 [loop body: {I, J}] J N-2 N-1 I N 1 2 M-1 M 1 2 enddo. . . ENDDO unrolled ü The MOD operation introduced by CLS prevents us to fully unroll the loop - Apply first Index Set Splitting to loop J
Index Set Splitting ü ISS splits a loop into two new loops that iterate over non-intersecting portions of the iteration space DO JJ =. . . DO II =. . . DO K =. . . do JX = LJ, min(N, UJ) J = JX do IX =. . . enddo J N-2 N-1 I N 1 M-2 M-1 M 1 2 do JX = max(N+1, LJ), UJ J = JX-N do IX =. . . enddo. . . ENDDO ISS 2
3 rd Step: 1 D-Tiling for register level DO JJ =. . . DO II =. . . DO K =. . . do JX = LJ, min(N, UJ)-3+1, 3 J N-2 N-1 I N 1 J = JX do IX =. . . [loop body: {J}] M-2 M-1 [loop body: {J+1}] M [loop body: {J+2}] 1 enddo 2 enddo do JX = JX, min(N, UJ) J = JX do IX =. . . [loop body: {J}] enddo. . . ENDDO ISS 2
Code Transformations Summary 1 - Apply a set of transformations to the original program to achieve the program model defined by Song & Li – Inline subroutines – Convert GOTO into DO-loop – Peel iterations of the time-step loop to eliminate IF-statements 2 - Perform 2 D-Tiling for the Cache Level – Construct JI-loop distance subgraph – Compute DELTA and BETAs and apply CLS to shorten backwards dep. – Update JI-loop distance subgraph – Compute OFSSETs and SLOPE and tile the IS 3 - Perform 1 D-Tiling for the Register Level – Index Set Splitting – Tiling in a straightforward manner
Performance Results (SWIM) • • Architecture: EV 56 (500 Mhz, L 1: 8 KB, L 2: 96 KB), EV 6(500 MHz, L 1: 64 KB, L 2: 4 MB) Compiler Invocation: – f 77 -O 5 -arch ev 56 (EV 5) – kf 77 -O 5 -arch ev 6 -notransform_loop -unroll 1 (EV 6) • Programs: – 1 D-Tiling for the Cache Level: loop J, TS = 4 (EV 5), TS=8 (EV 6) – 2 D -Tiling for the Cache Level: TSIx. J = 32 x 16 (EV 5), TSIx. J=40 x 12(EV 6) – 1 D-Tiling for the register level: loop J, TS=4 (EV 5 & EV 6) 1519 s 1533 s 1023 s 999 s 1009 s EV 6 439 s 658 s 294 s 371 s 578 s ORI + RT 1 D 677 s 296 s (execution time) Speedup EV 5 ORI 1 D + RT 2 D 2 D + RT
Performance Results EV 5 (SWIM) • • Architecture: EV 56 (500 Mhz, L 1: 8 KB, L 2: 96 KB) Compiler invocations: – base: kf 77 -O 5 -arch ev 56 – no_prefetch: kf 77 -O 5 -arch ev 56 -switch nolu_prefetch_fetch …. . Speedup over ORI (base) ORI + RT 1 D 1 D + RT 2 D 2 D + RT
Performance Results EV 6 (SWIM) • • Architecture: EV 6(500 MHz, L 1: 64 KB, L 2: 4 MB) Compiler invocations: – base: f 77 -O 5 -arch ev 6 – no_prefetch: f 77 -O 5 -arch ev 6 -switch nolu_prefetch_fetch …. . Speedup over ORI (base) ORI + RT 1 D 1 D + RT 2 D 2 D + RT
Code for Result Verification DO K = 2, ITMAX-1. . . do J = 1, N. . . enddo c J result verification IF (MOD(K, MPRINT). eq. 0) THEN do I = do J = UCHECK + {UNEW(I, J)} enddo UNEW (I, I) =. . . NEW in SPEC 2000!! enddo PRINTS ENDIF do J = 1, N. . . enddo ENDDO ü Apply strip-mining to loop K (only useful if MPRINT is large)