Optimizing Loop Performance for Clustered VLIW Architectures by

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments) Optimizing Loop Performance for Clustered VLIW Architectures

Clustered VLIW Architecture Optimizing Loop Performance for Clustered VLIW Architectures

Motivation Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. The compiler must Expose maximal parallelism, Maintain minimal communication overhead. High-level optimizations can improve loop performance on clustered VLIW machines. Optimizing Loop Performance for Clustered VLIW Architectures

Background Software Pipelining – modulo scheduling Archive ILP by overlapping execution of different loop iterations. Initiation Interval (II) Res. II -- constraints from the machine resources. Rec. II -- constraints from the dependence recurrences. Min. II = max(Res. II, Rec. II) Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations Scalar Replacement replace array references with scalar variables. improve register usage for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j]; for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; } Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations Unrolling reduce inter-iteration overhead enlarge loop body size Unroll-and-jam balance the computation and memory-access requirements improve u. Min. II (Min. II / unroll. Amount) (1 computational unit, 1 memory unit) original loop: unroll-and-jammed loop: for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } u. Min. II = 3 u. Min. II = 4 Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations Unroll-and-jam/unrolling generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { { /* cluster 0 */ a[i] = a[i-1] + 1; a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; a[i+1] = a[i+1] + 1; } } Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations Loop Alignment Remove loop-carried dependences for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } Alignment conflicts for (i=1; i<n; ++i) { a[i] = b[i] + q; <1> <2> x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) a[i] = a[i-1] + b[i]; c[i] = a[i-1] + a[i-2]; } Used to determine intercluster communication cost Optimizing Loop Performance for Clustered VLIW Architectures

Related Work Partitioning Problem Ellis -- BUG Capitanio et al. -- LC-VLIW Nystrom et al. -- cluster assignment & software pipelining Ozer et al. -- UAS Sanchez et al. -- unified method Hiser et al. – RCG Aleta et al. – pseudo-scheduler Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations Scalar Replacement Callahan, et al -- pipelined architectures Carr, Kennedy -- general algorithm Duesterwalk -- data flow framework Loop Alignment Allen et al -- shared-memory machines Unrolling/Ujam Callahan et al -- pipelined architectures Carr, Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files Optimizing Loop Performance for Clustered VLIW Architectures

Optimization Strategy Source Code Unroll-and-jam/Unrolling Scalar Replacement Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Assembly Code Generator Target Code Optimizing Loop Performance for Clustered VLIW Architectures

Our Method Picking loops to unroll Computing u. Min. II Computing register pressure (see paper) Determining unroll amounts Optimizing Loop Performance for Clustered VLIW Architectures

Picking Loops to Unroll : carries the most dep. that are amenable to S. R. : contains the fewest alignment conflicts. Computing u. Min. II u. Rec. II does not increase u. Res. II where Optimizing Loop Performance for Clustered VLIW Architectures

Computing Communication Cost for Unrolled Loops Intercluster Copies single loop variant dep. invariant dep. multiple loops (see paper) invariant dep. innermost loop is unrolled variant dep. innermost loop is not unrolled Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single Loop Variant Dep. Before unrolling sinks of the new dependences: After unrolling Cluster 1 Cluster? copies per cluster: = # of e where total costs: . . . Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single Loop Variant Dep. Special Cases if then , 4 clusters: for (i=0; i<4*n; i+=4) { if then , 2 clusters: for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i] = a[i-4]; a[i+1] = a[i– 3]; a[i+2] = a[i– 2]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; a[i+3] = a[i-1]; } } Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single Loop Invariant Dep. references can be eliminated by scalar replacement. clusters need a copy operation. for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i]; for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; } Optimizing Loop Performance for Clustered VLIW Architectures

Determining Unroll Amounts Integer optimization problem Exhaustive search Heuristic method Optimizing Loop Performance for Clustered VLIW Architectures

Experimental Results Benchmarks 119 DSP loops from the TI's benchmark suite DSP applications: FIR filter, correlation, Reed. Solomon decoding, lattice filter, LMS filter, etc. Architectures URM, a simulated architecture 8 functional units - 2 clusters, 4 clusters (1 copy unit) 16 functional units - 2 clusters, 4 clusters (2 copy units) TMS 320 C 64 x Optimizing Loop Performance for Clustered VLIW Architectures

Unroll-and-jam/unrolling is applicable to 71 loops. width clusters Harmonic Median Improved 8 2 1. 39 1. 52 50 16 4 2 Speedup 1. 68 1. 4 1. 78 1. 6 69 50 4 1. 43 1. 6 51 URM Speedups: Transformed vs. Original Optimizing Loop Performance for Clustered VLIW Architectures

Width Clusters 8 2 16 4 2 4 1 0. 88 9 1. 07 0. 95 21 Speedup Harmonic(fixed) # of loops 1 0. 88 9 0. 91 0. 84 4 Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant. Optimizing Loop Performance for Clustered VLIW Architectures

C 64 x Results Speedup Harmonic 1. 7 Median 2 Improved 55 TMS 320 C 64 x Speedups: Unrolled vs. Original Optimizing Loop Performance for Clustered VLIW Architectures

Accuracy of Communication Cost Model Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops 2 -cluster: 66 exact prediction 4 -cluster: 64 exact prediction Optimizing Loop Performance for Clustered VLIW Architectures

Conclusion Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. 70%-90% of 71 loops can be improved by a speedup of 1. 4 -1. 7. High-level transformations should be an integral part of compilation for clustered VLIW machines. Optimizing Loop Performance for Clustered VLIW Architectures