Register Pressure Guided UnrollandJam Author Yin Ma Steven

  • Slides: 18
Download presentation
Register Pressure Guided Unroll-and-Jam Author: Yin Ma Steven Carr

Register Pressure Guided Unroll-and-Jam Author: Yin Ma Steven Carr

Motivation In a processor, register sits at the fastest position in the memory hierarchy,

Motivation In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited. Unroll-and-jam in the loop model of Open 64 not only increases register pressure by itself but also creates opportunities to make other loop optimizations increase register pressure indirectly. If a transformed loop demands too many registers, the overall performance may degrade Given a loop nest, with a better register pressure prediction and an unroll factor, the degradation can be eliminated and a better overall performance can be achieved

Research Topic A register pressure prediction algorithm for unroll -and-jam A register pressure guided

Research Topic A register pressure prediction algorithm for unroll -and-jam A register pressure guided loop model for unrolland-jam

Background Data Dependence Analysis The data dependence graph (DDG) is a directed graph that

Background Data Dependence Analysis The data dependence graph (DDG) is a directed graph that represents the data dependence relationship among instructions. A true dependence exists when L 1 stores into a memory location that is read by L 2 later. An anti-dependence exists if L 1 is a read from a memory location that is written by L 2 later. An output dependence exists when L 1 and L 2 store into the same memory location. An input dependence exists if a memory location is read by L 1 and L 2. True Dependence S 1 L 1=……. S 2 ……. =L 2 Anti-Dependence S 1 ……. =L 1 S 2 L 2=……. Output Dependence S 1 L 1=……. S 2 L 2=……. Input Dependence S 1 ……. =L 1 S 2 ……. =L 2

Background Scalar Replacement Uses scalars, later allocated to registers to replace array references in

Background Scalar Replacement Uses scalars, later allocated to registers to replace array references in order to decrease the number of memory references in loops This directly increases register pressure for ( i = 2; i < n; i++ ) a[i] = a[i-1] + b[i]; Scalar Replaced: T = a[1]; for ( i = 2; i < n; i++){ T = T + b[i]; a[i] = T; }

Background Unroll-and-Jam Create larger loop bodies by flattening multiple iterations Larger loop bodies makes

Background Unroll-and-Jam Create larger loop bodies by flattening multiple iterations Larger loop bodies makes other optimizations create more register pressure for ( I = 1 ; I < 10 ; I ++ ){ for ( J = 1; J < 5 ; J ++ ){ A[I][J] = B[J] + C[J]; D[I][J] = E[I][J] + F[I][J]; } } Unroll-and-jammed and later scalar replaced for ( I = 1 ; I < 10 ; I = I+2 ){ for ( J = 1; J < 5 ; J ++ ){ b = B[J]; c = C[J] A[I][J] = b + c; D[I][J] = E[I][J] + F[I][J]; A[I+1][J] = b + c; D[I+1][J] = E[I+1][J] + F[I+1][J]; } /* register pressure increased because } b, c hold two registers that originally can be reused for E and F */

Background Software Pipelining Software pipelining is an advanced scheduling techniques. Usually, more-overlapped instructions demand

Background Software Pipelining Software pipelining is an advanced scheduling techniques. Usually, more-overlapped instructions demand additional registers The Initiation interval (II) of a loop is the number of cycles used to finish one iteration. The resource II (Res. II) gives the minimum number of cycles needed to execute the loop based upon machine resources such as the number of functional units. The recurrence II (Rec. II) gives the minimum number of cycles needed for a single iteration based upon the length of the cycles in the data dependence graph. Do N times Software pipelined due to dependences among the operations [Prelude] D 1 B 1 D 2 [Loop Body] Do N-2 times (with index i) Ai Ci Bi+1 Di+2 [Postlude] AN-1 CN-1 BN A N CN

Typical approaches of preventing degradation from register pressure Predictive approach <- Our approaches Iterative

Typical approaches of preventing degradation from register pressure Predictive approach <- Our approaches Iterative approach (like feedback based) Predict effects before applying optimizations and decide the best set of parameters to do optimizations Fastest speed and fit for all situations Apply optimizations with one set of parameters then redo for the better performance with adjusted parameters Genetic approach Prepare many sets of parameters and apply optimizations with each set. Use genetic programming to pick the best

Problem in Previous Work All predictive register prediction methods are designed for software pipelining.

Problem in Previous Work All predictive register prediction methods are designed for software pipelining. Do not support source-code-level loop optimizations at all No systemic research on how to predict register pressure for loop optimizations No register pressure guided loop model

Key Design Detail Prediction algorithms works on source-code level Prediction algorithms handle the effects

Key Design Detail Prediction algorithms works on source-code level Prediction algorithms handle the effects on register pressure from: unroll-and-jam scalar replacement software pipelining general scalar optimizations Register pressure guided loop model uses the predicted register information to pick an unroll vector for the best performance

Register Prediction for unroll-and-jam (Overview) Compute Rec. II with our heuristic method Create the

Register Prediction for unroll-and-jam (Overview) Compute Rec. II with our heuristic method Create the list of arrays that will be replaced by scalars by checking the original DDG Constructing the new DDG D 1 with the list above only for the original loop All copies will reuse the DDG D 1 as the base DDGs Adjust each copy of DDGs to reflect the future changes. Re-compute the Res. II to get Min. II Do pseudo schedule to get the register pressure

Construct the base DDG Travel through the innermost loop and construct the base DDG

Construct the base DDG Travel through the innermost loop and construct the base DDG DO J = 1, N DO I = 1, N U(I, J) = V(I) + P(J, I) ENDDO

Prepare the DDG after unroll-and-jam Duplicate the base DDG with the inputted unroll factors

Prepare the DDG after unroll-and-jam Duplicate the base DDG with the inputted unroll factors DO J = 1, N DO I = 1, N U(I, J) = V(I) + P(J, I) U(I, J+1) = V(I) + P(J+1, I) ENDDO Unroll vector is 2

Finalize the DDG Remove unnecessary nodes/edges and add new edges Based on the updated

Finalize the DDG Remove unnecessary nodes/edges and add new edges Based on the updated dependence Reflect the effect of further optimizations Consider array indexing reuse by analyzing array subscripts

Register Prediction Schedule the final DDG with a depth-first scan starting from the first

Register Prediction Schedule the final DDG with a depth-first scan starting from the first node of the first iteration copy The Rec. II is the Rec. II of the original innermost loop The Res. II is computed on the final DDG with the targeted architecture information Min. II = MAX( Rec. II, Res. II)

Register Pressure Guided Unroll-and. Jam Use unit. II as the performance indicator of an

Register Pressure Guided Unroll-and. Jam Use unit. II as the performance indicator of an unroll-andjammed loop R is the number of registers predicted P is the number of registers available D is the total outgoing degree in the final DDG E is the total number of cross iteration edges A is the average memory access penalty N is the number of nodes in the final DDG

Open 64 Implementation & Experiment Results For register prediction, a retargetable compiler with infinite

Open 64 Implementation & Experiment Results For register prediction, a retargetable compiler with infinite number of available physical registers is used Loop nests are extracted from SPEC 2000 For register pressure guided unroll-and-jam, our model directly replaces the unroll-and-jam analysis used by Open 64 backend An minor value computed with the information from Open 64's cache model is added to Unit. II For register prediction for unroll-and-jam, it predicts the floating-point register pressure of a loop within 3 registers and integer register pressure within 4 registers Also our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the Open 64 backend on both x 86 and x 86 -64 architectures on Polyhedron benchmark

The End Any Question?

The End Any Question?