Computer Aided Hand Tuning Antoine Monsifrot Franois Bodin

Computer Aided Hand Tuning Antoine Monsifrot François Bodin CAPS Team June 2001

Overview • • • Why CBR driven code tuning? Approach System overview Tuning cases Examples Conclusion 2

Introduction • Execution speed depends – on the code structure – on the processor architecture • Compiler optimizations frequently fail – unable to analyze the programs (aliasing, . . . ) – must preserve program semantics – few application or target architecture knowledge – ignore most of the existing libraries 3

CBR Driven Code Tuning? • Case-based reasoning – no knowledge formalization needed – 4 main operations: identification, retrieval, reuse, retention • Defining a Tuning case – abstracting loop performance properties • User interaction 4

System Overview 5

A Tuning Case • A goal and a target machine • A program transformation • A set of indices – data about the code that indicates the optimisation opportunity – abstraction of code properties • High probability of recognising a code structure we know how to optimise – compilers need to be conservative 6

Abstract performance indices • Based on execution time code properties – data locality – parallelism – floating point operations – libraries • Abstractions – data accesses – data dependencies – arithmetic expressions – code patterns 7

Static Indices • Loop nest structure – depth, gotos, function call • Array accesses – access strides • Expression patterns – div/div, power, sparse accesses, . . . • Loop patterns – Blas, LU, Jacobi, SOR • Parallelism – Data dependencies do k = 1, npts do j = 2, npts a(j, k) = a(j-1, k) + a(j, k)**2 if (a(j, k). eq. 0) then goto 4 endif a(j, k) = a(j, k) + 1 4 a(j, k) = a(j-1, k) / a(j, k) enddo Dynamic Indices • Execution time and frequency – etime, tcov 8

Computing Cases • For each loops all cases are checked char *Compute. Case 1(Indices[]){ …} 9

Tiling for TLB Cases Example Indices • no perfect loop nest • large body + } }distribution • affine loop • line array accesses • column array accesses } tiling + • no negative component in dependence vectors • uniform dependencies distribution + tiling }skewing } Skewing + tiling 10

Loop Benchmark 64 loop nests • 44 are compiler friendly • 40 are improved by KAP • 13 do not exhibit a case • 12 exhibit a case 3. 3 Mflop 54. 1 Mflop DO 3200 I = 1, NSIZE 2 DO 3170 J = 1, NSIZE 1 IF (B 2(J, I). EQ. 0. 0) GO TO 3130 A 2(J, I) = C 2(J, I)*B 2(J, I) GO TO 3170 3130 CONTINUE B 2(J, I) = C 2(J, I)*A 2(J, I) 3170 CONTINUE 3200 CONTINUE • 5 parallel loops not parallelized by KAP • 1 sorted else if • 1 condition on loop index • 3 loop nests with loops to merge • 2 matrix multiply http: //www. netlib. org/benchmark/parallel 11

An Application Example: De. FT • A real application • Gaussian Density Functional Program • 75863 lines of Fortran code (comment included) • Two main routines: gridwork : 47, 5% x_annihilate : 29, 7% 1015 lines 269 lines http: //www. ccl. net/cca/software/SOURCES/FORTRAN/De. FT/index. shtml 12

De. FT Examples • Examples of cases found: Matrix Multiplication (Blas) do 1029 k = 1, n. . . do 1029 j = istart(myid+1), iend(myid+1) do 1029 i = 1, n 1029 overlap(i, j) = overlap(i, j) + coeff(i, k)*coeff(j, k) Fusion do 1011 k=istart(myid+1), iend(myid+1) 1011 veci(k)=coeff(k, i) do 1012 k=istart(myid+1), iend(myid+1) 1012 vecj(k)=coeff(k, j) do 1013 k=istart(myid+1), iend(myid+1) 1013 coeff(k, i)=coeff(k, i)+s(i)*(vecj(k)-tau(i)*veci(k)) do 1014 k=istart(myid+1), iend(myid+1) 1014 coeff(k, j)=coeff(k, j)-s(i)*(veci(k)+tau(i)*vecj(k)) do 1015 k=istart(myid+1), iend(myid+1) 1015 veci(k)=smat(k, i) do 1016 k=istart(myid+1), iend(myid+1) 1016 vecj(k)=smat(k, j) Parallel loop do 1012 i=1, ihits ii=iwkvec(i) …. . . do 1012 j=1, ihits jj=iwkvec(j) …. . . do 1015 k=1, npts 1015 wf(k, ii)=wf(k, ii)+factor*fv(k, jj) if((nfunctional. gt. 0). and. (ipart. eq. 0)) then do 1016 k=1, npts wfx(k, ii)=wfx(k, ii)+factor*fvx(k, jj) wfy(k, ii)=wfy(k, ii)+factor*fvy(k, jj) 1016 wfz(k, ii)=wfz(k, ii)+factor*fvz(k, jj) endif 1012 continue 4 -processor SGI Onyx Sequential : 121 s KAP : 140 s CAHT : 85 s 13

Conclusion • Case based reasoning provides a promising framework for code tuning • Tuning the cases may be difficult – take into account the compiler (f. i. unrolling) – integration of dynamic data and assembly code properties – learning techniques for case tuning 14