Carnegie Mellon 14 513 Bryant and OHallaron Computer
Carnegie Mellon 14 -513 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 18 -613 1
Carnegie Mellon Code Optimization 15 -213/18 -213/14 -513/15 -513/18 -613: Introduction to Computer Systems 13 th Lecture, October 8, 2019 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 2
Carnegie Mellon ¢ Rear Admiral Grace Hopper § Invented first compiler in 1951 § § (technically it was a linker) Coined “compiler” (and “bug”) Compiled for Harvard Mark I Eventually led to COBOL (which ran the world for years) “I decided data processors ought to be able to write their programs in English, and the computers would translate them into machine code” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 3
Carnegie Mellon ¢ John Backus § Led team at IBM invented the first commercially available compiler in 1957 § Compiled FORTRAN code for the IBM 704 computer § FORTRAN still in use today for high performance code § “Much of my work has come from being lazy. I didn't like writing programs, and so, when I was working on the IBM 701, I started work on a programming system to make it easier to write programs” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 4
Carnegie Mellon ¢ Fran Allen § Pioneer of many optimizing compilation techniques § Wrote a paper simply called “Program Optimization” in 1966 § “This paper introduced the use of graph-theoretic structures to encode program content in order to automatically and efficiently derive relationships and identify opportunities for optimization” § First woman to win the ACM Turing Award (the “Nobel Prize of Computer Science”) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 5
Carnegie Mellon Today ¢ ¢ Overview Generally Useful Optimizations § § ¢ Code motion/precomputation Strength reduction Sharing of common subexpressions Example: Bubblesort Optimization Blockers § Procedure calls § Memory aliasing ¢ ¢ Exploiting Instruction-Level Parallelism Dealing with Conditionals Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 6
Carnegie Mellon Performance Realities ¢ ¢ There’s more to performance than asymptotic complexity Constant factors matter too! § Easily see 10: 1 performance range depending on how code is written § Must optimize at multiple levels: § ¢ algorithm, data representations, procedures, and loops Must understand system to optimize performance § § How programs are compiled and executed How modern processors + memory systems operate How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity and generality Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 7
Carnegie Mellon Optimizing Compilers ¢ Provide efficient mapping of program to machine § § ¢ register allocation code selection and ordering (scheduling) dead code elimination eliminating minor inefficiencies Don’t (usually) improve asymptotic efficiency § up to programmer to select best overall algorithm § big-O savings are (often) more important than constant factors § ¢ but constant factors also matter Have difficulty overcoming “optimization blockers” § potential memory aliasing § potential procedure side-effects Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 8
Carnegie Mellon Generally Useful Optimizations ¢ ¢ Optimizations that you or the compiler should do regardless of processor / compiler Code Motion § Reduce frequency with which computation performed If it will always produce same result § Especially moving code out of loop § void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition long j; int ni = n*i; for (j = 0; j < n; j++) a[ni+j] = b[j]; 9
Carnegie Mellon Compiler-Generated Code Motion (-O 1) void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; set_row: testq jle imulq leaq movl %rcx, %rcx. L 1 %rcx, %rdx (%rdi, %rdx, 8), %rdx $0, %eax movsd addq cmpq jne (%rsi, %rax, 8), %xmm 0, (%rdx, %rax, 8) $1, %rax %rcx, %rax. L 3: . L 1: # # # Test n If <= 0, goto done ni = n*i rowp = A + ni*8 j = 0 loop: t = b[j] M[A+ni*8 + j*8] = t j++ j: n if !=, goto loop done: rep ; ret Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 10
Carnegie Mellon Reduction in Strength § Replace costly operation with simpler one § Shift, add instead of multiply or divide 16*x --> x << 4 § Utility is machine dependent § Depends on cost of multiply or divide instruction – Intel Nehalem: integer multiply takes 3 CPU cycles, add is 1 cycle 1 § Recognize sequence of products for (i = int ni for (j a[ni } 0; i < n; i++) { = n*i; = 0; j < n; j++) + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } 1 https: //www. agner. org/optimize/instruction_ Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition tables. pdf 11
Carnegie Mellon Share Common Subexpressions § Reuse portions of expressions § GCC will do this with –O 1 /* Sum neighbors of i, j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; 3 multiplications: i*n, (i– 1)*n, (i+1)*n leaq imulq addq. . . 1(%rsi), %rax -1(%rsi), %r 8 %rcx, %rsi %rcx, %rax %rcx, %r 8 %rdx, %rsi %rdx, %rax %rdx, %r 8 # # # # i+1 i-1 i*n (i+1)*n (i-1)*n i*n+j (i+1)*n+j (i-1)*n+j Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition long inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 1 multiplication: i*n imulq addq movq subq leaq. . . %rcx, %rsi # i*n %rdx, %rsi # i*n+j %rsi, %rax # i*n+j %rcx, %rax # i*n+j-n (%rsi, %rcx), %rcx # i*n+j+n 12
Carnegie Mellon Optimization Example: Bubblesort ¢ Bubblesort program that sorts an array A that is allocated in static storage: § an element of A requires four bytes of a byte-addressed machine § elements of A are numbered 1 through n (n is a variable) § A[j] is in location &A+4*(j-1) for (i = n-1; i >= 1; i--) { for (j = 1; j <= i; j++) if (A[j] > A[j+1]) { temp = A[j]; A[j] = A[j+1]; A[j+1] = temp; } } Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 13
Carnegie Mellon Translated (Pseudo) Code L 5: L 4: i : = n-1 if i<1 goto L 1 j : = 1 if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 4 : = j+1 t 5 : = t 4 -1 t 6 : = 4*t 5 t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 for (i = n-1; i >= 1; i--) { for (j = 1; j <= i; j++) if (A[j] > A[j+1]) { temp = A[j]; A[j] = A[j+1]; A[j+1] = temp; } } Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition t 8 : = j-1 t 9 : = 4*t 8 temp : = A[t 9] // temp: =A[j] t 10 : = j+1 t 11: = t 10 -1 t 12 : = 4*t 11 t 13 : = A[t 12] // A[j+1] t 14 : = j-1 t 15 : = 4*t 14 A[t 15] : = t 13 // A[j]: =A[j+1] t 16 : = j+1 t 17 : = t 16 -1 t 18 : = 4*t 17 A[t 18]: =temp // A[j+1]: =temp L 3: j : = j+1 goto L 4 L 2: i : = i-1 Instructions goto L 5 L 1: 29 in outer loop 25 in inner loop 14
Carnegie Mellon Redundancy in Address Calculation L 5: L 4: i : = n-1 if i<1 goto L 1 j : = 1 if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 4 : = j+1 t 5 : = t 4 -1 t 6 : = 4*t 5 t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition t 8 : =j-1 t 9 : = 4*t 8 temp : = A[t 9] t 10 : = j+1 t 11: = t 10 -1 t 12 : = 4*t 11 t 13 : = A[t 12] t 14 : = j-1 t 15 : = 4*t 14 A[t 15] : = t 13 t 16 : = j+1 t 17 : = t 16 -1 t 18 : = 4*t 17 A[t 18]: =temp L 3: j : = j+1 goto L 4 L 2: i : = i-1 goto L 5 L 1: // temp: =A[j] // A[j+1] // A[j]: =A[j+1] // A[j+1]: =temp 15
Carnegie Mellon Redundancy Removed i : = n-1 L 5: if i<1 goto L 1 j : = 1 L 4: if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 6 : = 4*j t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 t 8 : =j-1 t 9 : = 4*t 8 temp : = A[t 9] t 12 : = 4*j t 13 : = A[t 12] A[t 9]: = t 13 A[t 12]: =temp L 3: j : = j+1 goto L 4 L 2: i : = i-1 goto L 5 L 1: // temp: =A[j] // A[j+1] // A[j]: =A[j+1] // A[j+1]: =temp Instructions 20 in outer loop 16 in inner loop Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 16
Carnegie Mellon More Redundancy i : = n-1 L 5: if i<1 goto L 1 j : = 1 L 4: if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 6 : = 4*j t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition t 8 : =j-1 t 9 : = 4*t 8 temp : = A[t 9] t 12 : = 4*j t 13 : = A[t 12] A[t 9]: = t 13 A[t 12]: =temp L 3: j : = j+1 goto L 4 L 2: i : = i-1 goto L 5 L 1: // temp: =A[j] // A[j+1] // A[j]: =A[j+1] // A[j+1]: =temp 17
Carnegie Mellon Redundancy Removed i : = n-1 A[t 2] : = t 7 L 5: if i<1 goto L 1 A[t 6] : = t 3 j : = 1 L 4: if j>i goto L 2 L 3: j : = j+1 t 1 : = j-1 goto L 4 t 2 : = 4*t 1 L 2: i : = i-1 t 3 : = A[t 2] // old_A[j] goto L 5 t 6 : = 4*j L 1: t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 // A[j]: =A[j+1] // A[j+1]: =old_A[j] Instructions 15 in outer loop 11 in inner loop Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 18
Carnegie Mellon Redundancy in Loops L 5: L 4: L 3: L 2: i : = n-1 if i<1 goto L 1 j : = 1 if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 6 : = 4*j t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 A[t 2] : = t 7 A[t 6] : = t 3 j : = j+1 goto L 4 i : = i-1 goto L 5 L 1: Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 19
Carnegie Mellon Redundancy Eliminated L 5: L 4: L 3: L 2: i : = n-1 if i<1 goto L 1 j : = 1 if j>i goto L 2 t 1 : = j-1 t 2 : = 4*t 1 t 3 : = A[t 2] // A[j] t 6 : = 4*j t 7 : = A[t 6] // A[j+1] if t 3<=t 7 goto L 3 A[t 2] : = t 7 A[t 6] : = t 3 j : = j+1 goto L 4 i : = i-1 goto L 5 L 1: Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition L 5: L 4: L 3: L 2: i : = n-1 if i<1 goto L 1 t 2 : = 0 t 6 : = 4 t 19 : = 4*i if t 6>t 19 goto L 2 t 3 : = A[t 2] t 7 : = A[t 6] if t 3<=t 7 goto L 3 A[t 2] : = t 7 A[t 6] : = t 3 t 2 : = t 2+4 t 6 : = t 6+4 goto L 4 i : = i-1 goto L 5 L 1: 20
Carnegie Mellon Final Pseudo Code L 5: L 4: L 3: L 2: L 1: i : = n-1 Instruction Count if i<1 goto L 1 Before Optimizations t 2 : = 0 29 in outer loop t 6 : = 4 t 19 : = i << 2 25 in inner loop if t 6>t 19 goto L 2 t 3 : = A[t 2] t 7 : = A[t 6] Instruction Count if t 3<=t 7 goto L 3 After Optimizations A[t 2] : = t 7 15 in outer loop A[t 6] : = t 3 t 2 : = t 2+4 9 in inner loop t 6 : = t 6+4 goto L 4 i : = i-1 • These were Machine-Independent Optimizations. goto L 5 • Will be followed by Machine-Dependent Optimizations, including allocating temporaries to registers, converting to assembly code Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 21
Carnegie Mellon Today ¢ ¢ Overview Generally Useful Optimizations § § ¢ Code motion/precomputation Strength reduction Sharing of common subexpressions Example: Bubblesort Optimization Blockers § Procedure calls § Memory aliasing ¢ ¢ Exploiting Instruction-Level Parallelism Dealing with Conditionals Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 22
Carnegie Mellon Limitations of Optimizing Compilers ¢ ¢ ¢ Operate under fundamental constraint § Must not cause any change in program behavior § Often prevents optimizations that affect only “edge case” behavior Behavior obvious to the programmer is not obvious to compiler § e. g. , Data range may be more limited than types suggest (short vs. int) Most analysis is only within a procedure § Whole-program analysis is usually too expensive § Sometimes compiler does interprocedural analysis within a file (new GCC) Most analysis is based only on static information § Compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 23
Carnegie Mellon Optimization Blocker #1: Procedure Calls ¢ Procedure to Convert String to Lower Case void lower(char *s) { size_t i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } § Extracted from 213 lab submissions, Fall, 1998 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 24
Carnegie Mellon Lower Case Conversion Performance § Time quadruples when double string length § Quadratic performance 250 CPU seconds 200 150 lower 1 100 50 0 0 50000 100000 150000 200000 250000 300000 350000 400000 4500000 String length Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 25
Carnegie Mellon Convert Loop To Goto Form void lower(char *s) { size_t i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: } § strlen executed every iteration Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 26
Carnegie Mellon Calling Strlen /* My version of strlen */ size_t strlen(const char *s) { size_t length = 0; while (*s != '