Machine Independent Optimizations Topics l Code motion l
- Slides: 30
Machine Independent Optimizations Topics l Code motion l Reduction in strength l Common subexpression sharing
Great Reality There’s more to performance than asymptotic complexity Constant factors matter too! n n Easily see 10: 1 performance range depending on how code is written Must optimize at multiple levels: l algorithm, data representations, procedures, and loops Must understand system to optimize performance n n n – 2– How programs are compiled and executed How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity and generality
Machine-Independent Optimizations n Optimizations you should do regardless of processor / compiler Code Motion n Reduce frequency with which computation performed l If it will always produce same result l Especially moving code out of loop for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; – 3–
Machine-Independent Optimizations n Optimizations you should do regardless of processor / compiler Code Motion n Reduce frequency with which computation performed l If it will always produce same result l Especially moving code out of loop for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; – 4–
Reduction in Strength n n Replace costly operation with simpler one Shift, add instead of multiply or divide 16*x --> x << 4 l Utility machine dependent l Depends on cost of multiply or divide instruction l On Pentium II or III, integer multiply only requires 4 CPU cycles n Recognize sequence of products for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; – 5–
Reduction in Strength n n Replace costly operation with simpler one Shift, add instead of multiply or divide 16*x --> x << 4 l Utility machine dependent l Depends on cost of multiply or divide instruction l On Pentium II or III, integer multiply only requires 4 CPU cycles n Recognize sequence of products for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; – 6–
Make Use of Registers n Reading and writing registers much faster than reading/writing memory Limitation n n – 7– Compiler not always able to determine whether variable can be held in register Possibility of Aliasing
Machine-Independent Opts. (Cont. ) Share Common Subexpressions n Reuse portions of expressions n Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i, j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; 3 multiplications: i*n, (i– 1)*n, (i+1)*n – 8– 1 multiplication:
Machine-Independent Opts. (Cont. ) Share Common Subexpressions n Reuse portions of expressions n Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i, j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; 3 multiplications: i*n, (i– 1)*n, (i+1)*n – 9– 1 multiplication:
Assume Vector ADT length data 0 1 2 length– 1 Procedures vec_ptr new_vec(int len) l Create vector of specified length int get_vec_element(vec_ptr v, int index, int *dest) l Retrieve vector element, store at *dest l Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) l Return pointer to start of vector data n Similar to array implementations in Java l E. g. , always do bounds checking – 10 –
Optimization Example void combine 1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } Procedure – 11 – n Compute sum of all elements of vector n Store result at destination location
Reminder: Cycles Per Element n Convenient way to express performance of program that operators on vectors or lists n Length = n T = CPE*n + Overhead n vsum 1 Slope = 4. 0 vsum 2 Slope = 3. 5 – 12 –
Optimization Example void combine 1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } Procedure n Compute sum of all elements of integer vector Store result at destination location n Vector data structure and operations defined via abstract data type n Pentium II/III Performance: Clock Cycles / Element – 13 – n 42. 06 (Compiled -g) 31. 25 (Compiled -O 2)
Loop Invariant Code Motion void combine 2(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } Optimization n CPE: 20. 66 (Compiled -O 2) l vec_length requires only constant time, but significant overhead – 14 –
Code Motion Example #2 Procedure to Convert String to Lower Case void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } – 15 –
Lower Case Conversion Performance n n – 16 – Time quadruples when double string length Quadratic performance
Improving Performance n Move call to strlen outside of loop n Since result does not change from one iteration to another n Form of code motion void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } – 17 –
Lower Case Conversion Performance n n – 18 – Time doubles when double string length Linear performance
Optimization Blocker: Procedure Calls Why couldn’t the compiler move vec_len or strlen out of the inner loop? n Procedure may have side effects l Alters global state each time called n Function may not return same value for given arguments l Depends on other parts of global state Why doesn’t compiler look at code for vec_len or strlen? n Linker may overload with different version l Unless declared static n Interprocedural optimization is not used extensively due to cost Warning: n n – 19 – Compiler treats procedure call as a black box Weak optimizations in and around them
Replace func call wt Direct Access void combine 3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); accumulator } Optimization n CPE: 6. 00 (Compiled -O 2) l Procedure calls are expensive! l Bounds checking is expensive – 20 –
Direct Access void combine 3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); } Optimization n CPE: 6. 00 (Compiled -O 2) l Procedure calls are expensive! l Bounds checking is expensive – 21 –
Eliminate Unneeded Memory Refs void combine 4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); } Optimization Don’t need to store in destination until end n Local variable sum, called “accumulator var”, held in register n Avoids 1 memory read, 1 memory write per cycle n CPE: 2. 00 (Compiled -O 2) n l Memory references are expensive! – 22 –
Detecting Unneeded Memory Refs. Combine 3. L 18: . L 24: movl (%ecx, %edx, 4), %eax addl %eax, (%edi) incl %edx cmpl %esi, %edx jl. L 18 – 23 – Combine 4 addl (%eax, %edx, 4), %ecx incl %edx cmpl %esi, %edx jl. L 24
Optimization Blocker: Memory Aliasing n Two different memory references specify single location Example Observations n Easy to have happen in C l Since allowed to do address arithmetic l Direct access to storage structures n Get in habit of introducing local variables l Accumulating within loops l Your way of telling compiler not to check for aliasing – 24 –
Machine-Independent Opt. Summary Code Motion n Compilers are good at this for simple loop/array structures n Don’t do well in presence of procedure calls and memory aliasing Reduction in Strength n Shift, add instead of multiply or divide l compilers are (generally) good at this l Exact trade-offs machine-dependent n Keep data in registers rather than memory l compilers are not good at this, since concerned with aliasing Share Common Subexpressions n – 25 – compilers have limited algebraic reasoning capabilities
- Machine independent code optimization
- Machine independent code optimization
- Difference between source code and machine code
- Busceral
- Fanboys connectors
- Cs 282
- Advanced machine design
- Loader features
- What is conditional macro expansion
- Machine independent features
- An assembler is machine dependent
- Machine independent compiler features
- Machine independent loader
- Machine dependent and independent assembler features
- Difference between linkage editor and linking loader
- What is active rom
- Simple harmonic motion formula
- An object in motion stays in motion
- Chapter 2 motion section 1 describing motion answer key
- Chapter 2 motion section 1 describing motion answer key
- Chapter 2 section 1 describing motion answer key
- Concept 1 notes describing motion
- Section 1 describing motion
- The linear motion of the twist drill is called as
- Six types of machine guarding
- The fabulous perpetual motion machine vocabulary
- Fabulous perpetual motion machine
- Drive wheel sewing machine
- Perpetual motion machine of third kind
- Perpetual motion laws
- Perpetual motion machine