Systems I Code Optimization II Machine Independent Optimizations

Vector ADT length data 0 1 2 length– 1 Procedures vec_ptr new_vec(int len) l

$Optimization Example void combine 1(vec_ptr v, int *dest) { int i; *dest = 0;$

$Reduction in Strength void combine 2(vec_ptr v, int *dest) { int i; int length$

$Eliminate Unneeded Memory Refs void combine 3(vec_ptr v, int *dest) { int i; int$

Detecting Unneeded Memory Refs. Combine 2. L 18: Combine 3. L 24: movl (%ecx,

Optimization Blocker: Memory Aliasing n Two different memory references specify single location Example v:

$Previous Best Combining Code void combine 4(vec_ptr v, int *dest) { int i; int$

$General Forms of Combining void abstract_combine 4(vec_ptr v, data_t *dest) { int i; int$

Machine Independent Opt. Results Optimizations n Reduce function calls and memory references within loop

$Pointer Code void combine 4 p(vec_ptr v, int *dest) { int length = vec_length(v);$

Pointer vs. Array Code Inner Loops Array Code. L 24: addl (%eax, %edx, 4),

Machine-Independent Opt. Summary Code Motion n Compilers are good at this for simple loop/array

Important Tools Measurement n Accurately compute time taken by code l Most modern machines

Code Profiling Example Task n Count word frequencies in text document n Produce sorted

Code Profiling Augment Executable Program with Timing Functions n n Computes (approximate) amount of

Profiling Results % cumulative time seconds 86. 60 8. 21 5. 80 8. 76

Code Optimizations n n First step: Use more efficient sorting function Library function qsort

Further Optimizations n Iter first: Use iterative function to insert elements into linked list

Profiling Observations Benefits n Helps identify performance bottlenecks n Especially useful when have complex

Role of Programmer How should I write my programs, given that I have a

Summary Today n Optimization blocker: procedure calls n Optimization blocker: memory aliasing Tools (profiling)

Slides: 22

Download presentation

Systems I Code Optimization II: Machine Independent Optimizations Topics n Machine-Independent Optimizations l Code motion l Reduction in strength l Common subexpression sharing n Tuning l Identifying performance bottlenecks

Vector ADT length data 0 1 2 length– 1 Procedures vec_ptr new_vec(int len) l Create vector of specified length int get_vec_element(vec_ptr v, int index, int *dest) l Retrieve vector element, store at *dest l Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) l Return pointer to start of vector data n Similar to array implementations in Pascal, ML, Java l E. g. , always do bounds checking 2

$Optimization Example void combine 1(vec_ptr v, int *dest) { int i; *dest = 0;$

Optimization Example void combine 1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } Procedure n n n Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data type Pentium II/III Performance: Clock Cycles / Element n 42. 06 (Compiled -g) 31. 25 (Compiled -O 2) 3

$Reduction in Strength void combine 2(vec_ptr v, int *dest) { int i; int length$

Reduction in Strength void combine 2(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } Optimization n Avoid procedure call to retrieve each vector element l Get pointer to start of array before loop l Within loop just do pointer reference l Not as clean in terms of data abstraction n CPE: 6. 00 (Compiled -O 2) l Procedure calls are expensive! l Bounds checking is expensive 4

$Eliminate Unneeded Memory Refs void combine 3(vec_ptr v, int *dest) { int i; int$

Eliminate Unneeded Memory Refs void combine 3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } Optimization Don’t need to store in destination until end n Local variable sum held in register n Avoids 1 memory read, 1 memory write per cycle n CPE: 2. 00 (Compiled -O 2) n l Memory references are expensive! 5

Detecting Unneeded Memory Refs. Combine 2. L 18: Combine 3. L 24: movl (%ecx, %edx, 4), %eax addl %eax, (%edi) incl %edx cmpl %esi, %edx jl. L 18 addl (%eax, %edx, 4), %ecx incl %edx cmpl %esi, %edx jl. L 24 Performance n Combine 2 l 5 instructions in 6 clock cycles l addl must read and write memory n Combine 3 l 4 instructions in 2 clock cycles 6

Optimization Blocker: Memory Aliasing n Two different memory references specify single location Example v: [3, 2, 17] n combine 2(v, get_vec_start(v)+2) --> n combine 3(v, get_vec_start(v)+2) --> n ? ? Observations n Easy to have happen in C l Since allowed to do address arithmetic l Direct access to storage structures n Get in habit of introducing local variables l Accumulating within loops l Your way of telling compiler not to check for aliasing 7

$Previous Best Combining Code void combine 4(vec_ptr v, int *dest) { int i; int$

Previous Best Combining Code void combine 4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } Task n Compute sum of all elements in vector Vector represented by C-style abstract data type n Achieved CPE of 2. 00 n l Cycles per element 8

$General Forms of Combining void abstract_combine 4(vec_ptr v, data_t *dest) { int i; int$

General Forms of Combining void abstract_combine 4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; } Data Types n n Use different declarations for data_t int float double Operations n n n Use different definitions of OP and IDENT + / 0 * / 1 9

Machine Independent Opt. Results Optimizations n Reduce function calls and memory references within loop Performance Anomaly n n n Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Caused by quirk of IA 32 floating point l Memory uses 64 -bit format, register use 80 l Benchmark data caused overflow of 64 bits, but not 80 10

$Pointer Code void combine 4 p(vec_ptr v, int *dest) { int length = vec_length(v);$

Pointer Code void combine 4 p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } Optimization n n Use pointers rather than array references CPE: 3. 00 (Compiled -O 2) l Oops! We’re not making progress here! Warning: Some compilers do better job optimizing array code 11

Pointer vs. Array Code Inner Loops Array Code. L 24: addl (%eax, %edx, 4), %ecx incl %edx cmpl %esi, %edx jl. L 24 # # # Loop: sum += data[i] i++ i: length if < goto Loop Pointer Code. L 30: addl (%eax), %ecx addl $4, %eax cmpl %edx, %eax jb. L 30 # Loop: # sum += *data # data ++ # data: dend # if < goto Loop Performance n n Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles 12

Machine-Independent Opt. Summary Code Motion n Compilers are good at this for simple loop/array structures n Don’t do well in presence of procedure calls and memory aliasing Reduction in Strength n Shift, add instead of multiply or divide l compilers are (generally) good at this l Exact trade-offs machine-dependent n Keep data in registers rather than memory l compilers are not good at this, since concerned with aliasing Share Common Subexpressions n compilers have limited algebraic reasoning capabilities 13

Important Tools Measurement n Accurately compute time taken by code l Most modern machines have built in cycle counters l Using them to get reliable measurements is tricky n Profile procedure calling frequencies l Unix tool gprof Observation n Generating assembly code l Lets you see what optimizations compiler can make l Understand capabilities/limitations of particular compiler 14

Code Profiling Example Task n Count word frequencies in text document n Produce sorted list of words from most frequent to least Steps n n Convert strings to lowercase Apply hash function Read words and insert into hash table 29, 801 the 27, 529 and l Mostly list operations 21, 029 I l Maintain counter for each unique word 20, 957 to 18, 514 of 15, 370 a 14010 you 12, 936 my 11, 722 in 11, 519 that Sort results Data Set n n n Shakespeare’s most frequent words Collected works of Shakespeare 946, 596 total words, 26, 596 unique Initial implementation: 9. 2 seconds 15

Code Profiling Augment Executable Program with Timing Functions n n Computes (approximate) amount of time spent in each function Time computation method l Periodically (~ every 10 ms) interrupt program l Determine what function is currently executing l Increment its timer by interval (e. g. , 10 ms) n Also maintains counter for each function indicating number of times called Using gcc –O 2 –pg prog. c –o prog. /prog l Executes in normal fashion, but also generates file gmon. out gprof prog l Generates profile information based on gmon. out 16

Profiling Results % cumulative time seconds 86. 60 8. 21 5. 80 8. 76 4. 75 9. 21 1. 27 9. 33 self seconds 8. 21 0. 55 0. 45 0. 12 calls 1 946596 self ms/call 8210. 00 total ms/call 8210. 00 name sort_words lower 1 find_ele_rec h_add Call Statistics n Number of calls and cumulative time for each function Performance Limiter n n Using inefficient sorting algorithm Single call uses 87% of CPU time 17

Code Optimizations n n First step: Use more efficient sorting function Library function qsort 18

Further Optimizations n Iter first: Use iterative function to insert elements into linked list l Causes code to slow down n Iter last: Iterative function, places new entry at end of list l Tend to place most common words at front of list n n n Big table: Increase number of hash buckets Better hash: Use more sophisticated hash function Linear lower: Move strlen out of loop 19

Profiling Observations Benefits n Helps identify performance bottlenecks n Especially useful when have complex system with many components Limitations n n Only shows performance for data tested E. g. , linear lower did not show big gain, since words are short l Quadratic inefficiency could remain lurking in code n Timing mechanism fairly crude l Only works for programs that run for > 3 seconds 20

Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion n Hard to read, maintain, & assure correctness n Select best algorithm Write code that’s readable & maintainable Do: n l Procedures, recursion, without built-in constant limits l Even though these factors can slow down code n Eliminate optimization blockers l Allows compiler to do its job Focus on Inner Loops n n Do detailed optimizations where code will be executed repeatedly Will get most performance gain here 21

Summary Today n Optimization blocker: procedure calls n Optimization blocker: memory aliasing Tools (profiling) for understanding performance n Next time n Memory system optimization 22