Code Optimization 1 Outline Optimizing Blockers Memory alias

Code Optimization 1

Outline • Optimizing Blockers – Memory alias – Side effect in function call • Understanding Modern Processor – Super-scalar – Out-of –order execution • More Code Optimization techniques • Performance Tuning • Suggested reading – 5. 1, 5. 7 ~ 5. 16 2

5. 1 Capabilities and Limitations of Optimizing Compliers Review on 5. 3 Program Example 5. 4 Eliminating Loop Inefficiencies 5. 5 Reducing Procedure Calls 5. 6 Eliminating Unneeded Memory References 3

$Example P 387 void combine 1(vec_ptr v, data_t *dest) { int i; *dest =$

Example P 387 void combine 1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } 4

$Example P 388 void combine 2(vec_ptr v, int *dest) { int i; int length$

Example P 388 void combine 2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } 5

$Example P 392 void combine 3(vec_ptr v, int *dest) { int i; int length$

Example P 392 void combine 3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; } 6

$Example P 394 void combine 4(vec_ptr v, int *dest) { int i; int length$

Example P 394 void combine 4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; } 7

Machine Independent Opt. Results • Optimizations – Reduce function calls and memory references within loop 8

Machine Independent Opt. Results Combine 1 Combine 2 Combine 3 Combine 4 P 385 P 388 P 392 P 394 • Performance Anomaly – – Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Memory uses 64 -bit format, register use 80 Benchmark data caused overflow of 64 bits, but not 80 9

$Optimization Blockers P 394 void combine 4(vec_ptr v, int *dest) { int i; int$

Optimization Blockers P 394 void combine 4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } 10

Optimization Blocker: Memory Aliasing P 394 • Aliasing – Two different memory references specify single location • Example – v: [3, 2, 17] – combine 3(v, get_vec_start(v)+2) --> ? – combine 4(v, get_vec_start(v)+2) --> ? 11

Optimization Blocker: Memory Aliasing • Observations – Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures – Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing 12

Optimizing Compilers • Provide efficient mapping of program to machine – register allocation – code selection and ordering – eliminating minor inefficiencies 13

Optimizing Compilers • Don’t (usually) improve asymptotic efficiency – up to programmer to select best overall algorithm – big-O savings are (often) more important than constant factors • but constant factors also matter • Have difficulty overcoming “optimization blockers” – potential memory aliasing – potential procedure side-effects 14

Limitations of Optimizing Compilers • Operate Under Fundamental Constraint – Must not cause any change in program behavior under any possible condition – Often prevents it from making optimizations when would only affect behavior under pathological conditions. 15

Limitations of Optimizing Compilers • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles – e. g. , data ranges may be more limited than variable types suggest • e. g. , using an “int” in C for what could be an enumerated type obfuscated：混乱 16

Limitations of Optimizing Compilers • Most analysis is performed only within procedures – whole-program analysis is too expensive in most cases • Most analysis is based only on static information – compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative 17

Optimization Blockers P 380 • Memory aliasing void twiddle 1(int *xp, int *yp) { *xp += *yp ; } void twiddle 2(int *xp, int *yp) { *xp += 2* *yp ; } 18

Optimization Blockers P 381 • Function call and side effect int f(int) ; int func 1(x) { return f(x)+f(x)+f(x) ; } int func 2(x) { return 4*f(x) ; } 19

Optimization Blockers P 381 • Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; } 20

5. 7 Understanding Modern Processors 5. 7. 1 Overall Operation 21

Modern CPU Design Figure 5. 11 P 396 Instruction Control Address Fetch Control Retirement Unit Instruction Cache Instructions Register File Instruction Decode Operations Register Updates Prediction OK? Integer/ Branch General Integer FP Add Operation Results FP Mult/Div Load Store Functional Units Addr. Data Cache Execution 22

2) 4) Fetch Control Retirement 5) 3) Unit Instruction Register Address 1) Instructions Cache Decode File operations Register Updates Predication OK? (2) (3) (4) (5) Integer General FP FP /branch Integer Add mult/div Operation results (6) Load addr Store addr data (7) Data data Cache 23 Functional units (1)

Modern Processor P 396 • Superscalar – Perform multiple operations on every clock cycle • Out-of-order execution – The order in which the instructions execute need not correspond to their ordering in the assembly program 24

Modern Processor P 396 • Two main parts – Instruction Control Unit • Responsible for reading a sequence of instructions from memory • Generating from above instructions a set of primitive operations to perform on program data – Execution Unit 25

1) Instruction Control Unit • Instruction Cache – A special, high speed memory containing the most recently accessed instructions. 26

1) Instruction Control Unit • Instruction Decoding Logic P 397 – Take actual program instructions – Converts them into a set of primitive operations – Each primitive operation performs some simple task • Simple arithmetic, Load, Store • addl %eax, 4(%edx) --- three operations load 4(%edx) t 1 addl %eax, t 1 t 2 store t 2, 4(%edx) – Register renaming P 398 27

2) Fetch Control • Fetch Ahead P 396 – Fetches well ahead of currently accessed instructions – ICU has enough time to decode these – ICU has enough time to send decoded operations down to the EU 28

Fetch Control • Branch Predication P 397 – Branch taken or fall through – Guess whether branch is taken or not • Speculative Execution P 397 – Fetch, decode and execute only according to the branch prediction – Before the branch predication has been determined 29

5. 7 Understanding Modern Processors 5. 7. 2 Functional Unit Performance 30

Multi-functional Units • Multiple Instructions Can Execute in Parallel – 1 load – 1 store – 2 integer (one may be branch) – 1 FP Addition – 1 FP Multiplication or Division 31

Multi-functional Units Figure 5. 12 P 400 • Some Instructions Take > 1 Cycle, but Can be Pipelined – – – – Instruction Load / Store Integer Multiply Integer Divide Double/Single FP Multiply Double/Single FP Add Double/Single FP Divide Latency 3 4 36 5 3 38 Cycles/Issue 1 1 36 2 1 38 32

5. 7 Understanding Modern Processors 5. 7. 1 Overall Operation 33

Execution Unit • Receives operations from ICU • Each cycle it may receive more than one operation • Operations are queued in buffer 34

Execution Unit • Operation is dispatched to one of multifunctional units, whenever – All the operands of an operation are ready – Suitable functional units are available • Execution results are passed among functional units • (7) Data Cache P 398 – A high speed memory containing the most recently accessed data values 35

4) Retirement Unit P 398 • Instructions need to commit in serial order – Misprediction – Exception • Updates Architecture status – Memory and register values 36

5. 7. 3 A Closer Look at Processor Operation Translation Instruction into Operations 37

Translation Example P 401. L 24: imull (%eax, %edx, 4), %ecx incl %edx cmpl %esi, %edx jl. L 24 # # # Loop: t *= data[i] i++ i: length if < goto Loop load (%eax, %edx. 0, 4) imull t. 1, %ecx. 0 incl %edx. 0 cmpl %esi, %edx. 1 jl-taken cc. 1 t. 1 %ecx. 1 %edx. 1 cc. 1 38

Understanding Translation Example P 401 imull (%eax, %edx, 4), %ecx load (%eax, %edx. 0, 4) t. 1 imull t. 1, %ecx. 0 %ecx. 1 • Split into two operations – Load reads from memory to generate temporary result t. 1 – Multiply operation just operates on registers 39

Understanding Translation Example P 401 imull (%eax, %edx, 4), %ecx load (%eax, %edx. 0, 4) t. 1 imull t. 1, %ecx. 0 %ecx. 1 • Operands – Registers %eax does not change in loop. Values will be retrieved from register file during decoding 40

Understanding Translation Example P 401 imull (%eax, %edx, 4), %ecx load (%eax, %edx. 0, 4) t. 1 imull t. 1, %ecx. 0 %ecx. 1 • Operands – Register %ecx changes on every iteration. – Uniquely identify different versions as • %ecx. 0, %ecx. 1, %ecx. 2, … – Register renaming • Values passed directly from producer to consumers 41

Understanding Translation Example P 402 incl %edx. 0 %edx. 1 • Register %edx changes on each iteration • Renamed as %edx. 0, %edx. 1, %edx. 2, … 42

Understanding Translation Example P 402 cmpl %esi, %edx. 1 cc. 1 • Condition codes are treated similar to registers • Assign tag to define connection between producer and consumer 43

Understanding Translation Example P 402 jl. L 24 jl-taken cc. 1 • Instruction control unit determines destination of jump • Predicts whether target will be taken • Starts fetching instruction at predicted destination 44

Understanding Translation Example P 401 jl. L 24 jl-taken cc. 1 • Execution unit simply checks whether or not prediction was OK • If not, it signals instruction control – Instruction control then “invalidates” any operations generated from misfetched instructions – Begins fetching and decoding instructions at correct target 45

Visualizing Operations Figure 5. 13 P 403 %edx. 0 incl load cmpl cc. 1 %ecx. 0 jl t. 1 Time imull %ecx. 1 %edx. 1 load (%eax, %edx. 0, 4) t. 1 imull t. 1, %ecx. 0 %ecx. 1 incl %edx. 0 %edx. 1 cmpl %esi, %edx. 1 cc. 1 jl-taken cc. 1 • Operations – Vertical position denotes time at which executed • Cannot begin operation until operands available – Height denotes latency • Operands – Arcs shown only for operands that are passed within execution unit 46

Visualizing Operations Figure 5. 14 P 403 %edx. 0 load incl load cmpl+1 %ecx. i cc. 1 %ecx. 0 Time %edx. 1 load (%eax, %edx, 4) iaddl t. 1, %ecx. 0 incl %edx. 0 cmpl %esi, %edx. 1 jl-taken cc. 1 t. 1 %ecx. 1 %edx. 1 cc. 1 jl t. 1 addl %ecx. 1 • Operations – Same as before, except that add has latency of 1 47

3 Iterations of Combining Product Figure 5. 15 P 404 • Unlimited Resource Analysis – Assume operation can start as soon as operands available – Operations for multiple iterations overlap in time • Performance – Limiting factor becomes latency of integer multiplier – Gives CPE of 4. 0 48

4 Iterations of Combining Sum Figure 5. 16 P 405 4 integer ops • Unlimited Resource Analysis • Performance – Can begin a new iteration on each clock cycle – Should give CPE of 1. 0 – Would require executing 4 integer operations in parallel 49

Combining Product: Resource Constraints Figure 5. 17 P 406 • Figure 5. 17 P 406 50

Combining Sum: Resource Constraints Figure 5. 18 P 408 51

Combining Sum: Resource Constraints • Only have two integer functional units • Some operations delayed even though operands available • Set priority based on program order • Performance – Sustain CPE of 2. 0 52

5. 8 Reducing Loop Overhead 53

$Loop unrolling P 409 void combine 5(vec_ptr v, int *dest) { int i; int$

Loop unrolling P 409 void combine 5(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; /* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2]; /* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; } *dest = x; 54

Visualizing Unrolled Loop P 410 – Loads can pipeline, since don’t have dependencies – Only one set of loop control operations load (%eax, %edx. 0, 4) iaddl t. 1 a, %ecx. 0 c load 4(%eax, %edx. 0, 4) iaddl t. 1 b, %ecx. 1 a load 8(%eax, %edx. 0, 4) iaddl t. 1 c, %ecx. 1 b iaddl $3, %edx. 0 cmpl %esi, %edx. 1 jl-taken cc. 1 t. 1 a %ecx. 1 a t. 1 b %ecx. 1 b t. 1 c %ecx. 1 c %edx. 1 cc. 1 55

Visualizing Unrolled Loop Figure 5. 20 P 410 %edx. 0 addl load %ecx. 0 c %edx. 1 cmpl %ecx. i +1 load jl cc. 1 t. 1 a load addl %ecx. 1 a Time t. 1 b addl %ecx. 1 b t. 1 c addl %ecx. 1 c Measured CPE = 1. 33 56

Executing with Loop Unrolling Figure 5. 21 P 411 57

Executing with Loop Unrolling • Predicted Performance – Can complete iteration in 3 cycles – Should give CPE of 1. 0 • Measured Performance – CPE of 1. 33 – One iteration every 4 cycles 58

Effect of Unrolling P 411 Unrolling Degree Integer Sum 1 2 3 4 8 16 2. 00 1. 50 1. 33 1. 50 1. 25 1. 06 Integer Product 4. 00 FP Sum 3. 00 FP Product 5. 00 59

Effect of Unrolling • Only helps integer sum for our examples – Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling – Many subtle effects determine exact scheduling of operations 60

5. 9 Converting to Pointer Code 61

$Example P 413 void combine 4 p(vec_ptr v, int *dest) { int i; int$

Example P 413 void combine 4 p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; } 62

Pointer Code vs. Array Code P 414 Function Integer + Floating pointer * + * Combine 4 2. 00 4. 00 3. 00 5. 00 Combine 4 p 3. 00 4. 00 3. 00 5. 00 • Some compilers and processors do better job optimizing array code 63

Pointer vs. Array Code Inner Loops P 414. L 24: addl (%eax, %edx, 4), %ecx incl %edx cmpl %esi, %edx jl. L 24 . L 30: addl (%eax), %ecx addl $4, %eax cmpl %edx, %eax jb. L 30 # # # Loop: x += data[i] i++ i: length if < goto Loop # Loop: # x += *data # data ++ # data: dend # if < goto Loop 64

Pointer vs. Array Code Inner Loops • Performance – Array Code: 4 instructions in 2 clock cycles – Pointer Code: Almost same 4 instructions in 3 clock cycles 65

5. 10 Enhancing Parallelism 66

$Loop Splitting P 416 void combine 6(vec_ptr v, int *dest) { int i; int$

Loop Splitting P 416 void combine 6(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x 0 = IDENT, x 1 = IDENT; /* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ } x 0 = x 0 OPER data[i]; x 1 = x 1 OPER data[i+1]; /* finish any remaining elements */ for (; i < length; i++) x 0 = x 0 OPER data[i]; } *dest = x 0 OPER x 1; 67

Loop Splitting • Optimization – Accumulate in two different sums • Can be performed simultaneously – Combine at end – Exploits property that integer addition & multiplication are associative & commutative – FP addition & multiplication not associative, but transformation usually acceptable Associative: 可结合的 Commutative: 可交换的 68

Visualizing Parallel Loop P 417 • Two multiplies within loop no longer have data dependency • Allows them to pipeline load (%eax, %edx. 0, 4) imull t. 1 a, %ecx. 0 load 4(%eax, %edx. 0, 4) imull t. 1 b, %ebx. 0 iaddl $2, %edx. 0 cmpl %esi, %edx. 1 jl-taken cc. 1 t. 1 a %ecx. 1 t. 1 b %ebx. 1 %edx. 1 cc. 1 69

Visualizing Parallel Loop Figure 5. 25 P 417 %edx. 0 addl load %edx. 1 cmpl cc. 1 %ecx. 0 load jl t. 1 a %ebx. 0 t. 1 b Time imull %ecx. 1 %ebx. 1 70

Executing with Parallel Loop Figure 5. 26 P 418 71

Optimization Results for Combining P 419 72

Optimization Results for Combining • Register spilling – only 6 registers available – Using memory as storage • Register spilling – movl -12(%ebp), %edi – imull 24(%eax), %edi – movl %edi, -12(%ebp) 73

5. 11 Putting it Together: Summary of Results for Optimizing Combining Code 5. 11. 1 Floating-Point Performance Anomaly 5. 11. 2 Changing Platforms 74

5. 12 Branch Prediction and Misprediction Penalties 75

What About Branches? • Challenge – Instruction Control Unit must work well ahead of Exec. Unit – To generate enough operations to keep EU busy 76

What About Branches? 80489 f 3: movl 80489 f 8: xorl 80489 fa: cmpl 80489 fc: jnl 80489 fe: movl 8048 a 00: imull $0 x 1, %ecx %edx, %edx %esi, %edx 8048 a 25 %esi, %esi (%eax, %edx, 4), %ecx Executing Fetching & Decoding 77

What About Branches? • Challenge – When encounters conditional branch, cannot reliably determine where to continue fetching 78

Branch Outcomes • When encounter conditional branch, cannot determine where to continue fetching – Branch Taken: Transfer control to branch target – Branch Not-Taken: Continue with next instruction in sequence • Cannot resolve until outcome determined by branch/integer unit 79

Branch Outcomes 80489 f 3: 80489 f 8: 80489 fa: 80489 fc: 80489 fe: 8048 a 00: movl xorl cmpl jnl movl imull 8048 a 25: 8048 a 27: 8048 a 29: 8048 a 2 c: 8048 a 2 f: $0 x 1, %ecx %edx, %edx %esi, %edx Branch Not-Taken 8048 a 25 %esi, %esi (%eax, %edx, 4), %ecx Branch Taken cmpl jl movl leal movl %edi, %edx 8048 a 20 0 xc(%ebp), %eax 0 xffffffe 8(%ebp), %esp %ecx, (%eax) 80

Branch Prediction • Idea – Guess which way branch will go – Begin executing instructions at predicted position • But don’t actually modify register or memory data 81

Branch Prediction 80489 f 3: 80489 f 8: 80489 fa: 80489 fc: . . . movl xorl cmpl jnl $0 x 1, %ecx %edx, %edx %esi, %edx 8048 a 25: 8048 a 27: 8048 a 29: 8048 a 2 c: 8048 a 2 f: cmpl jl movl leal movl Predict Taken %edi, %edx 8048 a 20 0 xc(%ebp), %eax 0 xffffffe 8(%ebp), %esp %ecx, (%eax) Execute 82

Branch Prediction Through Loop 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx i = 98 %esi, %edx 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx i = 99 %esi, %edx 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx %esi, %edx i = 100 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx %esi, %edx i = 101 80488 b 1 Assume vector length = 100 Predict Taken (OK) Predict Taken Executed (Oops) Read invalid location Fetched 83

Branch Misprediction Invalidation 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx i = 98 %esi, %edx 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx i = 99 %esi, %edx 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx %esi, %edx i = 100 80488 b 1: 80488 b 4: 80488 b 6: movl addl incl (%ecx, %edx, 4), %eax, (%edi) i = 101 %edx Assume vector length = 100 Predict Taken (OK) Predict Taken (Oops) Invalidate 84

Branch Misprediction Recovery 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: movl addl incl cmpl jl (%ecx, %edx, 4), %eax, (%edi) %edx i = 98 %esi, %edx 80488 b 1: 80488 b 4: 80488 b 6: 80488 b 7: 80488 b 9: 80488 bb: 80488 be: 80488 bf: 80488 c 0: movl addl incl cmpl jl leal popl (%ecx, %edx, 4), %eax, (%edi) %edx i = 99 %esi, %edx 80488 b 1 0 xffffffe 8(%ebp), %esp %ebx %esi %edi Assume vector length = 100 Predict Taken (OK) Definitely not taken 85

Branch Misprediction Recovery P 427 • Performance Cost – Misprediction on Pentium III wastes ~14 clock cycles – That’s a lot of time on a high performance processor 86

Conditional Jump Figure 5. 29 P 427 • Misprediction penalty is about 14 cycles in PIII machine • Conditional mov is used to avoid the misprediction penalty when the branch outcome is not predictable • For example: int absval(int val) { return (val <0)? –val : val } 87

Conditional Jump P 428 movl 8(%ebp), movl %eax, negl %edx testl %eax, cmov 1 %edx, %eax %edx %eax Get val as result Copy to %edx Negate %edx Test Val if <0 copy %edx to result 88

5. 13 Understanding Memory Performance 89

Load Latency P 429, P 430 typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; for (; ls=ls->next) len++ ; return len ; Assembly Instructions. L 27: incl %eax movl (%edx), %edx testl %edx, %edx jne. L 27 Execution unit operations incl %eax. 0 %eax. 1 load (%edx. 0) %edx. 1 testl %edx. 1, %edx. 1 cc. 1 jne-taken cc. 1 } Figure 5. 30 P 430 90

%eax. 0 %edx. 0 incl 1 2 load 5 incl testl cc. 1 jne i=1 incl 8 %eax. 3 load %edx. 2 6 7 %eax. 2 %edx. 1 3 4 %eax. 1 testl cc. 2 jne i=2 load 9 %edx. 3 testl cc. 3 jne 10 11 Figure 5. 31 P 430 i=3 91

$Store Latency Figure 5. 32 P 431 void array_clear(int *dest, int n) { int$

Store Latency Figure 5. 32 P 431 void array_clear(int *dest, int n) { int i; for ( i = 0 ; i dest[i] = 0 ; } CPE 2. 0 < n ; i++) 92

$Store Latency Figure 5. 32 P 431 void array_clear(int *dest, int n) { int$

Store Latency Figure 5. 32 P 431 void array_clear(int *dest, int n) { int i; int len = n-7 ; for ( i = 0 ; i < len ; i++) { dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ; dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] = 0 ; } for ( ; i < n ; i++) dest[i] = 0 ; } CPE 1. 25 93

Store latency Figure 5. 33 P 432 void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; } } 94

Store latency Figure 5. 33 P 432 write_read(&a[0], &a[1], 3) initial iter. 1 cnt 3 2 a (-10, 17) (-10, 0) val 0 -9 iter. 2 1 (-10, -9) -9 iter. 3 0 (-10, -9) -9 write_read(&a[0], 3) initial iter. 1 cnt 3 2 a (-10, 17) (0, 17) val 0 1 iter. 2 1 (1, 17) 2 iter. 3 0 (2, 17) 3 95

$Store latency void write_read(int *src, int *dest, int n) { int cnt = n;$

Store latency void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; } } 96

Store latency P 434. L 32: movl %edx, (%ecx) movl (%ebx), %edx incl %edx decl %eax jnc. L 32 storeaddr storedata load incl decl jnc-taken (%ecx) %edx. 0 (%ebx) %edx. 1 a %edx. 1 b %eax. 0 %eax. 1 cc. 1 97

Store latency Figure 5. 35 P 434 %eax. 0 %edx. 0 1 2 3 4 5 6 store data store addr decl cc. 1 load %eax. 1 decl%eax. 2 jnc store addr %edx. 1 a incl load jnc %edx. 2 a %edx. 1 b incl store data %edx. 2 b 7 98

Figure 5. 36 P 435 %eax. 0 %edx. 0 2 3 decl cc. 1 = 1 store data store addr %eax. 1 = jnc decl store addr load jnc 4 5 6 %edx. 1 a incl load %edx. 1 b 7 Store data %edx. 2 b incl 99 %eax. 2

5. 14 Life in the Real World: Performance Improvement Techniques 100

Basic Strategies for Optimizing Program Performance • High-level design • Basic coding principles • Eliminate excessive function calls • Eliminate unnecessary memory references • Low-level optimizations • Try various forms of pointer versus array code. • Reduce loop overhead by unrolling loops. • Find ways to make use of the pipelined functional units by techniques such as iteration splitting 101

5. 15 Identifying and Eliminating Performance Bottlenecks 102

Performance Tuning • Identify – Which is the hottest part of the program – Using a very useful method profiling • • Instrument the program Run it with typical input data Collect information from the result Analysis the result – gprof example • $gcc –O 2 –pg prog. c –o prog • $prog file. text (generate new file gmon. out) • $gprof prog (with gmon. out) 103

Example • Task – Count word frequencies in text document – Sort the words in descending order of occurence • Steps – Convert strings to lower case – Apply hash function – Read words and insert into hash table • Mostly list operations • Maintain counter for each unique word – Sort results 104

Examples unix> gcc –O 2 –pg prog. c –o prog unix>. /prog file. txt unix> gprof prog % cumulative self time seconds calls 86. 60 8. 21 1 5. 80 8. 76 0. 55 946596 4. 75 9. 21 0. 45 946596 1. 27 9. 33 0. 12 946596 self ms/call 8210. 00 total ms/call 8210. 00 name sort_words lower 1 find_ele_rec h_add 105

Branch Misprediction Recovery • Performance Cost – Misprediction on Pentium III wastes ~14 clock cycles – That’s a lot of time on a high performance processor 106

Example P 439 [5] 6. 7 0. 60 0. 00 0. 01 0. 00 4872758 find_ele_rec [5] 946596/946596 insert_string [4] 946596+4872758 find_ele_rec [5] 26946/26946 save_string [9] 26946/26946 new_ele [11] 4872758 find_ele_rec [5] 107

Principle • Interval counting – Maintain a counter for each function • Record the time spent executing this function – Interrupted at regular time (1 ms) • Check which function is executing when interrupt occurs • Increment the counter for this function 108

Data Set P 439 • Collected works of Shakespeare • 946, 596 total words, 26, 596 unique • Initial implementation: 9. 2 seconds 109

Code Optimizations 1) 2) 3) 4) 5) 6) 7) – First step: Use more efficient sorting function – Library function qsort Figure 5. 37 P 441 110

Further Optimizations 1) 2) 3) 4) 5) 6) 7) 111

Example • 3) Iter first: Use iterative function to insert elements in linked list – Causes code to slow down • 4) Iter last: Iterative function, places new entry at end of list – Tend to place most common words at front of list • 5) Big table: Increase number of hash buckets • 6) Better hash: Use more sophisticated hash function • 7) Linear lower: Move strlen out of loop 112

Code Motion Example#2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: } 113

Lower Case Conversion Performance – Time quadruples when double string length – Quadratic performance 114

Lower Case Conversion Performance • Time quadruples when double string length • Quadratic performance 115

Improving Performance void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } • Move call to strlen outside of loop • Since result does not change from one iteration to another • Form of code motion 116

Lower Case Conversion Performance – Time doubles when double string length – Linear performance 117

Performance Tuning • Benefits – Helps identify performance bottlenecks – Especially useful when have complex system with many components • Limitations – Only shows performance for data tested – E. g. , linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code – Timing mechanism fairly crude • Only works for programs that run for > 3 seconds 118

Amdahl’s Law P 443 Tnew = (1 - )Told + ( Told)/k = Told[(1 - ) + /k] S = Told / Tnew = 1/[(1 - ) + /k] S = 1/(1 - ) 119