Program design and analysis z Optimizing for execution












![Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; Xi=0; = Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; Xi=0; =](https://slidetodoc.com/presentation_image_h2/566fd7e6998019639c4bd0b2d86ef022/image-13.jpg)


![Array conflicts in cache a[0, 0] 1024 b[0, 0] . . . 4099 main Array conflicts in cache a[0, 0] 1024 b[0, 0] . . . 4099 main](https://slidetodoc.com/presentation_image_h2/566fd7e6998019639c4bd0b2d86ef022/image-16.jpg)












- Slides: 28

Program design and analysis z. Optimizing for execution time. z. Optimizing for energy/power. z. Optimizing for program size. © 2000 Morgan Kaufman Overheads for Computers as Components

Motivation z. Embedded systems must often meet deadlines. y. Faster may not be fast enough. z. Need to be able to analyze execution time. y. Worst-case, not typical. z. Need techniques for reliably improving execution time. © 2000 Morgan Kaufman Overheads for Computers as Components

Run times will vary z. Program execution times depend on several factors: y. Input data values. y. State of the instruction, data caches. y. Pipelining effects. © 2000 Morgan Kaufman Overheads for Computers as Components

Measuring program speed z. CPU simulator. y. I/O may be hard. y. May not be totally accurate. z. Hardware profiler/timer. y. Requires board, instrumented program. z. Logic analyzer. y. Limited logic analyzer memory depth. © 2000 Morgan Kaufman Overheads for Computers as Components

Program performance metrics z. Average-case: y. For typical data values, whatever they are. z. Worst-case: y. For any possible input set. z. Best-case: y. For any possible input set. z. Too-fast programs may cause critical races at system level. © 2000 Morgan Kaufman Overheads for Computers as Components

Performance analysis z. Elements of program performance (Shaw): yexecution time = program path + instruction timing z. Path depends on data values. Choose which case you are interested in. z. Instruction timing depends on pipelining, cache behavior. © 2000 Morgan Kaufman Overheads for Computers as Components

Programs and performance analysis z. Best results come from analyzing optimized instructions, not high-level language code: ynon-obvious translations of HLL statements into instructions; ycode may move; ycache effects are hard to predict. © 2000 Morgan Kaufman Overheads for Computers as Components

Program paths z Consider for loop: i=0; f=0; for (i=0, f=0; i<N; i++) f = f + c[i]*x[i]; z Loop initiation block executed once. z Loop test executed N+1 times. z Loop body and variable update executed N times. © 2000 Morgan Kaufman i<N N Y f = f + c[i]*x[i]; Overheads for Computers as Components i = i+1;

Instruction timing z. Not all instructions take the same amount of time. z. Instruction execution times are not independent. z. Execution time may depend on operand values. © 2000 Morgan Kaufman Overheads for Computers as Components

Trace-driven performance analysis z. Trace: a record of the execution path of a program. z. Trace gives execution path for performance analysis. z. A useful trace: yrequires proper input values; yis large (gigabytes). © 2000 Morgan Kaufman Overheads for Computers as Components

Trace generation z. Hardware capture: ylogic analyzer; yhardware assist in CPU. z. Software: y. PC sampling. y. Instrumentation instructions. y. Simulation. © 2000 Morgan Kaufman Overheads for Computers as Components

Loop optimizations z. Loops are good targets for optimization. z. Basic loop optimizations: ycode motion; yinduction-variable elimination; ystrength reduction (x*2 -> x<<1). © 2000 Morgan Kaufman Overheads for Computers as Components
![Code motion for i0 iNM i zi ai bi i0 Xi0 Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; Xi=0; =](https://slidetodoc.com/presentation_image_h2/566fd7e6998019639c4bd0b2d86ef022/image-13.jpg)
Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; Xi=0; = N*M i<X i<N*M Y z[i] = a[i] + b[i]; i = i+1; © 2000 Morgan Kaufman Overheads for Computers as Components N

Induction variable elimination z Induction variable: loop index. z Consider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i][j] = b[i][j]; z Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body. © 2000 Morgan Kaufman Overheads for Computers as Components

Cache analysis z. Loop nest: set of loops, one inside other. z. Perfect loop nest: no conditionals in nest. z. Because loops use large quantities of data, cache conflicts are common. © 2000 Morgan Kaufman Overheads for Computers as Components
![Array conflicts in cache a0 0 1024 b0 0 4099 main Array conflicts in cache a[0, 0] 1024 b[0, 0] . . . 4099 main](https://slidetodoc.com/presentation_image_h2/566fd7e6998019639c4bd0b2d86ef022/image-16.jpg)
Array conflicts in cache a[0, 0] 1024 b[0, 0] . . . 4099 main memory © 2000 Morgan Kaufman 4099 Overheads for Computers as Components cache

Array conflicts, cont’d. z. Array elements conflict because they are in the same line, even if not mapped to same location. z. Solutions: ymove one array; ypad array. © 2000 Morgan Kaufman Overheads for Computers as Components

Performance optimization hints z. Use registers efficiently. z. Use page mode memory accesses. z. Analyze cache behavior: yinstruction conflicts can be handled by rewriting code, rescheudling; yconflicting scalar data can easily be moved; yconflicting array data can be moved, padded. © 2000 Morgan Kaufman Overheads for Computers as Components

Energy/power optimization z. Energy: ability to do work. y. Most important in battery-powered systems. z. Power: energy per unit time. y. Important even in wall-plug systems---power becomes heat. © 2000 Morgan Kaufman Overheads for Computers as Components

Measuring energy consumption z. Execute a small loop, measure current: I while (TRUE) a(); © 2000 Morgan Kaufman Overheads for Computers as Components

Sources of energy consumption z. Relative energy per operation (Catthoor et al): ymemory transfer: 33 yexternal I/O: 10 y. SRAM write: 9 y. SRAM read: 4. 4 ymultiply: 3. 6 yadd: 1 © 2000 Morgan Kaufman Overheads for Computers as Components

Cache behavior is important z. Energy consumption has a sweet spot as cache size changes: ycache too small: program thrashes, burning energy on external memory accesses; ycache too large: cache itself burns too much power. © 2000 Morgan Kaufman Overheads for Computers as Components

Optimizing for energy z. First-order optimization: yhigh performance = low energy. z. Not many instructions trade speed for energy. © 2000 Morgan Kaufman Overheads for Computers as Components

Optimizing for energy, cont’d. z. Use registers efficiently. z. Identify and eliminate cache conflicts. z. Moderate loop unrolling eliminates some loop overhead instructions. z. Eliminate pipeline stalls. z. Inlining procedures may help: reduces linkage, but may increase cache thrashing. © 2000 Morgan Kaufman Overheads for Computers as Components

Optimizing for program size z. Goal: yreduce hardware cost of memory; yreduce power consumption of memory units. z. Two opportunities: ydata; yinstructions. © 2000 Morgan Kaufman Overheads for Computers as Components

Data size minimization z. Reuse constants, variables, data buffers in different parts of code. y. Requires careful verification of correctness. z. Generate data using instructions. © 2000 Morgan Kaufman Overheads for Computers as Components

Reducing code size z. Avoid function inlining. z. Choose CPU with compact instructions. z. Use specialized instructions where possible. © 2000 Morgan Kaufman Overheads for Computers as Components

Code compression 0101101 main memory © 2000 Morgan Kaufman 0101101 decompressor z. Use statistical compression to reduce code size, decompress on-the-fly: table LDR r 0, [r 4] cache Overheads for Computers as Components CPU