Embedded Systems Programming Writing Optimised C code for

Embedded Systems Programming Writing Optimised C code for ARM

Why write optimised C code? • For embedded system size and/or speed are of key importance • The compiler optimisation phase can only do so much • In order to write optimal C code you need to know details of the underlying hardware and the compiler

What compilers can’t do • void memclr( char * data, int N) • { • for (; N > 0; N--) • { • *data=0; • data++; • } • Is N == on first loop? – 0 – 1 is dangerous! • Is data array 4 byte aligned? – Can store using int • Is N a multiple of 4? – Could do 4 word blocks at a time • Compilers have to be conservative!

An example Program /* program showing inefficient * variable and loop * usage craig Nov 04 */ int checksum_1(int *data) { char i; int sum = 0; for (i =0; i < 64; i++) sum += data[i]; return sum; } • The program might seem fine – even resource friendly • Using a char saves space • for loops make good assembler • Lets look at the assembler code

. text. align 2. global checksum_1. type checksum_1, function checksum_1: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r 1, r 0 mov r 0, #0 @ sum = 0 mov r 2, r 0 @i=0. L 6: ldr r 3, [r 1, r 2, asl #2] @ data[i] add r 0, r 3 @ sum = data[i] add r 3, r 2, #1 @ i ++ and r 2, r 3, #255 cmp r 2, #63 @ i < 64 bls. L 6 ldmea fp, {fp, sp, pc}. Lfe 1: . size checksum_1, . Lfe 1 -checksum_1

What is wrong? • The use of char means that the compiler has to cast to look at 8 bits – using – and r 2, r 3, #255 • The loop variable requires a register and initialisation • If the loop is called often the tests and branch is quite an overhead

Variable sizes • In general the compiler will use 32 bit registers for local variables but will have to cast them when used as 8 or 16 bit values • If you can, use unsigned ints, if you can’t explicitly cast • Using signed shorts can be quite a problem for compilers

Watch your shorts! short add( short a, short b) { return a + (b >> 1); } Becomes …. mov ip, sp stmfd sub mov mov add mov ldmea sp!, {fp, ip, lr, pc} fp, ip, #4 r 1, asl #16 r 0, asr #16 r 0, r 1, asr #17 r 0, asl #16 r 0, asr #16 fp, {fp, sp, pc • The above C code turns into the rather nasty assembler • The gnu C compiler is very cautious when confronted with short variables

Loops #1 • As well as using a char for a loop counter the loop counter could be redundant • Terminate loops by counting down to 0 the reduces register usage and means no initialisation • Use do. . while instead of for loops

Efficient loop C */ * Program to show efficient use of * variables and loops */ int checksum_2(int *data) { int sum = 0, i = 64; do { sum += *(data++); } while ( --i != 0 ); return sum; }

Efficient loop assembler checksum_2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r 1, r 0 mov r 0, #0 @ sum = 0 mov r 2, #64 @ i = 64. L 6: ldr r 3, [r 1], #4 @ *(data++) add r 0, r 3 @ sum = *(data++) subs r 2, #1 @ --i bne. L 6 ldmea fp, {fp, sp, pc}

Loop unrolling • If a loop is going to be repeated often the test and branch can be quite an overhead • If the loop is a multiple of 4 and is done quite a lot then the loop can be unrolled • This increases code a size but is more speed efficient • Sizes that are not multiples of 4 can be done but are less efficient.

An unrolled loop * Program to show efficient use of * variables and loops & loop unrolling */ int checksum_2(int *data) { int sum = 0, i = 64; do { sum += *(data++); i -= 4; } while ( i != 0 ); return sum; }

checksum_2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r 2, r 0 mov r 0, #0 mov r 1, #64. L 6: ldr r 3, [r 2], #4 add r 0, r 0, r 3 subs r 1, #4 bne. L 6 ldmea fp, {fp, sp, pc}

Loop unrolling ! = 4 /* Program to show use of * loop unrolling */ int checksum_2(int *data, unsigned int N) { int sum = 0; unsigned int i; for ( i = N/4; i != 0; i--) { sum += *(data++); } for ( i = N&3; i != 0; i--) sum += *(data++); return sum; }
- Slides: 15