Averaging Filter Comparing performance of C and our

  • Slides: 70
Download presentation
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October Afternoon Practical examples handled in Lab 1 1

Code Design (RECAP) • Add new value to FIFO buffer – includes discarding oldest

Code Design (RECAP) • Add new value to FIFO buffer – includes discarding oldest FIFO value • Perform averaging • Note N will be set as small number (e. g. N = 16) during some parts of test, and set large when timing done • Use no magic numbers use in code – No loops involving for (j = 0; j < 1024; j++) – Use (j = 0; j < N; j++) where N is declared in Assign 1. h so a single N is used across project 2

Develop a personal software process to minimize mistakes and wasted time • Do a

Develop a personal software process to minimize mistakes and wasted time • Do a code review for syntax errors – Make a list of errors so that you can identify your most common mistakes AND STOP making them • The number of syntax errors the compiler finds after your code review is related to the number of logical defects (unfound) in your code. • The number of syntax errors the compiler finds after your code review is related to the amount of time you waste debugging your code looking for those hidden defects. – Plan to spend 20% of your programming time doing code review 3

SHARC assembly code WAIL – handle timing issues now • Can we call SHARC.

SHARC assembly code WAIL – handle timing issues now • Can we call SHARC. asm code and return to C++ without crashing system? – Equivalent of RTS on Blackfin – Equivalent of ? ? ? On MIPS • Can we access memory? – Equivalent of R 0 = [P 0]; and [P 1] = R 1; on Blackfin – Equivalent to ? ? ? On MIPS • Can we access memory without crashing system; 4

Life cycle (RECAP) • • Design – wish-list of ‘stories’ Write Tests to show

Life cycle (RECAP) • • Design – wish-list of ‘stories’ Write Tests to show code would work Write C code to satisfy tests Generate resource chart based on processor architecture to calculate best ‘theoretical’ speed – If ‘real time theoretical speed’ works for you, then okay to try to optimize. – If ‘theoretical speed’ does not work for you then ‘find a different algorithm’, optimization is not going to help. • Modify already written tests to prove that assembly code works as well as being fast 5

Did a project clean and then build Do code review to find final error

Did a project clean and then build Do code review to find final error This was a refactoring error when I changed file names Change CONTROL and NOTIFY macros to include CPP (See next slide) 6

Mock Device Registers “satisfy linker” CCES says “inconsistent” definition • Poor mock – we

Mock Device Registers “satisfy linker” CCES says “inconsistent” definition • Poor mock – we move values in Audio Device registers by hand • Can we “MOCK” – Receive_ADC_Samples – Typical industrial testing approach needed when hardware “NOT-YET-DEVELOPED 7

Better Simulation • What is – the algorithm is “by mistake” still doing Left_Out

Better Simulation • What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer • Currently “Left. Channel_In 1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals • So we could start testing the algorithm validity (not its speed) by changing Left. Channel_In 1 by “mocking “Recive. A 2 D( )” and “Transmit. D 2 A( ) audio devices 8

Using ‘Mock. Device. c” loads (RECAP) What do we do about ‘Receive_ADC_Samples ( )?

Using ‘Mock. Device. c” loads (RECAP) What do we do about ‘Receive_ADC_Samples ( )? ’ • These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL). 9

Mocked device inside Assign 1 Library Can be used during Lab 1 -- 4

Mocked device inside Assign 1 Library Can be used during Lab 1 -- 4 MADE PRIVATE (FIXED) GOOD OR BAD IDEA? VARIETY OF ALGORITHMS TESTED 10

Use GUI to add new test group for Averaging code – 3 styles of

Use GUI to add new test group for Averaging code – 3 styles of tests (RECAP) 11

Time test – measure in us Must be less than 20 us per point

Time test – measure in us Must be less than 20 us per point (1 audio channel) Un-automated, But we need to collect details and don`t have to do much analysis in lab 12

Interesting CCES code ran much slower than VDSP code CCES has different C++ device

Interesting CCES code ran much slower than VDSP code CCES has different C++ device buffer characteristics apparently for printf( ) 13

Refactored Project Arrows indicate some changes 14

Refactored Project Arrows indicate some changes 14

Some issues to take up with Analog Devices Engineering Zone 15

Some issues to take up with Analog Devices Engineering Zone 15

Theoretical Analysis • We expect our theoretical analysis to be fast or faster than

Theoretical Analysis • We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes • We are not using any C++ DSP extensions, so expected efficient rather than optimized code • Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture? 16

Expectations • First instruction after a jump takes 3 cycles to finish executing •

Expectations • First instruction after a jump takes 3 cycles to finish executing • After that 1 instruction, all things being equal, takes 1 cycle • 1 cycle for a read, write, add, multiple • D? cycles for a division 17

Averaging Filter Theoretical Analysis • Fetch N values from memory -- N cycles •

Averaging Filter Theoretical Analysis • Fetch N values from memory -- N cycles • Perform N add operations -- N cycles • Go round the sum for-loop -- N * FLC cycles – Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop • • Exit for loop (done once) -- EFL cycles Do division -- D cycles Return a value from function -- RV cycles Enter and exit Average routine -- EER cycles -- AVERAGE_FILTER_TIME = N(1 + FLC) + EFL + D + RV + EER cycles VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS 18

Averaging Filter Theoretical analysis continued • AVERAGE_FILTER_TIME = N(1 + FLC) + EFL +

Averaging Filter Theoretical analysis continued • AVERAGE_FILTER_TIME = N(1 + FLC) + EFL + D + RV + EER cycles • TIME / SAMPLE PROCESSED = { N(1 + FLC) + EFL + D + RV + EER } / N which becomes 2 + FLC + (EFL + D + RV + EER } / N • For N large – 2 + FLC cycles / sample processed • Estimate FLC – loop control - 1 + 3 (once per loop) + 1 + 3 = 7 – START OF LOOP -- compare, check compare, jump out of loop if needed – END OF LOOP – increment counter, jump to start of loop • For N large – expect 9 cycles / sample processed – most FLC!!!!!!! • Time analysis says -- 816 cycles per sample processed • Either C++ incredibly inefficient or we missed things – For example – this for loop overhead – but that would only add 9 cycles / sample processed 19

Big problem • Measuring wrong thing • We calculated the time to perform an

Big problem • Measuring wrong thing • We calculated the time to perform an average of N points in an N point array – Allows us to understand how to calculate the time for averaging P points • 816 cycles is time to average 1 input value – Useless information – as we make no mention of how many points were averaged – Need to rewrite tests 20

21

21

Optimizing C++ compiler • The optimizing compiler know more than we do about the

Optimizing C++ compiler • The optimizing compiler know more than we do about the processor architecture • Use C++ compiler as tutor – Look at code generate • Go into Disassembly window and search for the optimized code as SHARC instructions 22

Unexpected behaviour explained Also use – control-shift-G 23

Unexpected behaviour explained Also use – control-shift-G 23

Did the Ccompiler remember anoverhead for doing FIFO update? So our analysis could be

Did the Ccompiler remember anoverhead for doing FIFO update? So our analysis could be even slower Hardware Loop counter is 0 x. FF = 255 – NOT FIFO_SIZE! Why? VERY DIFFERENT REASONs See Next SLIDEs 24

For N Large Only the loop code really counts • FIFO update • What

For N Large Only the loop code really counts • FIFO update • What does the syntax of the hardware loop mean? 25

 • What’s a SISD compared to a SIMD • Another indication of special

• What’s a SISD compared to a SIMD • Another indication of special architectural features to handle DSP we must understand AND USE 26

For N Large Only the loop code really counts • FIFO update Do (pc,

For N Large Only the loop code really counts • FIFO update Do (pc, 0 x 06) -- Means what? ? Does it mean loop starts at current instruction location 12 b 8 db and finishes at (or includes) 12 b 8 db + 6 = 12 b 8 e 1? ? Or loop starts at first instruction start of loop 12 b 8 de and finishes BEFORE 12 b 8 de + 6 = 12 b 8 e 3 (2 instruction loop) Is the nop in the loop or not? --- Makes a 50% difference in DSP speed is inside the loop. – We must understand processor architecture 27

Timing -- For N Large Only the loop code time really counts • FIFO

Timing -- For N Large Only the loop code time really counts • FIFO update • Lcntr = 255 -- since move (FIFO-SIZE – 1) values • 0 x? db is only executed once (loop set up) – then the loop switches to automatic (zero-overhead hardware loop) • Loop is size 2 --- 0 x? de and 0 x? e 1 instructions are each of size 48 (hence 0 x? db says 0 x 6 is loop size) • 0 x? e 4 (nop) is only executed 1 – Is a special safety feature which is necessary when these “special loops are execute” to avoid data races • FIR filter hardware loops will be 1 cycle (dm[], pm[], + and * in 1 instruction) in loop – need “TWO” safety nops otherwise “possible” race condition” 28

Averaging loop itself • Doing a memory read and add each cycle • Average

Averaging loop itself • Doing a memory read and add each cycle • Average can be done in 1 cycle that way • We do 256 moves and adds – yet loop is only size 255 ! -- Concept of loop unrolling • Note special multiple by 2^N instruction f 2 = scaleb f 2 by r 1= 0 xffff 8 or -8 -- meaning divide by 2 ^ -8 which is (1 / 256) – 1 cycle division if power of 2 • One fetch before loop, 255 adds and fetches inside loop, leaving 1 add to finish outside the loop 29

Divide by power of two • We need to understand by division of a

Divide by power of two • We need to understand by division of a floating point number by power of two can be achieved in 1 cycle • Related to, but very different from doing integer division by “shifting” >> 8 • Need to “review” number representations – how are integer and floating point numbers stored and manipulated by software and hardware 30

Things we need to tell the compiler • In principle you can vectorize the

Things we need to tell the compiler • In principle you can vectorize the add • Instead of doing Loop 256 R 1 = dm(I 4, ? ); // Fetch floating point number F 2 = F 2 + F 1; • We can do Switch to SIMD mode Loop 128 R 1 = dm(I 4, ? ), S 1 = dm(I 4 + 1, ? ); ; // Fetch 2 values F 2 = F 2 + F 1, SF 2 = SF 2 + SF 1; // Add 2 values • Later we can cause the partial sums to be added F 2 and SF 2 If we can switch to SIMD in the right way – 50% speed improvement!!!! 31

Rough timing calculation can be performed based on “C” code EXPECT TO DO THIS

Rough timing calculation can be performed based on “C” code EXPECT TO DO THIS DURING EXAMS, QUIZZES AND LAB. REPORTS 32

Rough timing calculation – Cycles (Expect to do this in lab reports, exams etc)

Rough timing calculation – Cycles (Expect to do this in lab reports, exams etc) Get in and out of routine 20 First loop – update FIFO (N-1) * (Read + Write + loop control) Insertion of new value 2 Do sum (N) * (Read + write + loop control + sum) Forgot the ‘do adds’ Divisions and store result 3 * division + 2 writes Total (2 N – 1) Read + (2 N + 2) writes + 3 divisions + N sum + (2 N – 1) loop control Assume loop control is Compare (1), Check (1), increment (1) Jump back (3 because of pipeline) Assume read, write and add = 1 cycle Division = 10 (not many of these – so SHOULD not matter if N large > 30 Total – we can see that loop control is dominating – need to fix (2 N – 1) + (2 N + 2) + 3 * 10 + N + 6 * (2 N – 1) = around 17 N In our case N = 64 17 N = 1088 Where we have 2698 or 1326 experimentally 33

Things I have yet to learn how to do in CCES 34

Things I have yet to learn how to do in CCES 34

Lets switch to the SHARC simulator (VDSP screen shots here) No new information here

Lets switch to the SHARC simulator (VDSP screen shots here) No new information here (except simulator runs slowly) Need to run – CYCLE ACCURATE PIPELINE VIEWER 35

Oh Bother And Damnation! • Minor advantage – don’t have to run the slow

Oh Bother And Damnation! • Minor advantage – don’t have to run the slow simulator? 36

What other tools do we have? More accurate way of timing than using TESTs

What other tools do we have? More accurate way of timing than using TESTs with their strange overhead • Done using SHARC Cycle counters we can display • Break points we can set (Don’t set on a for loop) 37

Cheat – add breakpoint at dummy instruction 38

Cheat – add breakpoint at dummy instruction 38

Calculation • From line 84 to 106 – done once 162151 – 15936 =

Calculation • From line 84 to 106 – done once 162151 – 15936 = 2625 cycles • Add over head of getting in and out of routine (20 cycles) = 2645 cycles • • Test timing results – using fast board not slow simulator Reverse test list timing – • Direct test list timing – • Via micro-sec calculation 2639 cycles, via ‘clock( )’ 5398 Via micro-sec calculation 2639 cycles, via ‘clock( )’ 2899 WARNING: 2645 should not be considered close to 2639 (possible coincidence) until we know whether software loops are generated by C compiler in the way we assumed 39

(TMI) Took a shower to break my thought train. Look for code defect (now

(TMI) Took a shower to break my thought train. Look for code defect (now obvious) Defect– should be cycles. Used Otherwise using ‘time since start of program’ Code cycles consistent via two different approaches 40

Modify tests so can handle both CPP and ASM versions (Cut-and-paste) • Not the

Modify tests so can handle both CPP and ASM versions (Cut-and-paste) • Not the timing that’s the problem at this moment • It’s ‘does the ASM and CPP code work’ at all! 41

 • Probably stop here 42

• Probably stop here 42

Check what function needs developing • Fix compiler error with prototype in ‘Assign 1.

Check what function needs developing • Fix compiler error with prototype in ‘Assign 1. h” • Linker error message says ‘wrong prototype’ (NM) 43

Check to see if can run the Tests that call ASM code without crashing

Check to see if can run the Tests that call ASM code without crashing C++ prototype extern “C” void Function(void) 44

Now add assembly code FIFO stack • Temp fixes – Remove { } syntax

Now add assembly code FIFO stack • Temp fixes – Remove { } syntax init to zero code – WAIL – Need to declare N using ‘Assign 1. h’ which is “C++ • Do a quick local declaration of N = 64 to see if coding problem fixed before we start worrying about ‘Assign 1. h’ 45

How to declare array’s in assembly • This does not work • Need to

How to declare array’s in assembly • This does not work • Need to look in Assembly language manual – Copy available from Analog Web site or ENCM 515 website 46 Find better way later

I made the code ‘more general’ #define. byte 4. var // Home made defect

I made the code ‘more general’ #define. byte 4. var // Home made defect remover • Now move ‘all’ defines into ‘Assign 1. h’ so that the same N gets used by CPP and ASM code and by TEST • Does not work – C++ syntax confuses assembler 47

Best ‘temp’ fix I could find • Use this type of syntax in ‘Assign

Best ‘temp’ fix I could find • Use this type of syntax in ‘Assign 1. h’ – Conditional code generation • And in assembly code files 48

Initial testing done with small N N = 4 (as can work out expected

Initial testing done with small N N = 4 (as can work out expected result) • Write the test – C++ code expected to pass – 3. 3 is EXACTLY (N – 1) / N of 4. 4 when N is 4 49

Look for ‘one out error’ in loops Common DSP mistake • Remember to fix

Look for ‘one out error’ in loops Common DSP mistake • Remember to fix error in ASM ‘pseudo code’ 50

Initial testing done with small N N = 4 (as can work out expected

Initial testing done with small N N = 4 (as can work out expected result) • Write the test – C++ code expected to pass – Asm code MUST fail test – otherwise test is poor – Must fail as there is no ASM code to allow pass to occur. This is the TEST of the TEST Now have 4 tests passing rather than 3, including ASM test INDICATES BAD TEST – WHY? 51

Improved test. Don’t allow ‘old correct value’ in output from C++ test Defect might

Improved test. Don’t allow ‘old correct value’ in output from C++ test Defect might have been identified by reversing test order 52

Modify Embedded Unit main( ) to allow this to happen 53

Modify Embedded Unit main( ) to allow this to happen 53

Might not get any further Go over again in next class Do ‘software loop

Might not get any further Go over again in next class Do ‘software loop control’ in tutorial Need to understand if – then – else construct 54

What registers can we use in assembly? • Don’t use without performing save immediately

What registers can we use in assembly? • Don’t use without performing save immediately and later recover operations. • Otherwise C and C++ will crash • These okay to use in assembly 55

Set up the FIFO adjust loop Need to set up ‘loop. Max = N

Set up the FIFO adjust loop Need to set up ‘loop. Max = N – 1” THIS CONSTANT OKAY THIS CONSTANT BAD! 56

What’s the error here? RELIABLE METHOD PLACE CONSTANT IN REGISTER BEFOE USE 57

What’s the error here? RELIABLE METHOD PLACE CONSTANT IN REGISTER BEFOE USE 57

Here’s the full software loop structure Each time around Loop – 9 cycles for

Here’s the full software loop structure Each time around Loop – 9 cycles for Control Not the 5 we thought 58

dm(2, I 4) versus dm(I 4, 2) dm(M 4, I 4) versus dm(I 4,

dm(2, I 4) versus dm(I 4, 2) dm(M 4, I 4) versus dm(I 4, M 4) • Both instructions use the ‘eye’ 4 index register (volatile) • dm(2, I 4) – is a pre-modify memory operation – The 1 is before the I 4 – hence pre something – I 4 points to a memory location – Dm(2, I 4) means access the memory location at (I 4 + 2) • ADD IS NOT preformed in parallel with other operations? – LEAVE value in index register I 4 unchanged – Used in array addressing • Dm(I 4, 2) – is a post-modify memory operation – – The 2 is after the I 4 – hence post something I 4 points to a memory location Dm( I 4, 2) means access the memory location at (I 4) MODIFY value in index register by 2 • DO I 4 = I 4 + 2 AFTER USING I 4 (ADD in parallel with other operations? ) 59

60

60

Other bits of code needed 61

Other bits of code needed 61

Add assembly language ‘externs’ to ‘Assign 1. h • Still have not coded the

Add assembly language ‘externs’ to ‘Assign 1. h • Still have not coded the division – fake it by hard-coding * 1/4 • Must be an easier way to code memory – Yes – use post increment operation using pointers and not using array indexing 62

Code fails -- Most likely place to look for defects are in loop operations

Code fails -- Most likely place to look for defects are in loop operations Forgot to set loop. Counter =0 And loop. Max to N when we Added code for the new loops 63

Try persuading the “assembler” to pre-calculate F 3 = (1. 0 / N) at

Try persuading the “assembler” to pre-calculate F 3 = (1. 0 / N) at ‘compile time’, not ‘run-time’ Code should now work for N = 64 – so can compare timing with C code 64

If we believe tests then calculation accuracy is lower (5 E-06 for larger N)

If we believe tests then calculation accuracy is lower (5 E-06 for larger N) Despite lousy ASM code we already beating compiler in ‘debug’ mode(around 2 N) 65

Before optimizing, we need to add a few more tests to check code valid

Before optimizing, we need to add a few more tests to check code valid Uses sum of N integers N (N + 1) / 2 Accuracy now set to 1 E-5 66

Use post-modify address mode sum = sum + *pt++; ( N = 64) 2

Use post-modify address mode sum = sum + *pt++; ( N = 64) 2 cycle stall till M 4 ready to use? • ASM was 2400 cycles (N = 64), is now 2208 – Expect improvement of N = 64 cycles (2 instead of 3 instructions) 67 – Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster

dm(2, I 4) versus dm(I 4, 2) dm(M 4, I 4) versus dm(I 4,

dm(2, I 4) versus dm(I 4, 2) dm(M 4, I 4) versus dm(I 4, M 4) • Both instructions use the ‘eye’ 4 index register • dm(2, I 4) – is a pre-modify memory operation – – – The 2 is before the I 4 – hence pre something I 4 points to a memory location Dm(2, I 4) means access the memory location at (I 4 + 2) LEAVE value in index register I 4 unchanged Used in array addressing • Dm(I 4, 2) – is a post-modify memory operation – – The 2 is after the I 4 – hence post something I 4 points to a memory location Dm( I 4, 2) means access the memory location at (I 4) MODIFY value in index register by 1 (I 4 = I 4 + 2 AFTER USE) • POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES 68

Using pre-modify and post-modify addressing – replace 6 instructions by 2 Expect 4 *

Using pre-modify and post-modify addressing – replace 6 instructions by 2 Expect 4 * N faster (256) Was 2208, is 1704 = 500 cycles Close to N * 6 faster! 69

Need to force “C++” to optimize CONCLUSION We have a lot more to learn

Need to force “C++” to optimize CONCLUSION We have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS • Our asm code 1704 cycles • Optimized “C” 205 cycles – 1500 cycles faster or roughly N * 23. 5 cycles faster WE KNOW MORE, so should be able to write faster code (if we need to) • FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256 • Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182 – Our ASM = 1468 + 236 unaccounted for (N * 3. 7 or nearly N * 4) 70