Lab 3 information M R Smith smithmrucalgary ca

Lab. 3 information M. R. Smith smithmr@ucalgary. ca 12/16/2021 1

Recap of Lab. 2 n n Task 1: Check Lab. 1 code still works -- PRELAB for Lab. 2 Task 2: Check that Lab. 1 Tests are sufficient. – PRELAB for Lab. 2 n n Task 3: Timing of existing optimized and un-optimized C++ subroutines – PRELAB for Lab. 2 n n n n Possible solution to “clicking error” heard during Lab. 1 Possible refactoring of subroutines Task 4: Demonstration of C++ functionality of Lab. 1 (start of Lab. 2) Task 5: Generation of Tests for (un-optimized) assembly code Task 6: Create and validate assembly language version of “Index. Adjust( )” Task 7: Create and validate assembly language versions “FIR( )” Task 8: Demonstration of functionality Task 9: Timing of un-optimized assembly code 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 2

Overview of Lab. 3 n n n n n Task 1: Check Lab. 2 code still works -- PRELAB for Lab. 3 Task 2: Modify C++ FIR code and to allow dm and pm operations. -- PRELAB for Lab. 3 Task 3: Modify (and optimize) ASM FIR code to allow parallel dm and pm operations -- PRELAB for Lab. 3 Task 4: Demonstration of C++ functionality of Lab. 2 (start of Lab. 3) Task 5: Add hardware circular buffer for FIR Task 6: Adjust the (internal) FIR code so that the loop executes N / 2 times. Optimize code to demonstrate parallel operations possible inside the new “longer” loop. Create and validate assembly language version of “Index. Adjust( )” Task 7: Adjust the (internal) FIR code by “unrolling the loop” and obtain maximal SISD operations Task 8: In a new directory Lab. 3 -- adjust your C++ code to allow the left and right channels to operate at the same time using maximal SIMD operations. Not part of Lab. 3 n n Demonstrating the issues surrounding dual and quad memory fetches (DAB usage) Demonstrating dual processor operation. This could be done by taking the Lab. 3 Task 7 code and running one version in DSP-A (left channel code) and another DSP-B (right channel code). Alternatively we make look at an Analog Devices example code using FFT on both processors to implement filters 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 3

Task 1: Check Lab. 2 code still works n This is used to check the functionality of the laboratory station 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 4

Task 2 A. Modify C++ code for dm and pm n n This is a major change in your code; and you want to make sure you can still run all codes and tests from Lab. 1 and Lab. 2. I suggest the following solution Make a new directory Lab 3 n n n n Copy all files from Lab 1 directory into the Lab 3 directory. In Lab. 3 directory, change Lab 1. dpj to be Lab 3. prj. Remove files Lab 1. pcf and Lab 1. mak. Enter into directory Lab 3/debug and delete all files in that directory. Activate Lab 3. dpj. Inside ALL files in lab 3 directory change all reference from #include “. . /Lab 1/XXX” to #include “. . /Lab 3/XXX” Using the project options, change the program name to Lab 3 Rebuild and test. Make a new directory Lab 3 Tests n n n n Copy all files from Lab 1 Tests directory into the Lab 3 Tests directory. In Lab 3 Tests directory, change Lab 1 Tests. dpj to be Lab 3 Tests. prj. Remove files Lab 1 Tests. pcf and Lab 1 Tests. mak. Enter into directory Lab 3/Testsdebug and delete all files in that directory. Activate Lab 3 Tests. dpj. Delete the references to all files from Lab. 1 directory and replace them with references for the files from the Lab 3 directory Inside ALL files in lab 3 Tests directory change all reference from #include “. . /Lab 1/XXX” to #include “. . /Lab 3/XXX” Using the project options, change the program name to Lab 3 Tests Rebuild and test. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 5

Task 2 B. Modify C++ code for dm and pm n n Suggest that you store the filter coefficients in pm space to simply the maximum amount of code changes you have to make. Fix FIR( ) code and “Common. CB_CPPCode. h” and other include files to allow dm and pm operations n n n In the timing test call the code M times where M = 1, 100 and 1000 times Report “time / each execution” rather than ”time for all executions”. n n n Change the function prototypes X(float *p, float *q) X(dm float *p, pm float *q) float p[N], q[N] dm float p[N]; pm float q[N]; In tests, you will have to move the filter coefficients “off the stack” inside a TEST to become a global. You would expect the code to be independent of M for 10, 100 and 1000 There may be instruction cache issues and data cache issues as you are rerunning the code This optimization technique to access two memory locations at the same time should make a big difference in C++ code speed. n I would be surprised if you did not get at least a factor of 2 to 4 speed improvement. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 6

Task 3. Modify ASM code for dm and pm n Step 1 -- You should find that only the name of the ASM function needs changing to make the code work n n Step 2 – Change the J register accessing the filter coefficients to be K registers. n n Before filter. Value = [J 5 += 1]; ; After K 5 = J 5; ; at start of routine, then filter. Value = [K 5 += 1]; No code speed change expected Step 3 – Pair up filter and coefficient fetches in a single instruction line n n n Dm and pm pointers still come in as parameters in J registers Both dm and pm memory locations can be accessed via J registers. So code should work with only a name change – no code speed change expected Before data. Value = [J 4 += 1]; ; filter. Value = [K 5 + = 1]; ; Now data. Value = [J 4 += 1]; filter. Value = [K 5 + = 1]; ; parallel code Expected speed improvement – N cycles faster as you have removed one instruction Actual speed improvement – no idea, depends on how you wrote your code and which STALLs you have removed and which ones you have added. Time both optimized and un-optimized versions of the C++ code n n Later on (Lab. 3) we will be doing SIMD operations where the left-filter operation is performed in the X-Compute block, while the right filter is done in the Y-Compute block Plan ahead for these SIMD operations so that your timing tests calculate the time for processing both the left and the right channels. The answer may not be the same as twice the time of processing just the left channel. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 7

Task 4: Demonstrate functionality n 1) 2) 3) 4) 5) 6) 7) Get checked off by T. A. or myself on the following Make a Lab. 2 marking sheet showing Lab. team names and indicate “all your own work”. Add the following lines to the marking sheet Sound moves from side to side – C++ Demonstrated low pass FIR filter operation -- dm and pm C++ Demonstrated high-pass FIR filter operation – dm and pm C++ Sound moves from side to side -- ASM Demonstrated low pass FIR filter operation -- dm and pm ASM Demonstrated high-pass FIR filter operation -- dm and pm ASM Note: DSP-A can directly read DSP-B flags when DSP-B is running. The hand-shaking is automatically handled by the bus interface. In principle you can use DSP-B buttons to allow you to choose between FIR operations. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 8

Task 5: Hardware circular buffer operation n Currently, as you access the data you are doing something like this in software Value = [pointer ++]; If (pointer >= &data_array[N]) pointer -= N; n n n We can improve the FIR efficiency by using hardware CB Note that this task should basically require n n n n The software CB operation costs extra-cycle(s) every time around a short loop Remove pointer IF operation Change value = [J 1 += 1] to value = CB [J 1 += 1] Hardware CB operations only work on post-modify operations Midterm 2 question – why to hardware CB operations only work on postmodify operations? Make a new copy of the FIR_ASM as FIR_ASM_CB Add new test files Fix the assembly code to use CB operations and check timing for expected speed improvement n You might not get any improvement because of changes in the COMPUTE and MEMORY stalls that occur after you remove the pointer operations. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 9

Task 6 – Doubling the loop n n n Task 6: Adjust the (internal) FIR code so that the loop executes N / 2 times. Optimize code to demonstrate parallel operations possible inside the new “longer” loop. Create and validate. Your code is currently For (i = 0; i < N; i++) { Fetch H[i], D[i]; Do “I” mult, do “I” sum; } n Make your code look like – note the two “sum” variables For (i = 0; i < N / 2; i++) { Fetch H[j], D[j]; Do “J” mult, do “J” sum 1; Fetch H[j + 1], D[j + 1]; Do “J + 1” mult, do “J + 1” sum 2; } Sum = sum 1 + sum 2 n value = *h++ To get maximum speed and avoid stalls, you will need to change register names etc, re-order instructions. 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 10

Task 7: Unrolled FIR loop n Your original asm code was For (i = 0; i < N; i++) { Fetch H[i], D[i]; Do “I” mult, do “I” sum; } n Change to have more of the form Fetch H[0], D[0]; Do H[0] * D[0], Fetch H[1], D[1]; For (i = 2; i < N; i++) { Do sum += H[I - 2] * D[I - 2], Do H[I - 1] * D[I - 1], Fetch H[I], D[I]; } Do sum += H[N - 2] * D[N - 2], Do H[N - 1] * D[N - 1] Do sum += H[N - 1] * D[N - 1], n However, this code will have lots of stalls, so you will need to “unwrap the loop” more (as in Task 6) so you are doing 2 adds and 2 multiplications in each loop 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 11

Task 8 – SIMD operation n n Task 8 A: You have FIR(Left. Array, left. Coeffs) FIR(Right. Array, Right. Coeffs) Change to use a new function FIRBoth(Left. Array, left. Coeffs, Right. Array, Right. Coeffs) { FIR(Left. Array, left. Coeffs) FIR(Right. Array, Right. Coeffs) } -- That way the “C” code needs minor modification Change to use a ASM function FIRBoth. ASM(Left. Array, left. Coeffs, Right. Array, Right. Coeffs) { Modify your current FIRASM code so that the Left channel is processed in XFR and Right Channel processed in YFR } Note – maximum optimization would be obtained is you had one audio array arranged L[0], R[0], L[1], R[1] … so you could bring data values into XR and YR registers at the same time using a L[ ] operation. We will leave this as “theoretical knowledge” rather than totally rearranging all of Lab 1, Lab 2 and Lab. 3 to make it happeniong 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 12

Not parts of Lab. 3 n n Demonstrating the issues surrounding dual and quad memory fetches (DAB usage) Demonstrating dual processor operation. n n This could be done by taking the Lab. 3 Task 7 code and running one version in DSP-A (left channel code) and another DSP-B (right channel code). Alternatively we may look at an Analog Devices example code using FFT code running on both processors to implement left (DSP-A) and right (DSP-B) filters 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 13

Overview of Lab. 3 n n n n n Task 1: Check Lab. 2 code still works -- PRELAB for Lab. 3 Task 2: Modify C++ FIR code and to allow dm and pm operations. -- PRELAB for Lab. 3 Task 3: Modify (and optimize) ASM FIR code to allow parallel dm and pm operations -- PRELAB for Lab. 3 Task 4: Demonstration of C++ functionality of Lab. 2 (start of Lab. 3) Task 5: Add hardware circular buffer for FIR Task 6: Adjust the (internal) FIR code so that the loop executes N / 2 times. Optimize code to demonstrate parallel operations possible inside the new “longer” loop. Create and validate assembly language version of “Index. Adjust( )” Task 7: Adjust the (internal) FIR code by “unrolling the loop” and obtain maximal SISD operations Task 8: In a new directory Lab. 3 -- adjust your C++ code to allow the left and right channels to operate at the same time using maximal SIMD operations. Not part of Lab. 3 n n Demonstrating the issues surrounding dual and quad memory fetches (DAB usage) Demonstrating dual processor operation. This could be done by taking the Lab. 3 Task 7 code and running one version in DSP-A (left channel code) and another DSP-B (right channel code). Alternatively we make look at an Analog Devices example code using FFT on both processors to implement filters 12/16/2021 ENCM 515 – Lab. 3, 2007 Copyright smithmr@ucalgary. ca 14