DMA example Video image manipulation Video Copyright M

Problem to solve ¢ ¢ ¢ Build video images in SDRAM Scale all the

Video image Blanking information Frame 1 - luminance + colour information Blanking information Frame

Frame information CB 1 G 1 CR 1 G 2 CB 3 G 3

Set up TEST Tasks done after Another Tasks done with DMA occurring at the

3 threads – sequential Scaling intensity by 19 6 Video , Copyright M. Smith,

Task being performed Note – out of order of instructions associated with C++ code

Three threads in parallel Not the best solution? Start first DMA transfer – wait

Results of the tests ¢ 9 Need to use “profiling of the code” to

Multiplication code – 16 bit Note – out of order of instructions associated with

Multiplication details 11 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiplication possibilities ¢ ¢ ¢ ¢ R 1. L = R 2. L *

Multiply and add test 13 Video , Copyright M. Smith, ECE, University of Calgary,

Multiply and add result and code -- in SDRAM 3 cycle loop -- Note

MAC syntax details 15 Video , Copyright M. Smith, ECE, University of Calgary, Canada

Hints at possible advantage A 0 += R 2. L * R 3. L,

Multiply and accumulate operation ¢ Filter operation on 16 -bit values sum = 0;

Multiply and accumulate operation – solving the problem ¢ Does not take much to

Mult 16 x 16 To give 32 bits Adder is 40 bits Accumulator is

Example – filter 100 values in only 50 instructions. section data. byte 2 array[100],

Convert the following code using parallel instructions and ensuring maximum accuracy #define N 1024

Option for doing multiplication ¢ R 0 = R 1 * R 2; l

Warning -For more details see article When 1 + 1 = 2; but 2

Addition and multiplication on Blackfin ¢ If R 0 = 0 x 12345678 –

Other “multiplication” types ¢ Multiply by 2 or 4 l l ¢ R 0

Division ¢ Fast divide by 2 , 4, 8, 2 N using shift l

Code example -- P 10 -25. global _Divide. ASM; _Divide. ASM: R 0 =

¢ Information taken from Analog Devices On-line Manuals with permission http: //www. analog. com/processors/resources/technical.

Slides: 29

Download presentation

DMA example Video image manipulation Video , Copyright M. Smith, ECE, University of Calgary, Canada

Problem to solve ¢ ¢ ¢ Build video images in SDRAM Scale all the images (increase grey scale by a fixed scaling factor) Determine whether is more efficient to 1. 2. 3. 1. 2. 2 Work using the images in SDRAM Bring images from SDRAM (using DMA), scale them, then put back Using a multi-threaded version of task 2 Multiplication and Division issues Some possible Q 9 areas for the final Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Video image Blanking information Frame 1 - luminance + colour information Blanking information Frame 2 - luminance + colour information Blanking information Have ability to manipulate frame information with touching blanking information 3 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Frame information CB 1 G 1 CR 1 G 2 CB 3 G 3 CR 3 G 4 CB 5 G 5 CR 5 G 6 Pixel 1 uses G 1 + CB 1 + CR 1 ¢ Pixel 2 uses G 2 + CB 1 + CR 1 ¢ Pixel 3 uses G 3 + CB 3 + CR 3 ¢ Pixel 4 uses G 4 + CB 3 + CR 3 ¢ Image brightness decreasing 4 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Set up TEST Tasks done after Another Tasks done with DMA occurring at the same time as other tasks 5 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

3 threads – sequential Scaling intensity by 19 6 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Task being performed Note – out of order of instructions associated with C++ code 7 Loop involves 1 read / 1 write + 2 operations not involving r / w memory which gives DMA operation some bus bandwidth to work with , 12/22/2021 Video Copyright M. Smith, ECE, University of Calgary, Canada

Three threads in parallel Not the best solution? Start first DMA transfer – wait Start second DMA transfer start doing math operation done in parallel Wait till second DMA done Transfer math results back – wait Start third DMA transfer start doing math operation done in parallel Wait till third DMA done Transfer math results back – wait 8 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Results of the tests ¢ 9 Need to use “profiling of the code” to determine where the “waste of time now is” Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiplication code – 16 bit Note – out of order of instructions associated with C++ code IS -- integer signed multiplication 10 FS – fractional signed (form of 12/22/2021 block floating point) – on many processors Video , Copyright M. Smith, ECE, University of Calgary, Canada

Multiplication details 11 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiplication possibilities ¢ ¢ ¢ ¢ R 1. L = R 2. L * R 3. L; // Using multiplier 0 R 1. H = R 2. H * R 3. H; // Using multiplier 1 R 1. L = R 2. L * R 3. L, R 1. H = R 2. H * R 3. H; Using both multipliers in parallel R 2 = [P 0++]; R 3 = [P 1++]; R 1. L = R 2. L * R 3. L, R 1. H = R 2. H * R 3. H; [P 2++] = R 1; R 1. L = R 2. L * R 3. L, R 1. H = R 2. H * R 3. H || R 4 = [P 0++] || R 5 = [I 1++]; 12 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiply and add test 13 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiply and add result and code -- in SDRAM 3 cycle loop -- Note special MAC instruction A 0 += R 0. L * R 1. L (IS) involves both an ADD and a multiplication MAC – multiply and accumulate 14 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

MAC syntax details 15 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Hints at possible advantage A 0 += R 2. L * R 3. L, A 1 -= R 2. H * R 3. H || R 4 = [P 0++] || R 5 = [I 1++]; Involves 2 multiplies Involves 4 adds -- A 0 +=, A 1+=, P 0++ and I 1++ Involves 2 memory reads MNOP || R 2 = W[P 0++] (X) || R 3 = W[I 1++] (X); // MNOP multiplier NOP P 1 = 100 – 2 ; LSET (START, FINISH) LC 1 = P 1 >> 1; // Go round the loop 49 times START: A 0 += R 2. L * R 3. L, A 1 -= R 2. H * R 3. H || R 4 = W[P 0++] (X) || R 5 = W[I 1++] (X); FINISH: A 0 += R 4. L * R 5. L, A 1 -= R 4. H * R 5. H || R 2 = W[P 0++] (X) || R 3 = W[I 1++] (X); Using R 2, R 3 and then R 4, R 5 in an attempt to avoid pipeline issues May not be required – would have to examine pipeline viewer to see what happens FINAL EXAM REVIEW -- What is the syntax error? 16 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiply and accumulate operation ¢ Filter operation on 16 -bit values sum = 0; for count = 0 to N – 1 sum = sum + value[count] * coeff[count]; sum = sum / N; ¢ ¢ ¢ 17 Does not take much to overflow a signed sixteen-bit register value 1 = 32000; value 2 = 32000; value 1 + value 2 about -1000 as a signed 16 -bit value 1 = value 2 = 32000; coeff 1 = coeff 2 = 32000; value 1 * coeff 1 + value 2 * coeff 2 has overflowed as a 32 -bit value Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Multiply and accumulate operation – solving the problem ¢ Does not take much to overflow a signed sixteenbit register ¢ value 1 = 32000; value 2 = 32000; value 1 + value 2 about -1000 as a signed 16 -bit value 1 = value 2 = 32000; coeff 1 = coeff 2 = 32000; value 1 * coeff 1 + value 2 * coeff 2 has overflowed as a 32 -bit value ¢ 1. 2. 18 Take all input values and divide by N will guarantee that the sum of N values will not overflow the number representation – but does not give accurate answer – what if input 32000, 16000 today but 1, 3, 5, 7, tomorrow? Use a special 40 bit register for storing the sum. Makes it less likely to cause an overflow. Do theoretical calculation to determine how many bits are needed to store accurate answer Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Mult 16 x 16 To give 32 bits Adder is 40 bits Accumulator is 40 bits 19 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Example – filter 100 values in only 50 instructions. section data. byte 2 array[100], coeffs[100]; P 0. H = hi(array); P 0. L = lo(array); I 1 = hi(coeff); I 1 = lo(coeff); MNOP || R 2 = W[P 0++] (X) || R 3 = W[I 1++] (X); // MNOP multiplier NOP P 1 = 100 - 2; LSET (START, FINISH) LC 1 = P 1 >> 1; // Go round 49 times START: A 0 += R 2. L * R 3. L, A 1 -= R 2. H * R 3. H || R 4 = [P 0++] (X) || R 5 = [I 1++] (X); FINISH: A 0 += R 4. L * R 5. L, A 1 -= R 4. H * R 5. H || R 2 = [P 0++] (X) || R 3 = [I 1++] (X); R 0. L = (A 0 += R 2. L * R 3. L), R 0. H = (A 1 -= R 2. H * R 3. H); R 0. L = R 0. L + R 0. H (NS); 20 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Convert the following code using parallel instructions and ensuring maximum accuracy #define N 1024 . section data. byte 2 array[N]; // // short array[N]; // short Calculate. Average( ) { // Determine sum; // return 21 average Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Option for doing multiplication ¢ R 0 = R 1 * R 2; l l ¢ Mimics C++ multiplication User must make sure that multiplication does not overflow 32 -bits – no flags on error R 0. L = R 1. L * R 2. H (mode); 16 bit l l 22 32 bit Default – signed fraction IS -- integer signed IU -- integer unsigned Uses A 0 and A 1 multipliers Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Warning -For more details see article When 1 + 1 = 2; but 2 * 2 ! = 4; Published in Circuit Cellar magazine ¢ Link available from December 415 web-page Sounds like a good Q 9 to me for the final if you add some more details 23 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Addition and multiplication on Blackfin ¢ If R 0 = 0 x 12345678 – then what is result of R 0. L = 0 x. FFFF, and why? ¢ Math question what is result of 0. 1 * 10 -2 + 0. 2* 10 -2? Express the answer in the format 0. XYZ * 10 -2 Math question what is result of 0. 1 * 10 -2 * 0. 2* 10 -2? Express the answer in the format 0. XYZ * 10 -2 ¢ ¢ ¢ R 0. L = 0 x 6; R 1. L = 0 x 7; What is result of R 2. L = R 0. L + R 1. L (NS); and why? l l Treated as a 2’s complement number Treated as a signed fractional number (format R 0. L = 6 * 2 -31) ¢ What is result of R 2. H = R 0. L * R 1. L; and why? ¢ What is result of R 2. H = R 0. L * R 1. L (IS); and why? 24 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Other “multiplication” types ¢ Multiply by 2 or 4 l l ¢ R 0 = (R 1 + R 2) << 1: (or << 2) (or Pn) P 0 = P 1 + (P 2 << 1); (or << 2) P only Useful when using P 2 as the index in a loop Multiply by 1/2 , 1/4, 1/8, 1/2 N l l l R 0 >>=3; divide by 8 (R 0 unsigned number) 0 x 8000000 / 8 = 0 x 10000000 (unsigned (+ve) number) R 0 >>>= 3; divide by 8 (R 0 signed number) 0 x 8000000 / 8 = 0 x. E 0000000 (negative number) R 0 = ASHIFT R 1 BY -3; (negative divide, +ve mult) 25 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Division ¢ Fast divide by 2 , 4, 8, 2 N using shift l l l ¢ R 0 >>=3; divide by 8 (R 0 unsigned number) 0 x 8000000 / 8 = 0 x 10000000 (unsigned (+ve) number) R 0 >>>= 3; divide by 8 (R 0 signed number) 0 x 8000000 / 8 = 0 x. E 0000000 (negative number) R 0 = ASHIFT R 1 BY -3; (negative divide, +ve mult) More flexible using DIVS and DIVQ l l Slow – must be performed in a loop Example code 70 / 5 26 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Code example -- P 10 -25. global _Divide. ASM; _Divide. ASM: R 0 = 70; // Divide(70, 5); R 1 = 5; P 0 = 15; // Evaluate quotentient to 15 bits (loop info) R 0 <<= 1; // Book says "needed for integer division" DIVS(R 0, r 1); // Determines MSB of quotient LOOP. div_prim lc 0 = P 0; LOOP_BEGIN. div_prim; DIVQ(R 0, R 1); DIFFERENT LOOP SYNTAX LOOP_END. div_prim; R 0 = R 0. L(X); RTS; 27 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

Problem to solve ¢ ¢ ¢ Build video images in SDRAM Scale all the images (increase grey scale by a fixed scaling factor) Determine whether is more efficient to 1. 2. 3. 1. 28 Work using the images in SDRAM Bring images from SDRAM (using DMA), scale them, then put back Using a multi-threaded version of task 2 Multiplication and Division issues Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021

¢ Information taken from Analog Devices On-line Manuals with permission http: //www. analog. com/processors/resources/technical. Library/manuals/ ¢ Information furnished by Analog Devices is believed to be accurate and reliable. However, Analog Devices assumes no responsibility for its use or for any infringement of any patent other rights of any third party which may result from its use. No license is granted by implication or otherwise under any patent or patent right of Analog Devices. Copyright Analog Devices, Inc. All rights reserved. 29 Video , Copyright M. Smith, ECE, University of Calgary, Canada 12/22/2021