Tutorial Systematic development of programs with parallel instructions

Standard “C” code -- Last Lecture void Convert(float *temperature, int N) { int count;

Standard “C” code void Convert(DM float *temperature, int N) { int count; for (count

Reminder -- 21061 Parallelism opportunities CACHE Memory pointer operations MEMORY 32 x 48 Post

Parallel Instruction Code Development z Write the 21 k assembly code for the function

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y

21061 -style load/store “C” code void Convert(register dm float *temp_in, STILL INPAR 1 register

Assembly code z PROLOGUE y. Appropriate defines to make easy reading of code y.

Straight conversion -- PROLOGUE // void Convert(register dm float *temp_in, INPAR 1_R 4 register

Minor modification of the old code // for (count = 0; count < N;

Speed rules for memory access scratch = dm(0, pt); scratch = dm(pt, 0); dm(pt,

Speed rules IF you want adds and multiplys to occur on the same line

Check on required register use #define count scratch. R 1 #define pt scratch. DMpt

Register re-assignment -Step 1 #define count scratch. R 1 #define pt scratch. DMpt #define

Register re-assignment -Step 2 #define count scratch. R 1 #define pt scratch. DMpt #define

Fix poor coding practices -- Step 3 #define count scratch. R 1 #define pt

Resource Management -- Chart 1 -- Basic code Write in resource usage 12/24/2021 ENCM

Un-roll the loop z. Check for resource conflicts when start moving instructions z. Watch

Unroll the loop and then Move 1 st R = dm operation up, then

Following by moving 2 nd, 3 rd and 4 th R = dm operations

Need 1 resource to be maxed out Otherwise algorithm is inefficient z This code

Now to to “reroll the loop” z The loop is currently just straight line

Recap on source management Identify the loop components FILL ALU/FPU PIPE LOOP BODY 12/24/2021

Recap on source management Final code version -- re-rolled loop FILL -1 UNTIL LCE

Re-roll the loop -- Identify fill, loop body, empty 12/24/2021 ENCM 515 -- Tutorial

Speed improvements STANDARD START 4 LOOP + N*4 = 14 + 4 * N

Questions to Ask z. We now know the final code z. Will the code

Tackled today z. What’s the problem? z. Standard Code Development of “C”-code z. Process

Slides: 32

Download presentation

Tutorial Systematic development of programs with parallel instructions SHARC ADSP 2106 X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary. ca

Standard “C” code -- Last Lecture void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / 5 + 32; temperature++ } void Convert(DM float *temperature, int N) { int count; Accessing array via DM bus causes bus conflicts on read and write ops for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / 5 + 32; temperature++ } 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 2

Standard “C” code void Convert(DM float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / 5 + 32; temperature++ } void Convert(DM float *temp_in, PM float *temp_out, int N) { int count; Values in on DM bus, out on PM NO bus conflicts on read and write ops? for (count = 0; count < N; count++) { *temp_out = (*temp_in) * 9 / 5 + 32; temp_in++; temp_out++; } ENCM 515 -- Tutorial exercise on parallel code 12/24/2021 generation Copyright M. Smith -- smithmr@ucalgary, ca 3

Reminder -- 21061 Parallelism opportunities CACHE Memory pointer operations MEMORY 32 x 48 Post modify 2 index registers DAG 1 DAG 2 Automatic 8 circular buffer operations x 4 x 32 8 x 4 x 24 Automatic bit reverse addressing PMA BUS JTAG TEST & Zero overhead loops EMULATION Instruction pipeline issues FLAGS PROGRAM SEQUENCER TIMER 24 PMA BUS 32 Ability for. DMA parallel memory operation, DMA 48 PMD BUS One each on pm, dm and instruction cache busses. PMD Key issue -- Only 48? bits available in OPCODE to describe DMD BUS 40 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc. BUS CONNECT DMD Many parallel operations and register to register bus transfers REGISTER Rn FLOATING = Rx +& FIXED-POINT Ry or Rn = Rx. FILE* Ry 32 -BIT FLOATING-POINT MULTIPLIER, 16 x 40 BARREL & FIXED-POINT Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr ALU SHIFTER ACCUMULATOR

Parallel Instruction Code Development z Write the 21 k assembly code for the function void Convert(float *temperature, int N) which etc…. . . z Determine the instruction flow through the architecture using a resource usage diagram z Theoretically optimize the code z Compare and contrast the amount of time to perform the subroutine before and after customization. 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 5

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y Accounts for the SHARC super scalar RISC DSP architecture z Write the assembly code using a hardware loop z Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach z Move algorithm to “Resource Usage Chart” z Optimize using techniques z Compare and contrast time -- setup and loop 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 6

21061 -style load/store “C” code void Convert(register dm float *temp_in, STILL INPAR 1 register pm float *temp_out, STILL INPAR 2, register int N) { Exercise for you 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 7

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y Accounts for the SHARC super scalar RISC DSP architecture z Write the assembly code using a hardware loop y Check that end of loop label is in the correct place z Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach z Move algorithm to “Resource Usage Chart” z Optimize using techniques z Compare and contrast time -- setup and loop 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 8

Assembly code z PROLOGUE y. Appropriate defines to make easy reading of code y. Saving of non-volatile registers z BODY y. Try to plan ahead for parallel operations y. Know which 21 k “multi-function” instructions are valid z EPILOGUE y. Recover non-volatile registers 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 9

Straight conversion -- PROLOGUE // void Convert(register dm float *temp_in, INPAR 1_R 4 register pm float *temp_out, INPAR 2_R 8, register int N) {. segment/pm seg_pmco; . global _Convert; _Convert: Leave prologue for the moment Expect to have to save registers to stack Do rest of code first – then make parallel 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 10

Minor modification of the old code // for (count = 0; count < N; count++) { LCNTR = INPAR 2, DO LOOP_END - 1 UNTIL LCE: // scratch = *pt; scratch. F 2 = ? ? ? ; // Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR 1 (R 4) is dead -- can reuse as F 4 #define constant. F 4 // Must be float constant. F 4 = 1. 8 // No division, Use register constant scratch. F 2 = scratch. F 2 * constant. F 4; // scratch = scratch + 32; // Must be float #define F 0_32 = 32. 0; // NOT F 0 = 32 gives F 0 = 1 * 10 -45 scratch. F 2 = scratch. F 2 + F 0_32; 12/24/2021 // *pt = scratch; pt++; ? ? ? = scratch. F 2; ENCM 515 -- Tutorial exercise on parallel code generation LOOP_END: 5 magic lines of code Copyright M. Smith -- smithmr@ucalgary, ca 11

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y Accounts for the SHARC super scalar RISC DSP architecture z Write the assembly code using a hardware loop y Check that end of loop label is in the correct place z Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. y Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. z Move algorithm to “Resource Usage Chart” z Optimize using techniques (Attempt to) z Compare and contrast time -- setup and loop 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 12

Speed rules for memory access scratch = dm(0, pt); scratch = dm(pt, 0); dm(pt, 1) = scratch; l. Can’t use PREMODIFY PERIOD // Not ++ as to be re-used l. Can’t use POST MODIFY OPERATIONS with CONSTANTS Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant Must use Modify registers to store these constants. Several useful constants placed in modify registers (DAG 1 and DAG 2) during “C-code” initialization (if linked in) scratch = dm(pt, zero. DM); // Not ++ as to be re-used dm(pt, plus 1 DM) = scratch; 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 13

Speed rules IF you want adds and multiplys to occur on the same line z F 1 = F 2 * F 3, F 4 = F 5 + F 6; y Want to do as a single instruction y Not enough bits in the opcode y Register description 4 + 4 + 4 + 4 (bits) y Plus bits for describing math operations, conditions and memory ops? z Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) z Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) z Must rearrange register usage with program code for this to be possible y Register description 4 + 2 + 4 + 2 (bits) -- other bits “understood” y Inconvenient rather than limiting y e. g. F 6 = F 0 * F 4, F 7 = F 8 + F 12, F 9 = F 8 - F 12; y Not accepted F 6 ENCM 515 = F 4 *-- Tutorial F 0, F 7 exercise = F 8 on + parallel F 12, code F 9 = F 8 - F 12; generation y Not accepted F 7 = F 8 + F 12, F 9 = F 8 - F 12, F 6 = F 0 * F 4; 12/24/2021 Copyright M. Smith -- smithmr@ucalgary, ca 14

Check on required register use #define count scratch. R 1 #define pt scratch. DMpt #define scratch. F 2 LCNTR = INPAR 2, DO LOOP_END - 1 UNTIL LCE: scratch. F 2 = // INPAR 1 (R 4) is dead -- can reuse #define constant. F 4 // Must be float constant. F 4 = 1. 8; scratch. F 2 = scratch. F 2 * constant. F 4 #define F 0_32 F 0 // Must be float F 0_32 = 32. 0; scratch. F 2 = scratch. F 2 + F 0_32; 12/24/2021 = ENCM 515 -- Tutorial exercise on parallel code scratch. F 2; generation Copyright M. Smith -- smithmr@ucalgary, ca 15

Register re-assignment -Step 1 #define count scratch. R 1 #define pt scratch. DMpt #define scratch. F 2 LCNTR = INPAR 2, DO LOOP_END - 1 UNTIL LCE: scratch. F 2 = // INPAR 1 (R 4) is dead -- can reuse #define constant. F 4 // Must be float constant. F 4 = 1. 8; scratch. F 2 = scratch. F 2 * constant. F 4 #define F 0_32 F 0 // Must be float F 0_32 = 32. 0; scratch. F 2 = scratch. F 2 + F 0_32; = scratch. F 2; 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 16

Register re-assignment -Step 2 #define count scratch. R 1 #define pt scratch. DMpt #define scratch. F 2 LCNTR = INPAR 2, DO LOOP_END - 1 UNTIL LCE: scratch. F 2 = // INPAR 1 (R 4) is dead -- can reuse #define constant. F 4 // Must be float constant. F 4 = 1. 8; scratch. F 2 = scratch. F 2 * constant. F 4 #define F 0_32 F 0 // Must be float F 0_32 = 32. 0; scratch. F 2 = scratch. F 2 + F 0_32; = scratch. F 2; 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 17

Fix poor coding practices -- Step 3 #define count scratch. R 1 #define pt scratch. DMpt #define scratch. F 2 LCNTR = INPAR 2, DO LOOP_END - 1 UNTIL LCE: scratch. F 2 = // INPAR 1 (R 4) is dead -- can reuse #define constant. F 4 // Must be float constant. F 4 = 1. 8; Move constants out of loop scratch. F 2 = scratch. F 2 * constant. F 4 #define F 0_32 F 0 // Must be float F 0_32 = 32. 0; scratch. F 2 = scratch. F 2 + F 0_32; = scratch. F 2; 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 18

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y Accounts for the SHARC super scalar RISC DSP architecture z Write the assembly code using a hardware loop y Check that end of loop label is in the correct place z Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach y Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. z Move algorithm to “Resource Usage Chart” z Optimize using techniques (Attempt to) z Compare and contrast time -- setup and loop 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 19

Resource Management -- Chart 1 -- Basic code Write in resource usage 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 20

Process for developing parallel code z Rewrite the “C” code using “LOAD/STORE” techniques y Accounts for the SHARC super scalar RISC DSP architecture z Write the assembly code using a hardware loop y Check that end of loop label is in the correct place z Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach y Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. z Move algorithm to “Resource Usage Chart” z Optimize parallelism using techniques y Attempt to -- watch out for special situations where code will fail ENCM 515 -- Tutorial exercise on parallel code z Compare and contrast timegeneration -- setup and loop 12/24/2021 Copyright M. Smith -- smithmr@ucalgary, ca 21

Un-roll the loop z. Check for resource conflicts when start moving instructions z. Watch that you don’t disturb dependency of instructions y. Can’t use the result of a memory fetch in the instruction describing the memory fetch y. Ditto for the ALU/FPU operation y. Remember DAG 1 I and M registers for dm ops ENCM 515 -- Tutorial exercise on parallel code generation and DAG 2 Copyright I and. M. M registers for dm ops 12/24/2021 22 Smith -- smithmr@ucalgary, ca

Unroll the loop and then Move 1 st R = dm operation up, then move its dependent *, + and pm= R operation 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 23

Following by moving 2 nd, 3 rd and 4 th R = dm operations up, then move their dependent *, + and pm= R operation 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 24

Need 1 resource to be maxed out Otherwise algorithm is inefficient z This code has the capability of having all resources maxed out z. All in one parallel instruction -- note , and ; Fa = Fb * F 4, Fc = Fa * F 12 , Fb = dm(DAG 1_Index, DAG 1_Modify) , pm(DAG 2_Index, DAG 2_Modify) = Fc ; y. Storing result of instruction P along the pm bus y. Calculating the addition for instruction P+1 y. Calculating the multiplication for instruction P + 2 y. Fetch the initial. ENCM 515 array-- value for instruction P + 3 on dm bus Tutorial exercise on parallel code 12/24/2021 generation Copyright M. Smith -- smithmr@ucalgary, ca 25

Now to to “reroll the loop” z The loop is currently just straight line coded. y. Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations. z Look for y. Repeated patterns of instructions -- main loop body y. Instructions outside the loop used to fill the ALU/FPU pipeline (typically 1 stage from loop) y. Instructions outside the loop used to empty the ALU/FPU pipeline (typically 1 stage from loop) 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 26

Recap on source management Identify the loop components FILL ALU/FPU PIPE LOOP BODY 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca EMPTY ALU/FPU PIPELINE 27

Recap on source management Final code version -- re-rolled loop FILL -1 UNTIL LCE USE EMPTY LOOPEND: ALU/FPU PIPE 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 28

Re-roll the loop -- Identify fill, loop body, empty 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 29

Speed improvements STANDARD START 4 LOOP + N*4 = 14 + 4 * N EXIT +5 ENTRY +5 EXIT ENTRY With 2 -fold loop unfolding -- 1 data bus START LOOP 4 + 7 + (N – 2) * 5 / 2 +5+8 +5 = 24 + 2. 5 * N With 3 -fold loop unfolding -- 1 data bus START LOOP 4 + 5 + (N – 2) * 6 / 3 +5+1 +5 = 16 + 2 * N Factor of 4 / 2. 5 with a little effort -- Factor of 4 /2 with more effort NOW with parallel dm and pm data bus operations. Don’t forget the instruction cache effects -- why not a problem before? START LOOP EXIT ENTRY 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 30

Questions to Ask z. We now know the final code z. Will the code work for all values of N? z. What are the implications on the other components of the DSP algorithm since we are using both DM and PM busses? 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 31

Tackled today z. What’s the problem? z. Standard Code Development of “C”-code z. Process for “Code with parallel instruction” y. Rewrite with specialized resources y. Move to “resource chart” y. Unroll the loop y. Adjust code y. Reroll the loop y. Check if worth the effort 12/24/2021 ENCM 515 -- Tutorial exercise on parallel code generation Copyright M. Smith -- smithmr@ucalgary, ca 32