Efficient Loop Handling for DSP algorithms on CISC

  • Slides: 33
Download presentation
Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors Examples. doc

Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors Examples. doc file on web M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta, Canada smithmr @ ucalgary. ca

Key elements in DSP algorithms n n n Instruction fetches – must be efficient

Key elements in DSP algorithms n n n Instruction fetches – must be efficient Data fetches / stores – often multiple – must be efficient Multiplication – must be efficient and accurate and remain precise Addition – must be efficient and accurate and remain precise Decision logic to control above – must be efficient 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 2

To be tackled today n Performing operations on an array n n Loop overhead

To be tackled today n Performing operations on an array n n Loop overhead -- depends on implementation n n Loop overhead can steal many cycles Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies n n n CISC -- hardware RISC -- intelligent compilers DSP -- hardware 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 3

Example – Memory Move FIFO void Memory. Move_Delay_CPP(int *channel 1_in, int *channel 2_in, ADISound.

Example – Memory Move FIFO void Memory. Move_Delay_CPP(int *channel 1_in, int *channel 2_in, ADISound. Source *sounddemo) { int count; // Insert new value into the back of the FIFO delay line left_delayline[0 + LEFT_DELAY_VALUE] = *channel 1_in; // Grab delayed value from the front of the FIFO delay line *channel 1_in = left_delayline[0]; // Update the FIFO delay line using inefficient // memory-to-memory moves for (count = 0; count < LEFT_DELAY_VALUE; count++) left_delayline[count] = left_delayline[count + 1]; } 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 4

Example – Pointer FIFO void Pointer. FIFO_CPP(int *channel 1_in, int *channel 2_in, ADISound. Source

Example – Pointer FIFO void Pointer. FIFO_CPP(int *channel 1_in, int *channel 2_in, ADISound. Source *sounddemo) { // Insert new value into the back of the FIFO delay line *pt_in ++ = *channel 1_in // Grab delayed value from the front of the FIFO delay line *channel 1_in = *pt_out ++ May not be ++ could be +? ? ? if (pt_in > &left_delay[0 + LEFT_DELAY_VALUE]) then pt_in = pt_in – (LEFT_DELAY_VALUE) if (pt_out > &left_delay[0 + LEFT_DELAY_VALUE]) then pt_out = pt_out – (LEFT_DELAY_VALUE) } Requires additional reads and stores of “static” memory locations of where pointers are stored Requires compares and jumps – pipeline issues on jumps 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 5

Real-time FIR Filter void Process. Sound(int channel_one, int channel_two, int *left_channel, int *right_channel if

Real-time FIR Filter void Process. Sound(int channel_one, int channel_two, int *left_channel, int *right_channel if ((sound_source & FIRFilter) FIRFilter(&channel_one, &channel 2); float fircoeffs_30[], fircoeffs[330]; void FIRFilter(int *channel_one, int *channel_two) { // Insert new value into FIFO delay line left_delayline[0 + N] = (float) *channel_one; right_delayline[0 + N] = (float) *channel_two; channel_one_30 = channel_one 330 = 0; Need equivalent of following loop for EACH sound source for (count = 0, count < FIRlength - 1, count++) { channel_one_30 = channel_one_30 + arrayleft[count] * fir_30(count); channel_one_330 = channel_one_330 + arrayright[count]* fir_330[count]; } *channel_one = (int) channel_one_30 + scale_factor * channel_one_30 ditto 2 Update Left Channel delay line; Update Right Channel Delay line ENCM 515 -- High Speed Loops -- Hardware and Software } 9/25/2021 6 Copyright smithmr@ucalgary. ca

Real-time FIR – Hard-coded loop channel_one_30 channel_one_30 = = = = channel_one_30 channel_one_30 +

Real-time FIR – Hard-coded loop channel_one_30 channel_one_30 = = = = channel_one_30 channel_one_30 + + + + arrayleft[0] arrayleft[1] arrayleft[2] arrayleft[3] arrayleft[4] arrayleft[5] arrayleft[6] arrayleft[7] * * * * fir_30(0); fir_30(1); fir_30(2); fir_30(3); fir_30(4); fir_30(5); fir_30(6); fir_30(7); No loop overhead heavy memory penalty -- FIR filters in Lab. 4 – 300 taps * 4 filters use pt++ type operations and not direct memory access with offset on SOME processors!! 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 7

Timing required to handle DSP loops for k = 0 to (N-1) -- Could

Timing required to handle DSP loops for k = 0 to (N-1) -- Could require many lines Body of Code -- Bof. C cycles -- Could be 1 line Endfor n n -- Could require many lines Important feature -- how much overhead time is used in handling the loop construct itself? Three components n n n Set up Time Body of code time -- Bof. C cycles Handling the loop itself 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 8

Basic Loop body n n Set up loop -- loop overhead -- done once

Basic Loop body n n Set up loop -- loop overhead -- done once Check conditions -- loop overhead -- done many times Do Code Body -- done many times -- useful Loop Back + counter increment -- loop overhead -- many Define Loop Efficiency = N * Tcodebody ----------------------Tsetup + N * (Tcodebody + Tconditions + Tloopback) Different Efficiencies depending on size of the loop Need to learn good approximation techniques and recognize the two extremes 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 9

3 basic loop constructs n While loop n n Modified do-while loop with initial

3 basic loop constructs n While loop n n Modified do-while loop with initial test n n n Main compare test at top of loop Initial compare test at top Main compare test at the bottom of the loop Down-counting do-while loop with initial test n n n No compare operations in test. Relies on the setting of condition code flags during adjustment of the loop counter. Can increase overhead in some algorithms 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 10

Clements -- Microprocessor Systems Design PWS Publishing ISBN 0 -534 -94822 -7 Data from

Clements -- Microprocessor Systems Design PWS Publishing ISBN 0 -534 -94822 -7 Data from the memory appears near the end of the read cycle 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 11

Review -- CISC processor instruction phases n Fetch -- Obtain op-code n n PC-value

Review -- CISC processor instruction phases n Fetch -- Obtain op-code n n PC-value out on Address Bus Instruction op-code at Memory[PC] on Data Bus and then into Instruction Register n Decode -- Bringing required values (internal or external) to the ALU input. n n n Immediate -- Additional memory access for value -- Memory[PC] Absolute -- Additional memory access for address value and then further access for value -- Memory[PC]] Indirect -- Additional memory access to obtain value at Memory[Address. Reg] n n Execute -- ALU operation Writeback -- ALU value to internal/external storage n n May require additional memory accesses to obtain address used during storage May require additional memory operations to perform storage. 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 12

Basic 68 K CISC loop -- Test at start MOVE. L #0, count --

Basic 68 K CISC loop -- Test at start MOVE. L #0, count -- Set up -- count in register Fetch instr. (FI 4) + Fetch 32 -bit constant (FC 2 * 4) + operation (OP 0) LOOP: CMP. L #N, count BGE somewhere -- (FI 4 FC 8, OP 4 -- 32 bit subtract) Actually ADD. L #(somewhere - 4), PC (ADD OF 16 -bit DISPLACEMENT TO PC -- FI 4 FC 4 OP(0 or 4) ) Body Cycles -- doing FIR perhaps ADD. L #1, count JMP LOOP -- (FI 4, FC 8, OP 4) N * Body. Cycles LOOP EFFECIENCY = ----------------------------12 + N*(28 + Body. Cycles + 32) Since 60 >> 12 (5 times) then ignore startup cycles even if N small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 13

Check at end -- 68 K CISC loop MOVE. L #0, count JMP LOOPTEST

Check at end -- 68 K CISC loop MOVE. L #0, count JMP LOOPTEST LOOP: Body Cycles ADD. L #1, count LOOPTEST: CMP. L #N, count BLT LOOP -- (FI 4, FC 8, OP 0) -- (FI 4, FC 8, OP 4) -- doing FIR perhaps -- (FI 4, FC 8, OP 4) -- (FI 4, FC 4, OP 4) N * Body. Cycles EFFECIENCY = ----------------------------26 + N*Body. Cycles + 44*(N+1) Since 44 > 26 (1. 8 times) then can’t Ignore startup cycles if N small and Body Cycles small -- Small loop means inefficient 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 14

Down Count -- 68 K CISC loop MOVEQ. L #0, array_index MOVE. L #N,

Down Count -- 68 K CISC loop MOVEQ. L #0, array_index MOVE. L #N, count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles using instructions of form OPERATION (Addreg, Index) ADDQ. L #1, array_index SUBQ. L #1, count LOOPTEST : BGT LOOP n -- (FI 4, FC 0, OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 20*(N+1) Since 20 < 24 then can’t Ignore startup if N small and Body Cycles small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 15

Down Count -- Possible sometimes MOVEA. L #array_start, Addreg count JMP LOOPTEST LOOP: --

Down Count -- Possible sometimes MOVEA. L #array_start, Addreg count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) MOVE. L #N, -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles using autoincrement mode OPCODE (Addreg)+ SUBQ. L #1, count LOOPTEST : BGT LOOP n n -- (FI 4, FC 0, OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 16*(N+1) Since 16 < 24 then can’t Ignore startup if N small and Body Cycles small NOTE -- Number of cycles needed in body of the loop decreases in this case ENCM 515 -- High Speed Loops -- Hardware and Software 9/25/2021 Copyright smithmr@ucalgary. ca 16

Loop Efficiency on CISC processor n Efficiency depends on how loop constructed n n

Loop Efficiency on CISC processor n Efficiency depends on how loop constructed n n n Standard while-loop Check at end -- modified do-while Down counting -- with/without auto-incrementing addressing modes n Compiler versus hardcode efficiency n n n See Embedded System Design magazine Sept. /Oct 2000 Local copy available from the web-page What happens with different processor architectures? 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 17

Check at end -- 29 K RISC loop CONST count, 0 JUMP LOOPTEST NOP

Check at end -- 29 K RISC loop CONST count, 0 JUMP LOOPTEST NOP -- (1 cycle) -- (1 cycle -- delay slot) LOOP: Bodycycles -- autoincrementing mode -- NOT AN OPTION ON 29 K ADDU count, 1 -- (1 cycle) LOOPTEST: CPLE Truth. Reg, count, N -- (1 cycle should be 2 -- register forwarding) (Boolean Truth Flag in Truth. Reg -- which could be any register) JMPT Truth. Reg LOOP -- (1 cycle) NOP -- (1 cycle -- delay slot) n n N * Body. Cycles Loop Efficiency = ----------------------------3 + N * Body. Cycles + 4*(N+1) Since 4 = 3 then can’t Ignore startup if N small and Body Cycles small Since dealing with single cycle operations -- body cycle count smaller 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 18

Down Count -- 29 K RISC loop CONST index, 0 -- 1 cycle JUMP

Down Count -- 29 K RISC loop CONST index, 0 -- 1 cycle JUMP LOOPTEST -- 1 cycle CONST count, N -- in delay slot LOOP: Body. Cycles SUBU count, 1 LOOPTEST: CPGT Truth. Reg, count, 0 JMPT Truth. Reg, LOOP ADDS index, 1 Loop Efficiency = 9/25/2021 -- 1 cycle -- in delay N * Body. Cycles ----------------------------3 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 19

Efficiency on RISC processors n Not much difference between n Processor is highly pipelined

Efficiency on RISC processors n Not much difference between n Processor is highly pipelined -- Loop jumps cause the pipeline to stall n n n Test at end, Down count loop format HOWEVER body-cycle count has decreased Need to take advantage of delay slots Efficiency depends on DSP algorithm being implemented? What about DSP processors? n Architecture is designed for efficiency in this area. 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 20

Check at end -- ADSP-21 K loop count, = 0; number = N; JUMP

Check at end -- ADSP-21 K loop count, = 0; number = N; JUMP LOOPTEST (DB); NOP; LOOP: BODYCYCLES count = count + 1; LOOPTEST Comp(count, number); IF LT JUMP LOOP (DB); NOP; EFFICIENCY = 9/25/2021 -- (1 cycle) -- (1 cycle) -- (1 cycle -- delay slot) N * Body. Cycles ----------------------------5 + N*Body. Cycles + 5*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 21

Speed improve -- Possible? count = 1; number = N; JUMP LOOPTEST (DB); count

Speed improve -- Possible? count = 1; number = N; JUMP LOOPTEST (DB); count = count - 1; number = number - 1; LOOP: BODYCYCLES count = count + 1; LOOPTEST Comp(count, number); IF LT JUMP LOOP (DB); count = count + 1; NOP; EFFICIENCY = 9/25/2021 -- (1 cycle) -- (1 cycle) -- (1 cycle -- delay slot) N * Body. Cycles ----------------------------5 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 22

Down Count -- ADSP-21 K loop number = 0; JUMP (PC, LOOPTEST) (DB); index

Down Count -- ADSP-21 K loop number = 0; JUMP (PC, LOOPTEST) (DB); index = 0; count = N ; ----- (1 (1 cycle) cycle -- in delay slot) LOOP: Bodycycles count = count - 1; LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP) (DB); index = index + 1; NOP; Loop Efficiency = 9/25/2021 -- (1 cycle) ----- (1 (1 cycle) cycle -- delay slot) N * Body. Cycles ----------------------------4 + N*Body. Cycles + 5*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 23

Improved Down Count -- ADSP 21 K loop Is code valid -- or 1

Improved Down Count -- ADSP 21 K loop Is code valid -- or 1 off in times around loop? number = -1; JUMP (PC, LOOPTEST) (DB); index = 0; count = (N-1); LOOP: ----- Bias the loop counter (1 cycle) (1 cycle -- in delay slot) ----- (1 (1 Body cycles LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP); index = index + 1; count = count - 1; Loop Efficiency = 9/25/2021 cycle) cycle -- delay slot) N * Body. Cycles ----------------------------4 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 24

Faster Loops n n Need to go to special features CISC -- special Test,

Faster Loops n n Need to go to special features CISC -- special Test, Conditional Jump and Decrement in 1 instruction RISC -- Change algorithm format DSP -- Special hardware for loops n n Maximum of six-nested loops Can be a hidden trap when writing “C” 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 25

Recap -- 68 K CISC loop down count MOVEQ. L #0, index MOVE. L

Recap -- 68 K CISC loop down count MOVEQ. L #0, index MOVE. L #N, count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles ADDQ. L #1, index -- (FI 4, FC 0, OP 0? ) SUBQ. L #1, count LOOPTEST : BGT LOOP -- (FI 4, FC 0, OP 0? ) n -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 20*(N+1) Since 24=20 then can’t Ignore startup if N small and Body Cycles small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 26

Hardware 68 K CISC loop MOVEQ. L #0, index MOVE. L #(N-1), count JMP

Hardware 68 K CISC loop MOVEQ. L #0, index MOVE. L #(N-1), count JMP LOOPTEST LOOP: -- (FI 4 FC 0 OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles ADDQ. L #1, index LOOPTEST: DBCC count, LOOP -- (FI 4, FC 0 OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ------------------------24 + N*Body. Cycles + 16*(N+1) Possibility that Efficiency almost 100% if the Body Instructions are small enough to --fit. Highinto ENCM 515 Speedcache Loops -- Hardware and Software 9/25/2021 Copyright smithmr@ucalgary. ca 27

Custom loop hardware on RISC n n For long loops -- loop overhead small

Custom loop hardware on RISC n n For long loops -- loop overhead small -- no need to be concerned about the loop overhead (unless loop in loop) For small loops -- unroll the loop so that hardcoded n n n For medium loops -- advantage over CISC normally is that instructions more efficient -- 1 cycles compared to 4 -- 8 cycles For medium loops -- advantage over DSP normally is that instructions more efficient n n 20 instructions rather than 1 instruction looped 20 times 1 RISC cycle compared to 2 DSP cycles -- (not 21 K since 1 to 1) For more information n n See the Micro 1992 articles See the CCI articles 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 28

21 k Processor architecture 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and

21 k Processor architecture 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 29

Recap -- Improved Down Count -- 21 K DSP loop number = -1 JUMP

Recap -- Improved Down Count -- 21 K DSP loop number = -1 JUMP (PC, LOOPTEST) (DB) index = 0 count = (N-1) LOOP: ----- (1 (1 cycle) cycle -- in delay slot) ----- (1 (1 cycle ) cycle -- delay slot) Body cycles LOOPTEST Comp(count, number) IF GT JUMP (PC, LOOP) index = index + 1 count = count - 1 Loop Efficiency = 9/25/2021 N * Body. Cycles ----------------------------4 High + N*Body. Cycles + 4*(N+1) ENCM 515 -Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 30

Hardware Loop -- 21 K DSP loop count = N count = pass count

Hardware Loop -- 21 K DSP loop count = N count = pass count IF LE JUMP (PC, PASTLOOP) (DB) index = 0 nop ------ (1 (1 (1 cycle) cycle -- in delay slot) HARDWARE_LOOP: LCNTR N; do (PC, PASTLOOP-1) until LCE -- 1 cycle -- parallel instruction Body-cycles PASTLOOP: Last cycle of loop is at location PASTLOOP -1 Rest of the program code N * Body. Cycles Loop Efficiency = ----------------------------6 + N*Body. Cycles) 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 31

DSP Hardware loop n n Efficiency from a number of areas Hardware counter n

DSP Hardware loop n n Efficiency from a number of areas Hardware counter n n n No overhead for decrement No overhead for compare Pipelining efficient n n Processor knows to fetch instructions from start of loop, not past the loop Has some problems if loop size is too small -- loop timing is longer than expected as processor needs to flush the pipeline and restart it 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 32

Tackled today n Performing access to memory in a loop n n Loop overhead

Tackled today n Performing access to memory in a loop n n Loop overhead -- depends on implementation n n Loop overhead can steal many cycles Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies n n n CISC -- hardware RISC -- intelligent compilers DSP -- hardware 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 33