Efficient Loop Handling for DSP algorithms on CISC

































- Slides: 33
Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors Examples. doc file on web M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta, Canada smithmr @ ucalgary. ca
Key elements in DSP algorithms n n n Instruction fetches – must be efficient Data fetches / stores – often multiple – must be efficient Multiplication – must be efficient and accurate and remain precise Addition – must be efficient and accurate and remain precise Decision logic to control above – must be efficient 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 2
To be tackled today n Performing operations on an array n n Loop overhead -- depends on implementation n n Loop overhead can steal many cycles Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies n n n CISC -- hardware RISC -- intelligent compilers DSP -- hardware 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 3
Example – Memory Move FIFO void Memory. Move_Delay_CPP(int *channel 1_in, int *channel 2_in, ADISound. Source *sounddemo) { int count; // Insert new value into the back of the FIFO delay line left_delayline[0 + LEFT_DELAY_VALUE] = *channel 1_in; // Grab delayed value from the front of the FIFO delay line *channel 1_in = left_delayline[0]; // Update the FIFO delay line using inefficient // memory-to-memory moves for (count = 0; count < LEFT_DELAY_VALUE; count++) left_delayline[count] = left_delayline[count + 1]; } 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 4
Example – Pointer FIFO void Pointer. FIFO_CPP(int *channel 1_in, int *channel 2_in, ADISound. Source *sounddemo) { // Insert new value into the back of the FIFO delay line *pt_in ++ = *channel 1_in // Grab delayed value from the front of the FIFO delay line *channel 1_in = *pt_out ++ May not be ++ could be +? ? ? if (pt_in > &left_delay[0 + LEFT_DELAY_VALUE]) then pt_in = pt_in – (LEFT_DELAY_VALUE) if (pt_out > &left_delay[0 + LEFT_DELAY_VALUE]) then pt_out = pt_out – (LEFT_DELAY_VALUE) } Requires additional reads and stores of “static” memory locations of where pointers are stored Requires compares and jumps – pipeline issues on jumps 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 5
Real-time FIR Filter void Process. Sound(int channel_one, int channel_two, int *left_channel, int *right_channel if ((sound_source & FIRFilter) FIRFilter(&channel_one, &channel 2); float fircoeffs_30[], fircoeffs[330]; void FIRFilter(int *channel_one, int *channel_two) { // Insert new value into FIFO delay line left_delayline[0 + N] = (float) *channel_one; right_delayline[0 + N] = (float) *channel_two; channel_one_30 = channel_one 330 = 0; Need equivalent of following loop for EACH sound source for (count = 0, count < FIRlength - 1, count++) { channel_one_30 = channel_one_30 + arrayleft[count] * fir_30(count); channel_one_330 = channel_one_330 + arrayright[count]* fir_330[count]; } *channel_one = (int) channel_one_30 + scale_factor * channel_one_30 ditto 2 Update Left Channel delay line; Update Right Channel Delay line ENCM 515 -- High Speed Loops -- Hardware and Software } 9/25/2021 6 Copyright smithmr@ucalgary. ca
Real-time FIR – Hard-coded loop channel_one_30 channel_one_30 = = = = channel_one_30 channel_one_30 + + + + arrayleft[0] arrayleft[1] arrayleft[2] arrayleft[3] arrayleft[4] arrayleft[5] arrayleft[6] arrayleft[7] * * * * fir_30(0); fir_30(1); fir_30(2); fir_30(3); fir_30(4); fir_30(5); fir_30(6); fir_30(7); No loop overhead heavy memory penalty -- FIR filters in Lab. 4 – 300 taps * 4 filters use pt++ type operations and not direct memory access with offset on SOME processors!! 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 7
Timing required to handle DSP loops for k = 0 to (N-1) -- Could require many lines Body of Code -- Bof. C cycles -- Could be 1 line Endfor n n -- Could require many lines Important feature -- how much overhead time is used in handling the loop construct itself? Three components n n n Set up Time Body of code time -- Bof. C cycles Handling the loop itself 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 8
Basic Loop body n n Set up loop -- loop overhead -- done once Check conditions -- loop overhead -- done many times Do Code Body -- done many times -- useful Loop Back + counter increment -- loop overhead -- many Define Loop Efficiency = N * Tcodebody ----------------------Tsetup + N * (Tcodebody + Tconditions + Tloopback) Different Efficiencies depending on size of the loop Need to learn good approximation techniques and recognize the two extremes 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 9
3 basic loop constructs n While loop n n Modified do-while loop with initial test n n n Main compare test at top of loop Initial compare test at top Main compare test at the bottom of the loop Down-counting do-while loop with initial test n n n No compare operations in test. Relies on the setting of condition code flags during adjustment of the loop counter. Can increase overhead in some algorithms 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 10
Clements -- Microprocessor Systems Design PWS Publishing ISBN 0 -534 -94822 -7 Data from the memory appears near the end of the read cycle 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 11
Review -- CISC processor instruction phases n Fetch -- Obtain op-code n n PC-value out on Address Bus Instruction op-code at Memory[PC] on Data Bus and then into Instruction Register n Decode -- Bringing required values (internal or external) to the ALU input. n n n Immediate -- Additional memory access for value -- Memory[PC] Absolute -- Additional memory access for address value and then further access for value -- Memory[PC]] Indirect -- Additional memory access to obtain value at Memory[Address. Reg] n n Execute -- ALU operation Writeback -- ALU value to internal/external storage n n May require additional memory accesses to obtain address used during storage May require additional memory operations to perform storage. 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 12
Basic 68 K CISC loop -- Test at start MOVE. L #0, count -- Set up -- count in register Fetch instr. (FI 4) + Fetch 32 -bit constant (FC 2 * 4) + operation (OP 0) LOOP: CMP. L #N, count BGE somewhere -- (FI 4 FC 8, OP 4 -- 32 bit subtract) Actually ADD. L #(somewhere - 4), PC (ADD OF 16 -bit DISPLACEMENT TO PC -- FI 4 FC 4 OP(0 or 4) ) Body Cycles -- doing FIR perhaps ADD. L #1, count JMP LOOP -- (FI 4, FC 8, OP 4) N * Body. Cycles LOOP EFFECIENCY = ----------------------------12 + N*(28 + Body. Cycles + 32) Since 60 >> 12 (5 times) then ignore startup cycles even if N small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 13
Check at end -- 68 K CISC loop MOVE. L #0, count JMP LOOPTEST LOOP: Body Cycles ADD. L #1, count LOOPTEST: CMP. L #N, count BLT LOOP -- (FI 4, FC 8, OP 0) -- (FI 4, FC 8, OP 4) -- doing FIR perhaps -- (FI 4, FC 8, OP 4) -- (FI 4, FC 4, OP 4) N * Body. Cycles EFFECIENCY = ----------------------------26 + N*Body. Cycles + 44*(N+1) Since 44 > 26 (1. 8 times) then can’t Ignore startup cycles if N small and Body Cycles small -- Small loop means inefficient 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 14
Down Count -- 68 K CISC loop MOVEQ. L #0, array_index MOVE. L #N, count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles using instructions of form OPERATION (Addreg, Index) ADDQ. L #1, array_index SUBQ. L #1, count LOOPTEST : BGT LOOP n -- (FI 4, FC 0, OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 20*(N+1) Since 20 < 24 then can’t Ignore startup if N small and Body Cycles small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 15
Down Count -- Possible sometimes MOVEA. L #array_start, Addreg count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) MOVE. L #N, -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles using autoincrement mode OPCODE (Addreg)+ SUBQ. L #1, count LOOPTEST : BGT LOOP n n -- (FI 4, FC 0, OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 16*(N+1) Since 16 < 24 then can’t Ignore startup if N small and Body Cycles small NOTE -- Number of cycles needed in body of the loop decreases in this case ENCM 515 -- High Speed Loops -- Hardware and Software 9/25/2021 Copyright smithmr@ucalgary. ca 16
Loop Efficiency on CISC processor n Efficiency depends on how loop constructed n n n Standard while-loop Check at end -- modified do-while Down counting -- with/without auto-incrementing addressing modes n Compiler versus hardcode efficiency n n n See Embedded System Design magazine Sept. /Oct 2000 Local copy available from the web-page What happens with different processor architectures? 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 17
Check at end -- 29 K RISC loop CONST count, 0 JUMP LOOPTEST NOP -- (1 cycle) -- (1 cycle -- delay slot) LOOP: Bodycycles -- autoincrementing mode -- NOT AN OPTION ON 29 K ADDU count, 1 -- (1 cycle) LOOPTEST: CPLE Truth. Reg, count, N -- (1 cycle should be 2 -- register forwarding) (Boolean Truth Flag in Truth. Reg -- which could be any register) JMPT Truth. Reg LOOP -- (1 cycle) NOP -- (1 cycle -- delay slot) n n N * Body. Cycles Loop Efficiency = ----------------------------3 + N * Body. Cycles + 4*(N+1) Since 4 = 3 then can’t Ignore startup if N small and Body Cycles small Since dealing with single cycle operations -- body cycle count smaller 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 18
Down Count -- 29 K RISC loop CONST index, 0 -- 1 cycle JUMP LOOPTEST -- 1 cycle CONST count, N -- in delay slot LOOP: Body. Cycles SUBU count, 1 LOOPTEST: CPGT Truth. Reg, count, 0 JMPT Truth. Reg, LOOP ADDS index, 1 Loop Efficiency = 9/25/2021 -- 1 cycle -- in delay N * Body. Cycles ----------------------------3 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 19
Efficiency on RISC processors n Not much difference between n Processor is highly pipelined -- Loop jumps cause the pipeline to stall n n n Test at end, Down count loop format HOWEVER body-cycle count has decreased Need to take advantage of delay slots Efficiency depends on DSP algorithm being implemented? What about DSP processors? n Architecture is designed for efficiency in this area. 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 20
Check at end -- ADSP-21 K loop count, = 0; number = N; JUMP LOOPTEST (DB); NOP; LOOP: BODYCYCLES count = count + 1; LOOPTEST Comp(count, number); IF LT JUMP LOOP (DB); NOP; EFFICIENCY = 9/25/2021 -- (1 cycle) -- (1 cycle) -- (1 cycle -- delay slot) N * Body. Cycles ----------------------------5 + N*Body. Cycles + 5*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 21
Speed improve -- Possible? count = 1; number = N; JUMP LOOPTEST (DB); count = count - 1; number = number - 1; LOOP: BODYCYCLES count = count + 1; LOOPTEST Comp(count, number); IF LT JUMP LOOP (DB); count = count + 1; NOP; EFFICIENCY = 9/25/2021 -- (1 cycle) -- (1 cycle) -- (1 cycle -- delay slot) N * Body. Cycles ----------------------------5 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 22
Down Count -- ADSP-21 K loop number = 0; JUMP (PC, LOOPTEST) (DB); index = 0; count = N ; ----- (1 (1 cycle) cycle -- in delay slot) LOOP: Bodycycles count = count - 1; LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP) (DB); index = index + 1; NOP; Loop Efficiency = 9/25/2021 -- (1 cycle) ----- (1 (1 cycle) cycle -- delay slot) N * Body. Cycles ----------------------------4 + N*Body. Cycles + 5*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 23
Improved Down Count -- ADSP 21 K loop Is code valid -- or 1 off in times around loop? number = -1; JUMP (PC, LOOPTEST) (DB); index = 0; count = (N-1); LOOP: ----- Bias the loop counter (1 cycle) (1 cycle -- in delay slot) ----- (1 (1 Body cycles LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP); index = index + 1; count = count - 1; Loop Efficiency = 9/25/2021 cycle) cycle -- delay slot) N * Body. Cycles ----------------------------4 + N*Body. Cycles + 4*(N+1) ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 24
Faster Loops n n Need to go to special features CISC -- special Test, Conditional Jump and Decrement in 1 instruction RISC -- Change algorithm format DSP -- Special hardware for loops n n Maximum of six-nested loops Can be a hidden trap when writing “C” 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 25
Recap -- 68 K CISC loop down count MOVEQ. L #0, index MOVE. L #N, count JMP LOOPTEST LOOP: -- (FI 4, FC 0, OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles ADDQ. L #1, index -- (FI 4, FC 0, OP 0? ) SUBQ. L #1, count LOOPTEST : BGT LOOP -- (FI 4, FC 0, OP 0? ) n -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ----------------------------24 + N*Body. Cycles + 20*(N+1) Since 24=20 then can’t Ignore startup if N small and Body Cycles small 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 26
Hardware 68 K CISC loop MOVEQ. L #0, index MOVE. L #(N-1), count JMP LOOPTEST LOOP: -- (FI 4 FC 0 OP 0) -- (FI 4, FC 8, OP 4) Body. Cycles ADDQ. L #1, index LOOPTEST: DBCC count, LOOP -- (FI 4, FC 0 OP 0? ) -- (FI 4, FC 4, OP 4) N * Body. Cycles Loop Efficiency = ------------------------24 + N*Body. Cycles + 16*(N+1) Possibility that Efficiency almost 100% if the Body Instructions are small enough to --fit. Highinto ENCM 515 Speedcache Loops -- Hardware and Software 9/25/2021 Copyright smithmr@ucalgary. ca 27
Custom loop hardware on RISC n n For long loops -- loop overhead small -- no need to be concerned about the loop overhead (unless loop in loop) For small loops -- unroll the loop so that hardcoded n n n For medium loops -- advantage over CISC normally is that instructions more efficient -- 1 cycles compared to 4 -- 8 cycles For medium loops -- advantage over DSP normally is that instructions more efficient n n 20 instructions rather than 1 instruction looped 20 times 1 RISC cycle compared to 2 DSP cycles -- (not 21 K since 1 to 1) For more information n n See the Micro 1992 articles See the CCI articles 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 28
21 k Processor architecture 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 29
Recap -- Improved Down Count -- 21 K DSP loop number = -1 JUMP (PC, LOOPTEST) (DB) index = 0 count = (N-1) LOOP: ----- (1 (1 cycle) cycle -- in delay slot) ----- (1 (1 cycle ) cycle -- delay slot) Body cycles LOOPTEST Comp(count, number) IF GT JUMP (PC, LOOP) index = index + 1 count = count - 1 Loop Efficiency = 9/25/2021 N * Body. Cycles ----------------------------4 High + N*Body. Cycles + 4*(N+1) ENCM 515 -Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 30
Hardware Loop -- 21 K DSP loop count = N count = pass count IF LE JUMP (PC, PASTLOOP) (DB) index = 0 nop ------ (1 (1 (1 cycle) cycle -- in delay slot) HARDWARE_LOOP: LCNTR N; do (PC, PASTLOOP-1) until LCE -- 1 cycle -- parallel instruction Body-cycles PASTLOOP: Last cycle of loop is at location PASTLOOP -1 Rest of the program code N * Body. Cycles Loop Efficiency = ----------------------------6 + N*Body. Cycles) 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 31
DSP Hardware loop n n Efficiency from a number of areas Hardware counter n n n No overhead for decrement No overhead for compare Pipelining efficient n n Processor knows to fetch instructions from start of loop, not past the loop Has some problems if loop size is too small -- loop timing is longer than expected as processor needs to flush the pipeline and restart it 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 32
Tackled today n Performing access to memory in a loop n n Loop overhead -- depends on implementation n n Loop overhead can steal many cycles Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies n n n CISC -- hardware RISC -- intelligent compilers DSP -- hardware 9/25/2021 ENCM 515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary. ca 33