Implementing a FIRfilter algorithm using MMX instructions by

Implementing a FIR-filter algorithm using MMX instructions by Lars Persson <larper-8@sm. luth. se>

Merging the history buffer and the input buffer When computing the first num_taps-1 samples, we need to access both the input and the history buffer. Depending on the implementation, this might require extra branch instructions in the inner or outer loop. Improved history buffer: … X-3 X-2 X-1 X 0 X 1 X 2 … 0000 num_taps-1 samples num_taps-1 new samples from last call zero-padded

Preparing the taps array The filter tap array is prepared according to the Intel example. That is, it is reversed and 3 shifted copies are made. Also, the number of taps is rounded to a multiple of 4. 0 t 3 t 2 t 1 0 0 0 t 1 t 2 t 3 0 0 0 t 3 t 2 t 1 0

The convolution sum LOOP: // Load 4 samples movq mm 0, [esi] movq mm 1, mm 0 // multiply with taps shifted one // step movq mm 0, mm 1 pmaddwd mm 0, [ebx+ecx] paddd mm 5, mm 0 // preload taps that are shifted 2 // and 3 steps lea edi, [ebx+2*ecx] movq mm 4, [edi] movq mm 7, [edi+ecx] // multiply with taps shifted 2 // steps pmaddwd mm 4, mm 1 paddd mm 3, mm 4 // multiply with taps pmaddwd mm 0, [ebx] paddd mm 6, mm 0 // multiply with taps shifted 3 // steps pmaddwd mm 7, mm 1 paddd mm 2, mm 7 // update pointes for next loop // iter. add esi, 8 add ebx, 8 sub eax, 1 jnz LOOP

Parallel summation // low samples mm 6 mm 5 movq mm 4, mm 6 punpckhdq mm 4, mm 5 punpckldq mm 6, mm 5 paddd mm 6, mm 4 // [ out(n+1) out(n) ] in mm 6 // high samples mm 3 mm 2 movq mm 4, mm 3 punpckhdq mm 4, mm 2 punpckldq mm 3, mm 2 paddd mm 3, mm 4 // [ out(n+3) out(n+2) ] in mm 3

Loop optimization Inner loop keeps as much data as possible in the registers. Only taps and samples are loaded from memory. The parallel summation is done with 8 instructions as compared to 12 instructions in my SSE version. Memory copying is done with the rep instruction prefix. This avoids a branch instruction.

So far about 36 million cycles including float to short conversion. .

Optimizing float to short conversion The C language standard requires that float to integer conversion is done with truncation, i. e. 3. 6 is converted to 3 as opposed to 4 when using rounding. On the X 86 architecture this requires changing the FPU control word which is a very expensive instruction. Solution is to directly call the fistp instruction.

Compiler calls this function once for every conversion. __ftol: 00402 B 24 00402 B 25 00402 B 27 00402 B 2 A 00402 B 2 B 00402 B 2 E 00402 B 2 F 00402 B 33 00402 B 36 push mov add wait fnstcw wait mov or mov 00402 B 3 A 00402 B 3 D fldcw 00402 B 40 00402 B 43 00402 B 46 00402 B 49 00402 B 4 A fldcw fistp mov leave ret ebp, esp, 0 FFFFFFF 4 h word ptr [ebp-2] ax, word ptr [ebp-2] ah, 0 Ch word ptr [ebp-4], ax word ptr [ebp-4] qword ptr [ebp-0 Ch] word ptr [ebp-2] eax, dword ptr [ebp-0 Ch] edx, dword ptr [ebp-8] Optimized conversion routine. mov mov sub ecx, esi, edi, esi, ecx, num_samples input [esi] 1 LOOP 1: fld dword ptr [esi+ecx*4] fistp word ptr [edi+ecx*2] sub ecx, 1 jge LOOP 1