Implementation of DWT using SSE Instruction Set Mehta, Ami Muller, Gilles
Lifting based 2 D-DWT n Lifting q q n 1 D Horizontal lifting 1 D Vertical lifting Fixed point q q q (9, 7) tap biorthogonal filter Lossy compression High compression levels
2 D DWT Matrices layout n Mallat Strategy q q q Uses an auxiliary matrix to store the results of the horizontal filtering. No memory scattering: Horizontal high and low frequency components are not interleaved in memory. It allows a better exploitation of the SIMD parallelism.
Optimizations n Cache q The 2 matrices are aligned on the cache row size (128 bits=16 B) to allow data fetching in one cycle. access Cache layout without alignment q access Cache layout with alignment Input and output matrices are juxtaposed in the memory to prevent conflicts in Direct Mapped cache. (Associativity conflict)
Optimizations … n SIMD code q q Using SSE 2 Computes 4 pixels in parallel using fixed point arithmetic. Profiling C code showed that column transform and cache access caused the main bottleneck. In DWT intermediate values are reused, instead of recalculating we keep the intermediate computations.