Implementation of DWT using SSE Instruction Set Mehta

Lifting based 2 D-DWT n Lifting q q n 1 D Horizontal lifting 1

2 D DWT Matrices layout n Mallat Strategy q q q Uses an auxiliary

Optimizations n Cache q The 2 matrices are aligned on the cache row size

Optimizations … n SIMD code q q Using SSE 2 Computes 4 pixels in

Results n n Image size of 1024 x 1024 Profiling results done using VTune

Slides: 8

Download presentation

Implementation of DWT using SSE Instruction Set Mehta, Ami Muller, Gilles

Lifting based 2 D-DWT n Lifting q q n 1 D Horizontal lifting 1 D Vertical lifting Fixed point q q q (9, 7) tap biorthogonal filter Lossy compression High compression levels

2 D DWT Matrices layout n Mallat Strategy q q q Uses an auxiliary matrix to store the results of the horizontal filtering. No memory scattering: Horizontal high and low frequency components are not interleaved in memory. It allows a better exploitation of the SIMD parallelism.

Optimizations n Cache q The 2 matrices are aligned on the cache row size (128 bits=16 B) to allow data fetching in one cycle. access Cache layout without alignment q access Cache layout with alignment Input and output matrices are juxtaposed in the memory to prevent conflicts in Direct Mapped cache. (Associativity conflict)

Optimizations … n SIMD code q q Using SSE 2 Computes 4 pixels in parallel using fixed point arithmetic. Profiling C code showed that column transform and cache access caused the main bottleneck. In DWT intermediate values are reused, instead of recalculating we keep the intermediate computations.

Results n n Image size of 1024 x 1024 Profiling results done using VTune Analyzer© Cycles per uops improves from 3. 38 to 2. 28 Improvement of 32. 5%

Results …

Thank you