Performance Enhancement of Video Compression Algorithms using SIMD














- Slides: 14
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket
Motivation Understand the SSE architecture ¢ Understand the Video compression algorithm and identify the bottlenecks. ¢ Improve performance of Video Compression Algorithm using the SSE platform ¢
Components of Video Compression Algorithm Motion Estimation ¢ Motion Compensation and Image Subtraction ¢ Discrete Cosine Transform ¢ Quantization ¢ Run Length Encoding ¢ Huffman Coding ¢
Bottleneck ¢ Motion Estimation l ¢ It is the process of calculating motion vectors by searching image blocks from a reference image in a new target image DCT Technique to change from the time domain to spatial frequency domain l Highest energy compaction after KLT l
SSE 2 Specifics ¢ Intel C/C++ Compiler 8 l 3 coding styles • Intrinsics • Assembly • Vector Ops ¢ Use of Intrinsics _mm_sad_epu 8 for __m 128 i datatype l _m_psadbw for __m 64 datatype l
SSE 2 platform for Motion Estimation Full Search 16 x 16 Full Search 8 x 8 Three Step 16 x 16 Three Step 8 x 8 Without SSE With SSE 3 secs 1 secs 23 secs 6 secs 4 secs 12 secs 3 secs
Original Frame from Video
Part of Frames 4 and 5
Motion Compensated frames 16 x 16 8 x 8
Discrete Cosine Transform 2 -D DCT is extensively used in JPEG compression algorithm. ¢ Highly computational intensive. ¢ ¢ FOCUS Exploring DCT implementation on SSE 2. l Identify the DCT algorithm which is scalable with the SIMD Architecture l
DCT hardware Accelerator ¢ Distributed Arithmetic l Choice of DA implementation of DCT • Scalable with SSE platform. ¢ 2 -D 8 x 8 DCT operations can be performed as l l l Preprocessing 1 -D DCT (Using DA) Transpose 1 -D DCT (Using DA) Post Processing
1 -D DCT on SSE 2 using DA x 0+ x 7 x 1+x 6 x 2+x 5 x 3+x 4 x 0 -x 7 x 1 -x 6 x 2 -x 5 x 3 -x 4 4 ROM DAP DAP X 2 X 4 X 6 X 1 X 3 X 5 X 7 16 0. 5 16 + 16 16 R 16 X 0 0. 25 • Total of 8 DAP structures. • Each DAP completes operations in 8 cycles • Scalable on various datapaths 16, 32, 64, 128. • DAP subword dest, source
Work done ¢ Accomplished Motion Estimation coding and analysis l DCT hardware accelerator in Verilog l ISA extension for DCT implementation. l ¢ To be done Synthesis to get delay and area estimate l Assembly code with SSE-DCT enhancements and its performance analysis l
Questions