Efficient Placement of Compressed Code for Parallel Decompression

Outline l Introduction l Code Compression Techniques l Efficient Placement of Compressed Binaries u.

Why Code Compression? l Embedded systems are ubiquitous u. Automobiles, digital cameras, PDAs, cellular

Code Compression Methodology Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online)

Decompression Engine (DCE) l Pre-cache design u Between memory and cache l Post-cache design

Code Compression Techniques l Efficient code compression u Huffman coding: Wolfe and Chanin, MICRO

Dictionary-Based Code Compression Format for Uncompressed Code (32 Bit Code) Decision (1 Bit) Uncompressed

Code Compression Techniques l Efficient Compression u. Huffman coding, arithmetic coding, … u. Excellent

How to Accelerate Decompression? l Divide code into several streams, compress and store each

Another Alternative l Always perform fixed encoding variable-to-fixed u fixed-to-fixed u q Sacrifices compression

Overview of Our Approach l Divide code into multiple streams l Compress each of

Compression using Huffman Coding Huffman coding with instruction division and selective compression 0000 1110

Example using Two Decoders Branch Block Instructions between two Consecutive branch targets 0000 1110

Decode-Aware Code Placement Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams

Experimental Setup l Media. Bench and Mi. Benchmarks u. Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to,

Decode Bandwidth Code. Pack BPA 1 BPA 2 2 -4 times improvement in decode

Compression Penalty Less than 1% penalty in compression performance

Hardware Overhead l BPA 1 and Code. Pack uses similar area/power l BPA 2

More than 4 Decoders? l BPA 1 – Two Decoders u May need 1

Conclusion l Memory is a major constraint u. Existing compression methods provide either efficient

Slides: 27

Download presentation

Efficient Placement of Compressed Code for Parallel Decompression Xiaoke Qin and Prabhat Mishra Embedded Systems Lab Computer and Information Science and Engineering University of Florida, USA

Outline l Introduction l Code Compression Techniques l Efficient Placement of Compressed Binaries u. Compression u. Code Algorithm Placement Algorithm u. Decompression l Experiments l Conclusion Mechanism

Why Code Compression? l Embedded systems are ubiquitous u. Automobiles, digital cameras, PDAs, cellular phones, medical and military equipments, …. l Memory imposes cost, area and energy constraints during embedded systems design u. Increasing complexity of applications l Code compression techniques address this by reducing the size of application programs

Code Compression Methodology Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online) Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory) Embedded Systems

Decompression Engine (DCE) l Pre-cache design u Between memory and cache l Post-cache design u Between cache and processor u Decompression has to be very fast (at speed) u (+) Cache holds compressed data u (+) Reduced bus bandwidth and higher cache hits u (+) Improved performance and energy reduction D-Cache Post-Cache DCE Pre-Cache DCE Main Memory I-Cache Processor

Code Compression Techniques l Efficient code compression u Huffman coding: Wolfe and Chanin, MICRO 1992 u LZW: Lin, Xie and Wolf: DATE 2004 u SAMC/Arithmetic coding: Lekatsas and Wolf, TCAD 1999 l Dictionary-based code compression u Liao, Devdas and Keutzer, TCAD 1998 u Prakash et al. , DCC 2003 u Ros and Sutton, CASES 2004 u Seong and Mishra, ICCAD’ 06, DATE’ 07, TCAD 2008 l Divide an instruction into different parts u Nam et al. , FECCS 1999. u Lekatsas and Wolf, DAC 1998 u Code. Pack, Lefurgy 2000

Dictionary-Based Code Compression Format for Uncompressed Code (32 Bit Code) Decision (1 Bit) Uncompressed Data (32 Bits) Original Program 0000 1000 0010 0100 1110 0101 0010 0000 1100 0010 1100 0000 Format for Compressed Code Decision Dictionary Index (1 Bit) Compressed Program 0 1 1 1 0 0 1000 0010 0000 0010 1 0100 1110 0101 0010 0000 1100 1 1100 0000 0 Dictionary Index Entry 0 0000 1 0100 0010 0 – Compressed 1 – Not Compressed

Code Compression Techniques l Efficient Compression u. Huffman coding, arithmetic coding, … u. Excellent compression due to complex encoding u. Slow decompression q Not suitable for post cache decompression l Fast Decompression u. Dictionary-based, Bitmask-based, … u. Fast decompression due to simple/fixed encoding u. Compression efficiency is comprised l We combine the advantages by employing a novel placement of compressed binaries

How to Accelerate Decompression? l Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders u Unequal compression u q Wastage of space q Difficult to handle branch targets A B ?

Another Alternative l Always perform fixed encoding variable-to-fixed u fixed-to-fixed u q Sacrifices compression efficiency A B

Overview of Our Approach l Divide code into multiple streams l Compress each of them separately l Merge them using our placement algorithm Reduce space wastage u Ensure that none of the decoders are idle u

Compression Algorithm

Compression using Huffman Coding Huffman coding with instruction division and selective compression 0000 1110 0000 0100 0000 1000 0000 1. 0000 2. 1000 1110 3. 0000 4. 0000 CR: 77. 8% CR: 88. 9% 0000 5. 1000 Compression Ratio: Compressed Size / Original Size = 60/72 = 83. 3% 0000

Example using Two Decoders Branch Block Instructions between two Consecutive branch targets 0000 1110 0000 0100 0000 1000 0000 1. 0000 2. 1000 1110 3. 0000 4. 0000 Slot 1 (4 bits) Slot 2 (4 bits) 0000 5. Storage 1000 Structure 0000 Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5

Example

Decode-Aware Code Placement Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready 1 and !Ready 2 then Assign Stream 1 to Slot 1 and Stream 2 to Slot 2 else if !Ready 1 and Ready 2 then Assign Stream 1 to Slot 1 and Slot 2 else if Ready 1 and !Ready 2 then Assign Stream 2 to Slot 1 and Slot 2 else Assign Stream 1 to Slot 1 and Stream 2 to Slot 2 End Readyi i’th buffer has sufficient bits

Decompression Mechanism

Experimental Setup l Media. Bench and Mi. Benchmarks u. Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg 2 enc, mpeg 2 dec, and pegwit l Compiled for four target architectures u TI TMS 320 C 6 x, Power. PC, SPARC and MIPS l Compared our approach with Code. Pack u BPA 1: Bitstream placement for Two Streams q Two decoders work in parallel u BPA 2: Bitstream placement for Four Streams q Four decoders work in parallel

Decode Bandwidth Code. Pack BPA 1 BPA 2 2 -4 times improvement in decode performance

Compression Penalty Less than 1% penalty in compression performance

Hardware Overhead l BPA 1 and Code. Pack uses similar area/power l BPA 2 requires double area/power u Four 16 -bit decoders l This overhead is negligible 100 -1000 X smaller compared to typical reduction in overall area and energy by code compression. u Code. Pack Area (um 2) 122263 Power (m. W) 7. 5 Critical path (ns) 6. 91 BPA 1 137529 9. 8 5. 76 BPA 2 253586 14. 6 5. 94 Synthsized using Synopsys Design Compiler and TSMC 0. 18 cell library

More than 4 Decoders? l BPA 1 – Two Decoders u May need 1 startup stall cycle for each branch block l BPA 2 – Four Decoders u May need 2 startup stall cycles for each branch block l Proved that BPA 1 and BPA 2 uses exactly 1 and 2 cycles (respectively) more than optimal placement l Too many parallel decoders is not profitable u Overall increase in output bandwidth will slow down by more start up stalls u Startup stalls may not be negligible with the execution time of the branch block itself.

Conclusion l Memory is a major constraint u. Existing compression methods provide either efficient compression or fast decompression l Our approach combines the benefits u. Efficient u. Up placement for parallel decompression to 4 times improvement in decode bandwidth u. Less than 1% impact on compression efficiency l Future work u. Apply it for data compression q data values, FPGA bitstream, manufacturing test, …

Thank you !