IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran
IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive
ORC Layout • ORC writer contain 1 or more child tree writers • 1 tree writer primitive column • Each tree writer have 1 or more streams (Byte. Buffers) depending on the type • Integers • Row index stream • Present stream (will be absent if there are no nulls in the column) • Data stream • Strings • Row index stream • Present stream (will be absent if there are no nulls in the column) • Data stream • Length stream • Dictionary data stream • Each stream have the following buffers • Uncompressed buffer • Compressed buffer (created only if compression is enabled) • Overflow buffer (created only if compression buffer overflows) • Runtime memory requirement = compression buffer size * number of columns * number of streams * number of partitions (in case of dynamic partitioning) * number of buckets * 2 (if compression is enabled)
Test Setup • Test data • 10 million rows • 14 string columns • Test environment • Single node • 16 GB RAM • Default JVM heap size used for hive and hadoop • Default for Hive – 256 MB • Default for Hadoop – 1000 MB (child JVMs inherit this)
Impact on file size
Explanation • Each compressed block is preceded with 3 byte header that contains the length of compressed block • Lesser the compression buffer size, more the number of compressed blocks and hence more the file size (additional bytes for header)
Impact on load time
Explanation • ZLIB uses deflate compression method with a default window size of 32 KB [1] • DEFLATE [2] = LZ 77 + Huffman coding • when ORC compression buffer size is >32 KB multiple windows needs to be processed and hence increased compression and load time • From the graph there is ~10 s increase for buffer sizes >32 KB • SNAPPY is only LZ 77 [3] • compresses complete buffer (no window requirement) • compression time/load time is almost same for all buffer sizes
Impact on query execution time
Explanation • ZLIB decompression (INFLATE) is fast • http: //bashitout. com/2009/08/30/Linux-Compression-Comparison. GZIP-vs-BZIP 2 -vs-LZMA-vs-ZIP-vs-Compress. html • Query used • insert overwrite directory '/tmp/foo' select c 10, c 11, c 12, c 13, c 1, c 2, c 3, c 4, c 5, c 6, c 7, c 8, c 9 from test_8 k_zlib where c 14 > '0'; • Does not have significant impact on query execution time
Impact in runtime memory
Explanation • Max JVM heap memory = 1000 MB • 14 string columns • 4 streams (no null values, present stream will be suppressed) • 100 partitions • 8 KB compression buffer size • Memory requirement = 8 * 1024 * 14 * 100 * 2 ~= 92 MB • 16 KB compression buffer size • Memory requirement = 16 * 1024 * 14 * 100 * 2 ~= 184 MB • 256 KB memory requirement >1000 MB and hence job failed with OOM exception
References 1. http: //tools. ietf. org/html/rfc 1950 2. http: //tools. ietf. org/html/rfc 1951 3. https: //code. google. com/p/snappy/source/browse/trunk/f ormat_description. txt 4. http: //bashitout. com/2009/08/30/Linux-Compression. Comparison-GZIP-vs-BZIP 2 -vs-LZMA-vs-ZIP-vs. Compress. html
- Slides: 12