Performance Study of Parquet Codecs openlab Summer Students
Performance Study of Parquet Codecs openlab Summer Students Javier García Rubio 15/08/2019 1
WHICH IS THE MOTIVATION Javier García Rubio 2
-Large Production data systems at CERN such as NXCALS (New Generation Accelerator Logging Service) use Parquet - We want to undestand how the different codecs (compression) affect: - Data Storage - CPU payload - Data Access Performance - Oracle is developing a new Codec for Parquet and this work set up the basis for later comparisons Javier García Rubio 3
HOW I DID IT WHAT DID I USE Javier García Rubio 4
- Using real data from NXCALS: - Convert parquet files to json and apply the existing different codecs - GZIP SNAPPY PARQUET (NO CODEC) LZ 4 - Take compression rate - Take CPU time - Evaluate the impact Javier García Rubio 5
Javier García Rubio 6
WHAT ARE THE RESULTS Javier García Rubio 7
Performance Analysis (compression rate) 96, 9 96, 7 CODEC % COMPRESSION GZIP 96, 8% LZ 4 96% SNAPPY 95, 9% 96, 1 PARQUET 95, 9% 95, 9 96, 5 96, 3 95, 7 GZIP LZ 4 SNAPPY PARQUET % COMPRESSION Javier García Rubio 8
Performance Analysis (CPU time) 7800 7600 CODEC TIME AVG (ms) GZIP 7784, 5 PARQUET 7157, 5 SNAPPY 6921 7000 LZ 4 6867 6800 7400 7200 6600 GZIP PARQUET SNAPPY LZ 4 TIME AVG Javier García Rubio 9
WHAT ABOUT CONCLUSIONS Javier García Rubio 10
CONCLUSION COMPARE GZIP LZ 4 SNAPPY PARQUET 0 5 % COMPRESSION Javier García Rubio 10 15 % TIME 11
WHAT NEXT Javier García Rubio 12
-Compare with the new Parse Aware Compression(PAC) from Oracle - Waiting for them to make it open source Javier García Rubio 13
THANKS javigr 6623@gmail. com Javier García Rubio 14
- Slides: 14