Evaluation of the Presto Query Engine for integrating










- Slides: 10
Evaluation of the Presto Query Engine for integrating relational databases with big data platforms at scale Lightning Talk Andrew Waldman 14/08/2018
WHAT IS PRESTO? A distributed SQL query engine - Typically run on top of a Hadoop cluster - Open source - Already well established in the cloud sector SQL-on-Anything - HDFS (Parquet, Avro etc. ), S 3 - Relational DBS (Oracle, My. SQL, Postgre. SQL, SQLServer…) - No. SQL (Cassandra, Kudu) - Apache Kafka and more (data sources are pluggable) - Local File System You can Query different data sources from one query!!!
Objectives of the Project Is it worth adding to the current Hadoop service portfolio? - Get Presto running (installation and configuration) - Performance comparison on different data sets (with current frameworks) - Evaluation of native and open source connectors - Usability of Paa. S (Presto as a Service)
Control System Dataset Tests (Win. CC*) • 4 configurations • Begin evaluating Presto • • Using Hive connector to connect to Hive metastore Run 5 different types of queries varying in complexity Using numerous sets of different resource configurations Showing framework scalability with Query 4 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Time (minutes) with daily as (select day, stddev(value_number) dev, element_id from psen. eventhistory_00000008 group by day, element_id) select element_id, stddev(dev) from daily group by element_id having stddev(dev)>100000 order by 2; Single core, CPU and Thread 4 Machines, single CPU and single Thread 8 Machines, all CPU and Multi Threading 17, 5 10, 11 5, 58 3, 3 3, 55 1, 1 Single Core, CPU and Thread 1, 23 0, 07 1, 06 4 Node Cluster, Single CPU and 8 Node Cluster, Single CPU and Thread Cluster Formation Presto Impala Spark *Time is x/100 after decimal not seconds *CERN production data set 0, 08 0, 01 0, 11 8 Node Cluster, All CPU's and Threads
TPC-DS Test Set 2 100 GB (Presto, Impala, Spark) TPC-DS Benchmarking 100 GB Test Data - Same set of queries as for the previous test set - Automated result collection - Ran on the Hadalytic cluster 4 3, 5 Time (Minutes) 3 2, 5 2 1, 5 1 0, 5 0 q 11 q 12 q 17 q 1 q 25 q 26 q 30 q 31 q 3 q 41 q 42 Query Spark Impala Presto *Time is x/100 after decimal not seconds q 46
Connectors and Catalogs - A connector is software to make a connection between Presto and endpoint - A catalog is an instance of a connector (a configured connector to access particular endpoint). Connector Catalog
Summary of Connector Testing - Native connectors where very easy to configure and get working - The Native connectors could perform aggregations and predicate queries with no problems - The open source connectors where less consistent in working than the Native ones. With the Oracle open source connectors not working and so preventing full evaluation
Conclusion - The presto query engine when running against TPC-DS benchmark data and Win. CC data, performance was slightly behind the current framework, Impala - However, when it comes to compatibility and being able to connect to multiple data sources and query data even at once - Presto is more future-proof. Also eases building of hybrid systems (OLTP + archive) - This leads to the conclusion that it would be worth considering as an additional framework to be added to production alongside Spark and Impala
QUESTIONS? Andrew. waldman@cern. ch
CONTACTS ANDREW WALDMAN CERN openlab Student andrew. waldman@cern. ch ALBERTO DI MEGLIO CERN openlab Head alberto. di. meglio@cern. ch ZBIGNIEW BARANOWSKI CERN openlab Supervisor zbigniew. baranowski@cern. ch ANDREW PURCELL CERN openlab Communications Officer andrew. purcell@cern. ch EMIL KLESZCZ CERN IT-DB emil. kleszcz@cern. ch KRISTINA GUNNE CERN openlab Administration/Finance Officer kristina. gunne@cern. ch www. cern. ch/openlab