Evaluation of the Presto Query Engine for integrating

  • Slides: 10
Download presentation
Evaluation of the Presto Query Engine for integrating relational databases with big data platforms

Evaluation of the Presto Query Engine for integrating relational databases with big data platforms at scale Lightning Talk Andrew Waldman 14/08/2018

WHAT IS PRESTO? A distributed SQL query engine - Typically run on top of

WHAT IS PRESTO? A distributed SQL query engine - Typically run on top of a Hadoop cluster - Open source - Already well established in the cloud sector SQL-on-Anything - HDFS (Parquet, Avro etc. ), S 3 - Relational DBS (Oracle, My. SQL, Postgre. SQL, SQLServer…) - No. SQL (Cassandra, Kudu) - Apache Kafka and more (data sources are pluggable) - Local File System You can Query different data sources from one query!!!

Objectives of the Project Is it worth adding to the current Hadoop service portfolio?

Objectives of the Project Is it worth adding to the current Hadoop service portfolio? - Get Presto running (installation and configuration) - Performance comparison on different data sets (with current frameworks) - Evaluation of native and open source connectors - Usability of Paa. S (Presto as a Service)

Control System Dataset Tests (Win. CC*) • 4 configurations • Begin evaluating Presto •

Control System Dataset Tests (Win. CC*) • 4 configurations • Begin evaluating Presto • • Using Hive connector to connect to Hive metastore Run 5 different types of queries varying in complexity Using numerous sets of different resource configurations Showing framework scalability with Query 4 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Time (minutes) with daily as (select day, stddev(value_number) dev, element_id from psen. eventhistory_00000008 group by day, element_id) select element_id, stddev(dev) from daily group by element_id having stddev(dev)>100000 order by 2; Single core, CPU and Thread 4 Machines, single CPU and single Thread 8 Machines, all CPU and Multi Threading 17, 5 10, 11 5, 58 3, 3 3, 55 1, 1 Single Core, CPU and Thread 1, 23 0, 07 1, 06 4 Node Cluster, Single CPU and 8 Node Cluster, Single CPU and Thread Cluster Formation Presto Impala Spark *Time is x/100 after decimal not seconds *CERN production data set 0, 08 0, 01 0, 11 8 Node Cluster, All CPU's and Threads

TPC-DS Test Set 2 100 GB (Presto, Impala, Spark) TPC-DS Benchmarking 100 GB Test

TPC-DS Test Set 2 100 GB (Presto, Impala, Spark) TPC-DS Benchmarking 100 GB Test Data - Same set of queries as for the previous test set - Automated result collection - Ran on the Hadalytic cluster 4 3, 5 Time (Minutes) 3 2, 5 2 1, 5 1 0, 5 0 q 11 q 12 q 17 q 1 q 25 q 26 q 30 q 31 q 3 q 41 q 42 Query Spark Impala Presto *Time is x/100 after decimal not seconds q 46

Connectors and Catalogs - A connector is software to make a connection between Presto

Connectors and Catalogs - A connector is software to make a connection between Presto and endpoint - A catalog is an instance of a connector (a configured connector to access particular endpoint). Connector Catalog

Summary of Connector Testing - Native connectors where very easy to configure and get

Summary of Connector Testing - Native connectors where very easy to configure and get working - The Native connectors could perform aggregations and predicate queries with no problems - The open source connectors where less consistent in working than the Native ones. With the Oracle open source connectors not working and so preventing full evaluation

Conclusion - The presto query engine when running against TPC-DS benchmark data and Win.

Conclusion - The presto query engine when running against TPC-DS benchmark data and Win. CC data, performance was slightly behind the current framework, Impala - However, when it comes to compatibility and being able to connect to multiple data sources and query data even at once - Presto is more future-proof. Also eases building of hybrid systems (OLTP + archive) - This leads to the conclusion that it would be worth considering as an additional framework to be added to production alongside Spark and Impala

QUESTIONS? Andrew. waldman@cern. ch

QUESTIONS? Andrew. waldman@cern. ch

CONTACTS ANDREW WALDMAN CERN openlab Student andrew. waldman@cern. ch ALBERTO DI MEGLIO CERN openlab

CONTACTS ANDREW WALDMAN CERN openlab Student andrew. waldman@cern. ch ALBERTO DI MEGLIO CERN openlab Head alberto. di. meglio@cern. ch ZBIGNIEW BARANOWSKI CERN openlab Supervisor zbigniew. baranowski@cern. ch ANDREW PURCELL CERN openlab Communications Officer andrew. purcell@cern. ch EMIL KLESZCZ CERN IT-DB emil. kleszcz@cern. ch KRISTINA GUNNE CERN openlab Administration/Finance Officer kristina. gunne@cern. ch www. cern. ch/openlab