Yerevan Armenia 16 September 2020 Performance Optimization System
Yerevan, Armenia, 16 September 2020 Performance Optimization System for Hadoop and Spark Frameworks National Polytechnic University of Armenia Institute for Informatics and Automation Problems of NAS RA Toulouse Research Institute in Computer Science, France Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
Content 1. 2. 3. 4. 5. 6. Motivation Research Field Partner Institutions Participation Structure of the Project Outcomes Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
1. Motivation • Big Data processing is a resource-intensive operation that uses specific hardware and software. Due to the intense Input/Output (I/O) nature of the processing, the hardware architecture is different from the traditional highperformance computing (HPC) clusters or supercomputers, particularly, local disks are required for all data nodes. Moreover, the data processing application stack is also significantly different from traditional approaches. For instance, the data volume is substantially larger than in other operations, and the data sets are poorly structured, and various data types are available. • The traditional relational database management systems, like SQL queries, are incapable of tackling semi-structured or unstructured Big Data processing. Thus, the Map. Reduce model is used as a critical technology for processing and generating extensive data sets. Its implementations, such as Apache Hadoop or Spark, split large data sets into a set of distributed blocks, execute map tasks in parallel on these blocks, and finally reduce tasks for the aggregation of results. Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
2. Research Field • Data compression techniques are used to overcome data storage and network bandwidth limitations to process a massive volume of data. In Big Data infrastructures, it decreases the size of data chunks to minimize the time delay forced by the I/O operation and save space on local disks. • It is a challenge to find an optimal tradeoff, as high compression factor may underload I/O but overload CPU, while a weak compression factor may underload CPU but overload I/O. The ideal configuration is when both I/O and CPU are used entirely. CPU (respectively I/O) should not be waiting for I/O (respectively CPU) to reach the best performance. Our aim is to develop a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Hadoop and Spark infrastructures based on simulation analysis. Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
3. Partner Institutions • INSTITUTE FOR INFORMATICS AND AUTOMATION PROBLEMS OF THE NATIONAL • • ACADEMY OF SCIENCES OF THE REPUBLIC OF ARMENIA: Leading research and technology development institute of the National Academy of Sciences of the Republic of Armenia (NAS RA) in the sphere of applied mathematics and informatics, as well as application domain of computing technologies in various fields of science and technology. NATIONAL POLYTECHNIC UNIVERSITY OF ARMENIA: One of the largest Universities of Armenia having a central campus in Yerevan and branch campuses located in Gyumri, Vanadzor and Kapan cities. Currently there are more than 8, 000 students and 750 faculty members. TOULOUSE RESEARCH INSTITUTE IN COMPUTER SCIENCE: is one of the largest UMR in France, is one of the pillars of research in Occitanie with its 700 members, permanent and non-permanent. Due to its multi-tutorial nature (CNRS, Toulouse Universities), the laboratory constitutes one of the structuring forces of the IT landscape and its applications in the digital world, both at regional and national level. Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
4. Participation • INSTITUTE FOR INFORMATICS AND AUTOMATION PROBLEMS OF THE NATIONAL ACADEMY OF SCIENCES OF THE REPUBLIC OF ARMENIA: • Hrachya Astsatryan, Head of Center for Scientific Computing • Aram Kocharyan, Researcher at the Center for Scientific Computing • NATIONAL POLYTECHNIC UNIVERSITY OF ARMENIA • Gevorg Margarov, Head of Information Security and Software Development • • • (ISSD) Department Marine Usepyan, Assistant Professor at ISSD Department Arthur Lalayan, Master Student 1 -2 Bachelor students will be involved by the end of the year • TOULOUSE RESEARCH INSTITUTE IN COMPUTER SCIENCE • Daniel Hagimont, Professor, head of team Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
5. Structure of the Project 1. To study splittable and non-splittable data compression algorithms in Hadoop and Spark implemented on input data, intermediate Map output data, and Reduce output data levels. 2. To implement a scenario for the simulations: three types of input data, seven compression algorithms and five workloads (Test. DFSIO, Tera. Sort, Word. Count, Log. Analyzer, K-means). 3. More than 2000 simulations using Big data infrastructures: (workloads)X 7(compression algorithms)X 3(input Spark)X 10 (10 simulations per case). 1. Analyzes and developing a system. Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP data 5 size)X 2(Hadoop,
6. Outcomes The compressed data processing analyzes show that the lz 4 codec reaches Hadoop's best performance regardless of the input data size. Meanwhile, Spark achieves the best performance with Iz 4 only for 4 GB input data, and zstandard codec for 8 GB and 16 GB cases. Hrachya Astsatryan, Arthur Lalayan, Aram Kocharyan, Daniel Hagimont, Performance Optimization System for Hadoop and Spark Frameworks, Journal Cybernetics and Information Technologies (accepted). Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
Thank you! Agreement number – 2018 – 3234 / 001 – 001 Project reference number – 598719 -EPP-1 -2018 -1 -MK-EPPKA 2 -CBHE-JP
- Slides: 9