Overview of big data tools THE CONTRACTOR IS

Overview of big data tools THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

Topics • • Hadoop Analysis Tools in Hadoop Spark No. SQL 2 Eurostat

Hadoop Open source platform for distributed processing of large data Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms Eurostat

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000 -CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs Eurostat

Hadoop Components HDFS: Hadoop Distributed File System Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines Map. Reduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and Map. Reduce does the processing work Eurostat

Hadoop Principle Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes Eurostat

The Map. Reduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories Eurostat An algorithm is applied to all the elements of the same category

Map. Reduce and Hadoop Map. Reduce is logically placed on top of HDFS Map. Reduce HDFS Eurostat

Map. Reduce and Hadoop MR works on (big) files loaded on HDFS Hadoop MR MR HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores Output is written on HDFS Scalability principle: Perform the computation were the data is Eurostat

Hadoop pros & cons Good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data Eurostat

Tools for Data Analysis with Hadoop Pig Hive Map. Reduce Statistical Software HDFS Eurostat

Apache Pig • Tool for querying data on Hadoop clusters • Widely used in the Hadoop world • Yahoo! estimates that 50% of their Hadoop workload on their 100, 000 CPUs clusters is genarated by Pig scripts • Allows to write data manipulation scripts written in a high-level language called Pig Latin • Interpreted language: scripts are translated into Map. Reduce jobs • Mainly targeted at joins and aggregations Eurostat

Pig Example Real example of a Pig script used at Twitter The Java equivalent… Eurostat

Hive • SQL interface to Hadoop • Text files in tabular format stored in HDFS can be wrapped as relational tables by Hive and then can be queried through standard SQL 14 Eurostat

Demo: Pig and Hive in the Sandbox • Example of queries and analysis of Comtrade data 15 Eurostat

RHadoop • Set of packages that allows integration of R with HDFS and Map. Reduce • Hadoop provides the storage while R brings the analysis • Just a library • Not a special run-time, Not a different language, Not a special purpose language • Incrementally port your code and use all packages • Requires R installed and configured on all nodes in the cluster Eurostat

Demo: RHadoop in the Sandbox • Example of quality analysis of Comtrade data in Map. Reduce, written in R 17 Eurostat

Spark • • Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Eurostat

About Apache Spark Initially started at UC Berkeley in 2009 Fast and general purpose cluster computing system 10 x (on disk) - 100 x (In-Memory) faster than Map. Reduce Most popular for running Iterative Machine Learning Algorithms. • Provides high level APIs in 3 different programming languages • • • Scala, Java, Python • Support to R • Integration with Hadoop and its eco-system and can read existing data on HDFS Eurostat

Spark Stack • Spark SQL • For SQL and unstructured data processing • MLib • Machine Learning Algorithms • Graph. X • Graph Processing • Spark Streaming • stream processing of live data streams Eurostat

Spark in the Sandbox • Spark has been used in the Sandbox to compute network indicators starting from the Comtrade • Networks of countries has been extracted from the Comtrade database for each category of products. Network indicators has been computed with the Graphx library • All the extraction and the processing has been done in Spark • Alternative approach using R required a preliminary extraction and copy of the data in Hive 21 Eurostat

No. SQL: Definition • No. SQL databases is an approach to data management that is useful for very large sets of distributed data • No. SQL should not be misleading: the approach does not prohibit Structured Query Language (SQL) • And indeed they are commonly referred to as “Not. Only. SQL” Eurostat

No. SQL: Main Features • Non relational/Schema Free: little or no predefined schema, in contrast to Relational Database Management Systems • Distributed • Horizontally scalable: able to manage large volume of data also with availability guarantees • Transparent failover and recovery using mirror copies Eurostat

Example of No. SQL Databases: Document Storage • Platforms for storing and indexing semistructured data in JSON format • Not tied to a specific schema but can store different types of document together • Products • Mongo. DB • Elasticsearch 24 Eurostat

Elasticsearch in the Sandbox • Demo of visualization of Twitter data stored in Elasticsearch with Kibana 25 Eurostat