Hadoop DB Inneke Ponet Introduction Technologies for data
Hadoop. DB Inneke Ponet
§ Introduction § Technologies for data analysis § Hadoop. DB § Desired properties § Layers of Hadoop. DB § Hadoop. DB Components
Introduction § More and more data needs to be stored and processed. § People want to do more and more complex calculations on their collected data. § Analytical databases on high-end machines are moving towards cheaper lower-end machines. § The analytical database market is 27% of the database software market and is growing at a rate of 10, 3% annually.
Technologies for data analysis Parallel databases: § good performance, § good efficiency. Map. Reduce-based systems: § superior scalability, § good fault tolerance, § good flexibility to handle unstructered data.
Parallel databases § Support for standard relational tables and SQL. § Implements techniques for a better performance: § Indexing, compression, materialized views, result caching, I/O sharing. § Data is partitioned (shared-nothing architecture) transparent to the end-user.
Shared-nothing architecture The DBMS of the most analytical databases are deployed on a shared-noting architecture: § A collection of machines that § § are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network. Scalability of machines. Analysis tasks are easy to parallellize.
Map. Reduce A technology from Google: § processes (un)structured data that is distributed on many nodes in a shared-nothing cluster; § works at enormous scale. Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles.
Map. Reduce: advantages No detailed query execution plan in advance at runtime: adjust to node failures and slow nodes (re)assigning tasks to faster nodes. Checkpoints the output to local disk minimizing of the work in case of a failure.
Hadoop. DB Hybrid database: a combination of: § traditional DBMS, § Map. Reduce-technology. Developed by Yale University students: Azza Abouzeid and Kamil Baj. Da-Pawlikowski It is free and open source.
Desired properties A. Performance B. Fault tolerance C. Heterogeneous environment D. Flexible query interface E. Scalability
A. Performance § Primary characteristic to distinguish. § Map. Reduce: first modeling and loading data before processing slower performance than parallel databases. § Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware. Parallel databases Map. Reduce
B. Fault tolerance § Succesfully commit transactions. § Make progress on a workload. § Heterogeneity and scalibility more faults BUT Map. Reduce good fault tolerance: § reassigning tasks; § sub-tasks minimize the effect of faults. § Parallel databases: assumption failures are rare more testing => slower performance. Parallel databases Map. Reduce
C. Heterogeneous environment § Nodes don’t always run on § identical hardware, § an identical virtual machine. Different performance. § Parallel databases: not tested on more than 100 nodes. Parallel databases Map. Reduce
D. Flexible query interface § Easy to make queries: SQL and non-SQL interface languages, Use of tools. § Robust mechanisme for writing UDFs. § Parallel databases: SQL, ODBC and UDFs. § Map. Reduce-based systems: it is possible (Hive), but not always (Hadoop). Parallel databases / Map. Reduce
E. Scalability Traditional DBMS: § only scalable to 100 nodes. Reasons Assumption Failures are rare. Hetrogeneity Homogeneous array of machines. Not tested There are no applications with more than a few dozen nodes. Map. Reduce-based systems: § designed to scale to thousands of nodes in a sharednothing architecture. Parallel databases Map. Reduce
Desired properties Parallel databases Map. Reduce Performance Fault tolerance Heterogeneous environment Flexible query interface / Scalability
Layers of Hadoop. DB § Communication: Hadoop § Database: Postgre. SQL § Translation: Hive
Hadoop § Communication layer of Hadoop. DB. § Hadoop framework two layers: § Hadoop Distributed File System (HDFS), § Map. Reduce framework. Cost: free/open source Map. Reduce.
Postgre. SQL § Relational DBMS. § (Possible) database layer of Hadoop. DB. Cost: free/open source.
Hive § Translation layer. § Processing of a SQL query: § Query Abstract Syntax Tree. § Meta. Store: schema of the table(s). § Logical query plan: DAG of relational operators. § Optimized plan. § Physical executable plan: Map. Reduce job(s). § XML plan: DAG serialized. § Hive Driver executes a Hadoop job.
Hadoop. DB components § Database Connector: § Interface between independent database systems; § Extends the Input. Format class (of Hadoop); § Connect to any JDBC-compliant database. § Catalog: § Meta-information about the databases: § connection parameters, § metadata. § XML file in HDFS accessed by: § Master node, § Worker/Slave nodes.
Hadoop. DB Components (2) § Data loader: § Global hasher: § Custom Map. Reduce job files in HDFS; § Repartioning data upon loading. § Local hasher: § Copies partition from HDFS to local file system; § Partitions the file in smaller sized chunks.
Hadoop. DB Components (3) § SQL to Map. Reduce: § Parallel database front-end to process SQL queries. Hive. QL ↓ Transform Map. Reduce jobs: § Connect to tables stored in HDFS; § Consists of DAGs of relational operators that operate as iterators. § Assumption no collection of tables: § Operations on multiple tables Reduce function. NOT in Hadoop. DB: a join operation can be pushed to the databse layer.
Hadoop. DB Components (4) § SQL/SMS planner: § Modifies Hive: § Updates the Meta. Store § Two passes over the physical plan: 1. 2. Determine the partition keys for the Reduce Sink Operators are: § converted in SQL querie(s); § pushed into the database layer. § Only filter, select and aggregation operators.
Hadoop. DB Components (5)
Questions?
- Slides: 26