Hive A Petabyte Scale Data Warehouse Using Hadoop

Hive : A Petabyte Scale Data Warehouse Using Hadoop Lecturer : Prof. Kyungbaek Kim Presenter : Alvin Prayuda Juniarta Dwiyantoro

Contents • Background • Description • System Architecture • Data Types • Operations • Data Models • Ser. De • Installation Guide • Practical Example

Background • Hadoop = • Pro • Superior in scalability/availability/manageability • Effeciency scaled with more hardware • Cons • Map-reduce hard to program (user know sql) • Need to publish data in well known structure • Hive is the solution

Description • Hive is a data warehouse software to facilitate querying and manage larga datasets in distributed storages • Provides access to file stored in HDFS and query execution via Map. Reduce • Hive use a simple SQL-like querie language to enable users familiars with SQL to query the data • Hive not designed for OLTP (online transactional processing) and doesn’t offer real-time queries. • Hive values are in scalability, extensibility, fault-tolerance, loosecoupling with its input format

System Architecture

Data Types • Data type that supported by current Hive (v 0. 13. 1) • Numeric Types • Tinyint, smallint, bigint, float, double, decimal < (user defined precision and scale) • Date/time • Timestamp, date • String, varchar, char • Misc • Boolean, binary • Complex • Arrays, maps, structs, union

Operations • 3 kinds of operation in Hive : • Data Definition Language (DDL) Operation • Data Manipulation Language (DML) Operation • Structured Query Language (SQL) Operation

Operations • DDL Operations • • • Create/drop/alter database Use database Create/drop/truncate table Alter table/partition/column Create/drop/alter view Create/drop/alter index Create/drop function Create/drop/grant/revoke roles and privileges Show Describe Export/Import

Operations • Example DDL • Create table • Alter table • Drop table

Operations • DML Operations • • • Loading files into tables Inserting data into Hive Tables from Queries Writing data into filesystem from Queries Inserting values into tables from SQL (v. 0. 14) Updating values in tables from SQL (v. 0. 14) Deleting values in tables from SQL (v. 0. 14)

Operations • Example DML • Load data into table • Write data into files

Operations • SQL Operations • • • Select and Filters Group By Insert Overwrite and Insert into Join Multitable insert • Extensibility • Pluggable Map-reduce script using transform

Operations • Example SQL • Show data • Insert data with select statement • Group by

Operations • Join • Multitable Insert

Data Models • Data in Hive is organized into : • Table : A table model like in relational database, stored in HDFS • Partition : A partition of table which is stored in a sub-directory within a table’s directory, allow the system to prune data to be inspected based on query predicates Example : a query that is interested in rows from T that satisfy the predicate T. ds = '2008 -09 -01' would only have to look at files in <table location>/ds=2008 -09 -01/ directory in HDFS • Bucket : Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory, allows the system to efficiently evaluate queries that depend on a sample of data

Ser. De • Ser. De is short for Serializer/Deserializer. Hive uses the Ser. De interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing. • A Ser. De allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own Ser. De for their own data formats.

Projects Related to Hive • Shark • Fork of Apache Hive that using Spark instead of map-reduce • Hivemall • Machine-learning library for Hive • Apache Sentry • Role-based authorization system for Hive