Hadoop File Formats and Data Ingestion Prasanth Kothuri
![](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-1.jpg)
![Hadoop File Formats and Data Ingestion Prasanth Kothuri, CERN 2 Hadoop File Formats and Data Ingestion Prasanth Kothuri, CERN 2](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-2.jpg)
![Files Formats – not just CSV - Key factor in Big Data processing and Files Formats – not just CSV - Key factor in Big Data processing and](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-3.jpg)
![Available File Formats - Text / CSV JSON Sequence. File • - binary key/value Available File Formats - Text / CSV JSON Sequence. File • - binary key/value](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-4.jpg)
![AVRO - Language neutral data serialization system - - AVRO data is described using AVRO - Language neutral data serialization system - - AVRO data is described using](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-5.jpg)
![Avro – File structure and example Sample AVRO schema in JSON format Avro file Avro – File structure and example Sample AVRO schema in JSON format Avro file](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-6.jpg)
![Parquet • • columnar storage format key strength is to store nested data in Parquet • • columnar storage format key strength is to store nested data in](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-7.jpg)
![Optimizations – CPU and I/O Statistics for filtering and query optimization projection push down Optimizations – CPU and I/O Statistics for filtering and query optimization projection push down](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-8.jpg)
![Encoding - Delta Encoding: - - Prefix Encoding: - - delta encoding for strings Encoding - Delta Encoding: - - Prefix Encoding: - - delta encoding for strings](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-9.jpg)
![Parquet file structure & Configuration Configurable parquet parameters Internal structure of parquet file Property Parquet file structure & Configuration Configurable parquet parameters Internal structure of parquet file Property](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-10.jpg)
![Data Ingestion Hadoop File Formats and Data Ingestion 11 Data Ingestion Hadoop File Formats and Data Ingestion 11](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-11.jpg)
![Flume is for high-volume ingestion into Hadoop of eventbased data e. g collect logfiles Flume is for high-volume ingestion into Hadoop of eventbased data e. g collect logfiles](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-12.jpg)
![Flume Example Hadoop File Formats and Data Ingestion 13 Flume Example Hadoop File Formats and Data Ingestion 13](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-13.jpg)
![Flume configuration Category agent 1. sources = source 1 agent 1. sinks = sink Flume configuration Category agent 1. sources = source 1 agent 1. sinks = sink](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-14.jpg)
![Sqoop – SQL to Hadoop - Open source tool to extract data from structured Sqoop – SQL to Hadoop - Open source tool to extract data from structured](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-15.jpg)
![Sqoop – contd. - Sqoop schedules map reduce jobs to effect imports and exports Sqoop – contd. - Sqoop schedules map reduce jobs to effect imports and exports](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-16.jpg)
![Sqoop – How to run sqoop Example: sqoop import --connect jdbc: oracle: thin: Sqoop – How to run sqoop Example: sqoop import --connect jdbc: oracle: thin:](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-17.jpg)
![Sqoop – how to parallelize -- table_name -- query select * from table_name where Sqoop – how to parallelize -- table_name -- query select * from table_name where](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-18.jpg)
![Hands On – 1 Use Kite SDK to demonstrate copying of various file formats Hands On – 1 Use Kite SDK to demonstrate copying of various file formats](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-19.jpg)
![Hands On – 1 contd Step 3) Load the Dataset into Hadoop in Parquet Hands On – 1 contd Step 3) Load the Dataset into Hadoop in Parquet](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-20.jpg)
![Hands On – 2 Use Sqoop to copy an Oracle table to Hadoop Step Hands On – 2 Use Sqoop to copy an Oracle table to Hadoop Step](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-21.jpg)
![Hands On – 3 Use Sqoop to copy an Oracle table to Hadoop, multiple Hands On – 3 Use Sqoop to copy an Oracle table to Hadoop, multiple](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-22.jpg)
![Hands On – 4 Use Sqoop to make incremental copy of a Oracle table Hands On – 4 Use Sqoop to make incremental copy of a Oracle table](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-23.jpg)
![Hands On – 4 contd. Step 2) Run the sqoop job --exec alarms Step Hands On – 4 contd. Step 2) Run the sqoop job --exec alarms Step](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-24.jpg)
![Q&A E-mail: Prasanth. Kothuri@cern. ch Blog: http: //prasanthkothuri. wordpress. com See also: https: //db-blog. Q&A E-mail: Prasanth. Kothuri@cern. ch Blog: http: //prasanthkothuri. wordpress. com See also: https: //db-blog.](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-25.jpg)
- Slides: 25
![](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-1.jpg)
![Hadoop File Formats and Data Ingestion Prasanth Kothuri CERN 2 Hadoop File Formats and Data Ingestion Prasanth Kothuri, CERN 2](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-2.jpg)
Hadoop File Formats and Data Ingestion Prasanth Kothuri, CERN 2
![Files Formats not just CSV Key factor in Big Data processing and Files Formats – not just CSV - Key factor in Big Data processing and](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-3.jpg)
Files Formats – not just CSV - Key factor in Big Data processing and query performance Schema Evolution Compression and Splittability Data Processing • • • Write performance Partial read Full read Hadoop File Formats and Data Ingestion 3
![Available File Formats Text CSV JSON Sequence File binary keyvalue Available File Formats - Text / CSV JSON Sequence. File • - binary key/value](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-4.jpg)
Available File Formats - Text / CSV JSON Sequence. File • - binary key/value pair format Avro Parquet ORC • optimized row columnar format Hadoop File Formats and Data Ingestion 4
![AVRO Language neutral data serialization system AVRO data is described using AVRO - Language neutral data serialization system - - AVRO data is described using](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-5.jpg)
AVRO - Language neutral data serialization system - - AVRO data is described using language independent schema AVRO schemas are usually written in JSON and data is encoded in binary format Supports schema evolution - - Write a file in python and read it in C producers and consumers at different versions of schema Supports compression and are splittable Hadoop File Formats and Data Ingestion 5
![Avro File structure and example Sample AVRO schema in JSON format Avro file Avro – File structure and example Sample AVRO schema in JSON format Avro file](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-6.jpg)
Avro – File structure and example Sample AVRO schema in JSON format Avro file structure { "type" : "record", "name" : "tweets", "fields" : [ { "name" : "username", "type" : "string", }, { "name" : "tweet", "type" : "string", }, { "name" : "timestamp", "type" : "long", } ], "doc: " : “schema for storing tweets" } Hadoop File Formats and Data Ingestion 6
![Parquet columnar storage format key strength is to store nested data in Parquet • • columnar storage format key strength is to store nested data in](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-7.jpg)
Parquet • • columnar storage format key strength is to store nested data in truly columnar format using definition and repetition levels 1 Nested schema X Y Z Table representation Row format X Y Z x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x 5 Y 5 z 5 x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x 5 y 5 z 5 x 5 y 1 y 2 y 3 y 4 y 5 z 1 z 2 z 3 z 4 z 5 Columnar format x 1 x 2 x 3 x 4 encoded chunk (1) Dremel made simple with parquet - https: //blog. twitter. com/2013/dremel-made-simple-with-parquet Hadoop File Formats and Data Ingestion 7
![Optimizations CPU and IO Statistics for filtering and query optimization projection push down Optimizations – CPU and I/O Statistics for filtering and query optimization projection push down](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-8.jpg)
Optimizations – CPU and I/O Statistics for filtering and query optimization projection push down predicate push down read only the data you need X Y Z x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x 5 y 5 z 5 Minimizes CPU cache misses SLOW cache misses costs cpu cycles Hadoop File Formats and Data Ingestion 8
![Encoding Delta Encoding Prefix Encoding delta encoding for strings Encoding - Delta Encoding: - - Prefix Encoding: - - delta encoding for strings](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-9.jpg)
Encoding - Delta Encoding: - - Prefix Encoding: - - delta encoding for strings Dictionary Encoding: - - E. g timestamp can be encoded by storing first value and the delta between subsequent values which tend to be small due to temporal validity Small set of values, e. g post code, ip addresses etc Run Length Encoding: - repeating data Hadoop File Formats and Data Ingestion 9
![Parquet file structure Configuration Configurable parquet parameters Internal structure of parquet file Property Parquet file structure & Configuration Configurable parquet parameters Internal structure of parquet file Property](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-10.jpg)
Parquet file structure & Configuration Configurable parquet parameters Internal structure of parquet file Property name Default value Description parquet. block. size 128 MB The size in bytes of a block (row group). parquet. page. size 1 MB The size in bytes of a page. parquet. dictionary. page. size 1 MB The maximum allowed size in bytes of a dictionary before falling back to plain encoding for a page. parquet. enable. dictionary true Whether to use dictionary encoding. The type of compression: UNCOMPRESSED, SNAPPY, GZIP & LZO parquet. compression UNCOMPRESSED In summation, Parquet is state-of-the-art, open-source columnar format the supports most of Hadoop processing frameworks and is optimized for high compression and high scan efficiency Hadoop File Formats and Data Ingestion 10
![Data Ingestion Hadoop File Formats and Data Ingestion 11 Data Ingestion Hadoop File Formats and Data Ingestion 11](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-11.jpg)
Data Ingestion Hadoop File Formats and Data Ingestion 11
![Flume is for highvolume ingestion into Hadoop of eventbased data e g collect logfiles Flume is for high-volume ingestion into Hadoop of eventbased data e. g collect logfiles](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-12.jpg)
Flume is for high-volume ingestion into Hadoop of eventbased data e. g collect logfiles from a bank of web servers, then move log events from those files to HDFS (clickstream) Hadoop File Formats and Data Ingestion 12
![Flume Example Hadoop File Formats and Data Ingestion 13 Flume Example Hadoop File Formats and Data Ingestion 13](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-13.jpg)
Flume Example Hadoop File Formats and Data Ingestion 13
![Flume configuration Category agent 1 sources source 1 agent 1 sinks sink Flume configuration Category agent 1. sources = source 1 agent 1. sinks = sink](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-14.jpg)
Flume configuration Category agent 1. sources = source 1 agent 1. sinks = sink 1 agent 1. channels = channel 1 Component Avro Exec HTTP agent 1. sources. source 1. channels = channel 1 agent 1. sinks. sink 1. channel = channel 1 Source agent 1. sources. source 1. type = spooldir agent 1. sources. source 1. spool. Dir = /var/spooldir JMS Netcat Sequence generator Spooling directory Syslog Thrift Twitter agent 1. sinks. sink 1. type = hdfs agent 1. sinks. sink 1. hdfs. path = hdfs: //hostname: 8020/user/flume/logs agent 1. sinks. sink 1. hdfs. filetype = Data. Stream Sink agent 1. channels. channel 1. type = memory agent 1. channels. channel 1. capacity = 10000 agent 1. channels. channel 1. transaction. Capacity = 100 flume-ng agent –name agent 1 –conf $FLUME_HOME/conf Avro Elasticsearch File roll HBase HDFS IRC Logger Morphline (Solr) Channel Null Thrift File JDBC Memory Hadoop File Formats and Data Ingestion 14
![Sqoop SQL to Hadoop Open source tool to extract data from structured Sqoop – SQL to Hadoop - Open source tool to extract data from structured](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-15.jpg)
Sqoop – SQL to Hadoop - Open source tool to extract data from structured data store into Hadoop Architecture Hadoop File Formats and Data Ingestion 15
![Sqoop contd Sqoop schedules map reduce jobs to effect imports and exports Sqoop – contd. - Sqoop schedules map reduce jobs to effect imports and exports](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-16.jpg)
Sqoop – contd. - Sqoop schedules map reduce jobs to effect imports and exports Sqoop always requires the connector and JDBC driver Sqoop requires JDBC drivers for specific database server, these should be copied to /usr/lib/sqoop/lib The command-line structure has the following structure Sqoop TOOL PROPERTY_ARGS SQOOP_ARGS TOOL - indicates the operation that you want to perform, e. g import, export etc PROPERTY_ARGS - are a set of parameters that are entered as Java properties in the format -Dname=value. SQOOP_ARGS - all the various sqoop parameters. Hadoop File Formats and Data Ingestion 16
![Sqoop How to run sqoop Example sqoop import connect jdbc oracle thin Sqoop – How to run sqoop Example: sqoop import --connect jdbc: oracle: thin:](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-17.jpg)
Sqoop – How to run sqoop Example: sqoop import --connect jdbc: oracle: thin: @devdb 11 -s. cern. ch: 10121/devdb 11_s. cern. ch --username hadoop_tutorial -P --num-mappers 1 --target-dir visitcount_rfidlog --table VISITCOUNT. RFIDLOG Hadoop File Formats and Data Ingestion 17
![Sqoop how to parallelize tablename query select from tablename where Sqoop – how to parallelize -- table_name -- query select * from table_name where](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-18.jpg)
Sqoop – how to parallelize -- table_name -- query select * from table_name where $CONDITIONS -- table_name -- split-by primary key -- num-mappers n ----- table_name split-by primary key boundary-query select range from dual num-mappers n Hadoop File Formats and Data Ingestion 18
![Hands On 1 Use Kite SDK to demonstrate copying of various file formats Hands On – 1 Use Kite SDK to demonstrate copying of various file formats](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-19.jpg)
Hands On – 1 Use Kite SDK to demonstrate copying of various file formats to Hadoop Step 1) Download the Movie. Lens Dataset curl http: //files. grouplens. org/datasets/movielens/ml-latest-small. zip -o movies. zip unzip movies. zip cd ml-latest-small/ Step 2) Load the Dataset into Hadoop in Avro format -- infer the schema kite-dataset csv-schema ratings. csv --record-name ratings -o ratings. avsc cat ratings. avsc -- create the schema kite-dataset create ratings --schema ratings. avsc -- load the data kite-dataset csv-import ratings. csv --delimiter ', ' ratings Hadoop File Formats and Data Ingestion 19
![Hands On 1 contd Step 3 Load the Dataset into Hadoop in Parquet Hands On – 1 contd Step 3) Load the Dataset into Hadoop in Parquet](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-20.jpg)
Hands On – 1 contd Step 3) Load the Dataset into Hadoop in Parquet format -- infer the schema kite-dataset csv-schema ratings. csv --record-name ratingsp -o ratingsp. avsc cat ratingsp. avsc -- create the schema kite-dataset create ratingsp --schema ratingsp. avsc --format parquet -- load the data kite-dataset csv-import ratings. csv --delimiter ', ' ratingsp Step 4) Run a sample query to compare the elapsed time between Avro & Parquet hive select avg(rating)from ratings; select avg(rating)from ratingsp; Hadoop File Formats and Data Ingestion 20
![Hands On 2 Use Sqoop to copy an Oracle table to Hadoop Step Hands On – 2 Use Sqoop to copy an Oracle table to Hadoop Step](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-21.jpg)
Hands On – 2 Use Sqoop to copy an Oracle table to Hadoop Step 1) Get the Oracle JDBC driver sudo su cd /var/lib/sqoop curl -L https: //pkothuri. web. cern. ch/pkothuri/ojdbc 6. jar -o ojdbc. jar exit Step 2) Run the sqoop job sqoop import --connect jdbc: oracle: thin: @devdb 11 -s. cern. ch: 10121/devdb 11_s. cern. ch --username hadoop_tutorial -P --num-mappers 1 --target-dir visitcount_rfidlog --table VISITCOUNT. RFIDLOG Hadoop File Formats and Data Ingestion 21
![Hands On 3 Use Sqoop to copy an Oracle table to Hadoop multiple Hands On – 3 Use Sqoop to copy an Oracle table to Hadoop, multiple](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-22.jpg)
Hands On – 3 Use Sqoop to copy an Oracle table to Hadoop, multiple mappers sqoop import --connect jdbc: oracle: thin: @devdb 11 -s. cern. ch: 10121/devdb 11_s. cern. ch --username hadoop_tutorial -P --num-mappers 2 --split-by alarm_id --target-dir lemontest_alarms --table LEMONTEST. ALARMS --as-parquetfile Check the size and number of files hdfs -ls lemontest_alarms/ Hadoop File Formats and Data Ingestion 22
![Hands On 4 Use Sqoop to make incremental copy of a Oracle table Hands On – 4 Use Sqoop to make incremental copy of a Oracle table](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-23.jpg)
Hands On – 4 Use Sqoop to make incremental copy of a Oracle table to Hadoop Step 1) Create a sqoop job --create alarms -- import --connect jdbc: oracle: thin: @devdb 11 -s. cern. ch: 10121/devdb 11_s. cern. ch --username hadoop_tutorial -P --num-mappers 1 --target-dir lemontest_alarms_i --table LEMONTEST. ALARMS --incremental append --check-column alarm_id --last-value 0 Hadoop File Formats and Data Ingestion 23
![Hands On 4 contd Step 2 Run the sqoop job exec alarms Step Hands On – 4 contd. Step 2) Run the sqoop job --exec alarms Step](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-24.jpg)
Hands On – 4 contd. Step 2) Run the sqoop job --exec alarms Step 3) Run sqoop in incremental mode sqoop import --connect jdbc: oracle: thin: @devdb 11 -s. cern. ch: 10121/devdb 11_s. cern. ch --username hadoop_tutorial -P --num-mappers 1 --table LEMONTEST. ALARMS --target-dir lemontest_alarms_i --incremental append --check-column alarm_id --last-value 47354 hdfs -ls lemontest_alarms_i/ Hadoop File Formats and Data Ingestion 24
![QA Email Prasanth Kothuricern ch Blog http prasanthkothuri wordpress com See also https dbblog Q&A E-mail: Prasanth. Kothuri@cern. ch Blog: http: //prasanthkothuri. wordpress. com See also: https: //db-blog.](https://slidetodoc.com/presentation_image/10c4f9c11b6bc736b228ff6ed3104cea/image-25.jpg)
Q&A E-mail: Prasanth. Kothuri@cern. ch Blog: http: //prasanthkothuri. wordpress. com See also: https: //db-blog. web. cern. ch/ 25
Hadoop file formats
Input formats in hadoop
Performing file i/o using hdfs
Passenger management system
File-file yang dibuat oleh user pada jenis file di linux
Movie maker file formats
Rtf file format was introduced by
File formats in multimedia
Html
Ngs file formats
Hadoop hive architecture
Hadoop distributed file system
Hadoop distributed file system architecture design
Peristalsis vs segmentation
On digestion
Ingestion digestive system
Gisteraca
Rapid prototyping data formats
Data formats in computer architecture
Mithun subramaniam
Data formats in computer architecture
Metadata driven ingestion framework
Classification de di costanzo
Assimilation in digestion
Ingestion digestion absorption assimilation egestion order
Anatomia gastrica