Hadoop tutorials Todays agenda Hadoop Introduction and Architecture

  • Slides: 15
Download presentation
Hadoop tutorials

Hadoop tutorials

Todays agenda • Hadoop Introduction and Architecture • Hadoop Distributed File System • Map.

Todays agenda • Hadoop Introduction and Architecture • Hadoop Distributed File System • Map. Reduce • Spark 2

Cloudera Image for hands-on • Installation instruction • https: //cern. ch/test-zbaranow/CVM. txt 3

Cloudera Image for hands-on • Installation instruction • https: //cern. ch/test-zbaranow/CVM. txt 3

Hadoop Introduction

Hadoop Introduction

What is Hadoop? (1) • A framework for large scale data processing • Volume

What is Hadoop? (1) • A framework for large scale data processing • Volume • Variety • Velocity 5

What Hadoop is? (2) • Solution for big data processing • Sequential data access

What Hadoop is? (2) • Solution for big data processing • Sequential data access – a brute force approach • Simplified data structures (no relational model) • Ideal for ad-hoc data analytics • Instead of some clever data lookups with indexing etc. • Data analytic cases has to be known before hand • Complex data design 6

What is Hadoop? (3) • Data locality (shared nothing) – scales out Interconnect network

What is Hadoop? (3) • Data locality (shared nothing) – scales out Interconnect network CPU CPU CPU MEMORY MEMORY Disks Disks Node 1 Node 2 Node 3 Node 4 Node 5 Node X 7

What is Hadoop? (4) • Optimized storage access (for HDD) • Big data blocks

What is Hadoop? (4) • Optimized storage access (for HDD) • Big data blocks >=128 MB • Seqential IO instead of Random IO HDD drive 7200 rpm speed: - Sequential IO: ~120 MB/s - Random IO: 0. 5 - 50 MB/s 8

SQL Cluster resource manager Map. Reduce SQL Hive Scripting Pig Data exchange with RDBMS

SQL Cluster resource manager Map. Reduce SQL Hive Scripting Pig Data exchange with RDBMS Sqoop Workflow manager Oozie No. Sql columnar store Hbase YARN Machine learning Mahout Large scale data proceesing Spark Impala Log data collector Flume Coordination Zookeeper Hadoop eco system HDFS Hadoop Distributed File System 9

Hadoop cluster architecture • Master and slaves approach Interconnect network HDFS Name. Node YARN

Hadoop cluster architecture • Master and slaves approach Interconnect network HDFS Name. Node YARN Resource. Manager Hive metastore Various component agents and demons YARN Node Manager HDFS Data. Node Various component agents and masters Various component agents and demons YARN Node Manager HDFS Data. Node Node 2 Node 3 Node 4 Node 1 Node 5 Node X 10

What to not use the Hadoop for? • Online Transaction Processing system • •

What to not use the Hadoop for? • Online Transaction Processing system • • No transactions No locks No data updates (only appends and overwrites) Response time in seconds rather milliseconds • Not good for systems with relational data • Interactive applications • Accounting systems • Etc. 11

What to use the Hadoop for? • For Big Data! • Storing • Analysis

What to use the Hadoop for? • For Big Data! • Storing • Analysis • Write once – read many • Scalable out system (CPU, IO, RAM) • transparent to the users (data placement, data analysis) • Good for data exploration: • in a batch fashion • statistics, aggregations, correlation • Data warehouses • Logs 12

Hadoop @CERN • 4 main clusters (provided by IT) • 16 -20 machines each

Hadoop @CERN • 4 main clusters (provided by IT) • 16 -20 machines each • 24 GB – 256 GB of RAM • Main users • • • ATLAS (Event. Index, Panda. Mon, Rucio) CASTOR logs WLCG Dasboards IT Monitoring Computer Security … • Available services • HDFS, YARN (MR), Hbase, Hive, Pig, Spark, Impala (upcoming) • Contact • SNOW: https: //cern. service-now. com/service-portal/report-ticket. do? name=request&se=Hadoop -Service 13

Summary • Hadoop is a solution for massive data processing • Designed to scale

Summary • Hadoop is a solution for massive data processing • Designed to scale out • On a commodity hardware • Optimized for sequential reads • Hadoop architecture • HDFS is a core • Many components with multiple functionalities distributed across cluster nodes 14

Questions? 15

Questions? 15