Hadoop tutorials Todays agenda Hadoop Introduction and Architecture
- Slides: 15
Hadoop tutorials
Todays agenda • Hadoop Introduction and Architecture • Hadoop Distributed File System • Map. Reduce • Spark 2
Cloudera Image for hands-on • Installation instruction • https: //cern. ch/test-zbaranow/CVM. txt 3
Hadoop Introduction
What is Hadoop? (1) • A framework for large scale data processing • Volume • Variety • Velocity 5
What Hadoop is? (2) • Solution for big data processing • Sequential data access – a brute force approach • Simplified data structures (no relational model) • Ideal for ad-hoc data analytics • Instead of some clever data lookups with indexing etc. • Data analytic cases has to be known before hand • Complex data design 6
What is Hadoop? (3) • Data locality (shared nothing) – scales out Interconnect network CPU CPU CPU MEMORY MEMORY Disks Disks Node 1 Node 2 Node 3 Node 4 Node 5 Node X 7
What is Hadoop? (4) • Optimized storage access (for HDD) • Big data blocks >=128 MB • Seqential IO instead of Random IO HDD drive 7200 rpm speed: - Sequential IO: ~120 MB/s - Random IO: 0. 5 - 50 MB/s 8
SQL Cluster resource manager Map. Reduce SQL Hive Scripting Pig Data exchange with RDBMS Sqoop Workflow manager Oozie No. Sql columnar store Hbase YARN Machine learning Mahout Large scale data proceesing Spark Impala Log data collector Flume Coordination Zookeeper Hadoop eco system HDFS Hadoop Distributed File System 9
Hadoop cluster architecture • Master and slaves approach Interconnect network HDFS Name. Node YARN Resource. Manager Hive metastore Various component agents and demons YARN Node Manager HDFS Data. Node Various component agents and masters Various component agents and demons YARN Node Manager HDFS Data. Node Node 2 Node 3 Node 4 Node 1 Node 5 Node X 10
What to not use the Hadoop for? • Online Transaction Processing system • • No transactions No locks No data updates (only appends and overwrites) Response time in seconds rather milliseconds • Not good for systems with relational data • Interactive applications • Accounting systems • Etc. 11
What to use the Hadoop for? • For Big Data! • Storing • Analysis • Write once – read many • Scalable out system (CPU, IO, RAM) • transparent to the users (data placement, data analysis) • Good for data exploration: • in a batch fashion • statistics, aggregations, correlation • Data warehouses • Logs 12
Hadoop @CERN • 4 main clusters (provided by IT) • 16 -20 machines each • 24 GB – 256 GB of RAM • Main users • • • ATLAS (Event. Index, Panda. Mon, Rucio) CASTOR logs WLCG Dasboards IT Monitoring Computer Security … • Available services • HDFS, YARN (MR), Hbase, Hive, Pig, Spark, Impala (upcoming) • Contact • SNOW: https: //cern. service-now. com/service-portal/report-ticket. do? name=request&se=Hadoop -Service 13
Summary • Hadoop is a solution for massive data processing • Designed to scale out • On a commodity hardware • Optimized for sequential reads • Hadoop architecture • HDFS is a core • Many components with multiple functionalities distributed across cluster nodes 14
Questions? 15
- Todays agenda
- Hadoop io hadoop comes with a set of
- Agenda sistemica y agenda institucional
- Hadoop master slave architecture
- Hive provides data warehousing layer to data over hadoop
- Method omop
- Hadoop distributed file system architecture design
- Avid trf
- Nitmed
- Tutorial dreamweaver 8
- Dreamweaver templates tutorials
- Dreamweaver php tutorials
- Ccm examples
- Craps tutorial game
- Xna framework tutorial
- Zemax tutorials