How YARN Enables Multiple Data Processing Engines in
How YARN Enables Multiple Data Processing Engines in Hadoop We Do Hadoop Page 1 Mizell© Eric Director, Solution Engineering Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda • YARN 101 – Yet Another Resource Negotiator • Enabling a Modern Data Architecture • YARN in action – Demo of streaming application • SQL in Hadoop – Hive – Phoenix over HBase – Spark Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 2
YARN Concepts • Application – Application is a job submitted to the framework – Example – Map. Reduce Job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc. ) – container_0 = 2 GB, 1 CPU – container_1 = 1 GB, 6 CPU – Replaces the fixed map/reduce slots Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 3
YARN Architecture • Resource Manager – Global resource scheduler – Hierarchical queues – Application management • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E. g. Map. Reduce Application Master Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 4
YARN – Running Apps create app 1 Hadoop Client 1 submit app 1 Resource. Manager ASM . . . . negotiates. . . . Containers NM . . . . reports to. . . . ASM Scheduler create app 2 Hadoop Client 2 submit app 2 Scheduler ASM queues status report Node. Manager C 2. 1 Node. Manager C 2. 2 Node. Manager AM 2 Rack 1 Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Node. Manager C 1. 3 Node. Manager C 2. 3 C 1. 2 Node. Manager AM 1 Rack 2 Node. Manager C 1. 4 Node. Manager C 1. 1 Rack. N . . . . partitions. . . . Resources
Hadoop 2. x Stack – Enabled by YARN Hadoop BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS Web. HDFS Script SQL Java Scala No. SQL Stream Pig Hive Cascading HBase Accumulo Storm Tez ISV Engines Solr Spark (Cluster Resource Management) 1 ° ° ° ° Windows ° ° HDFS ° ° ° (Hadoop File System) ° ° Distributed ° ° ° Deployment Choice © Hortonworks Inc. 2011 – 2014. All Rights Reserved SECURITY OPERATIONS Authentication Authorization Accounting Data Protection Provision, Manage & Monitor Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Slider Tez Others In-Memory Search YARN: Data Operating System Linux Page 6 is the architectural center of HDP ° ° ° ° On-Premises Cloud Ambari Zookeeper Scheduling Oozie Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options
Hadoop 2 Stack – Versions 1. 7. 0 0. 14. 0 2. 6. 0 0. 5. 2 0. 60. 0 0. 6. 0 0. 4. 0 0. 9. 3 4. 2. 0 1. 5. 2 4. 7. 2 1. 5. 1 3. 4. 5 1. 4. 4 0. 4. 0 1. 4. 4 Data Access Hadoop © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance & Integration Operations Ranger 3. 3. 2 Knox Flume Sqoop Kafka Falcon Slider Tez Solr 1. 3. 1 Spark Accumulo Phoenix 0. 96. 1 Data Management Page 7 4. 0. 0 1. 4. 0 0. 9. 1 HBase 0. 12. 0 Pig 2013 2. 2. 0 Hadoop &YARN October 0. 4. 0. 0 0. 12. 0 2014 HDP 2. 0 0. 98. 0 3. 4. 6 0. 5. 0 Storm 0. 12. 1 Hive & HCatalog April 2. 4. 0 0. 5. 0 1. 4. 5 0. 13. 0 HDP 2. 1 4. 1. 0 Zookeeper 2014 1. 2. 0 0. 98. 4 Oozie 0. 14. 0 December 0. 8. 1 Ambari HDP 2. 2 4. 10. 2 1. 6. 1 Security
Enabling a Modern Data Architecture with Apache Hadoop Hortonworks. We do Hadoop. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
APPLICATIONS Traditional systems under pressure Custom Applications Business Analytics Packaged Applications Clickstream • Silos of Data DATA SYSTEM New Data Types • Costly to Scale RDBMS EDW MPP Geolocation Sentiment, Web Data • Constrained Schemas Sensor. Machine Data SOURCES Unstructured docs, emails Page 9 …and difficult to manage new data Existing Sources (CRM, ERP, …) © Hortonworks Inc. 2011 – 2014. All Rights Reserved Server logs
APPLICATIONS HDP 2 and YARN enable the Modern Data Architecture Custom Applications Business Analytics Packaged Applications Hortonworks architected and led development of YARN Common data set, multiple applications • Optionally land all data in a single cluster DATA SYSTEM Batch RDBMS EDW Real-Time Interactive YARN: Data Operating System MPP 1 ° ° ° HDFS ° ° ° (Hadoop Distributed ° ° ° File ° System) ° ° ° • Batch, interactive & real-time use cases • Support multi-tenant access, processing & segmentation of data N YARN: Architectural center of Hadoop SOURCES • Consistent security, governance & operations EXISTING Systems Page 10 Clickstream Web &Social Geolocation Sensor & Machine © Hortonworks Inc. 2011 – 2014. All Rights Reserved Server Logs Unstructured • Ecosystem applications certified by Hortonworks to run natively in Hadoop
YARN in Action Hortonworks. We do Hadoop. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Trucking Company’s YARN-enabled Architecture Truck Sensors Inbound Messaging (Kafka) (Active. MQ) Interactive Query Stream Processing Real-time Serving (Hive on Tez) (Storm) (HBase) Many Workloads: YARN Microsoft Excel Distributed Storage: HDFS Page 12 Alerts & Events © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time User Interface
Components of the Topology • 9 Node HDP 2. 2 Cluster with Storm and HBase on YARN • 4 Node 0. 8 Kafka Cluster • 1 Node Active. MQ with Stomp Protocol Enabled • Spring 4. 0 Web. MVC Web Using Socket. JS & Active. MQ over STOMP Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 13
Topology Architecture Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 14
Demo Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SQL in Hadoop Hortonworks. We do Hadoop. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive • Executes on Map. Reduce or Tez (Spark in future) on YARN – Queries taking hours now take minutes on Tez • CLI and ODBC/JDBC connections • Performance Components – Vectorization – set hive. vectorized. execution. enabled; – Tez - set hive. execution. engine=tez; – CBO – hive. compute. query. using. stats=true; – hive. stats. fetch. column. stats=true; – hive. stats. fetch. partition. stats=true; – hive. cbo. enable=ture; • Enhanced security available in Hive 13 – Grant semantics, Column Level • Create/Update/Delete available in Hive 14 (GA) • Sub-second response times next year Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 17
Phoenix over HBase • HBase is a No. SQL “data store” in Hadoop on YARN – Column Familys – Strong consistency for heavy reads/writes – Linear scale by adding Region Servers – Multiple access points – CLI, Java API, Thrift/Rest API – Hundreds of millions or billions of rows – Gets/Puts/Scans • Phoenix – Relational database layer over HBase – CLI and JDBC connection – Low seconds response times – Salting to prevent HBase region hot spots – Can map to an existing HBase table – Dynamic Columns at query time Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 18
Spark • Created at Berkley Labs • Advance DAG execution engine that supports in-Memory and cyclic data flow • Spark SQL, MLlib, Graph. X, Spark Streaming • Runs on Hadoop on YARN, Mesos, standalone • CLI and ODBC/JDBC connectivity • In-Memory in RDD’s – Resilient Distributed Datasets – Immutable – Perfect for iterative processing – Becoming a great way to server up smaller data sets (low TB) at high speed Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 19
Hadoop Summit 2015 Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 20
Thank You! Eric Mizell – Director, Solutions Engineering emizell@hortonworks. com @ericmizell Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
- Slides: 21