RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon

  • Slides: 44
Download presentation
RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon Valley Big Data Association Meetup February

RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata. com Kartik Mathur kartik@bluedata. com

Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of

Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of a Spark Cluster • Deployment Options: Public Cloud, On-Premises • Demo • Trade-Offs and Choices

Vocabulary • • • Bare-Metal Virtual Machine (VM) Container Docker Microservice Monolithic (service)

Vocabulary • • • Bare-Metal Virtual Machine (VM) Container Docker Microservice Monolithic (service)

Apache Spark™ is a fast and general engine for large-scale data processing. Source: www.

Apache Spark™ is a fast and general engine for large-scale data processing. Source: www. spark. apache. org

Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015

Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015

Spark On-Premises • Individual developers or data scientists who build their own infrastructure on

Spark On-Premises • Individual developers or data scientists who build their own infrastructure on laptops, on VMs, or bare-metal machines • IT takes a bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case

Why Change this Approach? As the number of Spark users grow … • IT

Why Change this Approach? As the number of Spark users grow … • IT needs to scale the deployment for additional use cases • Application lifecycle requires dev/test/QA/prod environments • Complexity overwhelms the organization, restricting adoption

Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service ① Get started with Spark for initial use

Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service ① Get started with Spark for initial use cases and users ① Spin up dev/test clusters with replica image of production ① LOB multi-tenancy with strict resource allocations ② Evaluation, testing, development, and QA ② QA/UAT using production data without duplication ② Bare-metal performance for business critical workloads ③ Prototype multiple data pipelines quickly ③ Offload specific users and workloads from production ③ Self-service, shared infrastructure with strict access controls Spark in a Secure Production Environment Multi-Tenant Spark Deployment On-Premises Dev/Test and Pre-Production

Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits

Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits and Value Bare-metal Containers and VMs Big-Data-as-a-Service Data locality Compute and storage separation Agility and cost savings Data on local disks In-place access on remote data stores Faster time-toinsights

New Realities, New Requirements • Software flexibility - Multiple distros, Hadoop and Spark, multiple

New Realities, New Requirements • Software flexibility - Multiple distros, Hadoop and Spark, multiple configurations - Support new versions and apps as soon as they are available • Multi-tenant support - Data access and network security - Differential Quality of Service (Qo. S) • Stability, Scalability, Cost, Performance, and Security are always important

Big Data Deployment – Public Cloud • Hadoop-as-a-Service - Amazon Web Services EC 2

Big Data Deployment – Public Cloud • Hadoop-as-a-Service - Amazon Web Services EC 2 and EMR Microsoft Azure HDInsight Google Cloud Dataproc IBM Bluemix. . . and others • Spark-as-a-Service - All of the above - Databricks

Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines - VMware Big Data

Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines - VMware Big Data Extensions - Open. Stack Sahara • Containers - Mesos - Blue. Data

APACHE SPARK - ANATOMY OF A SPARK CLUSTER

APACHE SPARK - ANATOMY OF A SPARK CLUSTER

Running Spark in Cluster Mode Source: http: //spark. apache. org/docs/1. 3. 0/cluster-overview. html

Running Spark in Cluster Mode Source: http: //spark. apache. org/docs/1. 3. 0/cluster-overview. html

Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% Standalone mode Source:

Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% Standalone mode Source: Spark Survey Report, 2015 (Databricks) 40% YARN 11% Mesos

Avoid Solution Mismatch

Avoid Solution Mismatch

Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare

Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare Metal Virtual Machine Spark Slave task task task

Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master task Node Manager

Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master task Node Manager Spark Executor task task

Spark Multi. Cluster + YARN Worker Controller Worker

Spark Multi. Cluster + YARN Worker Controller Worker

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task

Spark Cluster – Mesos Spark Client task Spark Mesos Framework Master for Mesos Spark

Spark Cluster – Mesos Spark Client task Spark Mesos Framework Master for Mesos Spark Scheduler Mesos Slave Spark Executor task task

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task

APACHE SPARK - DEPLOYMENT OPTIONS

APACHE SPARK - DEPLOYMENT OPTIONS

PUBLIC CLOUD – SPARK-AS-A-SERVICE (E. G. AWS)

PUBLIC CLOUD – SPARK-AS-A-SERVICE (E. G. AWS)

Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master

Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master task Node Manager Spark Executor task Virtual Machine task task Virtual Machine

AWS Spark-as-a-Service: Benefits • Amazon EC 2 Elastic Container Service (ECS) - Launch containers

AWS Spark-as-a-Service: Benefits • Amazon EC 2 Elastic Container Service (ECS) - Launch containers on EC 2 - Amazon Elastic Container Registry (ECR): Docker Images • Amazon Elastic Map. Reduce (EMR) - Easy to use - Low startup costs: Hardware and human - Expandable

AWS Spark-as-a-Service: Challenges • Data access - Already exists in S 3 - Ingest

AWS Spark-as-a-Service: Challenges • Data access - Already exists in S 3 - Ingest time • Data security • Software versions - Spark 1. 6. 0, Hadoop 2. 71; Map. R • Cost - Short running vs. long running clusters

ON-PREMISES – SPARK + CONTAINERS + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos

ON-PREMISES – SPARK + CONTAINERS + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task Containers task

Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment

Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment - Mesos framework for Spark available - Container isolation - Most of the pieces are available • Complete control - Customization

Spark + Docker + DCOS: Challenges • Can be difficult to set up a

Spark + Docker + DCOS: Challenges • Can be difficult to set up a production environment - Multi-tenancy, Qo. S - Software interoperability - Container cluster network connectivity and security

Spark + Docker + Mesos: Challenges Mesos Scheduler Mesos Master Mesos Scheduler Marathon Scheduler

Spark + Docker + Mesos: Challenges Mesos Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node

Spark + Docker + Mesos Scheduler + Myriad Scheduler Mesos Master Mesos Scheduler Marathon

Spark + Docker + Mesos Scheduler + Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Mesos Exec Name Node Container Task Container Task Mesos Slave #1 Mesos Slave #2 Mesos Exec Mesos Exec Container Task Container Mesos Exec Task Container Task Container Task Mesos Exec Container Data Node Container Task Container Data Node

Spark + Docker + Mesos Job Task Mesos Scheduler (microservice) Myriad Scheduler Mesos Master

Spark + Docker + Mesos Job Task Mesos Scheduler (microservice) Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node

ON-PREMISES – SPARK + CONTAINERS + BLUEDATA Monolithic deployment Spark-as-a-Service in an On-Premises Deployment

ON-PREMISES – SPARK + CONTAINERS + BLUEDATA Monolithic deployment Spark-as-a-Service in an On-Premises Deployment

Spark – Standalone with Containers Spark Client Bare Metal Virtual Machine Container Spark Master

Spark – Standalone with Containers Spark Client Bare Metal Virtual Machine Container Spark Master Bare Metal Virtual Machine Container Spark Slave task task task

Spark + Docker + Blue. Data: Benefits • Enterprise quality - Deployment flexibility (on

Spark + Docker + Blue. Data: Benefits • Enterprise quality - Deployment flexibility (on physical servers or VMs) - Network connectivity - Persistent IP addresses - Externally visible IP addresses - No NATing required • Cloud-like experience: Spark-as-a-Service - Self-service access to instant clusters, simple Web UI

Spark + Docker + Blue. Data: Benefits • Docker packaging of images - Distribution

Spark + Docker + Blue. Data: Benefits • Docker packaging of images - Distribution agnostic Spark, Kafka, Cassanda, Zeppelin, and more With or without YARN Bring your own BI/analytics tool • Currently on-premises - Future: on-premises, public cloud, or hybrid

Spark + Docker + Blue. Data: Benefits • Multi-tenancy - Per tenant Qo. S,

Spark + Docker + Blue. Data: Benefits • Multi-tenancy - Per tenant Qo. S, not per service - Private VLAN per Tenant - Limit Data Access • HA, software upgrades, data access, … - Blue. Data’s Data. Tap isolates data from compute - Upgrade compute independent of data

Blue. Data EPIC Software: Demo

Blue. Data EPIC Software: Demo

TRADE-OFFS AND CHOICES

TRADE-OFFS AND CHOICES

Trade-Offs (Not Unique to Spark) Less Stable Open Source More Stable Proprietary Less Cost

Trade-Offs (Not Unique to Spark) Less Stable Open Source More Stable Proprietary Less Cost More Cost Less Later On-Premises More Later Public Cloud More Now Less Now

Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations – Public

Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations – Public Cloud • Lots of Customizations, Willing to Tinker, Limited Qo. S – Opensource, microservice, Mesos • Configurable, Flexible, Enterprise Multi-Tenancy – Monolithic (for the moment) container deployment

THANK YOU www. bluedata. com Try Blue. Data EPIC for Free: bluedata. com/free

THANK YOU www. bluedata. com Try Blue. Data EPIC for Free: bluedata. com/free