RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon

Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of

Vocabulary • • • Bare-Metal Virtual Machine (VM) Container Docker Microservice Monolithic (service)

Apache Spark™ is a fast and general engine for large-scale data processing. Source: www.

Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015

Spark On-Premises • Individual developers or data scientists who build their own infrastructure on

Why Change this Approach? As the number of Spark users grow … • IT

Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service ① Get started with Spark for initial use

Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits

New Realities, New Requirements • Software flexibility - Multiple distros, Hadoop and Spark, multiple

Big Data Deployment – Public Cloud • Hadoop-as-a-Service - Amazon Web Services EC 2

Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines - VMware Big Data

APACHE SPARK - ANATOMY OF A SPARK CLUSTER

Running Spark in Cluster Mode Source: http: //spark. apache. org/docs/1. 3. 0/cluster-overview. html

Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% Standalone mode Source:

Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare

Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master task Node Manager

Spark Multi. Cluster + YARN Worker Controller Worker

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark

Spark Cluster – Mesos Spark Client task Spark Mesos Framework Master for Mesos Spark

PUBLIC CLOUD – SPARK-AS-A-SERVICE (E. G. AWS)

Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master

AWS Spark-as-a-Service: Benefits • Amazon EC 2 Elastic Container Service (ECS) - Launch containers

AWS Spark-as-a-Service: Challenges • Data access - Already exists in S 3 - Ingest

ON-PREMISES – SPARK + CONTAINERS + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos

Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment

Spark + Docker + DCOS: Challenges • Can be difficult to set up a

Spark + Docker + Mesos: Challenges Mesos Scheduler Mesos Master Mesos Scheduler Marathon Scheduler

Spark + Docker + Mesos Scheduler + Myriad Scheduler Mesos Master Mesos Scheduler Marathon

Spark + Docker + Mesos Job Task Mesos Scheduler (microservice) Myriad Scheduler Mesos Master

ON-PREMISES – SPARK + CONTAINERS + BLUEDATA Monolithic deployment Spark-as-a-Service in an On-Premises Deployment

Spark – Standalone with Containers Spark Client Bare Metal Virtual Machine Container Spark Master

Spark + Docker + Blue. Data: Benefits • Enterprise quality - Deployment flexibility (on

Spark + Docker + Blue. Data: Benefits • Docker packaging of images - Distribution

Spark + Docker + Blue. Data: Benefits • Multi-tenancy - Per tenant Qo. S,

Trade-Offs (Not Unique to Spark) Less Stable Open Source More Stable Proprietary Less Cost

Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations – Public

THANK YOU www. bluedata. com Try Blue. Data EPIC for Free: bluedata. com/free

Slides: 44

Download presentation

RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata. com Kartik Mathur kartik@bluedata. com

Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of a Spark Cluster • Deployment Options: Public Cloud, On-Premises • Demo • Trade-Offs and Choices

Vocabulary • • • Bare-Metal Virtual Machine (VM) Container Docker Microservice Monolithic (service)

Apache Spark™ is a fast and general engine for large-scale data processing. Source: www. spark. apache. org

Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015

Spark On-Premises • Individual developers or data scientists who build their own infrastructure on laptops, on VMs, or bare-metal machines • IT takes a bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case

Why Change this Approach? As the number of Spark users grow … • IT needs to scale the deployment for additional use cases • Application lifecycle requires dev/test/QA/prod environments • Complexity overwhelms the organization, restricting adoption

Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service ① Get started with Spark for initial use cases and users ① Spin up dev/test clusters with replica image of production ① LOB multi-tenancy with strict resource allocations ② Evaluation, testing, development, and QA ② QA/UAT using production data without duplication ② Bare-metal performance for business critical workloads ③ Prototype multiple data pipelines quickly ③ Offload specific users and workloads from production ③ Self-service, shared infrastructure with strict access controls Spark in a Secure Production Environment Multi-Tenant Spark Deployment On-Premises Dev/Test and Pre-Production

Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits and Value Bare-metal Containers and VMs Big-Data-as-a-Service Data locality Compute and storage separation Agility and cost savings Data on local disks In-place access on remote data stores Faster time-toinsights

New Realities, New Requirements • Software flexibility - Multiple distros, Hadoop and Spark, multiple configurations - Support new versions and apps as soon as they are available • Multi-tenant support - Data access and network security - Differential Quality of Service (Qo. S) • Stability, Scalability, Cost, Performance, and Security are always important

Big Data Deployment – Public Cloud • Hadoop-as-a-Service - Amazon Web Services EC 2 and EMR Microsoft Azure HDInsight Google Cloud Dataproc IBM Bluemix. . . and others • Spark-as-a-Service - All of the above - Databricks

Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines - VMware Big Data Extensions - Open. Stack Sahara • Containers - Mesos - Blue. Data

APACHE SPARK - ANATOMY OF A SPARK CLUSTER

Running Spark in Cluster Mode Source: http: //spark. apache. org/docs/1. 3. 0/cluster-overview. html

Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% Standalone mode Source: Spark Survey Report, 2015 (Databricks) 40% YARN 11% Mesos

Avoid Solution Mismatch

Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare Metal Virtual Machine Spark Slave task task task

Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master task Node Manager Spark Executor task task

Spark Multi. Cluster + YARN Worker Controller Worker

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task

Spark Cluster – Mesos Spark Client task Spark Mesos Framework Master for Mesos Spark Scheduler Mesos Slave Spark Executor task task

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task

APACHE SPARK - DEPLOYMENT OPTIONS

PUBLIC CLOUD – SPARK-AS-A-SERVICE (E. G. AWS)

Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master task Node Manager Spark Executor task Virtual Machine task task Virtual Machine

AWS Spark-as-a-Service: Benefits • Amazon EC 2 Elastic Container Service (ECS) - Launch containers on EC 2 - Amazon Elastic Container Registry (ECR): Docker Images • Amazon Elastic Map. Reduce (EMR) - Easy to use - Low startup costs: Hardware and human - Expandable

AWS Spark-as-a-Service: Challenges • Data access - Already exists in S 3 - Ingest time • Data security • Software versions - Spark 1. 6. 0, Hadoop 2. 71; Map. R • Cost - Short running vs. long running clusters

ON-PREMISES – SPARK + CONTAINERS + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos

Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task Containers task

Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment - Mesos framework for Spark available - Container isolation - Most of the pieces are available • Complete control - Customization

Spark + Docker + DCOS: Challenges • Can be difficult to set up a production environment - Multi-tenancy, Qo. S - Software interoperability - Container cluster network connectivity and security

Spark + Docker + Mesos: Challenges Mesos Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node

Spark + Docker + Mesos Scheduler + Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Mesos Exec Name Node Container Task Container Task Mesos Slave #1 Mesos Slave #2 Mesos Exec Mesos Exec Container Task Container Mesos Exec Task Container Task Container Task Mesos Exec Container Data Node Container Task Container Data Node

Spark + Docker + Mesos Job Task Mesos Scheduler (microservice) Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node

ON-PREMISES – SPARK + CONTAINERS + BLUEDATA Monolithic deployment Spark-as-a-Service in an On-Premises Deployment

Spark – Standalone with Containers Spark Client Bare Metal Virtual Machine Container Spark Master Bare Metal Virtual Machine Container Spark Slave task task task

Spark + Docker + Blue. Data: Benefits • Enterprise quality - Deployment flexibility (on physical servers or VMs) - Network connectivity - Persistent IP addresses - Externally visible IP addresses - No NATing required • Cloud-like experience: Spark-as-a-Service - Self-service access to instant clusters, simple Web UI

Spark + Docker + Blue. Data: Benefits • Docker packaging of images - Distribution agnostic Spark, Kafka, Cassanda, Zeppelin, and more With or without YARN Bring your own BI/analytics tool • Currently on-premises - Future: on-premises, public cloud, or hybrid

Spark + Docker + Blue. Data: Benefits • Multi-tenancy - Per tenant Qo. S, not per service - Private VLAN per Tenant - Limit Data Access • HA, software upgrades, data access, … - Blue. Data’s Data. Tap isolates data from compute - Upgrade compute independent of data

Blue. Data EPIC Software: Demo

TRADE-OFFS AND CHOICES

Trade-Offs (Not Unique to Spark) Less Stable Open Source More Stable Proprietary Less Cost More Cost Less Later On-Premises More Later Public Cloud More Now Less Now

Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations – Public Cloud • Lots of Customizations, Willing to Tinker, Limited Qo. S – Opensource, microservice, Mesos • Configurable, Flexible, Enterprise Multi-Tenancy – Monolithic (for the moment) container deployment

THANK YOU www. bluedata. com Try Blue. Data EPIC for Free: bluedata. com/free