RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon
- Slides: 44
RUNNING SPARK CLUSTERS IN CONTAINERS WITH DOCKER Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata. com Kartik Mathur kartik@bluedata. com
Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of a Spark Cluster • Deployment Options: Public Cloud, On-Premises • Demo • Trade-Offs and Choices
Vocabulary • • • Bare-Metal Virtual Machine (VM) Container Docker Microservice Monolithic (service)
Apache Spark™ is a fast and general engine for large-scale data processing. Source: www. spark. apache. org
Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015
Spark On-Premises • Individual developers or data scientists who build their own infrastructure on laptops, on VMs, or bare-metal machines • IT takes a bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case
Why Change this Approach? As the number of Spark users grow … • IT needs to scale the deployment for additional use cases • Application lifecycle requires dev/test/QA/prod environments • Complexity overwhelms the organization, restricting adoption
Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service ① Get started with Spark for initial use cases and users ① Spin up dev/test clusters with replica image of production ① LOB multi-tenancy with strict resource allocations ② Evaluation, testing, development, and QA ② QA/UAT using production data without duplication ② Bare-metal performance for business critical workloads ③ Prototype multiple data pipelines quickly ③ Offload specific users and workloads from production ③ Self-service, shared infrastructure with strict access controls Spark in a Secure Production Environment Multi-Tenant Spark Deployment On-Premises Dev/Test and Pre-Production
Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits and Value Bare-metal Containers and VMs Big-Data-as-a-Service Data locality Compute and storage separation Agility and cost savings Data on local disks In-place access on remote data stores Faster time-toinsights
New Realities, New Requirements • Software flexibility - Multiple distros, Hadoop and Spark, multiple configurations - Support new versions and apps as soon as they are available • Multi-tenant support - Data access and network security - Differential Quality of Service (Qo. S) • Stability, Scalability, Cost, Performance, and Security are always important
Big Data Deployment – Public Cloud • Hadoop-as-a-Service - Amazon Web Services EC 2 and EMR Microsoft Azure HDInsight Google Cloud Dataproc IBM Bluemix. . . and others • Spark-as-a-Service - All of the above - Databricks
Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines - VMware Big Data Extensions - Open. Stack Sahara • Containers - Mesos - Blue. Data
APACHE SPARK - ANATOMY OF A SPARK CLUSTER
Running Spark in Cluster Mode Source: http: //spark. apache. org/docs/1. 3. 0/cluster-overview. html
Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% Standalone mode Source: Spark Survey Report, 2015 (Databricks) 40% YARN 11% Mesos
Avoid Solution Mismatch
Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare Metal Virtual Machine Spark Slave task task task
Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master task Node Manager Spark Executor task task
Spark Multi. Cluster + YARN Worker Controller Worker
Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task
Spark Cluster – Mesos Spark Client task Spark Mesos Framework Master for Mesos Spark Scheduler Mesos Slave Spark Executor task task
Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task task
APACHE SPARK - DEPLOYMENT OPTIONS
PUBLIC CLOUD – SPARK-AS-A-SERVICE (E. G. AWS)
Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master task Node Manager Spark Executor task Virtual Machine task task Virtual Machine
AWS Spark-as-a-Service: Benefits • Amazon EC 2 Elastic Container Service (ECS) - Launch containers on EC 2 - Amazon Elastic Container Registry (ECR): Docker Images • Amazon Elastic Map. Reduce (EMR) - Easy to use - Low startup costs: Hardware and human - Expandable
AWS Spark-as-a-Service: Challenges • Data access - Already exists in S 3 - Ingest time • Data security • Software versions - Spark 1. 6. 0, Hadoop 2. 71; Map. R • Cost - Short running vs. long running clusters
ON-PREMISES – SPARK + CONTAINERS + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos
Spark Cluster – Mesos Spark Client task Spark Scheduler Mesos Master Mesos Slave Spark Executor task Containers task
Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment - Mesos framework for Spark available - Container isolation - Most of the pieces are available • Complete control - Customization
Spark + Docker + DCOS: Challenges • Can be difficult to set up a production environment - Multi-tenancy, Qo. S - Software interoperability - Container cluster network connectivity and security
Spark + Docker + Mesos: Challenges Mesos Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node
Spark + Docker + Mesos Scheduler + Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Mesos Exec Name Node Container Task Container Task Mesos Slave #1 Mesos Slave #2 Mesos Exec Mesos Exec Container Task Container Mesos Exec Task Container Task Container Task Mesos Exec Container Data Node Container Task Container Data Node
Spark + Docker + Mesos Job Task Mesos Scheduler (microservice) Myriad Scheduler Mesos Master Mesos Scheduler Marathon Scheduler Name Node Mesos Exec Mesos Slave #1 Mesos Exec Mesos Slave #2 Mesos Exec Container Task Container Task Container Data Node
ON-PREMISES – SPARK + CONTAINERS + BLUEDATA Monolithic deployment Spark-as-a-Service in an On-Premises Deployment
Spark – Standalone with Containers Spark Client Bare Metal Virtual Machine Container Spark Master Bare Metal Virtual Machine Container Spark Slave task task task
Spark + Docker + Blue. Data: Benefits • Enterprise quality - Deployment flexibility (on physical servers or VMs) - Network connectivity - Persistent IP addresses - Externally visible IP addresses - No NATing required • Cloud-like experience: Spark-as-a-Service - Self-service access to instant clusters, simple Web UI
Spark + Docker + Blue. Data: Benefits • Docker packaging of images - Distribution agnostic Spark, Kafka, Cassanda, Zeppelin, and more With or without YARN Bring your own BI/analytics tool • Currently on-premises - Future: on-premises, public cloud, or hybrid
Spark + Docker + Blue. Data: Benefits • Multi-tenancy - Per tenant Qo. S, not per service - Private VLAN per Tenant - Limit Data Access • HA, software upgrades, data access, … - Blue. Data’s Data. Tap isolates data from compute - Upgrade compute independent of data
Blue. Data EPIC Software: Demo
TRADE-OFFS AND CHOICES
Trade-Offs (Not Unique to Spark) Less Stable Open Source More Stable Proprietary Less Cost More Cost Less Later On-Premises More Later Public Cloud More Now Less Now
Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations – Public Cloud • Lots of Customizations, Willing to Tinker, Limited Qo. S – Opensource, microservice, Mesos • Configurable, Flexible, Enterprise Multi-Tenancy – Monolithic (for the moment) container deployment
THANK YOU www. bluedata. com Try Blue. Data EPIC for Free: bluedata. com/free
- Once upon a time there lived a wise old teacher
- Running running running
- Docker spark cluster
- Spark sql: relational data processing in spark
- Docker microsoft office
- Docker legacy applications
- Guido appenzeller
- Docker run yum
- Docker
- Grewe container
- Agendav docker
- Dripper docker
- Wat is docker
- Docker gdb
- Http://www.wordle.net/
- Pest control docker
- Alibaba ci/cd
- Datacore swarm
- Kubernetes vgpu
- Virginia tech library
- Docker whoami
- Jupyter notebook 원격 접속
- Docker jboss eap
- Citrix clemson
- Docker meetup
- Onvif docker
- Docker run detatched
- Expo docker
- Xnat docker
- Hpe greenlake
- Ossim docker
- Openairinterface docker
- Advantages and disadvantages of docker
- Hyper-v containers
- Boyle's law in real life
- Resa containers
- Intel sgx attacks
- Picture of java
- Returnable containers accounting
- Associative containers
- What goes in black pharmaceutical waste containers
- Example of heat transfer by radiation
- Pete containers
- Quantifiers containers
- Ancient greek containers