Databricks the new kid on the block Antonio
Databricks: the new kid on the block Antonio Abalos Castillo antonioa@avanade. com http: //www. sqlsaturday. com/746/Sessions/Details. aspx? sid=78633
A big thanks to all of our sponsors!
… the new kid on the block [informal] Someone who is new in a place or organization and has many things to learn about it Well, it is actually us the ones who really need to learn about it!! https: //dictionary. cambridge. org/es/diccionario/ingles/new-kid-on-the-block? q=the-new-kid-on-the-block https: //www. phrases. org. uk/meanings/255875. html
Ok, this is about BI and data science… …and failure rates for analytics, BI, and big data projects = 85% !! https: //designingforanalytics. com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85 -yikes/ http: //www. digitaljournal. com/tech-and-science/technology/big-data-strategies-disappoint-with-85 -percent-failure-rate/article/508325 https: //twitter. com/nheudecker/status/928720268662530048
Who is already “in the block”? AZURE DATA FACTORY AZURE SQL DB AZURE COSMOS DB AZURE IMPORT EXPORT SERVICE AZURE STORAGE BLOBS AZURE IOT HUB AZURE DATA LAKE STORE AZURE SQL DATA WAREHOUSE AZURE DATA LAKE ANALYTICS AZURE HDINSIGHT AZURE DATABRICKS AZURE ANALYSIS SERVICES AZURE ML POWER BI ML SERVER AZURE DATABRICKS AZURE EVENT HUBS AZURE SEARCH KAFKA ON AZURE HDINSIGHT AZURE EXPRESSROUTE AZURE ACTIVE DIRECTORY AZURE DATA CATALOG AZURE NETWORK SECURITY GROUPS AZURE STREAM ANALYTICS AZURE KEY MANAGEMENT SERVICE AZURE HDINSIGHT AZURE DATABRICKS OPERATIONS MANAGEMENT SUITE BOT SERVICE COGNITIVE SERVICES AZURE FUNCTIONS VISUAL STUDIO
More precisely, on big data, HDInsight • Includes Jupyter and Zeppelin notebooks • Remote API for job management • Integrated with Blob storage, Event Hubs for streaming and Power Bi for analytics • Quick to deploy and scale https: //docs. microsoft. com/en-us/azure/hdinsight/spark/apache-spark-overview
HDInsight, other considerations • Provisioning (template: 101 -hdinsight-spark-linux): https: //docs. microsoft. com/en-us/azure/hdinsight/spark/apachespark-jupyter-spark-sql • Clusters have to be created (20 minutes) and deleted after use • Admins have to decide on what to do with the disks and files • Data Factory can be used to automate the process (on-demand) https: //docs. microsoft. com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query
Azure Machine Learning Studio • Serverless • Web based • Active Directory integrated • Notebooks • Limited regions (West Europe)
Azure Machine Learning Studio https: //docs. microsoft. com/en-us/azure/machine-learning/service/overview-what-is-azure-ml
Azure Machine Learning Workbench (aka Machine Learning Services) (In preview as of August 2018) • Desktop application • Python based and Git compatible • Built-in Jupyter notebooks • Integrated in Azure AD • Deploys and runs models via Docker containers (Azure Machine Learning Experimentation service) https: //docs. microsoft. com/en-us/azure/machine-learning/service/ https: //docs. microsoft. com/en-us/azure/machine-learning/desktop-workbench/experimentation-service-configuration https: //docs. microsoft. com/en-us/azure/machine-learning/service/overview-what-is-azure-ml
Azure Machine Learning Workbench and Jupyter notebooks https: //docs. microsoft. com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyternotebooks
Microsoft Machine Learning Server • Previously known as “R Server” • Extends R with parallel tools for big data processing • Available in HDInsight • Runs models via Hadoop or Spark • Can publish models via web service • Can run Python too http: //blog. revolutionanalytics. com/2016/01/microsoft-r-open. html https: //docs. microsoft. com/en-us/machine-learning-server/what-is-machine-learning-server https: //docs. microsoft. com/en-us/machine-learning-server/operationalize/quickstart-publish-r-web-service#b-publish-model-as-a-web-service
What is the point with notebooks? https: //www. svds. com/why-notebooks-are-super-charging-data-science/
Isn’t everything about Jupyter? Databricks Azure Machine Learning Studio Azure Machine Learning Workbench Jupyter Data Science VM HDInsight https: //docs. microsoft. com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks https: //notebooks. azure. com/
How does the technology framework look like?
Data preparation Some tools in the Azure technology framework for data science Model execution Azure Notebooks Spark on HDInsight Azure Machine Learning Workbench Docker Azure Machine Learning Studio Machine Learning Server Other tools (R Studio, Visual Studio Code, …) SQL Server (yes!) Data Factory/Data Lake Analytics Azure Machine Learning web service https: //docs. microsoft. com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning
Big data architectures
Big data architectures https: //docs. microsoft. com/en-us/azure/architecture/data-guide/big-data/
Big data and advanced analytics scenarios Modern Data Warehousing Advanced Analytics Real-time Analytics “We want to integrate all our data including ‘big data’ with our data warehouse” “We are trying to predict when our customers churn” “We are trying to get insights from our devices in real-time”
Databricks Fast, easy, and collaborative Apache Spark-based analytics platform
Ok, but what is Databricks? Best of Databricks Best of Microsoft The leading Apache Spark analytics platform “It is not so often in the software industry that the most widely used tool is also the best available platform to choose from” Dr. Veljko Krunic
Databricks foundations What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets on top of an existing Hadoop Distributed File System (HDFS) infrastructure. What is Hadoop? Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
What do we get with Spark? • Allows programmers to develop complex, multi-step data pipelines • In-memory data sharing across different jobs (not like Hadoop, which is HDFS file-based) • More than just Map and Reduce functions • Optimizes arbitrary operator graphs • Lazy evaluation of big data queries • Provides concise and consistent APIs in Scala, Java and Python • Interactive shell for Scala and Python • Support for SQL and R
Ok wait, I like Spark but… I don’t want Databricks Azure still has HDInsight with Spark on top, but: - Cluster management is up to you - Notebook integration has to be configured (Jupyter or Zeppelin) - Lacks memory and performance enhancements Some good things still remain: - Anaconda comes preloaded by default - Azure integration with other services (Data lake, Machine Learning, Power BI) - REST APIs for service deployment and job management (Livy) https: //docs. microsoft. com/en-us/azure/hdinsight/spark/apache-spark-overview
Why Databricks then? • Unified platform for data science and data engineering • Easy to promote experiments to “products” • Unified security model, encryption and auditing • Optimized version of Spark, running 10 to 40 x faster
Azure Databricks Collaborative Workspace Io. T / streaming data Machine learning models DATA ENGINEER BUSINESS ANALYST DATA SCIENTIST Deploy Production Jobs & Workflows BI tools Cloud storage MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Hadoop storage DATABRICKS I/O APACHE SPARK Data exports SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits
Databricks in Azure • Control plane managed by Databricks • Data plane controlled by Azure • Deployed as Iaa. S using as many nodes as required
Control plane • Notebooks, jobs, clusters, users and ACLs are managed from the control plane • These services store data in dedicated Databricks databases (not accessible to external users) • The control plane is accessible from • Databricks UX • Databricks API
Data plane • The Spark clusters are deployed to the customer’s Azure subscription • Each workspace and associated clusters are created in dedicated VNETs • Access to VNETs is restricted by network security groups (NSG)
How to provision Databricks from Azure Databricks setup
Databricks setup – Creating workspace https: //azure. microsoft. com/en-us/pricing/details/databricks/ 31
Control plane provisioned Databricks setup – Creating workspace
Control plane Databricks setup – Creating workspace oth , n So far t abou ry wor o t g in
Databricks setup – Creating clusters
Databricks setup – Creating clusters Provisioning time is approx. 8’
Databricks setup – Testing setup Impressive results!! ; )
Databricks setup – Behind the scenes Here are the cost drivers • Separated resource group, managed from the control plane • network, VMs, storage, disks
Databricks setup – Behind the scenes Cluster terminated • Virtual machines and networks removed • Storage account remains
Other resources • https: //azure. microsoft. com/en-us/services/databricks/ • https: //blogs. msdn. microsoft. com/sqlcat/2016/08/18/migrating-data-to-azure-sql-datawarehouse-in-practice/ • https: //blogs. msdn. microsoft. com/sqlcat/2017/05/17/azure-sql-data-warehouse-loadingpatterns-and-strategies/ • https: //blogs. msdn. microsoft. com/sqlcat/2017/09/05/azure-sql-data-warehouse-workloadpatterns-and-anti-patterns/ • https: //channel 9. msdn. com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK 3377 • https: //channel 9. msdn. com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK 4016 • https: //databricks. com/product/azure • https: //docs. microsoft. com/en-us/azure/sql-data-warehouse-bestpractices
Help deciding which Machine Learning tool to use Help deciding what Machine Learning technology to use: • https: //docs. microsoft. com/en-us/azure/architecture/dataguide/technology-choices/data-science-and-machine-learning • https: //docs. microsoft. com/en-us/azure/machinelearning/service/overview-what-is-azure-ml • https: //docs. microsoft. com/en-us/azure/machinelearning/service/overview-more-machine-learning
Thank you!!
- Slides: 41