Azure Databricks for the Traditional DBA Rodney Landrum

Azure Databricks for the Traditional DBA Rodney Landrum Boston SQL Saturday BI Edition 2019

About Me(2008 – 2016) MVP Data Platform Senior Database/BI Consultant with Ntirety Author of SQL Server Tacklebox et al Not a Data Scientist Does not understand Dev. Ops Has a redgate tattoo WFA rodneylandrum@Hotmail. com

Session Goals • Relay the challenges and successes of deploying two real world Azure Databricks projects • Provide consolidated information on what the Traditional DBA needs to know to successfully deploy and manage Azure Databricks

The Traditional DBA • • • Responsible for company’s data assets Comfortable managing multiple SQL Server instances Does not shy away from long hours Works well with other teams Gets assigned technologies with the word “Data” in it like Data Factory and Data. Bricks

What the DBA Needs To Know For a Databricks Project • • How to use Google or Bing Very basic R and Python How to move about in Azure Portal How to interpret and resolve a crazy new world of syntax errors Who in your network you should contact when you hit a data brick wall What you can pawn off on the data scientists and what you have to do yourself How to conceal your excitement when some piece of code you wrote in a language you do not know actually works

Specific Challenges and Requirements • • Migrate 2000+ lines of R code to Azure Databricks Notebook from Rstudio Assign role-based user access to Databricks notebooks Use Azure Key Vault to avoid tokens and passwords in code Load R libraries Connect to SQL and Blob storage from Notebook Pass parameters to a Notebook Orchestrate Notebook run and data load tasks via Data Factory

What is Databricks

Who Uses Databricks

Real World Considerations For Data Science Projects • • Many data scientists want to use R as they do in R Studio R often requires many libraries to be loaded DBAs do not typically use R Studio DBAs do not need to load libraries for T-SQL Data scientists produce and consume a lot of data: SQL, CSVs, PNGs, JSON Data scientists look to the DBA to guide them through the transition to Databricks

Creating a Databricks Workspace

Databricks Home

Creating A New Notebook

Creating A New Notebook

Assigning Access to the Databricks Workspace https: //docs. azuredatabricks. net/administration-guide/admin-settings/index. html#admin-console

Assigning Access to the Databricks Workspace

Assigning Access to the Databricks Cluster

How to hide secrets Add: #secrets/create. Scope to Databricks URL (Capital S) https: //southcentralus. azuredatabricks. net/? o=3723599594135049#secrets/create. Scope Azure Key Vault DNS and Resource ID BS_TK https: //docs. azuredatabricks. net/user-guide/secrets/secret-scopes. html

How to hide secrets

Mounting Azure Blob Storage Control + Enter

Accessing Azure Blob Storage - Python

Thanks to this video https: //www. youtube. com/watch? v=Duk 0 scd. O 9 P 0&t=307 s

Mounting Azure Blob Storage - R

Accessing Azure Blob Storage - R

Loading Libraries Databricks supports three library modes: Workspace, cluster-installed, and notebook-scoped. A Workspace library exists in the Workspace. A Workspace library has the same attributes as a clusterinstalled library, plus its path in the Workspace. A Workspace library is effectively a template from which you create a cluster-installed library. To allow a library to be shared by all users in a Workspace, create the library in the Shared folder. To make it available only to a single user, create the library in the user folder. A cluster-installed library exists only in the context of the cluster it will be installed on. It has all of the attributes the cluster needs to install the library: DBFS path to the Jar, Maven coordinate, Py. PI package, and so on. A notebooks-scoped library exists only in the context of the notebook in which it is installed. See Library utilities. https: //docs. databricks. com/user-guide/libraries. html

Notebook Level Libraries - Recommended • Install libraries when and where they’re needed, from within a notebook. This eliminates the need to globally install libraries on a cluster before you can attach a notebook that requires those libraries. • Notebooks are completely portable between clusters (as long as they’re running Databricks 5. 1 or later, of course). • Library environments are scoped to individual sessions. Multiple notebooks using different versions of a particular library can be attached to a cluster without interference. • Different users on the same cluster can add and remove dependencies without affecting other users. You don’t need to restart your cluster to reinstall libraries. • When a notebook is detached, the session is garbage collected. Resources on the cluster for libraries installed during that session are released. https: //databricks. com/blog/2019/01/08/introducing-databricks-library-utilities-for-notebooks. html

Load and Test R Libraries Default session info after reboot with no libraries https: //cran. r-project. org/web/packages/lubridate/vignettes/lubridate. html

Load and Test R Libraries

Session After Loading Libraries to Notebook

Getting More Info On Loaded Libraries

A bit about data frames The Data. Frame concept is not unique to Spark. R and Python both have similar concepts. However, Python/R Data. Frames (with some exceptions) exist on one machine rather than multiple machines. This limits what you can do with a given Data. Frame in python and R to the resources that exist on that specific machine. However, since Spark has language interfaces for both Python and R, it’s quite easy to convert to Pandas (Python) Data. Frames to Spark Data. Frames and R Data. Frames to Spark Data. Frames (in R). https: //docs. azuredatabricks. net/spark/latest/dataframes-datasets/index. html#dataframes https: //databricks-prodcloudfront. cloud. databricks. com/public/4027 ec 902 e 239 c 93 eaaa 8714 f 173 bcfc/8599738367597028/1792412399382575/3601578643761083/latest. html

How to Connect to Azure SQL

How to Connect to Azure SQL

How to Connect to Azure SQL Spark. R: : create. Or. Replace. Temp. View(super_hero_df, "SH_View")

How to Connect to Azure SQL

How to Parameterize the Notebook https: //docs. databricks. com/user-guide/notebooks/widgets. html

How to Parameterize the Notebook

How to Parameterize the Notebook

In Sum • • • Assigned Permissions Loaded Libraries for R Accessed SQL and Blob with secrets Returned datasets for R and Spark R data frames Created parameters for notebook

Orchestrating with Data Factory

Orchestrating with Data Factory

Orchestrating with Data Factory

Databricks and Data Factory Demo

Fun with Syntax Errors Don’t name a variable “Day”

Questions
- Slides: 44