Using R in Azure Machine Learning Take Azure

  • Slides: 63
Download presentation
Using R in Azure Machine Learning Take Azure ML to the next level with

Using R in Azure Machine Learning Take Azure ML to the next level with R Alnis Bajars. alnis@bajars. com @alnisb

Agenda Using R in Azure Machine Learning R – Ecosystem Fundamentals R – Selected

Agenda Using R in Azure Machine Learning R – Ecosystem Fundamentals R – Selected Language Elements Data Science Principles (including some lingo) Azure ML Quick Overview Azure ML + R

Assumptions • We can’t do any one topic proper justice. • So this talk

Assumptions • We can’t do any one topic proper justice. • So this talk will introduce the core ecosystem for your own follow up. • No mathematical proofs. Hopefully. • You will know what you don’t know about Data Science • Set expectations and realities about Data Science These slides are meant to be used !

References Coursera – R Programming (Johns Hopkins) • Quality of explanations variable. • But

References Coursera – R Programming (Johns Hopkins) • Quality of explanations variable. • But the practical assignments are good, and deadline driven. Safari Books Online • Video. Introduction to Data Science with R. Garrett Grolemund (R Studio). • Most published books on R Hadley Wickham (@hadleywickham) – Modern Godfather of R • Chief Data Scientist at R Studio • Author of influential R packages Azure Machine Learning • Intro video at Microsoft Virtual Academy

References contnued ed. X – Azure ML with R/ Python • A number of

References contnued ed. X – Azure ML with R/ Python • A number of demos driven by this content.

R – Ecosystem Fundamentals

R – Ecosystem Fundamentals

R vs Python R Pros • More mature data science support (20 years +),

R vs Python R Pros • More mature data science support (20 years +), purpose built • More established ML support • T-SQL integration in 2016 (out of scope) Python Pros • Best all round script language. Data science support improving. • Better 64 bit support and scalability? R performance and scalability – don’t forget Revolution Analytics

R Ecosystem – Essential Bits Download latest R from CRAN – Comprehensive R Archive

R Ecosystem – Essential Bits Download latest R from CRAN – Comprehensive R Archive Network. https: //cran. r-project. org/ (Revolution Analytics R not covered) Get R Studio. https: //www. rstudio. com/ • Essential IDE. But much more, packages, RPubs etc. . • Download R, then R Studio. RStudio a better environment for test and debug than Azure ML! Get a Github account. https: //github. com/ and Github shell. • Distributed source code control system. • Essential part of R social network.

Github Lifecycle Cheat Sheet From github. com • Create repository (or fork someone else’s).

Github Lifecycle Cheat Sheet From github. com • Create repository (or fork someone else’s). From local Github shell. git clone <URL_of_repository> cd <repository> git add <files> git commit –a –m “some_message” git push

R Package Management You’ll be doing this a lot! To install a package at

R Package Management You’ll be doing this a lot! To install a package at the command line. install. packages("ggplot 2“) Or use R Studio. (multiple dependency options)

R Package Install/ Reference To use an installed package. At the command line. library("ggplot

R Package Install/ Reference To use an installed package. At the command line. library("ggplot 2") R Studio code hint. Can install libraries from Github (user/repository) library(“devtools") install_github( 'ramnathv/r. Charts') Older versions of install_github have user and repository as separate arguments

R Visualisation Packages plot • Standard package. Easy to use but presentation ordinary. lattice

R Visualisation Packages plot • Standard package. Easy to use but presentation ordinary. lattice • Enhanced package. Not very widely adopted. ggplot (by Hadley Wickham) – Grammar of Graphics • Best quality presentations yet easy to use • Layers approach: ggplot • Quickie version: qplot

qplot simple example

qplot simple example

ggplot 2 example inc Linear Model

ggplot 2 example inc Linear Model

ggplot 2 … if you really want to get funky…

ggplot 2 … if you really want to get funky…

ggplot 2 and the Boxplot Concise way to show median, 1 st/ 3 rd

ggplot 2 and the Boxplot Concise way to show median, 1 st/ 3 rd quartiles, 1. 5 * IQR and outliers.

Scatter plot matrix and R pairs function Concise way to relationships between all features.

Scatter plot matrix and R pairs function Concise way to relationships between all features.

R Data Wrangling Packages dplyr • Extensive function set for select/ sort/ filter/ derived

R Data Wrangling Packages dplyr • Extensive function set for select/ sort/ filter/ derived columns/ group by/ top n. • Note %>% directive to chain dplyr functions – pipeline like tidyr (Hadley Wickham) • Statisticians called cleansed data tidy data. • Normalise/ denormalise. sqldf • Surprisingly good SQL syntax fidelity

R Dynamic Report Packages knitr • R Markdown + embedded R code => reports.

R Dynamic Report Packages knitr • R Markdown + embedded R code => reports. HTML/ PDF/ Latex. • Ideal platform for Reproducible Research. • Demo. Properly cool. shiny • Interactive publishing of R driven web pages. Client and server bits. slidify • Generation of slide decks from R Markdown/ YAML/ R.

R – Selected Language Basics

R – Selected Language Basics

R Fundamental Data Structures Script language (Perl/ Python/ Ruby) data structures. • Scalar •

R Fundamental Data Structures Script language (Perl/ Python/ Ruby) data structures. • Scalar • Array • Hash (key/value) Contrast with R data structures • Vector (a “scalar” is really a 1 element vector) • Matrix (caveat – data of same type) The data frame is an operational tabular structure, integral to data manipulation. R is case sensitive everywhere! (Variables, functions etc. )

R Data Types Atomic data types. • character • numeric (real numbers) • integer

R Data Types Atomic data types. • character • numeric (real numbers) • integer • complex • logical (True/False) typeof function handy

R Assignment and c function Two different modes, generally equivalent. • The <- form

R Assignment and c function Two different modes, generally equivalent. • The <- form most popular. c for combine to build free form vectors.

Reading Data and Missing Values A number of functions to read data files (usually

Reading Data and Missing Values A number of functions to read data files (usually read. table). • Generally into data frames. How are values not entered handled? • R default is NA • This can be overwritten

Looking at the data A number of handy functions. (Factor – discrete values)

Looking at the data A number of handy functions. (Factor – discrete values)

R as a Functional Programming Language In R, functions are 1 st class objects.

R as a Functional Programming Language In R, functions are 1 st class objects. This is widely used. Eg apply family of functions. apply, sapply, lapply

View command – R Studio Console Needs no further introduction!

View command – R Studio Console Needs no further introduction!

Data Science Principles

Data Science Principles

Some General Notes Algorithms vs Data • Lots of data tends to be more

Some General Notes Algorithms vs Data • Lots of data tends to be more influential than choice of algorithm • Data collection methodology is critical Correlation implies Causation? • No! Outliers • Extreme values well outside the norm. Eg Australia’s billionaires • How are they handled? Depends. Variable Types (affects Algorithm choice) • Continuous, eg apartment price • Discrete, eg species of Iris. Don’t forget R function strings. As. Factors

Data Analysis Flowchart

Data Analysis Flowchart

Codebook and Interpetation Codebook is what Statisticians call the document that is • Field

Codebook and Interpetation Codebook is what Statisticians call the document that is • Field spec of the data • Details about the data collection Reference to data set • US NOAA storm database http: //www. ncdc. noaa. gov/stormevents/details. jsp? type=eventtype Read and interpret the Codebook carefully • Eg Time based issues, all weather events only recorded since 1/1/96 • Careful combining features, eg # fatalities + # injuries does not make sense

Machine Learning – Predictive Types Supervised Learning • Train model based on past results,

Machine Learning – Predictive Types Supervised Learning • Train model based on past results, validate with test data • Independent variables or features as predictors • Label or dependent variable to predict. • Eg predict house price based on size, # rooms etc Unsupervised Learning • No past results to train on, thus more difficult to evaluate • Find patterns, often using clustering • Eg Google News

Supervised Learning Experiments Split available data into training and test samples • Often training

Supervised Learning Experiments Split available data into training and test samples • Often training 70% as a rule of thumb • Fit a model against training of close to just right accuracy • Validate model against test set Beware of. • Underfitting. Not a convincing predictor. • Overfitting. Too much fitting of errors/ outliers. Great fit of training data, rubbish for other data sets.

Experiment Types At a very high level. • Regression. Fit mathematical (often linear) to

Experiment Types At a very high level. • Regression. Fit mathematical (often linear) to predict continuous values. • Classification. Predict discrete values. • Clustering. Group data items based on similarity. • Recommender. • Anomaly Detection. Detect exception cases.

Feature Selection Your training data has a lot of features. Should we use them

Feature Selection Your training data has a lot of features. Should we use them all? • No! Too many dimensions, too much noise. • Punt collinear features, those with marginal value • Combine features where it makes sense • random. Forest model to assess importance • Stepwise elimination of features, R has step() function • Be ruthless!

Averages and Standard Deviation How to do an average. • Mean. Sum of observations

Averages and Standard Deviation How to do an average. • Mean. Sum of observations / # of observations – outlier sensitive • Median. Middle value • Mode. Most common value, best for factors (categorical) Spread of data. • Variance is (Value – Mean) squared / # observations. Square to (a) take absolute value (b) better vibe of the data. • Take square root of variance to get Standard Deviation which brings value in same scale as observations, thus commonly used.

Normalize Data/ R scale function Features you want to compare naturally have different scales.

Normalize Data/ R scale function Features you want to compare naturally have different scales. • Eg • The bigger numbers will swamp small numbers in importance. Solution? Scaling. • Common solution is to normalize data to a scale where mean = 0 and standard deviation = 1. Note Azure ML has a Normalize Data module. R has a scale function.

Hypothesis Testing and Confidence Intervals The protocol for hypothesis. • Hypothesis 0 is the

Hypothesis Testing and Confidence Intervals The protocol for hypothesis. • Hypothesis 0 is the status quo. • Hypothesis 1 is the alternative (eg new drug). • Aim is to reject H 0 in favour of H 1 (or not) The result is generally framed within a confidence level (p value). • Commonly use 95%, a throwback to pre computer days. • Controversy. The Earth is Round (p < 0. 05)

Tidy Data Described by Hadley Wickham in • Paper - http: //vita. had. co.

Tidy Data Described by Hadley Wickham in • Paper - http: //vita. had. co. nz/papers/tidy-data. pdf • Video - https: //vimeo. com/33727555 Principles 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

Azure ML – Quick Overview

Azure ML – Quick Overview

Azure ML – Get Started What you need. Site is https: //studio. azureml. net

Azure ML – Get Started What you need. Site is https: //studio. azureml. net • Azure account – does not have to be a trial, Machine Learning has a free tier. • Storage account. * Trap. Storage account must be in same location as ML. Australia might not be available.

Azure ML – Flowchart

Azure ML – Flowchart

Azure ML Example – Compare Two Models

Azure ML Example – Compare Two Models

Azure ML Example – Prefab Data Wrangling

Azure ML Example – Prefab Data Wrangling

Azure ML – Re-use and Monetisation Re-use via web services. • REST APIs •

Azure ML – Re-use and Monetisation Re-use via web services. • REST APIs • Code snippets in C#, R and Python. Publish said web services to Azure Marketplace. • Fairly involved diligence process including approvals. Sadly, both topics out of scope.

Apply SQL Transformation Module Use SQL syntax for data wrangling, based on SQLite. I/O

Apply SQL Transformation Module Use SQL syntax for data wrangling, based on SQLite. I/O • 3 input ports, internally use “tables” t 1, t 2 and t 3 • 1 output port with results Within Azure ML, an easier alternative to the R package sqldf.

Extend ML with R

Extend ML with R

Execute R Script and I/O Execute R Script • Its own environment (avoid namespace

Execute R Script and I/O Execute R Script • Its own environment (avoid namespace collisions) • Need to load packages • Install new packages via zip 3 input ports • Dataset[12]; Azure table -> R data frame • Script bundle; Zip -> code, objects, packages 2 output ports • Results; R data frame -> Azure table • R Device; stdout, stderr, graphics

Template code for Execute R Script

Template code for Execute R Script

Execute R Script – a “real” example

Execute R Script – a “real” example

Debugging R Code What if code runs ok in RStudio but not in ML?

Debugging R Code What if code runs ok in RStudio but not in ML? There is no debugger as such in ML, so • Induce an error in R code, eg refer uninitialised object • Right click R script module, select View Error Log • Right click R script module, select View Output Log Latter has more detail

Sample Output Log

Sample Output Log

Create Your Own R Library Fairly mechanical. • Create your own source function(s) in

Create Your Own R Library Fairly mechanical. • Create your own source function(s) in a. R file • Zip up that file, with the name you want displayed in ML • In ML, call Add Dataset to import file. • Visible in My Datasets in ML.

Own R Library Example

Own R Library Example

Create R Model Module A module which includes model and scoring scripts • Own

Create R Model Module A module which includes model and scoring scripts • Own R environment • Only pre loaded R packages • Only one output, no graphics I/O • Input. Training data frame • Output. Model object. Scripts • Trainer script • Scorer: uses R predict function

Sample R Model Module Code Note most set and get functions local to R

Sample R Model Module Code Note most set and get functions local to R Model Module. Sample training script. Sample scoring script.

Loading R Packages into Azure ML There are “only” 350 R Packages in Azure

Loading R Packages into Azure ML There are “only” 350 R Packages in Azure ML – you’ll eventually want to use other packages. To load an R Package into Azure ML. • Find the package and download as zip locally • In ML Studio, select the big “+ NEW” option bottom LHS • Select DATASET -> FROM LOCAL FILE • Follow the bouncing ball

Using Loaded R Packages in Azure ML Effectively need to install each use in

Using Loaded R Packages in Azure ML Effectively need to install each use in Execute R Script.

Demos – CA Dairy Data Really simple example of R, plus custom library in

Demos – CA Dairy Data Really simple example of R, plus custom library in action.

Energy Efficiency Visualisation Use R for more fine grained visualisation of Energy Efficiency Regression

Energy Efficiency Visualisation Use R for more fine grained visualisation of Energy Efficiency Regression dataset. • Label to predict is the Heating Load column Steps we take. • Make Overall Height and Orientation categorical (what R calls Factors). • Make all column headers Camel. Case (remove spaces) to play nicer with R. • Add R code to use dplyr to create derived columns for squares and cubes. • Normalize Data for all numeric columns, transformation method Min. Max. Mean 0 and standard deviation 1. • Add R code to visualise data.

Energy Efficiency Visualisation continued Now let’s do some data science ! • Project Columns

Energy Efficiency Visualisation continued Now let’s do some data science ! • Project Columns module to punt a few columns. • Use the Linear Regression, solution method Ordinary Least Squares. • Split Data module – 60% training, 40% test • Train Model module – Linear Regression plus Training data • Permutation Feature Importance to score model against Test data

Energy Efficiency Visualisation – the score The relative feature importance.

Energy Efficiency Visualisation – the score The relative feature importance.

Summary Please take this presentation as a call to action. Alnis Bajars. Email: alnis@bajars.

Summary Please take this presentation as a call to action. Alnis Bajars. Email: alnis@bajars. com Twitter: @alnisb