Using R in Azure Machine Learning Take Azure
- Slides: 63
Using R in Azure Machine Learning Take Azure ML to the next level with R Alnis Bajars. alnis@bajars. com @alnisb
Agenda Using R in Azure Machine Learning R – Ecosystem Fundamentals R – Selected Language Elements Data Science Principles (including some lingo) Azure ML Quick Overview Azure ML + R
Assumptions • We can’t do any one topic proper justice. • So this talk will introduce the core ecosystem for your own follow up. • No mathematical proofs. Hopefully. • You will know what you don’t know about Data Science • Set expectations and realities about Data Science These slides are meant to be used !
References Coursera – R Programming (Johns Hopkins) • Quality of explanations variable. • But the practical assignments are good, and deadline driven. Safari Books Online • Video. Introduction to Data Science with R. Garrett Grolemund (R Studio). • Most published books on R Hadley Wickham (@hadleywickham) – Modern Godfather of R • Chief Data Scientist at R Studio • Author of influential R packages Azure Machine Learning • Intro video at Microsoft Virtual Academy
References contnued ed. X – Azure ML with R/ Python • A number of demos driven by this content.
R – Ecosystem Fundamentals
R vs Python R Pros • More mature data science support (20 years +), purpose built • More established ML support • T-SQL integration in 2016 (out of scope) Python Pros • Best all round script language. Data science support improving. • Better 64 bit support and scalability? R performance and scalability – don’t forget Revolution Analytics
R Ecosystem – Essential Bits Download latest R from CRAN – Comprehensive R Archive Network. https: //cran. r-project. org/ (Revolution Analytics R not covered) Get R Studio. https: //www. rstudio. com/ • Essential IDE. But much more, packages, RPubs etc. . • Download R, then R Studio. RStudio a better environment for test and debug than Azure ML! Get a Github account. https: //github. com/ and Github shell. • Distributed source code control system. • Essential part of R social network.
Github Lifecycle Cheat Sheet From github. com • Create repository (or fork someone else’s). From local Github shell. git clone <URL_of_repository> cd <repository> git add <files> git commit –a –m “some_message” git push
R Package Management You’ll be doing this a lot! To install a package at the command line. install. packages("ggplot 2“) Or use R Studio. (multiple dependency options)
R Package Install/ Reference To use an installed package. At the command line. library("ggplot 2") R Studio code hint. Can install libraries from Github (user/repository) library(“devtools") install_github( 'ramnathv/r. Charts') Older versions of install_github have user and repository as separate arguments
R Visualisation Packages plot • Standard package. Easy to use but presentation ordinary. lattice • Enhanced package. Not very widely adopted. ggplot (by Hadley Wickham) – Grammar of Graphics • Best quality presentations yet easy to use • Layers approach: ggplot • Quickie version: qplot
qplot simple example
ggplot 2 example inc Linear Model
ggplot 2 … if you really want to get funky…
ggplot 2 and the Boxplot Concise way to show median, 1 st/ 3 rd quartiles, 1. 5 * IQR and outliers.
Scatter plot matrix and R pairs function Concise way to relationships between all features.
R Data Wrangling Packages dplyr • Extensive function set for select/ sort/ filter/ derived columns/ group by/ top n. • Note %>% directive to chain dplyr functions – pipeline like tidyr (Hadley Wickham) • Statisticians called cleansed data tidy data. • Normalise/ denormalise. sqldf • Surprisingly good SQL syntax fidelity
R Dynamic Report Packages knitr • R Markdown + embedded R code => reports. HTML/ PDF/ Latex. • Ideal platform for Reproducible Research. • Demo. Properly cool. shiny • Interactive publishing of R driven web pages. Client and server bits. slidify • Generation of slide decks from R Markdown/ YAML/ R.
R – Selected Language Basics
R Fundamental Data Structures Script language (Perl/ Python/ Ruby) data structures. • Scalar • Array • Hash (key/value) Contrast with R data structures • Vector (a “scalar” is really a 1 element vector) • Matrix (caveat – data of same type) The data frame is an operational tabular structure, integral to data manipulation. R is case sensitive everywhere! (Variables, functions etc. )
R Data Types Atomic data types. • character • numeric (real numbers) • integer • complex • logical (True/False) typeof function handy
R Assignment and c function Two different modes, generally equivalent. • The <- form most popular. c for combine to build free form vectors.
Reading Data and Missing Values A number of functions to read data files (usually read. table). • Generally into data frames. How are values not entered handled? • R default is NA • This can be overwritten
Looking at the data A number of handy functions. (Factor – discrete values)
R as a Functional Programming Language In R, functions are 1 st class objects. This is widely used. Eg apply family of functions. apply, sapply, lapply
View command – R Studio Console Needs no further introduction!
Data Science Principles
Some General Notes Algorithms vs Data • Lots of data tends to be more influential than choice of algorithm • Data collection methodology is critical Correlation implies Causation? • No! Outliers • Extreme values well outside the norm. Eg Australia’s billionaires • How are they handled? Depends. Variable Types (affects Algorithm choice) • Continuous, eg apartment price • Discrete, eg species of Iris. Don’t forget R function strings. As. Factors
Data Analysis Flowchart
Codebook and Interpetation Codebook is what Statisticians call the document that is • Field spec of the data • Details about the data collection Reference to data set • US NOAA storm database http: //www. ncdc. noaa. gov/stormevents/details. jsp? type=eventtype Read and interpret the Codebook carefully • Eg Time based issues, all weather events only recorded since 1/1/96 • Careful combining features, eg # fatalities + # injuries does not make sense
Machine Learning – Predictive Types Supervised Learning • Train model based on past results, validate with test data • Independent variables or features as predictors • Label or dependent variable to predict. • Eg predict house price based on size, # rooms etc Unsupervised Learning • No past results to train on, thus more difficult to evaluate • Find patterns, often using clustering • Eg Google News
Supervised Learning Experiments Split available data into training and test samples • Often training 70% as a rule of thumb • Fit a model against training of close to just right accuracy • Validate model against test set Beware of. • Underfitting. Not a convincing predictor. • Overfitting. Too much fitting of errors/ outliers. Great fit of training data, rubbish for other data sets.
Experiment Types At a very high level. • Regression. Fit mathematical (often linear) to predict continuous values. • Classification. Predict discrete values. • Clustering. Group data items based on similarity. • Recommender. • Anomaly Detection. Detect exception cases.
Feature Selection Your training data has a lot of features. Should we use them all? • No! Too many dimensions, too much noise. • Punt collinear features, those with marginal value • Combine features where it makes sense • random. Forest model to assess importance • Stepwise elimination of features, R has step() function • Be ruthless!
Averages and Standard Deviation How to do an average. • Mean. Sum of observations / # of observations – outlier sensitive • Median. Middle value • Mode. Most common value, best for factors (categorical) Spread of data. • Variance is (Value – Mean) squared / # observations. Square to (a) take absolute value (b) better vibe of the data. • Take square root of variance to get Standard Deviation which brings value in same scale as observations, thus commonly used.
Normalize Data/ R scale function Features you want to compare naturally have different scales. • Eg • The bigger numbers will swamp small numbers in importance. Solution? Scaling. • Common solution is to normalize data to a scale where mean = 0 and standard deviation = 1. Note Azure ML has a Normalize Data module. R has a scale function.
Hypothesis Testing and Confidence Intervals The protocol for hypothesis. • Hypothesis 0 is the status quo. • Hypothesis 1 is the alternative (eg new drug). • Aim is to reject H 0 in favour of H 1 (or not) The result is generally framed within a confidence level (p value). • Commonly use 95%, a throwback to pre computer days. • Controversy. The Earth is Round (p < 0. 05)
Tidy Data Described by Hadley Wickham in • Paper - http: //vita. had. co. nz/papers/tidy-data. pdf • Video - https: //vimeo. com/33727555 Principles 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.
Azure ML – Quick Overview
Azure ML – Get Started What you need. Site is https: //studio. azureml. net • Azure account – does not have to be a trial, Machine Learning has a free tier. • Storage account. * Trap. Storage account must be in same location as ML. Australia might not be available.
Azure ML – Flowchart
Azure ML Example – Compare Two Models
Azure ML Example – Prefab Data Wrangling
Azure ML – Re-use and Monetisation Re-use via web services. • REST APIs • Code snippets in C#, R and Python. Publish said web services to Azure Marketplace. • Fairly involved diligence process including approvals. Sadly, both topics out of scope.
Apply SQL Transformation Module Use SQL syntax for data wrangling, based on SQLite. I/O • 3 input ports, internally use “tables” t 1, t 2 and t 3 • 1 output port with results Within Azure ML, an easier alternative to the R package sqldf.
Extend ML with R
Execute R Script and I/O Execute R Script • Its own environment (avoid namespace collisions) • Need to load packages • Install new packages via zip 3 input ports • Dataset[12]; Azure table -> R data frame • Script bundle; Zip -> code, objects, packages 2 output ports • Results; R data frame -> Azure table • R Device; stdout, stderr, graphics
Template code for Execute R Script
Execute R Script – a “real” example
Debugging R Code What if code runs ok in RStudio but not in ML? There is no debugger as such in ML, so • Induce an error in R code, eg refer uninitialised object • Right click R script module, select View Error Log • Right click R script module, select View Output Log Latter has more detail
Sample Output Log
Create Your Own R Library Fairly mechanical. • Create your own source function(s) in a. R file • Zip up that file, with the name you want displayed in ML • In ML, call Add Dataset to import file. • Visible in My Datasets in ML.
Own R Library Example
Create R Model Module A module which includes model and scoring scripts • Own R environment • Only pre loaded R packages • Only one output, no graphics I/O • Input. Training data frame • Output. Model object. Scripts • Trainer script • Scorer: uses R predict function
Sample R Model Module Code Note most set and get functions local to R Model Module. Sample training script. Sample scoring script.
Loading R Packages into Azure ML There are “only” 350 R Packages in Azure ML – you’ll eventually want to use other packages. To load an R Package into Azure ML. • Find the package and download as zip locally • In ML Studio, select the big “+ NEW” option bottom LHS • Select DATASET -> FROM LOCAL FILE • Follow the bouncing ball
Using Loaded R Packages in Azure ML Effectively need to install each use in Execute R Script.
Demos – CA Dairy Data Really simple example of R, plus custom library in action.
Energy Efficiency Visualisation Use R for more fine grained visualisation of Energy Efficiency Regression dataset. • Label to predict is the Heating Load column Steps we take. • Make Overall Height and Orientation categorical (what R calls Factors). • Make all column headers Camel. Case (remove spaces) to play nicer with R. • Add R code to use dplyr to create derived columns for squares and cubes. • Normalize Data for all numeric columns, transformation method Min. Max. Mean 0 and standard deviation 1. • Add R code to visualise data.
Energy Efficiency Visualisation continued Now let’s do some data science ! • Project Columns module to punt a few columns. • Use the Linear Regression, solution method Ordinary Least Squares. • Split Data module – 60% training, 40% test • Train Model module – Linear Regression plus Training data • Permutation Feature Importance to score model against Test data
Energy Efficiency Visualisation – the score The relative feature importance.
Summary Please take this presentation as a call to action. Alnis Bajars. Email: alnis@bajars. com Twitter: @alnisb
- Azure machine learning studio
- Aml workbench
- Databricks auto ml
- How to make money using machine learning
- Concept learning task in machine learning
- Analytical learning in machine learning
- Pac learning model in machine learning
- Pac learning model in machine learning
- Inductive and analytical learning in machine learning
- Inductive analytical approach to learning
- Instance based learning in machine learning
- Inductive learning machine learning
- First order rule learning in machine learning
- Lazy and eager learning
- Deep learning vs machine learning
- Azure plan vs azure global csp
- Multiagent learning using a variable learning rate
- Take a bus or take a train
- Cuadro comparativo e-learning m-learning b-learning
- What is windows azure virtual machine
- Microsoft fortune 500
- Apply sql transformation azure ml
- Finite state machine vending machine example
- Moore machine
- Moore machine to mealy machine
- Chapter 10 energy work and simple machines answer key
- Wheel and axle mechanical advantage
- What is the purpose of a simple machine
- System collections generic
- Defrost using internal heat is accomplished using
- The non-iid data quagmire of decentralized machine learning
- Expected risk minimization
- Sql server machine learning services
- Octave programming tutorial
- Jmp pca
- Machine learning definition mitchell
- Machine learning in infrastructure monitoring
- Valerie du preez
- Zillow machine learning
- Tom mitchell machine learning solutions chapter 3
- Ethem alpaydin
- Hypothesis space in machine learning
- Machine learning kth
- Andrew ng intro machine learning
- Andrew ng introduction to machine learning
- Hypothesis space in machine learning
- Ilp machine learning
- Qradar user behavior analytics
- Xkcd
- Avoiding discrimination through causal reasoning
- Stacking bagging boosting
- Econometrics machine learning
- Feature reduction in machine learning
- What is tensor in machine learning
- Usman roshan njit
- Azure synapse vs databricks
- Hypothesis space in machine learning
- Convex optimization in machine learning javatpoint
- Find s algorithm machine learning
- Upenn machine learning
- Aws lambda machine learning
- Mike mozer
- Supervised vs unsupervised learning
- Cisco machine learning security