Session objectives Key takeaway 1 Key takeaway 2
Session objective(s): Key takeaway 1 Key takeaway 2
Yin-Yang picture by Donkey. Hotey is licensed under CC BY
Data Sources Apps Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Factory Data Lake Store Machine Learning Cognitive Services Data Catalog SQL Data Warehouse Data Lake Analytics Bot Framework Web HDInsight (R Server and Spark) Cortana Mobile Event Hubs Apps Bots Stream Analytics Dashboards & Visualizations Sensors and devices Data Power BI Intelligence Automated Systems Action
Data Sources Apps Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Factory Data Lake Store Machine Learning Cognitive Services Data Catalog SQL Data Warehouse Data Lake Analytics Bot Framework Web HDInsight (R Server and Spark) Cortana Mobile Event Hubs Apps Bots Stream Analytics Dashboards & Visualizations Sensors and devices Data Power BI Intelligence Automated Systems Action
logistic regression, linear models, basic statistics, hypothesis testing, k-means, decision trees page rank, collaborative filtering, graph processing, SVD, PCA, Bayesian models, … deep learning over various types of networks
Retail Financial services Healthcare Manufacturing loyalty programs customer acquisition pricing strategy supply chain mgnt customer churn fraud detection risk & compliance cross -sell & upsell personalization bill collection operational efficiency patient demographics pay for performance demand forecasting pricing strategy supply chain optimization predictive maintenance remote monitoring product recommendations intelligent search routing robotics ad placement predictive maintenance image, video recognition sentiment analysis text comprehension natural language processing robotics bots augmented reality predictive maintenance
Server
Language Platform What is • The most popular statistical programming language • A data visualization tool • Open source Community Ecosystem • • 2. 5+M users Taught in most universities New and recent grad’s use it Thriving user groups worldwide • 8000+ contributed packages • Rich application & platform integration
Data Flows Overwhelm Open Source R Not enterprise ready
100% compatible with open source R Wide range of scalable and distributed R functions Ability to parallelize any R function Enterprise-grade offering
"http: //www. ats. ucla. edu/stat/data/binary. csv"
“/data/binary. csv”
“/data/binary. csv”
R R Server R R R R
R Tools for VS R Studio Intelli. J IDEA Edge node YARN R server Default Queue Deploy. R Head node Jupyter notebooks Livy server BI Tools Thrift server Thrift Queue
Elapsed Time (seconds) rx. Logit on a 100 node HDInsight cluster (airline dataset) 2000 1800 1600 2. 2 TB 1400 1200 1000 800 600 400 200 0 0 5 10 Billions of rows Preliminary results 15 20 25 Billions
Times faster than Map. Reduce Times faster than Local CC 40 36 X 35 8 7 X 7 30 6 25 5 20 4 15 3 10 2 5 0 1 0 20 40 60 80 Millions of rows Preliminary results 100 120 0 0. 05 0. 25 0. 45 0. 65 Billions of rows 0. 85 1. 05
ETL § § § § Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics § § § Min / Max, Mean, Median (approx. ) Quantiles (approx. ) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests § § Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Machine Learning § § Predictive Statistics § § § § Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Predictions/scoring for models Residuals for all models Variable Selection § Stepwise Regression Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Clustering § K-Means Sampling § § Subsample (observations & variables) Random Sampling Simulation § § Simulation (e. g. Monte Carlo) Parallel Random Number Generation Custom Parallelization § § § rx. Data. Step rx. Exec PEMA-R API
Data Sources
Basic statistics Collaborative filtering Clustering Classification and regression Dimensionality reduction Frequent pattern mining Simulation
Basic statistics Collaborative filtering Clustering Classification and regression Dimensionality reduction Frequent pattern mining Simulation
Basic statistics Collaborative filtering Clustering Classification and regression Dimensionality reduction Frequent pattern mining Simulation
Popularity of languages for data science KNuggets poll (2014) 60% 50% 40% 30% 20% 10% 0% R SAS Python the poll SQL Java
R Server contributes Spark contributes
2012 – Alex. Net 2014 – Image description 2016 – Alpha. Go
Inception-v 3
GPU-enabled And distributed
Deep Learning library Language GPU Distributed mode Theano Python Yes N/A Torch Lua/C++ Yes N/A Caffe Python/C++ Yes N/A Deep. Learning 4 J Java/Scala Yes Spark Tensor. Flow Python/C++ Yes Native CNTK C++ Yes Native MXNet Python/R/ Yes C++ and Julia Native
https: //github. com/deeplearning 4 j/dl 4 j-spark-cdh 5 -examples. git
R language Python language Scala/Java Server + R Ecosystem + Python Ecosystem + Spark Ecosystem
Session objective(s): Key takeaway 1 R and Spark are better together Key takeaway 2
Sessions R Server on HDInsight Documentation https: //azure. microsoft. com/en-us/documentation/articles/hdinsight-hadoop-r-server-overview/ Learning deep learning Andrew Ng's Machine Learning on Coursera Geoff Hinton Neural Netwroks course on Coursera
- Slides: 40