1 SCIENCE PASSION TECHNOLOGY Architecture of ML Systems

Data Science Lifecycle 2 Recap: The Data Science Lifecycle Data Scientist Data Integration Data

3 Agenda § Model Debugging and Validation § Model Deployment and Serving 706. 550

4 Model Debugging and Validation 706. 550 Architecture of Machine Learning Systems – 11

Model Debugging and Validation 5 Overview Model Debugging [Credit: twitter. com/tim_kraska] § #1 Understanding

Model Debugging and Validation 6 Basic Model-Specific Statistics § Regression Statistics § Average response

Model Debugging and Validation 7 Understanding Other Basic Issues § Overfitting / Imbalance §

Model Debugging and Validation 8 Occlusion-Based Explanations § Occlusion Explanations § Slide gray square

Model Debugging and Validation 9 Saliency Maps § Saliency Map [Karen Simonyan, Andrea Vedaldi,

Model Debugging and Validation 10 Debugging Bias and Fairness § Validate and ensure fairness

Model Debugging and Validation 11 Explanation Tables § Motivation § Generate a succinct decision

Model Debugging and Validation 12 Model Assertions § Motivation [Daniel Kang, Deepti Raghavan, Peter

Model Debugging and Validation 13 Continuous Integration § System Architecture ease. ml/ci [Cedric Renggli,

14 Model Deployment and Serving 706. 550 Architecture of Machine Learning Systems – 11

Model Deployment and Serving 15 Model Exchange Formats § Definition Deployed Model § #1

Model Deployment and Serving 16 Model Exchange Formats, cont. § Why Open Standards? §

Model Deployment and Serving 17 Model Exchange Formats, cont. § #3 Open Neural Network

Model Deployment and Serving 18 ML Systems for Serving § #1 Embedded ML Serving

Model Deployment and Serving 19 Example System. ML JMLC § Example Scenario Token Features

Model Deployment and Serving 20 Example System. ML JMLC, cont. § Background: Frame Schema

Model Deployment and Serving 21 Example System. ML JMLC, cont. § Motivation Embedded scoring

Model Deployment and Serving 22 k Serving Optimizations – Batching § Recap: Model Batching

Model Deployment and Serving 23 Serving Optimizations – MQO § Result Caching § Establish

Model Deployment and Serving 24 Serving Optimizations – Quantization § Lossy compression via ultra-low

Model Deployment and Serving 25 Serving Optimizations – Specialization § No. Scope Architecture §

26 Summary and Conclusions § Summary 11 Model Deployment § Model Debugging and Validation

Slides: 26

Download presentation

1 SCIENCE PASSION TECHNOLOGY Architecture of ML Systems 11 Model Debugging & Serving Matthias Boehm Graz University of Technology, Austria Computer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMVIT endowed chair for Data Management Last update: June 28, 2019

Data Science Lifecycle 2 Recap: The Data Science Lifecycle Data Scientist Data Integration Data Cleaning Data Preparation Model Selection Training Hyper-parameters Data-centric View: Application perspective Workload perspective System perspective Validate & Debug Deployment Scoring & Feedback Exploratory Process (experimentation, refinements, ML pipelines) Data/SW Engineer 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019 Dev. Ops Engineer

3 Agenda § Model Debugging and Validation § Model Deployment and Serving 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

4 Model Debugging and Validation 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 5 Overview Model Debugging [Credit: twitter. com/tim_kraska] § #1 Understanding via Visualization § Plotting of predictions / interactions § Sometime in combination with dimensionality reduction into 2 D: § Autoencoder § PCA (principal component analysis) § t-SNE (T-distributed Stochastic Neighbor Embedding) § Input, intermediate, and output layers of DNNs [Andrew Crotty et al: Vizdom: Interactive Analytics through Pen and Touch. PVLDB 2015] [Credit: nlml. github. io/in-rawnumpy/in-raw-numpy-t-sne/] § #2 Fairness, Explainability, and Validation via Constraints § Impose constraints like monotonicity for ensuring fairness § Generate succinct representations (e. g. , rules) as explanation § Establish assertions and thresholds for automatic validation and alerts 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 6 Basic Model-Specific Statistics § Regression Statistics § Average response and stddev, average residuals stddev residuals § R 2 (coeff of determination) with and without bias, etc § Classification Statistics § Classical: recall, precision, F 1 -score § Visual: confusion matrix (correct vs predicated classes) understand performance wrt individual classes correct § Example Mnist label § Mispredictions might also be visualized via dimensionality reduction predicted label 2 3 4 5 6 7 0 1 8 9 0 21 25 1 15 2 76 3 23 12 4 36 5 24 6 31 37 7 42 8 8 11 53 9 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 7 Understanding Other Basic Issues § Overfitting / Imbalance § Compare train and test performance Algorithm-specific techniques: regularization, pruning, loss, etc § Data Leakage § Example: time-shifted external time series data (e. g. , weather) § Compare performance train/test vs production setting § Covariance Shift § Distribution of training/test data different from production data § Reasons: out-of-domain prediction, sample selection bias § Examples: NLP, speech recognition, face/age recognition § Concept Drift § Gradual change of statistical properties like (mean, variance) § Requires re-training, parametric approaches for deciding when to retrain 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 8 Occlusion-Based Explanations § Occlusion Explanations § Slide gray square over inputs § Measure how feature maps and classifier output changes [Matthew D. Zeiler, Rob Fergus: Visualizing and Understanding Convolutional Networks. ECCV 2014] § Incremental Computation of Occlusion Explanations § View CNN as white-box operator [Supun Nakandala, Arun Kumar, and Yannis graph and operators as views Papakonstantinou: Incremental and Approximate Occlusion-based Deep CNN § Materialize intermediate tensors Inference for. Faster. Explanations, SIGMOD 2019] and apply incremental view maintenance 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 9 Saliency Maps § Saliency Map [Karen Simonyan, Andrea Vedaldi, Andrew Zisserman: Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. ICLR Workshop 2014] § Given input image and specific class § Compute saliency map of class derivatives wrt input image § Approximated w/ a linear function (Taylor expansion) § Unsupervised Image Segmentation 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 10 Debugging Bias and Fairness § Validate and ensure fairness with regard to sensitive features (unbiased) § Use occlusion and saliency maps to characterize and compare groups § Enforcing Fairness § Use constraints to enforce certain properties (e. g. , monotonicity, smoothness) § Example: late payment credit score [Maya Gupta: How Do We Make AI Fair? Sys. ML 2019] 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 11 Explanation Tables § Motivation § Generate a succinct decision rules from data § Problem: Decision tree rules do not overlap by def § Example athlete’s exercise log: Goal met 7 vs 7 § Explanation Tables § Find smallest explanation table subject to max KL divergence threshold § Greedy and sampling algorithms [Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, Divesh Srivastava: Interpretable and Informative Explanations of Outcomes. PVLDB 2014] 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 12 Model Assertions § Motivation [Daniel Kang, Deepti Raghavan, Peter Bailis, Matei Zaharia: Model Assertions for Debugging Machine Learning, NIPS Workshop ML Systems, 2018] § ML models might fail in complex ways that are not captured in loss function § Inspired by assertions in SW dev Model assertions via Python rules Example: Flickering of object detection § Assertion Use Cases § § #1 Runtime monitoring (collect statistics on incorrect behavior) #2 Corrective Action (trigger corrections at runtime) but how in retrospect? #3 Active Learning (decide which difficult data points to give to user) #4 Weak supervision (propose alternative labels and use for retraining) 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Debugging and Validation 13 Continuous Integration § System Architecture ease. ml/ci [Cedric Renggli, Bojan Karlaš, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, Ce Zhang: Continuous Integration of Machine Learning Models with ease. ml/ci: Towards a Rigorous Yet Practical Treatment, Sys. ML 2019] 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

14 Model Deployment and Serving 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 15 Model Exchange Formats § Definition Deployed Model § #1 Trained ML model (weight/parameter matrix) § #2 Trained weights AND operator graph / entire ML pipeline especially for DNN (many weight/bias tensors, hyper parameters, etc) § Recap: Data Exchange Formats (model + meta data) § § General-purpose formats: CSV, JSON, XML, Protobuf Sparse matrix formats: matrix market, libsvm Scientific formats: Net. CDF, HDF 5 ML-system-specific binary formats (e. g. , System. ML binary block) § Problem ML System Landscape § Different languages and frameworks, including versions § Lack of standardization DSLs for ML is wild west 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 16 Model Exchange Formats, cont. § Why Open Standards? § Open source allows inspection but no control § Open governance necessary for open standard § Cons: needs adoption, moves slowly [Nick Pentreath: Open Standards for Machine Learning Deployment, bbuzz 2019] § #1 Predictive Model Markup Language (PMML) § Model exchange format in XML, created by Data Mining Group 1997 § Package model weights, hyper parameters, and limited set of algorithms § #2 Portable Format for Analytics (PFA) § § Attempt to fix limitations of PMML, created by Data Mining Group JSON and AVRO exchange format Minimal functional math language arbitrary custom models Scoring in JVM, Python, R 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 17 Model Exchange Formats, cont. § #3 Open Neural Network Exchange (ONNX) § § § Model exchange format (data and operator graph) via Protobuf First Facebook and Microsoft, then IBM, Amazon Py. Torch, MXNet Focused on deep learning and tensor operations ONNX-ML: support for traditional ML algorithms Scoring engine: https: //github. com/Microsoft/onnxruntime Cons: very low level (e. g. , fused ops, cu. DNN centric), very young § Tensor. Flow Saved Models § Tensor. Flow-specific exchange format for model and operator graph § Freezes input weights and literals, for additional optimizations (e. g. , constant folding, quantization, etc) § Cloud providers may not be interested in open exchange standards 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 18 ML Systems for Serving § #1 Embedded ML Serving § Tensor. Flow Lite and new language bindings (small footprint, dedicated HW acceleration, APIs, and models: Mobile. Net, Squeeze. Net) § System. ML JMLC (Java ML Connector) § #2 ML Serving Services § § § Example: Google Translate 140 B words/day 82 K GPUs in 2016 Motivation: Complex DNN models, ran on dedicated HW RPC/REST interface for applications Tensor. Flow Serving: configurable serving w/ batching Clipper: Decoupled multi-framework scoring, w/ batching and result caching Pretzel: Batching and multi-model optimizations in ML. NET Rafiki: Optimization for accuracy under latency constraints, and batching and multi-model optimizations [Yunseong Lee et al. : [Christopher Olston et al: [Daniel Crankshaw PRETZEL: Opening the Black Tensor. Flow-Serving: et al: Clipper: A Low Box of Machine Learning Flexible, High-Latency Online Prediction Serving Systems. Performance ML Serving. Prediction Serving 706. 550 Architecture of Machine Learning Systems – 11 Model OSDI 2018] Debugging and Serving NIPS ML Systems 2017] System. NSDI 2017] Matthias Boehm, Graz University of Technology, SS 2019 [Wei Wang et al: Rafiki: Machine Learning as an Analytics Service System. PVLDB 2018]

Model Deployment and Serving 19 Example System. ML JMLC § Example Scenario Token Features Sentences Feature Extraction (e. g. , doc structure, sentences, tokenization, n-grams) “Model” ΔX Sentence Classification … (e. g. , �, ) M § Challenges § § § Scoring part of larger end-to-end pipeline External parallelization w/o materialization Simple synchronous scoring Data size (tiny ΔX, huge model M) Seamless integration & model consistency Embedded scoring Latency ⇒ Throughput Minimize overhead per ΔX Token inputs & outputs 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 20 Example System. ML JMLC, cont. § Background: Frame Schema § Abstract data type with schema (boolean, int, double, string) § Column-wise block layout § Local/distributed operations: e. g. , indexing, append, transform § Data Preparation via Transform Distributed representation: ? x ncol(F) blocks (shuffle-free conversion of csv / datasets) … transformencode FX FY Training MX MY Y X B ΔFX transformapply ΔX Scoring transformdecode ΔFŶ 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019 ΔŶ

Model Deployment and Serving 21 Example System. ML JMLC, cont. § Motivation Embedded scoring Latency ⇒ Throughput Minimize overhead per ΔX § Example 1: 2: 3: 4: 5: 6: 7: 8: 9: Typical compiler/runtime overheads: Script parsing and config: ~100 ms Validation, compile, IPA: ~10 ms HOP DAG (re-)compile: ~1 ms Instruction execute: <0. 1μs // single-node, no evictions, Connection conn = new Connection(); // no recompile, no multithread. Prepared. Script pscript = conn. prepare. Script( get. Script. As. String(“glm-predict-extended. dml”), new String[]{“FX”, “MY”, “B”}, new String[]{“FY”}); pscript. set. Frame(“MX”, MX, true); //. . . Setup constant inputs // setup static inputs (for reuse) pscript. set. Frame(“MY”, MY, true); for( Document d : documents ) { pscript. set. Matrix(“B”, true); pipeline Frame. Block FX =. . . ; B, //Input pscript. set. Frame(“FX”, FX); Frame. Block FY = pscript. execute. Script(). get. Frame(“FY”); //. . . Remaining pipeline // execute precompiled script } // many times 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 22 k Serving Optimizations – Batching § Recap: Model Batching (see 08 Data Access) n § One-pass evaluation of multiple configurations § EL, CV, feature selection, hyper parameter tuning m § E. g. : TUPAQ [So. CC’ 16], Columbus [SIGMOD’ 14 O(m*n) read O(m*n*k) compute m >> n >> k X § Data Batching § Batching to utilize the HW more efficiently under SLA § Use case: multiple users use the same model (wait and collect user request and merge) § Adaptive: additive increase, multiplicative decrease m [Clipper @ NSDI’ 17] n X 1 X 2 X 3 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019 Same complexity for MV but benefits for more complex models

Model Deployment and Serving 23 Serving Optimizations – MQO § Result Caching § Establish a function cache for X Y (memoization of deterministic function evaluation) § Multi Model Optimizations § § Same input fed into multiple partially redundant model evaluations Common subexpression elimination between prediction programs Done during compilation or runtime In PRETZEL, programs compiled into physical stages and registered with the runtime + caching for stages (decided based on hashing the inputs) [Yunseong Lee et al. : PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. OSDI 2018] 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 24 Serving Optimizations – Quantization § Lossy compression via ultra-low precision or fixed-point arithmetic § See 08 Data Access, Lossy and Lossless Compression § Quantization for Model Scoring § Usually much smaller data types (e. g. , UINT 8) § Quantization of model weights, and sometimes also activations reduced memory requirements and better latency / throughput (SIMD) import tensorflow as tf converter = tf. lite. TFLite. Converter. from_saved_model(saved_model_dir) converter. optimizations = [tf. lite. Optimize. OPTIMIZE_FOR_SIZE] tflite_quant_model = converter. convert() [Credit: https: //www. tensorflow. org/lite/performance/post_training_quantization ] 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

Model Deployment and Serving 25 Serving Optimizations – Specialization § No. Scope Architecture § Baseline: YOLOv 2 on 1 GPU per video camera @30 fps § Optimizer to find filters [Daniel Kang et al: No. Scope: Optimizing Deep CNN-Based Queries over Video Streams at Scale. PVLDB 2017] § #1 Model Specialization § Given query and baseline model § Trained shallow NN (based on Alex. Net) on output of baseline model § Short-circuit if prediction with high confidence § #2 Difference Detection § Compute difference to ref-image/earlier-frame § Short-circuit w/ ref label if no significant difference 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019

26 Summary and Conclusions § Summary 11 Model Deployment § Model Debugging and Validation § Model Deployment and Serving § Finalize Programming Projects § Oral Exam § Email to m. boehm@tugraz. at for appointment § If you’re in a remote setting, skype is fine 706. 550 Architecture of Machine Learning Systems – 11 Model Debugging and Serving Matthias Boehm, Graz University of Technology, SS 2019