Apache TVM and ONNX What can ONNX do

Agenda Intro to TVM Cool results (TVM + ONNX) How does it work? …

An exploding ecosystem makes deployment painful Rapidly evolving ML software ecosystem Cambrian explosion of

TVM: Bridging the gap as a DL compiler and runtime Reduce model time-tomarket Build

TVM is an emerging industry standard ML stack Every “Alexa” wake-up today across all

Performance: TVM on x 86 Py. Torc h Tensor. Flow Auto. TVM 2. 0

Performance: TVM on GPU Py. Torc h Tensor. Flow Tensor. RTTF Auto. TVM 2.

Performance: TVM on ARM Four core Cortex-A 72 @ 1. 5 GHz fp 32

Case Study: 50% reduction in Cloud NLP inference costs 2 x lower cost on

Not enough time! ● Tensor. Core performance (better than cu. BLAS) ● Classical ML

Auto. TVM Overview Automatically adapt to hardware type by learning 14

Auto-scheduling Overview Widens search space even further than Auto. TVM 1. 0. . .

We wish ONNX had… ● Even broader op coverage (eg Embedding. Bag) ● Broader

Slides: 17

Download presentation

Apache TVM and ONNX What can ONNX do for DL Compilers (and vice versa)? Jason Knight - CPO Automate efficient AI/ML ops through a unified software foundation. jknight@octoml. ai

Agenda Intro to TVM Cool results (TVM + ONNX) How does it work? … and in 10 minutes … Let’s go! Octo. ML’s wishlist for ONNX 2

An exploding ecosystem makes deployment painful Rapidly evolving ML software ecosystem Cambrian explosion of HW backends 3

TVM: Bridging the gap as a DL compiler and runtime Reduce model time-tomarket Build your model once, run anywhere Open source, optimization framework for deep learning. ML-based Optimizations Backends for x 86, n. Vidia/CUDA, AMD, ARM, MIPS, RISC-V, etc Cut capital and operational ML costs 4

TVM is an emerging industry standard ML stack Every “Alexa” wake-up today across all devices uses a model optimized with TVM Open source ~428+ contributors from industry and academia. “[TVM enabled] real-time on mobile CPUs for free. . . We are excited about the performance TVM achieves. ” More than 85 x speed-up for speech recognition model. Bing query understanding: 112 ms (Tensorflow) -> 34 ms (TVM). Qn. A bot: 73 ms->28 ms (CPU), 10. 1 ms->5. 5 ms (GPU) “TVM is key to ML Access on Hexagon” - Jeff Gelharr, VP Technology 5

The power of TVM + ONNX (AKA Results) 6

Performance: TVM on x 86 Py. Torc h Tensor. Flow Auto. TVM 2. 0 20 core Intel-Platinum-8269 CY fp 32 performance data from https: //arxiv. org/pdf/2006. 06762. pdf 7

Performance: TVM on GPU Py. Torc h Tensor. Flow Tensor. RTTF Auto. TVM 2. 0 V 100 fp 32 performance data from https: //arxiv. org/pdf/2006. 06762. pdf 8

Performance: TVM on ARM Four core Cortex-A 72 @ 1. 5 GHz fp 32 - internal data Tensor. Flow Lite Auto. TVM 2. 0 Four core Cortex-A 53 @ 1. 4 GHz fp 32 - https: //arxiv. org/pdf/2006. 06762. pdf 9

Case Study: 50% reduction in Cloud NLP inference costs 2 x lower cost on AMD EPYC CPU

Best of both worlds 11

Not enough time! ● Tensor. Core performance (better than cu. BLAS) ● Classical ML (better than XGBoost and RAPIDS) ● u. TVM for Tiny. ML - ML for microcontrollers ● Int{8, 4, 3, 2, 1} and posit quantization support ● ML in your browser - Web. GPU and WASM as TVM backends ● … and more! 12

How does it work? 13

Auto. TVM Overview Automatically adapt to hardware type by learning 14

Auto-scheduling Overview Widens search space even further than Auto. TVM 1. 0. . . 15 Ansor: Generating High-Performance Tensor Programs for Deep Learning Zheng L, et al. 2020 https: //arxiv. org/pdf/2006. 06762. pdf

Octo. ML’s wishlist for ONNX 16

We wish ONNX had… ● Even broader op coverage (eg Embedding. Bag) ● Broader non-ML (but adjacent) support: ○ ○ ○ More classical ML GCNN/DGL Graph workloads (Graph. BLAS, Metagraph) And on the “pie-in-the-sky” list: ● Framework integrations ○ ○ Py. Torch: so we don’t have to deal with torchscript MLIR dialect so we can easily plug into Tensor. Flow (for runtime JIT) ○ For eg: canonicalization of models coming out of Quantization-aware-training pipelines ● Quantization-aware standardization Thanks! 17