Apache TVM and ONNX What can ONNX do

  • Slides: 17
Download presentation
Apache TVM and ONNX What can ONNX do for DL Compilers (and vice versa)?

Apache TVM and ONNX What can ONNX do for DL Compilers (and vice versa)? Jason Knight - CPO Automate efficient AI/ML ops through a unified software foundation. jknight@octoml. ai

Agenda Intro to TVM Cool results (TVM + ONNX) How does it work? …

Agenda Intro to TVM Cool results (TVM + ONNX) How does it work? … and in 10 minutes … Let’s go! Octo. ML’s wishlist for ONNX 2

An exploding ecosystem makes deployment painful Rapidly evolving ML software ecosystem Cambrian explosion of

An exploding ecosystem makes deployment painful Rapidly evolving ML software ecosystem Cambrian explosion of HW backends 3

TVM: Bridging the gap as a DL compiler and runtime Reduce model time-tomarket Build

TVM: Bridging the gap as a DL compiler and runtime Reduce model time-tomarket Build your model once, run anywhere Open source, optimization framework for deep learning. ML-based Optimizations Backends for x 86, n. Vidia/CUDA, AMD, ARM, MIPS, RISC-V, etc Cut capital and operational ML costs 4

TVM is an emerging industry standard ML stack Every “Alexa” wake-up today across all

TVM is an emerging industry standard ML stack Every “Alexa” wake-up today across all devices uses a model optimized with TVM Open source ~428+ contributors from industry and academia. “[TVM enabled] real-time on mobile CPUs for free. . . We are excited about the performance TVM achieves. ” More than 85 x speed-up for speech recognition model. Bing query understanding: 112 ms (Tensorflow) -> 34 ms (TVM). Qn. A bot: 73 ms->28 ms (CPU), 10. 1 ms->5. 5 ms (GPU) “TVM is key to ML Access on Hexagon” - Jeff Gelharr, VP Technology 5

The power of TVM + ONNX (AKA Results) 6

The power of TVM + ONNX (AKA Results) 6

Performance: TVM on x 86 Py. Torc h Tensor. Flow Auto. TVM 2. 0

Performance: TVM on x 86 Py. Torc h Tensor. Flow Auto. TVM 2. 0 20 core Intel-Platinum-8269 CY fp 32 performance data from https: //arxiv. org/pdf/2006. 06762. pdf 7

Performance: TVM on GPU Py. Torc h Tensor. Flow Tensor. RTTF Auto. TVM 2.

Performance: TVM on GPU Py. Torc h Tensor. Flow Tensor. RTTF Auto. TVM 2. 0 V 100 fp 32 performance data from https: //arxiv. org/pdf/2006. 06762. pdf 8

Performance: TVM on ARM Four core Cortex-A 72 @ 1. 5 GHz fp 32

Performance: TVM on ARM Four core Cortex-A 72 @ 1. 5 GHz fp 32 - internal data Tensor. Flow Lite Auto. TVM 2. 0 Four core Cortex-A 53 @ 1. 4 GHz fp 32 - https: //arxiv. org/pdf/2006. 06762. pdf 9

Case Study: 50% reduction in Cloud NLP inference costs 2 x lower cost on

Case Study: 50% reduction in Cloud NLP inference costs 2 x lower cost on AMD EPYC CPU

Best of both worlds 11

Best of both worlds 11

Not enough time! ● Tensor. Core performance (better than cu. BLAS) ● Classical ML

Not enough time! ● Tensor. Core performance (better than cu. BLAS) ● Classical ML (better than XGBoost and RAPIDS) ● u. TVM for Tiny. ML - ML for microcontrollers ● Int{8, 4, 3, 2, 1} and posit quantization support ● ML in your browser - Web. GPU and WASM as TVM backends ● … and more! 12

How does it work? 13

How does it work? 13

Auto. TVM Overview Automatically adapt to hardware type by learning 14

Auto. TVM Overview Automatically adapt to hardware type by learning 14

Auto-scheduling Overview Widens search space even further than Auto. TVM 1. 0. . .

Auto-scheduling Overview Widens search space even further than Auto. TVM 1. 0. . . 15 Ansor: Generating High-Performance Tensor Programs for Deep Learning Zheng L, et al. 2020 https: //arxiv. org/pdf/2006. 06762. pdf

Octo. ML’s wishlist for ONNX 16

Octo. ML’s wishlist for ONNX 16

We wish ONNX had… ● Even broader op coverage (eg Embedding. Bag) ● Broader

We wish ONNX had… ● Even broader op coverage (eg Embedding. Bag) ● Broader non-ML (but adjacent) support: ○ ○ ○ More classical ML GCNN/DGL Graph workloads (Graph. BLAS, Metagraph) And on the “pie-in-the-sky” list: ● Framework integrations ○ ○ Py. Torch: so we don’t have to deal with torchscript MLIR dialect so we can easily plug into Tensor. Flow (for runtime JIT) ○ For eg: canonicalization of models coming out of Quantization-aware-training pipelines ● Quantization-aware standardization Thanks! 17