CS 4803 7643 Deep Learning Topics Finish Analytical

Administrativia • HW 1 Reminder – Due: 09/26, 11: 55 pm • Fuller schedule

Strategy: Follow the slope Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231

Gradient Descent Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation •

(C) Dhruv Batra Figure Credit: Baydin et al. https: //arxiv. org/abs/1502. 05767 8

(C) Dhruv Batra By Brnbrnz (Own work) [CC BY-SA 4. 0 (http: //creativecommons. org/licenses/by-sa/4.

current W: W + h (first dim): [0. 34, -1. 11, 0. 78, 0.

current W: W + h (second dim): [0. 34, -1. 11, 0. 78, 0.

Numerical vs Analytic Gradients Numerical gradient: slow : (, approximate : (, easy to

Matrix/Vector Derivatives Notation (C) Dhruv Batra 15

Vector/Matrix Derivatives Notation (C) Dhruv Batra 16

Vector Derivative Example (C) Dhruv Batra 17

Vector Derivative Example (C) Dhruv Batra 18

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs –

Chain Rule: Composite Functions (C) Dhruv Batra 21

Chain Rule: Scalar Case (C) Dhruv Batra 22

Chain Rule: Vector Case (C) Dhruv Batra 23

Chain Rule: Jacobian view (C) Dhruv Batra 24

Chain Rule: Graphical view (C) Dhruv Batra 25

Logistic Regression Derivatives (C) Dhruv Batra 27

Logistic Regression Derivatives (C) Dhruv Batra 28

Chain Rule: How should we multiply? (C) Dhruv Batra 30

Convolutional network (Alex. Net) input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever,

Neural Turing Machine input image loss Figure reproduced with permission from a Twitter post

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data +

Computational Graph x * s (scores) hinge loss + W R Slide Credit: Fei-Fei

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit:

Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges –

Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 39

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 40

Computational Graphs • Notation (C) Dhruv Batra 41

Example + sin( ) x 1 (C) Dhruv Batra * x 2 42

HW 0 Submission by Samyak Datta (C) Dhruv Batra 44

Logistic Regression as a Cascade Given a library of simple functions Compose into a

Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 47

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x

Forward Pass vs Forward mode AD vs Reverse Mode AD + sin( ) x

Forward mode vs Reverse Mode • What are the differences? + sin( ) x

Forward mode vs Reverse Mode • What are the differences? • Which one is

Slides: 61

Download presentation

CS 4803 / 7643: Deep Learning Topics: – (Finish) Analytical Gradients – Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech

Administrativia • HW 1 Reminder – Due: 09/26, 11: 55 pm • Fuller schedule + future reading posted – https: //www. cc. gatech. edu/classes/AY 2020/cs 7643_fall/ – Caveat: subject to change; please don’t make irreversible decisions based on this. (C) Dhruv Batra 2

Recap from last time (C) Dhruv Batra 3

Strategy: Follow the slope Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Gradient Descent Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 7

current W: W + h (first dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34 + 0. 0001, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25322 gradient d. W: [-2. 5, ? , ? , - 1. 25347)/0. 0001 (1. 25322 = -2. 5 ? , ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

current W: W + h (second dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34, -1. 11 + 0. 0001, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25353 gradient d. W: [-2. 5, 0. 6, ? , (1. 25353 ? , - 1. 25347)/0. 0001 = 0. 6 ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Numerical vs Analytic Gradients Numerical gradient: slow : (, approximate : (, easy to write : ) Analytic gradient: fast : ), exact : ), error-prone : ( In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Matrix/Vector Derivatives Notation (C) Dhruv Batra 15

Vector/Matrix Derivatives Notation (C) Dhruv Batra 16

Vector Derivative Example (C) Dhruv Batra 17

Vector Derivative Example (C) Dhruv Batra 18

Extension to Tensors (C) Dhruv Batra 19

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD – Patterns in backprop (C) Dhruv Batra 20

Chain Rule: Composite Functions (C) Dhruv Batra 21

Chain Rule: Scalar Case (C) Dhruv Batra 22

Chain Rule: Vector Case (C) Dhruv Batra 23

Chain Rule: Jacobian view (C) Dhruv Batra 24

Chain Rule: Graphical view (C) Dhruv Batra 25

Logistic Regression Derivatives (C) Dhruv Batra 27

Logistic Regression Derivatives (C) Dhruv Batra 28

Chain Rule: Cascaded (C) Dhruv Batra 29

Chain Rule: How should we multiply? (C) Dhruv Batra 30

Convolutional network (Alex. Net) input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Neural Turing Machine input image loss Figure reproduced with permission from a Twitter post by Andrej Karpathy. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD (C) Dhruv Batra 34

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra 35

Computational Graph x * s (scores) hinge loss + W R Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n L

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 37

Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 38

Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 39

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 40

Computational Graphs • Notation (C) Dhruv Batra 41

Example + sin( ) x 1 (C) Dhruv Batra * x 2 42

HW 0 (C) Dhruv Batra 43

HW 0 Submission by Samyak Datta (C) Dhruv Batra 44

Logistic Regression as a Cascade Given a library of simple functions Compose into a complicate function (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann Le. Cun 45

Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 47

Forward mode AD g 48

Reverse mode AD g 49

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 50

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 51

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 52

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 53

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 54

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 55

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 56

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 57

Forward Pass vs Forward mode AD vs Reverse Mode AD + sin( ) x 1 * x 2 + sin( ) x 1 Batra (C) Dhruv + sin( ) * x 2 x 1 * x 2 58

Forward mode vs Reverse Mode • What are the differences? + sin( ) x 1 (C) Dhruv Batra + sin( ) * x 2 x 1 * x 2 59

Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 60

Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 61