CS 4803 7643 Deep Learning Topics Finish Analytical

  • Slides: 61
Download presentation
CS 4803 / 7643: Deep Learning Topics: – (Finish) Analytical Gradients – Automatic Differentiation

CS 4803 / 7643: Deep Learning Topics: – (Finish) Analytical Gradients – Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech

Administrativia • HW 1 Reminder – Due: 09/26, 11: 55 pm • Fuller schedule

Administrativia • HW 1 Reminder – Due: 09/26, 11: 55 pm • Fuller schedule + future reading posted – https: //www. cc. gatech. edu/classes/AY 2020/cs 7643_fall/ – Caveat: subject to change; please don’t make irreversible decisions based on this. (C) Dhruv Batra 2

Recap from last time (C) Dhruv Batra 3

Recap from last time (C) Dhruv Batra 3

Strategy: Follow the slope Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231

Strategy: Follow the slope Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Gradient Descent Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Gradient Descent Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation •

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 7

(C) Dhruv Batra Figure Credit: Baydin et al. https: //arxiv. org/abs/1502. 05767 8

(C) Dhruv Batra Figure Credit: Baydin et al. https: //arxiv. org/abs/1502. 05767 8

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation •

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 9

(C) Dhruv Batra By Brnbrnz (Own work) [CC BY-SA 4. 0 (http: //creativecommons. org/licenses/by-sa/4.

(C) Dhruv Batra By Brnbrnz (Own work) [CC BY-SA 4. 0 (http: //creativecommons. org/licenses/by-sa/4. 0)] 10

current W: W + h (first dim): [0. 34, -1. 11, 0. 78, 0.

current W: W + h (first dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34 + 0. 0001, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25322 gradient d. W: [-2. 5, ? , ? , - 1. 25347)/0. 0001 (1. 25322 = -2. 5 ? , ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

current W: W + h (second dim): [0. 34, -1. 11, 0. 78, 0.

current W: W + h (second dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34, -1. 11 + 0. 0001, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25353 gradient d. W: [-2. 5, 0. 6, ? , (1. 25353 ? , - 1. 25347)/0. 0001 = 0. 6 ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Numerical vs Analytic Gradients Numerical gradient: slow : (, approximate : (, easy to

Numerical vs Analytic Gradients Numerical gradient: slow : (, approximate : (, easy to write : ) Analytic gradient: fast : ), exact : ), error-prone : ( In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation •

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 14

Matrix/Vector Derivatives Notation (C) Dhruv Batra 15

Matrix/Vector Derivatives Notation (C) Dhruv Batra 15

Vector/Matrix Derivatives Notation (C) Dhruv Batra 16

Vector/Matrix Derivatives Notation (C) Dhruv Batra 16

Vector Derivative Example (C) Dhruv Batra 17

Vector Derivative Example (C) Dhruv Batra 17

Vector Derivative Example (C) Dhruv Batra 18

Vector Derivative Example (C) Dhruv Batra 18

Extension to Tensors (C) Dhruv Batra 19

Extension to Tensors (C) Dhruv Batra 19

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs –

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD – Patterns in backprop (C) Dhruv Batra 20

Chain Rule: Composite Functions (C) Dhruv Batra 21

Chain Rule: Composite Functions (C) Dhruv Batra 21

Chain Rule: Scalar Case (C) Dhruv Batra 22

Chain Rule: Scalar Case (C) Dhruv Batra 22

Chain Rule: Vector Case (C) Dhruv Batra 23

Chain Rule: Vector Case (C) Dhruv Batra 23

Chain Rule: Jacobian view (C) Dhruv Batra 24

Chain Rule: Jacobian view (C) Dhruv Batra 24

Chain Rule: Graphical view (C) Dhruv Batra 25

Chain Rule: Graphical view (C) Dhruv Batra 25

(C) Dhruv Batra 26

(C) Dhruv Batra 26

Logistic Regression Derivatives (C) Dhruv Batra 27

Logistic Regression Derivatives (C) Dhruv Batra 27

Logistic Regression Derivatives (C) Dhruv Batra 28

Logistic Regression Derivatives (C) Dhruv Batra 28

Chain Rule: Cascaded (C) Dhruv Batra 29

Chain Rule: Cascaded (C) Dhruv Batra 29

Chain Rule: How should we multiply? (C) Dhruv Batra 30

Chain Rule: How should we multiply? (C) Dhruv Batra 30

Convolutional network (Alex. Net) input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever,

Convolutional network (Alex. Net) input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Neural Turing Machine input image loss Figure reproduced with permission from a Twitter post

Neural Turing Machine input image loss Figure reproduced with permission from a Twitter post by Andrej Karpathy. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation •

How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 33

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs –

Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD (C) Dhruv Batra 34

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data +

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra 35

Computational Graph x * s (scores) hinge loss + W R Slide Credit: Fei-Fei

Computational Graph x * s (scores) hinge loss + W R Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n L

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit:

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 37

Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges –

Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 38

Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 39

Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 39

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 40

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 40

Computational Graphs • Notation (C) Dhruv Batra 41

Computational Graphs • Notation (C) Dhruv Batra 41

Example + sin( ) x 1 (C) Dhruv Batra * x 2 42

Example + sin( ) x 1 (C) Dhruv Batra * x 2 42

HW 0 (C) Dhruv Batra 43

HW 0 (C) Dhruv Batra 43

HW 0 Submission by Samyak Datta (C) Dhruv Batra 44

HW 0 Submission by Samyak Datta (C) Dhruv Batra 44

Logistic Regression as a Cascade Given a library of simple functions Compose into a

Logistic Regression as a Cascade Given a library of simple functions Compose into a complicate function (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann Le. Cun 45

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data +

Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra 46

Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 47

Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 47

Forward mode AD g 48

Forward mode AD g 48

Reverse mode AD g 49

Reverse mode AD g 49

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 50

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 51

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 52

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 53

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 54

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 55

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 56

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x

Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 57

Forward Pass vs Forward mode AD vs Reverse Mode AD + sin( ) x

Forward Pass vs Forward mode AD vs Reverse Mode AD + sin( ) x 1 * x 2 + sin( ) x 1 Batra (C) Dhruv + sin( ) * x 2 x 1 * x 2 58

Forward mode vs Reverse Mode • What are the differences? + sin( ) x

Forward mode vs Reverse Mode • What are the differences? + sin( ) x 1 (C) Dhruv Batra + sin( ) * x 2 x 1 * x 2 59

Forward mode vs Reverse Mode • What are the differences? • Which one is

Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 60

Forward mode vs Reverse Mode • What are the differences? • Which one is

Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 61