CS 4803 7643 Deep Learning Topics Finish Analytical
- Slides: 61
CS 4803 / 7643: Deep Learning Topics: – (Finish) Analytical Gradients – Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech
Administrativia • HW 1 Reminder – Due: 09/26, 11: 55 pm • Fuller schedule + future reading posted – https: //www. cc. gatech. edu/classes/AY 2020/cs 7643_fall/ – Caveat: subject to change; please don’t make irreversible decisions based on this. (C) Dhruv Batra 2
Recap from last time (C) Dhruv Batra 3
Strategy: Follow the slope Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
Gradient Descent Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 7
(C) Dhruv Batra Figure Credit: Baydin et al. https: //arxiv. org/abs/1502. 05767 8
How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 9
(C) Dhruv Batra By Brnbrnz (Own work) [CC BY-SA 4. 0 (http: //creativecommons. org/licenses/by-sa/4. 0)] 10
current W: W + h (first dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34 + 0. 0001, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25322 gradient d. W: [-2. 5, ? , ? , - 1. 25347)/0. 0001 (1. 25322 = -2. 5 ? , ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
current W: W + h (second dim): [0. 34, -1. 11, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25347 [0. 34, -1. 11 + 0. 0001, 0. 78, 0. 12, 0. 55, 2. 81, -3. 1, -1. 5, 0. 33, …] loss 1. 25353 gradient d. W: [-2. 5, 0. 6, ? , (1. 25353 ? , - 1. 25347)/0. 0001 = 0. 6 ? , ? , …] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
Numerical vs Analytic Gradients Numerical gradient: slow : (, approximate : (, easy to write : ) Analytic gradient: fast : ), exact : ), error-prone : ( In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 14
Matrix/Vector Derivatives Notation (C) Dhruv Batra 15
Vector/Matrix Derivatives Notation (C) Dhruv Batra 16
Vector Derivative Example (C) Dhruv Batra 17
Vector Derivative Example (C) Dhruv Batra 18
Extension to Tensors (C) Dhruv Batra 19
Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD – Patterns in backprop (C) Dhruv Batra 20
Chain Rule: Composite Functions (C) Dhruv Batra 21
Chain Rule: Scalar Case (C) Dhruv Batra 22
Chain Rule: Vector Case (C) Dhruv Batra 23
Chain Rule: Jacobian view (C) Dhruv Batra 24
Chain Rule: Graphical view (C) Dhruv Batra 25
(C) Dhruv Batra 26
Logistic Regression Derivatives (C) Dhruv Batra 27
Logistic Regression Derivatives (C) Dhruv Batra 28
Chain Rule: Cascaded (C) Dhruv Batra 29
Chain Rule: How should we multiply? (C) Dhruv Batra 30
Convolutional network (Alex. Net) input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
Neural Turing Machine input image loss Figure reproduced with permission from a Twitter post by Andrej Karpathy. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n
How do we compute gradients? • Analytic or “Manual” Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 33
Plan for Today • (Finish) Analytical Gradients • Automatic Differentiation – Computational Graphs – Forward mode vs Reverse mode AD (C) Dhruv Batra 34
Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra 35
Computational Graph x * s (scores) hinge loss + W R Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n L
Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 37
Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 38
Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 39
Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 40
Computational Graphs • Notation (C) Dhruv Batra 41
Example + sin( ) x 1 (C) Dhruv Batra * x 2 42
HW 0 (C) Dhruv Batra 43
HW 0 Submission by Samyak Datta (C) Dhruv Batra 44
Logistic Regression as a Cascade Given a library of simple functions Compose into a complicate function (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann Le. Cun 45
Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra 46
Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 47
Forward mode AD g 48
Reverse mode AD g 49
Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 50
Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 51
Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 52
Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 53
Example: Forward mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 54
Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 55
Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 56
Example: Reverse mode AD + sin( ) x 1 (C) Dhruv Batra * x 2 57
Forward Pass vs Forward mode AD vs Reverse Mode AD + sin( ) x 1 * x 2 + sin( ) x 1 Batra (C) Dhruv + sin( ) * x 2 x 1 * x 2 58
Forward mode vs Reverse Mode • What are the differences? + sin( ) x 1 (C) Dhruv Batra + sin( ) * x 2 x 1 * x 2 59
Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 60
Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 61
- Cs 7643 fall 2020
- Cs 7643 deep learning github
- Cs 7643
- Cs 4803
- Cs 4803
- Cs 4803
- Cs 4803
- Cs 4803
- Cs 7643
- Analytical learning in machine learning
- Inductive and analytical learning in machine learning
- Inductive and analytical learning in machine learning
- Cs 7643
- Cs 7643 github
- Lstm colah
- Cs 7643
- Cmu machine learning
- Tony wagner's seven survival skills
- Deep asleep deep asleep it lies
- Deep forest: towards an alternative to deep neural networks
- O the deep deep love of jesus
- Cuadro comparativo e-learning b-learning m-learning
- Analytical learning example
- Cs 282r topics in machine learning
- Global citizenship education topics and learning objectives
- Operator fusion deep learning
- Andrew ng lstm
- Hadoop deep learning
- Gandiva: introspective cluster scheduling for deep learning
- Deep residual learning for image recognition
- Deep learning speech recognition
- Autoencoders
- Supervised vs unsupervised learning
- Mitesh khapra deep learning slides
- Frank rosenblatt
- Optimal auctions through deep learning
- Deep learning competencies 6 c's
- Cost function in neural network
- Bird eye view deep learning
- Jeff heaton github
- Jilong xue
- Deep learning dummies
- Backpropagation for dummies
- Moe deep learning
- Machine learning xkcd
- Intel deep learning training tool
- Kubernetes gpgpu
- Caffe framework tutorial
- Caffe tutorial
- Statistical mechanics of deep learning
- Student teacher neural network
- Ioslides
- Tensor flow playground
- Aditya kanade
- Deep learning yoshua bengio authors
- Deep learning for limit order books
- Traffic sign recognition deep learning
- The limitations of deep learning in adversarial settings
- Bert confusion matrix
- Deep learning
- Deep learning final exam
- Machine learning in astronomy