Computing Gradient Hungyi Lee Introduction Backpropagation an efficient

Computing Gradient Hung-yi Lee 李宏毅

Introduction • Backpropagation: an efficient way to compute the gradient • Prerequisite • Backpropagation for feedforward net: • http: //speech. ee. ntu. edu. tw/~tlkagk/courses/MLDS_2015_2/Lecture/ DNN%20 backprop. ecm. mp 4/index. html • Simple version: https: //www. youtube. com/watch? v=ib. Jp. Trp 5 mc. E • Backpropagation through time for RNN: http: //speech. ee. ntu. edu. tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN %20 training%20(v 6). ecm. mp 4/index. html • Understanding backpropagation by computational graph • Tensorflow, Theano, CNTK, etc.

Computational Graph

Computational Graph b a • A “language” describing a function • Node: variable (scalar, vector, tensor ……) • Edge: operation (simple function) c Example x u v y

Computational Graph • Example: e = (a+b) ∗ (b+1) c=a+b d=b+1 e=c∗d e ∗ c 3 d 2 +1 + a 2 6 b 1

Review: Chain Rule Case 1 x Case 2 x s z y g y h z

Computational Graph • Example: e = (a+b) ∗ (b+1) =1 x(b+1)+1 x(a+b) Sum over all paths from b to e c =1 a + e =d ∗ =c =b+1 =a+b d +1 =1 =1 b

Computational Graph • Example: e = (a+b) ∗ (b+1) e 15 a = 3, b = 2 =d 3 ∗ c =1 a 3 + =c 5 5 d 3 +1 =1 =1 b 2

Computational Graph • Example: e = (a+b) ∗ (b+1) =8 a = 3, b = 2 8 e =d 3 ∗ =c 5 1 c =1 a + d 1 +1 =1 =1 b 1 Start with 1

Computational Graph • Example: e = (a+b) ∗ (b+1) =3 a = 3, b = 2 3 e =d 3 ∗ =c 5 1 c =1 1 a + Start with 1 d +1 =1 =1 b

Computational Graph Reverse mode What is the benefit? • Example: e = (a+b) ∗ (b+1) and 1 e a = 3, b = 2 =d 3 ∗ Compute Start with 1 =c 5 d 5 3 c =1 3 a + +1 =1 =1 8 b

Computational Graph • Parameter sharing: the same parameters appearing in different nodes x exp ∗ x u ∗ v x y

Computational Graph for Feedforward Net

Review: Backpropagation Layer Error signal Layer Forward Pass Backward Pass … …

Review: Backpropagation Layer L-1 Layer L Error signal …… Backward Pass …… … … …… (we do not use softmax here)

Feedforward Network WL y x … W 2 z 1 W 1 a 1 W 2 b 1 b 2 x + b 1 z 2 + b 2 … + b. L y

Loss Function of Feedforward Network x z 1 a 1 W 2 b 1 b 2 z 2 y

Gradient of Cost Function To compute the gradient … Computing the partial derivative on the edge Using reverse mode Output is always a scalar x z 1 a 1 W 2 b 1 b 2 z 2 y vector

Jacobian Matrix size of y Example size of x

Gradient of Cost Function … … Last Layer … … Cross Entropy: y

Gradient of Cost Function is a Jacobian matrix square z 2 How about softmax? y Diagonal Matrix

is a Jacobian matrix i-th row, j-th column: a 1 W 2 z 2 y

mxn m (j-1)xn+k i y a 1 n z 2 W 2 Considering W 2 as a mxn vector mxn m

mxn (j-1)xn+k i m i=1 i=2 y a 1 n z 2 W 2 Considering W 2 as a mxn vector mxn m

x W 1 z 1 a 1 W 2 z 2 y

Question Q: Only backward pass for computational graph? Q: Do we get the same results from the two different approaches?

Computational Graph for Recurrent Network

Recurrent Network y 1 h 0 f x 1 y 2 h 1 f x 2 y 3 h 2 f h 3 …… x 3 (biases are ignored here)

Recurrent Network a 0 m 1 Wh n 1 x 1 Wi Wo o 1 z 1 h 1 y 1

Recurrent Network Wo h 0 y 1 h 1 Wh x 1 Wi

Recurrent Network C 2 C 1 Wo h 0 y 1 Wo h 1 Wh x 1 C y 2 Wo h 2 Wh Wi C 3 x 2 y 3 h 3 Wh Wi x 3 Wi

Recurrent Network h 0 y 1 Wo h 1 Wh x 1 1 y 2 C 3 Wo h 2 Wh Wi 1 1 C 2 C 1 Wo C x 2 y 3 h 3 Wh Wi x 3 Wi

1 + + 1 2 2 3 3 1 2 3

Reference • Textbook: Deep Learning • Chapter 6. 5 • Calculus on Computational Graphs: Backpropagation • https: //colah. github. io/posts/2015 -08 Backprop/ • On chain rule, computational graphs, and backpropagation • http: //outlace. com/Computational-Graph/