Deep Thoughts on Deep Learning Mark Stamp Deep
- Slides: 80
Deep Thoughts on Deep Learning Mark Stamp Deep Learning 1
Deep Learning q Here, deep learning refers to neural networks with “lots” of hidden layers q Deep learning often claimed to have almost metaphysical powers q Artificial neural networks (ANN) not new 1940 s era technology o So why all the fuss ? o More, more computing power, more data, more (deeper) nets Deep Learning 2
Agenda q We focus for now on backpropagation o The algorithm to train deep neural nets q Other topics covered include o History of ANNs and background topics o Later, discuss many network architectures q Also mention deep connections between deep learning and other ML techniques Deep Learning 3
History of ANNs q Artificial neuron first proposed by Mc. Culloch and Pitts in 1940 s q Example o Where Xi in {0, 1} and Y is 0 if and Y is 1 otherwise Deep Learning 4
Perceptron q In late 1950 s, Rosenblatt’s perceptron generalized Mc. Culloch-Pitts neuron o Specifically, inputs Xi can be real-valued q Then q In weighted sum is of the form 2 -d, this is equation of a line o Implies that we can have ideal results only when data is linearly separable Deep Learning 5
AND, OR, and XOR q Perceptron cannot model XOR q But, multilayer perceptron (MLP) is not restricted to linear decision boundary! Deep Learning 6
MLP q MLP perceptron with 1 or more hidden layer(s) o Always have input and output layers q Example of MLP with 2 hidden layers o Edges are weights Deep Learning 7
MLP vs SVM q Linear SVM and perceptron both have linear decision boundary q Nonlinear SVM and MLP are not restricted to linear decision boundary o Nonlinear SVM the “kernel trick” o MLP layers, but no explicit kernel o MLP “learns” (rough) equivalent of SVM kernel function thru training q Advantages/disadvantages Deep Learning ? 8
MLP vs RNN q MLP is a feed forward network o That is, no loops allowed q Recurrent have loops neural network (RNN) can o Gives RNN a concept of “memory” o But training RNN is more complex o So, might require more data to train q Discuss Deep Learning RNNs later 9
AI has Long History as “Next Big Thing” q We discuss some of this history on next few slides… Deep Learning 10
1 st AI Winter q In 1969, Minsky and Papert published influential book o Emphasized perceptron cannot model XOR q Was thought that successful AI would have to model basic logic functions q Yes, MLP can model XOR (homework!) o But nobody knew how to train MLPs… q So, ANNs viewed as a fatally flawed Deep Learning 11
Backpropagation q 1986, Rumelhart, Hinton, and Williams o Introduced backpropagation (BP) o Efficient way to train MLP q Game changer! o Marks the end of 1 st “AI winter” q But, AI then failed to live up to hype q 2 nd AI winter, late 1980 s into 1990 s q Will there be a 3 rd AI winter? Deep Learning 12
Deep Learning q ANNs are basic building blocks of deep learning o Analogous to relationship between Markov chain and HMM (kind of…) q Deep networks can be trained using BP q But why deep learning? o Advantages over “classic” models? o Any disadvantages of deep learning? Deep Learning 13
Why Deep Learning? q Often claimed that deep learning continues to learn as data increases q Non-deep models supposed to “saturate” earlier q Is this true? Deep Learning 14
Decisions, Decisions q Design decisions for deep network o Depth How many hidden layers? § The more the merrier? o Width How many neurons per layer? o Nonlinearity Activation functions? § Hidden layers must include nonlinear functions o What to use as objective function? § Usually, measures training error o Bias nodes in hidden layers? Deep Learning 15
Activation Functions q Examples of activation functions q Re. LU is most popular today q Many variants of Re. Lu too o Leaky Re. Lu o ELU, etc. Deep Learning 16
Training an ANN q Spse we want to train ANN, with the depth, width, functions, etc. , specified o Also, we have lots of training data q What does it mean to “train” ANN? o Determine “good” set of edge-weights! q How to determine weights? q How to measure “goodness”? q How to improve on a set of weights? Deep Learning 17
Training an ANN q ANN error is function of training data o Make initial guess for weights o Measure error of ANN on training data q OK, but now what? q Gradient descent wrt error function o That is, use calculus to determine best way to modify weights (iteratively) q More Deep Learning details later, but first… 18
Partial Derivatives q In backpropagation, we’ll compute the gradient of the error function q Need efficient way to evaluate (lots of) partial derivatives of error function o Error function depends on training data o So, lots and lots of variables o Lots of partial derivatives at each iteration, and lots of iterations q We’ll Deep Learning use automatic differentiation… 19
Derivatives q Product rule: d(uv) = udv + vdu o Or q Quotient rule: If f(x) = g(x)/h(x) o Then q Suppose that we want to compute partial derivatives of o Then Deep Learning and 20
Chain Rule q It all boils down to the chain rule… q Suppose y = f(x) where x = g(t) q Then q Suppose z = f(x, y) where x=g(t), y=h(t) q Then q And Deep Learning note that dx/dx = 1 (always) 21
Approximate Derivatives? q Definition of derivative: q So, can approximate derivative at x as o Where h is “small q Problems with this approach? o Inaccurate if h is not small enough o But bad roundoff error if h is too small q Computationally, Deep Learning this is not good! 22
Computing Derivatives q Symbolic computation can be used to compute derivative function o Like you would do in calculus class o See, Maple, Mathematica, etc. q But, we need to evaluate many partial derivative at specified point q Would need to compute a ton of partial derivative functions o Lots of duplication, very inefficient Deep Learning 23
Automatic Differentiation q Key insight: Only care about functions in the form of computer programs! q Any such function can be broken down into series of simple operations o Add, subtract, multiply, divide, log, etc. q Apply chain rule to computer program!!! o Apply chain rule over and over… o Known as automatic differentiation (AD) Deep Learning 24
AD Example q Consider the function q This function in pseudo-code q Here, Deep Learning we have computed z = f(x, y) 25
AD Example q Function q Derivative Deep Learning as pseudo-code of pseudo-code above 26
AD Example q How to use derivative code to compute partial derivatives? o Initialize (x, y) to desired point o But, how to initialize (dx, dy) ? q When computing we treat y as constant, so dx/dx = 1 and dy/dx = 0 o Therefore, initialize (dx, dy) = (1, 0) q For Deep Learning initialize (dx, dy) = (0, 1) 27
AD Example q Evaluate at the point (3, 2) q Using derivative program, we have q Easy Deep Learning to verify, since 28
AD Example q Evaluate q Also Deep Learning at the point (3, 2) easy to verify this is correct 29
Gradient q For function of more than 1 variable, gradient plays role of derivative q Gradient of function of 2 variables q Gradient of function of n variables Deep Learning 30
Gradient and AD q In backpropagation, error function has lots of variables q We need to evaluate gradient of error function lots of times q Using AD code, to compute gradient. . . o o Initialize Then initialize And so on, for n iterations of the code All of that for one gradient calculation! Deep Learning 31
Reverse Mode AD q AD discussed so far is forward mode q For a function of n variables o Forward mode requires n evaluations of derivative code to compute gradient q But, all iterations are the same, except for initialization of dvi terms q There must be a more efficient way! o Reverse mode AD Deep Learning 32
Reverse Mode AD q In reverse mode AD, swap roles of dependent and independent variables o Chain rule still applies q Enables us to compute entire gradient in one pass thru derivative code q We work an example, using same function as forward mode, above o Really, not all that difficult… Deep Learning 33
Reverse Mode AD Example q Consider q This again function as a program q From line 3: q From line 4: Deep Learning 34
Reverse Mode AD Example q We have q Chain rule & apply results on previous slide o Trivially, o Then and o Finally, Deep Learning 35
Reverse Mode AD Example q Let dz be dz/dz and dvi be dz/dvi q Then from previous slide we have q Note that dv 0 is and dv 1 is o Both evaluated at specified (x, y) Deep Learning 36
Gradient via Reverse Mode AD q Function: q Initialize q Compute Deep Learning (x, y) and compute vi as gradient as (dv 0, dv 1) from 37
Reverse Mode AD Example q Evaluating gradient at the point (x, y) = (3, 2) q Thus Deep Learning 38
Reverse Mode AD: Last Words q Computing vi is the forward pass q Computing dvi is the backward pass q Kind of like training an HMM… o Meet-in-the-middle (sort of) technique q Reverse mode AD is most efficient way to compute gradient at a point o Compute lots of gradients at lots of points in backpropagation (BP) Deep Learning 39
Gradient Descent q Suppose that we want to find min of q Initialize: x 0 q Select α > 0 q 1 st step q Then iterate… q Why –α ? Deep Learning 40
Gradient Descent q For functions of more than 1 variable, gradient plays role of derivative o Direction with maximum rate of change q In backpropagation, parameter α on previous slide is the learning rate o Learning rate need not be constant o Intuitively, might set α large initially, then smaller as we get closer to local min Deep Learning 41
Stochastic Gradient Descent q In backpropagation, we typically evaluate gradient of error function o Gradient must be evaluated lots and lots of times (many iterations) o And error function depends on lots and lots of variables q In SGD, error based on one sample q In mini-batch, error based on small number of samples at each iteration Deep Learning 42
Mini-Batch Gradient Descent q Mini-batch commonly used in backpropagation o More efficient, but no guarantee that each step descends o On average, move in right direction Deep Learning 43
Last Word on Mini-Batch q Batch size need not be constant o Intuitively, use larger batch when more info useful (not so easy to quantify…) q In mini-batch, bumpy path is good thing o Might allow us to escape from one local minimum to a better local minimum o In this sense, mini-batch is like a form of bagging as used in random forest q Mini-batches/SGD also analogous to SMO algorithm as used to train SVM Deep Learning 44
All Together Now q We’ll define an MLP, and show to train it using backpropagation o We train an MLP to model XOR function q We derive code forward pass o Use reverse mode AD for backward pass q The example we consider here is continued in the homework problems o Specifically, training and testing Deep Learning 45
MLP q Consider this MLP q Let q And q This MLP is same as Deep Learning 46
MLP for XOR q Train MLP on (generalized) version of XOR function: o Where 0 ≤ X 0, X 1 ≤ 1 o Easy to generate training/test data q Decision boundary for F(X 0, X 1) Deep Learning 47
Error Function q Consider one training sample: (X 0, X 1) and Z, where Z = F(X 0, X 1) q Let w = (w 0, w 1, …, w 5), be weights o We’ll measure error E(w) as 1/2 the squared Euclidean distance q Next, Deep Learning forward and backward pass… 48
Forward Pass q Pseudo-code for error E(w) q Note that code depends on only one sample, that is, (X 0, X 1) and Z Deep Learning 49
Backward Pass q This comes from reverse mode AD o Details are in the book q Gradient is (dv 0, dv 1, …, dv 5) q Now what? Deep Learning 50
Backpropagation Algorithm q Since error involves only 1 sample, this is stochastic gradient descent (SGD) 1. Select learning rate: α > 0 2. Select initial weights w = (w 0, w 1, …, w 5) 3. Iterate thru training data (repeatedly) and for each sample… i. Compute vi using forward pass ii. Compute dvi using backward pass iii. Update weights: Deep Learning 51
BP for XOR: Last Word q In this example, using a 2 -layer MLP to model (generalized) XOR function q We defined error function on 1 sample o So, in this case, BP is using SGD o Derived forward pass, backward pass q Outlined backpropagation algorithm q In practice, we would typically use mini -batch, instead of SGD Deep Learning 52
Conclusion q Discussed ANNs, with the emphasis on training via backpropagation q Backpropagation (BP) includes… o Forward pass o Backward pass (reverse mode AD) q And BP relies on gradient descent q Also, discussed history, background, “deep” connections to deep learning Deep Learning 53
HMM Training Using Gradient Ascent Deep Learning 54
HMM Training q Here, want to show that HMM training can be done via gradient ascent o As opposed to Baum-Welch re-estimation q Recall that HMM denoted λ = (A, B, π) o A={aij} is N x N, state transition probs o B={bj(k)} is N x M, observation probs o π={πi} is 1 x N, initial state probs q These Deep Learning matrices are all row stochastic 55
HMM Notation q Slightly Deep Learning different than previously… 56
Weight Matrices q Define N x N weight matrix W o Used to update A matrix q Define N x M weight matrix V o Used to update B matrix q Use “softmax” to update A and B B will be row stochastic for any Deep. W, V Learning q A, 57
Derivatives q To train HMM, want max P(O|λ) o Let LO(λ) = P(O|λ) be likelihood function q We will solve by gradient ascent, based on weight matrices W and V q Here, we focus on W, since V is similar q First, it is not difficult to show that Deep Learning 58
More Derivatives q Consider a given state sequence X=(X 0, X 1, …, XT-1) q Let q Then q Can be shown that q Where gij(X) is number of transitions from state i to state j in X Deep Learning 59
More Derivatives (continued) q We have and q Where gij(X) is number of transitions from state i to state j in X q It follows that (#) q Where is expected number of transitions from state i to state j Deep Learning 60
HMM Gradient Ascent q Let q By be sum of over j the chain rule and (#), we have q But we want log LO(λ), not LO(λ) Deep Learning 61
HMM Gradient Ascent q We have q This gives us a way to update W (and V is similar) via gradient ascent, that is, o Where ρ is learning rate, C(O) = P(O|λ) q Is this practical? Deep Learning 62
HMM Training q Yes, it is (almost) practical! q And q However, C(O) will cause overflow q So, instead we use Deep Learning 63
HMM Training Example q Our favorite HMM example… o English text, M=27, N=2, T=50, 000 q With Baum-Welch, the hidden states correspond to what? o Consonants and vowels q Let’s train using gradient ascent q Choose τ=2. 5 and ρ=12. 0 q Results on next couple of slides… Deep Learning 64
HMM English Text Example q Initial and final W and V (transpose) Deep Learning 65
HMM English Text Example q Initial and final A and B (transpose) Deep Learning 66
Gradient Ascent and HMM Training q An alternative and reasonably effective way to train HMM q Is it better? Maybe not in general q But, gradient ascent version of HMM training has an “online” mode o Can update model as new observations available, without retraining from scratch q Might Deep Learning be useful in some cases… 67
Online HMM Training q Let O 1 and O 2 be training sequences q In “online” mode, train a model on O 1 o Then, when O 2 becomes available, use O 2 to update model obtained from O 1 q No need to retrain in “batch” mode on observation sequence O=(O 1, O 2) o Instead, simply update existing O 1 model to incorporate O 2 Deep Learning 68
ANN for HMM Training Deep Learning 69
Training HMM with ANN q HMM includes (hidden) Markov process q HMM is of the form λ = (A, B, π) q Notation: A = {aij}, B = {bj(k)}, π = {πi} q Here, assume you know about HMMs Deep Learning 70
Training HMM with ANN q Recall standard method to train HMM o Baum-Welch re-estimation o A discrete hill climb technique q Want to show that HMM training problem can be viewed as ANN o We’ll use Lagrange multipliers for HMM training problem o View it as a Lagrangian neural network Deep Learning 71
Another HMM Graph q To train HMM, maximize P(O | λ) q Consider HMM with N=2 and T=3 o Computing P(O | λ), viewed graphically Deep Learning 72
HMM Graph (Again) q Let q Assuming q Not Deep Learning O = (1, 0, 2), the graph is too friendly for BP algorithm 73
HMM Forward Algorithm q Forward algorithm can be derived from graph like that on previous slide o Forward algorithm also known as (α-pass) q We have Deep Learning 74
Forward Algorithm q For special N = 2 and O = (1, 0, 2) o Using xi notation, forward algorithm: q This Deep Learning code computes P(O | λ) 75
Constraints q When training, there are constraints q In terms of xi, equality constraints q And Deep Learning inequality constraints 76
Lagrangian q Ignoring inequality constraints q Where f(x) comes from forward pass for P(O | λ) and gi(x) on previous slide q Solve by finding (x, u) such that Deep Learning 77
Backpropagation q Can solve this Lagrange multipliers problem using backpropagation o Forward pass (f(x) and gi(x) code) o Backward pass (reverse mode AD) q But, solution occurs at saddle point q So, in gradient descent o Maximize over one set of variables o Minimize over another set of variables Deep Learning 78
Lagrange Neural Network q Can view this Lagrangian as neural network o Here, hi(x) = gi(x) - 1 o Not a typical neural network problem… Deep Learning 79
References q See the following references, and others mentioned in these slides q M. Stamp, Deep thought on deep learning, https: //www. cs. sjsu. edu/~stamp/ML/ files/ann. pdf q M. Stamp, Gradient ascent for online HMM training Deep Learning 80
- My thoughts are higher than your thoughts
- Cmu machine learning
- Tony wagner's seven survival skills
- Cuadro comparativo e-learning y b-learning
- Meat is firmest when it is cooked
- Ca stamp format
- Place stamp here
- Pan african and independence comprehension check answers
- Stamp duty(amendment) proclamation no. 612/2008
- Migratory bird hunting stamp act
- A round purple inspection stamp of refers to
- Define coupling in software engineering
- 10 u.s.c. 1044a notary stamp
- 10 usc 1044a notary stamp
- Sugar and stamp act
- App.avantassessment.com
- Rutherford stamp collecting
- Incoming mail register book
- Clone stamp tool photoshop definition
- Idaho food stamp balance
- Sugar and stamp act
- Stamp violence assessment tool
- The stamp tax uproar
- Dr billie stamp
- Place stamp here
- Place stamp here
- Gas guzzler kenning
- Nanny shine has got
- Basic stamp programming
- Chi xi stigma pronunciation
- Cdc stamp
- 4 1/2 cent white house stamp
- Dental dam hole punch sizes
- Collective noun for stones
- Time stamping in dbms
- Boston tea party stamp
- Geschwister scholl lebenslauf
- Postage stamp technique extraction
- Joe louis postage stamp
- Why did the colonists resent the stamp act?
- Stampli vendor portal
- Basic stamp ii
- Stamp
- Deep asleep deep asleep it lies
- Deep forest: towards an alternative to deep neural networks
- 深哉深哉
- Operator fusion deep learning
- Andrew ng recurrent neural networks
- Hortonworks gpu
- Gandiva: introspective cluster scheduling for deep learning
- He kaiming
- Deep learning speech recognition
- Cs 4803
- Autoencoders, unsupervised learning, and deep architectures
- Supervised vs unsupervised learning
- Mitesh m khapra
- Frank rosenblatt
- Optimal auctions through deep learning
- New pedagogies for deep learning
- Backpropagation andrew ng
- Bird eye view deep learning
- Jeff heaton github
- Jilong xue
- Deep learning dummies
- Convolution for dummies
- Cs 7643 deep learning
- Moe deep learning
- Analogizers
- Intel deep learning training tool
- Kubernetes vgpu
- Caffe deep learning tutorial
- Caffe deep learning tutorial
- Statistical mechanics of deep learning
- Student teacher neural network
- Optimal transport deep learning
- Neural network playground
- Deepfix: fixing common c language errors by deep learning
- Deep learning yoshua bengio authors
- Cs 7643
- Deep learning for limit order books
- Traffic sign recognition deep learning