LSTM Recurrent Neural Networks Brandeis CS 114 Spring
LSTM Recurrent Neural Networks Brandeis CS 114 Spring 2020 Borrows significantly from: http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/ And Ray Mooney @UT Austin 1
Multi-Layer Feed-Forward Networks • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult. • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward. output hidden activation input • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer. 2
Hill-Climbing in Multi-Layer Nets • Since “greed is good” perhaps hill-climbing can be used to learn multi-layer networks in practice although its theoretical limits are clear. • However, to do gradient descent, we need the output of a unit to be a differentiable function of its input and weights. • Standard linear threshold function is not differentiable at the threshold. oi 1 0 Tj netj 3
Differentiable Output Function • Need non-linear output function to move beyond linear functions. – A multi-layer linear network is still linear. • Standard solution is to use the non-linear, differentiable sigmoidal “logistic” function: 1 0 Tj netj Can also use tanh or Gaussian output function 4
Gradient Descent • Define objective to minimize error: where D is the set of training examples, K is the set of output units, tkd and okd are, respectively, the teacher and current output for unit k for example d. • The derivative of a sigmoid unit with respect to net input is: • Learning rule to change weights to minimize error is: 5
Backpropagation Learning Rule • Each weight changed by: where η is a constant called the learning rate tj is the correct teacher output for unit j δj is the error measure for unit j 6
Error Backpropagation • First calculate error of output units and use this to change the top layer of weights. Current output: oj=0. 2 Correct output: tj=1. 0 Error δj = oj(1–oj)(tj–oj) 0. 2(1– 0. 2)=0. 128 output Update weights into j hidden input 7
Error Backpropagation • Next calculate error for hidden units based on errors on the output units it feeds into. output hidden input 8
Error Backpropagation • Finally update bottom layer of weights based on errors calculated for hidden units. output Update weights into j hidden input 9
Backpropagation Training Algorithm Create the 3 -layer network with H hidden units with full connectivity between layers. Set weights to small random real values. Until all training examples produce the correct value (within ε), or mean squared error ceases to decrease, or other termination criteria: Begin epoch For each training example, d, do: Calculate network output for d’s input values Compute error between current output and correct output for d Update weights by backpropagating error and using learning rule End epoch 10
Comments on Training Algorithm • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. • However, in practice, does converge to low error for many large networks on real data. • Many epochs (thousands) may be required, hours or days of training for large networks. • To avoid local-minima problems, run several trials starting with different random weights (random restarts). – Take results of trial with lowest training set error. – Build a committee of results from multiple trials (possibly weighting votes by training set accuracy). 11
Hidden Unit Representations • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space. • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc. . • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature. 12
Over-Training Prevention error • Running too many epochs can result in over-fitting. on test data on training data 0 # training epochs • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error. • To avoid losing training data for validation: – Use internal 10 -fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy. – Train final network on complete training set for this many epochs. 13
Determining the Best Number of Hidden Units error • Too few hidden units prevents the network from adequately fitting the data. • Too many hidden units can result in over-fitting. on test data on training data 0 # hidden units • Use internal cross-validation to empirically determine an optimal number of hidden units. 14
Recurrent Neural Networks (RNN) • Add feedback loops where some units’ current outputs determine some future network inputs. • RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks. 15
Simple Recurrent Network (SRN) • Initially developed by Jeff Elman (“Finding structure in time, ” 1990). • Additional input to hidden layer is the state of the hidden layer in the previous time step. http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/ 16
Unrolled RNN • Behavior of RNN is perhaps best viewed by “unrolling” the network over time 17
Training RNN’s • RNNs can be trained using “backpropagation through time. ” • Can viewed as applying normal backprop to the unrolled network. y 0 y 1 y 2 yt B B training outputs training inputs backpropagated errors 18
Vanishing/Exploding Gradient Problem • Backpropagated errors multiply at each layer, resulting in exponential decay (if derivative is small) or growth (if derivative is large). • Makes it very difficult train deep networks, or simple recurrent networks over many time steps. 19
Long Distance Dependencies • It is very difficult to train SRNs to retain information over many time steps • This make is very difficult to learn SRNs that handle long-distance dependencies, such as subject-verb agreement. 20
Long Short Term Memory • LSTM networks, additional gating units in each memory cell. – Forget gate – Input gate – Output gate • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time. 21
LSTM Network Architecture 22
Cell State • Maintains a vector Ct that is the same dimensionality as the hidden state, ht • Information can be added or deleted from this state vector via the forget and input gates. 23
Cell State Example • Want to remember person & number of a subject noun so that it can be checked to agree with the person & number of verb when it is eventually encountered. • Forget gate will remove existing information of a prior subject when a new one is encountered. • Input gate "adds" in the information for the new subject. 24
Forget Gate • Forget gate computes a 0 -1 value using a logistic sigmoid output function from the input, xt, and the current hidden state, ht: • Multiplicatively combined with cell state, "forgetting" information where the gate outputs something close to 0. 25
Hyperbolic Tangent Units • Tanh can be used as an alternative nonlinear function to the sigmoid logistic (0 -1) output function. • Used to produce thresholded output between – 1 and 1. 26
Input Gate • First, determine which entries in the cell state to update by computing 0 -1 sigmoid output. • Then determine what amount to add/subtract from these entries by computing a tanh output (valued – 1 to 1) function of the input and hidden state. 27
Updating the Cell State • Cell state is updated by using componentwise vector multiply to "forget" and vector addition to "input" new information. 28
Output Gate • Hidden state is updated based on a "filtered" version of the cell state, scaled to – 1 to 1 using tanh. • Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to "output". 29
Overall Network Architecture • Single or multilayer networks can compute LSTM inputs from problem inputs and problem outputs from LSTM outputs. Ot e. g. a POS tag as a “one hot” vector e. g. a word “embedding” with reduced dimensionality It e. g. a word as a “one hot” vector 30
LSTM Training • Trainable with backprop derivatives such as: – Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update). – ADAM optimizer (Kingma & Ma, 2015) • Each cell has many parameters (Wf, Wi, WC, Wo) – Generally requires lots of training data. – Requires lots of compute time that exploits GPU clusters. 31
General Problems Solved with LSTMs • Sequence labeling – Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (ht) to an output vector (Ot). • Language modeling – Train to predict next input (Ot =It+1) • Sequence (e. g. text) classification – Train a single or multilayer network that maps the final hidden state (hn) to an output vector (O). 32
Sequence to Sequence Transduction (Mapping) • Encoder/Decoder framework maps one sequence to a "deep vector" then another LSTM maps this vector to an output sequence. I 1, I 2, …, In Encoder LSTM hn Decoder LSTM O 1, O 2, …, Om • Train model "end to end" on I/O pairs of sequences. 33
Summary of LSTM Application Architectures Image Captioning Video Activity Recog Text Classification Video Captioning Machine Translation POS Tagging Language Modeling 34
Successful Applications of LSTMs • Speech recognition: Language and acoustic modeling • Sequence labeling – POS Tagging https: //www. aclweb. org/aclwiki/index. php? title=POS_Tagging_(State_of_the_art) – NER – Phrase Chunking • Neural syntactic and semantic parsing • Image captioning: CNN output vector to sequence • Sequence to Sequence – Machine Translation (Sustkever, Vinyals, & Le, 2014) – Video Captioning (input sequence of CNN frame outputs) 35
Bi-directional LSTM (Bi-LSTM) • Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output. xt-1 ht-1 xt+1 xt ht ht+1 36
Gated Recurrent Unit (GRU) • Alternative RNN to LSTM that uses fewer gates (Cho, et al. , 2014) – Combines forget and input gates into “update” gate. – Eliminates cell state vector 37
GRU vs. LSTM • GRU has significantly fewer parameters and trains faster. • Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better. 38
Attention • For many applications, it helps to add “attention” to RNNs. • Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing. • Used in image captioning to focus on different parts of an image when generating different parts of the output sentence. • In MT, allows focusing attention on different parts of the source sentence when generating different 39 parts of the translation.
Attention for Image Captioning (Xu, et al. 2015) 40
Conclusions • By adding “gates” to an RNN, we can prevent the vanishing/exploding gradient problem. • Trained LSTMs/GRUs can retain state information longer and handle long-distance dependencies. • Recent impressive results on a range of challenging NLP problems. 41
- Slides: 41