VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By
VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By: Collin Watts Wrritten By: Andrej Karpathy, Justin Johnson, Li Fei-fei
PLAN OF ATTACK What we’re going to cover: • Overview • Some Definitions • Expiremental Analysis • Lots of Results • The Implications of the Results • Case Studies • Meta-Analysis
SO, WHAT WOULD YOU SAY YOU DO HERE. . . • This paper set out to analyze both the most efficient implementation of an RANN (we’ll get there) as well as identify what mechanisms are used internally that achieve their results. • Chose 3 different variants of RANNs: • Basic RANNs • LSTM RANNs • GRU RANNs • Did character level language analysis as their test problem, as it is apparently strongly representative of other analysies.
DEFINITIONS • RECURRENT NEURAL NETWORK • Subset of Artificial Neural Networks • Still use feedforward and backpropogation • Allows nodes to form cycles, creating the potentiality for storage of information within the network • Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks • Difficult to train
DEFINITIONS • RECURRENT NEURAL NETWORK • Subset of Artificial Neural Networks • Still use feedforward and backpropogation • Allows nodes to form cycles, creating the potentiality for storage of information within the network • Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks • Difficult to train
DEFINITIONS • RECURRENT NEURAL NETWORK (Cont. ) • Uses a 2 dimensional node setup, with time as one axis and depth of the nodes as another • Nodes are referrd to as h. Lt, with l = 0 being the input nodes, and l = L being the output nodes. • Intermediate vectors are calculated as a function of both the previous time step and the previous layer. This results in the following recurrence:
MORE DEFINITIONS! • LONG SHORT-TERM MEMORY VARIANT • Variant of the RANN designed to mitigate problems with backpropogation within a RANN. • Adds a memory vector to each node. • Every time step, an LSTM can choose to read, write to, or reset the memory vector, following a series of gating mechanisms. • Has the effect of preserving gradients across memory cells for long periods. • i, f, o, and g are the gates for whether the memory cell is updated, reset, or read, respectively, while g allows for additive additions to the memory cell.
HALF A DEFINITION. . . • GATED RECURRENT UNIT • Not well elaborated on in the paper. . . • Given explanation is that “The GRU has the interpretation of computing a candidate hidden vector and then smoothly interpolating towards it, as gated by z. ” • My interpretation: rather than having explicit access & control gates, this follows a more analog approach.
EXPIREMENTAL ANALYSIS (SCIENCE!) • As previously stated, the researchers used character-level language modelling as a basis of comparison. • Trained each network to predict the following character in a sequence. • Used Softmax classifier at each time step. • Generated a vector of all possible next characters and fed those to the current network to get that many hidde vectors in the last layer of the network. • These outputs represented log probabilities of each character being the next character in the sequence.
EXPIREMENTAL ANALYSIS (SCIENCE!) • Rejected the use of two other datasets (Penn treeback dataset and Hutter Prize 100 MB of Wikipedia dataset) on the basis of them containing both standard English language and markup. • Stated intention for rejecting was to use a controlled setting for all types of neural networks, rather than compete for best results on these data sets. • Decided on Leo Tolstoy’s War and Peace, consisting of 3, 258, 246 characters and the source code of the Linux Kernel (randomized across files and then concatenated into a single 6, 206, 996
EXPIREMENTAL ANALYSIS (SCIENCE!) • War and Peace, was split into 80/10/10 for training/validation/testing. • Linux Kernel, was split into 90/5/5 for training/validation/testing. • Tested the following properties for each of the 3 RANNS: • Number of Layers (1, 2 , or 3) • Number of Parameters (64, 128, 256, 512 cell counts)
RESULTS (AND THE WINNER IS. . . ) • Test set cross entropy loss:
RESULTS (AND THE WINNER IS. . . )
RESULTS (AND THE WINNER IS. . . )
IMPLICATIONS OF RESULTS (BUT WHY. . . ) • The researchers paid attention to several characteristics beyond just the results of their findings. One of their stated goals was to arrive at why these emergent properties exist. • Interpretable, long-range LSTM cells • Have been theorized to exist, but never proven. • They proved them. • Truncated back-propagation (used for performance gains as well as combatting overfitting) limits understanding dependencies more than X characters away, where X is the depth of the backpropogation. • These LSTM cells have been able to overcome
VISUALIZATIONS OF RESULTS (BUT WHY. . . ) • Text color is a visualization of tanh(c) where -1 is red and +1 is blue.
VISUALIZATIONS OF RESULTS (BUT WHY. . . )
VISUALIZATIONS OF RESULTS (BUT WHY. . . )
VISUALIZATIONS OF RESULTS (BUT WHY. . . )
IMPLICATIONS OF RESULTS (BUT WHY. . . ) • Also paid attention to gate activations (remember the gates are what cause interactions with the memory node) in LSTMs. • Defined the ideas of “left saturated” and “right saturated” • Left saturated: If the gates activate less than 0. 1 (10% of the time). • Right saturated: If the gates activate more than 0. 9 (90% of the time) • Of particular note: • Right saturated forget gate cells (cells remembering values) • No left saturated forget gate cells (no cells being
VISUALIZATIONS OF RESULTS (BUT WHY. . . LSTMS)
VISUALIZATIONS OF RESULTS (BUT WHY. . . GRUS)
ERROR ANALYSIS OF RESULTS • Compared against two standard n-gram models for analysis of LSTMs effectiveness. • An error was defined to be if the probability of the next character being the character that was actually there was less than 0. 5. • Found that while the models shared many of the same errors, there were distinct segments that each one failed differently on.
ERROR ANALYSIS OF RESULTS Linux Kernel War and peace
ERROR ANALYSIS OF RESULTS • Found that LSTM has significant advantages over standard n-gram models when computing the probability of special characters. In the Linux Kernel model, brackets and whitespce are predicted significantly better than in the n-gram model, because of it’s ability to keep track of relationships between open and closing brackets. • Similarly, in War and Peace, LSTM was able to more correctly predict carriage returns, due to the relationship being outside of the n-gram
CASE STUDY { LOOK, BRACES! } • When it specifically compes to closing brackets (“}”) in the Linux kernel, the researchers were able to analyze the performance of the LSTM versus the ngram models. • Found that LSTM did better than n-gram for
META-ANALYSIS (THE GOOD) • The researchers were able to very effectively capture and elucidate their point via their visualizations and implications. • They seem to have proven several until now only theorized ideas on how RANNs work in data analysis.
META-ANALYSIS (THE BAD) • I would have appreciated a more in depth explanation of why they rejected the standard ANN competitive datasets. It would seem to follow that those would be a more true measure of the capabilities, which is why they are chosen in the first place. • There wasn’t a lot of explanation as to why their parameters were chosen for each RANN, or what their parameters for evaluation were. (What is test set cross-entropy loss? ) • Data was split differently across each of the texts, so that the total count for validation and tests was
META-ANALYSIS (THE UGLY) • This paper does not ease the reader into understanding the ideas involved. Required reading several additional papers to get the implications of things they assumed the reader knew. • Some ideas were not clearly explained even after researching the related works.
FINAL SLIDE • Questions? • Comments? • Concerns? • Corrections ?
- Slides: 30