Deep Learning of Algorithms Towards Universal Artificial Intelligence


























- Slides: 26
Deep Learning of Algorithms Towards Universal Artificial Intelligence Kārlis Freivalds, Renārs Liepiņš Joint Estonian-Latvian Theory Days 2018 Supported by Latvian Council of Science project lzp-2018/1 -0327
DL in Image Recognition
DL in Speech Recognition
Alpha Zero Achieved a superhuman level of play in Go, chess and shogi games by defeating worldchampion programs https: //applied-data. science/static/main/res/alpha_go_zero_cheat_sheet. png
Motivation for Algorithm Learning • Synthesizing better algorithms – Faster, more memory efficient(faster matrix multiplication) – Dealing with tasks where no good solutions are known(integer factoring) – Better heuristics for NP-hard problems, tailored for specific use-cases • Increased programming productivity – Working example - Flash Fill in MS Excel – Code refactoring and optimization – Programmer’s virtual assistant
Possible Setups • Specifying programming intent – Natural language – Formal language(specification) – Input-output examples – Execution examples • Representing the generated program – Neural network(differentiable) – Source code in a given programming language(nondifferentiable) – Hybrid (mixture of both)
Feed-forward Networks Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097– 1105, 2012.
Recurrent Networks Gated Recurrent Unit (GRU) http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/
Specialized Architectures • Neural Turing Machine (Graves et al 2014) • Reinforcement Learning Neural Turing Machines (Zaremba et al 2015) • Stack Recurrent Nets (Joulin et al 2015) • Neural GPU (Kaiser et al 2015) • Neural programmer (Neelankatan et al 2016) • Neural Programmer-Interpreter (Reed et al 2016) • Learning Simple Algorithms from Examples (Zaremba et al 2016) • Differentiable Neural Computer (Graves et al 2016) • String transformations: Neuro-Symbolic Program Synthesis (Parisotto et al 2017), Robust. Fill (Devlin et al 2017) • Deep. Coder (Balog et al 2017) • Leveraging grammar and reinforcement learning for neural program synthesis(Bunel et al 2018) Neural Program Synthesis Tasks: Copy, addition, Sorting, Shortest Path
Our Objectives • Program from input-output pairs • Variable-size input and output • Training on short samples -> error-free generalization to unbounded samples • Differentiable, easily trainable by gradient descent • Efficient implementation on current hardware – convolutions preferred Network architecture has to scale depending on the input length to provide enough memory and computing time for the algorithm to be inferred.
Cellular Automata Rule-30 automaton http: //mathworld. wolfram. com/Elementary. Cellular. Automaton. html
Neural GPU Input length n n layers with shared parameters Each layer is a convolutional GRU ✶ means convolution, ʘ means elementwise multiplication Output length n Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algorithms. ar. Xiv preprint ar. Xiv: 1511. 08228, 2015.
Cellular automaton Neural GPU performing multiplication
Our Improvement – DNGPU Problem with NGPU – the authors train 729 models to find one that generalizes well 1. Diagonal gates help to bring together data from both ends of the input. Feature maps of the state are divided into 3 parts. Each part has a gate from the same, previous or next cell of the previous timestep. State st at time t is computed from the state at time t-1 according to Diagonal Convolutional Gated Recurrent Unit(DCGRU): ✶ means convolution, ʘ means elementwise multiplication Karlis Freivalds, Renars Liepins. "Improving the Neural GPU Architecture for Algorithm Learning. " The ICML workshop Neural Abstract Machines & Program Induction v 2 (NAMPI 2018) [Best Paper Award]
Our Improvement – DNGPU 2. Hard nonlinearities improve generalization 3. Saturation cost helps to avoid unnecessary saturation and is applied to each application of hard_tanh or hard_σ.
Training Tricks • We train on inputs of different lengths simultaneously. • Dropout is applied to the ct vector of DCGRU as suggested in (Semeniuta et al. , 2016). • Large learning rate (0. 005) helps to avoid local minima • We use Ada. Max optimizer with integrated gradient clipping. It is very robust in the case of a large learning rate. Clipping is performed relative to the previous gradient.
Training We train on examples of length up to 41 and test on examples of length 401. On the long binary multiplication, it requires only about 800 training steps (about 10 minutes on a single computer) to reach 99% accuracy(% of correct bits). Test accuracy vs. training step
Generalization to Longer Inputs (% of correct bits)
Generalization to Longer Inputs (No. of incorrect sequences out of 1024)
Other Tasks On sorting and binary addition tasks performance is similar or better. Decimal multiplication can be learned if each decimal number is encoded in binary(or more simply, inserting 3 blanks after each symbol) Sorting task test accuracy Decimal multiplication test accuracy
Ablation Study All introduced features contribute nonlinearities are the most important. to performance. Hard
DNGPU Summary • DNGPU can learn moderately complex algorithms (addition, multiplication, sorting) • Robust –all trained networks generalize well • Better generalization - generalize to 100 x longer inputs • Hard nonlinearities with saturation cost are essential for generalization. • Code on Git. Hub: https: //github. com/LUMII-Syslab/DNGPU
Summary • DL is overcoming human level in many fields • Our results show to automatically synthesize nontrivial algorithms of complexity O(n 2), we are working to improve that • An emerging direction is learning of heuristic algorithms for NP-hard problems
Thank You!