Deep Learning of Algorithms Towards Universal Artificial Intelligence

  • Slides: 26
Download presentation
Deep Learning of Algorithms Towards Universal Artificial Intelligence Kārlis Freivalds, Renārs Liepiņš Joint Estonian-Latvian

Deep Learning of Algorithms Towards Universal Artificial Intelligence Kārlis Freivalds, Renārs Liepiņš Joint Estonian-Latvian Theory Days 2018 Supported by Latvian Council of Science project lzp-2018/1 -0327

DL in Image Recognition

DL in Image Recognition

DL in Speech Recognition

DL in Speech Recognition

Alpha Zero Achieved a superhuman level of play in Go, chess and shogi games

Alpha Zero Achieved a superhuman level of play in Go, chess and shogi games by defeating worldchampion programs https: //applied-data. science/static/main/res/alpha_go_zero_cheat_sheet. png

Motivation for Algorithm Learning • Synthesizing better algorithms – Faster, more memory efficient(faster matrix

Motivation for Algorithm Learning • Synthesizing better algorithms – Faster, more memory efficient(faster matrix multiplication) – Dealing with tasks where no good solutions are known(integer factoring) – Better heuristics for NP-hard problems, tailored for specific use-cases • Increased programming productivity – Working example - Flash Fill in MS Excel – Code refactoring and optimization – Programmer’s virtual assistant

Possible Setups • Specifying programming intent – Natural language – Formal language(specification) – Input-output

Possible Setups • Specifying programming intent – Natural language – Formal language(specification) – Input-output examples – Execution examples • Representing the generated program – Neural network(differentiable) – Source code in a given programming language(nondifferentiable) – Hybrid (mixture of both)

Feed-forward Networks Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep

Feed-forward Networks Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097– 1105, 2012.

Recurrent Networks Gated Recurrent Unit (GRU) http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/

Recurrent Networks Gated Recurrent Unit (GRU) http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/

Specialized Architectures • Neural Turing Machine (Graves et al 2014) • Reinforcement Learning Neural

Specialized Architectures • Neural Turing Machine (Graves et al 2014) • Reinforcement Learning Neural Turing Machines (Zaremba et al 2015) • Stack Recurrent Nets (Joulin et al 2015) • Neural GPU (Kaiser et al 2015) • Neural programmer (Neelankatan et al 2016) • Neural Programmer-Interpreter (Reed et al 2016) • Learning Simple Algorithms from Examples (Zaremba et al 2016) • Differentiable Neural Computer (Graves et al 2016) • String transformations: Neuro-Symbolic Program Synthesis (Parisotto et al 2017), Robust. Fill (Devlin et al 2017) • Deep. Coder (Balog et al 2017) • Leveraging grammar and reinforcement learning for neural program synthesis(Bunel et al 2018) Neural Program Synthesis Tasks: Copy, addition, Sorting, Shortest Path

Our Objectives • Program from input-output pairs • Variable-size input and output • Training

Our Objectives • Program from input-output pairs • Variable-size input and output • Training on short samples -> error-free generalization to unbounded samples • Differentiable, easily trainable by gradient descent • Efficient implementation on current hardware – convolutions preferred Network architecture has to scale depending on the input length to provide enough memory and computing time for the algorithm to be inferred.

Cellular Automata Rule-30 automaton http: //mathworld. wolfram. com/Elementary. Cellular. Automaton. html

Cellular Automata Rule-30 automaton http: //mathworld. wolfram. com/Elementary. Cellular. Automaton. html

Neural GPU Input length n n layers with shared parameters Each layer is a

Neural GPU Input length n n layers with shared parameters Each layer is a convolutional GRU ✶ means convolution, ʘ means elementwise multiplication Output length n Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algorithms. ar. Xiv preprint ar. Xiv: 1511. 08228, 2015.

Cellular automaton Neural GPU performing multiplication

Cellular automaton Neural GPU performing multiplication

Our Improvement – DNGPU Problem with NGPU – the authors train 729 models to

Our Improvement – DNGPU Problem with NGPU – the authors train 729 models to find one that generalizes well 1. Diagonal gates help to bring together data from both ends of the input. Feature maps of the state are divided into 3 parts. Each part has a gate from the same, previous or next cell of the previous timestep. State st at time t is computed from the state at time t-1 according to Diagonal Convolutional Gated Recurrent Unit(DCGRU): ✶ means convolution, ʘ means elementwise multiplication Karlis Freivalds, Renars Liepins. "Improving the Neural GPU Architecture for Algorithm Learning. " The ICML workshop Neural Abstract Machines & Program Induction v 2 (NAMPI 2018) [Best Paper Award]

Our Improvement – DNGPU 2. Hard nonlinearities improve generalization 3. Saturation cost helps to

Our Improvement – DNGPU 2. Hard nonlinearities improve generalization 3. Saturation cost helps to avoid unnecessary saturation and is applied to each application of hard_tanh or hard_σ.

Training Tricks • We train on inputs of different lengths simultaneously. • Dropout is

Training Tricks • We train on inputs of different lengths simultaneously. • Dropout is applied to the ct vector of DCGRU as suggested in (Semeniuta et al. , 2016). • Large learning rate (0. 005) helps to avoid local minima • We use Ada. Max optimizer with integrated gradient clipping. It is very robust in the case of a large learning rate. Clipping is performed relative to the previous gradient.

Training We train on examples of length up to 41 and test on examples

Training We train on examples of length up to 41 and test on examples of length 401. On the long binary multiplication, it requires only about 800 training steps (about 10 minutes on a single computer) to reach 99% accuracy(% of correct bits). Test accuracy vs. training step

Generalization to Longer Inputs (% of correct bits)

Generalization to Longer Inputs (% of correct bits)

Generalization to Longer Inputs (No. of incorrect sequences out of 1024)

Generalization to Longer Inputs (No. of incorrect sequences out of 1024)

Other Tasks On sorting and binary addition tasks performance is similar or better. Decimal

Other Tasks On sorting and binary addition tasks performance is similar or better. Decimal multiplication can be learned if each decimal number is encoded in binary(or more simply, inserting 3 blanks after each symbol) Sorting task test accuracy Decimal multiplication test accuracy

Ablation Study All introduced features contribute nonlinearities are the most important. to performance. Hard

Ablation Study All introduced features contribute nonlinearities are the most important. to performance. Hard

DNGPU Summary • DNGPU can learn moderately complex algorithms (addition, multiplication, sorting) • Robust

DNGPU Summary • DNGPU can learn moderately complex algorithms (addition, multiplication, sorting) • Robust –all trained networks generalize well • Better generalization - generalize to 100 x longer inputs • Hard nonlinearities with saturation cost are essential for generalization. • Code on Git. Hub: https: //github. com/LUMII-Syslab/DNGPU

Summary • DL is overcoming human level in many fields • Our results show

Summary • DL is overcoming human level in many fields • Our results show to automatically synthesize nontrivial algorithms of complexity O(n 2), we are working to improve that • An emerging direction is learning of heuristic algorithms for NP-hard problems

Thank You!

Thank You!