Dialogue Systems EndtoEnd Review Vladislav Belyaev Deep Hack

Dialogue Systems End-to-End Review Vladislav Belyaev Deep. Hack. Lab 20 December 2015

Agenda 1. Dialogue System – objectives and problems 2. Architectures 2. 1. Char-rnn 2. 2. Sequence-to-sequence (seq 2 seq) 2. 3. Hierarchical recurrent encoder-decoder (HRED) 2. 4. Attention with Intention (AWI) 2. 5. Memory Networks 3. Summary

Dialogue System A. M. Turing. Computing machinery and intelligence. Mind, pages 433– 460, 1950

Dialogue System Goal driven and Non-goal driven Deterministic (rule based, IR) and Generative Large dialogue corpora (0. 1 or 1 billions of words) Natural and Unnatural + can depend on external unobserved events Preprocessing Segmentation Evaluation metric/Problems Remove anomalies (acronyms, slang, misspellings and phonemicization) Tokenization, stemming and lemmatization Speaker segmentation (diarisation) How to measure sense? Conversation segmentation Serban J. V. et al. (2015), A Survey of Available Corpora For Building Data-Driven Dialogue Systems Generalization to previously unseen situations Highly generic responses Diversity of corpora

Dialogue System Evaluation metrics Goal-related performance criteria (including user simulator) Non-goal “Naturalness” – human evaluation (+) BLUE/METEOR score (-) Next utterance classification (+) Word perplexity (+) Response diversity (in comb) A. M. Turing. Computing machinery and intelligence. Mind, pages 433– 460, 1950 Serban J. V. et al. (2015), A Survey of Available Corpora For Building Data-Driven Dialogue Systems

Dialogue System - Business Without neural nets! http: //www. nextit. com/case-studies/amtrak/ http: //www. nextit. com/case-studies/alaska-airlines/

Architectures Opportunities Data Size Papers Code Char-rnn Suits well for morphologically rich languages Memory limited by RAM + Hard to remember facts Around several 10 Ms of words is better http: //karpathy. github. io/2015/ 05/21/rnn-effectiveness/ http: //arxiv. org/abs/1508. 06615 https: //github. com/karpathy/charrnn https: //github. com/yoonkim/lstmchar-cnn Seq 2 seq Don’t contain dialogue states No really long-term dependencies (tricky play) For NCM Help desk – 30 M of tokens Open. Subtitles – 923 M of tokens http: //arxiv. org/abs/1506. 05869 https: //www. tensorflow. org/versions /master/tutorials/seq 2 seq/index. htm l#sequence-to-sequence-models https: //github. com/macournoyer/ne uralconvo HRED Has a dialogue state representation Hard to remember facts >1 B of tokens recommended http: //arxiv. org/abs/1507. 04808 http: //arxiv. org/abs/1507. 02221 https: //github. com/julianser/hed-dlgtruncated AWI Amazing results (a dialogue state and attention), but no code and in-house dataset 4. 5 M of tokens (1000 dialogues) http: //arxiv. org/abs/1510. 08565 - Mem. NN Can perform QA, recommendation and chitchat http: //fb. ai/babi 3. 5 M training examples (4 ds for 4 tasks) http: //arxiv. org/abs/1511. 06931 https: //github. com/carpedm 20/Mem N 2 N-tensorflow https: //github. com/facebook/SCRNNs

Char-rnn http: //karpathy. github. io/2015/05/21/rnn-effectiveness/ Y. Kim et al. (2015), Character-Aware Neural Language Models

Sequence-to-sequence 4 LSTM layers with 1000 memory cells (1 layer = +10% perp) Input vocabulary 160 k words Output vocab 80 k words 1000 for word embeddings (for 300 M words) Grouping sentences by length (5. 8 perp) Reversing sentences (4. 7 perp) Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014), Sequence to Sequence Learning with Neural Networks

Sequence-to-sequence Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014), Sequence to Sequence Learning with Neural Networks https: //www. tensorflow. org/versions/master/tutorials/seq 2 seq/index. html#sequence-to-sequence-models http: //googleresearch. blogspot. ca/2015/11/computer-respond-to-this-email. html

Sequence-to-sequence A single layer LSTM with 1024 memory cells The vocabulary consists of the most common 20 K words 30 M of tokens Model achieved a perplexity of 8, An n-grammodel achieved 18 2 layers LSTM with 4096 memory cells + 2048 linear units The vocabulary consists of the most common frequent 100 K words 923 M of tokens Model achieved a perplexity of 17, An 5 -grammodel achieved 28 Oriol Vinyals, Quoc V. Le (2015), A Neural Conversational Model

Hierarchical recurrent encoder-decoder Iulian V. Serban, et al. (2015), Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Hierarchical recurrent encoder-decoder For the baseline RNN, we tested hidden state spaces dh = 200, 300 and 400. For HRED we experimented with encoder and decoder hidden state spaces of size 200, 300 and 400. Based on preliminary results and due to GPU memory limitations, we limited ourselves to size 300 when not bootstrapping or bootstrapping from Word 2 Vec, and to size 400 when bootstrapping from Sub. Tle. Preliminary experiments showed that the context RNN state space at and above 300 performed similarly, so we fixed it at 300 when not bootstrapping or bootstrapping from Word 2 Vec, and to 1200 when bootstrapping from Sub. Tle. For all models, we used word embedding of size 400 when bootstrapping from Sub. Tle and of size 300 otherwise Iulian V. Serban, et al. (2015), Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Hierarchical recurrent encoder-decoder # Gradients will be truncated after 80 steps. This seems like a fair start. state['max_grad_steps'] = 80 2015 -12 -20 11: 37: 04, 223: search: DEBUG: adding sentence [16, 306, 9294, 17, 15421, 10, 1] from beam 0 state['qdim_encoder'] = 3000 2015 -12 -20 11: 37: 04, 223: search: DEBUG: partial -> боже , нет , мама state['qdim_decoder'] = 3000 2015 -12 -20 11: 37: 04, 223: search: DEBUG: partial -> - вот гений ! - рад это # Dimensionality of dialogue hidden layer 2015 -12 -20 11: 37: 04, 223: search: DEBUG: partial -> конечно , он твой приятель один. 2015 -12 -20 11: 37: 04, 223: search: INFO: Random. Sampler : sampling step 7, beams alive 3 state['sdim'] = 1000 2015 -12 -20 11: 37: 04, 330: search: DEBUG: partial -> боже , нет , мама. # Dimensionality of low-rank approximation 2015 -12 -20 11: 37: 04, 330: search: DEBUG: partial -> - вот гений ! - рад это слышать 2015 -12 -20 11: 37: 04, 330: search: DEBUG: partial -> конечно , он твой приятель один. он state['rankdim'] = 1000 30 k vocab, 77 M words, perp 39. 0644 2015 -12 -20 11: 37: 04, 330: search: INFO: Random. Sampler : sampling step 8, beams alive 3 2015 -12 -20 11: 37: 04, 437: search: DEBUG: adding sentence [212, 11, 35, 10, 35, 11, 211, 10, 1] from beam 0 2015 -12 -20 11: 37: 04, 437: search: DEBUG: partial -> - вот гений ! - рад это слышать. 2015 -12 -20 11: 37: 04, 437: search: DEBUG: partial -> конечно , он твой приятель один. он тоже 2015 -12 -20 11: 37: 04, 437: search: INFO: Random. Sampler : sampling step 9, beams alive 2 2015 -12 -20 11: 37: 04, 534: search: DEBUG: adding sentence [18, 77, 3610, 22, 18, 458, 20, 1217, 10, 1] from beam 0 2015 -12 -20 11: 37: 04, 535: search: DEBUG: partial -> конечно , он твой приятель один. он тоже там 2015 -12 -20 11: 37: 04, 535: search: INFO: Random. Sampler : sampling step 10, beams alive 1 2015 -12 -20 11: 37: 04, 624: search: DEBUG: partial -> конечно , он твой приятель один. он тоже там был 2015 -12 -20 11: 37: 04, 624: search: INFO: Random. Sampler : sampling step 11, beams alive 1 2015 -12 -20 11: 37: 04, 712: search: DEBUG: partial -> конечно , он твой приятель один. он тоже там был. 2015 -12 -20 11: 37: 04, 713: search: INFO: Random. Sampler : sampling step 12, beams alive 1 2015 -12 -20 11: 37: 04, 801: search: DEBUG: adding sentence [139, 11, 25, 162, 846, 151, 10, 25, 119, 98, 62, 10, 1] from beam 0 2. 68409920732: боже , нет , мама. 3. 10118466318: - вот гений ! - рад это слышать. 3. 28937990442: и убил тебя. 3. 72474860085: конечно , он твой приятель один. он тоже там был. 5. 60221603406: ты дом налоги и болтаешь.

Attention with Intention 4. 5 M of tokens (1000 dialogues) Kaisheng Yao et al. (2015), Attention with Intention for a Neural Network Conversation Model

AWI Kaisheng Yao et al. (2015), Attention with Intention for a Neural Network Conversation Model

Memory Networks With additional modifications to construct both long-term and shortterm context memories Sukhbaatar et al. (2015), End-To-End Memory Networks J. Dodge et al. (2015), Evaluating prerequisite qualities for learning end-to-end dialog systems

Memory Networks J. Dodge et al. (2015), Evaluating prerequisite qualities for learning end-to-end dialog systems

Summary Our vision (Deep. Hack. Lab + NN&DL LAB MIPT) Large corpora (natural and unnatural) Pre-processing module Where to find? Russian or not? Russian… How many words are enough? Do we need words? What is the best representation? End-to-end module Architecture? Natural language generation The best suitable architecture? Dialogue policy learning Reinforcement learning? Dialogue state tracking What is the best representation? Q&A Knowledge module Architecture? How to put all together? Evaluation module How to measure sense? How much is enough? T. Mikolov et al. (2015), A Roadmap towards Machine Intelligence J. Schmidhuber (2015), On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

http: //qa. deephack. me/

Thank you for attention! Questions?

Bibliography Key refs 1. Serban J. V. et al. (2015), A Survey of Available Corpora For Building Data-Driven Dialogue Systems http: //arxiv. org/abs/1512. 05742 2. Y. Kim et al. (2015), Character-Aware Neural Language Models http: //arxiv. org/abs/1508. 06615 3. Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014), Sequence to Sequence Learning with Neural Networks http: //arxiv. org/abs/1409. 3215 4. Oriol Vinyals, Quoc V. Le (2015), A Neural Conversational Model http: //arxiv. org/abs/1506. 05869 5. Iulian V. Serban, et al. (2015), Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models http: //arxiv. org/abs/1507. 04808 6. Kaisheng Yao et al. (2015), Attention with Intention for a Neural Network Conversation Model http: //arxiv. org/abs/1511. 06931 7. J. Dodge et al. (2015), Evaluating prerequisite qualities for learning end-to-end dialog systems http: //arxiv. org/abs/1511. 06931 8. T. Mikolov et al. (2015), A Roadmap towards Machine Intelligence http: //arxiv. org/abs/1511. 08130 9. J. Schmidhuber (2015), On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models http: //arxiv. org/abs/1511. 09249