Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’ 14 paper by Kyunghyun Cho, et al.
Recurrent Neural Networks (1/3) 2
Recurrent Neural Networks (2/3) A variable-length sequence x = (x 1, …, x. T) Hidden state h (optional) Output y (e. g. next symbol in a sequence) A non-linear activation function f Logistic sigmoid Long short-term memory (LSTM) 3
Recurrent Neural Networks (3/3) The output at each timestep t is the conditional probability p(xt | xt-1, … , x 1) e. g. output from a softmax layer: Hence, the probability of the sequence x can be computed: 4
RNN Encoder-Decoder (1/3) 5
RNN Encoder-Decoder (2/3) Encoder: Input: A variable-length sequence x Output: A fixed-length vector representation c Decoder: Input: A given fixed-length vector representation c Output: A variable-length sequence y Note that the decoder’s hidden state ht depends on ht-1, yt-1, and c. 6
RNN Encoder-Decoder (3/3) Trained jointly to maximize conditional log-likelihood Usage: Generate an output sequence given an input sequence Score a given pair of input and output sequences 7
The Hidden Unit (1/2) Gated Recurrent Unit (GRU) 2 gates: Update gate z decides how the hidden state is updated with a new hidden state Reset gate r decides whether the previous hidden state is ignored. 8
The Hidden Unit (2/2) Reset gate: Update gate: New state: Final state: 9
Statistical Machine Translation RNN encoder-decoder for scoring phrase pairs. Additional feature in the log-linear model of the phrase-based SMT framework Trained on each phrase pairs (ignoring frequencies). A new score is added to the existing phrase table. 10