Show and Tell A Neural Image Caption Generator

  • Slides: 13
Download presentation
Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin

Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin Zhang October 5 th

Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using

Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words

Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood

Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure • Encoder (RNN): transform the source language into a rich fixed length vector • Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ”ABCD” to those in target language “XYZQ”

NIC Model Architecture Follow the Encoder - Decoder structure • Encoder (deep CNN): transform

NIC Model Architecture Follow the Encoder - Decoder structure • Encoder (deep CNN): transform the image into a rich fixed length vector • Decoder (RNN): take the output of encoder as input and generates the target sentence

NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice

NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.

RNN(Recurrent Neural Network) • Why? Sequential task: speech, text and video… E. g. translate

RNN(Recurrent Neural Network) • Why? Sequential task: speech, text and video… E. g. translate a word based on the previous one • Advantage: Pass information from one step to next, information persistence • How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more? http: //karpathy. github. io/2015/05/21/rnn-effectiveness/

RNN & LSTM • Why it’s better? Long term dependency problem: translation of the

RNN & LSTM • Why it’s better? Long term dependency problem: translation of the last word depends on the information of the first word… when gap between relevant information grows, RNN fails • Long Short Term Memory Networks remembers information for long periods of time.

LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information

LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information through

LSTM Cont. (forget gate) input x previous output h f (vector, every element is

LSTM Cont. (forget gate) input x previous output h f (vector, every element is 0 or 1) decide what information to throw away from the cell state

LSTM Cont. decide what values will be updated input gate: decide what new information

LSTM Cont. decide what values will be updated input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state

LSTM Cont. (output gate) decide what parts of cell state we’ll output the parts

LSTM Cont. (output gate) decide what parts of cell state we’ll output the parts we decided to

Result BLEU: https: //en. wikipedia. org/wiki/BLEU

Result BLEU: https: //en. wikipedia. org/wiki/BLEU

Reference: • Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev,

Reference: • Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan https: //arxiv. org/pdf/1411. 4555 v 2. pdf http: //techtalks. tv/talks/show -and-tell-a-neural-image-caption-generator/61592/ • Understanding LSTM Networks, colah’s blog http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/