Show and Tell A Neural Image Caption Generator
- Slides: 13
Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin Zhang October 5 th
Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words
Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure • Encoder (RNN): transform the source language into a rich fixed length vector • Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ”ABCD” to those in target language “XYZQ”
NIC Model Architecture Follow the Encoder - Decoder structure • Encoder (deep CNN): transform the image into a rich fixed length vector • Decoder (RNN): take the output of encoder as input and generates the target sentence
NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.
RNN(Recurrent Neural Network) • Why? Sequential task: speech, text and video… E. g. translate a word based on the previous one • Advantage: Pass information from one step to next, information persistence • How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more? http: //karpathy. github. io/2015/05/21/rnn-effectiveness/
RNN & LSTM • Why it’s better? Long term dependency problem: translation of the last word depends on the information of the first word… when gap between relevant information grows, RNN fails • Long Short Term Memory Networks remembers information for long periods of time.
LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information through
LSTM Cont. (forget gate) input x previous output h f (vector, every element is 0 or 1) decide what information to throw away from the cell state
LSTM Cont. decide what values will be updated input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state
LSTM Cont. (output gate) decide what parts of cell state we’ll output the parts we decided to
Result BLEU: https: //en. wikipedia. org/wiki/BLEU
Reference: • Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan https: //arxiv. org/pdf/1411. 4555 v 2. pdf http: //techtalks. tv/talks/show -and-tell-a-neural-image-caption-generator/61592/ • Understanding LSTM Networks, colah’s blog http: //colah. github. io/posts/2015 -08 -Understanding-LSTMs/
- Show and tell: a neural image caption generator
- Show attend and tell
- Show dont tell generator
- Tell me and i will forget show me
- Tell me and i will forget
- Apa image caption
- American psychological association example
- Famous manifestors
- Tell me what you eat and i shall tell you what you are
- Show and tell letter l
- Barrier konseling
- Show don't tell
- Show not tell phrases
- Descriptive writing clipart