Caption Description 2015 6 30 Mun Jonghwan Caption
- Slides: 26
Caption Description 2015. 6. 30 Mun Jonghwan
Caption Description
Caption Description A man skiing down the snow covered mountain with a dark sky in the background. INPUT OUTPUT • This problem require - Identifying and detecting objects, scenes, people, etc - Reasoning about relationships, properties and activity of objects - Combining several sources of information into a coherent sentence
Traditional method • Filling the template with detector • Embedding
Novel approach • • Generating caption Caption(sentence) is sequential data Infer the word of sentence step by step Based on image and history of words infer next word
Show and Tell: A Neural Image Caption Generator - 2015 CVPR Show, Attend and Tell: Neural Image Caption Generation with Visual Attention - 2015 ICML
CNN vs RNN <RNN> <CNN> output hidden input • An image • Spatial relation T-1 T • Sequence • Temporal relation T+1
Recurrent Neural Networks (RNNs) (a) RNN (b) RNN unrolled in time
Vanishing gradient problem • Gradient decays over time as new inputs overwrite the activation of hidden layer • The network forget the first inputs • Difficult to learn long-term dependecies
Long Short-Term Memory (LSTM) • Memory cell • 3 gates - input, output, forget
Long Short-Term Memory (LSTM) • Input(below), forget(left), output(above), open(o), closed(-) • Avoid vanishing gradient problem
The problem is similar to machine translation • Encoder & Decoder - RNN
The problem is similar to machine translation • Property - Variable-length output - End-to-end trainable • Encoder - CNN • Decoder - RNN
Structure of Encoder-Decoder with LSTM “A” CNN L S T M start
Structure of Encoder-Decoder with LSTM “A” “group” CNN L S T M start “A”
Structure of Encoder-Decoder with LSTM “A” “group” “of” CNN L S T M start L S T M “A” “group”
Structure of Encoder-Decoder with LSTM “A” “group” “of” CNN L S T M start L S T M “A” “group” END L S T M “market” A group of people shopping at an outdoor market
Limitation of Encoder-Decoder • Encoder-Decoder should compress all the necessary information of a whole image into a fixed-lengh vector - Difficult to capture detail of image - Difficult to describe compositionally novel images - Difficult to cope with longer sentence than the sentence in the training corpus
Attention model • Words in sentence have relevant region in image A group of men playing Frisbee in the park
Attention model • Words in sentence have relevant region in image • Learning alignment • predicting words from relevant region A group of men playing Frisbee in the park END L S T M L S T M group of in the park START A L S T M men playing Frisbee
Attention model
Attention model L S T M
Whole framework of attention model
Result
Result
Thank you
- What is mun
- Mun flow of debate
- Mun 101
- Point of entertainment mun
- Mun ugm
- Sonu hyang hui
- Mun arrival form
- Gsl speech example
- Mun introduction speech
- Greg naterer
- Mun chan
- Shad mun
- Mun earth science
- Operative clauses mun
- How to write a resolution mun
- Mun proposal example
- Chan ho mun
- Anatolia college mun
- Mun sign in
- Shad mun
- Ife pe and flc serum
- How to write a resolution mun
- Chan mun choon
- Mmun dress code
- How mun works
- Tuen mun hospital ambulatory care centre
- Mun position paper outline