Image Caption with Deep Learning Yulia Kogan and
- Slides: 42
Image Caption with Deep Learning Yulia Kogan and Ron Shiff 19. 06. 2016
References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. ar. Xiv preprint ar. Xiv: 1410. 1090, 2014 • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. ar. Xiv preprint ar. Xiv: 1411. 2539, 2014 • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. ar. Xiv preprint ar. Xiv: 1411. 4389, 2015 • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. ar. Xiv preprint ar. Xiv: 1411. 4555, 2015. • A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions. CVPR 2015 (Oral)
Structure of the talk • Problem formulation • Models: RNN + CNN • Architecture details • Evaluation problems • Results • Vector arithmetic • Dense image caption (Karpathy et al. )
I think it’s a David Bowie holding cat and he seems
Problem formulation Useful for • Early childhood education • Foreign language education • Visually impaired people • Image retrieval and image search
Problem formulation Hard task: • Objects (cat, dog) • Attributes (white, furry) • Relations (playing together) • Location (in a room) • Describe it in proper language
(Generated): A square with burning street lamps and a street in the foreground
Different tasks Image: • Image description (produce new sentence) • Sentence retrieval (pick the best sentence) • Sentence ranking (pick the best sentence) • Image retrieval (pick the best image) Video (Donahue et al. ): • Activity recognition (short label) • Video description (produce new sentence)
Models: RNN + CNN How to combine image and sentence? RNN + CNN: • Encoder-decoder model • Multimodal layer
Encoder-decoder model: machine translation
Encoder-decoder model: image caption
Encoder-decoder model: Vinyals et al.
Encoder-decoder model in time log-likelihood of a Word given Image and Context: :
Multimodal layer
Multimodal layer at time 0 Loss: log-likelihood of a Word given Image and Context
Architecture decisions • Model • RNN/LSTM • Type of non-linearity (sigmoid, tanh, RELU, etc) • Feed image to RNN on every step/once • Random initialization/pretrained models • How images and texts are fed
Architecture details: Mao et al.
Multimodal layer (Mao et al. ): sentence + image
Architecture details: Vinyals et al.
Architecture: Donahue et al.
Problems of evaluation
1. 2. 3. 4. Summer medieval festival. Two men are fighting with swords. Knights are having a tournament. Lots of people in colourful dresses on green grass.
Evaluation (sentence generation) Human evaluation • Costly • Level of inter-human agreement is low (Vinyals et al. : 65%) Multiple references for one image (usually 5) • Still not enough diversity • Not a lot of data
Evaluation (sentence generation) BLEU-N score (~ precision) • BLEU-1: adequacy • BLEU-2, BLEU-3: fluency THERE IS A CAT
BLEU problems
Evaluation: Retrieval and Ranking • Recall@K (K = 1, 5, 10): # of images for which the correct sentence is retrieved in the top-K. • Medr: median rank of the first correct sentence (low is good).
Results: Vinyals et al
Results: Kiros at al.
Results (pictures) I think it’s a dog that is standing in the dirt.
I think it’s a David Bowie holding cat and he seems
I think it’s a cat sitting on a table.
I’m not really confident but I think it’s a close up of a cat looking at the camera.
I’m not really confident but I think it’s a close up of a two giraffes near a tree.
Vector arithmetic • king – man + woman = queen • paris – france + poland = warsaw • word 2 vec (http: //deeplearner. fz-qqq. net/ )
Vector arithmetic (colors): Kiros et al.
Vector arithmetic (Structure): Kiros et al.
Karpathy et al. : dense captions
Karpathy et al. : dense captions (ranking) • Pretrain Region. CNN for object regions (instead of images) • Detect top 19 regions (bounding boxes) • Learn • Sentence-image score
Karpathy et al. : dense captions (ranking) • Learn • Sentence-image score • Loss
Karpathy et al 2015. : dense captions ranking RCNN + BRNN:
Take-home message • Image caption is in good shape • Sequential nature of RNN / LSTM • Encoder-decoder model / multimodal layer • Evaluation problems • Vector arithmetic
References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. ar. Xiv preprint ar. Xiv: 1410. 1090, 2014 • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. ar. Xiv preprint ar. Xiv: 1411. 2539, 2014 • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. ar. Xiv preprint ar. Xiv: 1411. 4389, 2015 • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. ar. Xiv preprint ar. Xiv: 1411. 4555, 2015. • A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions. CVPR 2015 (Oral)
- Mihai bondarescu
- Eugene b kogan
- Kogan pillay
- Kogan router
- Tony wagner's seven survival skills
- Show and tell a neural image caption generator
- Show and tell a neural image caption generator
- He kaiming
- Cmu machine learning
- Apa image caption
- Apa means
- Yulia puskhar
- The ninny theme
- Yulia newton
- Nano coloumb
- Yulia pushkar
- Yulia ayriza
- Anglaland
- Yulia andreeva
- Yulia brovkina
- Deep asleep deep asleep it lies
- Deep forest towards an alternative to deep neural networks
- 深哉深哉
- Cuadro comparativo e-learning b-learning m-learning
- Autoencoders, unsupervised learning, and deep architectures
- Deploying deep learning models with docker and kubernetes
- Image super-resolution using deep convolutional networks
- Deep homography estimation
- Operator fusion deep learning
- Rnn andrew ng
- Hadoop deep learning
- Gandiva: introspective cluster scheduling for deep learning
- Deep learning speech recognition
- Cs 7643 deep learning
- Youtube.com
- Mitesh m khapra
- Frank rosenblatt
- Dtting
- New pedagogies for deep learning
- Cost function deep learning
- Bird eye view deep learning
- Jeff heaton github
- Jilong xue