Image Caption with Deep Learning Yulia Kogan and

  • Slides: 42
Download presentation
Image Caption with Deep Learning Yulia Kogan and Ron Shiff 19. 06. 2016

Image Caption with Deep Learning Yulia Kogan and Ron Shiff 19. 06. 2016

References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille.

References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. ar. Xiv preprint ar. Xiv: 1410. 1090, 2014 • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. ar. Xiv preprint ar. Xiv: 1411. 2539, 2014 • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. ar. Xiv preprint ar. Xiv: 1411. 4389, 2015 • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. ar. Xiv preprint ar. Xiv: 1411. 4555, 2015. • A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions. CVPR 2015 (Oral)

Structure of the talk • Problem formulation • Models: RNN + CNN • Architecture

Structure of the talk • Problem formulation • Models: RNN + CNN • Architecture details • Evaluation problems • Results • Vector arithmetic • Dense image caption (Karpathy et al. )

I think it’s a David Bowie holding cat and he seems

I think it’s a David Bowie holding cat and he seems

Problem formulation Useful for • Early childhood education • Foreign language education • Visually

Problem formulation Useful for • Early childhood education • Foreign language education • Visually impaired people • Image retrieval and image search

Problem formulation Hard task: • Objects (cat, dog) • Attributes (white, furry) • Relations

Problem formulation Hard task: • Objects (cat, dog) • Attributes (white, furry) • Relations (playing together) • Location (in a room) • Describe it in proper language

(Generated): A square with burning street lamps and a street in the foreground

(Generated): A square with burning street lamps and a street in the foreground

Different tasks Image: • Image description (produce new sentence) • Sentence retrieval (pick the

Different tasks Image: • Image description (produce new sentence) • Sentence retrieval (pick the best sentence) • Sentence ranking (pick the best sentence) • Image retrieval (pick the best image) Video (Donahue et al. ): • Activity recognition (short label) • Video description (produce new sentence)

Models: RNN + CNN How to combine image and sentence? RNN + CNN: •

Models: RNN + CNN How to combine image and sentence? RNN + CNN: • Encoder-decoder model • Multimodal layer

Encoder-decoder model: machine translation

Encoder-decoder model: machine translation

Encoder-decoder model: image caption

Encoder-decoder model: image caption

Encoder-decoder model: Vinyals et al.

Encoder-decoder model: Vinyals et al.

Encoder-decoder model in time log-likelihood of a Word given Image and Context: :

Encoder-decoder model in time log-likelihood of a Word given Image and Context: :

Multimodal layer

Multimodal layer

Multimodal layer at time 0 Loss: log-likelihood of a Word given Image and Context

Multimodal layer at time 0 Loss: log-likelihood of a Word given Image and Context

Architecture decisions • Model • RNN/LSTM • Type of non-linearity (sigmoid, tanh, RELU, etc)

Architecture decisions • Model • RNN/LSTM • Type of non-linearity (sigmoid, tanh, RELU, etc) • Feed image to RNN on every step/once • Random initialization/pretrained models • How images and texts are fed

Architecture details: Mao et al.

Architecture details: Mao et al.

Multimodal layer (Mao et al. ): sentence + image

Multimodal layer (Mao et al. ): sentence + image

Architecture details: Vinyals et al.

Architecture details: Vinyals et al.

Architecture: Donahue et al.

Architecture: Donahue et al.

Problems of evaluation

Problems of evaluation

1. 2. 3. 4. Summer medieval festival. Two men are fighting with swords. Knights

1. 2. 3. 4. Summer medieval festival. Two men are fighting with swords. Knights are having a tournament. Lots of people in colourful dresses on green grass.

Evaluation (sentence generation) Human evaluation • Costly • Level of inter-human agreement is low

Evaluation (sentence generation) Human evaluation • Costly • Level of inter-human agreement is low (Vinyals et al. : 65%) Multiple references for one image (usually 5) • Still not enough diversity • Not a lot of data

Evaluation (sentence generation) BLEU-N score (~ precision) • BLEU-1: adequacy • BLEU-2, BLEU-3: fluency

Evaluation (sentence generation) BLEU-N score (~ precision) • BLEU-1: adequacy • BLEU-2, BLEU-3: fluency THERE IS A CAT

BLEU problems

BLEU problems

Evaluation: Retrieval and Ranking • Recall@K (K = 1, 5, 10): # of images

Evaluation: Retrieval and Ranking • Recall@K (K = 1, 5, 10): # of images for which the correct sentence is retrieved in the top-K. • Medr: median rank of the first correct sentence (low is good).

Results: Vinyals et al

Results: Vinyals et al

Results: Kiros at al.

Results: Kiros at al.

Results (pictures) I think it’s a dog that is standing in the dirt.

Results (pictures) I think it’s a dog that is standing in the dirt.

I think it’s a David Bowie holding cat and he seems

I think it’s a David Bowie holding cat and he seems

I think it’s a cat sitting on a table.

I think it’s a cat sitting on a table.

I’m not really confident but I think it’s a close up of a cat

I’m not really confident but I think it’s a close up of a cat looking at the camera.

I’m not really confident but I think it’s a close up of a two

I’m not really confident but I think it’s a close up of a two giraffes near a tree.

Vector arithmetic • king – man + woman = queen • paris – france

Vector arithmetic • king – man + woman = queen • paris – france + poland = warsaw • word 2 vec (http: //deeplearner. fz-qqq. net/ )

Vector arithmetic (colors): Kiros et al.

Vector arithmetic (colors): Kiros et al.

Vector arithmetic (Structure): Kiros et al.

Vector arithmetic (Structure): Kiros et al.

Karpathy et al. : dense captions

Karpathy et al. : dense captions

Karpathy et al. : dense captions (ranking) • Pretrain Region. CNN for object regions

Karpathy et al. : dense captions (ranking) • Pretrain Region. CNN for object regions (instead of images) • Detect top 19 regions (bounding boxes) • Learn • Sentence-image score

Karpathy et al. : dense captions (ranking) • Learn • Sentence-image score • Loss

Karpathy et al. : dense captions (ranking) • Learn • Sentence-image score • Loss

Karpathy et al 2015. : dense captions ranking RCNN + BRNN:

Karpathy et al 2015. : dense captions ranking RCNN + BRNN:

Take-home message • Image caption is in good shape • Sequential nature of RNN

Take-home message • Image caption is in good shape • Sequential nature of RNN / LSTM • Encoder-decoder model / multimodal layer • Evaluation problems • Vector arithmetic

References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille.

References • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. ar. Xiv preprint ar. Xiv: 1410. 1090, 2014 • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. ar. Xiv preprint ar. Xiv: 1411. 2539, 2014 • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. ar. Xiv preprint ar. Xiv: 1411. 4389, 2015 • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. ar. Xiv preprint ar. Xiv: 1411. 4555, 2015. • A. Karpathy, L. Fei-Fei. Deep Visual-Semantic alignment for generating Image Descriptions. CVPR 2015 (Oral)