CNNRNN A Unied Framework for Multilabel Image Classication
CNN-RNN: A Unified Framework for Multi-label Image Classification Xueying Bai, Jiankun Xu
Multi-label Image Classification • Co-occurrence dependency • Higher-order correlation: one label can be predicted using the previous label • Semantic redundancy: labels have overlapping meanings (cat and kitten)
Previous Models • Multiple single-label classification Fail to model the dependency between multiple labels • Graphic model Large amount of parameters; Can not model higher-order correlation
RNN-CNN Model • Learn the semantic redundancy and the cooccurrence dependencies • Have an end-to-end training process • Predict more objects that need contexts (higher-order correlation)
CNN-RNN Framework
Joint Embedding Model • Label embedding: the embedding vector in a lowd Euclidian space in which embeddings of semantically similar labels are close to each other • Image embedding: the embedding vector close to that of its associated labels in the same space • Exploit semantic redundancy problem: share classification parameters
Model Diagram • Output of CNN: Image embedding • Output of RNN (o(t)): new embedding including the information from previous label (to model higher order correlations)
LSTM
Recurrent Neural Network
Inference • Prediction Path • Beam Search Find top N labels in each time step as candidates Find top N prediction paths for each time (t+1)
Beam Search • When comes to ‘End’: add to the candidate path set • Termination condition: probability of current intermediate paths is smaller than that of all candidate paths.
Experiments • CNN module uses the 16 layers VGG network • Dimension of label embedding is 64 • Dimension of LSTM RNN layer is 512 • Test on Datasets: NUS-WIDE, MS COCO and VOC PASCAL 2007
• Evaluation Metric • Precision: correctly annotated labels/ generated labels • Recall: correctly annotated labels/ ground-truth labels • C-P, O-P; C-R, O-R • C-Fl, O-Fl: geometrical average • MAP
• NUS-WIDE • A web image dataset contains 269648 images and 5018 tags. • Test on dataset with 1000 tags and 81 tags.
• MS COCO • It contains 123 thousand images of 80 objects types. • Training data has 82783 images and testing data has 40504 images. • Most images have multiple objects.
• PASCAL VOC 2007 • Training data has 5011 images and testing data has 4952 images. • Use AP and m. AP to evaluate.
• Label embedding • The model effectively learns a joint label embedding
• Attention Visualization
Conclusion and Future Work • Combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and RNN • Experimental results on several datasets show good performance • Predicting small objects is still a challenge.
• Reference: CNN-RNN: A Unified Framework for Multi-label Image Classification — Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu • Questions?
• Thank you all!
- Slides: 26