CNNRNN A Unied Framework for Multilabel Image Classication

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation Xueying Bai, Jiankun Xu

Multi-label Image Classification • Co-occurrence dependency • Higher-order correlation: one label can be predicted using the previous label • Semantic redundancy: labels have overlapping meanings (cat and kitten)

Previous Models • Multiple single-label classification Fail to model the dependency between multiple labels • Graphic model Large amount of parameters; Can not model higher-order correlation

RNN-CNN Model • Learn the semantic redundancy and the cooccurrence dependencies • Have an end-to-end training process • Predict more objects that need contexts (higher-order correlation)

CNN-RNN Framework

Joint Embedding Model • Label embedding: the embedding vector in a lowd Euclidian space in which embeddings of semantically similar labels are close to each other • Image embedding: the embedding vector close to that of its associated labels in the same space • Exploit semantic redundancy problem: share classification parameters

Model Diagram • Output of CNN: Image embedding • Output of RNN (o(t)): new embedding including the information from previous label (to model higher order correlations)

LSTM

Recurrent Neural Network

Inference • Prediction Path • Beam Search Find top N labels in each time step as candidates Find top N prediction paths for each time (t+1)

Beam Search • When comes to ‘End’: add to the candidate path set • Termination condition: probability of current intermediate paths is smaller than that of all candidate paths.

Experiments • CNN module uses the 16 layers VGG network • Dimension of label embedding is 64 • Dimension of LSTM RNN layer is 512 • Test on Datasets: NUS-WIDE, MS COCO and VOC PASCAL 2007

• Evaluation Metric • Precision: correctly annotated labels/ generated labels • Recall: correctly annotated labels/ ground-truth labels • C-P, O-P; C-R, O-R • C-Fl, O-Fl: geometrical average • MAP

• NUS-WIDE • A web image dataset contains 269648 images and 5018 tags. • Test on dataset with 1000 tags and 81 tags.

• MS COCO • It contains 123 thousand images of 80 objects types. • Training data has 82783 images and testing data has 40504 images. • Most images have multiple objects.

• PASCAL VOC 2007 • Training data has 5011 images and testing data has 4952 images. • Use AP and m. AP to evaluate.

• Label embedding • The model effectively learns a joint label embedding

• Attention Visualization

Conclusion and Future Work • Combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and RNN • Experimental results on several datasets show good performance • Predicting small objects is still a challenge.

• Reference: CNN-RNN: A Unified Framework for Multi-label Image Classification — Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu • Questions?

• Thank you all！