Object Bank Presenter Liu Changyu Advisor Prof Alex

Object Bank Presenter： Liu Changyu Advisor： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 4 th, 2013

Contents l l l Introduction Model Algorithm Experiment Conclusion CMU - Language Technologies Institute 2

Introduction 1. Research Question 1) Understanding the meanings and contents of images remains one of the most challenging problems in machine intelligence and statistical learning. 2) Also present low-level image features are enough for a variety of visual recognition tasks, but still not enough, especial for the visual tasks which carry semantic meanings. So efficient high level image features are often needed. CMU - Language Technologies Institute 3

Introduction 2. What’s Object Bank? Object bank representation is a novel image representation for high-level visual tasks, which encodes semantic and spatial information of the objects within an image. In object bank, an image is represented as a collection of scale-invariant response maps of a large number of pre-trained generic object detectors. CMU - Language Technologies Institute 4

Introduction 3. Why to use it? Fig. 1 illustrates the gradient-based GIST features and texture-based Spatial Pyramid representation of two different scenes (foresty mountain vs. street). But such schemes often fail to offer sufficient discriminative power, as one can see from the very similar image statistics in the examples in this figure. Fig. 1: (Best viewed in colors and magnification. ) Comparison of OB) representation with GIST and SIFT-SPM of mountain vs. city street. . CMU - Language Technologies Institute 5

Introduction 4. What is it used for? The main goal we want to use object bank is: 1) Optimize object bank detection code 2) Extend object banks to incorporate more objects. CMU - Language Technologies Institute 6

Contents l l l Introduction Model Algorithm Experiment Conclusion CMU - Language Technologies Institute 7

Model---Object Bank A large number of object detectors are first applied to an input image at multiple scales. For each object at each scale, a three -level spatial pyramid representation of the resulting object filter map is used; the maximum response for each object in each grid is then computed, resulting in a No: Objects length feature vector for each grid. A concatenation of features in all grids leads to an OB descriptor for the image. CMU - Language Technologies Institute Fig. 2 Object Bank Model 8

Contents l l l Introduction Model Algorithm Experiment Conclusion CMU - Language Technologies Institute 9

Algorithm According Paper [1]Object Bank Algorithm is as follows: 1) Let represent the design built on the J-dimensional object bank representation of N images; 2) Let denote the binary classification labels of N samples. 3) This leads to the following learning problem: (1) Where is some non-negative, convex loss is a regularizer that avoids overfitting. CMU - Language Technologies Institute 10

Contents l l l Introduction Model Algorithm Experiment Conclusion CMU - Language Technologies Institute 11

Experiment We want to extend the original Object Bank approach, and do some related experiments as follow steps: 1) List and number the needed objects in our experiment, as: Object Names: 10200 -clock 10302 -goggles 10306 -spectacles 10477 -knife 10572 -key 10577 -keyboard 10638 -desktop computer 10658 -computer 1074 -dog 10790 -printer 10887 -faucet …………… CMU - Language Technologies Institute 12

Experiment 2) Download the related bounding box from image-net. 3) Resizing the original image: The image is resized using the following process. First get the image dimensions(i. e. (a, b)). The ratio for scaling is calculated, using the following: Ratio=400/min(a, b); Therefore, the smaller axe of the image is converted to 400 pixels. This example illustrates that: Fig. 3 Resizing Step CMU - Language Technologies Institute 13

Experiment 4) Getting HOG features at different scales: After this rescaling, HOG features are obtained using different scales of the image. Although, they obtain HOG features for more scales, they only use six of these scales. These are the ratios used for resizing the image(previously resized in the previous step) Ratios: 1(image obtained from the previous step) 0. 707 0. 5 0. 3535 0. 25 0. 17677 0. 128 CMU - Language Technologies Institute 14

Experiment After resizing the images, then calculate the HOG features for every image. These HOG features are used to obtain the response for every object. Example of HOG features are calculated for one image: Fig. 4 Example of HOG feature CMU - Language Technologies Institute 15

Experiment 5) Getting the response for the object): After getting the HOG features, we apply a object specific filter to these features. Each root filter, has two different components. Each of these components works on a different scale. As a result, we have 12 different detection scales because we obtained had 6 different scales from the previous steps, and every filter works at 2 different scales. Consequently, 6*2=12. These filter responses, are stored in a matrixes following the same distribution as the HOG feature in the image d of the previous figure. Then, for the HOG obtained from each ratio we have two different filter responses. Namely, we have 12 HOG responses. CMU - Language Technologies Institute 16

Experiment 6) Getting the spatial pyramids: We have 3 different spatial pyramid levels: These three spatial pyramid levels are applied to the 12 different responses to one object. In order to select the value for each box, they select the maximum response of the filter for every box. For instance, for the second level, they split the filter response using a grid of 2 x 2; They pick the maximum response inside every box. As you can see, we have 21( 1 + 2*2 + 4*4) values for every one of the 12 filter responses for one object. Resulting in a total of 12*21= 252 dimensions vector for every object. CMU - Language Technologies Institute 17

Experiment 7) Getting the feature vector for one object): Now, I am going to describe the distribution of the feature vector for one object. We have a vector of 252 dimensions. We start for the diferent scales(Remember that our original image, is the one obtained in the first step) CMU - Language Technologies Institute 18

Experiment Each one of the chunk for every scale is divided in two pieces, because the used root filter in object bank has two different components that work at different scales. So these 42 dimensions for every scale are splited in two pieces of 21 dimensions. CMU - Language Technologies Institute 19

Experiment Finally, This is the distribution of the 21 dimensions. CMU - Language Technologies Institute 20

Contents l l l Introduction Model Algorithm Experiment Conclusion CMU - Language Technologies Institute 21

Conclusion 1)It is a feasible method to use. The author used several experiments to demonstrate that Object Bank representation that carries rich semantic level image information is more powerful on scene classification tasks than some other popular methods. 2)We could use and extend this approach according the real situation to do the remain experiment in the near future. CMU - Language Technologies Institute 22

Reference [1] Level Image Representation for Scene Classification and Semantic Feature Sparsification. Proceedings of the Neural Information Processing Systems (NIPS), 2010. [2] Li-Jia Li, Hao Su, Yongwhan Lim and Li Fei-Fei. Objects as Attributes for Scene Classification. Proceedings of the 12 th European Conference of Computer Vision (ECCV), 1 st International Workshop on Parts and Attributes, 2010. [3] Sreemanananth Sadanand Jason J. Corso. Action Bank: A High-Level Representation of Activity in Video. CVPR, 2012. [4]Pedro Felzenszwalb, et. al. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, 2008. CMU - Language Technologies Institute 23

Thank you! CMU - Language Technologies Institute 24