Object recognition BY KARAM ABU ALHIJA IBRAHEM IGBARIA

Object recognition BY: KARAM ABU ALHIJA IBRAHEM IGBARIA LECTURER: HAGIT HEL-OR

Outline Introduction: What is object recognition? What does object recognition involve. Classification vs detection Bag of words: Bag of words Bag of features Part based models: Problem with bag of words Pascal challenge

What is object recognition?

So what does object recognition involve?

Detection: are there people?

Verification: is that a lamp?

Identification: is that Potala Palace?

Image classification: assigning a class label to the image People: present buildings: present cars: not present …. .

Application

Difficulties Camera position light source transformations

Difficulties The varieties within class

Bag of words/features • Interest Point Detector • Interest Point Descriptor • K-means clustering • Support Vector Machine (SVM) classifier • Evaluation Metrics: Precision & Recall

Origin 1: Texture recognition • Texture is characterized by the repetition of basic elements or textons

Origin 1: Texture recognition

Origin 2: bag-of-words • Order less document representation: frequencies of words from a dictionary Salton & Mc. Gill (1983) Car Cat Bike Car Bag-of-words Car Cat Bike … Cat Car 2 1 1 … Cat Car Cat Bike Car 0 1 0 … 3 0 1 … Car Cat 2 3 0 …

Bags of features People, ball, grass

Bag of features: outline 1. Extract features

Bag of features: outline 1. Extract features 2. Learn “visual vocabulary”

Bag of features: outline 1. Extract features 2. Learn “visual vocabulary” 3. Quantize features using visual vocabulary

Bag of features: outline 1. Extract features 2. Learn “visual vocabulary” 3. Quantize features using visual vocabulary 4. Represent images by frequencies of “visual words”

learning feature detection & representation recognition codewords dictionary image representation category models (and/or) classifiers category decision

Feature detection Regular grid

Feature detection Regular grid Point of interest Harris SIFT

Harris

Harris is good but…

SIFT

Representation of the features Compute SIFT descriptor Normalize patch [Lowe’ 99] Detect patches

Clustring Visual vocabulary … Vector quantization

Why cluster Clustering is a common method for learning a visual vocabulary or codebook Unsupervised learning process Each cluster center produced by k-means becomes a codevector Codebook can be learned on separate training set Provided the training set is sufficiently representative, the codebook will be “universal”

The codebook is used for quantizing features A vector quantizer takes a feature vector and maps it to the index of the nearest codevector in a codebook

K - means Choose randomly k – points as the center of the cluster Assign the to every point the points closest to it Calculate the new mean and back to step 2 We do it until we get unchanging centers

K – means clustering (K=3)

Example of visual words

Visual vocabularies: Issues How to choose vocabulary size? Too small: visual words not representative of all patches Too large: quantization artifacts, overfitting

frequency Image representation …. . codewords each image is represented by a vector, typically 1000 -4000 dimension,

Classification Learn a decision rule (classifier) assigning bag-offeatures representations of images to different classes Motorcycle Not motorcycle

classifier The classifier must be trained using a set of negative and positive examples. The classifier “learns” the regularities in the data. If training was successful classifier can classify an unknown example with a high degree of accuracy.

What are support vectors The objective of support vector machine is to find a hyperplane in N – dimensional space in order to distinctly classify data points Y Support vectors X

SVM – support virtual machine

SVM – support virtual machine The goal is to design a hyperplane that classifies all training vectors in two classes Identify the right hyperplane -Scenario-1

SVM – support virtual machine Identify the right hyperplane -Scenario-2

SVM – support virtual machine z 2 z 1 It will be best to choose the hyperplane that leave the maximum margin from both classes

SVM – support virtual z 2 > z 1 machine z 2 z 1 It will be best to choose the hyperplane that leave the maximum margin from both classes

SVM – support virtual machine B Identify the right hyperplane -Scenario-3 A

SVM – support virtual machine Can we classify two classes Identify the right hyperplane -Scenario-4

SVM – support virtual machine Identify the right hyperplane -Scenario-4

SVM – support virtual machine Find the hyper-plane to segregate to classes -Scenario-5 Y X

SVM – support virtual machine Find the hyper-plane to segregate to classes -Scenario-5 Z z=x^2+y^2 X

SVM – support virtual machine Find the hyper-plane to segregate to classes -Scenario-5 Y Kernel trick X

SVM – support virtual machine

Example bag-of-words matches

Evaluation metrics Ground Truth (GT) Result of method (RM) True positive (TP) False negative (FN) True negative (TN) False positive (FP)

Matching statistics Taken from Stanford university

So what happens when getting new image extract features build a histogram for each feature, find the closest visual word in the dictionary And by the histogram we can classify what objects in the image

What about spatial info? Spatial pyramids matching

Pyramids Very useful for representing images. Pyramid is built by using multiple copies of image. Each level in the pyramid is 1/4 of the size of previous level. The lowest level is of the highest resolution. The highest level is of the lowest resolution.

Bag of words + pyramids Level 0

Bag of words + pyramids Level 0 Level 1

Bag of words + pyramids Level 0 Level 1 Level 2

Scene category dataset

Classification results

Part 2: part-based models

Problem with bag-of-words All have equal probability for bag-of-words methods Location information is important

Model: Parts and Structure

Pictorial Structures Introduced by Fischler and Elschlager in 1973 Part-based models: Each part represents local visual properties “Springs” capture spatial relationships

Local Evidence + Global Decision Parts have a match quality at each image location Local evidence is noisy Parts are detected in the context of the whole model

Matching Problem Model is represented by a graph G E) = (V, V = {v 1, . . . , vn} are the parts (vi, vj) ∈ E indicates a connection between parts mi(li) is a cost for placing part i at location li dij(li, lj) is a deformation cost Optimal configuration for the object is L = (l 1, . . . , ln) minimizing E(L) = Σ mi(li) + Σ dij(li, lj)

Matching Problem E(L) = Σ mi(li) + Σ dij(li, lj) Assume n parts, k possible locations for each part There are kn configurations L If graph is a tree we can use dynamic programming O(nk 2) algorithm If dij(li, lj) = g(li-lj) we can use minconvolutions O(nk) algorithm As fast as matching each part separately!

Different connectivity structures Crandall et al. ‘ 05 Fergus et al. ’ 03 Fei-Fei et al. ‘ 03 O(N 6) O(N 2) Csurka ’ 04 Vasconcelos ‘ 00 Crandall et al. Felzenszwalb & Huttenlocher ‘ 00 ‘ 05 O(N 2) O(N 3) Bouchard & Triggs ‘ 05 Carneiro & Lowe ‘ 06 from Sparse Flexible Models of Local Features Gustavo Carneiro and David Lowe, ECCV 2006

Dynamic Programming on Trees v 2 v 1 E(L) = Σ mi(li) + Σ dij(li, lj) For each l 1 find best l 2: Best 2(l 1) = min [m 2(l 2) + d 12(l 1, l 2)] “Delete” v 2 and solve problem with smaller model Keep removing leaves until there is a single part left

Min-Convolution Speedup v 2 Best 2(l 1) = min [m 2(l 2) + d 12(l 1, l 2)] Brute force: O(k 2) --- k is number of locations Suppose d 12(l 1, l 2) = g(l 1 -l 2): Best 2(l 1) = min [m 2(l 2) + g(l 1 -l 2)] Min-convolution: O(k) if g is convex v 1

Finding Motorbikes Model with 6 parts: 2 wheels 2 headlights front & back of seat Pedro F. Felzenszwalb, University of Chicago

Human Pose Estimation Pedro F. Felzenszwalb, University of Chicago

example FIVE PARTS: EYES, NOSE, CORNERS OF MOUTH Training Results:

Results : Results

Occlusions

Pascal challenge recognize objects from a few visual object classes in realistic scenes There is only 20 object classes that been selected Participants may use systems built or trained using any methods or data excluding the provided test sets. Systems are to be built or trained using only the provided training/validation data.

Pascal challenge The 20 classes are:

Other dataset Web-Based Annotation - building large annotated databases by relying on the collaborative effort of a large population of users: ESP and Peekaboom internet games - “bored human intelligence” Label. Me

ESP game

Label. Me

Comparison of the years

References Computer Vision: Algorithms and Applications by Richard Szeliski: http: //szeliski. org/Book/drafts/Szeliski. Book_20100903_ draft. pdf Pedro F. Felzenszwalb, University of Chicago Juan Carlos Niebles and Ranjay Krishna, Stanford university http: //host. robots. ox. ac. uk/pascal/VOC/voc 2012/index. html Wikipidea, bag of words

Thanks for listening