Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori CS @ Simon Fraser University 27 Nov 2009 1

CRBMs for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori CS @ Simon Fraser University 27 Nov 2009 2

Problems Handwritten digit classification Human detection 3

Sliding Window Approach 4

Sliding Window Approach (Cont’d) y r a d i c e n o si un o B D [INRIA Person Dataset] 5

Success or Failure of an object recognition algorithm hinges on. Feature the features used Input Label representation Our Focus ? Learning Classifier Human Background 0/1/2/3/… 6

Local Feature Detector Hierarchies Larger More complicated Less frequent 7

Generative & Layerwise Learning ? ? ? Generative CRBM ? ? 8

Visual Features: Filtering Filter Kernel (Feature) 1 0 -1 2 0 -2 1 0 -1 -1 0 1 -2 0 2 Filter Response -1 0 -1 -2 1 0 -1 2 1 0 9

Our approach to feature learning is generative ? ? ? (CRBM model) Binary Hidden Variables 10

Related Work 11

Related Work • Convolutional Neural Network (CNN) e v i t – Filtering layers are bundled with a classifier, and all a n i the layers are learned together using error m i r [Lecun et al. 98] backpropagation. c s i –DDoes not perform well on natural images • Biologically plausible models [Ranzato et al. CVPR'07] g n i – Hand-crafted first layer vs. Randomly selected n r a prototypes for second layer. Le [Serre et al. , PAMI'07] [Mutch and Lowe, CVPR'06] o N 12

Related Work (cont’d) • Deep Belief Net [Hinton & et al. , NC'2006] e – A two layer partially observed MRF, called RBM, is it v sed a vi r the building block e er n p e – Learning is performed unsupervised and layer-by- u G s n layer from bottom layer upwards U • Our contributions: We incorporate spatial locality into RBMs and adapt the learning algorithm accordingly • We add more complicated components such as pooling and sparsity into deep belief nets 13

Why Generative &Unsupervised • Discriminative learning of deep and large neural networks has not been successful – Requires large training sets – Easily gets over-fitted for large models – First layer gradients are relatively small • Alternative hybrid approach – Learn a large set of first layer features generatively – Switch to a discriminative model to select the discriminative features from those that are learned – Discriminative fine-tuning is helpful

Details 15

CRBM • Image is the visible layer and hidden layer is related to filter responses • An energy based probabilistic model Dot product of vectorized matrices 16

Training CRBMs • Maximum likelihood learning of CRBMs is difficult • Contrastive Divergence (CD) learning is applicable data sample • For CD learning we need to compute the conditionals and . 17

CRBM (Backward) • Nearby hidden variables cooperate in reconstruction • Conditional Probabilities take the form 18

Learning the Hierarchy • The structure is trained bottom up and layerwise • The CRBM model for training filtering layers • Filtering layers are followed by down-sampling layers CRBM Pooling Filtering Non-linearity Reduce the dimensionality Pooling Classifier 19

Responses 1 st Filters 2 nd Filters Input 1 2 3 4

Experiments 21

Evaluation MNIST digit dataset • Training set: 60, 000 image of digits of size 28 x 28 • Test set: 10, 000 images INRIA person dataset • Training set: 2416 person windows of size 128 x 64 pixels and 4. 5 x 106 negative windows • Test set: 1132 positive and 2 x 106 negative windows 22

First layer filters • MNIST unlabeled digits • 15 filters of 5 x 5 • Gray-scale images of INRIA positive set • 15 filters of 7 x 7 23

Second Layer Features (MNIST) • Hard to visualize the filters • We show patches highly responded to filters: 24 24

Second Layer Features (INRIA) 25

MNIST Results • MNIST error rate when model is trained on the full training set 26

Results False Positive 27

1 st 28

2 nd 29

3 rd 30

4 th 31

5 th 32

INRIA Results • Adding our large-scale features significantly improves performance of the baseline (HOG) 33

Conclusion • We extended the RBM model to Convolutional RBM, useful for domains with spatial locality • We exploited CRBMs to train local hierarchical feature detectors one layer at a time and generatively • This method obtained results comparable to state-of-the-art in digit classification and human detection 34

Thank You 35

Hierarchical Feature Detector ? ? ? 36

Contrastive Divergence Learning 37

Training CRBMs (Cont'd) • The problem of reconstructing border region becomes severe when number of Gibbs sampling steps > 1. – Partition visible units into middle and border regions • Instead of maximizing the likelihood, we (approximately) maximize

Enforcing Feature Sparsity • The CRBM's representation is K (number of filters) times overcomplete • After a few CD learning iterations, V is perfectly reconstructed • Enforce sparsity to tackle this problem – Hidden bias terms were frozen at large negative values • Having a single non-sparse hidden unit improves the learned features – Might be related to the ergodicity condition

Probabilistic Meaning of Max 1 1 2 3 4 5 6 1 1 1 2 2 2 3 4 5 6

The Classifier Layer • We used SVM as our final classifier – RBF kernel for MNIST – Linear kernel for INRIA – For INRIA we combined our 4 th layer outputs and HOG features • We experimentally observed that relaxing the sparsity of CRBM's hidden units yields better results – This lets the discriminative model to set the thresholds itself

Why HOG features are added? • Because part-like features are very sparse • Having a template of the human figure helps a lot f

RBM • Two layer pairwise MRF with a full set of hidden-visible connections v • RBM Is an energy based model h w • Hidden random variables are binary, Visible variables can be binary or continuous • Inference is straightforward: and • Contrastive Divergence learning for training

Why Unsupervised Bottom-Up • Discriminative learning of deep structure has not been successful – Requires large training sets – Easily is over-fitted for large models – First layer gradients are relatively small • Alternative hybrid approach – Learn a large set of first layer features generatively – Later, switch to a discriminative model to select the discriminative features from those learned

INRIA Results (Cont'd) • Missrate at different FPPW rates • FPPI is a better indicator of performance • More experiments on size of features and number of layers are desired