Building HighLevel Features Using Large Scale Unsupervised Learning

Building High-Level Features Using Large Scale Unsupervised Learning Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, A. Y. Ng Presenting: Din Malachi Tomer Gafni

OUTLINE Introduction Architecture Learning and Optimization Results Conclusions

Introduction The Problem: building high level, class specified feature detectors from only unlabeled data. For example: is it possible to learn a face detector using only unlabeled images? Inspired by neuroscientific conjecture that there exist highly class-specific neurons in the human brain (“grandmother neurons”). The need for large labeled sets poses a significant challenge for problems where labeled data are rare.

Three Kinds of Learning

High Level and Low Level Features Low level features- minor details of the image, like lines or dots, that can be picked up by, say, a convolutional filter. High level features- built on top of low-level features to detect objects and larger shapes in the image.

High Level and Low Level Features Mostly the first couple convolutional layers will learn filters for finding lines, dots, curves etc. while the later layers will learn to recognize common objects and shapes.

From Recent Works to the Presented Work Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain high level features (i. e. bounding box around the face). Approaches that make use of unlabeled data have worked well for building low level features, but worked poorly for building high level data.

From Recent Works to the Presented Work In this work, we address the problem by scaling up the core components involved in training deep networks: Data set: 200 X 200 images from over 10 millions You. Tube videos. Model: Deep Autoencoder. Computational resources: 1000 machines (16, 000 cores). Google Builds a Brain that Can Search for Cat Videos, Time, June 2012 How Many Computers to Identify a Cat? 16, 000, NYT June 2012

Auto. Encoder

Restricted Boltzmann Machines (RBM) Invented by Geoff Hinton Only two layers Fully connected Sigmoid activation function Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines. " Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 599 -619.

Restricted Boltzmann Machines (RBM) Reconstructions: Activations are the inputs Same weights Reconstructions are the outputs Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines. " Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 599 -619.

Multiple Layers of RBMs First training between visible layer and the first hidden layer Then, hidden layer 1 is the new ‘visible’ layer, and on… Called Deep Belief Learning

Back to Auto. Encoder Unrolled multiple layers of RBM Hierarchy of representations with increasing level of abstraction Each module transforms its input representation into a higher-level one

Architecture of the Presented Paper Sparse Autoencoder with three important ingredients: Local receptive fields Pooling Local Contrast Normalization (LCN)

First sublayer – Local Receptive Fields 18 x 18 pixels RF windows 8 Feature maps (channels) Each neuron connects to all input channels

Second Sublayer - Pooling L 2 Pooling- taking the square root of the sum of the squares of the activations 5 x 5 overlapping windows Pooling over one feature

Third Layer - Local Contrast Normalization 5 x 5 overlapping windows Connects to all input channels Relatively dominant activations are preferred over high activations on all features Enforcing a sort of local competition between adjacent feature, and between features at the same spatial location in different feature maps

One Layer Summary

Learning and Optimization Global reconstruction cost Ensures the representations encode important information about the data = they can reconstruct the input data Group Sparsity / Spatial pooling – • Outputs of second sublayer. • Lower sum of activations is preferred. • Encourages pooling to group similar features together to achieve invariances.

Training the Network

Experiments Analysis of the learned representation in recognizing faces (“the face detector”). The test set consists of 37, 000 images, 13, 026 are labeled faces and the rest are distractors. After training, we use this set to measure the performance of each neuron in classifying faces against distractors.

Results The best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training (81. 7% accuracy in detecting faces). When removing the LCN sublayers, and the accuracy of the best neuron drops to 78. 5%. Histogram of activation values for face images (red) and random images (blue). Even with exclusively unlabeled data the neuron learns to differentiate between faces and random distractors.

Visualization Is the optimal stimulus of the neuron really a face? First method: visualizing the most responsive stimuli in the test set. Second method: Perform numerical optimization to find the optimal stimulus: f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x.

Visualization Top 48 stimuli of the best neuron from the test set The optimal stimulus according to numerical constraint optimization

Invariance Propeties The face detector is robust against common object transformation: translation, scaling and out of plain rotation.

Cat and Human Body Detector Is the network able to detect other high level concepts? We construct two datasets, one for classifying human bodies and one for classifying cat faces. The high level detectors also outperform standard baselines in terms of recognition rates, achieving 74. 8% and 76. 7% on cat and human body respectively.

Summary of Numerical Comparisons

Conclusions In this work we simulated high level class specified neurons using unlabeled data. The work shows that it is possible to train neurons to be selective for high level concepts (human faces, human bodies, cat faces) using entirely unlabeled data. These neurons naturally capture complex invariances such as out of plane and scale invariances.