Learning Deep Energy Models Jiquan Ngiam Zhenghao Chen

Energy-Based Models (EBM) • Energy-based models associate a scalar energy to each configuration of

Energy-Based Models (EBM) • Energy-based models define a probability distribution through an energy function:

Energy-Based Models (EBM) • An energy-based model can be learnt by stochastic gradient descent

EBM with Hidden Units • For a data x, it has a observable part

EBM with Hidden Units • The data negative log-likelihood gradient is • Hence the

Restricted Boltzmann Machine • The energy function of an RBM is defined as where

Restricted Boltzmann Machine • Sampling in an RBM can be carried out using Gibbs

Deep Belief Network (DBN) • A deep belief network is a graphical model with

Deep Boltzmann Machine (DBM) • DBMs have undirected connections between all layers of the

Deep Energy Models (DEM) • A feedforward neural network that deterministically transforms the input

Deep Energy Models • Comparison of DBN, DBM and DEM • Dotted arrow in

Deep Energy Models • Let gθ(v) denote the feedforward output of a neural network

Learning Deep Energy Models • Model is trained by maximizing the log-likelihood of a

Learning Deep Energy Models • To sample from the model distribution, Hybrid Monte Carlo

Greedy Layerwise Training • DBN: training an RBM to model the posteriors of the

Folding Hidden Layers • After training a network with l layers, we can “fold”

Joint Training for Multiple Layers • Simply unfreezes the weights of the previous layers

The Product of Student-t Model • Products of Experts (Po. E) models – A

General Form of DEM • The free energy can be defined as: • The

Experiments on Natural Images • DEM with sigmoidal units (sigmoid-DEM) and the SPo. T

Experiments on Natural Images • The first layer M 1 quickly plateaus • Adding

Generative Deep Energy Models • The activations of each layer in gθ as features

Object Recognition • The hybrid model outperformed the state of the art 3 D

Slides: 27

Download presentation

Learning Deep Energy Models Jiquan Ngiam Zhenghao Chen Pang Wei Koh Andrew Y. Ng

Energy-Based Models (EBM) • Energy-based models associate a scalar energy to each configuration of the variables of the interest • Learning corresponds to modifying that energy function • Desirable configurations will have low energy

Energy-Based Models (EBM) • Energy-based models define a probability distribution through an energy function: • Lower energy corresponds to higher probability • Z is the normalizing factor, called the partition function

Energy-Based Models (EBM) • An energy-based model can be learnt by stochastic gradient descent on the empirical negative log-likelihood of the training data • The log-likelihood and loss function is similar to those of logistic regression

EBM with Hidden Units • For a data x, it has a observable part (still denoted x) and hidden part h • The free energy is defined as • Which allows us to write

EBM with Hidden Units • Given • We have

EBM with Hidden Units • The data negative log-likelihood gradient is • Hence the average log-likelihood gradient over the training set is • P hat is the training set empirical distribution and P is the model’s distribution • Expectation over model distribution is hard to compute, usually computed through sampling

Restricted Boltzmann Machine • The energy function of an RBM is defined as where W is represents the weights connecting hidden and visible units, b and c are the offsets of the visible and hidden layers

Restricted Boltzmann Machine • Sampling in an RBM can be carried out using Gibbs sampling on a Markov chain • Efficient learning algorithms were found to train it

Deep Belief Network (DBN) • A deep belief network is a graphical model with undirected connections at the top hidden layers and directed connections in the lower layers • Greedy layerwise training by stacking restricted Boltzmann machines, each of which models the posterior distribution of the previous layer • Computationally expensive to train all layers of the DBN jointly

Deep Boltzmann Machine (DBM) • DBMs have undirected connections between all layers of the network. • A similar layerwise training algorithm using RMBs is used to initialize the DBM and all layers are jointly trained thereafter • Both DBNs and DBMs have multiple stochastic hidden layers, which makes inference and learning difficult as computing the conditional posterior over the hidden units intractable • Inference and learning are more tractable in models with only a single stochastic hidden layer (an RBM)

Deep Energy Models (DEM) • A feedforward neural network that deterministically transforms the input • Models the output of the feedforward network with a layer of stochastic hidden units – The feedforward neural network extracts features from the input which are more easily modeled with a single stochastic layer • Since the hidden variable of the feedforward neural network is deterministic, this allows us to efficiently train all layers of the network jointly

Deep Energy Models • Comparison of DBN, DBM and DEM • Dotted arrow in DEM represent deterministic relationships

Deep Energy Models • Let gθ(v) denote the feedforward output of a neural network gθ • The undirected connection between gθ(v) and the set of binary stochastic hidden units h defines an energy function

Learning Deep Energy Models • Model is trained by maximizing the log-likelihood of a training set • The parameters of the feedforward network gθ, variance parameter σ, weights W and biases c • The derivative of the log-likelihood is given by • The first term represents an expectation of the partial derivative over the model distribution and the second an expectation over the data • The second term is straightforward, but the first term requires sampling

Learning Deep Energy Models • To sample from the model distribution, Hybrid Monte Carlo (HMC) is used • HMC draw samples from the model distribution by performing a physical simulation of an energyconserving system to generate proposal moves • The method of using the HMC sampler to estimate the gradient of the data log-likelihood is known as contrastive backpropagation

Greedy Layerwise Training • DBN: training an RBM to model the posteriors of the hidden units in the previous layer • DEM: training the next layer to optimize for the data likelihood, but freeze the parameters of the earlier layers • The learning objective of DEM is the data likelihood of the entire deep model

Folding Hidden Layers • After training a network with l layers, we can “fold” the top hidden layer into the model by letting • Stochastic hidden layer -> deterministic • A next layer can then be trained using as the feedforward network

Joint Training for Multiple Layers • Simply unfreezes the weights of the previous layers while optimizing for the same objective function • On DBN and DBMs, it requires sampling all the hidden layers of the network • DEM does not require sampling hidden layers of the network as it use deterministic hidden units • In practice, it turns out that mixing greedy layerwise steps together with joint training steps performed well

The Product of Student-t Model • Products of Experts (Po. E) models – A restricted class of EBM – Distribution is the normalized product of all the distributions represented by the individual “experts” • Product of Student-t Model

General Form of DEM • The free energy can be defined as: • The original deep energy models can be viewed as having • By allowing other functions, we can recover models such as the product of Student-t (Po. T) • Stacked Po. T (SPo. T) model: we parameterize the neural network gθ to have log(1+z^2) as the activation function

Experiments on Natural Images • DEM with sigmoidal units (sigmoid-DEM) and the SPo. T models • M 1 -M 2 -M 12 denote models trained with greedy layerwise on the 1 st, 2 nd layer respectively, followed by a joint training on the 1 st and 2 nd layer.

Experiments on Natural Images • The first layer M 1 quickly plateaus • Adding the second layer improves performance • Joint training resulted in a further improvement

Generative Deep Energy Models • The activations of each layer in gθ as features which are used to learn a linear classifier for some associated labels y, with weights U. • The joint energy between the inputs v and labels y is where al are the activation of the lth layer and y is a one-hot vector of image labels. • Model with a hybrid objective function

Object Recognition • The hybrid model outperformed the state of the art 3 D DBN by a small margin of 0. 3% • The fully discriminative SPo. T model overfits and only achieves a test accuracy of 85. 7% • Regularizing the model to be a generative model significantly helps the model to generalize beyond the dataset

Thank You