Variational and Scale Mixture Density Representations for Estimation

  • Slides: 51
Download presentation
Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model: Sparse

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model: Sparse Coding, Independent Component Analysis, and Minimum Entropy Segmentation Jason Palmer Department of Electrical and Computer Engineering University of California San Diego

Introduction 1. Unsupervised learning of structure in continuous sensor data 1. Data must be

Introduction 1. Unsupervised learning of structure in continuous sensor data 1. Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations 2. Probabilistic modeling – states 1. Linear model 2. 3. 1. Basis sets 2. Hierarchical Linear processes 3. Also kernel non-linear Probability model 1. Distributions of input variables – types of densities 2. Conditionally independent inputs – Markov connection of states Thesis topics 1. Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities 2. Calculating probabilities in linear process model

Linear Process Model • General form of the model: • Sparse coding: x(t)=As(t), A

Linear Process Model • General form of the model: • Sparse coding: x(t)=As(t), A overcomplete • ICA: x(t)=As(t), A invertible, s(t)=A-1 x(t) • Blind deconvolution:

Voice and Microphone IID process Linear filters HT(z) HR(z) Observed process HM(z)

Voice and Microphone IID process Linear filters HT(z) HR(z) Observed process HM(z)

Sensor Arrays Source 1 Source 2 Sensor array

Sensor Arrays Source 1 Source 2 Sensor array

Convolutive Model

Convolutive Model

Different Impulse Response

Different Impulse Response

EEG sources

EEG sources

Binocular Color Image Channels R L r g b - Represented as 2 D

Binocular Color Image Channels R L r g b - Represented as 2 D field of 6 D vectors - Binocular video can be represented as a 3 D field of 6 D vectors - Use block basis or mixture of filters

Biological Systems WORLD x(t) WORLD (t) REPRESENTATION

Biological Systems WORLD x(t) WORLD (t) REPRESENTATION

Linear Process Mixture Model S-s-s-s-i-i-i-k---ks-s-s----t-t-e-e-n-n • Plot of speech signal: woman speaking the word

Linear Process Mixture Model S-s-s-s-i-i-i-k---ks-s-s----t-t-e-e-n-n • Plot of speech signal: woman speaking the word “sixteen” • Clearly speech is non-stationary, but seems to be locally stationary

Source model 11(t) 1 m(t) 1(z ) s(t) r 1(t) rm(t) r(z )

Source model 11(t) 1 m(t) 1(z ) s(t) r 1(t) rm(t) r(z )

Observation Segment Model A(z) x(t)

Observation Segment Model A(z) x(t)

Generative Model A 1(z) x 1(t) x(t) AM(z) x. M(t)

Generative Model A 1(z) x 1(t) x(t) AM(z) x. M(t)

Outline I. Types of Probability densities A. Sub- and Super-Gaussianity B. Representation in terms

Outline I. Types of Probability densities A. Sub- and Super-Gaussianity B. Representation in terms of Gaussian 1. 2. Convex variational representation – Strong Super-Gaussians Gaussian Scale mixtures a. 3. II. 2. B. MAP – Generalized FOCUSS a. Global Convergence of Iteratively Re-weighted Least Squares (IRLS) b. Convergence rates Variational Bayes (VB) – Sparse Bayesian Learning Dictionary Learning 1. 2. Lagrangian Newton Algorithm Comparison in Monte Carlo experiment Independent Component Analysis A. Convexity of the ICA optimization problem – stability of ICA 1. I. Fisher Information and Cramer-Rao lower bound on variance Super-Gaussian Mixture Source Model A. B. Relationship between representations Sparse Coding and Dictionary Learning A. Optimization with given (overcomplete) basis 1. III. Multivariate Gaussian Scale mixtures (ISA / IVA) Comparison between Gaussian and Super-Gaussian updates Linear Process Mixture Model 1. Probability of signal segments 2. Mixture model segmentation

Sub- and Super-Gaussianity • Super-Gaussian = more peaked, than Gaussian, heavier tail • Sub-Gaussian

Sub- and Super-Gaussianity • Super-Gaussian = more peaked, than Gaussian, heavier tail • Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian Super-Gaussian Sub-Gaussian Generalized Gaussian exp(-|x|p): Laplacian (p = 1. 0 ), Gaussian (p=2. 0), sub-Gaussian (p=10. 0) • • • Component density determines shape along direction of vector Super-Gaussian = concentrated near zero, some large values Sub-Gaussian = uniform around zero, no large values Sub- AND Super-Gaussian • Super-Gaussians, represent sparse random variables • Most often zero, occasionally large magnitudes • Sparse random variables model variables with on / off, active / inactive states

Convex Variational Representation • Convex / concave functions are pointwise supremum / infimum of

Convex Variational Representation • Convex / concave functions are pointwise supremum / infimum of linear functions convex concave Convex: Concave: • Convex function f (x) may be concave in x 2, i. e. f (x) = g(x 2), and g is concave on (0, ). • Example: |x|3/2 convex |x|3/4 concave • Example: |x|4 convex |x|2 still convex x 4 x 2 x 3/2 concave in x 2 x 4 convex in x 2 • If f (x) is concave in x 2, and p(x)= exp(-f (x)): • We say p(x) is Strongly Super-Gaussian If f (x) is convex in x 2, and p(x) = exp(-f (x)):

Scale Mixture Representation • Gaussian Scale Mixtures (GSMs) are sums of Gaussians densities with

Scale Mixture Representation • Gaussian Scale Mixtures (GSMs) are sums of Gaussians densities with different variances, but all zero mean: Gaussians • A random variable with a GSM density can be represented as a product of Standard Normal random variable Z, and an arbitrary non-negative random variable W: Gaussian Scale Mixture • Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector: • Contribution: general formula for multivariate GSMs: X = Z W -1/2

Relationship between Representations • Criterion for p(x) = exp(-f (x)) = exp(-g(x 2)) to

Relationship between Representations • Criterion for p(x) = exp(-f (x)) = exp(-g(x 2)) to be have convex variational representation: • Criterion for GSM representation given by Bernstein-Widder Theorem on complete monotonicity (CM): • For Gaussian representation, need CM • CM relationship (Bochner): • If is CM, then , and thus the GSM representation implies the convex variational representation.

Sparse Regression –Variational MAP • Bayesian Linear Model x=As+v: basis A, sources s, noise

Sparse Regression –Variational MAP • Bayesian Linear Model x=As+v: basis A, sources s, noise v • Can always put in form: min f (s) subject to As = x, A overcomplete • For Strongly Super-Gaussian priors, p(s) = exp(-f (s)): • Sources are independent, cost function f(s)= i f (si), (s) diagonal: • Solve: • sold satisfies As = x, so right side is negative, so left side is negative

Sparse Regression – MAP – GSM • For Gaussian Scale Mixture p(s), we have

Sparse Regression – MAP – GSM • For Gaussian Scale Mixture p(s), we have s = z -1/2, and s is conditionally Gaussian given EM algorithm • The complete log likelihood is quadratic since s is conditionally Gaussian: • Linear in . For EM we need expected value of given x. But s x is a Markov chain: • GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)

Generalized FOCUSS • The FOCUSS algorithm is a particular MAP algorithm for sparse regression

Generalized FOCUSS • The FOCUSS algorithm is a particular MAP algorithm for sparse regression f (s) = |s|p or f (s) = log s. It was derived by Gorodnitsky and Rao (1997), and Rao and Kreutz-Delgado (1998) • With arbitrary Strongly Super-Gaussian source prior, Generalized FOCUSS: • Convergence is proved using Zangwill’s Global Convergence Theorem, which requires: (1) Descent function (2) Boundedness of iterates, and (3) closure (continuity) of algorithm mapping. • We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution always Least squares solution lies in bounded part of orthant intersection • We also derive the convergence rate of Generalized FOCUSS for f (s) is convex. Convergence rate for concave f(s) was proved by Gorodnitsky and Rao. We give an alternative proof. Unbounded orthantconstraint intersection Bounded orthantconstraint intersection

Variational Bayes • General form of Sparse Bayesian Learning / Type II ML: –

Variational Bayes • General form of Sparse Bayesian Learning / Type II ML: – Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density: – – OR: MAP estimate of hyperparameters, , in GSM (instead of s). OR: Variational Bayes algorithm which finds the separable posterior q(s|x)q( |x) that minimizes KL divergence from true posterior p(s, |x). • The bound is derived using a modified Jensen’s inequality: • Then minimize the bound by coordinate descent as before. Also IRLS, same functional form but now diagonal weights are:

Sparse Regression Example An example of sparse regression with an overcomplete basis The line

Sparse Regression Example An example of sparse regression with an overcomplete basis The line is the one dimensional solution space (translated null space) Below the posterior density p(s|x) in null space plotted for Generalized Gaussian with p=1. 0, p=0. 5, and p=0. 2

Dictionary Learning • Problem: Given data x 1, …, x. N find an (overcomplete)

Dictionary Learning • Problem: Given data x 1, …, x. N find an (overcomplete) basis A for which As=x and the sources are sparse. • Three algorithms: (1) Lewicki-Sejnowski ICA (2) Kreutz-Delgado FOCUSS based (3) Girolami VB based algorithm • We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm These algorithms have the general form: (1) Lewicki-Sejnowski (2) Kreutz-Delgado (3) Girolami VB (4) Lagrangian Newton

Dictionary Learning Monte Carlo • Experiment: generate random A matrices, sparse sources s, and

Dictionary Learning Monte Carlo • Experiment: generate random A matrices, sparse sources s, and data x=As, N=100 m. • Test algorithms: – Girolami, p=1. 0, Jeffrey’s – Lagrangian Newton, p=1. 0, p=1. 1 – Kreutz-Delgado, (non-)normalized – Lewicki-Sejnowski, p=1. 1, Logistic A 4 x 8, sparsity 2 A 2 x 3, sparsity 1 A 10 x 20, sparsity 1 -5

Sparse Coding of EEG • Goal: find synchronous “events” in multiple interesting components •

Sparse Coding of EEG • Goal: find synchronous “events” in multiple interesting components • Learn basis for segments, length 100, across 5 channels • Events are rare, so the prior density is sparse EEG scalp maps:

EEG Segment Basis: Subspace 1 • Experimental task: subject sees sequence of letters, click

EEG Segment Basis: Subspace 1 • Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right • Each column is a basis vector: segment of length 100 x 5 channels • Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given • Dictionary learning with time series: must learn phase shifts

EEG Segment Basis: Subspace 2 • In this subspace, channels 1 and 3 are

EEG Segment Basis: Subspace 2 • In this subspace, channels 1 and 3 are active • Channel 3 crest slightly precedes channel 1 crest • This subspace is associated with correct response

EEG Segment Basis: Subspace 3 • In this subspace, channels 1 and 2 have

EEG Segment Basis: Subspace 3 • In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts • Not obviously associated with any recorded event

ICA • ICA model: x=As, with A invertible, s=Wx • Maximum Likelihood estimate of

ICA • ICA model: x=As, with A invertible, s=Wx • Maximum Likelihood estimate of W=A-1 : • For independent sources: • Source densities unknown, must be adapted – Quasi-ML (Pham 92) • Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information • If sources are Gaussian, A cannot be identified, only covariance • If sources are Non-Gaussian, A can be identified (Cheng, Rosenblatt)

ICA Hessian • • • Remarkably, the expected value of the Hessian of the

ICA Hessian • • • Remarkably, the expected value of the Hessian of the ICA ML cost function can be calculated. Work with the “global system” C = WA, whose optimum is always identity, C* = I. Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks: • • Main condition for positive definiteness and convexity of ML problem at the optimum: • • Expected Hessian is the Fisher Information matrix Inverse is Cramer-Rao lower bound on unbiased estimator variance Plot shows bound for off-diagonal element with Gen. Gauss. prior Hessian also allows Newton method For EM stability, replaced by:

Super-Gaussian Mixture Model • Variational formulation also allows derivation of generalization of Gaussian mixture

Super-Gaussian Mixture Model • Variational formulation also allows derivation of generalization of Gaussian mixture model to strongly super-Gaussian mixtures: • The update rules are similar to the Gaussian mixture model, but include the variational parameters

Source Mixture Model Examples

Source Mixture Model Examples

ICA Mixture Model – Images • Goal: find an efficient basis for representing image

ICA Mixture Model – Images • Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.

Covariance Square Root Sphere Basis

Covariance Square Root Sphere Basis

ICA: Single Basis

ICA: Single Basis

ICA Mixture Model: Model 1

ICA Mixture Model: Model 1

ICA Mixture Model: Model 2

ICA Mixture Model: Model 2

Image Segmentation 1 Using the learned models, we classify each image block as from

Image Segmentation 1 Using the learned models, we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency ground

Image Segmentation 2 Again we classify each image block as from Model 1 or

Image Segmentation 2 Again we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency tree bark

Image model 1 basis densities • Low frequency components are not sparse, and may

Image model 1 basis densities • Low frequency components are not sparse, and may be multimodal • Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2

Image Model 2 densities • Densities are very sparse • Higher frequency components occur

Image Model 2 densities • Densities are very sparse • Higher frequency components occur less often in the data • Convergence is less smooth

Gen. Gauss. shape parameter histograms Image bases 1. 2 2. 0 More sparse, edge

Gen. Gauss. shape parameter histograms Image bases 1. 2 2. 0 More sparse, edge filters, etc. EEG bases 1. 2 2. 0 Less sparse, biological signals

Rate Distortion Theory • Theorem of Gray shows that given a finite autoregressive process,

Rate Distortion Theory • Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter Z(t) H(z) X(t) H-1(z) Z(t) • For difference distortion measures: RZ (D) RX (D) • Proof seems to extend to general linear systems, and potentially mixture models • To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate

Time Series Segment Likelihood • Multichannel convolution is a linear operation • Matrix is

Time Series Segment Likelihood • Multichannel convolution is a linear operation • Matrix is block Toeplitz • To calculate likelihood, need determinant • Extension of Szegö limit theorem • Can be extended to multi-dimensional fields, e. g. image convolution

Segmented EEG source time series • Linear Process Mixture Model run on several source

Segmented EEG source time series • Linear Process Mixture Model run on several source – 2 models • Coherent activity is identified and segmented blindly • Spectral density resolution greatly enhanced by eliminating noise

Spectral Density Enhancement • Spectra before segmentation / rejection (left) and after (right). •

Spectral Density Enhancement • Spectra before segmentation / rejection (left) and after (right). • Spectral peaks invisible in all series spectrum becom visible in segmented spectrum All series spectrum Source A channel Source B channel Segmented spectrum

Future Work • Fully implement hierarchical linear process model • Implement Hidden Markov Model

Future Work • Fully implement hierarchical linear process model • Implement Hidden Markov Model to learn relationships among various model states • Test new multivariate dependent density models • Implement multivariate convolutive model, e. g. on images to learn wavelets, and test video coding rates • Implement Linear Process Mixture Model in VLSI circuits

Publications • “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian Priors, ” Proceedings

Publications • “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian Priors, ” Proceedings of the 36 th Asilomar Conference on Signals and Systems, 2002. • “A General Framework for Component Estimation, ” Proceedings of the 4 th International Symposium on Independent Component Analysis, 2003. • “Variational EM Algorithms for Non-Gaussian Latent Variable Models, ” Advances in Neural Information Processing Systems, 2005. • “Super-Gaussian Mixture Source Model for ICA, ” Proceedings of the 6 th International Symposium on Independent Component Analysis, 2006. • “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind Deconvolution, ” submitted ICASSP 2007

Summary of Contributions • Convergence proof for Generalized FOCUSS algorithm • Proposal of notion

Summary of Contributions • Convergence proof for Generalized FOCUSS algorithm • Proposal of notion of Strong Super-Gaussianity and clarification of relationship to Gaussian Scale Mixtures and Kurtosis • Extension of Gaussian Scale Mixtures to general multivariate dependency models using derivatives of univariate density • Derivation of Super-Gaussian Mixture Model, generalizes monotonic Gaussian mixture algorithm with same complexity • Derivation of Lagrangian Newton algorithm for overcomplete dictionary learning – best performance in Monte Carlo simulations • Analysis of convexity of EM based ICA • Proposal of Linear Process Mixture Model, and derivation of segment probablity enabling probabilistic modeling of nonstationary time series