School of Computer Science Probabilistic Graphical Models Variational

School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field and Generalized MF Eric Xing Lecture 17, November 5, 2009 Reading: © Eric Xing @ CMU, 2005 -2009 1

Inference Problems l Compute the likelihood of observed data Compute the marginal distribution over a particular subset of nodes Compute the conditional distribution for disjoint subsets A and B Compute a mode of the density l Methods we have l l l Message Passing Brute force Elimination Individual computations independent (Forward-backward , Max-product /BP, Junction Tree) Sharing intermediate terms

Exponential Family GMs l Canonical Parameterization Canonical Parameters Sufficient Statistics Log-normalization Function l Effective canonical parameters l Regular family: l Minimal representation: l if there does not exist a nonzero vector constant such that is a

Mean Parameterization l The mean parameter associated with a sufficient statistic is defined as l Realizable mean parameter set l A convex subset of Convex hull for discrete case l Convex polytope when l is finite

Convex Polytope l Convex hull representation l Half-plane based representation l Minkowski-Weyl Theorem: l any polytope can be characterized by a finite collection of linear inequality constraints

Conjugate Duality l l Duality between MLE and Max-Ent: l For all , a unique canonical parameter l The log-partition function has the variational form l For all , the supremum in (*) is attained uniquely at moment-matching conditions Bijection for minimal exponential family satisfying specified by the

Variational Inference In General l An umbrella term that refers to various mathematical tools for optimization-based formulations of problems, as well as associated techniques for their solution l General idea: l Express a quantity of interest as the solution of an optimization problem l The optimization problem can be relaxed in various ways l l l Approximate the functions to be optimized Approximate the set over which the optimization takes place Goes in parallel with MCMC

Bethe Variational Problem (BVP) l l We already have: l a convex (polyhedral) outer bound l the Bethe approximate entropy Combining the two ingredients, we have l l a simple structured problem (differentiable & constraint set is a simple polytope) Max-product is the solver!

Connection to Sum-Product Alg. l Lagrangian method for BVP: l Sum-product and Bethe Variational (Yedidia et al. , 2002) l For any graph G, any fixed point of the sum-product updates specifies a pair of such that l For a tree-structured MRF, the solution is unique, where correspond to the exact singleton and pairwise marginal distributions of the MRF, and the optimal value of BVP is equal to

Inexactness of Bethe and Sum. Product l From Bethe entropy approximation l l Example From pseudo-marginal outer bound l strict inclusion 1 4 2 3

Kikuchi Approximation l l l Recall: Bethe variational method uses a tree-based (Bethe) approximation to entropy, and a tree-based outer bound on the marginal polytope Kikuchi method extends these tree-based approximations to more general hyper-trees Generalized pseudomarginal set l l Normalization Marginalization l Hyper-tree based approximate entropy l Hyper-tree based generalization of BVP

Summary So Far l Formulate the inference problem as a variational optimization problem l The Bethe and Kikuchi free energy are approximations to the negative entropy © Eric Xing @ CMU, 2005 -2009 12

Next Step … l We will develop a set of lower-bound methods © Eric Xing @ CMU, 2005 -2009 13

Tractable Subgraph l Given a GM with a graph G, a subgraph F is tractable if l l We can perform exact inference on it Example: © Eric Xing @ CMU, 2005 -2009 14

Mean Parameterization l For an exponential family GM defined with graph G and sufficient statistics , the realizable mean parameter set l For a given tractable subgraph F, a subset of mean parameters is of interest l Inner Approximation © Eric Xing @ CMU, 2005 -2009 15

Optimizing a Lower Bound l Any mean parameter partition function l yields a lower bound on the log- Moreover, equality holds iff and l Proof Idea: (Jensen’s Inequality) l Optimizing the lower bound gives l are dually coupled, i. e. , This is an inference! © Eric Xing @ CMU, 2005 -2009 16

Mean Field Methods In General l However, the lower bound can’t explicitly evaluated in general l l Because the dual function typically lacks an explicit form Mean Field Methods l Approximate the lower bound l Approximate the realizable mean parameter set l The MF optimization problem l Still a lower bound? © Eric Xing @ CMU, 2005 -2009 17

KL-divergence l Kullback-Leibler Divergence l For two exponential family distributions with the same STs: Primal Form Mixed Form Dual Form © Eric Xing @ CMU, 2005 -2009 18

Mean Field and KL-divergence l Optimizing a lower bound l Equivalent to minimize a KL-divergence l Therefore, we are doing minimization © Eric Xing @ CMU, 2005 -2009 19

Naïve Mean Field l Fully factorized variational distribution © Eric Xing @ CMU, 2005

Naïve Mean Field for Ising Model l Sufficient statistics and Mean Parameters l Naïve Mean Field l Realizable mean parameter subset l Entropy l Optimization Problem © Eric Xing @ CMU, 2005 -2009 21

Naïve Mean Field for Ising Model l Optimization Problem l Update Rule resembles “message” sent from node l l neighborhood forms the “mean field” applied to © Eric Xing @ CMU, 2005 -2009 to from its 22

Non-Convexity of Mean Field l Mean field optimization is always non-convex for any exponential family in which the state space is finite l Finite convex hull contains all the extreme points l l l If is a convex set, then Mean field has been used successfully © Eric Xing @ CMU, 2005 -2009 23

Structured Mean Field l Mean field theory is general to any tractable sub-graphs l Naïve mean field is based on the fully unconnected sub-graph l Variants based on structured sub-graphs can be derived © Eric Xing @ CMU, 2005 -2009 24

Other Notations l Mean Parameterization Form l Distribution Form l l where Naïve Mean Field for Ising Model: © Eric Xing @ CMU, 2005 -2009 25

Examples to add l GMF for Ising Models l Factorial HMM l Bayesian Gaussian Model l Latent Dirichlet Allocation © Eric Xing @ CMU, 2005 -2009 26

Summary l Message-passing algorithms (e. g. , belief propagation, mean field) are solving approximate versions of exact variational principle in exponential families l There are two distinct components to approximations: l Can use either inner or outer bounds to l Various approximation to the entropy function l BP: polyhedral outer bound and non-convex Bethe approximation l MF: non-convex inner bound and exact form of entropy l Kikuchi: tighter polyhedral outer bound and better entropy approximation © Eric Xing @ CMU, 2005 -2009 27