Learning with Missing Data Eran Segal Weizmann Institute

  • Slides: 35
Download presentation
Learning with Missing Data Eran Segal Weizmann Institute

Learning with Missing Data Eran Segal Weizmann Institute

Incomplete Data n Hidden variables Missing values n Challenges n n n Foundational –

Incomplete Data n Hidden variables Missing values n Challenges n n n Foundational – is the learning task well defined? Computational – how can we learn with missing data?

Treating Missing Data n n How Should we treat missing data? Case I: A

Treating Missing Data n n How Should we treat missing data? Case I: A coin is tossed on a table, occasionally it drops and measurements are not taken n Sample sequence: H, T, ? , H Treat missing data by ignoring it Case II: A coin is tossed, but only heads are reported n n Sample sequence: H, ? , ? , H Treat missing data by filling it with Tails We need to consider the data missing mechanism

Modeling Data Missing Mechanism n X = {X 1, . . . , Xn}

Modeling Data Missing Mechanism n X = {X 1, . . . , Xn} are random variables n OX = {OX 1, . . . , OXn} are observability variables n n Always observed Y = {Y 1, . . . , Yn} new random variables n n Val(Yi) = Val(Yi) {? } Yi is a deterministic function of Xi and OX 1:

Modeling Missing Data Mechanism Case I (random missing values) X Y Case II (deliberate

Modeling Missing Data Mechanism Case I (random missing values) X Y Case II (deliberate missing values) X OX Y OX

Modeling Missing Data Mechanism Case I (random missing values) X Y Case II (deliberate

Modeling Missing Data Mechanism Case I (random missing values) X Y Case II (deliberate missing values) X OX Y OX

Treating Missing Data n When can we ignore the missing data mechanism and focus

Treating Missing Data n When can we ignore the missing data mechanism and focus only on the likelihood? n n For every Xi, Ind(Xi; OXi) Missing at Random (MAR) is sufficient n n The probability that the value of Xi is missing is independent of its actual value given other observed values In both cases, the likelihood decomposes

Hidden (Latent) Variables n Attempt to learn a model with hidden variables n n

Hidden (Latent) Variables n Attempt to learn a model with hidden variables n n In this case, MAR always holds (variable is always missing) Why should we care about unobserved variables? X 1 X 2 X 3 Y 3 Y 1 Y 2 Y 3 H Y 1 Y 2 17 parameters 59 parameters

Hidden (Latent) Variables n Hidden variables also appear in clustering n Naïve Bayes model:

Hidden (Latent) Variables n Hidden variables also appear in clustering n Naïve Bayes model: n n Hidden Class variable is hidden Observed attributes are independent given the class Cluster X 1 X 2 . . . Observed possible missing values Xn

Likelihood for Complete Data Input Data: P(X) X Y x 0 y 0 x

Likelihood for Complete Data Input Data: P(X) X Y x 0 y 0 x 1 x 0 y 1 x 0 x 1 y 0 X Likelihood: Y P(Y|X) § Likelihood decomposes by variables § Likelihood decomposes within CPDs X y 0 y 1 x 0 y 0|x 0 y 1|x 0 x 1 y 0|x 1 y 1|x 1

Likelihood for Incomplete Data Input Data: P(X) X Y ? y 0 x 1

Likelihood for Incomplete Data Input Data: P(X) X Y ? y 0 x 1 x 0 y 1 x 0 x 1 ? y 0 X Likelihood: Y P(Y|X) § Likelihood does not decompose by variables § Likelihood does not decompose within CPDs § Computing likelihood per instance requires inference! X y 0 y 1 x 0 y 0|x 0 y 1|x 0 x 1 y 0|x 1 y 1|x 1

Bayesian Estimation Bayesian network for parameter estimation X X X[1] X[2] … X[M] Y[1]

Bayesian Estimation Bayesian network for parameter estimation X X X[1] X[2] … X[M] Y[1] Y[2] … Y[M] Y|X=0 Bayesian network Y|X=1 Posteriors are not independent Y

Identifiability n Likelihood can have multiple global maxima H n Example: Y n n

Identifiability n Likelihood can have multiple global maxima H n Example: Y n n We can rename the values of the hidden variable H If H has two values, likelihood has two global maxima With many hidden variables, there can be an exponential number of global maxima Multiple local and global maxima can also occur with missing data (not only hidden variables)

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Gradient Ascent: § Follow

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Gradient Ascent: § Follow gradient of likelihood w. r. t. to parameters § Add line search and conjugate gradient methods to get fast convergence

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Expectation Maximization (EM): §

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Expectation Maximization (EM): § Use “current point” to construct alternative function (which is “nice”) § Guaranty: maximum of new function has better score than current point

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Gradient Ascent and EM

MLE from Incomplete Data Nonlinear optimization problem L(D| ) n Gradient Ascent and EM § Find local maxima § Require multiple restarts to find approx. to the global maximum § Require computations in each iteration

Gradient Ascent n Theorem: n Proof: How do we compute ?

Gradient Ascent n Theorem: n Proof: How do we compute ?

Gradient Ascent

Gradient Ascent

Gradient Ascent n n Requires computation: P(xi, pai|o[m], ) for all i, m Can

Gradient Ascent n n Requires computation: P(xi, pai|o[m], ) for all i, m Can be done with clique-tree algorithm, since Xi, Pai are in the same clique

Gradient Ascent Summary n Pros n n Flexible, can be extended to non table

Gradient Ascent Summary n Pros n n Flexible, can be extended to non table CPDs Cons n n Need to project gradient onto space of legal parameters For reasonable convergence, need to combine with advanced methods (conjugate gradient, line search)

Expectation Maximization (EM) n Tailored algorithm for optimizing likelihood functions n Intuition n Parameter

Expectation Maximization (EM) n Tailored algorithm for optimizing likelihood functions n Intuition n Parameter estimation is easy given complete data Computing probability of missing data is “easy” (=inference) given parameters Strategy n n n Pick a starting point for parameters “Complete” the data using current parameters Estimate parameters relative to data completion Iterate Procedure guaranteed to improve at each iteration

Expectation Maximization (EM) n Initialize parameters to 0 n Expectation (E-step): n n n

Expectation Maximization (EM) n Initialize parameters to 0 n Expectation (E-step): n n n For each data case o[m] and each family X, U compute P(X, U | o[m], i) Compute the expected sufficient statistics for each x, u Maximization (M-step): n Treat the expected sufficient statistics as observed and set the parameters to the MLE with respect to the ESS

Expectation Maximization (EM) Initial network Updated network X X Expected counts Y Y N(X)

Expectation Maximization (EM) Initial network Updated network X X Expected counts Y Y N(X) E-Step (inference) + N(X, Y) (reparameterize) Training data X Y ? y 0 x 0 y 1 ? y 0 M-Step Iterate

Expectation Maximization (EM) n Formal Guarantees: n L(D: i+1) L(D: i) n n If

Expectation Maximization (EM) n Formal Guarantees: n L(D: i+1) L(D: i) n n If i+1= i , then i is a stationary point of L(D: ) n n Each iteration improves the likelihood Usually, this means a local maximum Main cost: n n Computations of expected counts in E-Step Requires inference for each instance in training set n Exactly the same as in gradient ascent!

EM – Practical Considerations n Initial parameters n n Stopping criteria n n n

EM – Practical Considerations n Initial parameters n n Stopping criteria n n n Highly sensitive to starting parameters Choose randomly Choose by guessing from another source Small change in data likelihood Small change in parameters Avoiding bad local maxima n n Multiple restarts Early pruning of unpromising starting points

EM in Practice – Alarm Network n Alarm network n n Data sampled from

EM in Practice – Alarm Network n Alarm network n n Data sampled from true network 20% of data randomly deleted PULMEMBOLUS PAP MINVOLSET KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG VENITUBE PRESS MINOVL ANAPHYLAXIS SAO 2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO 2 VENTALV PVSAT ARTCO 2 EXPCO 2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO BP HR HREKG HRBP DISCONNECT ERRCAUTER HRSAT

EM in Practice – Alarm Network Training error Test error

EM in Practice – Alarm Network Training error Test error

EM in Practice – Alarm Network

EM in Practice – Alarm Network

Partial Data: Parameter Estimation n n Non-linear optimization problem Methods for learning: EM and

Partial Data: Parameter Estimation n n Non-linear optimization problem Methods for learning: EM and Gradient Ascent n n Exploit inference for learning Challenges n Exploration of a complex likelihood/posterior n n n More missing data many more local maxima Cannot represent posterior must resort to approximations Inference n n Main computational bottleneck for learning Learning large networks exact inference is infeasible resort to approximate inference

Structure Learning w. Missing Data n Distinguish two learning problems n n Learning structure

Structure Learning w. Missing Data n Distinguish two learning problems n n Learning structure for a given set of random variables Introduce new hidden variables n n n How do we recognize the need for a new variable? Where do we introduce a newly added hidden variable within G? Open ended and less understood…

Structure Learning w. Missing Data n n Theoretically, there is no problem n Define

Structure Learning w. Missing Data n n Theoretically, there is no problem n Define score, and search for structure that maximizes it n Likelihood term will require gradient ascent or EM Practically infeasible n n Typically we have O(n 2) candidates at each search step Requires EM for evaluating each candidate Requires inference for each data instance of each candidate Total running time per search step: O(n 2 M #EM iteration cost of BN inference)

Typical Search Requires EM A B D A B B C D C Requires

Typical Search Requires EM A B D A B B C D C Requires EM A dd B A D te e l e B C D Re ve rse C B A B C C D D Requires EM

Structural EM n Basic idea: use expected sufficient statistics to learn structure, not just

Structural EM n Basic idea: use expected sufficient statistics to learn structure, not just parameters n n n Use current network to complete the data using EM Treat the completed data as “real” to score candidates Pick the candidate network with the best score Use the previous completed counts to evaluate networks in the next step After several steps, compute a new data completion from the current network

Structural EM n Conceptually n n Algorithm maintains an actual distribution Q over completed

Structural EM n Conceptually n n Algorithm maintains an actual distribution Q over completed datasets as well as current structure G and parameters G At each step we do one of the following n n n Use <G, G> to compute a new completion Q and redefine G as the MLE relative to Q Evaluate candidate successors G’ relative to Q and pick best In practice n n n Maintain Q implicitly as a model <G, G> Use the model to compute sufficient statistics MQ[x, u] when these are needed to evaluate new structures Use sufficient statistics to compute MLE estimates of candidate structures

Structural EM Benefits n n Many fewer EM runs Score relative to completed data

Structural EM Benefits n n Many fewer EM runs Score relative to completed data is decomposable! n n n Utilize same benefits as structure learning w. complete data Each candidate network requires few recomputations Here savings is large since each sufficient statistics computation requires inference As in EM, we optimize a simpler score Can show improvements and convergence n An SEM step that improves in D+ space, improves real score