Continuous Latent Variables Bishop Xue Tian Continuous Latent

Continuous Latent Variables --Bishop Xue Tian

Continuous Latent Variables • Explore models in which some, or all of the latent variables are continuous • Motivation is in many data sets – dimensionality of the original data space is very high – the data points all lie close to a manifold of much lower dimensionality 2

Example • data set: 100 x 100 pixel grey-level images • dimensionality of the original data space is 100 x 100 • digit 3 is embedded, the location and orientation of the digit is varied at random • 3 degrees of freedom of variability – vertical translation – horizontal translation – rotation 3

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 4

PCA-maximum variance formulation • PCA can be defined as – the orthogonal projection of the data onto a lower dimensional linear space-principal subspace – s. t. the variance of the projected data is maximized goal 5

PCA-maximum variance formulation red dots: data points purple line: principal subspace green dots: projected points 6

PCA-maximum variance formulation • data set: {xn} n=1, 2, …N • xn: D dimensions • goal: – project the data onto a space having dimensionality M < D – maximize the variance of the projected data 7

PCA-maximum variance formulation • M=1 • D-dimension unit vector u 1: the direction u 1 T u 1=1 • xn project u 1 Txn • mean of the projected data: • variance of the projected data: 1 covariance matrix 8

PCA-maximum variance formulation • goal: maximize variance of projected data • maximize variance u 1 TSu 1 with respect to u 1 • introduce a Lagrange multiplier λ 1 – a constrained maximization to prevent ||u 1|| – constraint comes from u 1 T u 1=1 • set derivative equal to zero 2 u 1: an eigenvector of S max variance: largest λ 1 u 9 1

PCA-maximum variance formulation • define additional PCs in an incremental fashion • choose new directions – maximize the projected variance – orthogonal to those already considered • general case: M-dimensional • the optimal linear projection defined by – M eigenvectors u 1, . . . , u. M of S – M largest eigenvalues λ 1, . . . , λM 10

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 11

PCA-minimum error formulation • PCA can be defined as goal – the linear projection – minimizes the average projection cost – average projection cost: mean squared distance between the data points and their projections 12

PCA-minimum error formulation red dots: data points purple line: principal subspace green dots: projected points blue lines: projection error 13

PCA-minimum error formulation • complete orthonormal set of basis vectors {ui} – i=1, …D, D-dimensional 3 – • each data point can be represented by a linear combination of the basis vectors • take the inner produce with uj 4 14

PCA-minimum error formulation • to approximate data points using a M-dimensional subspace - depend on the particular data points - constant, same for all data points • goal: minimize the mean squared distance • set derivative with respect to 5 j=1, …, M to zero 15

PCA-minimum error formulation • set derivative with respect to 6 j=M+1, …, D to zero 7 8 • remaining task: minimize J with respect to ui 16

PCA-minimum error formulation • M=1 D=2 • introduce a Lagrange multiplier λ 2 – a constrained minimization to prevent ||u 2|| 0 – constraint comes from u 2 T u 2=1 • set derivative equal to zero u 2: an eigenvector of S min error: smallest λ 2 u 2 17

PCA-minimum error formulation • general case: i=M+1, …, D J: sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace • obtain the min value of J: – select eigenvectors corresponding to the D - M smallest eigenvalues – the eigenvectors defining the principal subspace are those corresponding to the M largest eigenvalues 18

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 19

PCA-application • • dimensionality reduction lossy data compression feature extraction data visualization • example PCA is unsupervised and depends only on the values xn 20

PCA-example • go through the steps to perform PCA on a set of data • Principal Components Analysis by Lindsay Smith • http: //csnet. otago. ac. nz/cosc 453/student_tut orials/principal_components. pdf 21

PCA-example Step 1: get data set D=2 N=10 22

PCA-example Step 2: subtract the mean 23

PCA-example Step 3: calculate the covariance matrix S S: 2 x 2 9 24

PCA-example • Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix S • the eigenvector with the highest eigenvalue is the first principle component of the data set 25

PCA-example • two eigenvectors • go through the middle of the points, like drawing a line of best fit • extract lines to characterize the data 26

PCA-example • in general, once eigenvectors are found • the next step is to order them by eigenvalues, highest to lowest • this gives us the PCs in order of significance • decide to ignore the less significant components • here is where the notion of data compression and reduced dimensionality comes 27

PCA-example • Step 5: derive the new data set new. Data. T=eigenvectors. T x original. Data. Adjust. T= new. Data: 10 x 1 28

PCA-example • new. Data 29

PCA-example • new. Data: 10 x 2 30

PCA-example • Step 6: get back old data compression • took all the eigenvectors in transformation, get exactly the original data back • otherwise, lose some information 31

PCA-example • new. Data. T=eigenvectors. T x original. Data. Adjust. T • new. Data. T=eigenvectors-1 x original. Data. Adjust. T • take all the eigenvectors • inverse of the eigenvectors matrix is equal to the transpose of it • unit vectors • original. Data. Adjust. T=eigenvectors x new. Data. T 32 • original. Data. T=eigenvectors x new. Data. T + mean

PCA-example • new. Data: 10 x 1 33

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 34

PCA-high dimensional data • number of data points is smaller than the dimensionality of the data space N < D • example: – data set: a few hundred images – dimensionality: several million corresponding to three color values for each pixel 35

PCA-high dimensional data • standard algorithm for finding eigenvectors for a Dx. D matrix is O(D 3) O(MD 2) • if D is really high, a direct PCA is computationally infeasible 36

PCA-high dimensional data • N<D • a set of N points defines a linear subspace whose dimensionality is at most N – 1 • there is little point to apply PCA for M > N – 1 • if M > N-1 • at least D-N+1 of the eigenvalues are 0 • eigenvectors has 0 variance of the data set 37

PCA-high dimensional data solution: • define X: Nx. D dimensional centred data matrix • nth row: Dx. D 38

PCA-high dimensional data define eigenvector equation for matrix • Nx. N have the same N-1 eigenvalues has D-N+1 zero eigenvalues • O(D 3) O(N 3) • eigenvectors 39

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 40

Kernel • Kernel function mapping of x into a feature space • inner product in feature space • feature space M ≥ input space N • feature space mapping is implicit 41

PCA-linear • maximum variance formulation – the orthogonal projection of the data onto a lower dimensional linear space – s. t. the variance of the projected data is maximized • minimum error formulation linear – the linear projection – minimizes the average projection distance 42

Kernel PCA • data set: {xn} n=1, 2, …N • xn: D dimensions • assume: the mean has been subtracted from xn (zero mean) • PCs are defined by the eigenvectors ui of S i=1, …, D 43

Kernel PCA • a nonlinear transformation into an Mdimensional feature space • xn project • perform standard PCA in the feature space • implicitly defines a nonlinear PC in the original data space 44

original data space feature space green lines: linear projection onto the first PC nonlinear projection in the original data space 45

Kernel PCA • assume: the projected data has zero mean Mx. M i=1, …, M given , vi is a linear combination of 46

Kernel PCA express this in terms of kernel function in matrix notation i=1, …, N ai: column vector the solutions of these two eigenvector equations differ only by eigenvectors of K having zero eigenvalues 47

Kernel PCA • normalization condition for ai 48

Kernel PCA • in feature space: what is the projected data points after PCA 49

Kernel PCA • original data space – dimensionality: D – D eigenvectors – at most D linear PCs • feature space – dimensionality: M M>>D (even infinite) – M eigenvectors – a number of nonlinear PCs then can exceed D • the number of nonzero eigenvalues can not exceed N 50

Kernel PCA • assume: the projected data has zero mean • nonzero mean – cannot simply compute and then subtract off the mean – avoid working directly in feature space • formulate the algorithm purely in terms of kernel function 51

Kernel PCA Gram matrix in matrix notation 1 N: Nx. N matrix 1/N 52

Kernel PCA • linear kernel: • Gaussian kernel: standard PCA • example: kernel PCA 53

54

Kernel PCA • contours: lines along which the projection onto the corresponding PC is constant 55

Kernel PCA disadvantage: • determine the eigenvectors of , Nx. N • for large data sets, approximations are used 56

Outline two commonly used definitions of PCA give rise to the same algorithm • PCA-principal component analysis – maximum variance formulation – minimum-error formulation – application of PCA – PCA for high-dimensional data • Kernel PCA • Probabilistic PCA 57

Probabilistic PCA • standard PCA: a linear projection of the data onto a lower dimensional subspace • probabilistic PCA: the maximum likelihood solution of a probabilistic latent variable model 58

Probabilistic PCA • the combination of a probabilistic model and EM allows us to deal with missing values in the data set – EM: expectation-maximization algorithm – a method to find maximum likelihood solutions for models with latent variables 59

Probabilistic PCA • probabilistic PCA forms the basis for a Bayesian treatment of PCA • in Bayesian PCA, the dimensionality of the principal subspace can be found automatically 60

Probabilistic PCA • the probabilistic PCA model can be run generatively to provide samples from the distribution • the simplest continuous latent variable model assumes – Gaussian distribution for both the latent and observed variables – makes use of a linear-Gaussian dependence of the observed variables on the state of the latent variables 61

Probabilistic PCA • an explicit latent variable z – corresponding to the PC subspace Mx 1 • a Gaussian prior distribution p(z) over the latent variable • a Gaussian conditional distribution p(x|z) – W: Dx. M matrix – columns of W: principal subspace – : D-dimensional vector observed variable Dx 1 62

Probabilistic PCA • get a sample value of the observed variables by – choosing a value for the latent variable – sampling the observed variable given the latent value • x is defined by a linear transformation of z – plus additive Gaussian noise – : D-dimensional noise 63

Probabilistic PCA data space: 2 -dimensional latent space: 1 -dimensional • get a value for the latent variable z • get a value for x from an isotropic Gaussian distribution • green ellipses: density contours for the marginal distribution p(x) 64

Probabilistic PCA • a mapping from latent space to data space • in contrast to the standard PCA 65

Probabilistic PCA • Gaussian conditional distribution • maximum likelihood PCA - determine 3 parameters • we need an expression for p(x) 66

Probabilistic PCA • so far, we assumed the value M is given • in practice, choose a suitable value for M – for visualization: M=2 or M=3 – plot the eigenvalue spectrum for the data set vseek a significant gap indicating a choice for M vin practice, such a gap is often not seen – Bayesian PCA vemploy cross-validation to determine the value of M vby selecting the largest log likelihood on a validation data set 67

Probabilistic PCA • the only clear break is between the 1 st and 2 nd PCs • the 1 st PC explains less than 40% of the variance – more components are probably needed • the first 3 PCs explain two thirds of the total variability – 3 might be a reasonable value of M 68

69