Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

The Problem: Data lies on or near a manifold. Lower dimensionality than overall space.

Background Assumptions: Kernel Trick Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be

The (More Specific) Problem: Data described by a manifold. Using kernel PCA, discover the

Motivation: Exploits properties of the data, not just its space. Relates kernel discovery to

Methodology – Idea: Semidefinite programming (optimization) Preserves distance, angles between points. Rotation and Translation

Optimization Constraints: Isometry: For all neighbors xj, xk of point xi. If xj and

Objective Function We want to maximize pairwise distances. This is an inversion of SSE/MSE!

Semidefinite Embedding (SDE) Maximize Tr(K) subject to: K≥ 0 Kii + Kjj - Kij

Experimental Setup Four kernels: SDE (proposed) Linear Polynomial Gaussian “Swiss Roll” Dataset. 23 dimensions.

Results – Dimensionality Reduction Two measures: Learned Kernels (SDE): “Eigenspectra”: Variance captured by individual

Results – Large Margin Classification Used SDE kernels with SVMs. Results were very poor.

Strengths and Weaknesses Strengths: Unsupervised convex kernel optimization. Generalizes well in theory. Relates manifold

Possible Improvements Introduce slack variables for robustness. Incrementally accept minors of K for large

Conclusions Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds.

Slides: 15

Download presentation

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan

The Problem: Data lies on or near a manifold. Lower dimensionality than overall space. Locally Euclidean. Example: data on a 2 D line in R 3, flat area on a sphere. Goal: Learn a kernel that will let us work in the lowerdimensional space. “Unfold” the manifold. First we need to know what it is! Its dimensionality. How it can vary. 2 D manifold on a sphere. (Wikipedia)

Background Assumptions: Kernel Trick Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a highdimensional space (Wikipedia; implied in paper). So we replace the dot product with a kernel function. Or “Gram Matrix”, Knm = φ(xn)T * φ(xm) = k(xn, xm) Kernel provides mapping into high-dimensional space. Consequence of Cover’s theorem: Nonlinear problem then becomes linear. Example: SVMs: xi. T * xj -> φ(xi)T * φ(xj) = k(xi, xj). Linear Dimensionality Reduction Techniques: SVD, derived techniques (PCA, ICA, etc. ) remove linear correlations. This reduces the dimensionality. Now combine these! Kernel PCA for nonlinear dimensionality reduction! Map input to a higher dimension using a kernel, then use PCA.

The (More Specific) Problem: Data described by a manifold. Using kernel PCA, discover the manifold. There’s only one detail missing: How do we find the appropriate kernel? This forms the basis of the paper’s approach. It is also a motivation for the paper…

Motivation: Exploits properties of the data, not just its space. Relates kernel discovery to manifold learning. With the right kernel, kernel PCA will allow us to discover the manifold. So it has implications for both fields. Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning. Unlike previous methods, this approach is unsupervised; the kernel is learned automatically. Not specific to PCA; it can learn any kernel.

Methodology – Idea: Semidefinite programming (optimization) Preserves distance, angles between points. Rotation and Translation on a neighborhood. Fix the distance and angles between a point and its k nearest neighbors. Intuition: Look for a locally isometric mapping from the space to the manifold. Represent points as a lattice of “steel balls”. Neighborhoods connected by “rigid rods” that fix angles and distance (local isometry constraint). Now pull the balls as far apart as possible (obj. function). The lattice flattens -> Lower dimensionality! The “balls” and “rods” represent the manifold. . . If the data is well-sampled (Wikipedia). Shouldn’t be a problem in practice.

Optimization Constraints: Isometry: For all neighbors xj, xk of point xi. If xj and xk are neighbors of each other or another common point, Let Gram matrices We then have Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji. Positive Semidefiniteness (required for kernel trick). No negative eigenvalues. Centered on the origin ( ). So eigenvalues measure variance of PCs. Dataset can be centered if not already.

Objective Function We want to maximize pairwise distances. This is an inversion of SSE/MSE! So we have Which is just Tr(K)! Proof: paper) (Not given in

Semidefinite Embedding (SDE) Maximize Tr(K) subject to: K≥ 0 Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji for all i, j that are neighbors of each other or a common point. This optimization is convex, and thus has a unique solution. Use semidefinite programming to perform the optimization (no SDP details in paper). Once we have the optimal kernel, perform k. PCA. This technique (SDE) is this paper’s contribution.

Experimental Setup Four kernels: SDE (proposed) Linear Polynomial Gaussian “Swiss Roll” Dataset. 23 dimensions. 800 inputs. k = 4, p = 4, σ = 1. 45 (σ of 4 -neighborhoods). “Teapot” Dataset. 3 meaningful (top right). 20 filled with small noise (not shown). Same teapot, rotated 0 ≤ i < 360 degrees. 23, 028 dimensions (76 x 101 x 3). Only one degree of freedom (angle of rotation). 400 inputs. k = 4, p = 4, σ = 1541. “The handwriting dataset”. No dimensionality or parameters specified (16 x 1 = 256 D? ) 953 images. No images or kernel matrix shown.

Results – Dimensionality Reduction Two measures: Learned Kernels (SDE): “Eigenspectra”: Variance captured by individual eigenvalues. Normalized by trace (sum of eigenvalues). Seems to indicate manifold dimensionality. “Swiss Roll” “Teapot” “Digits”

Results – Large Margin Classification Used SDE kernels with SVMs. Results were very poor. Lowering dimensionality can impair separability. Error rates: 90/10 training/test split. Mean of 10 experiments. Decision boundary no longer linearly separable.

Strengths and Weaknesses Strengths: Unsupervised convex kernel optimization. Generalizes well in theory. Relates manifold learning and kernel learning. Easy to implement; just solve optimization. Intuitive (stretching a string). Weaknesses: May not generalize well in practice (SVMs). Robustness – what if a neighborhood contains an outlier? Offline algorithm – entire gram matrix required. Implicit assumption: lower dimensionality is better. Not always the case (as in SVMs due to separability in higher dimensions). Only a problem if N is large. Paper doesn’t mention SDP details. No algorithm analysis, complexity, etc. Complexity is “relatively high”. In fact, no proof of convergence (according to the authors’ other 2004 paper). Isomap, LLE, et al. already have such proofs.

Possible Improvements Introduce slack variables for robustness. Incrementally accept minors of K for large values of N, use incremental kernel PCA. Convolve SDE kernel with others for SVMs? “Rods” not “rigid”, but punished for “bending”. Would introduce a “C” parameter, as in SVMs. SDE unfolds manifold, other kernel makes the problem linearly separable again. Only makes sense if SDE simplifies the problem. Analyze complexity of SDP.

Conclusions Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds. Kernel PCA then reduces dimensionality. Excellent for nonlinear dimensionality reduction / manifold learning. Without requiring parameters. Dramatic results when difference in dimensionalities is high. Poorly suited for SVM classification.