Adaptive Metric Dimensionality Reduction Aryeh Kontorovich Ben Gurion

Adaptive Metric Dimensionality Reduction Aryeh Kontorovich Ben Gurion U. joint work with: Lee-Ad Gottlieb Robert Krauthgamer Ariel U. Weizmann Institute

Setting: Supervised binary classification in a metric space n n Instance (metric) space (�� , d) Probability distribution P on �� {-1, 1} q n (agnostic PAC -- think noisy concept) Learner observes sample S: q q n points (x, y) drawn iid ~P produces hypothesis n n -1 +1 h: �� → {-1, 1} Generalization error q P[ h(X)≠Y ] Adaptive Metric Dimensionality Reduction 2

Metric space n q q X = set of points 2 → ℝ d = distance function d: �� n n Tel Aviv (�� , d) is a metric space if Nonnegative d(x, x′) = 0 ⇔ x = x′ Symmetric d(x, x′) = d(x′, x) triangle inequality d(x, x′) ≤ d(x, z) + d(z, x′) 3553 London 7955 10847 “no coordinates – just distances” Singapore n n n inner product ⇒ norm ||x||2 =〈x, x〉 norm ⇒ metric d(x, x′) = ||x - x′|| NOT ⇐ Take-away: metric assumption far less restrictive How to classify in a metric space? Nearest neighbors! (some variant of) Adaptive Metric Dimensionality Reduction 3

Curse of dimensionality n Learning in high dimensions is hard q q n Statistically: many examples are needed Computationally: building a classifier is expensive “Real” data tends to have q q High ambient dimension Low intrinsic dimension n Challenge: exploit low intrinsic dimensionality statistically and compuationally n This talk: some of the first such results in supervised learning Adaptive Metric Dimensionality Reduction 4

Nearest-Neighbor Classifier h. NN(x) = label of sample point closest to x …is terrific! n One of the oldest classification algorithms n “Simple” n Requires minimal geometric structure (metric) n Suitable for multi-class n Asymptotically consistent (expected error twice the optimal Bayes rate) x … but has statistical and computational drawbacks n Infinite VC-dimension n Distribution-free rate is impossible n Exact computation requires time (n) Adaptive Metric Dimensionality Reduction 5

Cover, Hart, 1967; Ben-David, Shalev-Shwartz, 2014+ n n η(x) = P[Y=1|X=x] assume Lipschitz continuous: |η(x) - η(x′)| ≤ Ld(x, x′) Bayes optimal classifier: threshold η(x) at 1/2 h*(x) = sign(η(x)-1/2) Bayes error err(h*) = �� [min{η(X), 1 -η(X)}] n Theorem: for the metric space ([0, 1]k, ║∙║ 2) E[err(h. NN)] ≤ 2 err(h*) + O(L k 1/2 / n 1/(k+1)) n Tightness [Curse of dimensionality]: need n = ((L+1)k) Exists distribution for which err(h*)=0 but n ≤ (L+1)k/2 �� [err(h. NN)] >¼ n Adaptive Metric Dimensionality Reduction 6

A Newer Look n Let’s take a richer hypothesis class q q n [von Luxburg & Bousquet JMLR ’ 04] q q n f maps �� to [-1, +1] instead of {-1, +1} Classify by thresholding at zero View sample S = {(Xi, Yi)} as evaluations of a [-1, +1] function Lipschitz-extend {-1, +1} data to [-1, +1] function on whole space margin d(S+, S−) = inverse Lipschitz constant Algorithmically realized by nearest neighbor } Efficient NN search Left open q - + d(S+, S−) Smoothing/ denoising/ Structural Risk Minimization/ Regularization GKK’ 2010 addressed these issues q n Adaptive Metric Dimensionality Reduction 7

Doubling Dimension n n Definition: Ball B(x, r) = all points within distance r from x. The doubling constant (of a metric �� ) is the minimum value >0 such that every ball can be covered by balls of half the radius q q n First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is ddim(�� )=log 2 (�� ) [GKL-03] A metric is doubling if its doubling dimension is finite Euclidean: ddim(Rn) = O(n) Summary q q Intimately connected to covering numbers Analogue of Euclidean dimension n n In geometry, one of many metric dimensions In CS, basically just ddim Adaptive Metric Dimensionality Reduction Here ≥ 7. 8

NN excess risk Previous: O(L k 1/2 / n 1/(ddim+1)) GKK’ 10: O([Lddim log(n)/n]1/2) Adaptive Metric Dimensionality Reduction 9

Metric Dimensionality Reduction n Runtime + Sample complexity are exponential in ddim q q n n (1+ )-approximate nearest neighbor search in time 2 ddim log n + -ddim Generalization bounds decay as min{n-1/ddim, Lddim n-1/2} All existing bounds work with ambient dimension Insensitive to intrinsic data dimension What if the intrinsic data dimension is much lower than ambient? What if data is close to being low-dimensional? Adaptive Metric Dimensionality Reduction 10

Principal Components Analysis (PCA) n n Data {Xi} with Xi in ℝN A k-dimensional subspace T ⊂ ℝN Induces distortion : = ∑i || Xi – PT(Xi) ||2 [PT(·) = orthogonal projection onto T] dimension k and (optimal) distortion have a simple relationship q q n 1 ≥ 2 ≥ … ≥ N are the singular values of data matrix X Uses of PCA: q q n = ∑j=k+1 N j 2 Denoising Discovering the “inherent” dimension of the data How to choose the cutoff k? q q Heuristics (such as looking for “jump” j ≫ j+1) To our knowledge, no principled guidelines Adaptive Metric Dimensionality Reduction 11

PCA and supervised classification n n Labeled sample (Xi, Yi) with Xi in ℝN and Yi in {-1, +1} Ambient space is high-dimensional: N≫ 1 Common heuristic: run PCA prior to SVM. Benefits q q n Computational: everything is faster in lower dimensions! Statistical (less well-understood) n Denoising (heuristic) n Better generalization guarantees? . . Drawbacks q q q Theoretically unmotivated What’s the “right” dimension / singular value cutoff? Won’t this mess up the margin? Adaptive Metric Dimensionality Reduction 12

ed l p i c Principal n i r P n Labeled sample (Xi, Yi) q q q n n n sample points ||Xi|| ≤ 1 in ℝN Yi in {-1, +1} Thm [GKK’ 2013]: For all > 0, with prob. ≥ 1 - : q n Components Analysis For all separating hyperplanes ||w|| ≤ 1 in ℝN For all subspaces T ⊂ ℝN with dim(T) = k Which incur distortion = ∑i || Xi – PT(Xi) ||2 We have �� Lhinge(w·X, Y) ≤ (1/n)∑i Lhinge(w·Xi, Yi) + 34(k/n)1/2 + 2( /n)1/2 + 3[log(2/ )/2 n]1/2 Distortion plays the role of inverse margin To our knowledge, first rigorous guide to PCA cutoff Adaptive Metric Dimensionality Reduction 13

General Metric Spaces n n n n Metric space (�� , d) Labeled sample (Xi, Yi) Distortion = ∑i d(Xi, Ẋi), where {Ẋi} is a “perturbed set” Dimensionality reduction: ddim({Ẋi}) < ddim({Xi}) ≤ log 2 n Optimal tradeoff dictated by data via Rademacher analysis Rn = O( L(1+ )n-1/D) [Dudley’s chaining technique] q q q n L = Lipschitz constant of induced hypothesis = distortion D = ddim({Ẋi}) intrinsic data dimension Generalization performance does not depend on the ambient dimension ddim({Xi}) Adaptive Metric Dimensionality Reduction 14

Dimensionality reduction algorithm n n Given a point-set and target dimension d What is the smallest distortion possible? Exact solution seems hard We give (O(1), O(1))-bicriteria approximation Adaptive Metric Dimensionality Reduction 15

Hierarchies n Every discrete space admits a point hierarchy:

Hierarchies n Every discrete space admits a point hierarchy: n Level 0: Each point is center of 1 -radius ball

Hierarchies n Every discrete space admits a point hierarchy: n Level 1: All 1 -radius balls are covered by 2 -radius balls Covering

Hierarchies n Every discrete space admits a point hierarchy: n Balls have minimum interpoint distance Packing

Hierarchies n Every discrete space admits a point hierarchy: n Level 2: Big ball of radius 4

Hierarchies n Every discrete space admits a point hierarchy:

Hierarchies n Every discrete space admits a point hierarchy: q q q n Covering Packing Nested Key property: In a doubling hierarchy, each ball neighbors only a small number of other balls (at each level)

Integer program n Consider hierarchy over the training sample n An integer program extracts a sub-hierarchy with small doubling dimension. q q n Indicator variable zji represents point xj in level i Let Eji be all i-level points close to point xj. Minimize cost ∑cj, subject to constraints: q q q zji {0, 1} zji ≤ zji-1 zji ≤ |Nji +1| |Nji | ≤ 2 d cj ≥ 2 i [(1 -zj 0) - |Nji +1|] xj present in level i? nested covering small target doubling dimension cj proxy for cost of deleting xj

Linear program n Solving the integer program is somewhat involved: q q Bicriteria algorithm: approximate cost, dimension Linear relaxation: zji [0, 1] Rounding scheme Runtime 2 O(ddim) + O(n log 4 n)

Thank you n Questions? Adaptive Metric Dimensionality Reduction 25