Approximate nearest neighbor for pspaces 2p via embeddings

Approximate nearest neighbor for ℓp–spaces (2<p<∞) via embeddings Yair Bartal Hebrew U. Lee-Ad Gottlieb Ariel University

Nearest neighbor search • Problem definition: – Given a set of points S, preprocess S so that the following query can be answered efficiently: – Exact: NNS - Given query point q, what is the closest point to q in S? – Approximate: ANN - Given query point q, find a point x in S whose distance from q is within some approximation factor of the closest ? – polynomial space, polylog query – o(log n) approximation q

Approx. nearest neighbor search • Not possible in general metrics • More restrictive spaces? – Good news! – Euclidean space – Normed spaces

Lp Normed Spaces • Norms: Recall that for d-dimensional vectors x, y ǁx-yǁp = (|x 1 -y 1|p+…+|xd-yd|p)1/p l 1 l 2 l∞

Approx. Nearest neighbor search • An efficient ANN structure features • polynomial space • polylog query time • o(log n), o(d) approximation • Efficient ANN structures exist for • Euclidean (p=2): (1+ɛ)-ANN [IM’ 98, KOR ‘ 98] Reduce dimension via JL, brute force in lower dimension. • ℓ ∞: O(log d)-ANN [Indyk ‘ 98] • What about other norms? • 1≤p<2: • 2<p<∞: (1+ɛ)-ANN ? (previous – Andoni) same as Euclidean subject of this paper

Summary of results • Combine two algorithms for ℓp (2<p<∞): – Andoni: – New result: • Analysis: O(log d (logdn)1/p) -ANN 2 O(p) –ANN – Equality at p = (logloglog d) + (loglogdn)1/2 – Worse case approximation: • (loglogd) exp((loglogdn)1/2) – Andoni better for larger values, New for smaller – Improved bounds in metrics of low doubling dimension

Embeddings • An embedding of set X into Y with distortion D is a mapping f: X→Y such that for all x, y ∈ X: – 1 ≤ c・d. Y(f(x), f(y)) / d. X(x, y) ≤ D – where c is any scaling constant – Relaxed: one side may be preserved with constant probability • If an embedding is non-expansive and has small contraction: – The nearest neighbor stays close, and far points are still relatively far • If an embedding is non-contractive and has small expansion: – The nearest neighbor is only a little bit farther away, and far points remain far q q

Andoni’s algorithm • Basic idea: [Andoni ‘ 09] – Embed ℓp space into ℓ∞ – Run Indyk’s ANN algorithm for ℓ∞ • Embedding using Frechet random variables – max-stable distribution…

Frechet distribution • For random variable X ∼ Frechet – Pr[X < x] = e-x-p for x>0 • Max-stability: – Let random variables X and Z 1, . . . , Zd be ∼ Frechet – let v = (v 1, . . . , vd) be a non-negative valued vector. – Then the random variable Y : = maxi vi. Zi is distributed as ǁvǁp ・X (Y ∼ ǁvǁp・X).

Frechet distribution • Proof of Max-stability: Recall Y : = maxi vi. Zi – Pr[Y ≤ x] = Pr[maxi vi. Zi ≤ x] = Πi Pr[Zi ≤ x/vi] = Πi e−(vi/x)p = e−(∑ivip)/xp = e−(ǁvǁp/x)p • Similarly, – Pr[ǁvǁp・X ≤ x] = Pr[X ≤ x/ǁvǁp] = e−(ǁvǁp /x)p

Review of Andoni’s embedding • Define embedding fb : V → ℓ∞ (b > 0): – Draw Frechet random variables Z 1, . . . , Zd. – fb maps v= (v 1, . . . , vd) to (v 1 b. Z 1, . . . , vdb. Zd) – The resulting set is V′ ∈ ℓ∞. • Theorem: Set b = (3 ln n)1/p. Then fb satisfies – Non-contractive (for all points) with prob. > 1− 1/n – Expansion: For any u, w∈ V, with constant prob. ǁfb(u) − fb (w)ǁ∞ ≤ b ǁu − wǁp • Expansion guarantee needed for only one inter-point distance: between the query point and nearest neighbor.

Analysis of Andoni’s embedding • Theorem: Set b = (3 ln n)1/p. Then fb satisfies – Non-contractive (all points) with prob. > 1− 1/n – Expansion: For any u, w∈ V with constant prob. ǁfb(u) − fb (w) ǁ∞ ≤ b ǁu − wǁp • Proof of contraction: – Take v with ǁvǁp = 1. By max-stability, ǁfb(v)ǁ∞ ∼ bǁvǁp・X = b・X, – By definition of Frechet distribution, Pr[ǁfb (v)ǁ∞ < 1] = Pr[b ・X < 1] = e−(1/b)−p = n-3. – Since the embedding is linear, v may be taken to be any inter-point distance between two vectors in V, so the probability that any of the n 2 inter-point distances decreases is less than n 2・ n-3 = 1/n. • Proof of expansion: – Same approach, Pr[ǁfb(v)ǁ∞ ≤ b] = Pr[b ・X < b] = Pr[X < 1] = e− 1 – So expansion bounded by b.

Summary • Embed ℓp space into ℓ∞ – distortion O(b) = O(ln n)1/p • Run Indyk’s ANN algorithm for ℓ∞ – O(loglogd)-ANN • Final guarantee – O(loglogd ln 1/pn)-ANN

An improvement • We can improve the guarantees of Andoni’s algorithm by considering the doubling dimension of the space. – Doubling constant: number of half-radius balls necessary to cover big ball. – Doubling dimension: log(doubling constant) 4 – For example, d-dimensional Euclidean 5 space has doubling dimension Ѳ(d) 6 8 3 2 7 1

Improvement outline • Nearest neighbor search can be reduced to a series of subproblems – Searches on spaces with small aspect ratio • So we can take a net on the subspaces, and run Andoni’s algorithm on the nets instead – Size of net: ddim. O(ddim) • Approximation: – Andoni: – Improved: O(log d (logdn)1/p) O(log d (ddim logdddim)1/p)

New algorithm • Basic idea: – Embed ℓp space into ℓ 2 – Run ANN algorithm for ℓ 2 • Embedding using the Mazur map

Mazur map • Mazur map is a mapping from ℓp to ℓq, for any 0 < p, q < ∞. The mapping of vector v ∈ ℓp is defined as M(v) = (|v 0|p/q, |v 1|p/q, . . . , |vm− 1|p/q) • For set V, let C satisfy C ≥ ǁvǁp, for all ∈ V. Our embedding f is the Mazur map from ℓp to ℓ 2, scaled down by a factor (p/2) C p/2 – 1 – f is non-expansive. – Contraction: If ǁx − yǁp = u, then ǁf(x) − f(y)ǁp ≥ 2 p-1 (2 C)1−p/2 up/2 [Binyamini & Lindenstrauss ‘ 00]

ANN via the Mazur map • The distortion of our embedding is large depends on the diameter C of the space: 2 p-1 (2 C)1−p/2 up/2 • But we can show that this guarantee is sufficient to solve a specific case of nearest neighbor in ℓp: – the c-bounded nearest neighbor problem.

C-bounded nearest neighbor • Define the c-bounded near neighbor problem where c ≥ ǁvǁp for all v ∈ V – If there is a point in V within distance 1 of query q, return it or some point in V within distance c/9 of q. • This is a c/9 -ANN. – If there is no point in V within distance 1 of query q, return null or some point in V within distance c/9 of q.

C-bounded nearest neighbor • Approximately solve the c-bounded nearest neighbor problem in ℓp, for c=p 18 p/2 – Embed from ℓp to ℓ 2 – Compute a 2 -ANN in ℓ 2 • Analysis: the Mazur map ensures that inter-point distances of c/9 or greater map to at least 2 p-1 (2 c) 1−p/2 (c/9) p/2 = 4. • If q possesses a neighbor in the original space at distance 1 or less, the 2 -ANN finds a neighbor at distance 2 in the embedded space and less than c/9 in the origin space. q q

C-bounded nearest neighbor • We can show that the C-bounded nearest neighbor problem can be used to give a c-ANN for the regular (unbound problem). • Final result: 2 O(p)-ANN

Conclusion • Combine two algorithms for ℓp (2<p<∞): – Andoni: O(log d (logdn)1/p) -ANN – New result: 2 O(p) –ANN • Worse case approximation: (loglogd) exp((loglogdn)1/2) • Improved bounds in metrics of low doubling dimension