Optimal Data Dependent Hashing for Approximate Near Neighbors

Optimal Data Dependent Hashing for Approximate Near Neighbors Alexandr Andoni (Simons Inst. / Columbia) Ilya Razenshteyn (MIT, now at IBM Almaden)

Near Neighbor Search •

Approximate Near Neighbor Search (ANN) •

Prior work •

Locality Sensitive Hashing (LSH) • From the definition of ANN

From LSH to ANN •

Bounds on LSH Distance metric Reference 1/4 (Andoni-Indyk 2006) 1/2 (O’Donnell-Wu-Zhou 2011) (Indyk-Motwani 1998) (O’Donnell-Wu-Zhou 2011) LSH? upon Can one improve Yes! First in (Andoni-Indyk-Nguyen- R 2014)

How to do better than LSH? •

The main result In this work we show optimal* data-dependent space partitions for the Euclidean and Manhattan/Hamming distances * After proper formalization

The main result (quantitative) Distance metric Reference 1/4 (Andoni-Indyk 2006) (O’Donnell-Wu-Zhou 2011) 1/7 This work 1/2 (Indyk-Motwani 1998) (O’Donnell-Wu-Zhou 2011) 1/3 This work

The plan • Focus on the Euclidean distance • Outline: • Random datasets (follows from (Andoni-Indyk-Nguyen- R 2014)) • Worst-case dataset → randomly-looking parts

Random case •

General case •

Handling clusters •

Overall bookkeeping • Root Clustering Cluster Random Voronoi LSH Part

Fast preprocessing •

Dynamic case • Classical LSH allows insertions and deletions “for free” • Insertions and deletions in ≈ query time by using dynamization techniques (Overmars-van Leeuwen 1981)

Conclusions and open problems • Quadratically improve upon the classical LSH (Indyk-Motwani 1998) and (Andoni -Indyk 2006) • Optimal data-dependent space partitions • Main idea: partition a worst-case dataset into pseudo-random pieces • High-dimensional geometric regularity lemma? • More applications of this approach? • Practical version of Voronoi LSH for ANN on a sphere (Andoni-Indyk-Kapralov-Laarhoven- R-Schmidt 2015) • can clustering be made practical too? Any questions?