When Is Nearest Neighbor Meaningful By Kevin Beyer

When Is “Nearest Neighbor” Meaningful? By Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft 1

Talk overview • • • Motivation Instability definition Relationship between instability and indexability Analysis of workloads Conclusions Future work 2

Definition of “Nearest Neighbor” • Given a relation, the nearest neighbor problem determines which tuple in the relation is closest to some given tuple (not necessarily from the original relation) assuming some distance function. • Usually the fields of the relation are reals, and the distance function is a metric. L 2 is the most frequently used metric. • High dimensional nearest neighbor problems usually stem from similarity and approximate matching problems. 3

Motivation • Nearest neighbor processing techniques perform badly in high dimensionality. Why? • Is there a fundamental reason for this breakdown? • Is more than performance affected by this breakdown? • Are there high dimensional scenarios in which these techniques may perform well? 4

Instability Typical query in 2 D Unstable query in 2 D 5

Formal definition of instability (i. e. As dimensionality increases, all points become equidistant w. r. t. the query point) 6

Instability and indexability If a workload has the following properties: 1) The workload is unstable 2) Query distribution follows data distribution 3) Distance is calculated using the L 2 metric 4) The number of data points is constant for all dimensionalities then as dimensionality increases, the probability that all (non-trivial) convex decompositions of the space result in examining all data points becomes 1. 7

Instability tool 8

IID result application Assume the following: • The data distribution and query distribution are IID in all dimensions. • All the appropriate moments are finite (i. e. , up to the é 2 pù’th moment). • The query point is chosen independently of the data points. 9

Variance goes to 0 result application 10

Examples that meet our condition: • All dimensions are IID; Q ~ P (Query distribution follows data distribution) • Variance converges to 0 at a bounded rate; Q ~ P • Variance converges to infinity at a bounded rate; Q ~ P • All dimensions have some correlation; Q ~ P • Variance converges to 0 at a bounded rate, all dimensions have some correlation; Q ~ P • The data contains perfect clusters; Q ~ IID uniform 11

Examples that don’t meet our condition: • All dimensions are completely correlated; Q ~ P • All dimensions are linear combinations of a fixed number of IID random variables; Q ~ P • The data contains perfect clusters; Q ~ P; a special case of this is the approximate matching problem 12

IID contrast as dimensionality increases 13

Contrast as dimensionality increases 14

Contrast in ideally clustered data Top right - Typical distance distribution Bottom left - Ideal clusters Bottom right - Distance distribution for ideally clustered data/queries 15

Contrast for a real image database 16

Distance distribution for fixed query NN 17

Distance distribution for fixed query NN 18

Distance distribution for fixed query NN 19

Conclusions • Serious questions are raised about techniques that map approximate similarity into high dimensional nearest neighbor problems. • The ease with which linear scan beats more complex access methods for high-D nearest neighbor is explained by our theorem. • These results should not be taken to mean that all high dimensional nearest neighbor problems are badly framed or that more complex access methods will always fail on individual high-D data sets. 20

Future Work • Examine the contrast produced by various mappings of similarity problems into high dimensional spaces • Does contrast fully capture the difficulty associated with the high dimensional nearest neighbor problem? • If so, find an indexing structure for nearest neighbor which has guaranteed good performance in high contrast situations • Determine the performance of various indexing structures compared to linear scan as dimensionality increases 21