On Fast NonMetric Similarity Search by Metric Access

On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal (tomas@skopal. net) Charles University in Prague Faculty of Mathematics and Physics Department of Software Engineering Prague, Czech Republic EDBT 2006, Munich

Presentation Outline n introduction ¨ motivation of non-metric similarity search ¨ metric access methods, intrinsic dimensionality n our objective: fast non-metric search ¨ turning non-metric into metric ¨ the Tri. Gen algorithm ¨ experimental results ¨ conclusions and future work EDBT 2006, Munich 2

Similarity Search in Multimedia Databases n non-structured data instances ¨ n multimedia objects, texts, sequences, time series, etc. distance function d: U U R ¨ d(O 1, O 2) interpreted as a dissimilarity score of two objects ¨ metric properties ( Oi, Oj, Ok U) n reflexivity d(Oi, Oj) = 0 Oi = Oj n positivity d(Oi, Oj) > 0 Oi Oj semi-metric n symmetry d(Oi, Oj) = d(Oj, Oi) n triangular inequality d(Oi, Oj) + d(Oj, Ok) d(Oi, Ok) ¨ ¨ metric triangular triplet (a, b, c) = a + b c & a + c b & b + c a when triangular inequality satisfied by d, then for every O 1, O 2, O 3 U (d(O 1, O 2), d(O 2, O 3), d(O 1, O 3)) is a triangular triplet EDBT 2006, Munich 3

Metric Access Methods n given a metric d and a dataset S U, metric access methods (MAMs) can be used to organize objects of S Reason: fast query processing (range & k-nearest neighbor queries) ¨ Principle of MAMs: structured decomposition of objects into equivalence classes, such that only some “candidate” classes have to be searched when querying ¨ n ¨ the filtering of non-relevant classes is possible due to the metric properties (esp. triangular inequality) Examples: M-tree, PM-tree, D-index, gh-tree, vp-tree, LAESA, etc. EDBT 2006, Munich 4

Metric Access Methods, cont. n intrinsic dimensionality ¨ definition (as proposed in [4]) : (S, d) = 2 / 2 2 ( is mean and 2 is variance of distance distribution in S) ¨ indicates how effeciently (quickly) could be a dataset S queried using a metric d n low (e. g. below 10) means the dataset is well-structured n high means the dataset is poorly structured – i. e. objects are almost n – i. e. there exist tight clusters of objects equaly distant low intrinsic dimensionality highdatasets intrinsic dimensionality in consequence, intrinsically high-dimensional are hard to organize, so that querying becomes inefficient (sequential scan) ¨ example: an M-tree hierarchy built on a high-dimensional dataset EDBT 2006, Munich 5

Metric vs. non-metric measures n non-metric measures are often robust (resistant to outliers, errors in objects, etc. ) ¨ the symmetry and mainly the triangular inequality are often violated a a b a≠b n b a>b+c c cannot be directly used with MAMs EDBT 2006, Munich 6

Examples of Non-metric measures n various k-median distances ¨ n COSIMIR ¨ n back-propagation network with single output neuron serving as a distance, allows training Dynamic Time Warping distance ¨ ¨ n measure distance between the two (k-th) most similar portions in objects sequence alignment technique minimizes the sum of distances between sequence elements fractional Lp distances ¨ ¨ generalization of Minkowski distances (p<1) more robust to extreme differences in coordinates EDBT 2006, Munich 7

Turning Non-metric into Metric n the reflexivity & positivity ¨ n the symmetry ¨ ¨ n by setting a minimum distance lowerbound d- < 0, i. e. O 1≠ O 2 drp(O 1, O 2) = d(O 1, O 2) + |d-| + some small value, otherwise drp(O 1, O 2) = 0 e. g. ds(O 1, O 2) = min(d(O 1, O 2), d(O 2, O 1)) query is processed using ds, and the query result is re-filtered using d how to satisfy the triangular inequality ? ¨ we apply a modifying function f on d, making semi-metric a metric EDBT 2006, Munich 8

SP-modifiers Let f be a function f: R R, such that f(0) = 0 and f is increasing (i. e. f(x) > f(y) x > y). For similarity search purposes f(d( , )) – further denoted as df – can be safely used instead of just d. (In case of range query (Q, r. Q) the query radius r. Q is modified to f(r. Q). ) Proof: All similarity orderings are preserved. 1) 2) 3) Consider the set of all pairs of objects from U. Create ordering of the pairs with respect to distances of the two objects in the pair. The ordering does not change after the application of any f on the distances, because f is increasing. We call such function f as similarity-preserving modifier (simply SP-modifier. ) EDBT 2006, Munich f 1( ) f 2( ) f 3( ) f 4( ) 9

TG-modifiers We want to find such SP-modifier, that forces d to satisfy the triangular inequality: n any concave SP-modifier f is metric-preserving (proof in [3]) when applied on any metric d( , ), df is metric as well ¨ when applied on a triangular triplet (a, b, c), (f(a), f(b), f(c)) is triangular triplet as well ¨ n any concave SP-modifier is triangle-generating (TG-modifier) ¨ when applied on all possible triplets, some of them become triangular (theory of concave functions) n ¨ the more concave f, the more triplets become triangular once a triplet becomes triangular, after application of any other TG-modifier it remains triangular Theorem: Every semi-metric can be turned by a single TG-modifier into a metric. EDBT 2006, Munich 10

Proof: Incremental Triplet Stretching We repeatedly apply TG-modifiers on all triangular triplets (generated by d( , ) on S), starting with a less concave TG-modifier, proceeding with more concave ones. a f 1(a) f 1(b) f 2(a) f 2(b) b f 1(c) c a>b+c f 3(a) f 1(a) > f 1(b) + f 1(c) (low-concave f 1) f 3(b) f 3(c) f 2(a) ≤ f 2(b) + f 2(c) (high-concave f 2) We continue with applying more and more concave TG-modifiers (e. g. by nesting f 3(a) f 3(b) f 3(c) them) until we turn all the (extremely high-concave f 3) triplets into triangular ones. EDBT 2006, Munich 11

Optimal TG-modifier There exist infinitely many TG-modifiers that turn a given semi-metric into a metric. However, not all are suitable for fast similarity search. Example: The optimal TG-modifier should: This TG-modifier turns every semimetric intoby metric, but is useless for § turn every non-triangular triplet generated d (considering the objects by triangular MAMs. from S) into a triangular one (i. e. searching enforce the inequality) § keep the intrinsic dimensionality All classes maintained by a MAM f of S with respect to d as low as overlap possibleevery query, so the search deteriorates to sequential scan. EDBT 2006, Munich 12

Scaling the concavity How to find an optimal TG-modifier for a given d (and S)? We make use of some predefined TG-bases: § TG-base is an extended TG-modifier such that it uses a concavity weight w 0 as second parameter, i. e. f: R R R for w = 0, the TG-base turns into identity, i. e. f(x, 0) = x § with increasing w, the TG-modifier f(x, w) becomes more concave § § the greater w (thus more concave f), the more triplets become triangular § the higher the intrinsic dimensionality is § § we can relax the strict condition of needing all triplets to become triangular by introducing a TG-error tolerance (a ratio of triangular triplets to non-triangular triplets) to be satisfied § a choice of exact or approximate search ( = 0 or > 0) EDBT 2006, Munich 13

Proposed TG-bases § general-purpose TG-bases § § Fractional Power TG-base (FP-base) Rational Bezier Quadric TG-bases (RBQ-bases) § § § each such TG-base is additionally provided by the second Bezier point (a, b) choosing different (a, b) allows to predefine the “place of maximum concavity” in the TG-base we need to find an optimal w for a TG-base f, such that df becomes metric, but w is as low as possible – the Tri. Gen algorithm EDBT 2006, Munich 14

The Tri. Gen algorithm The algorithm finds a TG-modifier (formed by a TG-base and the appropriate concavity weight w), which turns a given semi-metric d into an (approximated) metric, while the intrinsic dimensionality is kept as low as possible. The algorithm makes use of halving the concavity interval, when searching for the optimal concavity weight. EDBT 2006, Munich 15

Experimental Results The testbed: n n two dataset (1 real – images (histograms), 1 synthetic - polygons) 10 non-metric measures (6 for images, 4 for polygons) ¨ n Tri. Gen was used to create the modification of a semi-metric into metric 2 MAMs – M-tree and PM-tree Testing of: 1) intrinsic dimensionalities of the datasets (with respect to df, where f is the TG -modifier found by Tri. Gen) 2) performance k-NN queries – performance, retrieval error (when the TG-error tolerance > 0) EDBT 2006, Munich 16

Experiments – intrinsic dimensionalities EDBT 2006, Munich 17

Experiments – k-NN queries EDBT 2006, Munich 18

Experiments – k-NN queries EDBT 2006, Munich 19

Conclusions and Future Work We have presented: n a way of fast searching in non-metric datasets by metric access methods ¨ in particular, the Trigen algorithm for turning any semi-metric into a metric n future work: ¨a generalized framework for fast exact and approximate similarity search (either metric or nonmetric) – a combination with previous work [2] EDBT 2006, Munich 20

References [1] T. Skopal, J. Pokorný, V. Snášel Nearest Neighbours Search using the PM-tree. In DASFAA 2005, Beijing, China, pages 803– 815. LNCS 3453, Springer. [2] T. Skopal, P. Moravec, Jaroslav Pokorný, V. Snášel Metric Indexing for the Vector Model in Text Retrieval, In SPIRE 2004, Padova, Italy, pages 183 -195, LNCS 3246, Springer. [3] P. Corazza Introduction to metric-preserving functions, American Mathematical Monthly 104(4), 1999. [4] E. Chávez, G. Navarro A Probabilistic Spell for the Curse of Dimensionality, In ALENEX 2001, LNCS 2153, Springer. EDBT 2006, Munich 21