Lower Bounds for Embedding Edit Distance into Normed
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova
Definitions n Edit distance between two strings: n n Edit operation: n n Minimum number of edit operations needed to transform one string into another Insertion, deletion, or substitution of one character Edit distance = Levenstein metric
Edit Distance n Important in: n n n Computational bottlenecks: n n n Computational biology Text processing Widely used algorithm takes quadratic time No efficient algorithm known for nearest neighbor computation New approach for dealing with edit distance: n Embedding into a normed space
Embedding n Definition: n A mapping f: Strings→lpd, such that for any pair of strings s and s': Edit(s, s') ≤ ||f(s) -f(s')||p ≤ c·Edit(s, s') n n The factor c is called distortion Useful to embed edit distance into a normed space because: n n Efficient algorithms working on normed spaces are known (e. g. nearest neighbor computation) Can compute (approximately) edit distance in subquadratic time, if computing the mapping takes subquadratic time
Embedding edit distance into a normed space n n Essentially nothing known If allow moving a contiguous block of characters as a single edit operation: can embed new metric into l 1 with distortion O(log d·log*d) [CPSV’ 00] (d – length of strings to embed) n
Result in this paper n n A lower bound of 3/2 on the distortion of embedding into l 1 and (l 2)2 The bound cannot be improved using our technique
Structure of the argument n Will show that: n n n Conclude that: n n n Edit metric contains the shortest path metric over the K 2, n graph (K 2, n–metric) as induced subgraph K 2, n–metric not embeddable into (l 2)2 with low distortion Edit metric not embeddable into (l 2)2 with distortion better than 3/2 Edit metric not embeddable into l 1 with distortion better than 3/2 since l 1 -metric can be embedded isometrically into (l 2)2 [LLR 94] Show that: n The bound of 3/2 is tight for the considered graph
K 2, n metric – induced subgraph of edit metric n n n Vertices of the graph are A 1, A 2, B 1, B 2, … Bn Edges are (Ai, Bj), where 1≤i≤ 2, 1≤j≤n The mapping: n n n A 1 is mapped to the string (10)n A 2 is mapped to the string (10)n-1 Bj is mapped to the string (10)j-11(10)n-j A 1 n=4 B 1 1101010 B 2 1011010 A 2 1010 B 3 1010110 101010 B 4 1010101
Lower bound for embedding K 2, n graph into (l 2)2 n Theorem 1: n for any ε>0, there exists some n such that K 2, n–metric cannot be embedded into (l 2)2 with distortion less than (3/2 -ε)
Proof of theorem 1 n Let: n n n B-1=A 1 and B 0=A 2 f - some embedding of K 2, n–metric into (l 2)2 with distortion c The metric over points f(B-1), … f(Bn) needs to satisfy negative type inequality: n For any integers b-1, … bn that sum up to 0: Σ-1≤i<j≤nbibj||f(Bi)-f(Bj)||22≤ 0 n With suitable values for n and bi, inequality gives: c ≥ 3/2 -ε
3/2 is a tight bound n n Will prove that 3/2 is a tight bound for embedding K 2, n–metric into l 1 Theorem 2: n There exists an embedding f of K 2, n–metric into l 1 with distortion 3/2
Proof of theorem 2 n n Will combine two embeddings f 1 and f 2 f 1 is: f 1(A 1)=(0, … 0) n n f 1(A 2)=(1, … 1)/2 n n n f 1(Bj)=(bin(0)j, …bin(2 -1)j)/2 , (bin(i)j = j-th bit of the binary representation of integer i) n n f 1 satisfies: n n n ||f 1(A 1)-f 1(A 2)||1=1 ||f 1(Ai)-f 1(Bj)||1=1/2, for 1≤i≤ 2, 1≤j≤n ||f 1(Bi)-f 1(Bj)||1=1/2, for 1≤i<j≤n
Proof of theorem 2 (cont) n f 2 is: f 2(A 1)=f 2(A 2)=(0, … 0) n f 2(Bj)=ej/2 (ej = vector with 1 at the j-th position and 0 elsewhere) n n f 2 satisfies: n n ||f 2(A 1)-f 2(A 2)||1=0 ||f 2(Ai)-f 2(Bj)||1=1/2, for 1≤i≤ 2, 1≤j≤n ||f 2(Bi)-f 2(Bj)||1=1, for 1≤i<j≤n If f 1 and f 2 induce metrics D 1 and D 2: n 2 D 1+D 2 provides a distortion of 3/2
Computational Experiments n n Goal: raise lower bound (of 3/2) Tried following approaches: n Optimal embedding of strings of length up to d n n n into l 1 using cut-metric formulation into (l 2)2 using semidefinite programming Lower bounds via expansion properties of metric
Optimal embedding into l 1 n n n A metric embeddable into l 1 iff can be represented as a convex combination of cut metrics For computing optimal distortion can use linear programming Deficiency: number of variables is 2|X|-1, where |X|=2 d+1 -1 n n Infeasible for d>3 For d=3, distortion is 4/3<3/2
Optimal embedding into (l 2)2 n n n Formulated as a semidefinite programming problem For d=5, obtained optimal distortion of ~1. 30<3/2 Could not run for d=6 since would require ~2 Gb of memory
Lower bounds via expansion n Idea: n n Considered “two-layers” graph G: n n n To show that the graph underlying edit metric is a “good” expander The graph of all strings of length d and d-1 Regular with added self-loop edges (up to degree Δ=3 d-1) Shortest path metric over G = induced subgraph of edit metric
Expansion n Goal: n To find C such that for any set A of vertices: |e(A, V-A)|≥C|A||V-A|/n (|e(A, B)|=set of edges between A and B) n n Then: n Distortion ≥ S·C·avg(G)/Δ , where n n n S=const avg(G)=average distance in G C ≥ “eigenvalue gap”
Eigenvalue gap n n Can compute eigenvalues efficiently Was not large enough: n n n ~2. 7 for d=4, 8, 12, 16 for comparison: 2 for hypercube (embeddable isometrically into l 1) Gives lower bound for distortion <3/2 for d≤ 16
Conclusion n Lower bound of 3/2 for distortion of embedding edit metric into l 1 and (l 2)2 n n Using K 2, n-metric Tight bound for K 2, n-metric
- Slides: 20