Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica

  • Slides: 11
Download presentation
Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, design a

A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain g 5) Prefix. Suffix(a, b) = Prefix(a) Suffix(b) (Compacted) Trie L Two versions: for D and for DR + Intersect answers L Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, compress them

A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain by g 5) Prefix. Suffix(a, b) = Prefix(a) Suffix(b) Permuterm Index (Garfield, 76) Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa

Permuterm Index [Garfield, 1976] Take a dictionary D={yahoo, google} 1. Append a special char

Permuterm Index [Garfield, 1976] Take a dictionary D={yahoo, google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings Permuterm Dictionary yahoo$y hoo$yah o$yahoo google$g ogle$goo le$googl $google Paolo Ferragina, Università di Pisa Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) Prefix. Suffix(y, o)= Prefix(o$y) Space problems Any query on D reduces to a prefix-query on P[D]

The FM-index [Ferragina-Manzini, JACM ‘ 05] The result: ü Count(P): O(p) time ü Locate(P):

The FM-index [Ferragina-Manzini, JACM ‘ 05] The result: ü Count(P): O(p) time ü Locate(P): O(occ * polylog(|T|)) time ü Display( T[i, i+L] ): O( L + polylog(|T|) ) time ü Space occupancy: |T| Hk(T) + o(|T| log |S|) bits New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Paolo Ferragina, Università di Pisa Compressed Permuterm index builds upon the best two features of the FM-index

Third ingredient: FM-index substring search P = si unknown occ=2 [lr-fr+1] fr lr Paolo

Third ingredient: FM-index substring search P = si unknown occ=2 [lr-fr+1] fr lr Paolo Ferragina, Università di Pisa #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m L i p s s m # p i s s i i Count(P[1, p]): Finds <fr, lr> in O(p) time

Compressed Permuterm Index Lexicographically sorted Z = $hat$hip$hot$# Build FM-index to support substring searches

Compressed Permuterm Index Lexicographically sorted Z = $hat$hip$hot$# Build FM-index to support substring searches Some queries are trivial. . . Prefix(a) = Substring search($a) within Z Suffix(b) = Substring search(b$) within Z Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa

Prefix. Suffix search unknown i=2 Key property: Last char of si is at L[i+1]

Prefix. Suffix search unknown i=2 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] CLF[2] Paolo Ferragina, Università di Pisa

Prefix. Suffix(ho, p) unknown $ho LF CLF Prefix. Suffix(P): Search FM-index of Z using

Prefix. Suffix(ho, p) unknown $ho LF CLF Prefix. Suffix(P): Search FM-index of Z using Cyclic-LF instead of LF Paolo Ferragina, Università di Pisa No change in time/space bounds of compressed indexes

Rank and Select of strings unknown Z = $hat$hip$hot$# Other queries. . . Rank(s)

Rank and Select of strings unknown Z = $hat$hip$hot$# Other queries. . . Rank(s) = row of $s$ Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa

A test on URLs Choose your trade-off Tr ad e- of f % dict-size

A test on URLs Choose your trade-off Tr ad e- of f % dict-size msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space • Time of 20 60 Paolo Ferragina, Università di Pisa