Dictionary search Exact string search Paper on Cuckoo

  • Slides: 30
Download presentation
Dictionary search Exact string search Paper on Cuckoo Hashing

Dictionary search Exact string search Paper on Cuckoo Hashing

Exact String Search Given a dictionary D of K strings, of total length N,

Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing

Hashing with chaining

Hashing with chaining

Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot

Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m = a (load factor)

Search cost m=

Search cost m=

In practice A trivial hash function is: prime

In practice A trivial hash function is: prime

A “provably good” hash is l = max string len m = table size

A “provably good” hash is l = max string len m = table size ≈log 2 m K k 0 k 1 k 2 kr r ≈ L / log 2 m a n a 0 a 1 a 2 ar Each ai is selected at random in [0, m) prime not necessarily: (. . . mod p) mod m

Cuckoo Hashing A B E C D 2 hash tables, and 2 random choices

Cuckoo Hashing A B E C D 2 hash tables, and 2 random choices where an item can be stored

A running example A B C F E D

A running example A B C F E D

A running example A B E D C F

A running example A B E D C F

A running example A B G E D C F

A running example A B G E D C F

A running example E G B A D C F

A running example E G B A D C F

Cuckoo Hashing Examples A B C G E D F Random (bipartite) graph: node=cell,

Cuckoo Hashing Examples A B C G E D F Random (bipartite) graph: node=cell, edge=key

Natural Extensions n More than 2 hashes (choices) per key. n Very different: hypergraphs

Natural Extensions n More than 2 hashes (choices) per key. n Very different: hypergraphs instead of graphs. n Higher memory utilization n 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. but more insert time (and random access) n Balanced allocation and tightly O(1)-size bins n Insertion sees a tree of possible evict+ins paths more memory. . . but more local

Dictionary search Making one-side errors Paper on Bloom Filter

Dictionary search Making one-side errors Paper on Bloom Filter

Crawling How to keep track of the URLs visited by a crawler? n URLs

Crawling How to keep track of the URLs visited by a crawler? n URLs are long n Check should be very fast n No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs

Searching with errors. . .

Searching with errors. . .

Problem: false positives

Problem: false positives

TTT 2

TTT 2

Not perfectly true but. . .

Not perfectly true but. . .

Opt k = 5. 45. . . m /n = 8 We do have

Opt k = 5. 45. . . m /n = 8 We do have an explicit formula for the optimal k

Dictionary search Prefix-string search Reading 3. 1 and 5. 2

Dictionary search Prefix-string search Reading 3. 1 and 5. 2

Prefix-string Search Given a dictionary D of K strings, of total length N, store

Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

Trie: speeding-up searches 0 s 1 y z 2 stile aibelyite zyg 5 1

Trie: speeding-up searches 0 s 1 y z 2 stile aibelyite zyg 5 1 etic ial 2 3 5 2 omo 7 czecin ygy 4 6 Pro: O(p) search time Cons: edge + node labels and tree structure

Front-coding: squeezing strings …. systile syzygetic syzygial syzygy…. 2 http: //checkmate. com/All_Natural/Applied. html http:

Front-coding: squeezing strings …. systile syzygetic syzygial syzygy…. 2 http: //checkmate. com/All_Natural/Applied. html http: //checkmate. com/All_Natural/Aroma 1. html http: //checkmate. com/All_Natural/Aromatic_Art. html http: //checkmate. com/All_Natural/Ayate. html http: //checkmate. com/All_Natural/Ayer_Soap. html http: //checkmate. com/All_Natural/Ayurvedic_Soap. html http: //checkmate. com/All_Natural/Bath_Salt_Bulk. html http: //checkmate. com/All_Natural/Bath_Salts. html http: //checkmate. com/All/Essence_Oils. html http: //checkmate. com/All/Mineral_Bath_Crystals. html http: //checkmate. com/All/Mineral_Bath_Salt. html http: //checkmate. com/All/Mineral_Cream. html 5 33 45% 5 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http: //checkmate. com/All_Natural/ Applied. html roma. html 1. html tic_Art. html yate. html er_Soap. html urvedic_Soap. html Bath_Salt_Bulk. html s. html Essence_Oils. html Mineral_Bath_Crystals. html Salt. html Cream. html 0 http: //checkmate. com/All/Natural/Washcloth. html. . . http: //checkmate. com/All/Natural/Washcloth. html . . . Gzip may be much better. . .

2 -level indexing 2 advantages: • Search ≈ typically 1 I/O • Space ≈

2 -level indexing 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets Internal Memory CT on a sample A disadvantage: • Trade-off ≈ speed vs space (because of bucket size) systile szaielyite Disk …. 70 systile 92 zygeti c 85 ial 65 y 110 szaibelyite 82 czecin 92 omo….