Controlling the Chunk Size in Deduplication Systems M

  • Slides: 17
Download presentation
Controlling the Chunk Size in Deduplication Systems M. Hirsch S. T. Klein D. Shapira

Controlling the Chunk Size in Deduplication Systems M. Hirsch S. T. Klein D. Shapira Y. Toaff ISRAEL

Background and motivation Compression Deduplication Partition into chunks Apply hash function 4 K –

Background and motivation Compression Deduplication Partition into chunks Apply hash function 4 K – 16 M Store fingerprints hash / B-tree

Background and motivation Algorithm for storing a repository Hash Table Signature size k bits

Background and motivation Algorithm for storing a repository Hash Table Signature size k bits k 2 entries Repository chunks 470 2484 2485 2486 2487 2488 2489 420 470 550

Background and motivation Chunk size dilemma small More overhead large Less deduplication fixed variable

Background and motivation Chunk size dilemma small More overhead large Less deduplication fixed variable easier More robust

Variable length chunks seed Hash function Expected size of chunk: BUT great variability Max

Variable length chunks seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1 K 8 K

Variable length chunks Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution

Variable length chunks Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution

New segmentation procedure Use sequence of functions and constants 1) All functions are easily

New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities 3) Conditions are inclusive such that

New segmentation procedure Small inserts and deletes

New segmentation procedure Small inserts and deletes

New segmentation procedure To get set

New segmentation procedure To get set

New segmentation procedure P large random prime C large constant

New segmentation procedure P large random prime C large constant

Example distribution

Example distribution

Cumulative probabilities Individual probabilities

Cumulative probabilities Individual probabilities

Experimental results numb Avg er size Std numb Avg dev er size Std dev

Experimental results numb Avg er size Std numb Avg dev er size Std dev constant 15. 7 2127 2347 5. 5 2502 2568 Variable probab 15. 8 2176 1014 5. 9 2273 1081

Thank you !

Thank you !

Using fractional bits

Using fractional bits