Controlling the Chunk Size in Deduplication Systems M. Hirsch S. T. Klein D. Shapira Y. Toaff ISRAEL
Background and motivation Compression Deduplication Partition into chunks Apply hash function 4 K – 16 M Store fingerprints hash / B-tree
Background and motivation Algorithm for storing a repository Hash Table Signature size k bits k 2 entries Repository chunks 470 2484 2485 2486 2487 2488 2489 420 470 550
Background and motivation Chunk size dilemma small More overhead large Less deduplication fixed variable easier More robust
Variable length chunks seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1 K 8 K
Variable length chunks Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution
New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities 3) Conditions are inclusive such that
New segmentation procedure Small inserts and deletes
New segmentation procedure To get set
New segmentation procedure P large random prime C large constant
Example distribution
Cumulative probabilities Individual probabilities
Experimental results numb Avg er size Std numb Avg dev er size Std dev constant 15. 7 2127 2347 5. 5 2502 2568 Variable probab 15. 8 2176 1014 5. 9 2273 1081