Rsync Efficiently Synchronizing Files Using Hashing By David
Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004
Problem n Want to synchronize with newer version of a file on a remote server n Want to minimize data sent over slow network link n Want to minimize (round-trip) communication latencies
Solution: Rsync n Open source software project n http: //samba. anu. edu. au/rsync/ n Command line driven server and client for Unix-like systems n Synchronizes directories as well as files n Andrew Tridgell’s Ph. D. thesis
Overview of How Hashing Used n Can reduce amount of data sent if willing to live with a very small probability of inaccuracy n Several layers of hashing—fast but less accurate and slower but almost always accurate both used
Ideal Case n Divide files into equal-sized blocks n Files are almost identical except for relatively few blocks n Have almost all of the data blocks one needs—but how to know it. Receiver Sender
Ideal Protocol Receiver Hashes of blocks Sender Commands on how to build file
Sender Analyzes Own Blocks Hash Receiver Block 1 Hash Receiver Block 2 ? Hash Receiver Block 3 Hash Sender Block Hash Receiver Block 4
Commands: Copy or Add n COPY: If the receiver already has the data block, just tell him to copy it. n ADD: If the receiver does not have a data block, send it to him. n COPY cheap, ADD expensive
Advantage of Ideal n If COPY, reduction in network traffic by factor approximately L / h, where L is the block size and h is the size of a hash of a block of size L
Disadvantage of Ideal n Example: Edit source code, delete a comment at the beginning n Blocks no longer neatly aligned
Compute More Hashes n Sender needs to compute hash at every byte position n More expensive: L times more hashes computed by sender n Use weaker, faster hash to weed out
Ordinary Sum of Bytes n Rolling-type property: sum of L bytes starting at position i+1 almost the same as sum starting at i. n Subtract red, add green, yellow same Sum starting at i+1
Disadvantage of a Simple Sum n. A simple sum is too symmetric n Sum of “All men are mortals” is the same as “All mortals are men”
Weighted Sum n First bytes have more weight than the tail ones—arbitrary decision 0 1 2 3 4 5 6
Reordering the i + 1 Sum n Red part to be subtracted and the green part to be added. Yellow is same. 0 1 2 3 4 5 6
Further Enhancements n Compute separate (MD 4) signature for entire file n Reconstruct new file using temporary storage so that the old version is never removed until a new one is known to be good
Synchronizing Directories n Divide into separate receiver/generator Receiver Sender Generator
Summary of Hashing Used n Weaker easier to compute hash with the rolling property n Stronger hash (MD 4) once most candidates have been weeded out n Signature over entire file as a separate check
- Slides: 18