Mobile File System AFS Coda Bayou Byung Chul

Mobile File System <AFS, Coda, Bayou> Byung Chul Tak

AFS § Andrew File System • Distributed computing environment developed at CMU • provides transparent access to remote shared files • The most important design goal : Scalability • allows existing UNIX programs to access AFS files without modification or recompilations

AFS § Two design characteristics • Whole-file serving ◦ The entire contents of directories and files are transmitted to client computers by AFS servers • While-file caching ◦ A copy of a file is stored in a cache on the local disk ◦ The file cache is permanent

AFS § Usage scenario • A client issues open system call for a file ◦ If there is no copy in the local cache ∙ the server is located ∙ a request for a copy of the file is made • The copy is stored in the local UNIX file system and opened • Subsequent read, write are applied to the local copy • When the client issues a close system call ◦ if the local copy is updated, its contents are sent back to the server

AFS § Assumptions • Most files are small • Read is much more common than writes • Sequential access is common, and random access is rare • Most files are read and written by only one user ◦ When a file is shares, it is usually one user who modifies it • Files are referenced in bursts and there is a high temporal locality

AFS § Distribution of processes in AFS Workstations User program Servers Venus Vice UNIX kernel User program Venus UNIX kernel Network Vice User program Venus UNIX kernel

AFS § Two software components • Venus ◦ A user-level process that runs in each client computers • Vice ◦ The server software that runs as a user-level UNIX process in each server computers

AFS § System call interception in AFS • BSD UNIX is modified to intercept file system calls • Venus manages cache ◦ A partition on the local disk is used as a cache Workstations User program UNIX file system calls Non-local file operations UNIX kernel UNIX file system Local disk Venus

AFS § File identifier • Files and directories in the shared file space is identified by 96 -bit fid ◦ Venus translates file pathnames into fids Volume number 32 bits File handle Uniquifier 32 bits ◦ volume number ∙ In AFS, files are grouped into volumes ◦ file handle ∙ identify the file within the volume ◦ uniquifier ∙ ensures that file identifiers are not reused

AFS § Cache consistency • based on the callback promise § Callback promise ◦ for ensuring that cached copies of files are updated when another client closes the same file after updating it • Vice supplies a copy of file to Venus, with a callback promise ◦ a token issued by Vice with two state: valid, cancelled • When Venus receives a callback, it sets the callback promise token to cancelled • Venus checks the callback promise when user issues an open ◦ if it is cancelled, then a fresh copy must be fetched

CODA § Evolution from AFS § Mechanisms for high availability • Disconnected operation ◦ a mode of operation in which a client continues to use data during network failure ◦ while disconnected, rely on the local cache ◦ cache miss is reported as failure • Server replication ◦ allowing volumes to have read-write replicas at more than one server

CODA § Venus states • Hoarding state ◦ to hoard useful data in anticipation of disconnection • Emulation state ◦ enter upon disconnection ◦ Venus assumes full responsibility of file operations • Reintegration state ◦ Venus propagates changes made during emulation to the server ◦ validate all cached objects before use

CODA § Design philosophies for extending CODA • Don’t punish strongly-connected clients ◦ unacceptable to degrade the performance of stronglyconnected clients on account of the weakly-connected ones • Don’t make like worse than when disconnected ◦ user will not tolerate substantial performance degradation • Do it in the background if you can ◦ ex) switch foreground network delay to background • When in doubt, seek user advice ◦ As connectivity weakens, the price of misjudgment increases

CODA § CODA extensions • Transport protocol refinements ◦ code separation of RPC 2 and SFTP protocols • Rapid cache validation ◦ raising the granularity of cache validation • Trickle reintegration ◦ propagating updates to servers asynchronously • User-assisted miss handling ◦ asking user input for large file fetch

CODA § Rapid cache validation • Under previous implementation ◦ Reintegration process shows low performance ∙ Validation of cached objects after reconnection • Solution adopted ◦ Tracking server state at multiple levels of granularity ◦ Version stamps for each volumes ∙ if version stamp is invalid, each cached object is validated as usual

CODA § Trickle Reintegration • State modification ◦ Write disconnected state ∙ Updates are logged and propagated via trickle reintegration • Reintegration is run on background • A user can force a full reintegration Hoarding disconnection strong connection weak connection Emulating disconnection Write Disconnected

CODA • Log optimization ◦ key to reducing the volume of reintegration data ◦ basic concept ∙ In emulation state, Venus logs updates ∙ When a log record is appended to the CML(Client Modify Log), Venus checks if it cancels or overrides earlier records ◦ Trickle reintegration reduces the opportunity of optimization ∙ Records should spend enough time in the CML for optimizations to be effective

CODA • Log optimization ◦ Aging ∙ A record is not eligible for reintegration until it has spent a minimal amount of time in the CML ▫ aging window Older than A Log Head Log Tail Reintegration Barrier Time [ CML during reintegration ]

CODA § Seeking User Advice • Transparency is not always acceptable ◦ Under low bandwidth, a file fetch could take very long and this could be annoying to the user ◦ In some cases, a users is willing to wait for a long delay when the file is important • Patience threshold ◦ Maximum time a user is willing to wait for a particular file, or the equivalent file size ◦ a function of hoard priority P, bandwidth ∙ hoard priority: user perceived importance of files specified by the user

CODA § Seeking User Advice (cont’d) • Patience Threshold model • Handling misses τ: threshold β, δ: scaling parameter α: lower bound P: hoard priority ◦ In status walk, Venus obtains status for missing objects and decides whether to fetch ◦ In data walk, Venus fetches the contents from the server ∙ If file size is above the patience threshold, a screen is shown to the user to collect user decision

BAYOU § Bayou • A replicated, weakly consistent storage system for mobile computing environment § Design Philosophy • Application must know they may read inconsistent data • Applications must know there may be conflicts • Clients can read and write to any replica without the need for coordination • The definition of conflict depends on the semantics

BAYOU § System model • Each data collection is replicated in full at a number of servers • Bayou provides two basic operations ◦ read and write • Client can use any server’s data ◦ client can read and submit write ◦ once write is accepted, client has no further responsibilities ◦ client does not wait for the write to propagate • Anti-entropy session ◦ Bayou servers propagate writes during pair-wise contact

BAYOU § Conflict Detection and Resolution • Supporting application-specific, per-write conflict detection and resolution § Two mechanisms ◦ permit clients to indicate how to detect conflict and how to resolve • dependency check • merge procedures

BAYOU § Dependency checks • Each write operation includes a dependency check • A SQL-like query is used • A conflict is detected if the expected value is not returned

BAYOU § Merge procedures • Each write operation includes a merge procedure ◦ written in a high-level, interpreted language • When automatic merge is impossible, it runs to completion and produce a log ◦ Later, user can resolve it manually

BAYOU • Bayou write implementation • Bayou write call example update dependency check merge procedure

BAYOU § Replica Consistency • Eventual consistency ◦ Bayou guarantees that all servers eventually receive all writes • Consistency is maintained via pair-wise antientropy process

BAYOU § Anti-entropy process • To bring two replicas up-to-date • Accept-stamp ◦ Monotonically increasing number assigned by the server when it receives a write ◦ total order over all writes accepted by the server ◦ partial order over all writes in the system • Basic design ◦ a one-way operation between pairs of server ◦ via the propagation of write operations ◦ write propagation is constrained by the accept-order

BAYOU § Pair-wise anti-entropy • unidirectional process • one server brings the other up-to-date by propagating writes unknown to it § Prefix property • A server R that holds a write stamped Wi that was initially accepted by another server X will also hold all writes accepted by X prior to Wi • Accept-stamp is used to achieve this property in Bayou

BAYOU § Basic anti-entropy algorithm • The sending server gets version vector from the receiving server • It traverses the write-log and send writes not covered by the version vector s 1 s 2 s 3 s 4 s 5 s 6 version vector of R : x y z anti-entropy(S, R) { Get R. V from receiving server R # now send all the writes unknown to R w = first write in S. write-log WHILE (w) DO IF R. V(w. server-id) < w. accept-stamp THEN # w is new for R Send. Write(R, w) w = next write in S. write-log END }

BAYOU § Anti-entropy process • A receiving server may receive a write that precedes some writes on the server ◦ Server must undo the effect and redo with new writes • Each server maintains a log of all write operations it has received • The write log may become excessively long ◦ log truncation is necessary especially for mobile systems

BAYOU § Write-log management • Log truncation ◦ When two servers engage in the anti-entropy, it may be possible that one server has discarded some writes that the other might need ◦ In such cases, full database transfer is required • Write stability ◦ Committed write is introduced to allow log management ∙ committed write : one whose position in the write-log will not change, and never be reexecuted

BAYOU § Write stability • Primary-commit protocol ◦ One replica server is designated as the primary replica ◦ The primary replica commits the position of a write in the log ◦ CSN(Commit Sequence Number) ∙ monotonically increasing number assigned to commited writes ◦ CSN is propagated back to all other servers during the anti-entropy process

BAYOU § Anti-entropy protocol extensions • Server reconciliation using transportable media • Support for session guarantees and eventual consistency • Light-weight server creation and retirement
- Slides: 34