Highly Available ACID Memory Vijayshankar Raman Introduction Why

Highly Available ACID Memory Vijayshankar Raman

Introduction § Why ACID memory? l non-database apps: • want updates to critical data to be atomic and persistent • synchronization useful when multiple threads are accessing critical databases • concurrency control and recovery logic runs through most of database code. • Extremely complicated, and hard to get right • bugs lead to data loss -- disastrous!

Project goal § Take recovery logic out of apps l Build a simple user-level library that provides recoverable, transactional memory. + all the logic in one place => easy to debug, maintain + easy to to make use of hardware advances l use replication and persistent memory for recovery -- instead of writing logs + simpler to implement + simpler for applications to use ? ?

Questions to answer § program simplicity vs. performance l how much do we lose by replicating instead of logging? § on a cluster, can we use replication directly for availability? l traditionally availability handled on top of the recovery system

Outline § § § Introduction Acid Memory API Single Node design & implementation Evaluation High Availability: multiple node design and implementation § Evaluation § Conclusion

Acid Memory API § Transaction manager interface • Transaction. Manager(database name, acid memory area) § Transaction interface • • begin. Transaction() get. Lock(memory region 1, READ/WRITE) get. Lock(memory region 2, READ/WRITE). . . – memory region = virtual address prefix • commit/abort() -- all locks released § combine concurrency control with recovery • recovery done on write-locked regions § supports fine granularity locking => cannot use VM for recovery § applications can modify data directly

Implementation § assume non-volatile memory (NVRAM, battery backup) § assume persistent file cache Acid memory area master copy § acid memory area mmap’d from file mmap mirror § persistence => writes are permanent § get. Lock(WRITE) -- copy the region onto mirror area § transaction abort / system crash l undo changes on all writelocked regions using copy in mirror area § only overhead of recovery is a memcpy on each write lock Disk file

Evaluation § Overhead of acid memory l l l read lock: 35 usec (lock manager overhead) write lock: 35 usec + 5. 5 usec/KB (memcpy cost) much lesser than methods that write log to disk § Ease of programming l l application needs to only acquire locks to become recoverable can manipulate the data directly -- do not have to call special function on every update

Example: suppose I want to transfer 1 M $ from A’s account to B’s Using logging With ACID memory (Update() creates the needed logs) /* a points to A’s account /* b points to B’s account */ */ trans = new Transaction(trans. Mgr); trans->get. Lock(a, WRITE); trans->get. Lock(b, WRITE); a = a - 1000000; b = b + 1000000; trans->commit(); Begin. Transaction(); get. Lock(A’s account, WRITE); get. Lock(B’s account, WRITE); read(A’s account, a); read(B’s account, b); a = a - 1000000; b = b + 1000000; Update(A’s account, a); Update(B’s account, b); commit();

§ Performance comparison: acid memory vs. logging consider a transaction updating integers in a 1 KB data-structure Time (in microseconds) l Acid memory: write-lock the data-structure Logging: write-lock the structure and update each integer separately Number of integer writes l l logging each individual update is a bit faster, to an extent acid memory gives okay performance with very easy programmability

Outline § § § Introduction Acid Memory API Single Node design & implementation Evaluation High Availability: multiple node design and implementation § Evaluation § Conclusion

Replication for availability Transaction processing monitor replicate DBMS § traditionally, availability has been handled in a separate layer -- above recovery § can we handle both recovery and availability via same mechanism?

Architecture replicas Owner data lock manager data client Transaction handler § § Transactions run by transaction handler all lock requests must go to owner data in all replicas must be kept in sync balance load by partitioning data l different owner for each partition § failure model l fail-stop: nodes never send incorrect messages to others failed nodes never recover data after crash network never fails

Owner data lock manager data client Transaction handler § Reads: client gets data from random replica § Writes: must update all replicas l l on commit, transaction sends new data to owner propagates update atomically to all replicas • 3 phase non-blocking commit protocol. Always ensure that there is someone to take over the propagation if you crash § if owner crashes, fail-over to a replica

Evaluation + Very fast recovery -- 424 usecs + get fast transactions without non-volatile memory - writes are slower 4 n messages at commit if n replicas · still, this is faster than logging to disk – homogeneous software: susceptible to bugs ·

Conclusions § Acid memory easier to use § Performance relative to logging not too bad § replication gives fast recovery Future Work § Using cache for replication § when/how much to replicate?

Additional Slides

Evaluation, w. r. t. logging based approach § Ease of implementation l l very little to code, mostly lock manager stuff whereas in a traditional dbms • specialized buffer manager • log manager • complex recovery mechanism

How to make file cache persistent § Rio (Chen et. Al, 1996) § place file cache in non-volatile memory § protect it against OS crashes using VM protection § flush pages in file cache to disk files on reboot