The Single Node Btree for Highly Concurrent Distributed

The Single Node B-tree for Highly Concurrent Distributed Data Structures by Barbara Hohlt 9/26/2020 1

Why a B-tree DDS? • To do range queries (the queries need NOT be degree-3 transaction protected) • Need only sequential scans for related indexed items (retrieve mail messages 3 -50, etc. ) – Performance impact illustrated later 9/26/2020 2

Prototype DDS: Distributed B-tree client client client WAN service service DDS DDS lib lib DDS lib SAN storage storage “brick” “brick” storage “brick” 9/26/2020 clients interact with any service “front-end” as all persistent service state is in DDS and is consistent throughout entire cluster service interacts with DDS via library; library is 2 PC coordinator, handles partitions, replication, etc. , and exports Btree API “brick” is durable single-node B-tree plus RPC skels for network access; brick can be on same node as service example of a distributed B-tree partition with 3 replicas in group 3

Architecture Pull Event Service (Worker) DDS lib Event Pull Service (Worker) DDS lib WAN Service (Worker) DDS lib SAN storage “brick” storage “brick” 9/26/2020 clients interact with any service “front-end” [all persistent state is in DDS and is consistent across cluster] service interacts with DDS via library [library is 2 PC coordinator, handles partitioning, replication, etc. , and exports B-Tree + HT API] “brick” is durable single-node B-Tree or HT plus RPC skeletons for network access example of a distributed DDS 4 partition with 3 replicas in group

9/26/2020 5

Component Layers Application Distributed Btrees Single-Node Btrees Buffer Cache asynchronous I/O Core: “sinks and sources” file TCP system network 9/26/2020 storage VIA network raw disk storage The application layer makes “search” and “insert” requests to a btree instance. The btree determines what data blocks it needs and fetches them from the global buffer cache. If the cache does not have the needed blocks, it fetches them from the global I/O core, which is transparent to the btree instance. queued requests queued completions 6

9/26/2020 7

API Flavor • SN_Btree. Close. Request, SN_Btree. Closecomplete • SN_Btree. Create. Request, Sn_Btree. Create. Complete • SN_Btree. Open. Request, SN_Btree Open. Complete • Sn_Btree. Destroy. Request, SN_Btree. Destroy. Complete • SN_Btree. Read. Request, SN_Btree. Read. Complete • SN_Btree. Write. Request, SN_Btree. Write. Complete • SN_Btree. Remove. Request, SN_Btree. Remove. Complete 9/26/2020 8

API Flavor, Contd. . • Distributed_Btree. Create. Request, Distributed_Btree. Create. Complete • Distributed_Btree. Destroy. Request, Distributed_Btree. Destroy. Complete • Distributed_Btree. Read. Request, Distributed_Btree. Read. Complete • … • Errors: timeout (even after retries), replica_dead, lockgrab_failed, doesn’t exist, etc. 9/26/2020 9

Evaluation Metrics • Speedup: performance versus resources (data size fixed) • Scaleup: data size versus resources (fixed performance) • Sizeup: performance versus data size • Throughput: total number of reads/writes completed per second • Latency: for satisfying a single request 9/26/2020 10

Megabits per second Single Node B-tree Performance Btrees 9/26/2020 11

Single Node B-tree Performance 9/26/2020 12

FSM-based Data Scheduling • Scheduling is for: – Performance (including fairness, avoiding starvation) – Correctness/isolation • This functionality has traditionally resided in two different modules (kernel schedules threads, app/database schedules locks). Also, each module optimized individually • Our claim is there can be significant performance wins by jointly optimizing both 9/26/2020 13

How to Achieve Isolation? • Use threads and locks • Do careful scheduling (e. g. B-trees) – Unify all scheduling decisions • Problem is: such a globally optimal scheduling is hard – In restricted settings, similar to hardware scoreboarding techniques • A useful lesson for Database Concurrency – You can choose order of operations to avoid conflicts (have a prepare/prefetch phase) to avoid locking across blocking I/O (Lesson: Do not lock if you block) – This can be implemented more naturally with asynchronous FSMs than with straight-line threaded code 9/26/2020 14

Benefits of Using FSMs+events for Concurrency Control • Control-flow based concurrency control, as opposed to lock-based concurrency control – Can avoid wrong scheduling decisions – Unnecessary locks can be eliminated – “Locks” can be released faster – More flexibility for concurrency-control based on isolation requirements • Explicit concurrency-control also avoids deadlocks, priority inversions, race conditions, and convoy formations b 1 9/26/2020 T 2 T 1 b 2 15

Benefits of using FSMs+Queues for concurrency control • Control-flow based concurrency control using FSMs and queues, as opposed to lock-based concurrency control – Can avoid wrong scheduling decisions – Unnecessary locks can be eliminated – “Locks” can be released faster – More flexibility for concurrency-control based on isolation requirements • Explicit scheduling also avoids deadlocks, priority inversions, race conditions, and convoy formations b 1 9/26/2020 T 2 T 1 b 2 16

The Convoy Problem Illustrated • Most tasks execute code like: lock(b); read(b); lock(b>next); unlock(b); … • Problem is: if task T 1 blocks on I/O for b 4, then task T 2 cannot unlock b 3 to acquire a lock on b 4, and task T 3 cannot unlock b 2 to acquire a lock on b 3, and so on, forming a convoy even though most blocks are in cache and each task may require only a finite number of locks. b 1 b 2 b 3 b 4 Locked by T 4 waiting for lock on b 2 Locked by T 3 waiting for lock on b 3 Locked by T 2 waiting for lock on b 4 Locked and blocked on I/O by T 1 9/26/2020 Convoy 17

Scheduling Based on Data Availability • Two transaction T 1 and T 2 request blocks b 1, b 2, and b 1, b 3 respectively and T 1 acquires the lock on b 1 first • Problem is: if T 1 acquires a lock on b 2 and blocks, T 2 cannot make progress, even though T 2 can access both b 1 and b 3 • Lesson: schedule depending on how data is available; not how requests enter the system b 1 Locked by T 1 9/26/2020 b 2 T 2 blocked by T 1 time Locked and blocked on I/O by T 1 b 3 ready 18

Scheduling Based on Data Availability (Example of Misordering) Transferring funds from checking to savings. Begin(transaction) 1: read (checking account) 2: read(savings_account) 3: read(teller) // in cache 4: read(bank) // in cache 5: update(savings_account) 6: update(checking_account) 7: update(teller) 8: update(bank) If steps 3 and 4 were swapped with 1 and 2, we would be blocking while holding locks on the bank and teller balances. In a global scheduling model ordering of reads does not matter because a request does not start execution unless all the required data in the most probable execution path is available. End (transaction) 9/26/2020 19

Distributed Synchronization b 1 b 2 T 1 T 2 T 3 T 4 P 1 P 2 P 3 P 4 Conventional lock-based implementations serialize the lock manager code. In the example above, T 1 serializes against T 3, although T 1 and T 3 should ideally execute concurrently. Distributed synchronization on distinct queues is possible in FSMs running on multiprocessors, without requiring static data partition 9/26/2020 20

Single Node Btree “Brick” global event queue global buffer cache requests queues completion queues Btree requests are queued in the global event queue. Request completions are queued in the individual btree completion queues. btree “instance” 9/26/2020 completions btree “instance” 21

FSM for Non-blocking Fetch has descendents start key > highkey key <= highkey && not leaf moving down moving right key > highkey is leaf key <= highkey stop 9/26/2020 && is leaf 22

Splitting node a into nodes a’ and b’ f f a c a’ c b’ (a) (b) f c a’ b’ (c) 9/26/2020 f’ a’ b’ c (d) 23

A Single Node B-tree meta data 40 25 . . . 36 35 47 40 40 41 . . . 48 47 51 62 56 51 99 57 53 62 56 Key: 48 Key: 51 Key: 53 Key: 56 <values> . . . 9/26/2020 99 78 99 . . . 24

blink node leaf 9/26/2020 K 0 P 0 . . . K 2 k+1 P 2 k+1 25