Harmonia NearLinear Scalability for Replicated Storage with InNetwork
Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection Hang Zhu Zhihao Bai, Jialin Li, Ellis Michael Dan R. K. Ports, Ion Stoica, Xin Jin
Replication: fundamental tool for distributed storage request … clients reply Best case: the performance of a single node replication protocol to ensure consistency 2
Achieve linear scalability with strong consistency? Ø Naïve approach: allow any replica to serve read Ø Read-ahead anomalies Ø Primary-backup protocols Ø Read-behind anomalies Ø Quorum-based protocols Ø Protocol-level approach: CRAQ Ø An extra phase for each write Ø Unclear how to extend to other protocols 3
Our Approach: Harmonia Ø Design goals Ø Generality Ø support different protocols Ø Minimal overhead Ø avoid additional overhead of tracking the dirty set (objects with pending writes) Ø In-switch implementation Ø Key observation: the size of dirty set at any given time is small Ø Basic idea Ø Send reads without pending writes to any replica Ø Pass other requests to the underlying replication protocol 4
Harmonia Architecture Clients Replicated Storage Rack sequence number dirty set L 2/L 3 Routing Read-Write Conflict Detection To. R Switch Data plane Storage Servers obj_id seq E 1 … … last-committed point 5
How to build a strongly consistent replicated system with near-linear scalability? Ø How to maintain the dirty set in network? Ø How to achieve near-linear scalability? Ø How to guarantee consistency? Ø How to handle failures? 6
How to build a strongly consistent replicated system with near-linear scalability? Ø How to maintain the dirty set in network? Ø How to achieve near-linear scalability? Ø How to guarantee consistency? Ø How to handle failures? 7
Programmable switch architecture Egress pipeline Ingress pipeline Register array Stage 2 … Register array Stage 1 … Queues action match packets Custom parser Stage 1 … … Register array 8
Harmonia packet format Existing Protocols ETH IP L 2/L 3 Routing TCP/UDP reserved port # Harmonia Header TYPE OBJ_ID SEQ_NUM Payload READ, WRITE, etc. 9
Insert into dirty set Stage 1 Stage 2 Stage 3 obj seq E 1 B 2 X C h 1(A) h 2(A) 3 A 6 Q 5 4 h 3(A) Write Query (obj_id=A) v Index the register arrays with the hash of object ID v Iterate over stages until find an empty slot v Store the object ID and the sequence number 10
Remove from dirty set Stage 1 Stage 2 Stage 3 obj seq E 7 B 8 X Q C h 1(A) 9 h 2(A) 11 10 h 3(A) Write Completion (obj_id=A, seq=6) v Iterate over multiple stages v Remove the object ID and sequence number 11
How to build a strongly consistent replicated system with near-linear scalability? Ø How to maintain the dirty set in network? Ø How to achieve near-linear scalability? Ø How to guarantee consistency? Ø How to handle failures? 12
Handling write commit=0 Write (obj_id=A) 1 obj_id seq E 1 B 2 Primary Backup 1 Backup 2 Client Switch Data Plane Storage Servers 13
Handling write commit=0 Write (obj_id=A) Client 1 obj_id seq E 1 B 2 A 3 Switch Data Plane Primary 2 Backup 1 Backup 2 Storage Servers v Add the object ID into the dirty set v Forward the request to the primary 14
Handling write-completion Write (obj_id=A) Client commit=0 obj_id seq X 4 C 5 A 3 Switch Data Plane 3 Primary Backup 1 Backup 2 Storage Servers 15
Handling write-completion commit=3 Write (obj_id=A) 4 obj_id seq X 4 C 5 3 Primary Backup 1 Backup 2 Client Switch Data Plane Storage Servers v Delete the entry from the dirty set v Update the last-committed point 16
Handling read and reply on objects in the dirty set commit=0 Read (obj_id=E) 1 4 obj_id seq E 1 B 2 Primary 2 3 Backup 1 Backup 2 Client Switch Data Plane Storage Servers v Send the request to the Primary for consistency 17
Handling read and reply on objects not in the dirty set Read (obj_id=C) commit=0 1 obj_id seq E 1 B 2 Primary 2 Backup 1 4 3 Client Switch Data Plane Backup 2 Storage Servers v Send the request to a Backup for better performance 18
How to build a strongly consistent replicated system with near-linear scalability? Ø How to maintain the dirty set in network? Ø How to achieve near-linear scalability? Ø How to guarantee consistency? Sequence number + last-committed point Ø How to handle failures? 19
How to build a strongly consistent replicated system with near-linear scalability? Ø How to maintain the dirty set in network? Ø How to achieve near-linear scalability? Ø How to guarantee consistency? Sequence number + last-committed point Ø How to handle failures? Monotonically increasing switch ID 20
Implementation Ø Testbed Ø Twelve server machines connected by a 6. 5 Tbps Barefoot Tofino switch Ø Switch Ø Written in P 4 Ø 3 -stage register array for dirty set Ø Server Ø DPDK client to generate workload 21
Evaluation Ø Generality Ø Scalability Ø Comparison with in-switch storages (Net. Cache, Net. Chain) Ø Switch resource usage and overhead Ø Performance under failures 22
Evaluation Ø Generality Ø Scalability Ø Comparison with in-switch storage (Net. Cache, Net. Chain) Ø Switch resource usage and overhead Ø Performance under failures 23
Generality with storage architectures and workloads A general approach for different storage architectures A general approach for different workloads 24
Generality with replication protocols Primary-backup Quorum-based A general approach to a variety of protocols 25
Scalability Read-only 5% Writes Write-only Near-linear scalability for read-intensive workloads 26
Conclusion Ø Harmonia is a new replicated storage architecture Ø Near-linear scalability Ø Strong consistency Ø In-network conflict detection Ø New-generation programmable switches 27
Thanks! 28
- Slides: 28