Asynchronous rebalancing of a replicated tree Marek Zawirski
Asynchronous rebalancing of a replicated tree Marek Zawirski INRIA & UPMC, France Marc Shapiro INRIA & LIP 6, France Nuno Preguiça UNL, Portugal CFSE, May 2011, Saint-Malo
Summary • Overview of Treedoc: – Abstractly, always-responsive replicated sequence – Built as a replicated ordering tree • Problem faced: – Tree unbalance • Solution for asynchronous tree rebalance • Algorithm requirements statement • Novel F-translate algorithm Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 2
Treedoc – a replicated sequence 0 replica 1 I L replica 2 1 P • Replicated representation: Total order ”<”: infix traversal L I P – Grow-only binary tree – Stable, unique position ids 0 I L 1 P P =1 • Sequence of atoms: – Ops: read, add. At, remove. At 0 L replica 3 I 1 P [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 3
Treedoc – a replicated sequence 0 replica 1 I L L I replica 2 1 0 P L P S add. At( S , S) 0 I < S < P L I 1 P replica 3 I 1 P [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 4
Treedoc – a replicated sequence 0 replica 1 I L 0 replica 2 1 0 P L I 1 P S L I S P add. At( S , S) I < S < P S = 10 0 L replica 3 I 1 P [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 5
Treedoc – a replicated sequence 0 replica 1 I L 0 replica 2 1 0 P L I 1 P S L I S XP add. At( S , S) remove. At( P ) 0 L replica 3 I 1 P [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 6
Treedoc – a replicated sequence 0 replica 1 I L 0 replica 2 1 0 P L I 1 P S L I S add. At( S , S) remove. At( P ) 0 L replica 3 I 1 P [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 7
Treedoc – a replicated sequence 0 replica 1 I L replica 2 1 P 0 • Operation-based replication: – Immediate local execution – Propagate (cbcast) & replay 0 L S L I I 1 0 P S S add. At( S , S) remove. At( P ) 0 L replica 3 I 0 1 P S [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 8
Treedoc – a replicated sequence 0 replica 1 I L replica 2 1 P 0 • Operation-based replication: – Immediate local execution – Propagate (cbcast) & replay 0 L S L I I 1 0 P S S add. At( S , S) remove. At( P ) 0 L replica 3 I 0 1 P S [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 9
Treedoc – a replicated sequence 0 0 replica 1 I L 1 P 0 E replica 2 • Operation-based replication: – Immediate local execution – Propagate (cbcast) & replay S 0 L 0 A I 1 P 0 S add. At( A , A) E L I S add. At( E , E) 0 L ? replica 3 I 0 1 P S [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 10
Treedoc – a replicated sequence 0 0 E replica 1 I L replica 2 • Operation-based replication: 1 – Immediate local execution – Propagate (cbcast) & replay P 0 S A 0 L 0 A I 1 P 0 S add. At( A , A) E A L I S add. At( E , E) 0 add. At( A , A) L Predefined order: red < green < blue… replica 3 I 0 1 P S [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 11
Treedoc – a replicated sequence 0 0 E E replica 1 I L • Operation-based replication: 1 – – P 0 S A A replica 2 L I 0 Immediate local execution L Propagate (cbcast) & replay 0 Concurrent commute E A Eventually consistent S 0 Predefined order: red < green < blue… L 0 E A I 1 P 0 S replica 3 I 0 1 P S [Shapiro, Preguiça et. al, 2007, 2009] Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 12
The tree rebalance problem I 0 0 E L 1 P 0 S A 0 – – 1 E S 1 0 A I – Unbalanced, empty nodes, lot of colors… – Various negative impacts • Tree rebalance: rebalance L • With time tree gets worse and worse Create minimal tree from nonempty nodes Keep order “<“ Use single color (white) New ids (rectangles), incompatible with old • Challenge: – How to avoid system-wide consensus? Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 13
The core-nebula architecture Idea: limit consensus to a smaller number of replicas [Leția et. al, 2009] Divide replicas into two disjoint sets: NEBULA CORE • a stable group • execute tree operations & agree on rebalance ü easier agreement sites join & leave, dynamic generate tree operations learns about rebalance perform catch-up protocol to integrate conc. changes ü never blocked • • Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 14
Rebalance in core, catch-up from nebula 0 L core 1 I 1 0 P L 1 S nebula 1 I 1 0 P L 1 nebula 2 I 1 0 P L S add. At( S , S) nebula 3 I 1 P 1 1 S S • Any pair of replicas can exchange operations in the same epoch Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 15
Rebalance in core, catch-up from nebula 0 L core 1 I 1 0 P L 1 S nebula 1 I 1 0 P L 1 nebula 2 I 1 0 P L S nebula 3 I 1 P 1 1 S S remove. At( P ) • Any pair of replicas can exchange operations in the same epoch Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 16
Rebalance in core, catch-up from nebula 0 L core 1 I 1 0 PI L 1 S nebula 1 I 1 0 PI L 1 nebula 2 I 1 0 PI L S nebula 3 I 1 PI 1 1 S S remove. At( I ) • Any pair of replicas can exchange operations in the same epoch Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 17
Rebalance in core, catch-up from nebula 0 core 1 I L 0 0 rebalance final state 1 0 PI L 1 S nebula 1 I 0 P 1 0 PI L 1 S nebula 2 I 1 U 0 1 0 PI L P add. At( P , P) S L T add. At( T , T) nebula 3 I 1 PI 1 1 1 S U S add. At( U , U) • Any pair of replicas can exchange operations in the same epoch • rebalance@core initiates new epoch • rebalance@core and operations@core inherently concurrent to ops@nebula! Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 18
Rebalance in core, catch-up from nebula 0 core 1 I L 0 0 L rebalance final state S 1 0 PI L 1 S X nebula 1 I 0 P 1 0 PI L 1 S nebula 2 I 1 U 0 1 0 PI L P add. At( P , P) nebula 3 I 1 PI 1 1 1 S U S add. At( U , U) • Pairwise catch-up moves nebula replica to the next epoch ? T add. At( T , T) Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 19
Rebalance in core, catch-up from nebula 0 core 1 I L 1 0 PI L 1 rebalance S nebula 1 I 0 P 1 0 PI L 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S finalremove. At( state I ) 0 0 S • Pairwise catch-up moves nebula replica to the next epoch – replay core ops until final state L T Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 20
Rebalance in core, catch-up from nebula 0 core 1 I L 0 T L S 0 PI L 1 0 S 1 0 PI L P catch-up 0 rebalance final state 1 nebula 1 I S 0 L 1 P ? 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S • Pairwise catch-up moves nebula replica to the next epoch – replay core ops until final state – replay rebalance on final nodes – translate nodes of nebula operations || rebalance into the new tree Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 21
Naive translation algorithm(s) 0 core 1 I L 0 T L S 0 PI L 1 0 S 1 0 PI L P 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S catch-up 0 rebalance final state 1 nebula 1 I S 0 L 1 P • Naive translation algorithm: Create new position respecting old order observed at the nebula replica L < P < S ~[Leția et. al, 2009] L < P < S Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 22
Naive translation algorithm(s) 0 core 1 I L 1 0 PI L 1 nebula 1 I 0 0 0 T L PI L P finalremove. At( state I ) S 0 1 S 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S catch-up rebalance S 1 nebula 2 I S 0 L 1 P Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 23
Naive translation algorithm(s) 0 core 1 I L T PI L 1 0 S 1 0 PI L P 1 S 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S S 0 L 1 P catch-up 0 L S 0 nebula 2 I catch-up 0 rebalance final state 1 nebula 1 I S 0 L 1 U L < U < S Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 24
Naive translation algorithm(s) 0 core 1 I L T PI L 1 0 S 1 0 PI L P 1 S 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S S 0 L 1 catch-up 0 L S 0 nebula 2 I catch-up 0 rebalance final state 1 nebula 1 I S 0 L 1 P U add. At( P , P) add. At( U , P) Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 25
Naive translation algorithm(s) 0 core 1 I L T PI L 1 0 S 1 0 PI L P 1 S 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S S 0 L catch-up 0 L S 0 nebula 2 I catch-up 0 rebalance final state 1 nebula 1 I Order observed at nebula 2 broken! 1 U < P P U P < U Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree S 0 L 1 P U 26
Towards correct translate: requirements 1. Order-preserving – For every X , Y the order is preserved between epochs: X < Y => X < Y 2. Deterministic – For every X , nebulai, nebulaj, X is translated identically: X @nebulai = X @nebulaj 3. Non-disruptive For every X created by add. At and Y created by translate: X != Y Solution: designate all cases in advance using final state only! – Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 27
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 S nebula 1 I 0 P 1 0 PI L 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S final state 0 0 T L S • Sentinel position Xi : – Designated position for every potential translate – X < X 1 < X 2 … < Xim < Y – Materialized on translate Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 0 L S S 1 L 2 L 3 L 4 28
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 S final state 0 0 nebula 1 I 0 P 1 0 PI L 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S F-Translate: S – Right child of final node 0 L L T S S 1 L 2 L 3 L 4 1 U Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 29
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 nebula 1 I 0 S 1 0 PI L P 1 S final state 0 0 T L nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S F-Translate: S – Child of empty f. node – Etc. S 0 L S 1 L 2 L 3 L 4 0 P Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 0 L S S 1 L 2 L 3 L 4 1 U 30
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 nebula 1 I 0 S 1 0 PI L P 1 S nebula 2 I 1 U 0 1 0 PI L P nebula 3 I 1 PI 1 1 1 S U S finalremove. At( state I ) 0 0 T L S S 0 L 0 S 1 L 2 L 3 L 4 0 P Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree L S S 1 L 2 L 3 L 4 1 U 31
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 nebula 1 I 0 S 1 0 PI L P 1 S nebula 2 I 0 1 1 0 PI L P U nebula 3 I 1 PI 1 1 1 S U S final state 0 0 T L S S 0 L 0 S 1 L 2 L 3 L 4 0 P L S 0 S 1 L 2 L 3 L 4 1 0 U P Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree L S S 1 L 2 L 3 L 4 1 U 32
F-Translate algorithm & sentinels 0 core 1 I 1 0 PI L L 1 nebula 1 I 0 S 1 0 PI L P 1 S nebula 2 I 0 1 1 0 PI L P U nebula 3 I 1 PI 1 1 1 S U S final state 0 0 S 0 L 0 T L 1 L 2 L 3 L 4 T L S 0 S 1 L 2 L 3 L 4 0 T L S 0 S 1 L 2 L 3 L 4 1 0 1 0 U P U P S Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 0 T L S 1 L 2 L 3 L 4 1 0 U P 33
F-Translate: sentinel implementation • How to implement sentinel. After? • TODO: REPLCACE w/figure!!! • show if time allows & audience is not lost – Empty nodes are necessary! – Is more empty nodes discarded than introduced? Yes, but… • How to. Xminimize empty nodes in sentinel. After O(1) number of empty nodes / path – Using balanced binary tree! + O(log imax ) / path Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 34
Summary • Problem faced: – Tree unbalance • Solution for asynchronous tree rebalance • Algorithm requirements statement • Novel F-translate algorithm: – Identify and utilize final state prior to rebalance – Use sentinel positions – Prototype catch-up implementation • Future work? – Evaluation of sentinel. After implementation – Formal order-preservation proof Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 35
Appendix: the unbalance problem • Use sparse tree and heuristic to assign Pos. ID [Weiss et. al, ‘ 09] – Tested on Wikipedia traces; works at the cost of possible anomaly • Use list instead of a tree [Roh et. al, ‘ 10] – Different costs and convergence characteristics? • Rebalance the tree [Shapiro, Preguiça et. al, ‘ 09] – System-wide consensus; inherent limitations – The core-nebula idea [Leția et. al ‘ 09]; incorrect translation • This work brings: – – – More formalization of the core-nebula for asynchronous systems Flaws revealed in naive algorithms Translation requirements statement Novel F-translate algorithm First prototype implementation in Java (subject to further studies) Zawirski, Shapiro, Preguiça - Asynchronous rebalancing of a replicated tree 36
- Slides: 36