A Costbased Heterogeneous Recovery Scheme for Distributed Storage

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong Xu 1, Lingling Gao 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong DSN’ 12 1

Fault Tolerance Ø Fault tolerance becomes more challenging in modern distributed storage systems • Increase in scale • Usage of inexpensive but less reliable storage nodes Ø Fault tolerance is ensured by introducing redundancy across storage nodes • Replication A A A B B B • Erasure codes (e. g. , Reed-Solomon codes) A B A+2 B 2

XOR-Based Erasure Codes Ø Encoding/decoding involve XOR operations only • Low computational overhead Ø Different redundancy levels • 2 -fault tolerant: RDP, EVENODD, X-Code • 3 -fault tolerant: STAR • General-fault tolerant: Cauchy Reed-Solomon (CRS) 3

Failure Recovery Ø Recovering node failures is necessary • Preserve the required redundancy level • Avoid data unavailability Ø Single-node failure recovery Ø Single-node failure occurs more frequently than a concurrent multi-node failure

Example: Recovery in RDP ØAn RDP code example with 8 nodes node 0 node 1 node 2 node 3 node 4 node 5 d 0, 0 d 1, 0 d 2, 0 d 3, 0 d 4, 0 d 5, 0 d 0, 1 d 1, 1 d 2, 1 d 3, 1 d 4, 1 d 5, 1 d 0, 2 d 1, 2 d 2, 2 d 3, 2 d 4, 2 d 5, 2 d 0, 3 d 1, 3 d 2, 3 d 3, 3 d 4, 3 d 5, 3 d 0, 4 d 1, 4 d 2, 4 d 3, 4 d 4, 4 d 5, 4 d 0, 5 d 1, 5 d 2, 5 d 3, 5 d 4, 5 d 5, 5 ⊕ ⊕ ⊕ node 6 node 7 d 0, 6 d 1, 6 d 2, 6 d 3, 6 d 4, 6 d 5, 6 d 0, 7 d 1, 7 d 2, 7 d 3, 7 d 4, 7 d 5, 7 ⊕ ⊕ ⊕ Let’s say node 0 fails. How do we recover node 0? 5

Conventional Recovery Ø Idea: use only row parity sets. Recover each lost data symbol (i. e. , data chunk) independently node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 Different metrics can be used to measure the efficiency of a recovery scheme Read symbols: 36 Then how do we recover node 0 efficiently? 6

Minimize Number of Read Symbols Ø Idea: use a combination of row and diagonal parity sets to maximize overlapping symbols [Xiang, To. S’ 11] node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 Read symbols: 27 Improve rate: 25% 7

Need A New Metric? Ø A modern storage system is natural to be composed of heterogeneous types of storage nodes • System upgrades • New node addition Ø A heterogeneous environment node 1 node 0 68 Mbps node 3 109 Mbps 26 Mbps New node Need a new efficient failure recovery solution for heterogeneous environment! node 2 110 Mbps Proxy 113 Mbps 86 Mbps node 7 110 Mbps node 4 node 6 node 5 8

Related Work Ø Hybrid recovery • Minimize number of read symbols RAID-6 XOR-based erasure codes • e. g. , RDP [Xiang, To. S’ 11], EVENODD [Wang, Globecom’ 10 Ø Enumeration recovery [Khan, FAST’ 12] • Enumerate all recovery possibilities to achieve optimal recovery for general XOR-based erasure codes Ø Greedy recovery [Zhu, MSST’ 12] • Efficient search of recovery solutions for general XOR-based erasure codes Ø Regenerating codes [Dimakis, To. IT’ 10] • Nodes encode data during recovery • Minimize recovery bandwidth • Heterogeneous case considered in [Li, Infocom’ 10], but requires node encoding and collaboration 9

Challenges Ø How to enable efficient failure recovery for heterogeneous settings? • Minimizing # of read symbols homogeneous settings • Performance bottlenecked by poorly performed nodes Ø How to quickly find the recovery strategy? • Minimizing # of read symbols deterministic metric • Minimizing general cost non-deterministic metric Recovery decision typically can’t be pre-determined

Our Contributions Cost-based single-node failure recovery for heterogeneous distributed storage systems Ø Target two RAID-6 codes: RDP and EVENODD • XOR-based encoding operations Ø Goals: • Minimize search time • Minimize recovery cost 11

Our Contributions Ø Formulate an optimization problem for singlenode failure recovery in heterogeneous settings Ø Propose a cost-based heterogeneous recovery (CHR) algorithm Ø Narrow down search space Ø Suitable for online recovery Ø Implement and experiment on a heterogeneous networked storage testbed 12

Model Formulation Node 0 Node k Node 1 Node p-1 vp-1 Node p vp Node : v 0 v 1 . . . Weight: w 0 w 1 . . . wp-1 wp y 0 y 1 . . . yp-1 yp Download Distribution: vk . . . Ø Our formulation: Minimizing total recovery cost: 13

Physical Meanings wi C 1 for all i total number of symbols being read from surviving nodes inverse of transmission bandwidth of node Vi total amount of transmission time to download symbols from surviving nodes monetary cost of the total monetary cost of migrating per unit of data migrating data from surviving outbound from node Vi nodes (or clouds) 14

Solving the Model Ø Important: Which symbols to be fetched from surviving nodes must follow inherent rules of specific coding schemes Ø To solve the model, we introduce recovery sequence (x 0 , x 1 , … , xp-2, 0) – xi = 0 , di, k is recovered from its row parity set – xi = 1 , di, k is recovered from its diagonal parity set ØAn example: node 0 d 0, 0 d 1, 0 d 2, 0 d 3, 0 1) Each recovery sequence represents a feasible recovery solution; 2) Download distribution can be represented by recovery sequence; node 1 node 2 node 3 node 4 node 5 d 0, 1 d 1, 1 d 2, 1 d 3, 1 d 0, 2 d 1, 2 d 2, 2 d 3, 2 d 0, 3 d 1, 3 d 2, 3 d 3, 3 d 0, 4 d 1, 4 d 2, 4 d 3, 4 d 0, 5 d 1, 5 d 2, 5 d 3, 5 Ørecovery sequence: (0, 0, 1, 1, 0) Ødownload distribution: (3, 2, 2, 3, 2) 15

Solving the Model (2) Ø Step 1: use recovery sequence to represent downloads Ø Step 2: narrow down search space by only considering min -read recovery sequences (i. e. , download minimum number of read symbols during recovery) Ø Step 3: reformulate the model as Minimize 16

Expensive Enumeration Challenge: Too many min-read recovery sequences to enumerate even we narrow down search space P Total # of recovery sequences # of min-read recovery sequences # of unique min-read recovery sequences 5 16 6 2 7 64 20 4 11 1024 252 26 13 4096 924 74 17 65536 12870 698 19 262144 48620 2338 23 4194304 705432 28216 29 268435456 40116600 1302688 Observation: many min-read recovery sequences return the same download distribution 17

Optimize Enumeration Process Ø Two conditions under which different recovery sequences have same download distribution: Ø Shift condition (0, 0, 0, 1, 1, 1, 0) (0, 0, 1, 1, 1, 0, 0) (0, 1, 1, 1, 0, 0, 0) (1, 1, 1, 0, 0) … Ø Reverse condition (0, 0, 0, 1, 1, 1, 0) (0, 1, 1, 1, 0, 0, 0) Key idea: not all recovery sequences need to be enumerated (details in the paper) 18

Cost-based Heterogeneous Recovery (CHR) Algorithm: Intuition Ø Step 1: initialize a bitmap to track all possible min -read recovery sequences R Ø Step 2: compute recovery cost of R. Ø Step 3: mark all shifted and reverse sequences of R as being enumerated Ø Step 4: switch to another R; return the one with minimum cost 19

Example node 1 node 2 node 0 68 Mbps 26 Mbps New node 110 Mbps Proxy 86 Mbps node 7 113 Mbps 110 Mbps node 6 Our proposed CHR algorithm node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 3 5 4 4 node 3 109 Mbps 5 3 3 node 4 node 5 Hybrid approach [Xiang, To. S’ 11] node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 5 4 3 3 4 5 3

Recovery Cost Comparison Ø CHR approach reduce by 25. 89% Ø Hybrid approach reduce by 40. 91% Ø Conventional approach 21

Simulation Studies (1): Traverse Efficiency Ø Evaluate the computational time of CHR P Naive traverse time (ms) CHR’s traverse time (ms) Improved rate (%) 5 0. 0220 0. 0100 54. 55 7 0. 0950 0. 0310 67. 37 11 2. 3160 0. 3910 83. 12 13 11. 9840 1. 6150 86. 52 17 107. 7410 10. 0790 90. 65 19 455. 2760 40. 5370 91. 10 23 9230. 7800 691. 2800 92. 51 29 752296. 2700 45423. 5570 93. 96 CHR significantly reduces the traverse time of the naive approach 22 by over 90% as p increases!

Simulation Studies (2): Robustness Efficiency Ø Evaluate if CHR achieves the global optimal among all the feasible recovery sequences P Hit Global Optimal Probability(%) Global Optimal Max Improvement(%) 5 94. 9 6. 12 7 94. 5 5. 54 11 93. 6 5. 98 13 93. 2 6. 46 17 92. 8 5. 97 19 93. 1 5. 73 CHR has a very high probability (over 93%) to hit the global optimal recovery cost! 23

Simulation Studies (3): Recovery Efficiency Ø Evaluate via 100 runs for each p the recovery efficiency of CHR in a heterogeneous storage environment n CHR can reduce recovery cost by up to 50% over the conventional approach n CHR can reduce recovery cost by up to 30% over the hybrid approach 24

Experiments Ø Experiments on a networked storage testbed • • Conventional vs. Hybrid vs. CHR Default chunk size = 1 MB Communication via ATA over Ethernet (Ao. E) Consider two codes: RDP and EVENODD • Only RDP results shown in this talk Ø Recovery operation: • Read chunks from surviving nodes • Reconstruct lost chunks • Write reconstructed chunks to a new nodes Gigabit switch Recovery process 25

Experiments Ø Two types of Ethernet interface card equipped by physical storage devices • 100 Mbps set weight = 1/(100 Mbps) • 1 Gbps set weight = 1/(1 Gbps) p Total # of nodes with 100 Mbps # of nodes with 1 Gbps 5 6 2 4 7 8 3 5 11 12 5 7 13 14 6 8 17 18 9 9 Configuration for RDP code 26

Different Number of Storage Nodes Ø Total recovery time for RDP • CHR improves conventional by 21 -31% • CHR improves hybrid by 15 -20% 27

Different Chunk Size Ø Total recovery time for RDP (p = 11) • CHR improves conventional by 18 -26% • CHR improves hybrid by 14 -19%

Different Failed Nodes Ø Total recovery time for RDP (p = 11) • CHR still outperforms conventional and hybrid 29

Conclusions Ø Address single-node failure recovery RAID-6 coded heterogeneous storage systems Ø Formulate a computation-efficient optimization model Ø Propose a cost-based heterogeneous recovery algorithm Ø Validate the effectiveness of the CHR algorithm through extensive simulations and testbed experiments Ø Future work: Ø Different cost formulations Ø Extension for general XOR-based erasure codes Ø Degraded reads Ø Source code: • http: //ansrlab. cse. cuhk. edu. hk/software/chr/ 30

Backup

Cost-based Heterogeneous Recovery (CHR) Algorithm Notation: F R, C A bitmap that identifies if a min-read recovery sequence has been enumerated A min-read recovery sequence with its recovery cost R*, C* The min-cost recovery sequence with the minimum total recovery cost Algorithm: 1 Initialize F[0… 2 p-1 -1] with 0 -bits; Initialize R with 1 -bits followed by Initialize R* with R ; Initialize C* with MAX_VALUE 2 If R is null, then go to Step 4; Convert R into integer value v, if R has already enumerated, then go to Step 3; Mark all the shifted an reverse recovery sequences of R as being enumerated; Calculate the recovery cost C of R; Update R* and C* if necessary 3 Get the next min-read recovery sequence R and go to Step 2; 4 Finally, initialize R with all 0 -bits; Calculate the recovery cost C of R; Update R* and C* if necessary 0 -bits; 32

Example node 1 node 0 68 Mbps node 2 26 Mbps New node 3 109 Mbps node node 0 1 2 3 4 5 6 7 110 Mbps Proxy 113 Mbps 86 Mbps node 7 110 Mbps node 4 node 6 node 5 3 5 4 4 5 3 3 Step 1: Initialize F[0. . 63] with 0 -bits, R = {1110000}, the recovery cost C = MAX_VALUE Step 2: F[7]=1, mark R’s shifted and reverse recovery sequences: F[56]=F[28]=F[14]=1; Calculate the recovery cost for R, C will be 0. 7353α; R*, C* will be updated by R, C Step 3: Get the next min-read recovery sequence R and go to Step 2 Step 4: Finally, we can find that R* = {1010100} and C* = 0. 5449α 33

Recovery Cost Comparison Ø CHR approach node node 0 1 2 3 4 5 6 7 reduce by 25. 89% Ø Hybrid approach reduce by 40. 91% Ø Conventional approach 5 4 3 3 4 5 3 34

Different Number of Storage Nodes Ø Consider the overall performance of the complete recovery operation for EVENODD 35

Different Chunk Size Ø Evaluate the impact of chunk size for EVENODD on the recovery time performance 36

Different Failed Nodes ØEvaluate the recovery time performance for EVENODD when the failed node is in a different column 37