General Feature Selection for Failure Prediction in Largescale

Motivation Ø Solid-state drive (SSD) failures are prevalent in production Ø Machine learning techniques

Our Contribution Ø Measurement study on nearly 500 K SSDs at Alibaba • Feature

Data Collection Ø Collect nearly 500 K SSDs of 6 drive models from 3

Background Ø SSD Failure prediction • • Offline classification problem View raw and normalized

Feature Importance Characterization Ø Feature importance for SSD failure prediction • Feature importance: effectiveness

Feature Importance Characterization Ø Feature importance using different selection approaches • Rankings of top

Feature Importance Characterization Ø Feature importance with different wear-out degrees • Use normalized value

Feature Importance Characterization Ø Feature importance with different wear-out degrees • Divide SSDs into

WEFR Ø Wear-out-updating Ensemble Feature Ranking • General feature selection for SSD failure prediction

WEFR Ø Workflow • Input: all features of SMART attributes from the same drive

WEFR Ø Robust feature selection • Perform preliminary feature selection approaches • Examine the

Evaluation Ø Methodology • The first 21, 22, and 23 months as training phases

Evaluation Ø Effectiveness of robust feature selection • Overall, WEFR improves F 0. 5

Evaluation Ø Effectiveness of automated feature selection • F 0. 5 -score of WEFR

Evaluation Ø Effectiveness of updating feature selection • WEFR improves precision and F 0.

Evaluation Ø Runtime of state-of-the-art feature selection and WEFR • Average runtime with MC

Conclusion Ø WEFR, general feature selection to select features from SMART attributes • Robust

Slides: 19

Download presentation

General Feature Selection for Failure Prediction in Large-scale SSD Deployment Fan Xu 1, Shujie Han 2, Patrick P. C. Lee 2, Yi Liu 1, Cheng He 1, and Jiongzhou Liu 1 1 Alibaba 2 The Group Chinese University of Hong Kong 1

Motivation Ø Solid-state drive (SSD) failures are prevalent in production Ø Machine learning techniques complement traditional redundancy protection schemes • Self-Monitoring Analysis and Reporting Technology (SMART) logs Ø Feature selection for general SSD failure prediction • One feature selection approach may be inapplicable for all drive models • Various feature selection approaches may select different sets of features • Considering SSD characteristics, how to select features? 2

Our Contribution Ø Measurement study on nearly 500 K SSDs at Alibaba • Feature importance characterization Ø WEFR, general feature selection for SSD failure prediction • An automated and robust manner • Update features with wear-out degree Ø Evaluation on nearly 500 K SSDs at Alibaba • Improve F 0. 5 -score by 10% (4 -6%) compared to no feature selection (existing approaches) with comparable runtime performance Ø Release our dataset for public use 3

Data Collection Ø Collect nearly 500 K SSDs of 6 drive models from 3 vendors over two-year span (Jan. 2018 - Dec. 2019) at Alibaba Ø Data types: SMART logs and trouble tickets 4

Background Ø SSD Failure prediction • • Offline classification problem View raw and normalized values of each SMART attribute as two features Input variables: a vector of features Output variable: drive status (0 means healthy and 1 means failed) Ø Feature selection approaches • Pearson correlation, Spearman correlation, J-index, Random forest (RF), and XGBoost 5

Feature Importance Characterization Ø Feature importance for SSD failure prediction • Feature importance: effectiveness of features for predicting SSD failures • Feature importance scores for all features using Random forest • Different correlations between features and SSD failures • Trivial features as noises may decrease the prediction accuracy 6

Feature Importance Characterization Ø Feature importance using different selection approaches • Rankings of top five important features vary across feature selection approaches • Which feature selection approach is more effective? • How many important features should we select? 7

Feature Importance Characterization Ø Feature importance with different wear-out degrees • Use normalized value of media wear-out indicator (MWI_N) • Measure survival rate on each value in the range of MWI_N • Survival rate changes with MWI_N for MA 1, MA 2, MC 1, and MC 2 • SSD characteristics may change with the wear-out degree Fig. 1: Relationships between the survival rate and MWI_N. Green points are change points detected using the Bayesian change point detection. We omit MB 1 and MB 2 here due to a small range of MWI_N. 8

Feature Importance Characterization Ø Feature importance with different wear-out degrees • Divide SSDs into two groups by the threshold of MWI_N • Top five important features vary across different groups of MWI_N • Need to update features when predicting SSD failures with MWI_N 9

WEFR Ø Wear-out-updating Ensemble Feature Ranking • General feature selection for SSD failure prediction Ø Challenges and key techniques • Robust feature selection • Combine different feature selection approaches • Automated feature selection • Determine the number of selected features automatically • Updating feature selection • Select features with different values of MWI_N • Update the selected features over time 10

WEFR Ø Workflow • Input: all features of SMART attributes from the same drive model • Robust feature selection • Perform preliminary feature selection • Filter out the outlier feature selection approach by checking similarities of rankings • Automated feature selection • Automatically determine the feature count • Updating feature selection • Detect change points for survival rate • Update selected features with wear-out degree • Output: selected features 11

WEFR Ø Robust feature selection • Perform preliminary feature selection approaches • Examine the similarity between rankings of preliminary feature selection • Filter out the outlier feature selection approach Ø Automated feature selection • Measure feature effectiveness and determine feature count automatically Ø Updating feature selection • Detect change points of survival rate and divide SSDs into different groups of MWI_N • Update features for each group of MWI_N periodically 12

Evaluation Ø Methodology • The first 21, 22, and 23 months as training phases for the 22 th, 23 th, and 24 th month as testing phases • Set ratio of training period length to validation period length as 8: 2 • Use Random Forest as prediction model Ø Metrics • Precision, recall, F 0. 5 -score • F 0. 5 -score is to weigh precision twice as important as recall, so as to reduce the cost of replacing a healthy SSD 13

Evaluation Ø Effectiveness of robust feature selection • Overall, WEFR improves F 0. 5 -score by 10% compared to no feature selection • WEFR generally improves F 0. 5 -score by 4 -6% compared to the state-ofthe-art feature selection approaches 14

Evaluation Ø Effectiveness of automated feature selection • F 0. 5 -score of WEFR is always higher than or equal to the highest F 0. 5 score when fixing the percentage of selected features 15

Evaluation Ø Effectiveness of updating feature selection • WEFR improves precision and F 0. 5 -score with updating selected features for MA 1, MA 2, MC 1, and MC 2 compared to no updating feature selection 16

Evaluation Ø Runtime of state-of-the-art feature selection and WEFR • Average runtime with MC 1 over 20 rounds • Spearman correlation is the slowest one among state-of-the-art approaches • WEFR executes state-of-the-art approaches in parallel, so its runtime is comparable with those approaches 17

Conclusion Ø WEFR, general feature selection to select features from SMART attributes • Robust feature selection • Determine feature count automatically • Update features with wear-out degree Ø Evaluation on nearly 500 K SSDs • WEFR generally improves the prediction accuracy across six drive models Ø Dataset: • https: //github. com/alibaba-edu/dcbrain/tree/master/ssd_smart_logs 18

Thank You! Q&A 19