Computer Science Understanding Real World Data Corruptions in

  • Slides: 21
Download presentation
Computer Science Understanding Real World Data Corruptions in Cloud Systems Peipei Wang, Daniel Dean,

Computer Science Understanding Real World Data Corruptions in Cloud Systems Peipei Wang, Daniel Dean, Xiaohui Gu North Carolina State University 1

Motivation Computer Science 2

Motivation Computer Science 2

HDFS Background 1. HDFS write operation Client Data 3. Return three block locations Memory

HDFS Background 1. HDFS write operation Client Data 3. Return three block locations Memory Name. Node 2. Log this operation, log block location Disk HDFS System Files Block size Checksum Timestamp Version Memory Data. Node A Data. Node B Data. Node C Disk Block Block metadata Disk Block metadata Computer Science

HDFS Background 1. HDFS write operation Memory Client Data 3. Return three block locations

HDFS Background 1. HDFS write operation Memory Client Data 3. Return three block locations Name. Node 2. Log this operation, log block location Disk HDFS System Files Block size Checksum Timestamp Version Memory Data. Node A Data. Node B Data. Node C Disk Block metadata Disk Block metadata Computer Science 4

Methodology ▪ Randomly sampled 138 Hadoop bug incidents that are related to data corruption

Methodology ▪ Randomly sampled 138 Hadoop bug incidents that are related to data corruption – All incidents are resolved bug incidents – Manually studied each bug report (e. g. , bug descriptions, patches) System Name System file corruption Metadata corruption Block Misreported corruption Hadoop 1. x 15 11 46 4 Hadoop 2. x 1 (YARN) 0 7 0 HDFS 1. x 17 7 23 7 HDFS 2. x 8 22 10 0 Computer Science 5

Outline ▪ State of the art ▪ Research goals ▪ Data corruption impact ▪

Outline ▪ State of the art ▪ Research goals ▪ Data corruption impact ▪ Data corruption detection ▪ Data corruption causes ▪ Data corruption handling ▪ Key findings ▪ Future work ▪ Conclusion Computer Science 6

State of the Art § Data corruption studies [Zhang et al. FAST`10, Schroeder et

State of the Art § Data corruption studies [Zhang et al. FAST`10, Schroeder et al. FAST`07] - Focused on hardware-induced data corruption problems § Data corruption detection frameworks [Yang et al. OSDI`06, Subramanian et al. ICDE`10] - Reactive approaches, for stand-alone systems (e. g. , file system) § Bug characteristic studies [Jin et al. PLDI`12, Lu et al. ASPLOS`08] - Focus on software bugs (e. g. , performance bugs, concurrency bugs) Computer Science 7

Research Goals § Understand real-world software-induced data corruptions - What impact can data corruption

Research Goals § Understand real-world software-induced data corruptions - What impact can data corruption have on the application and system? - How is data corruption detected? - What are the causes of the data corruption? - What problems can occur while attempting to handle data corruption? Computer Science 8

Data Corruption Impact on System Integrity Block Metadata Availability Performance Hadoop failures Time delay

Data Corruption Impact on System Integrity Block Metadata Availability Performance Hadoop failures Time delay Map. Reduce job failures Decreased throughput System file Computer Science 9

Data Corruption Impact Examples HDFS-3277: fsimage load failure HDFS-2798: Thread cannot complete file operation

Data Corruption Impact Examples HDFS-3277: fsimage load failure HDFS-2798: Thread cannot complete file operation Block Appending Scanner Memory Matched Name. Node File system state Disk fsimage Hadoop failures Computer Science Disk Block metadata Job failures Time delay Unmatched 10

Data Corruption Detection correct corruption detection 25% silent data corruption 42% imprecise corruption detection

Data Corruption Detection correct corruption detection 25% silent data corruption 42% imprecise corruption detection 21% misreported corruption 12% Existing data detection schemes are insufficient Computer Science 11

Data Corruption Detection Example HDFS-1483: silent data corruption Block location on Block on Data.

Data Corruption Detection Example HDFS-1483: silent data corruption Block location on Block on Data. Node A/B/C Data. Node A Data. Node B Data. Node C get. Block. Locations() Name. Node Client does not know block corruption Computer Science Corrupted block Uncorrupted block 12

Data Corruption Detection Example HDFS-1524: Misreported data corruption Memory Name. Node Compressed fsimage 4

Data Corruption Detection Example HDFS-1524: Misreported data corruption Memory Name. Node Compressed fsimage 4 bytes of compression related information unread Disk Compressed fsimage Computer Science 13

Data Corruption Causes Cause Number of incidents Improper runtime checking 25 Race condition 26

Data Corruption Causes Cause Number of incidents Improper runtime checking 25 Race condition 26 Inconsistent state 16 Improper network failure handling 5 Improper node crash handling 10 Incorrect name/value 5 Lib/command errors 4 Compression-related errors 4 Incorrect data movement 2 Computer Science 14

Data Corruption Causes Example HDFS-3626: Improper runtime check given invalid file path Command with

Data Corruption Causes Example HDFS-3626: Improper runtime check given invalid file path Command with invalid path: hadoop fs –put filename hdfs: //localhost: 8020//temp/filename Illegal operation Mkdir (path=/) Mkdir (path=//temp) Add block Set timestamp Update block … Hadoop failed to load edits. log Edits. log Computer Science 15

Data Corruption Causes Example HADOOP-3069: Improper network failure handling try{ … Transfer. Fs. Image.

Data Corruption Causes Example HADOOP-3069: Improper network failure handling try{ … Transfer. Fs. Image. get. File Server(response. get. Outp ut. Stream(). nn. get. Fs. Image Name()); … }catch(IOException e) … Response. send. Error(…); } Computer Science Secondary Name. Node Void get. File. Server (outstream, …) try{ … outstream. write(buf, 0, num); … }finally{ outstream. close(); … } 16

Existing Data Corruption Handling Schemes Data replication Data recovery Data deletion Data corruption handling

Existing Data Corruption Handling Schemes Data replication Data recovery Data deletion Data corruption handling Simple re-execution Computer Science 17

Problems in Data Corruption Handling Schemes HDFS-4799: Incorrect data deletion reboot Data. Node D

Problems in Data Corruption Handling Schemes HDFS-4799: Incorrect data deletion reboot Data. Node D Data. Node A Data. Node E Data. Node B Data. Node F Computer Science Name. Node Data. Node C 18

Key Findings ▪ The impact of data corruption is not limited to data integrity

Key Findings ▪ The impact of data corruption is not limited to data integrity ▪ Existing data corruption detection schemes are quite insufficient ▪ There are various causes of data corruption ▪ Existing data corruption handling mechanisms make frequent mistakes Computer Science 19

Future Work ▪ Data corruption detection schemes – Trace data-related operations – Anomaly detection

Future Work ▪ Data corruption detection schemes – Trace data-related operations – Anomaly detection over the operation logs – Advantages: proactive Computer Science 20

Conclusion ▪ Characteristic study of 138 real world data corruption incidents – Software-induced data

Conclusion ▪ Characteristic study of 138 real world data corruption incidents – Software-induced data corruptions are prevalent – Data corruption detection schemes need to be improved – Replication cannot completely solve data corruption problems – Data corruption handling schemes may introduce other issues (e. g. , mistaken block deletion, resource hogging) Thank you! Computer Science 21