SRE Data Durability Current State and Future Plans
SRE Data Durability Current State and Future Plans Maria Arsuaga-Rios (IT-ST-FDO) 1
Index • • Goal and workflow Classifying and repairing tool Sources detection and evolution Main plans and actions Maria Arsuaga-Rios (IT-ST-FDO) 2
Eos-ops-repair • Code & automatic rpm creation: https: //gitlab. cern. ch/eos-ops-durability • Installed via puppet in all mgms • Running in mgms or from rundeck • Output sent to • Cernbox: https: //cernbox. cern. ch/index. php/apps/files/? dir=/__myprojects/ eos/Durability& • Elasticsearch: https: //eseosmon. cern. ch/kibana/app/kibana#/discover • Grafana for monitoring: https: //filercarbon. cern. ch/grafana/d/Jz. DQWU 7 Zz/durability-classification Maria Arsuaga-Rios (IT-ST-FDO) 4
Eos-ops-repair • How it works: 1. 2. 3. 4. Classify the files per problem by providing the list of fids or pathnames. Each category will contain one file with the list of fids affected --repair: Repair by category and run the corresponding repairing method --send: Send the data to ES (the data is always sent to Cernbox) eos-ops-repair files_to_repair_<date> -i <instance: eg lhcb> --id_type path/dec -repair --send Maria Arsuaga-Rios (IT-ST-FDO) 5
Eos-ops-repair • What it classifies? 17 categories ALL REP CORRUPTED MATCH XS AND SIZE ALL REP OK NO_REP Possible loss MULTIPLE REP NS XS 0 OR 1 ALL REP CORRUPTED Nothing to repair ONE REP OK ONE REP NS XS 0 OR 1 ONE REP CORRUPTED REP EXCEEDED ONE OK Repaired automatically AT LEAST ONE REP OK REMOVED FILE NOT EXIST “found deletion tombstone” ALL REP CORRUPTED MATCH XS NO SIZE ZERO FILE SIZE MISSING CONTAINER “Error while fetching Container. MD” Corner cases LOST&FOUND Maria Arsuaga-Rios (IT-ST-FDO) BIG FILE (>3 m timeout) 6 Other filters
Eos-ops-repair (repaired automatically or no need to be repaired (9. 5 cases)) Nothing to repair ALL REP OK Adjustreplica ONE REP OK File verify & Adjustreplica REP EXCEEDED ONE OK ONE REP CORRUPTED ALL REP CORRUPTED MATCH XS AND SIZE FILE NOT EXIST ONE REP NS XS 0 OR 1 BIG FILE (>3 m timeout) * MULTIPLE REP NS XS 0 OR 1 *Only repaired when one replica detection ZERO FILE SIZE If the automatic repairing is not able to repair them, it will classify with the suffix “no_repaired”. Maria Arsuaga-Rios (IT-ST-FDO) 7
Eos-ops-repair (to investigate (6. 5 cases)) Possible loss Possible bugs (from older or new versions) NO_REP AT LEAST ONE REP OK Possible to be repaired automatically MISSING CONTAINER “Error while fetching Container. MD” BIG FILE (>3 m timeout) REMOVED FILE ALL REP CORRUPTED “found deletion tombstone” LOST&FOUND ALL REP CORRUPTED MATCH XS NO SIZE * No considering gosth and aborted files Maria Arsuaga-Rios (IT-ST-FDO) 8
Sources detection – One replica files – detected everyday – Faulty files in backup – detected everyday – Faulty files in drain – detected manually – One replica layout – detected everyday – Unlinked files –detected everyday – Exceeded replicas – detected everyday – Cursed names – detected everyday – Directories without attributes - ongoing Maria Arsuaga-Rios (IT-ST-FDO) 9
One replica files • Detected and repaired automatically everyday from rundeck in all instances. One replica and corrupted Repaired automatically One replica and corrupted with xs 0 in namespace One replica in namespace but physically missing Possible loss Maria Arsuaga-Rios (IT-ST-FDO) 10
One replica files • Detected and repaired automatically everyday from rundeck in all instances. • Repair decrease of 99. 29% (except Alice) (from 140 k to 1 k). Maria Arsuaga-Rios (IT-ST-FDO) 11
One replica files • • • However new one replica files are appearing. . Action: Investigate the root of this issue. How? : We need a team of two people (one developer and one operator (myself)) Maria Arsuaga-Rios (IT-ST-FDO) 12
One replica files • Alice needs a faster clean up, going to the end (a decrease of 99. 98%) • Action: Separated job just for Alice Maria Arsuaga-Rios (IT-ST-FDO) 13
Faulty files in backup • • • Classified automatically from rundeck for all home instances. No repaired – we should understand the issue. Action: We need a team of two people (one developer and one operator (myself)) NO_REP ALL REP CORRUPTED MATCH XS AND SIZE Maria Arsuaga-Rios (IT-ST-FDO) ALL REP CORRUPTED MULTIPLE REP NS XS 0 OR 1 ALL REP OK 14
Faulty files in backup • Why backup do not detect one replica files? Maria Arsuaga-Rios (IT-ST-FDO) 15
Faulty files in backup • Why backup do not detect not all replica files? Maria Arsuaga-Rios (IT-ST-FDO) 16
Faulty files in drain • Adding draining metrics in our probe in graphite to collect: – Number of files and filesystems when disk booted. – Number of files and filesystems with bootfailure. – Number of files and filesystems with opserr. Maria Arsuaga-Rios (IT-ST-FDO) 17
Faulty files in drain • Original faulty files classified • Files that were not able to be repaired after the automatic reparation • What about make it automatic from Winston? Maria Arsuaga-Rios (IT-ST-FDO) 18
One replica layout • Motivation: Possible loss • Status: detected everyday • Plan: To be fixed automatically (repaired in CMS) Maria Arsuaga-Rios (IT-ST-FDO) 19
Exceeded replicas and Unlinked replicas • Motivation: Wasting space • Status: detected everyday for all instances • Plan: To be fixed automatically Maria Arsuaga-Rios (IT-ST-FDO) 20
Cursed-names (. , /) • Motivation: Cannot be accesible/neither removed • Status: detected everyday (except Alice) • Plan: To be fixed automatically (rename them) Maria Arsuaga-Rios (IT-ST-FDO) 21
Directories without attributes • Motivation: Possible loss files and user problems (e. g. permissions) • Status: ongoing • Plan: To be tested Maria Arsuaga-Rios (IT-ST-FDO) 22
Main Plans and Actions for next meeting 1. Two people team for investigation (dev + op) a. Select first instance for checking “no rep” (Homei 0? ) – example in backup? b. Create Jira tickets for each case c. Persons: Maria & ? 2. Repair Alice: Maria & some collaboration with Roberto? 3. Drain repair from Winston? collaboration with Luca? Maria Arsuaga-Rios (IT-ST-FDO) 23
- Slides: 23