Data Quality merging and cleaning of microdata Workshop



















- Slides: 19
Data Quality: merging and cleaning of microdata Workshop on populating education indicators and standardising microdata Noumea New Caledonia 21 - 25 OCTOBER 2019
Data Quality: validation and verification • Data integration • Data validation • Data verification
Data integration Most surveys collect data from different statistical units e. g. households and persons To analyse microdata we need to integrate data from different datasets by linking data to a common ID variable e. g. Hhold_ID In this way we can analyse individual characteristics by household variables e. g. Hhold Size, Location, Sex of Head of Hhold, Wealth Index
Data integration File 2 File 1 Hhold_ID Hhold Size Person_ID Sex of Head of Hhold Person Sex Hhold Location Person Age Hhold_Income Education Attendance Hhold_Wealth Grade Attending Education Attainment Qualification
Lookup Tables Persons Households
Data Validation To validate the quality of the dataset we need to ensure that all recorded values: • are not missing or incomplete • can exist as a valid measure • are within expected range (numeric data) • are correctly labelled (categorical data)
Missing Data • Check the dataset to identify any missing values for each variable • If possible obtain the missing value by referring to the source survey or questionnaire • If not possible, estimate a plausible value based on similar records (imputation) • If missing at random, then exclude from analysis – depend on frequency of missing data
Invalid Data • Check the dataset to identify any invalid or unexpected values for each variable • If possible correct the invalid data by referring to the source survey or questionnaire • If not possible, record a valid value based on similar records (estimation) • If not able to correct, then exclude from analysis - depend on frequency of missing data
Range Checks • Check the dataset to identify any out of range values for each numeric variable • If possible correct the out-of-range data by referring to the source survey or questionnaire • If not possible, record a valid value based on similar records (estimation) • If not able to correct, then exclude from analysis - depend on frequency of missing data
Mis-labelling Data • Check the dataset to identify any mis-labelled values for each categorical variable • Correct the mis-labelled data by referring to similar records • If not able to correct, then exclude from analysis - depend on frequency of missing data
Data validation
Data Verification To verify the accuracy and correctness of micro-data we need to check for: • internal consistency • consistency over time • consistency with external sources
Internal Consistency • Check that each variable in a dataset is consistent with other variables • If there is inconsistency then business rules should be applied • For example, if age of student is inconsistent with year level than change age to reflect the expected age for that year level.
Internal Consistency – Check Records =IF(J 2 -N 2>6, "Check", "") =IF(J 2 -N 2<5, "Check", "")
Internal Consistency - Check Outliers
Consistency over time • Check that annual variation in a numeric variable is within expected levels • If there is inconsistency then investigation of the variance should be investigated and reasons documented • For example, if a school roll increases or reduces significantly then check if the change is valid and identify the reason for the change.
Consistency over time
Consistency with other data sources • Check that both categorical and numeric data is consistent with data from other data sources • If there is inconsistency then investigate the reason for the difference and make a correction if data is shown to be incorrect. • For example, if a school roll is lower than the school attendance records then adjust roll or attendance record accordingly
Consistency with other data sources