Data Quality merging and cleaning of microdata Workshop

  • Slides: 19
Download presentation
Data Quality: merging and cleaning of microdata Workshop on populating education indicators and standardising

Data Quality: merging and cleaning of microdata Workshop on populating education indicators and standardising microdata Noumea New Caledonia 21 - 25 OCTOBER 2019

Data Quality: validation and verification • Data integration • Data validation • Data verification

Data Quality: validation and verification • Data integration • Data validation • Data verification

Data integration Most surveys collect data from different statistical units e. g. households and

Data integration Most surveys collect data from different statistical units e. g. households and persons To analyse microdata we need to integrate data from different datasets by linking data to a common ID variable e. g. Hhold_ID In this way we can analyse individual characteristics by household variables e. g. Hhold Size, Location, Sex of Head of Hhold, Wealth Index

Data integration File 2 File 1 Hhold_ID Hhold Size Person_ID Sex of Head of

Data integration File 2 File 1 Hhold_ID Hhold Size Person_ID Sex of Head of Hhold Person Sex Hhold Location Person Age Hhold_Income Education Attendance Hhold_Wealth Grade Attending Education Attainment Qualification

Lookup Tables Persons Households

Lookup Tables Persons Households

Data Validation To validate the quality of the dataset we need to ensure that

Data Validation To validate the quality of the dataset we need to ensure that all recorded values: • are not missing or incomplete • can exist as a valid measure • are within expected range (numeric data) • are correctly labelled (categorical data)

Missing Data • Check the dataset to identify any missing values for each variable

Missing Data • Check the dataset to identify any missing values for each variable • If possible obtain the missing value by referring to the source survey or questionnaire • If not possible, estimate a plausible value based on similar records (imputation) • If missing at random, then exclude from analysis – depend on frequency of missing data

Invalid Data • Check the dataset to identify any invalid or unexpected values for

Invalid Data • Check the dataset to identify any invalid or unexpected values for each variable • If possible correct the invalid data by referring to the source survey or questionnaire • If not possible, record a valid value based on similar records (estimation) • If not able to correct, then exclude from analysis - depend on frequency of missing data

Range Checks • Check the dataset to identify any out of range values for

Range Checks • Check the dataset to identify any out of range values for each numeric variable • If possible correct the out-of-range data by referring to the source survey or questionnaire • If not possible, record a valid value based on similar records (estimation) • If not able to correct, then exclude from analysis - depend on frequency of missing data

Mis-labelling Data • Check the dataset to identify any mis-labelled values for each categorical

Mis-labelling Data • Check the dataset to identify any mis-labelled values for each categorical variable • Correct the mis-labelled data by referring to similar records • If not able to correct, then exclude from analysis - depend on frequency of missing data

Data validation

Data validation

Data Verification To verify the accuracy and correctness of micro-data we need to check

Data Verification To verify the accuracy and correctness of micro-data we need to check for: • internal consistency • consistency over time • consistency with external sources

Internal Consistency • Check that each variable in a dataset is consistent with other

Internal Consistency • Check that each variable in a dataset is consistent with other variables • If there is inconsistency then business rules should be applied • For example, if age of student is inconsistent with year level than change age to reflect the expected age for that year level.

Internal Consistency – Check Records =IF(J 2 -N 2>6, "Check", "") =IF(J 2 -N

Internal Consistency – Check Records =IF(J 2 -N 2>6, "Check", "") =IF(J 2 -N 2<5, "Check", "")

Internal Consistency - Check Outliers

Internal Consistency - Check Outliers

Consistency over time • Check that annual variation in a numeric variable is within

Consistency over time • Check that annual variation in a numeric variable is within expected levels • If there is inconsistency then investigation of the variance should be investigated and reasons documented • For example, if a school roll increases or reduces significantly then check if the change is valid and identify the reason for the change.

Consistency over time

Consistency over time

Consistency with other data sources • Check that both categorical and numeric data is

Consistency with other data sources • Check that both categorical and numeric data is consistent with data from other data sources • If there is inconsistency then investigate the reason for the difference and make a correction if data is shown to be incorrect. • For example, if a school roll is lower than the school attendance records then adjust roll or attendance record accordingly

Consistency with other data sources

Consistency with other data sources