Automatic Editing with Hard and Soft Edits Some

Introduction • Error localisation problem: • Try to identify variables with erroneous/missing values •

Error localisation (1) • Fellegi and Holt (1976): • Find the smallest (weighted) number

Error localisation (2) • Alternative approach: • Choose a function Dsoft that measures the

Simulation study (1) • Two data sets: • Dutch SBS 2007, medium-sized wholesale businesses

Simulation study (2) editing approach (choice of Dsoft) % records with perfect solution data

Choices for Dsoft – fixed weights (1) • Fixed failure weights: • Resulting target

Choices for Dsoft – fixed weights (2) • Possible choices for sk: A. All

Simulation study (3) editing approach (choice of Dsoft) % records with perfect solution data

Choices for Dsoft – quantile edits (1) • Drawback of fixed failure weights: no

Choices for Dsoft – quantile edits (2) • Idea: use different versions of the

Choices for Dsoft – quantile edits (3) • Example: ratio edit x 1 /

Simulation study (4) editing approach (choice of Dsoft) % records with perfect solution data

Choices for Dsoft – dynamic expressions • Size of edit failure: ek • Linear

Simulation study (5) editing approach (choice of Dsoft) % records with perfect solution data

Conclusion • Using soft edits improved error localisation • Choice of Dsoft: • Results

Slides: 16

Download presentation

Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Introduction • Error localisation problem: • Try to identify variables with erroneous/missing values • Edits: • Constraints that should be satisfied by the data • Hard (fatal) – e. g. Turnover – Costs = Profit • Soft (query) – e. g. Profit / Turnover ≤ 0. 6 • Manual editing: hard and soft edits • Automatic editing: only hard edits Automatic Editing with Hard and Soft Edits - Some First Experiences 1

Error localisation (1) • Fellegi and Holt (1976): • Find the smallest (weighted) number of variables that can be imputed so that all edits are satisfied • Minimise so that all edits are satisfied • No room for soft edits Automatic Editing with Hard and Soft Edits - Some First Experiences 2

Error localisation (2) • Alternative approach: • Choose a function Dsoft that measures the degree of suspicion associated with particular soft edit failures • Minimise so that all hard edits are satisfied • Prototype algorithm in R (based on editrules) Automatic Editing with Hard and Soft Edits - Some First Experiences 3

Simulation study (1) • Two data sets: • Dutch SBS 2007, medium-sized wholesale businesses • Raw and manually edited data available • One half used as test data, one half as reference data • Test data set 1: • 728 records, 12 variables, 16 hard edits, 10 soft edits • Synthetic errors • Test data set 2: • 580 records, 10 variables, 17 hard edits, 24 soft edits • Real errors Automatic Editing with Hard and Soft Edits - Some First Experiences 4

Simulation study (2) editing approach (choice of Dsoft) % records with perfect solution data set 1 data set 2 no soft edits, only hard edits 40. 2% 58. 4% all edits as hard edits 36. 8% n/a Automatic Editing with Hard and Soft Edits - Some First Experiences 5

Choices for Dsoft – fixed weights (1) • Fixed failure weights: • Resulting target function to be minimised: • Higher failure weight ‘harder’ soft edit Automatic Editing with Hard and Soft Edits - Some First Experiences 6

Choices for Dsoft – fixed weights (2) • Possible choices for sk: A. All failure weights equal to 1 B. Proportion of records that satisfy edit k in manually edited reference data Interpretation: P(edited record satisfies edit k) C. P(edited record satisfies edit k | raw record fails edit k) • Alternative: categorised versions of B and C Automatic Editing with Hard and Soft Edits - Some First Experiences 7

Simulation study (3) editing approach (choice of Dsoft) % records with perfect solution data set 1 data set 2 no soft edits, only hard edits 40. 2% 58. 4% all edits, using soft edits as hard edits 36. 8% n/a sum of fixed failure weights A 47. 3% 63. 4% sum of fixed failure weights B 52. 1% 60. 9% sum of fixed failure weights C 43. 3% 60. 7% sum of fixed failure weights B(cat) 50. 0% 64. 5% sum of fixed failure weights C(cat) 43. 1% 64. 5% Automatic Editing with Hard and Soft Edits - Some First Experiences 8

Choices for Dsoft – quantile edits (1) • Drawback of fixed failure weights: no difference between large and small edit failures • Trick: quantile edits Automatic Editing with Hard and Soft Edits - Some First Experiences 9

Choices for Dsoft – quantile edits (2) • Idea: use different versions of the same edit by varying one of the constants • Choose values for this constant based on the fraction of reference data records that fail the resulting edit (e. g. 1%, 5%, 10%) Automatic Editing with Hard and Soft Edits - Some First Experiences 10

Choices for Dsoft – quantile edits (3) • Example: ratio edit x 1 / x 3 ≥ c % records failed c in ref. data quantile edit sk cumul. sk 10% 0. 75 x 1 / x 3 ≥ 0. 75 1 1 5% 0. 60 x 1 / x 3 ≥ 0. 60 1 2 1% 0. 10 x 1 / x 3 ≥ 0. 10 1 3 Automatic Editing with Hard and Soft Edits - Some First Experiences 11

Simulation study (4) editing approach (choice of Dsoft) % records with perfect solution data set 1 data set 2 no soft edits, only hard edits 40. 2% 58. 4% all edits, using soft edits as hard edits 36. 8% n/a sum of fixed failure weights A 47. 3% 63. 4% sum of fixed failure weights B 52. 1% 60. 9% sum of fixed failure weights C 43. 3% 60. 7% sum of fixed failure weights B(cat) 50. 0% 64. 5% sum of fixed failure weights C(cat) 43. 1% 64. 5% 10 -5 -1%-quantile edits, weights 0. 33 -0. 33 54. 4% 63. 4% 10 -5 -1%-quantile edits, weights 0. 90 -0. 05 56. 5% 63. 8% Automatic Editing with Hard and Soft Edits - Some First Experiences 12

Choices for Dsoft – dynamic expressions • Size of edit failure: ek • Linear equality edit: ak 1 x 1 + … + akpxp + bk = 0 Take: ek = | ak 1 x 1 + … + akpxp + bk | • Linear inequality edit: ak 1 x 1 + … + akpxp + bk ≥ 0 Take: ek = max{ 0, –(ak 1 x 1 + … + akpxp + bk) } • Use reference data to standardise: • Linear sum: • Mahalanobis distance: Automatic Editing with Hard and Soft Edits - Some First Experiences 13

Simulation study (5) editing approach (choice of Dsoft) % records with perfect solution data set 1 data set 2 no soft edits, only hard edits 40. 2% 58. 4% all edits, using soft edits as hard edits 36. 8% n/a sum of fixed failure weights A 47. 3% 63. 4% sum of fixed failure weights B 52. 1% 60. 9% sum of fixed failure weights C 43. 3% 60. 7% sum of fixed failure weights B(cat) 50. 0% 64. 5% sum of fixed failure weights C(cat) 43. 1% 64. 5% 10 -5 -1%-quantile edits, weights 0. 33 -0. 33 54. 4% 63. 4% 10 -5 -1%-quantile edits, weights 0. 90 -0. 05 56. 5% 63. 8% sum of standardised soft edit failures 49. 2% ? Mahalanobis distance of soft edit failures 46. 8% ? Automatic Editing with Hard and Soft Edits - Some First Experiences 14

Conclusion • Using soft edits improved error localisation • Choice of Dsoft: • Results not unequivocal • Quantile edits seem to work well • Room for improvement • Future work: • Extended simulation study with mixed data/edits Automatic Editing with Hard and Soft Edits - Some First Experiences 15