Challenges in data linkage error and bias Katie
Challenges in data linkage: error and bias Katie Harron October 2014 UCL Institute of Child Health k. harron@ucl. ac. uk
The linkage problem Link status Match Non-match (pair from same individual) (pair from different individuals) Link Identified match False match Non-link Missed match Identified non-match
Deterministic linkage in Hospital Episode Statistics (HES) 1 2 3 – – – Sex Date of Birth NHS Number – – Sex Date of Birth Postcode Local Patient Identifier within Provider – – – Sex Date of Birth Postcode Few falsematches More missedmatches
Quality of unique identifiers 166, 406 records of admissions to paediatric intensive care (PICANet) 85, 137 non-matches 81, 269 matches 46 (0. 1%) same NHS number 3, 207 (4%) different NHS numbers Hagger-Johnson et al. Causes and consequences of data linkage errors: False and missed matches following linkage of hospital data (under review)
Deterministic linkage with pseudonymisation at source Courtesy of Peter Jones, ONS Beyond 2011 programme
Probabilistic linkage pair 1 pair 2 Low match weight pair 3 Primary File Ronald Fisher Highest weight is retained High match weight Linking File Karl Pearson Carl Gauss Ronald Fisher
Probabilistic linkage pair 3 Low match weight P(γ=1 | M) = m-probability = sensitivity the probability of agreement given the records from same subject Log ratio = w = log 2 (m/u) log 2 [(1 -m)/(1 -u)] Highest weight is retained High match weight P(γ=1 | U) = u-probability= 1 -specificity the probability of agreement given the records from different subjects if identifiers agree if identifiers disagree Match weight = W = ∑wi
Probabilistic linkage Non-matches agreement on sex Low match weight disagreement on date of birth Matches agreement on NHS number High match weight agree on some ids disagree on some ids Chance (same date of birth) Missing data Recording errors
Missed matches Matches Non-matches Low match weight False matches High match weight Links Two thresholds
Evaluating linkage quality q Small amounts of linkage error can result in substantially biased results q The impact of linkage error on results is rarely reported q Linkage error affects different types of analysis in different ways
Why it’s important to evaluate linkage error Schmidlin et al (2013) Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak 13 (1): 1
Why it’s important to evaluate linkage error Hobbs, G. & Vignoles, A. , 2007. Is free school meal status a valid proxy for socio-economic status (in schools research)? Centre for the Economics of Education; London School of Economics and Political Science.
Why it’s important to evaluate linkage Highly error Highly sensitive Lariscy (2011). Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies: Implications for the Epidemiologic Paradox. J Aging Health 23(8): 1263 -84 specific
Why it’s important to evaluate linkage error Ford et al 2006. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatric and Perinatal Epidemiology 20(4): 329 -337.
Evaluating linkage quality i) Sensitivity analysis using different linkage criteria Highly sensitive iii) Comparisons of linked and unlinked data ii) Subset of gold-standard data to quantify linkage bias Highly specific iv) Imputation for uncertain links
Imputation for linkage Prior-informed imputation Primary file Linking file Variable of interest Record 1 Exact match high Record 2 Exact match low Record 3 Exact match high Match weight=10 med Record 4 Record n Match weight=1 low Match weight=5 low high Match weight=4 Match weight=3 Goldstein et al Stat Med 2012; 31(28): 3481 -3493 Harron et al BMC Med Res Method 2014; 14(1): 36 med high
Implications for data providers i) Sensitivity analysis using different thresholds ii) Subset of gold-standard data to quantify linkage bias Availability of all candidate records (linked and unlinked) iii) Comparisons of linked and unlinked data iv) Imputation for uncertain links Subset of data where true match status is known (gold-standard) Harron et al 2012. Opening the black box of record linkage. J Epidemiol Commun H 66(12): 1198
Summary q Data linkage is a powerful tool for enhancing administrative data q Linkage error has important effects on analyses q Results vary according to choice of thresholds and methods q Taking error into account is possible without releasing identifiable data q Communication between linkers and data users is vital
Acknowledgements and funding Harvey Goldstein, Ruth Gilbert, Gareth Hagger-Johnson and Angie Wade, UCL Institute of Child Health Berit Muller-Pebody, Public Health England Roger Parslow, Tom Fleming, Lee Norman and the PICANet team, University of Leeds This work was supported by funding from the National Institute for Health Research Health Technology Assessment (NIHR HTA) programme (project number 08/13/47). The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the HTA programme, NIHR, NHS or the Department of Health. The authors state no conflicts of interest.
- Slides: 19