Repeated anonymised samples of administrative records an application

Outline • Introduction and motivation • Sample design and selection • Some results from

Social Security Databases • Brazilian Social Security Administration (SSA) maintains huge databases of contributors

SSA databases – main issues • Confidentiality and security means that they are inaccessible

Anonymised Samples of Records • Enable dissemination of individual anonymised microdata • While protecting

Anonymised Samples of Jobs Database • Goal: to design samples of SSA database records

Sample Design & Selection • Target population = all jobs held by workers affiliated

Stratification & Sample Size • 57 explicit strata • 40 strata = 10 states

Rotation Scheme • Designed to rotate out 1/12 of the sample at each new

Sample Sizes for Alternative Analysis 11

Selected estimates of total and proportions of jobs by status – April 2002 12

Selected estimates of counts and proportions of new jobs by activity sector – April

Scatter plot of estimated proportions of new admissions and their CVs – April 2002

Proportions of jobs terminated in month t+k, for jobs existing (Active) or started (New

Conclusions and discussion (1) • Brazilian SSA could improve its approach for releasing statistical

Conclusions and discussion (2) • The sample design proposed worked well in our application

Conclusions and discussion (3) • For cross-sectional estimates in any given month, the sample

Future Work • Improved weighting methods for longitudinal analyses (e. g. following LAVALLÉE, 1995)

References • GONZALEZ, R. A. C. (2005). Amostragem longitudinal em registros administrativos: uma aplicação

Slides: 20

Download presentation

Repeated anonymised samples of administrative records: an application to social security data in Brazil Rigan A. C. Gonzalez (DATAPREV-Brazil) Pedro L. N. Silva (University of Southampton-UK)

Outline • Introduction and motivation • Sample design and selection • Some results from the selected anonymised samples • Conclusions and discussion 2

Social Security Databases • Brazilian Social Security Administration (SSA) maintains huge databases of contributors and beneficiaries enrolled in the social security system • Records held provide a rich source of information about participation in the formal labour market and in distribution of social security benefits • In particular, they provide a longitudinal perspective that is unavailable from other sources – There are no major longitudinal surveys covering the working age population in Brazil 3

SSA databases – main issues • Confidentiality and security means that they are inaccessible for research purposes • Currently used only for production of aggregate level summaries, published on regular basis – Pre-defined cross-classified tables, at high-level aggregation – Broad indicators only • Not available for user specific analysis • One idea: anonymised samples of records 4

Anonymised Samples of Records • Enable dissemination of individual anonymised microdata • While protecting the confidentiality of individual records • Popularised from applications in population censuses • More recently, also applied for administrative records – Drazga(2008) describes the US experience – Examples from other countries like UK and others 5

Anonymised Samples of Jobs Database • Goal: to design samples of SSA database records to be extracted and made available for analysis on regular basis • Proposed sample design: stratified simple random sampling at each time point • Rotation strategy: use Permanent Random Numbers (PRNs – e. g. Ohlsson 1995) to control sample overlap across time – Enables longitudinal analysis – Enables each sample to represent updated survey population – Simple, but effective rotation control 6

Sample Design & Selection • Target population = all jobs held by workers affiliated to the General Social Security Regime (GSSR) in reference period • Reference period = July 2001 till June 2002 • Key domains of analysis defined as cross-classification of states (27 levels) x SIC of employer (four ‘sectors’) 1=Manufacturing, 2=Trade and distribution services, 3=Other services, 4=Agriculture, construction and other productive activities • Main targets of inference: job status distribution 1=Active, 2=New admission, 3=Terminated in current month, 4=Terminated in previous periods, 5=Not reported 7

Stratification & Sample Size • 57 explicit strata • 40 strata = 10 states by 4 SIC groups • +17 states with no further stratification (state-only strata) • Sample size in each stratum to estimate proportions of at least 1. 5% with a CV no larger than 10% • nh = 6, 300 records in 40 state by SIC strata • nh = 12, 600 records in 17 state-only strata • Larger size in 17 state-only strata to enable domain estimation by SIC with some confidence • Total sample size n = 466, 200 job records (< 1. 5% of total) 8

Rotation Scheme • Designed to rotate out 1/12 of the sample at each new selection period • We used monthly samples, but this can easily be changed to other periods, such as quarters, semesters, years, etc. • Time in sample for each record 12 months (or periods) • Time in sample not fixed, due to stochastic rotation control caused from using PRN sampling 10

Sample Sizes for Alternative Analysis 11

Selected estimates of total and proportions of jobs by status – April 2002 12

Selected estimates of counts and proportions of new jobs by activity sector – April 2002 13

Scatter plot of estimated proportions of new admissions and their CVs – April 2002 14

Proportions of jobs terminated in month t+k, for jobs existing (Active) or started (New admissions) in January 2002 (k=0) 15

Conclusions and discussion (1) • Brazilian SSA could improve its approach for releasing statistical information about the formal labour market by providing access to anonymised samples of jobs • This would enable satisfying analytical needs of many specialized users, while still protecting the confidentiality of individual records • This would substantially enhance the capacity for the study and evaluation of the impact of public policies regarding the Social Security system in Brazil 16

Conclusions and discussion (2) • The sample design proposed worked well in our application • All the sample selection, estimation and analysis activities were carried out using a standard desktop microcomputer • Once the samples are made available, analysts should have no difficulty in exploring the data for their own estimation and analysis activities • The various analyses carried out with the selected samples illustrate the potential of such samples for analytical use 17

Conclusions and discussion (3) • For cross-sectional estimates in any given month, the sample of approximately 466, 200 records delivers precise estimates for some fine domains of interest • For longitudinal analyses with samples six months apart, the sample would still have approximately 233, 100 matched records available 18

Future Work • Improved weighting methods for longitudinal analyses (e. g. following LAVALLÉE, 1995) • Detailed analysis of disclosure risks associated with proposed sampling strategy • Assess impact and introduce control measures to reduce bias caused by late reporting of new jobs (births) and jobs terminated (deaths) 19

Thanks for your attention. 20

References • GONZALEZ, R. A. C. (2005). Amostragem longitudinal em registros administrativos: uma aplicação à previdência social. Rio de Janeiro: Escola Nacional de Ciências Estatísticas, MSc. Dissertation. • DRAZGA, L. (2008). Uses Of Administrative Data At The U. S. Social Security Administration. • LAVALLÉE, P. Cross-sectional weighting of longitudinal surveys of individuals and households using the weight share method. Survey Methodology v. 21, nº 1, p. 25 -32, 1995. • OHLSSON, E. Coordination of Samples using Permanent Random Numbers. In: Cox, Binder, Chinnappa, Christianson, Colledge & Kott (eds. ) Business Survey Methods, New York, Wiley, p. 153 -169, 1995. 21