SOEP and DOI Requirements and Challenges Jan Goebel
SOEP and DOI Requirements and Challenges Jan Goebel February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 1
Content 1. SOEP Overview 2. Problems 3. Conclusions February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 2
SOEP Overview • Socio-Economic Panel Study (SOEP) is a representative longitudinal study of private households in Germany • Annual survey since 1984 of about 10, 000 households (around 20, 000 persons) • Some of the many topics include household composition, occupational biographies, employment, earnings, health and indicators of subjective well-being February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 3
SOEP is an ongoing Survey • Common with all panel surveys • Each year we distribute an enhanced version with new and changed data • Question are changing, new topics, . . . → We do a lot but not just replication! • Even changes for „archived data“, like a change in the coding scheme of ISCO February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 4
SOEP is not one dataset but a complex data structure • The SOEP currently (User DVD) consists of: – More than 320 data files – About 40. 000 Variables • Granulation to choose for citation? – Complete SOEP distribution of one year? – „Connected“ SOEP parts, e. g. Individual questionnaires, HH-questionnaires, generated datasets – Each data file – Each Variable (for each year or only once, longitudinal concept? ) February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 5
„The SOEP” is available in different versions • European user: 100% Version (English, German, different formats for SAS/SPSS/Stata/ASCII) • Non-EU user: 95% Version (of cases) • International comparative research: Part of the CNEF (Cross National Equivalent File) • SOEP Geocodes (supplementary CD): Regional Planning Regions, Community types, etc. • Country codes, Community codes, zip codes, microm: only by remote execution or at the Research Data Center (RDC SOEP) • SOEP Pretests • SOEP Related Studies February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 6
SOEP can change during the period, because of updates • Updates of weighting schemes or even bug fixes (also possible for older waves) • Sometimes more than one update between distributions (cumulative updates? ) • How can a user know what version she is using? • • • Message-Digest Algorithm (MD 5) Secure Hash Algorithm (SHA-2) Universal Numeric Fingerprint (UNF) • Does rounding matter? • German/English Labels, different formats (SPSS, STATA, …) • Only update of a label bug? February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 7
Conclusions • Nesting of DOI should be possible: Print DOI SOEP example Edited book Survey SOEP DVD Article in book Data file SOEP dataset $PGEN Table in article in book Variable SOEP DOI 10. 1000/soep. 26. hgen SOEP dataset $pgen variable 10. 1000/soep. 26. hgen. ihinc$$ • It should be possible for a user to identify the data, including version The metadata of a DOI should include a SHA for each data file and format, which must also be persistent, like SHA-2 • Commitment about the persistence of the data provider • It is not enough to identify the data source to make an scientific empirical analysis reproducible, you normally need the syntax also February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 8
Thank you for your attention! February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 9
- Slides: 9