The Reproducible Research Advantage Why how to make
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April Clyburne-Sherin
Objectives • What is reproducibility? • Why practice reproducibility? • What is necessary for research to be reproducible? • How can you make your research reproducible? • What is literate programming?
What is reproducibility? Replicability • Replication of findings is highest standard of evaluating evidence • Focuses on validating the scientific claim Scientific method Observation Analysis Question Replication Testing Hypothesis Prediction
What is reproducibility? Replicability • Replication of findings is highest standard of evaluating evidence • Focuses of the validating the scientific claim • Many studies cannot be replicated Scientific method Observation Analysis Question Replication Testing Hypothesis Prediction
What is reproducibility? Replicability • Replication of findings is highest standard of evaluating evidence • Focuses of the validating the scientific claim • Many studies cannot be replicated Scientific method Observation Analysis Testing ? Prediction Question Hypothesis
What is reproducibility? Reproducibility • Reproduction of study findings using study materials • Requires transparency of methods, data, and code • Focuses on the validity of the data analysis • Limited type of replication • Minimum standard for any scientific study Scientific method Observation Analysis Question Reproduction Testing Hypothesis Prediction
Why practice reproducibility? • Study report is enough to: • Assess study justification • Assess study design • Understand how the experiment was conducted • Assess the relevance of findings Study report Reported results Figures Tables Numerical summaries
Why practice reproducibility? • Study report is not enough to: • Assess errors in analyses • Assess the sensitivity of findings to assumptions • Reproduce the analyses • Cannot evaluate the study analyses and findings using a study report alone. Study report Reported results Figures Tables Numerical summaries
Why practice reproducibility? Study report Processing Raw data Analysing Analytic data Raw results Reporting Reported results Figures Tables Numerical summaries
Why practice reproducibility? Study report Processing Raw data Analysing Analytic data Reporting Raw results Reported results Figures Tables To fully assess the analyses and findings of a study, we need more information. Numerical summaries
Why practice reproducibility? The idealist The pragmatist • Shoulders of giants! • Minimum scientific standard • Allows others to build on your findings • Improved transparency • Increased transfer of knowledge • Increased utility of your data + methods • Data sharing citation advantage (Piwowar 2013) • “It takes some effort to organize your research to be reproducible… the principal beneficiary is generally the author herself. ”- Schwab & Claerbout • Improves capacity for complex and large datasets or analyses • Increased productivity
What is necessary for research to be reproducible? Study report Processing Raw data Analysing Analytic data Raw results Reporting Reported results Figures Tables Numerical summaries
What is necessary for research to be reproducible? Study report Processing code Raw data Analytic code Analytic data Presentation code Raw results Reported results Figures Tables Numerical summaries
What is necessary for research to be reproducible? Study report Processing code Raw data Analytic code Analytic data Presentation code Raw results Reported results Figures Tables Numerical summaries
What is necessary for research to be reproducible? Study report Processing code Raw data Analytic code Analytic data Presentation code Raw results Reported results Figures Tables 1. Data + metadata 2. Code 3. Documentation of data + code Numerical summaries
How can you make your research reproducible? 1. Plan for reproducibility before you start • Data management plan • Informative naming + location • Study plan + pre-analysis plan 2. Keep track of things • Version control • Documentation 3. Let your computer do the work • Use software that can be coded • Literate programming 4. Archive + share your materials
1. Plan for reproducibility before you start Data management plan How? • Prepare to share • Data that is well-managed from the start is easier to prepare for sharing • Smooths transitions between researchers • Protects you if questions are raised about data validity • Metadata provides context • Document metadata while collecting to save time • Use open data formats rather than proprietary: . csv, . txt , . png • Data: – – Collected Stored Documented Managed • Metadata: – Collected – Documented / Version control
1. Plan for reproducibility before you start Informative name + location • Plan your file naming + location system a priori • Names and locations should be distinctive, consistent, and informative: – What it is – Why it exists – How it relates to other files
1. Plan for reproducibility before you start Informative name + location • The rules don’t matter. That you have rules matters. • Make it machine readable: – Default ordering – Use of meaningful deliminators and tags – Example: use “_” and “-” to store metadata in name (eg, YYYY-MMDD_assay_sample-set_well) • Make it human readable: – Choose self-explanatory names and locations
1. Plan for reproducibility before you start Study plan • Pre-register your study plan before you look at your data! • Hypothesis Open Science • Study design – – Type of design Sampling Power and sample size Randomization? Framework • Variables measured – Meaningful effect size • Variables constructed – Data processing • Etc… Clinical. Trials. gov
1. Plan for reproducibility before you start Pre-analysis plan Processing Analysing • Define data analysis set • Statistical analyses – Primary – Secondary – Exploratory Raw data Analytic data Raw results • • Missing data Outliers Multiplicity Subgroups + covariates (Adams-Huet and Ahn, 2009)
2. Keep track of things Version control • Everything created manually should use version control • Tracks changes to files, code, metadata • Allows you to revert to old versions • Make incremental changes: commit early, commit often • Git / Git. Hub / Bit. Bucket Version control for data • Metadata should be version controlled
2. Keep track of things Documentation • Document your software environment (eg, dependencies, libraries, session. Info () in R) • Everything done by hand or not automated from data and code should be precisely documented: – README files • Make raw data read only – You won’t edit it by accident – Forces you to document or code data processing • Document in code comments
3. Let your computer do the work Use software that can be coded • Graphical user interfaces are hard to reproduce. • Telling a computer what to do maximizes reproducibility. • Teaching a computer what to is telling researcher using your code what to do.
3. Let your computer do the work Literate programming • Links data, code, output, and documentation • Combines code “chunks” with text and output • Requires a documentation language + a programming language • Produces documents in html, pdf, and more • R Studio + R Notebook, Sweave, or knitr
4. Archive + share your materials Open Science Framework RPubs
How can you make your research reproducible? 1. Plan for reproducibility before you start • Data management plan – Prepare to share • Informative naming + location – The rules don’t matter. That you have rules matters. • Study plan + pre-analysis plan – Pre-register your plan 2. Keep track of things • Version control – Track your changes • Documentation – Everything done by hand 3. Let your computer do the work • Use software that can be coded – Teaching a computer is teaching others • Literate programming - Link data, code, output, and documentation 4. Archive + share your materials • Where doesn’t matter. That you share matters.
How to learn more • Organizing a project for reproducibility – Reproducible Science Curriculum by Jenny Bryan – https: //github. com/reproducibl e-science-curriculum/ • Data management – Data Management from Software Carpentry by Orion Buske – http: //softwarecarpentry. org/v 4/data/mgmt. h tml • Literate programming – Literate Statistical Programming by Roger Peng – https: //www. youtube. com/wat ch? v=Yc. Jb 1 HBc-1 Q • Version control – Version Control by Software Carpentry – http: //softwarecarpentry. org/v 4/vc/ • Sharing materials – Open Science Framework by Center for Open Science – https: //osf. io/
An example of reproducible analyses using R + Open Science Framework 1. 2. 3. 4. Pre-register analysis plan Read only dataset Version control of analyses Literate programming using knitr
- Slides: 29