Workshop to Support Data Science Workflows Practical steps

  • Slides: 65
Download presentation
Workshop to Support Data Science Workflows Practical steps for increasing the openness and reproducibility

Workshop to Support Data Science Workflows Practical steps for increasing the openness and reproducibility of data science Supporting Research Workflows in Data Science UVA November 11, 2016 Natalie Meyers

Objectives Session 1 • Understanding reproducible research • Setting up a reproducible project •

Objectives Session 1 • Understanding reproducible research • Setting up a reproducible project • Keeping track of things Session 2 • Containing bias • Sharing your work 2

© COPYRIGHT FIRST LOOK MEDIA 2016 https: //osf. io/ezcuj/ Repeat After Me by Maki

© COPYRIGHT FIRST LOOK MEDIA 2016 https: //osf. io/ezcuj/ Repeat After Me by Maki Naro, Oct 6, 2016 in The Nib Used with Permission URL=https: //thenib. com/repeat-after-me

© COPYRIGHT FIRST LOOK MEDIA 2016 Repeat After Me by Maki Naro, Oct 6,

© COPYRIGHT FIRST LOOK MEDIA 2016 Repeat After Me by Maki Naro, Oct 6, 2016 in The Nib Used with Permission URL=https: //thenib. com/repeat-after-me

Technology to enable change Training to enact change Incentives to embrace change

Technology to enable change Training to enact change Incentives to embrace change

OSF TOP Guidelines Badges for Open Practices Reproducibility Projects Pre. Reg & Registered Reports

OSF TOP Guidelines Badges for Open Practices Reproducibility Projects Pre. Reg & Registered Reports SHARE

What is the problem? Scientific method • Best way to learn how the world

What is the problem? Scientific method • Best way to learn how the world works • Replication of findings is highest standard of evaluating evidence • Replication of methods allows reuse and extension of new knowledge Observation Analysis Question Replication Testing Hypothesis Prediction 7

What is the problem? 8

What is the problem? 8

What is reproducibility? Scientific method • Computational reproducibility Observation Analysis Question Replication Testing Hypothesis

What is reproducibility? Scientific method • Computational reproducibility Observation Analysis Question Replication Testing Hypothesis Prediction 9

What is reproducibility? Scientific method • Computational reproducibility • Empirical reproducibility Observation Analysis Question

What is reproducibility? Scientific method • Computational reproducibility • Empirical reproducibility Observation Analysis Question Replication Testing Hypothesis Prediction 10

What is reproducibility? Scientific method • Computational reproducibility • Empirical reproducibility • Conceptual reproducibility

What is reproducibility? Scientific method • Computational reproducibility • Empirical reproducibility • Conceptual reproducibility Observation Analysis New data Question Replication Testing Hypothesis Prediction 11

What are the barriers? Disseminate Acquire data Evaluate Design Prepare data Model Explore 12

What are the barriers? Disseminate Acquire data Evaluate Design Prepare data Model Explore 12

What are the barriers? • Statistical –Low power –Researcher degrees of freedom • Transparency

What are the barriers? • Statistical –Low power –Researcher degrees of freedom • Transparency –Poor documentation –Poor reporting –Lack of sharing 13

Why practice reproducibility? The idealist The pragmatist • Shoulders of giants! • Validates scientific

Why practice reproducibility? The idealist The pragmatist • Shoulders of giants! • Validates scientific knowledge • Allows others to build on your findings • Improved transparency • Increased transfer of knowledge • Increased utility of your data + methods • Increased efficiency • Reduces false leads based on irreproducible findings • Data sharing citation advantage (Piwowar 2013) • “It takes some effort to organize your research to be reproducible… the principal beneficiary is generally the author herself. ”- Schwab & Claerbout 14

How can you make your research reproducible? 1. Plan for reproducibility before you start

How can you make your research reproducible? 1. Plan for reproducibility before you start • Create a study plan • Set-up a reproducible project 2. Keep track of things • Documentation • Version control 3. Contain bias • Registration • Reporting 4. Archive + share your materials 15

1. Plan for reproducibility before you start Create a study plan • Create a

1. Plan for reproducibility before you start Create a study plan • Create a study plan before you gather your data • Begin documentation early • Shows evolution of study How? • Research questions + hypotheses • Study design – Type of design – Sampling – Power and sample size – Randomization? • Variables measured – Meaningful effect size • Variables constructed – Data processing • Data management • Analyses • Sharing 16

1. Plan for reproducibility before you start Set-up a reproducible project How? • Set-up

1. Plan for reproducibility before you start Set-up a reproducible project How? • Set-up a centralized location https: //osf. io/ for project management • Organization is especially important for collaboration • Easily find the most recent file version • Eases transition between lab members • Allows for back-up and version control 17

https: //osf. io/institutions/uva/ 18

https: //osf. io/institutions/uva/ 18

https: //accounts. osf. io/login? camp aign=institution 19

https: //accounts. osf. io/login? camp aign=institution 19

Institutional Login 20

Institutional Login 20

1. Create an OSF project 21

1. Create an OSF project 21

1. Wiki, file tree, components, citation, GUID 22

1. Wiki, file tree, components, citation, GUID 22

1. Giving contributors access 23

1. Giving contributors access 23

1. Creating a wiki 24

1. Creating a wiki 24

1. Adding organizational structure components 25

1. Adding organizational structure components 25

How can you make your research reproducible? 1. Plan for reproducibility before you start

How can you make your research reproducible? 1. Plan for reproducibility before you start • Create a study plan • Set-up a reproducible project 2. Keep track of things • Documentation • Version control 3. Contain bias • Registration • Reporting 4. Archive + share your materials 26

2. Keep track of things Documentation • Document everything done by hand • Document

2. Keep track of things Documentation • Document everything done by hand • Document your software environment (eg, dependencies, libraries, session. Info () in R) • Everything done by hand or not automated from data and code should be precisely documented: • Make raw data read only – You won’t edit it by accident – Forces you to document or code data processing • Document in code comments – README files 27

2. Keep track of things Version control • Track your changes • Everything created

2. Keep track of things Version control • Track your changes • Everything created manually should use version control • Tracks changes to files, code, metadata • Allows you to revert to old versions • Make incremental changes: commit early, commit often • Git / Git. Hub / Bit. Bucket Version control for data • Metadata should be version controlled 28

2. Version control 29

2. Version control 29

2. Version control 30

2. Version control 30

1. Add-ons 31

1. Add-ons 31

Objectives Session 1 • Understanding reproducible research • Setting up a reproducible project •

Objectives Session 1 • Understanding reproducible research • Setting up a reproducible project • Keeping track of things Session 2 • Containing bias • Sharing your work 32

How can you make your research reproducible? 1. Plan for reproducibility before you start

How can you make your research reproducible? 1. Plan for reproducibility before you start • Create a study plan • Set-up a reproducible project 2. Keep track of things • Documentation • Version control 3. Contain bias • Registration • Reporting 4. Archive + share your materials 33

3. Contain bias Share important moments in your study • Create public registrations of

3. Contain bias Share important moments in your study • Create public registrations of your study • Improves transparency • Improves accountability • Counters selective reporting and outcome reporting bias • Preregistration of all study plans helps counter publication bias 34

Publication bias 35 Fanelli D (2010) “Positive” Results Increase Down the Hierarchy of the

Publication bias 35 Fanelli D (2010) “Positive” Results Increase Down the Hierarchy of the Sciences. PLo. S ONE 5(4): e 10068.

3. Register your study As. Predicted: Preregistration made easy 1. Have any data been

3. Register your study As. Predicted: Preregistration made easy 1. Have any data been collected for this study already? 1. Hypothesis. What's the main question being asked or hypothesis being tested in this study? 1. Dependent variable. Describe the key dependent variable(s) specifying how they will be measured. 1. Conditions. How many and which conditions will 36

3. Contain bias Analysis plan • Register your analysis plan • Defines your confirmatory

3. Contain bias Analysis plan • Register your analysis plan • Defines your confirmatory analyses • Decreases researcher degrees of freedom How? • Define data analysis set • Statistical analyses – Primary – Secondary – Exploratory • • Missing data Outliers Multiplicity Subgroups + covariates (Adams-Huet and Ahn, 2009) 37

Researcher degrees of freedom Simmons, Nelson, & Simonsohn (2012) 38

Researcher degrees of freedom Simmons, Nelson, & Simonsohn (2012) 38

3. Contain bias Reporting • Report transparently + completely • Transparently means: – –

3. Contain bias Reporting • Report transparently + completely • Transparently means: – – Readers can use the findings Replication is possible Users are not misled Findings can be pooled in meta-analyses • Completely means: – All results are reported, no matter their direction or statistical significance How? • SAMPL • Avoid HARKing: Hypothesizing After the Results are Known • Report all deviations from your study plan • Report which decisions were made after looking at the data 39

3. Register your analysis As. Predicted: Preregistration made easy 5. Analyses. Specify exactly which

3. Register your analysis As. Predicted: Preregistration made easy 5. Analyses. Specify exactly which analyses you will conduct to examine the main question/hypothesis. 6. More analyses. Any secondary analyses? 7. Sample Size. How many observations will be collected or what will determine sample size? 8. Other. Anything else you would like to pre-register? (e. g. , data 40

3. How to register 41

3. How to register 41

How can you make your research reproducible? 1. Plan for reproducibility before you start

How can you make your research reproducible? 1. Plan for reproducibility before you start • Create a study plan – Begin documentation at study inception • Set-up a reproducible project – Centralize and organize your project management 2. Keep track of things • Documentation – Document provenance, your environment, + everything done by hand • Version control – Track your changes 3. Contain bias • Registration – Share important moments in your study • Reporting – Report transparently + completely 4. Archive + share your materials 42

© COPYRIGHT FIRST LOOK MEDIA 2016 Repeat After Me by Maki Naro, Oct 6,

© COPYRIGHT FIRST LOOK MEDIA 2016 Repeat After Me by Maki Naro, Oct 6, 2016 in The Nib Used with Permission URL=https: //thenib. com/repeat-after-me

Thomas Harriot’s “release dates” timeline Thomas Harriot (1560– 1621) Was the first to observe

Thomas Harriot’s “release dates” timeline Thomas Harriot (1560– 1621) Was the first to observe the moon using a Recorded telescope and to Observations record his of Comet observations 1607 26 July 1609 Corresponded with Tycho Brahe, Johannes Kepler and Galileo Galilei Artis Analyticae Observed Praxis Sunspots Published in Latn 1610 1631 Galileo observes moon w/Telescope late Nov or Dec, Publishes in 1610 Harriot’s accurate observations of the 1607 comet used by Friedrich Wilhelm Bessel who is computing the comet's orbital elements & realizes that the 1607 comet was "Halley's comet". 1784 Harriot’s observational work recommended for publication by Count De. Bruhl Rigaud Supplement including Harriot’s Observations Published 1785 1833 Artis Notebooks Analyticae Made Open Praxis Online Published in English 2007 2012

Thomas Harriot’s excuses An Excerpt from Allan Chapman’s Thomas Harriot: The First Telescopic Astronomer

Thomas Harriot’s excuses An Excerpt from Allan Chapman’s Thomas Harriot: The First Telescopic Astronomer Too Busy? Too Paranoid? Not Lazy! Not Timid!

Funder Mandates for Where to Archive + share HAVE A GRANT U YO S

Funder Mandates for Where to Archive + share HAVE A GRANT U YO S S E L UN * r e t t a m t ’ n s oe !! d N e O r I e T Wh TEN T A Y PA N E H T • If you submitted a Data Management Plan with your proposal, follow it! • Deposit where your funder requires or recommends * CHECK what your funder mandates https: //cendi. gov/projects/Public_Access_Plans_ US_Fed_Agencies. html 46

4. Archive + share your materials Share your materials • Where doesn’t matter*. That

4. Archive + share your materials Share your materials • Where doesn’t matter*. That you share matters. • Get credit for your code, your data, your methods • Increase the impact of your research Open Science Framewor *BUT CHECK what your funder mandates 47

4. Share your work 48

4. Share your work 48

4. Increasing discoverability 49

4. Increasing discoverability 49

OSF for Meetings free poster + presentation service https: //osf. io/meetings

OSF for Meetings free poster + presentation service https: //osf. io/meetings

Open Proceedings: Use OSF for Meetings + an OSF Project https: //osf. io/view/containerstrategies/ https:

Open Proceedings: Use OSF for Meetings + an OSF Project https: //osf. io/view/containerstrategies/ https: //osf. io/y 9 mpx/

OSF for Institutions integration with local research services https: //osf. io/institutions/cos

OSF for Institutions integration with local research services https: //osf. io/institutions/cos

 • Reproducible Science Curriculum by How to learn more Jenny Bryan – https:

• Reproducible Science Curriculum by How to learn more Jenny Bryan – https: //github. com/reproducible -science-curriculum/ • Literate programming • 23 Things Libraries for Research Data • Practical Steps for Increasing the Openness and Reproducibility of Research Data by Natalie Meyers – Literate Statistical Programming by Roger Peng – https: //www. youtube. com/watc h? v=Yc. Jb 1 HBc-1 Q • Version control – Version Control by Software Carpentry – http: //softwarecarpentry. org/v 4/vc/ • Data management – Data Management from Software Carpentry by Orion Buske – http: //softwarecarpentry. org/v 4/data/mgmt. ht ml

Reproducibility training free stats + methods training http: //cos. io/stats_consulting 54

Reproducibility training free stats + methods training http: //cos. io/stats_consulting 54

Transparency and Openness Promotion (TOP) Guidelines http: //cos. io/top

Transparency and Openness Promotion (TOP) Guidelines http: //cos. io/top

Modular standards Low barrier to entry Discipline agnostic http: //cos. io/top

Modular standards Low barrier to entry Discipline agnostic http: //cos. io/top

Level 1 Disclose Level 2 Require Level 3 Verify http: //cos. io/top

Level 1 Disclose Level 2 Require Level 3 Verify http: //cos. io/top

“The policy of the __ is to publish papers where authors indicate whether the

“The policy of the __ is to publish papers where authors indicate whether the data, methods used in the analysis, and materials used to conduct the research will be made available to any researcher for purposes of reproducing the results or replicating the procedure. • Authors must, in acknowledgments or the first footnote, indicate if they will or will not make their data, analytic methods, and study materials available to other researchers. • If an author agrees to make materials available, the author must specify where that material will be available. ” http: //cos. io/top

“The policy of the ______ is to publish papers only if the data used

“The policy of the ______ is to publish papers only if the data used to conduct the research are clearly and precisely documented and are maximally available to any researcher for purposes of reproducing the results or replicating the procedure. Details of: • What must be shared • Legal and Ethical Exceptions – Disclosure at onset of review • Using trusted repositories” http: //cos. io/top

“. . . are maximally available to any researcher for purposes of reproducing the

“. . . are maximally available to any researcher for purposes of reproducing the results or replicating the procedure. All materials supporting the claims made by the author must be made available to the journal prior to publication. The journal, or an entity acting on behalf of the journal, will verify that the findings are replicable using the author’s data and methods of analysis. Failure to replicate at this stage may result in the paper not being published. ” http: //cos. io/top

752 Journals 63 Organizations http: //cos. io/top

752 Journals 63 Organizations http: //cos. io/top

Badges for Open Practices article signals open behaviour https: //osf. io/rfgdw 62

Badges for Open Practices article signals open behaviour https: //osf. io/rfgdw 62

ACM: Incentivizing Reproducibility The ACM Task Force on Data, Software, and Reproducibility in Publication

ACM: Incentivizing Reproducibility The ACM Task Force on Data, Software, and Reproducibility in Publication http: //www. acm. org/publications/task-force-on-datasoftware-and-reproducibility Ronald F. Boisvert. 2016. Incentivizing reproducibility. Commun. ACM 59, 10 (September 2016), 5 -5. DOI: http: //dx. doi. org/10. 1 145/2994031 Result and Artifact Review and Badging. ACM. June 8, 2016. http: //www. acm. org/pu blications/policies/artif act-review-badging

Case Study: Psychological Science article signals open behaviour https: //osf. io/rfgdw 64

Case Study: Psychological Science article signals open behaviour https: //osf. io/rfgdw 64

Where to get help reproducibility training stats-consulting@cos. io osf support@osf. io work with COS

Where to get help reproducibility training stats-consulting@cos. io osf support@osf. io work with COS contact@cos. io Slides: https: //osf. io/j 79 ud/ Workshop Materials All Materials https: //github. com/csoderber 65 g/OSF-Curriculum