Data as code Data management for reproducible research
Data as code Data management for reproducible research Martin O’Reilly Principal Research Software Engineer The Alan Turing Institute 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 1 @martinoreilly | @turinginst
The Alan Turing Institute is the national centre for data science, headquartered at the British Library. Turing Research Engineering • Radka Jersakova • May Yong • Tim Hobson • James Geddes • James Hetherington 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Turing Research Fellows • Kirstie Whitaker • Tomas Petricek 2 @martinoreilly | @turinginst
Data management for reproducible research 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 3 @martinoreilly | @turinginst
FAIR Data Principles • Findable • Accessible • Interoperable • Re-usable 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: FORCE 11 website. https: //www. force 11. org/group/fairprinciples. Accessed on 07 Sep 2017 4
Code management for reproducible research • How do I get your code? • • How do I use your code? • • Tests, examples, readable code How do I build on your code? • • Documentation, examples, packages, virtual machines, containers How do I trust your code? • • Online repositories and persistent archives with versioning support Documentation, readable code, tests What am I allowed to do with your code? • 08/09/2017 Licence The Alan Turing Institute Data as code: Data management for reproducible research 5
Data management for reproducible research • How do I get your data? • • How do I use your data? • • Record of provenance and processing, versioning How do I build on your data? • • Documentation, metadata, common data formats, data packages How do I trust your data? • • Online repositories with versioning and APIs for data access Record of provenance and processing, compatible content, linkable to other data What am I allowed to do with your data? • 08/09/2017 Licences, terms of use, data access agreements, ethics The Alan Turing Institute Data as code: Data management for reproducible research 6
Good examples 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 7
Web API for programmatic access UN Comtrade database Can apply current and historical classification codes to entire dataset Can select subset of data to retrieve along multiple dimensions 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Screenshot of UN Comtrade database website. https: //comtrade. un. org/data. Accessed on 06 Sep 2017 8
UN Comtrade database Third-party R package available for querying web API 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Screenshot from Comtradr R package Github README. md. https: //github. com/Chris. Muir/comtradr. Accessed on 06 Sep 2017 9
Connectome. DB Website requires registration and login 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Screenshot of Connectome. DB login page. https: //db. humanconnectome. org. Accessed on 06 Sep 2017 10
Connectome. DB One-time click for acceptance of terms Generate dedicated Amazon AWS access credentials 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Screenshot of Connectome. DB main page. https: //db. humanconnectome. org. Accessed on 06 Sep 2017 11
The Gamma Dot-driven development • Intellisense autocomplete for data exploration • Interactive dynamic data preview • Uses F# type providers • For more details, see http: //tomasp. net/academic/p apers/pivot/ 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: The Gamma homepage. https: //thegamma. net/. Accessed on 06 Sep 2017 12
The Gamma Sub categories indicated by initial numerals Sub-sub categories indicated by text formatting Subtotals indicated by background colour 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5. 2. https: //www. gov. uk/government/statistics/public-expenditure-statistical-analyses -2016/. Accessed on 06 Sep 2017 13
The Gamma 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Gamma @ The Turing: Accounting for Democracy. http: //gamma. turing. ac. uk/expenditure /. Accessed on 06 Sep 2017 14
The Gamma 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research Source: Gamma @ The Turing: Accounting for Democracy. http: //gamma. turing. ac. uk/expenditure /. Accessed on 06 Sep 2017 15
Dream data 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 16
My wish list • Repository supporting versioning and content-aware sub-setting • Data includes raw and processed data, with code to replicate processing • Content-aware, on-demand differential download • Automatable access to data requiring an access agreement / authentication • Data accessible as native code objects • Documentation accessible in context of data presentation • Standard, machine-readable licences • Repository tracks download / usage stats 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 17
Interesting tools Repositories • Figshare, Zenodo, Dataverse, Data. ONE, Dryad Data access • Repository APIs, r. Open. Sci, SPARQL Data formats • RDF, OWL, Research object bundles, Bag. It, Frictionless data Differencing data • Daff (tables), data-diff (JSON), data-diff (Python) Provenance / processing record • 08/09/2017 Workflow platforms (e. g. Galaxy), execution capture tools (e. g. Sumatra) The Alan Turing Institute Data as code: Data management for reproducible research 18
turing. ac. uk @turinginst moreilly@turing. ac. uk @martinoreilly 08/09/2017 The Alan Turing Institute Data as code: Data management for reproducible research 19
- Slides: 19