Deciding what to keep and where to keep

  • Slides: 20
Download presentation
Deciding what to keep …and where to keep it Angus Whyte Digital Curation Centre

Deciding what to keep …and where to keep it Angus Whyte Digital Curation Centre Research Data Management at University of Aberdeen & RGU 7 th October 2014 This work is licensed under a Creative Commons Attribution 2. 5 UK: Scotland License

Outline • Why select, rather than ‘file and forget’! • Take five steps to

Outline • Why select, rather than ‘file and forget’! • Take five steps to inform your choise… ① Think. What could be reused for what purpose? ② Recognise compliance risks ③ (Gu)estimate long-term value ④ Judge the cost factors ⑤ Decide – what action needed • The onus is on you, but it’s a partnership • So what tools and practical help do you need?

Storage Strategies Good practice Weigh up risks, Value, and costs Bad practice Keep everything

Storage Strategies Good practice Weigh up risks, Value, and costs Bad practice Keep everything until… lost by natural wastage Select, share, safeguard what you can afford to, or dispose of it Fragmented • Findable • Accessible • Interoperable • Reusable FAIR Principles www. force 11. org/group/fairgroup Risking unauthorised disclosure or loss • Bit rot • Media degradation • Obsolescence • (software, device, • format, media) • Fire, flood, theft • Organisation failure

Why not keep it all? Globally, data volumes are doubling every two years John

Why not keep it all? Globally, data volumes are doubling every two years John Gantz and David Reinsel 2011 Extracting Value from Chaos www. emc. com/digital_universe. 4

Data volumes escalate Volumes rising faster in data-intensive research domains e. g. DNA sequence

Data volumes escalate Volumes rising faster in data-intensive research domains e. g. DNA sequence data is doubling every 6 -8 months “ELIXIR and Open Data” View from an ELIXIR Node” Barend Mons, ELIXIR Launch event, 18 th Dec 2013 5

Storage mgmt costs rise long-term Hardware costs decline, but power and staff costs keep

Storage mgmt costs rise long-term Hardware costs decline, but power and staff costs keep rising David Rosenthal blog. dshr. org/2012/05/lets-just-keep-everything-forever-in. html 6

While data availability declines Nature News 19 Dec 2013 www. nature. com/news/scientists-losing-data-at-a-rapid-rate-1. 14416 7

While data availability declines Nature News 19 Dec 2013 www. nature. com/news/scientists-losing-data-at-a-rapid-rate-1. 14416 7

What to do? Data appraisal… a ‘later stage’ plan for your data ① Could

What to do? Data appraisal… a ‘later stage’ plan for your data ① Could this data be re-used ② Must it be kept to manage compliance risk ③ Should it be kept for its potential value and… ④ Considering costs ⑤ Will ✔or won’t ✗ it be kept, shared on what terms Researchers guidance & attractive choices Institutions Managed storage External repositories 8

Step 1 (? ) What ‘must’ be kept? Some data may be part of

Step 1 (? ) What ‘must’ be kept? Some data may be part of research record, evidence for e. g. … • Audit purposes • Health & Safety (Lab book) • Contractual requirement Jisc Infonet Guidance on Managing Research Records tools. jiscinfonet. ac. uk/downloads/bcs-rrs/managing-research-records. pdf What counts here? Depends on purposes data has been used for Compliance also about data that won’t be kept, or may only be shared with approved researchers… Research Ethics, Duty of Confidentiality, Data Protection Act, Human Rights Act, Statistics & Registration Services Act. UK Data Archive: http: //www. data-archive. ac. uk/create-manage/consent-ethics/legal 9

Step 1 (? ) What ‘must’ be kept? What about Funding Body data policies?

Step 1 (? ) What ‘must’ be kept? What about Funding Body data policies? “Data with acknowledged long-term value ” RCUK Common Principles on Data Policy “Data, information and other electronic resources of longterm interest” ESRC UK Data Archive Collections Development Policy “Where data underpins published research there is much greater expectation that it will be kept” Ben Ryan, EPSRC What counts depends on data’s value for purposes it has served or may serve, so consider these as first step. 10

Step 1 (? ) What ‘must’ be kept? Don’t forget Journal policies… “An inherent

Step 1 (? ) What ‘must’ be kept? Don’t forget Journal policies… “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. …Nature journals reserve the right to refuse publication in cases where authors do not provide adequate assurances that they can comply with the journal's requirements for sharing materials. http: //www. nature. com/authors/policies/availability. html • “Changemakers are journals with high impact factors…. Progressive policies are not widespread, but are being adopted rapidly” Victoria Stodden “Re-use and Reproducibility: Opportunities and Challenges” Open Repositories, 2013

Step 2 1 What could it be reused for? Step back and reflect –

Step 2 1 What could it be reused for? Step back and reflect – typical reuse purposes 1. 2. 3. 4. 5. 6. 7. Verification Further analysis Reputation building Resource development Further publications inc. data articles Learning and teaching materials Private reference Then relative to these, which data must be kept and which data and related materials will have significant value? 12

e. g. High Energy Physics community Levels of data to preserve Reuse purpose 1)

e. g. High Energy Physics community Levels of data to preserve Reuse purpose 1) Additional documentation Publication-related information search (e. g. wikis, news forums) 2) Data in a simplified format Outreach, simple training analyses 3) Analysis level software and the data format Full scientific analysis based on existing reconstruction 4) Reconstruction and simulation software and basic level data Full potential of the experimental data Adapted from: DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics, May 2012. http: //arxiv. org/abs/1205. 4667

Step 3 What data should have value Indicators that data have value 1. Quality

Step 3 What data should have value Indicators that data have value 1. Quality of the data and its description complete, accurate, reliable, valid, representative etc 2. Demand high known users, integration potential, reputation, recommendation, appeal 3. Replication difficulty difficult, costly, or impossible to reproduce 4. Low barriers legal/ ethical, copyright non-restrictive terms and conditions 5. Rarity unique copy or other copies at risk Which related material does data depend on for its value? 14

Step 4 Cost factors Consider these when deciding what to keep because • Costs

Step 4 Cost factors Consider these when deciding what to keep because • Costs incurred during project may add to the data’s value • Need to make sure post-project costs are covered 1. Creation, collection & cleaning 2. Short-term storage & backup 3. Short-term access & security 4. Team communication & development 5. Preservation & long-term access What action needs to be taken to ensure preservation is costed? 15

Step 5 Your data appraisal Establish a clear idea of what data needs packaged

Step 5 Your data appraisal Establish a clear idea of what data needs packaged at end 1. Title, contributors, description, access rights * 2. Reuse purpose(s) 3. Value for purpose 4. Risk of budget shortfall 5. Keep it or not? * 6. Reasons for disposal * 7. Actions to prepare for preservation or disposal * What anyone outside the project most needs to know (but the rest will help) 16

Who should help appraise? RLUK ‘skills gaps’ survey of Subject Librarians & Managers “

Who should help appraise? RLUK ‘skills gaps’ survey of Subject Librarians & Managers “ …nine key areas where future involvement by Subject Librarians is considered to be important now and is also expected to grow sharply… 1. Ability to advise on preserving research outputs (49% see as essential in 2 -5 years; 10% now) 2. Knowledge to advise on data management and curation, (48% essential in 2 -5 years; 16% now)…” Mary Auckland 2012 Reskilling Libraries for Research

Who else? Others who may be involved in appraising research data… • Domain specialists

Who else? Others who may be involved in appraising research data… • Domain specialists • Archives • Research Office- Business development • IT Support/ Research Computing • Research Ethics Committee • Records Management/ FOI Compliance • Facilities Managers (if physical samples involved)

Where should it go? Institutions aiming to offer a range of options Ø Secure

Where should it go? Institutions aiming to offer a range of options Ø Secure managed storage/ disposal Ø Institutional Data catalogue Most universities establishing Ø Institutional data repository If nowhere else it can go Ø Help to find external repository Go 19

Go Ø Finding external repositories General directories Re 3 data. org Databib. org Ø

Go Ø Finding external repositories General directories Re 3 data. org Databib. org Ø Domain specific directories e. g. life sciences – Biosharing. org Ø Data journal recommendations Edinburgh research data blog: Sources of dataset peer review Ø Funding body recommendations E. g. Wellcome Trust Data repositories and database sources 20