RDM Research Data Management Basics Meilahti Siiri Fuchs

RDM Research Data Management Basics Meilahti Siiri Fuchs, 31. 3. 2020 University of Helsinki, Data Support datasupport@helsinki. fi 0000 -0003 -1391 -9959 siiri. fuchs@helsinki. fi University of Helsinki

Materials: https: //datasupport. helsinki. fi/services/courses-workshops Direct link: https: //wiki. helsinki. fi/x/jb 5 ZDQ Meilahti_RDM_SF_20200331. pptx Meilahti_RDM_data_spring-20. xlsx

Content What & why - Research data management 1. 2. 3. 4. 5. 6. Data types Ethical & legal issues Data documentation Storing solutions by UH IT Services Opening, publishing and archiving Data management responsibilities and resources Useful to know! (links etc)

What is research data management

Why manage research data?

Why manage research data? FAIR -principles These principles provide guidance for scientific data management and promote the maximum use of research data. • Council of the European Union: Optimal reuse of research data can only be realized if data are consistent with FAIR principles. • The Ministry of Education and Culture is committed to these principle & Fairdata services are developed.

FAIR -principles Data and supplementary materials have sufficiently rich metadata and a persistent identifier (e. g. DOI). Data and collections have a clear usage licenses and provide information on provenance. Metadata and data are understandable to humans and machines. Data is deposited in a trusted repository. Metadata use a formal, accessible, shared, and broadly applicable language & data uses vocabularies.

1. What is data?

1. Data types • Research material = / ≠ data, general term or only digital material ØData collected by various methods • (Biological) samples, measurements, surveys, interviews, imaging techniques, curated collections etc. ØData produced during the research project • analysis results, sequences, field diaries, physical or electronic lab journals, copies of physical artifacts, source code, algorithms, software, etc. ØData reused from various origins • Biobank samples, archive materials, data from repositories, codes etc.

1. What is your data? Data type Source of the data File format DNA sample Collected Sample tube Analysed DNA sample Produced . xslx Statistical data x Reused from Tilastokeskus Database Size estimate 2 Gb

Prefer widely used open formats 1. Data types Data type Source of the data File format Questionnaire Collected . pdf, . docx, . xslx, paper Analyzed questionnaire Produced . csv, . xslx, RNA sample Collected Gene expression (q. PCR) results Produced . xlsx Sequence Produced FASTA, BAM, . xslx MRI images Reused from X . tiff Videos Collected . avi Transcripts of the videos Produced . docx Analysis codes Produced . txt Lab notebooks + Metadata files Produced Managerial documents (consents, agreements, contracts etc. ) Collected/ produced Paper, Scinote-program, . txt, . csv Paper, . pdf Size estimate

RESEARCH & LAW A VERY BRIEF OVERVIEW BY SIRPA KOVANEN, LEGAL COUNSEL, RESEARCH SERVICES

2. Ethical and legal issues • • Data protection guide for researchers by UH (Flamma) Data protection yammer group (UH, Yammer), ask questions and find relevant documents (e. g. informing participants) • Instruction on concluding an agreement • tutkimuksenjuristit@helsinki. fi • FSD: Informing Research Participants

2. Ethical and legal issues – your data? Data type Questionnaire Sensitivity / personal data + controller Yes + controller UH / HUS Owner UH, agreements done Analyzed questionnaire No, anonymized UH, agreements done

3. Data documentation = means describing the data

3. Data documentation = means describing the data about data = metadata Data about the data. Interpretation of the data. What are the specific data files, where to find them, how have they been named, what do the variables mean etc. Project metadata= “Label” of the data set, context: should always be published. How, when, where the data was collected. Subject, keywords etc. Discovery metadata / Administrative metadata: Label of the data set + Persistent identifier + Research records (e. g. administrative documents, and other research related descriptions)

3. Data documentation • Should be planned before starting to collect data by creating a data management plan (DMP) on the side of the actual research plan. • Be proactive, not a huge task when started in time! • Why do it? • Data will be more understandable for you as well as others =easier to share. • Having invested in documentation during the project, will save time upon publishing the dataset. • Competitive advantage!

3. Data documentation: General rules to follow 1. If possible, use metadata standards and controlled vocabularies. • • Describe data in a controlled format using vocabularies Use field and disciplinary specific standards, if suitable standards exist Some data repositories require the use of a metadata standard. Where to find: • Digital curation centre (DCC) has gathered discipline specific metadata standards • EMBL-EBI Ontology lookup service is a repository for biomedical ontologies.

3. Data documentation: General rules to follow 2. If available, use data management software, to make documenting easier. • Software take various data in & convert it into a database. Metadata is generated automatically when new inputs are made. • Easy error spotting: inputs out of range can be automatically detected. • Electronic laboratory notebooks (Splice bio article 2019) • Easy to share & control access, usually have safe storage & search tools • E. g. Scinote, Benchling, Rspace

3. Data documentation: General rules to follow 3. Get familiar with the following methods and choose suitable methods for you: Data dictionaries and Code books Dictionaries explain variables used in a dataset. Codebooks are collections of codes, algorithms and calculations used Directory structure Create a folder structure to suit your project needs Tagging files Tags are keywords assigned to files, which enable organizing and searching files easier File naming conventions Create a meaningful but brief system with unique names Version control Automatic version control system preferred Readme-files are text documents (e. g. format. txt) providing information about data files to ensure they are interpreted correctly

3. Data documentation: General rules to follow Guide: https: //www. helsinki. fi/en /research/guide-for-datadocumentation PDF-guide: https: //doi. org/10. 5281/ze nodo. 1683181

4. Storing solutions by UH IT Services

3. Data documentation – storage? Data type Questionnaire Analyzed questionnaire Measurement data with personal identifiers Metadata/ Documentation Readme. txt data dictionary, file naming system Storage during project File cabinet in PI room UH Group folder CSC e. Pouta / UH Umpio

5. Opening, publishing and archiving How, when, where and to whom will the data be made available? How and where will data with long-term value be made available?

5. Opening, publishing and archiving How, when, where and to whom will the data be made available? ”As open as possible, as closed as necessary” • All data should be opened, if possible usually in the end of the project. • Open science principles do not force to open everything • Sensitive data (nor pseudonymized) cannot be opened, but its metadata (description) most likely can be opened. Benefits of sharing data Personal benefits: • More visibility • Contacts & Joint publications • Scientific merits Community benefits: • Your institute benefits from your success • Transparency of data • Efficient use of research funds

5. Opening, publishing and archiving How, when, where and to whom will the data be made available? Why use online repositories 1. Publishing and sharing data in a data repository 2. Publishing in a data journal • Preservation of data beyond work contract length • Access to data from anywhere • Discoverability of data by search engines • Citation system and PIDs • Book keeping of data downloads • Getting visibility to your work • Funders & publisher require data to be made available online.

5. Opening, publishing and archiving 1. Publishing and sharing data in a data repository • Make your data findable, accessible, citable and/or to comply with funder requirements by choosing a repository, where: ØA persistent identifier (e. g. DOI, URN, Handle, ARk, PURL) is given, a permanent link which points to the data, making your data findable and citable; ØA license is given or can be chosen, creating clarity and certainty for potential users of your data.

5. Opening, publishing and archiving 1. Publishing and sharing data in a data repository How to choose a license? o A license states what a user is allowed to do with your data and creates clarity and certainty for potential users. CC 0 is recommended by UH o Creative commons: CC choose o UH license guide & guide in Finnish

5. Opening, publishing and archiving 1. Publishing and sharing data in a data repository Choose a repository: How to choose? • Favour archives with a certificate • Re 3 data Look for repositories! for long-term preservation (e. g. • ELIXIR Deposition Databases for Biomolecular Data Core Trust Seal-certified). • B 2 Share • IDA (general) & Etsin (metadata) by CSC • Zenodo • Dryad • What are the costs per dataset or gigabyte? • What is the physical storage location of data? EU or US? • Figshare • What is the default license? • FSD (The Finnish Social Science Data Archive), questionnaires, interviews etc. • Is long-term preservation guaranteed or not?

5. Opening, publishing and archiving 2. Publishing in a data journal • Publish dataset in a peer-reviewed data journal. • Typically, a publication in a data journal consists of an abstract, introduction, data description with methods and materials, short conclusion on reuse opportunities. • There are general and disciplinary data journals. Examples of generic data journals: Scientific Data; Data in Brief; Data Science Journal. https: //datasupport. helsinki. fi/services/guide-publishing-and-opening-your-data

5. Opening, publishing and archiving Long-term preservation >25 years • For tens and even hundreds of years = over generations • Technical challenge Ø Hardware, software, and file formats age, the information must still be kept usable Ø Repositories specialized for long-term preservation will keep the bits safe. • Data should be self explanatory = documentation well done • Fairdata-PAS is coming. UH instructions here. “All information that is needed to replicate a study should be preserved, and everything that is potentially useful for others. ” – Sarah Jones /DCC

5. Opening, publishing and archiving – your data? Data type Opening Long term archiving Questionnaire No, metadata in Etsin Analyzed questionnaire xxx No, discarded after 15 years xxx DNA sequence ENA - DNA samples - Biobank x (archiving) Electron Microscopy images EMDB, will be saved there x years. -

6. Data management responsibilities and resources

6. Data management responsibilities and resources • Who is responsible for data management tasks? o Are responsibilities allocated to one person or is the whole research group involved? o Who is responsible & controls data protection and information security issues? o Is an expert / employee needed? • What resources (time & workload) is needed for data management?

www. menti. com

We need your help – give us feedback! Thank you! https: //elomake. helsinki. fi/lomakkeet/94070/lomake. html

Useful to know…

SUCCESS RDM basic lecture DMP workshop datasupport@helsinki. fi DMP review service

Guides & help • Research data management guide (UH) • Datanhallinnan perusopas • Open access guide • Research data services at UH • DMPTuuli for making a Data Management Plan • Help in any matter regarding research datasupport@helsinki. fi

Elixir Finland • ELIXIR Finland provides authenticaton and authorisation infrastructure service to manage data access applications and access rights to sensitive datasets, analysis software for gene data (Chipster), cloud services (CSC Cloud) and training. • http: //www. elixir-finland. org/en/frontpage/ • Elixir training portal (Tess): https: //tess. elixir-europe. org/ • ELIXIR Deposition Databases for Biomolecular Data