Comparative study of Data Repository Software with Reference































- Slides: 31
Comparative study of Data Repository Software with Reference to Harvesting Data in the Context of Library and Information Science Prof. Dr. Mohan Raj Pradhan Ms. Parbati Pandey
Introduction • Data is becoming more important to business decisions. • This requires tools that can collect, store and help analyze data. • Data repository is a tool that is common in scientific research but also useful for managing business data. • Data repository is also known as a data library or data archive. • Data repository is a large database infrastructure that can collect, manage, and store data sets for data analysis, sharing and reporting.
Data repository : the what • IT infrastructure (cloud based/online) set up to manage, share, access, maintain, and archive datasets. • An application database specialized in storing metadata of data files/datasets/databases. • Differs from publication repository mainly in its ability to: • Store metadata at different level/hierarchy. • Store and ingest data files in various formats for long-term preservation http: //www. infotoday. com/cilmag/apr 16/Uzwyshyn--Research. Data-Repositories. shtml
Data Repository: The Why Easy information discovery Easy and efficient access More contact and intensify impact Persistent access (through persistent URL) i. e. make data citable through the assignment of DOI (Digital Object Identifier) • Long-term storage and preservation • Allows unprecedented use, analysis and finding through interoperability and interlinking with other repositories. • Most major federal grant agencies require data access as mandatory part of the grant proposal /oversite process (NIH, NSF, NEH, USDA) • •
Data Repository: The Why • Collecting all data at one place • Statistics on downloads and citations
What makes Data Management Repositories useful? • Makes available faculty, departmental and institutional research • Allows publication of negative data
Research Data Repository software Characteristics • Hosted locally or remotely on a server • Software contains collaborative options • Open source or proprietary software • Wide variety of data types (Excel to SPSS to various discipline specific formats)
Perceived Benefits of Data Repository • Can share publications and research data • Make research data more widely available • Statistics available on downloads and citations of data • Savings various versions of dataset (data lifecycle) • Collecting all data in one place
Research data management tools • A survey was done to identify currently implemented standards, requirements and features related to research data repositories. • Based on this, five well-known platform is chosen in this study, namely DSpace, CKAN, Zenodo, Figshare and Dataverse. • These tools are considered and evaluated them according to a set of key aspects: • • • architecture, metadata handling capabilities, interoperability, content dissemination, search features and community acceptance.
Architecture Class Feature DSpace CKAN Figshare Zenodo Dataverse Deployment Installation package Service Installation package Storage location Local or remote Remote Local or remote Maintenance costs Infrastructure management Monthly fee e-mail basedfree of cost Open Source √ √ × × √ Platform customization √ √ × Community policies √ Embargo period √ Private storage √ √ Content versioning × √ × × √ Pre-reserving DOI √ × √ √ √
Architecture… Metadata Class Feature DSpace CKAN Figshare Zenodo Dataverse Required fields Title, Date of issue Title Author, title, categories description Type, DOI, author, title, description Exporting schemas Schema flexibility Validation Any pre-loaded × schema Flexible DC Fixed DC, MARCXML Fixed Title, Author, Description, Contact Email, Subject, and DOI XML √ × × √ √ Versioning × √ × × √ Flexible
Architecture… Dissemination Class Feature DSpace CKAN Figshare Zenodo Dataverse API √ √ √ OAI-PMH Compliance √ With ckanextharvest installer √ √ √ Faceted search √ √ √
Architecture • Most of the above mentioned software open source based and have given some flexibility to the users. • Speedy and simple deployment of the used software is a crucial part for the implementation. • Open source software can be installed in house whereas platforms like Figshare and Zenodo are to be installed and implemented by the help of the developer. • Dspace, Dataverse & CKAN have better control in the recorded data as they are open source.
Architecture… • The proprietary software viz Figshare or Zenodo are not viable platform for the researchers and the institution as they have to rely on the developers. • DSpace, CKAN, Dataverse and Zenodo permit a customization with improvements ranging from small interface modification to the development of new data imagining plugins to satisfy the needs of their users: while Zenodo allows parametrization settings such as community-level can be further customized. • DSpace, Zenodo and Dataverse permit users to stipulate embargo period whereas CKAN and Figshare have options for reserved storage to let researchers control the data publication mode.
Metadata • Zenodo and Figshare software able to export records that comply with established metadata schemas (Dublin Core and MARC-XML respectively). • DSpace goes further by exporting DIPs (Dissemination Information Package) that include METS metadata records, thus enabling the ingestion of these packages into a long-term preservation workflow.
Metadata… • Although CKAN and Dataverse metadata records do not follow any standard schema, the platform allows the inclusion of a dictionary of keyvalue pairs that can be used to record domain specific metadata as a complement to generic metadata descriptions. • Neither platform natively supports collaborative validation stages where curators and researchers enforce the correct data and metadata structure, but Zenodo allows the users to create a highly curated area within communities, as highlighted in the “validation” feature. • Every deposit will have to be validated by the community curator, if the policy of a particular community specifies manual validation. • There is an important issue to tracking content changes in data management. CKAN provides an auditing trail of each deposited dataset by showing all changes made to it since its deposit.
Interoperability and Dissemination • All of the evaluated platforms allow the development of external clients and tools as they already provide their own APIs for exposing metadata records to the outside community, but there are some differences regarding standards compliance. • Zenodo and DSpace natively comply with the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. • This is a widely-used protocol that promotes interoperability between repositories while also streamlining data dissemination, and is a valuable resource for harvesters to index the contents of the repository.
Advantage of DSpace • Can comply with domain-level metadata schemas • Is open-source and has a wide supporting community • Has an extensive, community maintained documentation • Can be fully under institutions control • Structured metadata representation • Complaint with OAI-PMH • Supports Dublin Core, and MARCXML for metadata exporting
Advantage of CKAN • Is open-source and widely supported by the developer community • Features extensive and comprehensive documentation • Allows deep customization of its features
Advantage of CKAN… • Can be fully under institutions control • Supports unrestricted (non standards-compliant) metadata • Has faceted search with fuzzy-matching • Records datasets change logs and versioning information
Advantage of Figshare • Gives credit to authors through citations and references • Can export reference to Mendeley, Data. Cite, Ref. Works, Endnote, NLM and Reference Manager • Records statistics related to citations and shares • Does not require any maintenance
Advantage of Zenodo • Allows creating communities to validate submissions • Supports Dublin Core, MARC and MARCXML for metadata exporting • Can export references to Bib. Te. X, Data. Cite, DC, End. Note, NLM, Ref. Works • Complies with OAI-PMH for data dissemination • Does not require any maintenance • Includes metadata records in the searchable fields
Advantage of Dataverse • Is open-source and widely supported by the developer community • Data Citation automatically generated • Multiple Publishing Workflows • Faceted Search as well as tags can be used for searches
Advantage of Dataverse… • Already defines roles and also custom roles can be designed and assigned to the users • Branding, metadata based facets, sub-dataverses, featured dataverses, • Re-format, Summary Statistics, and Analysis for Tabular Files integration with Two. Ravens
Advantage of Dataverse… • Mapping of Geospatial files and integration with World. Map • Restricted Files as well as ability to request access to restricted files • three level of Metadata i. e. description/citation, domain-specific or custom fields, file metadata
Advantage of Dataverse… • Search API, data deposit API etc • Notifications will be generated to the user and also will be communicated by mail for access request, roles, and when data is published • CC 0 waiver default, terms of use can be customised by user, and download statistics • Can export reference to End. Note XML, RIS Format, or Bib. Te. X Format
Conclusion • Dataverse, CKAN and DSpace’s open-source licenses were highlighted that allow them to be updated and customized, while keeping the core functionalities intact. • There is live demo of Dataverse, CKAN, DSpace and Zenodo. • CKAN is mainly used by governmental institutions to disclose their data, its features. • DSpace enables system administrators to parametrize additional metadata schemas that can be used to describe resources.
Conclusion… • Dspace is often compared with Dataverse and is used for storing scientific data. • Zenodo and Figshare provide ways to reserve a permanent link and a DOI, even if the actual dataset is under embargo at the time of first citation. • Dspace, Dataverse and CKAN can be installed in an institutional server instead of relying on external storage provided by contracted services.
Conclusion… • Dataverse repository software focuses mainly on social science data, its improvisational tools to analyze and explore only for tabular data. • Geospatial data is handled by the Dataverse and with the help of worldmap • Dataverse also has some features like Guestbook template which allows to record the details of the users downloading the data
Re-Mix • Harvesting XML format • Notification through e-mail whenever update is made. • Live. DVD-Koha, DSpace, Vu. Find, Subjects. Plus, and Word. Press. Plugin of Vu. Find.
References • Amorim, Ricardo Carvalho; Castro, João Aguiar; Rocha, João; Ribeiro, C. (2015). A Comparative Study of Platforms for Research Data Management: Interoperability, Metadata Capabilities and Integration Potential. In L. P. R. Alvaro Rocha, Ana Maria Correia, Sandor Costanzo (Ed. ), Maturity, Benefits and Project Management Shaping Project Success (pp. 101– 111). Springer International Publishing. https: //doi. org/10. 1007/978 -3 -319 -16486 -1_10 • Amorim, R. C. , Castro, J. A. , Rocha da Silva, J. , & Ribeiro, C. (2017). A comparison of research data management platforms: architecture, flexible metadata and interoperability. Universal Access in the Information Society, 16(4), 851– 862. https: //doi. org/10. 1007/s 10209 -016 -0475 -y • Breu, F. , Guggenbichler, S. , Wollmann, J. (2008). Research and Advanced Technology for Digital Libraries. Vasa. • Brook, C. (2018). What is a Data Repository. Retrieved June 30, 2019, from https: //digitalguardian. com/blog/what-data-repository • Devarakonda, R. , Palanisamy, G. , Green, J. M. , & Wilson, B. E. (2011). Data sharing and retrieval using OAI-PMH. Earth Science Informatics, 4(1), 1– 5. https: //doi. org/10. 1007/s 12145 -010 -0073 -0 • Institute for Quantitiative Social Sciences. (2019). Features The Dataverse Project. Retrieved June 30, 2019, from https: //dataverse. org/softwarefeatures • Lyon, L. (2007). Dealing with Data Roles , Rights , Responsibilities and Relationships Consultancy Report. JISC Digital Repositories Conference, Manchester, June 2007, (June), 1– 65. • Mahato, S. S. , & Gajbe, S. B. (2018). A Comparative study of Open source data repository software: Dataverse and CKAN. Library Herald, 56(1), 36. https: //doi. org/10. 5958/0976 -2469. 2018. 00005. 2 • Rocha da Silva, J. , Ribeiro, C. , Correia Lopes, J. , da Silva, J. R. , Ribeiro, C. , Lopes, J. C. , … Correia Lopes, J. (2012). Managing multidisciplinary research data: Extending DSpace to enable long-term preservation of tabular datasets. IPres 2012 Conference, 105– 108. Retrieved from https: //ipres. ischool. utoronto. ca/sites/ipres. ischool. utoronto. ca/files/i. Pres 2012 Conference Proceedings • Willis, C. , Hill, C. , E-mail, N. C. , Greenberg, J. , Hill, C. , E-mail, N. C. , … E-mail, N. C. (2012). Analysis and Synthesis of Metadata Goals. Journal of the American Society for Information Science and Technology, 63(8), 1505– 1520. https: //doi. org/10. 1002/asi