CISLs Research Data Archive RDA Description and Methods

  • Slides: 28
Download presentation
CISL’s Research Data Archive (RDA) : Description and Methods Joseph L “Joey” Comeaux Computational

CISL’s Research Data Archive (RDA) : Description and Methods Joseph L “Joey” Comeaux Computational & Information Systems Laboratory National Center for Atmospheric Research

Outline Description of CISL RDA Metadata Sustainable Data Curation Considerations for Archiving Model Data

Outline Description of CISL RDA Metadata Sustainable Data Curation Considerations for Archiving Model Data Lessons learned

CISL Research Data Archive (RDA) Reference datasets maintained for use by research community Primarily

CISL Research Data Archive (RDA) Reference datasets maintained for use by research community Primarily Meteorological and Oceanographic datasets Receives high level of curation and stewardship > 200 person-years invested in RDA managed by 8 staff members 616 datasets (~ 10 -20 new datasets added annually) 438 TB (currently) 3. 7 Million files

Contents of the RDA 616 datasets

Contents of the RDA 616 datasets

ACCESS MODES NCAR Mass Storage System - Primary mode of access NCAR Users -

ACCESS MODES NCAR Mass Storage System - Primary mode of access NCAR Users - Most users from US Internet - Non NCAR users - Many international users

Storage Metrics MSS Online

Storage Metrics MSS Online

Unique Users

Unique Users

Amount of Data Delivered

Amount of Data Delivered

Long-term RDA user metrics

Long-term RDA user metrics

Comeaux/Worley/Dattore - SCD/DSS 1/31/2022 10

Comeaux/Worley/Dattore - SCD/DSS 1/31/2022 10

METADATA Several Levels of Metadata • Dataset o search and discovery o dataset usefulness

METADATA Several Levels of Metadata • Dataset o search and discovery o dataset usefulness • File Level o. Description of file content o. Relates files to datasets

Dataset Level Metadata • Model or Obs, Variables, Levels, POR … • Use controlled

Dataset Level Metadata • Model or Obs, Variables, Levels, POR … • Use controlled vocabularies (GCMD, ISO, THREDDS) • Guided entry via a Web-based GUI • Saved to a mysql database (and XML files as backup) Exportable to DIF (NASA GCMD), THREDDS (UCAR CDP); can include others as needed Dynamically create dataset web pages § Easy to create user interfaces that search the metadata and return relevant results

File Content Metadata • Scan a data file; inventory its contents o Command-line utilities

File Content Metadata • Scan a data file; inventory its contents o Command-line utilities read the data files and extract the metadata o Metadata are saved to a mysql database and a system of XML files o Works with many Model and Obs formats • Provides more detailed and up-to-date search/discovery metadata, leading to better (more relevant) results when searching for datasets • Facilitates the discovery of specific data files within an RDA dataset

File XREF Metadata • Provides Xref from individual data files to datasets • Command

File XREF Metadata • Provides Xref from individual data files to datasets • Command line utilities archive data and create metadata • Relies on mysql • Allow for grouping and organization of files • Tracks both MSS and Web files • Tracks usage and allows metrics

METADATA Advantages of a GOOD, ROBUST metadata system Allows creation of metrics in an

METADATA Advantages of a GOOD, ROBUST metadata system Allows creation of metrics in an easy fashion : • You can track dataset usage and users • Provides information on archive size and growth • Useful when analyzing future equipment and staff needs and thus funds

METADATA Advantages of a GOOD, ROBUST metadata system Quality of metadata • directly related

METADATA Advantages of a GOOD, ROBUST metadata system Quality of metadata • directly related to the usefulness of search of discovery on both the dataset level and individual file level • Improves ability and speed for subset generation and automation Improves the Long Term viability of the Archive • Reduces the chances of losing or throwing out data which is not adequately described with metadata • Facilitates preservation activities (backups, off-site replication, etc. )

Sustainable Data Curation Stable Funding Backup Plans Enriched Staff • Knowledgeable • Consistent Levels

Sustainable Data Curation Stable Funding Backup Plans Enriched Staff • Knowledgeable • Consistent Levels Data Formats Robust Storage Partnerships

Sustainable Data Curation Stable Funding Staff • Focused on Data Management • Not project

Sustainable Data Curation Stable Funding Staff • Focused on Data Management • Not project specific • Allows flexibility • Necessary to keep curated collection viable • Knowledgeable and educated in the specific discipline • Important for checking integrity of data • Choosing organization of data • Creating adequate meta-data • Designing access system and assisting users • Consistent Staffing Levels • Dedicated to best practices in archiving and stewardship • Great deal of knowledge held by staff, regardless of documentation • Value of human based knowledge cannot be under-estimated • We find ~10 years is good

Sustainable Data Curation Robust Storage Facilities • Capable of meeting growth needs • NCAR

Sustainable Data Curation Robust Storage Facilities • Capable of meeting growth needs • NCAR -> tape based Mass Storage System (MSS) • Size > 2 x every 2. 5 years • Currently > 6 PB • Must be able to handle data migration across generations of media (oozing) • Tapes size in MSS : 20 GB -> 60 GB -> 200 GB -> 1000 GB • Oozing must not interrupt normal, day-day operations • Provide access speeds able to handle daily curation and stewardship activities Backups • Loss of data attributed to 2 general causes • Equipment, Environmental • Lack of knowledge • Resolution • Store copies of irreplaceable data at separate facilities • Backup copies of data should be stored on different drives/tapes than originals • Knowledgeable Staff

Sustainable Data Curation Format Partnerships • • Ensure data access for long term Fully

Sustainable Data Curation Format Partnerships • • Ensure data access for long term Fully documented to the byte level Non-proprietary Practices to avoid • Formats should not be dependent on OS, hardware or applications • Latest/Greatest formats not always best for your situation • • No single institute can “do it all” Most users “need/want it all” Good way to share some costs National and international

Reanalysis Projects Prime example of data curation and stewardship Encompass all 6 major aspects

Reanalysis Projects Prime example of data curation and stewardship Encompass all 6 major aspects of good data curation Main feature of the RDA and have been a very valuable resource for a wide variety of climate and weather studies

Most Current Reanalysis Projects Temporal Range Name Highest Resolution Start End Temporal Horizontal Vertical

Most Current Reanalysis Projects Temporal Range Name Highest Resolution Start End Temporal Horizontal Vertical NCEP/NCAR 1948 Ongoing 6 hours 209 km 17 Plvl NCEP-DOE 1979 Ongoing 6 hours 209 km 17 Plvl ECMWF ERA-40 1957 2002 6 hours 125 km 23 Plvl NCEP NARR 1979 Ongoing 3 hours 32 km 29 Plvl Japanese JRA 1979 Ongoing 6 hours 125 km 23 Plvl

Considerations for Archiving Model Output • Know Your User Base o. Manner in which

Considerations for Archiving Model Output • Know Your User Base o. Manner in which data will be used o. How to organize the data o. Which model and what fields to archive o. How long data from each model needs to be kept • Backups • Partnerships • Plan storage carefully • Create necessary metadata – dataset and file level

Considerations for Archiving Model Output • Diverse delivery system for access – web/ftp/mss/media •

Considerations for Archiving Model Output • Diverse delivery system for access – web/ftp/mss/media • Transfer method for receiving archive • Data tools and formats • Known issues of models o. Who/How will questions be handled • Task often larger than expected o. Reorganize to meet user needs o. Fixes/changes to model output o. Changes in model resolution, variables, levels o. Sub-setting needed o. Moving large model output around

Considerations for Archiving Model Output • Diverse delivery system for access – web/ftp/mss/media •

Considerations for Archiving Model Output • Diverse delivery system for access – web/ftp/mss/media • Transfer method for receiving archive • Data tools and formats • Known issues of models o. Who/How will questions be handled • Task often larger than expected o. Reorganize to meet user needs o. Fixes/changes to model output o. Changes in model resolution, variables, levels o. Sub-setting needed o. Moving large model output around

LESSONS LEARNED • Create necessary Metadata o. Do not do just minimal amount o.

LESSONS LEARNED • Create necessary Metadata o. Do not do just minimal amount o. Use standards where possible o. Store in a useful, manageable system o. Tightly couple files to datasets o. User dynamic web interfaces to reflect current state • Organize archive files to align with ‘most’ user demands • Offer multiple modes of access to the data • Know your users o. Track metrics so resources can be applied

LESSONS LEARNED • How much software do you support • Balance between real time

LESSONS LEARNED • How much software do you support • Balance between real time access and delayed mode • Simply data access where possible • Plan backup and recovery immediately • Staff educated in particular discipline needed • Assign consultants to each dataset

Thank you Questions and/or comments

Thank you Questions and/or comments