Introduction to Biological Databases and Data Archiving Creating
Introduction to Biological Databases and Data Archiving Creating and Maintaining a Data Archive
LIFECYCLE SUPPORT CRADLE TO GRAVE – NO GRAVE 2
PDB as an Example 3
Pre-Deposition • • • Pre-deposition anonymous validation services Data harvesting services and tools Chemical reference data Metadata reference dictionaries Documentation for data formats and data processing procedures 4
At Deposition • Capture original data sets, descriptive metadata, and supporting data sets from depositor • Assign access code to the entry after mandatory data requirements are satisfied • Process and annotate the entry and return any queries and/or validation reports to depositor • These steps are repeated as required • All of these steps are documented with internal audit records • All original data sets, intermediate data files, and depositor communication is preserved in a collection of version data files 5
Annotation (Post-Deposition) • Validation reports produced during entry processing may be required by the editorial review of the citation describing the entry • Data may be embargoed within the archive until the publication of the citation describing the entry (typically <1 year) • Both automated and manual processing are required to manage the embargo period • Depositor notification and acknowledgement typically occurs at the end of the embargo period • All processing details and communication related to the embargo are preserved with the data entry 6
Data Release • Only the final data products of the deposition and annotation are released into the public archive (e. g. structure files, supporting experimental data and validation reports) • Data are released on a weekly schedule, coordinated with other ww. PDB deposition sites • Data from worldwide sites are first checked for consistency and then integrated into a single repository data archive file system. • The integrated or master copy of the data archive is replicated to each ww. PDB distribution site to enable synchronized public release 7
Post Release • Changes in the core data content of an entry requires the assignment of a new accession code • Accession reassignment is documented in both the obsoleted and superseding data entry and both entries remain in the archive • Smaller entry revisions are permitted and documented within internal audit records • Archive-wide data entry updates to improve content and uniformity are performed periodically • The state of the full data archive is preserved on an annual schedule and these snapshots are maintained in the public view 8
Disaster Recovery • Multiple on-line copies of the data archive are maintained at distinct physical sites • Internal data are also archived to magnetic tape (e. g. , correspondence with depositors) • In the future data copies will also be archived using remote storage services (e. g. Amazon, Rackspace, Microsoft Azure) 9
This work is licensed under Creative Commons Attribution-Non. Commercial-Share. Alike 4. 0 International. Funded by Grant R 25 LM 012286 from the National Library of Medicine of the National Institutes of Health. 10
- Slides: 10