Stanford Archival Repository Project Brian Cooper Arturo Crespo
Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University 1
Data does not live forever l Much data is stored digitally (perhaps exclusively) – Text – Multimedia (images, sound, etc. ) – Scientific data l But digital storage is currently unreliable – Magnetic tapes decay, break or lose magnetism – Disks crash – Buildings burn down – Users delete data (accidentally or maliciously) 2
Data does not live forever l Digital information already lost: – Early NASA records – U. S. Census Information – Toxic waste records l Decay time for common media: – Magnetic Tapes: – CD-ROM: – Hard Drive: 10 -20 years 5 -50 years 3 -5 years 3
Digital archiving l Digital archivists need: – A reliable system to store digital data for long periods without losing it – Convenient tools to add new data and manage data already archived – Methods for finding the “best” configuration » Most reliable » Most cost effective » Etc. 4
Archival Repository Project l Goal: Reliably archive digital information for long periods of time (decades or centuries) – Focus on “preserving bits” – Preserving meaning: future work l Strategy – Replicate objects – Automatically detect and correct errors l Our project – Stanford Archival Vault (SAV) – reliably archives data – Info. Monitor – automatically adds newly created data to the archive – Arch. Sim – a simulation tool to model archives 5
Architecture Local archive Users Remote archive Users SAV Archive Info. Monitor Archived data Internet SAV Archived data Filesystem 6
SAV architecture Data Creation/Import User Interface Upper Layers Reliability Layer Object Store Remote SAV Sites “Core” SAV components Upper layers Application/user level 7
Write-once repository l Deletions/modifications disallowed – Any object deleted or modified must have been corrupted, and is replaced l Challenges – Constructing structures of objects » Object references constrained to point from new to old objects – Representing modifications » Archive new version of objects = version chain – Finding objects » Indexes 8
Write once repository: Indexes l Key to performance – Locate an object quickly using its signature, “Who points to me? ” problem, etc. l Disposable indexes – Can be rebuilt at any time from SAV objects l “Bookmarks” used to find collections of objects using indexed name 9
Write once repository: Indexes SAV Bookmark (with well-known name) 10
Replication: Site networks l Sites form “replication agreements” – Agree to replicate data – Specify data to replicate in agreement » May be a subset of all of the data in the archive – Periodically connect and compare data, looking for errors Strongly connected Weakly connected 11
Replication: Data sets l SAV replicates different data sets separately – E. g. , web pages under agreement A, Usenet articles under agreement B – “Replication sets” should grow without human intervention – Traverse link structure to find objects in set 12
Replication: Data sets SAV Start traversal 13
User interface 14
User interface 15
Object store performance 16
Reliability layer performance 17
The Info. Monitor l Goal – Create a convenient, transparent mechanism for getting data from existing stores into the archive l Architecture Users SAV Archive Info. Monitor Filesystem 18
Detecting new data l Must find archive new data – Filesystem will not signal data writes – Users should not have to explicitly “check-in” data l Scanning – Quick scan: detect changes using timestamps – Slow scan: detect changes using file contents l Filtering – Automatically decide what to archive – Use filtering rules 19
User interface 20
User interface 21
Info. Monitor performance 22
Designing Archival Repositories l Designer needs to answer questions like: – What is the minimum number of copies of a documents that are needed to ensure its preservation? – What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate? – Are two sites enough to guarantee preservation? – How often should we scan the repositories for errors? – What’s the MTTF of this design? 23
Contributions l A comprehensive model for an Archival Repository l A powerful simulation tool: Arch. Sim, for evaluating Archival Repositories and the available strategies. l A detailed case study for an hypothetical TR Repository operated between Stanford and MIT 24
How important is having good disks? 25
Preventive maintenance 26
Current and future work l l l l New models for replication agreements and “data trading” Archiving the World Wide Web Modeling cost Managing “meaning” Security Alternative object naming schemes Other “upper layers, ” e. g. user access, metadata, etc. 27
Conclusion l l Digital librarians need tools to preserve data Our project addresses this need – Reliable storage: SAV – Convenient access: Info. Monitor – Finding the best configuration: Arch. Sim l More work must be done to refine these models – More automation – More flexibility – Answer a wider range of design questions 28
For more information http: //www-db. stanford. edu/archivalrep Brian Cooper: cooperb@db. stanford. edu Arturo Crespo: crespo@db. stanford. edu Hector Garcia-Molina: hector@db. stanford. edu 29
- Slides: 29