Stanford Archival Repository Project Brian Cooper Arturo Crespo

Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University 1

Data does not live forever l Much data is stored digitally (perhaps exclusively) – Text – Multimedia (images, sound, etc. ) – Scientific data l But digital storage is currently unreliable – Magnetic tapes decay, break or lose magnetism – Disks crash – Buildings burn down – Users delete data (accidentally or maliciously) 2

Data does not live forever l Digital information already lost: – Early NASA records – U. S. Census Information – Toxic waste records l Decay time for common media: – Magnetic Tapes: – CD-ROM: – Hard Drive: 10 -20 years 5 -50 years 3 -5 years 3

Digital archiving l Digital archivists need: – A reliable system to store digital data for long periods without losing it – Convenient tools to add new data and manage data already archived – Methods for finding the “best” configuration » Most reliable » Most cost effective » Etc. 4

Archival Repository Project l Goal: Reliably archive digital information for long periods of time (decades or centuries) – Focus on “preserving bits” – Preserving meaning: future work l Strategy – Replicate objects – Automatically detect and correct errors l Our project – Stanford Archival Vault (SAV) – reliably archives data – Info. Monitor – automatically adds newly created data to the archive – Arch. Sim – a simulation tool to model archives 5

Architecture Local archive Users Remote archive Users SAV Archive Info. Monitor Archived data Internet SAV Archived data Filesystem 6

SAV architecture Data Creation/Import User Interface Upper Layers Reliability Layer Object Store Remote SAV Sites “Core” SAV components Upper layers Application/user level 7

Write-once repository l Deletions/modifications disallowed – Any object deleted or modified must have been corrupted, and is replaced l Challenges – Constructing structures of objects » Object references constrained to point from new to old objects – Representing modifications » Archive new version of objects = version chain – Finding objects » Indexes 8

Write once repository: Indexes l Key to performance – Locate an object quickly using its signature, “Who points to me? ” problem, etc. l Disposable indexes – Can be rebuilt at any time from SAV objects l “Bookmarks” used to find collections of objects using indexed name 9

Write once repository: Indexes SAV Bookmark (with well-known name) 10

Replication: Site networks l Sites form “replication agreements” – Agree to replicate data – Specify data to replicate in agreement » May be a subset of all of the data in the archive – Periodically connect and compare data, looking for errors Strongly connected Weakly connected 11

Replication: Data sets l SAV replicates different data sets separately – E. g. , web pages under agreement A, Usenet articles under agreement B – “Replication sets” should grow without human intervention – Traverse link structure to find objects in set 12

Replication: Data sets SAV Start traversal 13

User interface 14

User interface 15

Object store performance 16

Reliability layer performance 17

The Info. Monitor l Goal – Create a convenient, transparent mechanism for getting data from existing stores into the archive l Architecture Users SAV Archive Info. Monitor Filesystem 18

Detecting new data l Must find archive new data – Filesystem will not signal data writes – Users should not have to explicitly “check-in” data l Scanning – Quick scan: detect changes using timestamps – Slow scan: detect changes using file contents l Filtering – Automatically decide what to archive – Use filtering rules 19

User interface 20

User interface 21

Info. Monitor performance 22

Designing Archival Repositories l Designer needs to answer questions like: – What is the minimum number of copies of a documents that are needed to ensure its preservation? – What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate? – Are two sites enough to guarantee preservation? – How often should we scan the repositories for errors? – What’s the MTTF of this design? 23

Contributions l A comprehensive model for an Archival Repository l A powerful simulation tool: Arch. Sim, for evaluating Archival Repositories and the available strategies. l A detailed case study for an hypothetical TR Repository operated between Stanford and MIT 24

How important is having good disks? 25

Preventive maintenance 26

Current and future work l l l l New models for replication agreements and “data trading” Archiving the World Wide Web Modeling cost Managing “meaning” Security Alternative object naming schemes Other “upper layers, ” e. g. user access, metadata, etc. 27

Conclusion l l Digital librarians need tools to preserve data Our project addresses this need – Reliable storage: SAV – Convenient access: Info. Monitor – Finding the best configuration: Arch. Sim l More work must be done to refine these models – More automation – More flexibility – Answer a wider range of design questions 28

For more information http: //www-db. stanford. edu/archivalrep Brian Cooper: cooperb@db. stanford. edu Arturo Crespo: crespo@db. stanford. edu Hector Garcia-Molina: hector@db. stanford. edu 29