SAV Stanford Archival Vault SAV Brian Cooper Hector
SAV Stanford Archival Vault (SAV) Brian Cooper Hector Garcia-Molina Department of Computer Science Stanford University {cooperb, hector}@db. stanford. edu http: //www-diglib. stanford. edu/~testbed/doc 2/Archival. Repositories/
Problem: Preserving Data n Data decays over time n n Media decay System failure Human error/malicious actions “Preserving the bits” n Goal: Ensure bits survive failures n Not deal with harder problem of meaning (yet) n n E. g. , formats, natural language, etc. Redundancy + periodic verification = reliability
SAV Architecture Data Creation/Import User Interface • Collects data for archiving • Allows direct access to archived data • E. g. , a web crawler • Allows SAV configuration Upper Layers • E. g. , security, indexing, metadata, etc. Reliability Layer • Ensures objects survive failures Remote SAV Sites • Objects are replicated to remote sites to provide reliability Object Store • Basic object storage and retrieval • Manages references between objects “Core” SAV components Unimplemented upper layers Application/user level
Replication: Site networks n Sites form “replication agreements” n n Agree to replicate data Specify data to replicate in agreement n n May be a subset of all of the data in the archive Periodically connect and compare data, looking for errors SAV site Replication Agreement Strongly connected Weakly connected
Replication: Data sets n SAV replicates different data sets separately n n n SAV E. g. , web pages under agreement A, Usenet articles under agreement B “Replication sets” should grow without human intervention Traverse link structure to find objects in set Start traversal SAV Start traversal Object in replication set Object not in replication set New object added to SAV Object reference A new object automatically becomes part of the correct replication set
Write-once repository n Deletions/modifications disallowed n n Any object deleted or modified must have been corrupted, and is replaced Challenges n Constructing structures of objects n n Representing modifications n n Objects references constrained to point from new to old objects Archive new version of objects = version chain Finding objects n Indexes
Write once repository: Indexes n Key to performance n n Disposable indexes n n Locate an object quickly using its signature, “Who points to me? ” problem, etc. Can be rebuilt at any time from SAV objects “Bookmarks” used to find collections of objects using indexed name SAV Bookmark (with well-known name) Related objects, e. g. from the same web site Other, unrelated objects
Implementation n SAV built using Java and CORBA Tested on Stanford Database Group website Basic user interface for archive management
Future work n Archiving the whole Internet n n n Scalability Defining meaningful subsets Other replication models? Preserving meaning Security n n Preserving sensitive documents Protecting intellectual property
- Slides: 9