Beyond code Versioning data with Git and Mercurial
Beyond code: Versioning data with Git and Mercurial Stephanie Collett and Martin Haye California Digital Library, University of California
Not on Agenda
Agenda • Background • Case Study #1: e. Scholarship Backup • Case Study #2: Zephir Metadata • Summary
Code Version Control Repository
Data/Metadata Version Control Repository
Why distributed?
Case #1 e. Scholarship Data/Metadata Backup
e. Scholarship
~50 k scholarly works
XML Metadata } 10 files per work
XML Metadata } ~500, 000 files total
XML Metadata Single Mercurial Repository
Working Repository Backup Repository Nightly Sync (hg push)
XML Metadata Single Mercurial Repository
XML Metadata Single Mercurial Repository . hgignore
Working Storage Backup Storage } Nightly Sync (rsync) {
30 -60 minutes for the batch job
Logs Date Annotation Change } Commit History
Case #2 Zephir Metadata Management System
Zephir
File system record/
File system record/ marc. xml
File system record/ marc. xml attrbutes. xml summary. xml transform. xsl
File system record/ . git/ marc. xml attrbutes. xml summary. xml transform. xsl
. . . /pairtree/ab/cd/e/record/. git /pairtree/ab/cd/ea/record/. git /pairtree/ab/cd/ez/record/. git /pairtree/ab/cd/f 2/record/. git /pairtree/ab/cd/f 9/record/. git /pairtree/ab/cd/ff/record/. git /pairtree/ab/cd/fm/record/. git /pairtree/ab/cd/fq/record/. git /pairtree/ab/cd/gi/record/. git /pairtree/ab/cd/gw/record/. git /pairtree/ab/cd/gz/record/. git /pairtree/ab/cd/hs/record/. git /pairtree/ab/cd/ht/record/. git /pairtree/ab/cd/i/record/. git . . . } 10 million
Individually
Versioning + Audit Trail + Diffing + Debugging
Collectively
record/ marc. xml
1 file, ~4 k
record/ marc. xml attrbutes. xml summary. xml transform. xsl
4 file, ~36 k
. git/ branches/ config description HEAD hooks/ index info/ objects/ refs/
record/ + record/. git 43 files, ~132 k
record/ + record/. git ~132 k x 10 million
record/ + record/. git 43 files x 10 million
Command Line vs. API
Grit Gem (Git) vs. Rugged Gem (Libgit 2)
Grit Gem (Git)
Rugged Gem (Libgit 2)
Grit vs. Rugged • add files • commit • add files • determine changes • determine parent • commit • replace HEAD
Summary
Add Remove Commit Log Diff
vs.
texty data, small files 100 -10, 000 files per repository
If it looks like code, even if it's data, it will probably work
- Slides: 68