Digital Asset Management and Publication with Lady Bird
Digital Asset Management and Publication with Lady. Bird Eric James programmer/analyst library IT Yale University Library eric. james@yale. edu 12 July 2013
What is Lady. Bird? • • 2 Bebop song by Tadd Dameron First Lady, Lyndon B. Johnson presidency Old dog from King of the Hill Digital asset management tool
Lady. Bird - Digital Asset Management Tool Lady. Bird from its origin is a system which processes metadata and temporarily houses digital assets to be published. It provides a configurable system for migrating digital objects and collections, normalizing metadata, and preserving and publishing content. It was initially writing in Microsoft. Net and C#, hosted on Windows 2008 using Microsoft SQL Server 2008. Some work on java modules (for import) Wish list – To migrate to Jruby/rails. 3
Lady. Bird components • • • Web interface Job processing engine - imports Export processing engine – exports Bag creation Heartbeat monitor Application cleanup system • This presentation will focus on the workflow and concepts involved in publication of digital objects w/ metadata to fedora 4
Lady. Bird concepts I • Core of the application is the object table • Collection – departments within the library and Yale (later will come into play when discussing c# tables) • Project – projects specific to a collection • An object belongs to a project and a project belongs to a collection • Currently 16 collections with 34 projects and 1. 53 million objects • We call objects “oids”, technically “oid” means object id column of the object table but we tend to use it to describe the whole ball of wax • User table – cataloger is registered and roles and permissions setting are used throughout the app 5
Lady. Bird concepts II • • • Processing objects is all about the spreadsheet Each row is an object Each column represents either functions or metadata • • • Functions ex – {F 1} is the object as identified by oid(primary key of object table), if left blank that is signal to create a new oid {F 4} parent oid (for complex objects) {F 40} can have a value PUBLISH telling ladybird to auto publish this object • Metadata ex – {FDID=58} call number, {FDID=262} Host, creator, etc. The cataloger can take advantage of excel functionality (like repeating fields) to quickly create a spreadsheet for batch import, 6
Lady. Bird concepts III field_definition (fdid) table (230 metadata fields) 51 52 53 54 55 56 57 58 59 60 Cataloger Record source Record date Record modified date Record ID Local record ID, other Call number Accession number Box The values are either strings or acid values (more on acids later) 7
Lady. Bird concepts IV • Import tables – all about the spreadsheets, though you can import MARC or EAD records by bibid, barcode, handle too, in that case the records are deserialized into fdids, and any spreadsheet data overrides the records im_job (1 master row for spreadsheet) Im_job_ex. Head (column headers from spreadsheet) im_job_contents (values) Im_files(for files) import_checksum (for files) im_job_contents_history • Job tracking (overall tracking associates a oid imported to a specific job) trk_project trk_job_contents trk_oid 8
Lady. Bird concepts V • • • The C# tables – c for “current”, # for each collection The “Metadata home” - data imported to the im tables finally transferred here There is a set of tables for each collection. Ex: # = 13 (collection: Hydra, project: Hydra Test) c 13 – master list of oids c 13_strings c 13_longstrings c 13_acid Each row contains basically a oid/fdid/value, thus given an oid one could get all metadata fields for that object as rows from this table. It also has a favid for additional values associated with the fdid. There also corresponding p# tables, p for “past” that keep a audit trail of any updates to specific oids. C#table designed for high volume. Exploring better options, hashing 9
Lady. Bird concepts VI • Acid – authority control – a system for using controlled vocabulary for metadata fields Fdid 62 = Host, Creator Acid fdid value 126434 62 Luhan, Mabel Dodge, 1879 -1962 126626 62 Dobbs, Arthur, 1689 -1765 126628 62 Filson, John, ca. 1747 -1788 126630 62 Thomson, Charles, 1729 -1824 126632 62 Hutchins, Thomas, 1730 -1789 126635 62 Adair, James, ca. 1709 -1783 So If for an oid row in the spreadsheet the fdid 62 column was given the value 126635, that field would resolve to Adair, James, ca. 1709 Currently 155, 415 values. Potential for more sophisticates uses with linked data. 10
Lady. Bird sample workflow start • Workstation mounted with a job folder for both import and export Windows: \birdcage. library. yale. eduproject 25import Mac: SMB: //birdcage. library. yale. edu/project 25//import// Windows: \birdcage. library. yale. eduproject 25export Mac: SMB: //birdcage. library. yale. edu/project 25//export// • • • 11 Project 25 corresponds to the project table Create a folder in the import directory and drag files into folders or subfolders Lady. Bird will now have detected that folder and have created a job for this under the “Dashboard” menu selection
Lady. Bird dashboard 12
add digital object to folder 13
Got to dashboard and process this folder 14
Receive email confirmation Subject: Lady. Bird Import Complete job: test_open_rep Your import has been processed. test_open_rep Visit your dashboard in Ladybird for your most recent jobs. http: //ladybird. library. yale. edu/user_jobs. aspx View job: http: //ladybird. library. yale. edu/user_jobs. aspx? qa=query&qid=12307 * A jobcomplete. txt file with the time is added to import folder so app know that directory is complete 15
View job 16
View set 17
New object->Metadata (form) 18
Or From View Set, “Export as Job” 19
Receive export email confirmation Subject: Lady. Bird Export Ready Your export is ready. \birdcageproject 25exportermadmix_46371_06262013_165116. xls 20
Spreadsheet – fill in and save as tab-delimited text file 21
Import 22
Import Email Confirmation Subject: Lady. Bird Import Complete job: ermadmix_import_062613_171134 Your import has been processed. ermadmix_import_062613_171134 Visit your dashboard in Ladybird for your most recent jobs. http: //ladybird. library. yale. edu/user_jobs. aspx View job: http: //ladybird. library. yale. edu/user_jobs. aspx? qa=query&qid=12313 23
Publish • Publishes automatically if {F 40}=publish • Or can use interface to check file and metadata and explicitly click the publish button 24
Publish (behind the scenes) • Oid is added to the hydra table with date (when added) and date published (when processing complete) timestamps Id oid date date published … … 39176 10684347 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 05. 900 39177 10684348 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 07. 457 39178 10684349 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 09. 017 39179 10684350 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 10. 577 39180 10684351 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 12. 137 39181 10684352 2013 -06 -26 16: 01: 11. 043 2013 -06 -26 17: 14: 13. 697 … … 25
oid added to hydra_publish table Key fields: hpid: 23703 hcmid: 2 cid: 9 Pid: 27 Oid: 10681633 _oid: 0 zindex: 0 hydra. ID: null date. Ready: 2013 -06 -26 16: 01: 55. 430 date. Hydra. Start: null 26
Rows for oid added to hydra_publish_path table Key fields w/ example: hppid: 139004 Hpid: 26340 Type: jp 2 path. HTTP: http: //lbfiles. library. yale. edu/10684274. jp 2 path. UNC: \storage. yale. eduhomeladybird-801001 yulladybirdproject 27publishdl106842741758. 02. 00_page 1. jp 2 Md 5: 35433 b 00 ca 9 de 2 cdaed 275 c 455339090 control. Group: M mime. Type: image/jp 2 Dsid: jp 2 ingest. Method: filepath oid. Pointer: null 27
Hydra_publish_path – typical files xml rights (hydra rights) Xml metadata (MODS desc. Metadata) Xml access (home grown granular rights) pdf (transcript YIPP) pdf 2 (annotated transcript YIPP) jp 2 (derivative) jpg (derivatives) tif (master) 28
desc. Metadata - creation There is a service (c# class and methods) that is called upon hydra publish that iterates through all the fdids for an oid and uses the XML DOM to create a MODS file. This is basically a mapping of field definitions to the MODS schema. There is the potential to map the fdids to any metadata format. 29
access. Metadata 30
Rights metadata 31
Transition in fedora hydra world select * from hydra_content_model id 1 2 3 date uid 2013 -04 -25 08: 50: 20. 043 1 2013 -04 -25 08: 50: 26. 350 1 2013 -04 -25 08: 50: 30. 420 1 Content. Model maps to Active. Fedora model 32 content. Model simple complex. Parent complex. Child
Transition into fedora hydra world II select * from hydra_content_model_ds id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 33 date uid 2013 -04 -25 08: 56: 11. 670 1 2013 -04 -25 08: 56: 11. 673 1 2013 -04 -25 08: 56: 11. 673 1 2013 -05 -31 10: 48: 25. 620 1 2013 -06 -07 11: 03: 24. 537 1 2013 -06 -07 11: 03: 52. 933 1 hcmid 1 1 1 2 2 2 3 3 3 2 2 2 dsid ing. Method access. Metadata pull. HTTP desc. Metadata pull. HTTP rights. Metadata pull. HTTP tif filepath jp 2 filepath jpg filepath oid. Pointer pointer pdf filepath pdf 2 filepath required y y y y y n n n
Example - simple content model • • • • 34 require "active-fedora" class Simple < Active. Fedora: : Base belongs_to : collection, : property=> : is_member_of has_metadata : name => 'desc. Metadata', : type => Hydra: : Datastream: : Simple. Mods has_metadata : name => 'access. Metadata', : type => Hydra: : Datastream: : Access. Conditions has_metadata : name => 'rights. Metadata', : type => Hydra: : Datastream: : Rights has_metadata : name => 'property. Metadata', : type => Hydra: : Datastream: : Properties delegate : oid, : to=>"property. Metadata", : unique=>true delegate : projid, : to=>"property. Metadata", : unique=>true delegate : cid, : to=>"property. Metadata", : unique=>true delegate : zindex, : to=>"property. Metadata", : unique=>true delegate : parentoid, : to=>"property. Metadata", : unique=>true end
Example – Properties Datastream • • • require 'active_fedora' module Hydra module Datastream class Properties < Active. Fedora: : Om. Datastream • • #ERJ note ladybird pid = projid, ladybird _oid = parentoid set_terminology do |t| t. root(: path=>"root") • • t. oid(: path=>"oid") t. cid(: path=>"cid") t. projid(: path=>"projid") t. zindex(: path=>"zindex") t. parentoid(: path=>"parentoid") t. ztotal(: path=>"ztotal") t. oidpointer(: path=>"oidpointer") • • • • end def to_solr(solr_doc=Hash. new) super(solr_doc) solr_doc['oid_isi'] = oid solr_doc['cid_isi'] = cid solr_doc['projid_isi'] = projid solr_doc['zindex_isi'] = zindex solr_doc['parentoid_isi'] = parentoid solr_doc['ztotal_isi'] = ztotal solr_doc['oidpointer_isi'] = oidpointer solr_doc end end 35
Workflow review 1. 2. 3. 4. 5. 6. 36 Add folder with files to import folder Process folder. This will create the records in the database (oids, job tracking, c# instances, and file derivatives) Export spreadsheet. This will create a spreadsheet template for the folder of files in (1) Fill in metadata in spreadsheet – the main cataloging task. Import spreadsheet. This will ultimately populate the c# with metadata from the oid rows of the spreadsheet. Publish to hydra. This will create the hydra tables with serialized metadata files(MODS, access rights), and stage files in storage for ingest.
Ingest task • Set up within a hydra project • gem ‘tiny_tds’ connect to the ladybird SQL Server database 37
app/models (objects) • collection. rb – maps to pid (project) in ladybird, parent to simple. rb and complex_parent. rb • simple. rb – 1 image w/derivatives, no hierarchy • complex_parent. rb – parent to a set of images (like a book or image set) • complex_child. rb – 1 image w/derivatives (like a page These relate to the hydra_content_model table 38
app/model (datastreams) • • • 39 coll_properties. rb rights. rb access_conditions. rb simple_mods. rb
simple_mods. rb - indexing 40
rake yulhy 4: ingest I Properties: • SQL server connection config • Mount of ladybird storage Uses the hydra_publish table as a queue (driven by this query until done): • select top 1 a. hpid, a. oid, a. cid, a. pid, b. content. Model, a. _oid from dbo. hydra_publish a, dbo. hydra_content_model b where a. date. Hydra. Start is null and a. date. Ready is not null and a. _oid=0 and a. hcmid is not null and a. hcmid=b. hcmid and a. action='insert' order by a. date. Ready") • 41
rake yulhy 4: ingest II Active. Fedora ingest Create new object based on content model obj = Simple. new obj = Complex. Parent. new obj = Complex. Child. new 42
Rake yulhy 4: ingest III Iterate through all datastreams for the content model • select hcmds. dsid as dsid, hcmds. ingest. Method as ingest. Method, hcmds. required as required from dbo. hydra_content_model hcm, dbo. hydra_content_model_ds hcmds where hcm. content. Model = '#{content. Model}' and hcmid = hcmds. hcmid/) For each in above query get the datastream info for the oid • select type, path. HTTP, path. UNC, md 5, control. Group, mime. Type, dsid, OIDpointer from dbo. hydra_publish_path where hpid=#{i["hpid"]} and dsid='#{dsid}'/) Verify checksums and use active. Fedora to ingest datastreams 43
rake yulhy 4: ingest IV Add ladybird specific info to properties datastream • oid • cid • pid • zindex • _oid Add hierarchical info to RELS-EXT • Simple and complex_parent – is_member_of a collection • Complex_child – is member of a complex_parent Some discussion about adding more linked data. 44
Rake yulhy 4: ingest V 45
Rake yulhy 4: ingest VI 46
Blacklight 47
review 48
future Hydra_publish – revise already ingested content • action=‘update’ • action=‘insert’ Archivematica (by artefactual) • Replace the ingest task with a custom workflow • GUI interface • Human decision points and manual processing • Technical metadata generation (FITS) • Provenance (jhove) • Issues – how to employ OAI packages (SIP, AIP, DIP) for objects without a natural package structure? 49
Contributors • • 50 Eric James Lakeisha Robinson Kalee Sprague Osman Din Jay Terray Rebekeh Irwin Mike Friscia
Thank you
- Slides: 51