A MultiTiered Architecture for Distributed Data Collection and
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Project Overview IN Harmony is • • An IMLS funded grant Awarded in Fall 2004 To be competed in Fall 2008 A partnership of • Indiana University Digital Library Program • Indiana University Lilly Library • Indiana State Museum • Indiana Historical Society IN Harmony – DLP Spring Forum 2008 April 28, 2008
Project Goals 1. To provide a model for fostering collaborative digital library development by partnering with institutions with complementary collections; 2. To digitize a portion of the sheet music from these collections and offer access to these materials free of charge on the web; 3. To bring these materials and their attendant metadata together on a single web site, offering both federated searching of the entire collection and searching of one or more selected collections; IN Harmony – DLP Spring Forum 2008 April 28, 2008
Deliverables • Tools to • Process the images • Capture metadata • Provide search and display functions • 10, 000 pieces of sheet music scanned and cataloged • • 4, 000 Indiana University Lilly Library 2, 000 Indiana State Museum 2, 000 Indiana Historical Society IN Harmony – DLP Spring Forum 2008 April 28, 2008
Cataloging and Imaging Workflow Goals • Data integrity • • Quality of the scans Quality of the metadata Accuracy of the links between page images Accuracy of the links between metadata and images • Simplicity of use • Balance of flexibility and constraints IN Harmony – DLP Spring Forum 2008 April 28, 2008
Cataloging and Imaging Use Cases 1. Catalog first 2. Scanning first 3. Metadata created in another system and imported into IN Harmony – DLP Spring Forum 2008 April 28, 2008
Digitizing Quality Control • 2 phased Quality Control Process • Automated QC process verifies: • • • All TIFF tags of every digital file TIFF must be uncompressed Files names Embedded profile appropriate to its bit depth Consistency of pixel dimensions within a score Appropriate resolution IN Harmony – DLP Spring Forum 2008 April 28, 2008
Digitizing Quality Control (2) • Manual QC – at 100% pixel display, verify: • • Correct page orientation and order Correct color balance Sharp and in-focus scan No digital artifacts • When all QC is passed, derivative files are created • Large and small jpgs for screen delivery • PDF sized for 8. 5 x 11 printing IN Harmony – DLP Spring Forum 2008 April 28, 2008
Digitizing Quality Control Software
Designing the metadata model • • • User studies Work with the partners Define fields Write cataloging guidelines with partner input Representation in MODS IN Harmony – DLP Spring Forum 2008 April 28, 2008
Types of fields • • Title elements Name elements Publication elements Subject elements Identification elements Note elements Cover information IN Harmony – DLP Spring Forum 2008 April 28, 2008
Metadata Collection Tool
Public Search and Discovery System Demo Customize footer: View menu/Header and Footer 29 December 2021
ARCHITECTURE OVERVIEW JIM HALLIDAY Customize footer: View menu/Header and Footer 29 December 2021
IN Harmony Technical Overview Mass Storage System Web Browser Fedora SRU and http MODs Export Java Swing Cataloging Client Scanner Oracle Quality Control Perl Web Application Authentication Service FTP
Getting Data Into IN Harmony 2 primary data sources • Cataloging client • Image QC/upload application Other data sources • XML data exported from other cataloging systems • Score images exported from older systems IN Harmony – DLP Spring Forum 2008 April 28, 2008
Image QC/upload application 1. User scans scores and uploads to IN Harmony server 2. User accesses Perl-based web application to initiate automated quality control 3. A second user proceeds with manual QC, then uses web application to signal that manual QC is finished 4. The application moves and backs up the files, creates derivatives, and alerts both Fedora and the internal database that the process is complete IN Harmony – DLP Spring Forum 2008 April 28, 2008
IN Harmony Derivatives • Three sizes of JPG’s produced per page • Full (1200 px high) • Screen (600 px high) • Thumb (200 px high) • Multi-page, playable PDF • Approx. 1 MB for an average score IN Harmony – DLP Spring Forum 2008 April 28, 2008
IN Harmony cataloging client • Standalone Java Swing based client • Connects to Oracle database and outputs MODS for Fedora ingestion • Implemented as a client-server application via web services using Axis • Specialized UI components (such as ‘smart’ combo boxes) assist with quick, correct data entry IN Harmony – DLP Spring Forum 2008 April 28, 2008
Internal IN Harmony database • Oracle database stores record and user data in our own internal format • Communicates with upload/QC application, and cataloging client • Cataloging client and internal scripts can output to MODS format for ingestion into Fedora IN Harmony – DLP Spring Forum 2008 April 28, 2008
IN Harmony authentication CAS (IU’s Central Authentication Service) is used to authenticate all users • Non-IU users must create IU Guest Accounts to authenticate • All account/password maintenance in user’s control • IN Harmony – DLP Spring Forum 2008 April 28, 2008
Fedora and IN Harmony • Fedora used as a single storage and infrastructure solution for Digital Library Program projects as IU • Data (score images and metadata) ingested into Fedora and referenced as METS objects • Master images sent to IU’s mass storage system • Derivatives stored internally • Objects indexed using Lucene for SRU-based searching IN Harmony – DLP Spring Forum 2008 April 28, 2008
Fedora Object Model Collection Sheet music Copy Page
IN Harmony end-user interface - Java Struts based web application Offers searching, browsing, and record display Each partner institution is offered a personalized view of their data only Interaction with Fedora - Application sends CQL queries to Fedora and retrieves MODS data which is transformed via XSLT - PURLs (persistent URL’s) are used to access image derivatives IN Harmony – DLP Spring Forum 2008 April 28, 2008
METS Navigator • METS Navigator is used to page through scores online • Uses METS structmap to facilitate navigation • Allows views of multiple sizes of images • Released by IU as open source – see http: //metsnavigator. sourceforge. net IN Harmony – DLP Spring Forum 2008 April 28, 2008
IN Harmony Technical Overview Mass Storage System Web Browser Fedora SRU and http MODs Export Java Swing Cataloging Client Scanner Oracle Quality Control Perl Web Application Authentication Service FTP
IN Harmony Links • IN Harmony Public Interface • IN Harmony Project Information • Cataloging Tool Release date – June 2008 IN Harmony – DLP Spring Forum 2008 April 28, 2008
Questions? IN Harmony – DLP Spring Forum 2008 April 28, 2008
- Slides: 39