Chronicling America and the National Digital Newspaper Program

  • Slides: 56
Download presentation
Chronicling America and the National Digital Newspaper Program: Technical Aspects Part 1: Newspapers and

Chronicling America and the National Digital Newspaper Program: Technical Aspects Part 1: Newspapers and Microfilm Challenges USNP Part 2: Technical Details Image views Text searching Indexing Part 3: Managing a newspaper digitization project PIALA 2010 UH Manoa Hamilton Library

Challenges Newspapers are a difficult medium Never meant to last, made for daily use

Challenges Newspapers are a difficult medium Never meant to last, made for daily use and disposal Pages crumble and acid corrodes the materials Tracking serial publications over time Patron demand increased, storage space grew scarce, binding costs rose PIALA 2010 UH Manoa Hamilton Library

Microfilm Adopted in the 1920 s as a standard Turns newspaper from a storage

Microfilm Adopted in the 1920 s as a standard Turns newspaper from a storage nightmare to a relatively easy medium to handle Libraries had to decide what to do with the hardcopy Keep in holdings? Deaccession? PIALA 2010 UH Manoa Hamilton Library

United States Newspaper Program (USNP) Began in 1982 Funded by National Endowment for the

United States Newspaper Program (USNP) Began in 1982 Funded by National Endowment for the Humanities, managed by the Library of Congress University of Hawai’i with Hawaiian Historical Society, Hawai’i State Archives and State Library contributed for Hawai’i In mid-2000 s: the USNP had received over $54 million in NEH support & non-federal contributions of approx $19. 6 million Bibliographic records for over 140, 000 newspaper titles; access to 70 million pages of newsprint in microfilm PIALA 2010 UH Manoa Hamilton Library

USNP Goal: Locate, catalog, and microfilm newspapers Hawai’i microfilmed 260, 000 pages and cataloged

USNP Goal: Locate, catalog, and microfilm newspapers Hawai’i microfilmed 260, 000 pages and cataloged 476 titles Program ended in 2007 PIALA 2010 UH Manoa Hamilton Library

USNP Preservation Microfilming Guidelines Optimum legibility Image orientation & reduction ratios to fill frame

USNP Preservation Microfilming Guidelines Optimum legibility Image orientation & reduction ratios to fill frame & obtain greatest degree of legibility in public use copies Quality Each roll of first generation film shall be inspected frame-by-frame by both the filming agency and the project for density and resolution and to determine that the film is free of emulsion scratches, abrasions, fingerprints, spots, fog, and other defects http: //www. loc. gov/preserv/usnpguidelines. html PIALA 2010 UH Manoa Hamilton Library

USNP Preservation Microfilming Guidelines � Density • No less than five readings at start,

USNP Preservation Microfilming Guidelines � Density • No less than five readings at start, middle & end of each reel with a transmission densitometer calibrated daily • Maximum (Dmax) density measurements taken on exposed image with no words or graphics • Background densities no lower than. 80 & no higher than 1. 20, lower densities preferred for older pages & to facilitate production of reader-printer & enlargement prints. • Base-plus-fog density (Dmin) on the master negative shall not exceed. 10 PIALA 2010 UH Manoa Hamilton Library

National Endowment for the Humanities and Library of Congress created NDNP No single US

National Endowment for the Humanities and Library of Congress created NDNP No single US collection of newspapers Every institution focusing on particular themes relating to their collecting plans Thousands of volumes of newspapers spread across the country Enhance access to newspapers, building on the foundation of the United States Newspaper Program PIALA 2010 UH Manoa Hamilton Library

NDNP Overview 2 -Year awards to state projects, renewable Digitize 100, 000 pages of

NDNP Overview 2 -Year awards to state projects, renewable Digitize 100, 000 pages of microfilmed newspaper Newspapers picked must be from between 1836 to 1922 Historical essays on each newspaper Collation and Quality Control on all papers PIALA 2010 UH Manoa Hamilton Library

NDNP Goals 20 -year span with phased, sustainable development of 30 million page database

NDNP Goals 20 -year span with phased, sustainable development of 30 million page database Establish technical conversion specs & practices for efficient basic discovery & access Develop production tools to ensure good digital objects that can be managed & preserved long-term Provide public access to and take preservation responsibility for the digitized newspapers Create a national resource of historically significant newspapers from all the states and U. S. territories PIALA 2010 UH Manoa Hamilton Library

NDNP Microfilm-related Challenges Where are the master reels? Copyright issues (Who filmed the newspapers

NDNP Microfilm-related Challenges Where are the master reels? Copyright issues (Who filmed the newspapers and owns the master microfilm) Technical specifications (Poorly filmed, low density readings, etc) Microfilm standards applied vary widely PIALA 2010 UH Manoa Hamilton Library

No universally accepted metadata standard for historical newspapers Online historical newspapers produced by public

No universally accepted metadata standard for historical newspapers Online historical newspapers produced by public or private sector existed as discrete systems, metadata structures not designed for interoperability Titles, issues, pages and reels all need to be represented as different yet related classes of objects PIALA 2010 UH Manoa Hamilton Library

NDNP Digital Deliverables Images scanned at 300 -400 dpi • Three formats: § grayscale,

NDNP Digital Deliverables Images scanned at 300 -400 dpi • Three formats: § grayscale, uncompressed Tiff 6. 0 Images § Compressed JPEG 2000 images § PDF Image with hidden text Accompanying structural and technical metadata OCR text for all pages PIALA 2010 UH Manoa Hamilton Library

NDNP Scanning specifications De-skew images with a skew of greater than 3 degrees Crop

NDNP Scanning specifications De-skew images with a skew of greater than 3 degrees Crop to visible edge of page Capture grayscale preservation microfilm targets PIALA 2010 UH Manoa Hamilton Library

NDNP OCR specifications Conform to ALTO XML schema • ALTO (Analyzed Layout and Text

NDNP OCR specifications Conform to ALTO XML schema • ALTO (Analyzed Layout and Text Object) is a XML (Extensible Markup Language) Schema that details technical metadata for describing the layout and content of physical text resources Bounding box coordinate data • Each column is sectioned and coordinates are used to place words PIALA 2010 UH Manoa Hamilton Library

NDNP Metadata requirements (Metadata is Information about Information) METS (Metadata Encoding and Transmission Standard)

NDNP Metadata requirements (Metadata is Information about Information) METS (Metadata Encoding and Transmission Standard) format records preservation metadata Structural metadata to relate pages to title, date, and edition; sequence pages within issue or section; and to identify image and OCR files Technical metadata to support the functions of the Library of Congress repository PIALA 2010 UH Manoa Hamilton Library

XML Rules Single, unique root element Matching open/close tags Consistent capitalization Correctly nested elements

XML Rules Single, unique root element Matching open/close tags Consistent capitalization Correctly nested elements (no overlapping elements) Attribute values enclosed in quotes No repeating attributes in an element Provides international, vendor independent standard for describing information PIALA 2010 UH Manoa Hamilton Library

Family of XML data standards includes: METS – Metadata Encoding and Transmission Standard MODS

Family of XML data standards includes: METS – Metadata Encoding and Transmission Standard MODS – Metadata Object Description Schema PREMIS – PREservation Metadata Implementation Strategies EAD – Encoded Archival Description PIALA 2010 UH Manoa Hamilton Library

METS (Metadata Encoding and Transmission Standard) XML Schema for the purpose of creating XML

METS (Metadata Encoding and Transmission Standard) XML Schema for the purpose of creating XML files that define: • the hierarchical structure of digital library objects (images, text files, etc. ) • the names and locations of the files • the associated metadata (e. g. , MODS) PIALA 2010 UH Manoa Hamilton Library

Metadata Object Description Schema (MODS) An XML Schema designed for expressing bibliographic data (Think

Metadata Object Description Schema (MODS) An XML Schema designed for expressing bibliographic data (Think of it as an alternative to the MARC format) PIALA 2010 UH Manoa Hamilton Library

Sections of a METS file <mets> <mets. Hdr/> - METS header (document talks about

Sections of a METS file <mets> <mets. Hdr/> - METS header (document talks about itself) <dmd. Sec/> - Descriptive metadata (MODS, etc. ) <amd. Sec/> - Administrative metadata (copyright info. , etc. ) <file. Sec/> - File section (names and locations of files) <struct. Map/> - Structural map (relationships of the parts) <struct. Link/> - Linking information <behavior. Sec/> - Binding executables/actions to object </mets> PIALA 2010 UH Manoa Hamilton Library

Title METS Combines bibliographic and holdings data in a single title record, converted from

Title METS Combines bibliographic and holdings data in a single title record, converted from MARC to MARC XML format Titles digitized will have additional data • descriptive essays, more precise geographic coverage data • which is put in a Metadata Object Description Schema (MODS) object within the larger METS document PIALA 2010 UH Manoa Hamilton Library

Issue and Reel METS Issue METS • Issue Data • Page Data Reel METS

Issue and Reel METS Issue METS • Issue Data • Page Data Reel METS • Reel Data • Target Data PIALA 2010 UH Manoa Hamilton Library

WHY? XML structure used by software for creation of multiple outputs: • HTML/XHTML for

WHY? XML structure used by software for creation of multiple outputs: • HTML/XHTML for Web display; PDF for printing Ease of editing (single records or batches of records) Ability to validate data Ease of data management and publishing Interoperability • Repository submission and OAI harvesting PIALA 2010 UH Manoa Hamilton Library

All that coding pays off for the user when SEARCHING Geographic metadata Title metadata

All that coding pays off for the user when SEARCHING Geographic metadata Title metadata Date metadata PIALA 2010 UH Manoa Hamilton Library

Keyword searching OCR/OWR does not yield article “transcriptions”; text OCR’d from images of newspapers

Keyword searching OCR/OWR does not yield article “transcriptions”; text OCR’d from images of newspapers is used for searching purposes Several options • ANY of the words, ALL of the words • EXACT PHRASE • Proximity search – Look for words within 5, 10, 50 or 100 words of one another PIALA 2010 UH Manoa Hamilton Library

Page thumbnail view Click on thumbnail or description of page to view larger version

Page thumbnail view Click on thumbnail or description of page to view larger version PIALA 2010 UH Manoa Hamilton Library

Page view Different format can be selected with one click PIALA 2010 UH Manoa

Page view Different format can be selected with one click PIALA 2010 UH Manoa Hamilton Library

Browse Issues A calendar view indicating which issues have been digitized Can change which

Browse Issues A calendar view indicating which issues have been digitized Can change which year you’re viewing Browse First Pages PIALA 2010 UH Manoa Hamilton Library

Project Management From Microfilm to Digital Images Managing a Newspaper Conversion Project PIALA 2010

Project Management From Microfilm to Digital Images Managing a Newspaper Conversion Project PIALA 2010 UH Manoa Hamilton Library

NDNP & University of Hawai’i UH first grant began in July 2008, running until

NDNP & University of Hawai’i UH first grant began in July 2008, running until June 2010 Grant renewed: July 2010 -June 2012 Utilizing the microfilm created under the USNP Excellent quality microfilm (in theory) Fewer problems with cataloging/description, acquiring 2 N duplicates (in theory) PIALA 2010 UH Manoa Hamilton Library

Project Management Request for Proposals (RFP) • Include all LC technical specifications Position Description(s)

Project Management Request for Proposals (RFP) • Include all LC technical specifications Position Description(s) • Coordinator, students Hiring and Training PIALA 2010 UH Manoa Hamilton Library

Project components Microfilm identification and duplication Digitization Metadata creation & Validation PIALA 2010 UH

Project components Microfilm identification and duplication Digitization Metadata creation & Validation PIALA 2010 UH Manoa Hamilton Library

Microfilm selection Choose what is important to your institution(s) if possible Copyright • Reels

Microfilm selection Choose what is important to your institution(s) if possible Copyright • Reels created by or for your institution • Reels by Proquest, etc, you may have to ask for permission and pay much higher duplication fees Decide • PIALA 2010 Complete runs of few titles, or many short/incomplete runs of a lot of titles UH Manoa Hamilton Library

Vendors i. Archives • Leaders in the field • Lots of experience OCLC/BSLW (Backstage

Vendors i. Archives • Leaders in the field • Lots of experience OCLC/BSLW (Backstage Library Works) Apex/Covantage Northern Micrographics (NMT) Local or national microfilm duplication companies PIALA 2010 UH Manoa Hamilton Library

Equipment 10 500 GB External Hard Drives (Western Digital My. Books) and Pelican cases

Equipment 10 500 GB External Hard Drives (Western Digital My. Books) and Pelican cases 1 PC with double monitor Software: Library of Congress’ Digital Validator and Viewer (DVV) Densitometer Microfilm reader/scanner PIALA 2010 UH Manoa Hamilton Library

Our Stuff Densitometer Pelican Cases Microfilm scanner PC with 2 monitors & portable HDs

Our Stuff Densitometer Pelican Cases Microfilm scanner PC with 2 monitors & portable HDs (red) PIALA 2010 UH Manoa Hamilton Library

Staffing Project Coordinator • Quality Control Technician Graduate students Advisory Board Subject/history/newspaper specialists PIALA

Staffing Project Coordinator • Quality Control Technician Graduate students Advisory Board Subject/history/newspaper specialists PIALA 2010 UH Manoa Hamilton Library

Metadata Collection Density readings Recorded onto a spreadsheet PIALA 2010 UH Manoa Hamilton Library

Metadata Collection Density readings Recorded onto a spreadsheet PIALA 2010 UH Manoa Hamilton Library

Preparing the Microfilm: Metadata Data from, OCLC MARC record & local holdings PIALA 2010

Preparing the Microfilm: Metadata Data from, OCLC MARC record & local holdings PIALA 2010 UH Manoa Hamilton Library

Preparing the Microfilm: Collation Review use copy of reel • Missing issues or pages

Preparing the Microfilm: Collation Review use copy of reel • Missing issues or pages • Duplicate issues or pages • Mutilated pages • Other abnormalities (E. g. pages out of order, incorrect dates) PIALA 2010 UH Manoa Hamilton Library

Preparing the Microfilm: Collation Review use copy, record data on spreadsheet PIALA 2010 UH

Preparing the Microfilm: Collation Review use copy, record data on spreadsheet PIALA 2010 UH Manoa Hamilton Library

i. Archives Digitization Workflow QC Film Scanning Split, De-Skew, Crop Shared Storage (NAS) QC

i. Archives Digitization Workflow QC Film Scanning Split, De-Skew, Crop Shared Storage (NAS) QC QC QC Image Processing Image Metadata KEY: ■ Automatic process [image processing, OCR, …] ■ Manual process [image + page metadata] ■ Quality Control Page/Reel Metadata Workflow Manager DB QC OCR Framework QC Post Process Customer Deliverables Automated Processing Cloud

Scan QC

Scan QC

Split, Crop & De. Skew

Split, Crop & De. Skew

i. Archives OWR Framework 3 Leading OCR Software Programs 2, 000 Word Dictionary OWR

i. Archives OWR Framework 3 Leading OCR Software Programs 2, 000 Word Dictionary OWR 2, 000 Name Dictionary

Post-vendor validation Once the hard drive returned, we verify/validate the batch using the DVV

Post-vendor validation Once the hard drive returned, we verify/validate the batch using the DVV program Verification compares the metadata listed in the master XML file to the metadata found in the issue XML files for correctness Validation is done if a new master XML file needs to be created. It creates checksums for each file and records them in the subsequent metadata Copy contents of hard drive onto our server PIALA 2010 UH Manoa Hamilton Library

Quality Control Image quality Too dark? Too light? Skewed? Correct image? Compare digitized image

Quality Control Image quality Too dark? Too light? Skewed? Correct image? Compare digitized image to microfilmed image No Missing Issue/Page tags Review metadata Dates LCCN # Locations PIALA 2010 UH Manoa Hamilton Library

Thumbnail View can use DVV or any graphics program PIALA 2010 UH Manoa Hamilton

Thumbnail View can use DVV or any graphics program PIALA 2010 UH Manoa Hamilton Library

Quality Control LC Digital Viewer and Validator (DVV) PIALA 2010 UH Manoa Hamilton Library

Quality Control LC Digital Viewer and Validator (DVV) PIALA 2010 UH Manoa Hamilton Library

Metadata Viewer PIALA 2010 UH Manoa Hamilton Library

Metadata Viewer PIALA 2010 UH Manoa Hamilton Library

OCR PIALA 2010 UH Manoa Hamilton Library

OCR PIALA 2010 UH Manoa Hamilton Library

Headers PIALA 2010 UH Manoa Hamilton Library

Headers PIALA 2010 UH Manoa Hamilton Library

Title Essays - 500 words Describes newspaper’s history • • • Date of establishment

Title Essays - 500 words Describes newspaper’s history • • • Date of establishment Editors Type of news reported Political viewpoint Where is the paper today? Published to Chronicling America PIALA 2010 UH Manoa Hamilton Library

Links Chronicling America: http: //chroniclingamerica. loc. gov/ Library of Congress: http: //www. loc. gov/ndnp/

Links Chronicling America: http: //chroniclingamerica. loc. gov/ Library of Congress: http: //www. loc. gov/ndnp/ National Endowment for the Humanities: http: //www. neh. gov/projects/ndnp. html Hawai’i Newspapers: a union list http: //evols. library. manoa. hawaii. edu/handle/10524/2 089 Using <METS> and <MODS> to Create XML Standards -based Digital Library Applications http: //www. loc. gov/standards/mods/presentations/me ts-mods-morgan-ala 07/ PIALA 2010 UH Manoa Hamilton Library

Thank You! Mahalo! Kinisou Chapur! Questions? Comments? Email us at: ♦ chantiny@hawaii. edu ♦

Thank You! Mahalo! Kinisou Chapur! Questions? Comments? Email us at: ♦ chantiny@hawaii. edu ♦ erenst@hawaii. edu https: //sites. google. com/a/hawaii. edu/ndnp-hawaii/ PIALA 2010 UH Manoa Hamilton Library