HATHITRUST A Shared Digital Repository Hathi Trust OutsideIn

  • Slides: 77
Download presentation
HATHITRUST A Shared Digital Repository Hathi. Trust Outside-In University of Michigan Law School June

HATHITRUST A Shared Digital Repository Hathi. Trust Outside-In University of Michigan Law School June 14, 2011 Jeremy York Hathi. Trust Project Librarian

Outline • Front end • What you see • Backend – About (Mission and

Outline • Front end • What you see • Backend – About (Mission and Goals) – Governance – Content – Services (including differences from Google) – How work gets done – Costs – Shared Strategies/Benefits

Front End

Front End

Skip navigation link Info about SSD service & link to accessibility page Descriptive headings

Skip navigation link Info about SSD service & link to accessibility page Descriptive headings added (hidden from GUI with CSS) Added labels & descriptive titles to forms & To. C table Access keys for navigating pages with keyboard Images used for style are in css so no need to use alt tags

Access Matrix Type of work Public domain worldwide Public domain in the US Search

Access Matrix Type of work Public domain worldwide Public domain in the US Search – Bib and Full text World View Full-PDF download Print on Demand World US World if no restrictions, Partners if restrictions US if no restrictions, US partners if restrictions World if no restrictions Open World Access (+Creative Commons) In copyright World (and undetermin ed) World US Print Section 108 disabilities (preservation uses) Partners N/A worldwide US Partners World with Partners permission worldwide if no restrictions Not available Not Partners available US and worldwide, where applicable N/A Partners US and worldwide, where applicable

Backend

Backend

About

About

Partnership Arizona State University Boston University Baylor University California Digital Library Columbia University Cornell

Partnership Arizona State University Boston University Baylor University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Library of Congress Massachusetts Institute of Technology Michigan State University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota The University of North Carolina at Chapel Hill University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin. Madison Utah State University Yale University Library

Digital Repository • Launched 2008 • Initial focus on digitized book and journal content

Digital Repository • Launched 2008 • Initial focus on digitized book and journal content • “Light” archive – As accessible as possible within the bounds of law

The Name • The meaning behind the name – Hathi (hah-tee)--Hindi for elephant –

The Name • The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy

Mission • To contribute to the common good by collecting, organizing, preserving, communicating, and

Mission • To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge

Goals • Comprehensive collection • Preservation…with Access • Shared strategies – Collection management, development

Goals • Comprehensive collection • Preservation…with Access • Shared strategies – Collection management, development – Preservation – Copyright – Efficient user services • Openness

Governance

Governance

Governance Budget/Finances Decision-making Strategic Advisory Board Executive Committee Hathi. Trust Guidance on Policy, Planning

Governance Budget/Finances Decision-making Strategic Advisory Board Executive Committee Hathi. Trust Guidance on Policy, Planning

Executive Committee Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive

Executive Committee Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Deputy Director of Libraries, UW – Madison (ex officio) • Brenda Johnson, Dean of Libraries, IU • Brad Wheeler, Chief Information Officer, IU • John Wilkin, Executive Director of Hathi. Trust and Associate University Librarian, LIT, UM • • •

Strategic Advisory Board • Ed Van Gemert (Chair), Deputy Director of Libraries, University of

Strategic Advisory Board • Ed Van Gemert (Chair), Deputy Director of Libraries, University of Wisconsin - Madison • John Butler, AUL for Information Technology, University of Minnesota • Patricia Cruse, Director, Preservation, CDL • Todd Grappone, AUL for Digital Initiatives & IT, UCLA • Julia Kochi, Director, Digital Library and Collections, UC San Francisco • Sarah Pritchard, University Librarian, Northwestern University • Paul Soderdahl, Director, LIT, University of Iowa • John Wilkin, Executive Director, Hathi. Trust (ex officio) • Robert Wolven, Columbia University Strategic Advisory Board

Constitutional Convention • October 2011 • Delegates from each institution and consortium – Carry

Constitutional Convention • October 2011 • Delegates from each institution and consortium – Carry certain number of votes determined according to formula approved by Executive Committee • 3 -year review • Proposals – Print management – Ballot proposals

Content

Content

What is in Hathi. Trust? • • 8, 825, 372 Total volumes 2, 407,

What is in Hathi. Trust? • • 8, 825, 372 Total volumes 2, 407, 570 Public Domain 4, 819, 000 Book titles 214, 719 Serial titles * As of June 14, 2011

Content Sources * As of June 13, 2011

Content Sources * As of June 13, 2011

Content Distribution * As of June 13, 2011

Content Distribution * As of June 13, 2011

Dates * As of June 13, 2011

Dates * As of June 13, 2011

Breakdown of Hathi. Trust book corpus by publication date Bibliographic Indeterminacy and the Scale

Breakdown of Hathi. Trust book corpus by publication date Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011

Breakdown of Hathi. Trust book corpus by publication date

Breakdown of Hathi. Trust book corpus by publication date

Language Distribution (1) The top 10 languages make up ~86% of all content *

Language Distribution (1) The top 10 languages make up ~86% of all content * As of June 13, 2011

Language Distribution (2) The next 40 languages make up ~13% of total * As

Language Distribution (2) The next 40 languages make up ~13% of total * As of June 13, 2011

Content over time 100% Michigan 90% California 80% Wisconsin 70% Cornell NYPL 60% Princeton

Content over time 100% Michigan 90% California 80% Wisconsin 70% Cornell NYPL 60% Princeton 50% Indiana 40% Minnesota 30% Harvard Lo. C 20% Columbia 1 b 1 Fe 10 c- 0 De t-1 Oc 10 0 Au g- n 1 Ju 0 r-1 0 Ap b 1 Fe 09 c- 9 De t-0 Oc 09 9 Au g- n 0 Ju r-0 Ap b 0 Fe c- De t-0 Oc 9 Chicago 9 0% 08 Madrid 8 10% * As of June 13, 2011

Content Growth

Content Growth

Services

Services

Services (1) • Ingest – Book and Journal content • Google • Internet Archive

Services (1) • Ingest – Book and Journal content • Google • Internet Archive • In-house, other vendor digitization – Images, Audio, Born Digital (coming soon…) • Two parts – Content – Bibliographic metadata

Services (2) • Long-term preservation – Bit-level, migration – Standard and open formats (ITU

Services (2) • Long-term preservation – Bit-level, migration – Standard and open formats (ITU G 4 TIFF, JPEG 2000, JPG, Unicode) – Validation, integrity, redundancy – OAIS • How reliable is it? – DRAMBORA, TRAC

Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner Hathi.

Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner Hathi. Trust API OAI Geo. IP DB CNRI Handles [Solr] Google Internet Archive In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G 4/JPEG 2000 OCR MD 5 checksums Isilon Site Replication TSM MD 5 checksum validation Technology METS object PNG OCR PDF

Quality • • Partner Digitization Google Digitization Quality work / Volume certification feedback@issues. hathitrust.

Quality • • Partner Digitization Google Digitization Quality work / Volume certification feedback@issues. hathitrust. org Quality

Services (3) • Preservation…with Access – As part of preservation, service to partners, and

Services (3) • Preservation…with Access – As part of preservation, service to partners, and as public good – Discovery • Bibliographic (temporary catalog, OCLC/Hathi. Trust catalog) • Full-text – Reading • Interface optimized for users with print disabilities – Collections

Services (4) • Rights Management – Rights Database – Copyright review • IMLS Grant

Services (4) • Rights Management – Rights Database – Copyright review • IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963 • 18 staff members, 4 institutions – – Indiana University of Michigan University of Minnesota University of Wisconsin • 140, 000 reviewed through CRMS • 77, 500 (54%) in public domain

Copyright status of books published pre-1923 and US works published 1923 -1963

Copyright status of books published pre-1923 and US works published 1923 -1963

Copyright status of books published pre-1923 and US works published 1923 -1963

Copyright status of books published pre-1923 and US works published 1923 -1963

Services (5) • Data Availability – Tab-delimited inventory files – Bibliographic API – Data

Services (5) • Data Availability – Tab-delimited inventory files – Bibliographic API – Data API – OAI feed of public domain – SFX target – Summon

Some Examples of Use • Catalogs – UM loaded every record – Chicago links

Some Examples of Use • Catalogs – UM loaded every record – Chicago links to public domain volumes owned in print – TROVE harvesting through OAI – OCLC loads records into OCLC • Link Resolves – UC created SFX target • Vendors – H. W. Wilson database links to public domain volumes – Pro. Quest full-text index via Summon

Services (6) • Collaborative Development Environment – Active repository development • Support for Computational

Services (6) • Collaborative Development Environment – Active repository development • Support for Computational Research – Datasets • 120, 000 -volume set • Google-digitized public domain – Protocol-based access – Research Center

How does work get done? • Collective work – e. g. , working groups

How does work get done? • Collective work – e. g. , working groups – Perform the work of the partnership – Now 40+ people across partner institutions • Distributed work – Driven by needs of institutions – able to leverage across the partnership – Projects, e. g. grant work, ingest specifications, page-turner, bibliographic data management • Leverage expertise across institutions

Working Groups (1) • Operational focus – Appointed by Executive Director in coordination with

Working Groups (1) • Operational focus – Appointed by Executive Director in coordination with Executive Committee – Current • Usability • User Support • Communications – Previous • Development Environment • Storage • Research Center

Working Groups (2) • Planning or Exploratory focus – Appointed by Strategic Advisory Board

Working Groups (2) • Planning or Exploratory focus – Appointed by Strategic Advisory Board – Recommendations reviewed by SAB and XCom; may call for subsequent implementation • • Collections Committee Surrogates Quality, Ingest, and Error rate Discovery

How is work prioritized? • Initial functional objectives • Collective processes – Working groups

How is work prioritized? • Initial functional objectives • Collective processes – Working groups and committees

Governance Budget, Finances Decision-making Enterprise Management Repository Administration Communication and Coordination with partner institutions

Governance Budget, Finances Decision-making Enterprise Management Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Policy Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (itemlevel) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for non-Google volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging e-Commerce Content Ingest Print on Demand Financial contributions of partners Content Access Processes for ensuring content integrity Quality Assurance User Services Transformation Page. Turner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Research Center Bibliographic Catalog APIs Hathi. Trust Functional Framework Outreach Project website Monthly newsletter Papers and presentations Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e. g. , DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy

Costs

Costs

Costs • Base funding from partner institutions • Basic infrastructure costs • Commitments in

Costs • Base funding from partner institutions • Basic infrastructure costs • Commitments in 5 -year periods

How much does it cost? (1) Cost

How much does it cost? (1) Cost

How much does it cost? (2) • $0. 149/volume/year for Google-digitized • $0. 489/volume/year

How much does it cost? (2) • $0. 149/volume/year for Google-digitized • $0. 489/volume/year for IA-digitized • $0. 154/volume/year for all content • $3. 40 per GB

Cost Model 1. Based on contributed content 2. Based on overlap with print collections

Cost Model 1. Based on contributed content 2. Based on overlap with print collections – Public Domain / In-copyright – Depends on Print Holdings Database • • Costs Lawful uses of materials Complete picture Volumes institutions own or have owned – OCLC number; Bib record ID; Condition; Holding – Status

Shared Strategies/Benefits

Shared Strategies/Benefits

How Different from Google? • • • Preservation Content Collective work Uses of materials

How Different from Google? • • • Preservation Content Collective work Uses of materials Own trajectory Partnership – Not just about digital content or repository – Address challenges – Fulfill mission – Provide services for our communities

A global change in the library environment 60% Academic print book collection already substantially

A global change in the library environment 60% Academic print book collection already substantially duplicated in mass digitized book corpus % of Titles in Local Collection 50% June 2010 Median duplication: 31% 40% 30% 20% June 2009 Median duplication: 19% 10% 0% 0 20 40 60 80 Rank in 2008 ARL Investment Index 100 120

Digitized Books in Shared Repositories ~3. 5 M titles 3, 500, 000 3, 000

Digitized Books in Shared Repositories ~3. 5 M titles 3, 500, 000 3, 000 ~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories ~2. 5 M Unique Titles 2, 500, 000 2, 000 1, 500, 000 1, 000 500, 000 0 40057 40087 40118 40148 Mass digitized books in Hathi digital repository 40179 40210 40238 40269 40299 Mass digitized books in shared print repositories 40330

Shared Strategies • Copyright • Preservation – Digital and print • • Discovery /

Shared Strategies • Copyright • Preservation – Digital and print • • Discovery / Use Bibliographic Indeterminacy Consolidate development talent Collective Attention to solving shared problems

How to find out more • Website “About” section – http: /www. hathitrust. org/about

How to find out more • Website “About” section – http: /www. hathitrust. org/about • Twitter – http: //twitter. com/hathitrust • Monthly newsletter – http: //www. hathitrust. org/updates_rss (RSS) • Contact us – feedback@issues. hathitrust. org – jjyork@umich. edu

Thank you very much!

Thank you very much!