Unified Digital Format Registry a semantic registry for

  • Slides: 29
Download presentation
Unified Digital Format Registry a semantic registry for digital preservation Digital Preservation 2012 Library

Unified Digital Format Registry a semantic registry for digital preservation Digital Preservation 2012 Library of Congress, July 24 -25, 2012 Sustaining the Unified Digital Format Registry (UDFR) Stephen Abrams UC Curation Center California Digital Library http: //www. cdlib. org/uc 3

Unified Digital Format Registry a semantic registry for digital preservation Agenda n Background n

Unified Digital Format Registry a semantic registry for digital preservation Agenda n Background n Current status n Demonstration n Next steps

Unified Digital Format Registry a semantic registry for digital preservation Why formats? n “Format”

Unified Digital Format Registry a semantic registry for digital preservation Why formats? n “Format” is the dividing line between bits and information ffd 8 ffe 000104 a 46 49460001020100830000 ffed 0 fb 0 50686 f 746 f 73686 f 7020332 e 30003842 494 d 03 e 90 a 507269 6 e 7420496 e 666 f 00 000000780000 00480000 02 f 40240 ffee 0306025203470528 03 fc 000200000048000002 d 8 0228000100000064 0000000100030. . . SOI APP 0 APP 13 APP 2 DQT SOF 0 DRI DHT SOS ECS 0 RST 0 ECS 1 RST 1 ECS 2. . . JFIF 1. 2 IPTC ICC 183 x 512

Unified Digital Format Registry a semantic registry for digital preservation Why formats? n There

Unified Digital Format Registry a semantic registry for digital preservation Why formats? n There are many necessary preservation activities that can be usefully performed on bits qua bits n to preserve information you most act on formatted bits and know what those formats represent Preservation of content syntax and semantics (both the structure and meaning of the digital representation)

Unified Digital Format Registry a semantic registry for digital preservation Unified Digital Format Registry

Unified Digital Format Registry a semantic registry for digital preservation Unified Digital Format Registry n “A reliable, publicly accessible, and sustainable knowledge base of file format representation information for use by the digital preservation community” http: //udfr. org/ udfr-l@listserv. ucop. edu “Unification” of the function and holdings of PRONOM and GDFR , available July 3, 2012 http: //www. nationalarchives. gov. uk/PRONOM http: //gdfr. info/ Funded by the Library of Congress Open source platform / GPL Semantic wiki

Unified Digital Format Registry a semantic registry for digital preservation A bit of history

Unified Digital Format Registry a semantic registry for digital preservation A bit of history … n PRONOM – National Archives [UK], 2002 http: //www. nationalarchives. gov. uk/PRONOM “ready access to reliable technical information about the nature of electronic records” n JHOVE – Harvard, 2003 http: //hul. harvard. edu/jhove “digital object validation and characterization” n Global Digital Format Registry (GDFR) – Harvard/OCLC, 2006 http: //gdfr. info/ “a distributed and replicated registry of format information populated and vetted by experts and enthusiasts world-wide”

Unified Digital Format Registry a semantic registry for digital preservation A bit of history

Unified Digital Format Registry a semantic registry for digital preservation A bit of history … n Proto-UDFR – Ad hoc stakeholder community, 2009 Resolve PRONOM IPR issues and develop a community- supported open source solution Advance beyond legacy RDBMS (PRONOM) and XMLDB (GDFR) technology n UDFR – CDL, January 2011 http: //udfr. org/ udfr-l@listserv. ucop. edu “a semantic registry for digital preservation” LC/NDIIPP funded Stakeholder meeting, April 2011 Beta release, November 2011 Production release, July 2012

Unified Digital Format Registry a semantic registry for digital preservation Representation information n What

Unified Digital Format Registry a semantic registry for digital preservation Representation information n What you need to know about something in order to exploit that thing meaningfully [OAIS/ISO 14720] n Information that lets you answer important preservation questions (directly or indirectly) What format is it? What are its significant properties? Is it valid? Is it at risk? How can I render/play/read it? What can it be transformed into?

Unified Digital Format Registry a semantic registry for digital preservation Why semantic? n The

Unified Digital Format Registry a semantic registry for digital preservation Why semantic? n The semantic web lets anyone say anything about anything Understandable to both people and machines n The web is (or soon will be) a semantic web Linked Data interoperability http: //linkeddata. org/

Unified Digital Format Registry a semantic registry for digital preservation Why semantic? n Triples

Unified Digital Format Registry a semantic registry for digital preservation Why semantic? n Triples all the way down… Data expressed as triples Data definition (i. e. , ontology) expressed as triples Ontology definition expressed as triples … n Facilitates self-configuration and easy extension However, the form and function of a semantic wiki may be unfamiliar

Unified Digital Format Registry a semantic registry for digital preservation Provenance n Open contribution

Unified Digital Format Registry a semantic registry for digital preservation Provenance n Open contribution Self-registration, but no further barriers Complete change history at the assertion level ● Who made the assertion, and when ● Confidence based on individual/institutional reputation Imprimatur of technically knowledgeable reviewers “Trust, but verify”

Unified Digital Format Registry a semantic registry for digital preservation Roles n Consumer Anonymous

Unified Digital Format Registry a semantic registry for digital preservation Roles n Consumer Anonymous read n Contributor Read + write Self-registration n Reviewer Read + write + review Administratively granted n Administrator Read + write + review + administer

Unified Digital Format Registry a semantic registry for digital preservation Technology stack Apache httpd

Unified Digital Format Registry a semantic registry for digital preservation Technology stack Apache httpd http: //httpd. apache. org/ HTTP / SPARQL http: //www. w 3. org/TR/rdf-sparql-query RDFauthor/Java. Script http: //aksw. org/Projects/RDFauthor Onto. Wiki http: //ontowiki. net/ Zend framework http: //framework. zend. com/ PHP http: //www. php. net/ Noid http: //wiki. ucop. edu/display/Curation/ NOID Erfurt API http: //aksw. org/Projects/Erfurt Virtuoso quadstore http: //virtuoso. openlinksw. com/ RDF http: //www. w 3. org/RDF

Unified Digital Format Registry a semantic registry for digital preservation Code repository n All

Unified Digital Format Registry a semantic registry for digital preservation Code repository n All code (and ontologies) managed in public repositories at Git. Hub https: //github. com/UDFR Onto. Wiki https: //github. com/UDFR/Onto. Wiki Forked from https: //github. com/AKSW/Onto. Wiki Erfurt https: //github. com/UDFR/Erfurt Forked from https: //github. com/AKSW/Erfurt RDFauthor https: //github. com/UDFR/RDFauthor Forked from https: //github. com/AKSW/RDFauthor n All CDL development available under GPL license

Unified Digital Format Registry a semantic registry for digital preservation UDFR schema Abstract Base

Unified Digital Format Registry a semantic registry for digital preservation UDFR schema Abstract Base … Controlled Vocabulary holder Process IPR embodies Software owner Agent ipr Hardware dependency creator Abstract Product Grammar Media assessment Character Encoding grammar Holding reference file specification Abstract Format Document signature File Format Abstract Signature Digest maintainer input / output Assessment product Compression Algorithm File digest External Signature Internal Signature

Unified Digital Format Registry a semantic registry for digital preservation Code repository n All

Unified Digital Format Registry a semantic registry for digital preservation Code repository n All ontologies (and code) managed in public repositories at Git. Hub https: //github. com/UDFR Ontologies https: //github. com/UDFR-Models ● udfrs [onto. owl] UDFR schema ● udfr UDFR instance data http: //udfr. org/onto# [udfr. owl] http: //udfr. org/udfr/ ● profile[profile. owl] UDFR user profiles http: //udfr. org/profile/

Unified Digital Format Registry a semantic registry for digital preservation Initial data loads n

Unified Digital Format Registry a semantic registry for digital preservation Initial data loads n PRONOM as of 2012 -02 -21 http: //www. nationalarchives. gov. uk/PRONOM 54 8 7, 8 16 n Special thanks to TNA ► ► ► Spencer Ross Tracey Powell Tim Gollins 846 file formats 28 character encodings 17 compression algorithms 1, 237 identifiers 1, 006 external signatures dedupulicated, 494 internal signatures June 2012 71 MIME types (not in Appspot) 156 agents 268 software packages 2, 080 software processes 23 IPR statements 217 relationships 8, 274

Unified Digital Format Registry a semantic registry for digital preservation Initial data loads n

Unified Digital Format Registry a semantic registry for digital preservation Initial data loads n MIME types from Appspot as of 2012 -02 -22 http: //mediatypes. appspot. com/ “Routinely scrapped from IANA using code in the mediatypes Google Code project” 809 125 39 19 14 14 51 56 1, 127 application/* audio/* image/* message/* model/* multipart/* text/* video/* Plus 71 defined by PRONOM

Unified Digital Format Registry a semantic registry for digital preservation Data licensing n PRONOM

Unified Digital Format Registry a semantic registry for digital preservation Data licensing n PRONOM data contributed under UK Open Government License (OGL) http: //www. nationalarchives. gov. uk/doc/open-government-licence/ n Other submissions contributed under Creative Commons Attribution license (CC-BY) http: //creativecommons. org/licenses/by/3. 0/

Unified Digital Format Registry a semantic registry for digital preservation UI layout Workspace pane

Unified Digital Format Registry a semantic registry for digital preservation UI layout Workspace pane • Function dependent Onto. Wiki pane • Register/login/logout • SPARQL query form • Documentation • Session reset Knowledge base pane Ontology browser pane Register/login pane http: //udfr. org/

Unified Digital Format Registry a semantic registry for digital preservation Contextual menus Contextual menu

Unified Digital Format Registry a semantic registry for digital preservation Contextual menus Contextual menu http: //udfr. org/

Unified Digital Format Registry a semantic registry for digital preservation User’s Guide http: //udfr.

Unified Digital Format Registry a semantic registry for digital preservation User’s Guide http: //udfr. org/docs/UDFR-Users-Guide-v 1. 0. 0. pdf

Unified Digital Format Registry a semantic registry for digital preservation Demonstration http: //udfr. org/

Unified Digital Format Registry a semantic registry for digital preservation Demonstration http: //udfr. org/

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Operational

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Operational control CDL will continue to host the UDFR for one year while a more permanent hosting strategy can be identified n Administrative control The “admin” role – necessary for adding user privileges, modifying the ontologies, and bulk imports – is held by CDL staff How can this responsibility be shared? n Technical control How to share “committer” responsibility for the codebase? How to coordinate additional development activity?

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Technical

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Technical development Synchronization with PRONOM and other external sources of bulk imports UI enhancements to provide lower-barrier learning curve RESTful API (in additional to SPARQL endpoint) Replication to mirror sites Others? n Bring under the OPF code repository/issue tracking umbrella

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Import

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Import additional data sources Library of Congress Sustainability of Digital Formats http: //www. digitalpreservation. gov/formats/ IT History Society hardware database http: //www. ithistory. org/hardware-name. php NIST NSRL (National Software Reference Library) http: //www. nsrl. nist. gov/ Stanford CPUdb http: //cpudb. stanford. edu/ TOTEM (Trustworthy Online Technical Environment Metadata) database http: //keep-totem. co. uk/ Other candidates? How important is merging?

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Encourage

Unified Digital Format Registry a semantic registry for digital preservation Next steps n Encourage adoption and use Identify an evangelist Marketing/outreach Cf. Chris Rusbridge’s blog posing the question, “What was the problem” that UDFR was trying to solve? http: //unsustainableideas. wordpress. com/2012/07/04/the-solution-is-42 -what-was-the-problem/ n Enable the reviewer function Who will review? What are the criteria? n Sustainable community governance Who will make the decisions?

Unified Digital Format Registry a semantic registry for digital preservation Questions and discussion

Unified Digital Format Registry a semantic registry for digital preservation Questions and discussion

Unified Digital Format Registry a semantic registry for digital preservation For more information n

Unified Digital Format Registry a semantic registry for digital preservation For more information n UDFR http: //udfr. org/ http: //github. com/UDFR udfr-l@listserv. ucop. edu (to subscribe, mail “SUB UDFR-L <name>” to listserv@ucop. edu) n Onto. Wiki http: //ontowiki. net/Projects/Onto. Wiki n Erfurt http: //aksw. org/Projects/Erfurt n RDFauthor http: //aksw. org/Projects/RDFauthor n Zend http: //framework. zend. com/ n Virtuoso n AKSW, Universität Leipzig http: //aksw. org/ Philipp Frischmuth Sebastian Tramp Norman Heino n National Archives, UK http: //www. nationalarchives. gov. uk/ Tim Gollins Tracey Powell Spencer Ross n Library of Congress http: //www. digitalpreservation. gov Martha Anderson Leslie Johnston n UC Curation Center http: //www. cdlib. org/uc 3@ucop. edu Stephen Abrams Lisa Dawn Colvin Patricia Cruse John Kunze Margaret Low Mark Reyes Abhishek Salve Marisa Strong http: //www. openlinksw. com/dataspace/dav/wiki/Main/VOSRDFWP