Digital Curation Centre a centre of expertise in

  • Slides: 35
Download presentation
Digital Curation Centre a centre of expertise in data curation and preservation Long term

Digital Curation Centre a centre of expertise in data curation and preservation Long term preservation: an overview Michael Day Digital Curation Centre UKOLN, University of Bath http: //www. ukoln. ac. uk/ Joint Workshop on Electronic Publishing, Lund, Sweden, 15 April 2005 Funded by:

Session overview – Quick introduction – A fifteen year view – Overview of current

Session overview – Quick introduction – A fifteen year view – Overview of current issues 2

What is digital preservation? – Dealing with the potential technical problems that impede continued

What is digital preservation? – Dealing with the potential technical problems that impede continued access to digital resources (of all types) – No longer possible to place physical artefact on a shelf and ignore for 100+ years – Not just a technical problem: 3 • ". . . The planning, resource allocation, and application of preservation methods and technologies to ensure that digital information of continuing value remains accessible and usable" Margaret Hedstrom (1998)

What is digital curation? – New(ish) term, from science data world (e. g. bioinformatics)

What is digital curation? – New(ish) term, from science data world (e. g. bioinformatics) – Reflects those extra things that need to be done to facilitate access and reuse – "The activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and reuse" - Philip Lord, et al. (2004) 4

Why is it a problem? (1) – An increasing flood of 'born-digital' data •

Why is it a problem? (1) – An increasing flood of 'born-digital' data • The World Wide Web – Comprises billions of pages + "deep Web" – Internet Archive = >1 petabyte, and growing @ 20 Tb. per month (http: //www. archive. org/) • Data deluge in science and engineering – Petabytes generated by high throughput instruments, streamed from sensors and satellites, etc. • 5 exabytes of new information created in 2002: 5 – http: //www. sims. berkeley. edu/research/projects/howmuch-info-2003/

Why is it a problem? (2) – Need for (open) access to this data

Why is it a problem? (2) – Need for (open) access to this data • Results in added scientific value • New analytic techniques • 2004 - OECD member states endorsed the principle that publicly funded research data should be openly available to the maximum extent possible 6

Technical problems – Media longevity • Estimated lifetimes are short compared to paper or

Technical problems – Media longevity • Estimated lifetimes are short compared to paper or good quality microform • Solutions: more durable media, 'refreshing' regimes – Hardware and software obsolescence • Relatively short obsolescence cycles for hardware, peripherals, media, and software • For example, BBC Domesday Project (1986) hybrid videodisc 7

Preservation strategies (1) – Technology preservation • The preservation of an information object together

Preservation strategies (1) – Technology preservation • The preservation of an information object together with all of the hardware and software needed to interpret it – But will lead to museums of "ageing and incompatible computer hardware" - Mary Feeney (1999) – Has key role in the rescue of digital objects (digital archaeology) – Emulation • The preservation of original application software and to run this on emulators that mimic the behaviour of obsolete hardware and operating systems 8 – Development of ‘virtual machines’ that will be migrated to work on different platforms (Jeff Rothenberg, 1998) – Universal Virtual Computer (UVC) concept

Preservation strategies (2) – Migration – Managed transformations – The periodic transfer of digital

Preservation strategies (2) – Migration – Managed transformations – The periodic transfer of digital information from one hardware and software configuration to another, or from one generation of computer technology to a subsequent one - CPA/RLG report (1996) – Widely used strategy, e. g. on ingest into a repository – Problems with preserving the integrity of an object – Encapsulation – Self-describing objects, e. g. information package in OAIS model, METS, Buckets, Universal Preservation Format 9

Preservation strategies (3) – Metadata and documentation – All digital preservation strategies depend -

Preservation strategies (3) – Metadata and documentation – All digital preservation strategies depend - to some extent - on the creation, capture and maintenance of metadata – "Preserving the right metadata is key to preserving digital objects" (ERPANET Briefing Paper, 2003) – The various types data that will allow the re-creation and interpretation of the structure and content of digital data over time (Ludäsher, Marciano & Moore, 2001) – Reference Model for an Open Archival Information System (OAIS) - ISO 14721: 2003 – PREMIS working group 10

A fifteen year retrospective – Based on my dissertation: • "Preservation problems of electronic

A fifteen year retrospective – Based on my dissertation: • "Preservation problems of electronic text and data" - Loughborough University (1989) • Overview of the state of the art in digital preservation in the late 1980 s • Hardware and software used = IBM PC XT, MS DOS, 5¼" floppy disks, shareware word processing program (Galaxy) 11

The 1980 s - contexts – Still faith in the "paperless" future – Electronic

The 1980 s - contexts – Still faith in the "paperless" future – Electronic publishing in its infancy • Online databases (mainly bibliographic) • Viewdata systems (e. g. , Minitel, Prestel) • Experiments with electronic journals (e. g. BLEND, project quartet) and electronic document delivery systems (ADONIS) • CD-ROM databases 12

The 1980 s - issues (1) – Digital preservation issues: • Major focus on

The 1980 s - issues (1) – Digital preservation issues: • Major focus on the longevity of media – e. g. , BNB Research Fund funded comparison of microfilm, magnetic media, and optical disks for archival storage (1983) – Interest in the potential value of new types of optical media, e. g. videodisc, Compact Disc (CD-ROM, CD-R) – No promising results from initial research 13

The 1980 s - issues (2) – Knowledge that media longevity was not the

The 1980 s - issues (2) – Knowledge that media longevity was not the only issue • "The problem with machine-readable records is the long term availability of the machines rather than the physical decay of the recording mechanism" John Mallinson (1986) • Brief consideration of COM (microform) for longterm storage 14

The 1980 s - experiences (1) – National archives: • A focus in some

The 1980 s - experiences (1) – National archives: • A focus in some countries on machine-readable records from the 1960 s • The principle that machine-readable records should be treated in the same manner as conventional records was established very early on, e. g. by Meyer Fishbein (1972) • Also, there was an early recognition of the importance of documentation and economic factors 15

The 1980 s - experiences (2) – Data archives: • Storage of social science

The 1980 s - experiences (2) – Data archives: • Storage of social science survey data started in the punched-card era (1940 s) • ESRC Data Archive established 1967 – Recognised the importance of developing procedures to manage data (e. g. , migration on ingest) and of standardised descriptions (metadata) – National libraries: • Were considering legal deposit obligations 16

The 1980 s - summing up – Some differences with the position today, e.

The 1980 s - summing up – Some differences with the position today, e. g. : – General lack of awareness – Focus on media longevity, 'refreshing' strategies – Little practical experience (except for data archives) – Some continuity, e. g. it was recognised: – That the obsolescence of hardware (and software environments) was a serious problem – That data management strategies and documentation/metadata were important – That digital resources were not conceptually different to non-digital ones 17

The current context (1) – The World Wide Web – Changes in scholarly communication,

The current context (1) – The World Wide Web – Changes in scholarly communication, e. g. : • Increased use of electronic journals, e-print repositories • Changes in scientific practice: data-intensive science, Grid computing, petabyte-scale storage, e -research • Current focus on open access – Similar developments elsewhere, e. g. : 18 • Broadcasting, e-commerce, e-government, . . .

The current context (2) – Task Force on Archiving of Digital Information (1996) •

The current context (2) – Task Force on Archiving of Digital Information (1996) • in UK led to influential research projects like Cedars, eventually to the Digital Preservation Coalition (DPC) – Major current initiatives: 19 • US National Digital Information Infrastructure and Preservation Program (NDIIPP) • ERPANET, NESTOR, KB's e-Depot, etc. • UK Digital Curation Centre

Digital Curation Centre (1) – Funded from 2004 for three years by the JISC

Digital Curation Centre (1) – Funded from 2004 for three years by the JISC and the e-Science Core Programme – Main aim: "continuing improvement in the quality of data curation and digital preservation" – Will focus on all aspects of the research process, e. g. from data creation to publication and beyond, also on the work of repositories and data archives – Not itself a digital repository, but offering outreach and practical services to assist those who curate data … 20

Digital Curation Centre (2) – Main activities: • Advisory services and outreach • Development

Digital Curation Centre (2) – Main activities: • Advisory services and outreach • Development – Registries of Representation Information, testing of tools, … • Research programme – Role of annotation, legal and socioeconomic issues, … 21 • Collaborative network of associates – Partners: Universities of Edinburgh (lead), Glasgow and Bath (UKOLN), CCLRC – http: //www. dcc. ac. uk/

Key developments (1) – Greater awareness of the issues – Digital preservation now beginning

Key developments (1) – Greater awareness of the issues – Digital preservation now beginning to be taken seriously by governments and NGOs (e. g. Unesco Charter on the Preservation of Digital Heritage, World Summit on the Information Society) – More experience with developing systems and tools, e. g. : – DIAS (IBM), DSpace, Fedora, Internet Archive, LOCKSS, OCLC Digital Archive, PANDAS, Pub. Med Central, Storage Resource Broker, etc. – Journal publishers co-operating with KB on e-Depot 22

Key developments (2) – Standards • Reference Model for an Open Archival Information System

Key developments (2) – Standards • Reference Model for an Open Archival Information System (OAIS) - ISO 14721: 2003 – A reference model, not a blueprint - but increasingly influential • Preservation metadata – Current focus on PREMIS working group, supported by OCLC and Research Libraries Group – Other activity ongoing, e. g. in scientific research domains 23

Research (1) • Some key requirements identified in: • It's about time: research challenges

Research (1) • Some key requirements identified in: • It's about time: research challenges in digital archiving and long-term preservation, National Science Foundation and Library of Congress (2003): http: //www. digitalpreservation. gov/repor/NSF_LC_Final_Report. pdf • Invest to Save: report and recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation (2003): http: //delos-noe. iei. pi. cnr. it/activities/internationalforum/Joint. WGs/digitalarchiving/Digitalarchiving. pdf 24

Research (2) – DELOS preservation cluster: • Frameworks for the analysis of preservation strategies

Research (2) – DELOS preservation cluster: • Frameworks for the analysis of preservation strategies • Building preservation functionality into digital libraries • File formats and metadata • Workshop on Digital repositories: interoperability and common services, Crete, 11 -13 May 2005: http: //www. ukoln. ac. uk/events/delos-rep-workshop/ 25

Research (3) – Current JISC research programmes: • Supporting Digital Preservation and Asset Management

Research (3) – Current JISC research programmes: • Supporting Digital Preservation and Asset Management in Institutions – Relatively small-scale projects: assessment tools, training, user guides, etc. • Digital Repositories (deadline last week) – Building on Focus on Access to Institutional Resources (FAIR) programme • http: //www. jisc. ac. uk/ 26

Some issues (1) – Open access repositories and preservation: 27 – Exact role of

Some issues (1) – Open access repositories and preservation: 27 – Exact role of repositories still evolving: » Some advocates of open access treat digital preservation concerns as a distraction to the primary task of "filling up the archives" » But the recent National Institutes of Health public access policy requests grantees to submit publications to Pub. Med Central - emphasising its role for permanent preservation – Disaggregated model proposed, whereby not all repositories will have preservation responsibilities » Possible need for mechanisms for transferring content to third parties, e. g. national libraries

Some issues (2) – Trusted repositories: • Attributes and responsibilities of 'trusted repositories' defined

Some issues (2) – Trusted repositories: • Attributes and responsibilities of 'trusted repositories' defined by RLG and OCLC working group (2002) 28 – Builds on 1996 Task Force report and OAIS model – Attributes include the viability and financial sustainability of the organisation, and the need for accountability – Question whether these (and other criteria) could be used as a basis for certification is being explored by the Task Force on Digital Repository Certification, supported by RLG and the National Archives and Records Administration (NARA)

Some issues (3) – Collection development: • Selection/appraisal, storage, access, 'de-selection' • Preservation issues

Some issues (3) – Collection development: • Selection/appraisal, storage, access, 'de-selection' • Preservation issues need to be considered early in an object's life-cycle (the traditional 'transfer to repository' model will not work) – Rethinking concept of 'custody' • Cannot be done in isolation – Sharing responsibilities across repositories while maintaining useful redundancy 29

Some issues (4) – Legal issues: • Repositories need the legal right to copy,

Some issues (4) – Legal issues: • Repositories need the legal right to copy, migrate, reverse engineer software, etc. • Problems with identifying rights holders • Access - are "dark archives" the answer? 30

Some issues (5) – Economic issues: • Still very little known about costs over

Some issues (5) – Economic issues: • Still very little known about costs over the long term • No widely used economic models • Research-type funding is not long-term – Recent draft report for National Science Foundation asks whether digital collections should be treated like scientific facilities 31

Summing up (1) – Major differences from the late 1980 s • Problem has

Summing up (1) – Major differences from the late 1980 s • Problem has grown, but awareness of it is now much higher • Many research projects, vendors, services, etc. now investigating this problem - not always particularly co-ordinated • Encouraging signs in funding of NDIIPP, DCC and other recent initiatives 32

Summing up (2) – Co-operation is essential • Some progress, e. g. DPC, ERPANET

Summing up (2) – Co-operation is essential • Some progress, e. g. DPC, ERPANET • Need to work out how trusted repositories will work together in a distributed network • Need for training – Many problems remain to be resolved 33 • Research (e. g. into provenance of data, the role of file format registries) • Development of tools • Integrating existing work

More information – National Library of Australia's Preserving Access to Digital Information (PADI) gateway:

More information – National Library of Australia's Preserving Access to Digital Information (PADI) gateway: http: //www. nla. gov. au/padi/ – Joint DPC and PADI bulletin What's New in Digital Preservation: http: //www. dpconline. org/graphics/whatsnew/ – UK Digital Curation Centre: http: //www. dcc. ac. uk/ 34

Acknowledgements The Digital Curation Centre is funded by the Joint Information Systems Committee (JISC)

Acknowledgements The Digital Curation Centre is funded by the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils and the Core e-Science Programme of the UK research councils. The consortium comprises the University of Edinburgh (lead partner), the University of Glasgow, the Council for the Central Laboratory of the Research Councils, and the University of Bath (UKOLN). http: //www. dcc. ac. uk/ 35 UKOLN is funded by the Council for Museums, Libraries and Archives (MLA) and the JISC, as well as by project funding from the JISC, the European Union and other sources. UKOLN also receives support from the University of Bath, where it is based. http: //www. ukoln. ac. uk/