Digital Curation Centre a centre of expertise in

  • Slides: 31
Download presentation
Digital Curation Centre a centre of expertise in data curation and preservation The Digital

Digital Curation Centre a centre of expertise in data curation and preservation The Digital Curation Centre: Developing support for digital curation David Giaretta Funders:

Outline • • • Fundamentals Details Current work Collaborations Conclusions

Outline • • • Fundamentals Details Current work Collaborations Conclusions

Organisation to Engage & Collaborate curation organisations eg DPC communities of practice: users community

Organisation to Engage & Collaborate curation organisations eg DPC communities of practice: users community support & outreach Collaborative Associates Network of Data Organisations service definition & delivery management & admin support research collaborators development co-ordination testbeds & tools Industry standards bodies

What is actually needed? MONEY

What is actually needed? MONEY

Fundamentals • OAIS Reference Model is a key (the only? ) standard for the

Fundamentals • OAIS Reference Model is a key (the only? ) standard for the long-term preservation of information • Digital Curation covers many issues including financial, scientific, technical, legal and sociological ones. • OAIS does not cover all these issues and so is only part of the solution

OAIS Reference Model – Functional Model

OAIS Reference Model – Functional Model

Representation Net

Representation Net

Preservation Issues Given a file or a stream of bits how does one know

Preservation Issues Given a file or a stream of bits how does one know what Representation Information is needed (this question applies to Representation Information itself as well as to the digital objects we are primarily interested in preserving and using); how does one know, for example, if this thing is in FITS format? • • • Someone may simply “know” what it is and how to deal with it i. e. the bits are within the Knowledge Base One may be able to recognise the format by looking for various types of patterns. One may feed the bits into all available interpreters to see which accept the data as valid Other means…. The only safe way: have an associated label which points to the appropriate Representation Information – Note this does not exclude the other methods e. g. for data rescue

High Level Conceptual View Example of use of Representation Information Labelling

High Level Conceptual View Example of use of Representation Information Labelling

Layered Model from OAIS

Layered Model from OAIS

Structure – including Formats • Distinguish – formats which are used mainly for rendering

Structure – including Formats • Distinguish – formats which are used mainly for rendering – to be followed by human inspection, and – formats used for automated processing • Distinguish: – Things with unknown structure – needs software • proprietary software e. g. MS Word • Open Source software e. g. CDF – Things with known structure • ASCII file, FITS file etc – Document the format – Use description language if possible e. g. EAST – The EAST tools are themselves Representation Information which in due course will have to be fully defined – the closure of their Representation Nets will be the EAST standard – Higher level definitions should include useful scientific objects and humanities objects

Representation Information (2) • Semantics – Hard problem – start with Data Dictionaries –

Representation Information (2) • Semantics – Hard problem – start with Data Dictionaries – Implications: the Representation Information Repository should include Data Dictionaries, followed by more general semantics

Time Dependent Information – Many, perhaps most, datasets change over time and the state

Time Dependent Information – Many, perhaps most, datasets change over time and the state at each particular moment in time may be important. It may be useful to break the issue into separate parts. • at each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net. • efficient storage of a series of snapshots may lead one to store differences or include time tags in the data – Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version. – Also applies to ANNOTATION – who said what about which and when did they say it

Actions and Processes (Behaviour) • Some information has, as an integral part of its

Actions and Processes (Behaviour) • Some information has, as an integral part of its content, an implicit or explicit process associated with it – An examples of this is a database or other time dependent or reactive system such as a Neural Net. • Emulations – Universal Virtual Computer (UVC) – A very well specified VM e. g. JVM

Is saying “it’s XML” enough? • <? xml version='1. 0'? > • • •

Is saying “it’s XML” enough? • <? xml version='1. 0'? > • • • • <VOTABLE version="1. 1" xmlns: xsi="http: //www. w 3. org/2001/XMLSchema-instance" xsi: schema. Location="http: //www. ivoa. net/xml/VOTable/v 1. 1" xmlns="http: //www. ivoa. net/xml/VOTable/v 1. 1"> <!-! VOTable written by uk. ac. starlink. votable. VOTable. Writer !--> <RESOURCE> <TABLE name="6 dfgs_E 7_subset" nrows="875"> <PARAM arraysize="*" datatype="char" name="Original Source" value="http: //www-wfau. roe. ac. uk/6 d. FGS/6 dfgs_E 7. fld. gz"> <DESCRIPTION>URL of data file used to create this table. </DESCRIPTION> </PARAM> <PARAM arraysize="*" datatype="char" name="Credits" value="Column explanations provided by Mike Read (ROE) from 6 df. GS project. "/> <PARAM arraysize="*" datatype="char" name="Conversion" value="Converted from 6 dfgs_E 7. fld. gz by Mark Taylor (Starlink) using STIL. "/> <PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6 df. GS dataset for TOPCAT demo usage. "/> <FIELD arraysize="15" datatype="char" name="TARGET"> <DESCRIPTION>Target name</DESCRIPTION> Or here </FIELD> <FIELD arraysize="11" datatype="char" name="RA" unit="HMS"> <DESCRIPTION>Right Ascension J 2000</DESCRIPTION> </FIELD> <FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"> <DATA> <FITS> <STREAM encoding='base 64'> U 0 l. NUEx. FICA 9 ICAg. ICBUIC 8 g. U 3 Rhbm. Rhcm. Qg. Rkl. UUy. Bm b 3 Jt. YXQg. ICAg. ICBCSVRQSVgg. ID 0 g. ICAg. IDgg. Ly. BDa. GFy. YWN 0 ZXIg. ZGF 0 YSAg. ICAg. IE 5 BWEl. TICAg. PSAg. ICAg. MCAv IE 5 v. IGlt. YWdl. LCBqd. XN 0 IGV 4 d. GVuc 2 lvbn. Mg. ICAg • • • • • NO!

Registry/Repository • Use existing standards – UDDI • No repository – eb. XML Additional

Registry/Repository • Use existing standards – UDDI • No repository – eb. XML Additional advantage: helps integration with the GRID • Registry/Repository access – Interface and protocols – JAXR “standard” – Can talk to UDDI and eb. XML registries – Freeb. XML implementation • many access methods – URL, Web Services, API, Etc. .

eb. XML Registry Version 3. 0: Simplified View of Architecture Source: eb. XML Registry

eb. XML Registry Version 3. 0: Simplified View of Architecture Source: eb. XML Registry Services and Protocols Committee Draft, 10 February 2005

Registry/ Repository (regrep) • Has to be a trusted repository (of Rep. Info) –

Registry/ Repository (regrep) • Has to be a trusted repository (of Rep. Info) – Authenticity of Rep. Info – Access control – Certificates/Digests : (are they trustable over the long term? ) • Extensibility • Distributed – Share the effort • Notification Service

Simple Buy-In • Need to add Rep. Info to your Data Objects? • Does

Simple Buy-In • Need to add Rep. Info to your Data Objects? • Does the Rep. Info already exist? – Yes: get its ID and put that in a label – No: register what you have – be assigned an ID. • Add more details later when needed • Or others can add more details

Archival Information Package

Archival Information Package

Preservation Description Info

Preservation Description Info

AIP implications – PDI • define standard Preservation Metadata – based initially on OCLC

AIP implications – PDI • define standard Preservation Metadata – based initially on OCLC work and also CCLRC work etc • define adequate Packaging technique – almost certainly XML based • define recommended tools and procedures for creating Fixity Information such as checksums and digests, together with associated Representation Information • investigate authentication systems

Standards and Audit & Certification • How can people know who to entrust with

Standards and Audit & Certification • How can people know who to entrust with their information? • There is a demand for a certification process for – Repositories – Components e. g. archive storage – Software • Certification such as ISO 9000 and ISO 17799 do not do the job • Group of international experts led by RLG and NARA is drafting a Certification standard – going to Technical Editor at the end of June

…dual benefit • Provide assistance in planning data archives – Hardware – Software •

…dual benefit • Provide assistance in planning data archives – Hardware – Software • Provide tools and assistance in “Preservation Planning” activities – This is the area most likely to be deficient in archives – Representation Information – Preservation “metadata” • Science Data and also digital libraries

BADC Background • The OAIS Reference Model is an important international standard for archive

BADC Background • The OAIS Reference Model is an important international standard for archive systems • British Atmospheric Data Centre (BADC) is examining its use of the Atlas Petabyte Store (APS) as its long term storage system • This presentation compares the BADC/APS system to OAIS • Context: – Looking at Certification of BADC and APS – CCLRC as member of Digital Curation Centre (DCC)

BADC mapped to OAIS

BADC mapped to OAIS

Follow-on Activities • Research Libraries Group has established a web page to track OAIS

Follow-on Activities • Research Libraries Group has established a web page to track OAIS implementation efforts and issues – http: //www. rlg. org/longterm/oais. html • CCSDS/ISO Producer-Archive Interface Methodology Standard – Provides framework for Producer/Archive interactions – Identifies steps and types of information exchanged during the ‘negotiation’ – May be used as a checklist by archives • CCSDS Certification Coordination Function – Will track and summarize various archive certification efforts – Will attempt to extract high-level model/checklist – RLG is organizing a group to establish certification approaches

Working with Others • • • DLF Development info – see PRONOM http: //dev.

Working with Others • • • DLF Development info – see PRONOM http: //dev. dcc. ac. uk GGF/PA for details of Wiki and email list NARA open to all Lo. C RLG JISC community E-Science Community Australian National Library EU Panel of Experts

FP 7 - R&D Themes proposed • Technical tools to support a variety of

FP 7 - R&D Themes proposed • Technical tools to support a variety of preservation strategies • Representation information, registries of Representation Information • Managing complex dynamic datasets and databases • Developing distributed archives and network solutions from a preservation perspective • Devising new approaches to the development of IT solutions from the perspective of the durability of information • Developing life-cycle costings, value chain analyses and other economic models to support sustainable long and very long term preservation and access

Possible OAIS Support Architecture Certification Accreditation Distributed OAIS Repository DRM Authentication Local OAIS Repository

Possible OAIS Support Architecture Certification Accreditation Distributed OAIS Repository DRM Authentication Local OAIS Repository Bit Management Representation Information Repository Persistent Identifiers Registries Standards Authorisation Tools & Services GRID Services

Conclusions • Some basics are in place • Now need to – begin populating

Conclusions • Some basics are in place • Now need to – begin populating Registry/Repository – work with projects to capture Representation Information and assigning Persistent IDs – test Draft Certification document against the real world – provide a number of tools to help projects – Collaborate with a number of related