Certification of CERN as a Trusted Digital Repository

  • Slides: 29
Download presentation
Certification of CERN as a Trusted Digital Repository ISO 16363 based on OAIS (ISO

Certification of CERN as a Trusted Digital Repository ISO 16363 based on OAIS (ISO 14271) DPHEP and Cross-Group Opportunities (And Cross-Department, Cross-Organisation…) Jamie. Shiers@cern. ch ITMM July 2016 International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics

WHERE TO START? 2

WHERE TO START? 2

3

3

4 years since the Higgs discovery!

4 years since the Higgs discovery!

Background to DPHEP • DPHEP started as a Study Group initiated by DESY in

Background to DPHEP • DPHEP started as a Study Group initiated by DESY in 2008 / 9 – See http: //www. dphep. org/ and The Road to DPHEP (June 2015) • It was later adopted by ICFA as a panel (1 of 7) • A Blueprint Document was published in May 2012 and input to the ESPP update – Data preservation included in the approved strategy document • The Study Group migrated to a Collaboration from 2013 – ICFA statement – Collaboration Agreement – 2015 Status Report • The CERN Services (cross-department) for DP are described in this i. PRES paper / IT note (IT, EP, SIS) (A (p)repeat of that PPT may make sense) • The general direction for Certification is described in the June 2016 CERN Courier

THINGS I AM NOT GOING TO COVER 6

THINGS I AM NOT GOING TO COVER 6

DPHEP: An International study group on data preservation David South | Data Preservation and

DPHEP: An International study group on data preservation David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21 -25 2012 | Page 8

CROSS-GROUP OPPORTUNITIES 10

CROSS-GROUP OPPORTUNITIES 10

Opportunities Exist… • As part of the technical work providing services for DP according

Opportunities Exist… • As part of the technical work providing services for DP according to the Use Cases of the experiments – These closely match requirements from Funders for “Data Management Plans” – Main opportunities here are: • To situate on-going work as part of “something bigger” (part of ESPP) • To get recognition for “background work”, e. g. for LEP • As part of the Certification Process for CERN as a Trusted Digital Repository – Expertise is spread over many people (not just in IT) – Learn more about CERN procedures and again situate work as part of a CERN strategic activity Ø Goal is to complete prior to next ESPP update and provide input to it • In “new” activities, where past experience and knowledge may be relevant – E. g. OPERA data • Helping to prepare a Data Management Plan (DMP) for OPERA (& other experiments) • Helping with the implementation (conversion of 70 TB of Oracle data to non-proprietary format(s) ) Ø Bottom line: “See and be seen”

BACKUP SLIDES

BACKUP SLIDES

Requirements from Funding Agencies • To integrate data management planning into the overall research

Requirements from Funding Agencies • To integrate data management planning into the overall research plan, all proposals submitted to the Office of Science for research funding are required to include a Data Management Plan (DMP) of no more than two pages that describes how data generated through the course of the proposed research will be shared and preserved or explains why data sharing and/or preservation are not possible or scientifically appropriate. • At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. • Similar requirements from European FAs and EU (H 2020) 13

H 2020: Annex 1 (DMP Template) The DMP should address the points below… 1.

H 2020: Annex 1 (DMP Template) The DMP should address the points below… 1. Data set reference and name 2. Data set description 3. Standards and metadata 4. Data sharing 5. Archiving and preservation • Identifier for the DS to be produced • Description; origin; nature & scale; to whom useful; underpins publication? similar data? • Reference to standards of the discipline • How will it be shared? Embargo periods? Mechanisms for dissemination, s/w and other tools for re-use, access open to restricted to groups, where is repository? Type of repository? • Description of procedures, how long will it be preserved? End volume? Costs? How will these be covered? 14

HEP LTDP Use Cases 1. Bit preservation as a basic “service” on which higher

HEP LTDP Use Cases 1. Bit preservation as a basic “service” on which higher level components can build; Ø “Maybe CERN does bit preservation better than anyone else in the world” 2. Preserve data, software, and know-how in the collaborations; Basis for reproducibility; 3. Share data and associated software with (wider) scientific community, such as theorists or physicists not part of the original collaboration; 4. Open access to reduced data sets to general public. Ø Basically, a reflection of DMP requirements 15

LHC Experiments’ Data Policies • These are essentially “extended DMPs” that capture the small

LHC Experiments’ Data Policies • These are essentially “extended DMPs” that capture the small variations between each experiment – Variations in duration of embargo periods, designated communities, fraction of data released • A generic “WLCG DMP” exists – just like a generic WLCG TDR (complemented by experimentspecific reports) • More detail in talk about CMS experience with data releases at ADMP workshop 16

3. 5. Will there be need for an adjustment of the general CERN data

3. 5. Will there be need for an adjustment of the general CERN data policy? § § CERN will establish a data policy that is in line with funding agency requirements, including in terms of Open Access (Science). This can be expected to be largely similar to that adopted by the 4 main LHC experiments, with a significant fraction of the data released after a reasonable embargo period. • • The duration of the embargo period and the fraction of the data to be released would be determined based on experience, resource requirements and scientific, educational and cultural benefits. Given that the total dataset of the (HL-)LHC will be in the Exabyte range, the volume of data to be released will eventually become significant and the appropriate resources must be factored into any planning. 5 November 2015 IT 2016 17

Which Certification Strategy? • “Trusted” or “certified” digital repositories – (Also cost recovery for

Which Certification Strategy? • “Trusted” or “certified” digital repositories – (Also cost recovery for repositories) • Several such standards exist: CERN (WLCG) following ISO 16363 route – Some sites start with DSA, then DIN, then ISO • Even DANS! (The originators of DSA) – This would not work at CERN… • At CERN, the closest thing to a “mission statement” is an Operational Circular – This, and other steps required for “certification” could not realistically be repeated as we moved up the ladder… 18

Certification – Current Status • Original idea was to perform Certification in the context

Certification – Current Status • Original idea was to perform Certification in the context of WLCG • However: a) Quite a few of the metrics concern the (CERN) site; b) Interest also in an OAIS archive for “CERN’s Digital Memory”; c) The two are linked: policies, strategies, mission statements for the former are part of the latter d) Some things will be easier in the latter which will in turn help the former Ø Current thinking: (self-)certify site-wise; “projectspecific details” via “Project DMPs” 19

ISO 16363 metrics Organisational Infrastructure 3. 1 Governance & Organisational Viability Mission Statement, Preservation

ISO 16363 metrics Organisational Infrastructure 3. 1 Governance & Organisational Viability Mission Statement, Preservation Policy, Implementation plan(s) etc. [ CERN, project(s) ] 3. 2 Organisational Structure & Staffing Duties, staffing, professional development etc. [ APT etc. ] 3. 3 Procedural accountability & preservation policy framework Designated communities, knowledge bases, policies & reviews, change management, transparency & accountability etc. [ At least partially projects ] 3. 4 Financial sustainability Business planning processes, financial practices and procedures etc 3. 5 Contracts, licenses & liabilities For the digital materials preserved… [ CERN? Projects? ] 20

Ø Logical to have an Operational Circular for “Data” – Obviously should include “meta-data”

Ø Logical to have an Operational Circular for “Data” – Obviously should include “meta-data” (as per DPHEP SR) • Software + environment, documentation etc. – Symmetry with OC 3 and OC 6 • Archival material and archiving at CERN • CERN scientific documents • [ CERN scientific data, s/w, doc + meta-data ] • This could address “Mission Statement” and “DP Policy” in ISO 16363 (as OC 3 does) • Complemented by: – Data Preservation Plan (inter-departmental) with ~3 year outlook • Include also experiment plans or as part of their DMPs? – Experiment / Project Data Management Plan – Data Policy (extended DMP – à la LHC)

Ø Logical to have an Operational Circular for “Data” – Obviously should include “meta-data”

Ø Logical to have an Operational Circular for “Data” – Obviously should include “meta-data” (as per DPHEP SR) • Software + environment, documentation etc. – Symmetry with OC 3 and OC 6 • Archival material and archiving at CERN • CERN scientific documents • [ CERN scientific data, s/w, doc + meta-data ] • This could address “Mission Statement” and “DP Policy” in ISO 16363 • Complemented by: – Data Preservation Plan (inter-departmental) with ~3 year outlook Work together on this “Po. W” for DP/DM • Include also experiment plans or as part of their DMPs? – Experiment / Project Data Management Plan – Data Policy (extended DMP – à la LHC)

Infrastructure & Security Risk Management 5. 1 Technical Infrastructure Risk Management [ We do

Infrastructure & Security Risk Management 5. 1 Technical Infrastructure Risk Management [ We do all of this, but is it documented? ] Technology watches, h/w & s/w changes, detection of bit corruption or loss, reporting, security updates, storage media refreshing, change management, critical processes, handling of multiple data copies etc OC 5, … 5. 2 Security Risk Management [ Do we do all of this, and is it documented? ] Security risks (data, systems, personnel, physical plant), disaster preparedness and recovery plans … OC 2, … 23

Covered in section 4 of ISO 16363

Covered in section 4 of ISO 16363

Data Preservation & Certification of Trusted Digital Repositories: Helps Address the Goals Below. F.

Data Preservation & Certification of Trusted Digital Repositories: Helps Address the Goals Below. F. A. I. R. and Open Data: Requires effort & Resources Data Management Plans: Sharing, Re-Use; 25 Reproducibility of Results

Concluding Remarks • Data Preservation is a Journey – Not a Destination – “Once

Concluding Remarks • Data Preservation is a Journey – Not a Destination – “Once you stop pedalling, you stop & fall off” • Data Preservation is not an Island – it is part of a much bigger picture, including the full data lifecycle – You can’t share or re-use data, nor reproduce results, if you haven’t first preserved it 26

(Self-)Certification ü Requires us to formalise / document some of our existing practices… (incl.

(Self-)Certification ü Requires us to formalise / document some of our existing practices… (incl. “bit preservation”) ü To “complete” work in certain areas (e. g. disaster preparedness / recovery) ü It needs effort / knowledge from a wide range of groups / people ü We also need to define a “Po. W for Preservation” Important milestone: update of the European Strategy for Particle Physics (ESPP): ~ 2019 -2020 27

How to move forward? • In some cases, I know (we all know) who

How to move forward? • In some cases, I know (we all know) who the suspects are – Typically “senior” people – ideally should include also some younger people for continuity / knowledge transfer • In other cases I do not know: do the GLs? • We do not have to address all metrics in parallel (but could do e. g. for section 5 – “Risk Management”) • Formal CERN documents, such as OCs, need to be prepared carefully: hopefully few (just one? ) of these Ø First step: identify suspects then discuss together (including with EP, SIS & expts) how to address metrics – Suspects need not be / become experts in ISO 16363 – Some existing thoughts already in DPHEP Wiki (DPHEP-IB e-group) – Level of involvement: from a few hours (e. g. if the information exists, e. g. in Power. Point, but not in a document with a DOI) up (e. g. for disaster recovery) 28

Volunteers Please Step Forward! 29

Volunteers Please Step Forward! 29