Usecases 1 ISIS Neutron Source 2 DP for

  • Slides: 18
Download presentation
Usecases: 1. ISIS Neutron Source 2. DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI

Usecases: 1. ISIS Neutron Source 2. DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4 -6 March 2014

Use Case 1 ISIS Neutron Source

Use Case 1 ISIS Neutron Source

Setting the Scene – Big Data at SCD, STFC • Solutions using CASTOR, DMF,

Setting the Scene – Big Data at SCD, STFC • Solutions using CASTOR, DMF, SDB, Panasas and home grown • Primarily Linux based. ORACLE Storagetek SL 8500 robot with T 10 K(A-D) media • 18 PB on tape and 9 PB on disk (CASTOR) 6 PB on disk (Panasas) • Users: • High Energy Particle Physics (CERN users) • STFC Facilities (Diamond Synchrotron, ISIS Neutron Source, …) • Complete end-to-end data solution offered for large scale facilities: • Data ingest, data archival, metadata, portal for data retrieval and DOI services

ISIS Neutron Source • • Pulsed Neutron and Muon source. At RAL, Harwell, UK.

ISIS Neutron Source • • Pulsed Neutron and Muon source. At RAL, Harwell, UK. Run by STFC ~3000 scientists supported. clean energy and the environment, pharmaceuticals and health care… nanotechnology and materials engineering, catalysis and polymers… fundamental studies of materials Techniques Muon spectroscopy, Neutron diffraction, Neutron spectroscopy, Neutron reflectometry, Small angle scattering Data collection • • From KBs to GBs per visit. Currently ~11 TB to date. New experiment (e. g. IMAT) up to 2 TB per visit

ISIS Data Policy, Management and Access • Well defined policy: http: //www. isis. stfc.

ISIS Data Policy, Management and Access • Well defined policy: http: //www. isis. stfc. ac. uk/user-office/data-policy 11204. html 3. 1. 1 All raw data and the associated metadata obtained as a result of free (non-commercial) access to ISIS, reside in the public domain, with ISIS acting as the custodian. 3. 1. 2 All raw data and the associated metadata obtained as a result of ‘commercial-inconfidence’ access to ISIS will be owned exclusively by the commercial user. Commercial users must agree with their relevant instruments scientists how they wish their raw data and metadata to be managed before the start of any experiment. • Also: 3. 3. 1 Access to raw data and metadata beyond the period that it is stored on instrument-related computers will be via a searchable on-line catalogue. 3. 3. 2 Access to the on-line catalogue will be restricted to those who register with STFC/ISIS as users of the on-line catalogue.

Data Management for ISIS DP Here!

Data Management for ISIS DP Here!

Accessing Data via DOI – Landing Page

Accessing Data via DOI – Landing Page

Accessing Data via DOI – Data Portal

Accessing Data via DOI – Data Portal

Data Preservation solution – Tessella Safety Deposit Box (SDB) • Primary copy on disk

Data Preservation solution – Tessella Safety Deposit Box (SDB) • Primary copy on disk (Windows File Store). Served to users on demand. • Copy of ALL data stored for long term backup and preservation on tape using SDB by Tessella (and DMF) • SDB uses SIP at ingest which reads OAIS Ne. Xus standard file format. • Ne. Xus validator checks data. Metadata generated. Well defined data. (see nexusformat. org. Synchrotron/neutron scattering driven) • Definable workflows for migration of data to new formats. • Continuous validation of data ‘bit rot’

Unresolved issues • Data Preservation is a dark archive. Yet to put into place

Unresolved issues • Data Preservation is a dark archive. Yet to put into place mechanics for accessing it. • Future data volume increase. How many copies? All on spinning disk? • Granularity of DOIs and How do we relate datasets together? (raw->reduced->derived). What if they all have different DOIs?

Usecase 1 - Summary • All ISIS data stored and available for download (with

Usecase 1 - Summary • All ISIS data stored and available for download (with provisos in DM plan) • Data preservation in place for retaining data for long period • Scientists responsible for documentation/annotation of their data and provenance

Usecase 2 – DP for HEP With thanks to Jamie Shears for his input

Usecase 2 – DP for HEP With thanks to Jamie Shears for his input Views expressed here are my own

Intro to DP( )HEP DPHEP… • is a study group focusing on data persistency

Intro to DP( )HEP DPHEP… • is a study group focusing on data persistency and long term analysis for HEP and including LHC data at CERN. Representation from many national labs • aims to converge to a common set of specifications for this.

The Problem Particle accelerators are very expensive - e. g. € 3 bn for

The Problem Particle accelerators are very expensive - e. g. € 3 bn for LHC. To maximize returns, we need to preserve data and knowledge to reproduce past analyses and perform new ones. DP has been done as a somewhat ‘ad hoc’ approach in the past.

Exascale Preservation • Current WLCG archives are 10 s of PB (CERN has 100

Exascale Preservation • Current WLCG archives are 10 s of PB (CERN has 100 PB). Next 2 decades, estimates are up to 5 EB • Scaling up past DP successes. e. g. • LEP - 10 TB until 2000. Data/SW still available and usable. • Past DESY HERA experiments – 1 PB preserved + usable

DPHEP Approach 1. Digital library tools & services, together with a Portal 2. Sustainable

DPHEP Approach 1. Digital library tools & services, together with a Portal 2. Sustainable software, coupled with advanced virtualization techniques and validation Frameworks 3. Draw from proven past bit preservation successes together with a sustainable funding model with an outlook to 2040/50 4. Open Data – over and above simple Open Access

Challenges • Not all HEP data is open. Experiments are reviewing their Open Data

Challenges • Not all HEP data is open. Experiments are reviewing their Open Data policies. • Training needs are different for different communities: • DP Service providers in bit preservation • Software developers • Scientists • The documentation problem in DP HEP. Who, what and how much. • Technological difficulties of DP HEP and scaling up to Exascale • Porting software is time consuming. Will old software compile on new compilers?

Thank you Questions matthew. viljoen@stfc. ac. uk

Thank you Questions matthew. viljoen@stfc. ac. uk