IRODS workshop September 26 28 2012 Linkping Sweden
IRODS workshop, September 26 -28 2012 , Linköping (Sweden) i. RODS experience in DEISA Agnès Ansari – agnes. ansari@idris. fr
The i. RODS usability for the DEISA infrastructure activity • Investigate the current data management and access needs within DEISA • Find use cases that are representative of the data management trends • Investigate i. RODS customization capabilities • Customize i. RODS as needed for the use cases • Evaluate how i. RODS match these requirements 2
The data management and access questionnaire • The aim of this questionnaire was to evaluate the i. RODS applicability for DEISA (Nov 2009) • It describes and enhances the knowledge about users data management, access and organization practices • We got back filled forms from end-users from IDRIS, HLRS, FZJ and LRZ WP 7 teams • It can be useful for DECI projects, virtual organizations or any HPC projects 3
Topic Description Research domain and organization primary area of research and project organization Describe the scientific domain, the team /organization size computation organization Specify where are the data and where the computations run (single, main, distributed sites) Data processing types and scenarios data access control and security Specify if the data is sensitive or not and if encryption is needed kind of computation Specify if simulations or data/analysis processing are computationally or I/O intensive overall data volume for the project Specify the range: less than 10 GB, up to more than 100 TB I/O volume for a typical processing job Specify the range: from 100 MB up to 1 TB typical I/O schemes Specify – Write type jobs/ Read-Write type jobs/ Read-Update type jobs Data management data access type Describe the use of a data base, a Posix interface, an I/O software layer(HDF 5), parallel I/O file access, fraction of the file read in, multiple reads scenarios Describe multiple readings of input files, fraction of the files read in, random access or random readings of multiple input files file creations and data writes Describe if temporary or permanent storage is used during jobs execution data storage Specify if the data is stored at one site, spread over multiple sites (from/to), on a workstation Files management Describes the files organization in directories/sub-directories or the use of meta-data Data transfer tools Describe the tools used to transfer data 4
Data management questionnaire feedback (1) • Filled by DECI projects from different research domain: engineering, particle physics, . . . • Project organization: rather small groups (<5 persons) and some collaborations (<20 persons) • Computation organization: data and computation are distributed over several sites, so data has to be exchanged between sites • In most of cases, data is considered as non-sensitive so usually all group members can access the data and data encryption is not needed • Protocols used: scp, rsync, gridftp 5
Data management questionnaire feedback (2) • Computationally intensive simulations are performed rather than I/O intensive simulation or data analysis/processing: I/O rate is < 1 MB I/O disk per 1 sec of CPU time (low I/O) • Data volume per project: 1 to 100 TB • I/O volume per job: 50 to 500 GB • I/O scheme: read input files/data – write output files/data (read-write type jobs) rather than read-update type jobs • Files are organized in directories/sub-directories. The files identification is based on the files names (using specific naming conventions) • Data access: direct access to data or files, or parallel I/O • File access: complete file reading or multiples readings. Random access or random readings are seldom used • Data storage: at one computing site or spread over several (GPFS is not the only data access method) 6
Use case for using i. RODS in DEISA • 2 DECI projects collaboration running jobs over 2 DEISA sites – Management of distributed and shared data – Heterogeneous computing environment – Heterogeneous set of data storage spaces – Various users profiles (DEISA and non DEISA accounts) that have to collaborate by sharing data 7
Distributed storage Permanent storage Archive storage permanent storage HOME/WORKDIR local user DEISA user Local user permanent storage HOME/WORKDIR DEISA user GPFS archive storage HOME/WORKDIR Distributed local user Storage (specific node) DEISA_HOME/ DEISA_DATA DEISA user DEISA Site_a Distributed Storage (specific node) DEISA_HOME/ DEISA_DATA DEISA user archive storage HOME/WORKDIR local user Distributed and shared storage DEISA user DEISA Site_b 8
DEISA site_a $DEISA_DATA/CR Distributed/shared storage, available on login nodes only Permanent storage $WORKDIR (job computing environment) DEISA network Data Copy (I/O) GPFS localuser_site_a Input data/results archive storage localuser_site_b Data Copy (I/O) DEISA Site_b Permanent storage $WORK (job computing environment) archive storage server
The 3 steps workflow simulation • Prototype used to match the use case • It is composed of a set of 3 phases, run sequentially in 2 different computing centers – First phase: production – Second phase: first processing of files produced in step 1 to reduce data volume – Third phase: final analysis of data from phase 2 10
The 3 steps simulation i. RODS database Site 1 Production Processing Analysis Site 2 Production Processing • Cubic mesh composed of a set of cells following the X, Y, Z axes as a time function T • The modeling over the 2 sites is specified by a time range (time_1, time_2) i. RODS database IRODS resources distributed and shared storage • Production step: file set E 1 with related metadata on each site • Processing step: subsets of E 1 processed to produce E 2 set with related metadata on each site • Analysis step: subsets of all E 2 analysed to produce E 3 data set with related metadata 11
Metadata attached during the production phase – Web. Davis – ARCS (Australian Research Collaboration Service) 12
Data Organization Files management in the i. RODs virtual data space Access rights (ACL) for i. RODS users User Distributed/ shared data User sees a virtual data space Virtual data collections (similar to directories) with virtual files span over multiples physical resources physical files at IDRIS physical files at HLRS Analysis Production Processing User data collections Proj_data Ref_data
i. RODS customization areas • Disk quotas management (quota check before moving data) – applied to almost all storage spaces including the shared repository • Cleaning policy if the disk quota space is reached • Back up management – applied to periodically back up the shared repository during the project life – applied at the end of the projects – development of files and collection back up rules, to back up data on a local or remote zone • Users metadata management (instead of flat files that gather data files information and location) 14
Conclusion on i. RODS feasability for DEISA The 3 steps simulation has shown the i. RODS capability: – to manage a distributed and shared context with various storages and users profiles – to store and retrieve data files – to attach metadata and quey for metadata – to define a logical data organization – to manage disk quotas and cleaning policy – to set up automatic back up procedures 15
- Slides: 15