CrossTier Unified Namespace Johann Lombardi Extreme Storage Architecture
Cross-Tier Unified Namespace Johann Lombardi, Extreme Storage Architecture & Development, Intel LAD’ 18, France
notices and disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http: //www. intel. com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and Mobile. Mark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http: //www. intel. com/benchmarks. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2. 0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http: //www. intel. com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE 2, SSE 3, and SSSE 3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U. S. and/or other countries. *Other names and brands may be claimed as property of others. © 2018 Intel Corporation. Copyright © 2018 Intel® Corporation 2
Agenda What this talk is about What this talk is not about § Multi-tier integration § Burst buffers or transparent caching – Scale-out object store / DAOS § DAOS internals – Parallel File System (PFS) / Lustre – Ping me separately if you are interested in the open-source DAOS project § Expose unified namespace to end users § A comparison between Lustre and DAOS § Efficient dataset movement DAOS, RADOS - Compute Nodes S 3 POSIX Scale-out Object Store PFS Copyright © 2018 Intel® Corporation 2 1 3 3
Targeted Storage Architecture DAOS Nodes Scale-out Object Store Tier ol toc ro SP O A D Compute Nodes I/O Fo rw a rdi ng Data Movement Pr oto co l Gateway Nodes Lustre Protocol Lustre OSS & MDS Nodes The information on this Copyright page is subject to the use and disclosure restrictions provided on Page 2 of this document. © 2018 Intel® Corporation PFS Tier 4
Distributed Async Object Storage 3 rd Party Applications Rich Data Models HPC Workflow Relaxed POSIX I/O HDF 5 Apache Arrow SQL … DAOS Storage Engine Storage Platform Open Source Apache 2. 0 License NVRAM SCM NVMe HDD g. RPC Storage Media SPDK Mercury/OFI PMDK Data Plane Control Plane Copyright © 2018 Intel® Corporation 5
root dir Party Applications HPC Workflow Relaxed POSIX I/O HDF 5 Apache Arrow SQL … UUID Rich Data Models data file HDD Data Plane Control Plane UUID NVMe group data data data data file KV store Container key value key group data data dataset g. RPC NVRAM SCM Mercury/OFI Storage Media SPDK data file data file File-per-process Container value key Open Source Apache 2. 0 License PMDK dir HDF 5 Container group DAOS Storage Engine Storage Platform dir Columnar DB Container key key Value Value value key value UUID 3 rd UUID POSIX Container UUID Distributed Async Object Storage key value ACG Container node node Copyright © 2018 Intel® Corporation 6
Unified Namespace Concept /mnt/prod Regular Lustre directories & files HDF 5 Container DAOS POSIX Container DAOS MPI-IO Container users Buzz . shrc projects libs mkl. so hdf 5. so Simul. h 5 EA: CUUID moon. mpg Gemini Apollo POSIX Container HDF 5 Container group MPI-IO file group data data data dataset Copyright © 2018 Intel® Corporation MPI-IO Container root group Simul. out file/dir with special EA: CUUID extended attribute (EA) Result. dn EA: CUUID dir data data file dir data file MPI-IO file 7
What’s really stored in the PFS? /mnt/prod Regular Lustre directories & files HDF 5 Container DAOS POSIX Container DAOS MPI-IO Container users Buzz . shrc mkl. so moon. mpg projects libs hdf 5. so Apollo Simul. h 5 EA: CUUID Gemini Result. dn EA: CUUID Empty file/dir! Copyright © 2018 Intel® Corporation Simul. out EA: CUUID
Use Case: Readdir Lustre Directory Compute Node 1. readdir Application 4. readdir results POSIX IOF FUSE DAEMON 2. lookup (intent=readdir) 3. getxattr Lustre protocol DAOS protocol System call DAOS Copyright © 2018 Intel® Corporation 9
Use Case: Readdir POSIX Container Compute Node 1. readdir Application 5. readdir results POSIX IOF FUSE DAEMON 2. lookup (intent=readdir) 3. getxattr Lustre 4. readdir (UUID) Lustre protocol DAOS protocol System call DAOS Copyright © 2018 Intel® Corporation 10
Use Case: DAOS-aware I/O Middleware Compute Node POSIX IOF FUSE DAEMON Application 1. getxattr DAOS-aware middleware 2. lookup (intent=getxattr) Lustre 3. return UUID 4. DAOS API (UUID) Lustre protocol DAOS protocol System call DAOS Copyright © 2018 Intel® Corporation 11
Special File/Dir Representation Regular Extended Attribute (EA) Special LOV EA § Portable § Not Portable § Performance Impact § Minimal Performance Impact – Extra EA fetch on every lookup § Can’t prevent Lustre file/dir from being created under the special directory Lustre Client fd = open(apollo/simul. out) fgetxattr(fd, DAOS_EA) open/Lookup – No extra RPC § Prohibit regular file/dir creation Lustre Client MDTs fd = open(apollo/simul. out) getxattr MDTs open/lookup fgetxattr(fd, LOV_EA) Copyright © 2018 Intel® Corporation 12
Dataset Movement DAOS Mover Lustre Apollo Specific data mover Result. dn EA: CUUID Simul. h 5 EA: CUUID HDF 5 Container dataset – POSIX Container group dataset § Format convertion – root dataset bar 1 bar 2 bar 3 Middleware-dependent – convert container from DAOS format to POSIX format & vice versa Middleware-agnostic – (de)Serialize container to image files/dirs in Lustre § Explore how to use layout swap functionality Mover Integration with Lustre Client Container Image (CCI) § Local ldiskfs image mounted transparently on Lustre client Apollo Result. dn Simul. h 5 bar 1 bar 2 – Written back to OSTs – High IOPS per client since MDTs not involved § Accelerate migration of POSIX containers bar 3 Copyright © 2018 Intel® Corporation 13
Summary Lustre change proposal Resources § Extend LOV EA § POSIX I/O Forwarding – New layout type to point at external tier – Generic feature based on UUID – https: //github. com/daos-stack/iof § DAOS – Can be integrated with any scale-out object stores – Opportunity to leverage layout swap functionality for cross-tier migration – http: //daos. io – https: //github. com/daos-stack/daos § Contacts – johann. lombardi@intel. com § Effort tracked in LU-11376 – Goal is to merge feature upstream – bruno. faccini@intel. com – Feedback is welcomed! – riaux. jb@intel. com Copyright © 2018 Intel® Corporation 14
Unified Namespace Implementation – POSIX IOF Compute Node Application HDF 5 Apache Arrow Interception Library DAOS POSIX Gateway Nodes POSIX I/O Forwarding FUSE Daemon DAOS Client Library DAOS Copyright © 2018 Intel® Corporation POSIX I/O Forwarding Service Lustre Client Lustre 16
- Slides: 16