ZUFS Zerocopy Usermode FS A new interface for

  • Slides: 13
Download presentation
ZUFS - Zero-copy User-mode FS A new interface for a new breed of user-mode

ZUFS - Zero-copy User-mode FS A new interface for a new breed of user-mode filesystems that require: - Extremely Low-Latency, - Synchronous & DAX, - NUMA-aware access Boaz Harrosh @ Linux Plumbers 1 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In theory kernel APP ZUF Zu Feeder APP ZU Thread zt zt zt per

In theory kernel APP ZUF Zu Feeder APP ZU Thread zt zt zt per cpu. . . APP ZUS Zu Server zt Zufs-foo. so Zufs-bar. so Zufs-mem. so User space 2 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In Theory ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)

In Theory ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr) Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT) ZT-vma - Mmap 4 M vma zero copy communication area per ZT IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation. On exit (or server crash) file is closed, Kernel cleans all resources 3 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In theory kernel ZUF APP App pages Mapped into Server VM Zu Feeder PP

In theory kernel ZUF APP App pages Mapped into Server VM Zu Feeder PP P zt-vma Zu Thread Unmapped on return ZUS User space 4 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Zu Server

In Theory Async operation is also supported Server must not sleep in a ZT.

In Theory Async operation is also supported Server must not sleep in a ZT. All locks are trylocks. If failed to lock operation is queued and server returns EAGAIN. Server will later complete the operation ASYNC. App will be woken up. Do we need PAGE_CACHE support ? Also here write/read_pages() maps page-cache to zt-vma Application mmap is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM 5 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

FUSE Raw Results Threads FUSE Vs. ZUFS vs In Kernel FS Threads Op/s Lat

FUSE Raw Results Threads FUSE Vs. ZUFS vs In Kernel FS Threads Op/s Lat (us) Op/s Lat [us] 1 71, 820 13. 5 2 148, 083 13. 1 4 212, 133 18. 3 8 209, 799 37. 6 12 201, 689 58. 7 18 174, 823 101. 8 1 388361 2. 271589 2 635115 2. 604376 24 149, 413 159. 0 4 1260307 2. 626361 36 148, 276 240. 7 48 145, 296 327. 3 8 2744963 2. 485292 12 2126945 5. 020506 18 4350995 3. 386433 24 4211180 4. 784997 36 3057166 9. 291997 1 200, 799 4. 6 48 3148972 10. 382461 2 314, 321 5. 9 4 565, 574 6. 6 8 1, 113, 138 6. 6 12 1, 598, 451 6. 8 18 1, 648, 689 7. 8 6 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- ZUFS Threads Op/s Lat [us]

Motivation for ZUFS (for near-memory speed PM media) • Measured on Dual socket Intel

Motivation for ZUFS (for near-memory speed PM media) • Measured on Dual socket Intel XEON 2650 v 4 (48 HW Threads) DRAM-backed PM type • Random 4 KB Direct. IO writ(ish) access 7 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Why is the mm patch required MMAP_LOCAL_CPU • Own-core TLB invalidate • Secure file

Why is the mm patch required MMAP_LOCAL_CPU • Own-core TLB invalidate • Secure file system signing? ZUFS w/wo mm patch 30 25 Latency [us] 20 15 ZUFS_unpatched_mm ZUFS_patched_mm 10 5 0 - 200 000 400 000 600 000 800 000 1 200 000 1 400 000 1 600 000 1 800 000 2 000 IOPS 8 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

ZUFS penalty Raw Results w/ and wo/ mm patch Threads patched Op/s 1 200,

ZUFS penalty Raw Results w/ and wo/ mm patch Threads patched Op/s 1 200, 799 4. 6 2 314, 321 5. 9 4 565, 574 6. 6 8 1, 113, 138 6. 6 12 1, 598, 451 6. 8 18 1, 648, 689 7. 8 24 1, 702, 285 8. 0 36 1, 783, 346 13. 4 48 1, 741, 873 17. 4 ZUFS penalty Threads 9 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Lat [us] unpatched Op/s Lat [us] 1 185, 391 4. 9 2 197, 993 9. 6 4 310, 597 12. 1 8 546, 702 13. 8 12 641, 728 17. 2 18 744, 750 22. 2 24 790, 805 28. 3

Additional Design Considerations Single ZUS application server ZUFS filesystems are. so libraries loaded into

Additional Design Considerations Single ZUS application server ZUFS filesystems are. so libraries loaded into ZUS. (pre configured or at run time) Regular mount command. New Super blocks created Devices are managed and owned by ZUF in Kernel Bind mount also works, the regular way. ZUS-API with fs-plugins very close to VFS API. Support for compiling zus-plugins as kernel modules also fed by ZUF? 10 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Thank you Please talk to me about ZUFS boazh@netapp. com 11 © 2017 Net.

Thank you Please talk to me about ZUFS boazh@netapp. com 11 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

static int _zu_wait(struct file *file, void *parg) { struct zufs_thread *zt; int cpu =

static int _zu_wait(struct file *file, void *parg) { struct zufs_thread *zt; int cpu = smp_processor_id(); int err; err = _zt_from_f(file, cpu, &zt); if (unlikely(err)) goto err; zt->fss_waiting = true; if (zt->app_waiting) { _unmap_pages(zt, zt->pages, zt->nump); zt->app_waiting = false; get_user(zt->next_opt. hdr. err, (int *)parg); _zu_wakeup_app(zt); } static void _zu_wakeup_fss(struct zufs_thread *zt) { zt->fss_wakeup = true; wake_up(&zt->fss_wq); } _zu_wait_fss(zt); zt->fss_waiting = false; static void _zu_wakeup_app(struct zufs_thread *zt) { zt->app_wakeup = true; wake_up(&zt->app_wq); } static int _zu_wait_fss(struct zufs_thread *zt) { zt->fss_wakeup = false; return wait_event_interruptible(zt->fss_wq, zt->fss_wakeup); } static int _zu_wait_app(struct zufs_thread *zt) { zt->app_wakeup = false; return wait_event_interruptible(zt->app_wq, zt->app_wakeup); } 12 © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- /* call map here at the zuf thread so we need no locks */ if (zt->next_opt. operation && zt->next_opt. operation < ZUS_OP_BREAK) _map_pages(zt, zt->pages, zt->nump, false); err = copy_to_user(parg, &zt->next_opt, sizeof(zt->next_opt)); return err; err: put_user(err, (int *)parg); return err; } int zufs_dispatch(struct m 1 fs_sb_info *sbi, int operation, uint pgoffset, struct page **pages, uint nump, u 64 filepos, uint len) { int cpu = smp_processor_id(); struct zufs_thread *zt; if ((cpu < 0) || (sbi->_max_zts <= cpu)) return -ERANGE; zt = &sbi->_all_zt[cpu]; if (unlikely(!zt->file)) return -EIO; while (!zt->fss_waiting) {

Abstract • FUSE enables user space file systems ever since kernel 2. 6. 14.

Abstract • FUSE enables user space file systems ever since kernel 2. 6. 14. It is a widely popular vehicle for rapid development and tens of file systems have used it to date. FUSE is asynchronous in nature and heavily relies on the operating system page cache. It was designed with hard drive latency in mind and was measured to add penalty of 12. 5 to 1000 micro second [us], depending on the load. • Emerging persistent memory technologies, such as NVDIMM-N and 3 D XPoint / MRAM / Re. RAM based NVDIMM, operate at near memory speed and require a different user space file system mechanism. One that is tuned to latency. • The motivation of this work is to enable new bread of User-mode work, based on above Technologies that typically respond within a single micro second – faster than any caching, redundant data copying and queuing. • ZUFS, pronounced Zoo-FS and stands for Zero-copy User-mode FS is a new kernel 13 project designed to fill that gap. © 2017 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---