ZFS The Last Word in Filesystem lwhsu 2019

  • Slides: 106
Download presentation
ZFS The Last Word in Filesystem lwhsu (2019, CC-BY) tzute (2018) ? (? -2018)

ZFS The Last Word in Filesystem lwhsu (2019, CC-BY) tzute (2018) ? (? -2018) Philip Paeps <Philip@Free. BSD. org> (CC-BY) Benedict Reuschling <bcr@Free. BSD. org> (CC-BY)

Computer Center, CS, NCTU 2 RAID q Redundant Array of Independent Disks q A

Computer Center, CS, NCTU 2 RAID q Redundant Array of Independent Disks q A group of drives glue into one

Computer Center, CS, NCTU 3 Common RAID types q JBOD q RAID 0 q

Computer Center, CS, NCTU 3 Common RAID types q JBOD q RAID 0 q RAID 1 q RAID 5 q RAID 6 q RAID 10 q RAID 50 q RAID 60

Computer Center, CS, NCTU 4 JBOD (Just a Bunch Of Disks) https: //zh. wikipedia.

Computer Center, CS, NCTU 4 JBOD (Just a Bunch Of Disks) https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 5 RAID 0 (Stripe) https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 5 RAID 0 (Stripe) https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 6 RAID 0 (Stripe) q Striping data onto multiple devices

Computer Center, CS, NCTU 6 RAID 0 (Stripe) q Striping data onto multiple devices q Increase write/read speed q Data corrupt if ANY of the device fails

Computer Center, CS, NCTU 7 RAID 1 (Mirror) https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 7 RAID 1 (Mirror) https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 8 RAID 1 (Mirror) q Devices contain identical data q

Computer Center, CS, NCTU 8 RAID 1 (Mirror) q Devices contain identical data q 100% redundancy q Faster read (but might be slower write)

Computer Center, CS, NCTU 9 RAID 5 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 9 RAID 5 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 10 RAID 5 q Slower than RAID 0 / RAID

Computer Center, CS, NCTU 10 RAID 5 q Slower than RAID 0 / RAID 1 q Higher CPU usage

Computer Center, CS, NCTU 11 RAID 6 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 11 RAID 6 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 12 RAID 6 q Slower than RAID 5 q Use

Computer Center, CS, NCTU 12 RAID 6 q Slower than RAID 5 q Use two different correcting algorithms q Usually implemented via hardware

Computer Center, CS, NCTU 13 RAID 10 q RAID 1+0 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 13 RAID 10 q RAID 1+0 https: //zh. wikipedia. org/zh-tw/RAID

Computer Center, CS, NCTU 14 RAID 50? https: //www. icc-usa. com/wp-content/themes/icc_solutions/images/raid-calculator/raid-50. png

Computer Center, CS, NCTU 14 RAID 50? https: //www. icc-usa. com/wp-content/themes/icc_solutions/images/raid-calculator/raid-50. png

Computer Center, CS, NCTU 15 RAID 60? https: //www. icc-usa. com/wp-content/themes/icc_solutions/images/raid-calculator/raid-60. png

Computer Center, CS, NCTU 15 RAID 60? https: //www. icc-usa. com/wp-content/themes/icc_solutions/images/raid-calculator/raid-60. png

Computer Center, CS, NCTU Issues of RAID q https: //en. wikipedia. org/wiki/RAID#Weaknesses • Correlated

Computer Center, CS, NCTU Issues of RAID q https: //en. wikipedia. org/wiki/RAID#Weaknesses • Correlated failures Ø Use different batches of drivers! • • Unrecoverable read errors during rebuild Increasing rebuild time and failure probability Atomicity: including parity inconsistency due to system crashes Write-cache reliability q Know the limitations and make decision for your scenario 16

Computer Center, CS, NCTU 17 Software Implementations q Linux – mdadm q Free. BSD

Computer Center, CS, NCTU 17 Software Implementations q Linux – mdadm q Free. BSD – GEOM classes

Here comes ZFS

Here comes ZFS

Computer Center, CS, NCTU Why ZFS? q Filesystem is always consistent • Never overwrite

Computer Center, CS, NCTU Why ZFS? q Filesystem is always consistent • Never overwrite an existing block (transactional Copy-on-Write) • State atomically advance at checkpoints • Metadata redundancy and data checksums q Snapshots (ro) and clones (rw) are cheap and plentiful q Flexible configuration • Stripe, mirror, single/double/triple parity RAIDZ q Fast remote replication and backups q Scalable (the first 128 bit filesystem) q SSD and memory friendly q Easy administration (2 commands: zpool & zfs) 19 https: //www. bsdcan. org/2015/schedule/events/525. en. html

Computer Center, CS, NCTU 20 End-to-end data integrity q Disks q Controllers q Cables

Computer Center, CS, NCTU 20 End-to-end data integrity q Disks q Controllers q Cables q Firmware q Device drivers q Non-ECC memory

Computer Center, CS, NCTU Disk block checksums q Checksums are stored with the data

Computer Center, CS, NCTU Disk block checksums q Checksums are stored with the data blocks q Any self-consistent block will have a correct checksum q Can’t even detect stray writes q Inherently limited to single filesystems or volumes Disk block checksums only validate media 21 ü ✘ ✘ ✘ Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

Computer Center, CS, NCTU ZFS data authentication q Checksums are stored in parent block

Computer Center, CS, NCTU ZFS data authentication q Checksums are stored in parent block pointers q Fault isolation between data and checksum q Entire storage pool is a selfvalidating Merkle tree ZFS data authentication validates entire I/O path 22 ü ü ü Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

Computer Center, CS, NCTU 23 Traditional storage architecture q Single partition or volume per

Computer Center, CS, NCTU 23 Traditional storage architecture q Single partition or volume per filesystem q Each filesystem has limited I/O bandwidth q Filesystems must be manually resized q Storage is fragmented

Computer Center, CS, NCTU 24 ZFS pooled storage q No partitions required q Storage

Computer Center, CS, NCTU 24 ZFS pooled storage q No partitions required q Storage pool grows automatically q All I/O bandwidth is always available q All storage in the pool is shared

Computer Center, CS, NCTU 25 Copy-on-write transactions

Computer Center, CS, NCTU 25 Copy-on-write transactions

Computer Center, CS, NCTU Simple administration Only two commands: 1. Storage pools: zpool •

Computer Center, CS, NCTU Simple administration Only two commands: 1. Storage pools: zpool • Add and replace disks • Resize pools 2. Filesystems: zfs • • 26 Quotas, reservations, etc. Compression and deduplication Snapshots and clones atime, readonly, etc.

Storage Pools

Storage Pools

Computer Center, CS, NCTU ZFS Pools q ZFS is not just a filesystem q

Computer Center, CS, NCTU ZFS Pools q ZFS is not just a filesystem q ZFS = filesystem + volume manager q Works out of the box q “Z”uper “z”imple to create q Controlled with single command • zpool 28

Computer Center, CS, NCTU 29 ZFS Pools Components q Pool is create from “Virtual

Computer Center, CS, NCTU 29 ZFS Pools Components q Pool is create from “Virtual Devices” (vdevs) q disk: A real disk (typically under /dev) q file: A file q mirror: Two or more disks mirrored together q raidz 1/2/3: Three or more disks in RAID 5/6* q spare: A spare drive q log: A write log device (ZIL SLOG; typically SSD) q cache: A read cache device (L 2 ARC; typically SSD)

Computer Center, CS, NCTU 30 RAID in ZFS q Dynamic Stripe: Intelligent RAID 0

Computer Center, CS, NCTU 30 RAID in ZFS q Dynamic Stripe: Intelligent RAID 0 • zfs copies=1 | 2 | 3 q Mirror: RAID 1 q Raidz 1: Improved from RAID 5 (parity) q Raidz 2: Improved from RAID 6 (double parity) q Raidz 3: triple parity

Computer Center, CS, NCTU Storage pools Creating storage pools (1/2) After creating a storage

Computer Center, CS, NCTU Storage pools Creating storage pools (1/2) After creating a storage pool, ZFS To create a storage pool named will automatically: “tank” from a single disk: q Create a filesystem with the same name (e. g. tank) # zpool create tank /dev/md 0 q Mount the filesystem under that name (e. g. /tank) ZFS can use disks directly. There is no need to create partitions or volumes. The storage is immediately available 31

Computer Center, CS, NCTU Storage pools Creating storage pools (2/2) | grep tank All

Computer Center, CS, NCTU Storage pools Creating storage pools (2/2) | grep tank All configuration is stored ## mount ls -al /tank with the storage pool and ls: /tank: No such file or directory # zpool create tank /dev/md 0 persists across reboots. # mount | grep tank No need to edit /etc/fstab. tank on /tank (zfs, local, nfsv 4 acls) # ls -al /tank total 9 drwxr-xr-x 2 root wheel 2 Oct 12 12: 17. drwxr-xr-x 23 root wheel 28 Oct 12 12: 17. . # reboot [. . . ] # mount | grep tank on /tank (zfs, local, nfsv 4 acls) 32

Computer Center, CS, NCTU Storage pools Displaying pool status # zpool list NAME SIZE

Computer Center, CS, NCTU Storage pools Displaying pool status # zpool list NAME SIZE ALLOC tank 1016 G 83 K # zpool: state: scan: config: FREE 1016 G CKPOINT - FRAG 0% status tank ONLINE none requested NAME tank md 0 STATE ONLINE errors: No known data errors 33 EXPANDSZ - READ WRITE CKSUM 0 0 0 CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT -

Computer Center, CS, NCTU Storage pools Displaying I/O statistics ZFS contains a built-in tool

Computer Center, CS, NCTU Storage pools Displaying I/O statistics ZFS contains a built-in tool to display I/O statistics. Given an interval in seconds, statistics will be displayed continuously until the user interrupts with Ctrl+C. Use -v (verbose) to display more detailed statistics. 34 # zpool iostat 5 capacity pool alloc free ----- ----tank 83 K 1016 G operations read write -----0 0 bandwidth read write -----234 841 0 0 # zpool iostat -v capacity pool alloc free ----- ----tank 83 K 1016 G md 0 83 K 1016 G ----- operations read write -----0 0 ----- bandwidth read write -----206 739 -----

Computer Center, CS, NCTU Storage pools Destroying storage pools is a constant time operation.

Computer Center, CS, NCTU Storage pools Destroying storage pools is a constant time operation. If you want to get rid of your data, ZFS will help you do it very quickly! All data on a destroyed pool will be irretrievably lost. 35 # time zpool create tank /dev/md 0 0. 06 real 0. 00 user 0. 02 sys # time zpool destroy tank 0. 09 real 0. 00 user 0. 00 sys

Computer Center, CS, NCTU 36 Storage pools Creating stripes # zpool A pool with

Computer Center, CS, NCTU 36 Storage pools Creating stripes # zpool A pool with just one disk # zpool: does not provide any state: redundancy, capacity or even scan: config: adequate performance. create tank /dev/md 0 /dev/md 1 status tank ONLINE none requested NAME tank md 0 md 1 STATE ONLINE Stripes offer higher capacity and better performance errors: No known data errors (reading will be parallelised) # zpool list NAME SIZE ALLOC FREE CAP but they provide no tank 1. 98 T 86 K 1. 98 T 0% redundancy. READ WRITE CKSUM 0 0 0 0 0 DEDUP 1. 00 x HEALTH ONLINE

Computer Center, CS, NCTU Storage pools Creating mirrors (RAID-1) # zpool Mirrored storage pools

Computer Center, CS, NCTU Storage pools Creating mirrors (RAID-1) # zpool Mirrored storage pools # zpool: provide redundancy against state: disk failures and better read scan: config: performance than single-disk pools. However, mirrors only have 50% of the capacity of the underlying disks. 37 create tank mirror /dev/md 0 /dev/md 1 status tank ONLINE none requested NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors # zpool list NAME SIZE ALLOC FREE CAP tank 1016 G 93 K 1016 G 0% READ WRITE CKSUM 0 0 0 DEDUP 1. 00 x HEALTH ONLINE

Computer Center, CS, NCTU Storage pools Creating raidz groups q # zpool create tank

Computer Center, CS, NCTU Storage pools Creating raidz groups q # zpool create tank > raidz 1 /dev/md 0 /dev/md 1 /dev/md 2 /dev/md 3 # zpool status pool: tank state: ONLINE scan: none requested config: NAME tank raidz 1 -0 md 1 md 2 md 3 STATE ONLINE ONLINE errors: No known data errors 38 READ WRITE CKSUM 0 0 0 0 0

Computer Center, CS, NCTU Storage pools Combining vdev types Single disks, stripes, mirrors and

Computer Center, CS, NCTU Storage pools Combining vdev types Single disks, stripes, mirrors and raidz groups can be combined in a single storage pool ZFS will complain when adding devices would make the pool less redundant ` zpool add log/cache/spare` 39 # zpool create tank mirror /dev/md 0 /dev/md 1 # zpool add tank /dev/md 2 invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool uses mirror and new vdev is disk # zpool create tank > raidz 2 /dev/md 0 /dev/md 1 /dev/md 2 /dev/md 3 # zpool add tank > raidz /dev/md 4 /dev/md 5 /dev/md 6 invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool uses 2 device parity and new vdev uses 1

Computer Center, CS, NCTU Storage pools Increasing storage pool capacity create tank /dev/md 0

Computer Center, CS, NCTU Storage pools Increasing storage pool capacity create tank /dev/md 0 More devices can be added to ## zpool add tank /dev/md 1 # zpool list a storage pool to increase SIZE ALLOC FREE CAP capacity without downtime. NAME tank 1. 98 T 233 K 1. 98 T 0% Data will be striped across the disks, increasing performance, but there will be no redundancy. If any disk fails, all data is lost! 40 # zpool: state: scan: config: DEDUP 1. 00 x HEALTH ONLINE status tank ONLINE none requested NAME tank md 0 md 1 STATE ONLINE errors: No known data errors READ WRITE CKSUM 0 0 0 0 0

Computer Center, CS, NCTU 41 Storage pools Creating a mirror from a single-disk pool

Computer Center, CS, NCTU 41 Storage pools Creating a mirror from a single-disk pool (1/4) A storage pool consisting of only one device can be converted to a mirror. In order for the new device to mirror the data of the already existing device, the pool needs to be “resilvered”. This means that the pool synchronises both devices to contain the same data at the end of the resilver operation. During resilvering, access to the pool will be slower, but there will be no downtime.

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (2/4)

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (2/4) # zpool create tank /dev/md 0 # zpool status pool: tank state: ONLINE scan: none requested config: NAME tank md 0 STATE ONLINE READ WRITE CKSUM 0 0 0 errors: No known data errors # zpool list NAME SIZE ALLOC tank 1016 G 93 K 42 FREE 1016 G CKPOINT - EXPANDSZ - FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT -

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (3/4)

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (3/4) q `zpool attach` # zpool attach tank /dev/md 0 /dev/md 1 # zpool status tank pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Oct 12 13: 55: 56 2018 5. 03 M scanned out of 44. 1 M at 396 K/s, 0 h 1 m to go 5. 03 M resilvered, 11. 39% done config: NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors 43 READ WRITE CKSUM 0 0 0 (resilvering)

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (4/4)

Computer Center, CS, NCTU Storage pools Creating a mirror from a single-disk pool (4/4) # zpool: state: scan: config: status tank ONLINE resilvered 44. 2 M in 0 h 1 m with 0 errors on Fri Oct 12 13: 56: 29 2018 NAME tank mirror-0 md 1 STATE ONLINE READ WRITE CKSUM 0 0 0 errors: No known data errors # zpool list NAME SIZE ALLOC tank 1016 G 99. 5 K 44 FREE 1016 G CKPOINT - EXPANDSZ - FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT -

Computer Center, CS, NCTU 45 Zpool command zpool(8) zpool list all the zpool scrub

Computer Center, CS, NCTU 45 Zpool command zpool(8) zpool list all the zpool scrub try to discover silent error or hardware failure zpool history [pool name] show all the history of zpool add <pool name> <vdev> additional capacity into pool zpool create/destroy create/destory zpool status [pool name] show status of zpool export/import [pool name] export or import given pool zpool set/get <properties/all> set or show zpool properties zpool online/offline <pool name> <vdev> set an device in zpool to online/offline state zpool attach/detach <pool name> <device> <new device> attach a new device to an zpool/detach a device from zpool replace <pool name> <old device> <new device> replace old device with new device

Computer Center, CS, NCTU 46 Zpool properties `zpool get all zroot` NAME zroot zroot

Computer Center, CS, NCTU 46 Zpool properties `zpool get all zroot` NAME zroot zroot zroot PROPERTY size capacity altroot health guid version bootfs delegation autoreplace cachefile VALUE 460 G 4% ONLINE 13063928643765267585 zroot/ROOT/default on off - SOURCE default local default zroot failmode wait default zroot listsnapshots off default zroot feature@async_destroy enabled local zroot feature@device_removal enabled local

Computer Center, CS, NCTU 47 Zpool Sizing q ZFS reserve 1/64 of pool capacity

Computer Center, CS, NCTU 47 Zpool Sizing q ZFS reserve 1/64 of pool capacity for safe-guard to protect Co. W q RAIDZ 1 Space = Total Drive Capacity -1 Drive q RAIDZ 2 Space = Total Drive Capacity -2 Drives q RAIDZ 3 Space = Total Drive Capacity -3 Drives q Dynamic Stripe of 4* 100 GB= 400 / 1. 016= ~390 GB q RAIDZ 1 of 4* 100 GB = 300 GB - 1/64 th= ~295 GB q RAIDZ 2 of 4* 100 GB = 200 GB - 1/64 th= ~195 GB q RAIDZ 2 of 10* 100 GB = 800 GB - 1/64 th= ~780 GB http: //cuddletech. com/blog/pivot/entry. php? id=1013

ZFS Dataset

ZFS Dataset

Computer Center, CS, NCTU ZFS Datasets q Three forms: • filesystem: just like traditional

Computer Center, CS, NCTU ZFS Datasets q Three forms: • filesystem: just like traditional filesystem • volume: block device • snapshot: read-only version of a file system or volume at a given point of time. q Nested q Each dataset has associated properties that can be inherited by sub-filesystems q Controlled with single command: • zfs(8) 49

Computer Center, CS, NCTU 50 Filesystem Datasets q Create new dataset with • zfs

Computer Center, CS, NCTU 50 Filesystem Datasets q Create new dataset with • zfs create <pool name>/<dataset name>(/<dataset name>/…) q New dataset inherits properties of parent dataset

Computer Center, CS, NCTU 51 Volumn Datasets (ZVols) q Block storage q Located at

Computer Center, CS, NCTU 51 Volumn Datasets (ZVols) q Block storage q Located at /dev/zvol/<pool name>/<dataset> q Useful for • i. SCSI • Other non-zfs local filesystem • Virtual Machine image q Support “thin provisioning” (“sparse volume”)

Computer Center, CS, NCTU 52 Dataset properties $ zfs get all zroot NAME PROPERTY

Computer Center, CS, NCTU 52 Dataset properties $ zfs get all zroot NAME PROPERTY zroot type zroot creation zroot used zroot available zroot referenced zroot compressratio zroot mounted zroot quota zroot reservation zroot recordsize zroot mountpoint zroot sharenfs VALUE SOURCE filesystem Mon Jul 21 23: 13 2014 22. 6 G 423 G 144 K 1. 07 x no none default 128 K default none local off default

Computer Center, CS, NCTU 53 zfs command zfs(8) zfs set/get <prop. / all> <dataset>

Computer Center, CS, NCTU 53 zfs command zfs(8) zfs set/get <prop. / all> <dataset> set properties of datasets zfs create <dataset> create new dataset zfs destroy datasets/snapshots/clones. . zfs snapshot create snapshots zfs rollback to given snapshot zfs promote clone to the orgin of the filesystem zfs send/receive data stream of the snapshot

Snapshots

Snapshots

Computer Center, CS, NCTU 55 Snapshot q q q Read-only copy of a dataset

Computer Center, CS, NCTU 55 Snapshot q q q Read-only copy of a dataset or volume Useful for file recovery or full dataset rollback Denoted by @ symbol Snapshots are extremely fast (-er than deleting data!) Snapshots occupy (almost) no space until the original data start to diverge q How ZFS snapshots really work (Matt Ahrens) • https: //www. bsdcan. org/2019/schedule/events/1073. en. html

Computer Center, CS, NCTU 56 Snapshots Creating and listing snapshots (1/2) q A snapshot

Computer Center, CS, NCTU 56 Snapshots Creating and listing snapshots (1/2) q A snapshot only needs an identifier • Can be anything you like! • A timestamp is traditional • But you can use more memorable identifiers too… # zfs snapshot tank/users/alice@myfirstbackup # zfs list -t snapshot NAME USED AVAIL tank/users/alice@myfirstbackup 0 - REFER 23 K MOUNTPOINT - # zfs list -rt all tank/users/alice NAME USED tank/users/alice 23 K tank/users/alice@myfirstbackup 0 REFER 23 K MOUNTPOINT /tank/users/alice - AVAIL 984 G -

Computer Center, CS, NCTU 57 Snapshots Creating and listing snapshots (2/2) q Snapshots save

Computer Center, CS, NCTU 57 Snapshots Creating and listing snapshots (2/2) q Snapshots save only the changes between the time they were created and the previous (if any) snapshot q If data doesn’t change, snapshots occupy zero space # echo hello world > /tank/users/alice/important_data. txt # zfs snapshot tank/users/alice@mysecondbackup # zfs list -rt all tank/users/alice NAME USED AVAIL REFER MOUNTPOINT tank/users/alice 36. 5 K 984 G 23. 5 K /tank/users/alice@myfirstbackup 13 K 23 K tank/users/alice@mysecondbackup 0 - 23. 5 K -

Computer Center, CS, NCTU 58 Snapshots Differences between snapshots q ZFS can display the

Computer Center, CS, NCTU 58 Snapshots Differences between snapshots q ZFS can display the differences between snapshots # touch /tank/users/alice/empty # rm /tank/users/alice/important_data. txt # zfs diff tank/users/alice@mysecondbackup M /tank/users/alice/important_data. txt + /tank/users/alice/empty Character Type of change + File was added - File was deleted M File was modified R File was renamed

Computer Center, CS, NCTU 59 Snapshots Rolling back snapshots (1/2) q Snapshots can be

Computer Center, CS, NCTU 59 Snapshots Rolling back snapshots (1/2) q Snapshots can be rolled ## # back to undo changes # q All files changed since the # snapshot was created will # be discarded echo hello_world > important_file. txt echo goodbye_cruel_world > also_important. txt zfs snapshot tank/users/alice@myfirstbackup rm * ls zfs rollback tank/users/alice@myfirstbackup # ls also_important. txt important_file. txt

Computer Center, CS, NCTU 60 Snapshots Rolling back snapshots (2/2) q By default, the

Computer Center, CS, NCTU 60 Snapshots Rolling back snapshots (2/2) q By default, the latest snapshot is rolled back. To roll back an older snapshot, use -r q Note that intermediate snapshots will be destroyed q ZFS will warn about this # touch not_very_important. txt # touch also_not_important. txt # ls also_important. txt important_file. txt also_not_important. txt not_very_important. txt # zfs snapshot tank/users/alice@mysecondbackup # zfs diff tank/users/alice@myfirstbackup > tank/users/alice@mysecondbackup M /tank/users/alice/ + /tank/users/alice/not_very_important. txt + /tank/users/alice/also_not_important. txt # zfs rollback tank/users/alice@myfirstbackup # zfs rollback -r tank/users/alice@myfirstbackup # ls also_important. txt important_file. txt

Computer Center, CS, NCTU Snapshots Restoring individual files q Sometimes, we only want to

Computer Center, CS, NCTU Snapshots Restoring individual files q Sometimes, we only want to restore a single file, rather than rolling back an entire snapshot q ZFS keeps snapshots in a very hidden. zfs/snapshots directory • It’s like magic : -) • Set snapdir=visible to unhide it q Remember: snaphots are read -only. Copying data to the magic directory won’t work! 61 # ls also_important. txt important_file. txt # rm * # ls. zfs/snapshot/myfirstbackup also_important. txt important_file. txt # cp. zfs/snapshot/myfirstbackup/*. # ls also_important. txt important_file. txt

Computer Center, CS, NCTU Snapshots Cloning snapshots q Clones represent a writeable copy of

Computer Center, CS, NCTU Snapshots Cloning snapshots q Clones represent a writeable copy of a read-only snapshot q Like snapshots, they occupy no space until they start to diverge # zfs list -rt all tank/users/alice NAME USED tank/users/alice 189 M tank/users/alice@mysecondbackup 0 AVAIL 984 G - REFER 105 M MOUNTPOINT /tank/users/alice - # zfs clone tank/users/alice@mysecondbackup tank/users/eve # zfs list tank/users/eve NAME USED AVAIL tank/users/eve 0 984 G 62 REFER 105 M MOUNTPOINT /tank/users/eve

Computer Center, CS, NCTU Snapshots Promoting clones q Snapshots cannot be deleted while clones

Computer Center, CS, NCTU Snapshots Promoting clones q Snapshots cannot be deleted while clones exist q To remove this dependency, clones can be promoted to ”ordinary” datasets q Note that by promoting the clone, it immediately starts occupying space # zfs destroy tank/users/alice@mysecondbackup cannot destroy 'tank/users/alice@mysecondbackup’: snapshot has dependent clones use '-R' to destroy the following datasets: tank/users/eve # zfs list tank/users/eve NAME USED AVAIL tank/users/eve 0 984 G REFER 105 M MOUNTPOINT /tank/users/eve # zfs promote tank/users/eve 63 # zfs list tank/users/eve NAME USED AVAIL tank/users/eve 189 M 984 G

Self-healing data

Self-healing data

Computer Center, CS, NCTU 65 Traditional mirroring

Computer Center, CS, NCTU 65 Traditional mirroring

Computer Center, CS, NCTU 66 Self-healing data in ZFS

Computer Center, CS, NCTU 66 Self-healing data in ZFS

Computer Center, CS, NCTU 67 Self-healing data demo Store some important data (1/2) q

Computer Center, CS, NCTU 67 Self-healing data demo Store some important data (1/2) q We have created a redundant pool with two mirrored disks and stored some important data on it q We will be very sad if the data gets lost! : -( # zfs list tank NAME USED AVAIL tank 74 K 984 G REFER 23 K MOUNTPOINT /tank # cp -a /some/important/data/ /tank/ # zfs list tank NAME USED AVAIL tank 3. 23 G 981 G REFER 3. 23 G MOUNTPOINT /tank

Computer Center, CS, NCTU Self-healing data demo Store some important data (2/2) # zpool:

Computer Center, CS, NCTU Self-healing data demo Store some important data (2/2) # zpool: state: scan: config: status tank ONLINE none requested NAME tank mirror-0 md 1 STATE ONLINE READ WRITE CKSUM 0 0 0 errors: No known data errors # zpool list tank NAME SIZE ALLOC tank 1016 G 3. 51 G 68 FREE 1012 G CKPOINT - EXPANDSZ - FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT -

Computer Center, CS, NCTU Self-healing data demo Destroy one of the disks (1/2) Caution!

Computer Center, CS, NCTU Self-healing data demo Destroy one of the disks (1/2) Caution! # dd if=/dev/random of=/dev/md 1 bs=1 m count=200 # This example can destroy data when used on the wrong device or a non-ZFS filesystem! Always check your backups! 69 # zpool export tank zpool import tank

Computer Center, CS, NCTU Self-healing data demo Destroy one of the disks (2/2) #

Computer Center, CS, NCTU Self-healing data demo Destroy one of the disks (2/2) # zpool: state: status: status tank ONLINE One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http: //illumos. org/msg/ZFS-8000 -9 P scan: none requested config: NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors 70 READ WRITE CKSUM 0 0 0 0 5 0 0 0

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (1/3) #

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (1/3) # zpool scrub tank # zpool status tank pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http: //illumos. org/msg/ZFS-8000 -9 P scan: scrub in progress since Fri Oct 12 22: 57: 36 2018 191 M scanned out of 3. 51 G at 23. 9 M/s, 0 h 2 m to go 186 M repaired, 5. 32% done config: NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors 71 READ WRITE CKSUM 0 0 0 0 1. 49 K 0 0 0 (repairing)

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (2/3) #

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (2/3) # zpool: state: status: status tank ONLINE One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http: //illumos. org/msg/ZFS-8000 -9 P scan: scrub repaired 196 M in 0 h 0 m with 0 errors on Fri Oct 12 22: 58: 14 2018 config: NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors 72 READ WRITE CKSUM 0 0 0 0 1. 54 K 0 0 0

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (3/3) #

Computer Center, CS, NCTU Self-healing data demo Make sure everything is okay (3/3) # zpool clear tank # zpool: state: scan: config: status tank ONLINE scrub repaired 196 M in 0 h 0 m with 0 errors on Fri Oct 12 22: 58: 14 2018 NAME tank mirror-0 md 1 STATE ONLINE errors: No known data errors 73 READ WRITE CKSUM 0 0 0

Computer Center, CS, NCTU Self-healing data demo But what if it goes very wrong?

Computer Center, CS, NCTU Self-healing data demo But what if it goes very wrong? (1/2) # zpool: state: status: status tank ONLINE One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http: //illumos. org/msg/ZFS-8000 -8 A scan: scrub in progress since Fri Oct 12 22: 46: 01 2018 498 M scanned out of 3. 51 G at 99. 6 M/s, 0 h 0 m to go 19 K repaired, 13. 87% done config: NAME tank mirror-0 md 1 STATE ONLINE READ WRITE CKSUM 0 0 1. 48 K 0 0 2. 97 K errors: 1515 data errors, use '-v' for a list 74

Computer Center, CS, NCTU Self-healing data demo But what if it goes very wrong?

Computer Center, CS, NCTU Self-healing data demo But what if it goes very wrong? (2/2) # zpool: state: status: status –v tank ONLINE One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http: //illumos. org/msg/ZFS-8000 -8 A scan: scrub repaired 19 K in 0 h 0 m with 1568 errors on Fri Oct 12 22: 46: 25 2018 config: NAME tank mirror-0 md 1 STATE ONLINE READ WRITE CKSUM 0 0 1. 53 K 0 0 3. 07 K errors: Permanent errors have been detected in the following files: /tank/Free. BSD-11. 2 -RELEASE-amd 64. vhd. xz /tank/base-amd 64. txz /tank/Free. BSD-11. 2 -RELEASE-amd 64 -disc 1. iso. xz /tank/intro_slides. pdf 75

Deduplication

Deduplication

Computer Center, CS, NCTU Duplication Intentional duplication q Backups, redundancy Unintentional duplication q Application

Computer Center, CS, NCTU Duplication Intentional duplication q Backups, redundancy Unintentional duplication q Application caches q Temporary files q Node. js (Grrr!) 77

Computer Center, CS, NCTU 78 Deduplication q Implemented at the block layer q ZFS

Computer Center, CS, NCTU 78 Deduplication q Implemented at the block layer q ZFS detects when it needs to store an exact copy of a block q Only a reference is written rather than the entire block q Can save a lot of disk space

Computer Center, CS, NCTU Deduplication Memory cost q ZFS must keep a table of

Computer Center, CS, NCTU Deduplication Memory cost q ZFS must keep a table of the checksums of every block it stores q Depending on the blocksize, this table can grow very quickly q Deduplication table must be fast to access or writes slow down q Ideally, the deduplication table should fit in RAM q Keeping a L 2 ARC on fast SSDs can reduce the cost somewhat Rule of thumb: 5 GB of RAM for each TB of data stored 79

Computer Center, CS, NCTU 80 Deduplication Is it worth it? (1/2) q The ZFS

Computer Center, CS, NCTU 80 Deduplication Is it worth it? (1/2) q The ZFS debugger (zdb) can be used to evaluate if turning on deduplication will save space in a pool q In most workloads, compression will provide much more significant savings than deduplication q Consider whether the cost of RAM is worth it q Also keep in mind that it is a lot easier and cheaper to add disks to a system than it is to add memory

Computer Center, CS, NCTU Deduplication demo Is it worth it? (2/2) # zdb -S

Computer Center, CS, NCTU Deduplication demo Is it worth it? (2/2) # zdb -S tank Simulated DDT histogram: bucket ______ refcnt -----1 2 Total allocated _______________ blocks LSIZE PSIZE DSIZE ---------25. 1 K 3. 13 G 1. 48 K 189 M 26. 5 K 3. 32 G referenced _______________ blocks LSIZE PSIZE DSIZE ---------25. 1 K 3. 13 G 2. 96 K 378 M 28. 0 K 3. 50 G dedup = 1. 06, compress = 1. 00, copies = 1. 00, dedup * compress / copies = 1. 06 81

Computer Center, CS, NCTU Deduplication demo Control experiment (1/2) # zpool list tank NAME

Computer Center, CS, NCTU Deduplication demo Control experiment (1/2) # zpool list tank NAME SIZE ALLOC tank 7. 50 G 79. 5 K # zfs NAME tank # > > > CKPOINT - EXPANDSZ - get compression, dedup tank PROPERTY VALUE compression off dedup off FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT - SOURCE default for p in `seq 0 4`; do zfs create tank/ports/$p portsnap -d /tmp/portsnap -p /tank/ports/$p extract & done # zpool list tank NAME SIZE ALLOC tank 7. 50 G 2. 14 G 82 FREE 7. 50 G FREE 5. 36 G CKPOINT - EXPANDSZ - FRAG 3% CAP 28%

Computer Center, CS, NCTU Deduplication demo Control experiment (2/2) # zdb -S tank Simulated

Computer Center, CS, NCTU Deduplication demo Control experiment (2/2) # zdb -S tank Simulated DDT histogram: bucket ______ refcnt -----4 8 16 32 64 256 Total allocated _______________ blocks LSIZE PSIZE DSIZE ---------131 K 374 M 2. 28 K 4. 60 M 144 526 K 22 23. 5 K 2 1. 50 K 1 512 512 134 K 379 M referenced _______________ blocks LSIZE PSIZE DSIZE ---------656 K 1. 82 G 23. 9 K 48. 0 M 3. 12 K 10. 5 M 920 978 K 135 100 K 265 132 K 685 K 1. 88 G dedup = 5. 09, compress = 1. 00, copies = 1. 00, dedup * compress / copies = 5. 09 83

Computer Center, CS, NCTU Deduplication demo Enabling deduplication # zpool list tank NAME SIZE

Computer Center, CS, NCTU Deduplication demo Enabling deduplication # zpool list tank NAME SIZE ALLOC tank 7. 50 G 79. 5 K # zfs NAME tank # > > > CKPOINT - get compression, dedup tank PROPERTY VALUE compression off dedup on EXPANDSZ - FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT - DEDUP 5. 08 x HEALTH ONLINE ALTROOT - SOURCE default for p in `seq 0 4`; do zfs create tank/ports/$p portsnap -d /tmp/portsnap -p /tank/ports/$p extract & done # zpool list tank NAME SIZE ALLOC tank 7. 50 G 670 M 84 FREE 7. 50 G FREE 6. 85 G CKPOINT - EXPANDSZ - FRAG 6% CAP 8%

Computer Center, CS, NCTU Deduplication demo Compare with compression # zpool list tank NAME

Computer Center, CS, NCTU Deduplication demo Compare with compression # zpool list tank NAME SIZE ALLOC tank 7. 50 G 79. 5 K # zfs NAME tank # > > > CKPOINT - EXPANDSZ - get compression, dedup tank PROPERTY VALUE compression gzip-9 dedup off FRAG 0% CAP 0% DEDUP 1. 00 x HEALTH ONLINE ALTROOT - SOURCE local default for p in `seq 0 4`; do zfs create tank/ports/$p portsnap -d /tmp/portsnap -p /tank/ports/$p extract & done # zpool list tank NAME SIZE ALLOC tank 7. 50 G 752 M 85 FREE 7. 50 G FREE 6. 77 G CKPOINT - EXPANDSZ - FRAG 3% CAP 9%

Computer Center, CS, NCTU 86 Deduplication Summary q ZFS deduplication can save a lot

Computer Center, CS, NCTU 86 Deduplication Summary q ZFS deduplication can save a lot of space under some workloads but at the expense of a lot of memory q Often, compression will give similar or better results q Always check with zdb -S whether deduplication would be worth it Control experiment 2. 14 G Deduplication 670 M Compression 752 M

Performance Tuning

Performance Tuning

Computer Center, CS, NCTU 88 General tuning tips q System memory q Access time

Computer Center, CS, NCTU 88 General tuning tips q System memory q Access time q Dataset compression q Deduplication q ZFS send and receive

Computer Center, CS, NCTU 89 Random Access Memory q ZFS performance depends on the

Computer Center, CS, NCTU 89 Random Access Memory q ZFS performance depends on the amount of system • recommended minimum: 1 GB • 4 GB is ok • 8 GB and more is good

Computer Center, CS, NCTU 90 Dataset compression q Save space q Increase CPU usage

Computer Center, CS, NCTU 90 Dataset compression q Save space q Increase CPU usage q Increase data throughput

Computer Center, CS, NCTU 91 Deduplication q requires even more memory q increases CPU

Computer Center, CS, NCTU 91 Deduplication q requires even more memory q increases CPU usage

Computer Center, CS, NCTU 92 ZFS send/recv q using buffer for large streams •

Computer Center, CS, NCTU 92 ZFS send/recv q using buffer for large streams • misc/buffer • misc/mbuffer (network capable)

Computer Center, CS, NCTU 93 Database tuning q For Postgre. SQL and My. SQL

Computer Center, CS, NCTU 93 Database tuning q For Postgre. SQL and My. SQL users recommend using a different recordsize than default 128 k. q Postgre. SQL: 8 k q My. SQL My. ISAM storage: 8 k q My. SQL Inno. DB storage: 16 k

Computer Center, CS, NCTU 94 File Servers q Disable access time q keep number

Computer Center, CS, NCTU 94 File Servers q Disable access time q keep number of snapshots low q dedup only if you have lots of RAM q for heavy write workloads move ZIL to separate SSD drives q optionally disable ZIL for datasets (beware consequences)

Computer Center, CS, NCTU Webservers q Disable redundant data caching • Apache Ø Enable.

Computer Center, CS, NCTU Webservers q Disable redundant data caching • Apache Ø Enable. MMAP Off Ø Enable. Sendfile Off • Nginx Ø Sendfile off • Lighttpd Ø server. network-backend="writev" 95

Cache and Prefetch

Cache and Prefetch

Computer Center, CS, NCTU ARC Adaptive Replacement Cache Resides in system RAM major speedup

Computer Center, CS, NCTU ARC Adaptive Replacement Cache Resides in system RAM major speedup to ZFS the size is auto-tuned Default: arc max: memory size - 1 GB metadata limit: ¼ of arc_max arc min: ½ of arc_meta_limit (but at least 16 MB) 97

Computer Center, CS, NCTU Tuning ARC q Disable ARC on per-dataset level q maximum

Computer Center, CS, NCTU Tuning ARC q Disable ARC on per-dataset level q maximum can be limited q increasing arc_meta_limit may help if working with many files q # sysctl kstat. zfs. misc. arcstats. size q # sysctl vfs. zfs. arc_meta_used q # sysctl vfs. zfs. arc_meta_limit q http: //www. krausam. de/? p=70 98

Computer Center, CS, NCTU 99 L 2 ARC q L 2 Adaptive Replacement Cache

Computer Center, CS, NCTU 99 L 2 ARC q L 2 Adaptive Replacement Cache • is designed to run on fast block devices (SSD) • helps primarily read-intensive workloads • each device can be attached to only one ZFS pool q # zpool add <pool name> cache <vdevs> q # zpool add remove <pool name> <vdevs>

Computer Center, CS, NCTU 100 Tuning L 2 ARC enable prefetch for streaming or

Computer Center, CS, NCTU 100 Tuning L 2 ARC enable prefetch for streaming or serving of large files configurable on per-dataset basis turbo warmup phase may require tuning (e. g. set to 16 MB) vfs. zfs. l 2 arc_noprefetch vfs. zfs. l 2 arc_write_max vfs. zfs. l 2 arc_write_boost

Computer Center, CS, NCTU ZIL q ZFS Intent Log • guarantees data consistency on

Computer Center, CS, NCTU ZIL q ZFS Intent Log • guarantees data consistency on fsync() calls • replays transaction in case of a panic or power failure • use small storage space on each pool by default q To speed up writes, deploy zil on a separate log device(SSD) q Per-dataset synchonocity behavior can be configured • # zfs set sync=[standard|always|disabled] dataset 101

Computer Center, CS, NCTU 102 File-level Prefetch (zfetch) q Analyses read patterns of files

Computer Center, CS, NCTU 102 File-level Prefetch (zfetch) q Analyses read patterns of files q Tries to predict next reads q Loader tunable to enable/disable zfetch: vfs. zfs. prefetch_disable

Computer Center, CS, NCTU 103 Device-level Prefetch (vdev prefetch) q reads data after small

Computer Center, CS, NCTU 103 Device-level Prefetch (vdev prefetch) q reads data after small reads from pool devices q useful for drives with higher latency q consumes constant RAM per vdev q is disabled by default q Loader tunable to enable/disable vdev prefetch: vfs. zfs. vdev. cache. size=[bytes]

Computer Center, CS, NCTU ZFS Statistics Tools # sysctl vfs. zfs # sysctl kstat.

Computer Center, CS, NCTU ZFS Statistics Tools # sysctl vfs. zfs # sysctl kstat. zfs using tools: zfs-stats: analyzes settings and counters since boot zfsf-mon: real-time statistics with averages Both tools are available in ports under sysutils/zfs-stats 104

Computer Center, CS, NCTU References q ZFS: The last word in filesystems (Jeff Bonwick

Computer Center, CS, NCTU References q ZFS: The last word in filesystems (Jeff Bonwick & Bill Moore) q ZFS tuning in Free. BSD (Martin Matuˇska): • Slide Ø http: //blog. vx. sk/uploads/conferences/Euro. BSDcon 2012/zfs-tuninghandout. pdf • Video Ø https: //www. youtube. com/watch? v=PIp. I 7 Ub 6 yjo q Becoming a ZFS Ninja (Ben Rockwood): • http: //www. cuddletech. com/blog/pivot/entry. php? id=1075 q ZFS Administration: • https: //pthree. org/2012/12/14/zfs-administration-part-ix-copy-on-write 105

Computer Center, CS, NCTU References (c. ) q https: //www. freebsd. org/doc/zh_TW/books/handbook/zfszfs. html q

Computer Center, CS, NCTU References (c. ) q https: //www. freebsd. org/doc/zh_TW/books/handbook/zfszfs. html q “ZFS Mastery” books (Michael W. Lucas & Allan Jude) • Free. BSD Mastery: ZFS • Free. BSD Mastery: Advanced ZFS q ZFS for Newbies (Dan Langille) • https: //www. youtube. com/watch? v=3 o. G 1 U 5 AI 9 A&list=PLsk. KNopggjc 6 Nss. Lc 8 GEGSi. FYJLYdl. TQx&inde x=20 q The future of Open. ZFS and Free. BSD (Allan Jude) • https: //www. youtube. com/watch? v=gma. HZBw. DKho&list=PLsk. KN opggjc 6 Nss. Lc 8 GEGSi. FYJLYdl. TQx&index=23 q How ZFS snapshots really work (Matt Ahrens) 106 • https: //www. bsdcan. org/2019/schedule/events/1073. en. html