Linux Virtual File System Peter J Braam P
Linux Virtual File System Peter J. Braam P. J. Braam/CMU -- 1
Aims • Present the data structures in Linux VFS • Provide information about flow of control • Describe methods and invariants needed to implement a new file system • Illustrate with some examples P. J. Braam/CMU -- 2
History • BSD implemented VFS for NFS: aim dispatch to different filesystems • VMS had elaborate filesystem • NT/Win 95 have VFS type interfaces • Newer systems integrate VM with buffer cache. File access P. J. Braam/CMU -- 3
Linux Filesystems • Media based – – – ext 2 - Linux native ufs - BSD fat - DOS FS vfat - win 95 hpfs - OS/2 minix - well…. Isofs - CDROM sysv - Sysv Unix hfs - Macintosh affs - Amiga Fast FS NTFS - NT’s FS adfs - Acorn-strongarm • Network – – – nfs Coda AFS - Andrew FS smbfs - Lan. Manager ncpfs - Novell • Special ones – procfs -/proc – umsdos - Unix in DOS – userfs - redirector to user P. J. Braam/CMU -- 4
Linux Filesystems (ctd) • Forthcoming: • Linux serves (unrelated – devfs - device file system to the VFS!) – DFS - DCE distributed FS • Varia: – – – cfs - crypt filesystem cfs - cache filesystem ftpfs - ftp filesystem mailfs - mail filesystem pgfs - Postgres versioning file system – NFS - user & kernel – Coda – Apple. Share netatalk/CAP – SMB - samba – NCP - Novell P. J. Braam/CMU -- 5
Usefulness Linux is Obsolete Andrew Tanenbaum P. J. Braam/CMU -- 6
Linux VFS • Multiple interfaces build up VFS: – – – File access files dentries inodes superblock quota • VFS can do all caching & provides utility fctns to FS • FS provides methods to VFS; many are optional P. J. Braam/CMU -- 7
User level file access • Typical user level types and code: – pathnames: “/myfile” – file descriptors: fd = open(“/myfile”…) – attributes in struct stat: stat(“/myfile”, &mybuf), chmod, chown. . . – offsets: write, read, lseek – directory handles: DIR *dh = opendir(“/mydir”) – directory entries: struct dirent *ent = readdir(dh) P. J. Braam/CMU -- 8
VFS • Manages kernel level file abstractions in one format for all file systems • Receives system call requests from user level (e. g. write, open, stat, link) • Interacts with a specific file system based on mount point traversal • Receives requests from other parts of the kernel, mostly from memory management P. J. Braam/CMU -- 9
File system level • Individual File Systems – responsible for managing file & directory data – responsible for managing meta-data: timestamps, owners, protection etc – translates data between • particular FS data: e. g. disk data, NFS data, Coda/AFS data • VFS data: attributes etc in standard format – e. g. nfs_getattr(…. ) returns attributes in VFS format, acquires attributes in NFS format to do so. P. J. Braam/CMU -- 10
Anatomy of stat system call sys_stat(path, buf) { dentry = namei(path); if ( dentry == NULL ) return -ENOENT; Establish VFS data inode = dentry->d_inode; rc =inode->i_op->i_permission(inode); if ( rc ) return -EPERM; Call into inode layer of filesystem Call into inode layer of rc = inode->i_op->i_getattr(inode, buf); filesystem dput(dentry); return rc; } P. J. Braam/CMU -- 11
Anatomy of fstatfs system call sys_fstatfs(fd, buf) { /* for things like “df” */ file = fget(fd); Translate fd to VFS if ( file == NULL ) return -EBADF; data structure superb = file->f_dentry->d_inode->i_super; rc = superb->sb_op->sb_statfs(sb, buf); return rc; Call into superblock layer of filesystem } P. J. Braam/CMU -- 12
Data structures • VFS data structures for: – VFS handle to the file: inode (BSD: vnode) – User instantiated file handle: file (BSD: file) – The whole filesystem: superblock (BSD: vfs) – A name to inode translation: dentry P. J. Braam/CMU -- 13
Shorthand method notation • • super block methods: sss_methodname inode methods: iii_methodname dentry methods: ddd_methodname file methods: fff_methodname • instead of : inode i_op lookup we write iii_lookup P. J. Braam/CMU -- 14
namei VFS FS struct dentry *namei(parent, name) { if (dentry = d_lookup(parent, name)) ddd_hash(parent, name) ddd_revalidate(dentry) else iii_lookup(parent, name) struct inode *iget(ino, dev) { /* try cache else. . */ sss_read_inode(…) } P. J. Braam/CMU -- 15
Superblocks • Handle metadata only (attributes etc) • Responsible for retrieving and storing metadata from the FS media or peers • Struct superblocks hold things like: – device, blocksize, dirty flags, list of dirty inodes – super operations – wait queue – pointer to the root inode of this FS P. J. Braam/CMU -- 16
Super Operations (sss_) • Ops on Inodes: – read_inode – put_inode – write_inode – delete_inode – clear_inode – notify_change • Superblock manips: – read_super (mount) – put_super (unmount) – write_super (unmount) – statfs (attributes) P. J. Braam/CMU -- 17
Inodes • Inodes are VFS abstraction for the file • Inode has operations (iii_methods) • VFS maintains an inode cache, NOT the individual FS’s (compare NT, BSD etc) • Inodes contain an FS specific area where: – ext 2 stores disk block numbers etc – AFS would store the FID • Extraordinary inode ops are good for dealing with stale NFS file handles etc. P. J. Braam/CMU -- 18
What’s inside an inode - 1 list_head i_hash list_head i_list_head i_dentry int i_count long i_ino int i_dev {m, a, c}time {u, g}id mode size n_link caching Identifies file Usual stuff P. J. Braam/CMU -- 19
What’s inside an inode -2 superblock i_sb inode_ops i_op wait objects, semaphore lock vm_area_struct pipe/socket info Which FS For mmap, networking waiting page information union { ext 2 fs_inode_info i_ext 2 nfs_inode_info i_nfs coda_inode_info i_coda. . } u FS Specific info: blockno’s fids etc P. J. Braam/CMU -- 20
Inode state • Inode can be on one or two lists: – (hash & in_use) or (hash & dirty ) or unused – inode has a use count i_count • Transitions – unused hash: iget calls sss_read_inode – dirty in_use: sss_write_inode – hash unused: call on sss_clear_inode, but if i_nlink = 0: iput calls sss_delete_inode when i_count falls to 0 P. J. Braam/CMU -- 21
Inode Cache Players: 1. iget: if i_count>0 ++ 2. iput: if i_count>1 - - 3. free_inodes 4. syncing inodes Inode_hashtable sss_clear_inode (freeing inos) or sss_delete_inode (iput) Fs storage sss_read_inode (iget) Unused inodes Dirty inodes media fs only (mark_inode_dirty) Used inodes Fs storage sss_write_inode (sync one) Fs storage P. J. Braam/CMU -- 22
Sales Red Hat Software sold 240, 000 copies of Red Hat Linux in 1997 and expects to reach 400, 000 in 1998. Estimates of installed servers (Info. World): - Linux: 7 million - OS/2: 5 million - Macintosh: 1 million P. J. Braam/CMU -- 23
Inode operations (iii_) • lookup: return inode – calls iget • creation/removal – – – – create link unlink symlink mkdir rmdir mknod rename • symbolic links – readlink – follow link • pages – readpage, writepage, updatepage - read or write page. Generic for mediafs. – bmap - return disk block number of logical block • special operations – revalidate - see dentry sect – truncate – permission P. J. Braam/CMU -- 24
Dentry world • Dentry is a name to inode translation structure • Cached agressively by VFS • Eliminates lookups by FS & private caches – timing on Coda FS: ls -l. R 1000 files after priming cache • linux 2. 0. 32: 7. 2 secs • linux 2. 1. 92: 0. 6 secs – disk fs: less benefit, NFS even more • Negative entries! • Namei is dramatically simplified P. J. Braam/CMU -- 25
Inside dentry’s • • • name pointer to inode pointer to parent dentry list head of children chains for lots of lists use count P. J. Braam/CMU -- 26
Dentry associated lists Legend: inode dentry inode relationship inode I_dentry list head = d_inode pointer d_alias chains place: d_instantiate remove: dentry_iput dentry tree relationship inode i_dentry list head = d_parent pointer d_child chains place: d_alloc remove: d_prune, d_invalidate, d_put P. J. Braam/CMU -- 27
Dcache dentry_hashtable (d_hash chains) dhash(parent, name) list head prune d_invalidate d_drop namei iii_lookup d_add unused dentries (d_lru chains) • namei tries cache: d_lookup – ddd_compare • Success: ddd_revalidate – d_invalidate if fails – proceed if success • Failure: iii_lookup – find inode – iget • sss_read_inode – finish: • d_add – can give negative entry in dcache P. J. Braam/CMU -- 28
Dentry methods • • ddd_revalidate: can force new lookup ddd_hash: compute hash value of name ddd_compare: are names equal? ddd_delete, ddd_put, ddd_iput: FS cleanup opportunity P. J. Braam/CMU -- 29
Dentry particulars: • ddd_hash and ddd_compare have to deal with extraordinary cases for msdos/vfat: – case insensitive – long and short filename pleasantries • ddd_revalidate -- can force new lookup if inode not in use: – used for NFS/SMBfs aging – used for Coda/AFS callbacks P. J. Braam/CMU -- 30
Style Dijkstra probably hates me Linus Torvalds P. J. Braam/CMU -- 31
Memory mapping • vm_area structure has – vm_operations – inode, addresses etc. • vm_operations – map, unmap – swapin, swapout – nopage -- read when page isn’t in VM • mmap – calls on iii_readpage – keeps a use count on the inode until unmap P. J. Braam/CMU -- 32
- Slides: 32