Windows Kernel Internals II Advanced File Systems University

  • Slides: 34
Download presentation
Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave

Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph. D. Advanced Operating Systems Group Windows Core Operating Systems Division Microsoft Corporation © Microsoft Corporation 2004 1

Disk Basics Volume exported via device object Addressed by byte offset and length Enforced

Disk Basics Volume exported via device object Addressed by byte offset and length Enforced on sector boundaries NTFS allocation unit - clusters Round size down to clusters © Microsoft Corporation 2004 2

Storage Management Volumes may span multiple logical disks Partitioning Description Benefits spanned logical catenation

Storage Management Volumes may span multiple logical disks Partitioning Description Benefits spanned logical catenation of arbitrary sized volumes striped (RAID-0) interleaved same-sized volumes read/write perf mirrored redundant writes to same(RAID-1) reliability, read sized volume, alternate reads perf RAID-5 striped volumes w/ parity © Microsoft Corporation 2004 reliability, size, read perf 3

File System Device Stack Application Kernel 32 / ntdll user kernel NT I/O Manager

File System Device Stack Application Kernel 32 / ntdll user kernel NT I/O Manager File System Filters File System Driver Cache Manager Virtual Memory Manager Disk Class Manager Disk Driver Partition/Volume Storage Manager DISK © Microsoft Corporation 2004 4

NTFS Deals with files Partition is collection of files Common routines for all meta-data

NTFS Deals with files Partition is collection of files Common routines for all meta-data Utilizes MM and Cache Manager No specific on-disk locations © Microsoft Corporation 2004 5

Cache. Manager overview Cache manager – kernel-mode routines – asynchronous worker routines – interface

Cache. Manager overview Cache manager – kernel-mode routines – asynchronous worker routines – interface between filesystems and VM mgr Functionality – access methods for pages of file data on opened files – automatic asynchronous read ahead – automatic asynchronous write behind (lazy write) – supports “Fast I/O” – IRP bypass © Microsoft Corporation 2004 6

Datastructure Layout FS Handle Context (2) File Object Handle K e r n e

Datastructure Layout FS Handle Context (2) File Object Handle K e r n e l Filesystem File Context Private Cache Map (Cc) Data Section (Mm) Section Object Pointers Image Section (Mm) Shared Cache Map (Cc) File Object == Handle (U or K), not one per file Section Object Pointers and FS File Context shared/stream © Microsoft Corporation 2004 7

Datastructures File Object – Fs. Context – per physical stream context – Fs. Context

Datastructures File Object – Fs. Context – per physical stream context – Fs. Context 2 – per user handle stream context, not all streams have handle context (metadata) – Section. Object. Pointers – the point of “single instancing” • Data. Section – exists if the stream has had a mapped section created (for use by Cc or user) • Shared. Cache. Map – exists if the stream has been set up for the cache manager • Image. Section – exists for executables – Private. Cache. Map – per handle Cc context (readahead) that also serves as reference from this file object to the shared cache map © Microsoft Corporation 2004 8

Cache View Management A Shared Cache Map has an array of View Access Control

Cache View Management A Shared Cache Map has an array of View Access Control Block (VACB) pointers which record the base cache address of each view – promoted to a sparse form for files > 32 MB Access interfaces map File+File. Offset to a cache address Taking a view miss results in a new mapping, possibly unmapping an unreferenced view in another file (views are recycled LRU) Since a view is fixed size, mapping across a view is impossible – Cc returns one address Fixed size means no fragmentation … © Microsoft Corporation 2004 9

View Mapping File Offfset 0 -256 KB-512 KB-768 KB c 1000000 <NULL> cf 0

View Mapping File Offfset 0 -256 KB-512 KB-768 KB c 1000000 <NULL> cf 0 c 0000 VACB Array © Microsoft Corporation 2004 10

Cache. Manager Interface Summary File objects start out unadorned Cc. Initialize. Cache. Map to

Cache. Manager Interface Summary File objects start out unadorned Cc. Initialize. Cache. Map to initiate caching via Cc on a file object – setup the Shared/Private Cache Map & Mm if neccesary Access methods (Copy, Mdl, Mapping/Pinning) Maintenance Functions Cc. Uninitialize. Cache. Map to terminate caching on a file object – teardown S/P Cache Maps – Mm lives on. Its data section is the cache! © Microsoft Corporation 2004 11

Cache. Manager / FS Diagram Fast IO Read/Write IRP-based Read/Write Cached IO Cache Manager

Cache. Manager / FS Diagram Fast IO Read/Write IRP-based Read/Write Cached IO Cache Manager Filesystem Cache Access, Flush, Purge Memory Manager Page Fault Noncached IO Storage Drivers Disk © Microsoft Corporation 2004 12

File System Notes Three basic types of IO – cached, non-cached, paging Three file

File System Notes Three basic types of IO – cached, non-cached, paging Three file sizes – file size, allocation size, valid data length Three worker threads – Mm’s modified page writer (paging file) – Mm’s mapped page writer (mapped files) – Cc’s lazy writer pool (flushes views) © Microsoft Corporation 2004 13

Cache Manager Summary Virtual block cache for files not logical block cache for disks

Cache Manager Summary Virtual block cache for files not logical block cache for disks Memory manager is the ACTUAL cache manager Cache Manager context integrated into File. Objects Cache Manager manages views on files in kernel virtual address space I/O has special fast path for cached accesses The Lazy Writer periodically flushes dirty data to disk Filesystems need two interfaces to CC: map and pin © Microsoft Corporation 2004 14

NTFS on-disk structure Some NTFS system files $Bitmap $Bad. Clus $Boot. (root directory) $Logfile

NTFS on-disk structure Some NTFS system files $Bitmap $Bad. Clus $Boot. (root directory) $Logfile $Volume $Mft. Mirr $Secure © Microsoft Corporation 2004 15

$Mft File Data is entirely File Records are fixed size Every file on volume

$Mft File Data is entirely File Records are fixed size Every file on volume has a File Record File records are recycled Reserved area for system files Critical file records mirrored in $Mft. Mirr © Microsoft Corporation 2004 16

File Records ‘Base’ file record for each file Header followed by ‘Attributes’ Additional file

File Records ‘Base’ file record for each file Header followed by ‘Attributes’ Additional file records as needed Update Sequence Array ID by offset and sequence number © Microsoft Corporation 2004 17

File D: Letters (File ID 0 x 200) ABCDEFGHIJKLMNOPQRSTUV File $Mft 100 200 JK

File D: Letters (File ID 0 x 200) ABCDEFGHIJKLMNOPQRSTUV File $Mft 100 200 JK LM NO 200 0 ABCDEFGHI 280 200 PQRST UV Physical Disk PQRST GHI LM UV ABCDEF © Microsoft Corporation 2004 JK NO 18

File Basics Timestamps File attributes (DOS + NTFS) Filename (+ hard links) Data streams

File Basics Timestamps File attributes (DOS + NTFS) Filename (+ hard links) Data streams ACL Indexes File Building Blocks File Records Ntfs Attributes Allocated clusters © Microsoft Corporation 2004 19

File Record Header USA Header Sequence Number First Attribute Offset First Free Byte and

File Record Header USA Header Sequence Number First Attribute Offset First Free Byte and Size Base File Record IN_USE bit © Microsoft Corporation 2004 20

NTFS Attributes Type code and optional name Resident or non-resident Header followed by value

NTFS Attributes Type code and optional name Resident or non-resident Header followed by value Sorted within file record Common code for operations © Microsoft Corporation 2004 21

MFT File Record $STANDARD_INFORMATION (Time Stamps, DOS Attributes) $FILE_NAME - Very. Long. File. Name.

MFT File Record $STANDARD_INFORMATION (Time Stamps, DOS Attributes) $FILE_NAME - Very. Long. File. Name. Txt $FILE_NAME - VERYLO~1. TXT $DATA (Default Data Stream) $DATA - “Very. Long. File. Name. Txt: A named stream” $END (Available for©attribute growth or new attribute) Microsoft Corporation 2004 22

Attribute Header Length Form Name and name length Flags (Compressed, Encrypted, Sparse) © Microsoft

Attribute Header Length Form Name and name length Flags (Compressed, Encrypted, Sparse) © Microsoft Corporation 2004 23

Resident Attributes Data follows attribute header ‘Allocation Size’ on 8 -byte boundary May grow

Resident Attributes Data follows attribute header ‘Allocation Size’ on 8 -byte boundary May grow or shrink Convert to non-resident © Microsoft Corporation 2004 24

Non-Resident Attributes Data stored in allocated disk clusters May describe sub-range of stream Sizes

Non-Resident Attributes Data stored in allocated disk clusters May describe sub-range of stream Sizes and stream properties Mapping pairs for on-disk runs © Microsoft Corporation 2004 25

Some Attribute Types $STANDARD_INFORMATION $FILE_NAME $SECURITY_DESCRIPTOR $DATA $INDEX_ROOT $INDEX_ALLOCATION $BITMAP $EA © Microsoft Corporation

Some Attribute Types $STANDARD_INFORMATION $FILE_NAME $SECURITY_DESCRIPTOR $DATA $INDEX_ROOT $INDEX_ALLOCATION $BITMAP $EA © Microsoft Corporation 2004 26

Mapping Pairs Stored in a byte optimal format Represents allocation and holes Each pair

Mapping Pairs Stored in a byte optimal format Represents allocation and holes Each pair is relative to prior run Used to represent compression/sparse © Microsoft Corporation 2004 27

Indexes File name and view indexes Indexes are B-trees Entries stored at each level

Indexes File name and view indexes Indexes are B-trees Entries stored at each level Intermediate nodes have down pointers $INDEX_ROOT $INDEX_ALLOCATION $BITMAP © Microsoft Corporation 2004 28

Index Implementation Top level - $INDEX_ROOT Index buckets - $INDEX_ALLOCATION Available buckets - $BITMAP

Index Implementation Top level - $INDEX_ROOT Index buckets - $INDEX_ALLOCATION Available buckets - $BITMAP © Microsoft Corporation 2004 29

$INDEX_ROOT E J R ABC GI end NPQ Z $INDEX_ALLOCATION unused GI ABC data

$INDEX_ROOT E J R ABC GI end NPQ Z $INDEX_ALLOCATION unused GI ABC data Z NPQ $BITMAP 0 x 36 (00110110) © Microsoft Corporation 2004 30

$ATTRIBUTE_LIST Needed for multi-file record file Entry for each attribute in file Resident or

$ATTRIBUTE_LIST Needed for multi-file record file Entry for each attribute in file Resident or non-resident form Must be in base file record © Microsoft Corporation 2004 31

Attribute List (example) • Base Record 0 x 200 • Aux Record 0 x

Attribute List (example) • Base Record 0 x 200 • Aux Record 0 x 180 • • • 0 x 10 - Standard 0 x 20 - Attribute List 0 x 30 - File. Name 0 x 80 - Default Data 0 x 80 - Data 1 “Owner” 0 x 30 - File. Name 0 x 80 - Data “Author” 0 x 80 - Data 0 “Owner” 0 x 80 - Data “Writer” © Microsoft Corporation 2004 32

Attribute List (example cont. ) Code FR 0 x 10 0 x 30 0

Attribute List (example cont. ) Code FR 0 x 10 0 x 30 0 x 80 0 x 80 0 x 200 0 x 180 VCN 0 0 0 40 Name (Not Present) “Author” “Owner” “Writer” $Standard $Filename $Data $Data © Microsoft Corporation 2004 33

Discussion © Microsoft Corporation 2004 34

Discussion © Microsoft Corporation 2004 34