Windows Kernel Internals II Processes Threads Virtual Memory

  • Slides: 37
Download presentation
Windows Kernel Internals II Processes, Threads, Virtual. Memory University of Tokyo – July 2004*

Windows Kernel Internals II Processes, Threads, Virtual. Memory University of Tokyo – July 2004* Dave Probert, Ph. D. Advanced Operating Systems Group Windows Core Operating Systems Division Microsoft Corporation © Microsoft Corporation 2004 1

Windows Architecture Applications Subsystem servers DLLs System Services Kernel 32 Critical services User-mode ntdll

Windows Architecture Applications Subsystem servers DLLs System Services Kernel 32 Critical services User-mode ntdll / run-time library Kernel-mode Trap interface / LPC Security refmon IO Manager File filters File systems Volume mgrs Device stacks Virtual memory Login/GINA Procs & threads FS run-time Scheduler Cache mgr exec synchr User 32 / GDI Win 32 GUI Object Manager / Configuration Management Kernel run-time / Hardware Adaptation Layer © Microsoft Corporation 2004 2

Process Container for an address space and threads Associated User-mode Process Environment Block (PEB)

Process Container for an address space and threads Associated User-mode Process Environment Block (PEB) Primary Access Token Quota, Debug port, Handle Table etc Unique process ID Queued to the Job, global process list and Session list MM structures like the Working. Set, VAD tree, AWE etc © Microsoft Corporation 2004 3

Thread Fundamental schedulable entity in the system Represented by ETHREAD that includes a KTHREAD

Thread Fundamental schedulable entity in the system Represented by ETHREAD that includes a KTHREAD Queued to the process (both E and K thread) IRP list Impersonation Access Token Unique thread ID Associated User-mode Thread Environment Block (TEB) User-mode stack Kernel-mode stack Processor Control Block (in KTHREAD) for cpu state when not running © Microsoft Corporation 2004 4

Job Container for multiple processes Queued to global job list, processes and jobs in

Job Container for multiple processes Queued to global job list, processes and jobs in the job set Security token filters and job token Completion ports Counters, limits etc © Microsoft Corporation 2004 5

Process/Thread structure Any Handle Table Object Manager Process Object Thread Files Events Process’ Handle

Process/Thread structure Any Handle Table Object Manager Process Object Thread Files Events Process’ Handle Table Virtual Address Descriptors Devices Thread Drivers Thread © Microsoft Corporation 2004 6

KPROCESS fields DISPATCHER_HEADER Header ULPTR Directory. Table. Base[2] KGDTENTRY Ldt. Descriptor KIDTENTRY Int 21

KPROCESS fields DISPATCHER_HEADER Header ULPTR Directory. Table. Base[2] KGDTENTRY Ldt. Descriptor KIDTENTRY Int 21 Descriptor USHORT Iopm. Offset UCHAR Iopl volatile KAFFINITY Active. Processors ULONG Kernel. Time ULONG User. Time LIST_ENTRY Ready. List. Head SINGLE_LIST_ENTRY Swap. List. Entry LIST_ENTRY Thread. List. Head KSPIN_LOCK Process. Lock KAFFINITY Affinity USHORT Stack. Count SCHAR Base. Priority SCHAR Thread. Quantum BOOLEAN Auto. Alignment UCHAR State BOOLEAN Disable. Boost UCHAR Power. State BOOLEAN Disable. Quantum UCHAR Ideal. Node © Microsoft Corporation 2004 7

EPROCESS fields KPROCESS Pcb EX_PUSH_LOCK Process. Lock LARGE_INTEGER Create. Time LARGE_INTEGER Exit. Time EX_RUNDOWN_REF

EPROCESS fields KPROCESS Pcb EX_PUSH_LOCK Process. Lock LARGE_INTEGER Create. Time LARGE_INTEGER Exit. Time EX_RUNDOWN_REF Rundown. Protect HANDLE Unique. Process. Id LIST_ENTRY Active. Process. Links Quota Felds SIZE_T Peak. Virtual. Size SIZE_T Virtual. Size LIST_ENTRY Session. Process. Links PVOID Debug. Port PVOID Exception. Port PHANDLE_TABLE Object. Table EX_FAST_REF Token PFN_NUMBER Working. Set. Page KGUARDED_MUTEX Address. Creation. Lock KSPIN_LOCK Hyper. Space. Lock struct _ETHREAD *Fork. In. Progress ULONG_PTR Hardware. Trigger; PMM_AVL_TABLE Physical. Vad. Root PVOID Clone. Root PFN_NUMBER Number. Of. Private. Pages PFN_NUMBER Number. Of. Locked. Pages PVOID Win 32 Process struct _EJOB *Job PVOID Section. Object PVOID Section. Base. Address PEPROCESS_QUOTA_BLOCK Quota. Block © Microsoft Corporation 2004 8

EPROCESS fields PPAGEFAULT_HISTORY Working. Set. Watch HANDLE Win 32 Window. Station HANDLE Inherited. From.

EPROCESS fields PPAGEFAULT_HISTORY Working. Set. Watch HANDLE Win 32 Window. Station HANDLE Inherited. From. Unique. Process. Id PVOID Ldt. Information PVOID Vad. Free. Hint PVOID Vdm. Objects PVOID Device. Map PVOID Session UCHAR Image. File. Name[ 16 ] LIST_ENTRY Job. Links PVOID Locked. Pages. List LIST_ENTRY Thread. List. Head ULONG Active. Threads PPEB Peb IO Counters PVOID Awe. Info MMSUPPORT Vm Process Flags NTSTATUS Exit. Status UCHAR Priority. Class MM_AVL_TABLE Vad. Root © Microsoft Corporation 2004 9

KTHREAD fields DISPATCHER_HEADER Header LIST_ENTRY Mutant. List. Head PVOID Initial. Stack, Stack. Limit PVOID

KTHREAD fields DISPATCHER_HEADER Header LIST_ENTRY Mutant. List. Head PVOID Initial. Stack, Stack. Limit PVOID Kernel. Stack KSPIN_LOCK Thread. Lock ULONG Context. Switches volatile UCHAR State KIRQL Wait. Irql KPROC_MODE Wait. Mode PVOID Teb KAPC_STATE Apc. State KSPIN_LOCK Apc. Queue. Lock LONG_PTR Wait. Status PRKWAIT_BLOCK Wait. Block. List BOOLEAN Alertable, Wait. Next UCHAR Wait. Reason SCHAR Priority UCHAR Enable. Stack. Swap volatile UCHAR Swap. Busy LIST_ENTRY Wait. List. Entry NEXT Swap. List. Entry PRKQUEUE Queue ULONG Wait. Time SHORT Kernel. Apc. Disable SHORT Special. Apc. Disable KTIMER Timer KWAIT_BLOCK Wait. Block[N+1] LIST_ENTRY Queue. List. Entry UCHAR Apc. State. Index BOOLEAN Apc. Queueable BOOLEAN Preempted BOOLEAN Process. Ready. Queue BOOLEAN Kernel. Stack. Resident © Microsoft Corporation 2004 10

KTHREAD fields cont. UCHAR Ideal. Processor volatile UCHAR Next. Processor SCHAR Base. Priority SCHAR

KTHREAD fields cont. UCHAR Ideal. Processor volatile UCHAR Next. Processor SCHAR Base. Priority SCHAR Priority. Decrement SCHAR Quantum BOOLEAN System. Affinity. Active CCHAR Previous. Mode UCHAR Resource. Index UCHAR Disable. Boost KAFFINITY User. Affinity PKPROCESS Process KAFFINITY Affinity PVOID Service. Table PKAPC_STATE Apc. State. Ptr[2] KAPC_STATE Saved. Apc. State PVOID Callback. Stack PVOID Win 32 Thread PKTRAP_FRAME Trap. Frame ULONG Kernel. Time, User. Time PVOID Stack. Base KAPC Suspend. Apc KSEMAPHORE Suspend. Sema PVOID Tls. Array LIST_ENTRY Thread. List. Entry UCHAR Large. Stack UCHAR Power. State UCHAR Iopl CCHAR Freeze. Cnt, Suspend. Cnt UCHAR User. Ideal. Proc volatile UCHAR Deferred. Proc UCHAR Adjust. Reason SCHAR Adjust. Increment © Microsoft Corporation 2004 11

ETHREAD fields KTHREAD tcb Timestamps LPC locks and links CLIENT_ID Cid Impersonation. Info Irp.

ETHREAD fields KTHREAD tcb Timestamps LPC locks and links CLIENT_ID Cid Impersonation. Info Irp. List p. Process Start. Address Win 32 Start. Address Thread. List. Entry Rundown. Protect Thread. Push. Lock © Microsoft Corporation 2004 12

Process Synchronization Process. Lock – Protects thread list, token Rundown. Protect – Cross process

Process Synchronization Process. Lock – Protects thread list, token Rundown. Protect – Cross process address space, image section and handle table references Token, Prefetch – Uses fast referencing Token, Job – Torn down at last process dereference without synchronization © Microsoft Corporation 2004 13

Thread scheduling states © Microsoft Corporation 2004 14

Thread scheduling states © Microsoft Corporation 2004 14

Thread scheduling states • Main quasi-states: – Ready – able to run – Running

Thread scheduling states • Main quasi-states: – Ready – able to run – Running – current thread on a processor – Waiting – waiting an event • For scalability Ready is three real states: – Deferred. Ready – queued on any processor – Standby – will be imminently start Running – Ready – queue on target processor by priority • Goal is granular locking of thread priority queues • Red states related to swapped stacks and © Microsoft Corporation 2004 processes 15

Process Lifetime Created as an empty shell Address space created with only ntdll and

Process Lifetime Created as an empty shell Address space created with only ntdll and the main image unless forked Handle table created empty or populated via duplication from parent Process is partially destroyed on last thread exit Process totally destroyed on last dereference © Microsoft Corporation 2004 16

Thread Lifetime Created within a process with a CONTEXT record Starts running in the

Thread Lifetime Created within a process with a CONTEXT record Starts running in the kernel but has a trap frame to return to use mode Kernel queues user APC to do ntdll initialization Terminated by a thread calling Nt. Terminate. Thread/Process © Microsoft Corporation 2004 17

Summary: Native NT Process APIs Nt. Create. Process() Nt. Terminate. Process() Nt. Query. Information.

Summary: Native NT Process APIs Nt. Create. Process() Nt. Terminate. Process() Nt. Query. Information. Process() Nt. Set. Information. Process() Nt. Get. Next. Thread() Nt. Suspend. Process() Nt. Resume. Process() Nt. Create. Thread() Nt. Terminate. Thread() Nt. Suspend. Thread() Nt. Resume. Thread() Nt. Get. Context. Thread() Nt. Set. Context. Thread() Nt. Query. Information. Thread() Nt. Set. Information. Thread() Nt. Alert. Thread() Nt. Queue. Apc. Thread() © Microsoft Corporation 2004 18

Virtual Memory Manager Features Provides 4 GB flat virtual address space (IA 32) Manages

Virtual Memory Manager Features Provides 4 GB flat virtual address space (IA 32) Manages process address space Handles pagefaults Manages process working sets Manages physical memory Provides memory-mapped files Allows pages shared between processes Facilities for I/O subsystem and device drivers Supports file system cache manager © Microsoft Corporation 2004 19

Virtual Memory Manager NT Internal APIs Nt. Create. Paging. File Nt. Allocate. Virtual. Memory

Virtual Memory Manager NT Internal APIs Nt. Create. Paging. File Nt. Allocate. Virtual. Memory (Proc, Addr, Size, Type, Prot) Process: handle to a process Protection: NOACCESS, EXECUTE, READONLY, READWRITE, NOCACHE Flags: COMMIT, RESERVE, PHYSICAL, TOP_DOWN, RESET, LARGE_PAGES, WRITE_WATCH Nt. Free. Virtual. Memory(Process, Address, Size, Free. Type) Free. Type: DECOMMIT or RELEASE Nt. Query. Virtual. Memory © Microsoft Corporation 2004 Nt. Protect. Virtual. Memory 20

Virtual Memory Manager NT Internal APIs Pagefault Nt. Lock. Virtual. Memory, Nt. Unlock. Virtual.

Virtual Memory Manager NT Internal APIs Pagefault Nt. Lock. Virtual. Memory, Nt. Unlock. Virtual. Memory – locks a region of pages within the working set list – requires PROCESS_VM_OPERATION on target process and Se. Lock. Memory. Privilege Nt. Read. Virtual. Memory, Nt. Write. Virtual. Memory ( Proc, Addr, Buffer, Size) Nt. Flush. Virtual. Memory © Microsoft Corporation 2004 21

Virtual Memory Manager NT Internal APIs Nt. Create. Section – creates a section but

Virtual Memory Manager NT Internal APIs Nt. Create. Section – creates a section but does not map it Nt. Open. Section – opens an existing section Nt. Query. Section – query attributes for section Nt. Extend. Section Nt. Map. View. Of. Section (Sect, Proc, Addr, Size, …) Nt. Unmap. View. Of. Section © Microsoft Corporation 2004 22

Virtual Memory Manager NT Internal APIs to support AWE (Address Windowing Extensions) – Private

Virtual Memory Manager NT Internal APIs to support AWE (Address Windowing Extensions) – Private memory only – Map only in current process – Requires LOCK_VM privilege Nt. Allocate. User. Physical. Pages (Proc, NPages, &PFNs[]) Nt. Map. User. Physical. Pages (Addr, NPages, PFNs[]) Nt. Map. User. Physical. Pages. Scatter Nt. Free. User. Physical. Pages (Proc, &NPages, PFNs[]) Nt. Reset. Write. Watch Nt. Get. Write. Watch Read out dirty bits for a section of memory since last reset © Microsoft Corporation 2004 23

Allocating kernel memory (pool) • Tightest x 86 system resource is KVA Kernel Virtual

Allocating kernel memory (pool) • Tightest x 86 system resource is KVA Kernel Virtual Address space • Pool allocates in small chunks: < 4 KB: 8 B granulariy >= 4 KB: page granularity • Paged and Non-paged pool Paged pool backed by pagefile • • Special pool used to find corruptors Lots of support for debugging/diagnosis © Microsoft Corporation 2004 24

80000000 A 4000000 C 0400000 C 0800000 C 0 C 00000 C 1000000 E

80000000 A 4000000 C 0400000 C 0800000 C 0 C 00000 C 1000000 E 8000000 FFBE 0000 FFC 00000 System code, initial non-paged pool Session space (win 32 k. sys) Sysptes overflow, cache overflow Page directory self-map and page tables Hyperspace (e. g. working set list) Unused – no access System working set list x 86 System cache Paged pool Reusable system VA (sysptes) Non-paged pool expansion Crash dump information HAL usage © Microsoft Corporation 2004 25

Valid x 86 Hardware PTEs Reserved Global Dirty Accessed Cache disabled Write through Owner

Valid x 86 Hardware PTEs Reserved Global Dirty Accessed Cache disabled Write through Owner Write Pageframe 31 R R R G R D A Cd Wt O W 1 12 11 10 9 8 7 6 5 4 3 © Microsoft Corporation 2004 2 1 0 26

Virtual Address Translation CR 3 PD PT page 1024 PDEs 1024 PTEs 4096 bytes

Virtual Address Translation CR 3 PD PT page 1024 PDEs 1024 PTEs 4096 bytes DATA 0000 0000 © Microsoft Corporation 2004 27

Self-mapping page tables • Page Table Entries (PTEs) and Page Directory Entries (PDEs) contain

Self-mapping page tables • Page Table Entries (PTEs) and Page Directory Entries (PDEs) contain Physical Frame Numbers (PFNs) – But Kernel runs with Virtual Addresses • To access PDE/PTE from kernel use the selfmap for the current process: Page. Directory[0 x 300] uses Page. Directory as Page. Table – Get. Pde. Address(va): 0 xc 0300000[va>>20] – Get. Pte. Address(va): 0 xc 0000000[va>>10] • • PDE/PTE formats are compatible! Access another process VA via thread ‘attach’ © Microsoft Corporation 2004 28

Self-mapping page tables Virtual Access to Page. Directory[0 x 300] CR 3 Phys: PD[0

Self-mapping page tables Virtual Access to Page. Directory[0 x 300] CR 3 Phys: PD[0 xc 0300000>>22] = PD Virt: *((0 xc 0300 c 00) == PD PD 0 x 300 PTE 0000 0011 1100 0000 0000 © Microsoft Corporation 2004 29

Self-mapping page tables Virtual Access to PTE for va 0 xe 4321000 CR 3

Self-mapping page tables Virtual Access to PTE for va 0 xe 4321000 CR 3 PT PD 0 x 300 Get. Pte. Address: 0 xe 4321000 => 0 xc 0390 c 84 0 x 321 PTE 0 x 390 0000 0011 1100 0000 1001 0000 1100 0000 1000 0100 0000 © Microsoft Corporation 2004 30

x 86 Invalid PTEs Transition Prototype Page file offset 0 31 Protection 5 4

x 86 Invalid PTEs Transition Prototype Page file offset 0 31 Protection 5 4 12 11 10 9 31 0 Transition Prototype Transition Page file offset 1 PFN Protection 12 11 10 9 Cache disable Write through Owner © Microsoft Corporation 2004 Write HW ctrl 0 5 4 1 0 31

x 86 Invalid PTEs Demand zero: Page file PTE with zero offset and PFN

x 86 Invalid PTEs Demand zero: Page file PTE with zero offset and PFN Unknown: PTE is completely zero or Page Table doesn’t exist yet. Examine VADs. Pointer to Prototype PTE p. Pte bits 7 -27 31 p. Pte bits 0 -6 12 11 10 9 8 7 © Microsoft Corporation 2004 5 4 0 1 0 32

Prototype PTEs • Kept in array in the segment structure associated with section objects

Prototype PTEs • Kept in array in the segment structure associated with section objects • Six PTE states: – Active/valid – Transition – Modified-no-write – Demand zero – Page file – Mapped file © Microsoft Corporation 2004 33

Physical Memory Management © Microsoft Corporation 2004 35

Physical Memory Management © Microsoft Corporation 2004 35

Paging Overview Working Sets: list of valid pages for each process (and the kernel)

Paging Overview Working Sets: list of valid pages for each process (and the kernel) Pages ‘trimmed’ from working set on lists Standby list: pages backed by disk Modified list: dirty pages to push to disk Free list: pages not associated with disk Zero list: supply of demand-zero pages Modify/standby pages can be faulted back into a working set w/o disk activity (soft fault) Background system threads trim working sets, write modified pages and produce zero pages based on memory state and config parameters © Microsoft Corporation 2004 36

Managing Working Sets Aging pages: Increment age counts for pages which haven't been accessed

Managing Working Sets Aging pages: Increment age counts for pages which haven't been accessed Estimate unused pages: count in working set and keep a global count of estimate When getting tight on memory: replace rather than add pages when a fault occurs in a working set with significant unused pages When memory is tight: reduce (trim) working sets which are above their maximum Balance Set Manager: periodically runs Working Set Trimmer, also swaps out kernel stacks of long-waiting threads © Microsoft Corporation 2004 37

Discussion © Microsoft Corporation 2004 38

Discussion © Microsoft Corporation 2004 38