WINDOWS MEMORY MANAGEMENT PART 1 UNDERSTANDING KERNEL MODE

Areas of discussion � Part 1 � Basics � No PAE in the first

About this Series � Not for complete beginners. � Understanding of C or assembly

Understanding Protection in Modern Operating systems

What is protection? � Do not allow a program to access or modify other

Protection. . how implemented. . � It is practically impossible to implement protection by

Protected Mode in Modern OS like Linux or Windows � Segmentation �Practically turned of

Abstract working of protection. � Any security system has following attributes. �An Identity verification

Abstract working of protection – Security Systems � Example An airport – �An Identity

Abstract working of protection – Security Systems � Example. . A Building premise of

Couple of Implicit points � You cannot make a passport by yourself and stamp

Abstract working of protection – Security Systems Example a generic protected mode CPU �

Again the implicit point � So called special register cannot be modified by a

Generic CPU protection in other words. There is a region of memory and a

Now lets discuss about Intel Architecture for 32 bit.

Introduction to Control Registers of Intel CPU � They are like switches of a

Control Registers we are interested. � Many CRs for IA but we are interested

Segmentation is the root of security and protection in Intel CPU. Segmentation work with

Intel X 86 - without PAE implementation in windows. � As mentioned to enable

Intel X 86 implementation in windows. [cont] In windows we use mainly 2 CS

Now what is CPL? � � � CPL is a 2 bit number which

Role of GDTR and LDTR GDTR is a register in Intel CPU ( Global

CS, GDTR, GDT Memory GDT LDT An Entry in GDT Base Limit CPL Other

Segment descriptor Format 15 0 14 Segment Limit ( 0 – 15 ) 0

Demo � Segment registers in windbg. � GDTR in windbg. � Base limit and

Summary Segmentation � � � � � Both the code segments windows uses has

Unfortunately its too early to completely explain how things works with segmentation. For a

Paging � CPU starts in real mode. � Setting the first bit of CR

Paging � � � The memory addresses given by programs are translated by the

Virtual Address � Any address which a program giving to CPU once the paging

Physical Address The actual address in the physical memory. This is binary number CPU

Page table contents � � � � CPU uses page table for translating virtual

Paging ( Logical Diagram ) Program Hello World C. Virtual Address (index) Present? /accessi

TLB �A subset of page table which is accessed frequently is cache by CPU

Page of memory �A block of memory to which you can apply protection on.

Page Size trade offs. � Like manager ( 32 bit page table entries) and

Page Size trade offs (con) � Advantage for large page �Only one manager (page

Page Size trade offs (con) � Small pages are like less number of team

Page Size trade offs ( con ) � Problems with small pages. �More number

Page Size in 32 bit windows. � Mixed Sizes are used. � Some of

Paging a concrete discussion. � Before turn on the paging bit OS should �

Address Translation with Paging Linear Address from Segmentation Unit ( Virtual Address ) MSB

Address Translation Demo. � Windbg Commands � dc � !dc ( !d*) � !pte

Page Directory Entry or Page Table Entry ( Both has same format ) �

LSB 12 bit Meaning C G L Copy on write. Global. Large page. This

We are more interested in the 3 rd bit in PTE here.

Combining Segmentation and paging – Privilege check. � The current privilege level is determined

Combining Segmentation and paging – Privilege check in Windows. � � � In windows

Combining Segmentation and paging – Privilege check in Windows. The lower 2 GB which

Important � Neither PTE or CS can be changed from user mode.

How the switching happens between privileged and non privileged segments? From privileged to non

IOPL flags ( 2 bit ) of eflags register � � � Following 6

Implementation of Different Process Address Spaces � � � All the process in windows

Demo Windows process implementation. !process 0 0 Get the page Directory Base. Do translation

3 Windows Kernel APIs worth mentioning Mm. Map. Io. Space 2. Mm. Get. Physical.

Demo Mm. Map. Io. Space 2. Mm. Get. Physical. Address 3. Hal. Translate. Bus.

Page fault handling. If V bit is not set it means that page is

Some Final Q and A � Why we cannot access other process memory directly?

Slides: 60

Download presentation

WINDOWS MEMORY MANAGEMENT PART 1. UNDERSTANDING KERNEL MODE AND USER MODE IN WINDOWS. Presented By Anand George

Areas of discussion � Part 1 � Basics � No PAE in the first part. Mostly Protected mode and paging. � Only Intel x 86 � Part 2 � PAE with short discussion on AWE � 64 bit. Note: Part 1 and 2 common for Linux and windows ( mostly processor stuffs ) but experiments and demos will be in windows. � Part 3 � PFN database and physical memory management. � Walk thorough of page fault handler (of Windows 7 ) by putting strategic breakpoints in the KD and viewing the states. ( No Code - only disassembly ) � Part 4 � Implementation of Important Windows Kernel Memory management APIs – Mm**** also some other Undocumented APIs like Nt. Allocate. Virtual. Memory ( including User Mode Virtual. Alloc ) � Directly looking into processor state which we are interested in with the help of a custom kernel driver and KD.

About this Series � Not for complete beginners. � Understanding of C or assembly programming language is expected. � Knowledge on CPU architectures, CPU registers, instruction is good to have.

Understanding Protection in Modern Operating systems

What is protection? � Do not allow a program to access or modify other programs data in a multi tasking environment. � Do not allow a program to access or modify OS data. � Do not allow a program to access system hardware resources directly like mouse, keyboard, monitor etc.

Protection. . how implemented. . � It is practically impossible to implement protection by an OS without assistance from the CPU ( unless we emulate a virtual CPU like java/. net does). � Modern OS ( like Windows or Linux ) uses the CPU feature called “Paging” and “Segmentation” to implement protection. � Modern OS works in the protected mode of the CPU.

Protected Mode in Modern OS like Linux or Windows � Segmentation �Practically turned of using flat model. �Code Segment protection feature is still used. � Paging �Not only just for virtual memory �Protection is also implemented �Windows uses both segmentation and paging to implement protection attributes.

Abstract working of protection. � Any security system has following attributes. �An Identity verification mechanism. �A Check or verification of above Identity �Access to a particular privileged resource if above check is success and other wise not.

Abstract working of protection – Security Systems � Example An airport – �An Identity verification mechanism. ○ Your passport. �A Check or verification of above Identity ○ Checking of your visa stamped in the passport or type of passport by government authorities in the airport. �Access to a particular privileged resource if above check is success and other wise not. ○ Access to a flight to a particular country or place if the passport is correct.

Abstract working of protection – Security Systems � Example. . A Building premise of a company– �An Identity verification mechanism. ○ Your Identity card. �A Check or verification of above Identity ○ The point where you swipe your card. �Access to a particular privileged resource if above check is success and other wise not. ○ The door gets opened if your card is issued by the company.

Couple of Implicit points � You cannot make a passport by yourself and stamp a visa by yourself but the Visa is issued by the country you are going. � Like wise you cannot make the ID card by yourself but issued by the company for which the access is needed. Very Important.

Abstract working of protection – Security Systems Example a generic protected mode CPU � An Identity verification mechanism. � The current value in a special register in the cpu which designate the privilege (like visa) of the code which is currently being executed. � A Check or verification of above Identity � Check inside CPU circuitry which checks if above mentioned special registers value is sufficient for the action the instruction is trying to perform. � Access to a particular privileged resource if above check is success and other wise not. � The execution of above instruction is succeed and resource ( mostly a specific memory region or device/cpu registers ) access is provided if above check by the cpu is success … other wise an exception is thrown and the control goes to an exception handler installed by the OS ( wrong visa go to authorities for further investigation).

Again the implicit point � So called special register cannot be modified by a less privileged program just like we cannot issue visa or id card by ourselves.

Generic CPU protection in other words. There is a region of memory and a class of instructions for the cpu, which need a special value ( high privilege ) in a special register in the cpu, to access and execute respectively. � Special register can be modified only if you have special value ( high privilege) inside the above mentioned special register. � If the above mentioned special register do not have that special value ( high privilege) and trying to access one of the privileged region of memory or execute a privileged class of instruction an exception is generated and control is passed to privileged code ( OS ) to ‘handle’ the situation and mostly terminate the unprivileged code which is trying to access/execute privileged stuffs ( memory /instruction). �

Now lets discuss about Intel Architecture for 32 bit.

Introduction to Control Registers of Intel CPU � They are like switches of a TV. � You can turn on and off different features. � 32 bit memory locations inside the CPU. � CPU instructions or software can read and write or modify it. � We use control registers to turn on protection in Intel CPUs.

Control Registers we are interested. � Many CRs for IA but we are interested in only 2 control registers in this presentation. CR 0 Used for turning on and off protection or segmentation – bit 0 � Also enable or disable paging. – last bit - 31 � CR 3 � Once the paging is enabled CR 3 points to the page directory base address. (Details later)

Segmentation is the root of security and protection in Intel CPU. Segmentation work with segment descriptors, segment registers, GDTR and LDTR. � The segment registers index into the Descriptor tables which contain the segment descriptors. � GDTR and LDTR point to base of the Descriptor tables. � An 8 byte Segment descriptors has � � � Base Limit Protection information plus other stuffs Make in the base to 0 and limit to 4 GB will be effectively turn off segmentation but still the protection provided by segmentation can be used as explained in rest of this presentation. � Note windows DOES NOT use segmentation in the traditional sense to access memory using base: offset format but just for security. �

Intel X 86 - without PAE implementation in windows. � As mentioned to enable segmentation ( protection ) and paging following 2 bit in CR 0 register in the CPU should be set ( = 1 ) � Bit Number 0 or PE bit or Protected mode enable bit. � Bit Number 31 (last bit) or PG bit or Paging enable bit Note: Rest of the presentation assumes that above 2 bits are set by the OS which was started in real mode ( while both the above bits are 0 ) Now before mentioned “special register” is called code Segment register (CS) for Intel X 86 architecture. � CS has 2 different potentially unrelated security attributes or bits associated with it. � � 1. Requested Privilege level or RPL which is the first 2 LSBs of the value in the CS register. � 2. Current Privilege Level or CPL which is the 2 flag bits in the 8 byte segment descriptor indexed by the 13 MSBs in the 16 bit CS register.

Intel X 86 implementation in windows. [cont] In windows we use mainly 2 CS values in both case RPL = CPL so we don’t worry about what is the difference between them. � For curiosity, if RPL is not equal to CPL minimum privilege among them is considered by the CPU. � We will be using the term CPL for the rest of the presentation which can be CPL or RPL as both going to be the same in windows. �

Now what is CPL? � � � CPL is a 2 bit number which decides and determines the privilege level of the instruction currently pointed by the EIP register. As it is 2 bit it can be 0, 1, 2 or 3 which are also called rings of protection. 0 is the most privileged and 3 is least. Windows uses only 0 and 3. When the CPL is 0 we say code is executing in Kernel Mode and when the CPL is 3 we say code is executing in User mode in windows.

Role of GDTR and LDTR GDTR is a register in Intel CPU ( Global Descriptor table register ) which contains address of Global descriptor table ( GDT ). � It is 48 bit and the upper 16 bit we don’t care much in this presentation lower 32 bit contains the base address of GDT as mentioned above. � LDTR ( Local Descriptor table register) is similar to GDTR. As windows don’t use it, it is also out of the scope of our discussion. � 13 MSBs of 16 bit CS index into Descriptor table. 2 bits LSBs are RPL. Now what about the 1 bit remaining? � � It decide whether the index is going to GDT or LDT if it is 0 GDT (the table pointed by GDTR), if 1 LDT (the table pointed by LDTR. ) In windows CS its always 0 which means GDT.

CS, GDTR, GDT Memory GDT LDT An Entry in GDT Base Limit CPL Other flags CPUGDTR GDT SIZE GDT base address CS Register GDT Index 1 LDTR RPL LDT SIZE LDT base address LDTR

Segment descriptor Format 15 0 14 Segment Limit ( 0 – 15 ) 0 Base Address ( 0 – 15 ) 1 13 CPL Other flags Base Address ( 24 – 31 ) Base Address ( 16 – 23 ) G Other flags Segment Limit ( 16 – 19 ) 2 3

Demo � Segment registers in windbg. � GDTR in windbg. � Base limit and flags of a segment descriptor. � View of 2 segment descriptor used by windows.

Summary Segmentation � � � � � Both the code segments windows uses has base 0 and limit 0 x. FFFF which means 4 GB segment. This is called flat memory model. Which means even in CPL = 3 code segment can access entire 4 GB. That is not the case in windows. That is were the owner bit in page table entry/directory comes into picture ( Next topic of discussion along with paging. ) So segmentation is one part of the security which determines the CPL of current code segment or the privilege of currently executing code. The value in the code segment cannot be changed from a less privileged CPL than 0 ( kernel mode ). By design of windows segmentation does not impose any requirement on what privilege is need to access a memory block, but paging does that. We didn’t discuss how memory access is done via segmentation ( the segment : offset details ) as in flat memory model segmentation does NOT contribute anything to the final physical address but paging does that. So in flat model segmentation just provide the current privilege level of the code being executed. Also we did not discuss DPL or descriptor privilege level as it is used for cross segment memory access. At any time in windows we have only one segment in play which is 4 GB in size and has access to complete addressable locations. So DPL always meet or beat the CPL. In fact meet.

Unfortunately its too early to completely explain how things works with segmentation. For a final picture we need to discuss the details of paging as well.

Paging � CPU starts in real mode. � Setting the first bit of CR 0 will turn on segmentation ( Protected Mode). PE bit � Setting the last bit of CR 0 will turn on paging. PG bit. � Once the bit last bit of CR 0 is set irrespective of which privilege ( kernel or user) no software instruction can directly access physical memory by any means.

Paging � � � The memory addresses given by programs are translated by the CPU. The memory address given by the programs are called virtual addresses which are translated to something called physical address. Both physical and virtual address space is divided in to chunks called PAGES each of which have something called page table entry in a table in physical memory called page table. CPU uses page table for translation. Input to the translation process is a virtual address and output is a physical address. Adds a slight performance hit but enables CPU to implement virtual memory and security.

Virtual Address � Any address which a program giving to CPU once the paging bit in CPU is turned on. void main () { int a = 100; int *p = &a: } Example In the above program p contains a virtual address.

Physical Address The actual address in the physical memory. This is binary number CPU generate in the BUS connected to memory chip to fetch the data into CPU. � Almost useless to programs as programs have no access to physical memory once paging is turned on. What ever address you give to CPU from any instruction ( program ) it will try to translate it “thinking” that it is a virtual address. � � Note: Throughout this presentation physical address means bus address although it may not be the case in reality. We are planning to discuss the old HAL kernel API Hal. Translate. Bus. Address and will make the difference clear in that.

Page table contents � � � � CPU uses page table for translating virtual address to physical address. Table Row is called page table entry. One column is virtual address of the page. One column is physical address of the page. One column for flags. You can think the index as virtual address although in reality it is normally arranged as a b – tree. ( will get to it later ) 2 very important information present in each entries of Page table in the flags column are 1. 2. � If the current CPU mode has privilege to access this page. If the current Page is present in the RAM of the system. This is VERY important. ( a concrete discussion is in the following slides )

Paging ( Logical Diagram ) Program Hello World C. Virtual Address (index) Present? /accessi ble? bits Physical Address 0 01 No phy Add 1 11 100 2 10 200 3 11 300 4 10 400 5 11 500 6 01 No phy Add CPU Page. Ta ble inside Ram RAM Program

TLB �A subset of page table which is accessed frequently is cache by CPU ( inside the chip ) for fast translation. � This cache is called Translation Look aside Buffer or TLB.

Page of memory �A block of memory to which you can apply protection on. � One page table entry in the table per page. � That page table entry bits will determine the attributes of that page. � Size of page varies in 32 bit. 4 MB and 4 KB is used.

Page Size trade offs. � Like manager ( 32 bit page table entries) and team members ( each bytes in the pages) � Large page ( say 4 MB ) is like many people are reporting to same manager. �Problems ○ Very difficult to take care of the interest of each members. ○ Same set of rules ( page table entry attributes ) applied to all members some may not be fit to all members.

Page Size trade offs (con) � Advantage for large page �Only one manager (page table entries) is needed for so many people so manager salary ( memory needed for page table entries ) can be saved. �Also managers’ managers salary can be saved ( space of all page table entries )

Page Size trade offs (con) � Small pages are like less number of team members for a manager. �Advantage is ○ People in the similar project and trades are grouped. ○ Better match of individual interest with the interest of team.

Page Size trade offs ( con ) � Problems with small pages. �More number of managers are needed and more salary is to be paid to the managers. �More number of managers’ managers are needed as well so total cost is getting increased.

Page Size in 32 bit windows. � Mixed Sizes are used. � Some of the binaries like ntoskrnl etc loaded to a 4 MB page as they are static and never paged out. � Most other regions are 4 KB.

Paging a concrete discussion. � Before turn on the paging bit OS should � Create Page Directory � Create Page Tables � Store Page Directory base to Upper 20 bits of CR 3 � Entire setup is more or less like a B-Tree. Until this( CR 0: 31 = 1) point OS has access to physical memory. � One more thing both windows and Linux does before turning on paging is making the physical memory layout database. (Windows it is called PFN database ). But this is no mandate from cpu. �

Address Translation with Paging Linear Address from Segmentation Unit ( Virtual Address ) MSB 10 Bits Next 10 bits LSB 12 bits Page Table Page Directory PTE Physical page + Memory Byte + PDE CR 3 + PTE

Address Translation Demo. � Windbg Commands � dc � !dc ( !d*) � !pte � !vtop

Page Directory Entry or Page Table Entry ( Both has same format ) � 32 bit entry Each PDE or PTE entry MSB 20 Bits LSB 12 bits Other Flags U W V

LSB 12 bit Meaning C G L Copy on write. Global. Large page. This only occurs in PDEs, never in PTEs. D Dirty. A Accessed. N Cache disabled. T Write-through. U Owner (user mode or kernel mode). W Writeable or read-only. Only on multiprocessor computers and any computer running Windows Vista or later. V E Valid. Executable page. For platforms that do not support a hardware execute/noexecute bit, including many x 86 systems, the E is always displayed.

We are more interested in the 3 rd bit in PTE here.

Combining Segmentation and paging – Privilege check. � The current privilege level is determined by the CS register as we saw before. � To access a page with 3 rd bit ( the owner or super user bit ) in the PTE not set ( = 0 ) the current privilege level of the code segment should be 0. � If above bit is 1 page can be access from any CPL including 3.

Combining Segmentation and paging – Privilege check in Windows. � � � In windows out of the total 4 GB addressable location, for the upper 2 GB 0 x 80000000 to 0 x. FFFF, all the pages has the 3 rd bit ( owner bit ) in PTE is 0. Which means to access ( read or write ) those pages CPL should be 0. So the upper 2 GB is called Kernel Address space or privileged memory region. This is Windows specific and it is OS developer decides how much pages needs privileged access. Windows keep all the OS related data structures in this region of memory. An application like notepad or a web browser does not have access to this region as it is running in privilege level 3 all the time. If an access is made a exception happens and control goes into OS to take action ( mostly terminate the application )

Combining Segmentation and paging – Privilege check in Windows. The lower 2 GB which 0 x 0000 to 0 x 7 FFFFFFF has all the page’s PTE has the 3 rd bit ( owner bit ) set ( = 1 ). � Which means from any privilege level the page can be accessed � This part of the address space is also called user mode address space where any application like notepad or word or any other application keep and access there data. �

Important � Neither PTE or CS can be changed from user mode.

Protection Demo

How the switching happens between privileged and non privileged segments? From privileged to non privileged ( kernel mode to user mode )things are easy as privileged code can change the CS register and load a non privileged value to it. � The reverse switch ( of user mode to kernel mode ) happens via something called software/hardware interrupt. For example � � � Execution or access of any privilege instruction or memory respectively. Instructions like Sysenter, int 3, int xx etc Or a hardware interrupt from a NIC card or mouse / key board etc. A page fault So in short any kind of interrupt will switch the CPU in the Kernel mode as the handler of those exceptions are in kernel mode. � It is like a 911/112 emergency call ( 100 in India ) reach to a specific location who know what to do and privilege to do. �

IOPL flags ( 2 bit ) of eflags register � � � Following 6 instructions has some additional security checking over the checking we have so far discussed. CLI, STI, INS, OUTS To execute above instruction the current privilege level of the code should meet or beat the IOPL which means numerically CPL should be equal or less than IOPL bits in eflags register. There is one exception for CPU for later 4 instructions to above rule but in Windows above rule holds good. In windows IOPL is always 0 which means above instruction can be executed in only privileged mode ( kernel mode )

Implementation of Different Process Address Spaces � � � All the process in windows has it own CR 3 value. When the process context switch happens a new CR 3 value is loaded into the CR 3 register. So once the CR 3 is changed it is like a whole new page table is loaded. So the translation changes accordingly. For all Page Tables the Upper 2 GB is mapped to same physical address which is common to all processes ( but not accessible details coming ).

Demo Windows process implementation. !process 0 0 Get the page Directory Base. Do translation of same address in two different process address space and prove that they are translating into 2 different physical memory locations. � Implementation of Share memory or Section Object. � Looking at one of the address in the upper 2 GB of address space and making sure it is mapping to same physical address. � �

3 Windows Kernel APIs worth mentioning Mm. Map. Io. Space 2. Mm. Get. Physical. Address 3. Hal. Translate. Bus. Address ( Note: Obsolete even 1. for WDM Pn. P, WDF is the latest replacement. Lot of students had questions on this API so adding it to this discussion. Although it is obsolete to outside world, windows calls this API quite frequently even in Windows 8 )

Demo Mm. Map. Io. Space 2. Mm. Get. Physical. Address 3. Hal. Translate. Bus. Address 1.

Page fault handling. If V bit is not set it means that page is not in the memory. � CPU treats this as an exception/software interrupt and transfer control to a predefined OS page fault handler function pointer which is normally in CR 2. � OS page fault handler function handle that interrupt and reads the page from disk and let cpu restart the instruction which caused the fault. � In this presentation we are not going to discuss this process in detail. �

Some Final Q and A � Why we cannot access other process memory directly? � Ans: One CR 3 and one Segment of 4 GB at a time. So at a time only one process address space is active. So there is no other address space to “access” � How we get common kernel mode address ( upper 2 gb )? � Ans: System page table entries are same for all process which means upper 2 gb pages maps to same physical memory. � How interprocess communications works? � Ans: Either via mapping same physical memory to 2 different process address space or via sharing the data with kernel address space both ways fundamentally same. � What happens when we touch a kernel mode address from user mode? � Ans: User mode the CPL is 3 and for the kernel page the owner bit in pte = 0 so an interrupt is generated and CPU switch in to kernel to handle it ( terminate the application – depends on handling ) � How the physical memory used by a dll like ntdll. dll is not replicated although it is begin used by all processes? � Ans: Same physical pages used by ntdll. dll is mapped to different process virtual address spaces. So all virtual addresses translate to same physical address.

Thank you.