Virtualization Creating a virtual i e not really

  • Slides: 41
Download presentation
Virtualization Creating a virtual (i. e. not really existing) existing version of something: o

Virtualization Creating a virtual (i. e. not really existing) existing version of something: o o o Hardware network memory storage … Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 1

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual Machine Monitor (VMM) Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 2

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual Machine Monitor Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 3

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual Machine Monitor Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 4

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual

Basic concepts q Virtual Machine (VM) q Host q Guest q Hypervisor / Virtual Machine Monitor VMware Workstation, Microsoft Virtual PC, Sun Virtual. Box, QEMU, KVM VM Software type 2 VMware ESX, Microsoft Hyper-V, Xen type 1 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 5

Why to use virtualization ? ü Servers' consolidation o Multiple VMs / OSs /

Why to use virtualization ? ü Servers' consolidation o Multiple VMs / OSs / services on one physical machine Unix Host OS VM Software Linux Windows 10 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 6

Why to use virtualization ? ü Isolation o VMs are completely isolated from each

Why to use virtualization ? ü Isolation o VMs are completely isolated from each other – multi-users security o Only VMM has full control of VMs Unix Host OS VM Software Linux Windows 10 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 7

Why to use virtualization ? ü Fault containment (recovery) o If VM crashes it

Why to use virtualization ? ü Fault containment (recovery) o If VM crashes it can be rebooted, does not affect other VMs Unix Host OS VM Software Linux Windows 10 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 8

Why to use virtualization ? For example, when some VM needs more HW resources

Why to use virtualization ? For example, when some VM needs more HW resources ü VM migration o move VM to another server is easy Unix Host OS VM Software X Linux Windows 10 VM Software Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky Host OS 9

Why to use virtualization ? ü For virtual testing non-existing (novel) HW architectures Unix

Why to use virtualization ? ü For virtual testing non-existing (novel) HW architectures Unix Host OS VM Software Linux Windows 10 virtual x 90 architecture Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 10

Types of virtualization Our focus in the course is on full virtualization q Full

Types of virtualization Our focus in the course is on full virtualization q Full virtualization – guest OS runs unmodified guest OS does not know that it runs on VM and not on real machine Host OS VM Software Windows 10 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky Windows 10 11

Types of virtualization guest OS can directly cooperate with VMM, and thus VM performance

Types of virtualization guest OS can directly cooperate with VMM, and thus VM performance may be better q Para-virtualization – guest OS must be aware of virtualization, guest OS source-code modifications is required Host OS VM Software modified Windows 10 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky Windows 10 12

Hypervisor (VMM) must provide: q Safety: Hypervisor should have full control of virtualized resources

Hypervisor (VMM) must provide: q Safety: Hypervisor should have full control of virtualized resources q Fidelity: program behavior on VM should be identical to its behavior on bare hardware q Efficiency: as much as possible, run directly on hardware , without Hypervisor intervention HW resources Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 13

Classic virtualization: trap-and-emulate Trap is caused when guest OS (or any other process that

Classic virtualization: trap-and-emulate Trap is caused when guest OS (or any other process that is not host OS) tries to run some privileges instruction. Trap is not caused when execute sensitive instructions. . . VM 1 VMM VM 2 3) Return to process HW emulation i. e. virtual hardware HW 1) Trap 2) Interrupt handler Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky מעדכן hypervisor , trap אחרי VM שהוא שומר מול state ב . את הנדרש בפקודה 14

X 86 virtualization problem I q Some sensitive instructions are not privileged 15 Host

X 86 virtualization problem I q Some sensitive instructions are not privileged 15 Host CPU IP q Example: popf instruction Pop 16 bits from stack to flags register IP flag masks (i. e. disables) interrupts IP flag remains When is executed in User Space, IP flag stays unchanged When executed in Kernel Space, IP flag is changed Since popf may be executed in both modes, this instruction is not privileged o Since in each mode popf has different result, this instruction is sensitive (to execution mode) Eflags o o o What happens when guest OS runs popf ? Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky VM virtual CPU IP Eflags indeed, we are interested to change IP flag on VM virtual CPU 15

Trap-and-emulate – difficulties on x 86 q Sensitive instructions - provide control over real

Trap-and-emulate – difficulties on x 86 q Sensitive instructions - provide control over real (i. e. not virtual) HW resources • access to some CPU registers • • • access to MMU • • CR 3 register CS, DS, SS registers Page table Enable / Disable CPU Interrupts Timers IO devices By executing sensitive instruction directly, guest OS might run incorrectly Theorem [Popek and Goldberg, 1974] A machine can be virtualized (using trap and emulate) if every sensitive instruction is privileged. Two solutions are possible: 1. q Privileged instructions: cause a trap if executed in user mode This was not supported by x 86 processors prior to 2005 In 2005, Intel/AMD introduced virtualization HW support. Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 2. make guest OS aware of its “guestness” make all sensitive instruction to be privileged (and thus causing trap when guest OS run them) 16

X 86 virtualization problem II q Some instructions can have code segment selectors (cs,

X 86 virtualization problem II q Some instructions can have code segment selectors (cs, ds, ss) as arguments even in user mode, so segment selectors can be read q The selectors have two bits that are their current privilege level o In x 86 (beginning with 386), four privilege levels (ring 0 to ring 3) o Guest OS thinks that it is in ring 0 o Guest OS is actually in ring 1 q Result - guest OS confusion Host OS Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 17

Outline q Concepts, classical CPU virtualization o Binary translation q Memory virtualization Operating Systems,

Outline q Concepts, classical CPU virtualization o Binary translation q Memory virtualization Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 18

Dynamic binary translation (by VMM) of guest OS machine code q Binary translation is

Dynamic binary translation (by VMM) of guest OS machine code q Binary translation is the process of translating one instruction set to another executable of guest OS Overhead for every instruction, similarly to interpreter. q Translate guest OS code on the fly q Translator reads next Basic Block (BB) of guest OS o stops upon control flow instruction (i. e. jump, call, loop, ret instructions) q Decodes BB instructions, and creates Intermediate Representation (IR) objects for them q Replace all sensitive and privileged instructions by changes of appropriate virtual data structure q IR objects are gathered into Translation Unit (TU) q Execute only At the end of each TU there is a trap instruction, which activates VMM to choose the next TU to run. lea edx, nomscdex xor ebx, ebx mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 BB 1 cli BB 2 test ebx, ebx jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 int 0 x 2 F popf jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex test ebx, ebx jz exit . . . TUs in RAM xor ebx, ebx mov eax, 1500 vcpu. flags. IP=0 TU 2 test ebx, ebx Binary Translation BB 3 BB 4 BB 5 Both privileged and sensitive instructions are translated code This is much faster than executing trap on privileged instructions. Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 mov ecx, edi lea edx, nomscdex test ebx, ebx jz exit lea edx, nomscdex xor ebx, ebx mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex TU 3 TU 5 TU 1 int 0 x 2 F vcpu. popf jz exit mov edi, 10 h TU 4 19

CPU IP Eflags P 1 CPU IP P 2 Eflags Example: q The cli

CPU IP Eflags P 1 CPU IP P 2 Eflags Example: q The cli (clear interrupts) instruction is privileged (on real CPU) q Translated to: “vcpu. flags. IP=0” vcpu. flags. IP=0 (on VM virtual CPU) Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 20

Dynamic binary translation with caching executable of guest OS q Translation cache (TC) stores

Dynamic binary translation with caching executable of guest OS q Translation cache (TC) stores translations done so far q Translation occurs only once q Static translation cannot handle dynamic control transfer, when: o Jump depending on content of memory address o Indirect function call (by function pointer) q User code is not translated xor ebx, ebx mov eax, 1500 lea edx, nomscdex xor ebx, ebx mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 BB 1 cli BB 2 test ebx, ebx jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 int 0 x 2 F popf jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex test ebx, ebx jz exit TUs Translation Cache in RAM vcpu. flags. IP=0 TU 2 test ebx, ebx Binary Translation BB 3 BB 4 BB 5 . . . jz exit mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex xor ebx, ebx mov eax, 1500 mov ecx, edi lea edx, nomscdex test ebx, ebx jz exit lea edx, nomscdex xor ebx, ebx mov edi, 10 h nextloop: mov ecx, edi lea edx, nomscdex TU 5 TU 1 int 0 x 2 F vcpu. popf jz exit mov edi, 10 h Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky TU 3 TU 4 21

Virtualization prior to HW support Figure 7 -4. The binary translation rewrites the guest

Virtualization prior to HW support Figure 7 -4. The binary translation rewrites the guest operating system running in ring 1, while the hypervisor runs in ring 0 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 22

VMWare binary translation: example First TU executable of guest OS BB 1 Binary Translation

VMWare binary translation: example First TU executable of guest OS BB 1 Binary Translation BB 2 BB 3 identical (unchanged) code Compiled code fragment (CCF) BB 4 BB 5 Translation of jump conditional code ‘jge prime’ Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 23

VMWare binary translation example: output Operating Systems, Spring 2020, I. Dinur, D. Hendler and

VMWare binary translation example: output Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 24

VMWare binary translation example: output These continuations remain because respective basic blocks were not

VMWare binary translation example: output These continuations remain because respective basic blocks were not executed Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 25

Outline q Concepts, classical CPU virtualization o Binary translation q Memory virtualization Operating Systems,

Outline q Concepts, classical CPU virtualization o Binary translation q Memory virtualization Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 26

Memory allocation q Each VM usually receives a contiguous set of physical addresses o

Memory allocation q Each VM usually receives a contiguous set of physical addresses o usually 1 -4 Gb q Guest OS allocates pages to guest processes q VMM must ensure that virtual pages mapping of guest OS processes is only to assigned page frames RAM P 2 P 1 P 3 VMM manages partition memory among VMs Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky P 1 P 2 27

Memory management q Assumptions of guest OS: o Physical memory is a contiguous block

Memory management q Assumptions of guest OS: o Physical memory is a contiguous block of addresses from 0 to some n o guest OS can map any virtual page to any page frame q Hypervisor must: o Partition memory among VMs o Ensure virtual page mapping only to assigned page frames. q TLB cache miss o cache miss in HW-managed TLB (e. g. x 86) o causes HW to select a page from page table q VM OS must not manage real (host) Page Table Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 28

Option 1: brute force q Guest page tables are read and write protected in

Option 1: brute force q Guest page tables are read and write protected in host system. q If guest OS reads page table (e. g. for page eviction) or writes page table (e. g. after page fault), or changes CR 3, the system traps q Hypervisor then uses a VM memory layout in RAM to: disable R/W, R/W to cause trap when guest OS tries to read or to edit Page directory or Page Table Guest OS Hypervisor Page Table VMM VM memory layout q Return answers to VM (e. g. which page to evict) q Update RAM layout (e. g. add map of virtual page to allocated physical frame) q Hypervisor switches VM memory layout when new VM is scheduled CR 3 TLB CPU Interrupt, and then VMM treats Page Table access correctly MMU HW Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 29

Option 2: shadow page tables q Hypervisor maintains “shadow page tables” q Guest page

Option 2: shadow page tables q Hypervisor maintains “shadow page tables” q Guest page tables mapping: Guest VA (GVA) Guest PA (GPA) q Shadow tables mapping: this is the real mapping Guest VA Host PA (HPA) q When guest process accesses virtual address, MMU translates GVA to HPA correctly, because it is aware only of Shadow Table q Hypervisor is not involved when guest OS updates its Page Table o Result – possibly inconsistent guest page table and shadow page table VA – virtual address PA – physical address Guest OS Hypervisor Page table VMM G-CR 3 TLB CPU Shadow Page Table Interrupt & VMM corrects page table. MMU Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 30

Option 2: shadow page tables VMM manages partition memory among VMs q If address

Option 2: shadow page tables VMM manages partition memory among VMs q If address in TLB (i. e. TLB hit), the found mapping is correct and there is nothing to do q When guest process causes a page fault (i. e. the appropriate virtual page is marked as “absent” in Shadow Page Table) : o Hypervisor begins execution o Hypervisor must check in Guest Page Table to detect which one of two possible scenarios happened: § Guest page fault – No translation in guest page tables guest OS must handle page fault (run page replacement algorithm if needed) § Guest translation found Hypervisor must update Shadow Page Table respectively Guest OS 1 3 1 Hypervisor Page table VMM G-CR 3 7 TLB CPU Shadow Page Table Interrupt & VMM corrects page table. MMU Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 31

Option 2: shadow page tables Guest OS q Performance is without overhead as long

Option 2: shadow page tables Guest OS q Performance is without overhead as long as there are no page faults q Shadow page tables should be cached so that once a VM is re-scheduled the page table does not have to be rebuilt from scratch Hypervisor Page table VMM G-CR 3 TLB CPU Shadow Page Table Interrupt & VMM corrects page table. MMU Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 32

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 33

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 34

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating

Shadow page tables – updating CR 3 Virtual CR 3 Real CR 3 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 35

Undiscovered guest page table Virtual CR 3 Real CR 3 Operating Systems, Spring 2020,

Undiscovered guest page table Virtual CR 3 Real CR 3 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 36

Undiscovered guest page table Virtual CR 3 Real CR 3 Operating Systems, Spring 2020,

Undiscovered guest page table Virtual CR 3 Real CR 3 Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 37

Option 3: Extended/nested page tables q The name implies having page tables within page

Option 3: Extended/nested page tables q The name implies having page tables within page tables q The essence of the idea is a hardware assist o Hardware has an extra pointer and the ability to walk an extra set of page tables o Idea is called Extended Page Tables (EPT) by Intel q Guest page tables hold Guest VA Guest PA mapping, access by standard CR 3 q Extended page tables hold Host VA Host PA mapping, access by EPTP (EPT pointer) q Host VA=Guest PA Host page table of Hypervisor VM 3 Host page table of VM 2 Host page table of VMM VM 1 Guest OS Page table CR 3 TLB CPU EPTP MMU Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 38

Walking extended page tables Operating Systems, Spring 2020, I. Dinur, D. Hendler and M.

Walking extended page tables Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 39

Option 3: Extended/nested page tables q TLB as usual holds Guest VA Host PA

Option 3: Extended/nested page tables q TLB as usual holds Guest VA Host PA q On memory access o If found in TLB – no problem. o If not in TLB, but no page fault, hardware walks both tables and updates TLB. o If page fault, then hypervisor gets host virtual page (guest physical page) and maps it to host physical page. Host page table of Hypervisor VM 3 Host page table of VM 2 Host page table of VMM VM 1 Guest OS Page table CR 3 TLB CPU EPTP MMU Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 40

Sources q “Modern operating systems”, 4‘th edition, A. Tanenbaum and H. Bos q “Virtual

Sources q “Modern operating systems”, 4‘th edition, A. Tanenbaum and H. Bos q “Virtual machines”, J. E. Smith and R. Nair q A presentation by Niv Gilboa from CSE@BGU q “Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974 q “A comparison of software and hardware techniques for x 86 virtualization”, K. Adams and O. Ageson, ASPLOS 2006 q A presentation by VMWare Operating Systems, Spring 2020, I. Dinur, D. Hendler and M. Kogan-Sadetsky 41