Xen 3 0 and the Art of Virtualization

  • Slides: 50
Download presentation
Xen 3. 0 and the Art of Virtualization Ian Pratt Xen. Source Inc. and

Xen 3. 0 and the Art of Virtualization Ian Pratt Xen. Source Inc. and University of Cambridge Keir Fraser, Steve Hand, Christian Limpach and many others…

Outline ¾Virtualization Overview ¾Xen Architecture ¾New Features in Xen 3. 0 ¾VM Relocation ¾Xen

Outline ¾Virtualization Overview ¾Xen Architecture ¾New Features in Xen 3. 0 ¾VM Relocation ¾Xen Roadmap ¾Questions

Virtualization Overview ¾Single OS image: Open. VZ, Vservers, Zones § Group user processes into

Virtualization Overview ¾Single OS image: Open. VZ, Vservers, Zones § Group user processes into resource containers § Hard to get strong isolation ¾ Full virtualization: VMware, Virtual. PC, QEMU § Run multiple unmodified guest OSes § Hard to efficiently virtualize x 86 ¾Para-virtualization: Xen § Run multiple guest OSes ported to special arch § Arch Xen/x 86 is very close to normal x 86

Virtualization in the Enterprise Consolidate under-utilized servers X Avoid downtime with VM Relocation Dynamically

Virtualization in the Enterprise Consolidate under-utilized servers X Avoid downtime with VM Relocation Dynamically re-balance workload to guarantee application SLAs X X Enforce security policy

Xen 2. 0 (5 Nov 2005) ¾Secure isolation between VMs ¾Resource control and Qo.

Xen 2. 0 (5 Nov 2005) ¾Secure isolation between VMs ¾Resource control and Qo. S ¾Only guest kernel needs to be ported § User-level apps and libraries run unmodified § Linux 2. 4/2. 6, Net. BSD, Free. BSD, Plan 9, Solaris ¾Execution performance close to native ¾Broad x 86 hardware support ¾Live Relocation of VMs between Xen nodes

Para-Virtualization in Xen ¾Xen extensions to x 86 arch § Like x 86, but

Para-Virtualization in Xen ¾Xen extensions to x 86 arch § Like x 86, but Xen invoked for privileged ops § Avoids binary rewriting § Minimize number of privilege transitions into Xen § Modifications relatively simple and self-contained ¾Modify kernel to understand virtualised env. § Wall-clock time vs. virtual processor time • Desire both types of alarm timer § Expose real resource availability • Enables OS to optimise its own behaviour

Xen 3. 0 Architecture AGP ACPI PCI x 86_32 x 86_64 IA 64 VM

Xen 3. 0 Architecture AGP ACPI PCI x 86_32 x 86_64 IA 64 VM 0 Device Manager & Control s/w VM 1 Unmodified User Software VM 2 Unmodified User Software Guest. OS (Xen. Linux) Back-End Native Device Drivers Control IF SMP Front-End Device Drivers Safe HW IF Front-End Device Drivers Event Channel Virtual CPU VM 3 Unmodified User Software Unmodified Guest. OS (Win. XP)) Front-End Device Drivers Virtual MMU Xen Virtual Machine Monitor Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) VT-x

I/O Architecture ¾ Xen IO-Spaces delegate guest OSes protected access to specified h/w devices

I/O Architecture ¾ Xen IO-Spaces delegate guest OSes protected access to specified h/w devices § Virtual PCI configuration space § Virtual interrupts § (Need IOMMU for full DMA protection) ¾ Devices are virtualised and exported to other VMs via Device Channels § Safe asynchronous shared memory transport § ‘Backend’ drivers export to ‘frontend’ drivers § Net: use normal bridging, routing, iptables § Block: export any blk dev e. g. sda 4, loop 0, vg 3 ¾ (Infiniband / “Smart NICs” for direct guest IO)

System Performance 1. 1 1. 0 0. 9 0. 8 0. 7 0. 6

System Performance 1. 1 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 L X V U SPEC INT 2000 (score) L X V U Linux build time (s) L X V U OSDB-OLTP (tup/s) L X V U SPEC WEB 99 (score) Benchmark suite running on Linux (L), Xen (X), VMware Workstation (V), and UML (U)

Scalability 1000 800 600 400 200 0 L X 2 L X 4 L

Scalability 1000 800 600 400 200 0 L X 2 L X 4 L X 8 L X 16 Simultaneous SPEC WEB 99 Instances on Linux (L) and Xen(X)

4 GB 3 GB 0 GB Xen S Kernel S User U ring 3

4 GB 3 GB 0 GB Xen S Kernel S User U ring 3 ring 1 ring 0 x 86_32 ¾ Xen reserves top of VA space ¾ Segmentation protects Xen from kernel ¾ System call speed unchanged ¾ Xen 3 now supports PAE for >4 GB mem

x 86_64 264 -247 Kernel U Xen S Reserved 247 User 0 U ¾

x 86_64 264 -247 Kernel U Xen S Reserved 247 User 0 U ¾ Large VA space makes life a lot easier, but: ¾ No segment limit support èNeed to use page-level protection to protect hypervisor

x 86_64 r 3 User Kernel U U syscall/sysret r 0 Xen S ¾

x 86_64 r 3 User Kernel U U syscall/sysret r 0 Xen S ¾ Run user-space and kernel in ring 3 using different pagetables § Two PGD’s (PML 4’s): one with user entries; one with user plus kernel entries ¾ System calls require an additional syscall/ret via Xen ¾ Per-CPU trampoline to avoid needing GS in Xen

Para-Virtualizing the MMU ¾Guest OSes allocate and manage own PTs § Hypercall to change

Para-Virtualizing the MMU ¾Guest OSes allocate and manage own PTs § Hypercall to change PT base ¾Xen must validate PT updates before use § Allows incremental updates, avoids revalidation ¾Validation rules applied to each PTE: 1. Guest may only map pages it owns* 2. Pagetable pages may only be mapped RO ¾Xen traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates

Writeable Page Tables : 1 – Write fault guest reads Virtual → Machine first

Writeable Page Tables : 1 – Write fault guest reads Virtual → Machine first guest write Guest OS page fault Xen VMM MMU Hardware

Writeable Page Tables : 2 – Emulate? guest reads Virtual → Machine first guest

Writeable Page Tables : 2 – Emulate? guest reads Virtual → Machine first guest write Guest OS yes emulate? Xen VMM MMU Hardware

Writeable Page Tables : 3 - Unhook guest reads guest writes X Virtual →

Writeable Page Tables : 3 - Unhook guest reads guest writes X Virtual → Machine Guest OS Xen VMM MMU Hardware

Writeable Page Tables : 4 - First Use guest reads guest writes X Virtual

Writeable Page Tables : 4 - First Use guest reads guest writes X Virtual → Machine Guest OS page fault Xen VMM MMU Hardware

Writeable Page Tables : 5 – Re-hook guest reads Virtual → Machine guest writes

Writeable Page Tables : 5 – Re-hook guest reads Virtual → Machine guest writes Guest OS validate Xen VMM MMU Hardware

MMU Micro-Benchmarks 1. 1 1. 0 0. 9 0. 8 0. 7 0. 6

MMU Micro-Benchmarks 1. 1 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 L X V Page fault (µs) U L X V U Process fork (µs) lmbench results on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)

SMP Guest Kernels ¾Xen extended to support multiple VCPUs § Virtual IPI’s sent via

SMP Guest Kernels ¾Xen extended to support multiple VCPUs § Virtual IPI’s sent via Xen event channels § Currently up to 32 VCPUs supported ¾Simple hotplug/unplug of VCPUs § From within VM or via control tools § Optimize one active VCPU case by binary patching spinlocks ¾NB: Many applications exhibit poor SMP scalability – often better off running multiple instances each in their own OS

SMP Guest Kernels ¾ Takes great care to get good SMP performance while remaining

SMP Guest Kernels ¾ Takes great care to get good SMP performance while remaining secure § Requires extra TLB syncronization IPIs ¾ SMP scheduling is a tricky problem § Wish to run all VCPUs at the same time § But, strict gang scheduling is not work conserving § Opportunity for a hybrid approach ¾ Paravirtualized approach enables several important benefits § Avoids many virtual IPIs § Allows ‘bad preemption’ avoidance § Auto hot plug/unplug of CPUs

VT-x / Pacifica : hvm ¾ Enable Guest OSes to be run without modification

VT-x / Pacifica : hvm ¾ Enable Guest OSes to be run without modification § E. g. legacy Linux, Windows XP/2003 ¾ CPU provides vmexits for certain privileged instrs ¾ Shadow page tables used to virtualize MMU ¾ Xen provides simple platform emulation § BIOS, apic, iopaic, rtc, Net (pcnet 32), IDE emulation ¾ Install paravirtualized drivers after booting for high-performance IO ¾ Possibility for CPU and memory paravirtualization § Non-invasive hypervisor hints from OS

Guest VM (VMX) (32 -bit) Domain 0 Domain N Unmodified OS Linux xen 64

Guest VM (VMX) (32 -bit) Domain 0 Domain N Unmodified OS Linux xen 64 Unmodified OS FE Virtual Drivers Native Device Drivers Front end Virtual Drivers Native Device Drivers Linux xen 64 Backend Virtual driver 1/3 P Control Panel (xm/xend) 3 P Guest VM (VMX) (64 -bit) Guest BIOS Virtual Platform VMExit IO Emulation Callback / Hypercall Event channel 0 P Control Interface Processor Scheduler Event Channel Memory Xen Hypervisor Hypercalls I/O: PIT, APIC, IOAPIC 3 D 0 D

MMU Virtualizion : Shadow-Mode guest reads Virtual → Pseudo-physical Guest OS guest writes Accessed

MMU Virtualizion : Shadow-Mode guest reads Virtual → Pseudo-physical Guest OS guest writes Accessed & dirty bits Updates Virtual → Machine VMM MMU Hardware

Xen Tools dom 0 dom 1 CIM xm Web svcs xmlib xenstore builder control

Xen Tools dom 0 dom 1 CIM xm Web svcs xmlib xenstore builder control save/ restore control libxc Priv Cmd dom 0_op Back xenbus Xen xenbus Front

VM Relocation : Motivation ¾VM relocation enables: § High-availability Xen • Machine maintenance §

VM Relocation : Motivation ¾VM relocation enables: § High-availability Xen • Machine maintenance § Load balancing • Statistical multiplexing gain Xen

Assumptions ¾Networked storage § NAS: NFS, CIFS § SAN: Fibre Channel § i. SCSI,

Assumptions ¾Networked storage § NAS: NFS, CIFS § SAN: Fibre Channel § i. SCSI, network block dev § drdb network RAID ¾Good connectivity § common L 2 network § L 3 re-routeing Xen Storage

Challenges ¾VMs have lots of state in memory ¾Some VMs have soft real-time requirements

Challenges ¾VMs have lots of state in memory ¾Some VMs have soft real-time requirements § E. g. web servers, databases, game servers § May be members of a cluster quorum è Minimize down-time ¾Performing relocation requires resources è Bound and control resources used

Relocation Strategy Stage 0: pre-migration Stage 1: reservation Stage 2: iterative pre-copy Stage 3:

Relocation Strategy Stage 0: pre-migration Stage 1: reservation Stage 2: iterative pre-copy Stage 3: stop-and-copy Stage 4: commitment VM active on host A Destination host selected (Block devices mirrored) Initialize container on target host Copy dirty pages in successive rounds Suspend VM on host A Redirect network traffic Synch remaining Activate on hoststate B VM state on host A released

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 1

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Round 2

Pre-Copy Migration: Final

Pre-Copy Migration: Final

Web Server Relocation

Web Server Relocation

Iterative Progress: SPECWeb 52 s

Iterative Progress: SPECWeb 52 s

Quake 3 Server relocation

Quake 3 Server relocation

Current Status x 86_32 Privileged Domains Guest Domains SMP Guests Save/Restore/Migrate >4 GB memory

Current Status x 86_32 Privileged Domains Guest Domains SMP Guests Save/Restore/Migrate >4 GB memory VT Driver Domains x 86_32 p x 86_64 IA 64 Power

3. 1 Roadmap ¾Improved full-virtualization support § Pacifica / VT-x abstraction § Enhanced IO

3. 1 Roadmap ¾Improved full-virtualization support § Pacifica / VT-x abstraction § Enhanced IO emulation ¾Enhanced control tools ¾Performance tuning and optimization § Less reliance on manual configuration ¾NUMA optimizations ¾Virtual bitmap framebuffer and Open. GL ¾Infiniband / “Smart NIC” support

IO Virtualization ¾IO virtualization in s/w incurs overhead § Latency vs. overhead tradeoff •

IO Virtualization ¾IO virtualization in s/w incurs overhead § Latency vs. overhead tradeoff • More of an issue for network than storage § Can burn 10 -30% more CPU ¾Solution is well understood § Direct h/w access from VMs • Multiplexing and protection implemented in h/w § Smart NICs / HCAs • Infiniband, Level-5, Aaorhi etc • Will become commodity before too long

Research Roadmap ¾Whole-system debugging § Lightweight checkpointing and replay § Cluster/distributed system debugging ¾Software

Research Roadmap ¾Whole-system debugging § Lightweight checkpointing and replay § Cluster/distributed system debugging ¾Software implemented h/w fault tolerance § Exploit deterministic replay ¾Multi-level secure systems with Xen ¾VM forking § Lightweight service replication, isolation

Conclusions ¾ Xen is a complete and robust hypervisor ¾ Outstanding performance and scalability

Conclusions ¾ Xen is a complete and robust hypervisor ¾ Outstanding performance and scalability ¾ Excellent resource control and protection ¾ Vibrant development community ¾ Strong vendor support ¾ Try the demo CD to find out more! (or Fedora 4/5, Suse 10. x) ¾ http: //xensource. com/community

Thanks! ¾If you’re interested in working full-time on Xen, Xen. Source is looking for

Thanks! ¾If you’re interested in working full-time on Xen, Xen. Source is looking for great hackers to work in the Cambridge UK office. If you’re interested, please send me email! ¾ian@xensource. com