Open Stack NUMA Nova Open Stack Who 1

Open. Stack NUMA技术以及在 Nova中的应用 Open. Stack

Who? 1 Name • 陈锐 • Rui Chen 2 Employer • Huawei Weibo •

A. NUMA 目录 B. Nova & NUMA C. Nova & NUMA & ? CONTENTS

[1] Open. Stack NUMA Non-Uniform Memory Access Open. Stack

体系结构 NUMA SMP, NUMA和MPP 1. SMP • Symmetric Multi-Processor 2. NUMA • Non-Uniform Memory

NUMA Sharing RAM, local faster Open. Stack

[2] Open. Stack Nova & NUMA guest NUMA node placement & topology Open. Stack

Use Case #1 使guest的vcpu和RAM从一个host NUMA node中获取，避免跨NUMA node的RAM接入，提高 guest性能，减少不可预知的延时。例1：hw: numa_nodes=1 Open. Stack

Use Case #2 当guest的vcpu和RAM超过单个host NUMA node时，将guest划分为多个 guest NUMA node，并与 host NUMA topology对应，有助于 guest

What is guest in QEMU/KVM? 1. Guest • Compute. Node上的一个Linux进程 2. v. CPU •

相关配置 flavor & image • hw: numa_nodes=NN – guest NUMA node数量 • hw: numa_mempolicy=preferred|strict

Nova实现流程 1 2 • nova-compute • libvirt driver • instance_claim检查 • 根据映射关系，生 host的NUMA资源是 •

Host&Guest NUMA topology stack@openstack: /home/devstack$ [master]$ numactl --hardware available: 2 nodes (0 -1) node

Libvirt. xml <memory unit='Ki. B'>524288</memory> <current. Memory unit='Ki. B'>524288</current. Memory> <vcpu placement='static'>8</vcpu> <cputune> <vcpupin

Double check stack@openstack: ~/logs$ $ sudo numastat -c qemu-system-x 86_64 Per-node process memory usage

Where is RAM? <memory unit='Ki. B'>524288</memory> <current. Memory unit='Ki. B'>524288</current. Memory> <vcpu placement='static'>8</vcpu> <cputune>

[3] Nova & NUMA & ? Open. Stack I/O NUMA scheduling, vcpu pin, large

NFV 1 NUMA-placement https: //blueprints. launchpad. net/nova/+spec/virt-driver-numa-placement 2 v. CPU topology https: //blueprints. launchpad.

Slides: 24

Download presentation

Open. Stack NUMA技术以及在 Nova中的应用 Open. Stack

Who? 1 Name • 陈锐 • Rui Chen 2 Employer • Huawei Weibo • @kiwik Blog Open. Stack 3 4 • http: //kiwik. github. io Open. Stack

A. NUMA 目录 B. Nova & NUMA C. Nova & NUMA & ? CONTENTS Open. Stack

[1] Open. Stack NUMA Non-Uniform Memory Access Open. Stack

体系结构 NUMA SMP, NUMA和MPP 1. SMP • Symmetric Multi-Processor 2. NUMA • Non-Uniform Memory Access 3. MPP • Massive Parallel Processing Open. Stack

SMP Sharing RAM Open. Stack

NUMA Sharing RAM, local faster Open. Stack

MPP Sharing nothing Open. Stack

NUMA Topology Open. Stack

[2] Open. Stack Nova & NUMA guest NUMA node placement & topology Open. Stack

Use Case #1 使guest的vcpu和RAM从一个host NUMA node中获取，避免跨NUMA node的RAM接入，提高 guest性能，减少不可预知的延时。例1：hw: numa_nodes=1 Open. Stack

Use Case #2 当guest的vcpu和RAM超过单个host NUMA node时，将guest划分为多个 guest NUMA node，并与 host NUMA topology对应，有助于 guest OS感知到guest NUMA，优化应用资源调度，例如： DB。例2：hw: numa_nodes=N Open. Stack

What is guest in QEMU/KVM? 1. Guest • Compute. Node上的一个Linux进程 2. v. CPU • Guest进程内的一个特殊的线程，被host OS调度 How to place a guest into a NUMA node? 1. Schedule v. CPU threads to p. CPU set of a NUMA node 2. Allocate RAM from local Open. Stack

相关配置 flavor & image • hw: numa_nodes=NN – guest NUMA node数量 • hw: numa_mempolicy=preferred|strict – RAM占用方式，Juno代码未实现 • hw: numa_cpus. 0=<cpu-list> - guest NUMA node 0中的vcpu列表 • hw: numa_cpus. 1=<cpu-list> - guest NUMA node 1中的vcpu列表 • hw: numa_mem. 0=<ram-size> - guest NUMA node 0中分配的RAM大小 • hw: numa_mem. 1=<ram-size> - guest NUMA node 1中分配的RAM大小 Open. Stack

Nova实现流程 1 2 • nova-compute • libvirt driver • instance_claim检查 • 根据映射关系，生 host的NUMA资源是 • nova-api • nova-scheduler • 生成guest NUMA • 判断host是否可以满 • topology 足guest NUMA 保存instance_extra topology Workflow 否满足 • 建立guest NUMA到 host NUMA的映射 3 5 • resource tracker • 刷新host numa资源使用情况成对应的libvirt. xml 4 Nova Open. Stack

Host&Guest NUMA topology stack@openstack: /home/devstack$ [master]$ numactl --hardware available: 2 nodes (0 -1) node 0 cpus: 4 5 6 7 12 13 14 15 node 0 size: 40232 MB node 0 free: 36833 MB node 1 cpus: 0 1 2 3 8 9 10 11 • 2 NUMA nodes node 1 size: 40318 MB node 1 free: 37822 MB • 40 G RAM per node distances: node 0 1 • CPU&RAM distance 0: 10 20 1: 20 10 Host NUMA Nova Guest NUMA • 8 vcpus • 2 numa nodes • 512 M RAM +--------------------------------------------+ | Property | Value | +--------------------------------------------+ | OS-FLV-DISABLED: disabled | False | | OS-FLV-EXT-DATA: ephemeral | 0 | | disk | 1 | | extra_specs | {"hw: numa_cpus. 0": "0, 1, 2, 3, 4, 5", "hw: numa_cpus. 1": "6, 7", | | "hw: numa_nodes": "2", "hw: numa_mem. 1": "128", | | "hw: numa_mem. 0": "384"} | | id | 11 | | name | chenrui_f | | os-flavor-access: is_public | True | | ram | 512 | | rxtx_factor | 1. 0 | | swap | | | vcpus | 8 | +--------------------------------------------+ Open. Stack

Libvirt. xml <memory unit='Ki. B'>524288</memory> <current. Memory unit='Ki. B'>524288</current. Memory> <vcpu placement='static'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='1' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='2' cpuset='4 -7, 12 -15'/> • <vcpupin vcpu='3' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='4' cpuset='4 -7, 12 -15'/> • <vcpupin vcpu='5' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='6' cpuset='0 -3, 8 -11'/> <vcpupin vcpu='7' cpuset='0 -3, 8 -11'/> </cputune> pin vcpu to pcpu set in cputune export guest numa in cpu/numa <cpu> <topology sockets='8' cores='1' threads='1'/> <numa> <cell cpus='0 -5' memory='393216'/> <cell cpus='6 -7' memory='131072'/> </numa> </cpu> Open. Stack

Double check stack@openstack: ~/logs$ $ sudo numastat -c qemu-system-x 86_64 Per-node process memory usage (in MBs) PID Node 0 Node 1 Total -------- ----- 7161 (qemu-syste 703 12 714 11839 (qemu-syst 700 13 712 12774 (qemu-syst 162 550 713 15115 (qemu-syst 12 698 711 16379 (qemu-syst 111 5 117 2 17493 (sudo) -------- ----- Total 1689 • numastat • virsh vcpuinfo 1279 2969 stack@openstack: ~/data/nova/instances/3722 ba 51 -cefa-48 e 1 -b 11 e-fd 6 ec 14865 ee$ $ virsh vcpuinfo 10 VCPU: 0 CPU: 4 State: running CPU time: 1. 9 s CPU Affinity: ----yyyy VCPU: State: CPU time: CPU Affinity: 1 15 running 0. 4 s ----yyyy VCPU: State: CPU time: CPU Affinity: 2 5 running 0. 0 s ----yyyy VCPU: State: CPU time: CPU Affinity: 3 12 running 0. 0 s ----yyyy VCPU: State: CPU time: CPU Affinity: 4 13 running 0. 0 s ----yyyy VCPU: State: CPU time: CPU Affinity: 5 14 running 0. 1 s ----yyyy VCPU: State: CPU time: CPU Affinity: 6 2 running 0. 0 s yyyy---- VCPU: State: CPU time: CPU Affinity: 7 0 running 0. 1 s yyyy---- Open. Stack

Where is RAM? <memory unit='Ki. B'>524288</memory> <current. Memory unit='Ki. B'>524288</current. Memory> <vcpu placement='static'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='1' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='2' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='3' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='4' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='5' cpuset='4 -7, 12 -15'/> <vcpupin vcpu='6' cpuset='0 -3, 8 -11'/> <vcpupin vcpu='7' cpuset='0 -3, 8 -11'/> </cputune> <cpu> <topology sockets='8' cores='1' threads='1'/> <numa> <cell cpus='0 -5' memory='393216'/> <cell cpus='6 -7' memory='131072'/> </numa> </cpu> • Why? • the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. Open. Stack

[3] Nova & NUMA & ? Open. Stack I/O NUMA scheduling, vcpu pin, large pages, etc. Open. Stack

NFV 1 NUMA-placement https: //blueprints. launchpad. net/nova/+spec/virt-driver-numa-placement 2 v. CPU topology https: //blueprints. launchpad. net/nova/+spec/virt-driver-vcpu-topology 3 IO based NUMA scheduling https: //blueprints. launchpad. net/nova/+spec/input-output-based-numa-scheduling 4 cpu pinning https: //blueprints. launchpad. net/nova/+spec/virt-driver-cpu-pinning 5 Large pages https: //blueprints. launchpad. net/nova/+spec/virt-driver-large-pages 6 SR-IOV https: //blueprints. launchpad. net/nova/+spec/pci-passthrough-sriov 7 vhost user https: //blueprints. launchpad. net/nova/+spec/vif-vhostuser Open. Stack

Thanks! Open. Stack