Zyg OS Achieving Low Tail Latency for Microsecondscale

Problem: Serve μs-scale RPCs • Applications: KV-stores, In-memory DB • Datacenter environment: • Complex

Elementary Queuing Theory • Processor • FCFS • Processor Sharing • Multi/Single Queue •

Baseline System Networking Connection Delegation Complexity Work Conservation Queuing Linux Kernel (epoll) Dataplanes Userspace

Upcoming • Key Observations: • Single queue systems perform theoretically better • Dataplanes, despite

Analysis • Metric to optimize: Load @ Tail-Latency SLO • Run timescale-independent simulations •

Latency vs Load – Queuing model Fixed Exponential Greater mismatch at high dispersion Bimodal

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10

Latency vs Load – Service Time 25μs Fixed Exponential Bimodal Linux Floating outperforms IX

Zyg. OS Approach • Dataplane aspect: • Reduced system overheads • Share nothing network

Background on IX Ring 3 3 event-driven app Event Conditions Guest Ring 0 Batched

IX Design Zyg. OS Design 1. Application layer Event based application that is agnostic

Zyg. OS Architecture event-driven app lib. IX Application Layer Ring 3 Guest Ring 0

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Dealing with Head of Line Blocking • Definition: • Home core busy in userspace

HOL – RX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer

HOL – TX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle

HOL – TX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer

Evaluation Setup • Environment: • 10+1 Xeon Servers • 16 -hyperthread server machine •

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Interrupt benefit 99 th

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Closer to Single-Queue 99

Silo with TPC-C workload • 5 types of transactions • Service time variability •

Silo with TPC-C workload 1. 63 x speedup over Linux 3. 68 x lower

Conclusion Zyg. OS: A datacenter operating system for low-latency We ♥�opensource • Reduced System

Slides: 36

Download presentation

Zyg. OS: Achieving Low Tail Latency for Microsecondscale Networked Tasks George Prekas, Marios Kogias, Edouard Bugnion 1

Problem: Serve μs-scale RPCs • Applications: KV-stores, In-memory DB • Datacenter environment: • Complex fan-out – fan-in patterns Load Balancer Root • Tail-at-scale problem Leaf • Tail Latency Service-Level Objectives • Goal: Improve throughput at an aggressive tail latency SLO • How? Focus within the leaf nodes Root Leaf • Reduce system overheads • Achieve better scheduling 3

Elementary Queuing Theory • Processor • FCFS • Processor Sharing • Multi/Single Queue • Inter-arrival Distribution (λ) FCFS λ S μ FCFS • Poisson • Service Time Distribution (μ) • Fixed • Exponential • Bimodal • No OS overheads • Independent of service time • Upper performance bound 4

Baseline System Networking Connection Delegation Complexity Work Conservation Queuing Linux Kernel (epoll) Dataplanes Userspace Partitioned Floating Partitioned Medium High Low ✖� ✔� ✖� Multi-Queue Single Queue Multi-Queue Can we build a system with low overheads that achieves work conservation? 5

Upcoming • Key Observations: • Single queue systems perform theoretically better • Dataplanes, despite being multi-queue systems, perform practically better • Key Contributions • Zyg. OS combines the best of the two worlds: • Reduced system overheads similar to dataplanes • Convergence to a single-queue model 6

Analysis • Metric to optimize: Load @ Tail-Latency SLO • Run timescale-independent simulations • Run synthetic benchmarks on real system • Questions: • Which model achieves better throughput? • Which system converges to its model at low service times? 7

Latency vs Load – Queuing model Fixed Exponential Greater mismatch at high dispersion Bimodal Single queue models provide better throughput at SLO because of 99 th percentile latency transient load imbalance SLO: 10 x AVG[service_time] 8

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 9

Latency vs Load – Service Time 25μs Fixed Exponential Bimodal Linux Floating outperforms IX Dataplanes perform better only in very low service times with low dispersion percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014 99 th 10

Zyg. OS Approach • Dataplane aspect: • Reduced system overheads • Share nothing network processing • Single Queue system • Work conservation • Reduction of head of line blocking Implement work-stealing to achieve work-conservation in a dataplane 11

Background on IX Ring 3 3 event-driven app Event Conditions Guest Ring 0 Batched Syscalls lib. IX 2 4 TCP/IP 5 RX FIFO Timer 6 1 RX TX 12

IX Design Zyg. OS Design 1. Application layer Event based application that is agnostic to work-stealing 2. Shuffle layer Includes a per core list of ready connections that allows stealing 3. Network layer Coherence- and sync-free network processing 13

Zyg. OS Architecture event-driven app lib. IX Application Layer Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue TCP/IP Shuffle Queue Remote Syscalls Network Layer TCP/IP Network Layer Timer Home core RX TX TCP/IP Remote core RX TX 14

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 15

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 16

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 17

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 18

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 19

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 20

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 21

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 22

Dealing with Head of Line Blocking • Definition: • Home core busy in userspace • Pending tasks • Idle remote cores • Inter-processor Interrupts if HOL is detected • High service time dispersion increases HOL 23

HOL – RX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 24

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 25

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 26

HOL – TX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 27

HOL – TX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 28

Evaluation Setup • Environment: • 10+1 Xeon Servers • 16 -hyperthread server machine • Quanta/Cumulus 48 x 10 Gb. E switch • Experiments: • Synthetic micro-benchmarks • Silo [SOSP 2013] • Memcached • Baselines: • IX • Linux (partitioned and floating connections) 29

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 30

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 31

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Interrupt benefit 99 th percentile latency SLO: 10 x AVG[service_time] 32

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Closer to Single-Queue 99 th percentile latency SLO: 10 x AVG[service_time] 33

Silo with TPC-C workload • 5 types of transactions • Service time variability • Average Service Time: 33μs • Open-loop experiment • 2752 TCP connections • 99 th percentile latency 34

Silo with TPC-C workload 1. 63 x speedup over Linux 3. 68 x lower 99 th latency 35

Conclusion Zyg. OS: A datacenter operating system for low-latency We ♥�opensource • Reduced System overheads • Converges to a single queue model • Work conservation through work stealing • Reduce HOL through light-weight IPIs https: //github. com/ix-project/zygos 36