Zyg OS Achieving Low Tail Latency for Microsecondscale

  • Slides: 36
Download presentation
Zyg. OS: Achieving Low Tail Latency for Microsecondscale Networked Tasks George Prekas, Marios Kogias,

Zyg. OS: Achieving Low Tail Latency for Microsecondscale Networked Tasks George Prekas, Marios Kogias, Edouard Bugnion 1

2

2

Problem: Serve μs-scale RPCs • Applications: KV-stores, In-memory DB • Datacenter environment: • Complex

Problem: Serve μs-scale RPCs • Applications: KV-stores, In-memory DB • Datacenter environment: • Complex fan-out – fan-in patterns Load Balancer Root • Tail-at-scale problem Leaf • Tail Latency Service-Level Objectives • Goal: Improve throughput at an aggressive tail latency SLO • How? Focus within the leaf nodes Root Leaf • Reduce system overheads • Achieve better scheduling 3

Elementary Queuing Theory • Processor • FCFS • Processor Sharing • Multi/Single Queue •

Elementary Queuing Theory • Processor • FCFS • Processor Sharing • Multi/Single Queue • Inter-arrival Distribution (λ) FCFS λ S μ FCFS • Poisson • Service Time Distribution (μ) • Fixed • Exponential • Bimodal • No OS overheads • Independent of service time • Upper performance bound 4

Baseline System Networking Connection Delegation Complexity Work Conservation Queuing Linux Kernel (epoll) Dataplanes Userspace

Baseline System Networking Connection Delegation Complexity Work Conservation Queuing Linux Kernel (epoll) Dataplanes Userspace Partitioned Floating Partitioned Medium High Low ✖� ✔� ✖� Multi-Queue Single Queue Multi-Queue Can we build a system with low overheads that achieves work conservation? 5

Upcoming • Key Observations: • Single queue systems perform theoretically better • Dataplanes, despite

Upcoming • Key Observations: • Single queue systems perform theoretically better • Dataplanes, despite being multi-queue systems, perform practically better • Key Contributions • Zyg. OS combines the best of the two worlds: • Reduced system overheads similar to dataplanes • Convergence to a single-queue model 6

Analysis • Metric to optimize: Load @ Tail-Latency SLO • Run timescale-independent simulations •

Analysis • Metric to optimize: Load @ Tail-Latency SLO • Run timescale-independent simulations • Run synthetic benchmarks on real system • Questions: • Which model achieves better throughput? • Which system converges to its model at low service times? 7

Latency vs Load – Queuing model Fixed Exponential Greater mismatch at high dispersion Bimodal

Latency vs Load – Queuing model Fixed Exponential Greater mismatch at high dispersion Bimodal Single queue models provide better throughput at SLO because of 99 th percentile latency transient load imbalance SLO: 10 x AVG[service_time] 8

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 9

Latency vs Load – Service Time 25μs Fixed Exponential Bimodal Linux Floating outperforms IX

Latency vs Load – Service Time 25μs Fixed Exponential Bimodal Linux Floating outperforms IX Dataplanes perform better only in very low service times with low dispersion percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014 99 th 10

Zyg. OS Approach • Dataplane aspect: • Reduced system overheads • Share nothing network

Zyg. OS Approach • Dataplane aspect: • Reduced system overheads • Share nothing network processing • Single Queue system • Work conservation • Reduction of head of line blocking Implement work-stealing to achieve work-conservation in a dataplane 11

Background on IX Ring 3 3 event-driven app Event Conditions Guest Ring 0 Batched

Background on IX Ring 3 3 event-driven app Event Conditions Guest Ring 0 Batched Syscalls lib. IX 2 4 TCP/IP 5 RX FIFO Timer 6 1 RX TX 12

IX Design Zyg. OS Design 1. Application layer Event based application that is agnostic

IX Design Zyg. OS Design 1. Application layer Event based application that is agnostic to work-stealing 2. Shuffle layer Includes a per core list of ready connections that allows stealing 3. Network layer Coherence- and sync-free network processing 13

Zyg. OS Architecture event-driven app lib. IX Application Layer Ring 3 Guest Ring 0

Zyg. OS Architecture event-driven app lib. IX Application Layer Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue TCP/IP Shuffle Queue Remote Syscalls Network Layer TCP/IP Network Layer Timer Home core RX TX TCP/IP Remote core RX TX 14

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 15

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 16

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 17

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 18

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 19

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 20

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 21

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle

Execution Model event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Timer Remote core RX TX 22

Dealing with Head of Line Blocking • Definition: • Home core busy in userspace

Dealing with Head of Line Blocking • Definition: • Home core busy in userspace • Pending tasks • Idle remote cores • Inter-processor Interrupts if HOL is detected • High service time dispersion increases HOL 23

HOL – RX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle

HOL – RX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 24

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 25

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer

HOL – RX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 26

HOL – TX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle

HOL – TX event-driven app lib. IX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 27

HOL – TX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer

HOL – TX event-driven app lib. IX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP Timer Home core RX TX Remote core RX TX 28

Evaluation Setup • Environment: • 10+1 Xeon Servers • 16 -hyperthread server machine •

Evaluation Setup • Environment: • 10+1 Xeon Servers • 16 -hyperthread server machine • Quanta/Cumulus 48 x 10 Gb. E switch • Experiments: • Synthetic micro-benchmarks • Silo [SOSP 2013] • Memcached • Baselines: • IX • Linux (partitioned and floating connections) 29

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 30

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10

Latency vs Load – Service Time 10μs Fixed 99 th percentile latency SLO: 10 x AVG[service_time] Exponential Bimodal IX, Belay et al. OSDI 2014 31

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Interrupt benefit 99 th

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Interrupt benefit 99 th percentile latency SLO: 10 x AVG[service_time] 32

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Closer to Single-Queue 99

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Closer to Single-Queue 99 th percentile latency SLO: 10 x AVG[service_time] 33

Silo with TPC-C workload • 5 types of transactions • Service time variability •

Silo with TPC-C workload • 5 types of transactions • Service time variability • Average Service Time: 33μs • Open-loop experiment • 2752 TCP connections • 99 th percentile latency 34

Silo with TPC-C workload 1. 63 x speedup over Linux 3. 68 x lower

Silo with TPC-C workload 1. 63 x speedup over Linux 3. 68 x lower 99 th latency 35

Conclusion Zyg. OS: A datacenter operating system for low-latency We ♥�opensource • Reduced System

Conclusion Zyg. OS: A datacenter operating system for low-latency We ♥�opensource • Reduced System overheads • Converges to a single queue model • Work conservation through work stealing • Reduce HOL through light-weight IPIs https: //github. com/ix-project/zygos 36