Server Client Kernel time User application time NSC

  • Slides: 38
Download presentation

Server Client

Server Client

Kernel time User application time NSC DNS Server 92% Redis Key-Value Store 8% 87%

Kernel time User application time NSC DNS Server 92% Redis Key-Value Store 8% 87% 13% Nginx HTTP Load Balancer 77% 23% Lighttpd HTTP Server 78% 22% Inter-host (vs. RDMA) Intra-host (vs. shared memory) 11 0. 25 0 Linux Socket 30 1. 6 5 10 15 20 shared memory (SHM) / RDMA 25 30 35

User App Kernel TCP/IP NIC packet API Kernel User App User-space VFS User-space TCP/IP

User App Kernel TCP/IP NIC packet API Kernel User App User-space VFS User-space TCP/IP NIC packet API Kernel VFS Host 1 TCP/IP User App Host 2 User App Fast. Socket, Megapipe, Stack. Map… Host 1 TCP/IP NIC packet API User-space Stack User App Host 2 IX, Arrakis, Sand. Storm, m. TCP, Lib. VMA, Open. Onload… User App User-space VFS NIC RDMA API Hardware-based Transport Host 1 RDMA NIC RDMA API User-space VFS User App Host 2 Rsocket, SDP, Free. Flow…

Application send() socket call C library send() syscall OS recv() socket call Wakeup process

Application send() socket call C library send() syscall OS recv() socket call Wakeup process C library recv() syscall OS Lock VFS send Copy data, allocate mem Process Scheduling Lock Event Notification Copy data, free mem TCP send buffer VFS recv TCP recv buffer TCP/IP Network packet Packet processing (netfilter, tc…) Network packet

Type Overhead (ns) Linux Lib. VMA Inter-host Total Per operation Intra-host Inter-host Intra-host 177

Type Overhead (ns) Linux Lib. VMA Inter-host Total Per operation Intra-host Inter-host Intra-host 177 209 53 C library shim 12 10 10 15 Kernel crossing (syscall) 205 N/A N/A Socket FD locking 160 121 138 N/A 15000 5800 2200 1300 1700 1000 850 150 Buffer management 430 320 370 50 TCP/IP protocol 360 260 N/A 130 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 900 450 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Inter-host Socks. Direct 413 Total Per packet Intra-host RSocket 5000 365 Copy Wire transfer 160 N/A N/A N/A 540 160 N/A 381 239 320 N/A 160 212 173 160 N/A 160 13 13 N/A 160 N/A

Compatibility • Drop-in replacement, no application modification Isolation • • Security isolation among containers

Compatibility • Drop-in replacement, no application modification Isolation • • Security isolation among containers and applications Enforce access control policies High Performance • • • High throughput Low latency Scalable with number of CPU cores

ACL rules connect Monitor Process send Shared Buffer recv accept

ACL rules connect Monitor Process send Shared Buffer recv accept

User App Host 1 libsd NIC RDMA API Mem Queue Monitor libsd Mem Queue

User App Host 1 libsd NIC RDMA API Mem Queue Monitor libsd Mem Queue libsd User App Monitor TCP/IP NIC Packet API TCP/IP RDMA Host 2 NIC RDMA API User App Host 3 No Socks. Direct TCP/IP User App

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53 C library shim 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Intra-host 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A 160 173 160 13 13 N/A 160 N/A

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53 C library shim 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Intra-host 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A 160 173 160 13 13 N/A 160 N/A

Lock Sender 1 Sender 2 Sender 1 Queue Receiver Sender 2 Send token Lock

Lock Sender 1 Sender 2 Sender 1 Queue Receiver Sender 2 Send token Lock Sender Queue Receiver 1 Receiver 2 Receiver 1 Sender Queue Receiver 2 Receive token

Parent process RDMA QP FD Table 5 (COW) Shared pages 3 SHM (shared) Socket

Parent process RDMA QP FD Table 5 (COW) Shared pages 3 SHM (shared) Socket Data 5 SHM (shared) SHM Queue FD Table 3 3 4 4 5 FD Table SHM (private) RDMA QP Socket Data 3 (on demand) 5 5 5 (COW) Child process 4

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53 C library shim 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Intra-host 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A 160 173 160 13 13 N/A 160 N/A

head tail per-socket • • Many sockets share a ring buffer Receiver segregates packets

head tail per-socket • • Many sockets share a ring buffer Receiver segregates packets from the NIC Buffer allocation overhead Internal fragmentation • • One ring buffer per socket Sender segregates packets via RDMA or SHM address Back-to-back packet placement Minimize buffer mgmt. overhead

send_next RDMA write data head tail RDMA write credits (batched) Two copies of ring

send_next RDMA write data head tail RDMA write credits (batched) Two copies of ring buffers on both sender and receiver. Use one-sided RDMA write to synchronize data from sender to receiver, and return credits (i. e. free buffer size) in batches. Use RDMA write with immediate verb to ensure ordering and use a shared completion queue to amortize polling overhead.

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53 C library shim 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Intra-host 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A 160 173 160 13 13 N/A 160 N/A

Sender App Receiver Mem NIC Mem App send(buf, size); Network packet memcpy(buf, data, size);

Sender App Receiver Mem NIC Mem App send(buf, size); Network packet memcpy(buf, data, size); DMA read DMA to socket buf Notify event (epoll) DMA read Wrong data user_buf = malloc(size); recv(user_buf, size); Copy socket buf to user buf

Send physical page • • Map 1 page: 0. 78 us Copy 1 page:

Send physical page • • Map 1 page: 0. 78 us Copy 1 page: 0. 40 us • • Map 32 pages: 1. 2 us Copy 32 pages: 13. 0 us

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53

Type Overhead Linux RTT (ns) Socks. Direct RTT (ns) Inter-host Total Per operation 53 C library shim 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing 500 N/A NIC doorbell and DMA 2100 N/A 600 N/A NIC processing and wire 200 N/A Handling NIC interrupt 4000 N/A Process wakeup Total Per kbyte Intra-host 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A 160 173 160 13 13 N/A 160 N/A

Linux process wakeup (mutex, semaphore, read): Cooperative context switch (sched_yield): • • • Pin

Linux process wakeup (mutex, semaphore, read): Cooperative context switch (sched_yield): • • • Pin each thread to a core Each thread poll for some time slice, and sched_yield after then. All threads on the core run in round-robin order.

Monitor and library in user space. Shared-nothing, use message passing for communication. Use hardware-based

Monitor and library in user space. Shared-nothing, use message passing for communication. Use hardware-based transports: RDMA / SHM. Token-based socket sharing. Optimize common cases, prepare for all cases. Per-socket ring buffer. Batch page remapping. Cooperative context switch.

User App Host 1 libsd NIC RDMA API RDMA Host 2 NIC RDMA API

User App Host 1 libsd NIC RDMA API RDMA Host 2 NIC RDMA API libsd User App Monitor Mem Queue User App libsd

Contributions of this work: • An analysis of performance overheads in Linux socket. •

Contributions of this work: • An analysis of performance overheads in Linux socket. • Design and implementation of Socks. Direct, a high performance user space socket system that is compatible with Linux and preserves isolation among applications. • Techniques to support fork, token-based connection sharing, allocationfree ring buffer and zero copy that may be useful in many scenarios other than sockets. • Evaluations show that Socks. Direct can achieve performance that is comparable with RDMA and SHM queue, and significantly speedup existing applications.

 • • Fast. Socket, Megapipe, Stack. Map… Good compatibility, but leave many overheads

• • Fast. Socket, Megapipe, Stack. Map… Good compatibility, but leave many overheads on the table. • • IX, Arrakis, Sand. Storm, m. TCP, Lib. VMA, Open. Onload… Does not support fork, container live migration, ACLs… Use NIC to forward intra-host packets (SHM is faster). Fail to remove payload copy, locking and buffer mgmt overheads. • • • Rsocket, SDP, Free. Flow… Lack support for important APIs (e. g. epoll). Same drawbacks as user-space TCP/IP stacks.

Application socket send / recv Application pipe read / write Loopback interface 10 us

Application socket send / recv Application pipe read / write Loopback interface 10 us RTT 0. 9 M op/s tput 8 us RTT 1. 2 M op/s tput

Sender A Receiver B’ 1 ⑦ C ⑥ B’ ② ① B 2 ④

Sender A Receiver B’ 1 ⑦ C ⑥ B’ ② ① B 2 ④ RDMA write ⑤ B’ ③ C

Monitor 3: S 1→R 1, 4: S 1→R 2, 5: S 2→R 2 Sender

Monitor 3: S 1→R 1, 4: S 1→R 2, 5: S 2→R 2 Sender S 1 3, 4 S 2 5 FD Socket Queues Receiver 3 R 1 3 4 5 R 2 4, 5

Sender 1 Send token Monitor Data Queue Sender 2 Send token Receiver

Sender 1 Send token Monitor Data Queue Sender 2 Send token Receiver