Network Requirements for Resource Disaggregation Peter Gao Berkeley

Network Requirements for Resource Disaggregation Peter Gao (Berkeley), Akshay Narayan (MIT), Sagar Karandikar (Berkeley), Joao Carreira (Berkeley), Sangjin Han (Berkeley), Rachit Agarwal (Cornell), Sylvia Ratnasamy (Berkeley), Scott Shenker (Berkeley/ICSI)

Disaggregated Datacenters Future datacenter: Disaggregated? Current Datacenter: Server-Centric Datacenter Network HP – The Machine GPU Huawei NUWA Intel - RSD Datacenter Facebook Network Sea. Micro Berkeley - Fire. Box GPU 2

Disaggregation Benefits (Architecture Community) Overcome memory capacity wall Higher resource density Simplify Hardware Design Relax Power & Capacity Scaling 3

Network is the key Disaggregated Server-Centric Do Network we need specialized hardware? Network e. g. : Silicon photonics, PCI-e GPU QPI, SMI, PCI-e GP U GPU Existing prototypes use specialized hardware, such as Silicon Photonics, PCI-e 4

Application Remote Resource Transport NIC Current OS and network stack are not (Solutions are feasible) OS OS ✘ ✔ Commodity hardware. Switch solutions may be sufficient • What end-to-end latency and bandwidth must the network provide for legacy apps? Worst case • Do existing transport protocols meet these requirements? performance degradation • Do existing OS network stacks meet these requirements? • Can commodity network hardware meet these requirements? 5

Assumptions Cache Coherence CPU Memory Datacenter Network Storage Scale • Limited cache coherence domain • Small amount of local cache (how much? ) • Page-level remote memory access • Block-level distributed data placement • Rack-scale? • Datacenterscale? 6

Latency and Bandwidth Requirements Methodology: Workload Driven Workloads Wordcount Sort • 10 workloads on 8 applications Batch Processing Pagerank Collaborative Filter. • ~ 125 GB input data • 5 m 3. 2 xlarge EC 2 nodes • Virtual Private Cloud enabled Interactive Key-value Store SQL Streaming Applications Spar k Hado op Timely Dataflow Graphlab Memcac hed HER D Spark SQL Spark Streaming 7

Disaggregated Datacenter Emulator OS Special Swap Device (Handles Page Fault) Local RAM Free to access Emulated Remote RAM Memory • Backed by the machine’s own memory • Partition the memory into local and remote Via swap device Inject latency and bandwidth constraints • Using special swap device • Delay = latency + request size / bandwidth • Akin to a dedicated link between CPU and memory 8

1 us 10 Gbps 100 Gbps 40 Gbps 10 Gbps Latency and Bandwidth Requirement ~3 us latency / 40 Gbps bandwidth is enough, ignoring queueing delay 5 us 5% degradation 10 us 5% degradation *Note: Delay = latency + request size / bandwidth 9

Understanding Performance Degradation Spark Wordcount Performance degradation is correlated with application memory bandwidth Spark. SQL BDBHERD YCSB Timely Pagerank Spark Sort Graphlab CF Memcached YCSB Spark Streaming Wordcount Hadoop Sort Hadoop Wordcount 10

• 3 us end-to-end latency • 40 Gbps dedicated link (no queueing delay) Application Remote Resource Transport OS OS NIC Switch 11

Transport Simulation Setting Special Swap Instrumentation Flow Trace Network Simulator Flow completion time distribution Need new transport protocols 12

Application Performance Degradation 40 Gbps (no queueing delay) 40 Gbps network Rack Scale (with queueing delay) ~5% degradation 100 Gbps network DC scale for some apps, rack scale for others 100 Gbps (no queueing delay) 100 Gbps network DC Scale (with queueing delay) Rack Scale (with queueing delay) ~5% degradation 13

Application Transport • 3 us end-to-end latency • 40 Gbps dedicated link • Efficient Transport • 100 Gbps network Remote Resource Transport OS OS NIC Switch 14

Is 100 Gbps/3 us achievable? 15

Feasibility of end-to-end latency within a rack Propaga tion Application 0. 32 us Transmis sion 0. 8 us 3 us Target Switching 2 us Data Copying OS Remote Resource 1. 9 us *Numbers estimated optimistically based on existing hardware 16

Feasibility of end-to-end latency within a rack Propaga tion Application 0. 32 us 0. 8 us 3 us Target Propaga tion Application 0. 32 us Transmis sion 0. 48 0. 8 us us Switching Data Copying 2 us Remote Resource 1. 9 us 2 us Cut-through Transmis Switching Switch sion OS Data Copying OS Remote Resource 1. 9 us *Numbers estimated optimistically based on existing hardware 15

Feasibility of end-to-end latency within a rack Propaga tion Application 0. 32 us Transmis sion Switching 0. 8 us 3 us Target 2 us Cut-through Switch Application 0. 32 us 0. 48 us 1 us Data Copying 2 us OS Remote Resource 1. 9 us NIC Data Integration Copying OS Remote Resource 1. 9 us *Numbers estimated optimistically based on existing hardware 15

Feasibility of end-to-end latency within a rack Propaga tion Application 0. 32 us Transmis sion Data Copying Switching OS Remote Resource Feasible to meet target across the datacenter? 0. 8 us 3 us Target Cut-through Switch Application 0. 32 us 0. 48 us 3 us 1 us Target 1. 9 us 2 us NIC Integration Use OS RDMA Remote Resource 1. 9 us *Numbers estimated optimistically based on existing hardware 15

• 3 us end-to-end latency • 40 Gbps dedicated link Application Transport OS • Efficient Transport (p. Fabric, SIGCOMM’ 13, p. Host, Co. NEXT’ 15) • 100 Gbps network (Available) • Kernel bypassing (RDMA common) NIC Remote Resource Transport OS NIC Switch • CPU-NIC Integration (Coming soon) • Cut-through switch (Common? ) • 100 Gbps links (Available) 16

What’s next? Application Design Rethinking OS Stack Storage Network Stack Failure Models Please refer our paper for evaluations on improving application performance in disaggregated datacenters Network Fabric Design 17

Thank You! Peter X. Gao Akshay Narayan Sagar Karandikar. Joao Carreira Rachit Agarwal. Sylvia Ratnasamy Scott Shenker Sangjin Han

Discussion Points • Scheduler Design in a disaggregated server • Rack Scale vs Data center scale • Applications with even higher memory bandwidth requirement? • Performance degradation is ok, what about energy? • Future of servers with specialization – FPGAs, ASICs • Security concerns of disaggregation. 23