15 744 Computer Networking L14 Data Center Networking

  • Slides: 40
Download presentation
15 -744: Computer Networking L-14 Data Center Networking II

15 -744: Computer Networking L-14 Data Center Networking II

Overview • Data Center Topologies • Data Center Scheduling 2

Overview • Data Center Topologies • Data Center Scheduling 2

Current solutions for increasing data center network bandwidth Fat. Tree 1. Hard to construct

Current solutions for increasing data center network bandwidth Fat. Tree 1. Hard to construct BCube 2. Hard to expand 3

Background: Fat-Tree • Inter-connect racks (of servers) using a fat-tree topology • Fat-Tree: a

Background: Fat-Tree • Inter-connect racks (of servers) using a fat-tree topology • Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) • • each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods Fat-tree with K=4 4

Why Fat-Tree? • Fat tree has identical bandwidth at any bisections • Each layer

Why Fat-Tree? • Fat tree has identical bandwidth at any bisections • Each layer has the same aggregated bandwidth • Can be built using cheap devices with uniform capacity • Each port supports same speed as end host • All devices can transmit at line speed if packets are distributed uniform along available paths • Great scalability: k-port switch supports k 3/4 servers Fat tree network with K = 3 supporting 54 hosts 5

An alternative: hybrid packet/circuit switched data center network Goal of this work: – Feasibility:

An alternative: hybrid packet/circuit switched data center network Goal of this work: – Feasibility: software design that enables efficient use of optical circuits – Applicability: application performance over a hybrid network 6

Optical circuit switching v. s. Electrical packet switching Switching technology Switching capacity Switching time

Optical circuit switching v. s. Electrical packet switching Switching technology Switching capacity Switching time Store and forward Optical circuit switching Circuit switching 16 x 40 Gbps at high end 320 x 100 Gbps on market, e. g. Cisco CRS-1 Calient Fiber. Connect Packet granularity Less than 10 ms e. g. MEMS optical switch 7

Optical Circuit Switch Output 1 Output 2 Input 1 Lenses Fixed Mirror Glass Fiber

Optical Circuit Switch Output 1 Output 2 Input 1 Lenses Fixed Mirror Glass Fiber Bundle Rotate Mirror • Does not decode packets • Needs take time to reconfigure 2010 -09 -02 SIGCOMM Mirrors on Motors Nathan Farrington 8

Optical circuit switching v. s. Electrical packet switching Switching technology Switching capacity Switching time

Optical circuit switching v. s. Electrical packet switching Switching technology Switching capacity Switching time Switching traffic Electrical packet switching Optical circuit switching Store and forward Circuit switching 16 x 40 Gbps at high end 320 x 100 Gbps on market, e. g. Cisco CRS-1 Calient Fiber. Connect Packet granularity Less than 10 ms For bursty, uniform traffic For stable, pair-wise traffic 9

Optical circuit switching is promising despite slow switching time • [IMC 09][Hot. Nets 09]:

Optical circuit switching is promising despite slow switching time • [IMC 09][Hot. Nets 09]: “Only a few To. Rs are hot and most their traffic goes to a few other To. Rs. …” • [WREN 09]: “…we find that traffic at the five edge switches exhibit an ON/OFF pattern… ” Full bisection bandwidth at packet granularity may not be necessary 10

Hybrid packet/circuit switched network architecture Electrical packet-switched network for low latency delivery Optical circuit-switched

Hybrid packet/circuit switched network architecture Electrical packet-switched network for low latency delivery Optical circuit-switched network for high capacity transfer Optical paths are provisioned rack-to-rack – A simple and cost-effective choice – Aggregate traffic on per-rack basis to better utilize optical circuits

Design requirements Traffic demands Control plane: – Traffic demand estimation – Optical circuit configuration

Design requirements Traffic demands Control plane: – Traffic demand estimation – Optical circuit configuration Data plane: – Dynamic traffic de-multiplexing – Optimizing circuit utilization (optional) 12

c-Through (a specific design) No modification to applications and switches Leverage endhosts for traffic

c-Through (a specific design) No modification to applications and switches Leverage endhosts for traffic management Centralized control for circuit configuration 13

c-Through - traffic demand estimation and traffic batching Applications Per-rack traffic demand vector Socket

c-Through - traffic demand estimation and traffic batching Applications Per-rack traffic demand vector Socket buffers 1. Transparent to applications. 2. Packets are buffered per-flow to avoid HOL blocking. Accomplish two requirements: – Traffic demand estimation – Pre-batch data to improve optical circuit utilization 14

c-Through - optical circuit configuration Traffic demand Controller configuration Use Edmonds’ algorithm to compute

c-Through - optical circuit configuration Traffic demand Controller configuration Use Edmonds’ algorithm to compute optimal configuration Many ways to reduce the control traffic overhead 15

c-Through - traffic de-multiplexing VLAN #1 VLAN-based network isolation: – No need to modify

c-Through - traffic de-multiplexing VLAN #1 VLAN-based network isolation: – No need to modify switches – Avoid the instability caused by circuit reconfiguration VLAN #2 traffic Traffic control on hosts: – Controller informs hosts about the circuit configuration Traffic de-multiplexer – End-hosts tag packets accordingly VLAN #1 16 VLAN #2

17

17

Overview • Data Center Topologies • Data Center Scheduling 22

Overview • Data Center Topologies • Data Center Scheduling 22

Datacenters and OLDIs § OLDI = On. Line Data Intensive applications § e. g.

Datacenters and OLDIs § OLDI = On. Line Data Intensive applications § e. g. , Web search, retail, advertisements § An important class of datacenter applications § Vital to many Internet companies OLDIs are critical datacenter applications

OLDIs • Partition-aggregate • Tree-like structure • Root node sends query • Leaf nodes

OLDIs • Partition-aggregate • Tree-like structure • Root node sends query • Leaf nodes respond with data • Deadline budget split among nodes and network • E. g. , total = 300 ms, parents-leaf RPC = 50 ms • Missed deadlines incomplete responses affect user experience & revenue

Challenges Posed by OLDIs Two important properties: 1) Deadline bound (e. g. , 300

Challenges Posed by OLDIs Two important properties: 1) Deadline bound (e. g. , 300 ms) • Missed deadlines affect revenue 2) Fan-in bursts § Large data, 1000 s of servers § Tree-like structure (high fan-in) § Fan-in bursts long “tail latency” § Network shared with many apps (OLDI and non-OLDI) Network must meet deadlines & handle fan-in bursts

Current Approaches TCP: deadline agnostic, long tail latency § Congestion timeouts (slow), ECN (coarse)

Current Approaches TCP: deadline agnostic, long tail latency § Congestion timeouts (slow), ECN (coarse) Datacenter TCP (DCTCP) [SIGCOMM '10] § first to comprehensively address tail latency § Finely vary sending rate based on extent of congestion § shortens tail latency, but is not deadline aware § ~25% missed deadlines at high fan-in & tight deadlines DCTCP handles fan-in bursts, but is not deadline-aware

D 2 TCP Deadline-aware and handles fan-in bursts Key Idea: Vary sending rate based

D 2 TCP Deadline-aware and handles fan-in bursts Key Idea: Vary sending rate based on both deadline and extent of congestion § Built on top of DCTCP § Distributed: uses per-flow state at end hosts § Reactive: senders react to congestion § no knowledge of other flows

D 2 TCP’s Contributions 1) Deadline-aware and handles fan-in bursts § Elegant gamma-correction for

D 2 TCP’s Contributions 1) Deadline-aware and handles fan-in bursts § Elegant gamma-correction for congestion avoidance § far-deadline back off more near-deadline back off less § Reactive, decentralized, state (end hosts) 2) Does not hinder long-lived (non-deadline) flows 3) Coexists with TCP incrementally deployable 4) No change to switch hardware deployable today D 2 TCP achieves 75% and 50% fewer missed deadlines than DCTCP and D 3

Coflow Definition

Coflow Definition

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

Data Center Summary • Topology • Easy deployment/costs • High bi-section bandwidth makes placement

Data Center Summary • Topology • Easy deployment/costs • High bi-section bandwidth makes placement less critical • Augment on-demand to deal with hot-spots • Scheduling • • Delays are critical in the data center Can try to handle this in congestion control Can try to prioritize traffic in switches May need to consider dependencies across flows to improve scheduling 38

Review • Networking background • OSPF, RIP, TCP, etc. • Design principles and architecture

Review • Networking background • OSPF, RIP, TCP, etc. • Design principles and architecture • E 2 E and Clark • Routing/Topology • BGP, powerlaws, HOT topology 39

Review • Resource allocation • Congestion control and TCP performance • FQ/CSFQ/XCP • Network

Review • Resource allocation • Congestion control and TCP performance • FQ/CSFQ/XCP • Network evolution • • Overlays and architectures Openflow and click SDN concepts NFV and middleboxes • Data centers • • Routing Topology TCP Scheduling 40