Projec To R Agile Reconfigurable Data Center Interconnect
Projec. To. R: Agile Reconfigurable Data Center Interconnect Monia Ghobadi Ratul Mahajan Amar Phanishayee Nikhil Devanur Janardhan Kulkarni Gireeja Ranade Pierre Blanche Houman Rastegarfar Madeleine Glick Daniel Kilper
Today’s data center interconnects A B C D 10 Gbps A B C D 0 3 3 3 3 0 A B C D A B 0 0 8 0 0 0 6 0 0 0 12 C D 0 6 0 08 0 07 07 0 D Ideal demand matrix: uniform and static Non-ideal demand matrix: skewed and dynamic Static capacity between To. R pairs 2
Need for a reconfigurable interconnect Data: • 200 K servers across 4 production clusters • Cluster sizes: 100 -- 2500 racks Observation: • Many rack pairs exchange little traffic • Only some hot rack pairs are active Implication: • Static topology with uniform capacity: • Over-provisioned for most rack pairs • Under-provisioned for few others Reconfigurable interconnect: To dynamically provide additional capacity between hot rack pairs 3
Desirable properties of a reconfigurable interconnect Static Reconfigurable Optical switch A B C D Observation: • Traffic matrices differ widely Implication: • Difficult to determine static vs. reconfigurable divide 4 (Seamless interconnect)
Desirable properties of a reconfigurable interconnect Observation: • Source racks send large amounts of traffic to many other racks Implications: • Should create direct links to lots of other racks (high fan-out) • Should switch quickly among destinations (low switching time) 5
Properties of reconfigurable interconnects Enabler technology Seamless High Fan-out Low switching time Helios, Mordia Optical Circuit Switch [sigcomm’ 10, sigcomm’ 13] 3 D Beam 3 D forming, Flyways, Beam forming [sigcomm’ 12, [sigcomm’ 11, hotnets’ 09] sigcomm’ 12] 60 GHz Fire. Fly [sigcomm’ 14] Free-Space Optics Projec. To. R Free-Space Optics 6
Projec. To. R interconnect • Free-space topology (seamless) • 18, 000 fan-out (60 x more than optical circuit switches) • 12 us switching time (2500 x faster than optical circuit switches) Laser Photodetector 7 Static topology 7
Reconfiguration in a Projec. To. R interconnect • Digital micromirror device to redirect light • Mirror assembly to magnify reach 8 8
Digital Micromirror Device (DMD) Array of micromirrors (10 um) Memory cell 9
Using DMDs to redirect light 0 0 0 0 1 1 1 1 0 1 • Theoretical number of accessible locations: total number of micromirrors • 768 x 768 = 589824 • Cross-talk between adjacent locations • Achievable number of accessible locations • 768 x 768 / 32 = 18, 432 1 10
Using mirror assemblies to magnify reach • Challenge: DMDs have a narrow angular reach • Solution: Coupling DMDs with angled mirrors 11 11
Questions to answer • How feasible is a Projec. To. R interconnect? • Built and micro-benchmarked a small Projec. To. R prototype • Robustness to environmental conditions • How should packets be routed in a Projec. To. R interconnect? • Devised a scheduling algorithm and simulated its performance • How much does a Projec. To. R interconnect cost? • Estimated cost based on cost break down of each component 12
Prototype: A 3 -To. R Projec. To. R interconnect To. R 3 To. R 2 To. R 1 13
Prototype: A 3 -To. R Projec. To. R interconnect Mirrors reflecting to To. R 2 and To. R 3 DMD Source laser 14
Prototype: A 3 -To. R Projec. To. R interconnect To. R 3 To. R 2 To. R 1 15
Prototype: throughput Projec. To. R Link Wired Link 1, 00 CDF 0, 80 0, 60 0, 40 0, 20 0, 00 8, 78 8, 98 9, 18 TCP Throughput (Gbps) 9, 38 16
Prototype: switching time To. R 3 To. R 2 To. R 1 17
Prototype: switching time Receive Power (d. Bm) To. R 1 -> To. R 2 -10 -15 -20 -25 -30 -35 -40 -45 -50 To. R 1 -> To. R 3 12 us 0 5 10 Time (us) 15 20 18
Connecting lasers and photodetectors lasers photodetectors To. R 1 To. R 2 To. R 3 dedicated topology opportunistic links • Two topology approach • Slow switching topology or dedicated topology • Fast switching links or opportunistic links 19
Routing packets 2 2 3 2 3 2 Virtual output queues To. R 1 To. R 2 To. R 3 opportunistic link dedicated topology K-shortest paths routing 20
Scheduling opportunistic links • Given a set of potential links and current traffic demand, find a set of active opportunistic links To. R 1 To. R 2 To. R 3 s o u r c e d e s t i n a t i o n 21
Scheduling opportunistic links • • • Standard switch scheduling problem Blossom matching d e s t i n a t i o n s Matrix decomposition o u Centralized scheduler r c Single tiered matching e input output 22
Scheduling opportunistic links • • • Standard switch scheduling problem Src To. Rs Blossom matching Matrix decomposition Decentralized Centralized scheduler Two-tiered Single tiered matching Dst To. Rs input output Extended the Gale-Shapely algorithm for finding stable matches [GS-1962] Constant competitive against an offline optimal allocation 23
Simulations Fat tree • • Fire. Fly Projec. To. R 128 -To. R (1024 servers) with 16 lasers and photodetectors Day-long traffic matrix: to build the dedicated topology 5 -min traffic matrix: to generate probability of To. R pair communication TCP flows arrival with poison arrival rate and realistic flow sizes 24
Simulation results Average Flow Completion Time (ms) 40 Fire. Fly 35 30 25 Fat tree 20 15 95% 10 5 0 20 30 40 50 60 Average Load (%) 70 80 - Slow switching time - Low fan-out • Tail flow completion time • -Different traffic matrices No reconfigurability • Impact of fan-out • Impact of switching time + Reconfigurable Projec. To. R + Switching time: 12 us + high fan-out 25
Projec. To. R: A reconfigurable data center Seamless, high fanout, low switching time interconnect Small prototype demonstrates feasibility To. R 1 To. R 2 To. R 3 Decentralized flow scheduling algorithm 26
- Slides: 26