Jupiter rising A decade of Clos topologies and

  • Slides: 49
Download presentation
Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks

Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks

Credits Authors of “Jupiter Rising” [SIGCOMM 2015]: Arjun Singh, Joon Ong, Amit Agarwal, Glen

Credits Authors of “Jupiter Rising” [SIGCOMM 2015]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat And several teams at Google: Platforms Networking Hardware and Software Development, Platforms SQA, Mechanical Engineering, Cluster Engineering, Net. Ops, Global Infrastructure Group (GIG), and SRE.

Grand challenge for datacenter networks • Tens of thousands of servers interconnected in clusters

Grand challenge for datacenter networks • Tens of thousands of servers interconnected in clusters • Islands of bandwidth a key bottleneck for Google a decade ago ■ Engineers struggled to optimize for b/w locality ■ Stranded compute/memory resources ■ Hindered app scaling 1 Mbps / machine within datacenter Datacenter 3 1 Gbps / machine within rack 100 Mbps / machine within small cluster

Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers •

Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers • Simplify job scheduling (remove locality) • Save significant resources via better bin-packing • Allow application scaling X Gbps / machine flat bandwidth Datacenter 4

Motivation • Traditional network architectures • Cost prohibitive • Could not keep up with

Motivation • Traditional network architectures • Cost prohibitive • Could not keep up with our bandwidth demands • Operational complexity of “box-centric” deployment • Opportunity: A datacenter is a single administrative domain • One organization designs, deploys, controls, operates the n/w • . . . And often also the servers 5

Three pillars that guided us Merchant silicon: General purpose, commodity priced, off the shelf

Three pillars that guided us Merchant silicon: General purpose, commodity priced, off the shelf switching components Clos topologies: Accommodate low radix switch chips to scale nearly arbitrarily by adding stages Centralized control / management 6

SDN: The early days • Control options • Protocols: OSPF, ISIS, BGP, etc; Box-centric

SDN: The early days • Control options • Protocols: OSPF, ISIS, BGP, etc; Box-centric config/management • Build our own • Reasons we chose to build our own central control/management: • Limited support for multipath forwarding • No robust open source stacks • Broadcast protocol scalability a concern at scale • Network manageability painful with individual switch configs 7

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 8

Outline • • 9 Motivation Network evolution Centralized control / management Experience

Outline • • 9 Motivation Network evolution Centralized control / management Experience

Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 1 T ‘

Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 10

Bisection b/w (bps) 2004 State of the art: 4 Post cluster network Cluster Router

Bisection b/w (bps) 2004 State of the art: 4 Post cluster network Cluster Router 2 Cluster Router 1 1000 T Cluster Router 3 Cluster Router 4 2 x 10 G 1 G 100 T To. R Server Rack 1 (log scale) 10 T To. R Server Rack 2 To. R Server Rack 3 To. R Server Rack 4 To. R Server Rack 512 + Standard Network Configuration - Scales to 2 Tbps (limited by the biggest router) - Scale up: Forklift cluster when upgrading routers 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 11

DCN bandwidth growth demanded much more 12

DCN bandwidth growth demanded much more 12

Five generations of Clos for Google scale Spine Block 1 Spine Block 2 Edge

Five generations of Clos for Google scale Spine Block 1 Spine Block 2 Edge Aggregation Block 1 13 Spine Block 4 Edge Aggregation Block 2 Spine Block M Edge Aggregation Block N Server racks with To. R switches

Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 4 Post 1

Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 14

Bisection b/w (bps) Firehose 1. 0 1000 T 100 T (log scale) 10 T

Bisection b/w (bps) Firehose 1. 0 1000 T 100 T (log scale) 10 T 4 Post + Scales to 10 T (10 K servers @1 G) - Issues with servers housing switch cards - Was not deployed in production 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 15

Bisection b/w (bps) 1000 T Firehose 1. 0 100 T (log scale) 10 T

Bisection b/w (bps) 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 16

Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log

Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post + Chassis based solution (but no backplane) - Bulky CX 4 copper cables restrict scale 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 17

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 18

Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log

Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post + In production as a “Bag-on-side” + Central control and management 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 19

Bisection b/w (bps) 1000 T Firehose 1. 0 100 T Firehose 1. 1 10

Bisection b/w (bps) 1000 T Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 29

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + + 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 Chassis with backplane Fiber (10 G) in all stages Scale to 82 Tbps fabric Global deployment ‘ 11 ‘ 12 ‘ 13 Year 21

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 22

Watchtower 23 + Cable bundling saves 40% TCO + 10 x reduction in fiber-runs

Watchtower 23 + Cable bundling saves 40% TCO + 10 x reduction in fiber-runs to deploy

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 24

Watchtower + Connect externally via border routers + Massive external burst b/w + Enables

Watchtower + Connect externally via border routers + Massive external burst b/w + Enables cross cluster Map Reduce 25 + Retire Cluster Routers completely

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 26

Bisection b/w (bps) Saturn Watchtower 1000 T Firehose 1. 0 100 T Firehose 1.

Bisection b/w (bps) Saturn Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + + 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 288 x 10 G port chassis Enables 10 G to hosts Scales to 207 Tbps fabric Reuse in WAN (B 4) ‘ 11 ‘ 12 ‘ 13 Year 27

Bisection b/w (bps) Watchtower Jupiter 1000 T Saturn Firehose 1. 0 100 T Firehose

Bisection b/w (bps) Watchtower Jupiter 1000 T Saturn Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 28

Jupiter topology 29 + Scales out building wide 1. 3 Pbps

Jupiter topology 29 + Scales out building wide 1. 3 Pbps

Jupiter racks + Enables 40 G to hosts + External control servers + Open.

Jupiter racks + Enables 40 G to hosts + External control servers + Open. Flow 39

Bisection b/w (bps) Watchtower Jupiter (1. 3 P) 1000 T Saturn Firehose 1. 0

Bisection b/w (bps) Watchtower Jupiter (1. 3 P) 1000 T Saturn Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 31

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 32

Network control and config New conventional wisdom from engineering systems at scale • Logically

Network control and config New conventional wisdom from engineering systems at scale • Logically centralized control plane beats full decentralization • Centralized configuration and management dramatically simplifies system aspects 33

Network config/management Datacenter Network Fabric DB • Bill Of Materials • Rack config •

Network config/management Datacenter Network Fabric DB • Bill Of Materials • Rack config • Port Maps • Cable Bundles • CPN design • Cluster Config. . . Switch X Fabric tools/ scripts Switch Y Spec • Size of spine • Base IP prefix • To. R rack indexes. . . N / w designer Servers 34 Switch Z

Network config/management Datacenter Network Fabric N / w Management Switch X Switch Y Switch

Network config/management Datacenter Network Fabric N / w Management Switch X Switch Y Switch Z (e. g. Install, Push config, Update software, Drain. . . ) Scalable Infrastructure • Monitor • Alert • Collect, process logs 35 N / w Operator Network and Server Monitoring Cluster Config Switch X Switch Y. . . Switch Z Link A. . . Servers

Firepath route controller FMRP Interface state update Firepath Master Link State database FMRP protocol

Firepath route controller FMRP Interface state update Firepath Master Link State database FMRP protocol Control Plane Network Firepath Client 1 36 Firepath Client 2 Firepath Client N

Firepath route controller FMRP Interface state update Link State database Firepath Master FMRP protocol

Firepath route controller FMRP Interface state update Link State database Firepath Master FMRP protocol Control Plane Network Firepath Client 1 Firepath Client 2 e. BGP protocol (inband) Firepath Client N (Border router) Firepath Client, BGP 1 (Border router) Firepath Client, BGP M External BGP peers 37

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear Integrate BGP stack on border • Performance and reliability routers • Small on-chip buffers • High availability from cheap/less reliable components 38

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability Tune switches (eg ECN) and Hosts (DCTCP) • Small on-chip buffers • High availability from cheap/less reliable components 39

Challenges faced in building our own solution • Topology and deployment • Introducing our

Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 40 Redundancy; diversity; implement only what was needed

Experience: Outages Large remote link churn Three broad categories of outages: • Control software

Experience: Outages Large remote link churn Three broad categories of outages: • Control software failures at scale • Cluster-wide reboot did not converge ■ Liveness protocol contended for cpu with routing process • Cannot test at scale in a hardware lab ■ Developed virtualized testbeds • Aging hardware exposes corner cases • Component misconfigurations 41 Local link churn Frequent topology updates Routing client Link liveness Pegged embedded CPU Missed heart beats

Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers •

Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers • Simplify job scheduling (remove locality) • Save significant resources (better bin-packing) • Allow application scaling • Scaled datacenter networks to Petabit scale in under a decade • Bonus: reused solution in campus aggregation and WAN X Gbps / machine flat bandwidth 42 Datacenter

Back up 43

Back up 43

What’s different about datacenter networking • Single administrative domain (homogeneity, protocol modification much easier)

What’s different about datacenter networking • Single administrative domain (homogeneity, protocol modification much easier) • More plentiful and more uniform bandwidth (aggregate bandwidth of entire Internet in one building) • Tiny round trip times • Massive multipath • Little buffering • Latency/tail latency as important as bandwidth • From client-server to large-scale, massively parallel computation 44

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1

Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + Depop deployments + Cable bundling + Connect externally via Border Routers + Reuse in inter cluster 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 45

Job mix on an example cluster 46

Job mix on an example cluster 46

No locality within blocks 47

No locality within blocks 47

Firepath route controller Firepath Master Firepath protocol Firepath Client CONFIG port status route updates

Firepath route controller Firepath Master Firepath protocol Firepath Client CONFIG port status route updates Embedded Stack Firepath Client CONFIG port status route updates RIB BGP Intra, inter cluster route redistribute e. BGP packets Embedded Stack Kernel and Device drivers Non-CBR fabric switch 48 CBR switch

Experience: Congestion Packet loss initially > 1% Mitigation: • • • Qo. S Switch

Experience: Congestion Packet loss initially > 1% Mitigation: • • • Qo. S Switch buffer tuning Upgradable bandwidth ECMP configuration tuning ECN + DCTCP Congestion window bound Packet loss reduced to < 0. 01% 49 Drops in an example Saturn cluster