Jupiter rising A decade of Clos topologies and
- Slides: 49
Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks
Credits Authors of “Jupiter Rising” [SIGCOMM 2015]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat And several teams at Google: Platforms Networking Hardware and Software Development, Platforms SQA, Mechanical Engineering, Cluster Engineering, Net. Ops, Global Infrastructure Group (GIG), and SRE.
Grand challenge for datacenter networks • Tens of thousands of servers interconnected in clusters • Islands of bandwidth a key bottleneck for Google a decade ago ■ Engineers struggled to optimize for b/w locality ■ Stranded compute/memory resources ■ Hindered app scaling 1 Mbps / machine within datacenter Datacenter 3 1 Gbps / machine within rack 100 Mbps / machine within small cluster
Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers • Simplify job scheduling (remove locality) • Save significant resources via better bin-packing • Allow application scaling X Gbps / machine flat bandwidth Datacenter 4
Motivation • Traditional network architectures • Cost prohibitive • Could not keep up with our bandwidth demands • Operational complexity of “box-centric” deployment • Opportunity: A datacenter is a single administrative domain • One organization designs, deploys, controls, operates the n/w • . . . And often also the servers 5
Three pillars that guided us Merchant silicon: General purpose, commodity priced, off the shelf switching components Clos topologies: Accommodate low radix switch chips to scale nearly arbitrarily by adding stages Centralized control / management 6
SDN: The early days • Control options • Protocols: OSPF, ISIS, BGP, etc; Box-centric config/management • Build our own • Reasons we chose to build our own central control/management: • Limited support for multipath forwarding • No robust open source stacks • Broadcast protocol scalability a concern at scale • Network manageability painful with individual switch configs 7
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 8
Outline • • 9 Motivation Network evolution Centralized control / management Experience
Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 10
Bisection b/w (bps) 2004 State of the art: 4 Post cluster network Cluster Router 2 Cluster Router 1 1000 T Cluster Router 3 Cluster Router 4 2 x 10 G 1 G 100 T To. R Server Rack 1 (log scale) 10 T To. R Server Rack 2 To. R Server Rack 3 To. R Server Rack 4 To. R Server Rack 512 + Standard Network Configuration - Scales to 2 Tbps (limited by the biggest router) - Scale up: Forklift cluster when upgrading routers 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 11
DCN bandwidth growth demanded much more 12
Five generations of Clos for Google scale Spine Block 1 Spine Block 2 Edge Aggregation Block 1 13 Spine Block 4 Edge Aggregation Block 2 Spine Block M Edge Aggregation Block N Server racks with To. R switches
Bisection b/w (bps) 1000 T 100 T (log scale) 10 T 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 14
Bisection b/w (bps) Firehose 1. 0 1000 T 100 T (log scale) 10 T 4 Post + Scales to 10 T (10 K servers @1 G) - Issues with servers housing switch cards - Was not deployed in production 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 15
Bisection b/w (bps) 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 16
Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post + Chassis based solution (but no backplane) - Bulky CX 4 copper cables restrict scale 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 17
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 18
Bisection b/w (bps) Firehose 1. 1 1000 T Firehose 1. 0 100 T (log scale) 10 T 4 Post + In production as a “Bag-on-side” + Central control and management 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 19
Bisection b/w (bps) 1000 T Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 29
Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + + 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 Chassis with backplane Fiber (10 G) in all stages Scale to 82 Tbps fabric Global deployment ‘ 11 ‘ 12 ‘ 13 Year 21
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 22
Watchtower 23 + Cable bundling saves 40% TCO + 10 x reduction in fiber-runs to deploy
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 24
Watchtower + Connect externally via border routers + Massive external burst b/w + Enables cross cluster Map Reduce 25 + Retire Cluster Routers completely
Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 26
Bisection b/w (bps) Saturn Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + + 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 288 x 10 G port chassis Enables 10 G to hosts Scales to 207 Tbps fabric Reuse in WAN (B 4) ‘ 11 ‘ 12 ‘ 13 Year 27
Bisection b/w (bps) Watchtower Jupiter 1000 T Saturn Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 28
Jupiter topology 29 + Scales out building wide 1. 3 Pbps
Jupiter racks + Enables 40 G to hosts + External control servers + Open. Flow 39
Bisection b/w (bps) Watchtower Jupiter (1. 3 P) 1000 T Saturn Firehose 1. 0 100 T Firehose 1. 1 10 T (log scale) 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 31
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 32
Network control and config New conventional wisdom from engineering systems at scale • Logically centralized control plane beats full decentralization • Centralized configuration and management dramatically simplifies system aspects 33
Network config/management Datacenter Network Fabric DB • Bill Of Materials • Rack config • Port Maps • Cable Bundles • CPN design • Cluster Config. . . Switch X Fabric tools/ scripts Switch Y Spec • Size of spine • Base IP prefix • To. R rack indexes. . . N / w designer Servers 34 Switch Z
Network config/management Datacenter Network Fabric N / w Management Switch X Switch Y Switch Z (e. g. Install, Push config, Update software, Drain. . . ) Scalable Infrastructure • Monitor • Alert • Collect, process logs 35 N / w Operator Network and Server Monitoring Cluster Config Switch X Switch Y. . . Switch Z Link A. . . Servers
Firepath route controller FMRP Interface state update Firepath Master Link State database FMRP protocol Control Plane Network Firepath Client 1 36 Firepath Client 2 Firepath Client N
Firepath route controller FMRP Interface state update Link State database Firepath Master FMRP protocol Control Plane Network Firepath Client 1 Firepath Client 2 e. BGP protocol (inband) Firepath Client N (Border router) Firepath Client, BGP 1 (Border router) Firepath Client, BGP M External BGP peers 37
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear Integrate BGP stack on border • Performance and reliability routers • Small on-chip buffers • High availability from cheap/less reliable components 38
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability Tune switches (eg ECN) and Hosts (DCTCP) • Small on-chip buffers • High availability from cheap/less reliable components 39
Challenges faced in building our own solution • Topology and deployment • Introducing our network to production • Unmanageably high number of cables/fiber • Cluster-external burst b/w demand • Control and management • Operating at huge scale • Routing scalability / routing with massive multipath • Interop with external vendor gear • Performance and reliability • Small on-chip buffers • High availability from cheap/less reliable components 40 Redundancy; diversity; implement only what was needed
Experience: Outages Large remote link churn Three broad categories of outages: • Control software failures at scale • Cluster-wide reboot did not converge ■ Liveness protocol contended for cpu with routing process • Cannot test at scale in a hardware lab ■ Developed virtualized testbeds • Aging hardware exposes corner cases • Component misconfigurations 41 Local link churn Frequent topology updates Routing client Link liveness Pegged embedded CPU Missed heart beats
Grand challenge for datacenter networks • Challenge: Flat b/w profile across all servers • Simplify job scheduling (remove locality) • Save significant resources (better bin-packing) • Allow application scaling • Scaled datacenter networks to Petabit scale in under a decade • Bonus: reused solution in campus aggregation and WAN X Gbps / machine flat bandwidth 42 Datacenter
Back up 43
What’s different about datacenter networking • Single administrative domain (homogeneity, protocol modification much easier) • More plentiful and more uniform bandwidth (aggregate bandwidth of entire Internet in one building) • Tiny round trip times • Massive multipath • Little buffering • Latency/tail latency as important as bandwidth • From client-server to large-scale, massively parallel computation 44
Bisection b/w (bps) Watchtower 1000 T Firehose 1. 0 100 T Firehose 1. 1 (log scale) 10 T + Depop deployments + Cable bundling + Connect externally via Border Routers + Reuse in inter cluster 4 Post 1 T ‘ 04 ‘ 05 ‘ 06 ‘ 07 ‘ 08 ‘ 09 ‘ 10 ‘ 11 ‘ 12 ‘ 13 Year 45
Job mix on an example cluster 46
No locality within blocks 47
Firepath route controller Firepath Master Firepath protocol Firepath Client CONFIG port status route updates Embedded Stack Firepath Client CONFIG port status route updates RIB BGP Intra, inter cluster route redistribute e. BGP packets Embedded Stack Kernel and Device drivers Non-CBR fabric switch 48 CBR switch
Experience: Congestion Packet loss initially > 1% Mitigation: • • • Qo. S Switch buffer tuning Upgradable bandwidth ECMP configuration tuning ECN + DCTCP Congestion window bound Packet loss reduced to < 0. 01% 49 Drops in an example Saturn cluster
- Jupiter rising google
- Chapter 14 the sixties a decade of protest and change
- Whom do lennie and george agree to let live on their farm?
- The sixties a decade of protest and change
- Clos data center network architecture
- In a banyan switch micro switch
- Clos criteria
- Characteristics of network topology
- Gwele
- Whats my ip
- Network topologies and layout
- Bus, ring and star topologies mostly used in the
- Biotics research uk
- 4017 decade counter
- Safe system engineering workshop
- Roaring 20s fashion
- Record peach harvest—price lowest in a decade
- Gary michael heidnik
- 1 decade later
- What decade did radio drama achieve widespread popularity?
- Wlan topologies
- Physical structure of network
- Physical network topology
- Advantages of bus network
- Utp cable
- Ups topologies
- Fddi ppt
- Network security topologies
- Ee 126
- General
- Bjt amplifier topologies
- Highly reliable topology
- Three dumb routers
- Network topologies
- Three basic network topologies
- Romeo and juliet jokes
- Mars jupiter and saturn show retrograde motion because
- Mars, jupiter, and saturn show retrograde motion because
- The two outer jovian planets appear bluish
- Numeros cuadrado
- Characteristics of saturn
- Does uranus have rings
- Jupiter spins once on its axis in a little less than
- Jupiter ppt
- Km to jupiter radii
- Jupiter promjer
- Sweden solar system
- What does jupiter have in common with neptune brainpop
- Crisis de misiles de cuba
- Jupiter grades.com