Saturating the Transceiver Bandwidth Switch Fabric Design on

  • Slides: 33
Download presentation
Saturating the Transceiver Bandwidth: Switch Fabric Design on FPGAs FPGA’ 2012 Zefu Dai, Jianwen

Saturating the Transceiver Bandwidth: Switch Fabric Design on FPGAs FPGA’ 2012 Zefu Dai, Jianwen Zhu University of Toronto

Switch is Important Cisco Catalyst 6500 Series Switches since 1999 Total $: $42 billion

Switch is Important Cisco Catalyst 6500 Series Switches since 1999 Total $: $42 billion Total Systems: 700, 000 Total Ports: 110 million Total Customers: 25, 000 Average price/system: $60 K Average price/port: $381 Commodity 48 1 Gig. E port switch: $52/port http: //www. cisco. com/en/US/prod/collateral/modules/ps 2797/ps 11878/at_a_glance_c 45 -652087. pdf http: //www. albawaba. com/cisco-readies-catalyst-6500 -tackle-next-decades-networking-challenges-383275 University of Toronto

Inside the Box Function blocks: Virtualization Admin. Ctrl Configuration Firewall L 2 Bridge Security

Inside the Box Function blocks: Virtualization Admin. Ctrl Configuration Firewall L 2 Bridge Security 10 Gb/s Line Card 40 Gb/s Qo. S Traffic manage 3. 125 Gb/s … Optical Fiber Routing Backplane SERDES 2 Tb/s 6. 6 Gb/s FPGAs University of Toronto Critical: connects to everyone

Go Beyond the Line Card XC 7 VH 870 T (single chip): - 72

Go Beyond the Line Card XC 7 VH 870 T (single chip): - 72 GTX: 13. 1 Gb/s - 16 GTZ: 28. 05 Gb/s - Raw total: 1. 4 Tb/s Next generation line rate: 160 Gb/s University of Toronto

Can We Do Switch Fabric? ² Is single chip switch fabric possible on FPGA

Can We Do Switch Fabric? ² Is single chip switch fabric possible on FPGA that saturate transceiver BW? ² Is it possible to rival ASIC performance? Prof. Jonathan Rose, FPGA’ 06: Gap Area Speed Power FPGA VS. ASIC 40 3 12 University of Toronto

What’s Switch Fabric ²Two major tasks: - Provide data path connection (crossbar) - Resolve

What’s Switch Fabric ²Two major tasks: - Provide data path connection (crossbar) - Resolve congestion (buffers) little computational task Input Buffers Crossbar Output Buffers 1 … … Line Rate: R 1 N N University of Toronto

Switch Fabric Architectures ²Buffer location marks the difference: - Output Queued (OQ) switch: buffered

Switch Fabric Architectures ²Buffer location marks the difference: - Output Queued (OQ) switch: buffered at output - Input Queued (IQ) switch: buffered at input - Crosspoint Queued (CQ) switch: buffered at XB Input Buffers Crossbar Output Buffers 1 1 … … N N University of Toronto

Ideal Switch ²OQ switch - Best performance - Simplest crossbar (distributed multiplexers) - 2

Ideal Switch ²OQ switch - Best performance - Simplest crossbar (distributed multiplexers) - 2 NR memory bandwidth requirement Crossbar 1 (N+1)R 1 N University of Toronto CSM 2 NR … N … 1 … … N Centralized Shared Memory 1 N

Lowest Memory Requirement Switch ²IQ switch - 2 R memory bandwidth - Low performance,

Lowest Memory Requirement Switch ²IQ switch - 2 R memory bandwidth - Low performance, 58% throughput cap (HOL) - Maximum bipartite matching problem 1 2 R Crossbar 1 … … N N University of Toronto

Most Popular Switch ² Combined Input and Output Queued (CIOQ) : - Internal speedy

Most Popular Switch ² Combined Input and Output Queued (CIOQ) : - Internal speedy up S: read (write) S cells from (to) memory Emulate OQ switch with S=2 (S+1)R memory bandwidth Complex scheduling 1<=S<=N Crossbar (S+1)R 1 1 … … N N University of Toronto

State-of-the-art Switch ² Combined Input and Crosspoint Queued (CICQ) - High performance, close to

State-of-the-art Switch ² Combined Input and Crosspoint Queued (CICQ) - High performance, close to OQ switch - Simple scheduling - 2 R memory bandwidth, - N 2 buffers Crossbar 1 … IBM Prizma switch (2004) 2 R … N 2 Tb/s 1 University of Toronto N

Switch Fabric on FPGA? ²FPGA resources: - LUTs, wires, DSPs, … ²Idea: Memory is

Switch Fabric on FPGA? ²FPGA resources: - LUTs, wires, DSPs, … ²Idea: Memory is the Switch! SERDES SRAM ? SRAM FPGA SRAM SAME Technology!! SRAM Saturate the transceiver BW University of Toronto

Start Point ² CICQ switch: rely on large amount of small buffers 1 2

Start Point ² CICQ switch: rely on large amount of small buffers 1 2 R … 2 R 2 N … … Xilinx 18 kbit dualport BRAM (500 MHz): 36 Gb/s … 2 R Run Out! … R=10 Gb/s 2 R = 20 Gb/s … … … N=100 N 2=10 k N-1 2 R 2 R … 1 2 … N-1 University of Toronto N Waste of BW!

Memory Can be Shared ² Use x BRAMs to emulate y crosspoint buffers (x<y)

Memory Can be Shared ² Use x BRAMs to emulate y crosspoint buffers (x<y) 1 2 R … 2 R 2 BRAMs … 2 R N Xilinx 18 kbit dualport BRAM (500 MHz): 36 Gb/s … … … N-1 x=5, y=9 5*36 = 9*20 Total: 5*(N/3)2 2 R 2 R … 1 2 … N-1 University of Toronto N

Memory is the Switch Logical View 1 1 2 3 2 Small OQ Switch!

Memory is the Switch Logical View 1 1 2 3 2 Small OQ Switch! 3 1 2 1 3 In the same memory 2 3 Physical View BRAMs address decoder and sense amp act as crossbar 2 1 3 University of Toronto 1 2 3

Memory Can be Borrowed ²Give busy buffer more memory space ²Enable efficient multicast support

Memory Can be Borrowed ²Give busy buffer more memory space ²Enable efficient multicast support BRAMs 1 1 2 2 3 3 Free Addr Pool Implement in BRAMs Pointer queues University of Toronto

Group Crosspoint Queued (GCQ) Switch ² Similar to Dally’s HC switch, but: - Higher

Group Crosspoint Queued (GCQ) Switch ² Similar to Dally’s HC switch, but: - Higher memory requirement for simpler scheduling and better performance - Use SRAMs as small switch (not just buffer) - Multicast support 1 2 N+1 N Memory Based Sub-switch MBS -Efficient utilization of BRAMs MBS 1 2 N+1 N -simple scheduler (time muxing) University of Toronto

Resource Savings ²Each Sx. S sub-switch requires P BRAMS: - P*B= 2 SR; (R:

Resource Savings ²Each Sx. S sub-switch requires P BRAMS: - P*B= 2 SR; (R: line rate; B: BW of BRAM) - P/S = 2 R/B; (constant) ²Total BRAMs required for entire crossbar: - P*(N/S)2 = (2 N 2 R/B)/S = C/S (C is constant) - Savings: N 2 -C/S; (larger S, more saving) ²Max S limited by minimum packet size - Aggregate data width of BRAMs <= minimum packet size - TCP packet (40 bytes): max S = 8 University of Toronto

Hardware Implementation N S Data Width Virtex 6 -240 T 16 4 256 bit

Hardware Implementation N S Data Width Virtex 6 -240 T 16 4 256 bit Spartan 6 -150 T 9 3 256 bit Registers LUTs BRAMS Savings 36945 (12%) 49537 (32%) 224 (27%) 288(56%) 27028(14%) 37285 (40%) 95 (36%) 67 (41%) University of Toronto Saturate the transceiver BW Small S due to automatic P&R (BRAM frequency as low as 160 MHz) Clock domain crossing (CDC) Source synchronous CDC technique

Performance Evaluation ² Experimental setup - Booksim: network simulator by Dally’s group - Tested

Performance Evaluation ² Experimental setup - Booksim: network simulator by Dally’s group - Tested switches: IQ, OQ, CICQ and different configuration of GCQ with different S Flit Delay 2 Credit Delay 2 N 16 Flit Size 32 bytes Packet size 1 flit, 16 flits Traffic Uniform Network topology Fat tree Routing Nearest common ancestor University of Toronto

Maximum Throughput Test Keep minimum buffer in the crossbar, sweep different Input Buffer depth

Maximum Throughput Test Keep minimum buffer in the crossbar, sweep different Input Buffer depth HOL blocking phenomenon University of Toronto

Buffer Memory Efficiency Keep minimum buffer in the Input Buffer, sweep different crossbar buffer

Buffer Memory Efficiency Keep minimum buffer in the Input Buffer, sweep different crossbar buffer size OQ GCQ CICQ University of Toronto

Packet Latency Test Limit the total buffer size to 1 k flits CICQ GCQ

Packet Latency Test Limit the total buffer size to 1 k flits CICQ GCQ IQ OQ University of Toronto

Discussion: Why ISE cannot meet timing? ² Hierarchical compilation, place and router - The

Discussion: Why ISE cannot meet timing? ² Hierarchical compilation, place and router - The design is regular and symmetric University of Toronto

Discuss: Why BRAM is slow? ² CACTI prediction based on ITRS-LSTP data ² Xilinx

Discuss: Why BRAM is slow? ² CACTI prediction based on ITRS-LSTP data ² Xilinx BRAM (18 Kbit) performance - 45 nm 28 nm, with little improvement on BRAM performance ? Fulcrum FM 4000 130 nm, 2 MB: 720 MHz Sun SPARC 90 nm 4 MB L 2: 800 MHz Intel Xeon 65 nm 16 MB L 3: 850 MHz University of Toronto

Discussion: Extrapolation on Virtex 7 ²Virtex-7 XC 7 VH 870 T : - Max

Discussion: Extrapolation on Virtex 7 ²Virtex-7 XC 7 VH 870 T : - Max BRAM frequency: 600 MHz - Total of 2820 18 kbit BRAMS - 1. 4 Tb/s transceiver ²CICQ requirement: > 10 k BRAMs ²Proposed GCQ switch: - Assume 400 MHz BRAM frequency: 2 R/B = 0. 69 - (1 -P/S 2) = 91% - 1014 18 kbit BRAMS (36%) University of Toronto

Questions? University of Toronto

Questions? University of Toronto

Conclusions ²FPGAs can rival ASICs on switch fabric design - Remain resource for other

Conclusions ²FPGAs can rival ASICs on switch fabric design - Remain resource for other functions ²Big room for improvement in FPGA’s on-chip SRAM performance ²FPGA CAD tools can do a better job. University of Toronto

Roadmap for Data Link Speed Nathan Binkert et al. @ ISCA 2011 Link Characteristics

Roadmap for Data Link Speed Nathan Binkert et al. @ ISCA 2011 Link Characteristics Process nm 45 32 22 Link Speedy Gb/s 80 160 320 Max link length m in flight data Bytes Optical Link parameters Data wavelengths Optical data rate Gb/s Electric link parameters SERDES speed Gb/s SERDES channels 10 1107 2214 4428 8 16 32 10 10 20 32 8 8 10 University of Toronto Grows in channel count NOT channel speed

Interlaken Chip x Packet 10 Gb/s CH CH 64 bits cells C C …

Interlaken Chip x Packet 10 Gb/s CH CH 64 bits cells C C … C Interlaken 40 Gb/s CH Chip y Packet C C … C Strip Interleave University of Toronto Assemble

High Link Speed Processing ²Parallel processing 10 Gb/s Memory bounded Parallel Switch 10 Gb/s

High Link Speed Processing ²Parallel processing 10 Gb/s Memory bounded Parallel Switch 10 Gb/s Strip Parallel Switch Assemble Parallel Switch 40 Gb/s Parallel Switch Prefer High radix switch with thin ports over Low radix switch with fat ports University of Toronto

Latest Improvement Upon CICQ ²Hierarchical Crossbar --- Dally: - Require (40%) less buffers -

Latest Improvement Upon CICQ ²Hierarchical Crossbar --- Dally: - Require (40%) less buffers - Higher requirement on memory BW and scheduling of sub-switches 2 R … (S+1)R … … … Cray Black. Widow (2006): 2 Tb/s Hierarchical Crossbar … 1 … University of Toronto N CIOQ

Can We Do Single-Chip Switch Fabric? ²FPGA VS. ASIC: - Comparable transceiver BW -

Can We Do Single-Chip Switch Fabric? ²FPGA VS. ASIC: - Comparable transceiver BW - Abundant on-chip SRAMs (same as ASICs) SERDES SRAM FPGA SRAM SAME Technology!! SRAM ASIC SRAM Saturate the transceiver BW University of Toronto SRAM Can’t do better