How to make FPGA router for highspeed networks

  • Slides: 40
Download presentation
『 How to make FPGA router for high-speed networks』 Keio University SFC Murai lab Takeshi

『 How to make FPGA router for high-speed networks』 Keio University SFC Murai lab Takeshi Matsuya macchan@sfc. wide. ad. jp   1

Open Router Competition • INTEROP 2012 event at Makuhari Messe • • Theme •

Open Router Competition • INTEROP 2012 event at Makuhari Messe • • Theme • • 13. June. 2012 Improve Open. Router regardless of software or hardware. http: //www. interop. jp/2012/orc/ 2

WHAT’S FPGA? Field Programmable Gate Array Address 24 bit Address 4 bit A B

WHAT’S FPGA? Field Programmable Gate Array Address 24 bit Address 4 bit A B C D RAM 16 MB Data 8 bit FPGA 4 LUT Data 1 bit 16 bit memory AND OR (A == 1 and B == 1) and (C == 1 or D == 1) 3 Addr A 3 (A) A 2 (B) A 1 (C) A 0 (D) Data 0 0 0 1 0 2 0 0 1 0 0 3 0 0 1 1 0 4 0 1 0 0 0 5 0 1 0 6 0 1 1 0 0 7 0 1 1 1 0 8 1 0 0 9 1 0 0 1 0 A 1 0 0 B 1 0 1 1 0 C 1 1 0 0 0 D 1 1 0 1 1 E 1 1 1 0 1 F 1 1 1

HI-SPEED NETWORK Throughput • Wire speed • Low latency • Frame/Packet forwarding • Switch/Router

HI-SPEED NETWORK Throughput • Wire speed • Low latency • Frame/Packet forwarding • Switch/Router (Interface speed) Throughput Latency Note Infiniband Switch Chip (2 G bps) Wire speed 200 ns (SDR) http: //en. wikipedia. org/wiki/Infini. Band PCI Express Switch Chip (2. 5 G bps) Wire speed 150 ns http: //www. plxtech. com/products/expressla ne/switches L 3 Switch (1 G bps) Wire speed 6, 000 ns (64 Byte Frame) Gigabit Interface 64*8(ns) = 512(ns) PC Router (1 G bps) ≦ Wire speed 24, 000 ns (64 Byte Frame) Gigabit Interface 64*8(ns) = 512(ns) 4

FPGA Interface for Ethernet User logic SPEED PHY 5 Receive 100 G Ethernet Transmit

FPGA Interface for Ethernet User logic SPEED PHY 5 Receive 100 G Ethernet Transmit 40 G Ethernet PCS PMA PMD 64 bit 10 G Ethernet 64 bit Giga Ethernet 4 bit @25 MHz 2 bit @50 MHz 8 bit @125 MHz 4 bit @250 MHz 64 bit @156 MHz 32 bit @312 MHz 320 bit @125 MHz 128 bit @312. 5 MHz 320 bit @312. 5 MHz 156 MHz 100 M Ethernet BUS WIDTH @CLOCK 10 G Ethernet

Implement Buffered Repeater Transmit Receive 6 Giga Port #0 PHY PCS PMA PMD tx_data

Implement Buffered Repeater Transmit Receive 6 Giga Port #0 PHY PCS PMA PMD tx_data 8 bit rx_data 8 bit always @(posedge clock) begin if ( r 0 x_data_valid ) begin tx 1_enable <= 1; tx 1_data <= rx 0_data; end else begin tx 1_enable <= 0; end User logic (repeater. v) Clock 125 MHz module repeater ( input clock, input rx 0_data_valid, // 0: invalid 1: valid input [7: 0] rx 0_data, output reg tx 1_enable, // 0: disable 1: enable output reg [7: 0] tx 1_data Giga Port #1

Routing • • • Next hop IP address Lookup • FIB (Forwarding Information Base)

Routing • • • Next hop IP address Lookup • FIB (Forwarding Information Base) Changing • Destination MAC address • TTL Recalculation • IP Header Checksum Filtering • Permit / Deny Priority Control • Qo. S 7

IDEAL ROUTING LATENCY PHY RX FRAME PROCE SS PMD Dest TTL SUM Src Dest

IDEAL ROUTING LATENCY PHY RX FRAME PROCE SS PMD Dest TTL SUM Src Dest IP OPT PMA MAC IP IP 0 -40 Byte PCS Calc SUM 336 ns 336~ 656 ns IP LOOKUP 180 ns Giga Ethernet 8 ns/cycle@125 MHz FRAME 180+384+180 = 744 ns? PHY TX PROCE PMD SS Dest TTL SUM Src Dest IP OPT PMA IP IP IP 0 -40 Byte MAC IP PCS 64 ns 336~ 384 ns 8

DESIGN • IP Address Lookup • Pipeline • Priority Control 9

DESIGN • IP Address Lookup • Pipeline • Priority Control 9

IP ADDR LOOKUP • TCAM • SRAM • Hash and sequential • Direct addressing

IP ADDR LOOKUP • TCAM • SRAM • Hash and sequential • Direct addressing • Cache Address RAM Data 10 CAM Address

CAM (Content Addressable Memory) 0012 E 258 C 982 0023 DFDFFAB 6 0012 F

CAM (Content Addressable Memory) 0012 E 258 C 982 0023 DFDFFAB 6 0012 F 27 C 0 AE 0 001192230842 SRAM 02 search data (ex. 0012 F 27 C 0 AE 0) • Examples • MAC address table • ARP table • CPU cache controller, TLB • Database engines 11 ※ http: //www. pagiamtz. com/pubs/pagiamtzis-jssc 2006. pdf decoder 0 1 2 3 matchlines encoder CAM (48 bit × 4 words) port 4 port 12 port 7 port 1

TCAM (Ternary CAM) 0101? ? (1. 1. ? ) 080808? ? (8. 8. 8.

TCAM (Ternary CAM) 0101? ? (1. 1. ? ) 080808? ? (8. 8. 8. ? ) 851 B 04? ? (133. 27. 4. ? ) 851 B? ? (133. 27. ? ) SRAM 02 search data (ex. 133. 27. 4. 130) • Examples • Forwarding Information Base • Access Control List • Qo. S List • Intrusion Prevention System 12 ※ http: //www. pagiamtz. com/pubs/pagiamtzis-jssc 2006. pdf decoder 0 1 2 3 matchlines Priority encoder TCAM (32 bit × 4 words) 203. 178. 5. 23 203. 178. 1. 5 203. 178. 4. 1 203. 178. 9. 19 Next hop

Hash and Sequential 13

Hash and Sequential 13

Direct addressing Basic idea Address 32 bit RAM 4 GB Data 8 bit decoder

Direct addressing Basic idea Address 32 bit RAM 4 GB Data 8 bit decoder IP v 4 Addr [31: 0] SRAM Subnet/32 203. 178. 5. 23 203. 178. 1. 5 203. 178. 4. 1 203. 178. 9. 19 Next hop IP v 4 Addr [31: 9] Address 23 bit RAM 8 MB Data 4 bit V 4 Addr [8: 8] Subnet/24 http: //bgp. potaroo. net/as 2. 0/bgp-active. html 14

C code struct { uint 32_t ip; char subnet; char next_hop_id; } fib[1000000]; unsigined

C code struct { uint 32_t ip; char subnet; char next_hop_id; } fib[1000000]; unsigined char mem_table[64*1024]; int write_fib() { for ( subnet = 2; subnet <= 28; ++subnet ) for ( i = 0; i <= fib_max; ++i ) if ( fib[i]. subnet == subnet ) { ip_start = fib[i]. ip & ~((1<<(32 -subnet))-1); ip_end = fib[i]. ip | ((1<<(32 -subnet))-1); for (ip=(ip_start>>4); ip<=(ip_end>>4); ++ip) mem_table[ip] = fib[i]. next_hop_id; } } uint 32_t get_nexthop(uint 32_t ip) { return ( next_hop[ mem_table[ip>>4]. next_hop_id ] ); } 15

Verilog code always @(posedge sys_clk) begin case (frame_counter) 12'd 00: eth_dest[47: 40] <= dout[7:

Verilog code always @(posedge sys_clk) begin case (frame_counter) 12'd 00: eth_dest[47: 40] <= dout[7: 0]; 12'd 01: eth_dest[39: 32] <= dout[7: 0]; 12'd 02: eth_dest[31: 24] <= dout[7: 0]; 12'd 03: eth_dest[23: 16] <= dout[7: 0]; 12'd 12: eth_type[15: 8] <= dout[7: 0]; 12'd 13: eth_type[7: 0] <= dout[7: 0]; 12'd 30: ipv 4_dest_ip[31: 24] <= dout[7: 0]; 12'd 31: ipv 4_dest_ip[23: 16] <= dout[7: 0]; 12'd 32: ipv 4_dest_ip[15: 8]<= dout[7: 0]; 12'd 33: begin ipv 4_dest_ip[ 7: 0] <= dout[7: 0]; if (forward_router == 1'b 1 && bridge_mode == 1'b 0) begin ip 4 lookup_req <= 1'b 1; search_ip 4 <= {ipv 4_dest_ip[31: 8], dout[7: 0]}; end end 16

PIPELINE (1/2) Stage 00 01 02 03 04 05 06 07 08 00 01

PIPELINE (1/2) Stage 00 01 02 03 04 05 06 07 08 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 00 01 02 03 04 05 00 01 02 03 04 00 01 02 03 00 01 02 00 01 04 05 06 07 00 08 17

PIPELINE (2/2) Stage 00 Read 29 30 31 32 33 34 35 36 37

PIPELINE (2/2) Stage 00 Read 29 30 31 32 33 34 35 36 37 01 28 29 30 31 32 33 34 35 36 02 27 28 29 30 31 32 33 34 35 03 26 27 28 29 30 31 32 33 34 04 25 26 27 28 29 30 31 32 33 35 00 00 01 02 36 00 00 01 37 Write 00 00 00 Reply IPIPaddr lookup Receive Destination IP address Request addr lookup Transmit packet 18

Verilog code always @(posedge sys_clk) begin dout 01<={rd_valid, rx_fifo}; dout 02<=dout 01; dout 03<=dout

Verilog code always @(posedge sys_clk) begin dout 01<={rd_valid, rx_fifo}; dout 02<=dout 01; dout 03<=dout 02; dout 04<=dout 03; dout 05<=dout 04; dout 06<=dout 05; ・・・・・・・・・・・・・・・・・・・・・・・・・・・ dout 39<=dout 38; dout 40<=dout 39; dout 41<=dout 40; dout 42<=dout 41; if (rd_vald == 1’b 1) begin counter 00 <= counter 00 + 16’d 1; case (counter 00) 16'd 00: ethr_dest[47: 40] <= rx_fifo[7: 0]; 16'd 01: ethr_dest[39: 32] <= rx_fifo[7: 0]; endcase else counter 00 <= 16’d 0; end if (dout 42[8] == 1’b 1) begin counter 42 <= counter 42 + 16’d 1; case (counter 42) 16'd 00: begin tx_fifo <= {1'b 1, etht_dest[47: 40]}; forward_port <= 16’b 0000001000; end 16'd 01: tx_fifo <= {1'b 1, etht_dest[39: 32]}; 16'd 02: tx_fifo <= {1'b 1, etht_dest[31: 24]}; endcase else 19 counter 42 <= 16’d 0;

Port#2(RX) . . IP DST IP SUM FIFO Port#1(RX) IP TOS (0). . .

Port#2(RX) . . IP DST IP SUM FIFO Port#1(RX) IP TOS (0). . . MAC Dest IP TOS (7). . . MAC Dest F D AR A R D RW O R W FO FIFO PRIORITY CONTROL 1 Port#3(TX) 20

PRIORITY CONTROL 2 Port#1(RX) Port#2(RX). . . FIFO . . . IP DST IP

PRIORITY CONTROL 2 Port#1(RX) Port#2(RX). . . FIFO . . . IP DST IP SUM IP TOS (7). . . IP SUM MAC Dest IP TOS (0) F D AR A R D RW O R W FO FIFO IP DST . . . MAC Dest Port#3(TX) 21

PIPELINE (2/2) Stage 00 Read 29 30 31 32 33 34 35 36 37

PIPELINE (2/2) Stage 00 Read 29 30 31 32 33 34 35 36 37 01 28 29 30 31 32 33 34 35 36 02 27 28 29 30 31 32 33 34 35 03 26 27 28 29 30 31 32 33 34 04 25 26 27 28 29 30 31 32 33 35 00 00 01 02 36 00 00 01 37 Write 00 00 00 22 Adding frame TOS

Port#2(RX) . . IP DST IP SUM FIFO Port#1(RX) . . MAC Dest FRAME

Port#2(RX) . . IP DST IP SUM FIFO Port#1(RX) . . MAC Dest FRAME TOS (0) FRAME TOS (7) F D AR A R D RW O R W FO FIFO PRIORITY CONTROL 3 Port#3(TX) 23

PRIORITY CONTROL 4 Port#1(RX) Port#2(RX) . . . FIFO IP SUM. . . MAC

PRIORITY CONTROL 4 Port#1(RX) Port#2(RX) . . . FIFO IP SUM. . . MAC Dest . . . IP DST FRAME TOS (0) IP SUM F D AR A R D RW O R W FO FIFO IP DST . . . MAC Dest Port#3(TX) 24

IMPLEMENT OF FIBNIC 25

IMPLEMENT OF FIBNIC 25

pps問題 Frame Size 64 速度 packet per second time/packet Frame 1518 CPU clock/pkt ※

pps問題 Frame Size 64 速度 packet per second time/packet Frame 1518 CPU clock/pkt ※ @4 GHz pps 100 M 148, 809 pps 6, 905 ns 27, 600 clock 8, 127 pps Giga 1, 488, 095 pps 690 ns 2, 760 clock 81, 274 pps 10 G 14, 880, 950 pps 69 ns 276 clock 812, 743 pps 40 G 59, 523, 800 pps 17 ns 68 clock 3, 250, 972 pps 100 G 148, 809, 500 pps 7 ns 28 clock 8, 127, 433 pps ※ 1ポート、片方向通信の場合 CPUコア数を増やしたりGPU併用で処理の負担を軽減することができる 27

バスとメモリの帯域問題 DDR 3 -1600 12. 8 GB/s※ 1 ※片方向通信 PCI-Express G 2× 8 4

バスとメモリの帯域問題 DDR 3 -1600 12. 8 GB/s※ 1 ※片方向通信 PCI-Express G 2× 8 4 GB/s 10 G Ethernet 1. 25 GB/s 40 G Ethernet 送信 受信 5 GB/s 100 G Ethernet 12. 5 GB/s 0 28 10 20

デモ 1 デモ http: //web. sfc. wide. ad. jp/~macchan/fibnic_demo 1. mov デモ 2(40. 5万ルート)

デモ 1 デモ http: //web. sfc. wide. ad. jp/~macchan/fibnic_demo 1. mov デモ 2(40. 5万ルート) http: //web. sfc. wide. ad. jp/~macchan/fibnic_demo 2. mov 30

 IPv 4性能評価(参考値) ※ 2 理論値と同じ 64バイトフレーム 1518バイトフレーム 遅延 フレーム間 遅延 (ns) GAP/pps ※ 1 (ns)

 IPv 4性能評価(参考値) ※ 2 理論値と同じ 64バイトフレーム 1518バイトフレーム 遅延 フレーム間 遅延 (ns) GAP/pps ※ 1 (ns) NIC+PCルータ i 5 2. 8 GHz (Linux 3. 3. 7) FIBNIC+PCルータ Celeron 2. 53 G(Linux 3. 3. 7) 24, 144 976 フレーム間 GAP/pps 270 byte 580 byte 101, 152 365, 497 pps 59, 355 pps 12 byte 1, 488, 095 pps 条件 ポート数 IPv 4 Route Giga× 2ポート 2 Route Giga× 4ポート 12 byte 976 405, 452 Route 81, 274 pps 13 byte Giga× 4ポート 13 byte 1, 470, 588 20, 608 4 Route 某社L 3スイッチ 6, 600 81, 222 pps ※ 1 パケットをロストしない時のフレーム間ギャップ最小値(Ethernetの規格上は 12 が最小) 31

FIBNIC遅延内訳 • FIBNIC回路 • • IPルーティングに必要なDest. IPを受信す るまでのフレーム内サイズ(8はプリアンプ ル部+Start. Frame. Delimiter) RGMII/GMII変換回路 • •

FIBNIC遅延内訳 • FIBNIC回路 • • IPルーティングに必要なDest. IPを受信す るまでのフレーム内サイズ(8はプリアンプ ル部+Start. Frame. Delimiter) RGMII/GMII変換回路 • • IP Lookup、Fowarding、Filtering、 Priority処理 Dest. IPまでの 8+34バイトフレーム • • IPv 4 976 ns IPv 4 (IPv 6 1, 064 ns) FIBNIC Dest. IPまでの 336 ns 8+34バイトフー RGMII/GMII変 40 ns 換 35 cycle 42 cycle 5 cycle 4 bit@250 MHz、8 bit@125 MHz変換 PHYチップ • 280 ns @125 MHz 320 ns ケーブルの信号をASICやFPGAで接続 可能なディジタル信号に変換 33 PHYチップ 40 cycle

オープンソース ネットワークテスタ • Magukara • FPGAベースのオープンソースハードウェア • Lattice ECP 3 Versa Development Kit ($299ドル

オープンソース ネットワークテスタ • Magukara • FPGAベースのオープンソースハードウェア • Lattice ECP 3 Versa Development Kit ($299ドル !!) • 1000 Base-T, IPv 4/v 6サポート • URL: https: //github. com/Murailab-arch/magukara 35

Workshop 紹介 36

Workshop 紹介 36

PHY#0 RX Receive ※ 1 RX 0 GMII 2 FIFO 9 RX 0_PHYQ ASFIFO

PHY#0 RX Receive ※ 1 RX 0 GMII 2 FIFO 9 RX 0_PHYQ ASFIFO 9 PHY#2 RX PHY#1 RX FIB 6 FIB ※ 5 LOOKUPFIB ※ 7 PCI/PCIe CONTROLLER ※ 6 NIC 0 NIC BUS Port ARP/NDP ARPNDP TX-BUF RX 0 -NC 0 RX-BUF Dual. Port RAM SFIFO 9 Dual. Port RAM ※ 2 RX 0 ROUTER PHY#3 RX RX 0 -ARP 0 SFIFO 9 Instance Module FIFO/ Memory RX 0 -TX 0 RX 1 -TX 0 RX 2 -TX 0 RX 3 -TX 0 NC 0 -TX 0 SFIFO SFIFO Transmission ※ 3 TX 0 MIXER TX 0_MIXQ SFIFO 9 TX 0_PHYQ ASFIFO 9 ※ 4 TX 0 FIFO 9 TOGMII PHY#0 TX RX 0 -TX 1 RX 1 -TX 1 RX 2 -TX 1 RX 3 -TX 1 NC 1 -TX 1 SFIFO SFIFO ※ 3 TX 1 MIXER RX 0 -TX 2 RX 1 -TX 2 RX 2 -TX 2 RX 3 -TX 2 NC 2 -TX 2 SFIFO SFIFO ※ 3 TX 2 MIXER RX 0 -TX 3 RX 1 -TX 3 RX 2 -TX 3 RX 3 -TX 3 NC 3 -TX 3 SFIFO SFIFO ※ 3 TX 3 MIXER ※ 1 PHYとシステムクロック変換およびPREAMBLE, SFD等除去 ※ 2 フォワード先決定、フォワード時のDest MAC置換&TTLの減算&IP SUM再計算、ドロップ 付きフォワード処理、CRC除去 ※ 3 To. Sによる優先度ミックスキューイング、ドロップ ※ 4 システムとPHYクロッック変換およびPREAMBLE, SFD, CRCの付与 ※ 5 FIBもしくはARPを調べ、Dest MACと転送先PORTを決定する ※ 6 NICとしての基本動作 ※ 7 PCIとしての基本動作 ※ 4 TX 1 FIFO 9 TOGMII PHY#1 TX ※ 4 TX 2 FIFO 9 TOGMII PHY#2 TX ※ 4 TX 3 FIFO 9 TOGMII PHY#3 TX

NICへ転送する条件 HDL記述(例) • wire frame_type_ipv 4 = (eth_type == 16'h 0800); • wire frame_type_ipv

NICへ転送する条件 HDL記述(例) • wire frame_type_ipv 4 = (eth_type == 16'h 0800); • wire frame_type_ipv 6 = (eth_type == 16'h 86 dd); • wire forward_nic = (frame_type_ipv 4 && (ipv 4_dest_ip == interface_ipv 4_addr || ipv 4_dest_ip == 32'h 0)) || eth_dest_addr[40] == 1'b ||. . . ; 40