Protocols Working with 10 Gigabit Ethernet Richard HughesJones
Protocols Working with 10 Gigabit Ethernet Richard Hughes-Jones The University of Manchester www. hep. man. ac. uk/~rich/ then “Talks” ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 1
u Introduction u 10 Gig. E on Super. Micro X 7 DBE u 10 Gig. E on Super. Micro X 5 DPE-G 2 u 10 Gig. E and TCP– Monitor with web 100 disk writes u 10 Gig. E and Constant Bit Rate transfers u UDP + memory access u GÉANT 4 Gigabit tests ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 2
Udpmon: Latency & Throughput Measurements u UDP/IP packets sent between back-to-back systems n Similar processing to TCP/IP but no flow control & congestion avoidance algorithms u Latency n Round trip times using Request-Response UDP frames u Tells us about: n Latency as a function of frame size n Behavior of the IP stack l Slope s given by: n The way the HW operates n Interrupt coalescence l Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) l Intercept indicates processing times + HW latencies n Histograms of ‘singleton’ measurements u UDP Throughput n Send a controlled stream of UDP frames spaced at regular intervals n Vary the frame size and the frame transmit spacing & measure: l The time of first and last frames received l The number packets received, lost, & out of order l Histogram inter-packet spacing received packets u Tells us about: l Packet loss pattern n Behavior of the IP stack l 1 -way delay n The way the HW operates l CPU load n Capacity & Available throughput l Number of interrupts of the LAN / MAN / WAN ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 3
Throughput Measurements u UDP Throughput with udpmon u Send a controlled stream of UDP frames spaced at regular intervals Sender Receiver Zero stats OK done Send data frames at regular intervals ●●● Inter-packet time (Histogram) Time to receive Time to send Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order OK done CPU load & no. int 1 -way delay Signal end of test Time Number of packets n bytes Wait time ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester time 4
High-end Server PCs for 10 Gigabit u Boston/Supermicro X 7 DBE u Two Dual Core Intel Xeon Woodcrest 5130 n 2 GHz n Independent 1. 33 GHz FSBuses u 530 MHz FD Memory (serial) n Parallel access to 4 banks u Chipsets: Intel 5000 P MCH – PCIe & Memory ESB 2 – PCI-X GE etc. u PCI n 3 8 lane PCIe buses n 3* 133 MHz PCI-X u 2 Gigabit Ethernet u SATA ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 5
10 Gig. E Back 2 Back: UDP Latency u u u Motherboard: Supermicro X 7 DBE Chipset: Intel 5000 P MCH CPU: 2 Dual Intel Xeon 5130 2 GHz with 4096 k L 2 cache Mem bus: 2 independent 1. 33 GHz PCI-e 8 lane Linux Kernel 2. 6. 20 -web 100_pktd-plus Myricom NIC 10 G-PCIE-8 A-R Fibre myri 10 ge v 1. 2. 0 + firmware v 1. 4. 10 n rx-usecs=0 Coalescence OFF n MSI=1 n Checksums ON n tx_boundary=4096 u MTU 9000 bytes Histogram FWHM ~1 -2 us Latency 22 µs & very well behaved Latency Slope 0. 0028 µs/byte B 2 B Expect: 0. 00268 µs/byte n Mem 0. 0004 n PCI-e 0. 00054 n 10 Gig. E 0. 0008 n PCI-e 0. 00054 n Mem 0. 0004 ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 6
10 Gig. E Back 2 Back: UDP Throughput u Kernel 2. 6. 20 -web 100_pktd-plus u Myricom 10 G-PCIE-8 A-R Fibre n rx-usecs=25 Coalescence ON u MTU 9000 bytes u Max throughput 9. 4 Gbit/s u Notice rate for 8972 byte packet u ~0. 002% packet loss in 10 M packets in receiving host u Sending host, 3 CPUs idle u For <8 µs packets, 1 CPU is >90% in kernel mode inc ~10% soft int u Receiving host 3 CPUs idle u For <8 µs packets, 1 CPU is 70 -80% in kernel mode inc ~15% soft int ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 7
10 Gig. E UDP Throughput vs packet size u Motherboard: Supermicro X 7 DBE u Linux Kernel 2. 6. 20 -web 100_ pktd-plus u Myricom NIC 10 G-PCIE-8 A-R Fibre u myri 10 ge v 1. 2. 0 + firmware v 1. 4. 10 n rx-usecs=0 Coalescence ON n MSI=1 n Checksums ON n tx_boundary=4096 u Steps at 4060 and 8160 bytes within 36 bytes of 2 n boundaries u Model data transfer time as t= C + m*Bytes n n n C includes the time to set up transfers Fit reasonable C= 1. 67 µs m= 5. 4 e 4 µs/byte Steps consistent with C increasing by 0. 6 µs u The Myricom drive segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent! ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 8
10 Gig. E via Cisco 7600: UDP Latency u u u u Motherboard: Supermicro X 7 DBE PCI-e 8 lane Linux Kernel 2. 6. 20 SMP Myricom NIC 10 G-PCIE-8 A-R Fibre n myri 10 ge v 1. 2. 0 + firmware v 1. 4. 10 n Rx-usecs=0 Coalescence OFF n MSI=1 Checksums ON MTU 9000 bytes Latency 36. 6 µs & very well behaved Switch Latency 14. 66 µs Switch internal: 0. 0011 µs/byte n PCI-e 0. 00054 n 10 Gig. E 0. 0008 ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 9
The “SC 05” Server PCs u Boston/Supermicro X 7 DBE u Two Intel Xeon Nocona n 3. 2 GHz n Cache 2048 k n Shared 800 MHz FSBus u DDR 2 -400 Memory u Chipsets: Intel 7520 Lindenhurst u PCI n 2 8 lane PCIe buses n 1 4 lane PCIe buse n 3* 133 MHz PCI-X u 2 Gigabit Ethernet ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 10
10 Gig. E X 7 DBE X 6 DHE: UDP Throughput u Kernel 2. 6. 20 -web 100_pktd-plus u Myricom 10 G-PCIE-8 A-R Fibre n myri 10 ge v 1. 2. 0 + firmware v 1. 4. 10 n rx-usecs=25 Coalescence ON u MTU 9000 bytes u Max throughput 6. 3 Gbit/s u Packet loss ~ 40 -60 % in receiving host u Sending host, 3 CPUs idle u 1 CPU is >90% in kernel mode u Receiving host 3 CPUs idle u For <8 µs packets, 1 CPU is 70 -80% in kernel mode inc ~15% soft int ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 11
So now we can run at 9. 4 Gbit/s Can we do any work ? ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 12
10 Gig. E X 7 DBE: TCP iperf u u No packet loss MTU 9000 TCP buffer 256 k BDP=~330 k Cwnd n Slow. Start then slow growth n Limited by sender ! Web 100 plots of TCP parameters u Duplicate ACKs n One event of 3 Dup. ACKs u Packets Re-Transmitted u Throughput Mbit/s n Iperf throughput 7. 77 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 13
10 Gig. E X 7 DBE: TCP iperf u u Packet loss 1: 50, 000 -recv-kernel patch MTU 9000 TCP buffer 256 k BDP=~330 k Cwnd n Slow. Start then slow growth n Limited by sender ! Web 100 plots of TCP parameters u Duplicate ACKs n ~10 Dup. ACKs every lost packet u Packets Re-Transmitted n One per lost packet u Throughput Mbit/s n Iperf throughput 7. 84 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 14
10 Gig. E X 7 DBE: CBR/TCP u u u Packet loss 1: 50, 000 -recv-kernel patch tcpdelay message 8120 bytes Wait 7 µs RTT 36 µs TCP buffer 256 k BDP=~330 k Cwnd n Dips as expected Web 100 plots of TCP parameters u Duplicate ACKs n ~15 Dup. ACKs every lost packet u Packets Re-Transmitted n One per lost packet u Throughput Mbit/s n tcpdelay throughput 7. 33 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 15
B 2 B UDP with memory access u Send UDP traffic B 2 B with 10 GE u On receiver run independent memory write task n L 2 Cache 4096 k Byte n 8000 k Byte blocks n 100% user mode u Achievable UDP Throughput n mean 9. 39 Gb/s sigma 106 n mean 9. 21 Gb/s sigma 37 n mean 9. 2 sigma 30 u Packet loss n mean 0. 04% n mean 1. 4 % n mean 1. 8 % u CPU load: Cpu 0 Cpu 1 Cpu 2 Cpu 3 : 6. 0% us, 74. 7% : 0. 0% us, 0. 0% : 100. 0% us, 0. 0% sy, sy, 0. 0% ni, 0. 3% id, ni, 100. 0% id, ni, 0. 0% id, 0. 0% wa, wa, 1. 3% 0. 0% hi, 17. 7% si, hi, 0. 0% si, ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 0. 0% 16 st st
ESLEA-FABRIC: 4 Gbit flows over GÉANT u Set up 4 Gigabit Lightpath Between GÉANT Po. Ps n Collaboration with Dante n GÉANT Development Network London – London or London – Amsterdam and GÉANT Lightpath service CERN – Poznan n PCs in their Po. Ps with 10 Gigabit NICs u VLBI Tests: n UDP Performance l Throughput, jitter, packet loss, 1 -way delay, stability n Continuous (days) Data Flows – VLBI_UDP and n multi-Gigabit TCP performance with current kernels n Experience for FPGA Ethernet packet systems u Dante Interests: n multi-Gigabit TCP performance n The effect of (Alcatel) buffer size on bursty TCP using BW limited Lightpaths ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 17
Options Using the GÉANT Development Network u 10 Gigabit SDH backbone u Alkatel 1678 MCC u Node location: n London n Amsterdam n Paris n Prague n Frankfurt u Can do traffic routing so make long rtt paths u Available Now 07 u Less Pressure for long term tests ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 18
Options Using the GÉANT Light. Paths u Set up 4 Gigabit Lightpath Between GÉANT Po. Ps n Collaboration with Dante n PCs in Dante Po. Ps u 10 Gigabit SDH backbone u Alkatel 1678 MCC u Node location: n Budapest n Geneva n Frankfurt n Milan n Paris n Poznan n Prague n Vienna u Can do traffic routing so make long rtt paths u Ideal: London Copenhagen ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 19
Any Questions? ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 20
Backup Slides ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 21
10 Gigabit Ethernet: UDP Throughput u u u 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 Data. TAG Supermicro PCs Dual 2. 2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2. 9 Gbit/s u u CERN Open. Lab HP Itanium PCs Dual 1. 0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5. 7 Gbit/s u u SLAC Dell PCs giving a Dual 3. 0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5. 4 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 22
10 Gigabit Ethernet: Tuning PCI-X u 16080 byte packets every 200 µs u Intel PRO/10 Gb. E LR Adapter u PCI-X bus occupancy vs mmrbc n n mmrbc 512 bytes Measured times Times based on PCI-X times from the logic analyser Expected throughput ~7 Gbit/s Measured 5. 7 Gbit/s mmrbc 1024 bytes CSR Access mmrbc 2048 bytes PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes 5. 7 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 23
10 Gigabit Ethernet: TCP Data transfer on PCI-X u Sun V 20 z 1. 8 GHz to 2. 6 GHz Dual Opterons u Connect via 6509 u XFrame II NIC u PCI-X mmrbc 4096 bytes 66 MHz Data Transfer u Two 9000 byte packets b 2 b u Ave Rate 2. 87 Gbit/s CSR Access u Burst of packets length 646. 8 us u Gap between bursts 343 us u 2 Interrupts / burst ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 24
10 Gigabit Ethernet: UDP Data transfer on PCI-X u Sun V 20 z 1. 8 GHz to 2. 6 GHz Dual Opterons u Connect via 6509 u XFrame II NIC u PCI-X mmrbc 2048 bytes 66 MHz u One 8000 byte packets n 2. 8 us for CSRs n 24. 2 us data transfer effective rate 2. 6 Gbit/s Data Transfer CSR Access 2. 8 us u 2000 byte packet wait 0 us n ~200 ms pauses u 8000 byte packet wait 0 us n ~15 ms between data blocks ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 25
10 Gigabit Ethernet: Neterion NIC Results u u u X 5 DPE-G 2 Supermicro PCs B 2 B Dual 2. 2 GHz Xeon CPU FSB 533 MHz XFrame II NIC PCI-X mmrbc 4096 bytes u Low UDP rates ~2. 5 Gbit/s u Large packet loss u TCP n One iperf TCP data stream 4 Gbit/s n Two bi-directional iperf TCP data streams 3. 8 & 2. 2 Gbit/s ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 26
SC|05 Seattle-SLAC 10 Gigabit Ethernet u 2 Lightpaths: n Routed over ESnet n Layer 2 over Ultra Science Net u 6 Sun V 20 Z systems per λ u dcache remote disk data access n 100 processes per node n Node sends or receives n One data stream 20 -30 Mbit/s u Used Neteion NICs & Chelsio TOE u Data also sent to Stor. Cloud using fibre channel links u Traffic on the 10 GE link for 2 nodes: 3 -4 Gbit per nodes 8. 5 -9 Gbit on Trunk ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 27
- Slides: 27