High Performance WAN Testbed Experiences Results Les Cottrell

High Performance WAN Testbed Experiences & Results Les Cottrell – SLAC Prepared for the CHEP 03, San Diego, March 2003 http: //www. slac. stanford. edu/grp/scs/net/talk/chep 03 -hiperf. html Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), by the Sci. DAC base program. 1

Outline • • • Who did it? What was done? How was it done? Who needs it? So what’s next? Where do I find out more? 2

Who did it: Collaborators and sponsors • Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn • SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti • LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart • NIKHEF/Uv. A: Cees De. Laat, Antony • CERN: Olivier Martin, Paolo Moroni • ANL: Linda Winkler • Data. TAG, Star. Light, Tera. Grid, SURFnet, Nether. Light, Deutsche Telecom, Information Society Technologies • Cisco, Level(3), Intel • Do. E, European Commission, NSF 3

What was done? • Set a new Internet 2 TCP land speed record, 10, 619 Tbit-meters/sec – (see http: //lsr. internet 2. edu/) • With 10 streams achieved 8. 6 Gbps across US • Beat the Gbps limit for a One Terabyte transferred single TCP stream across in less than one hour the Atlantic – transferred a TByte in an hour When From To Bottle- MTU neck Nov ’ 02 (SC 02) Amste Sunny- 1 Gbps rdam vale Nov ’ 02 (SC 02) Baltimore Feb ‘ 03 Sunny Geneva 2. 5 -vale Gbps Sunny- 10 vale Gbps Strea TCP ms Thruput 9000 B 1 Stand ard 923 Mbps 1500 10 9000 B 1 FAST 8. 6 Gbps Stand 2. 38 ard Gbps 4

10 Gig. E Data Transfer Trial European Commission On February 27 -28, over a Terabyte of data was transferred in 3700 seconds by S. Ravot of Caltech between the Level 3 Po. P in Sunnyvale, near SLAC, and CERN. The data passed through the Tera. Grid router at Star. Light from memory to memory as a single TCP/IP stream at an average rate of 2. 38 Gbps (using large windows and 9 KByte “jumbo” frames). This beat the former record by a factor of approximately 2. 5, and used the US-CERN link at 99% efficiency. Original slide by: Olivier Martin, CERN 5

How was it done: Typical testbed 6*2 cpu servers 4 disk servers Chicago 7 6 09 Geneva 12*2 cpu servers 4 disk servers Sunnyvale 2. 5 Gbits/s (EU+US) 6*2 cpu servers T G 6 S 40 OC 192/POS R (10 Gbits/s) 7 6 09 Sunnyvale section deployed for SC 2002 (Nov 02) SNV CHI AMS > 10, 000 km GVA 6

Typical Components • CPU Earthquake strap Disk servers – Pentium 4 (Xeon) with 2. 4 GHz cpu • For GE used Syskonnect NIC • For 10 GE used Intel NIC – Linux 2. 4. 19 or 20 • Routers Compute servers – Cisco GSR 12406 with OC 192/POS & 1 and 10 GE server interfaces Heat sink (loaned, list > $1 M) GSR – Cisco 760 x – Juniper T 640 (Chicago) • Level(3) OC 192/POS fibers (loaned SNV-CHI monthly lease cost ~ $220 K) Note bootees 7

Challenges • PCI bus limitations (66 MHz * 64 bit = 4. 2 Gbits/s at best) • At 2. 5 Gbits/s and 180 msec RTT requires 120 MByte window • Some tools (e. g. bbcp) will not allow a large enough window – (bbcp limited to 2 MBytes) • Slow start problem at 1 Gbits/s takes about 5 -6 secs for 180 msec link, – i. e. if want 90% of measurement in stable (non slow start), need to measure for 60 secs – need to ship >700 MBytes at 1 Gbits/s • After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1 Gbits/s – i. e. loss rate of 1 in ~ 2 Gpkts (3 Tbits), or BER of 1 in 3. 6*1012 Sunnyvale-Geneva, 1500 Byte MTU, stock TCP 8

Windows and Streams • Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput • Effectively reduces impact of a loss by 1/n, and improves recovery time by 1/n • Optimum windows & streams changes with changes (e. g. utilization) in path, hard to optimize n • Can be unfriendly to others 9

Even with big windows (1 MB) still need multiple streams with Standard TCP • ANL, Caltech & RAL reach a knee (between 2 and 24 streams) above this gain in throughput slow • Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams • Streams, windows can change during day, hard to optimize 10

New TCP Stacks • Reno (AIMD) based, loss indicates congestion – Back off less when see congestion – Recover more quickly after backing off Standard • Scalable TCP: exponential recovery – Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002. Scalable • High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table • Vegas based, RTT indicates congestion – Caltech FAST TCP, quicker response to congestion, but … High Speed 11

Stock vs FAST TCP MTU=1500 B • Need to measure all parameters to understand effects of parameters, configurations: – Windows, streams, txqueuelen, TCP stack, MTU, NIC card – Lot of variables Stock TCP, 1500 B MTU 65 ms RTT • Examples of 2 TCP stacks – FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1) FAST TCP, 1500 B MTU FAST 65 ms. TCP, RTT 1500 B MTU 65 ms RTT 12

Jumbo frames • Become more important at higher speeds: – Reduce interrupts to CPU and packets to process, reduce cpu utilization – Similar effect to using multiple streams (T. Hacker) 1500 B Jumbos • Jumbo can achieve >95% utilization SNV to CHI or GVA with 1 or multiple stream up to Gbit/s • Factor 5 improvement over single stream 1500 B MTU throughput for stock TCP (SNV-CHI(65 ms) & CHI-AMS(128 ms)) • Complementary approach to a new stack • Deployment doubtful – Few sites have deployed – Not part of GE or 10 GE standards 13

TCP stacks with 1500 B MTU @1 Gbps txqueuelen 14

Jumbo frames, new TCP stacks at 1 Gbits/s SNV-GVA 15

Other gotchas • Large windows and large number of streams can cause last stream to take a long time to close. • Linux memory leak • Linux TCP configuration caching • What is the window size actually used/reported • 32 bit counters in iperf and routers wrap, need latest releases with 64 bit counters • Effects of txqueuelen (number of packets queued for NIC) • Routers do not pass jumbos • Performance differs between drivers and NICs from different manufacturers – May require tuning a lot of parameters 16

Who needs it? • HENP – current driver • Data intensive science: – Astrophysics, Global weather, Fusion, sesimology • Industries such as aerospace, medicine, security … • Future: – Media distribution • Gbits/s=2 full length DVD movies/minute • 2. 36 Gbits/s is equivalent to – Transferring a full CD in 2. 3 seconds (i. e. 1565 CDs/hour) – Transferring 200 full length DVD movies in one hour (i. e. 1 DVD in 18 seconds) • Will sharing movies be like sharing music today? 17

What’s next? • Break 2. 5 Gbits/s limit • Disk-to-disk throughput & useful applications – Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors • Evaluate new stacks with real-world links, and other equipment – Other NICs – Response to congestion, pathologies – Fairnesss – Deploy for some major (e. g. HENP/Grid) customer applications • Understand how to make 10 GE NICs work well with 1500 B MTUs 18

More Information • Internet 2 Land Speed Record Publicity – www-iepm. slac. stanford. edu/lsr/ – www-iepm. slac. stanford. edu/lsr 2/ • 10 GE tests – www-iepm. slac. stanford. edu/monitoring/bulk/10 ge/ – sravot. home. cern. ch/sravot/Networking/10 Gb. E_test. html • TCP stacks – netlab. caltech. edu/FAST/ – datatag. web. cern. ch/datatag/pfldnet 2003/papers/kelly. pdf – www. icir. org/floyd/hstcp. html • Stack comparisons – www-iepm. slac. stanford. edu/monitoring/bulk/fast/ – www. csm. ornl. gov/~dunigan/net 100/floyd. html 19

Impact on others 20