High Performance Networking for ALL Members of Grid
High Performance Networking for ALL Members of Grid. PP are in many Network collaborations including: Close links with: MB - NG SLAC UKERNA, SURFNET and other NRNs Dante Internet 2 Starlight, Netherlight GGF Ripe Industry … Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 1
Network Monitoring [1] u. Architecture Data. Grid WP 7 code extended by Gareth Manc Technology transfer to UK e-Science Developed by Mark Lees DL Fed back into Data. Grid by Gareth Links to: GGF NM-WG, Dante, Internet 2 Characteristics, Schema & web services Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester Success 2
Network Monitoring [2] 24 Jan to 4 Feb 04 TCP iperf RAL to HEP Only 2 sites >80 Mbit/s 24 Jan to 4 Feb 04 TCP iperf DL to HEP HELP! Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 3
High bandwidth, Long distance…. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r. tasker@dl. ac. uk] Data. TAG is a project sponsored by the European Commission - EU Grant IST-2001 -32459 RIPE-47, Amsterdam, 29 January 2004
Throughput… What’s the problem? One Terabyte of data transferred in less than an hour On February 27 -28 2003, the transatlantic Data. TAG network was extended, i. e. CERN - Chicago - Sunnyvale (>10000 km). For the first time, a terabyte of data was transferred across the Atlantic in less than one hour using a single TCP (Reno) stream. The transfer was accomplished from Sunnyvale to Geneva at a rate of 2. 38 Gbits/s Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 5
Internet 2 Land Speed Record On October 1 2003, Data. TAG set a new Internet 2 Land Speed Record by transferring 1. 1 Terabytes of data in less than 30 minutes from Geneva to Chicago across the Data. TAG provision, corresponding to an average rate of 5. 44 Gbits/s using a single TCP (Reno) stream Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 6
So how did we do that? Management of the End-to-End Connection Memory-to-Memory transfer; no disk system involved Processor speed and system bus characteristics TCP Configuration – window size and frame size (MTU) Network Interface Card and associated driver and their configuration End-to-End “no loss” environment from CERN to Sunnyvale! At least a 2. 5 Gbits/s capacity pipe on the end-to-end path A single TCP connection on the end-to-end path No real user application That’s to say - not the usual User experience! Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 7
Realistically – what’s the problem & why do network research? End System Issues Network Interface Card and Driver and their configuration TCP and its configuration Operating System and its configuration Disk System Processor speed Bus speed and capability Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e. g. , firewalls) Sub-optimal routing Transport Protocols Network Capacity and the influence of Others! Many, many TCP connections Mice and Elephants on the path Congestion Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 8
End Hosts: Buses, NICs and Drivers u. Use UDP packets to characterise Intel PRO/10 Gb. E Server Adaptor § Super. Micro P 4 DP 8 -G 2 motherboard § Dual Xenon 2. 2 GHz CPU § 400 MHz System bus § 133 MHz PCI-X bus Throughput Latency Bus Activity Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 9
End Hosts: Understanding NIC Drivers u. Linux driver basics – TX § § § § u. Linux driver basics – RX Application system call Encapsulation in UDP/TCP and IP headers Enqueue on device send queue Driver places information in DMA descriptor ring NIC reads data from main memory via DMA and sends on wire NIC signals to processor that TX descriptor sent § § § NIC places data in main memory via DMA to a free RX descriptor NIC signals RX descriptor has data Driver passes frame to IP layer and cleans RX descriptor IP layer passes data to application u. Linux NAPI driver model § On receiving a packet, NIC raises interrupt § Driver switches off RX interrupts and schedules RX DMA ring poll § Frames are pulled off DMA ring and is processed up to application § When all frames are processed RX interrupts are re-enabled § Dramatic reduction in RX interrupts under load § Improving the performance of a Gigabit Ethernet driver under Linux http: //datatag. web. cern. ch/datatag/papers/drafts/linux_kernel_map/ Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 10
Protocols: TCP (Reno) – Performance u AIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm § For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 § For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 11
Protocols: High. Speed TCP & Scalable TCP u Adjusting the AIMD Algorithm – TCP Reno § For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 § For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ u High Speed TCP a and b vary depending on current cwnd where § a increases more rapidly with larger cwnd as a consequence returns to the ‘optimal’ cwnd size sooner for the network path; and § b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. u Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd such that the increase is greater than TCP Reno, and the decrease on loss is less than TCP Reno Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 12
Protocols: High. Speed TCP & Scalable TCP u High. Speed TCP u Scalable TCP High. Speed TCP implemented by Gareth Manc Scalable TCP implemented by Tom Kelly Camb Integration of stacks into Data. TAG Kernel Yee UCL + Gareth Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester Success 13
Some Measurements of Throughput CERN -SARA u. Using the GÉANT Backup Link § 1 GByte file transfers § Blue Data § Red TCP ACKs u. Standard TCP § Average Throughput 167 Mbit/s § Users see 5 - 50 Mbit/s! u. High-Speed TCP § Average Throughput 345 Mbit/s u. Scalable TCP § Average Throughput 340 Mbit/s Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 14
Pete White Users, The Campus & the MAN Pat Meyrs u. NNW – to – SJ 4 Access 2. 5 Gbit Po. S Hits 1 Gbit 50 % [1] u. Man – NNW Access 2 * 1 Gbit Ethernet Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 15
Users, The Campus & the MAN [2] u. Message: § Continue to work with your network group u. LMN to site 1 Access 1 Gbit Ethernet u. LMN to site 2 Access 1 Gbit Ethernet § Understand the traffic levels § Understand the Network Topology Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 16
10 Gig. Ethernet: Tuning PCI-X Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 17
10 Gig. Ethernet at SC 2003 BW Challenge (Phoenix) u Three Server systems with 10 Gig. Ethernet NICs u Used the Data. TAG alt. AIMD stack 9000 byte MTU u Streams From SLAC/FNAL booth in Phoenix to: § Pal Alto PAIX 17 ms rtt § § Chicago Starlight 65 ms rtt Amsterdam SARA 175 ms rtt Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 18
Helping Real Users [1] Radio Astronomy VLBI Po. C with NRNs & GEANT 1024 Mbit/s 24 on 7 NOW Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 19
VLBI Project: Throughput Jitter & 1 -way Delay n 1472 byte Packets Manchester -> Dwingeloo JIVE n 1472 byte Packets man -> JIVE n FWHM 22 µs (B 2 B 3 µs ) n 1 -way Delay – note the packet loss (points with zero 1 –way delay) Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 20
VLBI Project: Packet Loss Distribution n Measure the time between lost packets in the time series of packets sent. n Lost 1410 in 0. 6 s n Is it a Poisson process? n Assume Poisson is stationary λ(t) =λ n Use Prob. Density Function: P(t) = λ e-λt n Mean λ = 2360 / s [426 µs] n Plot log: slope -0. 0028 expect -0. 0024 n Could be additional process involved Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 21
VLBI Traffic Flows – Only testing! u Manchester – Net. North. West - Super. JANET Access links § Two 1 Gbit/s u Access links: SJ 4 to GÉANT to Surf. Net Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 22
Throughput & PCI transactions on the Mark 5 PC: u Mark 5 uses Supermicro P 3 TDLE § 1. 2 GHz PIII § Mem bus 133/100 MHz Ethernet § 2 *64 bit 66 MHz PCI § 4 32 bit 33 MHz PCI NIC IDE Disc Pack Read / Write n bytes Wait time Super. Stor Input Card Logic Analyser Display time Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 23
PCI Activity: Read Multiple data blocks 0 wait u Read 999424 bytes u Each Data block: § § § Setup CSRs Data movement Update CSRs u For 0 wait between reads: § Data blocks ~600µs long take ~6 ms § Then 744µs gap u PCI transfer rate 1188 Mbit/s (148. 5 Mbytes/s) u Read_sstor rate 778 Mbit/s Data Block 131, 072 bytes (97 Mbyte/s) u PCI bus occupancy: 68. 44% u Concern about Ethernet Traffic 64 bit 33 MHz PCI needs ~ 82% for 930 Mbit/s Expect ~360 Mbit/s CSR Access Grid. PP Meeting Edinburgh 4 -5 Feb 04 PCI R. Hughes-Jones Manchester Data transfer Burst 4096 bytes 24
PCI Activity: Read Throughput u Flat then 1/t dependance u ~ 860 Mbit/s for Read blocks >= 262144 bytes u CPU load ~20% u Concern about CPU load needed to drive Gigabit link Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 25
Helping Real Users [2] HEP Ba. Bar & CMS Application Throughput Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 26
Ba. Bar Case Study: Disk Performace u Ba. Bar Disk Server § § § Tyan Tiger S 2466 N motherboard 1 64 bit 66 MHz PCI bus Athlon MP 2000+ CPU AMD-760 MPX chipset 3 Ware 7500 -8 RAID 5 8 * 200 Gb Maxtor IDE 7200 rpm disks u Note the VM parameter readahead max u Disk to memory (read) Max throughput 1. 2 Gbit/s 150 MBytes/s) u Memory to disk (write) Max throughput 400 Mbit/s 50 MBytes/s) [not as fast as Raid 0] Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 27
Ba. Bar: Serial ATA Raid Controllers u 3 Ware 66 MHz PCI u ICP 66 MHz PCI Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 28
Ba. Bar Case Study: RAID Throughput & PCI Activity u 3 Ware 7500 -8 RAID 5 parallel EIDE u 3 Ware forces PCI bus to 33 MHz u Ba. Bar Tyan to MB-NG Super. Micro Network mem-mem 619 Mbit/s u Disk – disk throughput bbcp 40 -45 Mbytes/s (320 – 360 Mbit/s) u PCI bus effectively full! Read from RAID 5 Disks Write to RAID 5 Disks Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 29
MB - NG MB – NG Super. JANET 4 Development Network Ba. Bar Case Study MAN MCC OSM 1 OC 48 POS-SS PC PC Status / Tests: § Manc host has Data. TAG TCP stack § RAL Host now available § Ba. Bar-Ba. Bar mem-mem § Ba. Bar-Ba. Bar real data MB-NG SJ 4 Ba. Bar-Ba. Bar real data SJ 4 § Dev SJ 4 Dev real data MB-NG § Mbng-mbng real data SJ 4 § Different TCP stacks already installed SJ 4 Dev PC Bar PC 3 ware RAID 5 Gigabit Ethernet 2. 5 Gbit POS Access 2. 5 Gbit POS core MPLS Admin. Domains RAL OSM 1 OC 48 POS-SS PC Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester PC 30
MB - NG Study of Applications MB – NG Super. JANET 4 Development Network MAN MCC SJ 4 Dev OSM 1 OC 48 POS-SS SJ 4 Dev PC PC 3 ware RAID 0 Gigabit Ethernet 2. 5 Gbit POS Access 2. 5 Gbit POS core MPLS Admin. Domains UCL OSM 1 OC 48 POS-SS PC Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester PC 31
MB - NG 24 Hours High. Speed TCP mem-mem u u TCP mem-mem lon 2 -man 1 Tx 64 Tx-abs 64 Rx-abs 128 941. 5 Mbit/s +- 0. 5 Mbit/s Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 32
MB - NG Gridftp Throughput High. Speed. TCP u Int Coal 64 128 u Txqueuelen 2000 u TCP buffer 1 M byte (rtt*BW = 750 kbytes) u Interface throughput u Acks received u Data moved u 520 Mbit/s u Same for B 2 B tests u So its not that simple! Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 33
MB - NG Gridftp Throughput + Web 100 u Throughput Mbit/s: u See alternate 600/800 Mbit and zero u Cwnd smooth u No dup Ack / send stall / timeouts Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 34
MB - NG http data transfers High. Speed TCP u Apachie web server out of the box! u prototype client - curl http library u 1 Mbyte TCP buffers u 2 Gbyte file u Throughput 72 MBytes/s u Cwnd - some variation u No dup Ack / send stall / timeouts Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 35
More Information Some URLs u MB-NG project web site: http: //www. mb-ng. net/ u Data. TAG project web site: http: //www. datatag. org/ u UDPmon / TCPmon kit + writeup: http: //www. hep. man. ac. uk/~rich/net u Motherboard and NIC Tests: www. hep. man. ac. uk/~rich/net/nic/Gig. Eth_tests_Boston. ppt & http: //datatag. web. cern. ch/datatag/pfldnet 2003/ u TCP tuning information may be found at: http: //www. ncne. nlanr. net/documentation/faq/performance. html & http: //www. psc. edu/networking/perf_tune. html Grid. PP Meeting Edinburgh 4 -5 Feb 04 R. Hughes-Jones Manchester 36
- Slides: 36