Realization and Utilization of highBW TCP on real

Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo Kei Hiraki University of Tokyo

Computing System for real Scientists • Fast CPU, huge memory and disks, good graphics – Cluster technology, DSM technology, Graphics processors – Grid technology • Very fast remote file accesses – Global file system, data parallel file systems, Replication facilities • Transparency to local computation – No complex middleware, or no small modification to existing software • Real Scientists are not computer scientists • Kei Hiraki Computer scientists are not work forces for real scientists University of Tokyo

Objectives of Data Reservoir / GRAPE-DR(1) • Sharing Scientific Data between distant research institutes – Physics, astronomy, earth science, simulation data • Very High-speed single file transfer on Long Fat pipe Network – > 10 Gbps, > 20, 000 Km, > 400 ms RTT • High utilization of available bandwidth – Transferred file data rate > 90% of available bandwidth • Including header overheads, initial negotiation overheads Kei Hiraki University of Tokyo

Objectives of Data Reservoir / GRAPE-DR(2) • GRAPE-DR: Very high-speed attached processor to a server – 2004 – 2008 – Successor of Grape-6 astronomical simulator • 2 PFLOPS on 128 node cluster system – – 1 G FLOPS / processor 1024 processor / chip 8 chips / PCI card 2 PCI card / serer – 2 M processor / system Kei Hiraki University of Tokyo

Data intensive scientific computation through global networks Nobeyama X-ray astronomy Satellite ASUKA Radio Observatory （VLBI) Nuclear experiments Belle Experiments Data Reservoir Very High-speed Network Digital Sky Survey Distributed Shared files Local Accesses Data Reservoir SUBARU Telescope Grape 6 Kei Hiraki Data analysis at University of Tokyo

Basic Architecture Data Reservoir High latency Very high bandwidth Network Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute Shared Data (DSM like architecture) Local file accesses Cache Disks Kei Hiraki University of Tokyo

File accesses on Data Reservoir Scientific Detectors File Server User Programs File Server 1 st level striping File Server Disk access by i. SCSI IP Switch 2 nd level striping Disk Server IBM x 345 (2. 6 GHz x 2) Kei Hiraki University of Tokyo

Global Data Transfer Scientific Detectors File Server User Programs File Server i. SCSI Bulk Transfer IP Switch Global Network Disk Server Kei Hiraki Disk Server University of Tokyo

Problems found in 1 st generation Data Reservoir • Low TCP bandwidth due to packet losses – TCP congestion window size control – Very slow recovery from fast recovery phase (>20 min) • Unbalance among parallel i. SCSI streams – Packet scheduling by switches and routers – User and other network users have interests only to total behavior of parallel TCP streams Kei Hiraki University of Tokyo

Fast Ethernet vs. Gb. E • • Kei Hiraki Iperf in 30 seconds Min/Avg: Fast Ethernet > Gb. E FE Gb. E University of Tokyo

Packet Transmission Rate • Bursty behavior – Transmission in 20 ms against RTT 200 ms – Idle in rest 180 ms Packet loss occurred Kei Hiraki University of Tokyo

Packet Spacing • Ideal Story – Transmitting packet every RTT/cwnd – 24μs interval for 500 Mbps (MTU 1500 B) – High load for software only – Low overhead because of limited use at slow start phase RTT/cwnd Kei Hiraki University of Tokyo

Example Case of 8 IPG • Success on Fast Retransmit – Smooth Transition to Congestion Avoidance – CA takes 28 minutes to recover to 550 Mbps Kei Hiraki University of Tokyo

Best Case of 1023 B IPG • Like Fast Ethernet case – Proper transmission rate • Spurious Retransmit due to Reordering Kei Hiraki University of Tokyo

Unbalance within parallel TCP streams • Unbalance among parallel i. SCSI streams – Packet scheduling by switches and routers – Meaningless unfairness among parallel streams – User and other network users have interests only to total behavior of parallel TCP streams • Our approach – Constant Σcwnd i for fair TCP network usage to other users – Balance each cwnd i communicating between parallel TCP streams BW Kei Hiraki BW time University of Tokyo

3 rd Generation Data Reservoir • Hardware and software basis for 100 Gbps Distributed Datasharing systems • • 10 Gbps disk data transfer by a single Data Reservoir server Transparent support for multiple filesystems (detection of modified disk blocks) • Hardware(FPGA) implementation of Inter-layer coordination mechanisms • 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger Kei Hiraki University of Tokyo

Utilization of 10 Gbps network • A single box 10 Gbps Data Reservoir server • • Quad Opteron server with multiple PCI-X buses (prototype, SUN V 40 z server) Two Chelsio T 110 TCP off-loading NIC Disk arrays for necessary disk bandwidth Data Reservoir software (i. SCSI deamon, disk driver, data transfer maneger) PCI-X bus Quad Opteron Server (SUN V 40 z) Linux 2. 6. 6 PCI-X bus Chelsio T 110 TCP NIC 10 GBASE-SR Chelsio T 110 TCP NIC SCSI adaptor 10 G Ethernet Switch Ultra 320 SCSI adaptor Data Reservoir Software Kei Hiraki University of Tokyo

Tokyo-CERN experiment (Oct. 2004) • CERN-Amsterdam-Chicago-Seattle-Tokyo – SURFnet – CA*net 4 – IEEAF/Tyco – WIDE – 18, 500 km WAN PHY connection • Performance result – 7. 21 Gbps (TCP payload) standard Ethernet frame size, iperf – 7. 53 Gbps (TCP payload) 8 K Jumbo frame, iperf – 8. 8 Gbps disk to disk performance • 9 servers, 36 disks • 36 parallel TCP streams Kei Hiraki University of Tokyo

CANARIE IEEAF Vancouver Seattle Calgary CA*net 4 Minneapolis Chicago Amsterdam SURFnet Geneva Tokyo Network used in the experiment End Systems A L 1 or L 2 switch Tokyo-CERN Network connection Kei Hiraki University of Tokyo

Network topology of CERN-Tokyo experiment T-LEX Fujitsu XG 800 12 port switch IBM x 345 server Dual Intel Xeon 2. 4 GHz 2 GB memory Linux 2. 6. 6 (No. 2 -7) Linux 2. 4. X (No. 1) Extreme Summit 400 IBM x 345 Tokyo Seattle IBM x 345 Gb. E Foundry BI MG 8 Data Reservoir at Univ. of Tokyo Star. Light Minneapolis Foundry Net. Iron 40 G Fujitsu XG 800 Linux 2. 6. 6 (No. 2 -6) IBM x 345 server Dual Intel Xeon 2. 4 GHz 2 GB memory Linux 2. 6. 6 (No. 2 -7) Linux 2. 4. X (No. 1) IBM x 345 Vancouver Pacific Northwest Gigapop Amsterdam CERN (Geneva) Nether Light WIDE / IEEAF Kei Hiraki Chicago Opteron server Chelsio Dual Opteron 248, 2. 2 GHz T 110 NIC 1 GB memory 10 GBAS E-LW Opteron server Chelsio Dual Opteron 248, 2. 2 GHz 1 GB memory. T 110 NIC Linux 2. 6. 6 (No. 2 -6) CA*net 4 Foundry FEXｘ４４８ IBM x 345 Gb. E Data Reservoir at CERN(Geneva) SURFnet University of Tokyo

LSR experiments • Target – > 30, 000 km LSR distance – L 3 switching at Chicago and Amsterdam – Period of the experiment • 12/20 – 1/3 • Holiday season for vacant public research networks • System configuration – A pair of opteron servers with Chelsio T 110 (at N-otemachi) – Another pair of opteron servers with Chelsion T 110 for competing traffinc generation – Clear. Sight 10 Gbps packet analyzer for packet capturing Kei Hiraki University of Tokyo

CANARIE Calgary Vancouver Seattle IEEAF/Tyco/WIDE APAN/JGN 2 CA*net 4 Amsterdam Minneapolis Chicago Abilene SURFnet NYC Tokyo Network used in the experiment A router or an L 3 switch A L 1 or L 2 switch Figure 2. Network connection Kei Hiraki University of Tokyo

Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo Vancouve r OME 6550 ONS 15454 Calgary ONS 15454 Minneapolis ONS 15454 OME 6550 Chicago Router or L 3 switch CANARIE L 1 or L 2 switch SURFnet Tokyo Opteron 1 Chelsio T 110 NIC T-LEX IEEAF/Tyco ONS 15454 WIDE Opteron server Opteron 3 Chelsio T 110 NIC WAN PHY Foundry Net. Iron 40 G Procket 8812 SURFnet ONS 15454 Pacific Force 10 E 600 WAN PHY Ocean WIDE Opteron server University of Amsterdam Seattle Pacific Northwest Gigapop Trans. PAC APAN/JGN OC-192 Force 10 E 1200 Ocean Procket 8801 T 640 HDXc CISCO 12416 SURFnet Chicago Star. Light Fujitsu XG 800 Abilene OC-192 SURFnet Clear. Sight 10 Gbps capture T 640 HDXc CISCO 12416 SURFnet OC-192 Amsterdam New York Nether. Light MANLAN Univ of Tokyo Kei Hiraki CISCO 6509 SURFnet Atlantic WIDE IEEAF/Tyco/WIDE CANARIE SURFnet Abilene APAN/JGN 2 University of Tokyo

Network Traffic on routers and switches Star. Light Force 10 E 1200 University of Amsterdam Force 10 E 600 Abilene T 640 NYCM to CHIN Trans. PAC Procket 8801 Kei Hiraki Submitted run University of Tokyo

Summary • Single Stream TCP – We removed TCP related difficulties – Now I/O bus bandwidth is the bottleneck – Cheap and simple servers can enjoy 10 Gbps network • Lack of methodology in high-performance network debugging – 3 day debugging (overnight working) – 1 day stable period (usable for measurements) – Network may feel fatigue, some trouble must happen – We need something effective. • Detailed issues – Flow control (and Qo. S) – Buffer size and policy – Optical level setting Kei Hiraki University of Tokyo

Kei Hiraki University of Tokyo

Systems used in Long-distance TCP experiments CERN Kei Hiraki Pittsburgh Tokyo University of Tokyo

Efficient and effective utilization of High-speed internet • Efficient and effective utilization of 10 Gbps network is still very difficult • PHY, MAC, Data-link , and Switches – 10 Gbps is ready to use • Network interface adaptor – 8 Gbps is ready to use, 10 Gbps in several months – Proper offloading, RDMA implementation • I/O bus of a server – 20 Gbps is necessary to drive 10 Gbps network • Drivers, operating system – Too many interruption, buffer memory management • File system – Slow NFS service – Consistency problem Kei Hiraki University of Tokyo

Difficulty in 10 Gbps Data Reservoir • Disk to disk Single Stream TCP data transfer – High CPU utilization (performance limit by CPU) • Too many context switches • Too many interruption from Network adaptor (> 30, 000/s) • Data copy from buffers to buffers • I/O bus bottleneck – PCI-X/133 --- maximum 7. 6 Gbps data transfer • Waiting for PCI-X/266 or PCI-express x 8 or x 16 NIC – Disk performance • Performance limit of RAID adaptor • Number of disks for data transfer (>40 disks are required) • File system – High BW in file service is more difficult than data sharing Kei Hiraki University of Tokyo

High-speed IP network in supercomputing (GRAPE-DR project) • World fastest computing system – • Construction of general-purpose massively parallel architecture – – • Low power consumption in PFLOPS range performance MPP architecture more general-purpose than vector architecture Use of comodity network for interconnection – – Kei Hiraki 2 PFLOPS in 2008 (performance on actual application programs) 10 Gbps optical network (2008) + MEMs switches 100 Gbps optical network (2010) University of Tokyo

FLOPS Target performance 30 10 Grape DR 2 PFLOPS 27 10 Parallel processors 1 Y KEISOKU supercomputer 10 PFLOPS Earth Simulator 40 TFLOPS 1 Z 1 E 1 P 1 T Processor chips 1 K 256 1 G 64 16 1 M Kei Hiraki 70 80 90 2000 2010 2020 2030 2040 2050 Year University of Tokyo

Kei Hiraki University of Tokyo

GRAPE-DR architecture • • • Massively Parallel Processor Pipelined connection of a large number of PEs SIMASD (Single Instruction on Multiple and Shared Data) – All instruction operates on Data of local memory and shared memory – Extension of vector architecture • Issues – Compiler for SIMASD architecture (currently developing – flat-C) Local Memory Integer ALU Floating point ALU 512 PEs ＧＦ CP ＋ On chip network On chip shared memory Outside world Kei Hiraki Shared memory University of Tokyo

Hierarchical construction of GRAPE-DR メモリ 512 PE/Chip 2 KPE/PCI board 8 KPE/Server 512 GFlops /Chip 2 TFLOPS/PCI board 8 TFLOPS/Server 2 M PE/System 2 PFLOPS/System １ MPE/Node 1 PFLOPS/Node Kei Hiraki University of Tokyo

Network architecture inside a GRAPE-DR system AMD based server IP storage system Memory bus 100 Gbps i. SCSIサーバ光インタフェース KOE Memory MEMs based optical switch Highly functional router Adaptive compier Total system conductor For dynamic optimization Kei Hiraki Outside IP network University of Tokyo

Fujitsu Computer Technologies, LTD Kei Hiraki University of Tokyo