The ongoing evolution from Packet based networks to

The ongoing evolution from Packet based networks to Hybrid Networks in Research & Education Networks Olivier Martin, CERN NEC’ 2005 Conference, VARNA (Bulgaria) 16 September 2005 1

Presentation Outline • The demise of conventional packet based networks in the R&E community • The advent of community managed dark fiber networks • The Grid & its associated Wide Area Networking challenges • « on-demand Lambda Grids » • Ethernet over SONET & new standards – WAN-PHY, GFP, VCAT/LCAS, G. 709, OTN 16 September 2005 NEC’ 2005 conference 2

3

10 Gbit/s 1024 10 Gbit/s 160 10 6 10 Gbit/s 32 10 Gbit/s 16 System Capacity (Mbit/s) 10 5 4 10 Gbit/s 8 10 Gbit/s 4 10 Gbit/s 2 10 4 1. 7 Gbit/s 10 3 OC-768 c OC-192 c 10 -GE OC-48 c 565 Mbit/s OC-48 c I/0 Rates = Optical Wavelength Capacity Gig. E OC-12 c 10 2 10 1 135 Mbit/s Fast Ethernet OC-3 c Optical DWDM Capacity Ethernet Internet Backbone T 3 Ethernet T 1 Year 16 September 2005 40 -GE 1985 1990 1995 NEC’ 2005 conference 2000 2005 4

Internet Backbone Speeds MBPS IP/ OC 12 c OC 3 c ATM-VCs T 3 lines T 1 Lines 5

High Speed IP Network Transport Trends Multiplexing, protection and management at every layer IP Signalling IP ATM IP SONET/SDH IP Optical B-ISDN IP Over ATM IP Over SONET/SDH IP Over Optical Higher Speed, Lower cost, complexity and overhead 6

7

8

9

Network Exponentials l Network vs. computer performance – Computer speed doubles every 18 months – Network speed doubles every 9 months – Difference = order of magnitude per 5 years l 1986 to 2000 – Computers: x 500 – Networks: x 340, 000 l 2001 to 2010 – Computers: x 60 – Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan 2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins. October 12, 2001 Intro to Grid Computing and Globus Toolkit™ 10

Know the user (3 of 12) # of users A B ADSL C Gig. E LAN F(t) BW requirements A -> Lightweight users, browsing, mailing, home use B -> Business applications, multicast, streaming, VPN’s, mostly LAN C -> Special scientific applications, computing, data grids, virtual-presence

What the user (4 of 12) Total BW A B ADSL C Gig. E LAN BW requirements A -> Need full Internet routing, one to many B -> Need VPN services on/and full Internet routing, several to several C -> Need very fat pipes, limited multiple Virtual Organizations, few to few

So what are the facts (5 of 12) • Costs of fat pipes (fibers) are one/third of equipment to light them up – Is what Lambda salesmen told Cees de Laat (University of Amsterdam & Surfnet) • Costs of optical equipment 10% of switching 10 % of full routing equipment for same throughput – 100 Byte packet @ 10 Gb/s -> 80 ns to look up in 100 Mbyte routing table (light speed from me to you on the back row!) • Big sciences need fat pipes • Bottom line: create a hybrid architecture which

Utilization trends Gbps Network Capacity Limit Jan 2005

Today’s hierarchical IP network Other national networks National or Pan-National IP Network NREN A University NREN C NREN B Regiona l NREN D

Tomorrow’s peer to peer IP network World National DWDM Network World Child Lightpaths NREN A University Server NREN B NREN C Regiona l Child Lightpaths NREN D

Creation of application VPNs University Dept Direct connect bypasses campus firewall High Energy Physics Network Commodity Internet University Research Network CERN University Bio-informatics Network University e. VLBI Network

Production vs Research Campus Networks > Increasingly campuses are deploying parallel networks for high end users > Reduces costs by providing high end network capability to only those who need it > Limitations of campus firewall and border router are eliminated > Many issues in regards to security, back door routing, etc > Campus networks may follow same evolution as campus computing > Discipline specific networks being extended into the campus

UCLP intended for projects like National Lambda. Rail CAVEwave acquires a separate wavelength between Seattle and Chicago and wants to manage it as part of its network including add/drop, routing, partition etc NLR Condominium lambda network Original CAVEwave

GEANT 2 POP Design

Ultra. Light Optical Exchange Point Photonic switch u L 1, L 2 and L 3 services u Interfaces q 1 GE and 10 GE q 10 GE WAN-PHY (SONET friendly) u Hybrid packet- and circuit-switched Po. P q Interface between packet- & circuit-switched networks Calient or Glimmerglass Photonic Cross Connect Switch

LHC Data Grid Hierarchy CERN/Outside Resource Ratio ~1: 2 Tier 0/( Tier 1)/( Tier 2) ~1: 1: 1 ~PByte/sec Online System Experiment ~100 -400 MBytes/sec Tier 0 +1 10 Gbps Tier 1 IN 2 P 3 Center INFN Center RAL Center Tier 2 Tier 3 ~2. 5 Gbps Institute ~0. 25 TIPS Physics data cache Workstations 16 September 2005 CERN 700 k SI 95 ~1 PB Disk; Tape Robot Institute 0. 1– 1 Gbps Tier 4 FNAL: 200 k SI 95; 600 TB 2. 5/10 Gbps Tier 2 Center Tier 2 Center Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels NEC’ 2005 conference 22

Deploying the LHC Grid Lab m Uni x grid for a regional group Uni a CERN Tier 1 Lab a UK USA Tier 3 physics department France The LHC Tier 1 Computing Tier 2 Uni n Centre Italy CERN Tier 0 Japan Desktop Lab b 16 September 2005 Taipei? Germany Lab c Uni y les. robertson@cern. ch grid for a physics study group Uni b NEC’ 2005 conference 23

What you get Lab m Uni x Uni a CERN Tier 1 Lab a UK USA physics department France Tier 1 Tier 2 Uni n CERN Tier 0 Italy Japan physicist ………. Lab b Lab c 16 September 2005 Germany Uni y les. robertson@cern. ch Uni b NEC’ 2005 conference 24

Main Networking Challenges • Fulfill the, yet unproven, assertion that the network can be « nearly » transparent to the Grid • Deploy suitable Wide Area Network infrastructure (50 -100 Gb/s) • Deploy suitable Local Area Network infrastructure (matching or exceeding that of the WAN) • Seamless interconnection of LAN & WAN infrastructures firewall? • End to End issues (transport protocols, PCs (Itanium, Xeon), 10 Gig. E NICs (Intel, S 2 io), where are we today: Ø memory to memory: 7. 5 Gb/s (PCI bus limit) Ø memory to disk: 1. 2 MB (Windows 2003 server/Newi. Sys) Ø disk to disk: 400 MB (Linux), 600 MB (Windows) 16 September 2005 NEC’ 2005 conference 25

Main TCP issues • • Does not scale to some environments Ø High speed, high latency Ø Noisy Unfair behaviour with respect to: Ø Round Trip Time (RTT Ø Frame size (MSS) Ø Access Bandwidth Widespread use of multiple streams in order to compensate for inherent TCP/IP limitations (e. g. Gridftp, BBftp): Ø Bandage rather than a cure New TCP/IP proposals in order to restore performance in single stream environments Ø Not clear if/when it will have a real impact Ø In the mean time there is an absolute requirement for backbones with: – Zero packet losses, – And no packet re-ordering Ø Which re-inforces the case for “lambda Grids” 16 September 2005 NEC’ 2005 conference 26

TCP dynamics (10 Gbps, 100 ms RTT, 1500 Bytes packets) Window size (W) = Bandwidth*Round Trip Time – Wbits = 10 Gbps*100 ms = 1 Gb – Wpackets = 1 Gb/(8*1500) = 83333 packets Standard Additive Increase Multiplicative Decrease (AIMD) mechanisms: – W=W/2 (halving the congestion window on loss event) – W=W + 1 (increasing congestion window by one packet every RTT) Time to recover from W/2 to W (congestion avoidance) at 1 packet per RTT: – RTT*Wp/2 = 1. 157 hour – In practice, 1 packet per 2 RTT because of delayed acks, i. e. 2. 31 hour Packets per second: – RTT*Wpackets = 833’ 333 packets 16 September 2005 NEC’ 2005 conference 27

Single TCP stream performance under periodic losses Loss rate =0. 01%: è LAN BW utilization= 99% è WAN BW utilization=1. 2% Bandwidth available = 1 Gbps u TCP throughput much more sensitive to packet loss in WANs than LANs r TCP’s congestion control algorithm (AIMD) is not well-suited to gigabit networks r The effect of packets loss can be disastrous u TCP is inefficient in high bandwidth*delay networks u The future performance-outlook for computational grids looks bad if we continue to rely solely on the widely-deployed TCP RENO

Responsiveness u. Time to recover from a single packet loss: C. RTT 2 r= C : Capacity of the link 2. MSS Path Bandwidth RTT (ms) Time to recover 1 MTU (Byte) 1500 LAN 10 Gb/s Geneva–Chicago 10 Gb/s 120 1500 1 hr 32 min Geneva-Los Angeles 1 Gb/s 180 1500 23 min Geneva-Los Angeles 10 Gb/s 180 1500 3 hr 51 min Geneva-Los Angeles 10 Gb/s 180 9000 38 min Geneva-Los Angeles 10 Gb/s 180 64 k (TSO) 5 min Geneva-Tokyo 1 Gb/s 300 1500 1 hr 04 min 430 ms u Large MTU accelerates the growth of the window u Time to recover from a packet loss decreases with large MTU u Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets)

Internet 2 land speed record history (IPv 4 & IPv 6) period 2000 -2004 16 September 2005 NEC’ 2005 conference 30

Layer 1/2/3 networking (1) • Conventional layer 3 technology is no longer fashionable because of: – High associated costs, e. g. 200/300 KUSD for a 10 G router interfaces – Implied use of shared backbones • The use of layer 1 or layer 2 technology is very attractive because it helps to solve a number of problems, e. g. – 1500 bytes Ethernet frame size (layer 1) – Protocol transparency (layer 1 & layer 2) – Minimum functionality hence, in theory, much lower costs (layer 1&2) 16 September 2005 NEC’ 2005 conference 31

Layer 1/2/3 networking (2) « 0 n-demand Lambda Grids » are becoming very popular: • Pros: ü circuit oriented model like the telephone network, hence no need for complex transport protocols ü Lower equipment costs (i. e. « in theory » a factor 2 or 3 per layer) ü the concept of a dedicated end to end light path is very elegant • Cons: ü « End to end » still very loosely defined, i. e. site to site, cluster to cluster or really host to host ü Higher circuit costs, Scalability, Additional middleware to deal with circuit set up/tear down, etc ü Extending dynamic VLAN functionality is a potential nightmare! 16 September 2005 NEC’ 2005 conference 32

« Lambda Grids » What does it mean? • • Clearly different things to different people, hence the apparently easy consensus! Conservatively, on demand « site to site » connectivity Ø Where is the innovation? Ø What does it solve in terms of transport protocols? Ø Where are the savings? ü Less interfaces needed (customer) but more standby/idle circuits needed (provider) ü Economics from the service provider vs the customer perspective? – Traditionally, switched services have been very expensive, » Usage vs flat charge » Break even, switches vs leased, few hours/day » Why would this change? ü In case there are no savings, why bother? More advanced, cluster to cluster Ø Implies even more active circuits in paralle Ø Is it realistic? Even more advanced, Host to Host or even « per flow » Ø All optical Ø Is it really realisitic? 16 September 2005 NEC’ 2005 conference 33

Some Challenges • Real bandwidth estimates given the chaotic nature of the requirements. • End-end performance given the whole chain involved – (disk-bus-memory-bus-network-bus-memory-busdisk) • Provisioning over complex network infrastructures (GEANT, NREN’s etc) • Cost model for options (packet+SLA’s, circuit switched etc) • Consistent Performance (dealing with firewalls) • Merging leading edge research with production networking 16 September 2005 NEC’ 2005 conference 34

Tentative conclusions Ø There is a very clear trend towards community managed dark fiber networks Ø As a consequence National Research & Education Networks are evolving into Telecom Operators, is it right? • • Ø In the short term, almost certainly YES Ø In the longer term, probably NO In many countries, there is NO other way to have affordable access to multi-Gbit/s networks, therefore this is clearly the right move The Grid & its associated Wide Area Networking challenges « on-demand Lambda Grids » are, according to me, extremely doubtful! Ethernet over SONET & new standards will revolutionize the Internet Ø WAN-PHY (IEEE) has, according to me NO future! Ø However, GFP, VCAT/LCAS, G. 709, OTN are very likely to have a very bright future. 16 September 2005 NEC’ 2005 conference 35

Single TCP stream between Caltech and CERN u. Available (PCI-X) CPU load = 100% Single packet loss Bandwidth=8. 5 Gbps u RTT=250 ms (16’ 000 km) u 9000 Byte MTU u 15 min to increase throughput from 3 to 6 Gbps u Sending station: r Tyan S 2882 Burst of packet losses motherboard, 2 x Opteron 2. 4 GHz , 2 GB DDR. u. Receiving station: r CERN Open. Lab: HP rx 4640, 4 x 1. 5 GHz Itanium-2, zx 1 chipset, 8 GB memory u. Network adapter: r S 2 IO 10 Gb. E

High Throughput Disk to Disk Transfers: From 0. 1 to 1 GByte/sec u Server Hardware (Rather than Network) Bottlenecks: r Write/read and transmit tasks share the same limited resources: CPU, PCI-X bus, memory, IO chipset r PCI-X bus bandwidth: 8. 5 Gbps [133 MHz x 64 bit] ð Link aggregation (802. 3 ad): Logical interface with two physical interfaces on two independent PCI-X buses. ð LAN test: 11. 1 Gbps (memory to memory) Performance in this range (from 100 MByte/sec up to 1 GByte/sec) is required to build a responsive Grid-based Processing and Analysis System for LHC

Transferring a TB from Caltech to CERN in 64 -bit MS Windows u Latest disk to disk over 10 Gbps WAN: 4. 3 Gbits/sec (536 MB/sec) - 8 TCP streams from CERN to Caltech; 1 TB file u 3 Supermicro Marvell SATA disk controllers + 24 SATA 7200 rpm SATA disks r Local Disk IO – 9. 6 Gbits/sec (1. 2 GBytes/sec read/write, with <20% CPU utilization) u S 2 io SR 10 GE NIC r 10 GE NIC – 7. 5 Gbits/sec (memory-to-memory, with 52% CPU utilization) r 2*10 GE NIC (802. 3 ad link aggregation) – 11. 1 Gbits/sec (memory-to-memory) u Memory to Memory WAN data flow, and local Memory to Disk read/write flow, are not matched when combining the two operations u Quad Opteron AMD 848 2. 2 GHz processors with 3 AMD-8131 chipsets: 4 64 -bit/133 MHz PCI-X slots. u Interrupt Affinity Filter: allows a user to change the CPU-affinity of the interrupts in a system. u Overcome packet loss with re-connect logic. u Proposed Internet 2 Terabyte File Transfer Benchmark

Ultra. Light: Developing Advanced Network Services for Data Intensive HEP Applications u Ultra. Light: a next-generation hybrid packet- and circuitswitched network infrastructure Æ Packet switched: cost effective solution; requires ultrascale protocols to share 10 G efficiently and fairly Æ Circuit-switched: Scheduled or sudden “overflow” demands handled by provisioning additional wavelengths; Use path diversity, e. g. across the US, Atlantic, Canada, … u Extend augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as an integral component Æ Using Mon. ALISA to monitor and manage global systems

Ultra. Light MPLS Network u u Compute path from one given node to another such that the path does not violate any constraints (bandwidth/administrative requirements) Ability to set the path the traffic will take through the network (with simple configuration, management, and provisioning mechanisms) q Take advantage of the multiplicity of waves/L 2 channels across the US (NLR, HOPI, Ultranet and Abilene/ESnet MPLS services)

Summary u For many years the Wide Area Network has been the bottleneck; this is no longer the case in many countries thus making deployment of a data intensive Grid infrastructure possible! r Recent I 2 LSR records show for the first time ever that the network can be truly transparent and that throughputs are limited by the end hosts r Challenge shifted from getting adequate bandwidth to deploying adequate infrastructure to make effective use of it! u Some transport protocol issues still need to be resolved; however there are many encouraging signs that practical solutions may now be in sight. u 1 GByte/sec disk to disk challenge. Today: 1 TB at 536 MB/sec from CERN to Caltech r Still in Early Stages; Expect Substantial Improvements u Next generation network and Grid system: Ultra. Light r Deliver the critical missing component for future e. Science: the integrated, managed network r Extend augment existing grid computing infrastructures

10 G Data. TAG testbed extension to Telecom World 2003 and Abilene/Cenic On September 15, 2003, the Data. TAG project was the first transatlantic testbed offering direct 10 Gig. E access using Juniper’s VPN layer 2/10 Gig. E emulation. 16 September 2005 NEC’ 2005 conference 42