ESnet Endtoend Internet Monitoring Les Cottrell and Warren

  • Slides: 35
Download presentation
ESnet End-to-end Internet Monitoring Les Cottrell and Warren Matthews, SLAC and David Martin, HEPNRC

ESnet End-to-end Internet Monitoring Les Cottrell and Warren Matthews, SLAC and David Martin, HEPNRC Presented at the ESSC Review Meeting, Berkeley, May 1998 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM) 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 1

Outline of Talk • • • Why are we (ESnet/HENP community) measuring? What are

Outline of Talk • • • Why are we (ESnet/HENP community) measuring? What are we measuring & how? What do we see? What does it mean? Summary – Deployment/development, Internet Performance, Next Steps – Collaborations 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 2

Why go to the effort? • Internet woefully under-measured & underinstrumented • Internet very

Why go to the effort? • Internet woefully under-measured & underinstrumented • Internet very diverse - no single path typical • Users need end-to end measurements for: – realistic expectations, planning information – guidelines for setting and validating SLAs – information to help in identifying problems – help to decide where to apply resources • Complements ESnet utilization measurements • Provides information for reporting problems to NOC 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 3

Our Main Tool (Ping. ER) is Ping Based • “Universally available”, easy to understand

Our Main Tool (Ping. ER) is Ping Based • “Universally available”, easy to understand – no software for clients to install • Low network impact • Provides useful real world measures of response time, loss, reachability, unpredictability • Now monitoring from 14 sites in 8 countries monitoring > 500 links in 22 countries (> 300 sites) • Resources: 6 bps/link, ~600 k. Bytes/month/link 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 4

Measurement Architecture HTTP WWW Ping Reports & Data SLAC Archive Analysis Archive Monitoring Cache

Measurement Architecture HTTP WWW Ping Reports & Data SLAC Archive Analysis Archive Monitoring Cache Monitoring Remote HEPNRC Monitoring Remote 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 5

Ping Loss Quality • Want quick to grasp indicator of link quality • Loss

Ping Loss Quality • Want quick to grasp indicator of link quality • Loss is the most sensitive indicator – Studies on economic value of response time by IBM showed there is a threshold around 4 -5 secs where complaints increase. – loss of packet requires ~ 4 sec TCP retry timeout – For packet loss we use following thresholds: • 0 -1% = Good 1 -2. 5% = Acceptable • 2. 5%-5% = Poor 5%-12% = Very Poor • > 12% = Bad (unusable for interactive work) 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 6

Quality Distributions from SLAC • ESnet median good quality • Other groups poor or

Quality Distributions from SLAC • ESnet median good quality • Other groups poor or very poor 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 7

Aggregation/Grouping • Critical for 14 monitoring sites & > 500 links • Group measurements

Aggregation/Grouping • Critical for 14 monitoring sites & > 500 links • Group measurements by: – area (e. g. N. America W, N. America E, W. Europe, Japan, Asia, others, or by country, or TLD) – trans-oceanic links, intercontinental links, crossing IXP – ISP (ESnet, v. BNS/I 2, TEN-34. . . ) – by monitoring site – one site seen from multiple sites – common interest/affiliation (XIWT, HENP, Expmt …) • Beware: reduces statistics, choice of sites critical 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 8

Tabular Navigation Tool Monitoring site Select grouping, e. g. Intercontinental, TLDs, Site to site.

Tabular Navigation Tool Monitoring site Select grouping, e. g. Intercontinental, TLDs, Site to site. . . Select metric Select month Goes back to Jul-97 , Colored by quality < 62. 5 ms excellent (white) <125 ms good (green) < 250 ms poor (yellow) <500 ms very poor (pink) >500 ms bad (red) Remote site Response, Loss, Quiescence, Reachability. . . Drill down Site to show all sites monitoring it Value to see all links contributing Mouse. Over To see number of links To see country To see monitoring site 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 9

Drill down (all sites monitoring CERN) Also provides Excel for DIY Select one of

Drill down (all sites monitoring CERN) Also provides Excel for DIY Select one of these groups Sort CMU CNAF RL FNAL SLAC DESY Carelton RMKI CERN KEK 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 10

Overall Improvements Jan-95 Nov-97 • For about 80 remote sites seen from SLAC •

Overall Improvements Jan-95 Nov-97 • For about 80 remote sites seen from SLAC • Response time improved between 1 and 2. 5% / month • Loss - similar (closer to 2. 5%/month) 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 11

How does it look for ESnet Researchers getting to US sites (280 links, 28

How does it look for ESnet Researchers getting to US sites (280 links, 28 States)? • Within ESnet excellent (median loss 0. 1%) • To v. BNS sites very good (~ 2 * loss for ESnet) • DOE funded Universities not on v. BNS/ESnet – acceptable to poor, getting better (factor 2 in 6 months) – lot of variability (e. g. ) • Brown. T, UMass. T = unacceptable(>= 12%) • Pitt*, SC*. Colo. State*, UNMT, UOregon. T, Rochester*, UC*, Ole. Miss*, Harvard 1 q 98, UWashington. T, UNMT= v. poor(> 5%) • Syracuse. T, Purdue. T, Hawaii* = poor (>= 2. 5%) – *=no v. BNS plans, T= v. BNS date TBD, V=on v. BNS 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 12

University access changes in last year • A year ago we looked at Universities

University access changes in last year • A year ago we looked at Universities with large DOE programs • Identified ones with poor (>2. 5%) or worse (>5%) performance – UOregon. T, Harvard 1 q 98, UWashington. T = very poor (>= 5%) – JHUV, Duke. V, UCSDV, UMich. T, UColo. V, UPenn. T, UMNV, UCIT, UWisc. V = acceptable (>1%)/good – *=no v. BNS plans, T= v. BNS date TBD, V=on v. BNS 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 13

Canada • 20 links, 9 remote sites, 7 monitoring sites • Seems to depend

Canada • 20 links, 9 remote sites, 7 monitoring sites • Seems to depend most on the remote site – UToronto bad to everyone – Carleton, Laurentian, Mc. Gill poor – Montreal, UVic acceptable/good – TRIUMF good with ESnet, poor to CERN 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 14

Europe • Divides up into 2 – TEN-34 backbone sites (de, uk, nl, ch,

Europe • Divides up into 2 – TEN-34 backbone sites (de, uk, nl, ch, fr, it, at) • within Europe good performance • from ESnet good to acceptable, except nl, fr (Renater) &. uk are bad – Others • within Europe performance poor • from ESnet bad to es, il, hu, pl acceptable for cz 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 15

Asia • Israel bad • KEK & Osaka good from US, very poor from

Asia • Israel bad • KEK & Osaka good from US, very poor from Canada • Tokyo poor from US • Japan-CERN/Italy acceptable, Japan-DESY bad • FSU bad to Moscow, acceptable to Novosibirsk • China is bad everywhere 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 16

Intercontinental Grouping (Loss) Looks pretty bad for intercontinental use Improving (about factor of 2

Intercontinental Grouping (Loss) Looks pretty bad for intercontinental use Improving (about factor of 2 in last 6 months) 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 17

Summary 1/5 • Deployment Development – ESnet/HENP/ICFA has 14 Collection sites in 8 countries

Summary 1/5 • Deployment Development – ESnet/HENP/ICFA has 14 Collection sites in 8 countries collecting data on > 500 links involving 22 countries – HEPNRC archiving/analyzing, SLAC analyzing – 600 KB/month/link, 6 bps/link, . 25 FTE @ archive site, 1. 5 -2. 5 FTE on analysis – reports available worldwide to end-users to access, navigate, review & customize (via Excel) & see quality – 4 GBytes of data available to experts for analysis – tools available for others to monitor, archive, analyze • XIWT/IPWT chose & deployed Ping. ER ~ 10 collection sites are now monitoring 41 beacon sites 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 18

Summary 2/5 • Deployment Development • Next Steps – Improve tools: • • •

Summary 2/5 • Deployment Development • Next Steps – Improve tools: • • • Improve statistical robustness - Poisson sampling, medians More groupings, beacon sites, matched pairs, for comparison More navigation features to drill down Better/easier identification of common bottlenecks Prediction (extrapolations, develop models, configure and validate with data) – Pursuing deployment of dedicated PC based monitor platforms: IETF Surveyor & NIMI/LBNL • NIMIs up & running at PSC, LBNL, FNAL, SLAC, CERN (CH), working with RAL (UK), KEK (JP), DESY (DE) • Will provide throughput, traceroute & one way ping measurements 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 19

Summary 3/5 • Deployment Development • Next Steps • Internet Performance (summary for our

Summary 3/5 • Deployment Development • Next Steps • Internet Performance (summary for our 500 links) – Performance within ESnet is good – Performance to v. BNS good (median loss ~ 2* ESnet) – Performance to non ESnet/v. BNS sites is acceptable to poor – Intercontinental performance is very poor to bad – Response time improving by 1 -2% / month – Packet loss improving between SLAC & other sites by 3% / month since Jan-95, – Very dynamic 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 20

Summary 4/5 • Deployment Development • Next Steps • Internet Performance (continued): – Links

Summary 4/5 • Deployment Development • Next Steps • Internet Performance (continued): – Links to sites outside N. America vary from good (KEK) to bad – Canada a mixed bag, depending on remote site it is acceptable to bad – TEN-34 backbone countries (exc UK) good to acceptable – Otherwise Europe poor to bad – Asia (apart from some Japanese sites) is bad – Rest of world generally poor to bad. 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 21

Summary 5/5 • Deployment Development • Next Steps • Internet Performance • Lots of

Summary 5/5 • Deployment Development • Next Steps • Internet Performance • Lots of collaboration & sharing: – – – – – SLAC & HEPNRC leading effort on Ping. ER 14 monitoring sites, ~ 400 remote sites Monitoring site tools CERN & CNAF/INFN, Oxford/Trace. Ping Map. Ping/MAPNet working with NLANR TRIUMF Traceroute topology Map NIMI/LBNL & Surveyor/IETF/IPPM Industry: XIWT/IPWT, also SBIR from Net. Predict on prediction Talks at IETF, XIWT, ICFA, ESSC, ESCC, Interface’ 98, CHEP… Lots of support: DOE/MICS/ESSC/ESnet, ICFA, XIWT 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 22

More Information & extra info follows • ICFA Monitoring WG home page (links to

More Information & extra info follows • ICFA Monitoring WG home page (links to status report, meeting notes, how to access data, and code) – http: //www. slac. stanford. edu /xorg/icfa/ntf/home. html • WAN Monitoring at SLAC has lots of links – http: //www. slac. stanford. edu /comp/net/wan-mon. html • Tutorial on WAN Monitoring – http: //www. slac. stanford. edu /comp/net/wan-mon/tutorial. html • Ping. ER History tables – http: //www. slac. stanford. edu/ /xorg/iepm/pinger/table. html • NIMI http: //www. psc. edu/~mahdavi/nimi_paper/NIMI. html 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 23

Perception of Packet Loss • Above 4 -6% packet loss video conferencing becomes irritating,

Perception of Packet Loss • Above 4 -6% packet loss video conferencing becomes irritating, and non native language speakers become unable to communicate. • The occurrence of long delays of 4 seconds or more at a frequency of 4 -5% or more is also irritating for interactive activities such as telnet and X windows. • Above 10 -12% packet loss there is an unacceptable level of back to back loss of packets and extremely long timeouts, connections start to get broken, and video conferencing is unusable. 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 24

180 Day Ping Performance SLACCERN 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 25

180 Day Ping Performance SLACCERN 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 25

Running 10 week averages Sorted on biggest change Standard deviation gives idea of loading

Running 10 week averages Sorted on biggest change Standard deviation gives idea of loading 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 26

Quiescence • Frequency of zero packet loss (for all time not cut on prime

Quiescence • Frequency of zero packet loss (for all time not cut on prime time) 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 27

Response & Loss Improvements • Improved between 1 and 2. 5% / month •

Response & Loss Improvements • Improved between 1 and 2. 5% / month • Response & Loss similar improvements 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 28

Top Level Domain Grouping (Loss) Diagonals are within TLD US good/accept for it, de,

Top Level Domain Grouping (Loss) Diagonals are within TLD US good/accept for it, de, ch & cz Hungary is poor China unusable Canada poor to bad UK - US bad 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 29

US ESnet & v. BNS ESnet Median 0. 1% Links 36 Unique remote sites

US ESnet & v. BNS ESnet Median 0. 1% Links 36 Unique remote sites 17 Monitoring sites 6 v. BNS . EDU, non ESnet/v. BNS Median 1. 5% (avg 3. 2%) Links 54 Unique remote sites 36 Monitoring sites 3 Median 0. 3% Links 30 Unique remote sites 18 Monitoring sites 4 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 30

3/4/98 z: cottrellesccmay 98esscmay 98. ppt 31

3/4/98 z: cottrellesccmay 98esscmay 98. ppt 31

3/4/98 z: cottrellesccmay 98esscmay 98. ppt 32

3/4/98 z: cottrellesccmay 98esscmay 98. ppt 32

Advanced to U Chicago to Advanced Loss Delay 3/4/98 z: cottrellesccmay 98esscmay 98. ppt

Advanced to U Chicago to Advanced Loss Delay 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 33

Map. Ping • Java Applet, based on Map. Net from NLANR – Colors links

Map. Ping • Java Applet, based on Map. Net from NLANR – Colors links by performance – Selection: • • collection site performance metric month zoom level – Mouse over gives coords 3/4/98 z: cottrellesccmay 98esscmay 98. ppt 34

Traceroute Topology Tool • Reverse traceroute servers • Traceping • Topology. Map KEK From

Traceroute Topology Tool • Reverse traceroute servers • Traceping • Topology. Map KEK From TRIUMF – Ellipses show node on route – Open ellipse is measurement node – Blue ellipse not reachable – Keeps history FNAL 3/4/98 z: cottrellesccmay 98esscmay 98. ppt DESY CERN 35