Identifying Performance Bottlenecks in CDNs through TCPLevel Monitoring
Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University August 19, 2011
Performance Bottlenecks CDN Servers Server APP Server OS Internet Clients APP Internet Client Write too slowly Network congestion Insufficient receive buffer Server OS Insufficient send buffer or Small initial congestion window 2
Reaction to Each Bottleneck CDN Servers Server APP Server OS APP is bottleneck: Debug application Server OS is bottleneck: Tune buffer size, or upgrade server Internet is bottleneck: Circumvent the congested part of network Clients Client is bottleneck: Notify client to change 3
Previous Techniques Not Enough Application logs: No details of network activities Packet sniffing: Expensive to capture Server APP Server OS Transport-layer stats: Directly reveal perf. bottlenecks Packet Sniffer Active probing: Extra load on network 4
How TCP Stats Reveal Bottlenecks Insufficient data in send buffer CDN Server Applications Server Network Stack CDN Servers Send buffer full or Initial congestion window too small Packet loss Receive window too small Network Path Clients Internet Clients 5
Measurement Framework • Collect TCP statistics • Web 100 kernel patch • Extract useful TCP stats for analyzing perf. • Analysis tool • Bottleneck classifier for individual connections • Cross-connection correlation at AS level • • Map conn. to AS based on Route. View Correlate bottlenecks to drive CDN decisions 6
How Bottleneck Classifier Works Bytes. In. Snd. Buf = Small initial Cwin drops greatly Rwin and Packet loss Slow start limits perf. Rwin limits sending Network path is Network Stack is bottleneck Client is bottleneck 7
Coral. CDN Experiment • Coral. CDN serves 1 million clients per day • Experiment Environment • Deployment: A Clemson Planet. Lab node • Polling interval: 50 ms • Traces to Show: Feb 19 th – 25 th 2011 • Total # of Conn. : 209 K • After removing Cache-Miss Conn. : 137 K (Total 2008 ASes) • Log Space overhead • < 200 MB per Coral server per day 8
What are Major Bottleneck for Individual Clients? • We calculate the fraction of time that the connection is under each bottleneck in lifetime Bottlenecks % of Conn. With Bottleneck for >40% of Lifetime Server Application 10. 75% Server Network Stack 18. 72% Network Path 3. 94% Clients 1. 27% Reasons: Our suggestion: Congestion window rises too slowly for short conn. Filter them outscarce of decision making Spotty network (discussed inwindow next slide) Use larger initial congestion Slow Receive more CPU buffer powerful or too small Planet. Lab disk (Most resources machines of them of the are. Planet. Lab <30 KB) node (>80% of the connections last <1 second) 9
AS-Level Correlation • CDNs make decision at the AS level • e. g. , change server selection for 1. 1. 1. 0/24 • Explore at the AS level: • Filter out non-network bottlenecks • Whether network problems exist • Whether the problem is consistent 10
Filtering Out Non-Network Bottlenecks • CDNs change server selection if clients have low throughput • Non-network factors can limit throughput • 236 out of 505 low-throughput ASes limited by non- network bottlenecks • Filtering is helpful: • Don’t worry about things CDNs cannot control • Produce more accurate estimates of perf. 11
Network Problem at AS Level • CDN make decision at AS level • Whether conn. in the same AS have common network problem • For 7. 1% of the ASes, half of conn. have >10% packet loss rate • Network problems are significant at the AS level 12
Consistent Packet Loss of AS • CDNs care about predictive value of measurement • Analyze the variance of average packet loss rates • Each epoch (1 min) has nonzero average loss rate • Loss rate is consistent across epochs (standard deviation < mean) Analysis Length # of ASes with Consistent Packet Loss One Week 377 / 2008 One Day (Feb 21 st) 122 / 739 One Hour (Feb 21 st 18: 00~19: 00) 19 / 121 13
Conclusion & Future Work • Use TCP-level stats to detect performance bottlenecks • Identify major bottlenecks for a production CDN • Discuss how to improve CDN’s operation with our tool • Future Works • Automatic and real-time analysis combined into CDN operation • Detect the problematic AS on the path • Combine TCP-level stats with application logs to debug online services 14
Thanks! Questions? 15
- Slides: 15