More normal than Normal Scaling distributions in complex

More “normal” than Normal: Scaling distributions in complex systems Walter Willinger (AT&T Labs-Research) David Alderson (Caltech) John C. Doyle (Caltech) Lun Li (Caltech) Winter Simulation Conference 2004

Acknowledgments • • Reiko Tanaka (RIKEN, Japan) Matt Roughan (U. Adelaide, Australia) Steven Low (Caltech) Ramesh Govindan (USC) Neil Spring (U. Maryland) Stanislav Shalunov (Abilene) Heather Sherman (CENIC)

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) – Model Requirement: Internal Consistency – Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) – Model Requirement: Resilience to Ambiguity – Choice: Scale-Free vs. HOT

20 th Century’s 100 largest disasters worldwide 2 10 Technological ($10 B) Log(rank) Natural ($100 B) 1 10 US Power outages (10 M of customers) 0 10 -2 10 -1 0 10 10 Log(size)

Note: it is helpful to use cumulative distributions to avoid statistics mistakes 2 10 Log(Cumulative frequency) 1 10 = Log(rank) 0 10 -2 10 -1 10 Log(size) 0 10

2 10 100 Log(rank) 1 10 10 3 2 0 10 1 -2 10 -1 0 10 10 Log(size)

Typical events are relatively small 2 10 Median Log(rank) 1 Largest events are huge (by orders of magnitude) 10 0 10 -2 10 -1 0 10 10 Log(size)

20 th Century’s 100 largest disasters worldwide 2 10 Technological ($10 B) Natural ($100 B) 1 10 US Power outages (10 M of customers, 1985 -1997) 0 10 -2 10 -1 10 Slope = -1 0 10

20 th Century’s 100 largest disasters worldwide A random variable X is said to follow a power law with index > 0 if 2 10 1 10 Slope = -1 ( =1) 0 10 -2 10 -1 10 0 10

2 US Power outages (10 M of customers, 1985 -1997) 10 Slope = -1 ( =1) 1 10 0 10 ? A large event is not inconsistent with statistics. -2 10 -1 10 0 10

Observed power law relationships • Species within plant genera (Yule 1925) • Mutants in bacterial populations (Luria and Delbrück 1943) • Economics: income distributions, city populations (Simon 1955) • Linguistics: word frequencies (Mandelbrot 1997) • Forest fires (Malamud et al. 1998) • Internet traffic: flow sizes, file sizes, web documents (Crovella and Bestavros 1997) • Internet topology: node degrees in physical and virtual graphs (Faloutsos et al. 1999) • Metabolic networks (Barabasi and Oltavi 2004)

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X x ] NB: Avoid descriptions based on probability density f(x)! Cumulative Rank-Size Relationship Frequency-Based Relationship

0. 1 100 Frequency Rank 1000 =0 0. 01 =0 10 =1 1 100 101 102 103 Size 104 105 106 Cumulative Rank-Size Relationship 0. 001 100 101 102 103 Size 104 105 106 Frequency-Based Relationship Avoid non-cumulative frequency relationships for power laws

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X x ] NB: Avoid descriptions based on probability density f(x)! Cumulative Rank-Size Relationship Frequency-Based Relationship Avoid non-cumulative frequency relationships for power laws

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X x ] NB: Avoid descriptions based on probability density f(x)! For many commonly used distribution functions • Right tails decrease exponentially fast • All moments exist and are finite • Corresponding variable X exhibits low variability (i. e. concentrates tightly around its mean)

Subexponential Distributions Following Goldie and Klüppelberg (1998), we say that F (or X) is subexponential if where X 1, X 2, …, Xn are IID non-negative random variables with distribution function F. This says that Xi is likely to be large iff max (Xi) is large (i. e. there is a non-negligible probability of extremely large values in a subexponential sample). This implies for subexponential distributions that (i. e. right tail decays more slowly than any exponential)

Heavy-tailed (Scaling) Distributions A subexponential distribution function F(x) (or random variable X) is called heavy-tailed or scaling if for some 0 < < 2 for some constant 0 < c < . Parameter is called the tail index • 1 < < 2 F has finite mean, infinite variance • 0 < < 1 F has infinite mean, infinite variance • In general, all moments of order are infinite.

Simple Constructions for Heavy-Tails • For U uniform in [0, 1], set X = 1/U, then X is heavytailed with = 1. • For E (standard) exponential, set X = exp(E), then X is heavy-tailed with = 1. • The mixture of exponential distributions with parameter 1/ having a (centered) Gamma(a, b) distribution is a Pareto distribution with = a. • The distribution of the time between consecutive visits to zero of a symmetric random walk is heavytailed with = 1/2.

Power Laws • Scaling distributions are also called power law distributions. • We will use notions of power laws, scaling distributions, and heavy tails interchangeably, requiring only that Note that (1) implies In other words, the CCDF when plotted on log-log scale follows an approximate straight line with slope -.

20 th Century’s 100 largest disasters worldwide 2 10 1 10 Slope = -1 ( =1) 0 10 -2 10 -1 10 0 10

Why “Heavy Tails” Matter … • • • Risk modeling (insurance) Load balancing (CPU, network) Job scheduling (Web server design) Combinatorial search (Restart methods) Complex systems studies (SOC vs. HOT) Understanding the Internet – Behavior (traffic modeling) – Structure (topology modeling)

Power laws are ubiquitous • High variability phenomena abound in natural and man made systems • Tremendous attention has been directed at whether or not such phenomena are evidence of universal properties underlying all complex systems • Recently, discovering and explaining power law relationships has been a minor industry within the complex systems literature • We will use the Internet as a case study to examine the what power laws do or don’t have to say about its behavior and structure. First, we review some basic properties about scaling distributions

Response to Conditioning • If X is heavy-tailed with index , then the conditional distribution of X given that X > w satisfies For large values, x is identical to the unconditional distribution P[ X > x ], except for a change in scale. • The non-heavy-tailed exponential distribution has conditional distribution of the form The response to conditioning is a change in location, rather than a change in scale.

Mean Residual Lifetime • An important feature that distinguishes heavy-tailed distributions from non-heavy-tailed counterparts • For the exponential distribution with parameter , mean residual lifetime is constant • For a scaling distribution with parameter , mean residual lifetime is increasing

Key Mathematical Properties of Scaling Distributions • Response to conditioning (change in scale) • Mean residual lifetime (linearly increasing) Invariance Properties • Invariant under aggregation – Non-classical CLT and stable laws • (Essentially) invariant under maximization – Domain of attraction of Frechet distribution • (Essentially) invariant under mixture – Example: The largest disasters worldwide • Invariant under marginalization

Linear Aggregation: Classical Central Limit Theorem • A well-known result – X(1), X(2), … independent and identically distributed random variables with distribution function F (mean < and variance 1) – S(n) = X(1) + X(2) +…+ X(n) n-th partial sum • More general formulations are possible • Often-used argument for the ubiquity of the normal distribution

Linear Aggregation: Non-classical Central Limit Theorem • A less well-known result – X(1), X(2), … independent and identically distributed with common distribution function F that is heavy-tailed with 1 < < 2 – S(n) = X(1)+X(2)+…+X(n) n-th partial sum • • The limit distribution is heavy-tailed with index More general formulations are possible Gaussian distribution is special case when = 2 Rarely taught in most Stats/Probability courses

Maximization: Maximum Domain of Attraction • A not so well-known result (extreme-value theory) – X(1), X(2), … independent and identically distributed with common distribution function F that is heavy-tailed with 1 < < 2 – M(n) = max(X(1), …, X(n)), n-th successive maxima • G is the Fréchet distribution exp(-x- ) • G is heavy-tailed with index

Weighted Mixture • A little known result – X(1), X(2), … independent random variables having distribution functions Fi that are heavy-tailed with common index 1 < < 2, but possibly different scale coefficients ci – Consider the weighted mixture W(n) of X(i)’s – Let pi be the probability that W(n) = X(i), with p 1+…+pn=1, then one can show where c. W = pi ci is the weighted average of the separate scale coefficients ci. • Thus, the weighted mixture of scaling distributions is also scaling with the same tail index, but a different scale coefficient

Multivariate Case: Marginalization • For a random vector X Rd, if all linear combinations Y = k bk Xk are stable with 1, then X is a stable vector in Rd with index . • Conversely, if X is an -stable random vector in Rd then any linear combination Y = k bk Xk is an -stable random variable. • Marginalization – The marginal distribution of a multivariate heavytailed random variable is also heavy tailed – Consider convex combination denoted by multipliers b = (0, …, 0, 1, 0, …, 0) that projects X onto the kth axis – All stable laws (including the Gaussian) are invariant under this type of transformation

Invariance Properties Gaussian Distributions Scaling Distributions Aggregation Yes Maximization No Yes Mixture No Yes Marginalization Yes • For low variability data, minimal conditions on the distribution of individual constituents (i. e. finite variance) yields classical CLT • For high variability data, more restrictive assumption (i. e. right tail of the distribution of the individual constituents must decay at a certain rate) yields greater invariance

Scaling: “more normal than Normal” • Aggregation, mixture, maximization, and marginalization are transformations that occur frequently in natural and engineered systems and are inherently part of many measured observations that are collected about them. • Invariance properties suggest that the presence of scaling distributions in data obtained from complex natural or engineered systems should be considered the norm rather than the exception. • Scaling distributions should not require “special” explanations.

Our Perspective • Gaussian distributions as the natural null hypothesis for low variability data – i. e. when variance estimates exist, are finite, and converge robustly to their theoretical value as the number of observations increases • Scaling distributions as natural and parsimonious null hypothesis for high variability data – i. e. when variance estimates tend to be ill-behaved and converge either very slowly or fail to converge all together as the size of the data set increases

High-Variability in Network Measurements: Implications for Internet Modeling and Model Validation Walter Willinger (AT&T Labs-Research) David Alderson (Caltech) John C. Doyle (Caltech) Lun Li (Caltech) Winter Simulation Conference 2004

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) – Model Requirement: Internal Consistency – Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) – Model Requirement: Resilience to Ambiguity – Choice: Scale-Free vs. HOT

G. P. E. Box: “All models are wrong, … • … but some are useful. ” – Which ones? – In what sense? • … but some are less wrong. – Which ones? – In what sense? • Mandelbrot’s version: – “When exactitude is elusive, it is better to be approximately right than certifiably wrong. ”

What about Internet measurements? • High-volume data sets – Individual data sets are huge – Huge number of different data sets – Even more and different data in the future • Rich semantic context of the data – A packet is more than arrival time and size • Internet is full of “high variability” – Link bandwidth: Kbps – Gbps – File sizes: a few bytes – Mega/Gigabytes – Flows: a few packets – 100, 000+ packets – In/out-degree (Web graph): 1 – 100, 000+ – Delay: Milliseconds – seconds and beyond

On Traditional Internet Modeling • Step 0: Data Analysis – One or more sets of comparable measurements • Step 1: Model Selection – Choose parametric family of models/distributions • Step 2: Parameter Estimation – Take a strictly static view of data • Step 3: Model Validation – Select “best-fitting” model – Rely on some “goodness-of-fit” criteria/metrics – Rely on some performance comparison How to deal with “high variability”? – Option 1: High variability = large, but finite variance – Option 2: High variability = infinite variance

Some Illustrative Examples • Some commonly-used plotting techniques – Probability density functions (pdf) – Cumulative distribution functions (CDF) – Complementary CDF (CCDF) • Different plots emphasize different features – Main body of the distribution vs. tail – Variability vs. concentration – Uni- vs. multi-modal

Probability density functions 1. 5 Lognormal(0, 1) Gamma(. 53, 3) Exponential(1. 6) Weibull(. 7, . 9) Pareto(1, 1. 5) f(x) 1 0. 5 0 0 0. 5 1 1. 5 2 2. 5 x 3 3. 5 4 4. 5 5

Cumulative Distribution Function 1 0. 9 Lognormal(0, 1) Gamma(. 53, 3) Exponential(1. 6) Weibull(. 7, . 9) Pareto(1, 1. 5) 0. 8 0. 7 F(x) 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 0 2 4 6 8 10 x 12 14 16 18 20

Complementary CDFs 10 log(1 -F(x)) 10 10 0 -1 -2 -3 Lognormal(0, 1) Gamma(. 53, 3) Exponential(1. 6) Weibull(. 7, . 9) -4 10 -1 10 0 10 log(x) 1 10 2

Complementary CDFs 10 log(1 -F(x)) 10 10 0 -1 -2 -3 Lognormal(0, 1) Gamma(. 53, 3) Exponential(1. 6) Weibull(. 7, . 9) Pareto. II(1, 1. 5) Pareto. I(0. 1, 1. 5) -4 10 -1 10 0 10 log(x) 1 10 2

By Example Internet Traffic • HTTP Connection Sizes from 1996 • IP Flow Sizes (2001) Internet Topology • Router-level connectivity (1996, 2002)

HTTP Connection Sizes (1996) – 1 day of LBL’s WAN traffic (in- and outbound) – About 250, 000 HTTP connection sizes (bytes) – Courtesy of Vern Paxson 10 10 1 -F(x) 10 10 10 0 HTTP Data -1 -2 -3 -4 -5 -6 10 0 10 2 4 10 x (HTTP size) 10 6 10 8

HTTP Connection Sizes (1996) How to deal with “high variability”? – Option 1: High variability = large, but finite variance – Option 0 2: High variability = infinite variance 10 Fitted 2 -parameter Pareto ( =1. 27, m=2000) 10 10 1 -F(x) Fitted 2 -parameter Lognormal ( =6. 75, =2. 05) 10 10 HTTP Data Fitted Lognormal Fitted Pareto -1 -2 -3 -4 -5 -6 10 0 10 2 4 10 x (HTTP size) 10 6 10 8

IP Flow Sizes (2001) – 4 -day period of traffic at Auckland – About 800, 000 IP flow sizes (bytes) – Courtesy of NLANR and Joel Summers 10 1 -F(x) 10 10 10 0 IP flow data -2 -4 -6 IP flow 10 0 5 10 x (IP Flow Size) 10 10

IP Flow Sizes (2001) How to deal with “high variability”? – Option 1: High variability = large, but finite variance – Option 02: High variability = infinite variance 10 1 -F(x) 10 10 10 IP flow data Fitted Lognormal Fitted Pareto -2 -4 -6 IP flow 10 0 5 10 x (IP Flow Size) 10 10

1 -F(x) Samples from Pareto Distribution 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 Fitted Pareto Samples from Fitted Pareto 0 10 2 10 x 4 10 6 10 8

1 -F(x) Samples from Lognormal Distribution 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 Fitted Lognormal Samples from Fitted Lognormal -6 10 -2 10 10 2 10 x 4 10 6 10 8

0 0 10 10 HTTP Data Fitted Lognormal Fitted Pareto -1 10 -2 1 -F(x) 10 1 -F(x) IP flow data Fitted Lognormal Fitted Pareto -3 10 -4 10 -5 10 -6 0 10 2 10 4 10 x (HTTP size) 6 10 8 0 10 10 0 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 10 Fitted Pareto Samples from Fitted Pareto 5 10 1 -F(x) 10 10 10 x (IP Flow Size) 10 Fitted Lognormal Samples from Fitted Lognormal -6 0 10 2 10 x 4 10 6 10 8 10 -2 10 0 10 2 10 4 x 10 6 10 8 10

Traditional Modeling Approach • • Step 0: Data Analysis Step 1: Model Selection Step 2: Parameter Estimation Step 3: Model Validation Criticism of Traditional Approach • Highly predictable outcome – Always doable, no surprises – Cause for endless discussions (Downey’ 01) • Curve fitting: when “more” means “better” … – Adding parameters improves fit • Inadequate “goodness-of-fit” criteria due to – Voluminous data sets – Dependencies, high-variability, non-stationarities

Beyond Traditional Internet Modeling • Requirement 1: Internal Model Consistency – Exploit high volume of available data – Learn from Mandelbrot and Tukey – Example: Understanding HTTP and IP data • Requirement 2: External Model Consistency – Exploit rich semantic of available data – Learn more from Mandelbrot and Cox – Example: Understanding Internet topology data • Requirement 3: Resilience to Ambiguous Data – High variability to the rescue – Again, look up Mandelbrot!

Internal Model Consistency • Take dynamic view of data – Rely on traditional modeling approach for initial (small) subset of available data (model M(0)) – Consider successively larger subsets (models M(k)) – Analyze resulting family of models M(0), …, M(n) • Approach: Tukey’s “borrowing strength” idea – Borrowing strength from large data sets – Simple way to exploit high-volume data sets – Traditional modeling as a means, not as an end in itself • Internally consistent family of models – Parameter estimates converge quickly/robustly – 95% Confidence intervals become nested • Internally inconsistent family of models – Parameter estimates don’t converge – 95% CI’s don’t overlap

HTTP Data: Lognormal Family of Models • Lognormal model assumes finite variance • Tool: Mandelbrot’s “sequential moment plots” – Plot moment estimates as a function of n (sample size) – Plot corresponding 95% CI as a function of n – Look for convergence/divergence as n approaches the full sample size • Practical implementation – Working with raw data – Working with transformations of raw data – Working with random permutation of transformations of raw data

Sequential Moment Plots: HTTP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 4 10 x 10 STD(n) 8 6 4 HTTP data (original) 2 0 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 5 x 10

Sequential Moment Plots: HTTP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n 4 10 x 10 STD(n) 8 6 4 HTTP data (original) HTTP data (permuation) 2 0 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 5 x 10

Sequential Moment Plots: HTTP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n 4 10 x 10 STD(n) 8 6 4 HTTP data (original) HTTP data (permuation) Log. Normal 2 0 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 5 x 10

Sequential Moment Plots: HTTP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n 4 10 x 10 STD(n) 8 6 4 HTTP data (original) HTTP data (permuation) Log. Normal Pareto 2 0 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 5 x 10

Sequential Moment Plots: HTTP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n 4 10 x 10 STD(n) 8 6 4 HTTP data (original) HTTP data (permuation) Log. Normal Pareto Exponential 2 0 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 5 x 10

HTTP: Log-transformed Raw Data • Sequential estimates^ (n) of parameter (n) for fitted Lognormal model Mn, together with 95% CI • Individual fitted lognormals appear adequate for data Di ? ? • Successive models are inconsistent (i. e. non-overlapping CIs) ^ ^ • Minor differences in (n) translate into very substantial 5 differences for the standard deviation estimates s(n) x 10 7 2. 6 2. 55 6 2. 5 5 2. 4 2. 35 s^ (n) ^(n) 2. 45 ^ (n) Estimate 95% CI 2. 3 ^ Estimate s(n) Approx 95% CI 3 2 2. 25 2. 2 1 2. 15 2. 1 4 0 0. 5 1 1. 5 2 2. 5 5 n (Number of Observations) x 10 0 0 0. 5 1 1. 5 2 2. 5 n (Number of Observations) x 10 5

HTTP: Permuted & Transformed Raw Data Question: Are the jumps in the estimate of (n) the result of • dependencies in the data? • Answer: Data permutation gives the appearance of convergence Random permutation Log-transformed of log-transformed raw data 2. 6 2. 55 2. 4 2. 35 ^(n) 2. 45 ^ (n) Estimate 95% CI 2. 3 2. 45 2. 2 2. 4 2. 15 2. 1 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 x 10 5 2. 35 0 0. 5 1 1. 5 n (Number of Observation) 2 2. 5 5 x 10

HTTP: Does the log-transformed data fit a normal? Normal Probability Plot 0. 999 0. 997 Probability 0. 99 0. 98 0. 95 0. 90 0. 75 0. 50 0. 25 0. 10 0. 05 0. 02 0. 01 0. 003 0. 001 0 2 4 6 8 Data 10 12 14 16

Modeling HTTP Data Lognormal models: • Raw data – Shows lack of convergence of 2 nd moment estimates • Transformed data – Shows impact of dependencies in the data • Transformed and permuted data – Lognormal model is internally inconsistent Example of being “certifiably wrong”

HTTP Data: Pareto Family of Models • Pareto model assumes infinite variance, but is defined in terms of tail index • Tool: “Sequential tail index estimate plots” – Plot tail index estimates as a function of n – Plot corresponding 95% CI as a function of n – Look for convergence/divergence as n approaches the full sample size • Practical implementation – Working with raw data – Working with random permutation of raw data

HTTP: Sequential Tail Index Estimate Plots ^ • Sequential estimates (n) of parameter (n) for fitted Pareto model Mn, together with 95% CI • Successive fitted Paretos appear largely consistent with one another (i. e. overlapping CIs) 1. 8 1. 9 1. 7 1. 8 1. 6 1. 7 ^ (n) 1. 5 1. 4 1. 3 1. 5 1. 4 1. 2 1. 1 ^ (n) Estimate 1 95% CI 0. 9 0. 8 0 1. 6 0. 5 1 1. 5 2 n (Number of Observations) Raw Data 1. 3 ^ (n) Estimate 1. 2 95% CI 1. 1 2. 5 5 x 10 1 0 0. 5 1 1. 5 2 n (Number of Observation) 2. 5 5 x 10 Random permutation of raw data

HTTP: Does the data fit a Pareto? 20 15 Y Quantiles 10 5 0 -5 0 2 4 6 8 10 X Quantiles 12 14 16 18

Modeling HTTP Data Lognormal models: • Raw data – Shows lack of convergence of 2 nd moment estimates • Transformed data – Shows impact of dependencies in the data • Transformed and permuted data – Lognormal model is internally inconsistent Example of being “certifiably wrong” Pareto Family of Models: • Raw data – Moment estimates are problematic – Tail index estimates converge quickly • Permutation of raw data – Tail index estimates converge robustly (irrespective of dependencies in the data) – Pareto models are internally consistent Example of being “approximately right”

“All models are wrong… “ but some are less wrong. 0. 50 0. 25 0. 10 0. 05 0. 02 0. 01 0. 003 0. 001 20 0 2 4 6 8 10 12 14 16 Data 15 HTTP: Fitted Lognormal Y Quantiles Probability 0. 999 0. 997 0. 99 0. 98 0. 95 0. 90 0. 75 10 HTTP: Fitted Pareto 5 0 -5 0 2 4 6 8 10 X Quantiles 12 14 16 18

Some Sanity Checks • Fitting Pareto model to Lognormal sample – Generate iid sample from a Lognormal model – Check sequential tail index estimate plot

1. 8 1. 7 1. 6 ^ (n) 1. 5 1. 4 1. 3 1. 2 1. 1 ^ (n) Estimate 1 95% CI 0. 9 0. 8 0 0. 5 1 1. 5 2 n (Number of Observations) 2. 5 x 10 5 Using a Pareto model for lognormal data

Some Sanity Checks • Fitting Pareto model to Lognormal sample – Generate iid sample from a Lognormal model – Check sequential index estimatediverge plot • Result: sequential tail index estimates • Fitting Lognormal model to Pareto sample – Generate iid sample from a Pareto model – Check sequential standard deviation plot – Check normal probability plot

Normal Probability Plot Probability 0. 999 0. 997 0. 99 0. 98 0. 95 0. 90 0. 75 0. 50 0. 25 0. 10 0. 05 0. 02 0. 01 0. 003 0. 001 -2 -1 0 1 2 3 4 5 6 7 Data Using a lognormal model for Pareto data

Some Sanity Checks • Fitting Pareto model to Lognormal sample – Generate iid sample from a Lognormal model – Check sequential tail index estimate plot • Result: sequential tail index estimates diverge • Fitting Lognormal model to Pareto sample – Generate iid sample from a Pareto model – Check sequential standard deviation plot – Check normal probability plot • Result: transformed data is not Gaussian

IP Flow Sizes (2001) – 4 -day period of traffic at Auckland – About 800, 000 IP flow sizes (bytes) – Courtesy of NLANR and Joel Summers 10 1 -F(x) 10 10 10 0 IP flow data -2 -4 -6 IP flow 10 0 5 10 x (IP Flow Size) 10 10

Finite Variance vs Infinite Variance? – Sequential moment plots: IP raw data – Sequential estimates of (n): log-transformed raw data – Sequential tail index plots: estimates of (n) 10 1 -F(x) 10 10 10 0 IP flow data Fitted Lognormal Fitted Pareto -2 -4 -6 IP flow 10 0 5 10 x (IP Flow Size) 10 10

Sequential Moment Plots: IP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 6 2. 5 x 10 2 STD(n) 1. 5 1 IP flow data (original) 0. 5 0 0 2 4 6 n (Number of Observations) 8 5 x 10

Sequential Moment Plots: IP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 6 2. 5 x 10 2 STD(n) 1. 5 1 IP flow data (original) IP flow data (permuation) 0. 5 0 0 2 4 6 n (Number of Observations) 8 5 x 10

Sequential Moment Plots: IP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 6 2. 5 x 10 2 STD(n) 1. 5 1 IP flow data (original) IP flow data (permuation) Log. Normal 0. 5 0 0 2 4 6 n (Number of Observations) 8 5 x 10

Sequential Moment Plots: IP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 6 2. 5 x 10 2 STD(n) 1. 5 1 IP flow data (original) IP flow data (permuation) Log. Normal Pareto 0. 5 0 0 2 4 6 n (Number of Observations) 8 5 x 10

Sequential Moment Plots: IP Raw Data • Let D be original data set of size N • Build sequential models M 0, M 1, …, MN using nested data sets: D 0 D 1 … D of size N 0 < N 1 < … < N • Plot sample STD as a function of n (sample size) 6 2. 5 x 10 2 STD(n) 1. 5 1 IP flow data (original) IP flow data (permuation) Log. Normal Pareto Exponential 0. 5 0 0 2 4 6 n (Number of Observations) 8 5 x 10

• Sequential estimates^ (n) of parameter (n) for fitted Lognormal model Mn, together with 95% CI • Individual fitted lognormals appear adequate for data Di, but successive models are inconsistent (i. e. nonoverlapping CIs) ^ ^ • Minor differences in (n) translate into very substantial 5 differences for the standard deviation estimates s(n) x 10 2. 25 7 2. 2 6 2. 15 ^ Estimate s(n) Approx 95% CI 5 2. 1 s^ (n) ^(n) IP: Log-transformed Raw Data ^ (n) Estimate 2. 05 95% CI 2 4 3 1. 95 2 1. 9 1. 85 0 1 2 3 4 5 6 7 8 9 5 x 10 n (Number of Observations) 1 0 1 2 3 4 5 6 7 8 9 5 n (Number of Observations) x 10

IP Data: Sequential Tail Index Estimate Plots ^ • Sequential estimates (n) of parameter (n) for fitted Pareto model Mn, together with 95% CI • Successive fitted Paretos appear largely consistent with one another (i. e. overlapping CIs) 1. 4 1. 3 1. 2 ^ (n) 1. 1 1 0. 9 0. 8 0. 7 ^ (n) Estimate 0. 6 95% CI 0. 5 0. 4 0 1 2 3 4 5 6 7 n (Number of Observations) 8 9 x 10 5

Modeling HTTP and IP Data Lognormal models: • Raw data – Shows lack of convergence of 2 nd moment estimates • Transformed data – Shows impact of dependencies in the data • Transformed and permuted data – Lognormal model is internally inconsistent Example of being “certifiably wrong” Pareto Family of Models: • Raw data – Moment estimates are problematic – Tail index estimates converge quickly • Permutation of raw data – Tail index estimates converge robustly (irrespective of dependencies in the data) – Pareto models are internally consistent Example of being “approximately right”

Beyond Traditional Internet Modeling • Requirement 1: Internal Model Consistency – Exploit high volume of available data – Learn from Mandelbrot and Tukey – Example: Understanding HTTP and IP data • Requirement 2: External Model Consistency – Exploit rich semantic of available data – Learn more from Mandelbrot and Cox – Example: Understanding self-similar Internet traffic • Requirement 3: Resilience to Ambiguous Data – High variability to the rescue – Again, look up Mandelbrot – Example: Understanding Internet topology data

Internet Traffic: Poisson Models • Internally inconsistent – Earlier criterion applied to processes – D. Figueiredo et al. (2004) • Externally inconsistent – Aggregate Poisson is incompatible with high variability of the higher-layer constituents • Example of being “verifiably wrong”

Internet Traffic: Self-Similar Models • Internally consistent – Earlier criterion applied to processes – D. Figueiredo et al. (2004) • Externally consistent – Mandelbrot/Cox construction – LRD via high variability of the higher-layer constituents – Optimal web layout: heavy-tailed HTTP data • Example of being “approximately right”

Models of Self-Similar Traffic Mandelbrot’s Construction • Renewal reward processes and their aggregates – Aggregate is made up of many constituents – Each constituent is of the on/off type – On/off periods have a “duration” – Constituents make contributions (“rewards”) when “on” – Constituents make no contributions when “off” Cox’s construction • Known as immigration-death or M/G/ process – Aggregate traffic is made up of many connections – Connections arrive at random – Each connection has a “size” (number of packets) – Each connection transmits packets at some “rate” • The limiting regimes for the aggregate are essentially the same as those for Mandelbrot’s construction

External Model Consistency • Cross-layer view of models – Aggregate link traffic (packet-level) – Semantic context in packet trace data allows for identification of higher-layer constituents [IP flow, TCP connections, HTTP requests/responses, etc. ] – Aggregate link traffic (higher-layer constituents) • External model consistency – Models respect layered network architecture – Models are required to be consistent across layers – Models explain observed phenomena at different layers

6 5 Mbytes Frequency (Huffman) (Crovella) 4 Cumulative Data compression WWW files 3 Forest fires 1000 km 2 2 (Malamud) 1 0 -1 -6 -5 Decimated data Log (base 10) -4 -3 -2 -1 0 1 Size of events 2

6 Web files 5 Codewords 4 Cumulative Frequency -1 3 Fires 2 -1/2 1 0 -1 -6 Log (base 10) -5 -4 -3 -2 -1 0 1 Size of events 2

6 Data compression WWW files Mbytes 5 4 -1 Files 3 Most 2 files are 1 0 small -1 -6 Forest fires 1000 km 2 -5 Fires -1/2 Most packets -4 -3 -2 -1 0 1 are in a few large files. Size of events 2

6 5 Mice Data compression WWW files Mbytes 4 Files -1 3 Mice Fires Forest fires 1000 km 2 2 -1/2 1 0 -1 -6 -5 Elephants -4 -3 -2 -1 0 1 Size of events Elephants 2

Delay sensitive Probability of user access Mice Bandwidth sensitive Elephants

Generalized “coding” theory Shannon Web layout • Minimize avg file transfer • No feedback • Discrete (0 -d) topology • Minimize avg file transfer • Feedback • 1 -d topology Web Data compression Reference: Zhu, X. , J. Yu, and J. C. Doyle. Heavy Tails, Generalized Coding, and Optimal Web Layout.

Data 6 DC 5 WWW 4 3 2 1 0 -1 -6 -5 -4 -3 -2 -1 0 1 2

Data + Model/Theory 6 DC 5 WWW 4 3 2 1 0 -1 -6 -5 -4 -3 -2 -1 0 1 2

Data + Model/Theory 6 DC 5 WWW 4 3 2 Unified “source coding” theory: 1 1. Data compression (Shannon) 0 2. Web layout 3. Other network applications -1 -6 -5 -4 -3 -2 -1 0 1 2

How general is this mice/elephant picture? • • • Selecting and reading books Selecting and reading magazine articles Selecting and viewing television Deciding what movie to go to Deciding where to go on vacation Deciding which meetings and classes to attend • Etc….

Internet traffic Links

Typical web traffic Heavy tailed web traffic > 1. 0 log(freq > size) p s- Is streamed out on the net. Web servers Creating fractal Gaussian internet traffic (Willinger, …) log(file size)

Fat tail web traffic time Is streamed onto the Internet creating long-range correlations with

Heavy tailed web traffic Typical web traffic > 1. 0 log(freq > size) p s- Is streamed out on the net. Web servers Externally consistent, rigorous theory with supporting measurements log(file size)

The “Closing the Loop” Approach 1. Discovery (data-driven) 2. Modeling, subject to internal and external consistency 3. Proposed explanation in terms of elementary concepts or mechanisms (mathematics) 4. Step 3 suggests first-of-its-kind measurements or revisiting existing measurements related to checking the elementary concepts or mechanisms 5. Empirical validation of elementary concepts or mechanisms using the data collected in Step 4

Why “Closing the Loop” is Progress • Departure from classical “data-fitting” • Validation is moved to a more elementary or fundamental level • Fully exploits the context in which measurements are made (“start with data, end with data”) • If successful, provides actual explanation of “emergent” phenomena (new insight) • Shows inherent limitations and weaknesses of proposed model, suggests further improvements

Modeling Internet Traffic – – More than “curve fitting” More than “follows a power law” Fully consistent with theory and empirical evidence Validated by “closing the loop” 0 0 10 10 HTTP Data IP flow data -1 10 -2 1 -F(x) 10 -3 10 -4 10 -5 10 -6 10 0 10 2 10 4 6 10 10 x (HTTP size) 8 10 0 10 5 10 x (IP Flow Size) 10 10

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) – Model Requirement: Internal Consistency – Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) – Model Requirement: Resilience to Ambiguity – Choice: Scale-Free vs. HOT

Beyond Traditional Internet Modeling • Requirement 1: Internal Model Consistency – Exploit high volume of available data – Learn from Mandelbrot and Tukey – Example: Understanding HTTP and IP data • Requirement 2: External Model Consistency – Exploit rich semantic of available data – Learn more from Mandelbrot and Cox – Example: Understanding self-similar Internet traffic • Requirement 3: Resilience to Ambiguous Data – High variability to the rescue – Again, look up Mandelbrot – Example: Understanding Internet topology data

Internet Topology What does the structure of the Internet look like? • Internet router-level topology – Physical connectivity – Direct inspection generally not possible • Available measurements: Traceroute-based – Pansiot and Grad (1998) – Rocketfuel data (Spring et al. 2002) – A few accurate router-level maps • Other models: AS graphs, WWW graphs

Router-Level Topology Routers Hosts • Nodes are machines (routers or hosts) running IP protocol • Measurements taken from traceroute experiments that infer topology from traffic sent over network • Subject to sampling errors and bias • Requires careful interpretation

AS Topology • Nodes are entire networks (ASes) • Links = peering relationships between ASes • Relationships inferred from Border Gateway Protocol (BGP) information • Really a measure of business relationships, not network structure AS 1 AS 3 AS 2 AS 4

Pansiot-Grad data (1995) of router-level Internet connectivity based on large-scale traceroute experiments 10 Node Rank 10 10 10 4 3 2 1 0 10 10 1 10 2 Node Degree Faloutsos et al. (1999): Power law degree distribution

Internet Topology: Scale-Free Models • Key assumptions – Data: Taken at face value – Node degree distribution: Power law • Key claims (Albert, Jeong, Barabasi. 2000) – Internet router-level topology is “scale-free” (Definition of “scale-free” is mathematically imprecise. ) – High-degree routers are centrally located (“hubs) – Router-level topology has hub-like core – Discovery of the “Achilles’ heel” of the Internet

On Resilience to Data Ambiguity • Traceroute-based measurements – Bias (location of sources) – Incompleteness (number of destinations) – Errors (alias resolution) – Layer 3 (IP) vs. layer 2 issues • Inferred node degree distribution – Observed power law may be artifact of data – Where are the highly-connected nodes?

Internet Topology: Scale-Free Models • Exploit semantic context of available data – Core routers have low degrees – High-degree routers at the edge of the network – Lack of high variability in router-level core networks

Node degree distribution for AS 7018 (Rocketfuel) 4 10 all nodes r 1 nodes r 0 nodes 3 Node Rank 10 High variability is toward the network edge. 2 10 1 10 0 10 1 2 10 10 3 10 Node Degree • Nodes categorized by “radius” • “r 0” nodes are most “central” (i. e. in the network core)

A closer look at “r 0” (core) nodes… 3 10 Degree Distribution for AS 7018 - By Router Type all core routers access routers backbone routers 2 Node Rank 10 1 10 0 10 1 10 2 10 Node Degree • Access routers: traffic aggregation within each POP • Backbone routers: connectivity between POPs

Model Validation: Scale-Free Models • Exploit semantic context of available data – Core routers have low degrees – High-degree routers at the edge of the network – Lack of high variability in router-level core networks • Scale-free models and Internet topology – Not resilient to ambiguities in the data – Externally inconsistent (hub nodes in the core) – Ignore all engineering details – Example of being “certifiably wrong” – The Internet is exactly the opposite of what scalefree models claim in essentially every meaningful aspect

PA HOT Abilene-inspired PLRG Sub-optimal

Internet Topology: Scale-Rich Models • Key assumption – Heuristically optimized topology (HOT) design • Approach – Perspective of individual Internet Service Provider (ISP) – Consider economic and technological forces at work – Reconcile engineering tradeoffs in design • Key implications – Mesh-like core of low degree routers – High-degree nodes are at the edge of the network – The Internet “Achilles’ heel” is not connectivity • Scale-rich models and Internet topology – Resilient to ambiguities in the data – Externally consistent – Example of being “approximately right”

Router Technology Constraint 10 Cisco 12416 GSR, circa 2002 3 high BW low degree Bandwidth (Gbps) Total Bandwidth 10 10 high degree low BW 2 1 Bandwidth per Degree 15 x 10 GE 10 15 x 3 x 1 GE 0 15 x 4 x OC 12 15 x 8 FE Technology constraint 10 -1 10 0 10 1 Degree 10 2

Aggregate Router Feasibility core technologies older/cheaper technologies approximate aggregate feasible region edge technologies Source: Cisco Product Catalog, June 2002

Heuristically Optimal Topology Mesh-like core of fast, low degree routers Cores High degree nodes are at. Edges the edges. Hosts

Intermountain Giga. Po. P Front Range Giga. Po. P Indiana Giga. Po. P U. Louisville Great Plains Merit OARNET One. Net Qwest Labs Arizona St. Northern Lights U. Memphis Wisc. REN NCSA Star. Light U. Arizona Iowa St. Oregon Giga. Po. P Pacific Northwest Giga. Po. P MREN NYSERNet Pacific Wave UNM Denver Kansas City WPI Indianapolis Chicago Seattle U. Hawaii AMES NGIX WIDE GEANT Sunnyvale CENIC SINet New York ESnet SURFNet Wash D. C. Rutgers U. Los Angeles Uni. Net Trans. PAC/APAN Houston North Texas Giga. Po. P Abilene Backbone Physical Connectivity (as of December 16, 2003) 0. 1 -0. 5 Gbps 0. 5 -1. 0 Gbps 1. 0 -5. 0 Gbps 5. 0 -10. 0 Gbps Atlanta SFGP/ AMPATH Texas Giga. Po. P Miss State Giga. Po. P UT Austin UT-SW Med Ctr. MANLAN SOX Texas Tech La. Net Tulane U. Northern Crossroads Florida A&M U. So. Florida MAGPI PSC DARPA Boss. Net UMD NGIX U. Florida NCNI/MCNC Mid-Atlantic Crossroads Drexel U. U. Delaware

Connection Speed (Mbps) 1 e 4 Ethernet 1 -10 Gbps 1 e 3 high performance computing academic and corporate 1 e 2 a few users have very high speed connections 1 e 1 1 Ethernet 10 -100 Mbps most users have low speed connections 1 e-1 residential and small business Broadband Cable/DSL ~500 Kbps Dial-up ~56 Kbps 1 e-2 1 1 e 2 1 e 4 1 e 6 1 e 8 Rank (number of users) High variability in population density High variability in willingness to pay for bandwidth by end users

Router-Level Topologies: Rocketfuel Neil Spring, Ratul Mahajan, and David Wetherall. Measuring ISP Topologies with Rocketfuel. ACM SIGCOMM 2002. AS Name Routers Links POPs 1221 Telstra (Aus. ) 4, 440 4, 996 54 1239 Sprintlink (US) 11, 889 15, 263 25 1755 Ebone (EU) 438 1, 192 26 2914 Verio (US) 7, 574 19, 175 103 3257 Tiscali (EU) 618 839 52 3356 Level 3 (US) 2, 064 8, 669 44 3967 Exodus (US) 688 2, 166 22 4755 VSNL (India) 664 484 8 6461 Abovenet (US) 843 2, 667 22 7018 AT&T (US) 13, 993 18, 083 109 Validation from ISPs: “good” to “excellent”

External Consistency: Improving Rocketfuel Approach: • Use additional context specific information to validate and augment the data collected by Rocketfuel • Use knowledge about Heuristically Optimal Topology to “reverse-engineer” the structure within an ISP Point of Presence (Po. P) • Unexpected result: node duplicates in large Po. Ps AS 7018 9261 total nodes 640 core nodes 156 duplicates (24%) 484 unique core nodes AS 1239 7043 total nodes 673 core nodes 215 duplicates (32%) 458 unique core nodes

AS 7018: Phoenix, AZ

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) – Model Requirement: Internal Consistency – Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) – Model Requirement: Resilience to Ambiguity – Choice: Scale-Free vs. HOT

Lessons Learned High Variability and Scaling Distributions • Don’t be surprised! • Don’t fight high variability when it’s apparent! – There are ways to check for genuine high variability • Exploit high variability when it’s there! – Provides basis for explanatory modeling • Don’t force high variability when it’s absent! – A straight-looking log-log plot is not a proof Internet Modeling • Need for internal and external consistency • Need for “closing the loop”: empirical validation • Explanatory and not merely descriptive modeling

Some References • W. Willinger, D Alderson, J. C. Doyle, and L. Li, More “normal” than Normal: scaling distributions in complex systems. WSC 2004. • W. Willinger, D Alderson, and L. Li, A pragmatic approach to dealing with high-variability in network measurements, Proc. ACM SIGCOMM IMC 2004, Taormina, Italy • L. Li, D. Alderson, W. Willinger, and J. Doyle, A first-principles approach to understanding the Internet’s router-level topology, Proc. ACM SIGCOMM 2004, Portland, OR • D. Figueiredo, B. Liu, A. Feldmann, V. Mishra, D. Towsley, and W. Willinger, On TCP and self-similar traffic, Performance Evaluation (to appear). • W. Willinger, R. Govindan, S. Jamin, V. Paxson, and S. Shenker, Critically examining criticality: Scaling phenomena in the Internet, PNAS, Vol. 99, 2002. • Zhu, X. , J. Yu, and J. C. Doyle. Heavy Tails, Generalized Coding, and Optimal Web Layout. Proc. of the IEEE Infocom 2001.

More “normal” than Normal: Scaling distributions in complex systems Walter Willinger (AT&T Labs-Research) David Alderson (Caltech) John C. Doyle (Caltech) Lun Li (Caltech) alderd@cds. caltech. edu www. cds. caltech. edu/~alderd/topology/