Calculation of uptime and service levels a theoretical

  • Slides: 36
Download presentation
Calculation of uptime and service levels – a theoretical approach TERENA Networking Conference 2012,

Calculation of uptime and service levels – a theoretical approach TERENA Networking Conference 2012, Reykjavik 23/5 -2012 Deputy Director Martin Bech, UNI-C/De. IC martin. bech@uni-c. dk

The challenge: SLA calculation • The ages of ”best effort” are long gone… •

The challenge: SLA calculation • The ages of ”best effort” are long gone… • We are expected to have an SLA for all our services with specific operational efficiencies • We often create our services by combining components and contributions from subcontractors with varying SLAs • How do you calculate the uptime/service level you can safely promise for the combined service? 23 -05 -2012 2 SLA Calculation - Martin Bech

Availability – the central measurement of operational efficiency 23 -05 -2012 3 SLA Calculation

Availability – the central measurement of operational efficiency 23 -05 -2012 3 SLA Calculation - Martin Bech

Service Levels – how are they specified? 23 -05 -2012 4 SLA Calculation -

Service Levels – how are they specified? 23 -05 -2012 4 SLA Calculation - Martin Bech

99. 7% - what does that actually mean? • A guaranteed operational efficiency of

99. 7% - what does that actually mean? • A guaranteed operational efficiency of 99. 7% - does that mean that the service is down 0. 3% of the time? • No – hopefully it is down much less! • According to our observations for simple WAN connections, they are only down 0. 11% of the time (explanation will follow…) • However, the suppliers have chosen 0. 3% in order not to pay penalty too often • Actually, the service does not meet the guarantee in only 6. 2% of all quarters, or roughly 1 time in 4 years 23 -05 -2012 5 SLA Calculation - Martin Bech

The relation between guaranteed and actual uptime In order to get an idea about

The relation between guaranteed and actual uptime In order to get an idea about this relation, we look at those WAN lines of the Danish NREN, which have the following properties: • It is single WAN lines without protection or redundancy • Both dark fibres and transmission capacities • We have our own surveillance measurements • What we measure is as close to the performance of the supplier as we can get • They have guaranteed uptimes of 99. 7%, calculated over a quarter (corresponding to a downtime of 6. 5 hours/quarter) 23 -05 -2012 6 SLA Calculation - Martin Bech

In our network… 23 -05 -2012 7 SLA Calculation - Martin Bech

In our network… 23 -05 -2012 7 SLA Calculation - Martin Bech

…we are monitoring a lot of units 23 -05 -2012 8 SLA Calculation -

…we are monitoring a lot of units 23 -05 -2012 8 SLA Calculation - Martin Bech

23 -05 -2012 9 SLA Calculation - Martin Bech We select the relevant WAN-links

23 -05 -2012 9 SLA Calculation - Martin Bech We select the relevant WAN-links (99, 7% over a quarter singular links)

The monitoring details • Done with ”Linux style” ping –c 5 –s 10 •

The monitoring details • Done with ”Linux style” ping –c 5 –s 10 • Every 10 th minute • If the packet loss is 0 and ping time under a threshold, the measurement is marked as ”green” (well – blue, actually) • If the packet loss is 100%, the measurement is marked as ”red” • Anything between those two is marked as ”yellow” • All measurements that are not ”green” is counted as downtime 23 -05 -2012 10 SLA Calculation - Martin Bech

23 -05 -2012 11 SLA Calculation - Martin Bech

23 -05 -2012 11 SLA Calculation - Martin Bech

23 -05 -2012 12 SLA Calculation - Martin Bech

23 -05 -2012 12 SLA Calculation - Martin Bech

23 -05 -2012 13 SLA Calculation - Martin Bech

23 -05 -2012 13 SLA Calculation - Martin Bech

23 -05 -2012 14 SLA Calculation - Martin Bech

23 -05 -2012 14 SLA Calculation - Martin Bech

Calculation of relative downtime is completely traditional • 23 -05 -2012 15 SLA Calculation

Calculation of relative downtime is completely traditional • 23 -05 -2012 15 SLA Calculation - Martin Bech

We look at 27 non-protected WAN lines – all with 99. 7% guaranteed uptime

We look at 27 non-protected WAN lines – all with 99. 7% guaranteed uptime Legend: ≥ 99. 7% <99. 7% Discarded 23 -05 -2012 16 SLA Calculation - Martin Bech

No. of quarters Measurements from 450 quarters (corresponding to some 112 years) Relative downtime

No. of quarters Measurements from 450 quarters (corresponding to some 112 years) Relative downtime The whole dataset is not shown here, as there are quarters with higher downtime 23 -05 -2012 17 SLA Calculation - Martin Bech

The long tail (here as a probability distribution for the whole dataset) Outliers 23

The long tail (here as a probability distribution for the whole dataset) Outliers 23 -05 -2012 18 SLA Calculation - Martin Bech

We discard the high values (outliers) We ignore: • Some of them are not

We discard the high values (outliers) We ignore: • Some of them are not the supplier’s fault, but Erron eou owing to a practice of stopping surveillance a s m eas while after a connection is taken down ure Lac me k nts • Some of them are due to service windows that of e of re xce gis pti trat are not registered in the surveillance system ons ion s • A few are due to tragic accidental outages that we may presume not to have been forseen by the For ce Ma supplier, and which is therefore not taken into jeu re account when he calculated prices, guarantees and penalties 23 -05 -2012 19 SLA Calculation - Martin Bech

Looks like an exponential distribution 23 -05 -2012 20 SLA Calculation - Martin Bech

Looks like an exponential distribution 23 -05 -2012 20 SLA Calculation - Martin Bech

The exponential probability distribution • 23 -05 -2012 21 SLA Calculation - Martin Bech

The exponential probability distribution • 23 -05 -2012 21 SLA Calculation - Martin Bech

Distribution of relative downtimes of single WAN-lines in the Danish NREN The supplier’s guarantee

Distribution of relative downtimes of single WAN-lines in the Danish NREN The supplier’s guarantee The supplier’s actual average performance Probability Relative downtime 23 -05 -2012 22 SLA Calculation - Martin Bech

The cumulative distribution 93. 85% of the quarters are within the guaranteed availability, corresponding

The cumulative distribution 93. 85% of the quarters are within the guaranteed availability, corresponding to breach of gurantee in 6. 15% of the quarters or approximately every 4 th year Guaranteed availability: 0. 3% 23 -05 -2012 23 SLA Calculation - Martin Bech

A few bold assumptions and we are ready to calculate our SLA • 23

A few bold assumptions and we are ready to calculate our SLA • 23 -05 -2012 24 SLA Calculation - Martin Bech

Omregning til gennemsnitlig nedetid 95% 98. 20% 96% 98. 56% 97% 98. 92% 98%

Omregning til gennemsnitlig nedetid 95% 98. 20% 96% 98. 56% 97% 98. 92% 98% 99. 28% 99. 64% 99. 5% 99. 82% 99. 7% 99. 892% 99. 8% 99. 928% 99. 964% 99. 95% 99. 982% 99. 9964% 100% 23 -05 -2012 25 SLA Calculation - Martin Bech

Serial connection: When both parts need to be up • 23 -05 -2012 26

Serial connection: When both parts need to be up • 23 -05 -2012 26 SLA Calculation - Martin Bech

Special case of serial connection: Many identical sytems • 23 -05 -2012 27 SLA

Special case of serial connection: Many identical sytems • 23 -05 -2012 27 SLA Calculation - Martin Bech

Parallel connection: When just one system needs to be up • 23 -05 -2012

Parallel connection: When just one system needs to be up • 23 -05 -2012 28 SLA Calculation - Martin Bech

Now we master the process with a serial connection of two 95% services as

Now we master the process with a serial connection of two 95% services as example Guaranteed uptime 95% Converted to probabilities (or perfornance) Combined probability 98. 2% Multiplication of probabilities 95% 98. 2% 96. 43% Converted back into a guaranteed uptime 90. 09% Which is (slightly) different from the more inaccurate result we get if we treat the guaranteed uptimes as probabilities 95% ∙ 95% = 90. 25% 23 -05 -2012 29 SLA Calculation - Martin Bech

Another example with a parallel connection of two 95% services as example Guaranteed uptime

Another example with a parallel connection of two 95% services as example Guaranteed uptime 95% Converted to probabilities (or perfornance) 98. 2% Combination of probabilities 95% Combined probability 98. 2% 99. 968% Converted back into a guaranteed uptime 99. 91% Which is (slightly) different from the more inaccurate result we get if we treat the guaranteed uptimes as probabilities 1 -(1 -95%) ∙(1 -95%) = 99. 75% 23 -05 -2012 30 SLA Calculation - Martin Bech

Why bother with this method? • For a serial connection of 2 WAN lines

Why bother with this method? • For a serial connection of 2 WAN lines with 99. 7% guaranteed uptime, the resulting guaranteed uptime is 99. 40032%, which is not significantly different from 99. 7%∙ 99. 7%=99. 4009% (half a minute over a whole quarter) • However – as we have seen – for larger downtimes and more complex situations, this new method produces significantly more accurate results • And this new insight also lets you do something else: calculation of expected penalties if you change the guaranteed uptime – see the examples on the following pages… 23 -05 -2012 31 SLA Calculation - Martin Bech

The SLA of the Danish NREN (for the basic connectivity) • Guaranteed uptime of

The SLA of the Danish NREN (for the basic connectivity) • Guaranteed uptime of 99. 7%, measured over each quarter for each connection • Guaranteed uptime of 100. 0% for optical connections which are protected by redundancy 23 -05 -2012 32 SLA Calculation - Martin Bech

Example: Odense-Copenhagen P 2 P SDU Odense National ring Uptime 100. 0% Panum POP

Example: Odense-Copenhagen P 2 P SDU Odense National ring Uptime 100. 0% Panum POP • Dark fibre Panum-SIF Uptime 99. 7% SIF Dark fibre SIF-DKUNI Uptime 99. 7% DKUNI Resulting uptime for DKUNI: 99. 4% 23 -05 -2012 33 SLA Calculation - Martin Bech

Example 2: Odense-Sønderborg P 2 P SDU Esbjerg Dark fibre Uptime 99. 7% Kolding

Example 2: Odense-Sønderborg P 2 P SDU Esbjerg Dark fibre Uptime 99. 7% Kolding POP Dark fibre Uptime 99. 7% National ring Uptime 100, 0% SDU Odense Transmission capacity Uptime 99. 7% SDU Sønderborg • 23 -05 -2012 34 SLA Calculation - Martin Bech

You are now educated to do this yourselves! You can now calculate • What

You are now educated to do this yourselves! You can now calculate • What SLAs you cam promise your users • If the SLAs in your suppliers’ contracts are adequate given what you have already promised your users • What penalties you are likely to pay or receive in the future • Which are the sufficient SLAs you specify when procuring services in the future Bon courage… 23 -05 -2012 35 SLA Calculation - Martin Bech

For your handbook of formulas for combining services with exponentially distributed downtimes 23 -05

For your handbook of formulas for combining services with exponentially distributed downtimes 23 -05 -2012 36 SLA Calculation - Martin Bech