Establishing a Service Level Agreement SLA tg Thomas

Establishing a Service Level Agreement SLA =tg= Thomas Grohser SQL Server MVP SQL Server Performance Engineering

select * from =tg= where topic = =tg= Thomas Grohser email: tg@grohser. com Web: www. grohser. com twitter: @tgrohser Focus on SQL Server Performance Engineering, Infrastructure and Architecture New Book coming in 2012 Close Relationship with • SQLCAT (SQL Server Customer Advisory Team) • SCAN (SQL Server Customer Advisory Network) • TAP (Technology Adoption Program) • Product Teams in Redmond Active PASS member and PASS Summit Speaker @@Version Remark SQL 4. 21 First SQL Server ever used (1994) SQL 6. 0 First Log Shipping with failover SQL 6. 5 First SQL Server Cluster (NT 4. 0 + Wolfpack) SQL 7. 0 2+ billion rows / month in a single Table SQL 2000 938 days with 100% availability SQL 2000 IA 64 First SQL Server on Itanium IA 64 SQL 2005 IA 64 First OLTP long distance database mirroring SQL 2008 IA 64 First Replication into mirrored databases SQL 2008 R 2 IA 64 SQL 2008 R 2 x 64 First 256 CPUs & >500. 000 STMT/sec First Scale out > 1. 000 STMT/sec First time 1. 2+ trillion rows in a table SQL 2012 (Denali) > 220. 000 Transactions per second Can’t wait to push the limits even further

Agenda • Why & When? • What & How? • Q&A

Why do we need SLA’s • Management and coworkers need to understand agree to reality • Help you to request and argue the resources you need • Avoid lawsuits

Rule Number One! • SLA first • Solution later • If you already have a solution don’t agree to a SLA the solution can’t support

What should be in a SLA? • Everything – – – Operational requirements Maintenance windows Responsibilities Dependencies What happens if the SLA is not met

RPO – Recovery Point Objective • In plain English: How much data can we lose? – Samples • Your last log backup is from 12 minutes ago • Your last full backup is from last week • You do not have a backup 12 minutes 1 week all

RTO – Recovery Time Objective • In plain English: How much time after a failure till we have to be available again • Samples • Your restore takes 6 hours 6+ hours • Your last backup does not work you have to go to tape 24+ hours

Availability • Time the database is available within a period of time divided by the length of the period of time. • Don’t confuse luck with availability! – How fast to you think you can fix data corruption or human error in your database?

Availability • • • 99. 0 % 99. 7 % 99. 9999 % the famous five nines

Availability how long can I be offline Availability 1 Year Days Hours 0% 365. 25 8766 Minutes Seconds 525960 31557600 1 Month Days Hours 30. 4375 730. 5 0. 30 7. 31 438 26298 Minutes Seconds 43830 2629800 99. 0 % 3. 65 87. 66 5260 315576 99. 7 % 1. 10 26. 30 1578 94673 2. 19 131 7889 99. 9 % 8. 77 526 31558 0. 73 44 2630 99. 99 % 0. 88 53 3156 4 263 5 316 26 0. 5 32 3 99. 999% 99. 9999%

Available • Is a database available when – – – – It is online in SSMS? I can login? I can select data? I can update data? I can insert data? I can change the schema? … you get the idea?

When is a database needed? • Is the database used on the web 24 x 7 or just in the office from 9 to 5 or just once a month to process payroll? • Do the availability requirements apply all the time or just during the periods its actually used?

Service Windows • Specify times when you can service your system • The more the better – Every night from 11 pm till 5 am, all day Saturday and Sunday, except the weekend before the year end results are due. – First Sunday every month from 2 am till 4 am

Planned versus Unplanned • Big debate is planned maintenance part of the yearly downtime or not? • Big difference between the two cases • Make sure its clearly defined and understood

Monitoring availability • 99. 999% is equivalent to less than 5. 2 minutes of outage per year or less than a 0. 8 seconds per day • This requires you to do an availability check at least every 0. 4 seconds otherwise you waste valuable seconds.

Availability • Having a certain availability vs. guaranteeing it. • Easy to end up with 100% availability • Hard to guarantee even 99. 7

Differentiate between HA, DR and LR • HA … High Availability • DR … Disaster recovery • LR … Last Resort • Have different RPO and RTO values for all three cases. • Define worst case scenarios each level has to deal with

HA … High Availability • • • RTO: seconds to minutes RPO: Zero to seconds Automatic failover HA site usually close by (< 50 km) Well tested (maybe with each patch or release)

DR … Disaster recovery • • • RTO: minutes to hours RPO: seconds to minutes (even hours) Manual failover into prepared environment DR site usually several hundred km away Tested from time to time

LR … Last Resort • RTO: days to weeks • RPO: minutes to hours (even a whole day) • Rebuild system from scratch – Hardware has to be ordered – Floor space, connectivity to be rented • LR site usually on different continent and jurisdiction • Have a rough plan

Define worst case scenario for HA • Failure of a single component • Failure of two components (which are of a different kind) That means you need • Failure of server everything at least three (3) times (not so easy for disks) • Failure of multiple servers • Failure of any two components

Define worst case scenario for DR • • • Human error Failure of server Failure of multiple servers Partial failure of data center Full failure of data center Failure of multiple data centers

Define worst case scenario for LR • • • Destruction / failure of multiple datacenters Natural disaster Sabotage Political incident (i. e. war, regime change) Destruction of planet earth …

Outside SQL Server • Make sure you state that you depend on the underlying infrastructure and failures of that infrastructure don’t count for you! • Make sure no processes are interfering – Example: async database mirroring + failover OK for loosing data. Only one person allowed to give the OK. 3 weeks vacation and availability is down to 94. 2%

Dependencies • Who needs this database/server • What does this server need to operate – – – – Power Cooling Network Firewall rules Domain Controller Other servers (linked server) …

Responsibilities • Who can actually make a decision for a database/server • Who owns the data • Who needs to be notified if something is wrong

Backup retention and granularity • How far must you be able to go back? – Hours, Days, Weeks, Months, Years – And how accurate must the restore be – I need the database restored to November 15, 2008 at 6: 27… – How much time do you need for this historic restores – Test them from time to time (you need the resources and time)

Tips for keeping the SLA • Make sure your monitoring and alerting works • Monitor your Monitoring • Test your HA, DR, LR solutions regular and especially after every change to your infrastructure.

DR and LR Instructions • Keep printed copies in several places trust me your electronic documentation won’t be there when you are in a DR or LR situation

Summary • Remember Rule Number one – SLA first – Solution later – If you already have a solution don’t agree to a SLA the solution can’t support – The laws of physics apply (even to the best DBA : -)

THANK YOU! For attending this session and PASS SQLRally Nordic 2011, Stockholm