From SLA to HADR solution tg Thomas Grohser

select * from =tg= where topic = =tg= Thomas Grohser, NTT DATA Senior Director

NTT DATA Overview • 20, 000 professionals – Optimizing balanced global delivery • $1.

Agenda § § § What can go wrong? A look at the SLA A

What can go wrong? § § § Single component failure Server failure Datacenter failure

A look at the SLA § Don’t confuse luck with keeping the SLA! §

RPO – Recovery Point Objective § In plain English: How much data can we

RTO – Recovery Time Objective § In plain English: How much time after a

HA/DR/LR explained § HA – High Availability § § RTO: seconds to minutes RPO:

A good bare minimum SLA § § § HA RTO RPO Single component failure

Sample SLA § § § RTO RPO LR 3 days up to one day

Backup / Restore § Available since SQL 4. x (1. x but that one

Log Shipping § Available since SQL 6. 0 (manual possible before) § Take Transaction

Failover Clustering § Available since SQL 6. 5 (useful since 7. 0) § §

Database Mirroring § Available since SQL 2005 § Transactions are set to a secondary

Always. On FCI – Failover Cluster Instance § Available since SQL 2012 § Very

Always. On AG – Availability Groups § Available since SQL 2012 (enhanced in 2014

Other HA/DR Technologies § use at your own risk § Transactional/Merge/Peer 2 Peer Replication

Combining HA/DR Features § Backup/Restore § Must (not can – must) be combined with

RTO - Summary § § § § Detect Failure Verify Failure Notify Failure (only

How long does it take to detect a failure § The more severe the

How long does it take to react to a failure § Automatic Failover (seconds,

How long does it take to fail over § It depends § Backup Restore

Database Recovery § It depends § On oldest uncommitted transaction § Number and size

RTO/RPO Improvement § To handle data deletion / corruption § Multiple delayed log shipping

THANK YOU! and may the force be with you… Questions? thomas. grohser@nttdata. com tg@grohser.

Slides: 27

Download presentation

From SLA to HA/DR solution =tg= Thomas Grohser, NTT Data SQL Server MVP SQL Server Performance Engineering SQL Saturday #545 Pittsburgh, PA October 1 st 2016

select * from =tg= where topic = =tg= Thomas Grohser, NTT DATA Senior Director Technical Solutions Architecture email: Thomas. grohser@nttdata. com / tg@grohser. com Focus on SQL Server Security, Performance Engineering, Infrastructure and Architecture New Papers coming 2016 @@Version Remark SQL 4. 21 First SQL Server ever used (1994) SQL 6. 0 First Log Shipping with failover SQL 6. 5 First SQL Server Cluster (NT 4. 0 + Wolfpack) SQL 7. 0 2+ billion rows / month in a single Table SQL 2000 938 days with 100% availability SQL 2000 IA 64 First SQL Server on Itanium IA 64 SQL 2005 IA 64 First OLTP long distance database mirroring SQL 2008 IA 64 First Replication into mirrored databases SQL 2008 R 2 IA 64 SQL 2008 R 2 x 64 First 256 CPUs & >500. 000 STMT/sec 22 Years SQL Server First Scalewith out > 1. 000 STMT/sec First time 1. 2+ trillion rows in a table Close Relationship with • SQLCAT (SQL Server Customer Advisory Team) • SCAN (SQL Server Customer Advisory Network) • TAP (Technology Adoption Program) • Product Teams in Redmond Active PASS member and PASS Summit Speaker SQL 2012 > 220. 000 Transactions per second > 1. 3 Trillion Rows in a table SQL 2014 > 400. 000 Transactions per second Fully automated deploy and management Always. On Automatic HA and DR SQL 2016 Can’t wait to raise the bar again

NTT DATA Overview • 20, 000 professionals – Optimizing balanced global delivery • $1. 6 B – Annual revenues with history of above-market growth • Long-term relationships – >1, 000 clients; mid-market to large enterprise • Delivery excellence – Enabled by process maturity, tools and accelerators • Flexible engagement – Spans consulting, staffing, managed services, outsourcing, and cloud • Industry expertise – Driving depth in select industry verticals Why NTT DATA for MS Services: • NTT DATA is a Microsoft Gold Certified Partner. We cover the entire MS Stack, from applications to infrastructure to the cloud • Proven track record with 500+ MS solutions delivered in the past 20 years

Agenda § § § What can go wrong? A look at the SLA A tour on HA/DR features/options The Surprise Q&A ATTENTION: Important Information may be displayed on any slide at any time! ! Without Warning !

What can go wrong? § § § Single component failure Server failure Datacenter failure Network failure between datacenters Data loss/corruption § There is a lot more that can go wrong but we have only one hour today

A look at the SLA § Don’t confuse luck with keeping the SLA! § It is easy to achieve 99. 999% § It is almost impossible to guarantee § SLA burns down to § RPO § RTO different values during different times of the day/week/month

RPO – Recovery Point Objective § In plain English: How much data can we lose? § None § This is what you always aim for § Some § § Seconds Minutes Hours Days § All § This is not the worst that can happen § All and the competition “finds” it

RTO – Recovery Time Objective § In plain English: How much time after a failure till we have to be up and running again § None § Impossible § Some § § Seconds Minutes Hours Days § Forever

HA/DR/LR explained § HA – High Availability § § RTO: seconds to minutes RPO: Zero to seconds Automatic failover Well tested (maybe with each patch or release) § DR – Disaster Recovery § § RTO: minutes to hours RPO: seconds to minutes (even hours) Manual failover into prepared environment Tested from time to time § LR – Last Resort § RTO: days to weeks § RPO: minutes to hours (even a whole day) § Rebuild system from scratch (Hardware has to be ordered, Floor space, connectivity to be rented) § Have a rough plan

A good bare minimum SLA § § § HA RTO RPO Single component failure Server failure Datacenter failure Network failure between datacenters Data loss/corruption § § § Days a lot Deletion Corruption Sabotage HA HA (DR) DR HA/DR HA/DR/LR DR (LR) LR (DR) Any different double failure Any triple failure HA DR DR LR LR Priorities § § to LR Hours Minutes Concurrent failures § § DR Minutes Seconds Failure Modes (single failures) § § § Seconds Close to Zero Recoverability (no data loss) Availability (keep going) Performance (keep running) Defined maintenance windows and adjusted values based on the day and time of the day

Sample SLA § § § RTO RPO LR 3 days up to one day 30 Minutes after end of window before maintenance window 3 days up to one day Deletion Corruption Sabotage HA HA DR DR LR Concurrent failures § § § Any different double failure Any triple failure HA DR/LR LR Priorities § § N/A Single component failure Server failure Datacenter failure Network failure between datacenters Data loss/corruption § § DR < 30 Minutes up to 1 minute Failure Modes (single failures) § § § < 30 Seconds Close to Zero During Maintenance Window § § § HA RTO RPO Recoverability (no data loss) Availability (keep going) Performance (keep running) Defined maintenance windows Saturday 1 pm till Sunday 9 pm except if EOM

A tour on HA/DR features/options

Backup / Restore § Available since SQL 4. x (1. x but that one didn’t run on Windows) § A backup is a consistent point in time COPY of the database in a different location Backups contain all data up to the point the backup finishes Snapshots are not backups Backups need to be verified (restored and then DBCC CHECKDB validated) RTO § § § Depends on § backup/restore solution and disk speed and DB size § § § Native Backup/Restore § 1 Gb/s network § Local Rotating Spindle 5 -6 GB/minute 300 -350 GB/hour/NIC 5 -12 GB/minute 500 -700 GB/hour/spindle Database Recovery Time RPO § Depends on the backup frequency and if the last backup is good § There is no excuse for not having a valid backup!

Log Shipping § Available since SQL 6. 0 (manual possible before) § Take Transaction Log backups and optional restore them as fast as possible or with a delay to one or more targets. § Manual failover § Failback possible but not if managed via GUI § RTO § Manual process § Depends on § frequency and timing of backup, copy, restore job § Database Recovery Time § RPO § Depends on frequency and timing of backup and copy job § Restores fail from time to time lower RPO till fixed § Stagger Backup, Copy and Restore jobs (by default at the same time) § Backup, copy a minute later, restore a minute later § Don’t run the jobs at the same time for all databases

Failover Clustering § Available since SQL 6. 5 (useful since 7. 0) § § § Disk are connected to multiple servers and are used only on the active node. The SQL Server Service is installed on all nodes but active only on one at a time. The most important component (data) is the only one that does not exists twice Depending on technology the nodes have to be close to very close to each other RTO § § Automatic failover in some cases Depends on § § § § Storage System Number of LUNS (a. k. a. Volumes, Disks) System Shutdown Time (depends on amount of memory and last checkpoint time) System Startup Time Database Recovery Time Depends on DNS refresh time if cross network failure happens RPO § § Since the same disks are used its either zero (good) or forever (bad) Not covering data deletion/corruption

Database Mirroring § Available since SQL 2005 § Transactions are set to a secondary machine that holds a copy of the database and committed on all machines. Synchronous or asynchronous can be configured per replica. Replicas can be made readable. § Automatic Failover if witness is configured or odd number of replicas. § Works great across datacenters, can be combined with Log Shipping and/or Clustering § RTO § Database is already running with a warm cache on the secondary § Depending on the size of the redo queue § Depending on Database Recovery Time § RPO § Depending on synchronous (zero) or asynchronous (depending on network speed and transaction volume) § Issue: if secondary fails primary continues without secondary after a few seconds. Then RPO can increase. This must be caught by monitoring § Not covering data deletion might cover some cases of physical data corruption

Always. On FCI – Failover Cluster Instance § Available since SQL 2012 § Very similar to Failover Clustering § Better failure detection and automatic failover § Other than that same RTO/RPO than clustering

Always. On AG – Availability Groups § Available since SQL 2012 (enhanced in 2014 and 2016) § Transactions are set to multiple secondary machines that hold copies of the database and are committed to both machines. Synchronous or asynchronous commit can be configured per replica. Replicas can be made readable. Automatic Failover if witness is configured or odd number of replicas. Works great across datacenters, can be combined with Log Shipping and/or Always. On FCI § § § RTO § § § Database is already running with a warm cache on the secondary’s Depending on redo queue size Depending on Database Recovery Time Depends on DNS refresh time if cross network failure happens RPO § Depending on synchronous (zero) or asynchronous (depending on network speed and transaction volume) § Issue: if secondary’s fails primary continues without secondary after a few seconds. Then RPO can increase. This must be caught by monitoring § Not covering data deletion might cover some cases of physical data corruption

Other HA/DR Technologies § use at your own risk § Transactional/Merge/Peer 2 Peer Replication § All kinds of storage replications § Please do a real test when using more than one disk § Real Test = Full random writing load generated by SQL Server (e. g. SQL Stress) on all disks and then pull the plug § If the databases come online on the other side with no data loss and corruption try again § If this works 3 times you will be the first person on the planet to have successfully tested this § All kinds of VM replication § Magic and luck (a. k. a. software driven solutions)

Combining HA/DR Features § Backup/Restore § Must (not can – must) be combined with all others. You need backups of your database § Log Shipping § Should (maybe even must) be combined with all others. § Some extra work required if combined with Mirroring/Always. On Availability Groups § Log backups must be taken from secondary after failover § Mirroring/Always. On Availability Groups without automatic failover and clustering works as expected § Mirroring/Always. On Availability Groups with automatic failover and clustering always creates a double failover

RTO - Summary § § § § Detect Failure Verify Failure Notify Failure (only with manual failover) Verify Failure (only with manual failover) Initiate Failover Database Recovery Repointing users (only with manual failover)

How long does it take to detect a failure § The more severe the problem the shorter usually the detection time § How often do you check your system health? § Every 5 minutes § That means if I as a DBA can failover to system to DR in less then 4. 9 minutes its not actually an outage?

How long does it take to react to a failure § Automatic Failover (seconds, depending on configuration) § Manual Failover § § § Who is the DBA on call Does he/she pick up immediate Communicate the problem Connect to Network Login

How long does it take to fail over § It depends § Backup Restore Hours § Log Shipping Minutes § Clustering / Always. On FCI Seconds to Minutes § Plus time to mount the disks § Eventual time to verify the disks (since Windows Server 2012 that’s seconds, could be hours before that). § DB Mirroring / Always. On AG Seconds § A bit more its cross network failover

Database Recovery § It depends § On oldest uncommitted transaction § Number and size of transactions in flight that need rollback § So somewhere between § Less than a second § Several weeks

RTO/RPO Improvement § To handle data deletion / corruption § Multiple delayed log shipping copies of the database § Stop restore the moment you detect corruption, recover till before corruption happened, bring database online § Recover data § To minimize Database Recovery Time § Monitor long running transactions and alert on them. Design systems around them. § Partition large tables and rebuild indexes partition by partition

THANK YOU! and may the force be with you… Questions? thomas. grohser@nttdata. com tg@grohser. com