SQL Server 2012 Always On HA and DR

Who and Why • Who am I? • Why I am speaking about this?

Setting the Stage • • • Always. On ≠ Availability Groups Always. On =

Key Learnings from Early Customer Deployments • • • • Windows Cluster (Windows Server

Key Learnings from Early Customer Deployments • • • • Organizational structure Typically, teams

SQL Server 2012 Always. On HA+DR Design Patterns SQL Server 2012 Always. On Solution

SQL Server 2012: Multi-site FCI for HA and DR Primary Site Node 1 Active

SQL Server 2012: AG for HA and DR Primary Data Center Fileshare Witness Primary

SQL Server 2012: FCI for HA, AG for DR Primary Site Windows Server Failover

Lessons Learned from Customer Deployments • We will discuss two Tier-1 Mission-Critical deployments to

Customer Deployment Example: Service. U Corporation

FCI + DBM (prior to SQL Server 2012) (FCIs for local high availability, DBM

FCI + AG (SQL Server 2012) (FCIs for local high availability, AG for Disaster

Lesson 1: Choose Appropriate Quorum Model • For mission-critical infrastructure, Service. U has extra

Lesson 2: Understand Operational Procedures for DR Scenario: Disaster = entire primary site is

Lesson 3: Instance Name & Path Plan the file paths and instance names before

Lesson 4: Concurrent FCI setup on multiple nodes will set possible owners incorrectly Initial

Lesson 5: After adding a node, possible owners = every node in the Windows

Lesson 6: Failover Cluster Manager shows as if a FCI can be moved to

Customer Deployment Example: MSIT SAP ERP Deployment

DBM + LS (Prior to SQL Server 2012) Database Mirroring for local HA, Log

Deployment Architecture: SQL Server 2012 Availability Group for HA and DR File share for

Lesson 1: Working across organizational boundaries • Working with the Windows Team in IT

Lesson 2: Choose Appropriate Quorum Model • Choice of Quorum Model • What do

Lesson 3: Monitoring and Alerting AG • Alerting compared to DBM • Challenge: No

Lesson 4: DO NOT use Windows Failover Cluster Manager to perform AG failover DO

Slides: 29

Download presentation

SQL Server 2012: Always. On HA and DR Design Patterns, and Lessons Learned from Early Customer Deployments Sanjay Mishra SQLCAT

Who and Why • Who am I? • Why I am speaking about this?

Setting the Stage • • • Always. On ≠ Availability Groups Always. On = { SQL Server Failover Cluster Instances, Availability Groups } Availability Groups ≠ Database Mirroring

Key Learnings from Early Customer Deployments • • • • Windows Cluster (Windows Server Failover Cluster, WSFC) is the foundation for HA and DR in SQL Server 2012 Always. On inherits all “characteristics” of Windows Cluster every single Always. On deployment is a Windows Cluster deployment Windows Cluster must understand Windows Cluster if you want to deploy, operate, monitor, troubleshoot, administer Always. On key areas are: quorum model, cluster network communication, DR procedures, cluster. exe, Power. Shell Windows Cluster ≠ SQL Cluster (SQL Server Failover Cluster Instance) therefore, is NOT necessarily a shared-storage cluster Windows Cluster many key enhancements have been made to Windows Cluster specifically for SQL Server 2012 Always. On • Asymmetric Disk • Node Votes • Asymmetric Disk as Quorum resource

Key Learnings from Early Customer Deployments • • • • Organizational structure Typically, teams and skills are organized into separate groups – SQL Server DBA team and Windows Server Admin team Always. On reaches out beyond the SQL Server DBAs need to work closely with Windows / Network Administration teams • Not just for initial deployment, but for troubleshooting and disaster recovery as well Historical experience need to unlearn and relearn a few things if you are already experienced with Windows Cluster, but new to Always. On For example, if you haven’t read the Windows Cluster documentation in the last few months, it is worth a re-read now New/Different Tools for administration and troubleshooting Windows cluster log Failover Cluster Manager Powershell cluster. exe Knowledge of Power. Shell and cluster. exe command lines will come very handy

SQL Server 2012 Always. On HA+DR Design Patterns SQL Server 2012 Always. On Solution Characteristics HA+DR Solution Multi-site Failover Cluster 1 Instance (FCI) for HA and DR 2 Availability Group for HA and DR Failover Cluster Instance for 3 local HA + Availability Group for DR Corresponding Pre-Denali Solution • • • Shared Storage solution * Instance Level HA (automatic) Instance Level DR (automatic *) Uses storage replication Doesn’t require database to be in FULL recovery model Multi-site FCI using stretch VLAN • • • Non-Shared Storage solution (Group of) Database Level HA (automatic) (Group of) Database Level DR (manual) DR replica can be Active Secondary Requires database to be in FULL recovery model Database Mirroring for Local HA and Log Shipping for DR • Combined Shared Storage and Non-Shared Storage Instance Level HA (automatic) (Group of) Database Level DR (manual) DR replica can be Active Secondary Requires database to be in FULL recovery model Asymmetric storage is the key to this solution Failover Cluster Instance for Local HA and Database Mirroring for DR • • •

SQL Server 2012: Multi-site FCI for HA and DR Primary Site Node 1 Active DR Site Windows Server Failover Cluster Node 3 Node 2 Passive SQL-FCI Storage Replication Passive Node 4 Passive

SQL Server 2012: AG for HA and DR Primary Data Center Fileshare Witness Primary Disaster Recovery Data Center Windows Server Failover Cluster Secondary Synchronous / Asynchronous Availability Group

SQL Server 2012: FCI for HA, AG for DR Primary Site Windows Server Failover Cluster Node 3 Node 2 Node 1 DR Site Node 4 SQL-FCI-2 SQL-FCI-1 Secondary Primary Availability Group

Lessons Learned from Customer Deployments • We will discuss two Tier-1 Mission-Critical deployments to relate lessons learned with specific scenarios • Service. U • Microsoft IT SAP ERP Deployment

Customer Deployment Example: Service. U Corporation

FCI + DBM (prior to SQL Server 2012) (FCIs for local high availability, DBM for Disaster Recovery) PRIMARY SECONDARY Database Mirroring Asynchronous Site 1 • Two independent Windows clusters – one at each site • Each site has a 3 -node FCI • Using Disk-Only Quorum Site 2

FCI + AG (SQL Server 2012) (FCIs for local high availability, AG for Disaster Recovery) PRIMARY SECONDARY Availability Group Asynchronous Site 1 Site 2 • With SQL Server 2012 and AGs, it is a single Windows cluster spanning the two sites, instead of a different Windows cluster at each site. • NOTE: Asymmetric Storage requires Windows Server 2008 R 2 + SP 1 or QFE’s

Lesson 1: Choose Appropriate Quorum Model • For mission-critical infrastructure, Service. U has extra passive nodes at each site (3 node FCIs with only 1 active SQL Server instance) so that in the event of a hardware failure, still have extra node for local HA • Service. U SLA demands that SQL Server be online, with no user intervention, as long as one node (out of 3) is up in the primary site => Last Man Standing • Node Majority or Node+Fileshare Majority can’t provide Last Man Standing • In FCI+AG configuration across two sites, the storage is asymmetric storage. Can asymmetric disk resource be used as quorum? 14

FCI + AG (SQL Server 2012) (FCIs for local high availability, AG for Disaster Recovery) PRIMARY SECONDARY Solution: Windows Server 2008 now supports Asymmetric Disk-Only Quorum Availability Group Asynchronous Site 1 Site 2 Asymmetric Disk-Only Quorum • Must be configured with cluster. exe; not supported in GUI • Requires testing and thorough knowledge of Windows clustering • Allows “Last Man Standing”

Lesson 2: Understand Operational Procedures for DR Scenario: Disaster = entire primary site is unavailable Considerations • #1: Must force quorum on the DR site to get Windows cluster back online (any quorum model) • Impact: change operational procedures • /FQ switch seems to work faster than the GUI • #2: Rethink quorum model after disaster (any quorum model) • Re-establishing quorum is a Windows cluster admin job, not DBA job. • Until this is done you cannot bring SQL online. • Who needs to be involved? • #3: Changing quorum model is a multi-step process for Service. U • Convert from disk-only to node majority • Then, convert from node majority to disk-only (with a separate new disk resource at the DR site) Ensure the process is documented and understood by all players. 16

Lesson 3: Instance Name & Path Plan the file paths and instance names before starting the first installation • Default installation paths on shared disk: • . . MSSQL 11. Instance. NameMSSQLDATA • . . MSSQL 11. Instance. NameMSSQLLog • Requirements for AG between two FCIs: • DIFFERENT instance names between FCIs • Recommendation for all AG deployments: • SAME file path across all FCIs • Solution: Be sure to use a path for Data and Logs that is always the same for all instances participating in an AG. File path recommendations is the same as in DBM, but this is more significant now because it is one Windows cluster, and the two FCI instance names can’t be 17 the same like they could in DBM between two FCIs.

Lesson 4: Concurrent FCI setup on multiple nodes will set possible owners incorrectly Initial setup • Node 1 and Node 2 added during the initial install. • Both are possible owners of the FCI What happened during a concurrent install on Node 3 and Node 4: 1. Node 3 setup started (possible owners=Node 1 & Node 2) 2. Node 4 setup started a few seconds later (possible owners=Node 1 & Node 2) 3. Node 3 setup finished: possible owners set to previous owners (Node 1 & Node 2) PLUS this node (Node 3) 4. Node 4 setup finished: possible owners set to previous owners (Node 1 & Node 2) PLUS this node (Node 4) 5. End result: Possible owners=Node 1, Node 2, Node 4 Correct result: Possible owners=Node 1, Node 2, Node 3, Node 4 Solution: DO NOT to use concurrent installation – it is unsupported and NOT recommended. 18

Lesson 5: After adding a node, possible owners = every node in the Windows Cluster Example: After the DR nodes were added to the Windows Cluster, the DR nodes show up as possible owners of the primary site FCI. Solution: Manually remove invalid owners from every resource (every time you 19 add a node). Not necessary on disks.

Lesson 6: Failover Cluster Manager shows as if a FCI can be moved to a node that is not a possible owner Issue #3: Move operations allow selection of nodes that are not possible owners 20

Customer Deployment Example: MSIT SAP ERP Deployment

DBM + LS (Prior to SQL Server 2012) Database Mirroring for local HA, Log Shipping for DR Production Witness Synchronous DBM Log Shipping Disaster Recovery Test SAP Volume Test and Integration System Image of production 22

Deployment Architecture: SQL Server 2012 Availability Group for HA and DR File share for Cluster Quorum Production Sync 1 Production Availability Group Log on Shipping production DBMS cluster A SAP production CI cluster containing File Share quorum for DBMS cluster SAP test CI cluster containing File Share quorum for test DBMS cluster 0 Test 1 1 nc sy Sync Test Availability Group on test DBMS cluster Main Site DR Site SAP Volume Test and Integration System Image of production 23

Lesson 1: Working across organizational boundaries • Working with the Windows Team in IT • Organization: Separate team from the DBA team • DBA team underestimated the role of Windows Cluster • Challenge: Concept of a Windows Cluster with “no” (? ) shared storage

Lesson 2: Choose Appropriate Quorum Model • Choice of Quorum Model • What do we chose: Node Majority vs Node+Fileshare Majority • The Windows IT team didn’t know anything other than a shared storage quorum • Picked Node+Fileshare Majority • Node votes • Don’t want issues at DR data center or the connectivity between the data centers affect HA in the primary data center • Assigned zero vote to DR server

Lesson 3: Monitoring and Alerting AG • Alerting compared to DBM • Challenge: No DBM Monitor • Developed custom scripts for monitoring and alerting key measures • Alternative: System Center Operations Manager

Lesson 4: DO NOT use Windows Failover Cluster Manager to perform AG failover DO NOT change the preferred owners and possible owners settings for the AG listener. • DO NOT move AG from one to another node using the FCM. • Solution: Use SSMS or T-SQL • •

Questions ? sanjaymi@microsoft. com

Thank You!