Disaster Recovery Plan Development Mark Carlson VP of

  • Slides: 78
Download presentation
Disaster Recovery: Plan Development Mark Carlson, VP of Information Technology SMI – Carrollton, GA

Disaster Recovery: Plan Development Mark Carlson, VP of Information Technology SMI – Carrollton, GA ERICSA 50 th Annual Training Conference & Exposition ▪ May 19 – 23 ▪ Hilton Orlando Lake Buena Vista, Florida

Topics • Original versus Current BC Plan • Roles and People • Plan Development

Topics • Original versus Current BC Plan • Roles and People • Plan Development – Plan Creation – Tabletop Exercises – Recovery Tests – Disaster Events

Original Plan • FEMA Based – Full featured BCP with focus on “the plan”

Original Plan • FEMA Based – Full featured BCP with focus on “the plan” • Components – Risk Assessment – Emergency Communication Plan Quick Reference – Emergency Action Plan – Incident Prevention Plan – Testing Scripts … and more

Current Plan • “Operations” based – Simplified DR plan with focus on high risk

Current Plan • “Operations” based – Simplified DR plan with focus on high risk and the highest probability outage scenarios • Primary Component – Scenario Tables (including onsite and offsite) • Associated Components – Tabletop exercises – Testing scripts

From Then to Now Previous Plan Formal BC plan perspective Offsite recovery Monolithic document

From Then to Now Previous Plan Formal BC plan perspective Offsite recovery Monolithic document Created independently Unknown shelf-ware New Plan Operational service perspective Off and Onsite recovery Bite-sized pieces Shared among Testing, IM, SR, CM processes Underpinning knowledge, Shared during outages

Roles and People • Business Continuity Team – BC Plan Manager – BC Plan

Roles and People • Business Continuity Team – BC Plan Manager – BC Plan Testing Team • Incident Response Team – Incident Response Coordinator – Recovery Agents…

Roles and People Cont. • Specific Positions – Operational Management – Technical Management –

Roles and People Cont. • Specific Positions – Operational Management – Technical Management – Chief Operating Officer • Personalities

Plan Development: Workflow Update Plan Creation Tabletop Exercises Disaster Test Disaster

Plan Development: Workflow Update Plan Creation Tabletop Exercises Disaster Test Disaster

Plan Creation Update Plan Creation Tabletop Exercises Recovery Test Disaster

Plan Creation Update Plan Creation Tabletop Exercises Recovery Test Disaster

Plan Creation Survey Operational Environment Define High Risk, Most Common Scenarios Organize by Operational

Plan Creation Survey Operational Environment Define High Risk, Most Common Scenarios Organize by Operational / Technical Services Plug-in Recovery Activities into “Scenario Tables” • Create Tabletop & Testing Script Documents • •

Plan Creation: Table TOC

Plan Creation: Table TOC

Plan Creation: Table TOC

Plan Creation: Table TOC

Scenario Table Structure • Scenario Setup – Description, Services Affected, Mitigation, Symptoms, Response Coordinator

Scenario Table Structure • Scenario Setup – Description, Services Affected, Mitigation, Symptoms, Response Coordinator • Phase 1 – Impact Assessment • Phase 2 – Recovery • Phase 3 – Restoration

Sample Scenario Table

Sample Scenario Table

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Sample Scenario Table Cont.

Tabletop Exercises Update Plan Creation Tabletop Exercises Recovery Test Disaster

Tabletop Exercises Update Plan Creation Tabletop Exercises Recovery Test Disaster

Tabletop Exercise • Scenario Table Walkthrough – Mental exercise for DR team (Recovery Phase)

Tabletop Exercise • Scenario Table Walkthrough – Mental exercise for DR team (Recovery Phase) – Notes appended to the scenario tables • Scenario Table Quality Improvements – BC Manager compiles feedback – Operational validation – Technical details may be added – Contact information verified

Tabletop Exercise Document

Tabletop Exercise Document

Recovery Test Update Creation Tabletop Exercises Recovery Test Disaster

Recovery Test Update Creation Tabletop Exercises Recovery Test Disaster

Recovery Test Documents • Testing Scripts • Disaster Recovery Scorecard

Recovery Test Documents • Testing Scripts • Disaster Recovery Scorecard

Recovery Test Scripts

Recovery Test Scripts

Recovery Test Scripts Cont.

Recovery Test Scripts Cont.

Testing Scorecard • Setup & Dismantle Results • Application Results

Testing Scorecard • Setup & Dismantle Results • Application Results

Testing Scorecard: Infrastructure

Testing Scorecard: Infrastructure

Testing Scorecard: Applications

Testing Scorecard: Applications

Disaster Event Update Plan Creation Tabletop Exercises Disaster Test Disaster

Disaster Event Update Plan Creation Tabletop Exercises Disaster Test Disaster

Disaster Event • • • Conference Call Disseminate the Scenario Involve the BC Manager

Disaster Event • • • Conference Call Disseminate the Scenario Involve the BC Manager Treat it as a Tabletop Exercise Update Plan

From Then to Now Previous Plan Formal BC plan perspective Offsite recovery Monolithic document

From Then to Now Previous Plan Formal BC plan perspective Offsite recovery Monolithic document Created independently Unknown shelf-ware New Plan Operational service perspective Off and Onsite recovery Bite-sized pieces Shared among Testing, IM, SR, CM processes Underpinning knowledge, Shared during outages

Remember • Organization is Key – Define operational and technical services to drive plan

Remember • Organization is Key – Define operational and technical services to drive plan scenarios and to integrate with processes • Keep it Simple and Practical – The most comprehensive Business Continuity plan is often the least used – Create & tabletop scenarios with a history FIRST • Leverage Personalities and People

Logistics of Disaster Recovery The First 24 -72 Hours John Popa Xerox State and

Logistics of Disaster Recovery The First 24 -72 Hours John Popa Xerox State and Local Solutions ERICSA 50 th Annual Training Conference & Exposition ▪ May 19 – 23 ▪ Hilton Orlando Lake Buena Vista, Florida

Definitions Business Continuity Disaster Recovery Business Recovery Back up Hot Site Back up Cold

Definitions Business Continuity Disaster Recovery Business Recovery Back up Hot Site Back up Cold Site

Our Approach Designate a primary DR site for each of our operational State Disbursement

Our Approach Designate a primary DR site for each of our operational State Disbursement Units PA SDU is the hot site for 8 of the XEROX operated SDUs Same or similar hardware, software and procedures All systems are loaded, maintained and tested on a periodic basis

New York Preparedness Example Two DR tests per year Client attends and brings their

New York Preparedness Example Two DR tests per year Client attends and brings their own data set Each task must be completed at the recovery site and match the test case outcomes predicted by the customer

LOGISTICS IN A TRUE DISASTER

LOGISTICS IN A TRUE DISASTER

A Disaster is Declared… What needs to be done in 24 -72 hours? Communicate

A Disaster is Declared… What needs to be done in 24 -72 hours? Communicate with State Leadership Communicate with the affected site personnel Staffing the additional work Ready technology to activate hardware and software Initiate operations checklists Refine the process where needed Affected site begins business recovery analysis Communicate with Internal Leadership Commit to restore operations in the hot site

Staffing Mobilize Xerox and local staffing resources Transportation and lodging for staff relocating to

Staffing Mobilize Xerox and local staffing resources Transportation and lodging for staff relocating to the recovery site Activate training for all staff working the DR Project Activate local security for all new and visiting staff

Communications Command Center coordinates disaster site recovery and hot site operation Establish hour-by-hour communication

Communications Command Center coordinates disaster site recovery and hot site operation Establish hour-by-hour communication between Command Center and all recovery functional elements Establish local communications capability for visiting staff members Ensure the connectivity and communications with the depository bank is established and working Establish client related communications, reporting, status, and program related items Ensure IVR or web based communications reflects current disaster status, alternate resources and recovery timeframes to keep all stakeholders informed

Infrastructure Facility Ready office space and training facilities Install/energize additional office equipment Prepare storage/secure

Infrastructure Facility Ready office space and training facilities Install/energize additional office equipment Prepare storage/secure processing areas Equipment Prepare work stations Printer access Provide additional telecom services Review and implement records retention requirements and storage areas Technology Establish and test connectivity internal/external Prepare scanners, robotics, and ancillary systems Prepare systems security Energize and test processing equipment

Operations Segregation of work from all other performed for the host State Payment Processing

Operations Segregation of work from all other performed for the host State Payment Processing – scanning, payment recording, deposit preparation. Transmissions/Reconciliation Process Disbursements Processing Banking and Reconciliation Program Reporting Print Services (Coupons, Checks, Special Processing Instructions) Customer Service

Refining the Process Analyze / Communicate Site Performance and Recommend Improvements Implement Improvements and

Refining the Process Analyze / Communicate Site Performance and Recommend Improvements Implement Improvements and Disaster Recovery Plan Updates Stabilize Processing Environment Rebuilding the Disaster Site assesses damage plans for and rebuilds Coordinates with DR Site for planning and cutover Return to normal operations

Physical Disasters: Preventative Planning and Recovery Alisha A. Griffin, IV-D Director New Jersey ERICSA

Physical Disasters: Preventative Planning and Recovery Alisha A. Griffin, IV-D Director New Jersey ERICSA 50 th Annual Training Conference & Exposition ▪ May 19 – 23 ▪ Hilton Orlando Lake Buena Vista, Florida

Disaster: v a sudden calamitous event bringing great damage or destruction v A sudden

Disaster: v a sudden calamitous event bringing great damage or destruction v A sudden greatness for time or failure Synonyms v apocalypse, catastrophe, debacle, tragedy

Goals: v Manage Effectively - a potential disaster or only an incident / failure

Goals: v Manage Effectively - a potential disaster or only an incident / failure v Ensure Quality Customer Service

Key Components v v Ownership Documentation and Planning Disaster Recovery Business Continuity

Key Components v v Ownership Documentation and Planning Disaster Recovery Business Continuity

Ownership v It’s Everyone’s Responsibility ü Technical ü Program v Communication P Internal P

Ownership v It’s Everyone’s Responsibility ü Technical ü Program v Communication P Internal P External P Customers

Responsibility: v Technical P Create and Maintain Matrix Ø All infrastructures Ø All interfaces

Responsibility: v Technical P Create and Maintain Matrix Ø All infrastructures Ø All interfaces Ø All Connections ü Prioritize v Program ü Key Managers – Central ü Key Managers - Local

Communication: v Internal – program / Team P Maintain a master list of: Ø

Communication: v Internal – program / Team P Maintain a master list of: Ø Primary email Ø Secondary email Ø Primary phone number Ø Secondary phone numbers ü Agenda item to regular meetings ü Service Level Agreements (SLA’s) ü Internal – Executive Ø Public Information Officer. Ø Engage other Agency Heads Ø General

Communication v External: P Contracts / Outside Agencies Ø Incorporate into contracts Ø Incorporate

Communication v External: P Contracts / Outside Agencies Ø Incorporate into contracts Ø Incorporate into Matrix Ø Assign a point of communication v Customers: P Notification Ø Websites Ø IVR Ø Notices Ø Press Releases Ø Postings

Documentation and Planning v Everyday / Don’t Delay v Rigorous Attention and Priority v

Documentation and Planning v Everyday / Don’t Delay v Rigorous Attention and Priority v Schedule Regular Updates v QA

Disaster Recovery v Inventory of Systems PMain PSide components / key interfaces PLocal operations

Disaster Recovery v Inventory of Systems PMain PSide components / key interfaces PLocal operations v Continuity ü Assess for vulnerability PAlternative options PFull and partial PShort and long term

Disaster Recovery v Biennial Requirements ü Environments Ø Full Ø Partial ü Tests Ø

Disaster Recovery v Biennial Requirements ü Environments Ø Full Ø Partial ü Tests Ø Regular / Episodic Ø Full / Partial

DR Readiness Checklist Responsible Duration Estimate 1. DR Systems Rob Cislak 2 weeks Team

DR Readiness Checklist Responsible Duration Estimate 1. DR Systems Rob Cislak 2 weeks Team DR Coordinator or Deputy 2. Send e-mail to DR Distribution list of GO/NO GO decision. Notify Stakeholders of test DR Coordinator or Deputy, Program Owner Ensure accessibility and DR System privileges in NJKi. DS DR Team servers including Autosys for Batch Operations 3. 4. 5. Verify Connectivity from SAFE, Datamation and Bull Mainframe DR System 6. Verify and E-mail- NJKIDS processing is complete NJKi. DS Batch 7. Notify the DR Coordinator that batch processing complete. Verify all code is in synch between production and DR servers. NJKi. DS Batch 8. 9. Verify all Interface files are in synch between production and DR servers. Resource Group Action Step Connectivity test using test page. 10. Notify Elapsed Estimate Date Time Status Description / Notes 10/5 Completed At least 2 weeks prior to test Rob Cislak 10/18 Completed At least 1 week before test Rob Cislak 10/15 Completed At least 1 week before test Jayakumar, Vijay Prabhu 10/16 Completed At least 1 week before test Greg Steen 10/19 Completed At least 1 week before test Jayakumar, Vijay Prabhu 10/20 Day of cutover Operations Jayakumar, Vijay Prabhu EBSU AI Gottsch 10/20 Day of cutover NJKi. DS Batch Jayakumar, Vijay Prabhu 10/20 Day of cutover Ed Michalak, Rob Cislak 10/20 Day of Cutover Team Operations Restoration and DR Coordinator or Support Team (EBSU) to Deputy prepare servers

Failover to DR Checklist Step Action Resource Group Responsible Duration Elapsed Estimate Date Time

Failover to DR Checklist Step Action Resource Group Responsible Duration Elapsed Estimate Date Time Status Description / Notes 1. Initial Damage assessment Disaster Recovery Coordinator/Deputy DRC Ed Michalak, Rob Cislak 10/21 7: 00 AM 2. Confirm a Disaster has been Declared Disaster Recovery Coordinator/Deputy DRC Ed Michalak, Rob Cislak 10/21 7: 00 AM 3. Contact all DR Team(s) Members Disaster Recovery Coordinator/Deputy DRC Ed Michalak, Rob Cislak 10/21 7: 00 AM EMAIL DR team after completion Ensure Network Readiness Emergency Management Team Restoration and Support Operations Team (Network) Steve Pohler 10/21 7: 15 AM EMAIL DR team after completion a. Network monitoring tools (PRTG, Whatsup, etc) Restoration and Support Operations Team (Network) Vivek Bansal, Paul Bostwick 10/21 b. Wide Area Link Restoration and Support Operations Team (Network) Vivek Bansal, Paul Bostwick 10/21 c. Routers Restoration and Support Operations Team (Network) Vivek Bansal, Paul Bostwick 10/21 d. Switches Restoration and Support Vivek Bansal, Paul Bostwick 10/21 Ensure DR Readiness 4. 30 mins

October 29, 2012 - Super Storm Sandy hits the east coast

October 29, 2012 - Super Storm Sandy hits the east coast

Hurricane Sandy Keys

Hurricane Sandy Keys

Hurricane Sandy Key (continued) v Management v Lessons Learned ü Flooding’s of the Delaware

Hurricane Sandy Key (continued) v Management v Lessons Learned ü Flooding’s of the Delaware / Hurricane Irene / Anthrax v Communication ü Emergency Management ü Loss of contact v Business Continuity – Options ü Assessing and reassessing regularly ü Staff impact

Outcomes

Outcomes

Future Improvement

Future Improvement

Future Improvement v Movement of checks – done v Distributed Systems üAbility to prioritize

Future Improvement v Movement of checks – done v Distributed Systems üAbility to prioritize and respond v Reliance on other Systems v Document Imaging

Questions?

Questions?