Modern Distributed Systems Design Security and High Availability

Modern Distributed Systems Design – Security and High Availability 1. Measuring Availability 2. Highly Available Data Management 3. Redundant System Design

Measuring Availability • How resiliency and high availability are interconnected? • Define downtime and what causing downtime. • How to meager availability?

Measuring Availability

Define Downtime • Downtime could be defined by following: “If a user cannot get his job done on time, the system is down”

What causing downtime? • Planned – ones that easiest to reduce that include scheduled system maintenance, hot-swappable hard drives, cluster upgrades and even failovers. Usually 30% of all downtime; • People or human factor – dumb mistakes and complex innovation in IT equipment, software and protocols requires greater knowledge of engineers. Usually 15 % of all downtime; • Software Failures - due to software bugs and viruses. (40%)

How to meager availability? MTBF Availability = -----------, where MTBF + MTTR MTBF – “mean time between failures” and MTTR - “maximum time to repair”

What can go wrong? • • • Hardware Environmental and Physical Failures Network Failures Database System Failures Web Server Failures File and Print Server Failures

The Cost of Downtime.

Levels of Availability: 1. Regular Availability 2. Increased Availability 3. High Availability 4. Disaster recovery 5. Fault-Tolerant System

Highly Available Data Management • Data management is the most sensitive area of modern distributed systems. • Quick overview of existing data topologies

Redundant System Design • Redundant storage (RAID, Multi-hosting, Multi-Pathing, Disk. Array, JBOD, etc) • Failover Configurations and Management • Introduction to SAN and Fibre Channel protocol • Security aspects of data management in Storage Area Networks

Redundant storage

Redundant Storage (RAID 5)

Failover Configurations and Management Failover must meet following requirements: • Transparent to client; • Quick (no more then 5 min, ideally 0 -2 min); • Minimal manual intervention, guaranteed data access.

Failover components: • Two servers, one primary another takeover; • Two network connections, third is highly recommended • All disks on a failover pair should have some sort of redundancy • Application portability • No single point of failure.

Symmetric Failover

Asymmetric Failover

Fibre Channel, SAN, IP Storage

Security in IP Storage Networks • Security in Fibre Channel SANs • Security Options for IP Storage Networks

Fibre Channel SAN Security • Port or hard zoning • WWN Zoning • LUN Masking

Security Options for IP Storage Networks • i. SNS • LUN Masking as in Fibre Channel and VLAN tagging • IP Security or IPSec • ACL