Bilgisayar Mhendislii Blm Hardware Redundancy MTAT 03 240

Bilgisayar Mühendisliği Bölümü Hardware Redundancy MTAT. 03. 240 Seminar on Enterprise Software Olgun Cakabey Othmar Mwambe December 2010 UT – Software Engineering B 06337 B 06324

Agenda • • What is Redundancy? Introduction to Hardware Redundancy Hardware Components Disk Storage – – • • RAID (Redundant Array of Independent Disks) RAID Configurations Hardware Redundancy Techniques Conclusions References Demo UT – Software Engineering MTAT. 03. 240 2

What is Redundancy? • In engineering, Redundancy is the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe. UT – Software Engineering MTAT. 03. 240 3

Concept of Redundancy • Hardware redundancy is the addition of extra hardware, usually for the purpose of either detecting or tolerating faults. • Software redundancy is the addition of extra software, beyond what is needed to perform a given function, to detect and possibly tolerate faults. • Information redundancy is the addition of extra information beyond that required to implement a given function; for example, error detection codes. • Time redundancy uses additional time to perform the functions of a system such that fault detection and often fault tolerance can be achieved. Transient faults are tolerated by this. UT – Software Engineering MTAT. 03. 240 4

Introduction to Hardware Redundancy • Hardware redundancy does not only concentrate on recovery from failures, but also on protection against them. • Always demands trade off against achievable dependability. • Costs: Additional components, area, power, shielding, . . . Please Computer Without Redundancy 5 UT – Software Engineering MTAT. 03. 240

Hardware Components • There are several parts of computer systems which are highly considered when we are discussing about hardware redundancy which are CPU, Memory, Backplane and System Bus, I/O and Network Cards, Power Supplies, Cables and Connections. Some systems have a layer between the CPUs and the operating system, and this is sometimes called hypervisor UT – Software Engineering MTAT. 03. 240 6

Disk Storage UT – Software Engineering MTAT. 03. 240 7

Types of Storage Disks • Disks are one of the most important parts of a computer system as they store the data, application programs, and operating systems. • Data disks • Operating system disks; eg. bootable cd UT – Software Engineering MTAT. 03. 240 8

RAID (Redundant Array of Independent Disks) • RAID is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. • There are different Raid levels but They all follow the same idea: the data of one I/O request (read or write) coming from the computer system are sent to the Raid group and are distributed there to multiple disks enriched with redundant information to provide protection against disk failure(s). UT – Software Engineering MTAT. 03. 240 9

RAID Configurations • If a disk drive fails, the redundant Raid group is able to reconstruct the lost information. • There two parameters which describe a stripe: the number of disks (also called stripe width) and the number of bytes written to a disk as a chunk. UT – Software Engineering MTAT. 03. 240 10

How the reconstruction of data works • Parity checking is a rudimentary method of detecting simple, single-bit errors in a memory system. UT – Software Engineering MTAT. 03. 240 11

Raid 0 • block-level striping without parity or mirroring • provides improved performance and additional storage but no redundancy or fault tolerance • This combines several disks to one stripe with the goal that the I/O load is evenly distributed between the disks UT – Software Engineering MTAT. 03. 240 12

Raid 0 UT – Software Engineering MTAT. 03. 240 13

Raid 1 • mirroring without parity or striping • This is first – and simplest – level for redundancy: data is written identically to multiple disks (a "mirrored set"). • This minimizes overhead and provides good performance. • Mirroring can decrease write performance slightly as twice the amount of data needs to be transferred UT – Software Engineering MTAT. 03. 240 14

Raid 1 UT – Software Engineering MTAT. 03. 240 15

Raid 3 • byte-level striping with dedicated parity • Each single I/O request is distributed over all data disks. • The performance of Raid 3 is very good for large, single requests, as all disks are used equally. • Dis. Adv: To reconstruct a failed drive, all the data needs to be read, which makes reconstruction much slower than with Raid 1 UT – Software Engineering MTAT. 03. 240 16

Raid 3 UT – Software Engineering MTAT. 03. 240 17

Raid 5 • block-level striping with distributed parity • On small writes, Raid 5 is inefficient. Each time a block is written, first the old data block and parity block need to be read • Dis. Adv: Like Raid 3, Raid 5 has slow redundancy recovery times, since all the data needs to be read in order to reconstruct the lost data UT – Software Engineering MTAT. 03. 240 18

Raid 5 UT – Software Engineering MTAT. 03. 240 19

Raid 6/Double Parity Raid • It provides fault tolerance from two drive failures • This makes larger RAID groups more practical, especially for high-availability systems UT – Software Engineering MTAT. 03. 240 20

Raid 6/Double Parity Raid UT – Software Engineering MTAT. 03. 240 21

Raid 10 and Raid 01 • Combining Stripes and Mirrors • Sometimes it is useful to combine multiple Raid groups with different Raid levels. Disk • outages in the Raid 10 configuration leave the mirror intact, though without redundancy UT – Software Engineering MTAT. 03. 240 22

Raid 10 and Raid 01 UT – Software Engineering MTAT. 03. 240 23

Comparison UT – Software Engineering MTAT. 03. 240 24

Hardware Redundancy Techniques • Passive techniques • Active techniques • Hybrid techniques UT – Software Engineering MTAT. 03. 240 25

Passive Techniques • Also known as static technique. . • Implements fault masking • Fault does not show up, since it is transparently removed • No action from the system is required • No reconfiguration - inherently fault tolerant • Examples: Voting, correcting codes, Nmodular redundancy (NMR), Flux Summing, special logic, TMR with duplex UT – Software Engineering MTAT. 03. 240 26

Fault Masking • Fault masking “hides” faults that occur. Do not require detecting faults, but require containment of faults (the effect of all faults should be local) UT – Software Engineering MTAT. 03. 240 27

Active Techniques • Also known as dynamic technique. . • Actions required for correct result • detection, localization, containment, recovery • no fault masking • Does not attempt to prevent faults from producing errors within the system • After fault detection, the system is reconfigured to avoid a failure – remove faulty hardware from system UT – Software Engineering MTAT. 03. 240 28

Active Techniques (continued) • Most common in applications that can tolerate temporary erroneous results – satellite systems - preferable to have temporary failures that high degree of redundancy • Examples: Stand-by sparing, duplication with comparison, pair-and-a-spare, watchdog timer UT – Software Engineering MTAT. 03. 240 29

Hybrid Techniques • is combination of passive + active techniques • fault masking + reconfiguration – use fault masking to prevent erroneous results (prevent temporary errors) – and provide spares to replace faulty hardware (high reliability) UT – Software Engineering MTAT. 03. 240 30

Hybrid Techniques (continued) • expensive, but better to achieve higher reliability and more fault tolerance • Types: Self-purging redundancy, N-modular redundancy with spares, Triple-duplex architecture UT – Software Engineering MTAT. 03. 240 31

Conclusions • Redundancy is never for free!! • Application-dependent choice – critical-computation - momentary erroneous results are not acceptable • passive or hybrid – long-life, high-availability - system should be restored quickly • active – very critical applications - highest reliability • hybrid UT – Software Engineering MTAT. 03. 240 32

References • [1] SCHMIDT Klaus, High Availability and Disaster Recovery: Concepts, Design, Implementation, Springer, 2009 • [2] http: //en. wikipedia. org/wiki/Redundancy_%28 engineering%29 • [3] http: //en. wikipedia. org/wiki/RAID • [4] SIEWIOREK Daniel P, SWARZ Robert S. , Reliable Computer Systems. third. , Wellesley, MA : A. K. Peters, Ltd. , 156881092 X, 1998 UT – Software Engineering MTAT. 03. 240 33

THANK YOU ANY QUESTIONS? UT – Software Engineering MTAT. 03. 240 34