COMP 1321 Backup of data Richard Henson December

COMP 1321 Backup of data Richard Henson December 2015

Week 10 – Back Up and Fault Tolerance n Objectives – describe different kinds of solutions available for data backup – explain the concept and principles of fault tolerance

If it can go wrong, it will! …………Murphy’s Law

Backing Up and Fault Tolerance n In terms of computing… – “back up” is what is done to data, in case the original is corrupted for some reason » e. g. all computer users should back up any files they may save, and want to use again later… » e. g. all network users should have their “user data” saved as part of good network processes – “fault tolerance” is more fundamental, and concerned with 100% availability » relates to hardware as well and the software required to manage that hardware… n but back up is an essential part of fault-tolerance

Fault Tolerance of Data n Data on storage media is easily corrupted or deleted – magnetic disks particularly sensitive to data loss – must be backed up at all times n Useful also to store contents of memory – otherwise lost whenever there is a power interruption or system malfunction

Backing Up Data n If complete hard disk is regularly copied… – massive amounts of data will soon accumulate… – need a storage medium that can copy very large quantities n Once upon a time, tape storage the preferred method – use a new back up tape every day – keep the old ones carefully labelled in a safe place

Which Data Should be Backed Up? n Can be classified into several types – System data – Critical System data – Application data – User data n May be backed up in different ways, with differing regularity:

What data should NOT be backed up n Probably two categories: – that which is safely stored elsewhere and can be restored at leisure » e. g. applications on CD – that which won’t be used again and won’t be missed » e. g. temporary files » read/unread emails that aren’t important » saved word files, etc. that are no longer needed

Clearing out data that is longer needed n According to a visionary researcher: – "All computer-mediated processes produce data. Unless dealt with, it stays around. – And it’s after-effects can be pretty toxic. – And, just as 100 years ago we ignored pollution in our rush to build the Industrial Age, today we’re ignoring data in our rush to build the Information Age. – And, I believe, 100 years from now our great-grandchildren will look back at the decisions we made and wonder how we could have been so ignorant and short-sighted. " (Bruce Schneier, 2008)

Automatic tidying up of data n The answer is simple… – BACK UP processes should be accompanied by DELETE processes! – not yet accepted practice… n This is good information management – Reduces risk of information getting into the wrong hands – and ensures compliance with UK Data Protection Legislation

Essential/Important System Data n Essential: what is needed for a healthy boot up – Microsoft networks refer to this as SYSTEM STATE DATA – highly dynamic » regular back up essential n Data to support utilities – required for “housekeeping” duties – back up every time not essential » data available on CD n Data to support services – as utilities data

Backing up User Data n A number of approaches available: – Incremental backup » Some files backed up on Monday, others on Tuesday, etc… – Differential backup » Just files that have changed (different datestamp) are backed up – Full backup » all data backed up n What about critical system data? e. g. Windows registry settings – differential backup?

Tape Backup? Can store many gigabytes of data on a single tape n Storage is fairly rapid, but BUT… tape is no longer regarded as the natural choice… n – storage medium is still magnetic – can be very slow retrieval time

The Backup Process n Handled by software – easily scheduled to be automatic n Data could be backed up to a variety of alternative media – e. g. removable hard disk n A lot of backup data will accumulate… – general rule to dump data after three backup “generations” – known as grandfather-son

Other Alternatives to Tape Backup n Server data could also be backed up to: – a USB-linked hard drive – another computer on the network – a computer on another network in a different location » easily achievable via the Internet » data will be preserved in the event of a fire or environmental catastrophe

Verification of Backup n One thing to THINK that the data is being backed up n Quite another to ensure that this has indeed occurred! – no reason to assume that the backup will be completely effective – plenty that could go wrong n Data backup routine should to check: – that the data has indeed been copied – that there are no errors n Good backup software should make such checks automatically!

Restoring Backed Up Data n Should happen… – as part of a regular routine – just like the backing up itself… n No good backing the data up to tape or disk if it can’t easily be recovered! – or can’t be copied back to the right place… n Back up software should always be tested in “restore” mode as well…

Beyond Backup: “Thinking the unthinkable. . . ” n Humans are optimistic – we HOPE things won’t ever go wrong… – but they do!!! n ANY network device could go wrong at any time – could affect network performance – could even bring the whole network to a halt… – with time, the business/organisation will be kaput n Software can also fail – may go into an endless loop – may need to be restarted

An International Standard for “Business Continuity Planning” n BS 25999: – taking “Murphy’s Law” contingency planning to all aspects of the organisation » recent UK e. g. : flooding » need to prepare for it so the business can continue… n These days, a business’s most important asset often is its information – stored in digital format – copy needs to be kept in a different location

Fault Tolerance and Computer Systems All about availability n Any organisation now dependent on digital data n Power cut… people stop work… most of what they do involves a computer n Good fault tolerance is about minimising the chances of this happening… n

Definition of “Fault Tolerant”? n “A computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service”

Fault Tolerance role of the Network Operating System Each important hardware component on the network should have a backup that can take over in the event of a failure n NOS should therefore n – detect failures – enable a backup to automatically take over when the fault is detected. . .

Achieving Fault Tolerance n ONE APPROACH… – carefully written software » software detects failure of other software » takes evasive action in real time – hardware has an embedded system that: » detects failure » rapidly swaps alternative hardware into action n Makes sense for the operating system to do all of this… – detects both hardware and software failure » restarts program(s) » swaps in alternative pre-wired hardware

Concept of Data “Mirroring” n Problem with periodic backup: – data copied the previous night – what if the system hard disk goes kaput in the middle of the next day? n Copy of all data should additionally be stored “shorter term” on further media – easiest way is to have another disk in reserve – everything copied to system disk also copied to mirror

Disk Mirroring n n Increases boot/system disk fault tolerance under most conditions In its simplest form: – – n all data held on one disk: second disk is an exact copy of the first Disk A Writes data to A Disk controller Writes same data to B When anything is written to disk… – written simultaneously to both disks Disk B

Where even Mirroring alone is not enough… n If the system crashes and will not reboot… – operating system doesn’t get reloaded – therefore the mirror never gets activated » and copied files cannot be read…

Recovering the system after a damaged Mirrored Boot Disk n n Boot program can only point to one disk at a time… If the boot disk crashes … – the system boot program will fail to access a disk at all next time it restarts… n System needs an alternative boot up… – e. g. use a boot floppy or CD to restart the system… n Boot program can then be modified to point to the backup, not the faulty disk

Remedial action after a broken mirror n n Just because the system is up and running again, doesn’t mean the emergency is over… Fault-tolerance MUST be restored before ANYONE can relax – replacement disk must be added asap to replace the damaged one – the mirror must then be re-established – all the disk copying required to re-establish system fault-tolerance may take some time…

Relative Merits of Mirroring (system availability) n Advantage: – system keeps going as normal if a non-boot disk crashes n Disadvantages: – disk write operations take longer – half of available disk space is used up (only 50% efficient used of storage)

Hardware flaw with Mirroring n Regardless of the boot disk problem, disk mirroring is STILL not entirely faulttolerant! – both disks connected to the same hard disk controller – if the controller card goes down, neither disk will be accessible

Disk Duplexing n n Separate controller card for each disk Disk A Disk B – if one card goes down, only the disk connected to it is affected NOTE: – use of duplexing DOES NOT eradicate the potential rebooting problem caused by a damaged boot disk – needs the same solution as mirroring Controller A Controller B motherboard

Problem: Too Much redundancy of disk space n n Redundancy = disk space used by the system/total disk space Both mirroring and duplexing: – – – n Redundancy = 0. 5 (50%) Rather high Half of available space tied up in backup! Solution: RAID (Redundant Array of Inexpensive Disks) – less redundancy – still full backup

What is RAID? n n A system of several disks where part of each disk is used to store system data, and the rest stores backup data If all the disks are linked together and just used for primary data (ie no backup): – the arrangement is known as a stripe set – also known as RAID 0 (ie zero fault tolerance)

Categories of RAID providing fault tolerance n n n RAID 1 - mirroring or duplexing RAID 2 – backup using disks that do not have their own error-checking RAID 3 – backup using disks with their own error checking – striped across disks at byte level – parity data stored on one drive n RAID 4 – similar to RAID 3 – – but data striped in whole blocks, not per byte poor data write performance

RAID 5 (the best!) Can use different number of disks (minimum three) n Each disk divided into sections n One parity section in each disk n Data write faster than RAID 4 n Redundancy depends on number of disks used…. n

Example of RAID 5 (four disks) – Each of the four disks divided into four sections » one section for parity in each » not always write to parity disk » data write therefore faster than RAID 4, read slower » Redundancy = ¼

RAID 5 (five disks) – RAID 5 using five disks (most popular), each divided into five sections » One section for parity as before » Redundancy = 1/5

Hot Swapping n n Disks that can be removed and replaced without rebooting the system If a disk that belongs to a RAID system fails… – the system can continue – but fault tolerance is immediately lost n Helpful and quicker to replace the disk, and re -establish RAID: – as soon as possible – without having to turn the power off and rebooting

Fault Tolerance and Re-boot n If a system crashes and/or is rebooted… – availability is temporarily lost Needs to be a reserve system (backup server) that will perform that system’s functions in the meantime n Network Operating system needs to synchronise processes across systems to enable this to take place… n

The Backup Server Essential for 100% availability n Should be configured as a replacement for the main server n – also needs to be a domain controller – must also have a copy of the users database, regularly synchronised with the main domain controller – also configured to be able to log users onto the network

Backup of Settings before Reconfiguration by Rebooting n New hardware will be added to a server from time to time: – – – n n more memory? extra hard disk? new video, sound, or network card? Hot swapping may well NOT be supported Server will have to reboot and reconfigure

Backup of Settings before Reconfiguration by Rebooting n If the new drivers are not correct: – system may not reboot properly – may be difficult to remove drivers n In such circumstances, system needs a “rollback” feature, so the old hardware can be put back, as well as… – previous settings safely stored where they can be easily retrieved – previous settings restored as an option on boot-up

Keeping Servers Cool! Servers work hard (especially the disks…) n Can get hot n – will reduce MTBF of components n Need good ventilation at all times…

Minimising Effects of Power Failure n Power failure can ruin hardware – mains spikes can overheat components – sudden lack of power will lose data currently being processed n Best to protect all hardware: – bottom line - surge preventer – better: UPS (uninterruptible power supply)

The UPS n Battery packs that can provide mains voltage after a power cut – for a few minutes (cheap but effective) – or half an hour (expensive, less down time) n n NOS needs to make sure it automatically cuts in when voltage drops sharply Power continuation must include the backup domain controller, so synchronisation can occur – procedure of “graceful degradation” » allows processing to go to completion » allows new system settings to be written

The Fault Tolerant Network Operating System A Fault Tolerant system needs to have good control of hardware, backup hardware and software n The NOS, and those who configure it, need to use fault tolerance effectively so an organisational network will n – keep going… (accessibility) – do what is expected… (reliability, stability)

Network Operating Systems and Fault Tolerance Many features to make fault tolerance kick in automatically n However, fault tolerance only restored once the faulty component has been replaced and its replacement configured to work as the new backup… n You’ll see how all this can be achieved in the practicals… n