Cours TF Techniques de tolrance aux fautes dans

Cours TF Techniques de tolérance aux fautes dans le SW

Défaillances du SW z A. Défaillances dans les logiciels de type « billing »

Défaillances du SW z SW complexe, grande vulnérabilité aux fautes z Préoccupation actuelle des

Problèmes du design SW z Le grand nombre d’états implique seulement une partie faible

Que faire avec les fautes du SW? z Prévention de fautes y. Utiliser des

Evitement ou prévention de fautes z Techniques: artistiques! => augmentation de la maintenance avec

Que faire avec les fautes du SW? z Élimination de fautes y. Utilisation des

Test et vérification z Réduction ou élimination de l’incidence d’erreurs dans le logiciel y‘

Test conventionnel z ‘ path testing ’: chaque chemin possible entre une entrée et

Difficultés dans le test et vérification z Ni la prévention, ni le test exhaustif,

Robustesse z Utilisation des séquences d’entrées de type «workaround » ou robustesse du SW

Robustesse z Actions globales en cas du robustness (exemples): yune nouvelle entrée est demandée

System Closure z Aucune interaction (échange entre modules) n’est permise si elle n’est pas

System Closure ou la structuration temporelle de l’activité z Dans le cas des échanges

Non Contamination z Fault containment: Evitement des sorties incorrectes dues aux fautes internes (applicable

Tolérance aux fautes dynamique z Implémentée en plusieurs étapes: y. Détection de fautes y.

Watchdog and Control flow checking (contd. ) z Watchdog for timing errors, hangs, silent

Control Flow Checking ysignature checking x. Compiler identifies branch free intervals and generates signatures

Implémentation des stratégies de détection z De manière ad-hoc par les designers avec un

Medium Level SW Techniques z La majorité des fautes dans SW sont de design

Medium Level SW Techniques: ckp+rollback

Medium Level SW Techniques z These techniques are mostly dynamic: checkpoint and rollback y

Checkpoint and rollback z General concept: save fault-free state of the system and when

Acceptance Test z Acceptance test is most effective if it can be calculated in

Fault Containment Techniques z Fault containment is software can be achieved by modifying the

Fault Recovery z Once a fault is detected and contained, a system attempts to

Static vs Dynamic Checkpoints z Static checkpoint - takes a single snapshot of the

Advantages/Inconvenient z Advantages: Conceptually simple y Independent of the damage caused by a fault

Higher Level Fault Tolerance Techniques based on C&R: Process Pairing z Two identical versions

High Level FT: Techniques using C&R: Multi Versions z Multi-version techniques use two or

Recovery Blocks z N different implementations of the same program y. Only one of

Recovery Blocks z Similarly to cold and hot standby sparing, different version can be

High Level SW FT: Data Diversity z Ensures FT by a combination of checkpoints,

Data Diversity z Re-assign values through variables to inputs (depending on specifications) y. Close

Data Diversity z Extensively used in applications based on y. System based from information

High Level FT - N Version Programming z Resembles N-modular hardware redundancy y. N

N Version Programming z The most critical issue in multi-version software fault tolerance techniques

N Version Programming z Decision to be made when developing a multiversions software system

N-versions programming z La décision se fait par vote majoritaire ylogiciels indépendants, chacun doit

N-versions programming (cont) z Conditions pour le voter (test d’acceptation) y. Prévoir un chien

N-versions z Avantages: masquage immédiat des fautes software z Inconvénients : comment on trouve

Other Techniques based on C&R: Distributed Systems z Many concurrent process exchange data through

Global HW/SW Fault Tolerance Strategies y. We will consider the case of building distributed

Concurrent processes z Consistent checkpoints y. X 1, Y 1, Z 1 has to

Consistent Checkpoints z Coordinated checkpoints - to become consistent z There is an initiator

How a Distributed Systems network looks like? z System Model y. Units are represented

Failure Models in a Distributed Systems z Failure models y Nodes may fail or

Achieving Fault Tolerance z HW/SW at process/machine - node level (has been the object

Interconnect Level z Building fault tolerant network y. Based on multiple paths from source

Butterfly network z 2 k inputs and 2 k outputs z k stages of

Fault Tolerant Butterfly network z Extra stage – duplicating stage 0 at the input

Where and how can be used? z Multi-stage interconnection network - connects N processors

Interstitial Mesh z Primary node has four spare nodes z Each spare node is

Cross Bar z Regular Crossbar is not fault-tolerant failure of any switchbox will disconnect

Protocols on Network z Transactions and protocols yon reliable or unreliable links y. With

Transactional Model z In distributed systems may want one application to talk to multiple

Transactional Model z Commiting: unconditional guarantee that a transaction will be complete. The effects

Transactional Model z Atomic operations are the basic building blocks of distributed fault tolerant

Agreement z Several clients search to reach an agreement to perform a task (to

Who is the traitor? z One general (coordinator) and more lieutenants y. Tolerance of

Commit protocol (fault free) illustrated 1 coordinator, several cohorts 2 phases transaction ok to

Protocol-level fault tolerance z Aim: transmit information from A to B through a network.

Reliable Protocols z Transmission Control Protocol (TCP) y Correctness: packet checksum y End-to-end correctness:

Reordering z Different routes z Resend of faulty packets z Exploits sequence number A

Network dynamic fault models B ! A z Congestion: too much traffic in a

Congestion: architectural z Node cannot handle packets: y Effect: packet drop y Reaction: buffering

TCP and congestion z Built-in algorithms to avoid congestion: y. Congestion Window y. Slow

TCP: Congestion Window z Maximum number of outstanding packets z Defined at each node

TCP: Slow start z Modify CW depending on acknowledged packets z CW grows exponentially

TCP: Congestion Avoidance z Combination of y. Congestion Window (CW) y. Round Trip Time

TCP: Fast retransmit z Multiple ACKs: congestion in the network y. Segment K has

TCP: Fast recovery z Combination of previous approaches y. When congestion is detected (ex:

More than TCP! z TCP ok for bulk data, but y. Delay not taken

Buffer Bloat (2) : REAL path z ALL nodes are BUFFERED! B A

Buffer Bloat (3) z Hidden Buffers y. Higher DATA reliability y. Big impact on

Slides: 79

Download presentation

Cours TF Techniques de tolérance aux fautes dans le SW

Défaillances du SW z A. Défaillances dans les logiciels de type « billing » (réservation hôtellerie, avions, télécom)- dues en grande majorité aux entrées erronées y Solutions : routines de contrôle et mécanismes de protection contre les données erronées, sinon pertes financières importantes z B. Fautes dans les systèmes de contrôle en temps réel, pouvant mettre en danger les vies humaines ou pouvant entraîner des grosses pertes économiques à cause d’une faute y défaillances spectaculaires: lancement de Mariner 1 (1962), destruction du satellite météorologique français (1968), mission Apollo 1970, Ariane V. . .

Défaillances du SW z SW complexe, grande vulnérabilité aux fautes z Préoccupation actuelle des conséquences de défaillances SW: y Grande préoccupation pour les SW de contrôle aérien ( « in flight » or « control on the ground » ) => Airbus, Boeing acteurs du marché y SW contrôle des systèmes nucléaires ou de lancement de missiles y SW contrôle des processus industriels à risque y SW contrôle équipements des ICU dans les hôpitaux d’urgence y SW contrôle des transactions bancaires, boursières, payement en ligne, etc z Les techniques de tolérance aux fautes logicielles ont été développées tout d’abord pour les systèmes critiques en temps réel. y Un software de gestion des évitements des collisions en aviation commerciale nécessite 1040 états possibles (US)

Problèmes du design SW z Le grand nombre d’états implique seulement une partie faible du SW pourrait être vérifiée y. Le debug et les techniques de test SW ne sont pas économiquement faisables pour les systèmes complexes y. Les méthodes formelles présentent une couverture de fautes importante, mais leur design et mise en place peut être beaucoup plus difficile à mettre en œuvre que le SW même x. Une méthode formelle d’un système complexe peut être aussi complexe que le code (ou plus importante en terme de lignes de code) y. A cause du manque de vérification, plusieurs fautes de design restent dans le SW

Que faire avec les fautes du SW? z Prévention de fautes y. Utiliser des méthodologies de design SW afin d’éviter d’introduire des fautes dans le SW z Collection de méthodes qui visent à éviter les erreurs de type « design » (qui peuvent apparaître lors de la phase de spécification, conception, codage, interprétation et implémentation) y. Cible erreurs internes (données et instructions) et externes (interactions avec des autres logiciels)

Evitement ou prévention de fautes z Techniques: artistiques! => augmentation de la maintenance avec 6070% du coût total du logiciel yla modularité – partitionnement en modules avec des fonctions précises et claires, qui permet facilement l’évolution x. Partitionnement horizontal: les fonctions majeures du SW sont séparées en structures indépendantes qui communiquent à travers des interfaces et qui sont contrôlées par d’autre fonctions au niveau d’exec et communication x. Partitionnement vertical: distribution du contrôle et calcul en hiérarchie topdown x. Les modules top « contrôlent » , les modules bas « calculent » y. Toute méthode de partitionnement facilite le test, la maintenance et limite la propagation des effets secondaires yla programmation-orienté-objet - la manipulation des classes et modules se fait par des fonctions simples (ex. C++)

Que faire avec les fautes du SW? z Élimination de fautes y. Utilisation des techniques de test et vérification, analyse de SW afin d’éliminer toute faute y. Beaucoup de recherche et de développement de ces méthodes pour les fautes déterministes et aléatoires y. Les fautes déterministes sont activées par les entrées du SW et ne sont pas dépendantes des états internes

Test et vérification z Réduction ou élimination de l’incidence d’erreurs dans le logiciel y‘ proof of correctness ’: vérification formelle par l’induction mathématique xhabituellement utilisée sur les logiciels de petite taille (limitation!), xon ne teste pas les contraints temporelles (dans les systèmes en temps réel utilisation très limitée) ytests conventionnels: vérification de la performance et de la fonctionnalité d’un logiciel x sans « robustness test » (input data outside specifications), « limit test» (input data at boundary regions) ou tests de tous les branchements et boucles

Test conventionnel z ‘ path testing ’: chaque chemin possible entre une entrée et une sortie est traversé une fois z ‘ branch testing ’: chaque chemin entre un nœud et la sortie est traversé une fois z ‘ functional testing ’: chaque fonction élémentaire est testée une fois z ‘ special value testing ’: test de toutes les valeurs susceptibles de créer des erreurs z ‘ anomaly analysis ’: test de toutes les construction de logiciels susceptibles de créer des erreurs z ‘ interface analysis ’: test de tous les problèmes au niveau d’interfaces entre les différents modules y Exemple: 6 programmes avec 28 erreurs connues : au moins une erreur reste non -détectée par n ’importe quel type de test (réduction par 10!)

Difficultés dans le test et vérification z Ni la prévention, ni le test exhaustif, ni la vérification formelle, ne peuvent pas assurer le niveau de fiabilité nécessaire pour un logiciel utilisé dans une application critique ou de contrôle en temps réel z Conclusion => Tolérance aux fautes y. Plusieurs techniques de protection contre les défaillances du logiciel

Robustesse z Utilisation des séquences d’entrées de type «workaround » ou robustesse du SW z Robustness: le SW continue à fonctionner correctement en dépit de l’apparition des entrées invalides y le logiciel doit pouvoir manipuler les données de type entrées (passage de paramètres, interactions de programmes) qui sont en dehors de la gamme de valeurs, ou qui appartiennent à un autre format, sans dégrader les performances des autres fonctions qui ne dépendent pas de ces entrées non standard y Implémentation en SW : x. Méthode de détection, isolation d’effet et recouvrement x. Détecter de la faute =>Appartenance à l’espace entrées, x. Isolation => conversion en format approprié, ou fonctionnement avec des constantes prédéfinies afin de prévenir qu’elle se propage dans tout le système x. Recouvrement => continuation du programme, suspension, en dernière instance restart

Robustesse z Actions globales en cas du robustness (exemples): yune nouvelle entrée est demandée (s’il s’agit d’une interaction avec un opérateur humain), yon utilise la dernière valeur correcte de la même variable, soit une valeur prédéfinie y. Utiliser un ‘flag’ pour notifier l’opérateur sur l’apparition d’un état exceptionnel et afin de faciliter le traitement de l’état exceptionnel par les autres éléments du programme

System Closure z Aucune interaction (échange entre modules) n’est permise si elle n’est pas systématiquement autorisée y. Toutes les restrictions doivent être explicitement enlevées avant qu’une donnée (fonction) spécifique puisse être utilisée y. But: limiter toute propagation d’erreur à travers un système fermé, limité dans lequel toutes les interactions possibles sont connues à l’avance

System Closure ou la structuration temporelle de l’activité z Dans le cas des échanges entre des modules SW, bien structurér: y Action atomique entre les modules distincts sans aucune interaction avec les autres modules du système pendant la durée de l’activité y Dans les actions atomiques, aucun module participant à l’action atomique ne partage de l’information avec les autres modules y Avantage : isolation des structures critiques x. Si une action atomique se déroule normalement alors les résultats à la fin sont validés vers les modules non participants x. Si défaillance pendant l’action atomique, on connaît a priori que la défaillance ne peut venir que des modules isolés, dont toute technique de recouvrement sera implémenté avec priorité sur ces modules

Non Contamination z Fault containment: Evitement des sorties incorrectes dues aux fautes internes (applicable sur des morceaux de code, sur des modules) y. Construire du code qui empêche la propagation de fautes à d’autres modules z Méthodes : watch-dogs, alarmes de type hardware/software (tests de overflow et division par 0), capability checking z Action: en fonction de l’erreur détectée, yrollback au début du de l’exécution du module affecté, ou au début du programme principal ou x. Si la défaillance provient d’une donnée erronée mais temporaire, après le rollback, le programme devrait fonctionner correctement yre-démarrage du système entier

Tolérance aux fautes dynamique z Implémentée en plusieurs étapes: y. Détection de fautes y. Recouvrement (ou correction) suite à la détection

Watchdog and Control flow checking (contd. ) z Watchdog for timing errors, hangs, silent failures, crash failures (timeout, etc…) z Watchdog with control flow checking z Basic principle y. Analyze the program and extract control information x. Branch free intervals x. Subroutine calls y. Assignatures to branch free intervals and provide these signatures to the watchdog processor to check these values 17

Control Flow Checking ysignature checking x. Compiler identifies branch free intervals and generates signatures (such as checksum) for these intervals x. At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages x. Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature) 18

Implémentation des stratégies de détection z De manière ad-hoc par les designers avec un peu d’expérience guidés par leurs jugements quant à l’identification des types de contrôles et leur placement dans le code y Il est impossible d’anticiper toutes les fautes dans un module y Il est pas possible de tout contrôler, trop de délai d’exécution z De manière structurée: construire des arbres de fautes y Identifier les classes générales de défaillance et les conditions qui les déclenchent y Approche top-down qui aide les designers à identifier les omissions, simplifier le design, voir toutes les interactions et conséquences y Par cette approche, en passant du niveau haut fonctionnel à un niveau plus bas qui dépend d’autres éléments, le designer peut raffiner la stratégie de détection de fautes

Medium Level SW Techniques z La majorité des fautes dans SW sont de design activées par des entrées non attendues ou non testées y Comme les fautes intermittentes dans le HW, qui apparaissent suivant des conditions spécifiques z Un simple re-start du système peut remettre le système en fonctionnement normal z Le module SW qui s’exécute fonctionne avec un test d’acceptation qui vérifie le résultat z Si une faute est détectée, un message de type « retry » est envoyé et le module recommence à fonctionner avec des données sauvegardées auparavant dans une mémoire

Medium Level SW Techniques: ckp+rollback

Medium Level SW Techniques z These techniques are mostly dynamic: checkpoint and rollback y Detection mechanism: acceptance testing x. Various types of acceptance tests are used to detect faults • the result of a program is subjected to a test • if the result passes the test, the program continues its execution • a failed test indicates a fault y Recovery mechanism: backward recovery x. Return to some previous error free state 22

Checkpoint and rollback z General concept: save fault-free state of the system and when an error is detected, reload the fault-free state and re -execute y. Save system state at regular interval x. How often to save - checkpoint interval x. How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time x. How long between fault occurrence and its detection (error latency) is tolerable y. Rollback recovery x. Where do we go back to: damage assessment x. Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) x. Restart the computation 23

Acceptance Test z Acceptance test is most effective if it can be calculated in a simple way and if it is based on criteria that can be derived independently of the program application. z The existing techniques include ytiming checks ycoding checks yreversal checks yreasonableness checks ystructural checks 24

Fault Containment Techniques z Fault containment is software can be achieved by modifying the structure of the system and by putting a set of restrictions defining which actions are permissible within the system z Techniques for fault containment: y– modularization y– partitioning y– system closure y– atomic actions 25

Fault Recovery z Once a fault is detected and contained, a system attempts to recover from the faulty state and regain operational status y– If fault detection and containment mechanisms are implemented properly, the effects of the faults are contained within a particular set of modules at the moment of fault detection z The rollback – the most commonly used action y. Return the system to a previous checkpoint y. Ultimately to the beginning 26

Static vs Dynamic Checkpoints z Static checkpoint - takes a single snapshot of the system state at the beginning of the program execution and stores it in the memory. y. If a fault is detected, the system returns to this state and starts the execution from the beginning. y. Fault detection checks are placed at the output of the module z Dynamic checkpoints are created dynamically at various points during the execution y. If a fault is detected, the system returns to the last checkpoint and continues the execution. y. Fault detection checks need to be embedded in the code and executed before the checkpoints are created 27

Advantages/Inconvenient z Advantages: Conceptually simple y Independent of the damage caused by a fault y Applicable to unanticipated faults y General enough to be used at multiple levels in a system z Inconvenient: Non-recoverable actions exist in some systems, these actions cannot be compensated by simply reloading the state and restarting the system y firing a missile y soldering a pair of wires z The recovery from such actions can be done y by compensating for their consequences (undoing a solder) y by delaying their output until after additional confirmation checks are completed (do a friend-or-foe confirmation before firing) 28

Higher Level Fault Tolerance Techniques based on C&R: Process Pairing z Two identical versions of the software run on separate processors z First the primary processor, is active. y– It executes the program and sends the checkpoint information to the secondary processor, Processor 2. z If a fault is detected, the primary processor is switched off. The secondary processor loads the last checkpoint as its starting state and continues the execution 29

High Level FT: Techniques using C&R: Multi Versions z Multi-version techniques use two or more versions of the same software module, which satisfy design diversity requirements. ydifferent teams, different coding languages or different algorithms can be used to maximize the probability that all the versions do not have common faults z Recovery Blocks: Combines checkpoint and restart approach with standby sparing redundancy scheme 30

Recovery Blocks z N different implementations of the same program y. Only one of the versions is active y. If an error if detected by the acceptance test, a retry signal is sent to the switch y. The system in rolled back to the state stored in the checkpoint memory and the execution is switched to another module 31

Recovery Blocks z Similarly to cold and hot standby sparing, different version can be executed either serially, or concurrently y Cold execution may require the use of checkpoints to reload the state before the next version is executed (time penalty). The cost in time of trying multiple versions serially may be too expensive, especially for a real-time system. y Hot redundancy- requires n redundant hardware modules to work in parallel, a communications network to connect them and the use of input and state consistency algorithms z If all n versions are tried and failed, the module invokes the exception handler to communicate to the rest of the system a failure to complete its function y Recovery blocks technique heavily depends on design diversity 32

High Level SW FT: Data Diversity z Ensures FT by a combination of checkpoints, recovery and “input work around” y. If the system is sensitive at faults from the inputs z Re expression of data is a way of expressing the original inputs by different equivalent inputs (format conversion, correction of values, translation of values, etc) z The program takes checkpoints z There is an acceptation testing and recovery by rollback 33

Data Diversity z Re-assign values through variables to inputs (depending on specifications) y. Close values if sensors yre-mapping data is float or specific other data (no altering the content) Mem checkpoints Re-expression 1 Re-expression 2 Switch Execution output retry Re-expression n Error detection 34

Data Diversity z Extensively used in applications based on y. System based from information coming from sensors y. Statistical data calculus y. Real numbers applications (signal processing, maths) y. Text, keyboard data inputs z Very dependent on application 35

High Level FT - N Version Programming z Resembles N-modular hardware redundancy y. N different software implementations of a module are executed concurrently. y. The selection algorithm (voter) decides which of the answers is correct xa voter is application independent xthis is an advantage over recovery block fault detection mechanism, requiring application dependent acceptance tests 36

N Version Programming z The most critical issue in multi-version software fault tolerance techniques is assuring independence between the different versions of software through design diversity y. Software systems are vulnerable to common design faults if they are developed by the same design team, by applying the same design rules and using the same software tools 37

N Version Programming z Decision to be made when developing a multiversions software system include y– which modules are to be made redundant ( usually less reliable modules are chosen y– the level of redundancy (procedure, process, whole system) y– the required number of redundant versions y– the required diversity ydiverse specification, algorithm, code, programming language, testing technique xrules of isolation between the development teams 38

N-versions programming z La décision se fait par vote majoritaire ylogiciels indépendants, chacun doit fournir les sorties dans un format identique yun programme d’acceptation très fiable et très complexe évalue les sorties les analyse et fournit les résultats vers le niveau SW suivant y. Problème du voter (test d’acceptation): xétablir la fréquence des comparaisons (peu de comparaisons minimise l’impact sur les performances, trop de comparaisons impliquent des longues attentes pour la synchronisation de résultats) x. Synchronisation de sorties xétablir des routines de diagnostique en parallèle afin de pouvoir distinguer si les erreurs provient du hardware ou du software

N-versions programming (cont) z Conditions pour le voter (test d’acceptation) y. Prévoir un chien de garde pour les versions qui risquent ne pas s’exécuter correctement y. Ne peut pas être utilisé dans le cas où le logiciel peut produire plusieurs sorties correctes y. Il indique aussi quoi faire après le vote y. Ne peut pas être utilisé ensemble avec la diversité de données, car les imprécisions arithmétiques peuvent être très importantes, et le vote trop difficile à mettre en place z N-versions: Pdéfaillance: 6. 9*10 -3 => 3. 7*10 -5 z 2 -versions: Airbus, Boeing dans le contrôle en temps réel z 3 -versions: avec 2 spécifications et 2 langages => les protections du réacteur nucléaire

N-versions z Avantages: masquage immédiat des fautes software z Inconvénients : comment on trouve les N-versions ydiversité du design afin d’éviter les fautes corrélées, très dépendant de la spécification y. Les défaillances des versions doivent être indépendantes: plusieurs équipes développent les versions, avec des background différents y. Utilisée pour les tâches critiques

Other Techniques based on C&R: Distributed Systems z Many concurrent process exchange data through shared memory (multiprocessors), or messages (distributed systems) y. If a process fails, it restarts execution from the last checkpoint, but it has to cancel effects of the other processes ( « undone » the effects) y. The other processes should consistently rollback y. Each process has a log, it can do, undo, redo operations on each object, state, etc). z Avoid orphan message and domino, and lost message 42

Global HW/SW Fault Tolerance Strategies y. We will consider the case of building distributed checkpoint/rollback recovery over lossy communication system y. Consistent system state x. A state that may occur in legal execution of a distributed computing • In other words for every message that is received, it is shown to have been sent in the state of a sender x. Consistent global checkpoint • Set of N local checkpoints, one for each process, together these form a consistent system state • The key idea of this definition is that we can rollback to this state and re-compute from this state to arrive at the present state x. Recovery line x. Most recent consistent global checkpoint 43

Concurrent processes z Consistent checkpoints y. X 1, Y 1, Z 1 has to be taken so no information is exchanged between a pair of process and outside between successive checkpoints y. Called recovery line y(ex. X 2, Y 2, Z 2 - not consistent, m is lost if Y fails) 44

Consistent Checkpoints z Coordinated checkpoints - to become consistent z There is an initiator Qi that starts coordination y. Checkpoint saves – 2 phases y. Rollback – 2 phases z Inconvenient: charge exhaustively the system during the “save” and rollback steps 45

How a Distributed Systems network looks like? z System Model y. Units are represented as nodes x. N processors/processes/memories/others y. Interconnects are represented as links between nodes x. Interaction between processors/processes and the outside world by sending and/or receiving messages x. Messages are non-deterministic events z Communication system protocol x. Lossy (messages may get lost, duplicated, or re-ordered in the communication system – most commonly used scenario) x. Reliable (no messages are lost and they are always served in order, e. g. FIFO - less frequently used assumption) 46

Failure Models in a Distributed Systems z Failure models y Nodes may fail or go down – the corresponding unit unable to interact with other units y Interconnect may fail or go down – no units can communicate using the failed or down link z Objective of fault tolerance y Any pair of units must be able to interact in the presence of x. Node failures x. Link failures y Performance metrics x. How many faults (node or link failures) can be tolerated (fault coverage) x. Impact on the route length – number of hops between pairs of nodes (same as the length of the shortened path between a pair of nodes) • Can pay attention to the worst case scenario or impact on the average length of the paths 47

Achieving Fault Tolerance z HW/SW at process/machine - node level (has been the object of previous courses) z At interconnect level z Global HW/SW strategies z At transaction level (communication protocol) z At Protocol Level 48

Interconnect Level z Building fault tolerant network y. Based on multiple paths from source to destination y. Spare nodes to replace failed ones y. Several strategies z Butterfly network y. Non-fault-tolerant multi-stage network (butterfly network) - typically built out of 2 x 2 switches - two inputs and two outputs yswitch 49

Butterfly network z 2 k inputs and 2 k outputs z k stages of 2 k-1 switches each z Connections follow a recursive pattern from input to output z Butterfly is not fault-tolerant: there is only one path from any given input to a specific output 50

Fault Tolerant Butterfly network z Extra stage – duplicating stage 0 at the input z Bypass multiplexors around switchboxes at the input and output stages - a failed switch can be bypassed by routing around it Network can remain connected despite the failure of up to one switchbox anywhere in the system 51

Where and how can be used? z Multi-stage interconnection network - connects N processors to N memory units in a shared memory architecture z In the presence of faulty elements - the system can operate - possibly in a degraded mode z System’s resilience as it degrades can be measured y. Resilience Measures: x. Bandwidth x. Average number of operational paths x. Metrics of connectivity among processors and memories 52

Interstitial Mesh z Primary node has four spare nodes z Each spare node is a spare for four primary nodes z Higher level of fault tolerance – higher redundancy overhead of almost 100% 53

Cross Bar z Regular Crossbar is not fault-tolerant failure of any switchbox will disconnect certain pairs n n Input and output connections are augmented –each input can be sent to either of two rows and each output can be received on either of two columns If a switch becomes faulty - row and column to which it belongs are replaced by the spare row and column 54

Protocols on Network z Transactions and protocols yon reliable or unreliable links y. With reliable or unreliable nodes 55

Transactional Model z In distributed systems may want one application to talk to multiple databases z Applications are coded in a stylized way: xbegin transaction x. Perform a series of read, update operations x. Terminate by commit or abort. z Terminology y. One application is the transaction manager y. The transaction is the sequence of operations issued by the transaction manager while it runs y. It schedules them in an interleaved but serializable order 56

Transactional Model z Commiting: unconditional guarantee that a transaction will be complete. The effects of its actions on a database (example) will be permanent z Abort is an unconditional guarantee that the transaction is backed out and none of the effects of its actions will persist z In distributed systems several process may need coordination to perform a task ytheir actions should be atomic with respect to other processes executing at different sites 57

Transactional Model z Atomic operations are the basic building blocks of distributed fault tolerant system y. The effects of process on a system (even in case of concurrency) should look like a undivided and uniterrupted operation y. Atomicity is extended from instruction level to a a sequence of instructions or a group of processes that will be executed atomically y. They provide a mean to the designer to specify the process interactions that are to be prevented and to maintain the integrity of the system z Atomic actions characterisitcs y. The process performing does not communicate with any other active processes 58

Agreement z Several clients search to reach an agreement to perform a task (to commit, abort) A 0 B A 0 1 0 C B 1 1 C B can not differentiate between 2 scenarios. Agreement requires 3 m+1 nodes to tolerate m Byzantine faults 59

Who is the traitor? z One general (coordinator) and more lieutenants y. Tolerance of general traitor or lieutenant traitor z Only lieutenants =>tolerance of 1 traitor 60

Commit protocol (fault free) illustrated 1 coordinator, several cohorts 2 phases transaction ok to commit? Commit request commit ok with us Reply from all Wait for Ack to write the Final log 61

Protocol-level fault tolerance z Aim: transmit information from A to B through a network. z Parameters: y. Correctness y. Tx time y. Delay y… z …. what is the fault model?

Reliable Protocols z Transmission Control Protocol (TCP) y Correctness: packet checksum y End-to-end correctness: sequence number y Reordering y. Congestion Avoidance y. Retransmission timeout

Packet (and header!) Checksum

Sequence Number

Reordering z Different routes z Resend of faulty packets z Exploits sequence number A y. Overhead: memory, latency B

Network dynamic fault models B ! A z Congestion: too much traffic in a given node/segment

Congestion: architectural z Node cannot handle packets: y Effect: packet drop y Reaction: buffering A y. Overhead: memory, latency B

TCP and congestion z Built-in algorithms to avoid congestion: y. Congestion Window y. Slow Start y. Congestion Avoidance y. Fast retransmit y. Fast recovery

TCP: Congestion Window z Maximum number of outstanding packets z Defined at each node B ! A z If OP > CW, then a congestion is taking place

TCP: Slow start z Modify CW depending on acknowledged packets z CW grows exponentially with AP y. Grows stops on treshold z Upon Packet loss, wait for CW y. Big CW: « give time » to congestion before resend z Ok for big transmissions, problematic for short-lived connections

TCP: Congestion Avoidance z Combination of y. Congestion Window (CW) y. Round Trip Time (RTT) y. Replicated ACK z Many algorithms in the standard: x. TCP Tahoe and Reno x. TCP Vegas x. TCP New Reno x. TCP Hybla x. TCP BIC x. TCP CUBIC x. Compound TCP

TCP: RTT and Double Ack B ! A

TCP: Fast retransmit z Multiple ACKs: congestion in the network y. Segment K has « n » Acks (ex: n=4) y. Segments >K will probably be dropped because of congestion y. Resend segments >K before timeout

TCP: Fast recovery z Combination of previous approaches y. When congestion is detected (ex: multilple acks) y. Reduce CW to Slow Start threshold y. Bigger CW -> more time to ease congestion

More than TCP! z TCP ok for bulk data, but y. Delay not taken into account y. Does not support Graceful Degradation y. Bad performances for Real Time applications x. Vo. IP x. Video Streams y. Application-specific protocols with variable Qo. S

Buffer Bloat (1) : ideal path B A

Buffer Bloat (2) : REAL path z ALL nodes are BUFFERED! B A

Buffer Bloat (3) z Hidden Buffers y. Higher DATA reliability y. Big impact on transmission delay even in optimal conditions y-> lower TIME reliability z Solution …. . Not there yet! y. Might impact your Embedded System performances….