Software Aging Rejuvenation DAIICT Gandhinagar January 9 2006

  • Slides: 77
Download presentation
Software Aging & Rejuvenation DA-IICT Gandhinagar January 9, 2006 Kishor S. Trivedi Dept. of

Software Aging & Rejuvenation DA-IICT Gandhinagar January 9, 2006 Kishor S. Trivedi Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 kst@ee. duke. edu www. ee. duke. edu/~kst www. software-rejuvenation. com

2 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

2 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Cluster systems – SRN model Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

Motivation: Dependence on Computer Systems Communication Health & Medicine Entertainment Avionics Banking Copy right

Motivation: Dependence on Computer Systems Communication Health & Medicine Entertainment Avionics Banking Copy right 2005 by Kishor Trivedi 3

4 Downtown Costs per Hour n n n Brokerage operations $6, 450, 000 Credit

4 Downtown Costs per Hour n n n Brokerage operations $6, 450, 000 Credit card authorization $2, 600, 000 e. Bay (1 outage 22 hours) $225, 000 Amazon. com $180, 000 Package shipping services $150, 000 Home shopping channel $113, 000 Catalog sales center $90, 000 Airline reservation center $89, 000 Cellular service activation $41, 000 On-line network fees$25, 000 ATM service fees $14, 000 Sources: Internet. Week 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p. 8. ”. . . based on a survey done by Contingency Planning Research. " Copy right 2005 by Kishor Trivedi

High Availability: Software is the problem (1) n n Hardware fault tolerance, fault management,

High Availability: Software is the problem (1) n n Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed System outages more due to software faults Software reliability is one of the weakest links in system reliability Copy right 2005 by Kishor Trivedi 5

High Availability: Software is the problem (2) n n n Fault avoidance through good

High Availability: Software is the problem (2) n n n Fault avoidance through good software engineering practices difficult for large/complex software systems Impossible to fully test and verify if software is faultfree Yet there are stringent requirements for failure-free operation Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software Copy right 2005 by Kishor Trivedi 6

Software Fault Tolerance Techniques n Design diversity n n N-version programming Recovery block N-self

Software Fault Tolerance Techniques n Design diversity n n N-version programming Recovery block N-self check programming Data diversity n N-copy programming Copy right 2005 by Kishor Trivedi 7

8 Software remains the problem n n n Design diversity based software fault tolerance

8 Software remains the problem n n n Design diversity based software fault tolerance expensive Data diversity may have limited applicability Stringent requirements for failure-free operation Copy right 2005 by Kishor Trivedi

Software Fault Tolerance New thinking n Environment diversity n Checkpointing and rollback, retry n

Software Fault Tolerance New thinking n Environment diversity n Checkpointing and rollback, retry n n n Replication of software modules (applications) n n Helps in dealing with hardware transients But also helps in dealing with software bugs Does it help? If yes, why? Proactive fault management (software rejuvenation) n Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes. It allows proactive repairs to be carried at the discretion of the user/administrator, e. g. , in the middle of the night Copy right 2005 by Kishor Trivedi 9

10 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

10 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Cluster systems – SRN model Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

11 Software Aging n “Software Aging” phenomenon n What constitutes aging? n n n

11 Software Aging n “Software Aging” phenomenon n What constitutes aging? n n n Long-running software tends to show an increasing failure rate. Not related to application program becoming obsolete due to changing requirements/maintenance. Deterioration in the size of free OS resources Accumulation of internal errors Common examples n n Memory leaks Data corruption Fragmentation Round-off errors Copy right 2005 by Kishor Trivedi

12 Software Aging – Examples n n n n Netscape, xrn, Windows 9 x

12 Software Aging – Examples n n n n Netscape, xrn, Windows 9 x LAX airport shutdown (Sep 12, 2004) File system aging [Smith & Seltzer] Crash/hang failures in general purpose applications Gradual service degradation in the AT&T transaction processing system [Avritzer et al. ] Error accumulation in Patriot missile system’s software [Marshall] Resource exhaustion in Apache [Li et al. ] Copy right 2005 by Kishor Trivedi

13 Software Fault Classification Bohrbugs: Software bugs that are reproducible, easily found and (often)

13 Software Fault Classification Bohrbugs: Software bugs that are reproducible, easily found and (often) fixed during the testing and debugging phase n Mandelbugs: Software bugs that are hard to find and fix; (often) remain in the software during the operational phase n n n § These bugs may never be fixed, but if the operation is retried or the system is rebooted, the bugs may not manifest themselves as failures manifestation is non-deterministic and dependent on the software (or its environment) reaching very rare states Yet another cause software failures is resource exhaustion, e. g. , memory leakage, swap space fragmentation Software appears to “Age” due to resource exhaustion Copy right 2005 by Kishor Trivedi

14 Software Fault Classification Bohrbugs Mandelbugs Aging related Bugs Copy right 2005 by Kishor

14 Software Fault Classification Bohrbugs Mandelbugs Aging related Bugs Copy right 2005 by Kishor Trivedi

15 Software Fault Classification Software (OS, recovery s/w, applications) Mandelbugs Bohrbugs Test/ Debug Design/

15 Software Fault Classification Software (OS, recovery s/w, applications) Mandelbugs Bohrbugs Test/ Debug Design/ Development Des. /Data Diversity Retry opn. Restart app. Operational Copy right 2005 by Kishor Trivedi “Aging” related bugs Reboot node

Environment Diversity 16 New Approach to S/W FT n Transient nature of software failures

Environment Diversity 16 New Approach to S/W FT n Transient nature of software failures n n [Gray] Bohrbugs and Mandelbugs (or Heisenbugs) [Lee & Iyer] Tandem GUARDIAN – 70% transient faults [Sullivan & Chillarege] IBM’s system software – most failures caused by peak conditions in workload, timing and exception errors Environmental Diversity Allows the use of time redundancy over expensive design diversity n n n [Adams] [Grey] [Siewiorek] Restart [Jalote et al. ] Rollback, rollforward [Wang et al. ] Progressive retry [folklore] Occasional reboot, “switch off and on” Proactive approach n Software rejuvenation Copy right 2005 by Kishor Trivedi Reactive in approach

Software Rejuvenation Definition n Proactive fault management technique aimed at postponing/preventing crash failures and/or

Software Rejuvenation Definition n Proactive fault management technique aimed at postponing/preventing crash failures and/or performance degradation n Involves occasionally stopping the running software, “cleaning” its internal state and/or its environment and restarting it n Rejuvenation of the environment, not of software Counteracts the aging phenomenon n Frees up OS resources n Removes error accumulation Common techniques for cleaning n Garbage collection, defragmentation, flushing kernel and file server tables etc Copy right 2005 by Kishor Trivedi 17

18 Software Rejuvenation – Examples n n n AT&T billing applications [Huang et al.

18 Software Rejuvenation – Examples n n n AT&T billing applications [Huang et al. ] JPL REE System Patriot missile system software - switch off and on every 8 hours [Marshall] On-board preventive maintenance for long-life deep space missions (NASA’s X 2000 Advanced Flight Systems Program) [Tai et al. ] IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers] n n Microsoft IIS 5. 0 process recycling tool Process restart in Apache [Li et al. ] Copy right 2005 by Kishor Trivedi

19 Software Rejuvenation – Trade-off n Advantages n n n Reduces costs of sudden

19 Software Rejuvenation – Trade-off n Advantages n n n Reduces costs of sudden aging-related failures. Can be applied at the discretion of the user/ administrator, e. g. , in the middle of the night. Disadvantages n n Direct costs of carrying out rejuvenation Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions, etc. ) Important research issue: Find optimal times to perform rejuvenation! Copy right 2005 by Kishor Trivedi

20 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

20 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Cluster systems – SRN model Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

21 Software Rejuvenation – Trade-off n Two approaches n Use analytical model to optimize

21 Software Rejuvenation – Trade-off n Two approaches n Use analytical model to optimize rejuvenation schedule n n Lucent Bell Labs [Huang et al. , ’ 95] Duke [IEEE-TC’ 98, SIGMETRICS’ 96, ISSRE’ 95, PRDC’ 00 SIGMETRICS’ 01, Comp J. ’ 01, SRDS’ 02, DSN’ 02, ISSRE’ 02, DSN’ 03, IEEETR 05] Others [IPDS’ 98, PNPM’ 99] Use measurements of resource degradation to determine optimal rejuvenation schedule n Duke [ISSRE’ 98, ISSRE’ 99, IBMJRD’ 01, ISESE’ 02, IEEETPDS 05] Copy right 2005 by Kishor Trivedi

22 Analytic Models n Single node models n Condition based n n n Time-based

22 Analytic Models n Single node models n Condition based n n n Time-based n n MRSPN model Cluster systems n n n CTMC model SMP model IBM Cluster model (Time-based, condition-based) Motorola CMTS Model Degradation model to analytically show that TTF is IFR Copy right 2005 by Kishor Trivedi

Failure rate 23 Preventive maintenance is useful only if failure rate is increasing If

Failure rate 23 Preventive maintenance is useful only if failure rate is increasing If the time to failure distribution is exponential then failure rate is Constant Need to assume (and establish) that TTF is IFR Copy right 2005 by Kishor Trivedi

Analytic Models 24 Software Aging and Rejuvenation • A simple and useful model of

Analytic Models 24 Software Aging and Rejuvenation • A simple and useful model of increasing failure rate: Robust state Failure probable state Failed state Time to failure: Hypo-exponential distribution Increasing failure rate Copy right 2005 by Kishor Trivedi aging

Analytic Models 25 CTMC model [Huang 95] Robust state Failed state Failure probable state

Analytic Models 25 CTMC model [Huang 95] Robust state Failed state Failure probable state Model w/o rejuvenation Failure probable state Rejuvenation state Model with rejuvenation • From this Continuous-time Markov chain model • Can find closed-form expression for the optimal rejuvenation trigger rate (r 4) Copy right 2005 by Kishor Trivedi

Analytic Models CTMC model (Huang 95) Copy right 2005 by Kishor Trivedi 26

Analytic Models CTMC model (Huang 95) Copy right 2005 by Kishor Trivedi 26

Analytic Models Semi-Markov model [Dohi 00] • Relax the assumption of exponentially distributed sojourn

Analytic Models Semi-Markov model [Dohi 00] • Relax the assumption of exponentially distributed sojourn times (time-independent transition rates) • Hence have a semi Markov model • Can find closed-form expression for the optimal (deterministic) time to rejuvenation trigger Copy right 2005 by Kishor Trivedi 27

Analytic Models Semi-Markov model (Dohi 00) Copy right 2005 by Kishor Trivedi 28

Analytic Models Semi-Markov model (Dohi 00) Copy right 2005 by Kishor Trivedi 28

Analytic Models MRSPN model [Garg 95] • If degraded state cant be determined •

Analytic Models MRSPN model [Garg 95] • If degraded state cant be determined • Allow the rejuvenation trigger clock to start in the robust state, we obtain a Markov Regenerative Process Copy right 2005 by Kishor Trivedi 29

Analytic Models 30 MRSPN model [Garg 95] • Optimal time (deterministic) to rejuvenation trigger

Analytic Models 30 MRSPN model [Garg 95] • Optimal time (deterministic) to rejuvenation trigger is determined numerically Copy right 2005 by Kishor Trivedi

31 Cluster Systems Cluster System n n n [Pfister] Collection of independent, self-contained computer

31 Cluster Systems Cluster System n n n [Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself Easier scaling to larger systems, high levels of availability/performance and low management costs No single point of failure Node failures transparent to users Graceful repairs, shutdowns, upgrades Copy right 2005 by Kishor Trivedi

Rejuvenation for Cluster Systems Motivation n Rejuvenation using the fail-over mechanisms Long-terms benefits in

Rejuvenation for Cluster Systems Motivation n Rejuvenation using the fail-over mechanisms Long-terms benefits in terms of availability/performance Continuous operation (possibly at a degraded level) n n n Practically zero downtime Less disruptive and lower overhead than unplanned outages Transparent to user/application Most current industry initiatives reactive Two approaches n n Simple time-based (periodic) Condition-based (only from the “failure-impending” state) Copy right 2005 by Kishor Trivedi 32

Rejuvenation for Cluster Systems SRN Models n n n Rejuvenation using the fail-over mechanisms

Rejuvenation for Cluster Systems SRN Models n n n Rejuvenation using the fail-over mechanisms in a rolling fashion Modeling using SRNs (Stochastic Reward Nets) Analysis for 2 rejuvenation policies n Simple time-based policy n n Condition-based policy n n Nodes rejuvenated only from the “failure-probable” state Various configurations n n All nodes rejuvenated successively at the end of each rejuvenation interval a/b: cluster with a nodes that can tolerate at the most b individual node failures, i. e. , (a-b)-out-of-a system Model solution n SPNP (Stochastic Petri Net Package) Copy right 2005 by Kishor Trivedi 33

SRN Model Basic Cluster Model Copy right 2005 by Kishor Trivedi 34

SRN Model Basic Cluster Model Copy right 2005 by Kishor Trivedi 34

SRN Model Simple Time-Based Rejuvenation Copy right 2005 by Kishor Trivedi 35

SRN Model Simple Time-Based Rejuvenation Copy right 2005 by Kishor Trivedi 35

36 Model Parameters Transition Mean time Tfprob 240 hours Tnodefail 720 hours Tnoderepair 30

36 Model Parameters Transition Mean time Tfprob 240 hours Tnodefail 720 hours Tnoderepair 30 mins Tsysrepair 4 hours Trejuv 10 mins costnodefail $5000/hour costnoderejuv $250/hour Measures Computed Unavailability (#Psysfail == 1) ? 1 : 0 Cost #Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail Copy right 2005 by Kishor Trivedi

Results 37 Simple Time-Based Rejuvenation Effect of costnodefail/costrejuv for the 8/1 configuration • Cost

Results 37 Simple Time-Based Rejuvenation Effect of costnodefail/costrejuv for the 8/1 configuration • Cost of node failure is fixed • Decrease in cost ratio implies increase in cost of rejuvenation • Hence, decrease in cost ratio increases total expected cost • As rejuvenation interval increases, rejuvenation is performed less frequently • As rejuvenation tends to infinity, almost no rejuvenation is performed and all the plots tend to the same value Copy right 2005 by Kishor Trivedi

Results 38 Simple Time-Based Rejuvenation Expected cost for various configurations Copy right 2005 by

Results 38 Simple Time-Based Rejuvenation Expected cost for various configurations Copy right 2005 by Kishor Trivedi

Recap 39 Analysis of Rejuvenation for Cluster Systems n Huge benefit in terms of

Recap 39 Analysis of Rejuvenation for Cluster Systems n Huge benefit in terms of UA and cost improvement for systems with more than one spare n n n Simple time-based policy better than predictionbased for some cases Condition policy much better for large node repair times and low node-failure coverage Future work n n Consider other performability measures Explore non-ideal effects of common-mode failure and node-failure coverage Copy right 2005 by Kishor Trivedi

Application to CMTS Example [issre 02] Cable modem system and broadband access n n

Application to CMTS Example [issre 02] Cable modem system and broadband access n n n High availability requirement of CMTS n n Most popular & promising high speed Internet access Success lies in the widespread HFC cable networks and the industry standard DOCSIS Cable modem termination system (CMTS) is the most complex and crucial component of the system Existing approaches only provide hardware redundancy Current systems cannot achieve 5 nines availability Our proposed approach and contributions n n Propose software rejuvenation in CMTS cluster system Construct analytic models, obtain numerical results, optimize rejuvenation parameters, and show the benefits Copy right 2005 by Kishor Trivedi 40

41 Basic model without rejuvenation HW failure detection, switchover, repair, and giveback PCMTS HW

41 Basic model without rejuvenation HW failure detection, switchover, repair, and giveback PCMTS HW & SW failures HW failure detection and repair SCMTS SW failure detection, switchover, reboot, and giveback Copy right 2005 by Kishor Trivedi HW & SW failures SW failure detection and reboot

42 Model with rejuvenation PCMTS Rejuvenation for the robust and “aged” nodes Same as

42 Model with rejuvenation PCMTS Rejuvenation for the robust and “aged” nodes Same as the basic system Rejuvenation, switchover, and giveback Rejuvenation for the robust and “aged” nodes Rejuvenation SCMTS Same as the basic system Copy right 2005 by Kishor Trivedi Timer Approximate deterministic timer interval by r-stage Erlang distribution

Degradation model [DSN 03, IEEETR 05] 43 n Explicitly connecting resource leaks with failure

Degradation model [DSN 03, IEEETR 05] 43 n Explicitly connecting resource leaks with failure rate and hence aging Copy right 2005 by Kishor Trivedi

44 Problem Parameters n n n Total amount of resource: Resource request arrival rate:

44 Problem Parameters n n n Total amount of resource: Resource request arrival rate: Resource release rate: Accumulated resource leak at time t: Number of processes in the system: Conditional probability that system fails to honor the resource request at state k upon the arrival of new request: Copy right 2005 by Kishor Trivedi

45 Degradation Model Copy right 2005 by Kishor Trivedi

45 Degradation Model Copy right 2005 by Kishor Trivedi

46 Degradation Model (cont. ’d) n Failure rate: n Conditional probability: n n Homogeneous

46 Degradation Model (cont. ’d) n Failure rate: n Conditional probability: n n Homogeneous CTMC (leakless) Non-homogeneous (leak-present) Copy right 2005 by Kishor Trivedi

47 Degradation Analysis n Asymptotically constant failure rate (leakless) Copy right 2005 by Kishor

47 Degradation Analysis n Asymptotically constant failure rate (leakless) Copy right 2005 by Kishor Trivedi

48 Degradation Analysis n Monotonic degradation (leak-present) Copy right 2005 by Kishor Trivedi

48 Degradation Analysis n Monotonic degradation (leak-present) Copy right 2005 by Kishor Trivedi

49 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

49 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Cluster systems – SRN model Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

50 Measurement-Based Approach n Objective n n n Periodically monitor and collect data on

50 Measurement-Based Approach n Objective n n n Periodically monitor and collect data on the attributes responsible for the “health” of the system Quantify the effect of aging on system resources n n Detection and validation of aging Proposed metric – Estimated time to exhaustion Three approaches n n n Time-based (workload-independent) estimation [Garg 98] Workload-based estimation [Vaidyanathan 99, TDSC 05] ARMA/ARX models [Li 02] Copy right 2005 by Kishor Trivedi

Data Collection 51 Experimental Setup SNMP-based resource monitoring tool: Data related to OS resource

Data Collection 51 Experimental Setup SNMP-based resource monitoring tool: Data related to OS resource usage (memory, process table, file table etc. ) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervals Copy right 2005 by Kishor Trivedi

Time Plots 52 Non-parametric Regression Smoothing Real Memory Free File Table Size Trend detection:

Time Plots 52 Non-parametric Regression Smoothing Real Memory Free File Table Size Trend detection: Seasonal Kendall test for trend Copy right 2005 by Kishor Trivedi

Time-Based Approach (Workload-independent) Estimated Time to Resource Exhaustion Copy right 2005 by Kishor Trivedi

Time-Based Approach (Workload-independent) Estimated Time to Resource Exhaustion Copy right 2005 by Kishor Trivedi 53

Workload-based Approach Motivation n Time-based approach for estimation of resource exhaustion n n Non-parametric

Workload-based Approach Motivation n Time-based approach for estimation of resource exhaustion n n Non-parametric regression smoothing Seasonal Kendall test for trend Simple linear equation using Sen’s slope estimate Doesn’t incorporate workload Intuitive that rate of resource exhaustion depends on current system load n Strong correlation between workload and system reliability/availability Copy right 2005 by Kishor Trivedi 54

Workload Characterization Cluster Analysis n Workload parameters n n cpu. Context. Switch, sys. Call

Workload Characterization Cluster Analysis n Workload parameters n n cpu. Context. Switch, sys. Call page. In, page. Out Statistical cluster analysis n Hartigan’s k-means algorithm Clusters {1, 2, 3} and {4, 5} merged to get 8 clusters Copy right 2005 by Kishor Trivedi 55

Workload Characterization Transition Probability Matrix Copy right 2005 by Kishor Trivedi 56

Workload Characterization Transition Probability Matrix Copy right 2005 by Kishor Trivedi 56

Workload Characterization Sojourn Time Distribution Copy right 2005 by Kishor Trivedi 57

Workload Characterization Sojourn Time Distribution Copy right 2005 by Kishor Trivedi 57

58 Model Validation Steady-state probabilities computed from the model match very closely with actual

58 Model Validation Steady-state probabilities computed from the model match very closely with actual probabilities obtained from the observed data Copy right 2005 by Kishor Trivedi

Modeling Resource Usage Reward Function Reward function for each resource - Sen’s slope estimate

Modeling Resource Usage Reward Function Reward function for each resource - Sen’s slope estimate (in KB/10 min) for each resource at every workload state Copy right 2005 by Kishor Trivedi 59

60 Solution Method n n n Semi-Markov reward model Markovized into Markov reward model

60 Solution Method n n n Semi-Markov reward model Markovized into Markov reward model Solved using SHARPE (Symbolic Hierarchical Automated Reliability and Performance Estimator) Measures obtained (reward rate of exhaustion) n n Expected instantaneous reward rate Expected reward rate at steady state Expected accumulated reward at time t Mean time to accumulate a given reward= mean “job completion time” Copy right 2005 by Kishor Trivedi

Results 62 Transient slope estimates Slope for used. Swap. Space Copy right 2005 by

Results 62 Transient slope estimates Slope for used. Swap. Space Copy right 2005 by Kishor Trivedi Slope for real. Memory. Free

Results Estimates for slope and time to exhaustion Estimations for slope (KB/10 min) and

Results Estimates for slope and time to exhaustion Estimations for slope (KB/10 min) and time to exhaustion (days) for used. Swap. Space and real. Memory. Free Workload-based approach: lower time to exhaustion than the time-based approach Copy right 2005 by Kishor Trivedi 63

Recap 64 Workload-based Approach n n Developed measurement-based model which incorporates workload Demonstrated relation

Recap 64 Workload-based Approach n n Developed measurement-based model which incorporates workload Demonstrated relation between workload and resource usage Estimates for slope and time to exhaustion Not actual machine failure times n n n Need more accurate models Dependencies between various resources A step further towards predicting failures resulting from resource exhaustion n New/better preventive maintenance policies Copy right 2005 by Kishor Trivedi

65 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

65 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Transaction processing system Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

IBM x. Series Software Rejuvenation Agent (SRA) n n n Implemented in a high-availability

IBM x. Series Software Rejuvenation Agent (SRA) n n n Implemented in a high-availability clustered environment Monitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon IBM Director system management tool n n n Provides GUI to configure SRA Acts upon alerts Two versions n n Periodic rejuvenation Prediction-based rejuvenation Copy right 2005 by Kishor Trivedi 66

Rejuvenation in IBM Director Periodic Rejuvenation Menu Copy right 2005 by Kishor Trivedi 67

Rejuvenation in IBM Director Periodic Rejuvenation Menu Copy right 2005 by Kishor Trivedi 67

Rejuvenation in IBM Director Prediction-Based Rejuvenation Copy right 2005 by Kishor Trivedi 68

Rejuvenation in IBM Director Prediction-Based Rejuvenation Copy right 2005 by Kishor Trivedi 68

69 Rejuvenation Granularity n Level 1 rejuvenation n Restart service Only when stoppage of

69 Rejuvenation Granularity n Level 1 rejuvenation n Restart service Only when stoppage of service saves necessary states Level 2 rejuvenation n n OS reboot Application failover and recovery by cluster management software Copy right 2005 by Kishor Trivedi

70 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n

70 Outline n Introduction & Motivation n Software Aging and Rejuvenation Analytic Models n n n n CTMC model MRSPN model SMP model Cluster systems – SRN model Degradation model Measurement-based Models n n Software reliability and fault tolerance Time-based Time and workload-based Software Rejuvenation in a Commercial Server Summary and Conclusions Copy right 2005 by Kishor Trivedi

Summary 71 Approaches to Rejuvenation Software Rejuvenation Open-loop approach No feedback from system Elapsed

Summary 71 Approaches to Rejuvenation Software Rejuvenation Open-loop approach No feedback from system Elapsed time (periodic) and load [ISSRE 95, TOC 98, SIGMETRICS 01] [TOC 98] Closed-loop approach Feedback from the system (monitoring) Offline Time-based Time & analysis workload-based [ISSRE 98, ISESE 02] analysis [ISSRE 99] Copy right 2005 by Kishor Trivedi Online [SHAMAN 02, IBMJRD 01] Failure data [HASE 00]

Summary 72 Granularity of Rejuvenation Cluster failover Node restart Application restart Selective rejuvenation Process

Summary 72 Granularity of Rejuvenation Cluster failover Node restart Application restart Selective rejuvenation Process restart Copy right 2005 by Kishor Trivedi Cluster failback

Summary n n n 73 Software aging not anecdotal – real life scientific phenomenon

Summary n n n 73 Software aging not anecdotal – real life scientific phenomenon Rejuvenation implemented in several special purpose applications and one general purpose cluster system Interesting problems for modeling community & practitioners of fault tolerance Realistic models with non-exponential distributions can be solved by the current state of modeling methodologies More information: www. software-rejuvenation. com Copy right 2005 by Kishor Trivedi

74 References n n n n . Software Rejuvenation: Analysis, Module and Applications, Y.

74 References n n n n . Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis and N. Fulton, In Proc. of the 25 th IEEE Intl. Symp. on Fault Tolerant Computing (FTCS-25), Pasadena, CA, June 1995. Analysis of Software Rejuvenation using Markov Regenerative Stochastic Petri Net, S. Garg, A. Puliafito M. Telek and K. S. Trivedi, In Proc. of the Sixth IEEE Intl. Symp. on Software Reliability Engineering, Toulouse, France, October 1995. Analysis of Preventive Maintenance in Transaction Based Software Systems, S. Garg, A. Puliafito, M. Telek and K. S. Trivedi, IEEE Trans. on Computers, 47(1). January 1998. Analysis of Software Cost Models with Rejuvenation, T. Dohi, K. Goseva. Popstojanova and K. S. Trivedi, Proc. of the IEEE Intl. Symp. on High Assurance Systems Engineering, HASE-2000, November 2000. Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. of the 2000 Pacific Rim Intl. Symp. on Dependable Computing, PRDC 2000 , Los Angeles, December 2000. A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. In Proc. of the Ninth IEEE Intl. Symp. on Software Reliability Engineering, Paderborn, Germany, November 1998. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems, K. Vaidyanathan and K. S. Trivedi. In Proc. of the Tenth IEEE Intl. Symp. on Software Reliability Engineering, Boca Raton, Florida, November 1999. Now in IEEE-TDSC, April-June 2005. Copy right 2005 by Kishor Trivedi

75 References n n n n (contd. ) Modeling and Analysis of Software Aging

75 References n n n n (contd. ) Modeling and Analysis of Software Aging and Rejuvenation, K. S. Trivedi, K. Vaidyanathan and K. Goseva-Popstojanova. In Proc. of the 33 rd Annual Simulation Symp. , Washington D. C. , April 2000. Analysis and Implementation of Software Rejuvenation in Cluster Systems, K. Vaidyanathan, R. E. Harper, S. W. Hunter and K. S. Trivedi. In Proc. of the Joint Intl. Conf. on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001. Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, Vol. 45, No. 2, March 2001. An Approach to Estimation of Software Aging in a Web Server, L. Li, K. Vaidyanathan and K. S. Trivedi. In Proc. of the Intl. Symp. on Empirical Software Engineering, ISESE 2002, Nara, Japan, October 2002. Analysis of Inspection-Based Preventive Maintenance in Operational Software Systems, K. Vaidyanathan, D. Selvamuthu and K. S. Trivedi. In Proc. of the Intl. Symp. on Reliable Distributed Systems, SRDS 2002, Osaka, Japan, October 2002. Adaptive Software Rejuvenation: Degradation Model and Rejuvenation Scheme, Y. Bao, X. Sun and K. S. Trivedi, Proceedings of the International Conference on Dependable Systems and Networks, DSN 2003, pages 241248, San Francisco, June 2003; to appear in IEEE-TR, Sept. 2005. Probability & Statistics with Reliability, Queuing and Computer Science Applications, (2 nd ed. ), K. S. Trivedi, John Wiley, 2001. Copy right 2005 by Kishor Trivedi

76 Collaborators n n n Sachin Garg, Ph. D (Duke), now with Avaya Research

76 Collaborators n n n Sachin Garg, Ph. D (Duke), now with Avaya Research Antonio Puliafito, Messina, Italy Kalyan V. , Ph. D (Duke), now with Sun Microsystems Bharat Madan, Duke post-doc Katerina Goseva-Popstojanova, former Duke post-doc, now with West Virginia University Tadashi Dohi, former Duke visiting scientist, now with Hiroshima University, Japan Yujuan Bao, Current Duke student Yun Liu, Current Duke student Steve Hunter, IBM, Research Triangle Park Rick Harper, IBM Research Kenny Gross, Sun Microsystems CTO Labs Copy right 2005 by Kishor Trivedi

77 Related Research n Recovery-Oriented Computing (ROC) n n David Patterson (Berkeley) Armando Fox

77 Related Research n Recovery-Oriented Computing (ROC) n n David Patterson (Berkeley) Armando Fox (Stanford) Improve availability by minimizing recovery/repair time Recursive restarts n Restart tree Copy right 2005 by Kishor Trivedi

Thank You!

Thank You!