Architecting Fault-Tolerant Software Systems

(1)

A

rchit

ec

ting F

ault

-To

le

ran

t S

oft

w

ar

e S

yst

ems H

as

an Söz

er

Hasan Sözer

Architecting

Fault-Tolerant

Software Systems

ISBN 978-90-365-2788-0

The increasing size and complexity of software systems

makes it hard to prevent or remove all possible faults. Faults

that remain in the system can eventually lead to a system

failure. Fault tolerance techniques are introduced for

enabling systems to recover and continue operation when

they are subject to faults. Many fault tolerance techniques

are available but incorporating them in a system is not

always trivial. In this thesis, we introduce methods and tools

for the application of fault tolerance techniques to increase

the reliability and availability of software systems.

Architecting Fault-Tolerant Software Systems

_Invitation

to the public defense of my thesis

Architecting

Fault-Tolerant

Software Systems

on Thursday, January 29, 2009 at 16:45 in Collegezaal 2 of the Spiegel building at the

University of Twente.

At 16:30 I will give a brief introduction to the subject of

my thesis.

The defense will be followed by a reception in the

same building.

(2)

(3)

Architecting Fault-Tolerant Software Systems

(4)

Chairman and secretary:

Prof. Dr. Ir. A.J. Mouthaan, University of Twente, The Netherlands Promoter :

Prof. Dr. Ir. M. Ak¸sit, University of Twente, The Netherlands Assistant promoter :

Dr. Ir. B. Tekinerdo˘gan, Bilkent University, Turkey Members:

Dr. Ir. J. Broenink, University of Twente, The Netherlands

Prof. Dr. Ir. A. van Gemund, Delft University of Technology, The Netherlands Dr. R. de Lemos, University of Kent, United Kingdom

Prof. Dr. A. Romanovsky, Newcastle University, United Kingdom Prof. Dr. Ir. G. Smit, University of Twente, The Netherlands

CTIT Ph.D. thesis series no. 09-135. Centre for Telematics and Information Tech-nology (CTIT), P.O. Box 217 - 7500 AE Enschede, The Netherlands.

This work has been carried out as part of the Trader project under the responsibil-ity of the Embedded Systems Institute. This project is partially supported by the Dutch Government under the Bsik program. The work in this thesis has been car-ried out under the auspices of the research school IPA (Institute for Programming research and Algorithmics).

ISBN 978-90-365-2788-0

ISSN 1381-36-17 (CTIT Ph.D. thesis series no. 09-135) IPA Dissertation Series 2009-05

Cover design by Hasan S¨ozer

Printed by PrintPartners Ipskamp, Enschede, The Netherlands Copyright c 2009, Hasan S¨ozer, Enschede, The Netherlands

(5)

Architecting Fault-Tolerant Software Systems

DISSERTATION

to obtain

the degree of doctor at the University of Twente, on the authority of the rector magnificus,

Prof. Dr. H. Brinksma,

on account of the decision of the graduation committee, to be publicly defended

on Thursday the 29th of January 2009 at 16.45

by

Hasan S¨ozer

born on the 21st of August 1980 in Bursa, Turkey

(6)

Prof. Dr. Ir. M. Ak¸sit (promoter)

(7)

(8)

(9)

Acknowledgements

When I was a M.Sc. student at Bilkent University, I have met with Bedir Tekin-erdo˘gan. He was a visiting assistant professor there at that time. Towards the end of my M.Sc. studies, he has notified me about the vacancy for a Ph.D. position at the University of Twente. He has also recommended me for this position. First of all, I would like to thank him for the faith he had in me. Following my admission to this position, he became my daily supervisor and we have been working very closely thereafter. I have always been impressed by his ability to abstract away key points out of details and his writing/presentation skills based on a true empathy towards the intended audience. I would like to thank him for his contributions to my intellectual growth and for his continuous encouragement, which has been an important source of motivation for me.

I have carried out my Ph.D. studies at the software engineering group lead by Mehmet Ak¸sit. We have had regular meetings with him to discuss my progress and future research directions. In these meetings, I have sometimes been exposed to challenging critics but always with a positive, optimistic attitude and encourage-ment. Over the years, I have witnessed his ability to foresee pitfalls and I have been convinced about the accuracy of his predictions in research. I would like to thank him for his reliable guidance.

During my studies, I have also had the opportunity to work together with Hichem Boudali and Mari¨elle Stoelinga from the formal methods group. I have learned a lot from them and an important part of this thesis (Section 5.10) presents the results of our collaboration. I would like to thank them for their contribution.

I would like to thank to the members of my Ph.D. committee: Jan Broenink, Ar-jan van Gemund, Rog´erio de Lemos, Alexander Romanovsky, and Gerard Smit for spending their valuable time and energy to evaluate my work. Their useful com-ments enabled me to dramatically improve this thesis.

(10)

during our regular project meetings. In particular, David Watts, Jozef Hooman and Teun Hendriks have reviewed my work closely. Ben Pronk brought up the research direction on local recovery, which later happened to be the main focus of my work. He has also spent his valuable time to provide us TV domain knowledge together with Rob Golsteijn. Previously we had several discussions with Iulian Nitescu, Paul L. Janson and Pierre van de Laar on failure scenarios, fault/error/failure classes and recovery strategies. These discussions have also directly or indirectly contributed to this thesis.

The members of the software engineering group have provided me useful feedback during our regular seminars. I would like to thank them also for the comfortable working environment I have had. In particular, I thank my roommates over the years: Joost, Christian and Somayeh. In addition to the Dutch courses provided by the university, Joost has given me a ‘complementary’ course on Dutch language and Dutch culture. He has also read and corrected my official Dutch letters, which would have caused quite some trouble if they were not corrected. Christian and Somayeh have always been open to give their opinion about any issue I may bring up and help me if necessary.

I would like to thank Ellen Roberts-Tieke, Joke Lammerink, Elvira Dijkhuis, Hilda Ferweda and Nathalie van Zetten for their invaluable administrative support. To be able to finish this work, first of all I had to feel secure and comfortable in my social environment. In the following, I would like to extend my gratitude to people, who have provided me such an environment during the last four years.

When I first arrived in Enschede, G¨urcan was one of the few people I knew at the university. He has helped me a lot to get acquainted with the new environment. He has provided me a useful set of survival strategies to deal with never-ending official procedures. The set of strategies has been later extended for surviving at mountains and at the military service as well.

I have been sharing an apartment with Espen during the last three years. The life is a lot easier if you always have a reliable friend around to talk to. Espen is very effective in killing stress and boosting courage in any circumstance (almost like alcohol, but almost healthy at the same time). Besides Espen, I had the chance to meet with several other good friends while I was living at a student house (the infamous 399) in the campus. I am sure that we will keep in touch in the future, in one way or another.

(11)

new Turkish friends during my studies. They have became very close friends of mine and they have helped me not to feel so much homesick. There are many people to count in this group and I will shortly refer to them as ‘¨oztwenteliler’. Selim and Emre are also in this group and I especially thank them for “supporting” me during my defense.

I would like to thank all the people who have contributed to Tusat. Similarly, I am grateful to people who have volunteered to work for making our life more social and enjoyable in the university, for example, members of Esn Twente over the years. Stichting Kleurrijke Dans has also made my life more colorful lately. In addition, I would like to thank all my friends who accompany me in recreational trips and various other activities during the last four years.

I have had endless love and support from my family throughout my life. I thank foremost my parents for always standing by me regardless of the geographic distance between us.

(12)

(13)

Abstract

The increasing size and complexity of software systems makes it hard to prevent or remove all possible faults. Faults that remain in the system can eventually lead to a system failure. Fault tolerance techniques are introduced for enabling systems to recover and continue operation when they are subject to faults. Many fault tolerance techniques are available but incorporating them in a system is not always trivial. We consider the following problems in designing a fault-tolerant system. First, existing reliability analysis techniques generally do not prioritize potential failures from the end-user perspective and accordingly do not identify sensitivity points of a system. Second, existing architecture styles are not well-suited for specifying, communicating and analyzing design decisions that are particularly related to the fault-tolerant aspects of a system. Third, there are no adequate analysis techniques that evaluate the impact of fault tolerance techniques on the functional decomposition of software architecture. Fourth, realizing a fault-tolerant design usually requires a substantial development and maintenance effort.

To tackle the first problem, we propose a scenario-based software architecture reli-ability analysis method, called SARAH that benefits from mature relireli-ability engi-neering techniques (i.e. FMEA, FTA) to provide an early reliability analysis of the software architecture design. SARAH evaluates potential failures from the end-user perspective to identify sensitive points of a system without requiring an implemen-tation.

As a new architectural style, we introduce Recovery Style for specifying fault-tolerant aspects of software architecture. Recovery Style is used for communicating and analyzing architectural design decisions and for supporting detailed design with respect to recovery.

As a solution for the third problem, we propose a systematic method for optimizing the decomposition of software architecture for local recovery, which is an effective fault tolerance technique to attain high system availability. To support the method, we have developed an integrated set of tools that employ optimization techniques, state-based analytical models (i.e. CTMCs) and dynamic analysis on the system.

(14)

decomposition alternatives, ii ) reducing the design space with respect to domain and stakeholder constraints and iii ) making the desired trade-off between availability and performance metrics.

To reduce the development and maintenance effort, we propose a framework, FLORA that supports the decomposition and implementation of software architecture for lo-cal recovery. The framework provides reusable abstractions for defining recoverable units and for incorporating the necessary coordination and communication protocols for recovery.

(15)

Chapter 1 Introduction

A system is said to be reliable [3] if it can continue to provide the correct service, which implements the required system function. A failure occurs when the delivered service deviates from the correct service. The system state that leads to a failure is defined as an error and the cause of an error is called a fault [3]. It becomes harder to prevent or remove all possible faults in a system as the size and complexity of software increases. Moreover, the behavior of current systems are affected by an increasing number of external factors since they are generally integrated in networked environments, interacting with many systems and users. It may therefore not be economically and/or technically feasible to implement a fault-free system. As a consequence, the system needs to be able to tolerate faults to increase its reliability. Potential failures can be prevented by designing the system to be recoverable from errors regardless of the faults that cause them. This is the motivation for fault-tolerant design, which aims at enabling a system to continue operation in case of an error. By this way, the system can remain available to its users, possibly with reduced functionality and performance rather than failing completely. In this thesis, we introduce methods and tools for the application of fault tolerance techniques to increase the reliability and availability of software systems.

(21)

1.1 Thesis Scope

The work presented in this thesis has been carried out as a part of the TRADER1_[120]

project. The objective of the project is to develop methods and tools for ensuring reliability of digital television (DTV) sets. A number of important trends can be observed in the development of embedded systems like DTVs. First, due to the high industrial competition and the advances in hardware and software technology, there is a continuous demand for products with more functionality. Second, the imple-mentation of functionality is shifting from hardware to software. Third, products are not solely developed by just one manufacturer only but it is host to multiple parties. Finally, embedded systems are more and more integrated in networked environments that affect these systems in ways that might not have been foreseen during their construction. Altogether, these trends increase the size and complexity of software in embedded systems and as such make software faults a primary threat for reliability.

For a long period, reliability and fault-tolerant aspects of embedded systems have been basically addressed at hardware level or source code. However, in face of the current trends it has now been recognized that reliability analysis should focus more on software components. In addition, incorporation of some fault tolerance tech-niques should be considered at a higher abstraction level than source code. It must be ensured that the design of the software architecture supports the application of necessary fault tolerance techniques. Software architecture represents the gross-level structure of the system that directly influences the subsequent analysis, design and implementation. Hence, it is important to evaluate the impact of fault tolerance techniques on the software architecture. By this way, the quality of the system can be assessed before realizing the fault-tolerant design. This is essential to identify potential risks and avoid costly redesigns and reimplementations.

Different type of fault tolerance techniques are employed in different application domains depending on their requirements. For safety-critical systems, such as the ones used in nuclear power plants and airplanes, safety is the primary concern and any failure that can cause harm to people and environment must be prevented. In that context, the additional cost of the fault-tolerant design due to the required hardware/software resources is a minor issue. In the TRADER project [120], we have focused on the consumer electronics domain, in particular, DTV systems. For such systems, which are probably less subjected to catastrophic failures, the cost and the perception of the user turn out to be the primary concerns, instead. These systems are very cost-sensitive and failures that are not directly perceived by the user can be accepted to some extent, whereas failures that can be directly

1

(22)

observed by the user require a special attention. In this context, the fault tolerance techniques to be applied must be evaluated with respect to their additional costs and their effectiveness based on the user perception.

1.2 Motivation

Traditional software fault tolerance techniques are mostly based on design diversity and replication because software failures are generally caused by design and coding faults, which are permanent [58]. So, the erroneous system state that is caused by such faults has also been assumed to be permanent. On the other hand, software systems are also exposed to so-called transient faults. These faults are mostly ac-tivated by timing issues and peak conditions in workload that could not have been anticipated before. Errors that are caused by such faults are likely to be resolved when the software is executed after a clean-up and initialization [58]. As a re-sult, it is possible to design a system that can recover from a significant fraction of errors [16] without replication and design diversity and as such without requiring substantial hardware/software resources [58]. Many such fault tolerance techniques are available but developing a fault-tolerant system is not always trivial.

First of all, we need to know which fault tolerance techniques to select and where in the system to apply the selected set of techniques. Due to the cost-sensitivity of consumer electronics products like DTVs, it is not feasible to design a system that can tolerate all the faults of each of its elements. Thus, we need to analyze potential failures and prioritize them based on user perception. Accordingly, we should identify sensitive elements of the system, whose failures might cause the most critical system failures. The set of fault tolerance techniques should be then selected based on the type of faults activated by the identified sensitive elements. Existing reliability analysis techniques do not generally prioritize potential failures from the end-user perspective and they do not identify sensitive points of a system accordingly.

After we analyze the system and select the set of fault tolerance techniques accord-ingly, we should adapt the software architecture description to specify the fault-tolerant design. The software architecture of a system is usually described using more than one architectural view. Each view supports the modeling, understand-ing, communication and analysis of the software architecture for different concerns. This is because current software systems are too complex to represent all the con-cerns in one model. An analysis of the current practice for representing architectural views reveals that they focus mainly on functional concerns and are not well-suited for communicating and analyzing design decisions that are particularly related to

(23)

the fault-tolerant aspects of a system. New architectural styles are required to be able to document the software architecture from the fault tolerance point of view. Fault tolerance techniques may influence the decomposition of software architec-ture. Local recovery is such a fault tolerance technique, which aims at making the system ready for correct service as much as possible, and as such attaining high system availability [3]. For achieving local recovery the architecture needs to be decomposed into separate units (i.e. recoverable units) that can be recovered in iso-lation. Usually there are many different alternative ways to decompose the system for local recovery. Increasing the number of recoverable units can provide higher availability. However, this will also introduce an additional performance overhead since more modules will be isolated from each other. On the other hand, keeping the modules together in one recoverable unit will increase the performance, but will result in a lower availability since the failure of one module will affect the others as well. As a result, for selecting a decomposition alternative we have to cope with a trade-off between availability and performance. There are no adequate integrated set of analysis techniques to directly support this trade-off analysis, which requires optimization techniques, construction and analysis of quality models, and analysis of the existing code base to automatically derive dependencies between modules of the system. We need the utilization and integration of several analysis techniques to optimize the decomposition of software architecture for recovery.

The optimal decomposition for recovery is usually not aligned with the existing decomposition of the system. As a result, the realization of local recovery, while preserving the existing decomposition, is not trivial and requires a substantial de-velopment and maintenance effort [26]. Developers need to be supported for the implementation of the selected recovery design.

Accordingly, this thesis provides software architecture modeling, analysis and real-ization techniques to improve the reliability and availability of software systems by introducing fault tolerance techniques.

1.3 The Approach

In the following subsections, we summarize the approaches that we have taken for supporting the design and implementation of fault-tolerant software systems. The overall goal is to employ fault tolerance techniques that can mainly tolerate transient faults and as such to improve the reliability and availability of cost-sensitive systems from the user point of view.

(24)

1.3.1 Software architecture reliability analysis using failure

scenarios

Our first approach aims at analyzing the potential failures and the sensitivity points at the software architecture design phase before the fault-tolerant design is im-plemented. Since implementing the software architecture is a costly process, it is important to predict the quality of the system and identify potential risks, before committing enormous organizational resources [31]. Similarly, it is of importance to analyze the hazards that can lead to failures and to analyze their impact on the reliability of the system before we select and implement fault tolerance techniques. For this purpose, we introduce a software architecture reliability analysis approach ( SARAH) that benefits from mature reliability engineering techniques and scenario-based software architecture analysis to provide an early software reliability analysis. SARAH defines the notion of failure scenario model that is based on the Failure Modes and Effects Analysis method (FMEA) in the reliability engineering domain. The failure scenario model is applied to represent so-called failure scenarios that define a Fault Tree Set (FTS). FTS is used for providing a severity analysis for the overall software architecture and the individual architectural elements. Despite conventional reliability analysis techniques which prioritize failures based on criteria such as safety concerns, in SARAH failure scenarios are prioritized based on sever-ity from the end-user perspective. The analysis results can be used for identifying so-called architectural tactics [5] to improve the reliability. Hereby, architectural tactics form building blocks of design patterns for fault tolerance.

1.3.2 Architectural style for recovery

Once we have selected the appropriate fault tolerance techniques and related archi-tectural tactics, they should be incorporated into the existing software architecture. Introduction of fault tolerance mechanisms usually requires dedicated architectural elements and relations that impact the software architecture decomposition. Our second approach aims at modeling the resulting decomposition explicitly by pro-viding a practical and easy-to-use method to document the software architecture from a recovery point of view. For this purpose, we introduce the recovery style for modeling the structure of the system related to the recovery concern. It is used for communicating and analyzing architectural design decisions and supporting detailed design with respect to recovery. The recovery style considers recoverable units as first class architectural elements, which represent the units of isolation, error con-tainment and recovery control. The style defines basic relations for coordination and application of recovery actions. As a further specialization of the recovery style,

(25)

the local recovery style is provided, which is used for documenting a local recovery design including the decomposition of software architecture into recoverable units and the way that these units are controlled.

1.3.3 Quantitative analysis and optimization of software

ar-chitecture decomposition for recovery

To introduce local recovery to the system, first we need to select a decomposition among many alternatives. We propose a systematic approach dedicated to opti-mizing the decomposition of software architecture for local recovery. To support the approach, we have developed an integrated set of tools that employ i ) dynamic program analysis to estimate the performance overhead introduced by different de-composition alternatives, ii ) state-based analytical models (i.e. CTMCs) to estimate the availability achieved by different decomposition alternatives, and iii ) optimiza-tion techniques for automatic evaluaoptimiza-tion of decomposioptimiza-tion alternatives with respect to performance and availability metrics. The approach enables the following.

• modeling the design space of the possible decomposition alternatives

• reducing the design space with respect to domain and stakeholder constraints • making the desired trade-off between availability and performance metrics With this approach, the designer can systematically evaluate and compare decom-position alternatives, and select an optimal decomdecom-position.

1.3.4 Framework for the realization of software architecture

recovery design

After the optimal decomposition for recovery is selected, the software architecture should be partitioned accordingly. In addition, new supplementary architectural el-ements and relations should be implemented to enable local recovery. To reduce the resulting development and maintenance efforts we introduce a framework, FLORA that supports the decomposition and implementation of software architecture for local recovery. The framework provides reusable abstractions for defining recover-able units and the necessary coordination and communication protocols for recov-ery. Using our framework, we have introduced local recovery to the open-source media player called MPlayer for several decomposition alternatives. We have then

(26)

performed measurements on these implementations to validate the results of our analysis approaches.

1.4 Thesis Overview

The thesis is organized as follows.

Chapter 2 provides background information and a set of definitions that is used throughout this thesis. It introduces the basic concepts of reliability, fault tolerance and software architectures.

Chapter 3 presents the software architecture reliability analysis method (SARAH). SARAH is a scenario-based analysis method, which aims at providing an early eval-uation and feedback at the architecture design phase. It utilizes mature reliability engineering techniques to prioritize failure scenarios from the user perspective and identifying sensitive elements of the architecture accordingly. The output of SARAH can be utilized as an input by the techniques that are introduced in Chapter 4 and Chapter 5. This chapter is a revised version of the work described in [111], [117], and [118].

Chapter 4 introduces a new architectural style, called Recovery Style for modeling the structure of the software architecture that is related to the fault tolerance prop-erties of a system. The style is used for communicating and analyzing architectural design decisions and supporting detailed design with respect to recovery. This style is used in Chapter 6 to represent the designs to be realized. It also supports the understanding of Chapter 5. This chapter is a revised version of the work described in [110].

Chapter 5 proposes a systematic approach dedicated to optimizing the decompo-sition of software architecture for local recovery. In this chapter, we explain several analysis techniques and tools that employ dynamic analysis, analytical models and optimization techniques. These are all integrated to support the approach.

Chapter 6 presents the framework FLORA that supports the decomposition and implementation of software architecture for local recovery. The framework provides reusable abstractions for defining recoverable units and the necessary coordination and communication protocols for recovery. This chapter is a revised and extended version of the work described in [112].

Chapter 7 provides our conclusions. The evaluations, discussions and related work for the particular contributions are provided in the corresponding chapters.

(27)

An overview of the main chapters is depicted in Figure 1.1. Hereby, the rounded rectangles represent the chapters of the thesis. The solid arrows represent the rec-ommended reading order of these chapters. After reading Chapter 2, the reader can immediately start reading Chapter 3, 4 or 5. Chapter 4 should be read be-fore Chapter 6. All the other chapters are self-contained. All the chapters provide complementary work and the works that are presented through chapters 4 to 6 are directly related.

! "

(28)

Chapter 2 Background and Definitions

In our work, we utilize concepts and techniques from both the areas of dependability and software architectures. In this chapter, we provide background information on these two areas and we introduce a set of definitions that will be used throughout the thesis.

2.1 Dependability and Fault Tolerance

Dependability is the ability of a system to deliver service that can justifiably be trusted [3]. It is an integrative concept that encompasses several quality attributes including reliability, availability, safety, integrity and maintainability. A system is considered to be dependable if it can avoid failures (service failures) that are more frequent or more severe than is acceptable [3]. A failure occurs when the delivered service of a system deviates from the required system function [3]. An error is defined as the system state that is liable to lead to a failure and the cause of an error is called a fault [3]. Figure 2.1 depicts the fundamental chain of these concepts that leads to a failure. As an example, assume that a software developer allocates an insufficient amount of memory for an input buffer. This is the fault. At some point during the execution of the software, the size of the incoming data overflows this buffer. This is the error. As a result, the operating system kills the corresponding process and the user observes that the software crashes. This is the failure.

Figure 2.1: The fundamental chain of dependability threats leading to a failure

(29)

Figure 2.1 shows the simplest possible chain of dependability threats. Usually, there are multiple errors involved in the chain, where an error propagates to other errors and finally leads to a failure [3].

2.1.1 Dependability and Related Quality Attributes

Dependability encompasses the following quality attributes [3]: • reliability: continuity of correct service.

• availability: readiness for correct service.

• safety: absence of catastrophic consequences on the user(s) and the environ-ment.

• integrity: absence of improper system alterations.

• maintainability: ability to undergo modifications and repairs.

Depending on the application domain, different emphasis might be put on different attributes. In this thesis, we have considered reliability and availability, whereas safety, integrity and maintainability are out of the scope of our work. Reliability and availability are very important quality attributes in the context of fault-tolerant systems and they are closely related. Reliability is the ability of a system to perform its required functions under stated conditions for a specified period of time. That is the ability of a system to function without a failure. Availability is the proportion of time, where a system is in a functioning condition. Ideally, a fault-tolerant system can recover from errors before any failure is observed by the user (i.e. reliability). However, this is practically not always possible and the system can be unavailable during its recovery. After a fault is activated, a fault-tolerant system must become operational again as soon as possible to increase its availability.

Even if there are no faults activated, fault tolerance techniques introduce a perfor-mance overhead during the operational time of the system. The overhead can be caused, for instance, by monitoring of the system for error detection, collecting and logging system traces for diagnosis, saving data for recovery and wrapping system elements for isolation. For this reason, in addition to reliability and availability, we consider performance as a relevant quality attribute in this work although it is not a dependability attribute. Performance is defined as the degree to which a system accomplishes its designated functions within given constraints, such as speed and accuracy [60].

(30)

In traditional reliability and availability analysis, a system is assumed to be either up and running, or it is not. However, some fault-tolerant systems can also be par-tially available (also known as performance degradation). This fact is taken into account by performability [52], which combines reliability, availability and perfor-mance quality attributes. The quantification of performability rewards the system for the performance that is delivered not only during its normal operational time but also during partial failures and their recovery [52]. Essentially, it measures how well the system performs in the presence of failures over a specified period of time.

2.1.2 Dependability Means

To prevent a failure, the chain of dependability threats as shown in Figure 2.1 must be broken. This is possible through i ) preventing occurrence of faults, ii ) removing existing faults, or iii ) tolerating faults. In the last approach, we accept that faults may occur but we deal with their consequences before they lead to a failure, if possible. Error detection is the first necessary step for fault tolerance. In addition, detected errors must be recovered. Based on [3], Figure 2.2 depicts the dependability means and the features of fault tolerance as a simple feature diagram.

Figure 2.2: Basic features of dependability means and fault tolerance

(31)

The other nodes represent its features. The solid circles indicate mandatory features (i.e. the feature is required if its parent feature is selected). The empty circles indicate optional features. Mandatory features that are connected through an arc decorated edge (See Figure 2.3) are alternatives to each other (i.e. exactly one of them is required if their parent feature is selected).

In addition to fault prevention, fault removal and fault tolerance, fault forecasting is also included in Figure 2.2 as a dependability means. Fault forecasting aims at evaluating the system behavior with respect to fault occurrence or activation [3]. Figure 2.2 also shows the two mandatory features of fault tolerance: error detection and recovery. Recovery has one mandatory feature, error handling, which eliminates errors from the system state [3]. The fault handling feature prevents faults from being activated again. This requires further features such as diagnosis, which reveals and localizes the cause(s) of error(s) [3]. Diagnosis can also enable a more effective error handling. If the cause of the error is localized, the recovery procedure can take actions concerning the associated components without impacting the other parts of the system and related system functionality.

2.1.3 Fault Tolerance and Error Handling

When faults manifest themselves during system operations, fault tolerance tech-niques provide the necessary mechanisms to detect and recover from errors, if pos-sible, before they propagate and cause a system failure. Error recovery is generally defined as the action, with which the system is set to a correct state from an er-roneous state [3]. The domain of recovery is quite broad due to different type of faults (e.g. transient, permanent) to be tolerated and different requirements (e.g. cost-effectiveness, high performance, high availability) imposed by different type of systems (e.g. safety-critical systems, consumer electronics). Figure 2.3 shows a partial view of error handling features for fault tolerance. We have derived the fea-tures of the recovery domain through a domain analysis based on the corresponding literature [3, 29, 36, 58]

(32)

Figure 2.3: A partial view of error handling features for fault tolerance As shown in Figure 2.3, error handling can be organized into three categories; com-pensation, backward recovery and forward recovery [3]. Compensation means that the system continues to operate without any loss of function or data in case of an error. This requires replication of system functionality and data. N-version program-ming is a compensation technique, where N independently developed functionally equivalent versions of a software are executed in parallel. All the outputs of these versions are compared to determine the correct, or best output, if one exists [81]. Backward recovery (i.e. rollback) puts the system in a previous state, which was known to be error free. Recovery blocks approach uses multiple versions of a soft-ware for backward recovery. After the execution of the first version, the output is tested. If the output is not acceptable, the state of the system is rolled back to the state before the first version is executed. Similarly, several versions are executed and tested sequentially until the output is acceptable [81]. The system fails if no acceptable output is obtained after all the versions are tried. Restarting a system is also an example for backward recovery, in which the system is put back to its initial state. Backward recovery can employ different features for saving data to be restored after recovery (i.e. check-pointing) or for saving messages and events to replay them after recovery (i.e. log-based recovery). In either case, a stable storage is required to store recovery-related information (data, messages etc.). Forward re-covery (i.e. rollforward) puts the system in a new state to recover from an error.

(33)

Exception handling is an example forward recovery technique, where the execution is transfered to the corresponding handler when an exception occurs. Graceful degra-dation [107] is a forward recovery approach that puts the system in a state with reduced functionality and performance.

Different error handling features can be utilized based on the fault assumptions and system characteristics. The granularity of the error handling in recovery can differ as well. In the case of global recovery, the recovery mechanism can take actions on the system as a whole (e.g. restart the whole system). In the case of local recovery, erroneous parts can be isolated and recovered while the rest of the system is available. Thus, a system with local recovery can provide a higher system availability to its users in case of component failures.

2.2 Software Architecture Design and Analysis

A software architecture for a program or computing system consists of the struc-ture or strucstruc-tures of that system, which comprise elements, the externally visible properties of those elements, and the relationships among them [5].

Software architecture represents a common abstraction of a system [5] and as such it forms a basis for mutual understanding and communication among architects, developers, system engineers and anybody who has an interest in the construction of the software system. As one of the earliest artifact of the software development life cycle, software architecture embodies early design decisions, which impacts the system’s detailed design, implementation, deployment and maintenance. Hence, it must be carefully documented and analyzed. Software architecture also promotes large-scale reuse by transferring architectural models across systems that exhibit common quality attributes and functional requirements [5].

In the following subsections, we introduce basic techniques and concepts that are used for i ) describing architectures ii ) analyzing quality properties of an architecture and iii ) achieving or supporting qualities in the architecture design.

2.2.1 Software Architecture Descriptions

An architecture description is a collection of documents to describe a system’s archi-tecture [78]. The IEEE 1471 standard [78] is a recommended practice for architec-tural description of software-intensive systems. It introduces a set of concepts and relations among them as depicted in Figure 2.4 with a UML (The Unified Modeling

(34)

Language) [100] diagram. Hereby, the key concepts are marked with bold rectangles and two types of relations are defined: association and aggregation. Associations are labeled with a role and cardinality. For example, Figure 2.4 shows that a concern is important to (the role) 1 or more (the cardinality) stakeholders and a stakeholder has 1 or more concerns. Aggregations are identified with a diamond shape at the end of an edge and they represent part-whole relationships. For example, Figure 2.4 shows that a view is a part of an architectural description.

Figure 2.4: Basic concepts of architecture description (IEEE 1471 [78]) The people or organizations that are interested in the construction of the software system are called stakeholders. These might include, for instance, end users, ar-chitects, developers, system engineers and maintainers. A concern is an interest, which pertain to the systems development, its operation or any other aspects that are critical or otherwise important to one or more stakeholders. Stakeholders may have different, possibly conflicting, concerns that they wish the system to provide or optimize. These might include, for instance, certain run-time behavior, perfor-mance, reliability and evolvability. A view is a representation of the whole system from the perspective of a related set of concerns. A viewpoint is a specification of

(35)

the conventions for constructing and using a view.

To summarize the key concepts as described in the IEEE 1471 framework: • A system has an architecture.

• An architecture is described by one or more architectural descriptions. • An architectural description selects one or more viewpoints.

• A viewpoint covers one or more concerns of stakeholders.

• A view conforms to a viewpoint and it consists of a set of models that represent one aspect of an entire system.

This conceptual framework provides a set of definitions for key terms and outlines the content requirements for describing a software architecture. However, it does not standardize or put restrictions to how an architecture is designed and how its description is produced. There are several software design processes and architecture design methods proposed in the literature, such as, the Rational Unified Process [64], Attribute-Driven Design [5] and Synthesis-Based Software Architecture Design [115]. Software architecture design processes and methods are out of cope of this thesis. Another issue that is not standardized by IEEE 1471 [78] is the notation or for-mat that is used for describing architectures. UML [103] is an example standard notation for modeling object-oriented designs, which can be utilized for describing architectures as well. Similarly, several architecture description languages (ADLs) have been introduced as modeling notations to support architecture-based develop-ment. There have been both general-purpose and domain-specific ADLs proposed. Some ADLs are designed to have a simple, understandable, and possibly graphical syntax, but not necessarily formally defined semantics. Some other ADLs encompass formal syntax and semantics, supported by powerful analysis tools, model checkers, parsers, compilers and code synthesis tools [82].

2.2.2 Software Architecture Analysis

Software architecture forms one of the key artifacts in software development life cycle since it embodies early design decisions. Accordingly, it is important that the architecture design supports the required qualities of a software system. Soft-ware architecture analysis helps to predict the risks and the quality of a system before it is built, thereby reducing unnecessary maintenance costs. On the other hand, usually it is also necessary to evaluate the architecture of a legacy system if

(36)

it is subject to major modification, porting or integration with other systems [5]. Software architecture constitutes an abstraction of the system, which enables to suppress the unnecessary details and focus only on the relevant aspects for analysis. Basically, there are two complementary software architecture analysis techniques: i ) questioning techniques and ii ) measuring techniques [5].

Questioning techniques use scenarios, questionnaires and check-lists to review how the architecture responds to various situations [19]. Most of the software architec-ture analysis methods that are based on questioning techniques use scenarios for evaluating architectures [31]. These methods take as input the architecture design and estimate the impact of predefined scenarios on it to identify the potential risks and the sensitivity points of the architecture. Questioning techniques sometimes employ measurements as well but these are mostly intuitive estimations relying on hypothetical models without formal and detailed semantics.

Measuring techniques use architectural metrics, simulations and static analysis of formal architectural models [82] to provide quantitative measures of qualities such as performance and availability. The type of analysis depends on the underlying semantic model of the ADL, where usually a quality model is applied, such as queuing networks [22].

In general, measuring techniques provide more objective results compared to ques-tioning techniques. As a drawback, they require the presence of a working artifact (e.g. a prototype implementation, a model with enough semantics) for measure-ment. On the other hand, questioning techniques can be applied on hypothetical architectures much earlier in the life cycle [5]. However, (possibly quantitative) re-sults of questioning techniques are inherently subjective. In this thesis, we explore both analysis approaches. In Chapter 3, we present SARAH, which is an analy-sis method based on questioning techniques. In Chapter 5, we present an analyanaly-sis approach based on measuring techniques.

2.2.3 Architectural Tactics, Patterns and Styles

A software architect makes a wide range of design decisions, while designing the software architecture of a system. Depending on the application domain, many of these design decisions are made to provide required functionalities. On the other hand, there are also several design decisions made for supporting a desired qual-ity attribute (e.g. to use redundancy for providing fault tolerance and in turn to increase system dependability). Such architectural decisions are characterized as architectural tactics [4]. Architectural tactics are viewed as basic design decisions and building blocks of patterns and styles [5].

(37)

An architectural pattern is a description of element and relation types together with a set of constraints on how they may be used [5]. The term architectural style is also used for describing the same concept. Similar to Object-Oriented design patterns [41], architectural patterns/styles provide a common design vocabulary (e.g. clients and servers, pipes and filters, etc.) [12]. They capture recurring idioms, which constrain the design of the system to support certain qualities [109]. Many styles are also equipped with semantic models, analysis tools and methods that enable style-specific analysis and property checks.

(38)

Chapter 3 Scenario-Based Software

Architecture Reliability Analysis

To select and apply appropriate fault tolerance techniques, we need to analyze po-tential system failures and identify architectural elements that cause system failures. In this chapter, we propose the Software Architecture Reliability Analysis Approach (SARAH), which prioritizes failure scenarios based on user perception and provides an early software reliability analysis of the architecture design. It is a scenario-based software architecture analysis method that benefits from mature reliability engineering techniques, FMEA and FTA.

The chapter is organized as follows. In the following two sections, we introduce background information on scenario-based software architecture analysis, FMEA and FTA. In section 3.3, we present SARAH and illustrate it for analyzing reliability of the software architecture of the next release of a Digital TV. We conclude the chapter after discussing lessons learned and related work in sections 3.4 and 3.5, respectively.

(39)

3.1 Scenario-Based Software Architecture

Analysis

Scenario-based software architecture analysis methods take as input a model of the architecture design and measure the impact of predefined scenarios on it to identify the potential risks and the sensitivity points of the architecture [31]. Different analysis methods use different type of scenarios (e.g. usage scenarios [19], change scenarios [7]) depending on the quality attributes that they focus on. Some methods define scenarios just as brief descriptions, while some other methods define them in a more structured way with annotations [19].

Software Architecture Analysis Method (SAAM) can be considered as the first scenario-based architecture analysis method. It is simple, practical and a mature method, which has been validated in various cases studies [19]. Most of the other scenario-based analysis methods are proposed as extensions to SAAM or in some way they adopt the concepts used in this method [31]. The basic activities of SAAM are illustrated with a UML [100] activity diagram in Figure 3.1. The filled circle is the starting point and the filled circle with a border is the ending point. The rounded rectangles represent activities and arrows (i.e. flows) represent transitions between activities. The beginning of parallel activities are denoted with a black bar with one flow going into it and several leaving it. In the following, SAAM activities as depicted in Figure 3.1 are explained.

• Describe architectures: The candidate architecture designs are described, which include the systems’ computation/data components and their relationships. • Define scenarios: Scenarios are developed for stakeholders to illustrate the

kinds of activities the system must support (usage scenarios) and the antici-pated changes that will be made to the system over time (change scenarios). • Classify/Prioritize scenarios: Scenarios are prioritized according to their

im-portance as defined by the stakeholders.

• Individually evaluate indirect scenarios: Scenarios that can be directly sup-ported by the architecture are called direct scenarios. Scenarios that require the redesign of the architecture are called indirect scenarios. The required changes for the architecture in case of indirect scenarios are attributed to the fact that the architecture has not been appropriately designed to meet the given requirements. For each indirect scenario the required changes to the architecture are listed and the cost for performing these changes is estimated.

(40)

Figure 3.1: SAAM activities [19]

• Assess scenario interaction: Determining scenario interaction is a process of identifying scenarios that affect a common set of components. Scenario inter-action measures the extent to which the architecture supports an appropriate separation of concerns. Semantically close scenarios should interact at the same component. Semantically distinct scenarios that interact point out a wrong decomposition.

• Create overall evaluation: Each scenario and the scenario interactions are weighted in terms of their relative importance and this weighting determines an overall ranking.

SAAM was originally developed to analyze the modifiability of an architecture [19]. Later, numerous scenario-based architecture analysis methods have been developed each focusing on a particular quality attribute or attributes [31]. For example, SAAMCS [75] has focused on analyzing complexity of an architecture, SAAMER [77] on reusability and evolution, ALPSM [8] on maintainability, ALMA [7] on modifia-bility, ESAAMI [84] on reuse from existing component libraries, and ASAAM [116] on identifying aspects for increasing maintainability.

Hereby, it is implicitly assumed that scenarios correspond to the particular qual-ity attributes that need to be analyzed. Some methods such as SAEM [33] and ATAM [19] have considered the need for a specific quality model for deriving the

(41)

corresponding scenarios. ATAM has also addressed the interactions among multiple quality attributes and trade-off issues emerging from these interactions.

In this chapter, we propose a reliability analysis method, which uses failure scenarios for analysis. We define a failure scenario model that is based on the established Failure Modes and Effects Analysis method (FMEA) in the reliability engineering domain as explained in the next subsection.

3.2 FMEA and FTA

Failure Modes and Effects Analysis method (FMEA) [99] is a well-known and mature reliability analysis method for eliciting and evaluating potential risks of a system systematically. The basic operations of the method are i ) to question the ways that each component fails (failure modes) and ii ) to consider the reaction of the system to these failures (effects). The analysis results are organized by means of a work-sheet, which comprises information about each failure mode, its causes, its local and global effects (concerning other parts of the product and the environment) and the associated component. Failure Modes, Effects and Criticality Analysis (FMECA) extends FMEA with severity and probability assessments of failure occurrence. A simplified FMECA worksheet template is presented in Figure 3.2.

System: Car Engine

Date: 10-10-2000

Compiled by: J. Smith Approved by: D. Green

ID Item ID Failure

Mode Failure Causes Failure Effects

Severity Class

1 CE5 fails to operate

Motor shorted Motor overheats and burns V

2 … … … … …

Figure 3.2: An example FMECA worksheet based on MIL-STD-1629A [30] In FMECA, 6 attributes of a failure scenario are identified; failure id, related com-ponent, failure mode, failure cause, failure effect and severity. A failure mode is defined as the manner in which the element fails. A failure cause is the possible cause of a failure mode. A failure effect is the (undesirable) consequence of a failure mode. Severity is associated with the cost of repair.

FMEA and FMECA can be employed for risk assessment and for discovering poten-tial single-point failures. Systematic analysis increases the insight in the system and the analysis results can be used for guiding the design, its evaluation and improve-ment. At the downside, the analysis is subjective [97]. Some components failure

(42)

modes can be overlooked and some information (e.g. failure probability, severity) regarding the failure modes can be incorrectly estimated at early design phases. Since these techniques focus on individual components at a time, combined effects and coordination failures can also be missed. In addition, the analysis is effort and time consuming.

FMEA is usually applied together with Fault Tree Analysis (FTA) [34]. FTA is based on a graphical model, fault tree, which defines causal relationships between faults. An example fault tree can be seen in Figure 3.3.

Figure 3.3: An example fault tree

The top node (i.e. root) of the fault tree represents the system failure and the leaf nodes represent faults. Faults, which are assumed to be provided, are defined as undesirable system states or events that can lead to a system failure. The nodes of the fault tree are interconnected with logical connectors (e.g. AND, OR gates) that infer propagation and contribution of faults to the failure. Once the fault tree is constructed, it can be processed in a bottom-up manner to calculate the probability that a failure would take place. This calculation is done based on the probabilities of fault occurrences and interconnections between the faults and the failure [34]. Additionally, the tree can be processed in a top-down manner for diagnosis to determine the potential faults that may cause the failure.

(43)

3.3 SARAH

We propose the Software Architecture Reliability Analysis (SARAH) approach that benefits from both reliability analysis and scenario-based software architecture anal-ysis to provide an early reliability analanal-ysis of next product releases. SARAH defines the notion of failure scenario model that is inspired from FMEA. Failure scenarios define potential component failures in the software system and they are used for deriving a fault tree set (FTS). Similar to a fault tree in FTA, FTS shows the causal and logical connections among the failure scenarios.

To a large extent SARAH integrates the best practices of the conventional and stable reliability analysis techniques with the scenario-based software architecture analysis approaches. Besides this, SARAH provides another distinguishing property by focusing on user perceived reliability. Conventional reliability analysis techniques prioritize failures according to how serious their consequences are with respect to safety. In SARAH, the prioritization and analysis of failure scenarios are based on user perception [28]. The structure of FTS and related analysis techniques are also adapted accordingly.

SARAH results in a failure analysis report that defines the sensitive elements of the architecture and provides information on the type of failures that might frequently happen. The reliability analysis forms the key input to identify architectural tactics ([4]) for adjusting the architecture and improving its dependability, which forms the last phase in SARAH. The approach is illustrated using an industrial case for an-alyzing user-perceived reliability of future releases of Digital TVs. In the following subsection, we present this industrial case, in which a Digital TV architecture is in-troduced. This example will be used throughout the remainder of the section, where the activities of SARAH are explained and illustrated. As such while explaining the approach we also discuss our experience and obstacles in applying the approach.

3.3.1 Case Study: Digital TV

A conceptual architecture of Digital TV (DTV) is depicted in Figure 3.4, which will be referred throughout the section. The design mainly comprises two layers. The bottom layer, namely the streaming layer, involves modules taking part in streaming of audio/video information. The upper layer consists of applications, utilities and modules that control the streaming process. In the following, we briefly explain some of the important modules that are part of the architecture. For brevity, the modules for decoding and processing audio/video signals are not explained here.

(44)

! ! "

# #

$ % & ' ( )

Figure 3.4: Conceptual Architecture of DTV

• Application Manager (AMR), located at the top middle of the figure, initi-ates and controls execution of both resident and downloaded applications in the system. It keeps track of application states, user modes and redirects commands/information to specific applications or controllers accordingly. • Audio Controller (AC), located at the bottom right of the figure, controls

(45)

audio features like volume level, bass and treble based on commands received from AMR.

• Command Handler (CH), located at the top left of the figure, interprets ex-ternally received signals (i.e. through keypad or remote control) and sends corresponding commands to AMR.

• Communication Manager (CMR), located at the top left of the figure, employs protocols for providing communication with external devices.

• Conditional Access (CA), located at the bottom left of the figure, authorizes information that is presented to the user.

• Content Browser (CB), located at the middle of the figure, presents and pro-vides navigation of content residing in a connected external device.

• Electronic Program Guide (EPG), located at the middle right of the figure, presents and provides navigation of electronic program guide regarding a chan-nel.

• Graphics Controller (GC), located at the bottom right of the figure, is re-sponsible for generation of graphical images corresponding to user interface elements.

• Last State Manager (LSM), located at the middle of the figure, keeps track of last state of user preferences such as volume level and selected program. • Program Installer (PI), located at the middle of the figure, searches and

reg-isters programs together with channel information (i.e. frequency).

• Program Manager (PM), located at the middle left of the figure, tunes to a specific program based on commands received from AMR.

• Teletext (TXT), located at the middle of the figure, handles acquisition, inter-pretation and presentation of teletext pages.

• Video Controller (VC), located at the bottom middle of the figure, controls video features like scaling of the video frames based on commands received from AMR.

3.3.2 The Top-Level Process

For understanding and predicting quality requirements of the architectural design [4], Bachman et al. identify four important requirements: i ) provide a specification

(46)

of the quality attribute requirements, ii ) enumerate the architectural decisions to achieve the quality requirements, iii ) couple the architectural decisions to the qual-ity attribute requirements, and iv ) provide the means to compose the architectural decisions into a design. SARAH is in alignment with these key assumptions. The focus in SARAH is the specification of the reliability quality attribute, the analysis of the architecture based on this specification and the identification of architectural tactics to adjust the architecture.

The steps of SARAH are presented as a UML activity diagram in Figure 3.5. The approach consists of three basic processes: i ) Definition ii ) Analysis and iii ) Ad-justment. In the definition process the architecture, the failure domain model, the failure scenarios, the fault trees and the severity values for failures are defined. Based on this input, in the analysis process, an architectural level analysis and an architectural element level analysis are performed. The results are presented in the failure analysis report. The failure analysis report is used in the adjustment process to identify the architectural tactics and adapt the software architecture. In the fol-lowing subsections the main steps of the method will be explained in detail using the industrial case study.

!

(47)

3.3.3 Software Architecture and Failure Scenario Definition

Describe the Architecture

Similar to existing software architecture analysis methods SARAH starts with de-scribing the software architecture. The description includes the architectural el-ements and their relationships. Currently, the method itself does not presume a particular architectural view [18] to be provided but in our project we have basi-cally applied it to the module view. The architecture that we analyzed is depicted in Figure 3.4.

Develop Failure Scenarios

SARAH is a scenario-based architecture analysis method, that is, scenarios are the basic means to analyze the architecture. SARAH defines the concept of failure scenario to analyze the architecture with respect to reliability. A failure scenario defines a chain of dependability threats (i.e. fault, error and failure) for a component of the system. To specify the failure scenarios in a uniform and consistent manner a failure scenario template, as defined in Table 3.1 is adopted for specifying failure scenarios.

Table 3.1: Template for Defining Failure Scenarios

FID A numerical value to identify the failures (i.e. Failure ID)

AEID An acronym defining the architectural element for which the

failure scenario applies (i.e. Architectural Element ID)

Fault The cause of the failure defining both the description of the

cause and its features

Error Description of the state of the element that leads to the

failure together with its features

Failure The description of the failure, its features, user/element(s)

that are affected by the failure

The template is inspired from FMEA [99]. For clarity in SARAH fault, error and failure are used instead of the concepts failure cause, failure mode and failure effect, respectively. In SARAH, failure scenarios are derived in two steps. First the relevant failure domain model is defined, then failure scenarios are derived from this failure domain. The following subsections describe these steps in detail.

(48)

Define Relevant Failure Domain Model

The failure scenario template can be adopted to derive scenarios in an ad hoc manner using free brainstorming sessions. However, it is not trivial to define fault classes, error types or failure modes. Hence, there is a high risk that several potential and relevant failure scenarios are missed or that other irrelevant failure scenarios are included. To define the space of relevant failures SARAH defines relevant domain model for faults, errors and failures using a systematic domain analysis process [2]. These domain models provide a first scoping of the potential scenarios. In fact, sev-eral researchers have already focused on modeling and classifying failures. Avizienis et al., for example, provide a nice overview of this related work and provide a comprehensive classification of faults, errors and failures [3]. The provided domain classification by Avizienis et al., however, is rather broad, and one can assume that for a given reliability analysis project not all the potential failures in this overall domain are relevant. Therefore, the given domain is further scoped by focusing only on the faults, errors and failures that are considered relevant for the actual project. Figure 3.6, for example, defines the derived domain model that is considered relevant for our project.

In Figure 3.6(a), a feature diagram is presented, where faults are identified according to their source, dimension and persistence. In SARAH, failure scenarios are defined per architectural element. For that reason, the source of the fault can be either i ) internal to the element in consideration, ii ) caused by other element(s) of the system that interact(s) with the element in consideration or iii ) caused by external entities with respect to the system. Faults could be caused by software or hardware, and be transient or persistent. In Figure 3.6(b), the relevant features of an error are shown, which comprise the type of error together with its detectability and reversibility properties. Figure 3.6(c) defines the features for failures, which includes the features type and target. The target of a failure defines what is/are affected by the failure. In this case, the target can be the user or other element(s) of the system.

The failure domain model of Figure 3.6 has been derived after a thorough domain analysis and in cooperation with the domain experts in the project. In principle, for different project requirements one may come up with a slightly different domain model, but as we will show in the next sections this does not impact the steps in the analysis method itself. The key issue here is that failure scenarios are defined based on the FMEA model, in which their properties are represented by domain models that provide the scope for the project requirements.

(49)

(a) Feature Diagram of Fault

(b) Feature Diagram of Error

(c) Feature Diagram of Failure

Architecting Fault-Tolerant Software Systems

A

rchit

ec

ting F

ault

-To

le

ran

t S

oft

w

ar

e S

yst

ems H

as

an Söz

er

Hasan Sözer

Architecting

Fault-Tolerant

Software Systems

ISBN 978-90-365-2788-0

The increasing size and complexity of software systems

makes it hard to prevent or remove all possible faults. Faults

that remain in the system can eventually lead to a system

failure. Fault tolerance techniques are introduced for

enabling systems to recover and continue operation when

they are subject to faults. Many fault tolerance techniques

are available but incorporating them in a system is not

always trivial. In this thesis, we introduce methods and tools

for the application of fault tolerance techniques to increase

the reliability and availability of software systems.

Architecting Fault-Tolerant Software Systems

Invitation

Architecting

Fault-Tolerant

Software Systems

Architecting Fault-Tolerant Software Systems

Architecting Fault-Tolerant Software Systems

DISSERTATION

Hasan S¨ozer

Acknowledgements

Abstract

Contents

Chapter 1

Introduction

1.1

Thesis Scope

1.2

Motivation

1.3

The Approach

1.3.1

Software architecture reliability analysis using failure

scenarios

1.3.2

Architectural style for recovery

1.3.3

Quantitative analysis and optimization of software

ar-chitecture decomposition for recovery

1.3.4

Framework for the realization of software architecture

recovery design

1.4

Thesis Overview

Chapter 2

Background and Definitions

2.1

Dependability and Fault Tolerance

2.1.1

Dependability and Related Quality Attributes

2.1.2

Dependability Means

2.1.3

Fault Tolerance and Error Handling

2.2

Software Architecture Design and Analysis

2.2.1

_Invitation