‘Optimizing a heuristic-based maintenance policy for a system that is modeled by a dynamic fault tree’

(1)

‘Optimizing a heuristic-based maintenance policy for

a system that is modeled by a dynamic fault tree’

Casper van Raaij

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies University of Groningen

(3)

‘Optimizing a heuristic-based maintenance policy for

a system that is modeled by a dynamic fault tree’

Casper van Raaij

January 20, 2020

Abstract

In this paper, a maintenance policy is obtained for a multi-component system, for which structural dependence between components is modeled by a dynamic fault tree. Fault trees have a very clear way of describing the failure behavior of multi-component systems and are therefore one of the most widely used methods in reliability analysis to represent system structures. In between inspections, the probability of system failure should not exceed a threshold value. Periodical inspections are done, during which a subset of all components is maintained in the most cost-effective way, and such that the constraint on system reliability is satisfied.

The maintenance policy is economically optimized over the inspection interval length using Monte Carlo simulation. The subset of components to maintain is selected by a heuristic that is based on the ratio between the increase in reliability as a result of maintenance and corresponding costs, referred to as the group improvement factor [19]. System reliability is expressed algebraically. For part of the components only failures can be observed, whereas for other components stochastic processes are used to model continuous deterioration (gamma process).

(4)

1 Introduction

In the past few years, maintenance of industrial systems has become an important tool for com-panies to minimize costs, maximize reliability and enhance other key performance indicators. Earlier, maintenance was seen as a necessary evil but now it has become an intensively studied subject in scientific research [13]. In literature, maintenance essentially is subdivided into two types: corrective maintenance, which is performed after a piece of equipment has failed, and preventive maintenance, which is done before failure. In general, it is better to include preventive maintenance actions in a maintenance policy since it is often cheaper than corrective mainte-nance and it increases availability of the system. However, it should be clear that firms are faced with a trade-off between a high preventive maintenance frequency, leading to high maintenance costs, and a low frequency, leading to a higher probability of a costly system failure. Finding the optimal policy in this sense is one of the main aims in research on maintenance optimization.

A lot has been published on maintenance scheduling for single component systems, but in practice the structure of large systems consist of multiple components [19]. Optimizing the maintenance policy for a multi-component system can be a complex task [28]. There simply are more variables to optimize over and there always exist various kinds of dependencies between components, which have to be taken into account. Structural dependence is often modelled by means of a fault tree, due to its very clear way of describing the failure behaviour of a multi-component system. The fault tree is one of the most widely used methods to represent system structure in reliability analysis and it is used in fields such as aerospace, nuclear power plants and medical devices [23]. As an illustration, consider the crash of an airplane, which can be seen as a system failure. It can be caused by multiple events: something happening to the pilot, failure of both engines, or the occurrence of a software error. On its turn, one of these events can be caused by another underlying event, and so on. The way these underlying events propagate trough the system and whether or not they result in system failure can be visualized by a fault tree.

When modelled by a fault tree, the components in the system are often assumed to have a constant failure rate [23]. In this way the probability of system failure can be estimated. However, it would be more realistic to assume that the failure rates of components depend on time and that preventive maintenance enhances system reliability. For these kind of models a cost efficient policy can be found, consisting of an optimal inspection interval and at every inspection a set of actions to perform (preventive maintenance or corrective maintenance) on the components [9,18,28]. This can also be done when the physical condition of a component is modeled by a deterioration process [19].

(6)

fail, this becomes less trivial [23]. The structural dependence of these kind of systems is often modeled by a dynamic fault tree [23], for which optimizing the maintenance policy will be the main focus of this paper. The maintenance policy consists of periodical inspections, during which maintenance will be done on a subset of components in the most cost-effective way. Maintenance is done such that the probability of system failure over the next inspection interval will not exceed a threshold. Thus, a constraint is formulated on the reliability of the system over the subsequent inspection interval. For some components the condition can be measured, for other components only failures can be observed. The aim is to optimize the inspection interval. Since we require the system to attain a certain level of reliability per unit time, larger inspection intervals do not cause more system failures. However, since maintenance can only be done during inspections, downtime of the system increases with the length of the inspection interval, leading to higher maintenance costs. On the other hand, since inspecting the system very often can be costly, a very short inspection interval cannot be optimal too. Thus, a trade-off needs to be found.

During inspections, the subset of components to maintain is selected by a heuristic that is based on the ratio between the increase in reliability as a result of maintenance and corresponding costs. This ratio is referred to as the group improvement factor [19]. Furthermore, another way of selecting components is used, namely adding components to a group one by one. The latter heuristic is also based on the group improvement factor, but it does not consider all possible groups and therefore will be more scalable. The long run cost rates resulting from both heuristics are evaluated using Monte Carlo simulation. For the components whose conditions can be measured stochastic processes are used to model continuous deterioration (gamma process). The lifetimes of the other components are modeled using a lifetime distribution. System reliability is expressed algebraically.

The remainder of this paper is organized as follows. In Section 2 the technicalities of fault trees will be explained. In Section 3 we discuss related literature. In Section 4 the problem to be solved will be presented. In Section 5 we present the methodology used to solve the problem (combining Monte Carlo simulation with a heuristic). In Section 6 we will describe the way reliability of the system is determined, which is then used in Section 7 where the heuristic is explained formally. The results are presented and interpreted in Section 8. In Section 9 concluding remarks and suggestions for further research are made.

2 Fault trees

(7)

between components that can be distinguished. Next, in Section 2.2, we discuss static fault trees and in Section 2.3 we discuss the equivalent block reliability diagram. Finally, in Section 2.4, dynamic fault trees are introduced, which add a specific time-dependence to static fault trees.

2.1 Types of dependencies

Multiple types of dependencies between components of a system can be distinguished. Economic dependence means that the costs of maintaining multiple components at once is cheaper than maintaining all components individually, i.e. a setup cost is incurred when performing mainte-nance. Stochastic dependence between components means that the failure rates of components are dependent. If there exists structural dependence from a performance point of view between components, the functioning of the system can depend on the functioning of the components in a non trivial way. Some components are then required to be in operation for the system to be, whereas other components may not. Structural dependence can also exist from a technical point of view. This implies that when dismantling a system for maintenance on a component, it pays off to combine this with maintenance on other components [26].

2.2 Static fault trees

(8)

Figure 1: Symbols representing logic gates, all with two inputs. The AND gate is shown on the left and the OR gate on the right.

Figure 2: Symbols representing events in a fault tree. From left to right: basic event, intermediate event and top event.

2.3 Block reliability diagrams

The structure of systems represented by a static fault tree is equivalent to one represented by a block reliability diagram [23]. Hence, the two can easily be transformed into each other. A block reliability diagram thus serves the same purpose as a static fault tree, namely visualizing the structural dependence in a system. An example of a static fault tree and its equivalent block reliability diagram can be seen in Figure 3. In a block reliability diagram, if components are in a series configuration, failure of one of the components induces a system failure. Hence components in series are equivalent to basic events being inputs of an OR gate. The same goes for parallel configurations and AND gates.

6 5 1 2 3 4 1 3 4 5 2 6

(9)

Figure 4: Symbols representing dynamic gates, all with two inputs. The SPARE gate is shown on the left and the PAND gate one the right.

2.4 Dynamic fault trees

When system failure or the occurrence of an intermediate event depends on the order in which components fail, the static fault tree must be extended to a so called dynamic fault tree [10]. This temporal extension can be modelled by the introduction of new logic gates, called dynamic gates. Examples are the Priority AND (PAND) and SPARE gates, see Figure 4. The intermediate event corresponding to the output of a PAND gate occurs if failure of its inputs occur in a certain order, by convention from left to right. The SPARE gate has one active input and one or more spare inputs. If the active input fails, the spare takes over. When this component also fails and there is no other spare left, the gate fails. A component that is currently not active is called dormant. The failure rate of a dormant spare can be the same as the active one, negligible or anywhere in between. This behaviour is quantified by the dormancy factor γ. It is a coefficient which is multiplied by the failure rate of the dormant spare. A hot spare has γ = 1, hence the same failure rate as the active input, a cold spare has γ = 0, hence a cold spare will never fail, and a warm spare has 0 < γ < 1. If the lifetime of a component is modeled by a deterioration process, a dormant component accumulates a fraction γ of the deterioration it would accumulate if it was active.

3 Literature

In this section we discuss related literature. First, in Section 3.1, an overview is given on the existing techniques for analysing fault trees, also known as fault tree analysis. Then, in Section 3.2, we describe fault maintenance trees which incorporate maintenance into fault tree analysis. In Section 3.3 we summarize articles on the optimization of maintenance in block reliability diagrams. Concluding the literature review, in Section 3.4, we explain what the contribution of this paper will be.

3.1 Fault tree analysis

(10)

to the structure of the tree, such as uncovering failure modes or minimal cut sets. The use of the term cut set can be misleading in this context. In graph theory a cut set is defined as the set of edges through which a cut is made, separating the nodes of the graph in two disjoint sets. However, in the context of fault trees, a cut set is a group of components which, if all components in it fail, induces system failure. In this study we use the latter definition. A cut set is minimal if its number of elements cannot be reduced without losing its property of cut set [23]. The cut sets of a system are highly relevant since they indicate where the system is most vulnerable.

Quantitative techniques compute important numerical values of the tree, such as reliability and availability over time. If gates in a static fault tree do not share basic events, the reliability is easily determined by making use of the minimal cut sets of the system [19]. When static fault trees become more complex this becomes harder. A widely used approach is to convert the static fault tree to a binary decision diagram and use this to approximate the reliability [11].

For dynamic fault trees, calculating the reliability of the top event can be cumbersome. The standard method is to convert the problem into a continuous time Markov chain. Every con-figuration the system can possibly be in, i.e. what components have failed and, if important, in what order, is represented by a separate state and the transition probabilities between them are defined. A downside of this approach is that the number of states increases exponentially with the number of components. There exist various approaches to reduce the number of states: conversion of dynamic fault trees to Petri nets [6], modularization, which is a hybrid approach using binary decision diagrams to solve static sub-trees and Markov chains to solve dynamic sub-trees [12] and compositional analysis [4]. Furthermore, Markov chains can only be used if all components have an exponential lifetime distribution [22].

Portinale et al. [21] present a software tool which reliability engineers, who do not have to be familiar with underlying dynamic Bayesian networks, can use to find reliability measures of a system. A downside of dynamic Bayesian networks is the discretization in time, since this leads to approximate results. The smaller the discretization step, the more accurate the results, but the longer the computation time [23].

(11)

modules, which are sub-trees that might contain dynamic gates, they subsequently derive exact analytic expressions for the failure probability. Long et al. [15] derive the analytical expressions of failure probabilities of a PAND gate, but only for the special case where the parameters of the lifetime distributions of the basic events are the same.

Monte Carlo simulation is also used for reliability analysis. Durga Rao et al. [22] sample the time to failure of the components and their repair times from probability distributions. For a given mission time the behaviour of the system is then simulated. The maintenance policy is such that when a component fails a repair time is sampled and that is how long the component will be in the failed state. Characteristics of the system can be determined, of which the accuracy depends on the number of iterations. Computation time obviously increases with the number of iterations. Boudali et al. [5] present the simulation tool DFTSim, which can handle any lifetime distribution for the components and in which computation time is linear in the number of components.

3.2 Fault maintenance trees

Although fault tree analysis provides insight in the structure and can be used to calculate various important quantities of multi-component systems, in itself it is not capable of optimizing a main-tenance policy [25]. It rather provides us with tools to do so. In this section we review papers that combine fault trees with maintenance interventions.

(12)

in these articles, the results are not the optimization of these policies but rather a comparison of some thought of policy with an existing one. For instance, comparing an existing policy with one having twice as many inspections in time.

3.3 Optimization in block reliability diagrams

The following studies are devoted to maintenance optimization based on block reliability diagrams. Doostparast et al. [9] use simulated annealing, a heuristic algorithm, to improve maintenance schedules. The heuristic, originating from the field of material science, seems to find good solutions within reasonable computation times. They choose to use a heuristic since the problem of exactly optimizing multi-component systems often suffers from the combinatorial explosion, also known as the curse of dimensionality. That is, the solution space increases exponentially when the number of components increase. Upon inspection of the system, preventive- or corrective maintenance is performed based on predictive reliability of the system. The objective is to minimize total costs and the main constraint is reliability of the system over a pre-specified time period.

Moghaddam and Usher [18] compare an exact algorithm with heuristic algorithms. They also perform a sensitivity analysis on parameters characterizing the lifetime distributions of the components and on the relative size of different types of costs. They consider two criteria: minimizing costs subject to a minimal reliability constraint and maximizing reliability subject to a maximal cost constraint. For the first model the objective value function obtained by using the exact algorithm is ‘better’ than the one found using the heuristic algorithms for various types and sizes of systems. In the first model ‘better’ means lower since the objective is to minimize costs and in the second model it means higher since reliability is maximized. Although this result is expected, it is interesting that in the first model the difference in the objective value is 2 − 6% and remains constant with increasing number of components. For the second model a constant gap is not observed. The computation time of the exact algorithm increases exponentially with increasing components, due to the curse of dimensionality. This is not observed for the heuristic algorithms. For large multi-component systems the heuristic algorithms find nearly optimal solutions when minimizing costs subject to a reliability constraint. The interval between maintenance actions is pre-set and not optimized over. This is already mentioned as a topic for further research by Doostparast et al. [9].

(13)

could not be applied to all systems in the paper and a combination of analytical calculation and simulation referred to as hybrid simulation was done.

Nguyen et al. [19] use a heuristic-based approach where decisions on preventive maintenance are done based on a so called group improvement factor (GIF). They model the lifetime of components by using a deterioration process. The reliability over the next inspection interval is calculated. If this reliability gets below a certain threshold, it is decided to perform preventive maintenance. Next, for every possible group of components on which preventive maintenance can be performed, the GIF is calculated. It contains information on both the costs of maintaining the corresponding group as the reliability over the next time period assuming the components in the group are maintained. The group with the optimal GIF is then maintained. This simulation is done for a large number of different inspection interval lengths and minimum reliability thresholds until convergence of the optimal cost-rate. A sensitivity analysis is done on cost parameters and the heuristic is compared to an existing effective maintenance policy. It is observed that the proposed maintenance policy is more flexible and efficient for large complex systems, but for a simple series configuration the performance is slightly lower.

3.4 Contribution

Although in fault tree analysis it is often assumed that the failure rates of components are constant, it would be more realistic to drop this assumption and add maintenance to the model. Using maintenance we can reduce the increasing failure rates of components and thereby improve the reliability of the system. Exactly optimizing the maintenance schedule for a system with complex structural dependencies sometimes might be possible, but computation time increases very quickly in the number of components. Due to practical considerations our approach will not be to aim for optimal solutions, but rather to develop a heuristic. Simulation will then be used to find the optimal inspection interval.

For a static fault tree, or equivalently a block reliability diagram, several heuristics exist for deciding what components to maintain during inspections. Most of these techniques are reliability based, in the sense that decisions are taken based on system reliability. This type of maintenance is called reliability centered maintenance and is said to be the current trend in asset management [24]. Extending aforementioned techniques such that they can be used to analyse dynamic fault trees is not trivial and has not been done as such. That is, we will develop a heuristic that, based on reliability and costs, finds a maintenance schedule for systems that are modeled using dynamic gates, such as SPARE and PAND gates.

(14)

a supporting heuristic. We will allow for both modeling approaches, which has not been treated yet. We will also test effectiveness of the heuristic by making changes to decision criteria.

4 Problem description

In this section the problem to be solved will be defined formally. First, in Section 4.1, we formally describe the type of systems that we will analyze. Thereafter, in Section 4.2, we define the costs associated with maintenance.

4.1 The system

We consider a system consisting of N structural dependent components that deteriorate and ultimately fail. Inspections are done after the passing of an inspection interval T , so at all times Tk = kT with k ∈ N. We thus consider a periodic inspection interval T and we assume that all

components will be inspected at each inspection. The condition observed during inspection is the actual condition of the component, i.e. inspections are perfect. We assume that an inspection is needed to detect failure of a component, i.e. failures are hidden. The same goes for system failure. Upon system failure components continue to deteriorate and age. Corrective maintenance is always done on failed components returning them to the as-good-as-new state. We assume that maintenance is perfect and that it takes a negligible amount of time. There exists no stochastic dependence between the components.

At each inspection it must be determined which components to maintain. The reason for doing preventive maintenance is to prevent costly system failures. One way to do this is by formulating a restriction on system reliability. That is, we impose a minimum on the system reliability over the next inspection interval T . Let Rmin be the minimal reliability one wants to

preserve during one time unit. The minimal reliability over a time T is then (Rmin)T, i.e. the

reliability over a period of time depends on the length of that time exponentially. To see whether system reliability attains this threshold, we need to be able to evaluate it. It is chosen to express system reliability exact since the techniques to do so are available. This does limit the complexity of the systems that can be considered. Restrictions on the complexity of the dynamic fault trees, which we use to model the structural dependence between components, are:

i) Basic events must be unique, i.e. no repeated events. ii) The inputs of dynamic gates must be basic events.

(15)

this becomes more cumbersome. Amari et al. [2] show that it is possible to derive analytical expressions for the lifetime densities and distributions of static sub-trees (being inputs of dynamic gates), but it is not treated here.

For some of the components condition information is available, and for some of them not. We thus consider the following two types of components:

• There are N1 components of type 1, of which we can observe their condition. Every

in-spection, the additional amount of deterioration can be observed. When the cumulative deterioration exceeds a threshold, the component fails.

• There are N2 components of type 2, each with a stochastic lifetime distribution. These

components have a failure rate that depends on their effective age.

We choose to use a gamma process as deterioration process for type 1 components, since it is the most appropriate for monotonic and gradual deterioration [20]. We assume the deterioration to be linear in time and therefore consider a stationary gamma process, which is more widely used than the non-stationary gamma process. In a gamma process the additional deterioration during a certain period follows a gamma distribution, which has probability density function (pdf)

f_α(t),β(x) = 1

Γ(α(t))βα(t)x

α(t)−1_exp−x_β

, (1)

where x > 0, shape function α(t) = at in which a > 0 is the shape parameter and t is time, β > 0 is the scale parameter and Γ(·) denotes the gamma function. Note that the length of the inspection interval influences the pdf via the shape function. Furthermore, notice that (1) is the pdf of the additional deterioration and hence does not correspond to the lifetime density of a type 1 component. Van Noortwijk [20] gives the analytical expression for the lifetime distribution of a component, subject to deterioration modelled by a gamma process (type 1 component), as

F1(t) =

Γ(α(t), zβ−1)

Γ(α(t)) , (2)

where z is the failure threshold for the deterioration and Γ(a, x) denotes the upper incomplete gamma function defined by

Γ(a, x) = Z ∞

x

ωa−1e−ωdω, (3)

(16)

the lifetime density: ∂ ∂tF1(t) = ∂ ∂t Γ(α(t), zβ−1) Γ(α(t)) ! = ∂ ∂ ¯α Γ( ¯α, zβ−1) Γ( ¯α) ! ¯ α=α(t) · α0(t).

Substituting (3) and using the quotient rule we get that, under the assumption that the shape function α(t) is differentiable, which it is in our case, the lifetime density for type 1 components is f1(t) = α0(t) Γ(α(t)) Z ∞ zβ−1 log(ω) − ψ(α(t)) ωα(t)−1_e−ω_dω, ₍₄₎

where ψ(x) = Γ_Γ(x)0(x), with x > 0 denotes the digamma function. Note that (4) and (2) do not depend on deterioration level x directly: the cumulative deterioration x is subtracted from the threshold z, on which they do depend. Hence, the lifetime density and distribution of a deteriorated component are calculated as the lifetime density and distribution of a new component with a lower threshold value. Every type 1 component has characterizing parameters αi and βi

and failure threshold zi, with i ∈ E1, where E1 is the set containing the identifying numbers of

type 1 components. The number of elements in E1 is given by |E1| = N1.

As the Weibull distribution is most widely used and provides a good description for many types of deterioration processes [14], we use it to model the lifetimes of type 2 components. The pdf of the Weibull distribution, and hence the lifetime density of type 2 components, is

f2(t) = k λ t λ k−1 exp−(t/λ)k (5)

and the cumulative distribution function (cdf), and hence the lifetime distribution of type 2 components, is

F2(t) = 1 − exp−(t/λ)

k

. (6)

Type 2 components have characterizing parameters λi and ki, with i ∈ E2, where E2 is the set

containing the identifying numbers of type 2 components. The number of elements in E2 is given

by |E2| = N2. For completeness note that N = N1+ N2.

As said, structural dependence between the components is modeled by means of a fault tree. Formally, a fault tree can be defined as follows [8]:

(17)

3 2

1

4 5

6 7 8 9 10

Figure 5: Example of a dynamic fault tree. Squares correspond to failures of type 1 components and circles correspond to failures of type 2 components.

the basic events and one top event, G is the set of gates, A ⊆ (E × G) ∪ (G × E ) is the set of arcs, the function T : G 7→ Γ describes the type of each gate, where Γ is the set containing all types of gates. A fault tree can also be seen as a directed acyclic graph, formed by (V, E), where the directed vertices V = A and the edges E = G ∪ E .

Thus, E contains the basic events, used to model the failures of type 1 and type 2 components, and the top event, used to model system failure. The set G contains all gates we use to model the structure of our system and A is a set of arcs describing what event is connected to what gate. As to dynamic gates, we will consider the SPARE and the PAND gate. In our fault trees we thus have that Γ = {AND, OR, SPARE, PAND}. Every element in E is used only once, i.e. there are no repeated events (i). Furthermore, for reasons regarding computational treatability, we assume that the inputs of dynamic gates are identical in the sense of their type (1 or 2). This is a weak assumption with regard to SPARE gates, since it is logical that spare parts are identical to the ones they are replacing. For PAND gates however it might restrict modeling options. An example of a dynamic fault tree satisfying the restrictions is given in Figure 5. In this example E1 = {4, 5, 6, 7, 9, 10} and E2 = {1, 2, 3, 8}.

4.2 Costs

Every time the system is inspected a cost of cihas to be paid. To account for the positive economic

dependence between components, a setup cost cs is incurred for performing maintenance. Since

(18)

to cluster maintenance operations. In practice, maintenance is often clustered for economical reasons. The maintenance costs can differ per component. Costs for corrective maintenance on component i are cc,i and for preventive maintenance on component i are cp,i.

System failure has negative consequences for a firm. In general, the longer a system is un-available, the larger these consequences. Hence a penalty cost per unit of time cd is formulated.

Since failures are hidden and we only observe the system upon inspection, we do not know ex-actly when the system has failed when we observe a system failure. Surely system failure occurs upon the failure of one critical component. Once we have simulated the moment of failure of this component we have simulated the moment of system failure. However, for reasons regarding computational treatability, the following approximation is done: upon observation of a system failure, we know that the system has failed somewhere between the previous and current inspec-tion. Since we do not know exactly when, we assume it to be exactly in the middle of the interval. System downtime is then half of one inspection interval length T . We multiply it by the downtime cost per unit of time:

Cdown=        T

2cd if system failure is observed,

0 if no system failure is observed.

(7)

Note that this is an overestimation of the downtime costs because the system will be more likely to fail in the second half of the inspection interval. We will require the system to preserve a certain level of reliability per time unit, and hence the probability of system failure will not increase with the length of the inspection interval T . However, if the system does fail, the downtime costs will increase with T due to its dependence in (7). Of course there is an upper limit in T at which we are not able to preserve the level of reliability and the solution will not be feasible. A small inspection interval will lead to high inspection costs and therefore a trade-off needs to be found, resulting in an optimal value for T .

After every inspection interval all different types of costs are summed. We are interested in the long run cost rate, so we sum over all inspections during a sufficiently long time horizon and divide by its length. The problem to be solved is to minimize the long run cost rate over T , under a minimal reliability constraint.

5 Methodology

(19)

concerning the simulation of the time to failure of dynamic gates are elaborated upon.

5.1 Monte Carlo simulation

Exactly optimizing the maintenance policy of a multi-component system has proved to be very difficult. Therefore, we will develop a heuristic to obtain a sensible policy and we will find the optimal inspection interval length using Monte Carlo simulation. At the beginning of a simulation, for type 2 components the remaining lifetimes are drawn from their Weibull distributions. Once their effective ages reach their respective random times to failure, the components fail. Upon failure or maintenance on a type 2 component its effective age is reset to 0 and a new remaining lifetime is drawn from its Weibull distribution. Type 1 components deteriorate according to a gamma process and hence we sample the additional deterioration during an inspection interval from its gamma distribution. If the deterioration level at the next inspection exceeds the failure threshold, we know that the component has failed during the previous inspection interval. Upon failure or maintenance on a type 1 component its deterioration level is reset to 0.

During an inspection, the heuristic (to be defined formally in Section 7) will select what components to maintain preventively. This decision is based on the improvement in reliability it causes and the costs associated with it. In order to calculate the improvement in reliability we need a way to determine system reliability. The procedure of evaluating system reliability algebraically is presented in Section 6.

For a sufficiently long time horizon the total costs will accumulate until the simulation stops. We then divide the total costs by the realized time to obtain costs per unit time or the long run cost rate. We will optimize over the inspection interval T , so for a range of values of T the simulation will be done. To ensure a smooth graph of the cost rate versus T , the time horizon should be large enough. Of course there is a trade off between accuracy and computation time involved here, as is the case in every simulation.

5.2 Simulating time to failure

(20)

because we initially only simulate the deterioration level at the moment of the next inspection. Only if the active component turns out to have failed, we simulate the time to failure of this unit. For most gates in the model it is clear whether or not they have failed upon inspection. An OR gate failed when one of its inputs has failed, an AND gate when all inputs have failed and a SPARE gate when both initial active and spare component have failed. However, when we observe that both inputs of a PAND gate have failed, we do not necessarily know whether the gate itself has failed or not. The gate only fails if failures of its inputs occur from left to right. Again, for type 2 components we know the exact time to failure of both inputs and we can anticipate. For type 1 components we simulate the time to failure. If the simulated time to failure of the ‘left’ component is smaller than that of the ‘right’ component we assume the PAND gate has failed.

6 Reliability analysis

Determining the reliability of fault trees is one of the most important aspects of fault tree analysis, since it quantifies the probability of failure of a system over time [23]. In particular, the heuristic bases its decisions on it. First, in Section 6.1, we will demonstrate how to compute the reliability of single components. Then, in Section 6.2 and 6.3 the reliability of the outputs of dynamic gates will be calculated. In order to calculate the reliability of a SPARE gate we need the reliability of its dormant component, which will be given in Section 6.4. Using a function called the structure function, which will be introduced in Section 6.5, we will then find system reliability.

6.1 Single components

For a single component, the probability that failure occurs within the next inspection interval T , given its age t, i.e. its conditional lifetime distribution, is given by

F (T | t) = Pr(tf ≤ t + T | t < tf) = Pr(t < tf ≤ t + T ) Pr(t < tf) = F (t + T ) − F (t) 1 − F (t) = 1 −R(t + T ) R(t) , (8)

where tf denotes the time to failure and R(t) = 1 − F (t), in which F (t) is its lifetime distribution.

(21)

T , given its age t, i.e. its conditional reliability, is

R(T | t) = 1 − F (T | t) = R(t + T )

R(t) . (9)

This expression holds for both types of components.

6.2 SPARE gate

As will become clear from introduction of the structure function in Section 6.5, in order to find system reliability we need to find the reliabilities of the dynamic gates. The lifetime distribution of a SPARE gate with initial active component A and spare component B, according to Merle et al. [17], is given by F (t) = Z t 0 Z t v fBa(u)du ! fA(v)dv + Z t 0 FBd(u)fA(u)du, (10)

where fBa(t) is the lifetime density of component B while being active, FBd(t) is the lifetime

distribution of component B while being dormant and fA(t) is the lifetime density of component

A. Merle et al. [17] provide a similar expression to (10) for a SPARE gate with an arbitrary number of inputs. We integrate fBa(t) from v to t. This makes sense because failure of component A

means that component B becomes active. This changes its failure rate and hence its lifetime density (if γ 6= 1). Note that there is no distinction in activity or dormancy for component A. This follows from the fact that once component A has failed, it is correctively maintained at the next inspection and then will be active for the subsequent interval. Hence at the start of each inspection interval, when we calculate reliability, component A will always be active. Finally note that (10) is a sum of two terms. The first term corresponds to the event that active component A fails before dormant component B and component B thereafter. The second term corresponds to the event that dormant component B fails before active component A, and A thereafter. Since one of the two failure modes has to occur these events are mutually exclusive and we can sum their probabilities to find the probability of failure of the gate. In Figure 6 this is illustrated. For various values of t both terms in (10) and their sum are plotted against time. The two inputs are new type 1 components with parameters α1= 0.5, α2= 0.5, β1= 1.2, β2= 1.1, z1= 15, z2= 15

and γ = 0.90.

(22)

0.00 0.25 0.50 0.75 1.00 0 10 20 30 40 50 Time t Probability of f ailure bef ore t

Figure 6: The probability of failure before t against time t. The solid line denotes the sum of the left term (dotted) and the right term (dashed) in (10).

(8) it is shown how to obtain a conditional lifetime distribution. To obtain the lifetime density of a component with age t∗, first consider the density of that component with age 0, say f (t). Truncate this density from the left at t = t∗ and shift it left such that we obtain density f∗(t) which has at t = 0 the same value as f (t) has at t = t∗, i.e.

f∗(t) =        0 if t < 0, f (t + t∗) else. (11)

In order to normalize this function such that the area under it equals 1, we must have Z ∞ −∞ δ · f∗(t)dt = 1 ⇒ δ Z ∞ 0 f (t − t∗)dt = 1 and hence δ = R∞ 1 0 f (t − t∗)dt (12)

(23)

density of a component, conditional on its age t∗, is given by f (t | t∗) =        0 if t < 0, δf (t − t∗) else, (13)

where f (t) is the unconditional lifetime density and δ is defined in (12). Thus, the lifetime distribution of a SPARE gate, conditional on the ages of its inputs, tA and tB, is given by

This is a general expression for any type of lifetime density or distribution. The conditional reliability during the next inspection interval T is then given by R(T |tA, tB) = 1 − F (T | tA, tB).

6.3 PAND gate

For PAND gates with two inputs A and B we find that, according to Merle et al. [17], the lifetime distribution is

F (t) = Z t

0

fB(u)FA(u)du. (15)

Again similar expressions can be derived for a PAND gate with an arbitrary number of inputs. The lifetime distribution of a PAND gate, conditional on the ages of its inputs, tAand tB, can be

obtained by replacing the lifetime density and distribution in (15) by their conditional equivalents. This results in

F (t | tA, tB) =

Z t 0

fB(u | tB)FA(u | tA)du. (16)

Again, this is a general expression. The conditional reliability over the next inspection interval T can then be obtained by calculating R(T |tA, tB) = 1 − F (T |tA, tB).

6.4 Dormant lifetime distributions

(24)

From Barlow and Proschan [3] we have that the reliability of a component follows

R(t) = exph− Z t

0

r(x)dxi, (17)

where r(t) is its failure rate. Hence, the reliability of a dormant component is

Rγ(t) = exp h − Z t 0 γ · r(x)dx i = exph− γ Z t 0 r(x)dxi = exph− Z t 0 r(x)dxi !γ = R(t) γ = 1 − F (t) γ

and hence the lifetime distribution of a dormant component is

Fγ(t) = 1 −

1 − F (t)γ. (18)

Furthermore, the conditional lifetime distribution of a dormant component with age t∗ is

Fγ(t | t∗) = 1 − (1 − F (t | t∗))γ. (19)

6.5 System reliability and the structure function

We now have obtained expressions for the conditional reliability of all components and dynamic gates. We will use these to find the conditional system reliability via the structure function. The structure function is a binary function over time defined as

φ(s(t)) =       

1 if the system is operating at time t, 0 if the system is down at time t,

(20)

where s(t) = (s1(t), s2(t), ..., sN(t)) is a vector of states describing the N components in the

system. Thus the structure function indicates whether the system operates or not. The state si(t), for i = 1, ..., N , also is a binary function defined as

si(t) =       

1 if component i is operating at time t, 0 if component i is down at time t.

(25)

The structural dependencies among components give rise to the analytic form of the structure function. For example, if a system consists of two components that are modeled to be the inputs of an OR gate, of which failure corresponds to system failure, the structure function is φ(s(t)) = s1(t) · s2(t). This can be seen by observing that failure of one of the components will

induce system failure. For a similar system but with an AND gate instead of an OR gate we have that φ(s(t)) = s1(t) + s2(t) − s1(t) · s2(t). More general, we can obtain an expression for

the structure function of a static fault tree by using all minimal cut sets in the tree [19]. Once we have found all minimal cut sets we can find the structure function using

φ(s(t)) = nc Y j=1  1 − Y i∈Mj (1 − si(t))  , (22)

where nc is the number of minimal cut sets and Mj denotes minimal cut set j. Since minimal

cut sets are not properly defined for dynamic fault trees, (22) is only valid for static fault trees. However, if we treat intermediate events, corresponding to the failures of a dynamic gates, as basic events, we can find the minimal cut sets of this transformed ‘dynamic fault tree’ and thereby its structure function. For an example of this procedure see Figure 7. In Figure 7a a dynamic fault tree is depicted. In Figure 7b the intermediate events corresponding to failures of dynamic gates in Figure 7a are replaced by a special basic event. For dynamic gates with type 1 inputs this event is denoted by a double square and for dynamic gates with type 2 inputs this event is denoted by double circle. The numbers inside the double circle and square correspond to the basic events which were the inputs of the original dynamic gate. Since we can calculate the reliability of dynamic gates, we know the reliability of the imaginary components of which failure is represented by this special basic event.

Finding all minimal cut sets for large static fault trees containing complex structural depen-dencies can be a non-trivial task [23], but is outside the scope of this paper. Nguyen et al. [19] show that E[φ(s(t))] = R(t) and hence system reliability can be found by replacing the states of the components (and dynamic gates) by their reliabilities in (22):

Rsystem(t) = nc Y j=1  1 − Y i∈Mj (1 − Ri(t))  . (23)

System reliability between Tk and Tk+1, assuming we do not perform preventive maintenance,

conditional on the ages of all components, is denoted as Rsystem(Tk+1|Tk). It can be found

(26)

7

3 4 8

1 2 5 6

(a) Dynamic fault tree before transformation.

7

3 4 8

1,2 5,6

(b) The same dynamic fault tree after transformation.

(27)

7 Heuristic

Nguyen et al. [19] introduced a heuristic which is based on a minimal reliability threshold, as will be explained in Section 7.1. Once system reliability drops below this threshold, maintenance will be performed based on the so called group improvement factor, explained in Section 7.2. We build on this heuristic and extend it in the sense that it is able to handle systems that are modelled using dynamic gates. Finally, in Section 7.3, changes to the decision criteria of the heuristic are made to test the effectiveness and performance of the heuristic.

7.1 Minimal reliability threshold

The heuristic works as follows. Upon inspection, corrective maintenance is performed on all failed components. Hence, for these components either the deterioration level or the effective age be-comes 0. Subsequently, in order to decide what components to perform preventive maintenance on, conditional system reliability during the next inspection interval T is calculated. The compo-nents that have failed during the last inspection interval and have been correctively maintained will not be eligible for preventive maintenance, since their condition cannot be improved upon. To be clear, we calculate the conditional reliability of the system over the next T time units assuming that we do not perform preventive maintenance. Next we compare this value with a minimal reliability threshold we want to preserve, namely (Rmin)T. If the calculated reliability

falls short of this threshold value, i.e. Rsystem(Tk+1|Tk) < (Rmin)T, we will perform preventive

maintenance. If it does not, we move on to the next inspection.

It is good to notice that this does not imply strict economical optimization of a maintenance policy. The incentive to perform preventive maintenance is not done directly from an economical point of view, but rather from a reliability point of view.

7.2 Group improvement factor

Let us consider we observe Rsystem(Tk+1|Tk) < (Rmin)T and hence we will perform preventive

(28)

where RGl

system(Tk+1|Tk) is the reliability of the system that follows from performing preventive

maintenance on group Gl and CGl denotes the costs of maintaining group Gl, in which l =

{1, 2, ..., 2p_{− 1} denotes the index of the group, in which p is the number of eligible components.}

The GIF takes into account the reliability of the system if a certain group is maintained, compares this to the reliability of the system in the case it is not preventively maintained and scales it with the costs of this maintenance action. The most important aspects of the problem are thus taken into account in one simple factor. Preventive maintenance is done on the group with the maximal GIF, provided that the resulting system configuration has a reliability that exceeds the minimal reliability threshold (Rmin)T. That is, the optimal group to maintain at time Tk is found by

G∗ ∈ arg max Gl GIFGl_(T k) : RGsysteml (Tk+1|Tk) ≥ (Rmin)T . (25)

The deterioration of all type 1 components in G∗ is set to 0 and the effective age of all type 2 components is set back to 0 time units. Subsequently we move on to the next inspection and repeat the process.

7.3 One by one selection

When using the GIF for decision making we scale the improvement in system reliability to the costs of the corresponding maintenance action. This implies that the maintenance action chosen is not always the cheapest way of restoring the system to the minimal reliability threshold. It only uses the threshold as a constraint for feasibility. In addition, the original heuristic considers all possible groups eligible for maintenance. If the number of components in a system increase, the solution space per inspection increases exponentially. To avoid this, we can compose the group to be maintained by adding components one by one until maintenance on it returns system reliability to the minimal reliability threshold.

To illustrate this procedure, consider again we observe Rsystem(Tk+1|Tk) < (Rmin)T and hence

(29)

Table 1: Parameters used to model the lifetime of the components in the small systems, resulting expected times to failure and maintenance costs.

Component Type αi βi zi λi ki E[tf] cp,i cc,i

1 1 0.6 1.2 18 - - 25 80 270

2 1 0.6 1.2 19 - - 26.4 80 250

3 2 - - - 53 2.2 47 100 300

the new group to maintain and we will continue to add components to this group until we reach the reliability threshold.

This method will always consider less groups than the original heuristic and more impor-tantly, its solution space at every inspection will only increase at most quadratically in stead of exponentially with the number of components. We will refer to this selection method as one by one selection.

8 Results and interpretation

In this section the results obtained using the heuristic are presented and analysed. In order to see what the output of the heuristic is, in Section 8.1, three small and comparable systems are analysed. Then, in section 8.2, a larger system consisting of 5 components is analysed using both the original heuristic and one by one selection.

8.1 Three small systems

In order to evaluate the performance of the heuristic, we will analyse three small systems consisting of 3 components. The corresponding dynamic fault trees are shown in Figure 8. The only difference between the trees is the type of the gate. We used the same parameters to model the lifetimes of the components in all three systems and the same goes for maintenance costs. Parameters are chosen such that the expected times to failure of all components are comparable to each other. All these characteristics are displayed in Table 1. Other costs used in the analysis of the small systems are cs= 60, ci = 25, cd= 250. The dormancy factor of the spare component

in Figure 8b is γ = 1₂.

(30)

3

1 2

(a) Dynamic fault tree with a PAND gate.

3

1 2

(b) Dynamic fault tree with a SPARE gate.

(c) Static fault tree with an AND gate.

(31)

Table 2: Maintenance schedule for the dynamic fault tree in Figure 8a. For this instance T = 6 and Rmin= 0.85.

Component T1 T2 T3 T4 T5 T6 T7 T8 T9

1 - - PM - - PM - -

-2 - - - CM -

-3 - - PM - - - - CM

-1 improves system reliability more than maintaining component 2, due to the PAND gate in the tree, and maintenance costs being equal. Note also that the system has failed once in the time interval considered. This can be seen by noting that at the eighth inspection component 3 is correctively maintained, implying it was in the failed state upon inspection. From Figure 8a it can be deduced that failure of component 3 induces system failure.

Since we are interested in the optimal inspection interval length T , plots of the long run cost rate against inspection interval T , for all three systems, are shown in Figure 9. The minimal reliability threshold in all three analyses is Rmin = 0.85. Using visual inspection, for the system

with the PAND gate, the minimum occurs somewhere between T = 3 and T = 4 with a cost rate between 39 and 40, see Figure 9a. For the system with the SPARE gate shorter simulations were done because they took significantly longer than simulations for the tree with the PAND gate. The reason for this difference in computation time is due to the difficulty in solving (10) compared to (15). The minimum seems to occur around T = 3 with corresponding cost rate around 36, see Figure 9b.

(32)

Table 3: Parameters used to model the lifetime of the components in the larger system, resulting expected times to failure and maintenance costs.

Component Type αi βi zi λi ki E[tf] cp,i cc,i

1 1 0.7 1.3 15 - - 16.5 100 350 2 1 0.5 1.4 16 - - 22.9 80 250

3 2 - - - 21 3.2 18.8 100 300

4 1 0.6 1.2 16 - - 22.2 110 280

5 2 - - - 18 5 16.5 100 300

Table 4: Maintenance schedule for the larger dynamic fault tree in Figure 10. For this instance T = 5 and Rmin= 0.85.

Component T1 T2 T3 T4 T5 T6 T7 T8 T9 1 - - - CM - - - CM -2 - - PM - - - PM - -3 - PM - - PM - - PM -4 - - - CM - -5 - - PM - PM - - - CM

This failure mode might dominate the location of the minimum, whereas the inherent differences between the SPARE, PAND and AND gates do not.

8.2 Larger system

In this section a somewhat larger system consisting of 5 components is analysed. The dynamic fault tree modeling it is shown in Figure 10. The parameters, expected times to failure and maintenance costs of the components are in Table 3. Other costs used in the analysis of the larger system are cs= 50, ci = 50, cd= 500.

An example of an inspection schedule obtained using the original heuristic is shown in Table 4. Clustering is observed during the third and fifth inspection. Observe that maintaining com-ponent 2 is now preferred over maintaining comcom-ponent 1. This must be caused by the differences in maintenance costs and the improvement in reliability that results from maintenance on the individual components. The latter depends on the condition of the component and the parame-ters of its lifetime distribution. Although the costs of corrective maintenance on component 1 are relatively high, the heuristic chooses not to perform preventive maintenance on it. This is due to the fact that corrective maintenance costs are not part of decision making, since they are not in the group improvement factor. Failure of component 3 will lead to system failure and hence it is maintained regularly.

(33)

(a) Results for the small tree with the PAND gate, see Figure 8a. The time horizon was set to tmax= 40000.

(b) Results for the small tree with the SPARE gate, see Figure 8b. The time horizon was set to tmax= 10000.

(c) Results for the small tree with the AND gate, see Figure 8c. The time horizon was set to tmax= 40000.

(34)

3

1 2 4 5

Figure 10: Larger dynamic fault tree.

rate obtained by one by one selection is higher than the one obtained by the original heuristic. For small values of T cost rates of both heuristics almost converge and the difference between them increases for larger values of T . Apparently the solutions of both heuristics are very much alike for small values of T , but for larger values the solution of the original heuristics performs significantly better. The minimal cost rate for both heuristics lies around T = 3. The relative increase in cost rate using one by one selection at T = 3 is around 2%.

As said, the solution space of one by one selection increases only quadratically with the number of components, whereas for the original heuristic it increases exponentially. For the original heuristic the maximum number of possible groups is 2N−1, whereas one by one selection considers at most 1₂N (N + 1). During simulation these numbers may be smaller because correctively maintained components are not eligible for preventive maintenance. Also, in practice one by one selection will very often consider far less groups because it stops selection once maintenance on the composed group can return the system to the minimal reliability threshold. From Figure 12 it can be seen that the difference in solution space is significant, certainly for large systems. For this particular case of a 5 component system the solution space can be reduced from 31 to 15, corresponding to a relative decrease of around 50%.

9 Discussion and conclusion

(35)

Figure 11: Long run cost rate against inspection interval length T . Results of the original heuristic (filled circles) and one by one selection (empty circles). In this instance Rmin= 0.85 and the time horizon is tmax= 20000.

(36)

the maximal group improvement factor was maintained. By formulating a constraint, system reliability was always above a threshold value. We have economically optimized this maintenance policy over the inspection interval length using Monte Carlo simulation. For part of the com-ponents lifetimes were modeled using a probability distribution, whereas for other comcom-ponents a stochastic process was used to model continuous deterioration.

Three variants of a three component system were analysed to evaluate the performance of the heuristic. Using visual inspection, global minima could be identified. Due to the small solution space of these systems also local extrema were observed. The strong reliability of SPARE and PAND gates, relative to an AND gate, was emphasized by comparison.

The developed heuristic was tested in terms of quality of the solution by comparing it to a less comprehensive heuristic called one by one selection. The latter heuristic is also based on the group improvement factor, but its solution space is smaller. For the parameters and costs used in a 5-component system, the relative cost increase was only 2%, while solution space decreased with more than 50%. For larger systems the solution space will decrease even more rigorously whereas the effect on cost increase is an interesting topic for further investigation, for instance by doing a sensitivity analysis.

In the model as presented some choices and assumptions were made for reasons of computa-tional treatability. In further research on the subject, some of them are worth it to be reconsidered, thereby increasing modeling power for practical situations. Importantly, it was chosen to work with inspections rather than with continuous monitoring. Ideally conditions are monitored in real time, for instance by using sensors, and preventive maintenance will be done just before a component breaks down. This requires the condition information to be very accurate and reliable which is difficult in practice. Assuming condition information to be perfect upon inspection in practice also is a strong assumption. Since it can reduce maintenance costs significantly, firms are putting a lot of effort in real time monitoring. Furthermore, working with inspections also eased computations. If we would constantly observe the system during simulations, computations would become more exhaustive.

(37)

However, in practice some unimportant components might be left in their failed state. In this case one could also reconsider placing such a component in the fault tree at all.

Exactly optimizing the maintenance policy of a dynamic fault tree is hard and hence reliability centered maintenance was done. Hereby a sensible maintenance policy was obtained, however it could not be compared to the exact optimal one. This problem is hard to overcome since it is inherent to exact optimization of large multi-component systems. The heuristic makes sure system reliability attains a minimal reliability threshold. However, on average, system reliability is higher than the threshold since system reliability is required to be equal or higher than it. For firms it might therefore be better to let the system attain an average minimal system reliability. In this way maintenance costs can be cut while minimal system reliability remains the same.

The most difficult part of fault tree analysis is calculating the reliability of a fault tree. Algebraic calculation becomes increasingly difficult once dynamic gates have large sub-trees as inputs, due to solving multiple integrals. If lifetime densities and distributions have complex analytical form, this increases difficulties further. Because of this it was assumed that inputs of dynamic gates are basic events and these basic events are unique. There are alternatives to finding reliability algebraically, as stated in the literature section, but they each have their own shortcomings.

(38)

Bibliography

[1] A. Abate, C. Budde, N. Cauchi, K. Hoque, and M. Stoelinga. Assessment of maintenance policies for smart buildings: Application of formal methods to fault maintenance trees. In Proceedings of the European Conference of the PHM Society, volume 4. PHM society, Jul 2018.

[2] S. Amari, G. Dill, and E. Howald. A new approach to solve dynamic fault trees. In Annual Reliability and Maintainability Symposium, pages 374–379, Jan 2003.

[3] R. Barlow and F. Proschan. Mathematical Theory of Reliability. Wiley, New York, NY, USA, 1965.

[4] H. Boudali, P. Crouzen, and M. Stoelinga. A compositional semantics for dynamic fault trees in terms of interactive markov chains. In Automated Technology for Verification and Analysis, pages 441–456. Springer Berlin Heidelberg, 2007.

[5] H. Boudali, A. Nijmeijer, and M. Stoelinga. Dftsim: A simulation tool for extended dynamic fault trees. Mar 2009.

[6] D. Codetta-Raiteri. The conversion of dynamic fault trees to stochastic petri nets, as a case of graph transformation. Electronic Notes in Theoretical Computer Science, 127(2):45–60, 2005.

[7] D. Codetta-Raiteri. Integrating several formalisms in order to increase fault trees’ modeling power. Reliability Engineering System Safety, 96, May 2011.

[8] D. Codetta-Raiteri, G. Franceschinis, M. Iacono, and V. Vittorini. Repairable fault tree for the automatic evaluation of repair policies. In International Conference on Dependable Systems and Networks, 2004, pages 659–668, Jun 2004.

[9] M. Doostparast, F. Kolahan, and D. Mahdi. A reliability-based approach to optimize pre-ventive maintenance scheduling for coherent systems. Reliability Engineering System Safety, 126:98–106, 2014.

[10] J. B. Dugan, S. J. Bavuso, and M. A. Boyd. Fault trees and sequence dependencies. In Annual Proceedings on Reliability and Maintainability Symposium, pages 286–293, Jan 1990. [11] Y. Dutuit and A. Rauzy. Approximate estimation of system reliability via fault trees.

Reli-ability Engineering System Safety, 87(2):163–172, 2005.

[12] R. Gulati and J. B. Dugan. A modular approach for analyzing static and dynamic fault trees. In Annual Reliability and Maintainability Symposium, pages 57–63, Jan 1997.

(39)

[14] J. F. Lawless. Statistical Models and Methods for Lifetime Data. John Wiley Sons Ltd, Wiley-Interscience, Hoboken, NJ, USA, 2 edition, 2002.

[15] W. Long, Y. Sato, and M. Horigome. Quantification of sequential failure logic for fault tree analysis. Reliability Engineering System Safety, 67(3):269–274, 2000.

[16] G. Merle, J. Roussel, J. Lesage, and A. Bobbio. Probabilistic algebraic analysis of fault trees with priority dynamic gates and repeated events. IEEE Transactions on Reliability, 59(1): 250–261, Mar 2010.

[17] G. Merle, J. Roussel, J. Lesage, and N. Vayatis. Analytical calculation of failure probabilities in dynamic fault trees including spare gates. Reliability, Risk and Safety, Sep 2010.

[18] K. S. Moghaddam and J. S. Usher. Sensitivity analysis and comparison of algorithms in pre-ventive maintenance and replacement scheduling optimization models. Computers Industrial Engineering, 61(1):64–75, 2011.

[19] K. Nguyen, P. Do, and A. Grall. Multi-level predictive maintenance for multi-component systems. Reliability Engineering System Safety, 144, Dec 2015.

[20] J. Noortwijk, van. A survey of the application of gamma processes in maintenance. Reliability Engineering System Safety, 94(1):2–21, 2009.

[21] L. Portinale, D. Codetta-Raiteri, and S. Montani. Supporting reliability engineers in ex-ploiting the power of dynamic bayesian networks. International Journal of Approximate Reasoning, 51(2):179–195, 2010.

[22] K. D. Rao, V. Gopika, V. S. Rao, H. Kushwaha, A. Verma, and A. Srividya. Dynamic fault tree analysis using monte carlo simulation in probabilistic safety assessment. Reliability Engineering System Safety, 94(4):872–883, 2009.

[23] E. Ruijters and M. Stoelinga. Fault tree analysis: A survey of the state-of-the-art in modeling, analysis and tools. Computer Science Review, 15-16:29–62, 2015.

[24] E. Ruijters, D. Guck, P. Drolenga, M. Peters, and M. Stoelinga. Maintenance analysis and optimization via statistical model checking: evaluating a train pneumatic compressor. Aug 2016.

[25] E. Ruijters, D. Guck, P. Drolenga, and M. Stoelinga. Fault maintenance trees: Reliability centered maintenance via statistical model checking. In Annual Reliability and Maintain-ability Symposium, pages 1–6, Jan 2016.

[26] L. Thomas. A survey of maintenance and replacement models for maintainability and relia-bility of multi-item systems. Reliarelia-bility Engineering, 16(4):297–309, 1986.

[27] W. Vesely and N. Roberts. Fault Tree Handbook. Number v. 88.

‘Optimizing a heuristic-based maintenance policy for a system that is modeled by a dynamic fault tree’

‘Optimizing a heuristic-based maintenance policy for

a system that is modeled by a dynamic fault tree’

‘Optimizing a heuristic-based maintenance policy for

a system that is modeled by a dynamic fault tree’

Contents

1

Introduction

2

Fault trees

3

Literature

4

Problem description

5

Methodology

6

Reliability analysis

7

Heuristic

8

Results and interpretation

9

Discussion and conclusion

Bibliography