Improving predictive maintenance of turbofan engines with different attention mechanism models

(1)

Improving predictive maintenance of turbofan engines with

different attention mechanism models

SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER OF

SCIENCE

Albert Folch Garcia

12136018

M

ASTER

I

NFORMATION

S

TUDIES

Data Science

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

July 4th, 2019

1st_Supervisor ₂nd_Supervisor

Maurits Bleeker, MSc. Vesa Muhonen, Ph.D.

(2)

different attention mechanism models

Albert Folch Garcia

albertfolchg@gmail.com University of Amsterdam

The Netherlands

ABSTRACT

Turbofan engine failure in aeroplanes can be catastrophic, causing accidents and leading to loss of life, and it can be attributed to a variety of causes, such as wear from continued use. In this paper, we develop an approach which automates diagnostics by predicting the remaining number of operational cycles before engine failure. Our solution has the potential to guide better design choices and minimise engine failure risk, saving lives and millions of dollars for the industry. Different machine learning models have already been applied to the same datasets such as Convolutional Neural Networks (CNN) [1], Long Short-Term Memory networks (LSTM) [25] and Convolutional Long Short-Term Memory (CNN-LSTM) [13]. In this paper, we use different attention mechanisms applied to an LSTM network and compare these models to vanilla LSTMs. At the end of the study, we show how our models using attention mechanisms outperform LSTM with the same architecture and hyperparameters on the simplest datasets.

CCS CONCEPTS

•Computing methodologies → Neural networks; Machine learn-ing approaches;

KEYWORDS

Attention mechanism, LSTM, Deep learning, Predictive mainte-nance, NASA, Time series

1 INTRODUCTION

In general, all machines have to be maintained to prevent failure. The reasons for maintenance are to reduce costs and downtime, given that failure in one machine can affect any other machine or process depending on it. Historically, with reactive maintenance machines were fixed only after breakdown [7]. This approach led to wasted time and money, however, as fixing machines when they are unusable is more costly than maintaining them in advance. Consequently, preventive maintenance has emerged as a way to perform maintenance in a conservative way on machines to avoid failure. This approach, however, has the drawback that machines are maintained before they require it, thus reducing their time of use in production.

Increased availability in computation resources over time has led to more data collection and better machine learning algorithms. These improvements have led to the creation of new algorithms that offer the possibility of doing predictive maintenance [21], using sensor data and the operational settings of machines to predict their remaining useful life (RUL). One of the machine learning models that has been successfully used in time series models is the Long Short-Term Memory network (LSTM) [11]. The benefit of this

model is that it can retain information over time. Although LSTMs are chiefly used for sequence-to-sequence data, they can be also used for time series data. Their architecture demands samples with a time dimension and input dimension (also known as features). This time dimension corresponds to the window size of the window sliding approach generally used.

Attention mechanism (or simply attention) by Bahdanau et al. [2] is another approach that has been successfully used in time series algorithms in other fields such as neural machine transla-tion. Sequence-to-sequence models without attention performed the same task with some difficulty because they compressed all information in one fixed vector (or last hidden state), and therefore some information was lost [17, 22]. Attention mechanisms, on the other hand, use all hidden states and thus, information loss is less likely.

Attention mechanisms are also a method that aims to imitate the way humans pay attention. An image-captioning task is an example. Rather than paying attention to the entire image to generate a caption, people and attention mechanism models pay attention mostly to specific elements of the image. Another valid example would be when reading a page of a book, since a person does not look at all the words at the same time but rather focuses on just a few words at a time.

In this study, we are interested in predicting the RUL of turbofan engines used in aeroplanes by using attention mechanisms using the C-MAPSS (Commercial Modular Aero-Propulsion System Sim-ulation) datasets with both operational settings and sensor data as our features. Details are given in section 4.1.

Previous studies have already attempted to predict RUL using this dataset but none using attention mechanism. Hence, we study attention mechanism on this dataset to investigate if it can pay attention to relevant features in the same way it is done in previous examples. In other words, we study whether retaining relevant in-formation using attention mechanisms can lead to better prediction of the RUL. More related work is explained in section 2.

Given the LSTM network and attention mechanisms, we are interested in whether the information to predict RUL correctly is mostly within the last time steps of the window size of the LSTM or is spread out more broadly. In other words, to what extent can prediction of RUL be improved by the addition of an attention mecha-nism?

On the other hand, we are also interested in the insights of attention mechanism in our predictive maintenance datasets: can attention show the origin of a failure?

This leads us to the research questions that follow:

• What are the pros and cons of LSTM networks with atten-tion mechanism to predict failure within a time window?

(3)

Master’s Thesis, 2019, University of Amsterdam Albert Folch Garcia • What performance level can be achieved for RUL prediction

for turbofan engines within a time window using atten-tion mechanisms with LSTM networks compared to the analogous models (baselines) without attention?

Two architectures of attention mechanism are studied: the atten-tion mechanism over the hidden states (or outputs) of the LSTM and the attention mechanism over the inputs. The motivation for the former is to investigate whether paying more attention to all the outputs of the LSTM - rather than using just the last output, as the vanilla LSTM does - and thus learning more about previous steps the LSTM might have forgotten, helps to predict RUL. On the other hand, the motivation for using attention mechanisms over the inputs is to investigate whether attention can better predict RUL by learning to ignore the noise of the inputs and boosting those fea-tures of the inputs that are more informative. In conclusion, the goal of this research is to determine whether predictive maintenance can be performed better by using attention mechanisms.

The remaining part of this paper is structured as follows: in section 2 we discuss related work on the same dataset, in section 3 we explain the methods we have researched on and methods we use directly, in section 4 we describe our experimental setup, in section 5 we show our results and in section 6 we conclude our study.

2 RELATED WORK

Predictive maintenance using the C-MAPSS data has been ap-proached by a variety of researchers with positive results. First, Babu et al. [1] used a Multilayer Perceptron (MLP), a Support Vec-tor Regression (SVR), a Relevant VecVec-tor Regression (RVR) and es-pecially an adapted Convolutional Neural Network (CNN) over the temporal dimension to predict RUL on two publicly available datasets, one of which was the C-MAPSS dataset.

Subsequently, a study by Zheng et al. [25] used an LSTM network to predict RUL on three datasets, the C-MAPPS among them. Zheng et al. [25] chose the LSTM approach rather than CNN used by Babu et al. [1] to overcome the CNN’s limitation of looking at each window slide independently. This study reported what were at that time state-of-the-art performances on all datasets tested. A similar study by Zhang et al. [24] used Bidirectional Long Short-Term Memory (BD-LSTM) networks and also reported the results of a Bidirectional Recurrent Neural Network (BD-RNN). In the end, these researchers acquired a better result with the BD-LSTM network.

Work by Jayasinghe et al. [13] featured augmentation of the training dataset - to make it more similar to the testing set - and used a Convolutional Long Short-Term Memory (CNN-LSTM) network. They achieved the best performance, and therefore the best RUL prediction of all four studies (see Table 5 for results of these four studies).

In our case, we chose as baseline the vanilla LSTM rather than the BD-LSTM because we have results for the former for all four datasets we are concerned with. We chose not to use the CNN-LSTM as baseline because it requires excessive training time, which was a limited factor in our study.

In conclusion, to our knowledge, no work has been done using attention mechanisms on the C-MAPSS dataset in the manner we

propose: attention over the hidden states of the LSTM and over the inputs, both with different score functions.

3 METHODS

In this section we first give an overview of how LSTMs work and then delve into attention mechanisms as the method we research on. Finally, we describe how Autoregressive Integrated Moving Average (ARIMA) models and Bayesian hyperparameter optimisation are used to enhance our attention mechanism models.

3.1 LSTM

Long Short-Term Memory [11] networks are a type of Recurrent Neural Network (RNN) created to overcome problems of vanilla RNN. In particular, LSTM networks solve the problem of vanishing or exploding gradients of RNNs by introducing a forget gate (see Equation 1). In other words, they can theoretically keep important information for longer and forget what is unnecessary. This forget gate is used to update the cell state by using the previous hidden state ht −1. ft = σWf · [ht −1, xt]+ bf (1) it = σ (Wi · [ht −1, xt]+ bi) (2) ˜Ct = tanh (WC· [ht −1, xt]+ bC) (3) Ct = ft∗Ct −1+ it∗ ˜Ct (4) ot= σ (Wo[ht −1, xt]+ bo) (5) ht= ot∗tanh (Ct) (6)

where ftis the probability to forget the current cell state, it is the

probability to keep the updated ˜Ct, xtis the input, h refers to the

hidden state, all C refer to the cell state, otis the output and all W

refer to learnt weights of the (Wf) forget, (Wi) input, (WC) cell and

(Wo) output gates.

For our study, we created an LSTM baseline that replicates the architecture of Zheng et al. [25]. In that paper, the best model consisted of an LSTM with two layers followed by three feedforward networks (FFN) - also known as dense layers. In between the layers, the authors also used dropouts to prevent overfitting. Hence, our attention mechanism operates with this LSTM baseline.

3.2 Attention mechanism

Attention is primarily used in sequence-to-sequence tasks [2], but it can also be used with any model that uses features with time steps. In our case, we are presented with a many-to-one time series problem, thus the concept of encoder/decoder does not apply (only the encoder part does) as we do not have a sequence or multi-step target to predict.

In our research, we investigated two types of attention. The first is the most common attention mechanism architecture, which pays attention to the hidden states (or outputs) of the LSTM, but we also investigated applying attention mechanism over the inputs.

(4)

Figure 1:A 2-layer LSTM network with attention mechanism over the hidden states (the outputs) of the LSTM. Score represents any of the attention functions described in the 3.2 subsections. Dropouts are not displayed.

3.2.1 Attention mechanism over hidden states. The first attention mechanism (see Figure 1 for an overview) we employed uses the hidden states of all time steps of the last layer of the model and produces weights, called attention weights or alignments to express how a hidden state is aligned or related to the output from 1 to 0. Concretely, we used this mechanism to pay attention to the outputs of the LSTM network. The vector of attention weights (αt ∈ RT, where T is the total number of time steps in the window)

is calculated using an attention score function which returns an attention score s ∈ RT _{that depends on the type of attention we}

want to use. These functions are explained at the end of the section. The inputs for all score functions are the last hidden state of the last layer of the LSTM network (represented as the vector ht∈ RK,

where t is the last time step, and K is the number of hidden neurons) and all the hidden states of the same layer (HT∈ RT ×K). The result

of the score function (Equation 7) is called the attention score and it can have different forms. These include forms for the additive attention (Bahdanau et al. [2]), self-multiplicative attention (Luong et al. [17]), self-additive attention (also Bahdanau et al. [2]) and

Figure 2:A 2-layer LSTM network with attention mechanism over the inputs. Score represents any of the attention functions described in subsections 3.2. Dropouts are not displayed.

key value attention (Daniluk et al. [8]). After all attention scores were calculated, we applied a softmax function to normalise the distribution (see Equation 8). In this equation, e is the mathematical constant, or Euler’s number.

attention score= s = score(ht, HT) (7)

αt = softmax(score(ht, HT))= es t ÍT i esi (8) c= T Õ i αt · Ht, i (9) a= tanh(W · [c; ht]) (10)

The context vector (c ∈ RK_{) is the representation of how}

im-portant each hidden state is, and it is calculated by augmenting or reducing the effect of each hidden state with the trained attention weights (see Equation 9).

The attention vector (a ∈ RK_{) can be constructed by applying}

(5)

Master’s Thesis, 2019, University of Amsterdam Albert Folch Garcia context vector and the output of the last hidden step to the matrix

of trained weights W to ensure attention learning.

3.2.2 Attention mechanism over the inputs.Using the attention mechanism in this way may (Figure 2) allow us to learn which input features are more relevant in predicting the output. Thus, the neural network will learn to pay attention to the input features (XT ∈ RT ×F, where F is the number of features, also known as

input dimension) instead of the hidden states of the LSTM. Here, the attention score (s ∈ RF_{) is instead calculated using}

the inputs and the last hidden state of the LSTM (see Equation 11). Furthermore, the context vector (c ∈ RF_{) is also calculated using all}

inputs (see Equation 13). Finally, the attention vector formula (see Equation 10) only works when the context vector and the output of the last hidden state are concatenated, as these vectors have different dimensions.

attention score= s = score(ht, XT) (11)

αt = softmax(score(ht, XT))= es t ÍT i esi (12) c= T Õ i αt · Xt, i (13)

The following subsections explain different types of attention and, therefore, different score functions. We were interested in try-ing different score functions to test whether performance differed across them. For all types, we denote UTto represent all hidden

states HTor all inputs XT.

3.2.3 Additive attention.This attention version is based on the attention mechanism introduced by Bahdanau et al. [2]. The dif-ference between our score function and theirs lies in the absence of the decoder hidden states, which we replace with ht, and the

addition of two learnt matrices. This version also adds the product of each matrix times the last hidden state or all hidden states:

score(ht, UT)= v · tanh(W1· ht+ W2· UT) (14)

where v, W1and W2are three learnt weight parameters of the

attention model. In Equation 14, before using htwe repeat it on

the time steps dimension so that htand UThave the same shape.

3.2.4 Self-multiplicative attention. This is a variation of the orig-inal multiplicative attention created by Luong et al. [17]. Concretely, it does not have decoder hidden states. Also, as it does not use the last hidden state, it is a type of self-attention. It is more efficient than additive attention because it has fewer operations to compute and fewer parameters to learn.

score(UT)= w · UT (15)

where w is a learnt weight parameter of the attention mechanism model. In Equation 15 the signature of the score function is still the same but as htis not used, it has been deleted from the parameters.

3.2.5 Self-additive attention.This self-additive attention is sim-ilar to the attention mechanism formula by Bahdanau et al. [2] except for the absence of the hidden states of the decoder. In other words, this self-attention uses all hidden states but ignores the last hidden state used in additive attention. As in self-multiplicative

attention, hthas been removed of the score function (Equation 16)

because it is not used.

score(UT)= v · tanh(W · UT) (16)

where v and W are learnt weight parameters of the attention mech-anism model.

3.2.6 Key value attention.Introduced by Daniluk et al. [8], this attention mechanism differs from the others in that it duplicates the hidden states into keys and values. While the keys are used to calculate MT by using Equation 18 (which is the score function if

we also add the w of Equation 19), the values are used to compute the context vector and calculate the new hidden output. Moreover, ktand vtare analogous to ht.

KT VT = HT (17) MT = tanh (W1[kt −T· · · kt −1]+ W2kt) (18) αt= softmax (w · MT) (19) ct= [vt −T· · · vt −1]αt (20) h∗t = tanh (W3ct+ W4vt) (21)

In Equations 18, 19, and 21 w, W1, W2, W3and W4are five learnt

weight parameters of the attention mechanism model. However, we have used this type of attention only over the hidden states.

3.3 ARIMA Models

An ARIMA model is a statistical general model for forecasting univariate time series. It has three parts defined by their respective parameters that we explain next.

The autoregressive (AR) part, or the order p of the model, in-dicates how many prior observations are used to predict the next output (see Equation 22). Using only this part is the same as a linear regression model with p variables.

Xt= c + p

Õ

i=1

φiXt −i+ εt (22)

The integrated (I) part, or the order d of the model, indicates how many prior observations are used to differentiate an observa-tion from its d previous observaobserva-tions (see Equaobserva-tion 23). This part transforms the data to make it stationary by removing its trend and seasonality. In the following example d=2 as only the past two observations are used:

yt= (Yt−Y_t−1) − (Y_t−1−Y_t−2)= Y_t−2Y_t−1+ Y_t−2 (23) The moving average (MA) part, or the order q of the model, indicates how many prior observations are used to calculate the average, which in fact becomes the next output (see Equation 24). The number of past observations is fixed, and therefore it represents the size of the window used.

Xt= µ + εt+ θ1εt −1+ · · · + θqεt −q (24)

All ε are white noise and all θ are the result of the averages. We correlated sequence length in ARIMA models with the q parameter. The motivation for this is that the ARIMA model has its qparameter increasing or decreasing in the same way the window size of the LSTM does. As we were interested only in the MA(q) part, we set the AR and I to 0. However, as the data would be transformed

(6)

to be stationary on the I(d) part, we tested every feature individually to see whether it was stationary by using the Dickey-Fuller test. As a result, we were confident (p-value 0.05) that we could reject the null hypothesis and therefore assume that all the features were stationary.

The goal of using ARIMA models in our research was to narrow down the sequence length - also known as time steps of the window size or just window size - space of the LSTM when using a Bayesian hyperparameter search. In other words, we did not conduct research on ARIMA models, but only used them as a complementary tool.

3.4 Bayesian Hyperparameter Optimisation

Bayesian hyperparameter optimisation is a type of hyperparam-eter optimisation that uses Bayesian statistics to choose the next hyperparameter to try. The difference between this approach and grid or randomised search is that the former makes an informed decision based on previous decisions. Bayesian optimisation can be seen as an automated approach for manually choosing the best hyperparameters at each step. In addition, Bayesian optimisation renders better results.

Bayesian hyperparameter optimisation was not a subject of our research and was used as is. Our process for executing this opti-misation can be summarized in three steps. First, we selected the best hyperparameters according to the maximum of a pre-chosen acquisition functionand a surrogate function of our objective function. Secondly, we used those hyperparameters with our objective func-tion and saved our loss and selecfunc-tion of hyperparameters. Finally, we modified our surrogate function with the result of the objective function.

The objective function is the function that we want to optimise. It encapsulates the training process of our model which is performed using cross-validation.

The surrogate function is the function that we actually do op-timise, as it can be optimised more quickly and simply than the objective function. Our surrogate function is the Tree-structured Parzen Estimators (or TPE) [4].

The acquisition function also known as the selection function -determines which hyperparameters are more likely to perform bet-ter in the next step. This function intrinsically has the exploration versus exploitation trade-off included. We selected the expected im-provement for the acquisition function, which is a common choice. In summary, we first utilised ARIMA models to narrow down the window size hyperparameter space for Bayesian hyperparameter optimisation. Then, we used the best hyperparameters in our LSTM and LSTM with attention mechanism approach, over - either over the hidden states or inputs - with each score function.

4 EXPERIMENTAL SETUP

In this section, we explain the data we used in detail, we show the exploratory data analysis (EDA) we did followed by how we evaluated our models and finally, the data preprocessing.

4.1 Data

We used a dataset from turbofan engines. These were generated by C-MAPSS and were provided by NASA [19]. It contains data for four engines with several time series per engine as described

in Table 1. As the data come from a simulated environment, noise has been purposefully added. All conditions of the engines are considered normal — that is, each engine has a particular initial wear that depends on the manufacturing. The four time series are distributed in train, test and RUL (in cycles) datasets, and each time series consists of a number of flight trajectories. A cycle is a series of processes that run in a loop in which a gas produces some work and is in a certain state, which varies depending on the amount of heat transmitted to it [10]. The train and test datasets report the three operational settings of the engine, the values from 21 sensors around the engine, the id or unit number of the engine in a given trajectory, and the cycle, which is the time expression. For the train sets, the engines work until they fail, whereas for the test set, data cease at some unspecified number of cycles before the engine’s failure. The RUL dataset is being used to determine how many cycles remain until the test set engine’s failure. The goal is to predict the RUL or remaining cycles from the test set and then compare the prediction to the RUL validation file.

The three operational settings [20] correspond to the altitude of the flight, the speed in terms of Mach number and the throttle resolver angle. The summaries [16] provided in Table 1 describe the conditions and fault modes of each dataset along with the number of trajectories in the train and test sets. The fourth and second datasets are the most complex. However, they include data from a larger number of trajectories.

Dataset _trajectoriesTrain _{trajectories Conditions Fault modes}Test

FD001 100 100 _{(Sea Level) ONE (HPC Degradation)}ONE FD002 260 259 SIX ONE (HPC Degradation) FD003 100 100 _{(Sea Level)}ONE TWO (HPC Degradation,_{Fan Degradation)} FD004 248 249 _{(Sea Level)}SIX TWO (HPC Degradation,_{Fan Degradation)}

Table 1:Dataset summaries. The more conditions and fault modes, the more complex the dataset is. HPC = high pressure compressor.

4.2 Exploratory Data Analysis

EDA is done to understand what kind of data is being worked with and how to proceed accordingly.

We carried out an analysis to determine whether each feature of the dataset was individually stationary (i.e., a feature has neither trend nor seasonality), given that stationarity is required for the use of ARIMA models. To achieve this, we performed the Dickey Fuller[9] statistical test and successfully checked that all features were indeed stationary by rejecting this null hypothesis (p-value 0.05): The data has a unit root and is non-stationary.

4.3 Evaluation

Our models were evaluated with the Root Mean Square Error (RMSE) between the real and the predicted RUL.

RMSE = v u t 1 N N Õ i=1

(7)

Master’s Thesis, 2019, University of Amsterdam Albert Folch Garcia This evaluation was conducted individually per dataset so that

we could compare our results fairly with previous papers. In other words, we calculated the RMSE of our predictions for each dataset using the testing set from that same dataset.

On the other hand, we used the Mean Square Error (MSE) for the loss function between the same predicted and ground truth RUL. The MSE makes no difference in terms of performance if we compare it with the RMSE, as MSE is just the squared transformation of RMSE, but it makes the evolution of the loss over time clearer.

4.4 Preprocessing

Data preprocessing is crucial when it comes to using machine learning models, as it can have a significant effect on the results of the models [15].

4.4.1 Piece-wise RUL.First, we created the target variable RUL. For the training set, there is data for the trajectory until the point of engine failure. Therefore, we created a countdown (see the blue linear line in Figure 3) from the beginning of each trajectory until the cycle when the trajectory finishes. For the testing set, there are data for cycles only until some point in time before engine failure: the remaining cycles for each trajectory are in the RUL dataset (see Section 4.1).

Second, we set a threshold (see Figure 3) to limit the maximum RUL an engine could have for the training dataset. This was done to create a fair comparison with previous papers [13, 25]. The reason for this data transformation was that we assumed that in a real-life scenario the engines start in a healthy state and remains so until a certain cycle after which they start to degrade.

Figure 3:Target RUL transformation. The piece-wise line remains constant at the threshold we set (130) until it starts to degrade linearly.

4.4.2 Normalisation. We used normalisation, also known as feature scaling, to increase the performance of the model by scaling the data to a certain range. This helps the model to converge more quickly to a local or global minimum [18, 23].

We performed a twofold normalisation to scale the data to a range closer to [-1, 1], so that the data after the last layer of the tanh

would resemble the testing data, and to create a fair environment in comparison to previous papers [13, 25].

There are many ways to normalise the training and testing datasets (e.g., Min-Max normalisation…), but for fair comparison we chose to normalise by z-score:

s_i0=si−µi

σi (25)

where si is the sensor data or the operational settings data (our

features), µiis the mean of the feature, σi is the standard deviation

of the feature, and s0

i is the normalised sensor data or operational

setting.

For each dataset of the four, we used the data of the training set per sensor, operational setting, and RUL and then applied z-score normalisation (see Equation 25) to each of these features individually. For features that had a standard deviation of 0, we used a standard deviation of 1 to avoid losing features. We did this to resemble what previous papers had done, in the interest of fair comparison.

To get a comparable RUL again, we had to remove the z-score normalisation of the output of the models - RUL - by multiplying it by its standard deviation and by adding the product to its mean.

After all this preprocessing, we changed the training and testing datasets so that each sample had the shape of the number of features times the window size. In other words, we converted the data from 2D to 3D using a window approach to look back. For the testing dataset, however, we used the last T time steps of each flight, as we predicted RUL only for the last cycle. This data shape transformation is a necessary step when the first layer of a model consists of an LSTM, as in our case, as these models inherently need to work with time information.

4.5 Implementation Details

For all experiments, we used a seed (42) so that all processes that have random behaviour could be reproduced. Furthermore, we programmed with the Keras [6] framework to build the neural networks and the attention mechanisms.

Regarding the Bayesian hyperparameters optimisation, we used the Python library Hyperopt [3] in conjunction with the NoSQL database MongoDB to run the optimisation in parallel and store the results. The cloud environment where we executed all the experiments was AWS.

5 RESULTS

This section aims to answer the aforementioned research questions and to comment on the results of the ARIMA models, the Bayesian hyperparameters optimisation and the attention models.

5.1 ARIMA Models

To find the best q of the ARIMA model, we ran a grid search of qover the first flight of the first dataset (see Table 2) only, due to computational constraints.

The maximum q was restricted to the shortest flight in our dataset. This restriction is based on the training set only and is enforced so that features from previous cycles are not used to predict the RUL of the next flight (because of the adjacency of the data). For the datasets 1, 2 and 4 the upper bound of the sequence length is 128

(8)

F0 F1 F2 F3 F4 F5 F6 F7 27 21 - - 29 23 23 -F8 F9 F10 F11 F12 F13 F14 F15 - 19 23 27 - 36 9 20 F16 F17 F18 F19 F20 F21 F22 F23 21 35 - 31 - - 24 19

Table 2: ARIMA q parameters (window sizes) with the lowest error for each feature F. The first three features are the operational settings, and the rest are the sensors. Dashes mean there was no parameter q that produced a valid result that is no invertibility or singular matrix errors.

time steps but for the dataset 3 is 145. Because of time constraints, we only run ARIMA on the first dataset, therefore the upper bound is 128 time steps.

As can be seen in Table 2, larger values of the window size (q) tended to have lower errors than narrower windows (see Figure 4). However, due to some high MA coefficients (q) not being invertible, we do not have an upper bound to search.

Figure 4:Mean of all features of the error between the predicted and real value over the different values of the q parameter

Since we needed to set a threshold for the lower bound of the window size for the hyperparameter optimisation, we chose 20, as it seemed to be a good lower bound because of its low error (see Figure 4).

5.2 Bayesian Hyperparameter Optimisation

The Bayesian hyperparameters optimisation was done on the first dataset with window size 50 using the best attention mechanism found in our initial tests: key value attention. The window size of 50 was also chosen in previous papers. Due to time constraints, we used the optimal hyperparameters for all the other models and datasets.

We defined the search space for each hyperparameter [hidden size ∈ 22, 23· · ·29, dropout ∼ U(0,0.5) and both concatenation and addition operations to merge the context vector and the last hidden state] and the objective function. To make our model more robust, we used cross-validation with three folds.

The Bayesian optimisation was executed 30 times to ensure we had enough observations following the Central Limit Theorem (CLT). For all experiments, we ran our model for 50 epochs using the

FD001

Hyperparameter Worst configuration Best configuration

H 1st_LSTM ₄ ₂₅₆ H 2nd_LSTM ₅₁₂ ₆₄ H 1stFFN 32 4 H 2ndFFN 4 256 H Att. 4 8 D 1stLSTM 0.341 0.165 D 2ndLSTM 0.314 0.002 D 1stFFN 0.130 0.014 D 2ndFFN 0.292 0.367

Merge op. concatenate add

Model RMSE RMSE

KeyValue 41.72 2.53

Table 3:Comparison of the key value attention model loss with the worst and best configuration of hyperparameters for the first dataset and window size 50. H = hidden size, D = dropout probability, Att. = attention mechanism and Merge op. = the merging operation used to calculate the attention vector.

Adam optimiser [14] and an initial learning rate of 0.01. We also used ReduceLROnPlateau from Chollet [5] so that the learning rate decreased by a factor of 0.1, in case the validation loss did not decrease in 8 epochs to a minimum learning rate of 0.0001.

5.3 Advantages and Disadvantages of Attention

LSTM

In this section we answer the first research question: What are the pros and cons of LSTM networks with attention mechanism to predict failure within a time window?

In general, attention mechanism can be seen as an extra layer that sits on top of an LSTM. Hence, we end up with slightly more complex models that take longer to train, but we may get mean-ingful insights from the attention weights. Nevertheless, attention weights have to be used as guidance, as described by Jain and Wal-lace [12]. The effect of attention weights can be seen as a process of filtering unnecessary information and increasing the feature im-portance of meaningful information. For the case of using attention over the inputs, we can learn what input features the model learnt to pay attention to to predict RUL more accurately (see Figure 5b).

Notwithstanding what attention shows in Figure 5b, nothing can be affirmed respecting high attention weights on some features and their variance, since some features are highlighted independently of their high or low variance.

Certainly, manufacturers could put more effort into monitoring those parts the model pays more attention to, which are the parts causing the failures.

5.4 Attention LSTM Performance

In this section we answer the second research question: What performance level can be achieved for RUL prediction for turbine en-gines within a time window using attention mechanism with LSTM

(9)

Master’s Thesis, 2019, University of Amsterdam Albert Folch Garcia

(a)Over the hidden outputs

(b)Over the inputs

Figure 5:Attention weights of both architectures: (a) over the hidden outputs of the LSTM (FD003) and (b) over the inputs (FD001)

networks compared to the analogous models (baselines) without at-tention?

As can be seen in Table 4, attention models outperform our LSTM baselines in all datasets. Simple datasets yield the greatest differences. This may relate to the idea that a simple dataset will have fewer intrinsic meanings per input. This is also what happens in natural language processing, where attention succeeds because a word has one meaning most of the times and not many, which is true in our case with the sensors and operational settings.

Regarding the performance of all models in the same table, key value attention performs better overall as it has been already com-pared in the original paper Daniluk et al. [8], but this may also be because the hyperparameter optimisation was executed using key value attention only.

In Table 5 we compare our results from Table 4 with the reported results of previous works. Here we see that our attention models with the optimal hyperparameters from Table 3 outperform all previous models on the datasets 1 and 3, the simplest ones. However, a decision as to which attention model to choose would have to be made in production, since key value attention is not the absolute winner in all datasets.

Model Attention_over FD001_RMSE FD002_RMSE FD003_RMSE FD004_RMSE LSTM - 16.77 31.76 20.68 32.16 Additive H_I 15.37_15.09 29.42_32.50 14.70_14.62 33.71_34.24 Self-Mul. H_I 15.11_15.05 32.98_30.71 15.12_14.60 34.39_34.20 Self-Add. H_I 15.22_15.29 31.67_32.32 14.82 14.21 34.14 33.89 KeyValue H 14.96 30.77 14.86 31.43 Table 4:Comparison of all models loss of the testing dataset with the optimal hyperparameters. H = attention over the hidden outputs of the LSTM, I = attention over the inputs.

Dataset FD001 FD002 FD003 FD004 Evaluation RMSE RMSE RMSE RMSE MLP Babu et al. [1] 37.56 80.03 37.39 77.37 SVR Babu et al. [1] 20.96 42.00 21.05 45.35 RVR Babu et al. [1] 23.80 31.30 22.37 34.34 CNN Babu et al. [1] 18.45 30.29 19.82 29.16 LSTM Zheng et al. [25] 16.14 24.49 16.18 28.17 BD-RNN Zhang et al. [24] 20.04 - - -BD-LSTM Zhang et al. [24] 15.42 - - -CNN-LSTM Jayasinghe et al. [13] 23.57 20.45 21.17 21.03 Attention LSTM (Score function) (Attention over) 14.86 (KV) (H) 29.42 (A) (H) 14.21 (SA) (I) 31.43 (KV) (H) Table 5:Final results. Dashes stand for unreported results. MLP = Multilayer Perceptron, SVR = Support Vector Regression, SVR = Relevant Vector Regression, CNN = Convolutional Neural Network, BD-RNN = Bidirectional Recurrent Neural Network, BD-LSTM = Bidirectional Long Short-Term Memory, CNN-LSTM = Convolu-tional Long Short-Term Memory, KV = Key Value attention, A = Additive attention, SA = Self-Additive attention, H = Attention over the hidden states of the LSTM, I = Attention over the inputs

To test whether the results of the LSTM are statistically more significant than the results of each attention model, we performed an independent-samples t-test. This statistical test determines if there is a significant difference between the means of two unrelated groups. Therefore, the null hypothesis is: the population means from the two independent samples are equal.

(10)

For the datasets 1 and 3, we strongly rejected the null hypothesis and therefore we are confident that attention mechanism outper-forms (p-value 0.05) our baseline, the LSTM. For the dataset 2, additive attention over the hidden outputs is the only model that significantly outperforms the LSTM (p-value = 0.022031). Finally, for the dataset 4, key value attention, which is the only model that gives a lower RMSE in Table 4, does not significantly outperform the LSTM as the null hypothesis can not be rejected.

All things considered, attention mechanism seems to be a valu-able approach to consider in any time series problem.

6 CONCLUSION

In this paper, we have created an LSTM model as our baseline and developed two models with different attention mechanisms on top of it. Our first model pays attention to the hidden outputs of the LSTM; our second model applies attention over the inputs. For both architectures, we use different attention score functions. In the end, we see that LSTM models using attention mechanisms outperform the same LSTM models without attention mechanisms, at least for cases in which the input features are simpler. Of particular note, we obtain state-of-the-art results for datasets FD001 and FD003.

Further could test our models on other publicly available datasets. Moreover, attention mechanisms could be applied to other baseline models.

REFERENCES

[1] Giduthuri Sateesh Babu, Peilin Zhao, and Xiao-Li Li. 2016. Deep convolutional neural network based regression approach for estimation of remaining useful life. In International conference on database systems for advanced applications. Springer, 214–228.

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473(2014).

[3] James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference. Citeseer, 13–20.

[4] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Al-gorithms for hyper-parameter optimization. In Advances in neural information processing systems. 2546–2554.

[5] Francois Chollet. 2019. Keras documentation. (2019). https://keras.io/callbacks/ #reducelronplateau

[6] Franc¸ois Chollet and others. 2015. Keras. https://keras.io. (2015).

[7] Clia. 2019. Industrial maintenance: history and evolution. (May 2019). https: //www.mobility-work.com/blog/history-of-maintenance-in-industry [8] Micha l Daniluk, Tim Rockt¨aschel, Johannes Welbl, and Sebastian Riedel. 2017.

Frustratingly short attention spans in neural language modeling. arXiv preprint arXiv:1702.04521(2017).

[9] David A Dickey and Wayne A Fuller. 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American statistical association74, 366a (1979), 427–431.

[10] Grc.nasa.gov. 2019. Turbine Engine Thermodynamic Cycle - Brayton Cycle. (2019). https://www.grc.nasa.gov/www/k-12/airplane/brayton.html [11] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural

computation9, 8 (1997), 1735–1780.

[12] Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. CoRR abs/1902.10186 (2019). http://arxiv.org/abs/1902.10186

[13] Lahiru Jayasinghe, Tharaka Samarasinghe, Chau Yuen, and Shuzhi Sam Ge. 2018. Temporal Convolutional Memory Networks for Remaining Useful Life Estimation of Industrial Machinery. arXiv preprint arXiv:1810.05644 (2018). [14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic

opti-mization. arXiv preprint arXiv:1412.6980 (2014).

[15] R Lacroix, F Salehi, XZ Yang, and KM Wade. 1997. Effects of data preprocessing on the performance of artificial neural networks for dairy yield prediction and cow culling classification. Transactions of the ASAE 40, 3 (1997), 839–846. [16] LahiruJayasinghe. 2019. Deep learning approach for estimation of Remaining

Useful Life (RUL) of an engine. (2019). https://github.com/LahiruJayasinghe/ RUL-Net

[17] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-tive approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025(2015).

[18] Andrew Ng. 2019. Gradient Descent in Practice I - Feature Scal-ing. (2019). https://www.coursera.org/lecture/machine-learning/ gradient-descent-in-practice-i-feature-scaling-xx3Da

[19] A Saxena and K Goebel. 2008. Turbofan engine degradation simulation data set. NASA Ames Prognostics Data Repository(2008).

[20] Abhinav Saxena, Kai Goebel, Don Simon, and Neil Eklund. 2008. Damage propagation modeling for aircraft engine run-to-failure simulation. In 2008 inter-national conference on prognostics and health management. IEEE, 1–9. [21] Gian Antonio Susto, Andrea Schirru, Simone Pampuri, Se´an McLoone, and

Alessandro Beghi. 2014. Machine learning for predictive maintenance: A multiple classifier approach. IEEE Transactions on Industrial Informatics 11, 3 (2014), 812– 820.

[22] Lilian Weng. 2018. Attention? Attention! (Jun 2018). https://lilianweng.github. io/lil-log/2018/06/24/attention-attention.html

[23] Wikipedia contributors. 2019. Feature scaling — Wikipedia, The Free Encyclope-dia. (2019). https://en.wikipeEncyclope-dia.org/w/index.php?title=Feature scaling&oldid= 899790585 [Online; accessed 5-June-2019].

[24] Jianjing Zhang, Peng Wang, Ruqiang Yan, and Robert X Gao. 2018. Long short-term memory for machine remaining life prediction. Journal of manufacturing systems48 (2018), 78–86.

[25] Shuai Zheng, Kosta Ristovski, Ahmed Farahat, and Chetan Gupta. 2017. Long short-term memory network for remaining useful life estimation. In 2017 IEEE International Conference on Prognostics and Health Management (ICPHM). IEEE, 88–95.

Improving predictive maintenance of turbofan engines with different attention mechanism models