Explaining predictions of black box models for multivariate time series

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Explaining predictions of black box models

for multivariate time series

by

F

R

E

´

H

AVER

10185453

August 8, 2019

36 ECTS January 2019 - August 2019

Supervisors:

Prof. Dr. H. Haned

S. Vreugdenhil BBA

Assessor:

Prof. Dr. M. de Rijke

(2)

Abstract

The expanding field of interpretable artificial intelligence has mostly been focused on image classification tasks. In real-world problems, the task at hand is more often based on tabular data and mainly also involve regression tasks. One specific application of tabular data is the forecasting of time series data, which introduces an extra challenge as there is temporal dependency between the different time steps.

Current methods are not suitable to work with temporal data as they assume independence between the features, which is not the case for temporal data. This work aims at filling this gap by introducing a new method: Local Explainer for Time Series (LETS), a post-hoc method that is specifically designed to work with multivariate time series.

LETS builds on the framework of LIME, but improves it by changing the way the neighborhood for the local model is designed. Instead of individually perturbing features, LETS samples real instances from the training set to create a neighborhood. These samples are not sampled randomly but are selected based on how identical the motif of the given instance and the training instance is. To find these motifs the instances are represented as SAX words and are aligned using the Needleman-Wunsch algorithm.

We find that the neighborhoods created by LETS are in most cases more faithful to the model than the neighborhoods created by LIME or by randomly selecting training instances. The explanations created by LETS are in most cases more faithful to the model than LIME, but for the explanations there is no clear difference between LETS and selecting real instances randomly. LETS is shown to be robust in general, but can have instances for which this does not hold.

The results of the user study to determine how the explanations help users to trust and understand the model do not give a definitive answer, as the results found are not significantly different between the group that has seen explanations and the group that has not seen explanations.

(3)

Acknowledgments

First and foremost, I would like to thank my two supervisors, Hinda Haned and Simon Vreugdenhil, for their continuous guidance, support and enthusiasm. I have been very lucky with the amount of time and effort they have invested and this thesis would not have been possible without them. I would like to thank Hinda Haned for her constructive feedback that helped me to keep improving my work, but also that she always ensured me that I was doing a good job. This was exactly what I needed to keep me motivated and improving my work. I am also grateful to Simon Vreugdenhil for his constant enthusiasm about my work and being a great sparring partner to bounce ideas of.

I would also like to thank my parents, sisters, and friends, who have been there throughout the whole course of my student time. They were able to lift my spirit when things did not go as planned and celebrated the successes with me. Both have made sure that my student time is an unforgettable period of my life.

Lastly, I would like to express my great appreciation to Thomas Somers. His never-ending support and encouragement have helped me to focus on my study and complete my Master’s.

(4)

1 |

Introduction

Forecasting methods are used to make predictions in a range of tasks, such as the number of occupied beds in an acute hospital (Jones, Joy, and Pearson, 2002), tourism demand (Song and Li, 2008), electricity load (Hsu and C.-Y. Chen, 2003), and retail sales (Alon, Qi, and Sadowski, 2001). In the aforementioned tasks, accurate forecasts of time series are important for different reasons, but they have in common that the information of the forecasts gives the ability to anticipate the future. Since the interest in Machine Learning (ML) has increased, this also has increased the use of ML models in forecasting (Song and Li (2008), Hsu and C.-Y. Chen (2003), Alon, Qi, and Sadowski (2001)). These models have two main advantages over earlier developed (statistical) models: i) they can model highly non-linear structures in the data, and ii) the explicit relationship between the input and output variables does not have to be specified (G. Zhang, Patuwo, and Hu, 1998).

Although one of the strengths of ML models lies in their ability to model complex non-linear dependen-cies, the complex design of ML models makes them incomprehensible to humans and can be perceived as “black-boxes”. Black-models are especially problematic when the impact on individuals is important, for example, in loan applications, it is important to determine whether a system is producing unfair outcomes, and therefore, discriminating certain group demographics (Doshi-Velez and Kim, 2018).

With the increased application of black-box models also comes the need for methods that generate comprehensible explanations. Even though the interest in explainable artificial intelligence has grown sub-stantially in the last five years (Figure 1.1) the urge for explaining complex models is not new. Andrews, Diederich, and Tickle already state in 1995 that explanation capability should become an integrated part of trained Neural Networks (NNs) and that it should even be mandatory for models that are used in safety-critical applications, such as airlines and power stations.

1.1 On the importance of interpretability for industry practitioners

The urge for explaining NNs and other ML models is not only useful in safety-critical applications, but can also originate from a business need. This work has been supported by a company, Gall & Gall, which motivation for offering this internship was two-fold: first, develop more accurate forecasts for their on-line business, and second, making sure that the developed models would be interpretable by their end-users. The advantages that accurate forecasting has for Gall & Gall are discussed more extensively in Appendix A. The current method used by Gall & Gall is an ARIMA1model (Box and Jenkins, 1976) for the weekly number of transactions. These weekly predictions are then distributed over the different days of the week

(7)

Figure 1.1: Indexed number of searches on the topic of “Explainable Artificial Intelligence” worldwide on Google in the last 5 years. source: https://trends.google.com

by using the distribution of previous weeks that have the same promotional themes2_.

Gall & Gall is interested into the power that ML can bring to increase the accuracy of the daily order forecasts. However, the current model was chosen because of its simplicity and its interpretability by end-users. Therefore, the motivation for this work was to develop a method that can give interpretable explanations about ML methods based on time series data, while ensuring higher performance than the currently implemented method.

1.2 On the importance of interpretability in machine learning

In this work the terms interpretability and explanation are often used. Interpretability is here defined as to what extent a user is able to obtain true insight into how outcomes are obtained (Ras, Gerven, and Haselager, 2018) and an explanation is one mechanism to help reach interpretability.

Besides safety-critical ML applications and needs from business applications, there are other reasons that a model should be interpretable. Samek, Wiegand, and M¨uller, 2017 list four arguments in favor of interpretability in artificial intelligence: 1) Verification of the system, which is important because previous research has shown that models can show unwanted behavior when deployed (Wolf, K. Miller, and Grodzin-sky, 2017) and be biased (Lowry and Macpherson (1988), Bolukbasi et al. (2016)) because the input data for the model was already prejudiced; 2) Improvement of the system, which can be done by obtaining insights into the strengths and vulnerabilities of the model (Guo et al., 2018); 3) Learning from the system, as ML algorithms are efficient in learning patterns in data, it would be useful to humans to access these learned patterns; 4) Compliance to legislation, because current regulations in the European Union give people the right for an explanation of algorithmic decisions (Goodman and Flaxman, 2017). Another reason to build explanations is to increase the understanding and trust of a user in the model. This increase in trust is im-portant, as humans lose confidence in models more quickly than in human forecasters even when they have seen that the model outperforms a human forecaster (Dietvorst, Simmons, and Massey, 2015).

(8)

1.3 Intepretability methods and current gaps

Several methods have been developed to obtain more insights into the inner-workings of ML models. These can be grouped in two categories: methods that rely on changing the design of a model, so that interpretabil-ity becomes one of its intrinsic properties (Alvarez-Melis and Jaakkola (2018b), Guo et al. (2018)) and methods that generate post-hoc explanations (Tan et al. (2018), Tolomei et al. (2017)). An advantage of this post-hoc approach is that it does not impact performance of the model (Doˇsilović, Brˇcić, and Hlupić, 2018). The method presented in this work focuses on the second approach, which means that the original trained model is left untouched and a method is developed to generate explanations that give insight in the behavior of the model. Current methods for creating post-hoc explanations are focused on generating explanations on the global and the local levels. Global methods (Tan et al., 2018) are aimed at giving hu-mans an understanding of the whole logic of a model, while local methods (Ribeiro, Singh, and Guestrin (2016), Lundberg and Lee (2017), Zafar and Khan (2019)) are aimed at explaining a particular outcome of the model (Doshi-Velez and Kim, 2017).

Most of the current local methods for generating post-hoc explanations have focused on image (Bach et al. (2015), Koh and Liang (2017)) and text classification (Arras et al. (2017)). To the best of our knowledge, only one work has focused on time series data (Munir et al., 2019), however, this method is still specific to classification tasks. The examples of forecasting tasks at the start of this section show that there are many applications in which models on time series are used. The fact that scientific research about interpretability in ML is mainly about image classification and hardly about temporal tabular data seems to be gap in the existing state of the art methods.

We can identify two additional significant limitations with current post-hoc local explanations methods: i) they are designed for classification tasks, and their application to a regression task would require discretiz-ing the target variable, which might not always be relevant, and ii) they do not account for dependencies between different features, since features are perturbed (Ribeiro, Singh, and Guestrin, 2016, Lundberg and Lee, 2017) individually. However, for temporal data it is inevitable that the different time steps have a dependency.

Consider the example of a time series that has the following expression as its underlying structure: yt = 0.7 ∗ yt−1+ 0.3 ∗ yt−2+ ε, where ε ∼ N (0, 1). For y0 = 50 and y1 = 52, the graph is shown

in Figure 1.2. If we do not know the underlying structure of the time series and we want to see how time step 52 (indicated with the red dot and striped line) has influenced the prediction of time step 54 (y54= 0.7 ∗ y53+ 0.3 ∗ y52+ ε), taking only the feature time step 52 into account does not make sense, as

changing the value of y52would also directly change the value of y53, as y53= 0.7 ∗ y52+ 0.3 ∗ y51+ ε. As

such, the effect of increasing y52with one point on y54is not just 0.3, but more as y53will also increase.

In this work, we introduce our approach: Local Explainer for Time Series (LETS), to generate local ex-planations for a multivariate time series model predictions, applied to a forecasting task. The LETS method builds on the framework of Local Interpretable Model-agnostic Explanations (LIME), but we modify the underlying method to account for the temporal nature of multivariate time series data. To position our work, we formulate our main research question as follows:

(9)

Figure 1.2: Example time series, where the data has been generated using yt= 0.7 ∗ yt−1+ 0.3 ∗ yt−2+ ε,

and the data has been simulated for 100 time steps. The red dot and striped line are set at time step 52.

Main Research Question

Is it possible to generate faithful local explanations for models trained on multivariate time series data?

In this work, we develop LETS: a robust and faithful method that generates explanations for black-models trained on multivariate time series data. The downstream goal for this interpretable ML system is to provide better forecasts for the daily number of orders of the online store of Gall & Gall, while providing explanations to the users about specific model outcomes. Interpretability is the right tool to achieve this goal because the improvement of the forecasts is achieved using black-box models and explanations ensure that the model is trusted and used by its intended audience.

This thesis is structured as follows: Section 2 introduces the Related work and Section 3 introduces the Preliminaries required to introduce the LETS method. Our method LETS is described in Section 4. Section 5 describes the experimental setup, and Section 6 the results that have been gathered from quantitative and qualitative experiments. Finally, in section 7, we present our conclusion and a discussion of future work.

(10)

2 |

Related work

This section introduces methods that are related to our LETS method.

2.1 Desiderata for black-box models explanations

Explanations are not just important in the field of ML, but are, according to Lombrozo (2006) “central to our sense of understanding, and the currency in which we exchange beliefs”. This means that for humans, an explanation is the answer to a why-question (T. Miller, 2018). In ML, this is a method that can answer ‘why’ questions, and if a method can answer more questions, this increases its explanatory power (Ras, Gerven, and Haselager, 2018). An explanation in ML can consist of a collection of features in the interpretable domain that have contributed to produce a decision (Montavon, Samek, and M¨uller, 2018)

Note that the definition of an explanation by Montavon, Samek, and M¨uller (2018) already includes one of the desiderata for an explanation, namely that it has to be interpretable. Interpretability can be defined as the ability of a system to explain its reasoning in in understandable terms to a human (Doshi-Velez and Kim (2017), Doshi-Velez and Kim (2018)) and determines to what extent a user is able to obtain true insight into how outcomes are obtained (Ras, Gerven, and Haselager, 2018).

These desiderata are that the explanations should be: i) faithful to the underlying model (Ras, Ger-ven, and Haselager, 2018, Ribeiro, Singh, and Guestrin, 2016, Alvarez-Melis and Jaakkola, 2018b), which means that the explanation method should agree with the input-output mapping of the model, and ii) robust (Alvarez-Melis and Jaakkola, 2018b, Alvarez-Melis and Jaakkola, 2018a), which means that the explana-tions are consistent for similar examples and small changes to the input will not change its explanation.

2.2 Scope of this work

In the existing literature on explainable ML, there are different taxonomies to summarize current methods for generating explanations. J. Chen et al. (2018) group local post-hoc methods based on the four proper-ties: 1)if the method needs training, 2) how efficient it is, 3) if it is an additive method and 4)if the method is model-agnostic. However, Guo et al. (2018) classify methods based on the approach of generating expla-nations, which they divide into: occlusion, gradient, perturbation, prototype based, and intrinsic methods.

As the focus of this work is on generating local post-hoc explanations, the global and intrinsic methods are disregarded. In addition, because the method is an additive method based on real samples, the existing local additive methods are discussed as well as the methods based on prototypes.

(11)

2.2.1 Additive methods

Lundberg and Lee (2017) state that additive feature attribution methods have an explanation model that is a linear function of binary variables1_:

g(z0) = φ0+ M

X

i=1

φizi0 (2.1)

where zi∈ {0, 1}M, M is the number of simplified input features and φi∈ R.

Here the definition of Lundberg and Lee (2017) is stretched by the fact that the variables do not have to be binary. This means that all the method that were selected as additive methods by Lundberg and Lee are still additive methods under the new definition. Only the additive methods mentioned by Lundberg and Lee (2017) will be discussed next.

LRP Linear-wise Relevance Propagation (LRP) (Bach et al., 2015) interprets the predictions of deep networks by decomposing the output using redistribution rules to assign relevance scores to each input vari-able. Although the original work focuses on feed-forward classification tasks, the work has been extended to work on recurrent neural networks for sentiment analysis (Arras et al., 2017). However, sentiment anal-ysis is still a classification task and the method depends on the output layer that is specific to classification tasks.

LIME Local Interpretable Model-agnostic explanations (LIME) explains the prediction of any classifier in an interpretable and faithful manner (Ribeiro, Singh, and Guestrin, 2016). For each output to explain, LIME creates a neighborhood by randomly perturbing the input instance’s features and then trains a linear model on this neighborhood, where the neighborhood instances are weighted according to how close they are to the instance to explain.

DeepLIFT DeepLIFT (Shrikumar et al., 2016) is an extension of LRP, by introducing a reference in-stance. Instead of explaining the output of a model directly, the output of the model is explained as the difference between the output of the model and the output of some reference instance. DeepLIFT is equal to LRP if the reference activations of all neurons are set to zero (Lundberg and Lee, 2017).

SHAP SHapley Additive exPlantions (SHAP) returns values per feature that indicate how much change is expected in the model prediction when conditioning on that feature (Lundberg and Lee, 2017). The SHAP values of the different features explain how to get from a base value (where no features are known) to the current output.

Both LRP and DeepLIFT are categorized as gradient methods by Guo et al. (2018), while both LIME and SHAP are categorized as perturbing methods.

Criticism of existing additive models Even though LIME is a popular method, it has also been widely criticized. For instance, the choice of the loss function, the weighted kernel and regularization term are made heuristically, the local accuracy is violated, which can lead to unintuitive behavior (Lundberg and Lee,

(12)

2017). In addition, the neighborhood of LIME is created by randomly perturbing the features, which can lead to different explanations for the same output and input instance (Zafar and Khan, 2019). Kindermans et al. (2017) show that the explanations of LRP are sensitive to input variance, in which the input is shifted.

2.2.2 Prototype methods

Only two of the developed prototype methods will be discussed here. For a more extensive overview of prototype methods we refer the reader to the related work sections of Kim, Khanna, and Koyejo (2016).

Bien, Tibshirani, et al. (2011) develop a method for selecting prototypes for classification tasks. They create neighborhood that consists of sub-neighborhoods. Each of these sub-neighborhoods aims to maxi-mize the training points of a class and to minimaxi-mize the training points of another class. The center of each of these sub-neighborhood is then a prototype.

Koh and Liang (2017) use influence functions that traces the prediction of a model through the learning algorithm to its training data. It identifies which training points are most responsible for a given prediction. By indicating the training points that are most responsible for a given prediction, influence functions give insight in how the model relies on training data.

Criticism of existing prototype methods Kim, Khanna, and Koyejo (2016) state that only showing train-ing points that are positively contributtrain-ing to a given prediction (prototypes) is not enough. A method that is based on showing examples from the training set should also include examples (criticism) that explain what is not captured by the prototypes.

2.3 Interpretability of forecasting models

This section gives an overview of some models that are used in forecasting tasks. This does not only include models that have been specifically designed for time series but also models that have been developed for regression tasks in general. In Table 2.1 an overview is given of the different forecasting models with an indication of their interpretability. A more extensive overview of the methods is given below, where it is discussed how the label for the interpretability has been established.

Model Interpretable

OLS Yes

ARIMA Yes

Feedforward Neural Networks No Recurrent Neural Networks No

Random Forest No

Hybrid Models No

Table 2.1: Overview of forecasting models and their interpretability.

OLS Ordinary Least Squared (OLS) was invented around 1800 and expresses the dependent variable as a linear combination of explanatory variables. The coefficients of explanatory variables express the influence

(13)

of each variable on the dependent variable. In OLS it is clear on a global level how each variable contributes to the output.

ARIMA For time series, a popular model is the Auto-Regressive Integrated Moving Average (ARIMA) model (Box and Jenkins, 1976). It expresses the dependent variable as a linear combination of previous observations, previous forecast errors, and explanatory variables. Once the model has been fitted to the data, the coefficients of the lagged observations, errors and explanatory variables are known, which makes the model globally interpretable.

Feedforward Neural Networks Any NN that does not have recurrent connections in its structure is a feedforward NN. One of the simplest forms can be found in a MultiLayer Perceptron (MLP). In 1998 this was the most used model for time series forecasting (G. Zhang, Patuwo, and Hu, 1998). The characteristics of these networks is that they have a nonlinear structure and are self-adaptive. There is no way to analyze the relationship between the input and the outputs (G. Zhang, Patuwo, and Hu, 1998), which makes them non-interpretable.

Recurrent Neural Networks Recurrent Neural Networks (RNN) have recurrent gates from the hidden layer to the input layer. The model takes the input sequence one step at a time and remembers information about the previous time steps. On of the flaws of RNNs is that it suffers from exploding and vanishing gradients on long time dependencies (Bengio, Simard, Frasconi, et al., 1994). To overcome this problem Long Short Term Memory (LSTM) models (Hochreiter and Schmidhuber, 1997) have additional gates to guide the backpropagation process. As the structures of these models is even more complex than the feedforward NNs, RNNs are also not interpretable.

Random Forest Random Forest (RF) is a bagging method in which different decision trees are combined. In regression tasks, the average of all decision trees determine the output of the model (Breiman, 2001). The mechanism of the RF is not interpretable (Breiman, 2001).

Hybrid models Hybrid models are a combination of different models. In forecasting hybrid models are formed by combining statistical methods with NNs. A few examples are given by the following combina-tions: univariate ARIMA-MLP (Khashei and Bijari (2010), Khashei and Bijari (2011), G. P. Zhang (2003)), univariate ARIMA-RNN (Aladag, Egrioglu, and Kadilar, 2009), multivariate ARIMA-MLP (D´ıaz-Robles et al., 2008), and seasonal ARIMA-MLP (Tseng, Yu, and Tzeng, 2002). As these hybrid models also have a NN in their structure, they are also not interpretable by design.

(14)

3 |

Preliminaries

In this section, we give the necessary background to introduce our method.

3.1 Local Interpretable Model-agnostic Explanations

In the previous section, a general overview of the Local Interpretable Model-agnostic Explanations (LIME) method (Ribeiro, Singh, and Guestrin, 2016) was given. Because our method builds on LIME, we further discuss it in detail.

LIME objective is to identify a locally faithful and interpretable explainer for any classifier. The predic-tive model that is being explained is denoted by f : Rd → R. In the case of image and text classification tasks, the original input to the model, x ∈ Rd, has to be altered to construct an interpretable representation , x0 ∈ {0, 1}d_{. In text classification, a model’s input often consists of word embeddings, which are not}

interpretable to humans, as a result, the interpretable input is a binary vector that indicates the presence or absence of a word.

The explainer is denoted by g ∈ G, where G is a class of potentially interpretable models, such as decision trees or linear models. The explainer takes x0 as an input and outputs the explanations. Because not every interpretable model has the same complexity, the complexity of the explainer can be measured by Ω(g). For decision trees this can be determined by the depth of a tree, while for linear models it can be the number of non-zero weights.

Once the explainer has been designed (this is a choice to be made by the model designer), the explainer requires data, called the neighborhood, to train a local model on. This neighborhood is generated from randomly perturbing the feature values of x0 according to the feature distribution in the training set. This gives a perturbed sample z0, which is transformed from the interpretable representation to the original input space, which gives z ∈ Rd. z is fed to the model to obtain the model prediction (f (z)) for this sample. This prediction is used as a label for the perturbed sample z0. The set of perturbed samples with its labels is then used as data to train the explainer.

In this explainer, the samples are weighted according to a proximity measure πx(z). To evaluate how

unfaithful the explainer g is in approximating the predictive model f in the locality πx, a fidelity function

is specified as L(f, g, πx). To ensure both local fidelity and interpretability (by keeping the complexity of

the explainer low), the explanation produced by LIME is obtained by the following:

ξ(x) = arg min

g∈G

(15)

Even though the equation above can be used with different explanation families G, fidelity functions L and complexity measures Ω, Ribeiro, Singh, and Guestrin focus on sparse linear models as the explainer. The choice of G (sparse linear models) means that if the underlying model is highly non-linear even in the locality of the prediction, there may not be a faithful explanation. However, the faithfulness of an explanation can be estimated.

Ribeiro, Singh, and Guestrin state that their method is fairly robust to sampling noise because the samples are weighted by πx.

3.2 Sequence representation and alignment

Because time series data are different from the data for which local explainers such as LIME have been developed, it is necessary to use methods that enable time series representation as well as alignment. Time series representation ensures that the temporal nature of the data will be preserved. This is important, as the different time steps in temporal data should be considered as a whole and not as individual features. The alignment between two sequences is needed to find similar motifs in the data, as this ensures local fidelity.

In this work, we choose to work with sequence representation that transforms the data into symbolic strings or SAX (Symbolic Aggregate approXimation), and works well on data mining tasks (Lin et al., 2007).

SAX reduces a one-dimensional time series of length n into a string of length w. The string represen-tation depends on the alphabet size a, where a > 2. To obtain this string represenrepresen-tation, the data is first transformed into the Piecewise Aggregate Approximation (PAA) representation (Keogh, Chakrabarti, et al. (2001), Yi and Faloutsos (2000)). After the data has been transformed into its PAA representation, it can be converted into a symbolic representation, the SAX word. Below these two steps are described in more detail.

3.2.1 Piecewise Aggregate Approximation

To reduce the time series from n dimensions to w dimensions, the mean and standard deviation of the time series are calculated. If the standard deviation of the time series is smaller than a predefined threshold ε, the PAA representation is simply a sequence of w zeros. If the standard deviation of the time series is bigger than ε, the data of length n is first normalized to have a mean of zero and a standard deviation of one. Performing this normalization ensures that time series can be compared, according to Keogh and Kasetty (2003). After the normalization, the data is divided into w equal sized “frames”. The mean value of the data in each frame is calculated and a vector of these values becomes the data-reduced representation.

3.2.2 Symbolic representation

Once the PAA representation of the data is known, it can be transformed into a discrete symbolic represen-tation. We follow the work of Lin et al. (2007), where they show that it is desirable that all symbols will be produced with equal probability (Alberto Apostolico, Bock, and Lonardi (2003), Lonardi and Apostolico (2001)). Because the data has been normalized to have a mean of zero and standard deviation of one, it

(16)

has a highly Gaussian distribution and can therefore be divided into a equal-sized areas under the Gaussian curve by determining “breakpoints” (Larsen, Marx, et al., 1986).

With the breakpoints set and the PAA representation available, the PAA coefficients can be converted into symbols by mapping all coefficients that are below the smallest breakpoint to the symbol “a”, all coefficients greater than or equal to the smallest breakpoint and less than the second smallest breakpoint to “b”, and so on.

Figure 3.1: Example of how the original time series is first transformed into the PAA representation and then into its symbolic representation. Image from Lin et al. (2007), “Experiencing SAX: a novel symbolic representation of time series”

An overview of the steps taken to obtain convert the normalized time series into SAX words is shown in Figure 3.1. The blue line indicates the normalized time series, the horizontal bars the PAA representations and the letters that these PAA coefficients are assigned to are indicated next to these horizontal bars.

3.3 The Needleman-Wunsch sequence alignment algorithm

Once the discrete representations of the time series has been established, this can be used to find similar mo-tifs in the time series. Lin et al. (2007) prove that there is a lower bound for the euclidean distance between two SAX words. However, this only gives a lower bound and does not distinguish between the case that an A motif is aligned with an A motif or an A motif is aligned with a B motif. Therefore, some information about the sequences will be lost. One way to circumvent this, is to use an alignment algorithm that preserves sequence information. In the field of biology there have been developed methods to find motifs in strings of DNA and proteins. One of those methods is the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), that was originally developed to find similarities in the amino acid sequence of two proteins.

Before the alignment procedure can start three scores have to be determined. The first is the ‘match’ score which occurs when two aligned letters are the same. The ‘mismatch’ score occurs when two aligned letters are not the same and the gap penalty occurs when a letter is aligned with a gap.

In the simplest form, the score for a match is one and the score for a mismatch is zero (Needleman and Wunsch, 1970). If the gap penalty is set to zero, all possible gaps would be allowed and if the gap penalty is

(17)

equal to the theoretical value for the maximum match between two proteins it would be impossible to allow a gap (Needleman and Wunsch, 1970).

This sequence alignment algorithm works by following these steps (Likic, 2008):

1. Initialize the score and traceback matrix

If the length of sequence 1 is equal to l1and the length of sequence 2 is equal to l2the size of both

the score matrix and the traceback matrix is (l1+ 1) × (l2+ 1). The first row and column of the

score matrix are set to the multiple of the negative gap penalty, so if the gap penalty is 10, the first row and column have the form [0, −10, −20, ...]. The first row and column of the traceback matrix are [‘done’, ‘left’, ‘left’, ...] and [‘done’, ‘up’, ‘up’, ...] respectively.

2. Calculate the scores and fill the traceback matrix

Now we can start by filling the score and traceback matrix. This is done by starting in the upper left corner, in the cell that is still empty (in the second row and column). The score of this cell is calculated by looking at the cells that are positioned to the left, above and up-left diagonal of the cell. Three possible scores are calculated, one that takes the score of the cell to the left and adds a gap penalty, one that takes the score of the cell above and adds the gap penalty, and one that takes the score of the cell that is positioned on the upper left corner and adds a (mis)match score.

The maximum of these three possible scores is taken and put into the score matrix. The traceback matrix stores how this score was calculated, so either from the ‘left’ cell, the ‘up’ cell or the ‘diag’ cell. The cells of both the score as the traceback matrices are filled row by row.

3. Use the traceback matrix to find the alignment score

Once both matrices are full, the traceback matrix is used to find the path with the highest score by following the clues in the traceback matrix.

In Appendix C an example is shown of the alignment between to sequences using the Needleman-Wunsch algorithm.

(18)

4 |

Method

In this section, we introduce our method: Local Explainer for Time Series (LETS), which aims at filling the gap that exists in current work on explaining the predictions of black box models trained on multivariate time series. This method builds on the LIME framework (Ribeiro, Singh, and Guestrin, 2016), where the main contribution is in the way it constructs a neighborhood around a given instance in order to build an post-hoc local explanation.

4.1 Neighborhood sampling and time series representation

Our main contribution in this work is the way we select neighborhoods to generate local explanations. Before we further describe LETS, we illustrate how we select the neighbors of a given instance, which will be used to create a local explanation.

Neighborhoods are build using real instances from the data set. This way of creating the neighborhood ensures that in each instance the temporal dependencies between the different time steps are ensured to be faithful to the data. To find which instances should be included in the neighborhood of a given instance, the representation of the instance is changed into a SAX representation. To do this, we first reduce the dimensionality of the data to create a one-dimensional time series, while still preserving the temporal nature of the data. The one-dimensional time series is converted to a SAX word, to allow for the finding of motifs in the data. Using the SAX representation of the multivariate time series, the neighborhood is created by finding instances that have the same motif as the given instance. These motifs are found by using the Needleman-Wunsch algorithm. The steps that LETS takes to go from a dataset with original representations to a neighborhood for a given instance are described in detail in the next section.

4.2 LETS overview

(19)

LETS Method overview

1. Reduce the dimensionality of the data

With a data dimensionality reduction technique, reduce the data from a multivariate time series to one dimensional time series

2. Represent each reduced instance as a SAX word

With the parameters a alphabet and ε known, convert each reduced instance into a SAX word. 3. Use SAX representation to find a neighborhood

Use the Needleman-Wunch algorithm to find the instances in the training set that have the same motif as the instance for which the explanation has to be generated.

4. Use the identified neighborhoods to build explanations using the LIME framework (a) Unfold each instance so the time steps are incorporated in the variables

(b) Create a second neighborhood frame that changes categorical features to binary features indicat-ing if the features in the neighborhood are the same or different than the features in the instance of interest.

(c) Select features to include in the linear model (d) Use the selected features to fit linear model

The coefficients of this linear model are use to generated a local explanation for the prediction of the black-box model

Step 1: Reduce the dimension of the data

To find motifs in the data, the first step is to dimensionality reduction to make it one-dimensional. The original dimension of an instance is window × n f eatures, and the new dimension of the instance will be window × 1. There are multiple dimensionality reduction techniques (Van Der Maaten, Postma, and Van den Herik, 2009), but in this work an autoencoder is chosen to reduce the data, as this also gives an idea of how much information is preserved when reducing the information.

Step 2: Represent each reduced instance as a SAX word

Once the dimension of the data has been reduced, this reduced data can be transformed into a sequence of SAX words. To achieve this, two parameters have to be set:

1. a alphabet. The alphabet size (should be bigger than 2 (Lin et al., 2007)).

2. ε. A threshold that determines if the series should be standardized or not. If the standard deviation of the time series sequence is lower than ε, the sequence is not standardized and otherwise it is.

Here, botha alphabet and ε are assumed to be known, and at the end of the section it is discussed how these parameters can be set.

(20)

Figure 4.1: For an example where window = 3, a schematic overview is given on how the original instances 0, 1, and 2 are transformed from an original instance to the SAX representation via a reduced representation.

An overview of how the the original instances of shape window × n f eatures are transformed into their SAX representations can be found in Figure 4.1. In this example window = 3 and it can be seen clearly that one instance consists of multiple time steps, so even when the instance has index 0, it contains time steps 0, 1, and 2. The instances are then reduced in dimension by passing them through the data dimension reduction method, after which the reduced representations are represented as SAX words. The steps to take to go from the reduced representation to the SAX words are discussed in Section 3.2. There it is discussed that the time series is reduced from length n to length w. In the implementation of LETS, the length of the encoded instance is the same as the length of the SAX word (n = w). This means that the PAA representation does not consist of the mean values of a frame, but of the value in that frame, as there is only one value per frame. The implementation for the sax representation is from the saxpy package of Senin et al. (2018).

Step 3: Use SAX representation to find neighborhoods

At this point, a SAX representation has been constructed for each instance. These SAX words are used to find a neighborhood of size n neigh for an instance for which we want to explain the black-box model output.

Here the assumption is made that the neighborhood size (n neigh) is known, and at the end of the section it is discussed how this parameter can be determined.

Assume that we have a specific instance in the validation set valid_instance_k, that has a SAX representation valid_sax_k. The alignment score is calculated between valid_sax_k and all SAX representations in the training set, denoted by train_sax_j for j ∈ [0, n train]. This alignment score is calculated using the Needleman-Wunsch algorithm discussed in Section 3.3, with the adjustments that

(21)

are discussed below. Then the training instances are sorted on their alignment score (in descending order) and the top n neigh instances are selected to be the neighborhood of the given instance.

Adjustments to the Needleman-Wunsch algorithm

As stated in Needleman and Wunsch (1970), the simplest settings for the match and mismatch are 1 for a match and 0 for a mismatch, and the gap penalty can be between zero (allowing gaps at any position) and the maximum match between two letters (not allowing any gaps). However, for comparing SAX representations for time series it does not make much sense to say that the difference between two letters that are close to each other (a and b) is the same as the difference between two letters that are far apart (a and q), and so the score for a mismatch has to be altered. This is done in the following manner: the score for a match is set to a alphabet, the score for a mismatch depends on how many steps the letters are apart. This gives a mismatch score of -1 to a and b, -2 to a and c, and e and g, and so on. Also, the gap penalty is set to −10 ∗ a alphabet, to make sure no gaps occur.

Step 4: Use the neighborhood to build explanations using the LIME

framework

Once the neighborhood is known, it can be used to build a linear model and use the coefficients of this linear model as an explanation.

Step 4.1: Unfold time steps of instances

In Figure 4.1 it is clear that an instance consists of multiple time steps. However, to fit the linear model on the variables, the variables of each time step are a variable on their own. To account for this, the instance is unfolded so the time steps are not below each other, but next to each other to create one long row.

The (unfolded) instances are used to generate the explanations by using the LIME framework. The framework as described in “Why should i trust you?: Explaining the predictions of any classifier” has been discussed in Section 3.1. However, as the paper focuses on text and image classification, it does not cover the case of regression on tabular data. Below, the steps are described that are taken in LIME to go from the neighborhood (which Ribeiro, Singh, and Guestrin obtained by perturbing the original instance) to the explanations.

Step 4.2: Change the categorical features

To know the effect of every variable in valid_instance_k, even if some categorical variables are equal to zero, a second neighborhood frame is created. In this second frame, the categorical features in the neighborhood are adjusted so that they are binary variables indicating if the feature value is the same as the feature value in valid_instance_k or not.

(22)

Step 4.3: Select features to include in the linear model

In the LIME framework, the user sets the number of features to to include in the explanation. If this number is equal or lower than six, the forward method as described in James et al. (2013) will be used, where the linear model is the least squares (Ridge regression with λ = 0), and the samples are weighted according to their distance. This linear selection works in the following manner (James et al., 2013):

Start with a model without any variables, then fit p linear regressions and add the variable that resulted in the highest R2. Add variables in this manner until some stopping rule is satisfied. In LIME, the stopping rule is the number of features that the user has specified. However, forward selection is a greedy approach, and might include variables in an early satge that later become redundant (James et al., 2013).

If the number of features to include is higher than six, the features are selected in a way that is denoted by ‘highest weights’ in the code of LIME1. The ‘highest weights’ method works in the following way: first a least squares regression is performed (ridge regression where lambda is zero) using all the features and weighing the samples according to their distances. Then the coefficients of this regression are multiplied with the first row of of scaleddata (which is the given instance where all categorical features are equal to one and the continuous features are values from the normal distribution). Then the features that correspond to the n f eatures highest values of these multiplied weights are returned as the features to used.

In LETS, to overcome the drawback of forward selection, backward selection is used. This is described by James et al. (2013) in the following way: Start with all variables in the model and remove the variable that has the highest p-value. Fit a model with this new variables and continue this procedure until some stopping rule is reached. For instance we may stop when all remaining variables have a p-value below some threshold. In LETS the threshold for the p-values is set to 0.001.

Step 4.4: Use selected features to fit linear model

The second frame of the neighborhood with only the selected features left in the frame are used to perform a Ridge regression where λ = 1. In LIME, the samples are weighted based on their proximity. The coefficients of this regression are returned as explanations.

4.3 Setting the parameters

In the description of the method we have assumed that three parameters are known:

• a alphabet is the alphabet size to use (should be bigger than 2 (Lin et al., 2007)).

• ε determines if the series should be standardized or not. If the standard deviation of the time series sequence is lower than ε, the sequence is not standardized and otherwise it is.

• n neigh determines how many instances are in the neighborhood.

However, the values of these parameters have to be chosen in a manner that overall creates the best neighborhood for the instances. ε is set to 0.001 and the for both a alphabet and n neigh a grid search is

(23)

performed.

The a alphabet parameter is set to range from 3 to 20 and the n neigh range is set according to the size of the instance. Because a linear model needs at least as many observations as variables, the minimum neighborhood size is set to window × n f eatures + 1 and the maximum neighborhood is set to the minimum neighborhood size + 50. This maximum is set to reduce the search space and to make sure that the neighborhood is a local representation, in stead of a global.

Then for each combination of a alphabet and n neigh a neighborhood is created for all validation instance - output pairs, a simple linear model is fitted on the neighborhood, and the output of the instance is predicted using this simple linear model. The Mean Squared Error between the predicted output of the linear model and the true output is calculated for each validation instance and the combination of a alphabet and n neigh that have on average the lowest error are picked.

(24)

5 |

Experimental design

5.1 Data

Gall.nl daily transactions The main data set used in our experimental design, consists of the daily num-ber of transactions for the online store gall.nl. This data has been collected from the 5th of January 2015, when the online store was launched, till the 30th of June 2019. This represents a total of 1638 observations. In addition to the transaction data, the data set has been expanded with explanatory variables from different data sources. In Table 5.1 only a summary is given of the categories of each set of variables and how many variables are in this category.

Explanatory variables Number of features Calendar properties 19

Events 24

General discount 2 Marketing efforts 8

One year lag 1

Themes 33

Weather 7

Total 94

Table 5.1: Overview of the available explanatory variables and the number of features per category.

Important notes concerning the data: For both the general discount variables as the marketing efforts, no information about 2015 is available.

The data has been split up into train, validation and test set, according to the percentages and number of observations shown in Table 5.2.

Data set Percentage Observations

Train 80% 1311

Validation 10% 164

Test 10% 163

Total 100% 1638

Table 5.2: Overview of number of observations of the train, validation and test set

A visual representation of these splits can be found in Figure 5.1, where the normalized number of daily transactions are shown.

(25)

Figure 5.1: The normalized number of orders for the online store of Gall & Gall, split into the train validation and test set.

Feature selection To select the best subset of features to train the predictive model, the Mutual Informa-tion based Feature SelecInforma-tion (MIFS) algorithm is used (Battiti, 1994). This algorithm does not only look at the Mutual Information (MI) between the target variable and the explanatory variables, but also at the MI be-tween the different explanatory variables. The MI is calculated using the f eature selection.mutual inf o regression function from the sklearn package, that is based on the article of Ross (2014) for mutual information between discrete and continuous datasets1_{. The MIFS algorithm selects variables until k}

vari-ables are selected, where k is specified by the user. Also the parameter β has to be set that determines how influential the MI between the features in set S and set F is, where Battiti (1994) states that in practice a value for β between 0.5 and 1 is appropriate for classification tasks. For this project, β = 0.5. The MIFS algorithm will select a better subset of features than regular MI, since the MI between the features is used in the calculation to select the non-redundant features (Chandrashekar and Sahin, 2014) To find the optimal subset of features, experiments have been performed with k in the range of (0, 60).

Data scaling Before the training of the model, the data is normalized as this can avoid computational problems (Lapedes and Farber, 1988). Here the features are scaled individually from their original values to the range [0, 1] by applying the following formula:

xnormalized=

xunnormalized− xmin

xmax− xmin

(5.1) xminand xmaxare the minimum and maximum values of the original feature values.

5.2 Neural models

Both the predictive model to forecast the daily number of transactions of gall.nl, and the autoencoder used to reduce the dimensionality of the multivariate time series into a univariate time series are neural models.

1_{https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_}

(26)

The structure of both models is discussed, as the way they are evaluated, and how well they perform on data sets that consist of different number of features.

5.2.1 Forecasting model

With the feature subset that is selected by the MIFS algorithm described above, a RNN is build using PyTorch (Paszke et al., 2017). The output of the model is calculating by initializing an initial hidden state h(0)_{and applying the following update equations from timestep t = 1 to t = T :}

h1(t)= b(h)+ W(hh)h(t−1) (5.2)

h2(t)= b(x)+ W(hx)x(t) (5.3)

h(t)= tanh(h1(t)+ h2(t)) (5.4)

For the last time step

o(T )= ReLU(b(o)+ W(oh)h(T )) (5.5)

The weights of the model are initialized according to the default values of Linear layers2_{, which is}

U (−√k,√k) for both the weights and the biases, where k = 1

n f eatures. The model is trained using

RMSprop (Tieleman and Hinton, 2012) with a learning rate of 0.001.

Because of the fact that the weights are initialized in a random manner, it can be that some weight initializations are a better starting point for the model than others. To overcome the problem that the performance of a model is due to its weight initialization, the model training is performed 10 times.

To make sure the network has a good generalization performance an early stopping criterion is intro-duced (Bishop, 2006). This means that every time one epoch on the training set as been trained, the model is evaluated on the validation set and the error on the validation set is stored. Every time the validation error is lower than the current lowest error the training is continued and every time the validation error is higher than the current lowest error, the training is continued for 5 epochs more.

There is no set rule on how to determine the optimal number of input nodes and hidden nodes (the number of output nodes depends on the task to learn) (G. Zhang, Patuwo, and Hu, 1998). Therefore a grid search has been performed where the number of input nodes ranges from 3 to 7 and the number of hidden nodes ranges from 1 to 10.

5.2.2 Autoencoder

Recall that LETS requires multivariate time series to be reduced to a univariate time series. We choose an autoencoder to perform this dimensionality reduction. Both the encoder as the decoder are RNN models of Pytorch, with an additional linear layer to ensure that the dimensionality makes sense. For the encoder this means reducing the dimensionality to a one-dimensional vector and for the decoder this means increasing the dimensionality to give the output the same dimensionality as the original input.

(27)

5.2.3 Evaluation metrics for the neural models

Both the forecasting model and the encoder-decoder are evaluated using the Mean Squared Error (MSE). The MSE can be computed using the following formula:

M SE(y, ˆy) = 1 n n X i=1 (yi− ˆyi)2 (5.6)

Here y are the true values, and ˆy are the predicted values. This metric has the advantage that is has the quadratic form so it penalizes big errors quadraticly more than small errors. A drawback of this metric is that the error scales to the data. So a model trained on normalized data cannot be compared with a model trained on unnormalized data with this metric.

Figure 5.2: MSE on both the validation and test set of the forecasting model (RNN)

Evaluation of the forecasting model In Figure 5.2 the MSE of the forecasting model on both the valida-tion and test set are shown for the number of features ranging from 0 to 60. Even though the feature selecvalida-tion process discussed in the previous section is developed to make sure only relevant and non-redundant fea-tures are included in every time step, it is clear that not all added feafea-tures have a positive contribution to the accuracy of the predictions. For example, the features added as 39th and 44th have a negative effect on the performance of the model on both the validation and test set.

Evaluation of the autoencoder In Figure 5.3 the MSE of the autoencoder on both the validation and test sets are shown for the number of features ranging from 0 to 60. For every individual instance, the MSE is computed by comparing each element of the original instance to the same element in the reconstructed instance. The performance of the autoencoder depends on the number of features in the data set. It seems like there is a trend in that the autoencoder has an higher error for instances with a higher number of features. This is not surprising as reconstructing a 60-dimensional vector from a one-dimensional vector seems like an harder task than reconstructing a 30-dimensional vector. However, the autoencoder is also struggling to recreate instances that have around 10 features. The weak performance on data sets that have around 10 features is not investigated further in this work and can be investigated during future research.

(28)

Figure 5.3: MSE on both the validation and test set of the autoencoder model.

5.3 Evaluation of LETS

After LETS has been implemented and has generated explanations, the quality of these explanations are evaluated. The evaluation metrics can be divided into quantitative evaluation, that expresses the quality of the explanations in an objective manner, and qualitative evaluation, that expresses the quality of the explanations in an subjective manner.

5.3.1 Quantitative evaluation

Measuring faithfulness Recall that one of the desiderata of explanations are that they are faithful to the predictive model. However, there are no straightforward ways to measure faithfulness of explanations. Most of the works around this type of evaluation relate to image classification. For example, Guo et al. (2018) feed images to the model where they have only kept the pixels that are marked as important by their explainer and set the other pixels to black. If the model still predicts the same class, this means that the right features have been selected, and thus that the explainer is faithful to the predictive model. This evaluation metric is not applicable to our regression task.

We measure the faithfulness of the explainer to the predictive model in two manners. First we want to know how well the neighborhood of the instance is capable to predicting the output of the model. This is done by fitting a linear model on the neighborhood and using this model to predict the output for a given instance. The MSE between the prediction of the linear model and the output of the predictive model is calculated as a measurement of faithfulness. The second way to measure the faithfulness of the explainer is to construct the explanation by following steps that are described in Section 4.2, and use the prediction of the Ridge regression on a given instance in the same manner as the neighborhood faithfulness. Both for the neighborhood and the explanations, the size of the neighborhood and the number of features to select are the same.

We compare the faithfulness of the neighborhood and the explanations of LETS to LIME and to a method that randomly selects real instances from the data, called RAND. RAND can be interpreted as a first step to improve LIME (by ensuring that the temporal dependencies are being respected), with the

(29)

difference that the information about the patterns, detected by SAX and Needleman-Wunsch in LETS, is not being used. We have set the neighborhood size for all methods to be equal to the optimal neighborhood size for the LETS method and the number of features are 10 for all methods. In addition, the predictions are calculated over 20 trials, to account for the fact that LIME perturbs the instances randomly, and therefore generates instable explanations.

Measuring robustness Another desiderata is that explanations should be robust. We define rootlessness of an explainer as follows: for similar instances, the explainer should generate similar explanations. Exist-ing robustness measures such as the ones used in Alvarez-Melis and Jaakkola (2018), “On the robustness of interpretability methods” are not suitable for the neighborhoods created by the LETS method, as the measure of Alvarez-Melis and Jaakkola (2018a) uses the euclidean distance. The instances in the neighbor-hood of LETS can have values in the input and output that can be quite far apart according to the euclidean distance, but have been assigned to the neighborhood because they were detected as having the same pattern as a given instance.

We will inspect the robustness of the generated explanations by selecting instances that have high align-ment scores and instances that have low alignalign-ment scores and see how much the explanations differ. If LETS is robust, we expect the instances with a high alignment score should have explanations that are alike, and instances with a low alignment score should have explanations that are not alike.

These quantitative evaluation metrics lead to the following quantitative research questions:

Qualitative Research Questions

• To what extent are the neighborhood a good representation of the instance to explain? • How does SAX help in improving these neighborhoods?

• How faithful are the explanations to the black-box model? • To what extend are the generated explanations robust?

5.3.2 Qualitative evaluation

There are a few ways in which humans can help evaluate explanations. For the task of image classification the explanations can be evaluated by observing the feature maps that where created by a method and com-pare them to the feature maps of other methods (Guo et al., 2018). This evaluation method is specific to the task of image classification as it quite straightforward for humans to see if the feature maps cover the parts the class of the image.

For the task of text classification, the explanations can be evaluated by presenting the top key words selected by the explainer and measure if the labels that the humans assign based on these words aligns with the labels provided by the model (J. Chen et al., 2018). This method is specific to the task of classification and cannot be used by to evaluate LETS.

These qualitative evaluation methods of J. Chen et al. (2018) and Guo et al. (2018) are based on the fact that they are explanations generated for classification models on a task that is comprehensible for humans. However, this is not the case for our problem, therefore we design a user study with human subjects to

(30)

assess the quality of the explanations.

User study design We follow the work of Lucic, Haned, and Rijke (2019) in the setup of our user study. In total, there are 30 participants that are divided into the treatment group (16 participants), and the control group (14 participants). The setup of the experiment is as follows:

• All participants are shown some basic information. As most of the participants in both group are not familiar with machine learning, some background about how ML models work is given.

• The general properties of the forecasting model are explained with some information about the global performance of the model.

• The user is asked to answer the questions shown in Table 5.3, which they can answer by selecting one of four predefined options: Strongly disagree, Disagree, Agree, Strongly agree.

Based on the information about the model you have received so far, please indicate to what extend you agree with the following statements:

Q1 In my opinion, this model produces mostly reasonable forecasts. Q2 I understand how the model makes forecasts.

Q3 I trust this model.

Q4 I would support using this model as a forecasting tool. Table 5.3: Questions of the user study

• All participants are shown information about the input the model receives. This is a table that contains 11 features. The model with 11 features was chosen because of two reasons. The first is that including more features would make the input quite large and difficult to comprehend for humans, and the second is that the participants of the survey are mostly people from the business that are more familiar with percentage errors than MSE, and the model with 11 features has the lowest percentage error of all models with less than 15 features.

• Participants in the treatment group receive additional information about the explanations they will see. This consists of an example of an explanation, shown in Figure 5.4, with the following text: The explanation shows how much each variable contributed to this forecast. The green bars are for variables that contribute positively to the forecast (i.e. increase number of orders), while the red bars show variables that contribute negatively to the forecast (decrease number of orders). The horizontal axis is a measure of the influence of the variables.

Among all the variables we use in the model, the variables that contributed the most to the forecast in a positive manner (increasing the forecast) are the ones that have a green bar and and the variables that contributed negatively (decreasing the forecast) are displayed in red. For example: The fact that the number of orders last year (last year at time (x - 0)) was between 563 and 663 has increased the forecast of the number of orders today. The fact that there is no tv campaign today (x - 0) has decreased the forecast of the number of orders of today.

(31)

Figure 5.4: Example of an explanation shown to the participants in the treatment group. The green bars are for variables that contribute positively to the forecast (i.e. increase number of orders), while the red bars show variables that contribute negatively to the forecast (decrease number of orders). The horizontal axis is a measure of the influence of the variables.

• After the information about the model, input and explanation (the last only for the treatment group), the participants are shown 5 examples of the model making a prediction. The input, and the number of orders predicted by the model, and the real number of orders are shown.

• The participants in the treatment group are also shown the explanation generated by LETS, which explains how each variable has contributed to the output of the model. Note that this is a local explanation about the model performance.

• At the end the questions in Table 5.3 are asked again to see how these examples (with explanations) have influenced their trust/opinion on the model.

With this user study, we aim to answer the following qualitative research questions:

Qualitative Research Questions

• Do explanations improve user trust in the model once they have seen it perform? • Do explanations help the users to understand the workings of the model?

(32)

6 |

Results

This section is aimed to answer the research questions that were introduced in previous sections. Firstly, the quality of the neighborhood used by LETS will be inspected, and secondly the quality of the explanations generated by LETS will be evaluated. Throughout this section the results on the validation set will be colored in red, while the results on the test set will be colored in green.

The quality of the neighborhood and explanations generated by LETS is assess through comparisons to neighborhood and explanations generated by LIME and to the ones generated by random instance selection, we call RAND. This random instance selection can be seen as the first step to move away from LIME by using real instances instead of random perturbations, the second step to move away from LIME is to select the instances that have a same pattern, as discovered by combining the SAX and Needleman-Wunsch algorithm.

6.1 Neighborhoods evaluation

The neighborhoods of LETS, LIME and RAND are evaluated by fitting a simple linear model (OLS) on the neighborhood, that does not include the given instance. That instance is then used as input to the model and the MSE between the prediction of the linear model and the true output of the model are calculated. This measures the faithfulness of the neighborhood to the model.

In Figure 6.1 the faithfulness on the validation set of the neighborhood created by the different methods can be seen. For all the numbers of features inspected here, the neighborhood created by LETS is better than both the neighborhoods created by RAND and LIME. To see if these differences are significant, we have performed a two-sided Welsch’s t-test. For the comparison between LETS and LIME this has given p-value for the different models that are all smaller than 1e-9, so these differences are all significant. For the difference between LETS and RAND, the model with 20 features gives 0.003, while for all other models the p-values are also in the same range as the p-values of the comparison between LETS and LIME. From this we can conclude that LETS creates more faithful neighborhoods than RAND or LIME.

Figure 6.2 shows the faithfulness on the test set of the neighborhoods created by LETS, RAND and LIME. For the model that is trained on 10 features, LETS creates neighborhoods that are less faithful to the model than the other two methods, this difference is also significant. For the other models the neighborhoods created by LETS are significantly more faithful to the model than LIME and RAND,. Recall that it is shown in Section 5.2.3 (Figure 5.3) that the autoencoder has a slightly worse performance on the data set that has around features compared to the performance on data sets that have less features or more features.

(33)

Figure 6.1: Mean Squared Error of the linear model fitted on the neighborhood compared to the output of the predictive model on the validation set. The bars show the mean value over 20 trials, and the whiskers indicate the standard deviation.

Summary of the neighborhoods faithfulness evaluation

• On the validation set where the SAX parameters have been established, the neighborhoods created by LETS are significantly more faithful than RAND and LIME on all cases.

• On the test set, the neighborhoods created by LETS are more faithful to the model than RAND and LIME in most of the cases.

6.2 Evaluating the explanations

6.2.1 Quantitative evaluation

Evaluating the faithfulness of the explanations In the same manner as the neighborhoods have been compared for the different methods, the quality of the explanations generated are compared.

Figure 6.3 shows how faithful the sparse linear models are to the predictive black-box model on the validation set. For the model with 10 features, LIME significantly outperforms both LETS and RAND. For the models that have more features LIME is significantly less faithful to the model than LETS and RAND. However, the difference in faithfulness between LETS and RAND depends on the number of features, as RAND is more faithful to the model trained on 10 and 20 features, and there is no significant difference between the two for the bigger data sets.

The results on the test set shown in Figure 6.4, have a similar pattern as the results on the validation set. The performance of LETS and RAND is quite stable for the different models, while the faithfulness of LIME decreases when the number of features decreases.

(34)

Figure 6.2: Mean Squared Error of the linear model fitted on the neighborhood compared to the output of the predictive model on the test set. The bars show the mean value over 20 trials, and the whiskers indicate the standard deviation.

Figure 6.3: Mean Squared Error of the explanations on the validation set. The bars show the mean value over 20 trials, and the whiskers indicate the standard deviation.

Summary of the explanations faithfulness evaluation

• In most cases, both LETS and RAND produce significantly more faithful explanations than LIME.

• We have found no proof that LETS creates more faithful explanations than RAND.

Robustness evaluation of the explanations By design, the neighborhoods generated by LETS for sim-ilar instances are also simsim-ilar, because of the SAX representation and alignment methods. Therefore, the explanations will also be similar. We show that this holds for the model of 20 features, by calculating the alignment score between two instances and how many features co-occur in their explanations. As similar instances should generate similar explanations, the co-occurrence of features should be higher for instances

Explaining predictions of black box models for multivariate time series

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Explaining predictions of black box models

for multivariate time series

F

R

E

´

H

AVER

August 8, 2019

Supervisors:

Prof. Dr. H. Haned

S. Vreugdenhil BBA

Assessor:

Prof. Dr. M. de Rijke

Abstract

Acknowledgments

Contents

1

|

Introduction

1.1

On the importance of interpretability for industry practitioners

1.2

On the importance of interpretability in machine learning

1.3

Intepretability methods and current gaps

2

|

Related work

2.1

Desiderata for black-box models explanations

2.2

Scope of this work

2.2.1

Additive methods

2.2.2

Prototype methods

2.3

Interpretability of forecasting models

3

|

Preliminaries

3.1

Local Interpretable Model-agnostic Explanations

3.2

Sequence representation and alignment

3.2.1

Piecewise Aggregate Approximation

3.2.2

Symbolic representation

3.3

The Needleman-Wunsch sequence alignment algorithm

4

|