Actor-criticWeight Sharing for Instance-wise Temporal Feature Attribution on clinical Time Series Classification

(1)

F

ACULTY OF

M

EDICINE

Self-criticism

Actor-critic Weight Sharing for Instance-wise Temporal

Feature Attribution on Clinical Time Series Classification

Author Tutor Mentor

Lukas De Clercq* _{Martijn Schut}* _{Mihaela van der Schaar}†

MSc. student #11886188 ass. Prof. Med. Informatics Prof. ML & AI for Medicine

Period 20/05/2019 - 20/12/2019

Location Centre for Mathematical Sciences, University of Cambridge Wilberforce Road, CB3 0WA Cambridge, United Kingdom

(2)

Intensive care units (ICU) provide us with numerous data streams, yet the models processing these have been historically limited to rudimentary scoring systems and simple regression on aggregate features. Recent advancements in deep learning provide us with a new avenue, but applications of these models in medicine are limited due to their “black box” nature. Improving their interpretability can both increase their uptake in critical environments as well as allow for improved feature engineer-ing. We explore the early stages of this opportunity in terms of both predictive and interpretative performance. For the former, we improve a popular and competitive multivariate time series classifi-cation architecture and display its improved performance on an ICU mortality prediction task. For the latter, this architecture is subsequently used in conjunction with our proposed feature attribution tech-nique, which draws on the state-of-the-art INVASE feature selector employing reinforcement learning. Among other improvements, we consolidate its three separate networks into one multitask network, sharing weights between actor and critic. We display its improved interpretative performance when compared to its predecessor and popular model-specific and -agnostic feature attribution techniques. Additionally, we show how its reinforcement learning nature speeds up training by a significant margin. We dub the developed classification architecture and interpretability technique 1D-ResNet++ and INFEAT, respectively.

Samenvatting (NL)

Intensive care units (ICU) voorzien ons van talrijke datastromen. Echter, de modellen die deze verwerken zijn historisch gezien beperkt tot rudimentaire scoresystemen en simpele regressie op geaggregeerde features. Recente ontwikkelingen op het gebied van deep learning bieden ons nieuwe mogelijkheden maar het gebruik van dit type model in de geneeskunde is beperkt door hun “black box” aard. Het verbeteren van hun interpreteerbaarheid kan zowel het gebruik ervan in kritische omgevingen verhogen als verbeterde feature-engineering mogelijk maken voor traditionele mod-ellen. We onderzoeken de vroege stadia van deze opportuniteit in termen van zowel voorspellende als interpretatieve performantie. Voor de eerste verbeteren we een populaire en concurrerende multivariate tijdreeksen classificatie-architectuur en tonen we de verbeterde prestaties op een IC-sterftevoorspellingstaak. Voor de laatste wordt deze architectuur vervolgens gebruikt in combinatie met onze voorgestelde feature attributie techniek, gebaseerd op het state-of-the-art INVASE variabels-electiealgoritme, dewelke reinforcement learning hanteert. Naast andere verbeteringen consolideren we diens drie afzonderlijke netwerken in één multitasknetwerk, waarbij parameters tussen actor en critic gedeeld worden. We tonen een verbeterde interpretatieve performantie aan in vergelijking met diens voorganger en andere populaire model-specifieke en -agnostische feature attribution technieken. Daarnaast laten we zien hoe de twee-staps procedure de training van het model met een aanzienli-jke marge versnelt. We noemen de ontwikkelde classificatie-architectuur en interpretatietechniek respectievelijk 1D-ResNet++ en INFEAT.

(3)

4 Results & Analysis 16 4.1 Dataset . . . 16 4.2 Predictive: 1D-ResNet++ . . . . 17 4.3 Interpretative: INFEAT . . . . 17 4.3.1 Performance . . . 17 4.3.2 Speed . . . 18 4.3.3 Ablation Analysis . . . 18 4.3.4 Qualitative Analysis . . . 19 5 Conclusion 20 5.1 Limitations . . . 20 5.2 Future work . . . 21 Bibliography 22 A INVASE on IHMC 27

B INFEAT Training Procedure 27

C Diagram for full weight-sharing INFEAT 27

(4)

List of Figures

1 1D-ResNet++ building block architectures . . . 10

2 INFEAT architecture and two-stage training process . . . 12

3 Selector head . . . 13

4 Comparison of architectures and backpropagation depth . . . 15

5 IHMCfeature visualisation of the first-fold test set . . . 16

6 Interpretative performance comparison . . . 18

7 Ablation analysis . . . 19

8 Qualitative analysis . . . 19

9 Application of INVASE to IHMC . . . 27

10 Selector weight-sharing version of Figure 2 . . . 27

11 INFEAT attribution examples . . . 29

List of Tables

1 Comparison of instance-wise feature attribution techniques . . . 7

2 Classifier performance comparison . . . 17

3 Training and inference speed comparison . . . 18

(5)

1 Introduction

In comparison to other wards, intensive care units leverage a relatively large amount of resources due to the severity of their admitted patients’ state. With this resource and labour intensive process comes a trove of patient data, both (semi-)static features such as weight and sex, as well as measurements over time such as vital signs and lab values, regularly and irregularly acquired and nurse-validated (Saeed et al.,2011).

This data-intensive environment has given way to a range of weighted scoring systems with different motivations, target groups, and applications, of which one category are the physiological assessment or “general severity” scores (Strand and Flaatten,2008;Bouch and Thompson,2008). These tend to consist of a score indicating patient state severity and a probability of a certain outcome, typically all-cause in-hospital mortality, calculated from this score (Lemeshow and Le,1994). Prominent examples within this category include SAPS-II (Le Gall et al.,1993) and APACHE-III (Knaus et al.,1991), both of which use features measured across the first 24 hours of ICU admission to assess patient state severity. Processing of multivariate time series in these scoring systems has been limited to simple aggregate features (i.e. min-/maxima across the measurement range), which are very sensitive to outliers/erroneously validated data. More extensive manual feature engineering typically involves an extensive search space and domain knowledge. Furthermore, these scoring models make additive and linear relationship assumptions which have been shown to be incorrect, paving the way for non-linear approaches to ICU mortality modelling (Kim et al.,2011).

The recent publications of (anonymised) ICU datasets have opened up this playing field, allowing for the development and application of advanced statistical learning models to ICU data. Prominent examples of these databases include the Philips eICU-CRD (Pollard et al.,2018), AmsterdamUMCdb (Amsterdam Medical Data Science,2019), and MIMIC-III (Johnson et al.,2016); the last of which has seen many applications of a range of non-linear models (Johnson et al.,2017), in part fueled by the development of a series of standardised tasks and benchmarks in the form of open-source abstractions byHarutyunyan et al.(2019). In line with the recent resurgence of deep learning, many of the competitors on these benchmarks use neural networks in various forms and architectures. Several of these have been shown to significantly outperform existing scoring models (Purushotham et al.,

2017;Xia et al.,2019).

The accuracy and efficiency of these deep learning models continues to rise with the advent of new and improved architectures; However, interpretability of the resulting models remains fairly rudimentary, leading them to be oft-described as “black box” models. Applications in “production” or live environments of these types of models in healthcare settings have been scarce, yet opportunities are rife. Development of humanly-interpretable architectures or post-hoc explanations is likely to encourage the uptake and improve the use of these models in the medical domain (Miotto et al.,2017;

Ahmad et al.,2018). Increased interpretability of “black box”-like models serves many purposes such as:

• Identification of failure modes (e.g.Hoiem et al.,2012;Tao et al.,2018)

• Increasing (human) trust in the model (e.g.Selvaraju et al.,2017;Poursabzi-Sangdeh et al.,2018) • Legal compliance (e.g.Goodman and Flaxman,2017)

• Learning from a model’s reasoning (e.g.Zhu,2013)

As mentioned, trust is essential for the direct uptake of these models, yet this is not the only possible approach. An improved insight in what features an advanced and strongly performing model deems

(6)

The concept “interpretability” is subject to different interpretations itself, its precise definition and merit appraised differently in each domain. One popular perspective on interpretability consists of developing architectures or post-hoc explanations unveiling which features contribute to the output of the classifier/regressor, which is known as feature attribution or feature-additive explanations ( Oana-Maria et al.,2019). Compared to global feature attribution (i.e. for the model as a whole), instance-wise feature attribution, performed for each input or patient separately, and other “local” explanations are arguably more useful in a healthcare setting (Yoon et al.,2018). Furthermore, it allows for post-hoc examination of feature attributions on a subpopulation level by aggregating attributions thereof. Regardless of what direction is taken with the further implementation of a new deep learning model in ICU settings, be it as a teacher for or directly as a predictive model, it speaks for itself that improving both predictive and interpretative performance are important, early steps in this process. In this paper we tackle both these fronts. For each we start with a state-of-the-art algorithm or framework and investigate how we can improve these, both in general and for our specific data modality (multivariate time series). These new, improved techniques are evaluated on various performance metrics with appropriately designed tasks, all derived from an ICU mortality prediction benchmark defined by

Harutyunyan et al.(2019).

Contributions

• We consolidate advancements from computer vision and natural language processing to improve a popular, state-of-the-art multivariate time series classification architecture

• We present an actor-critic methodology for mapping activations back to the original input channels, allowing for feature attribution across both the temporal and feature dimension, and reinforcing these attributions by feeding the classifier the input sampled according to its generated attribution weights. We show that, incidentally, this procedure improves training speed.

• We provide a series of improvements to this method that greatly improve its interpretative performance on multivariate time series, outperforming competing methods

• We present an empirical method for evaluating feature attribution values on datasets without ground truth for said feature attributions.

(7)

2 Related Work

2.1 Time Series Classification

Although univariate time series classification has a long history and a wide range of approaches and benchmarks (Bagnall et al.,2017), its multivariate counterpart has been plagued by the curse of dimensionality, manifesting itself as intractably long training or inference durations (Fawaz et al.,

2019a). Fortunately, deep neural networks have opened up new approaches in this domain, most notably convolutional neural networks (CNN). This class of sparse-like neural networks uses set of learned kernels to identify patterns irrespective of their location in the spatial/temporal dimension(s). Despite their propensity towards very local patterns or “texture” recognition, they are remarkably apt to image processing (Geirhos et al.,2018). Although they find their origin and main source of popularity in image processing, these algorithms have shown some successes on other data modalities. One-dimensional (temporal) CNN variants are steadily climbing in popularity in the field of physiological signal processing, displaying significant improvements in both speed and predictive performance over “traditional” algorithms (Faust et al.,2018).

Residual Network (ResNet) derivatives, a class of deep (multi-layer) CNNs, use skip connections between (groups of ) convolutional layers to facilitate gradient flow, reducing the so-called “vanishing gradient” problem found in networks with many layers (He et al.,2016a). Architectures leveraging this deep residual learning have topped the charts on the ImageNet (Deng et al.,2009) classification benchmark since their conception. Notable examples include ResNeXt’s grouped convolutions (Xie et al.,2017), the Inception family (Szegedy et al.,2017), and the current state-of-the-art EfficientNet (Tan and Le,2019). Recently, temporal ResNet derivatives have consistently beaten other algorithms on a wide range of time series classification benchmarks. For this paper we take the temporal residual network (1D-ResNet) defined byWang et al.(2017b), which the recent review byFawaz et al.(2019a) showed to be state-of-the-art on the majority of tested datasets, and improve it by incorporating techniques from other domains.

2.2 Instance-wise Feature Attribution

Various methods exist for instance-wise attribution in machine learning models. Many are tuned towards the specific model used, ranging from simple feature weights in logistic regression to impurity scores for random-tree-based learners (Louppe et al.,2013). The predictive model proposed and used

Tab. 1: Characteristics comparison of discussed instance-wise feature attribution techniques applicable to multivariate time series

Method Localization precision Channel-wise Architecture dependency

CAM Zhou et al.(2016) Coarse grained* FCN-like only

GradCAM Selvaraju et al.(2017) Coarse grained* CNN-like only

SHAP Lundberg and Lee(2017) “Pixel”-wise Any

INFEAT Ours “Pixel”-wise CNN-like only

(8)

in this paper is a CNN variant, specifically a fully convolutional network (FCN). These are a class of CNN that have no fully-connected layers, i.e. every operation is a filter. In combination with a global average (or mean-) pooling to map the output of these filters to class predictions it acts as a powerful feature extractor while removing some of the black box aspects of CNNs (Lin et al.,2013); This global pooling layer causes the output of the last convolutional layer to contain easily accessible information on feature relevance localisation as the contribution of each “pixel” or time step to the pooling operators depends on their amplitude at the output of said layer. This characteristic spurred the development of and is leveraged in class activation mapping (CAM), a model-specific instance-wise attribution technique. This technique generates a localisation of the relevant features by mapping class activations back to the previous convolutional layer and summing the resulting activations of each filter channel (Zhou et al.,2016). Depending on the convolutional architecture used, there may be a loss in localisation precision due to lack of appropriate padding, strided convolutions, and so forth, leading to a coarse grained heatmap.

GradCAM, a generalisation of CAM with less stringent requirements for CNNs (e.g. allowing fully-connected layers), was provided bySelvaraju et al.(2017). As the network architecture we use for evaluation (described in Section3.1) qualifies for the default CAM methodology we make no use of this extension.

Alternatives to these model-specific methods have gained significant attention recently (e.g.Štrumbelj and Kononenko,2014;Ribeiro et al.,2016). Recent work byLundberg and Lee(2017) provides a unified approach for a wide range of local feature attribution methods. This framework, dubbed SHAP (SHaply Additive exPlanations), provides an entirely model-agnostic local explainer (KernelSHAP) and a range of more model-specific explainers that speed up the generation of these Shapley values (i.e. feature attributions) by making model assumptions and/or approximations. For this paper we consider one of the latter: DeepSHAP, an approximative explainer based onShrikumar et al.(2017)’s DeepLIFT.

2.3 INVASE

While borrowing ideas from CAM, our proposed feature attribution technique is ultimately derived from INVASE (Yoon et al.,2018), a reinforcement learning framework which leverages the Actor-Critic methodology (Peters et al.,2005) to perform state-of-the-art instance wise feature selection, meaning that it generates a selection mask for each entry in the dataset rather than a single global one. This acts as both a regulariser as well as a means to model interpretability. While the latter is not the focus of the framework, its generated selection probabilities can be considered to be feature attribution weights. In brief, INVASE consists of three concurrently trained subnetworks: Baseline, Selector, and Predictor. The Baseline is a classifier architecture trained on full, “untampered” data to generate class probability predictions. The Predictor shares the same architecture but receives a subset of that data, where the feature censoring/selection mask is determined by a Bernoulli sampling of selection probabilities generated by the Selector network. The latter is constructed similarly to segmentation architectures as its output and input have to share the same dimensions.

Both classifiers are trained on the cross-entropy loss versus the label; The divergence of these losses is subsequently used train the Selector, driving it to discover a selection function able to identify the relevant features (for the classification task) for each data point. We elaborate on this “Selector loss” in Section3.2.

The number of selected parameters does not need to be provided in advance, yet the framework does contain a “selection penalty” hyperparameter that steers this indirectly. When applying a temporally-adapted INVASE to our in-hospital-mortality use-case (as described in Section1 and defined in Section4.1), our selection ratio at Predictor convergence approximates the expected value of the Bernoulli sampling on a logit-standard distribution (0.5) for negligible selection penalties. Details of this experiment and its results can be found in AppendixA.

(9)

2.4 Weight Sharing - Multitask Learning

Weight sharing (equiv. parameter sharing) is the practice of sharing a section of layers/weights amongst different neural (sub)networks with compatible architectures and, typically, different tasks. The degree to which these are shared differs;Ruder(2017) defines two types: hard and soft sharing. The former refers to a class of networks with shared layers and task specific heads/outputs, commonly known as the classic multitask network as defined byCaruana(1997). Its soft counterpart keeps these networks separate, but limits the divergence of their “shared” weights. For the latter many methods exist, but as this paper focuses on hard weight sharing, these will not be elaborated on; refer toRuder(2017) for an overview.

Nowadays, weight sharing is used in many settings involving “agents” or interacting neural networks. Under a shared latent space assumption the shared layers are hypothesised to be able to operate on different domains and/or in function of different tasks. For instance, architectures employing two generative adversarial networks (GAN) trained on differing domains have used weight sharing to limit divergence among generator/discriminator subnetworks of both GANs and force them to focus their capacity on the common ground between the domains to discover shared high-level representations (Liu and Tuzel,2016;Liu et al.,2017).

Reinforcement learning has seen its fair share of parameter sharing as well, incurring significant performance improvements (Foerster et al.,2016;Gupta et al.,2017). We make note of the current practice of hard weight sharing of convolutional layers withing Actor-Critic methodologies (e.g.Mnih et al.,2016;Wang et al.,2017a;Wu et al.,2017), although the practitioners do not elaborate on its use aside from the concept of actor and critic learning from a shared representation, essentially treating the shared layers as a learned pre-processing/feature extraction on a shared domain.

(10)

3 Proposed Methods

3.1 Predictive: 1D-ResNet++

Inspired by the recently displayed successes of deep residual learning, the architecture we propose (1D-ResNet++) is roughly based onWang et al.(2017b)’s time series classification 1D-ResNet. We adopt a “building block” approach in its description to facilitate its incorporation in our proposed feature attribution framework presented in Section3.2. Both blocks combined have the characteristics of a FCN, making it suitable for both classification and segmentation (Long et al.,2015).

ReLU

BatchNorm

1×3 Conv., 64

ReLU

BatchNorm

1×3 Conv., 64

5×

1×5 Conv., 64

ReLU

(a)Residual convolutional stack

ConcatPool

FC, 2

SoftMax

(b)Classifier head Fig. 1: 1D-ResNet++ building block architectures

(11)

3.1.1 Residual Convolutional Stack (RCS)

The first block, acting as a feature extractor, consists of one temporal convolutional layer (k1= 5), followed by five residual temporal convolutional sub-blocks of two temporal convolutional layers each (k2= 3), as shown in Figure1a, giving it a receptive field of:

((k1− 1) + 5 · 2(k2− 1)) + 1 = 25

This deviates from the original 1D-ResNet by a reduced width and block size, an increased depth, and changed kernel sizes, all in line withHe et al.(2016a)’s findings in the development of the original two-dimensional ResNet and its successors. Furthermore, we incorporate a few proven techniques from computer vision:

1. The last batch normalization (BatchNorm) layer (Ioffe and Szegedy,2015) in each residual block is initialised with weightsγ = 0 to provide a true identity short-cut in early stages of training (Goyal et al.,2017). The weights of other layers as well as all biases are initialised according toHe et al.(2015).

2. The residual blocks are of the “full pre-activation”-variant, whichHe et al.(2016b) have shown to be superior to the default configuration used inWang et al.(2017b).

3. Note, however, that we switch the ReLU activation and BatchNorm layers around in order to reflect experimental findings such as inZagoruyko and Komodakis(2016).

All vectors undergoing convolution operations are zero-padded symmetrically with length (k − 1)/2, with k being the kernel size. This ensures that the time series length at the output of each convolution operation is equal to its input.

Adding one-dimensional Squeeze-Excite operators (Hu et al.,2018) provided no appreciable perfor-mance improvements, neither did CoordConv channels (Liu et al.,2018) to either the first, last, or all convolutional layers. We deem temporal convolutions on synchronised time series to be one of the few settings in which it makes contextual sense to provide absolute localisation capacities, yet observe no improvements from doing so.

3.1.2 Classifier Head (CLS)

The head, shown in Figure1b, remains largely unchanged compared to its predecessor. The only difference is found in the pooling operator. We useHoward and Ruder(2018)’s ConcatPool, which simply performs a global mean- and maxpooling and concatenates their results, generating the mean and maximum value for each of the RCS’s 64 output channels. These are fed into a single fully connected layer (FC), width equal to the number of output classes, with SoftMax applied to generate class probabilities.

Intuitively, a global maxpooling drives the convolutional filters to identify patterns of which only the highest amplitude at which it is encountered is relevant to the classification, whereas the meanpooling drives them to identify features of which both the quantities and amplitudes matter. It speaks for itself that a concatenation of both allows the convolutions to focus on either, yet note that both pooling operators will drive the convolutions to express relevancy across the temporal dimension as amplitude. We leverage this expression in Section3.2.3.

(12)

3.2 Interpretative: INFEAT

3.2.1 Default Framework

Whereas INVASE chooses to focus on feature selection, we leverage its Actor-Critic dynamic to pro-duce instance-wise feature attribution values. Architecturally, our main deviation is the number of subnetworks: we wish to train a single network able to operate on both masked and unmasked data. To this end, we only have one Classifier network acting as both Baseline and Predictor and one Selector network, as depicted in Figure2. This de facto weight sharing between the classifiers allows us to condition a single classifier to maximise its performance across all selection ratios by comparing its performance on the full and masked data. In addition, this has the effect of limiting divergence between Baseline and Predictor losses, preventing the Selector loss derived thereof from exploding.

Schulman et al.(2017) state that such a shared network requires a combined loss for policy and value function. In spite of that, and contrary to INVASE’s fully concurrent training, we propose an architecture in which weights are shared among Actor and Critic, but alternatingly so, across different input domains and their respective losses (weighted equally); We calculate Baseline- and Predictor-losses on the classifier head output when the network is fed the full and masked time series (of the same batch) respectively. Figure2depicts the two-stage training procedure for this configuration. Although heavily inspired by it, is more elaborate than its feature selecting predecessor.

While we make use of the 1D-ResNet++ defined in Section3.1for the Classifier, any classifying architec-ture can take its place. We adapt the same architecarchitec-ture to function as Selector by replacing its classifier head by a selection head (SEL), shown in Figure3. This is a simple block that translates the number of output channels of the RCS to a number equal to that of the input data. This is followed by a sigmoid non-linearity to map these logit values to selection probabilities. The purpose of this block is to learn to map the RCS’ last layer’s activations back to the original input channels. Note that there are no pooling operations or strided convolutions in the RCS. Combined with the appropriate padding described in Section3.1this eliminates the need for convolution with fractional strides (“upconvolution”) in the Selector head to allow for fine-grained localisation.

n

ŷ

(bln)

CLS

SEL

RCS

x

_n

P

_n

_CLS

RCS

x̂

_n

ŷ

(prd) n sel bln prd Sample

M

_n

y

_n

RCS

SEL

RCS

(a)Critic-stage (b)Actor-stage

Fig. 2: INFEAT architecture and two-stage training process depicted for one sample. Dotted lines indicate backpropagation.

(13)

BatchNorm

1×1 Conv., C

Sigmoid

β

_r

Fig. 3: Selector head

Critic-stage

Assume a multivariate time series dataset of N entries across G classes.

Define the features as X = {x1,1,1, ..., xN ,C ,T}, consisting of C equally and regularly sampled time series vectors (or “channels”) of length T , and the corresponding class labels as Y = {y1,1, ..., yN ,G}.

The first “Critic” stage uses any data entry xnfor both subnetworks. The Classifier’s output in this stage is used as the Baseline class probabilities ˆy_n(bl n), from which we calculate the Baseline cross-entropy lossLn(bl n)= H(y, ˆy(bl n)n ).

The Selector, in its turn, generates a vector of selection probabilities Pn = {p1,1, ..., pC ,T ∈ [0, 1]}, identical in dimensions to xn. This output is sampled to generate a binary selection mask Mn = {m1,1, ..., mC ,T∈ {0, 1}}. During training this is performed using a Bernoulli sampling taking Pnas the individual success probabilities. In test time a threshold is drawn at the expected value of said sampling at initialisation1, i.e. 0.5, so that:

mc,t= (

1 if xc,t≥ 0.5 0 if xc,t< 0.5

WhileYoon et al.(2018) do not elaborate on this, multiplication of z-standardised data with this mask is equivalent to replacing non-selected variables with the training set means.

(14)

Actor-stage

The second stage feeds the masked data ˆxn= Mx× xnto the Classifier, generating the Predictor class probabilities and the corresponding lossL_n(pr d )_{= H(y}n, ˆy(pr d )n ).

The Selector lossL_n(sel )is calculated, as in INVASE, as the Kullback-Leibler divergence between the full and the masked data conditional distributions of the outcome. We redefine INVASE’s mono-dimensional (“tabular”) feature vector as a C × T matrix, resulting in the following loss function (for the full derivation thereof refer toYoon et al.(2018)):

L(sel ) n = DnHn (I) Dn= G X g =1 yn,g( ˆyn,g(bl n)− ˆy (pr d ) n,g ) (II) Hn= C X c=1 T X t =1

mc,tlog(pc,t) + (1 − mc,t) log(1 − pc,t) (III)

Originally, INVASE’s Selector loss contains a second term,λ ¯Mn, representing a “selection penalty”, withλ a hyperparameter used to tune the Selector’s performance. It prevents the Selector from simply selecting all variables in its strive to minimise divergence between the Predictor and Baseline losses. While INVASE’s selection probabilities may be indicative of feature importance, they are optimised for the selection ratio the networks converge to. We wish to condition for all selection ratios, so we opt to remove this selection penalty. Instead we shift the output of the convolutional layer in the Selector during training with a random logit biasβr (depicted in Figure3):

βr= log

br 1 − br

(IV)

with br sampled uniformly from the interval [0, 1].

This directly affectsLpr dthrough the randomisation of each entry’s selection ratio, which in its turn affectsLsel, thus allowing for optimisation of both Classifier and Selector across the entire selection ratio range.

Across both stages, backpropagation for the Classifier is performed twice: on the losses of both the full and masked data (Lbl nandLpr drespectively). In essence, this procedure forces the entire network to re-evaluate and reinforce its decisions of what it finds relevant, in a form of what can be described as (semi-supervised) multitask attention focusing (Caruana,1997).

We dub this default configuration INFEATd (Instance-wise Feature Attribution). With the additions described in the following sections it is referred to as INFEAT and embodies the full extent of our proposed algorithm.

(15)

CLS

SEL

RCS

(a)Default

CLS

SEL

RCS

(b)Selector weight sharing Fig. 4: Comparison of architectures and backpropagation depth

(indicated by blue→for the Critic stage and red→for the Actor stage)

3.2.2 Channel-wide Threshold Sampling

In a temporal configuration, the Bernoulli sampling inherently generates noise across the temporal dimension of the masked input as each sample is taken entirely independently of its neighbours. We hypothesise this to be detrimental to the Selector and Classifier’s learning, which is largely based on convolutions and thus reliant on contiguity. Therefore we introduce a channel-wide threshold sampling, in which we sample a vector V = {v1, ..., vC} uniformly from the interval [0, 1] for each entry that passes through the network. Subsequently generating the mask follows logically as:

mc,t= (

1 if xc,t≥ vc 0 if xc,t< vc

3.2.3 Selector Weight Sharing

Knowing that the output of the Classifier’s feature extractor (RCS) contains feature importance localisa-tion informalocalisa-tion due to the global pooling operalocalisa-tion, we make a shared latent space assumplocalisa-tion: we hypothesise that allowing the Selector to learn from this representation instead of raw input X reduces the required complexity of the selection function to be learned. We propose tying the feature extrac-tors (and thus the majority) of both subnetworks together, essentially creating a multitask network with hard parameter sharing. In our implementation, this design choice is facilitated by the RCS’s FCN-like architecture, which minimises information bottlenecks until the pooling operation and is apt to extracting features for both classification (label prediction) and segmentation (mapping xnto Pn). This configuration puts more restrictions on the network architecture, as it requires a Selector head to be able to tap in to the Classifier’s feature extractor. For our implementation this is fulfilled by a single RCS with one Classifier head and one Selector head, for the Predictor/Baseline and Selector respectively. Figure4shows differences in configuration and backpropagation between the regular and weight sharing INFEAT variants. The two-stage training procedure for this implementation is summarised in AppendixB. A Selector weight sharing version of Figure2is included in AppendixC.

3.2.4 Validation Threshold Randomisation

Early stopping (Morgan and Bourlard,1990) for the model is performed on the Predictor loss of the validation set. In INVASE this is done with a Selection threshold of 0.5, as mentioned in Section3.2, yet for INFEAT we are interested in performance across the whole range. We replace it during validation by randomly sampling a threshold u uniformly from [0, 1] for each sample, so that:

(16)

4 Results & Analysis

We evaluate both proposed algorithms in terms of performance and speed. At the core of each task is an identical time series classification task with binary label, where predictive performance is expressed in area under the receiver-operator curve (AUROC) and precision-recall curve (AUPRC).2Performance confidence intervals (CI) are provided through 20-fold Monte Carlo cross-validation, maintaining identical folds across all experiments. In the case of overlapping CIs pertaining to our proposed algorithms we establish statistical significance with a dependent t-test for paired samples.

4.1 Dataset

The benchmark used for all experiments is a variant of the MIMIC-III In-hospital Mortality Clas-sification, a synchronised time series classification task with binary mortality label, developed by

Harutyunyan et al.(2019) and derived from the MIMIC-III intensive care database (Johnson et al.,

2016). While we use all of the original benchmark’s 16 variables, we deviate from their preprocessing methodology by treating each variable as continuous for the sake of simplicity in this proof-of-concept. To comply withTeasdale and Jennett(1974)’s original definition of the Glasgow Coma Score (GCS) we include an additional “Intubation” variable to differentiate between minimal verbal GGSs due to intubation/tracheotomy (i.e. disability of the vocal chords) and those due to other causes. We denote this benchmark as IHMC.

In summary, each sample consists of 17 time series (or “channels”) with a time step of 1 hour, measured in the first 48 hours of an intensive care admission. A visualisation of these variables, along with a rank correlation analysis, is given in Figure5. The dataset contains a total of N = 21.139 samples of which 2.797 (13.23%) carry a positive label (i.e. all-cause death within that hospital admission). Inversely class-weighted cross-entropy (Huang et al.,2016) provided no appreciable performance improvements. We deem the class imbalance insignificant enough to not take further measures. A description of all variables can be found in AppendixD.

GCS eye

GCS motor GCS verbal Temperature Heart Rate Sys. BP Dias. BP Mean BP Resp. rate FiO2 SpO

2

pH

Glucose Height Weight Abn. CRT Intubated GCS eye

GCS motor GCS verbal Temperature_{Heart Rate} Sys. BP Dias. BP Mean BP Resp. rate_FiO 2 SpO2 pH Glucose_Height Weight Abn. CRT_Intubated 1.0 0.5 0.0 0.5 1.0

(a)Feature rank correlation

0 10 20 30 40

Hours

0.50 0.25 0.00 0.25 0.50

x

(b)Z-standardised feature means

0 10 20 30 40

Hours

1.0 0.5 0.0 0.5 1.0

(c)Label rank correlation. (-) indicates p > 0.05. Fig. 5: IHMCfeature visualisation of the first-fold test set

(17)

All variables are z-score standardised on-the-fly using training set channel means and standard devia-tions. The individual sequences are imputed using linear interpolation and carry-backward/forward for start/end extremities respectively. Channels without any measurement in the 48-hour range are imputed with “normal” channel values as defined byHarutyunyan et al.(2019), see AppendixD. We maintain a train-validation-test split ratio identical to the introductory paper (70-15-15).

4.2 Predictive: 1D-ResNet++

1D-ResNet is evaluated in its role as classifier and compared against a commonly used logistic re-gression baseline with LASSO regularisation (Tibshirani,1996) and its predecessors as defined by

Wang et al.(2017b). Results thereof on the IHMCtask can be found in Table2. Our adaptations have significantly improved the model’s performance over 1D-ResNet which it is derived from (p < 0.01 for both metrics) without any significant increase in training time (p = 0.66).

Tab. 2: Mean predictive performance and training tine (95% CI) on the IHMCof our proposed

architecture, its predecessors, and a logistic regression baseline.

Model AUROC AUPRC Training time (s)

LR-LASSO 0.835 ± 0.005 0.471 ± 0.011 2.44 ± 0.32 1D-FCNa 0.842 ± 0.006 0.497 ± 0.013 19.58 ± 3.19 1D-ResNeta 0.843 ± 0.005 0.501 ± 0.012 33.86 ± 5.56 1D-ResNet++ 0.850 ± 0.006 0.518 ± 0.016 35.15 ± 4.18 a_{Wang et al.}₍_2017b₎

4.3 Interpretative: INFEAT

4.3.1 Performance

INFEAT is compared against its predecessor INVASE and the feature attribution methodologies de-scribed in Section2.2. As described in Section3.2, this INFEAT implementation3makes use of the 1D-ResNet++ building blocks. CAM and SHAP are run on a standalone 1D-ResNet++ classifier. While

Long et al.(2015) do not specify how a combined pooling (such as the ConcatPool used in our 1D-ResNet++) should be handled in CAM, our implementation simply adds the linear layer’s weights of each channel’s mean and maximum, which are subsequently mapped back to last convolutional layer’s output. For both CAM and SHAP the magnitudes of feature explanations of the model’s predicted class were taken (i.e. not the ground truth class labels). We useLundberg and Lee(2017)’s DeepExplainer to approximate SHAP values as non-approximative SHAP methods were intractably slow for our test set size and number of folds. SHAP is the only of the tested methods that requires non-trivial computation besides a single forward pass to derive feature attributions.

We evaluate the feature attributions from a ranking perspective, for which we have designed the follow-ing task: for each entry mask out the bottom r percentage of features as ranked by their generated at-tributions and assess model performance on this masked data using default metrics (AUROC/AUPRC). We explore r in the entire [0, 1] interval. The philosophy behind this task is simple: if a model performs better when provided the top (1 − r ) features as ranked by a certain set of feature attributions when compared to other sets, then those feature attributions are inherently more accurately ranked. This

(18)

0.0 0.2 0.4 0.6 0.8 1.0 Selection ratio 0.60 0.65 0.70 0.75 0.80 0.85 AUROC 0.0 0.2 0.4 0.6 0.8 1.0 Selection ratio 0.20 0.25 0.30 0.35 0.40 0.45 0.50 AUPRC INFEAT SHAP INVASE CAM

Fig. 6: Comparison of mean performance (95% CI) on the IHMCwhen selecting variables in order of ranked

attribution. Performance across the range of selection ratios displays ability to rank feature importance.

The results of this task on the IHMCbenchmark can be seen in Figure6. Observe that both SHAP and INFEAT display a strong recognition of the most important features, indicated by a rapid climb and strong results for low selection ratios, yet SHAP underperforms in properly ranking the less important features. As we can see, INFEAT attains an identical performance to the standalone 1D-ResNet++ classifier for a selection ratio of 1 (p = 0.08), whereas INVASE fails to do so. Across the entire range INFEAT displays a monotonic incline and significantly outperforms its competitors, displaying the merit of our method.

4.3.2 Speed

Table3shows a comparison of training and inference speeds and displays the effect of adding INFEAT’s two-step reinforcement procedure. While this was not our focus nor an intended result, we observe a significant speedup in training while still achieving identical predictive performance to a standalone 1D-ResNet++ classifier. Disabling INFEAT’s second backpropagation onLpr ddecreased interpretative performance and increased training time to standalone 1D-ResNet++ levels, as is expected. Note that the added selection head does incur a slight inference penalty.

Tab. 3: Mean speed comparison (95%CI) of model training across 20 folds and inference for one entry. Observe that the addition of INFEAT’s reinforcement procedure incurs a faster convergence.

Model Training (s) Inference (ms) 1D-ResNet++ 35.15 ± 4.18 1.41 ± 0.00

INFEAT + 1D-ResNet++ 28.60 ± 1.43 1.98 ± 0.00

4.3.3 Ablation Analysis

In order to assess the contributions of each addition we compare the model’s performance when these are added sequentially. The results are presented in Figure7. Note that our shared latent space assumption for Selector weight sharing holds up, as does our hypothesis of reduced noise across the temporal dimension facilitating the feature extractor’s training. Note that the increases in performance are paired with a tightening of the confidence intervals.

(19)

INFEAT

d

+ channel-wide threshold sampling

+ weight sharing

+ test-time random threshold

0.0

0.2

0.4

0.6

0.8

1.0 Selection ratio

0.77

0.79

0.81

0.83

0.85 AUROC

0.0

0.2

0.4

0.6

0.8

1.0 Selection ratio

0.40

0.43

0.46

0.49

0.52 AUPRC

Fig. 7: Ablation analysis mean performance (95% CI) of a range of selection ratios on the IHMC

4.3.4 Qualitative Analysis

While a close investigation of selected features on an instance- or subpopulation-wise level is reserved for future work, we can easily aggregate all test-set attributions4. These are presented in Figure8. Figure8agives us some indication of the model’s proclivity towards certain values at different points in time. For instance, cardiovascular vitals seem important in the very early stages of admission, whereas values potentially indicating post-operative state such as temperature and intubation peak after two hours. Further investigation of the post-operative subpopulation’s attributions may result in better, subpopulation-level aggregate features. Figure8bgives us an idea of the level of “individualisation” for our attribution, i.e. the variability thereof. This can be indicative of subpopulation dynamics. Finally, Figure8cdisplays rank correlation between the attributions and outcome labels. Note the strong signal amplification when compared to Figure5c, indicative of proper functioning of feature extraction and the Selector pathway. A selection of individual patients’ features and their attributions can be found in AppendixE. 0 10 20 30 40 Hours GCS eye GCS motor GCS verbal Temperature_{Heart Rate} Sys. BP Dias. BP Mean BP Resp. rate_FiO

2 SpO_pH2 Glucose_Height Weight Abn. CRT_Intubated 0.30 0.45 0.60 0.75 P (a)Median 0 10 20 30 40 Hours 0.1 0.2 0.3 MADP

(b)Median absolute deviation

0 10 20 30 40

Hours

1.0 0.5 0.0 0.5 1.0

(c)Label rank correlation (-) indicates p > 0.05. Fig. 8: Test set INFEAT attributions analysis

(20)

5 Conclusion

This project explored the first stages of porting advancements in deep learning to an ICU setting. Both our methods, the predictive model and the explanations thereof, show significant improvement compared to their predecessors.

We deem our ResNet-1D++ to be a great example of how techniques from other (deep learning) subdo-mains are valuable in improving neural network architectures; no novel techniques were introduced for this architecture, at most an adaptation to a one-dimensional setting for the techniques stemming from image processing, yet we display a significant improvement over a state-of-the-art architecture. This exemplifies not only the need for improved cross-domain cooperation but also the portability of these techniques to other data modalities.

Our feature attribution method on the other hand does introduce several novelties. Starting from the INVASE framework, we carefully engineered whatOana-Maria et al.(2019) refer to as a feature-addition explainer from a feature-selecting perspective by optimising across all selection ratios. Doing so improved predictive performance up to the level of the base network and eliminated theλ selection penalty hyperparameter and with it the need for an expensive optimisation thereof. Subsequently, we improved the interpretative performance of this baseline by:

1. Developing a more appropriate sampling strategy by taking into account the data modality 2. Unveiling another weight sharing opportunity through examination of our convolutional

archi-tecture

3. Introducing a limited stochasticity to the early stopping criterion to facilitate finding an optimum balanced for all selection ratios

The contribution of each of these novelties displays the importance of considering the full picture, from the data one operates on to the training procedure of the model, when attempting to improve a baseline.

5.1 Limitations

We only evaluated 1D-ResNet++ on one benchmark of one dataset. Extrapolation of its performance to other multivariate time series classification tasks is speculative. It is our opinion that the poor availability of high-quality multivariate time series datasets presents issues; We find the majority of those present inBagnall et al.(2018)’s archive to contain too little samples to support deep learning. It is to be noted that feature attributions can be fragile with respect to available features in the dataset and applied preprocessing (Lipton,2016). In the case of highly correlated features this may manifest itself as instability in the model’s preference thereof under differently initialised training runs. Our methodology assumes no categorical values and treats those present as continuous for pre-processing, imputation5, z-standardisation and feature attribution. Non-standardised categorical values, especially dichotomous ones, may give a distorted and possibly misleading image of the number/ratio of selected variables and their attributed importance as the selection may effectively be nothing more than a lossless re-encoding of the variable.

(21)

5.2 Future work

First and foremost, we envision further fine-tuning of our proposed model for ICU mortality prediction as a stronger model’s explanations are inherently more valuable than those of an underperforming one. This may entail inclusion of more variables (especially static variables such as sex), categorical “dummy” variables, more advanced (multiple) imputation, ...

Application of the INFEAT framework to other data modalities is yet to be tested. It is uncertain whether its methodology will transfer to image classification where “pixel-wise” correlations as in Figure5care not as pronounced, but the successes of CAM and INVASE on this datatype do indicate potential.

Mapping back CAM-like class-weighted activations through the Selector, rather than the raw activa-tions, was briefly experimented with but provided no improvements. It would be interesting to see why this failed and how the task influences this (e.g. multiclass settings).

At the time of writing,Fawaz et al.(2019b) have a pre-publication paper on a novel, strongly performing univariate time series classification CNN. They use a similar methodology to us in terms of using developments from image processing to improve their baseline. Notably, it contains much larger kernel sizes than our architecture, enabling a larger receptive field and facilitating exploration of trends over time in addition to local “textures”/patterns. It is as of yet uncertain how well it performs on multivariate time series, but we see opportunities for introducing several of the techniques we used in 1D-ResNet++ to it.

Finally, despite their significantly differing training procedures, our models with and without INFEAT attain a similar predictive performance. We are interested to see whether adding INFEAT influences which features are discovered and whether it increases adversarial robustness, similar toNoack et al.

(2019)’s findings.

Acknowledgements

The authors would like to thank the University of Cambridge, its Faculty of Mathematics, and the Machine Learning & Artificial Intelligence for Medicine research group for accommodating and supporting this research. Special thanks to Jinsung Yoon for his valuable feedback on the early versions. This project was co-funded by the Erasmus+ Programme of the European Union.

(22)

Bibliography

Ahmad, M. A., C. Eckert, and A. Teredesai

2018. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Pp. 559–560. ACM. Amsterdam Medical Data Science

2019. AmsterdamUMCdb 1.0. Amsterdam UMC. Accessed 29/12/2019.

Bagnall, A., H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. Keogh

2018. The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075. Bagnall, A., J. Lines, A. Bostrom, J. Large, and E. Keogh

2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31:606–660.

Bouch, D. C. and J. P. Thompson

2008. Severity scoring systems in the critically ill. Continuing Education in Anaesthesia, Critical Care & Pain, 8(5):181–185.

Caruana, R.

1997. Multitask learning. Machine learning, 28(1):41–75. Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei

2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, Pp. 248–255. IEEE.

Faust, O., Y. Hagiwara, T. J. Hong, O. S. Lih, and U. R. Acharya

2018. Deep learning for healthcare applications based on physiological signals: A review. Computer methods and programs in biomedicine, 161:1–13.

Fawaz, H. I., G. Forestier, J. Weber, L. Idoumghar, and P.-A. Muller

2019a. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963.

Fawaz, H. I., B. Lucas, G. Forestier, C. Pelletier, D. F. Schmidt, J. Weber, G. I. Webb, L. Idoumghar, P.-A. Muller, and F. Petitjean

2019b. Dreamtime: Finding alexnet for time series classification. arXiv preprint arXiv:1909.04939. Foerster, J., I. A. Assael, N. de Freitas, and S. Whiteson

2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, Pp. 2137–2145.

Geirhos, R., P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel

2018. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231.

Goodman, B. and S. Flaxman

2017. European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine, 38(3):50–57.

Goyal, P., P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Gupta, J. K., M. Egorov, and M. Kochenderfer

2017. Cooperative multi-agent control using deep reinforcement learning. In International Confer-ence on Autonomous Agents and Multiagent Systems, Pp. 66–83. Springer.

Harutyunyan, H., H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan

(23)

He, K., X. Zhang, S. Ren, and J. Sun

2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, Pp. 1026–1034.

2016a. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, Pp. 770–778.

2016b. Identity mappings in deep residual networks. In European conference on computer vision, Pp. 630–645. Springer.

Hoiem, D., Y. Chodpathumwan, and Q. Dai

2012. Diagnosing error in object detectors. In European conference on computer vision, Pp. 340–353. Springer.

Howard, J. and S. Ruder

2018. Universal language model fine-tuning for text classification. In 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Pp. 328–339, Melbourne, Australia. Association for Computational Linguistics.

Hu, J., L. Shen, and G. Sun

2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Pp. 7132–7141.

Huang, C., Y. Li, C. Change Loy, and X. Tang

2016. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, Pp. 5375–5384.

Ioffe, S. and C. Szegedy

2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Johnson, A. E., T. J. Pollard, and R. G. Mark

2017. Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference, Pp. 361–376.

Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark

2016. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035. Kim, S., W. Kim, and R. W. Park

2011. A comparison of intensive care unit mortality prediction models through the use of data mining techniques. Healthcare informatics research, 17(4):232–243.

Knaus, W. A., D. P. Wagner, E. A. Draper, J. E. Zimmerman, M. Bergner, P. G. Bastos, C. A. Sirio, D. J. Murphy, T. Lotring, A. Damiano, et al.

1991. The apache iii prognostic system: risk prediction of hospital mortality for critically iii hospital-ized adults. Chest, 100(6):1619–1636.

Le Gall, J.-R., S. Lemeshow, and F. Saulnier

1993. A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama, 270(24):2957–2963.

Lemeshow, S. and J.-R. Le

1994. Modeling the severity of illness of icu patients: a systems update. Jama, 272(13):1049–1055. Lin, M., Q. Chen, and S. Yan

(24)

Liu, M.-Y., T. Breuel, and J. Kautz

2017. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, Pp. 700–708.

Liu, M.-Y. and O. Tuzel

2016. Coupled generative adversarial networks. In Advances in neural information processing systems, Pp. 469–477.

Liu, R., J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski

2018. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, Pp. 9605–9616.

Long, J., E. Shelhamer, and T. Darrell

2015. Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition, Pp. 3431–3440.

Louppe, G., L. Wehenkel, A. Sutera, and P. Geurts

2013. Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems, Pp. 431–439.

Lundberg, S. M. and S.-I. Lee

2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., Pp. 4765–4774.

Miotto, R., F. Wang, S. Wang, X. Jiang, and J. T. Dudley

2017. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246.

Mnih, V., A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu

2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, Pp. 1928–1937.

Morgan, N. and H. Bourlard

1990. Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in neural information processing systems, Pp. 630–637.

Noack, A., I. Ahern, D. Dou, and B. Li

2019. Does interpretability of neural networks imply adversarial robustness? arXiv preprint arXiv:1912.03430.

Oana-Maria, C., G. Eleonora, F. Jakob, L. Thomas, and B. Phil

2019. Can i trust the explainer? verifying post-hoc explanatory methods. arXiv preprint arXiv:1910.02065.

Peters, J., S. Vijayakumar, and S. Schaal

2005. Natural actor-critic. In European Conference on Machine Learning, Pp. 280–291. Springer. Pollard, T. J., A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi

2018. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data, 5.

Poursabzi-Sangdeh, F., D. G. Goldstein, J. M. Hofman, J. W. Vaughan, and H. Wallach

2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810. Purushotham, S., C. Meng, Z. Che, and Y. Liu

2017. Benchmark of deep learning models on large healthcare mimic datasets. arXiv preprint arXiv:1710.08531.

Ribeiro, M. T., S. Singh, and C. Guestrin

2016. Why should i trust you?: Explaining the predictions of any classifier. In 22nd ACM SIGKDD international conference on knowledge discovery and data mining, Pp. 1135–1144. ACM.

(25)

Ruder, S.

2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Saeed, M., M. Villarroel, A. T. Reisner, G. Clifford, L.-W. Lehman, G. Moody, T. Heldt, T. H. Kyaw,

B. Moody, and R. G. Mark

2011. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical care medicine, 39(5):952.

Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov

2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra

2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Pp. 618–626.

Shrikumar, A., P. Greenside, and A. Kundaje

2017. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Pp. 3145–3153. JMLR. org. Strand, K. and H. Flaatten

2008. Severity scoring in the icu: a review. Acta Anaesthesiologica Scandinavica, 52(4):467–478. Štrumbelj, E. and I. Kononenko

2014. Explaining prediction models and individual predictions with feature contributions. Knowl-edge and information systems, 41(3):647–665.

Szegedy, C., S. Ioffe, V. Vanhoucke, and A. A. Alemi

2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.

Tan, M. and Q. V. Le

2019. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.

Tao, G., S. Ma, Y. Liu, and X. Zhang

2018. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems, Pp. 7717–7728.

Teasdale, G. and B. Jennett

1974. Assessment of coma and impaired consciousness: a practical scale. The Lancet, 304(7872):81– 84.

Tibshirani, R.

1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.

Wang, Z., V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas

2017a. Sample efficient actor-critic with experience replay. In 5th International Conference on Learning Representations (ICLR 2017).

Wang, Z., W. Yan, and T. Oates

2017b. Time series classification from scratch with deep neural networks: A strong baseline. In 2017 international joint conference on neural networks (IJCNN), Pp. 1578–1585. IEEE.

Wu, Y., E. Mansimov, R. B. Grosse, S. Liao, and J. Ba

2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, Pp. 5279–5288.

(26)

Xie, S., R. Girshick, P. Dollár, Z. Tu, and K. He

2017. Aggregated residual transformations for deep neural networks. In IEEE conference on computer vision and pattern recognition, Pp. 1492–1500.

Yamamoto, Y., T. Tsuzuki, J. Akatsuka, M. Ueki, H. Morikawa, Y. Numata, T. Takahara, T. Tsuyuki, K. Tsutsumi, R. Nakazawa, et al.

2019. Automated acquisition of explainable knowledge from unannotated histopathology images. Nature Communications, 10(1):1–9.

Yoon, J., J. Jordon, and M. van der Schaar

2018. Invase: Instance-wise variable selection using neural networks. In 7th International Conference on Learning Representations (ICLR 2019).

Zagoruyko, S. and N. Komodakis

2016. Wide residual networks. arXiv preprint arXiv:1605.07146. Zhou, B., A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba

2016. Learning deep features for discriminative localization. In IEEE conference on computer vision and pattern recognition, Pp. 2921–2929.

Zhu, J.

2013. Machine teaching for bayesian learners in the exponential family. In Advances in Neural Information Processing Systems, Pp. 1905–1913.

(27)

A INVASE on IHM

C 104 ₁₀ 2 ₁₀0 ₁₀2 ₁₀4 ₁₀6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Selection ratio 10 4 ₁₀ 2 ₁₀0 ₁₀2 ₁₀4 ₁₀6 0.60 0.65 0.70 0.75 0.80 0.85 AUROC 104 ₁₀ 2 ₁₀0 ₁₀2 ₁₀4 ₁₀6 0.25 0.30 0.35 0.40 0.45 0.50 AUPRC

Fig. 9: Results of INVASE on the IHMCfor a wide range ofλ selection penalty values. Note that feature selection

does not improve predictive performance, nor does the ratio of selected features ever rise significantly above the expected initialised ratio (0.5). We consider this to be a failure to converge on the Selector’s part.

B INFEAT Training Procedure

Algorithm 1 Two-stage INFEAT Training (one sample)

Input: xn, yn, V ˆ y(bl n)n = CLS(RCS(xn)) Pn= PRD(RCS(xn)) Backpropagate onLn(bl n) Mn= (pn> V ) ˆ xn= Mn× xn ˆ y(pr d )_n = CLS(RCS( ˆxn) Backpropagate on (L(pr d )_{+ L}(sel )₎

C Diagram for full weight-sharing INFEAT

ŷ

(bln) n

CLS

SEL

RCS

x

_n

P

_n

CLS

SEL

RCS

x̂

_n

ŷ

(prd) n sel bln prd Sample

M

_n

y

_n

(28)

D IHM

C

Variable Description

Variable Unit Missingness (%) Mean Normal value

Continuous

GCS eye response Scale (1-4) 69.08 3.28 4.00

GCS motor response Scale (1-6) 69.21 5.32 6.00

GCS verbal response Scale (1-5) 69.17 3.34 5.00

Temperature °C 67.38 36.92 36.88

Heart Rate Beats/min 7.60 86.31 85.74

Systolic blood pressure mmHg 9.63 120.49 118.00

Diastolic blood pressure mmHg 9.64 61.75 59.56

Mean blood pressure mmHg 10.04 78.92 77.17

Respiratory rate Breaths/min 9.01 20.39 18.80

Fraction of inspired oxygen FiO2 93.80 0.30 0.21

Oxygen saturation SpO2 10.75 97.83 97.11

Blood acidity pH 86.88 7.18 7.38

Blood glucose level mg/dL 74.05 139.62 130.73

(Semi-)static

Height cm 81.03 169.76 170.00

Weight kg 26.99 82.44 81.00

Dichotomous

Abnormal capillary refill rate 99.65 False

Intubation 69.17 False

Tab. 4: Description of IHMCvariables.

For a full description of features and dataset cohort statistics refer toHarutyunyan et al.(2019). Means (after imputation) and missingness are calculated across all timesteps after resampling for

continous variables (i.e. on an hourly basis) and on a per-admission basis for their (semi-)static counterparts. Normal imputation values are those defined byHarutyunyan et al.(2019).

(29)

E IHM

C

INFEAT Attribution Examples

0

10

20

30

40 GCS eye

GCS motor

GCS verbal

Temperature

_{Heart Rate}

Sys. BP

Dias. BP

Mean BP

Resp. rate

_FiO

2

SpO

2

pH

Glucose

_Height

Weight

Abn. CRT

_Intubated

0

10

20

30

40

0

10

20

30

40 GCS eye

GCS motor

GCS verbal

Temperature

_{Heart Rate}

Sys. BP

Dias. BP

Mean BP

Resp. rate

_FiO

2

SpO

_pH

2

Glucose

_Height

Weight

Abn. CRT

_Intubated

0

10

20

30

40

0

10

20

30

40 GCS eye

GCS motor

GCS verbal

Temperature

_{Heart Rate}

Sys. BP

Dias. BP

Mean BP

Resp. rate

_FiO

2

SpO

2

pH

Glucose

_Height

Weight

Abn. CRT

_Intubated

0

10

20

30

40

0

10

20

30

40 GCS eye

GCS motor

GCS verbal

Temperature

_{Heart Rate}

Sys. BP

Dias. BP

Mean BP

Resp. rate

_FiO

2

SpO

_pH

2

Glucose

_Height

Weight

Abn. CRT

_Intubated

0

10

20

30

40

Hours Hours 2 0 2 Pn 0.0 0.2 0.4 xn 0.6 0.8 1.0

Actor-criticWeight Sharing for Instance-wise Temporal Feature Attribution on clinical Time Series Classification

F

ACULTY OF

M

EDICINE

Self-criticism

Actor-critic Weight Sharing for Instance-wise Temporal

Feature Attribution on Clinical Time Series Classification

Samenvatting (NL)

Contents

List of Figures

List of Tables

1 Introduction

Contributions

2 Related Work

2.1 Time Series Classification

2.2 Instance-wise Feature Attribution

2.3 INVASE

2.4 Weight Sharing - Multitask Learning

3 Proposed Methods

3.1 Predictive: 1D-ResNet++

ReLU

BatchNorm

1×3 Conv., 64

ReLU

BatchNorm

1×3 Conv., 64

5×

1×5 Conv., 64

ReLU

ConcatPool

FC, 2

SoftMax

3.1.1 Residual Convolutional Stack (RCS)

3.1.2 Classifier Head (CLS)

3.2 Interpretative: INFEAT

3.2.1 Default Framework

ŷ

CLS

SEL

RCS

x

P

CLS

RCS

x̂

ŷ

M

y

RCS

SEL

RCS

BatchNorm

1×1 Conv., C

Sigmoid

β

Critic-stage

Actor-stage

CLS

SEL

RCS

RCS

CLS

SEL

RCS

3.2.2 Channel-wide Threshold Sampling

3.2.3 Selector Weight Sharing

3.2.4 Validation Threshold Randomisation

4 Results & Analysis

4.1 Dataset

4.2 Predictive: 1D-ResNet++

4.3 Interpretative: INFEAT

4.3.1 Performance

4.3.2 Speed

4.3.3 Ablation Analysis

INFEAT

+ channel-wide threshold sampling

+ weight sharing

+ test-time random threshold

0.0

_CLS

_{Heart Rate}

_FiO

_Height