As easy as APC: Leveraging self-supervised learning in the context of time series classification with varying levels of sparsity and severe class imbalance

(1)

MSc Artificial Intelligence

Master Thesis

As easy as APC:

Leveraging self-supervised learning in the context of time

series classification with varying levels of sparsity and

severe class imbalance

by

Fiorella Wever

12101745

February 2, 2021

48 EC March 2020 - February 2021

Supervisors:

Victor García Satorras

Andy Keller

Laura Symul

Assessor:

Herke van Hoof

(2)

Abstract

Time series data with high levels of sparsity and strong class imbalance are a ubiquitous issue challenging our ability to perform meaningful and effective classification. While methods have been proposed to deal with these obstacles, they tackle each challenge separately; either dealing with missingness or correctly identifying the minority class(es). However, time series data often present both challenges simultaneously. It would therefore be beneficial to tackle them in conjunction.

In this work, we propose leveraging a self-supervised learning method, specifically Autore-gressive Predictive Coding (APC), in order to learn relevant hidden representations of time series data in the context of both missingness and class imbalance. We apply APC using either a GRU or GRU-D encoder on two real-world datasets, and show that applying one-step-ahead prediction with APC improves the classification results in all settings. This demonstrates that APC can learn improved latent representations of the data compared to those from the original input space. These learned representations allow the classifier to better identify the minority class(es), and improves the performance even with different levels of sparsity present. Addi-tionally, if the dataset contains truly informative missingness patterns, the synergy of using APC with a GRU-D encoder to capture these patterns can lead to substantial improvements. In fact, by applying GRU-D - APC, we achieve state-of-the-art results on a benchmark dataset.

(3)

Acknowledgements

I would first like to thank my supervisors, Laura, Andy and Victor, for their invaluable guidance and mentorship. Andy and Victor, thank you both for sharing your expertise, and for the many hours spent on zoom and emails. Your optimism and enthusiasm throughout the process made this all much more enjoyable to work on. Laura, I am very grateful for your mentorship, and I always greatly appreciate your advice, both academic and non-academic. I learn a lot from how you approach things always guided by a strong moral compass and I definitely hope we get to work together for a long time. I also want to thank Amanda, Charlie, Daniel, and Adam from the Clue team for all their help with getting me set up with their datasets and machines. A big thank you to Edge Water and everybody there, you all made starting my thesis with a fresh and inspired mind so much easier to do. Thanks to all the wonderful people I had a pleasure to meet during this master’s program, you all made studying so much more fun. A huge thank you from the bottom of my heart goes to my parents and my family. Thank you for always supporting me and allowing me to pursue my goals. You guys are awesome.

Finally, last but not least, thank you to my Cival for your unwavering support. You are my rock. I couldn’t have done this without you.

(4)

1 Introduction

Rare event prediction on time series data is known to be a difficult challenge on its own. [1] Adding varying levels of missing values to the mix only adds to that complexity. To top it off, when it comes to real-world time series data, these two challenges are ubiquitous and often presented simultaneously. [2] This makes it hard to properly learn from the time series data in order to perform well in a supervised classification setting.

Identifying rare events from sequential data is a crucial problem that affects many different fields, such as medicine, financial markets, and cyber security. [1] [3] [4] Rare event prediction is also referred to as imbalanced classification, where the dataset consists of a majority class that includes most of the examples, and one or more minority classes which consists of examples that are far more infrequent. [1] In these scenarios, the focus is often on correctly identifying the minority class, since this involves cases that are far more rare, and therefore often more interesting and highly relevant to the final task at hand. [1] For example, for tasks such as credit card fraud detection, predicting natural disasters, or predicting cancer, it is the less common cases that are the most crucial to identify. However, the challenge with training classification models on an imbalanced dataset is that they are prone to classify most examples into the majority class, resulting in a poor predictive performance, specifically for the minority class. [3]

Missing values are also a common issue that affects many time-series data and can be due to unexpected accidents, equipment damage, irregular sampling, etc. [2] [5] The challenge, of course, is that the strength and quality of the information that we can learn from is affected by these missing data. Besides losing potentially important information with these missing values, the pattern of missingness can also be informative in and of itself. [2]

These obstacles in the datasets make it more difficult to properly learn the underlying, discriminative patterns from time series data needed to achieve high classification performances. While successes have been obtained to deal with these obstacles, these methods tackle each challenge separately; either dealing with properly learning in the context of missingness [6] [5] [2] or correctly identifying the minority class(es) in a class imbalance setting [7] [8] [9] [3]. However, time series data often contain both high levels of sparsity and severe class imbalance at the same time, and it would therefore be beneficial to tackle both of these challenges in conjunction.

Here we propose leveraging a self-supervised learning method, specifically Autoregressive Predictive Coding (APC) [10], in order to learn relevant hidden representations of time series data in the context of both missingness and class imbalance. The objective here is that these learned hidden representations are then used to improve performance on the final classification tasks.

Autoregressive Predictive Coding is a recently proposed autoregressive method that learns from sequential data by trying to predict information about a future frame that is n ≥ 1 steps ahead of the current one. [10] This differs from autoencoders in that this time shifting factor

(7)

n allows the model to learn more general/global structures of the data rather than only local ones. [10]

We suggest that APC could be a potential method to alleviate the class imbalance problem, since APC can learn the discriminative features from the unlabeled data regardless of which class each sequence belongs to. The goal here is that by using this new set of learned features as input, the classifier will be better able to identify the minority class(es) and discriminate between these and the majority class.

To evaluate the impact of specific encoder architecture, we compare two different encoders to use in the APC framework: a baseline GRU [2] encoder that is fed hand-engineered missingness features, and a more complex, state-of-the-art GRU-D [2] encoder that uses trainable decays to learn the missingness patterns from the data. The objective here is to explore the synergistic effects of combining a model that can learn from the potentially informative missingness present in the data (GRU-D) together with APC.

To demonstrate this, we run experiments on two healthcare datasets. They both contain varying levels of sparsity and severe class imbalance. These two datasets and their classification objectives are:

• Clue: Clue by Biowink [11] is a menstrual cycle tracking app that allows their millions of users worldwide to self-track different aspects of their cycles. This includes bleeding patterns, psychological and/or physical symptoms, as well as contraceptive use over time. In collaboration with the Bill and Melinda Gates Foundation [12], we seeked to predict the discontinuation of birth control methods over time.

• PhysioNet Challenge 2012 [13]: The PhysioNet dataset is a public, clinical dataset containing time series of ICU patients. These time series contain different demographic and physiological measurements from the first 48 hours of a patient’s admission to the ICU [6]. The classification objective is to predict patient in-hospital mortality, namely: whether patients will survive or die during their stay at the hospital.

1.1 Research Question

Summarizing the ideas just mentioned, our main research question is:

• Given that we have time series data with both varying levels of missing values and severe class imbalances, does including Autoregressive Predictive Coding ahead of a classifier improve the final classification by learning relevant latent representations of the data?

(8)

1.2 Contributions

Our contributions are:

• To the best of our knowledge, there has not been any research done using Autoregressive Predictive Coding outside of the speech domain [10] [14] [15], making this research project the first to apply APC on multivariate time series data.

• We propose to leverage APC as a method to tackle both missing values and class imbalance in conjunction. The performance of this framework was validated on two multivariate time series datasets in the medical domain: Physionet and Clue, and the results are promising for both datasets.

• We propose a novel approach to improve time series classification on imbalanced datasets that contain potentially informative missing patterns: APC with a GRU-D encoder. With this approach, we are able to achieve state-of-the-art performance on the Physionet Challenge 2012 [13].

• Clue: there have not been any studies done researching contraceptive use, discontinuation and switching that leverage deep learning methods using data acquired through menstrual cycle tracking apps, making the research section with Clue the first of its kind.

(9)

2 Literature Review

In this section, we introduce the reader to the background relevant to the main topics of this thesis. We begin by addressing the deep learning models commonly used for time series classification (Section 2.1); then we present some of the techniques that have been used to handle missing values in time series data (Section 2.2). We continue this section by detailing some of the most popular methods to deal with class imbalance (Section 2.3), and finally, we address representation learning for time-series data (Section 2.4). Note that the list of methods presented here is by no means exhaustive and is meant as a general outline to provide the reader more context.

2.1 Deep learning for time series classification

Both Clue and Physionet are multivariate time series datasets, which means that they have multiple time-dependent variables. These variables are not only dependent on their past mea-surements, but they can also be dependent on one another. [16]

When it comes to performing multivariate time series classification, Recurrent Neural Net-works (RNNs) are a natural first choice, as they have been successfully used in many domains for processing sequential data, including medical time series data [17] [18]. The reason why RNNs are so well-equipped to deal with sequential data is because the feedback loops of their recurrent cells inherently address the temporal order as well as the temporal dependencies of the sequences. Two of the most popular types of RNNs are the Long Short-Term Memory cell (LSTM) [19] and the Gated Recurrent Unit cell (GRU) [20]. Both the LSTM and the GRU are improvements compared to the traditional RNN, which suffers from the vanishing gradient problem. [19]

The Long Short-Term Memory network includes a memory cell c, which can preserve long-term relationships [19]. The LSTM architecture makes use of three gates: an input gate i, a forget gate f, and an output gate o to control the flow of information into and out of the cell. The Gated Recurrent Unit (GRU), is inspired by the LSTM unit, but while the LSTM makes use of 3 gates, the GRU uses 2 gates: an update gate z and a reset gate r. [20] The update gate z controls how much information from the previous hidden state gets passed along to current hidden state. [20] This has a similar functionality as the memory cell in the LSTM and helps the RNN to memorize long-term information. [20] Furthermore, the simpler design of the GRU allows it to use fewer training parameters compared to the LSTM, making it faster and more memory-efficient. [20] Based on the GRU model’s favorable characteristics, we will be implementing this method to learn from our time series data in order to perform the final classification tasks.

However, as mentioned before, we are dealing with various challenges in our datasets that might make it more difficult to learn the underlying, discriminative patterns from the time series data needed to perform well in the supervised classification setting.

(10)

take advantage of all the underlying information that can be learned from the time series data. In this case, we turn to additional approaches that might be better suited to handle these situations, and in turn help improve the final classification tasks.

2.2 Handling missingness

One of the main challenges we are facing with these time series data is varying levels of miss-ingness, as this affects both the quality and the strength of the information we can learn from. The causes for missing data are generally categorized into 3 groups [21]:

• Missing Completely At Random (MCAR): this is the most restrictive assumption, which suggests that the cause for the missing data is neither dependent on the observed data nor on the unobserved data.

• Missing At Random (MAR): this implies that the missing data can be explained by the observed data.

• Missing Not At Random (NMAR): this means that the missingness depends on the un-observed data, and this dependency remains even given the un-observed data. This has also been referred to as non-ignorable missing data. [21]

Traditional methods for handling missing data often involve first filling in the missing values (data imputation), and then applying predictive models on the imputed data. [2]

Traditional data imputation methods

A traditional data imputation method that has widely been used is mean-imputation, which fills the missing values of a particular variable by the mean for that variable. [21] Mean-imputation is very simple and fast to implement, but it is also very limiting, since it disregards the correlations between the features, doesn’t work well with categorical features, and moreover, it assumes that the missing values are missing completely at random (MCAR). [21] Another basic imputation method is zero-filling, which, as the name suggests, replaces the missing values with zero. However, many studies have shown that zero-filling leads to sub-optimal performance in neural networks. [22] Further, missing values can also be filled in by using linear interpolation. [23] This technique assumes a linear relationship between data points and uses the non-missing values surrounding the missing values to compute the imputed values for these missing points.

Multiple imputation is an improvement to the afore-mentioned methods, since it can handle missing data by estimating and replacing missing values many times. [24] Essentially, the missing values are filled in with many different plausible values, which represent a measure of uncertainty in estimating what the missing values might be. [25] Particularly, the Multiple Imputation by Chained Equations (MICE) [26] has been widely adopted, which uses chain equations to create multiple imputations.

(11)

ues with the weighted average of its k nearest neighbors. [27] On the other hand, the EM algorithm iteratively fills in the missing values by using expectation-maximization. [28] The E-step (expectation) computes the expected values based on all the observed data. The M-step (maximization) replaces the missing values with the values computed in the E-step and then recomputes the new expected values. [28] This two-step iterative process continues until the changes in expected values become insignificant. [28]

Although there are thus many traditional data imputation methods available, the quality of the imputed data cannot be guaranteed and most of these methods have very strong as-sumptions about the missing values, such as that the data is missing completely at random (MCAR) or at random (MAR). [2] This can introduce a strong bias which can influence the outcome of the task, especially for data with high levels of missing values. [29] Furthermore, these imputation methods can also be computationally expensive. [2]

Besides losing potentially important information with the missing datapoints, the pattern of missingness can also be informative in and of itself since data is often not missing completely at random. [2] However, combining the imputation methods with prediction models often results in a two-step process where imputation and prediction models are separated, and this doesn’t allow the prediction model to properly explore the missingness patterns. [2]

Learning missingness patterns

This idea of informative missingness, that the missing values and patterns can incorporate potentially useful information about the target labels, has been shown by Che et al. (2018). From their experiments, they found that the value of the missing rate is correlated with the target labels, and this correlation is even higher for variables with a low missing rate. [2] This indicates that incorporating the missingness patterns into the time series prediction model could improve the final prediction. However, most state-of-the-art time-series models do not incorporate these missing patterns into the model itself.

From the handful of methods that do incorporate these missingness patterns, one example is the Bidirectional Recurrent Imputation for Time Series (BRITS) [5]. BRITS is a data imputation method which directly learns the missing values in a bidirectional recurrent way, without strong underlying assumptions about the data. [5] It is able to simultaneously impute the missing values and perform classification/regression within a joint neural graph. By doing so, the prediction model is able to explore these missingness patterns, leading to improved results.

Another method, the GRU-D, has shown to achieve competitive results on irregularly-sampled time-series in the healthcare domain, such as Physionet. [2] GRU-D is a recurrent neural network (RNN) that builds upon the GRU model, but with trainable decays. These decay mechanisms are used to decay the input variables and the hidden states over time based on when a variable was last observed. Since the patterns of missing data can be potentially informative, but also quite complex, it is beneficial to learn these decay rates directly from the data rather than fixed a priori.

(12)

Both the BRITS and the GRU-D take advantage of two representations of informative miss-ingness patterns: masking and time interval. [2] The masking vector specifies which variables are missing at time step t, and the time interval indicates how long it has been for each variable d since its last observation. However, unlike BRITS, GRU-D does not explicitly impute the missing values. Moreover, while BRITS only decays the hidden states, the GRU-D is able to decay the input variables as well. This allows for a more insightful decay analysis to study the impact of missingness per variable.

Finally, two models based on ordinary differential equations (ODE), the ODE-RNN and the Latent-ODE, have also recently shown promising results on irregularly-sampled data. [6] These models have hidden-state dynamics that are determined by neural ordinary differential equations (Neural ODEs) and are learned over time. [6] The future hidden states depend on the time since the last observation, and these models make no strong assumptions about the dynamics of the time series [6], making them well suited for sparse and/or irregular data.

Also, neither of the ODE models nor GRU-D require discretizing observation times or imputing data as a preprocessing step, making them very desirable methods to deal with sparse and/or irregularly sampled data [6], since this allows them to learn more fine-grained missingness patterns when performing prediction tasks.

Given that our time series data has varying levels of sparsity as well as potentially infor-mative missingness, we aim to learn the best representation of the data by using a model that is well-suited to handle and learn from its missingness patterns. Even though ODE-RNNs and Latent ODEs seem very promising, GRU-D has actually shown to outperform these models on classification tasks with Physionet. [30] Based on the GRU-D model’s state-of-the art perfor-mance on Physionet, its direct comparability to GRU, and its ease of implementation into our final framework, we decide to implement this method.

2.3 Class imbalance

Another challenge these datasets bring is class imbalance. This is referring to the case where the distribution of examples across the classes is highly skewed. In other words, the dataset consists of a majority class that includes most of the examples, and one or more minority classes. [1] The minority classes consist of examples that are far more infrequent, but often more crucial to identify as well. [1]

The challenge with classification on an imbalanced dataset, is that the models are prone to classify most examples into the majority class, resulting in a poor predictive performance, specifically for the minority class. [3] This is because most standard classifiers assume a rela-tively balanced class distribution and equal misclassification costs [31]. If one does nothing to deal with the severe class imbalance, the model will most likely overfit and almost exclusively predict the majority class, since this will lead to higher accuracy scores. This is misleading, and also suggests that accuracy is an unreliable measure of performance in severe class imbalance settings. Besides, this model will not be performing any meaningful classification and this will result in very poor generalization capability on the minority class. [4]

(13)

There are several ways to deal with class imbalances. Some of the most common and widely adopted methods are data re-sampling methods, such as under-sampling the majority class [7] [8] or over-sampling the minority class [9], or using class weights [3]:

Data re-sampling methods:

• Under-sampling majority class

One way to handle imbalanced datasets is by undersampling the majority class in such a way that in the end the dataset is balanced [7] [8], i.e. making sure each class represents an equal fraction of the dataset. This is done by looking at the class that has the least amount of data points, let’s call this amount m. Then a sample is taken of the data, by randomly choosing m datapoints per class, so each class now has the same amount of examples.

However, since we’re dealing with a severe class imbalance, this means that there are way fewer examples of the minority class(es) compared to the majority class, and so we end up losing a lot of data. This makes undersampling the majority class a less than ideal method for dealing with class imbalance.

• Over-sampling minority class

Another method of data re-sampling is by over-sampling the minority class in such a way that it will lead to a balanced dataset, in an attempt to increase the sensitivity of a classifier to the minority class [32]. However, a major drawback of over-sampling is that this will require a lot of duplicates from the minority class, which can result in the classifier to easily overfit to the minority class. [4]

To overcome this issue, the Synthetic Minority Oversampling Technique (SMOTE) [32] has been proposed as an improvement to the original oversampling method. SMOTE randomly generates synthetic examples of the minority class by selecting a random point along a line segment between a minority sample and a nearest neighbour. [4] [32] However, one setback that remains is that by oversampling the minority class, we are artifically increasing the size of the dataset, which consequently worsens the computational burden of the learning model. [4]

Cost sensitive learning: class weights

An option that does not involve re-sampling the data, is cost sensitive learning. Cost sensitive learning is essentially a way to make the classifier aware of the imbalance by incorporating the class weights into the cost/loss function. [33] It allows the backpropagation algorithm to be updated to weigh the misclassification errors in proportion to the importance of the class. This method is also referred to as weighted - or cost-sensitive neural networks. The class weights are in proportion to the importance of each class: a larger weighting will be used for the minority class, while a smaller weight will be assigned to the majority class. [33] The larger class weight for the minority class results in there being larger penalty for misclassified examples from this class. This allows the model to pay more attention to samples from the minority class than the majority class in datasets with a severely skewed class distribution, such as ours.

(14)

Unsupervised feature selection

Finally, apart from data re-sampling and cost-sensitive learning, there have been some studies done using unsupervised learning as a feature selection tool for dealing with imbalanced clas-sification. [4] These have either been used in combination with data re-sampling methods, or independently as the main method to deal with data imbalance directly. The new set of features learned in this un-/self-supervised manner will then be used as the new inputs to the classifer, rather than the samples from the original input space. [4] More details on these methods will be given in Section 3.

2.4 Learning latent representations from time series data

According to Bengio et al. (2013) the performance of machine learning models is generally dependent on how the data is represented. [34] This is because different representations can incorporate and hide the different factors that explain the variations behind the data. [34] Based on this, they propose that a good data representation is one that is able to extract these underlying factors of variation. Because of its importance, there has been a lot of research done in representation learning, so much so that is has become its own field within machine learning [34].

Since we are dealing with various challenges in our time series datasets, we aim to learn improved representations of the time series data in which the missing values and class imbalance become less of an obstruction to perform meaningful and effective classification.

One solution is by using semi-supervised learning. [35] Semi-supervised learning has gained great interest in machine learning, since it can learn from readily available unlabeled data in order to improve supervised learning tasks. [35] This has shown to improve classification per-formances in many different domains, such as classifying images [36], Internet traffic [37], and sentiment from text [38]. Specifically, we would like to explore a semi-supervised approach that uses a combination of unsupervised pre-training and supervised classification. [39] Unsu-pervised pre-training is a special case of semi-suUnsu-pervised learning where the aim is to learn a good initialization point for the supervised setting instead of changing the supervised learning objective. [39] Unsupervised representation learning can detect the underlying and discrimina-tive patterns and features from the data without the use of class labels. [35] This could allow us to learn meaningful and improved compressed representations of the data even when facing noisy labels, missing data points and class imbalance. These latent representations of the data can then be used for the final classification tasks. It has been shown that, even in cases where considerable supervised labels are available, learning good representations in an unsupervised way can provide a significant boost in performance. [39] Thus, here we use both unsupervised and supervised learning for our classification task.

Different methods for unsupervised representation learning have been proposed and used in the past to learn from time series data. [16] One such method is the autoencoder, which is an artificial neural network which consists of an encoder and a decoder. [40] The encoder

(15)

essentially learns a compressed representation of the input data, while the decoder attempts to reconstruct the data from the compressed representation learned in the encoding phase. [40] Although auteoncoders where traditionally used as a dimensionality reduction technique [16], they have also found success in learning features from time series data in an unsupervised way that could improve the supervised learning tasks. [40]

However, when it comes to learning useful representations from high-dimensional sequential data, some of the most recent and promising methods [41] [10] are based on the idea of predictive coding, which is an unsupervised learning method to predict future, missing, or contextual information [41]. One of these proposed methods is the Contrastive Predictive Coding (CPC), which learns representations from the data by predicting the future in latent space. [41] CPC does this by using powerful autoregressive models in combination with a probabilistic contrastive loss, which encourages the model to learn information from the data that is the most discriminative from negative examples. [41] CPC has shown to achieve strong or state-of-the art results across a wide variety of domains. [41]

However, more recently, Autoregressive Predictive Coding (APC) [10] was proposed and shown to significantly outperform CPC. The main difference with CPC, is that the APC is able to do multiple step-forward prediction in an autoregressive manner instead of a contrastive manner. Instead of learning information that is most discriminative between the target and negative samples, APCs learn to encode information from the data that is most useful to pre-dict the future observations, and are only allowed to discard information that is most common across the dataset. [10] Furthermore, a time shifting factor n ≥ 1 is introduced that tells the model to predict n steps ahead in the future from the current time step and encourages the model to learn more global structures of the data rather than local ones. [10] In general, the APC is capable of learning representations that retain as much information about the original signals as possible [10], and thus seems like a good unsupervised representation learning method to learn from our time-series data. Eventually, this latent representation of the data can be used to perform the classification task.

(16)

3 Related Work

In this section, we present research that is most related to our approach. We start by addressing the successes that autoregressive models have had as a method of unsupervised learning (Section 3.1). Then, we touch on how unsupervised feature learning has previously been used as a way to handle class imbalance (Section 3.2); and finally, how it has been used to deal with missing values (Section 3.3). Again, these methods listed are not exhaustive, but serve to give the reader an understanding of the most relevant and related work, which could be compared to our implementation.

3.1 Autoregressive models for unsupervised learning

Autoregressive Predictive Coding (APC) is largely inspired by language models [10], so it should come as no surprise that most work using autoregressive models for unsupervised learning has seen breakthroughs in the language domain. Most recently, the Generative Pre-Training lan-guage models (GPT, GPT-2, and GPT-3) [39] [42] [43] have been trending, as they have demon-strated substantial improvements in many NLP related tasks. They follow a semi-supervised approach for language understanding tasks by first pre-training their unsupervised autore-gressive model on a very large corpus of unlabeled text, and then fine-tuning on a specific supervised task. [43] Aside from fine-tuning, they also test the model’s performance in a few-shot and zero-few-shot setting, showing competitive results. [43] In fact, the GPT-3 [43] is one of the largest language models ever trained with 175 billion parameters, and has also shown outstanding achievement on its generative performance.

Furthermore, autoregressive models for unsupervised learning also enjoy their success in the field of computer vision, where it has recently shown to outperform current state-of-the-art on unsupervised image segmentation, for example. [44]

Finally, when it comes to time series, there has been some research done using autoregressive models for forecasting, with a more recently proposed DeepAR model [45] that has shown to outperform state-of-the-art methods by 15 % forecasting accuracy.

However, to our knowledge, there has been no research done using autoregressive models as a pre-training method to improve time series classification in the context of class imbalance and missing values.

(17)

3.2 Unsupervised feature learning for class imbalance

When it comes to unsupervised feature learning for class imbalance, several approaches have been introduced that combine the feature selection method with random-based re-sampling methods. [4] One such approach implements feature selection based on class decomposition. [46] This is done by splitting the majority class into smaller pseudo-sub-classes with relatively balanced sizes of samples. Feature selection is then performed on both the pseudo-classes and the minority class. [46] However, this turns the original binary classification task into a more complex multi-class problem. [4] Another approach turns to support vector machines (SVM) in combination with SMOTE. [47] They use a backward feature elimination process to select those features that are most relevant to discriminate between classes under imbalanced class condition. [47] Nonetheless, these two approaches are essentially data re-sampling methods. [4] One approach that is not based on data re-sampling is the dual auto-encoders features (DAF) [4]. DAF uses two stacked auto-encoders (one with a sigmoid activation function, and one with tanh) to learn different types of features of the data, which has an imbalanced class distribution. [4] Ng et al. (2016) argue that if the feature set of the data provides a clear decision boundary, data re-sampling might not be necessary to deal with class imbalance. They propose DAF as an alternative to other data re-sampling methods such as under-sampling, oversampling, and SMOTE, and indeed show that their method outperforms these methods with statistical significance. [4] According to them, this shows that a set of good learned features can yield better results than re-sampling in an imbalanced classification setting. [4]

Another interesting approach is using unsupervised learning to improve anomaly detection, which could be considered to be an extreme case of class imbalance [48]. In anomaly detection, the pattern of a normal process is learned and anything that does not follow this pattern is classified as an anomaly, or outlier. One simple yet powerful way to perform imbalanced classification for time series based on anomaly detection is by training an auto-encoder only on the samples of the majority class, and then classifying the outliers (minority class) based on their high reconstruction loss. [49] This technique is often referred to as one-class classification. A drawback here is that it is more difficult to classify samples when the reconstruction loss of the minorities overlap with that of the majority class. Lubbering et al. (2020) recently propose to overcome this limitation with an adversarial loss function that maximizes the loss of the minority class while minimizing the loss for the majority class, showing promising results. [49] Nevertheless, these methods based on anomaly detection are mainly promising for binary classification tasks, and not as straight forward to implement for the multi class setting.

Finally, to the best of our knowledge, no studies have been done specifically leveraging au-toregressive predictive coding as an unsupervised pre-training step to tackle the class imbalance problem.

(18)

3.3 Unsupervised feature learning for time series with missing data

When addressing time series with missing data using unsupervised learning, most of the work is focused on data imputation. One example uses autoencoders to impute missing data in electronic health records [50]. Another study applies LSTM and a Denoising Autoencoder (DAE) for data imputation. [51] They use a bi-directional LSTM as the encoder to learn temporal information of the time series and use the DAE to learn correlation between the variables. They apply their method on several real-world datasets in the medical domain, and show that their approach outperforms previous imputation algorithms significantly.[51]

However, very little research has been done leveraging unsupervised learning to tackle time series with missing values without data imputation as the end goal. One proposed method is Time Series Cluster Kernel (TCK), which computes the similarities between multivariate time series in an unsupervised way. [52] It can handle time series of varying lengths and can deal with missing data without resorting to imputation. [52] Specifically, this method takes advantage of the missing data handling properties of Gaussian mixture models (GMM) extended with informative priors. [52] TCK applies an ensemble learning approach by combining many GMM to form the final kernel. [53] The TCK method has a lot of desirable properties, such as being robust to parameters and noise, and has shown to outperform other methods on prediction tasks with missing data. Additionally, leveraging TCK, Bianchi et al. (2019) propose a Temporal Kernelized Autoencoder (TKAE) [29] to learn good representations from time series data with missing values. Specifically, through kernel alignment performed with TCK, this RNN-based autoencoder is able to learn compressed representations from the data, even with high levels of missingness. The encoder and decoder in the TKAE comprises a stack of multiple RNNs, which allows it to learn fixed-length compressed representations of variable-length multivariate time-series. [29] However, TCK (and thus TKAE) has one main shortcoming that is important to highlight: it operates under the assumption that the data is missing at random (MAR). [29] Therefore, these methods do not take advantage of the informative missingness patterns which is often present in many real-world applications. [53]

To overcome this, Mikalsen et al. (2019) proposed an improved kernel that is able to learn from these missingness patterns. [53] This new kernel, TCKIM, incorporates a representation of the missing patterns, masking, into the mixture models. However, although this new proposed kernel performs well in situations with limited or no labels available, one limitation is that it is not able to handle time series of variable length. Furthermore, unlike other methods previously mentioned that learn missingness patterns, TCKIM only makes use of one representation of missing patterns. It doesn’t seem to leverage time interval to take into account how long it has been since a variable has been observed. This might limit the model in the granularity of the missingness patterns that it can learn.

Therefore, our proposed self-supervised approach of using APC with a GRU-D encoder to learn representations is unique since it takes advantage of two representations of missingness patterns, masking and time interval, and can handle variable-length time series.

(19)

4 Data

We implement our methods for classifying time-series on two different datasets. The first dataset is the Physionet Challenge 2012 [13]: a real-world clinical dataset that has been previously released & used by several research groups to tackle the same classification task. This is our benchmark dataset that we will use to test our methods on. The second dataset is time series data from the menstrual cycle tracking app, Clue by Biowink. [11] Unlike Physionet, the Clue dataset is not publicly available.

These two datasets are very suitable to help answer our research question, since both these time series datasets deal with varying levels of missing values as well as class imbalance. The main difference between these two classification tasks is that with Physionet we will be dealing with a binary classification, while with Clue we will be implementing a multi class classification. In this section, we will discuss both datasets and their classification objectives. We specify the inputs and outputs for each dataset, explain how we prepared the data and what filtering criteria was used. Finally, we also address the class imbalance and sparsity levels in each dataset.

4.1 Physionet

The PhysioNet 2012 Challenge dataset [13] [54] consists of 12,000 unique time series, each con-taining measurements from the first 48 hours of a patient’s admission to the ICU [6]. Each patient has 5 general descriptors, such as their age and gender, which are measured at the time of admittance (see Table 1). Next to these general descriptors, 37 different measurements, such as their heart rate and cholesterol levels, are collected at irregular times, and each of these variables could be collected once, more than once or not at all. These time-series variables are specified in the Appendix Table 18. It is important to note that most of these variables are continuous variables.

The classification task that we will focus on with the Physionet data is:

In-hospital Mortality task : Predict whether a patient will survive (class 0) or die during their stay at the hospital (class 1) after the first 48 hours.

Since there are two output classes to predict, this is a binary classification task. The reason we use this dataset as a benchmark to evaluate our methods on is because it has been used in many research studies involving methods to deal with missing data [2] [6] [30]. Further, similar to the Clue dataset it contains varying levels of missing values as well as a large class imbalance, making it appropriate to answer our research question and to be approached using the same methodologies.

Table 1 shows the general descriptors. The categorical variables that are italicized, Gender and ICUType are expanded to one-hot encodings. Weight is both a general descriptor, measured at the patient’s admission, and a time-series variable.

(20)

Variable Unit of Data Description name measure type

Age years numeric The age of the patient at the moment of admission. Gender none binary The genotypical sex of the patient. 0: female, or 1: male Height cm numeric The height of the patient on unit admission.

The type of ICU to which the patient has been admitted. ICUType none categorical 1: Coronary Care Unit, 2: Cardiac Surgery Recovery Unit,

3: Medical ICU, or 4: Surgical ICU

Weight kg numeric The weight of the patient measured at admission.

Table 1: General descriptors in the Physionet dataset.

4.1.1 Data preparation & filtering

Since 12 patients did not have any time series measurements (only general descriptors), we removed these out of the dataset, leading us to keep a total of 11988 time series instances.

Furthermore, the measurements are observed at irregular time steps on a minute-by-minute scale, but we aggregate the measurements to different time scales as a pre-processing step as is recommended per model. For the GRU model, we aggregate the measurements to a 1 hour resolution, as is done in most existing approaches using Physionet [2]. However, for the GRU-D we aggregate the measurements to a 1-decimal time resolution (e.g. 2.6 hours) as is done in the original GRU-D paper [2]. This allows the model to learn with more detail the time interval since a variable was last observed.

4.1.2 Class imbalance & missing values

As mentioned, we are dealing with both class imbalance as well as varying levels of missing values. The class distribution is shown in Table 2. The imbalance here consists of the majority class, Survivor, making up 85.76 % of the dataset, while the minority class, Died in-hospital, makes up 14.24 %. Thus, we have a lot more patients surviving than dying in-hospital.

To characterize the sparsity of the dataset, we use the following two metrics: (i) the fraction of hours of out the 48 hours of data for which there are no measurements, (ii) the average frac-tion of missing features per hour. The distribufrac-tions of these data sparsity metrics are displayed in Figure 1. It can be inferred from Figure 1 that there are a lot more missing features per time steps as that there are completely missing time steps. Figure 2 shows examples of time series values for two patients: one that has a high number of measurements, and one that has not been observed as often.

Output Label Percentage % of total

Survivor 85.760

Died in-hospital 14.239

(21)

Figure 1: Sparsity in the physionet dataset. Left: fractions of missing time steps per patient. Right: missing features per time step

Figure 2: Examples of time series for two different ICU patients with different frequencies of observations. On the left, we have a time series with a high frequency of observations: the patient is observed every hour, so there are no missing time steps, only missing features. On the right is an example of a patient who is barely observed: there are a lot of missing time steps, as well as missing features.

(22)

4.2 Clue

For our task of predicting birth control discontinuation, we use a dataset obtained from the menstrual cycle tracking app Clue [11]. Clue users can self-track their menstrual cycles, in-cluding, but not limited to, their bleeding patterns, symptoms experienced, and contraception changes over time. Cycle symptoms tracked could be either psychological or physical, and they include the most common PMS (premenstrual syndrome) symptoms as well as contraceptive side-effects. [55] When it comes to tracking contraception, not only can users register changes in their birth control method, but some users also log their daily pill intake (taken, missed, double, late), which will help us quantify their compliance over time.

The Clue data set consists of 4 large tables:

• users: each user of the dataset is defined via a unique alphanumerical identifier (user id). The users table provides user-specific information such as weight, height, age, and country.

• cycles: containing unique cycle ids, linked to the user id, with cycle-specific characteris-tics such as cycle start date or cycle number, with cycle 1 being the first cycle tracked by the user in the app.

• tracking: including the full logging behaviour of the users by date and showing the category and type of the log.

• birth control: containing the birth control method(s) that a user specifies in their settings. Each user could either have none, 1 or several birth control methods if they have switched over time. This table specifies the type of birth control method as well as the intake regimen of the method.

Figure 3: Clue app interface: users can self-track their menses, maintain daily records of various psychological or physical symptoms, as well as contraceptive compliance and changes over time.

(23)

4.2.1 Background information & motivation

Unintended pregnancy is a major public health concern that can also have severe economic, social, and emotional consequences for the women, their families and society [56] [57]. Research exploring the underlying causes of unintended pregnancy shows that high rates of contraceptive discontinuation are a significant contributor [58]. Contraceptive discontinuation specifically refers to the event where someone ceases to use their current method of contraception without adopting another method [59]. It is estimated that the discontinuation of oral contraceptives (OCs) alone accounts for 20% of unintended pregnancies that occur each year in the United States. [58] However, if someone chooses to switch their birth control method but doesn’t make the switch fast enough or the switch happens improperly, this also results in a heightened risk of unintended pregnancy. [60]

In order to prevent unintended pregnancies, it is therefore critical to understand the dy-namics underlying both discontinuation and switching. [58] In collaboration with the Bill and Melinda Gates Foundation [12], and the menstrual cycle tracking app Clue [11], this part of the study is focused on predicting the discontinuation and switching of birth control methods using data from cycles from women around the world [61].

A more extensive introduction, motivation, and literature review for the Clue study can be found in the Appendix Sections A.2 and A.3.

4.2.2 Clue Research Question

Specifically, we are interested in finding out the following:

Given that a user is currently on the birth control method pill combined alternating and that we have access to this user’s self-tracked menstrual cycle history, can we predict whether they will continue on this birth control method, switch to another birth control method or dis-continue completely within a given time frame?

We are interested in users that are currently on the birth control method pill combined alternating, as it is the most popular birth control method used by Clue users, as well as in many places around the world. [62] [63] Here, pill combined refers to the hormonal oral contraceptive pill that contains a combination of synthetic versions of the hormones oestrogen and progesterone [64]. Alternating refers to the intake regimen of the pill, which in this case consists of a pack of 21 pills and then a 7 day break to fill the 28 day cycle.

(24)

4.2.3 Input

In this study, we will use both the users’ menstrual history (time series data ), as well as their demographic characteristics (static features), as specified below.

Static Features:

• BMI = _[height(m)]weight(kg)2

• Current age (years) • Country

• Median cycle length of the 3 input cycles

• Variance cycle length of the 3 input cycles • # of days on current birth control method

• Baseline OFF variables: average, mininum, and maximum values per category and symp-tom

Time-series features

For each user, 3 consecutive cycles will be used when forming the time series input data. For users that discontinue their birth control method or switch to another one, these will be the last 3 cycles completely on the birth control method pill combined alternating right before the switch or discontinuation happens. For users that are always on pill combined alternating, we will use their 2nd_{, 3}rd_{, and 4}th _{tracked cycle, as shown in Figure 5.}

For these three cycles, the daily inputs are which birth control method the user is on, their birth control intake, and whether or not they’ve tracked any of the following categories: period, pain, energy, mental, emotion, social, medication, motivation, or productivity. The categories and the tracking choices for each of these categories are specified in Table 3. These are the features that we take into account when forming the time series input data, since they belong to some of the most common contraceptive side effects associated with discontinuation and switching. [59] [65] It is important to keep in mind that while Physionet contains mostly continuous variables, the time-series variables in Clue are categorical (either ordinal or nominal) (Table 3). We convert all the nominal categories to one-hot encoding.

Standardized cycles

Given that cycle lengths can vary between users but also per user even in an alternating pill regiment, we standardize cycles to 25 days each. We do this by only taking into account: the first 7 days of the cycle and the last 18 days of the cycle, counting backwards from the last day of the cycle: -18 to -1 as most of the variation in cycle length is explained by changes in the first half of the cycle (see Figure 4).

(25)

Category Data type Possible values

Birth Control Method nominal ON, OFF, OTHER-hormonal, OTHER-non-hormonal Birth Control Intake nominal taken, missed, late, double

Period ordinal Spotting, Light, Medium, Heavy

Pain nominal Cramps, Headache, Ovulation, Tender Breasts Energy ordinal Exhausted, Low Energy, Energized, High Energy Mental nominal Calm, Focused, Distracted, Stressed

Emotion nominal Happy, Sad, Sensitive Emotion, PMS Social nominal Conflict, Withdrawn, Sociable, Supportive

Medication binary Pain Medication

Motivation ordinal Unmotivated, Motivated Productivity ordinal Unproductive, Productive

Table 3: Time series categories for the Clue dataset

Figure 4: Standardize cycle to 25 days. The days indicated with red are the days that we take into account for our time-series input: the first 7 days and the last 18 days of a cycle, resulting in 25 days total.

4.2.4 Output

According to a study of oral contraceptive discontinuation and switching in 19 different coun-tries, approximately 35 % women who discontinue their contraceptive use because of dissatis-faction switched to another method of contraception within three months’ time. [66] Taking this into consideration, we are most interested in examining stable transitions in which a user either discontinues completely for at least 90 days or switches to a new contraceptive method and stays on this new method for at least 90 days. The 4 possible output labels are: ON, OFF, OTHER-hormonal, and OTHER-non-hormonal, and the birth control methods that belong to each class label is specified in Table 4.

ON: If a user never indicates a change in their settings and continues on the same birth control method, pill combined alternating, for the remaining time on the Clue app, their class label will remain ON.

OFF: If a user would switch their birth control method in the app to “None” for at least 90 days, then their output class label would be OFF. Since there are detrimental and last-ing risks associated with discontinulast-ing a contraceptive method without adoptlast-ing a new method [56] [57] [58], we are very interested in understanding and predicting users that follow this track.

(26)

Figure 5: Example of the time series of 4 different users with different output birth control labels used for the classification task. The final output label is specified on the right.

OTHER (/SWITCH)

Furthermore, there will also be users that switch to a different birth control method and stay on this new method for at least 90 days. These users are divided into two groups:

• OTHER-hormonal: referring to users that switch from the pill combined alternating to another hormonal birth control method. Other hormonal methods are: IUD, implant, injection, vaginal ring, patch, pill combined continuous, pill minipil continuous (Table 4). • OTHER-non-hormonal: referring to users that switch to a non-hormonal birth control

method, such as condoms or the fertility awareness method (FAM).

The reason why we distinguish between these two separate OTHER categories is because we hypothesize that the reasons for switching might be different for users that switch to another hormonal birth control method compared to users that switch to a non-hormonal birth control method. We believe that the profiles of users going from ON to OTHER-hormonal may be different than those going from ON to OTHER-non-hormonal, in terms of the magnitude and/or types of side-effects they are experiencing as well as their compliance.

(27)

Birth Control Method Class

Pill combined alternating ON

None OFF

IUD, Implant, Injection, Vaginal ring, Patch,

Pill combined continuous, OTHER-hormonal Pill minipil continuous

Condom,

Fertility Awareness Method (FAM) OTHER-non-hormonal

Table 4: Birth control classes

4.2.5 Data preparation & filtering

Here we specify some of the pre-processing done to prepare the Clue dataset for our tasks. We start by querying the data directly from the Clue database. To preserve user privacy in the final datasets, we first implement some general filters, such as removing tracking outliers (super trackers), converting some of the general descriptors to ranges, and apply K-anonymization [67]. Then, the filtering criteria we use to answer our research question are:

1. Keep all dates > January 1st 2016. We filter for these dates, since some internal Clue data processing changed after this date.

2. Keep users that have tracked at least 6 months and 6 cycles (both)

3. Filter out users who only track their period

4. Birth control method simplification & relabeling. We simplify birth control meth-ods and group them to the classes: ON, OFF, OTHER-hormonal, or OTHER-non-hormonal according to Table 4. Here, our main birth control method of interest is Pill combined alternating, specified as ON.

5. Users that have at least 3 cycles that are on the birth control method of interest (ON)

6. Detect stable transitions: Determine whether a user has switched to another birth control method, and has stayed on this new birth control method for at least 90 days. Alternatively, check whether a user has discontinued their method completely for at least 90 days. These are considered to be a stable transitions. If the three cycles before this stable transition were completely ON, we keep this time series.

7. If a user has never switched and remained ON, we take their cycles 2, 3, and 4 as input cycles and the 90 days after. If a user does have a stable transition, we take their last 3 cycles on the birth control method ON before the transition as input cycles and the 90 days after.

(28)

9. Filter out input-output pairs where the user only tracks “period” in the 3 input cycles.

After preparing the dataset by implementing the filtering criteria mentioned above, we end up with 63,988 unique time-series, each belonging to a unique user. Furthermore, we want to mention that unlike with Physionet, we do not switch between different time scales for the different models, since the Clue data is only available on a per-day basis.

4.2.6 Class imbalance & missing values

Again, we are dealing with an imbalanced class distribution, as specified in Table 5. Most of the users (91.52 %) remain on their birth control method pill combined alternating (ON) without ever stopping or switching. A small amount of users (4.88 %) discontinued their method (OFF) for at least 90 days, and even less users either switch to another hormonal contraceptive method (2.22 %) or to a non-hormonal contraceptive method (1.38 %).

To characterize the sparsity of the dataset, we use the same two metrics as for the Physionet dataset, as shown in Figure 6. From Figure 6, we once again see that there are more missing features per time step than missing time steps. We note that the Clue dataset contains a higher level of sparsity than Physionet. From the first plot, it is also clear that there are two main clusters of users: a bigger cluster with users tracking most days & a smaller one with users only tracking a few days of their cycles. Finally, Figure 7 shows an example of a user that tracks very frequently, while Figure 8 displays a user that barely tracks, only keeping track of their period, birth control intake and birth control method.

Output Label Percentage % of total

ON 91.52

OFF 4.88

OTHER-hormonal 2.22 OTHER-non-hormonal 1.38

Table 5: Clue: Class distribution

(29)

Figure 7: Example of the time series of the 3 input cycles for a user that tracks with high frequency.

Figure 8: Example of the time series of the 3 input cycles for a user that doesn’t track very often. There are a lot of missing time steps, as well as missing features.

(30)

5 Methodology

As mentioned before, we will be implementing two recurrent neural networks, GRU and GRU-D, as baselines to perform the classification tasks. Then we implement autoregressive predictive coding (APC), using either GRU or GRU-D as the encoder, and run experiments to compare their performances. In this section, each of the models are described in detail. We discuss the assumptions underlying the models and provide a theoretical outline of them.

5.1 Supervised classification

Starting with the supervised classification task, we train a GRU and GRU-D model with the class labels to learn from the time-series data and predict which class each input sequence belongs to. In the next few sections, we will use the following notations:

The multivariate time series with D variables of length T is specified as X = (x1, x2, . . . , xT) T

∈ RT ×D, where for each time-step t ∈ {1, 2, . . . , T }, xt ∈ RD are the t -ith observations of all variables and xd_t refers to the measurement of d -th variable of xt.

5.1.1 GRU

The Gated Recurrent Unit (GRU) was proposed by Cho et al. (2014) [20] and is a type of recurrent neural network (RNN) that is capable of adaptively learning dependencies of different time scales [68]. The GRU is inspired by the LSTM, but while LSTM makes use of 3 gates (input, forget, and output gate), the GRU uses only 2 gates (update, and reset gate), making it much simpler and faster to train. Similar to the LSTM, the GRU uses its gating mechanisms to control the information that flows inside the unit, however, without requiring a separate memory cell. [68] For each j -th hidden unit, GRU has a reset gate rj_t and an update gate z_tj to control the hidden state hj_t at each time t.

The update functions of the GRU are:

rt= σ (Wrxt+ Urht−1+ br) (1)

zt= σ (Wzxt+ Uzht−1+ bz) (2) ˜

ht = tanh (Wxt+ U (rt ht−1) + b) (3)

ht= (1 − zt) ht−1+ zt ˜ht (4)

where matrices Wz, Wr, W, Uz, Ur, U and vectors bz, br, and b are model parameters. [2] Here, is the element-wise multiplication and σ is the element-wise sigmoid function.

Furthermore, we want to specify how we encode missing values in the GRU model. Here, we add a “missingness” flag next to each measurement xd

t, displaying a 1 if the value is missing and a 0 if the variable has been observed at that specific time-step. This allows the model to learn which values where missing versus observed.

(31)

5.1.2 GRU-D

As previously mentioned, missing values in time-series data are a complex challenge. If we are not properly handling these missing values, we are not only losing potentially important information due to sparse data, but we are also not learning the possibly informative pattern of missingness.[2] Since we want to learn this potential informativeness of the missing patterns as well, we turn to GRU-D [2], a variation of the GRU unit that can inherently handle missing values.

GRU-D stands for Gated Recurrent Unit (GRU) with trainable Decays [2] and is an RNN that takes advantage of two representations of informative missingness patterns: masking and time interval. The masking vector mt ∈ {0, 1}D specifies which variables are missing at time step t, and the time interval δd

t ∈ R indicates how long it has been for each variable d since it’s last observation. Furthermore, we specify the time-stamp when the t -th observation is made as st∈ R and assume that the first measurement is observed at time-stamp 0 (i.e., s1 = 0). Then the masking vector and the time interval are denoted as:

md_t = ( 1, if xd_t is observed 0, otherwise (5) δ_td=      st− st−1+ δt−1d , t > 1, mdt−1 = 0 st− st−1, t > 1, mdt−1 = 1 0, t = 1 (6)

In this case, the time series classification is done on the time series data D:

D = {(Xn, sn, Mn)}N_n=1 where Xn = h x(n)₁ , . . . , x(n)_T_n i , sn = h s(n)₁ , . . . , s(n)_T_n i , Mn= h m(n)₁ , . . . , m(n)_T_n i

and we aim to predict the labels ln∈ {1, . . . , L}.

Since we can incorporate these two representations of missingness, we can omit the “missingness” flags we used for the GRU. An example of the notations are shown in Figure 9 and the proposed GRU-D model architecture is displayed in Figure 10.

Before we continue specifying the details of the model architecture, it is important to take note of the assumptions that GRU-D makes about missing values. They find that missing values in time-series data often hold two important properties, especially in healthcare. First, the value of a missing variable tends to be close to a default value if the last observation for this variable happened a long time ago. This is based on the idea of homeostasis that is observed in the human body, but this tendency towards a relative equilibrium can also be found in many other fields. The second assumption is that the influence of the last observed value of a variable will fade over time the longer this variable has been missing.

(32)

Figure 9: Example of measurement vectors xt, time stamps st, masking mt, and time interval δt. Illustration from the paper Recurrent Neural Networks for Multivariate Time Series with Missing Values by Che et al. (2018) [2]

Figure 10: GRU-D model architecture. Illustration from the paper Recurrent Neural Networks for Multivariate Time Series with Missing Values by Che et al. (2018) [2]

In order to capture these two properties of missing values, the GRU-D uses a decay mechanism for the input variables and the hidden states. The model includes decay rates to control the decay mechanism. The idea here is that the decay rate is different for each variable, based on the underlying characteristics of the variable. This means that some variables may decay to their default value faster compared to other variables. Furthermore, since the patterns of missing data can be potentially informative, but also quite complex, it is beneficial to learn these decay rates directly from the data rather than fixed a priori.

Therefore, we model a vector of decay rates γ as:

γt= exp {− max (0, Wγδt+ bγ)} (7)

where Wγ and bγ are model parameters that are trained together with the other parameters of the GRU-D and δt is the vector of time intervals. This exponentiated negative rectifier is chosen to keep the decay rate decreasing monotonically in a range between 0 and 1.

The GRU-D model uses two different trainable decay mechanisms:

1. Input decay γx: this decay term is used to decay a missing variable over time toward its empirical mean, instead of using the last observation as is.

ˆ

xd_t = md_txd_t + 1 − md_t γ_xd_txd_t0 + 1 − γ_xd t ˜x

d

(33)

where xd_t0 is the last observation of the d -th variable (t0 < t) and ˜xdis the empirical mean of the

d -th variable. In order to keep the decay rate of each variable independent from one another, Wγx (equation 7) is constrained to be diagonal when decaying the input variable.

The empirical mean is given as:

˜ xd= PN n=1 PTn t=1m d t,nxdt,n PN n=1 PTn t=1mdt,n (9)

where N refers to the total amount of time-series sequences in the dataset and Tnis the length of each sequence n.

One important thing to note here is that similar to the original GRU-D implementation, we pre-process all the input variables by normalizing them to be of 0 mean and 1 standard deviation. [2] To do so, we compute the empirical mean and standard deviation on the training dataset and use this for the training, validation, and test sets. Consequently, this entails that the empirical mean is now ˜xd = 0, and the trainable decay scheme for the input (equation 8) becomes:

ˆ

xd_t = md_txd_t + 1 − md_t γ_xd_txd_t0 (10)

Moreover, while Physionet has one mean per variable, the idea of a mean/ default value is a bit more complex when it comes to the Clue dataset. Since we are dealing with cycles, there are physical or psychological changes that occur on a cyclical basis. This means that a default value for a variable is heavily dependent on the day/time of the cycle. To keep this into account, we use a multi-dimensional mean for the Clue dataset, with each variable having a different mean for each day of the cycle. For example, in the first 5 days we can probably expect a higher average of bleeding, while in the middle to end of the cycle this average should be close to zero. However, since users do not explicitly log a “non-bleeding” day, the missing values here give us important information about the cycle. Therefore, unlike with Physionet, we do not incorporate masking when computing the mean for Clue.

2. Hidden state decay γh: this mechanism is used to decay the hidden states of the GRU, instead of the input variables directly. This is done by decaying the previous hidden state ht−1 before computing the new hidden state ht:

ˆ

ht−1 = γht ht−1 (11)

Here we do not constrain ˆWγh to be diagonal. The update functions of GRU-D are:

rt= σ Wrˆxt+ Urˆht−1+ Vrmt+ br (12) zt = σ Wzˆxt+ Uzˆht−1+ Vzmt+ bz (13) ˜ ht = tanh Wˆxt+ U rt ˆht−1 + Vmt+ b (14) ht= (1 − zt) ˆht−1+ zt ˜ht (15)

(34)

where matrices Wz, Wr, W, Uz, Ur, U and vectors bz, br, and b are model parameters. The masking vectors mtare fed directly into the model, and Vr, Vz, and V are the new parameters for it. As before, σ indicates the element-wise sigmoid function and is the element-wise multiplication.

Classification

Finally, in order to perform classification, we apply an additional linear layer to the final hidden state, with a softmax activation function. This linear layer reduces the dimensionality from the hidden state size to the number of classes. Finally, the softmax activation function allows the outputs to be interpreted as posterior probabilities of belonging to a certain class:

p_k = softmax(xk) =

exp(xk) PK

j=1exp(xj)

(16)

where the output distribution PK

k=1pk= 1 for K classes.

5.1.3 Objective

Cross-Entropy Loss:

When performing classification, the objective is to train the model in order to minimize the Cross-Entropy Loss. With multiclass classification, where the number of classes equals three or more, we minimize the Categorical Cross-Entropy Loss, which is given by the following function: L(y, p) = − 1 N N X i=1 K X k=1

yi,klog(pi,k) (17)

where K is the total number of classes, yi,k is the binary indicator which indicates whether k is the correct class, and pi,k is the predicted probability that observation i is of class k. By minimizing this loss, we are minimizing the expected loss between the correct classes and the predicted classes. Since the Clue task involves multiclass classification, we train the models on this dataset by minimizing the categorical cross-entropy loss.

When it comes to binary classification, such as the Physionet mortality task, we minimize the Binary Cross-Entropy Loss. This can be computed with the same equation (17), simplified with K = 2.

As easy as APC: Leveraging self-supervised learning in the context of time series classification with varying levels of sparsity and severe class imbalance

MSc Artificial Intelligence

Master Thesis

As easy as APC:

Leveraging self-supervised learning in the context of time

series classification with varying levels of sparsity and

severe class imbalance

Fiorella Wever

February 2, 2021

Supervisors:

Victor García Satorras

Andy Keller

Laura Symul

Assessor:

Herke van Hoof

Abstract

Acknowledgements

Contents

1

Introduction

1.1

Research Question

1.2

Contributions

2

Literature Review

2.1

Deep learning for time series classification

2.2

Handling missingness

2.3

Class imbalance

2.4

Learning latent representations from time series data

3

Related Work

3.1

Autoregressive models for unsupervised learning

3.2

Unsupervised feature learning for class imbalance

3.3

Unsupervised feature learning for time series with missing data

4

Data

4.1

Physionet

4.2

Clue

5

Methodology

5.1

Supervised classification