Mood prediction in e-health : comparing the predictive performance of traditional econometric methods and machine learning techniques

(1)

University of Amsterdam

Master’s Thesis

M.Sc. Econometrics

Mood Prediction in E-Health

Comparing the predictive performance of

traditional econometric methods and

machine learning techniques

Leeg Leeg Authorship: L.M. Becka StudentID: 11084464 Leeg Leeg Supervision: Dr. M. Hoogendoorn Dr. K.J. van Garderen Dr. H. van Ophem .

Faculty of Economics & Business

(2)

i Statement of Originality

This document is written by Larissa Becka who declares to take full responsi-bility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

ii Abstract

This thesis analyzes and compares the predictive performance of two tradi-tional econometric time series methods with two machine learning techniques on an application in mental healthcare. We evaluate one step ahead forecasts of the mood from an Autoregressive Integrated Moving Average model, Vector Autoregression, Support Vector Regression, and Recurrent Neural Networks. Using self-reported data from a study on electronic treatment methods for pa-tients suffering from mental commorbidities we find that no one model leads to the best result for all individuals. However, we show that the Autoregressive Integrated Moving Average model exhibits the most stable performance and we conclude that standard Recurrent Neural Networks perform worse than the other methods on this application. Finally, we demonstrate that all methods fail to predict more extreme variations from the mean.

(4)

iii Acknowledgement

My special thanks goes to Dr. Mark Hoogendoorn for his guidance, support and the provision of the data that made this thesis possible. I also want to express my gratitude to Dr. Kees Jan van Garderen and Dr. Hans van Ophem for their supervision and contribution to the project as well as their understanding for taking an unconventional approach.

I am deeply thankful to all people who supported me during this year and to Jan who I greatly enjoyed working and discussing with. Specifically, I want to thank Pieter and Joyce for their help, interest and great confidence. At this point it is also a pleasure to thank my family for their patience, love and the opportunities they made possible over the entire period of my studies. Finally, my sincere gratitude to Joost who has encouraged me to stretch my limits and for the wonderful life we live together.

(5)

Introduction

As a global phenomenon and named the number one cause of ill health by the World Health Organization1, psychological disorders such as depressions be-come increasingly more prevalent. More than 300 million people are assumed to be suffering from depression in 2017 and this number is expected to rise constantly over the next decades. For far too many people the burden of their psychological disorder becomes unbearable which in the worst case can lead to suicide. Besides the individual’s own sorrow, psychological disorders are also leading to major costs for society.

In recent years, a variety of domains have been transformed by technology and the availability of increasingly more data in the healthcare sector inspires hope that technology based treatment methods could also lead to advancements in mental healthcare. In this thesis we contribute to the research around technol-ogy driven solutions in mental healthcare by evaluating the predictability of the mood of mental healthcare patients. In this context, we compare the predictive capabilities of different econometric time series methods and machine learning techniques and investigate their ability to accurately forecast mood.

The remainder of this thesis is structured as follows: chapter 2 dives into ex-isting literature on the costs of mental healthcare, technology based treatment methods and influencing factors on mood. Chapter 3 gives an overview of the data set and discusses missing values. Chapter 4 outlines the theoretical foun-dation of the models considered and chapter 5 covers the parameter and feature

1_{WHO (2017)}

(8)

CHAPTER 1. INTRODUCTION 2 choice. Chapter 6 evaluates the results of all models and includes a comparison of their predictive performance. In chapter 7 possible implications for future research and limitations of the approach are discussed. Chapter 8 concludes this thesis.

(9)

Chapter 2

Literature Review

2.1 Mental Healthcare Costs

The costs associated with mental healthcare are difficult to estimate directly as they consist of different components. Even though, expenses for treatments are directly observable the resulting costs from unemployment are much more difficult to estimate. According to a study from the OECD (2012) the risk of unemployment is about 25 to 30 percentage points higher for individuals with severe mental disorders. Overall, the costs resulting from mental illnesses in Europe are estimated to be about 3.5% of the European GDP1 with mood dis-orders accounting for the highest amount of these expenditures. Specifically, treatment costs are a broadly discussed topic in the literature. Olesen, Gustavs-son, SvensGustavs-son, Wittchen & Jnsson (2012) mention several studies on the costs of brain disorders in Europe and Shen (2013) shows that in the United States mental disorders have a significant and increasing effect not only on medical treatment costs but also on the probability that an individual opts for health insurance coverage.

2.2 Technology Based Treatment Methods

In addition to the financial implications of mental health treatments another concern mainly faced by developing countries is the availability of access to treatment facilities. Recently, technological advancements and the rising avail-ability of healthcare related data sparked the ambition to make use of

technol-1_{OECD (2015)}

(10)

CHAPTER 2. LITERATURE REVIEW 4 ogy in mental health treatments. These methods are expected to support the healthcare sector with more effective, cheaper and possibly preventive methods for mental health disorders. A more thorough discussion of the technologi-cal innovations in mental healthcare can be found in Hollis, Morriss, Martin, Amani, Cotton, Denis & Lewis (2015). Much like health trackers are utilizing indicators of physical conditions such as blood pressure and heart rate to warn patient and physician from possible heart attacks or other diseases, ideally we could make use of mental health indicators to be informed about particularly severe episodes of emotional distress. To collect data on mental health indica-tors, ecological momentary assessments are used to capture the patient’s own evaluation of his emotional state in a natural surrounding. During the assess-ment, the person has to regularly answer multiple questions on a smartphone. This system allows for a more continuous evaluation of the patients situation and prevents recall bias, as is often observed for in-person treatment methods. However, compared to physical health indicators, factors related to the mood of a patient are often subjective and more difficult to obtain which makes re-search on this topic scarce. LiKamWa, Liu, Lane & Zhong (2013) were among the first to use smartphone activity to infer the mood of it’s user. The goal of their study was to automate the process of sharing the emotional state of a user on social media platforms in Asia. A healthcare related study was conducted by Becker, Bremer, Funk, Asselbergs, Riper & Ruwaard (2016) who examined the predictability of mood. Based on smartphone logged data of healthy stu-dents they capture cellphone activity and try to forecast the student’s mood on the next day. Van Breda, Pastor, Hoogendoorn, Ruwaard, Asselbergs & Riper (2016) use the same data to compare the predictive capabilities of dif-ferent methods (among them ARIMA, Support Vector Machines and Random Forests). Other efforts to make use of technology in this field of research count for instance Reece, Reagan, Lix, Dodds, Danforth & Langer (2016) who analyze posts from social media platform Twitter to predict emerging depressions and post traumatic disorders.

2.3 Influencing Factors on Mental Health

Comparable to Van Breda et al. (2016) the aim of this thesis will be to eval-uate the predictive performance of different time series analysis techniques on mental healthcare data. However, the methods we will use differ from those

(11)

CHAPTER 2. LITERATURE REVIEW 5 in Van Breda et al. (2016) as we additionally analyze Vector Autoregressions and Recurrent Neural Networks and exclude Random Forests. We make this choice in order to limit the amount of methods which are not originally build for time series analysis. Another difference to Van Breda et al. (2016) (who only use data on smartphone activity) is that we can built on a rich data set comprising several influencing factors on mood which we want to leverage in the predictions. One of these factors which sometimes even contributes to the outbreak of psychological commorbidities is sleep. Vandeputte & de Weerd (2003) for example show that sleep disorders increase the risk of a depression. Another measurement that might influence the development of mental disor-ders is the patient’s feeling of self-esteem, which is called a mediator between social support and depressive feelings by Symister & Friend (2003). Generally, the social surrounding of an individual plays a significant role in the progress of mood disorders. Loneliness as a result of limited social contact can lead to an increase in depressive feelings (e.g. Hagerty & Williams (1999)). Furthermore, Lewinsohn & Graf (1973) find a correlation between the course of psychological diseases and the amount of pleasant activities as well as the extent to which participants find pleasure in the activity. Other common indicators for depres-sive episodes are rumination and worry about past as well as future events (e.g. Papageorgiou (2006)). Also physical activity is assumed to have a positive im-pact on mental health according to Taylor, Sallis & Needle (1985). In three of the methods considered in this thesis we will expoit these influencing factors and their progression to establish more precise forecasts of the patient’s mood.

2.4 Classical Statistical Methods vs. Machine

Learn-ing Techniques

Time series predictions are common in various applications and methods to acquire forecasts have been available for many years. The most traditional and widely used model is the Autoregressive Integrated Moving Average model (ARIMA) introduced already in 1970 by Box and Jenkins (a revised version of their original paper can be found in Box & Jenkins (1976)). While the ARIMA model takes only a single variable into account and is therefore univariate, an extension to the multivariate case can be found in Vector Autoregressions. Al-though both of these methods are still very popular nowadays, these classical statistical methods are subject to several limitations. For example, they rely

(12)

CHAPTER 2. LITERATURE REVIEW 6 on many assumptions about the functional form of the data and require tests to insure properties such as stationarity. In contrast, more recently developed machine learning techniques such as Neural Networks and Support Vector Ma-chines are not constrained by these assumptions and are able to take non-linear relationships into account. Often, these methods are non-parametric which means they do not require the specification of a functional form. A challenge in the application of most machine learning techniques is given by the high train-ing times and the difficulty of maktrain-ing the optimal decision on hyperparameters and input variables. As a result, comparisons of the effectiveness and reliability of these methods are of great interest in the research on many practical appli-cations. For instance, Ho, Xie & Goh (2002) compare an ARIMA model with Multi-Layer Feed Forward Neural Networks and Recurrent Neural Networks in the prediction of system failures. An application in finance can be found in Cao & Tay (2001). They show that Support Vector Machines outperform Multi-Layer Perceptrons in the prediction of the S&P 500 daily price index. A healthcare related research paper by Zhang, Zhang, Young & Li (2014) recently compared different techniques in their ability to forecast future epidemic disease events and concluded that no one model is superior.

(13)

Chapter 3

Data Set

3.1 Data Description

In this chapter we will provide an overview of the data set along with summary statistics and a discussion of missing values.

The data set considered in the following originates from the project E-Compared1. This EU initiative is concerned with the comparison of common treatment methods for mental health commorbidities with approaches combining internet, mobile and in-person treatment. It consists of a smartphone based ecological momentary assessment which includes seven categories, namely mood, worry, self -esteem, sleep, activities done, activities enjoyed and social contact. An overview of the variables and the associated questions is given in Table 3.1.

Description Question

Mood How is your mood?

Worry How much do you worry about things at the moment? Self Esteem How good do you feel about yourself right now?

Sleep How did you sleep tonight?

Activities done To what extent have you carried out enjoyable activities today? Enjoyed activities How much have you enjoyed today’s activities?

Social contact How much have you been involved in social contact today? Table 3.1: Ecological Momentary Assessment: Questions

The variable of interest is mood as the goal will be to obtain a reasonable prediction of an individual’s mood on the next day based on past behavior.

1_{E-COMPARED (2017)}

(14)

CHAPTER 3. DATA SET 8 Overall, 309 individuals took part in this study with the duration of their participation varying across individuals. Therefore, the available data differs in it’s density and length per individual, as shown in Figure 3.1. The individuals were asked to report on all variables at least once a day and the ratings were measured in 0.1 increments on a Likert scale in the interval [1,10].

Figure 3.1: Number of days the individuals participated in the study.

In order to achieve reasonable results, we have to limit the number of individuals considered to those with a sufficient number of observations. We exclude all individuals with a time frame of less than 60 days. Note that the data will be divided into a training-, test- and validation set so that two months are considered to be the shortest possible time frame to obtain reliable results. We also run statistical tests on the predictive performance of the models on the test set which requires a sufficiently large amount of observations in the test set. Furthermore, each variable is required to consist of at least 30 non-missing observations and to contain no more than 70% missing values in each variable. We also restrict the data to be measured in 0.1 increments. Establishing these requirements causes a possible selectivity issue which will be discussed in detail in section 7.1. For individuals showing a long period of missing values in the

(15)

CHAPTER 3. DATA SET 9 beginning of the evaluation time frame, we limit the series to the rest of the period. Establishing these requirements reduces the amount of participants in this study to only seven. An overview of the missing values and amount of observations in their time series can be found in Table 3.2.

Participant 1 2 3 4 5 6 7 Average Mood 1% 21% 11% 46% 10% 4% 14% 15% Worry 29% 23% 13% 47% 10% 4% 14% 20% Self Esteem 31% 23% 11% 47% 10% 4% 14% 20% Sleep 70% 56% 28% 70% 34% 22% 50% 48% Activities done 40% 41% 37% 61% 38% 18% 27% 37% Enjoyed Activities 41% 38% 40% 61% 38% 19% 23% 37% Social Contact 41% 40% 42% 61% 38% 18% 24% 38% Total Observations 312 133 142 223 184 79 98 167

Table 3.2: Reported are the amount of days considered for each individual and the percentage of days for which there is no mea-surement available.

Table 3.2 shows that the amount of missing values varies greatly between the individuals and the variables. We define a missing observation as a day for which no measurement was reported. The most scarce variable is sleep, for which there are on average 48% missing values. The variable of interest shows the least amount of missing values. The most complete time series is given for individual 6, but this is also the participant with the least amount of overall observations. An example of a boxplot for all variables is shown in Figure 3.2 for individual 3. We choose individual 3 as an example throughout this thesis, as the participant shows the median amount of observations, a representative amount of missing values and a representative average rating of his mood. Summary statistics of all seven participants are displayed in Table 3.3. This table reveals that the reported measurements differ greatly between individuals, variables and dimensions.

3.2 Missing Data Imputation

As the time series of the remaining individuals also contains missing values, these are imputed using a Kalman filter as suggested by Jones (1980). The

(16)

CHAPTER 3. DATA SET 10

Figure 3.2: Boxplot of all variables for patient 3.

Kalman filter operates on state space models which can be described as2:

yt= Zαt+ St (3.1)

αt= Ttαt−1+ Rηt (3.2)

Here, yt stands for the measurement which is in this case given by the

respec-tive variable (mood, sleep, etc.) and which includes missing values so that ytTt=1 is not fully observed but only a subset of it is available. αt is the state

variable and (t, ηt) ∼ iid, N (0,

Q 0

0 H

!

) The Kalman filter then tries to find the optimal estimates for αt by computing the conditional mean and variance

for the conditional distribution of αt based on the previous time periods. For

non-missing values the Kalman filter predicts the next observation of αt based

on the measurement variable and the state variable in period t. For missing values the Kalman filter instead relies only on the state variable to provide an estimate for αt. Note that we will only impute the value when missing.

Assum-ing that αt∼ N (αt, Pt) the estimate for the missing value is given by equation

(17)

CHAPTER 3. DATA SET 11 3.4.

αt+1= T αt (3.3)

Pt+1 = T Pt(T − KtZ)0+ Q (3.4)

Alternative methods for implementation could be to impute the mean of the first observed data point before and after the missing value or to fully model the missing values based on the other variables. We opt for the Kalman filter as it takes the development over time into account and is not yet inferring a relationship between the variables. Blocks of concurrently missing values in several variables would further increase the difficulty in modeling the missing values.

Variable Dimension Participant Mean

1 2 3 4 5 6 7 Mood Mean 4.37 6.22 6.03 6.06 4.62 5.27 5.00 5.37 Std 0.78 1.47 0.95 1.96 0.97 0.25 0.32 0.96 Min 1.8 2.8 4.0 1.0 2.3 4.5 4.0 2.91 Max 6.0 8.8 9.0 10.0 8.0 5.8 6.0 7.66 Social Contact Mean 4.08 6.38 6.86 7.60 4.71 5.50 5.00 5.73 Std 1.56 1.62 1.43 1.57 1.57 0.34 0.47 1.22 Min 1.0 3.1 4.0 3.0 2.4 4.0 3.7 3.03 Max 6.5 9.3 9.5 10.0 8.6 6.5 6.2 8.09 Sleep Mean 4.94 6.56 7.69 6.44 5.00 5.32 5.19 5.88 Std 1.21 1.84 1.38 2.04 1.00 0.45 0.41 1.19 Min 2.0 2.6 3.5 1.0 2.4 3.8 4.4 2.81 Max 8.0 8.9 10.0 9.5 7.6 6.7 6.2 8.13 Self Esteem Mean 4.23 6.61 5.70 6.19 4.37 5.20 4.94 5.32 Std 0.89 1.37 0.80 1.96 0.96 0.32 0.34 0.95 Min 1.5 3.6 3.2 1.0 2.5 3.7 4.0 2.79 Max 6.0 8.8 7.7 10.0 7.7 5.9 6.2 7.47 Enjoyed Activities Mean 4.19 6.69 7.28 6.21 5.43 5.44 5.00 5.75 Std 1.21 1.44 1.41 2.00 1.19 0.53 0.37 1.16 Min 1.0 3.6 4.3 1.0 3.0 2.5 4.0 2.77 Max 8.0 9.0 10.0 9.0 8.7 6.7 6.5 8.27 Pleasant Activities Mean 3.31 5.86 7.07 6.46 5.37 5.43 4.96 5.49 Std 1.49 1.62 1.23 2.17 1.31 0.47 0.47 1.25 Min 1.0 2.7 4.0 1.0 3.1 3.0 3.9 2.67 Max 7.9 8.9 10.0 10.0 8.4 6.1 6.5 8.26 Worry Mean 1.34 5.77 1.91 4.37 5.37 5.34 2.52 3.80 Std 0.57 1.45 1.11 2.48 1.17 0.36 0.51 1.09 Min 1.0 3.2 1.0 1.0 2.8 4.8 1.5 2.19 Max 4.0 8.8 5.0 10.0 8.0 7.0 3.7 6.64

(18)

Chapter 4

Models

In this chapter we will discuss the theoretical foundation of all models, namely the Autoregressive Integrated Moving Average models (ARIMA), Vector Au-toregressions (VAR), Support Vector Regression (SVR) and Recurrent Neural Networks (RNN).

The objective of all models described in the following will be to provide one step ahead predictions for the mood of the patient. To fulfill this task a model will be build for each individual separately. Mood or the emotional state of a person is a highly complex process and influencing factors can differ greatly between different individuals. Van Breda, Hoogendoorn, Eiben, Andersson, Riper, Ruwaard & Vernmark (2017) argue that optimizing the chosen features for each patient individually leads to better results than requiring the input variables to be the same. Furthermore, in the context of mental commorbidities as considered in this thesis the emotional state as well as the fluctuations in mood might be dependent on the specific disease of the patient. Figure 1 and Figure 2 of the appendix show a correlation plot of all variables for patient 3 and patient 4 and clearly reveal the differences in the extent of correlation between the mood and the influencing factors for the two individuals. We also inspect cross correlation plots of the mood and the influencing factors and conclude that the order of significant lags also differs strongly between patients. Based on these observations, building individual models to forecast mood seems to be more promising than trying to fit one model to all participants.

(19)

CHAPTER 4. MODELS 13

4.1 Autoregressive Integrated Moving Average Model

The first model applied to the data is the Autoregressive Integrated Moving Average model. The ARIMA model is commonly used in financial and other economic time series analysis1_{. The ARIMA model consists of an autoregressive}

process of order p and a moving average process of order q. The autoregressive process can be understood as a linear regression on the variables own lags, for which p defines the number of lags included. The process can be written as:

yt= α + P

X

p=1

φpyt−p+ t (4.1)

The variable of interest is yt, α is a constant, φ is a parameter vector which

needs to be estimated and tis assumed to be white noise. The moving average

process of order q can be described by a linear combination of the error terms as follows: yt= α + t+ Q X q=1 φqt−q (4.2)

While any time series could theoretically be approximated by an AR(p) or MA(q) process alone, combining the two approaches leads to far less param-eters that need to be estimated. In order to apply an ARIMA model to the time series data, there are some assumptions the time series needs to fulfill. A condition on stationarity requires the time series to have statistical proper-ties which remain constant over time. This assumption can be tested using a Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test with the null hypothesis that the time series is stationary against the alternative of a unit root. In case the original time series includes a non-stationary trend term it is possible to trans-form the series by taking first differences into a trans-form in which it is difference stationary. This process is called integrated of order d, for which d describes the order of first differences that needs to be taken in order to achieve station-arity. We confirm the results on the order of integration with an Augmented Dickey-Fuller test.The residual autocorrelations of a model with residuals etare

given by:

(20)

CHAPTER 4. MODELS 14 rk(e) = Pn t=k+1etet−k Pn t=1e2t (4.3) For a model with no autocorrelation the residuals converge to the innovations. Residual autocorrelation can be tested for example using a Ljung-Box test where we test the null hypothesis that the model is correctly specified. The Ljung-Box test asymptotically follows a χ2 _{distribution with (m − p − q) degrees of freedom}

and the test statistic for n observations and m lags is given as:

LB(m) = n m X k=1 n + 2 n − kr 2 k(e) (4.4)

To evaluate whether the model shows constant variance we will visually inspect time plots of the data as well as of the model residuals. The residuals should also follow a normal distribution.

4.2 Vector Autoregression

While the ARIMA only considers the time series of the variable to be predicted, a method called Vector Autoregression also captures linear interdependencies among multiple time series. They were introduced by Sims (1980) and are commonly used in macroeconomic analysis as well as in the context of financial econometrics (e.g. Tsay (2005)). Essentially, each of the variables considered is described by a function of its own lags and the lags of the other variables in the model. When Yt is a vector of time series variables and a VAR of order p

is selected, the model is given by:

Yt= c + Π1Yt−1+ Π2Y t − 2 + ... + ΠpYt−p+ t, t = 1, ...T (4.5)

The parameters Π of the model are then obtained by OLS. Given that T periods were observed, a one step ahead prediction is described by:

Y_{T +1|T} = c + Π1YT −1+ Π2Y T − 2 + ... + ΠpYT −p (4.6)

Comparable to the ARIMA model, non-stationarity of the data can also be an issue in VAR. We will test for stationarity again using a KPSS test and an augmented Dickey-Fuller test. Furthermore, to handle non-stationarity in VAR we need to know whether the series are also cointegrated. In case more than one time series considered shows non-stationarity we use a Johansen trace test

(21)

CHAPTER 4. MODELS 15 for cointegration. When no cointegration is given we can calculate the VAR after differencing the data. For cointegrated time series Phillips (1986) showed that a VAR in levels can be used for forecasting tasks as spurious regression does not pose an issue for the predictions.

4.3 Support Vector Machines

Support Vector Machines (SVM) find application in various areas where time series predictions are needed, such as for financial market predictions, electricity demand forecasting and also for medical data. SVMs are a supervised machine learning algorithm and can be applied to classification as well as regression tasks (Support Vector Regression). Compared to the other methods discussed here, Support Vector Machines are not automatically incorporating the time structure of the data. Instead the model remains the same as for cross sectional data and only the input variables differ as to include information on past ob-servations.

One of their major strengths is that they have a unique solution and are not, like some Neural Networks, prone to merely finding a local minimum. More-over, the SVR is a non-linear algorithm so that it can easily capture non-linear relationships between inputs and outputs. By the use of kernels, SVR maps the input variables into a higher dimensional feature space. Hereby, the op-timal kernel to be used depends on the application. In it’s simplest form the relationship between inputs (x) and output (y) can be written as:

y = w0φ(x) + b (4.7)

where w describes the weight vector, φ is the fixed and non-linear feature space transformation and b is a bias parameter. One of the major ideas behind Sup-port Vector Regression (SVR) as introduced by Cortes & Vapnik (1995) is the -sensitive error function, which ensures that SVRs do not penalize deviations of the prediction from the actual observation when they are smaller than . The minimization problem leading to an optimal solution for w and b therefore presents itself as:

(22)

CHAPTER 4. MODELS 16 minimize: 1 2||w|| 2_{+ C} l X i=1 (ξi+ ξ∗i) subject to: yi− w0φ(xi) − b ≤ + ξi w0φ(xi) + b − yi≤ + ξi∗ ξi, ξi∗ ≥ 0 (4.8)

To guarantee that the constraints of the resulting optimization problem are feasible (Smola & Sch¨olkopf (2004)) slack variables (ξi, ξi∗) are introduced. C

in equation 4.8 is the cost of constraint parameter which guides the trade-off between the flatness of the solution and the extent to which observations which exceed are penalized. If C is too high, there is a great chance of overfitting, while a low C leads to a smaller error penalty and therefore might cause a bad fit of the model. The choice of the hyperparameters and C as well as of the kernel for this application will be discussed in the following chapter. Figures 4.1 and 4.2 show a Suport Vector Regression before and after the non-linear transformation and include a visualization of and the slack variables.

Figure 4.1: SVR before transformation. Figure 4.2: SVR after transformation.

4.4 Recurrent Neural Networks

Like the methods discussed earlier, Recurrent Neural Networks have also been successfully applied to a diverse range of problems including financial time series forecasting. They also find use in more complex tasks such as natural language processing. A Recurrent Neural Network consists of input, hidden and output layers, where the hidden layer nodes are performing a non-linear transforma-tion of the inputs, and the output node generates the predictransforma-tions. The main

(23)

CHAPTER 4. MODELS 17 difference to ordinary Feed Forward Neural Networks is the setup of the hidden layers which are linked to previous observations and outputs of the Recurrent Neural Networks. This allows the network to store information on the past. Being a type of dynamic network, Recurrent Neural Networks contain feedback elements, also called delay units, which are able to keep historic information in the network. This means that the output depends on the current input as well as past inputs, hidden states and previous outputs of the network.

Figure 4.3: Unrolled Recurrent Neural Network Source: Lewis (2017)

Each layer in the network consists of neurons which are activated by a specific activation function such as the tanh or sigmoid function. This activation func-tion outputs a value usually between 0 and 1 and must be differentiable. Along with it comes a threshold value which can be seen as the minimum value nec-essary to activate the neuron. The derivative of the activation function is then used to calculate the weights required for the network to learn. This is done by the application of stochastic gradient descent. Each of the neurons is associated with a function which weighs the inputs to create the output according to:

f (u) =

n

X

i=1

wijxj + bj (4.9)

where the weights are given by wij and bjis the bias parameter. To optimize the

weights, Recurrent Neural Networks use error backpropagation through time. For this purpose the network is “unfolded” as shown in Figure 4.3 and the back-propagation of the error is performed as in a Feed Forward Neural Network, so

(24)

CHAPTER 4. MODELS 18 that first the error at the output layer is calculated and then the error associated with the proceeding layer until the input layer is reached. In alignment with this process, the weights are then updated iteratively using stochastic gradient descent. In stochastic gradient descent the weights are updated in the direction of the gradient of a specified loss function. The extent of the steps taken in the direction of the gradient are controlled by the learning rate and the momentum, two hyperparameters which need to be tuned.

We choose a standard RNN in this application as we believe that the long-term dependencies in the model are limited. However, standard RNNs are suffering from a vanishing gradient problem, as the gradient of the loss function decays exponentially. Therefore, long-term temporal dependencies are not taken into account. In section 7.3 we will briefly discuss Long-Short Term Memory RNNs which are not suffering from this problem.

(25)

Chapter 5

Model Fitting

In this chapter we will give an overview of the evaluation procedure, the hyper-parameter choice and the feature engineering process.

5.1 Evaluation Criteria & Cross Validation

To evaluate the predictive performance of the model we will focus on the root mean squared error (RMSE) and the mean absolute error (MAE). While the RMSE puts a higher weight on larger errors, the MAE treats the error equally regardless of the size. In case the two criteria are not aligned we will judge the results based on the RMSE. Obtaining extreme errors in the predictions means that our forecast is far off the actual value. For this application these extreme errors could lead to a psychologist missing the worst episodes which poses a high risk on the patient. Penalizing higher errors therefore seems reasonable. Nevertheless, for comparative reasons and to gain an understanding of how big of an error we can expect we will also report on the MAE.

In order to judge the performance of the models independently from training, the data will be divided in two sets. The training set consists of the first 80% of the data points and is used to estimate the model. The test set comprises the last 20% of the data points and will be used to evaluate the performance of the model on parts of the data which were not used in training. To optimize the hyperparameters of the models we implement a cross validation method. At this point, note that the hyperparameters are different for each of the models. In case of time series data the observations used for training need to lie before

(26)

CHAPTER 5. MODEL FITTING 20 the observation to be predicted. Therefore, following Hyndman (2016) cross validation for time series data is performed on a rolling forecasting evaluation. We define a minimum training length equivalent to 60% of the data. For an individual with 100 observations in the time series we start by training a model on the first 60 observations and obtain the error on the 61st observation. This process gradually ’rolls’ forward until observation 81 is reached, which is the first observation in the test set. We acknowledge that the term ”cross validation” can be confusing here and does not imply that we split the data in different sets at a time. Based on the RMSE and MAE obtained during cross validation, the model with the lowest RMSE is chosen. The parameters for this model are then kept for training of the entire training set, which generates a new set of weights. These weights are in the following used to make one step ahead forecasts for the test set. Finally the different approaches will be compared using the RMSE and MAE obtained for the test set. A visual example of the cross validation process is provided in Figure 5.1

Figure 5.1: Cross-validation on rolling forecasting evaluation. Source: Hyndman (2016)

5.2 Feature Engineering & Selection

While ARIMA is a descriptive model which makes use of the univariate time series only, Vector Autoregression, the Support Vector Regression as well as the Recurrent Neural Networks are able to incorporate information from other predictors. In Vector Autoregression the regressors are clearly defined by the lags. For the supervised machine learning algorithms it is, however, necessary to create features which are able to represent information on the past. The first set of features used for this application is based on the lagged values of all variables including mood. Another set of features commonly used in time

(27)

CHAPTER 5. MODEL FITTING 21 series prediction are aggregations over several previous realizations (see e.g. Van Breda et al. (2017)). We follow this approach by creating the following features: the mean, the standard deviation, minimum, maximum, sum and the trend which is defined as the slope of a linear fit. To give an example of the information captured in these variables, the mean would provide information about the average activity over the last days. It might for instance be of importance how much social contact the respective person had throughout the last days. One day with little social interaction might not be influencing the mood much, but several days without contact to others might have a significant impact on the person’s emotional state. Aggregating previous observations additionally helps to filter out noise in the data. Van Breda et al. (2017) found that the optimal number of days n chosen as the aggregation interval, differs per individual and lies between two and three days. Based on the same data set as considered here they used genetic algorithms as well as random sampling and fixed windows to compare the predictive accuracy of all three methods of aggregation. Van Breda et al. (2017) also argue that the predictive performance increases when the aggregation interval is not fixed over all variables. Based on these findings, we create aggregated intervals of the last two and three days and then analyze the model performance for each of the combinations using cross validation. During this process we keep the interval per variable fixed so that we aggregate over mean, stdev, etc. using the same number of days. Within this process we also optimize the parameter setting. We choose this approach in order to align with previous research on this topic (Van Breda et al. (2017)) but we also provide a discussion of the results when including aggregated values over two and three days in Section 6.3. Finally, we normalize the inputs for both machine learning techniques to lie between 0 and 1.

5.3 Kernel & Hyperparameter Choice

5.3.1 Autoregressive Integrated Moving Average Model

The only parameters that need to be chosen for the implementation of the ARIMA model are the order of the autoregressive terms (p), the order of the moving average terms (q) and the order of differencing (d). To decide on the order of differencing we run a KPSS test and use first differences when the null hypothesis of stationarity is rejected. One way to find the optimal order of p and q is to choose the combination which leads to the lowest value of a

(28)

CHAPTER 5. MODEL FITTING 22

Figure 5.2: ACF plot patient 3. Figure 5.3: PACF plot patient 3.

given information criterion (e.g. Heij et al. (2004)). Most commonly used are Akaikes Information Criterion (AIC), a corrected version of it or the Bayesian Information Criterion (BIC). An alternative to this approach is to use cross validation and choose the model with the lowest RMSE on the validation set. In this thesis we will follow the second approach as (1) the information criteria are often suggesting different choices and (2) we would like to establish a sim-ilar approach to choose the hyperparameters of all models. However, we find that in most cases the model chosen by cross validation is also among the best models when considering the AIC.

To initialize the cross validation procedure for the ARIMA, we define a grid of values for p and q for each patient separately. Both sequences start at 0 and increase in one step increments until the maximum is reached. We define the maximum as the highest number of significant autocorrelations, where we use partial autocorrelations for the order of q. Figure 5.2 shows the ACF plot for individual 3. Accordingly we choose the maximum order of autocorrelation considered in the grid for individual 3 and parameter p to be four. Likewise, the last significant partial autocorrelation is at lag 3. As will be shown in the following chapter the final ARIMA model for individual 3 is first differenced as the original time series is not stationary. The ACF and PACF plot are, therefore, also based on first differences.

5.3.2 Vector Autoregression

The only choice necessary for the Vector Autoregression is the number of lags to include in the model. To align this process with the other methods we also

(29)

CHAPTER 5. MODEL FITTING 23 use cross validation over a grid from 1 to 5 lags and choose the one with the smallest RMSE. Alternative setups will be discussed in section 6.3.

5.3.3 Support Vector Regression

As already mentioned in the previous chapter, the choice of the kernel used for the SVR depends mainly on the application. Most commonly applied kernels are the linear kernel, the radial basis kernel (RBF), or the sigmoidal kernel. Previous studies (e.g. Müller, Smola, Rätsch, Schölkopf, Kohlmorgen & Vapnik (1999)) on time series prediction problems have shown very good results for the RBF kernel. Rüping (2001) gives an overview of the performance of different kernels for time series prediction and concludes that for time series problems comparable to the one encountered here, the Radial Basis Function (RBF) kernel leads to the most accurate results. The RBF kernel takes the form:

kγ(x, x) = exp(−γ||x − x0||2), γ > 0 (5.1) The RBF kernel makes use of the euclidian distance to evaluate the similarity of two observations. In case of a time series this translates into the RBF kernel comparing a time series window to another window of the same time series by analyzing their euclidian distance (R¨uping (2001)). The RBF kernel has only one hyperparameter which is the similarity measure γ and defines how close two observations need to be in order to be considered similar. To find the optimal γ, we use the build-in sigest function from the kernlab package in R. To choose the other hyperparameters of the SVR, namely the cost of constraint parameter and we define a grid of nine values between 0.1 and 0.9 for both parameters and choose the combination leading to the lowest RMSE in the cross validation process.

5.3.4 Recurrent Neural Network

For the application of the Recurrent Neural Network we also need to optimize several parameter values. Next to the activation function, a cost function needs to be chosen, the number of hidden layers needs to be set, the number of epochs needs to be determined and the learning rate as well as the momentum have to be optimized. Each of these choices will be discussed in this subsection.

(30)

CHAPTER 5. MODEL FITTING 24

f (u) = 1

1 + exp(−cu) (5.2)

It maps a real valued number onto the interval [0,1] and it has the advantage that its derivative is computationally easy to calculate as it is given by:

δf (u)

δu = f (u)(1 − f (u)) (5.3)

We use an L1 regularization as our loss function because it leads to sparse weight vectors and therefore it should automatically eliminate the influence of the most irrelevant inputs by driving their weight to zero. The number of epochs defines how many times the data is passed through the network. We set this number to 1000 which seems to give stable results as the epoch error converges for all individuals. Figure 5.4 shows an example of the epoch error for individual 3. We can clearly see a steep decrease of the error rate in the beginning showing that the network is learning and convergence towards the end. We obtain similar results for the other individuals. To define the number of layers in the network, the learning rate, and the momentum, we will run a grid search using cross validation as described in the previous section. First tests show that a momentum between 0.2 and 0.3 combined with a learning rate between 0.05 and 0.5 give reasonable results. We therefore define the gird as [0.2, 0.3] for the momentum and as [0.05, 0.1, 0.3, 0.5] for the learning rate. The number of layers is set to two and we choose the number of hidden nodes in the first layer to be one of [10, 30, 50] and in the second layer to [5, 10, 15].

Figure 5.4: Epoch error of the final Recurrent Neural Network for patient 3.

(31)

Chapter 6

Results & Comparison

In this section we will present the results of all models and compare their predictive accuracy using statistical tests. All models will also be tested against a random walk benchmark. This benchmark predicts the mood of the patient simply as being equal to the mood of the patient on the previous day.

6.1 Predictive Performance

Table 6.1 includes the RMSE and MAE for all patients and models. The ARIMA model shows the most consistent results and is the only model that always performs better than the benchmark. This result will also be verified by statistical tests in the following section. While the ARIMA model outperforms the VAR for some individuals the VAR model leads to a lower RMSE than the random walk benchmark for all except one individual. The SVR gives the best performance for three individuals but leads to a higher RMSE than the bench-mark for all other patients. For four out of seven individuals the RNNs show the highest RMSE and they never outperform all other models. Overall, the Re-current Neural Networks clearly seem to perform worse than the other methods.

Table 6.2 displays the optimal hyperparameter choice for all models and shows that the hyperparameters differ strongly between individuals. The resulting parameters of the ARIMA models make clear that the time series of some indi-viduals exhibits non-stationarity and is therefore of order d equal to 1. We can also see that individuals exist whose development of mood is best described by an AR(p) or MA(q) process alone. The lag order of the VAR finds its optimum between 1 and 3. This is not surprising as a higher lag order leads to an even

(32)

CHAPTER 6. RESULTS & COMPARISON 26 higher increase in parameters to be estimated. The hyperparameter choices for the SVR show that while the cost of constraint parameter differs over the full grid range, lies always between 0.1 and 0.3. The optimal parameters for the RNN are surprisingly stable for the number of nodes in the first hidden layer, the momentum and the learning rate. The highest differences can be found in the number of hidden nodes in the second layer. Overall, these dif-ferences underline the importance of fitting different models for each individual.

As an example we display actual values and predictions of all methods on the test set for individual 3. A visual inspection of Figure 6.1 makes clear that all models fail to predict more extreme values. We observe a comparable outcome for all other individuals. For a comparison, actual and fitted values for the training set of individual 3 are provided in Figure 3 of the appendix. Particu-larly, the graphs for the VAR and SVR show a clearly better fit on the training set. Although, we use cross validation for the RNN it seems like we are not able to prevent overfitting for this model. We observe a similar pattern for most other individuals.

Patient Type Benchmark ARIMA VAR SVR RNN

1 RMSE 0.59 0.52 0.53 0.57 0.72 MAE 0.45 0.42 0.41 0.45 0.60 2 RMSE 1.53 1.35 1.21 1.59 1.40 MAE 1.30 1.06 0.93 1.29 1.16 3 RMSE 1.06 0.85 0.82 0.79 1.10 MAE 0.83 0.64 0.65 0.57 0.91 4 RMSE 0.87 0.68 0.80 0.95 3.26 MAE 0.56 0.46 0.56 0.72 3.02 5 RMSE 0.79 0.62 0.80 0.95 1.02 MAE 0.65 0.51 0.65 0.84 0.87 6 RMSE 0.27 0.22 0.24 0.21 0.31 MAE 0.23 0.17 0.19 0.17 0.26 7 RMSE 0.45 0.40 0.39 0.33 0.36 MAE 0.37 0.32 0.31 0.25 0.26

(33)

CHAPTER 6. RESULTS & COMPARISON 27

Patient ARIMA VAR SVR RNN

# (p,d,q) order C layer 1 layer 2 learning rate momentum

1 (6,0,2) 2 0.2 0.3 30 5 0.50 0.3 2 (4,0,1) 1 0.8 0.2 50 10 0.10 0.3 3 (0,1,2) 2 0.9 0.1 50 10 0.10 0.3 4 (1,1,0) 1 0.9 0.1 50 15 0.10 0.3 5 (4,1,4) 3 0.5 0.2 50 5 0.10 0.3 6 (1,0,0) 1 0.4 0.3 50 15 0.05 0.3 7 (2,1,2) 2 0.8 0.1 30 10 0.10 0.3

Table 6.2: Optimal hyperparameter choice based on CV.

6.2 Comparison

Comparing the overall error of the models to each other does not provide any insight on whether there are significant differences between the models and the benchmark. Therefore, we will run one sided Diebold-Mariano tests. This test is a well-known test to compare the predictive accuracy of two forecasts and was introduced by Diebold & Mariano (2002). The test is defined as:

DM = _q d¯

2π ˆfd(0)/T

(6.1)

where ¯d is the sample mean loss differential and we use the squared error as a loss function. T describes the amount of periods in the test set and ˆfd(0)

stands for the estimate of the spectral density of dt. The Diebold–Mariano test

has the advantage that it is also valid when the forecast errors are non–zero mean, serially and contemporaneously correlated.

The p-values of the one sided DM test with the null hypothesis that the bench-mark predictions are better or equal to the predictions of the model are given in Table 6.3. We find that the null hypothesis is always rejected on a 10% significance level for the best performing model. This indicates that for all individuals we have found at least one model which significantly outperforms the benchmark. The ARIMA model performs significantly better than the benchmark for five out of seven individuals. As expected, we are never able to reject H0 for the RNNs, showing that they never lead to a significantly better

accuracy than the benchmark. For the VAR and the SVR the results are mixed.

(34)

signifi-CHAPTER 6. RESULTS & COMPARISON 28

1 0.04** 0.06* 0.49 0.84 2 0.16 0.09* 0.63 0.19 3 0.16 0.11 0.09* 0.35 4 0.08* 0.13 0.43 1.00 5 0.04** 0.64 0.81 0.92 6 0.06* 0.10* 0.06* 0.83 7 0.08* 0.06* 0.07* 0.13

Table 6.3: Reported are the p–values for the one sided DM test against the benchmark for each individual. Significance is indi-cated on the 10% significance level by ∗, on the 5% significance level by∗∗ and on the 1% significance level∗∗∗.

1 best 0.16 0.03** 0.02** 2 0.10 best 0.04** 0.41 3 0.47 0.74 best 0.06* 4 best 0.05* 0.08* 0.00*** 5 best 0.03** 0.00*** 0.02** 6 0.13 0.23 best 0.02** 7 0.17 0.18 best 0.13

Table 6.4: Reported are the p–values for the one sided DM test against the best model for each individual. Significance is indi-cated on the 10% significance level by ∗, on the 5% significance level by∗∗ and on the 1% significance level∗∗∗.

cantly better than the other models. Table 6.4 shows the results of these tests. For patient 1, 4 and 5 we found that the ARIMA led to the lowest RMSE. Table 6.4 shows that for these individuals the ARIMA performs significantly better than almost all other methods. While we found that for individual 2 the VAR model had the highest accuracy, the predictions are not significantly better than those of the ARIMA and the RNN. For individual 3, 6 and 7 we concluded that the SVR leads to the smallest RMSE. However, Table 6.4 shows that the predictive accuracy is not significantly higher than for ARIMA and VAR.

(35)

CHAPTER 6. RESULTS & COMPARISON 29

Figure: 6.1: Predicted (blue) and actual (red) values on the test set for patient 3.

6.3 Alternative Model Specifications

To test our results we also try different parameter choices. For ARIMA and VAR we evaluate whether choosing the order based on the AIC over the entire training set increases the performance. We find that it leads to slightly worse results than our approach of using cross validation and choosing the model with the lag structure being based on the lowest RMSE over the validation set. We also test for serial correlation in the error terms of the final ARIMA model for all individuals using a Ljung–Box test as described in chapter 3.1. We fail to reject H0 of independence in the error terms for all individuals.

We also check for serial correlation in the error terms of the VAR. As shown in Table 1 of the appendix we reject H0 for individuals 1 and 4. Therefore

(36)

CHAPTER 6. RESULTS & COMPARISON 30 we re-estimate the VAR for these two patients while increasing the lag order by one. The resulting RMSE on the test set can also be found in Table 1 in the appendix. It becomes clear that for individual 1 the performance slightly decreases while for individual 4 we find a strong increase in accuracy. Note that the AIC leads to the same lag structure as the cross validation procedure for almost all individuals. As we are only interested in forecasting future mood values and we include lags in mood for all individuals we only use VAR in levels even for non-stationary time series. When non-stationarity is given we test for cointegration. Based on these tests we conclude that cointegration is given for at least some variables in all cases of non-statioanry data. To analyze whether decreasing the amount of variables in the VAR increases the model performance we also run a VAR on all possible combinations of variables and optimize the lag structure for each combination. The results can be found in Table 2 in the appendix and show that for most individuals this procedure does not increase the performance on the test set. However, for individual 6 we find a (significant) increase in the performance compared to the original VAR model. This could be caused by the limited amount of observations for individual 6 as VAR re-quires a high amount of parameters to be estimated. Overall, we conclude that our results for the VAR are relatively robust to changes in the parameter setting.

The machine learning techniques are to at least some extent dependent on the input variables. To find out whether our approach of limiting the amount of input variables to include only one level of aggregation per variable (2 or 3) benefits the results we also run the models including the full set of aggregated values. We find that the results are very close to those obtained for the original models, showing that optimizing over the time windows has neither positive nor negative effects, despite strongly increasing the time required for training.

(37)

Chapter 7

Discussion

In the following we will analyze potential issues and implications for further research.

7.1 Selectivity

The way the individuals report on their daily activities and by restricting the data to certain criteria, selectivity might become an issue when generalizing the results obtained in this thesis. As there is a chance that the participants do not report their emotional state on days where they feel particularly good or bad, the missing values might not be randomly distributed. We try to counter this effect by using a Kalman filter instead of simply imputing the mean, however, this might not capture the entire effect. Additionally, by choosing the individ-uals with a long enough evaluation period and a comparably small amount of missing values, our results might be restricted to a specific subset of individuals. This might include more stable participants or generally participants who have a milder form of the disease. Even a higher willingness to get better should be considered. The same issues might be caused by limiting the time series to a subset of the observation period, as this might be a particularly stable period. Given the potential selectivity issue, we caution from generalizing these results to all mental health patients and all time frames. Future research could focus on obtaining more complete data sets and also on longer evaluation periods. Some of the methods often used in time series prediction such as Long-Short Term Memory Recurrent Neural Networks require a large amount of observation in order to reach their full potential. The methods evaluated in this thesis might

(38)

CHAPTER 7. DISCUSSION 32 also lead to a higher predictive performance for a more complete or longer time series.

7.2 Long-Term Forecasts

Predicting the emotional state of the participant for the next day is a valuable first step in the development of technology based mental healthcare solutions. However, in order to foster preventive care and make it possible for psychol-ogists and patients to react before the mood swings, a long-term forecast is essential. Due to the limited amount of data and the higher computation times we only analyze one day ahead predictions here and long-term predictions are left to future research. It should, however, be mentioned that the predictive performance especially when it comes to more extreme values is already limited for short-term predictions as discussed earlier and that long-term forecasts are usually of a lower accuracy.

7.3 Additional Factors, Features & Methods

This thesis gives an overview of the predictive capabilities of four different time series prediction methods for the mood of mental healthcare patients. While we believe that our set of methods covers the most common and suitable mod-els given the structure of the data, there are plenty of other methods imag-inable. For example, Long-Short Term Memory Recurrent Neural Networks have shown very good results for financial time series applications but require a very high volume of data. LSTM Recurrent Neural Networks would also solve the vanishing gradient problem which might be causing the relatively poor re-sults obtained for the Recurrent Neural Networks. Other methods applied to time series are for example decision trees or more sophisticated random forests. Bayesian Vector Autoregression could be an alternative to the ordinary Vector Autoregression used here. In recent years, combinations of different methods have also been shown to be of a high predictive accuracy in time series predic-tions.

Given the high training times of the Support Vector Regression and the Re-current Neural Networks we furthermore limit the combinations of features we explore. Other combinations of features and longer time windows could filter

(39)

CHAPTER 7. DISCUSSION 33 out more noise in the data. We focus on time windows of two and three days which was shown in Van Breda et al. (2017) to be a good choice. However, due to the immense amount of possible combinations also Van Breda et al. (2017) concentrate on a time horizon of up to four days only.

Even though, the study includes information on a broad set of variables, the literature knows multiple other factors which influence the mood of an indi-vidual. The daily stress level and the amount of physical activity, as well as weather conditions and the mood of social contacts are known to be influenc-ing the emotional state of individuals sufferinfluenc-ing from a mental health condition. Including those might lead to an increase in the performance of the predictions. Unfortunately, we do not have information on any of these factors.

(40)

Chapter 8

Conclusion

The aim of this thesis was to compare the predictive performance of four time series methods when applied to the mood of a patient suffering from a mental disorder. The analysis was based on an ecological momentary assessment from the project E-Compared which collected the patient’s self-rated measurements of his mood and six other influencing factors. We obtained the one-step ahead predictions of an Autoregressive Integrated Moving Average model, a Vector Autoregression, a Support Vector Regression and Recurrent Neural Networks, for which we tuned the parameters in a cross validation procedure. A thorough evaluation of the results and related statistical tests show that the model with the best results based on the RMSE differs per individual. We also find that there is always a model which significantly outperforms the benchmark. Fur-thermore, the ARIMA model which relies solely on mood lags as input variables is the only method that leads to an improved predictive accuracy for all indi-viduals compared to the benchmark. Finally, we cannot deny that the standard Recurrent Neural Networks perform comparably poor and for some individuals lead to quite extreme errors in prediction. This might be caused by the vanish-ing gradient problem and although we use a cross validation procedure it seems like the RNNs are prone to overfitting the training set. A visual inspection of the resulting forecasts reveals a severe issue, namely that all models fail to predict more extreme values in mood. Optimizing the hyperparameters and lag order for each individual separately shows that they differ greatly between the patients. The amount of missing values and the high variability in mood make accurate predictions difficult to obtain. Related to the missing values we also discuss a possible selectivity issue which might limit our ability to generalize our results beyond this study.

(41)

CHAPTER 8. CONCLUSION 35

While we believe that the need for technology driven mental healthcare treat-ments is given, we nevertheless caution from their premature use in real world applications due to their lack in reliability and the high stakes. For future re-search we recommend to put more emphasis on choosing the input variables for the Support Vector Regression and stress the need for more complete data sets. However, in a time when research institutions and large corporations such as Google or IBM have joined their forces in the quest to improve technology driven healthcare solution we have no doubt that technology will also positively impact mental healthcare treatments. We hope that this thesis has contributed to the research by shedding some light on the differences in predictive perfor-mance of common time series methods when applied to mood predictions of mental healthcare patients.

(42)

Bibliography

Becker, D., Bremer, V., Funk, B., Asselbergs, J., Riper, H., & Ruwaard, J. (2016). How to predict mood? delving into features of smartphone-based data. In AMCIS.

Box, G. & Jenkins, G. (1976). Time series analysis forecasting and control-Rev. Holden–Day.

Cao, L. & Tay, F. E. (2001). Financial forecasting using support vector ma-chines. Neural Computing & Applications, 10 (2), 184–192.

Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine learning, 20 (3), 273–297.

Diebold, F. X. & Mariano, R. S. (2002). Comparing predictive accuracy. Journal of Business & economic statistics, 20 (1), 134–144.

E-COMPARED (2017). Our mission. https://www.e-compared.eu/about-us/mission/. Accessed: 2017-06-08.

Grewal, M. (2011). Kalman filtering, (pp. 705–708). Springer.

Hagerty, B. M. & Williams, A. (1999). The effects of sense of belonging, social support, conflict, and loneliness on depression. Nursing research, 48 (4), 215– 219.

Heij, C., De Boer, P., Franses, P. H., Kloek, T., Van Dijk, H. K., et al. (2004). Econometric methods with applications in business and economics. OUP Oxford.

Ho, S., Xie, M., & Goh, T. (2002). A comparative study of neural network and box-jenkins arima modeling in time series prediction. Computers & Industrial Engineering, 42 (2), 371–375.

(43)

BIBLIOGRAPHY 37 Hollis, C., Morriss, R., Martin, J., Amani, S., Cotton, R., Denis, M., & Lewis, S. (2015). Technological innovations in mental healthcare: harnessing the digital revolution. The British Journal of Psychiatry, 206 (4), 263–265.

Hyndman, R. (2016). Cross-validation for time series.

https://www.robjhyndman.com/hyndsight/tscv/. Accessed: 2017-06-18. Jones, R. H. (1980). Maximum likelihood fitting of arma models to time series

with missing observations. Technometrics, 22, 389–395.

Lewinsohn, P. M. & Graf, M. (1973). Pleasant activities and depression. Journal of consulting and clinical psychology, 41 (2), 261.

Lewis, N. D. (2017). Neural Networks for Time Series Forecasting with R: An Intuitive Step by Step Blueprint for Beginners. CreateSpace Independent Publishing Platform.

LiKamWa, R., Liu, Y., Lane, N. D., & Zhong, L. (2013). Moodscope: Build-ing a mood sensor from smartphone usage patterns. In ProceedBuild-ing of the 11th Annual International Conference on Mobile Systems, Applications, and Services, (pp. 389–402).

Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., & Vapnik, V. (1999). Using support vector machines for time series prediction.

OECD (2012). Sick on the job? – myths and realities about men-tal health and work. http://www.oecd.org/els/mental-health-and-work-9789264124523-en.htm. Accessed: 2016-07-05.

OECD (2015). Fit mind, fit job – from evidence to practice in mental health and work. http://www.oecd.org/els/fit-mind-fit-job-9789264228283-en.htm. Accessed: 2017-07-05.

Olesen, J., Gustavsson, A., Svensson, M., Wittchen, H.-U., & Jnsson, B. (2012). The economic cost of brain disorders in europe. European Journal of Neurol-ogy, 19 (1), 155–162.

Papageorgiou, C. (2006). Worry and rumination: Styles of persistent negative thinking in anxiety and depression. Worry and its psychological disorders: Theory, assessment and treatment, 1, 21–40.

(44)

BIBLIOGRAPHY 38 Phillips, P. C. (1986). Understanding spurious regressions in econometrics.

Journal of econometrics, 33 (3), 311–340.

Reece, A. G., Reagan, A. J., Lix, K. L. M., Dodds, P. S., Danforth, C. M., & Langer, E. J. (2016). Forecasting the onset and course of mental illness with twitter data. CoRR, abs/1608.07740.

Rüping, S. (2001). Svm kernels for time series analysis. Technical report, Technical Report, SFB 475: Komplexitätsreduktion in Multivariaten Daten-strukturen, Universität Dortmund.

Shen, C. (2013). Determinants of health care decisions: Insurance, utilization, and expenditures. The Review of Economics and Statistics, 95 (1), 142–153. Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48 (1), 1–48. Smola, A. J. & Sch¨olkopf, B. (2004). A tutorial on support vector regression.

Statistics and computing, 14 (3), 199–222.

Symister, P. & Friend, R. (2003). The influence of social support and prob-lematic support on optimism and depression in chronic illness: a prospective study evaluating self-esteem as a mediator. Health Psychology, 22 (2), 123. Taylor, C. B., Sallis, J. F., & Needle, R. (1985). The relation of physical activity

and exercise to mental health. Public health reports, 100 (2), 195.

Tsay, R. S. (2005). Analysis of financial time series, volume 543. John Wiley & Sons.

Van Breda, W., Hoogendoorn, M., Eiben, G., Andersson, G., Riper, H., Ruwaard, J., & Vernmark, K. (2017). A feature representation learning method for temporal datasets, (pp. 1–8). Institute of Electrical and Elec-tronics Engineers, Inc.

Van Breda, W., Pastor, J., Hoogendoorn, M., Ruwaard, J., Asselbergs, J., & Riper, H. (2016). Exploring and Comparing Machine Learning Approaches for Predicting Mood Over Time, (pp. 37–47). Springer International Publishing. Vandeputte, M. & de Weerd, A. (2003). Sleep disorders and depressive feelings: a global survey with the beck depression scale. Sleep Medicine, 4, 343 – 345. WHO (2017). Depression. http://www.who.int/mediacentre/factsheets/fs369/en/.

(45)

BIBLIOGRAPHY 39 Zhang, X., Zhang, T., Young, A. A., & Li, X. (2014). Applications and com-parisons of four time series models in epidemiological surveillance data. PLoS One, 9 (2), e88075.

(46)

Appendix

Figure 1: Correlation plot for patient 3. Figure 2: Correlation plot for patient 4.

Patient 1 2 3 4 5 6 7 Lag by CV 2 1 2 1 3 1 2 Lag by AIC 2 1 1 1 3 1 2 P-value LB test 0.00 0.47 0.29 0.00 0.50 0.65 0.98 Increased Lag 3 - - 2 - - -RMSE 0.57 - - 0.69 - -

-Table 1: Reported is the original lag order of the VAR model, the suggested lag order by AIC and the p-value of the Ljung-box test for serial correlation of the error terms given the original lag order. When we fail to reject the null hypothesis of independence in the error terms we increase the lag order by one and report the RMSE on the test set for a model with increased lag order.

(47)

APPENDIX 41

Patient RMSE MAE Order Variables

1 0.56 0.45 4 mood, social contact, self-esteem

2 1.44 1.09 5 mood, social contact, self-esteem, enjoyed activities

3 0.95 0.75 4 mood, enjoyed activities, worry

4 0.80 0.52 1 mood, sleep, self-esteem, enjoyed activities, activities done

5 0.80 0.65 5 mood, sleep, self-esteem, worry

6 0.18 0.14 3 mood, self-esteem, worry

7 0.39 0.29 2 mood, social contact, sleep, self-esteem, enjoyed activities, worry Table 2: Reported are the results of a minimization of the RMSE

over all combinations of variables within the cross validation framework and the RMSE and MAE on the test set for the re-sulting model.

Mood prediction in e-health : comparing the predictive performance of traditional econometric methods and machine learning techniques

University of Amsterdam

Master’s Thesis

M.Sc. Econometrics

Mood Prediction in E-Health

Comparing the predictive performance of

traditional econometric methods and

machine learning techniques

Faculty of Economics & Business

Contents

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

Mental Healthcare Costs

2.2

Technology Based Treatment Methods

2.3

Influencing Factors on Mental Health

2.4

Classical Statistical Methods vs. Machine

Learn-ing Techniques

Chapter 3

Data Set

3.1

Data Description

3.2

Missing Data Imputation

Chapter 4

Models

4.1

Autoregressive Integrated Moving Average Model

4.2

Vector Autoregression

4.3

Support Vector Machines

4.4

Recurrent Neural Networks

Chapter 5

Model Fitting

5.1

Evaluation Criteria & Cross Validation

5.2

Feature Engineering & Selection

5.3

Kernel & Hyperparameter Choice

Chapter 6

Results & Comparison

6.1

Predictive Performance

6.2

Comparison

6.3

Alternative Model Specifications

Chapter 7

Discussion

7.1

Selectivity

7.2

Long-Term Forecasts

7.3

Additional Factors, Features & Methods

Chapter 8

Conclusion

Bibliography

Appendix