Prediction of Cognitive Decline on a Continuous Scale in Alzheimer's Disease A Comparison of Different Model Classes

(1)

MSc Brain and Cognitive Sciences

Cognitive Neuroscience

Research Project II

Prediction of Cognitive Decline on a Continuous Scale

in Alzheimer's Disease

A Comparison of Different Model Classes

By

Enny H. van Beest

UvA ID: 10002255

10 June, 2015

36 ECTS

November 2014 – June 2015

Supervisor:

Examiner:

Dr. Kerstin Ritter-Hackmack

Dr. Guido van Wingen

(2)

1. Abstract

In this study, we compared different model classes on their performance in predicting cognitive decline in elderly single subjects after 6, 12 and 24 months, based on baseline data measured at 0 months. Data of healthy controls, mild cognitive impaired and Alzheimer's disease patients were obtained from the database of the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cognitive decline was measured by two continuous neuropsychological (NP) scores - Mini Mental State Exam (MMSE) and Alzheimer's Disease Assessment Scale (ADAS) - and one functional activity questionnaire score (FAQ). Baseline scores included MRI, PET, CSF proteins, gene information and NP scores. Three different comparisons were made. The first was Gaussian processes versus structural risk minimization, the second singletask learning versus multitask learning, and the third multimodality integration versus concatenation. An additional question was whether we need multiple modalities at all, or whether single modalities could predict NP scores just as accurate. Using all modalities combined, we found a correlation between 0.77 and 0.90 between true and predicted scores for every model class. When comparing different models, we found that 1) model classes with structural risk minimization performed better than those with Gaussian processes and 2) using multiple modalities in an integrated way gave higher prediction accuracies than using concatenation or single modalities. However, these effects were small (differences in correlation between models were all smaller than 0.1) and performance depended on time point of prediction and clinical group. In our models we assumed that prediction of cognitive decline can be solved as a linear combination of input features, whereas several models describe biomarkers to develop in a dynamic way over time. We believe we can further improve our most accurate model - structural risk minimization with integrated modalities - by taking more complex weights into account.

1.1 Keywords

Alzheimer's Disease, Mild Cognitive Impairment, biological markers, magnetic resonance imaging, positron-emission tomography, Apolipoprotein-E, Artificial Intelligence, Cerebrospinal Fluid

(4)

2. Introduction

2.1 The origin of ADNI

Dementia is a general term for memory loss as well as the loss of other intellectual abilities, in such a degree that it affects daily life, of which Alzheimer's Disease (AD1) is the most common form. AD and other forms of dementia are different from a possible loss of cognitive abilities in normal aging or mild cognitive impairment (MCI), since in the latter two cases symptoms are not severe enough to impair quality of life. In 2014, over 5 million people suffered from AD in America alone (“Alzheimer’s Association | Latest Facts & Figures Report,” n.d.). The high number of patients combined with the relatively high costs per patient, makes AD the most expensive condition in the States (“Alzheimer’s Association | Latest Facts & Figures Report,” n.d.). Although no cure is available yet, both progression rate and symptom severity can be scaled down using drug and non-drug treatment (“Alzheimer’s Association | Latest Treatment Options,” n.d.). To benefit from this optimally, it is important to detect AD as early as possible. In 2004 the World Wide Alzheimer's Disease Neuroimaging Initiative (WW-ADNI) was created, stating goals both focusing on early diagnosis of AD and predicting the course of this disease before the first symptoms are clinically visible (“ADNI | Background & Rationale,” n.d.). This is for example relevant for diagnosis of MCI patients, who often will convert to AD patients, when their symptoms are not caused by other underlying diseases than AD.

2.2 Biomarkers for Alzheimer's Disease

Different biomarkers are found that can be used to classify and predict the course of AD. The one that can be identified earliest, even before the onset of MCI as a pre-stage of AD, is an increased amount of amyloid-β (Aβ40) plaques in the diseased brain (“Alzheimer’s Association | Alzheimer's & Brain Research Milestones | Research Center,” n.d.). Another early biomarker is an increased amount of tau, which is a key component of the tangles that are observed in AD (Craig-Schapiro, Fagan, & Holtzman, 2009). When symptoms of cognitive decline are starting to be observable, alterations in brain structure can be measured using imaging techniques such as positron emission tomography (PET) and magnetic resonance imaging (MRI). As AD progresses, biomarkers in the brain become more and more abnormal, while scores on neuropsychological (NP) tests get worse. A different type of biomarker is the risk-factor gene apilopoprotein-E (APOE), which is only partly helpful in prediction and classification problems, since its value is constant over time and the gene is non-deterministic (“Alzheimer’s Association | Alzheimer's & Brain Research Milestones | Research Center,” n.d.). 2.3 Structural risk minimization versus Bayesian formulation

A relation between cognitive scores and structural changes in the brain has already been shown in the early years of ADNI. Structural changes were hereby qualified using voxel based morphometry and region of interest methods on MRI data (Baxter, Sparks, & Johnson, 2006; Duchesne, Caroli, Geroldi, Collins, & Frisoni, 2009; Jack Jr et al., 2008). Later, machine learning methods have been implemented to classify AD, MCI and healthy control (HC) individuals. Davatzikos, Fan, Wu, Shen, and Resnick (2008) for example, used high-dimensional pattern classification on MRI data to classify MCI patients from HCs, thereby having an accuracy of 90%. Unfortunately, distinguishing MCI from HC individuals alone is not enough to predict AD or even to select participants for clinical trials in AD

(5)

research, since underlying problems causing MCI can be different from pre-stage AD pathology. Also severity of symptoms is diverse within one group of individuals, which makes methods to predict clinical scores on a continuous scale more interesting than binary classification. Whereas a lot of methods make use of a structural risk minimization (SRM) framework, Stonnington et al. (2010) used a relevance vector regression (RVR) in a Bayesian framework to predict different cognitive measurements from whole-brain weighted MRI images. Hereby they found significant correlations between predicted and actual scores of those cognitive measurements that recruit broader brain areas (Stonnington et al., 2010). Wang, Fan, Bhatt, and Davatzikos (2010) compared this method with support vector regression (SVR), which implements a SRM framework. After using principal component analysis to reduce dimensionality of white matter, gray matter and cerebrospinal fluid (CSF) features, it was consistently shown that RVR had a more accurate prediction with higher correlations than SVR in predicting cognitive measures (Wang et al., 2010). The cognitive measurements that were predicted in these studies, however, were those of three to six months after initial scanning. To have clinical relevance, we should be able to predict both cognitive and other symptom scores not only more accurate at present, but also further in the future. Another implementation of the Bayesian framework comes from Marquand, Brammer, Williams, and Doyle (2014), who performed ordinal regression using Gaussian processes (GP) to classify individuals as AD, MCIn (non-converting to AD in a specific time period), MCIc (converted to AD in a specified time period) and HC, made possible by treating progression from HC to AD as an ordinal problem. An advantage of this approach, is that predictions for single subjects can be made as well (Doyle et al., 2014).

2.4 Multimodal and multitask learning

To increase accuracy, researchers started to use more available information by taking more than one biomarker at a time to predict or classify AD related scores. One example is a study by Zhang, Wang, Zhou, Yuan, and Shen (2011), in which not only MRI, but also PET scans, and Aβ40 and tau levels in the CSF were used to diagnose AD and MCI from HC individuals. A year later, the same authors tried to combine multimodal learning with multitask learning (MTL), meaning that classification of AD, MCI and HC individuals and predicting individual cognitive scores were done simultaneously (Zhang & Shen, 2012). It is believed that MTL will increase accuracy, assuming that when similar tasks need to be performed by the same algorithm, information about one task can help solving that of the other (Argyriou, Evgeniou, & Pontil, 2007; Evgeniou, Micchelli, & Pontil, 2005; Evgeniou & Pontil, 2004). Side note here is that in most cases of clinical classification the same input is used for these different tasks. Therefore, when we talk about MTL learning here this actually means multi-output learning (Alvarez, Rosasco, & Lawrence, 2011). In the design Zhang and Shen (2012) made, which is called multimodal task learning (M3T), these two tasks (i.e. classification and prediction) were said to be learned simultaneously while using multimodal (multiple biomarkers) data, as was described in their previous work. Feature-selection was indeed done using MTL. The actual classification and prediction, however, were done separately using a support vector machine and SVR respectively (Zhang & Shen, 2012; Zhang et al., 2011). Another group used MTL in a temporal group lasso regularization paradigm to select a small number of features for every task, which in this case was predicting clinical scores on different time points (Jiayu Zhou, Yuan, Liu, & Ye, 2011). Recently it has been shown that MTL can also be implemented in GP. Marquand, Brammer, Williams, and Doyle (2014) used this to decode multi-subject neuroimaging data for different (but related) tasks

(6)

simultaneously. 2.5 Goals of this study

2.5.1 Total NP scores

Many more different methods to predict cognitive decline are available. However, a direct comparison to decide on which to use in clinical settings, lacks. Also prediction is often only done for the near future, while it is at least just as relevant to predict cognitive functioning further in the future. In this study, both of these aspects will be taken into account. We will test GP model classes against models using SRM with regularization (table 1), while testing singletask learning (STL) against MTL (figure 1a). More specifically, all models will try to predict scores relevant for AD diagnosis 6, 12 and 24 months later. Prediction is based on input, which consists of baseline NP, MRI, PET, CSF and APOE-ε4 scores. A general idea is that using multimodal input results in predicting output more accurate than when using a single modality, hence model comparison here will be mostly done using multimodal input. However, it is unclear whether it is best to concatenate all modalities in predicting scores (i.e. determining model parameters of all features at once, model does not distinguish different modalities), or to integrate modalities (i.e. different modalities have different parameters on a second-level on top of individual parameters per feature, figure 1B). This will be another comparison we test in our analyses. Another question is whether it is necessary to combine all these biomarkers. If we could predict just as accurate with only one of the modalities as when combining all of them, there is no need to increase the complexity of models. Another part of this study therefore is which (combination of) modalities is optimal to predict NP scores.

2.5.2 Cognitive and functional sub scores

Usually models for AD prediction are trained on more general NP scores, which are sums or averages of different subtest scores. In the Alzheimer's disease assessment scale (ADAS) questionnaire, different domains are, for example, to remember a list of words, or to draw a clock. An - by our knowledge - new field in AD prediction, is whether models that can predict the general NP scores, can also predict different sub scores within those questionnaires. This might be relevant when it is desired to know future abilities of a patient in a more specific way, for example when developing a personal treatment. Also relevant might be to know how functional abilities will develop over time. Are we able to predict functioning on daily activities, like preparing a cup of coffee? A more exploratory part of this study tries to answer these questions.

2.5.3 Hypotheses

The underlying null-hypotheses of all these questions, are that 1) for predicting cognitive decline and functioning in daily life on a continuous scale for different time points over a time-course of two years, different model classes have the same performance. In other words, STL and MTL models perform equal, as do SRM and GP, and models using concatenated versus integrated modalities. 2) All (combinations of) modalities used in a model, integrated or concatenated, give equal performance. 3) Models are able to predict sub scores for AD. For these hypotheses, performance is measured by accuracy of the model - correlation between predicted and actual values and mean squared error (MSE).

(7)

Training framework Gaussian processes Structural risk minimization Hyper-parameter

optimization

Minimize function (GPML toolbox)

Grid-search and nested cross-validation

Model parameters Mean- and covariance functions ‘Least-dirty’ function to find weights Table 1: table describing two important differences regarding parameter optimization and

determining model parameters between GP and SRM with regularization.

Figure 1 Schematic

overview of

contrasts. a) Input

and labels are

divided in 10 parts. For training 9/10 of data are used to learn the relation between input and target labels. 1/10 of data (test-data) are used to measure

accuracy of the

learned model when

predicting scores

based on unseen input data. In the next fold another 1/10 are used as

test-data. The

difference between STL and MTL, is that in STL training is done for every task separately, whereas in MTL training is

done per NP

category, but

simultaneously for

the different time points.

b) The input of the model exists of the five modalities MRI, PET, CSF, APOE and NP, each with a different number of features. These modalities can either be concatenated (model doesn't acknowledge different modalities) or integrated (model acknowledges different modalities, modality parameter α has to be determined).

(8)

3. Methods

3.1 Subjects

We retrieved data for this study from the ADNI dataset (“ADNI | Access Data,” n.d.). This set contains values for AD biomarkers, NP scores and demographics for AD, MCI and HC individuals. AD patients had MMSE (Mini Mental State Exam) scores between 18 and 27, ADAS scores between 12 and 55 and CDR-box (Clinical Dementia Rating for different “boxes”) scores between 1 and 10, and in addition met criteria of the 'national institute of neurological and communicative disorders and stroke' and the 'AD related disorders association' (NINCDS/ARDA). MCI individuals had MMSE scores between 23 and 30, ADAS scores between 3 and 40 and CDR-box scores between 0 and 5.5 and HC individuals MMSE scores between 24 and 30, ADAS scores between 0 and 24 and CDR-BOX scores between 0 and 1. See

table 2 for more information about participants.

Mean ± std Number APOE4-ε MMSE bs ADAS-COG bs CDR-Box bs

Aβ40 T-Tau PET (z-score) Hippocampal volume HC 101 0.24 ± 0.00 28.99 ± 0.01 8.95 ± 0.04 0.03 ± 0.00 215.73 ± 0.54 69.71 ± 0.33 76.22 ± 1.59 7488.16 ± 8.10 MCIn 177 0.51 ± 0.00 28.05 ± 0.01 14.08 ± 0.03 1.33 ± 0.00 181.52 ± 0.31 82.61 ± 0.28 102.54 ± 2.05 7104.19 ± 6.35 MCIc 27 1.04 ± 0.03 26.59 ± 0.07 21.83 ± 0.24 2.04 ± 0.04 162.12 ± 1.82 106.26 ± 1.73 622.19 ± 42.96 5860.81 ± 34.18 AD 34 1.06 ± 0.02 23.44 ± 0.06 27.57 ± 0.19 4.19 ± 0.05 147.72 ± 1.18 128.31 ± 2.09 307.09 ± 10.83 5677.94 ± 25.33 Table 2: number or mean +/- std for different features, separated for different clinical groups.

3.2 Input

We used the following modalities as input to create the predictive models: MRI, FDG-PET, CSF proteins, genotype of APOE (number of ε4 alleles) and NP scores, scored within the first three months of participation. Features for MRI were volume averages of hippocampus, the whole brain, entorhinal cortex, fusiform area, the midtemporal lobe and intracranial volume. For FDG-PET features are average value of the reference region (white matter, peripheral nervous system or cerebellum), average value of frontal and associative cortices normalized to the reference region, average value of frontal cortex normalized to the reference region, and number and sum of pixels with extreme values. CSF proteins consisted of Aβ40 and total-tau levels and for neuropsychological scores we used ADAS, MMSE and CDR-BOX scores. All input came from the ADNI database and was processed in standardized ways as described on the ADNI website (“ADNI | Access Data,” n.d.).

3.3 Output

For the first two hypotheses, all models learned to predict ADAS and MMSE total scores at 6 months, 12 months and 24 months after baseline scores, based on training output. For the last hypothesis, prediction was done for ADAS- and functional activity questionnaire (FAQ) sub scores at the same time points. Performance of the different models was measured by the mean squared error (3) and

(9)

the Pearson correlation between the predicted and the actual values of test-data in a 10-fold leave-one-out cross-validation setting (figure 1a).

3.4 Models and implementation

All learning and forming of the different models and analyses were performed in Matlab 2013a. Our first comparison was model classes implemented in GP versus in SRM. Two important differences between these model classes are described in table 1. To implement models in GP (a homemade adaptation of) the GPML toolbox was used (Rasmussen & Nickisch, 2013; Rasmussen, 2006). For training a model to predict labels of unseen data, we used a zero mean, and a linear covariance function (1):

, (1)

in which k(x,x') is the covariance function, x is the input data structure, and is the posterior of the Gaussian likelihood function (Rasmussen & Nickisch, 2013). One of the advantages of this GPML toolbox, is that it contains a minimization function to optimize hyper-parameters. For further details about this toolbox, see Rasmussen and Nickisch (2013). For SRM models, we used the least-dirty function of the MALSAR toolbox to minimize the error between predicted and true output (2) (J. Zhou, Chen, & Ye, 2012). The least-dirty function performs better than conventional functions when input features are not necessarily equally important for doing different tasks. It assumes the resulting model is a sum of two components; a group sparse component and an element-wise sparse component (Jalali, Ravikumar, Sanghavi, & Ruan, 2010).

(2)

is the formula for minimizing the error in SRM, in which W is the weight matrix, X the matrix with input data, Y the matrix with output data, are the regularization parameters, P is the group sparsity component and Q the element wise component. Regularization parameters define how strong the Q and P component each contribute to the model. They were determined with 10-fold cross-validation and grid-search. Our second comparison was STL versus MTL. In STL, a separate training was done per task, whereas in MTL training was done simultaneously for multiple tasks (figure 1a). In our case only tasks from the same NP test (i.e. scores at different time points from one test) were learned simultaneously, hence we can only speak of partial MTL (pMTL). Some adaptations to the GPML toolbox were done to make it perform pMTL as well, using Marquand et al. (2014) as an example. STL could be easily implemented in MALSAR by using one output at a time. The third and last comparison was that of integrating modalities versus concatenating modalities (figure 1b). For concatenation, all input features were given as one big matrix to the model, resulting in one covariance matrix in GP and one matrix of weights in SRM for all features. One could think of this as determining all weights on a first level. To integrate modalities, a second level was introduced. On the first level, per modality a covariance matrix (GP) or weights (SRM) were attributed containing all features of that modality. On the second level, modality parameters α were determined.

(10)

3.5 Statistics

To properly compare performance of different model classes, more than one performance measurement per model class had to be obtained. Because the pool of participants after removing those with missing values in either the tasks or the baseline features was limited, we used a bootstrap method to increase statistical validity (Efron & Tibshirani, 1993). In our case, after having one original estimate (based on average of 10-fold cross-validation of original data), we pseudo-randomly selected n participants with replacement from the total pool of N, 199 times. Pseudo, because we made sure the random selection was the same for every model class to guarantee a proper comparison. For every of these repetitions, we initiated a ten-fold cross-validation in which parameter optimization was performed for every fold (using minimization and grid-search for GP and SRM, respectively), where after performance of the model was tested on an unseen part of the data. Performance measurements were Pearson correlation (r) – Pearson because we could assume linearity - between predicted values and actual values of the model for the unseen data points and mean squared error (MSE), which is given by the following formula:

, (3)

in which n is the number of participants of the fold, Ŷp is the predicted value and Yp the true value for participant p.

This procedure resulted in 200 MSE and r-values (averaged over the 10 folds for every bootstrap-repetition). MSE values were log transformed and r-values Fisher transformed to create symmetric distributions. To test a main effect of STL versus MTL, we averaged z-values of models using STL and models using MTL, where after we subtracted these averaged Z-values of MTL from STL (pair-wise, since same randomization was used for every model). If the (for bootstrap corrected, Efron & Tibshirani, 1993) confidence interval (CI) completely existed of either only positive numbers or only negative numbers, there was a significant difference in model performance between STL and MTL. The alpha used was 0.05 divided by the number of multiple comparisons. The same procedure was used to test the main effect of SRM versus GP and concatenating modalities versus integrating (see

table 3).

Model levels Regularization Gaussian Processes

Concatenated Integrated Concatenated integrated

STL Zsrc Zsri Zsgc Zsgi

MTL Zmrc Zmri Zmgc Zmgi

Table 3: shows the 2x2x2 design used to compare models. Z values are symbolic for transformed

MSE and r-values. Test effects e.g.: main-effect of framework, (Zsrc + Zmrc + Zsri + Zmri ) - (Zsgc + Zmgc +

Zsgi + Zmgi) and then calculating the bootstrap corrected confidence interval.

To check for over- versus underestimation of true values per group of participants, for every repetition a prediction error (PE) was calculated.

(11)

in which Ŷ is the predicted value and Y the true value. First an average was made over those 200 repetitions per participant, where after group averages were calculated as well as standard error of the mean.

3.6 Sub scores

When one of the model classes was chosen as best to predict total scores, this model-class was used to learn how to predict sub scores of ADAS and FAQ. A 10-fold cross-validation was done, and r- and MSE values were calculated. R-values described the linear relation between predicted and true sub scores, which could be significantly tested. Since MSE values are dependent on the scale of the task, (e.g. some sub scores vary from 0 to 10 whereas others vary only from 1 to 3), MSE values were normalized according to formula (5).

(5)

in which MSEiN is the normalized MSE value for task i, MSEi the calculated MSE value and Yi the vector of true values for task i. The resulting MSEiN is a fraction of what the maximum mean squared error could be based on that scale. The lower MSEN and higher r-values, the better the model was able to predict sub scores.

The critical r-value, which is the threshold above which correlations would be significant, was determined according to formula (6).

, (6)

in which rc is the critical r-value, t a value of the t-distribution with alpha 0.05 divided by the number of multiple comparisons, and n the number of data points in the test-set.

4. Results

4.1 Total scores

4.1.1 Using both correlation and mean squared error as accuracy measures

For every model, every fold in the cross-validation tried to predict NP scores based on baseline features. We calculated r and MSE values between the true and predicted scores for every fold, and averaged these accuracy measurements over the 10 folds to obtain 200 MSE- and r-values per model. In figure 2 we give examples of the true versus average +/- standard error of the mean (s.e.m.) predicted score per participant for the separate time points of task ADAS (Figure 2a,c) and MMSE (figure 2b,d), using the MTL GP with concatenated (figure 2a,b) and the MTL SRM with integrated modalities (figure 2c,d). Group sizes differed per clinical group. HC: n = 101, MCIn: n = 177, MCIc: n = 27, AD: n = 34. (table 2) The r-values shown were averaged over the 200 bootstrap-repetitions. Average r-values were higher for ADAS (0.88 < r < 0.90) than for MMSE scores (0.77 < r < 0.82) and slightly increased over time. This suggests that our models can 1) predict ADAS scores more accurate than MMSE, and 2) predict scores better at later time points. However, high correlations between

(12)

predicted and true scores do not guarantee high accuracy of a model, since it only measures the linear relationship between true and predicted values. This value can increase when data is more broadly distributed, hence can be caused by higher variability of NP scores over patients later in time. In addition, correlation does not depend on how much the predicted scores differ from the true scores, as long as their relation is linear. Therefore MSE is a valuable complimentary accuracy measure, based on the squared error (i.e. absolute difference) between true and predicted value. It is expected for more accurate model classes that r-values increase, while errors between true and predicted values decrease. Figure 3 shows the true scores at different time points (black line) in a separate window for every participant group (HC, MCIn, MCIc and AD from left to right) for ADAS- (figure 3a) and MMSE scores (figure 3b). Predicted scores by the different models are shown in the same figures, showing whether different models follow the pattern of changes in true scores over time. This seems to depend on the participant group as well as on the model class that was used.

Figure 3c,d shows the prediction error (PE), to better visualize the difference between predicted and

true values. For most models and participant groups, we observe an increase in PE over time, rather than the decrease that was expected from the higher correlation values in figure 2 (absolute mean ± std ADAS: 6 months 0.4844 ± 0.4667, 12 months 0.8708 ± 1.0254, 24 months 1.0823 ± 1.2422, MMSE: 6 months 0.1818 ± 0.1557, 12 months 0.4699 ± 0.3244, 24 months 0.7499 ± 0.6296). This increase in PE does not seem to be a consequence of change in actual scores over time; the predictions for AD are quite accurate in most models at all time points (often less than 0.5 from the true value, figure 3c,d), while in this group a big change in scores over time is present (figure 3a,b). After having shown that both MSE and r-values are important to draw a conclusion about which models have higher accuracies than other models, we averaged both MSE and r-values over all participants. This led to distributions of one original estimate score and 199 bootstrap repetitions per model class on every time point. However, to make distributions more symmetrical, a Fisher-transform was applied to r-values, and a log-Fisher-transform to MSE values. In figure 4a,b (r-values) and

figure 4c,d (MSE-values), transformed accuracy measure distributions are plotted per model.

Different figures show distributions at different time points for ADAS (figure 4a,c) and for MMSE scores (figure 4b,d). Again a general increase for both r- and MSE- values over time can be observed. However, we were interested in whether there is a difference in accuracy between model classes.

4.1.2 Models based on regression and integrated modalities more accurate

To test whether there were significant main-effects of STL versus MTL, GP versus SRM or concatenating versus integrating modalities, we summed up scores over different model classes that belonged to a certain group, as described in the methods section. Subsequently we subtracted scores pair wise, which we could do because we used a pseudo-random bootstrap. Figure 5a-c show mean +/- std Fisher transformed r-values per main effect. No significant difference between MTL and STL learning was found, based on a bootstrap-corrected CI (alpha = 0.0083; corrected for multiple comparisons, figure 5a and d). A bootstrap-corrected CI makes use of an original estimate (black line in figure 5a-c) to estimate a more accurate CI. For model class, it was shown that SRM models produced slightly higher Fishertransformed rvalues than GP models (GP – SRM gives CI: [0.0536 -0.0098], figure 5b,d). Also a significant effect was found for modality integration, namely models that used integrated modalities produced slightly higher r-values than those that used modalities concatenated (integrated – concatenated gives CI: [0.0096 0.0418], figure 5c,d).

(13)

Figure 2) True scores (x-axis) versus average +/- s.e.m. scores predicted (y-axis) per participant

averaged over 200 bootstrap repetitions. Different figures for different months (left 6, middle 12, right 24 months). Different colour/marker combinations for different groups (see legend). R gives the average Pearson correlation over 200 bootstrap repetitions. a) For ADAS when a MTL GP model class with concatenated modalities was used. b) For MMSE when a MTL GP model class with concatenated modalities was used. c) For ADAS when a MTL SRM model with integrated modalities was used. d) For MMSE when a MTL SRM model with integrated modalities was used.

(14)

Figure 3 a) sub-figures show for the different clinical groups the mean +/- s.e.m. (predicted)

ADAS-scores (y-label) for different time points of prediction (x-label). This is done for every model separately (legend), as well as for the true values. Predicted scores were averaged over number of bootstraps first, and thereafter over participants. b) same as figure a, but now for MMSE scores. c) sub-figures show for the different participant groups the mean +/- s.e.m. errors between true and predicted ADAS scores (y-label) for different time points (x-label). This is done for every model separately (legend). Errors were averaged over number of bootstraps first, and thereafter over number of participants d) same as figure a, but now for MMSE scores.

The same statistical analyses were used for MSE-values (N.B. keep in mind that lower MSE values represent higher accuracy), giving similar results. There was no main effect for task-learning (figure

5e,h). Regarding model framework, SRM- gave significantly lower MSE values than GP-models

(GP-SRM gives CI: [0.0110 0.1319], figure 5f,h). Models using modality integration resulted in lower MSE values than those using concatenated modalities (integration – concatenation gives CI: [0.1173 -0.0223], figure 5g,h).

(15)

Even though the effects found were small, they were consistent for the two different accuracy measures. Therefore we can say SRM performs better for predicting NP scores than GP, and that integrated modalities give higher accuracies than concatenated modalities. There was no proof for either STL of MTL resulting in better performance.

Figure 4 a,b) Box-plots of 200 Fisher-transformed Pearson correlation values that were produced

by bootstrapping for different models (x-axis). Different subplots for different tasks a) ADAS and b) MMSE. c,d) Box-plots of 200 log-transformed MSE values that were produced by bootstrapping for different models (x-axis). Different subplots for different tasks c) ADAS and d) MMSE. In all figures left plots are for 6 months, middle plots for 12 months, and right plots for 24 months. Red asterisks are outliers.

(16)

Figure 5) Horizontal black lines show original estimates (i.e. results when original data was used

instead of a bootstrap sample with replacement). Asterisks show significance. a-c) shows mean +/- std Fisher-transformed r-values for all models. e-g) shows mean +/- std log-transformed MSE-values for all models. In a),e) models are grouped on how the different tasks were learned. In b),f) grouping was performed based on model classes Gaussian Processes (GP) or structural risk minimization (SRM). In c),g) on whether modality integration was used compared to when all features were concatenated. In d),h) corrected confidence intervals (CI) are shown per main-effect. This is calculated by subtracting scores in a pair-wise manner between categories as shown in

(17)

Apart from these main effects, interactions might exist; different model class combinations (from the 2x2x2 design of main contrasts) can result in higher or lower accuracies, than was expected from just taking the main effects into account. Therefore we performed another test to see whether there was one model that performed better than all others. Analyses were done in the same way as described before; pair wise and with a bootstrap-corrected CI. For this analysis eight different comparisons were needed (number of models), resulting in alpha = 0.0063. Supplementary figure 1a shows for every model, whether log-transformed MSE values of another model pair-wisely subtracted from this model, on average resulted in positive (bright colours), negative (dark colours) or equal (gray) values.

Supplementary figure 1b shows the same for Fisher-transformed r-values. Please note that for MSE

values dark colours correspond to a more accurate model, whereas for r-values this applies to lighter colours.

STL GP and pMTL GP both with concatenated modalities, pMTL GP with integrated modalities and pMTL SRM with concatenated modalities had higher log-transformed MSE values compared to other models (in the same order, mean ± std difference: 0.0070 ± 0.0132, 0.0078 ± 0.0131, 0.0127 ± 0.0126 and 0.0255 ± 0.0099) and lower Fishertransformed rvalues (0.0030 ± 0.0082, 0.0021 ± 0.0080, -0.0106 ± 0.0099 and -0.0012 ± 0.0078, lighter rows in supplementary figure 1a and darker rows in

supplementary figure 1b). STL GP, STL SRM and pMTL SRM with integrated modalities, and STL SRM

with concatenated modalities on the other hand, were more accurate (mean ± std difference in log-transformed MSE-values: -0.0166 ± 0.0121, -0.0142 ± 0.0124, -0.0108 ± 0.0129 and -0.0114 ± 0.0128, Fisher-transformed r-values: 0.0050 ± 0.0060, 0.0079 ± 0.0050, 0.0021 ± 0.0069, 0.0019 ± 0.0070). No evidence was found that any of the models was significantly better than all other models.

4.1.3 Computation time

Another important aspect to choose an optimal model class, apart from accuracy, is its computational time. For some model classes, this was extremely long (figure 6). Especially running the STL SRM with integrated modalities took long (73.58 ± 17.52 minutes), compared to the other models. STL GP with concatenated modalities, on the other hand, had a computation time of a minute (0.76 ± 0.16 minutes). Taking both the significant main effects (integrated modalities and SRM give higher accuracies than concatenated modalities and GP respectively) and the computation time into account, we choose to continue with the pMTL SRM model to answer whether using multiple modalities integrated predicts scores better than using multiple modalities concatenated, or using only single modalities.

(18)

Figure 6) distributions of computation time (minutes) for every model (200 run). Highest

computation time for pMTL GP with integrated modalities, lowest for STL GP with concatenated modalities.

4.1.4 Using different modalities

To test whether using all modalities, in contrast to using one modality at a time, benefited the prediction of NP scores, we choose the pMTL SRM model class to predict NP scores at 6, 12 and 24 months, using just one of the modalities. We used the same model class both integrated and concatenated to predict NP scores using all modalities, or all modalities except for NP baseline scores to see what just biomarkers are capable of.

Figure 7a,b shows the distributions of Fisher-transformed r-values and figure 7c,d the

log-transformed MSE values for every task. Every plot contains distributions for different (combinations of) modalities. There appear to be three different 'groups' of distributions: one group of low performance with average (over all tasks and months) r-values between 0.365 and 0.602, and average MSE values between 31.12 and 44.41 when single biomarkers are used (APOE, CSF, PET or MRI only), one mediocre-performing group when all biomarkers are used together (either concatenated or integrated) with average r-values between 0.6792 and 0.6872 and average MSE values between 24.43 and 25.48, and one high-performing group using NP scores only or using all biomarkers combined with NP scores either concatenated or integrated. This last group had average r-values between 0.832 and 0.845, and average MSE-values between 11.01 and 12.08.

(19)

4.1.5 Integrating modalities shows low prediction error

In figure 8a,b the true NP scores and those predicted by the models using modality combinations that perform best (i.e. those using NP baseline-scores or multiple modalities combined as input) are plotted against the time point of prediction for the two clinically most relevant group (left: MCIc and right: AD). To get a better idea of what the PE is per model, figure 8c,d visualizes the difference between predicted and actual scores over different time points. Three things clearly stand out from these figures. Firstly, in general true scores could be predicted more accurately for AD than for MCIc. For ADAS, the best prediction for MCIc differed on average 1.13, 2.60 and 2.96 from the true values for 6, 12 and 24 months respectively, whereas for AD the difference for the best performing model was lower (0.11, 0.25 and 0.12). The same pattern could be observed for MMSE.(figure 8). Secondly, for AD patients, using NP scores as only modality was just as accurate as using all modalities concatenated (blue line overlaps with green line). For MCIc participants on the other hand, out of these five, the model using NP only performed worst (figure 8, all left windows). In addition to that, using NP-baseline scores in addition to other biomarkers in a model, does not per se increase prediction accuracy compared to using only biomarkers. Thirdly, for 6 and 12 months there is not much difference in performance between concatenating modalities and integrating modalities. However, at 24 months the model performance diverges, and integration gives a more accurate prediction than concatenation. This suggests that the attribution of weights to modalities on a second level contributes to a more accurate prediction at 24 months only.

4.1.6 Relative second-level weight

The last claim would mean that relative modality weights (i.e. absolute weights of modalities as a percentage of the total absolute weight attributed to all modalities together) change over time. In

figure 9 we can indeed observe some differences in relative weight for modalities over time. For

example PET scores contributed less to the prediction of ADAS scores than MRI at 6 and 12 months (PET 6.55 ± 0.19, 4.21 ± 0.19 for 6 and 12 months respectively, MRI 13.78 ± 0.21 and 12.87 ± 0.21) , but at 24 months the relative weight of PET was even higher than for MRI (PET 10.89 ± 0.22, MRI 10.10 ± 0.23, figure 9a). Other changes in relative weights can be observed. However, the degree of contribution of specific modalities, reflected by their weights, does not seem to be dependent on time per se. Namely for MMSE, relative weights and their changes over time are different (figure 9b) than for ADAS scores (figure 9a). Contribution of weight can also be dependent on the number of participants from a certain clinical group. Think back to the observation we found in figure 8, that one modality (NP-baseline scores) can be more helpful for predicting scores in one group (AD), but not in the other (MCIc), meaning that when there are more participants of the MCIc group, it is likely that the weight of the NP-modality will be less. This is confirmed by supplementary figure 2, which shows the same figure as figure 8 with the difference that scores were predicted using an equal number of participants per clinical group. More on this analysis with an equal number of participants will follow later.

(20)

Figure 7 a,b) Box-plots of 200 Fisher-transformed Pearson correlation values that were produced

by bootstrapping for different models (i.e. using different modalities, or combinations of modalities, x-axis). a) ADAS and b) MMSE. c,d) Box-plots of 200 log-transformed MSE values that were produced by bootstrapping for different models (i.e. using different modalities, or combinations of modalities, x-axis). c) ADAS d) MMSE In all figures left plots are for 6 months, middle plots for 12 months, and right plots for 24 months. Red asterisks are outliers.

(21)

Figure 8 a) sub-figures show for different clinical groups the mean +/- s.e.m. (predicted)

ADAS-scores (y-label) for different time points of prediction (x-label). This is done for every model separately (legend), as well as for the true values. Predicted scores were averaged over number of bootstraps first, and thereafter over participants. Models in this figure used different (combinations of) modalities. b) same as figure a, but now for MMSE scores. c) sub-figures show for the different participant groups the mean +/- s.e.m. errors between true and predicted ADAS scores (y-label, predicted – true) for different time points (x-label). This is done for every model separately (legend). Errors were averaged over number of bootstraps first, and thereafter over number of participants. Models in this figure used different (combinations of) modalities. d) Same as figure c, but now for MMSE scores.

(22)

Figure 9) shows for the SRM MTL model that used all modalities integrated the mean +/- s.e.m.

relative second level weights (y-label) (i.e. percentage of total weight attribution, that specifies how much a modality attributes to the prediction) that were found for different time points (x-label). a) for ADAS, and b) for MMSE.

4.1.7 Using all modalities integrated gives best prediction

On first sight, the model using all modalities integrated seems to be a more accurate model than others using all modalities concatenated or using less modalities. To test this, for every model we again subtracted accuracy scores of every other model in a pair-wise manner. From figure 10 one can get an idea of relative performance of models compared to other models (figure 10a shows difference in log-transformed MSE values, figure 10b difference in Fisher-transformed r-values). Here the three different groups of performance are clearly visible as blocks of different shades of gray; using single biomarkers to predict NP scores result in relatively high MSE values and low r-values. From these biomarkers, MRI seems to have the highest accuracy. The mediocre-performing group of using all biomarker modalities (no NP baseline scores) are at the bottom of these figures. Best performing group of both NP baseline scores only and all modalities concatenated or integrated, shows lower MSE values compared to the other groups (dark gray to black colours in their rows in

figure 10a) and higher r-values (light gray to white colours in their rows in figure 10b). From these

three high-performing models, the one with all modalities integrated had the lowest MSE values. We were able to confirm this result with a statistical test; using all modalities integrated (int MRI PET CSF APOE NP) gave significantly lower MSE values than all other models (CIs of other model values subtracted from this model: [1.1119 0.7151], [1.2778 0.9131],[1.2513 0.9130],[1.3470 -0.9529],[-0.0934 -0.0186],[-0.0770 -0.0046],[-0.8630 -0.6402],[-0.8308 -0.6402], corrected alpha =

(23)

0.0056). There was no model that gave significantly better correlation values than any other model.

4.1.8 Consistency of results

To exclude that our results were partly dependent on an unbalanced training data set, the same analysis was performed using an under-sampling of participants for some clinical groups, so that the number of participants in each group was equal (n = 27). For the main comparisons, this led to the finding that using integrated modalities in a model consistently resulted in higher accuracies than using concatenated modalities (correlation: CI int – cat: [0.0581 0.0635] supplementary figure 3c,d, MSE: CI int – cat: [-0.1789 -0.0654] supplementary figure 3g,h). The result that regularization is better than GP only remained to be significant in correlation (CI GP-SRM: [-0.0829 -0.0410]

supplementary figure 3b,d), but not in MSE values (CI GP - SRM: [-0.0048 0.1651] supplementary figure 3f,h). A new finding in correlations was that MTL was significantly more accurate than STL (CI

MTL – STL: [0.0049 0.0279] supplementary figure 3a,d), which again was not observable in MSE values (CI MTL – STL: [-0.0260 0.0129] supplementary figure 3g,h). Even though MSE values did not support a significant higher accuracy for SRM and integration, based on all results found so far the best choice of a model would be one using SRM, with integrated modalities that learns to predict different time points of one task simultaneously.

Supplementary figure 4 shows for models using different (combinations) of modalities their

prediction at different time points, as well as the true values (black lines). Different sub plots of

supplementary figure 4a show predicted and true ADAS scores each for another clinical group, all

using n = 27 participants (under-sampled for most groups). Supplementary figure 4b shows the same for MMSE-scores. Consistent with the findings in the non under-sampled results, using all modalities integrated in a model gave significantly lower MSE values than all other (combinations) of modalities in a model.

4.2 Predicting sub scores

4.2.1 Some sub scores can be predicted

NP scores like ADAS and MMSE give a general overview of cognitive functioning of the patient. These are sums or averages of scores that are attributed to different domains within cognitive functioning, like word recall, orientation and so on. Sometimes, rather than a general overview, a more specific description of the future abilities of a patient is desired. For example to develop a more personal treatment. In addition, it might be interesting to not only focus on cognitive abilities, but also on the troubles in daily activities patients experience. To quantify the latter, the FAQ was developed as part of the AD test battery. Here we used the MTL SRM model class with integrated modalities, to predict sub scores of both the ADAS (for cognitive functioning) and FAQ (for daily activities). A list of sub scores with explanations can be found in supplementary table 1. Bars in figure 11a and b show the mean correlation between true and predicted values for ADAS and FAQ sub scores, respectively. The mean was calculated from r-values that were obtained from every fold of the 10-fold cross-validation, represented by crosses in both figures. A critical r-value of rc = 0.5652 represents the threshold from which a correlation is significant, given an α of 0.05 divided by the number of comparisons, and degrees of freedom n - 2. Some sub scores could be predicted well by the model, having correlation

(24)

values above the critical r-values for every fold (Q1 and Q4, both word recall tasks), whereas other sub scores had correlation values that not always reached the critical r-value (e.g. Q13, number cancellation task, and most FAQ scores). Figure 11c and d show the mean and individual MSEN values for ADAS and FAQ respectively. MSEN values represent the observed MSE as a fraction of the maximum possible MSE, which is the range on the scoring scale of the task. For ADAS sub scores (figure 11c) MSEN values were generally lower than for FAQ sub scores (figure 11d). Nevertheless, the mean squared deviation was never above 15% of the total deviation possible for both FAQ and ADAS.

Figure 10 a) shows the average of log-transformed MSE values of model 'b' (x-axis) subtracted

from model 'a' (y-axis). Models 'a' that show relatively light gray to white colours in their entire row and on average (last column) have higher MSE values, corresponding to a lower accurate model. Models 'a' with relatively dark gray to black colours in their rows and on average (last column), have lower MSE values, which correspond to more accurate models. b) shows the average Fisher-transformed Pearson correlation of model 'b' (x-axis) subtracted from model 'a' (y-axis). Models 'a' that show relatively dark gray to black colours in their entire row and on average (last column) have lower r-values, which correspond to lower accuracy of the model. Models 'a' with more light-gray to white colours in their rows and on average (last column) have higher correlation values, corresponding to more accurate models.

(25)

Figure 11a,b) show the mean and individual (per fold) Pearson r-values (y-axis) for a) ADAS sub

scores and b) FAQ sub scores at different time points (x-axis). Coloured bars represent different sub scores (see legend). The dashed line is the critical r-value (r-value that would give a significant correlation based on α < 0.05 divided by number of multiple comparisons). c,d) show the mean and individual (per fold) MSEN_{values (y-axis) for c) ADAS sub scores and d) FAQ sub scores at different}

time points (x-axis). MSEN_{values represent the observed MSE values as a fraction of the maximum}

possible MSE values, based on the scale of the sub score. Coloured bars and text written above the bars, see a).

4.2.2 Neural correlates

What could be the reason that some sub scores are predicted better than others? One of the explanations could be the relation between the features that were used as input and the sub scores we tried to predict. Therefore we investigated this relation by calculating the correlation between the different sub scores and input features. Results are shown in figure 12a (ADAS) and figure 12b (FAQ) for different time points. Correlations between features and sub scores vary from -1 (blue) to 1 (red). The closer to 0 (green) the correlation is, the less there is a relation between the feature and the sub score, and the less likely it is that this feature can contribute in the correct prediction of the sub score. In line with this, sub scores that showed more r-values above the critical r-value in figure 11, show a higher absolute average (last column of each window) than other sub scores in figure 12. Features that had the highest correlations with sub scores, and thus most likely contributed most to the correct prediction, were hippocampus, entorhinal and mid temporal lobe volumes, some PET features and the baseline cognitive scores.

(26)

Figure 12) These figures show correlations between features (x-axis) and sub scores (y-axis) for

different time points (left: 6 months, middle: 12 months, right: 24 months). Colours go from blue (r = -1) to green (r = 0) to red (r = 1), reflecting the scale from full negative correlation, to no correlation at all, and further to full positive correlation. a) for ADAS sub scores and b) for FAQ sub scores.

5. Discussion

5.1 The three different comparisons

In this study, firstly we tried to find the model class that could most accurately predict NP scores not only in the near, but also further in the future. This is important for correctly diagnosing patients, for both clinical and research purposes. Three comparisons were tested. The first one, was whether GP or SRM resulted in more accurate prediction. Using two complementary accuracy measures, averaged over two different NP scores, we found that models using SRM performed the tasks of predicting NP scores at three different time points slightly better than GP. This is in contrast with what was found earlier by Wang et al. (2010), who observed a better performance for their model implemented in the Bayesian, compared to the one implemented in the SRM framework. However, in their paper a statistical test to proof this difference, lacked. The second comparison was between MTL and STL. In MTL training occurs for multiple tasks simultaneously, thereby making use of shared information between the tasks, whereas in STL every task is learned separately. In the main analysis, in which all participants without missing data were used, no main effect in either of the two accuracy measures was found for STL or MTL being better. On the other hand, when we under-sampled the

(27)

data such that we had an equal number of participants in all four clinical groups, MTL models gave significantly higher correlations than STL, more consistent with expectations in the literature (Argyriou et al., 2007; Evgeniou et al., 2005; Evgeniou & Pontil, 2004). Nevertheless this effect was minimal, and absent in MSE values. The last comparison was between integrating and concatenating modalities. Both main analysis and analysis with equal samples showed that integrating modalities is better than concatenating them, since it gave significantly higher correlations and significantly lower MSE values. The effects were generally small, but were slightly larger when equal samples were used. These main effects we found, though small, showed that model classes using SRM are more accurate in predicting NP scores over a range of two years than model classes in a GP framework. In addition, modalities should be integrated rather than concatenated, to use input information most efficient. Whether either STL or MTL model classes reaches higher accuracies, could not be solved here. However, when fulfilling the requirements of the other main effects (using a SRM model class with integrated modalities), MTL was more favourable to use due to a shorter computation time. Therefore, we continued our analyses with this preferred model class (SRM, MTL and integrated modalities).

5.2 Integrated multimodality input better than concatenated multimodality or unimodality input Next research question we investigated was whether the use of single modalities instead of combinations of modalities could achieve an as accurate prediction. When only one single modality was used in a model, NP baseline scores outperformed the other modalities. However, models that used all modalities in an integrated way gave significantly higher accuracies for predicting NP scores than all other (combinations of) modalities. Again a proof that integration of modalities is more efficient than concatenation. We investigated where this difference came from, and found that whereas for 6 and 12 months concatenation and integration lead to equal accuracies, for 24 months integration outperformed concatenation. Unfortunately we could not test whether this difference was significant per time point, due to a lack of statistical power.

5.3 Sub scores can be predicted

For the third hypothesis, the answer is a bit more complicated. Namely only a few of the sub scores could be predicted by our model. We suggested that a possible explanation lies in the nature of the input features. Some of the sub tasks could be linked to more input features than others, making it more likely that prediction is accurate. We were able to show this with average correlations between feature- and sub score values being higher for those sub scores that could be predicted more accurate. Support for this idea can be found in the literature. Word recall tasks (as are Q1 and Q4) are shown to have significant correlations with gray matter density in the left entorhinal cortex and left hippocampus (Schmidt-Wilcke, Poljansky, Hierlmeier, Hausner, & Ibach, 2009). We observed a strong correlation between features representing these areas and word recall tasks as well. For our final hypothesis, we therefore want to conclude that our model can predict sub scores, under the condition that representative brain areas are used as input features. Unfortunately it is not always known which brain areas are representative for a given task. A broad input matrix and automatic task-specific feature selection might be a good solution.

(28)

5.4 Clinical application and future directions

ADNI has been brought to life with specific goals in mind, that emphasize the clinical relevance of early diagnosis and accurate prediction of the course of the disease. However, still it has been hard to directly apply findings in the machine learning field in clinical settings, because often only single aspects of this clinical application are taken into account. Here we tried to compare model classes on their performance to point out which one should be used when it comes to real-life questions regarding long-term prediction in an AD framework. In addition, we used the most accurate model class to see whether we can predict more specific domains of cognitive and functional decline. We think this is relevant on a single-patient-level, to make treatment more applicable for the patient's specific needs. Although a 100% accuracy has not yet been achieved and may never be achieved, we believe it is important to find an as accurate model as possible, using all the available sources we have, from multi-dimensional brain images to NP scores. Here we will describe a few possible directions that we think we can go to achieve this.

5.4.1 Biases for feature weights

One of the unexpected findings in this study was the difference in prediction accuracy between MCIc and AD patients. A likely cause for this difference lies in the informative content of a feature with respect to the clinical group. In other words, one feature can contain a lot of information for certain clinical groups, but not at all for others. The weights or covariance functions that were attributed to a certain feature, were equal for all participants in all our models. However, the difference we found between MCIc and AD tells us that a given feature (here for example NP baseline scores) should not contribute equally to the prediction of all participants. How do we determine for which participants which feature should contribute more or less, if diagnostic labels are still unknown (labels given to participants are normally unknown when a patient comes in and prediction for a period of two years has to be made)? One possibility is to use knowledge we have about how biomarkers and clinical scores become more and more visible in AD pathology. We could, for example, use the model of dynamic biomarkers as described by Jack Jr et al. (2010) as prior information, and adapt the weights given to certain features in a logical way, based on this model. How this bias for weights should look like in a model, and how it should be determined for a given participant, without the use of diagnostic labels, is an interesting question for further research.

5.4.2 Feature selection and imputation for regression

In order to reduce the variability other than that as a property of the models we used, we made two decisions. Firstly, we removed all participants with missing values in our dataset. Unfortunately, by doing that, a lot of useful data to improve a model is thrown away, thereby decreasing the number of participants tremendously. Secondly, we used the same set of features for every model and every task, without having a feature selection method implemented. The latter, however, could cause a reduction in accuracy because a model is forced to use futile features as well. In a future model, it would be good to increase performance of a model by finding a way to get around these two problems. A guideline for this might be a method used by Thung, Wee, Yap, and Shen (2014), who implemented a feature selection and imputation approach in a classification framework. A similar approach can be implemented for regression to predict both average and sub scores on a continuous scale.

(29)

5.5 Main conclusions

Which model class can best predict cognitive decline over a time period of two years in single subjects based on baseline data? We found that this would be a model that uses SRM. Considering computation time, it would be best if the model was trained in a MTL framework. In addition, it should use multiple modalities in an integrated way. Since we also found that it depends on the clinical group the subject belongs to, it is important to improve this model further to achieve a high accuracy for subjects from every clinical group.

6. Abbreviations

Aβ Amyloid-β

AD Alzheimer's Disease (patients) ADAS Alzheimer's disease assessment scale

APOE Apilopoprotein-E

CDR-box Clinical dementia rating for different boxes/domains

CI Confidence interval

CSF Cerebrospinal fluid

FAQ Functional activity questionnaire

GP Gaussian processes

HC Healthy control

MCI, -n, -c Mild cognitive impaired, non-converting, converting

M3T Multimodal task learning

MMSE Mini mental state exam

MRI Magnetic resonance imaging

MSE Mean squared error

MTL Multitask learning

NP Neuropsychological (scores)

PE Prediction error

PET Positron emission tomography pMTL Partial multitask learning

pMTL GP cat Gaussian processes model class using partial multitask learning and concatenated modalities

pMTL GP int Gaussian processes model class using partial multitask learning and integrated modalities

pMTL SRM cat Structural risk minimization model class using partial multitask learning and concatenated modalities

pMTL SRM int Structural risk minimization model class using partial multitask learning and integrated modalities

r Pearson correlation

s.e.m. Standard error of the mean SRM Structural risk minimization

std Standard deviation

STL Singletask learning

STL GP cat Gaussian processes model class using singletask learning and concatenated modalities

STL GP int Gaussian processes model class using singletask learning and integrated modalities

STL SRM cat Structural risk minimization model class using singletask learning and concatenated modalities