How to integrate multimodal information to improve classification performance? Comparison of multimodal approaches to distinguish Alzheimer patients from healthy controls based on high-dimensional neuroimaging data.

(1)

Master’s Thesis Methodology and Statistics Master Methodology and Statistics Unit, Institute of Psychology, Faculty of Social and Behavioral Sciences, Leiden University Date: 17st_{of August 2020}

Student number: 1239031

Supervisor: Dr. Tom F. Wilderjans

How to integrate multimodal information to improve classification

performance?

Comparison of multimodal approaches to distinguish Alzheimer patients from healthy controls based on high-dimensional neuroimaging data.

Master’s thesis

(2)

2 Acknowledgement

The past months have been a meaningful and interesting time that has led to finishing this master thesis. This would not have been possible without the people surrounding me.

First, I would like to thank my supervisor Tom Wilderjans. Tom was also my supervisor during my bachelor’s thesis, and I am grateful for all the advice and learning opportunities that he has given me. Over the years we have gotten to know each other, and I enjoyed our conversations about statistics and many other topics. I would also like to thank my boyfriend, Tim van Beelen, who has cooked, cleaned, and endlessly listened to related and less-related statistical topics that I needed to share. Your support has been critical for the completion of this thesis whilst I was writing it after work.

The contents of this thesis are a result of a collaborative environment in which I was stimulated to thrive. Thanks to Leiden University.

(3)

3 Abstract

An often-encountered challenge in neuroscience is the reliable prediction of a subject’s disease status based on brain data to, for example, disentangle Alzheimer’s disease (AD) patients from healthy control (HC) subjects. In this regard, functional Magnetic Resonance Imaging (fMRI) has been applied successfully to study the functional and structural differences between the brains of AD patients and HC’s. However, when using fMRI data for AD classification, the complex multivariate nature of these data faces researchers with two main statistical challenges that need to be appropriately addressed to improve the AD classification performance: high dimensionality (i.e., the data containing many variables/voxels) and multimodality (i.e., the data consisting of different brain modalities, like structural and functional brain information). In this study, both statistical challenges are tackled. To address high dimensionality, dimension reduction is proposed. Regarding the multimodality challenge, a ‘concatenated strategy’, which consists of a simultaneous dimension reduction of the information present in all the modalities involved, is compared to a ‘separate strategy’, which reduces the data for each modality separately. Combined with these strategies, three common feature extraction methods are compared in terms of classification performance: (1) Principal Component Analysis (PCA), (2) Partial Least Squares Regression (PLS-R) and (3) Generalized Regularized Canonical Correlation Analysis (RGCCA). Testing these methods and strategies on multimodal data consisting of three different neuroimaging properties that are related to AD, it is found that the best classification accuracies are obtained with PLS-R (compared to PCA and CCA) and the concatenated strategy (compared to the separate strategy). PLS-R combined with the concatenated strategy, however, is outperformed by a whole-brain analysis applied to all data modalities simultaneously and performs at the same level as a whole-brain analysis applied to one of the brain modalities used (i.e., the structural one).

(4)

4 Table of content

Section 1. Introduction 9

1.1 Classification problems in neuroscience 9

1.2 Statistical challenges 8

1.3 Dimension reduction as a solution: concatenated versus separate strategy 13

1.3.1 Feature selection 13

1.3.2 Feature extraction 15

1.3.2.1 Principle Component Analysis (PCA) 16 1.3.2.2 Partial Least Squares Regression (PLS-R) 17 1.3.2.3 Canonical Correlation Analysis (CCA) and its extensions 19 1.3.3 Addressing multimodality: concatenated and separate strategy 21 1.4 Goal of the study, research questions and hypotheses 22

Section 2. Methods 25

2.1 Data 25 2.2 Six-step procedure 26

(5)

5 2.3 Validation approach to create a training and a test set (Step 1) 27 2.4 Multimodal feature extraction approaches (Step 2) 29 2.4.1 PCA 29 2.4.2 PLS-R 30

2.4.3 Regularized Generalized CCA (RGCCA) 32

2.4.4 Strategies for dealing with multimodality 34

2.4.5 Number of components S 37

2.5 Support Vector Machine (SVM) classifier (Step 3) 38 2.6 Deriving component scores for the test set (Step 4) 41

2.6.1 PCA 41

2.6.2 PLS-R 42

2.6.3 RGCCA 42

2.7 Computing classification accuracy (Step 6) 42

Section 3. Results 45

3.1 Classification accuracy of the multimodal feature extraction approaches as a

(6)

6

3.1.1 Concatenated versus separate strategy 47

3.1.2 Comparison of classification accuracy between feature extraction methods 49 3.2 Multimodal feature extraction versus whole-brain analysis 50

Section 4. Discussion 52

4.1 Summary and discussion of the main results 52

4.1.1 Which is the best multimodal feature extraction approach? 52 4.1.1.1 What strategy yields better classification accuracies? 53 4.1.1.2 Comparison among the feature extraction methods 54 4.1.2 Multimodal feature extraction versus whole-brain analysis 56

4.2 Limitations of this study 57

4.2.1 Limited type and number of modalities 57

4.2.2 Restricted to Linear feature extraction methods 59

4.2.3 Limited sample size 60

4.3 Suggestions for further research 61

4.3.1 Searching for the combination of modalities that is optimal for classification 61

(7)

7 4.3.3 Investigating the added value of non-linear feature extraction techniques 62

4.3.4 Using larger sample sizes 63

References 64

(8)

8 Table 5: Overview of Tables

Table 6: Overview of Figures

Table Title Page

Table 1 Description of the modality type, voxel size and number of variables for the data modalities used in this study

26 Table 2 Overview of the five multimodal feature extraction approaches considered

in this study

37

Table 3 Mean AUC value for each feature selection approach (rows), strategy for multimodality (columns) and combinations thereof (cells), encountered across all values of 𝑆 and random splits.

49

Table 4 Mean AUC value for whole-brain analysis, encountered across all random splits, when performed on all modalities simultaneously (i.e., multimodal) and on each modality separately.

51

Table 5 Overview of Tables 8

Table 6 Overview of Figures 8

Figure Title Page

Figure 1 Right; separate strategy vs left; concatenated strategy 36

Figure 2 SVM hyperplanes 39

Figure 3 Mean AUC value plotted against the number of components 𝑆 for the five studied multimodal feature extraction approaches.

(9)

9 Section 1. Introduction

1.1 Classification problems in neuroscience

An often-encountered problem in neuroscience is the reliable prediction of a subject’s disease status (i.e., sick, or not) based on brain data and this for several brain disorders, like, for example, Alzheimer’s disease (AD). AD is an irreversible, progressive brain disorder that slowly destroys memory and thinking skills, and, eventually, the ability to carry out the simplest tasks. AD is not only currently ranked within the top 10 leading causes of death in the United States, it is also the most common form of dementia in elderly people worldwide. It is reported that the number of affected people is expected to double in the next 20 years, and 1 in 85 people will be affected by 2050 (Ron et al., 2007). Identifying the disease in an earlier stage may be key for developing effective treatments for AD (Schouten et al., 2016). Crucial in this regard is the (early) classification of AD.

An important neuroimaging tool for the classification of AD (and other diseases) is functional Magnetic Resonance Imaging (fMRI). fMRI is particularly suitable for AD classification because of its sensitivity to structural and functional changes in the brain that are caused by AD (Schouten et al., 2016). fMRI has been applied successfully to study the functional and structural differences between the brains of AD patients and healthy controls. In particular, researchers used fMRI images and discovered differences between patients and healthy controls in functional connectivity (Gour et al., 2014, Binnewijzend et al., 2012) and in specific brain tissues, such as in voxel based grey matter (Ferreira et al., 2011), white matter (Li et al., 2012) and diffusion measures (Douaud et al., 2011).

(10)

10 When using fMRI data for AD classification, the complex multivariate nature of these data face researchers with two main statistical challenges that need to be appropriately addressed to improve the AD classification performance. First, the number of variables (i.e., voxels) commonly used is much larger than the number of observations (i.e., time points or subjects), which refers to the curse of dimensionality (James, Witten, Hastie, & Tibshirani, 2015). A second challenge is the multimodality of the data, which implies that the data consists of multiple data sources (signals) that possibly provide complementary information about brain functioning. Only recently research has acknowledged that combining different modalities for AD classification can yield better results compared to when resorting to the information from a single modality only.

The goal of this study is to tackle both statistical challenges. To address the first challenge, dimension reduction is proposed. Regarding the second challenge, we will compare a ‘concatenated strategy’, which consists of a simultaneous dimension reduction of the information present in all the modalities involved, to a ‘separate strategy’, which reduces the data for each modality separately. In the following subsections of Section 1, first, the two statistical challenges are outlined in more detail. Next, dimension reduction is presented as a way to tackle these two challenges and both the concatenated and the separate strategy are discussed. Section 1 ends with a presentation of the study goals, research questions and hypotheses.

1.2 Statistical challenges

To improve the current classification performance of AD using fMRI data, two statistical challenges need to be addressed properly. A first challenge is that in neuroimaging research the number of variables is often out of proportion in comparison to the amount of observations.For example, fMRI data of a single participant consists of measures of activation for lots of voxels

(11)

11 (i.e., easily 200,000 or more) for a very limited number of time points (i.e., a few hundred). As a second example, in a fMRI classification studies that try to discriminate healthy subjects from diseased patients, measurements of a small number of participants (i.e., 200 is already quite large) are taken for many voxels. This phenomenon of many variables compared to the number of observations is also referred to as the small-n-large-p problem or the curse of dimensionality (James, Witten, Hastie, & Tibshiranie, 2015), which implies that a lot of information is present for a small number of cases under study. When training in this case a classifier, there is a substantial risk of overfitting as the amount of variables overwhelms the number of cases. This implies that a model is trained that is too much tailored towards the oddities and random noise in the sample data rather than that the model reflects the data characteristics that generalize to the population.

A commonly used solution to counter this challenge is dimension reduction. Indeed, complex multivariate high-dimensional data can be better understood by studying low-dimensional projections of these high-low-dimensional data (Härdle et al., 2007). As such, feature extraction (or selection) prior to fitting a classifier can be potentially beneficial for classification performance. For example, in Mwangi, Tian, & Souares, (2014) it is shown that using a feature extraction method (see further) before building a Support Vector Machine (SVM) classifier based on fMRI data, generally can increase the classification accuracy for separating healthy controls from AD patients in comparison to whole brain analysis, which applies SVM to the original features/variables implying that no feature extraction is performed.

A second important statistical challenge for AD classification based on neuroimaging data is prompted by the fact that information about different aspects of brain functioning can be collected, like, for example, features capturing functional and structural brain information. As such, typical

(12)

12 brain data consist of several data sets, with each data set containing information about a different aspect or modality of brain functioning. This situation is known as multimodality. Multimodality refers to multiple data sources providing –possibly– complementary information about the topic under study (i.e., brain functioning). Many statistical methods, however, are not able to account for multimodality in a straightforward way. Often, researchers adopt in this situation a unimodal approach, which implies the use of only a single neuroimaging measure or signal in the analysis (Tulay, Metin, Tarhan, & Arıkan, 2018). However, in recent years, it became increasingly attractive in neuroimaging research to integrate multiple, possibly complementary, modalities in the analysis in order to attempt to improve the spatiotemporal resolution of the signal in a way that cannot be achieved by analyzing any modality individually (He & Liu, 2008). As different modalities may reflect distinct but possibly closely related aspects of underlying brain functioning, combining multiple modalities allows us to study data from more than one perspective and to draw a more complete picture of brain functioning, which may enhance AD classification.

Up until recently, it was totally not clear whether accounting for multimodality substantially improves classification accuracy, nor which modality or combination of modalities provides the best classification performance for separating AD patients from healthy controls. Some studies have suggested that the classification of AD may further be improved when combining multiple fMRI modalities (Mesrob et al., 2012, Sui et al., 2012), while another study found better classification results for a unimodal analysis strategy than for a multimodal strategy (Dyrba et al., 2015). Schouten et al. (2016) showed that by combining measures from multiple modalities in a stepwise manner, the classification performance can improve significantly in comparison to a unimodal approach. However, it remains unclear which combination of modalities

(13)

13 improves classification of AD the most. More findings from similar research could clarify more precisely which measures yield the most powerful combination for classifying AD patients.

1.3 Dimension reduction as a solution: concatenated versus separate strategy

A dimension reduction technique can be applied to overcome the curse of dimensionality and at the same time to address the issue of multimodality. Performing dimension reduction prior to using a classifier, like, for example, the Support Vector Machine (SVM) classifier, can be done in two ways: by feature selection (Section 1.3.1) or by feature extraction (Section 1.3.2). Moreover, when having data with multiple modalities, feature selection/extraction can be performed using a concatenated or a separate strategy (Section 1.3.3).

1.3.1 Feature selection

Feature selection consists of selecting from the data a subset of relevant features/variables that can be used as the input for training a classifier. Feature selection is assumed to be beneficial in neuroimaging research as it may reduce the amount of noise that masks the existing differences in brain functioning between subject groups, like healthy controls and AD patients. One assumption behind the use of a feature selection technique is that the data contain some features that are irrelevant for the classification and thus can be removed without a large loss of relevant information, which may be detrimental for classification performance (Bermingham et al., 2015). Research, however, demonstrated that applying a feature selection technique prior to classification is not always more beneficial in terms of classification performance when compared to performing whole brain analysis. In Chu et al. (2012) it became clear that when using an SVM

(14)

14 classifier, feature selection methods do not necessarily improve classification accuracy. Indeed, classification performance was only ameliorated when a priori information was used to determine and select the important features. Furthermore, the authors showed the importance of the sample size that is used for the classification. When the sample size is large enough, a feature selection technique did not enhance classification accuracy in comparison to whole brain analysis.

A disadvantage of feature selection is that during the selection process a lot of information from the original data is discarded, which runs the risk that vital information is lost and may affect classification accuracy in a negative way. Another important drawback of feature selection is that it does not account for the relations (i.e., interactions) between features. Indeed, feature selection is a univariate technique (i.e., each feature is studied on its own) and when it is used in neuroimaging data, a lot of multivariate information –in terms of interactions between features (possibly from multiple modalities)– that could be relevant for the discrimination of patient groups is discarded.

1.3.2 Feature extraction

An alternative technique for dimension reduction is feature extraction. In feature extraction, new features are derived from the existing features/variables by, for example, computing a weighted sum of the original features. As such, the original (correlated) features that more or less measure the same concept are taken together to form a new feature. Unlike feature selection, only limited information is lost during the process of feature extraction. This, however, depends on the number

(15)

15 of new features which are extracted from the original features. For example, one new feature will never explain all the variance in the original features (and will imply a substantial loss of information), but it will explain more variance than each feature on its own.

An important advantage of feature extraction is that it is multivariate. In other words, it accounts for the relations between all the features as the new features are combinations of the original features (i.e., the new more or less uncorrelated features, which consist of a combination of correlated original features, represent distinct aspects from the original data). A disadvantage of feature extraction is that the new variables are sometimes difficult to interpret. Indeed, the meaning of a weighted sum of different variables that measure more or less similar concepts is not always easy to determine in a straightforward way.

In this study, we will compare several feature extraction methods prior to applying a machine learning classifier. When using a classifier in neuroimaging research, the assumption is made that discriminative information is hidden in the associations between neuroimaging variables (e.g., activations levels of brain areas; Linden, 2012). Several machine learning classifiers have proven to be suitable to discriminate between AD patients and healthy controls. For example, Yang et al. (2011) used a SVM as a classifier to successfully distinguish between healthy controls, mild cognitive impairment, and AD patients based on structural MRI data containing different activation levels from different brain areas. Furthermore, reliable individual classifications have been obtained using a similar machine learning algorithmon MRI measures of grey matter atrophy (Klöppel et al., 2008; Plant et al., 2010; Cuingnet et al., 2011), white matter integrity (Nir et al., 2014) and brain activity (Koch et al., 2012). In this study, we will use and compare three feature extraction methods: (1) Principal Component Analysis (PCA), (2) Partial Least Squares Regression (PLS-R) and (3) Canonical Correlation Analysis (CCA). Below, these three feature

(16)

16 extraction methods are introduced, and their advantages and disadvantages are discussed (a more technical description of these methods is referred to Section 2.3).

1.3.2.1 Principal Component Analysis (PCA)

A well-known and often used method for feature extraction is Principle Component Analysis (PCA). In PCA, a small number of new uncorrelated features, known as components, are constructed which maximally explain the variance of the original features within the given data. PCA has been used frequently in neuroimaging studies and has yielded positive results. For example, Koutsoleris et al. (2009) applied PCA to data with information on whole-brain grey matter in order to successfully discriminate participants at risk of psychosis from healthy participants. In combination with SVM, Koutsoleris et al. (2009) reached classification accuracies up to 91%.

PCA, however, has some drawbacks that may have serious consequences for neuroimaging practice. For instance, when using PCA for classification, it is assumed that there exists a strong relationship between the predictor variables and the response variable, which is an assumption that never can be guaranteed to hold. Indeed, although the derived components explain the largest amount of variance in the data possible, these derived components do not necessarily have a large explanatory power for the response variable. In particular, the direction(s) in the data with the largest variance does not automatically coincide with the direction of the response variable. This is because PCA is an unsupervised learning method that does not take the response variable into account when constructing the components. In most practical applications, however, this assumption seems to hold and PCA yields good results regarding classification accuracy (Mwangi

(17)

17 et al., 2014). A second drawback of PCA is that the new components, due to the Central Limit Theorem (i.e., new features are a weighted sum of original features), tend to follow a distribution that has a unimodal shape and that looks like a Gaussian distribution. However, for classification, a predictor variable that discriminates well between groups is expected to have a distribution that is multimodal instead of unimodal, with multiple peaks, one for each group.

1.3.2.2 Partial Least Squares Regression (PLS-R)

In contrast to PCA, Partial Least Squares Regression (PLS-R) is a (semi-)supervised learning technique. Where PCA only searches for relations between predictor variables, in PLS-R the response variable is also taken into account when deriving the new components (Krishnan et al., 2010). As a consequence, the new components explain a considerable amount of variance of the predictor variables and –at the same time– are maximally related to the response variable, resulting, in general, in components that are (more) relevant for the classification of the response variable. As a consequence, PLS provides new features that are related to the response variable, whereas this is not necessarily the case for PCA; thus, PLS overcomes this drawback of PCA.

An important advantage of PLS-R is that it models the correlational structure of the predictors among each other and simultaneously accounts for the correlation between the predictors and the response variable. Naturally, this is very useful within the neuroimaging domain in which data are multimodal and are high-dimensional by nature. In this regard, Nguyen et al. (2002) found that for classifying human tumor samples based on microarray gene expressions, PLS proves superior in comparison to PCA.

(18)

18 PLS-R is mostly suited for continuous response variables, but it has also been proven to be useful for a binary outcome variable. This is relevant because in this study, which aims at improving classification of AD, we have a binary response variable (i.e., AD or HC). PLS-R is a feature extraction method that already has successfully been applied within the neuroimaging domain. For example, Menzies et al. (2007) used PLS-R to derive latent MRI markers from structural MRI data. These markers were associated with the performance on inhibitory outcome tasks which is often used as a diagnostic tool for obsessive compulsive disorder. Furthermore, it has been shown within the neuroimaging literature that PLS-R components derived from brain activity measured by, for example, MRI, have predictive power for predicting certain illnesses. In McIntosh et al. (2004) it was demonstrated that PLS-R is an effective multivariate analysis tool for data containing both spatial and temporal information. PLS-R appeared to be able to discover distributed brain activity patterns that provide some valuable information about the delay and duration of the responses. Therefore, PLS-R can be considered as a useful method to analyze neuroimaging data.

1.3.2.3 Canonical Correlation Analysis (CCA) and its extensions

Canonical Correlation Analysis (CCA) was originally developed by Hotelling (1935), who analyzed how arithmetic speed and arithmetic power are related to reading speed and reading power (Härdle et al., 2007). CCA is a tool for multivariate statistical analysis that aims at dimension reduction by discovering the (largest) associations between two sets of variables (although generalization to multiple sets of variables have been proposed, see below). In our case,

(19)

19 the different sets of variables represent different brain modalities, like, for example, functional and structural brain information. Contrary to PCA and PLS-R which perform dimension reduction to data from a single modality, CCA can only be used to reduce data from (at least) two modalities. In order to find the joint structure/relations among two multivariate samples, CCA searches for low-dimensional projections of the data (i.e., canonical variates), one for each data set (i.e., brain modality), that are maximally correlated with each other. Each canonical variate (i.e., low-dimensional projection) is a weighted sum of the original variables of its associated set of variables. The canonical variates are obtained by means of a joint covariance analysis of the data from the multiple modalities, which boils down to solving a generalized eigenvalue problem.

A typical feature of CCA, in contrast to both PCA and PLS which are most often applied to a single dataset, is that CCA focuses on the relations between (features of two different) datasets rather than on the relations (between variables) within each dataset. Indeed, PCA fully focuses on the explained variance of the predictor variables (of a single modality) and PLS tries to explain the variance of both the predictor variables (for a single modality) and the response variable. CCA, on the contrary, operates on two modalities and searches for the largest relations between the variables of the two modalities, herewith ignoring the relations between the variables (i.e., the explained variance) within each dataset. As such, CCA is a truly multimodal approach as it focuses on the correlations between datasets. This may be beneficial to uncover unknown but complementary information from that data, which is particularly interesting to improve multimodal classification. However, this could also be considered a disadvantage as in CCA the variance within the datasets (as in PCA) and the relation between the predictors and the response variable (as in PLS-R) are not taken into account. Indeed, when extracting the relations between data sets, CCA does not make use from the information in the response variable.

(20)

20 For neuroimaging data, CCA can be applied to integrate different sources such as functional magnetic resonance imaging (fMRI) and electroencephalography (EEG). In Correa et al., (2008) Canonical Correlation Analysis was successfully applied to fuse biomedical imaging modalities and improve the detection of associative networks in Schizophrenia. In Sui et al (2010) the differences between bipolar disorder and schizophrenia were successfully clarified with the combination of fMRI and diffusion tensor imaging data by a fusion method, CCA+ ICA and it was shown that bipolar disorder and schizophrenia have several distinct brain patterns from each other and healthy controls. However, they also share abnormalities in prefrontal thalamic WM integrity and in frontal brain mechanisms.

Regularized Canonical Correlation Analysis (RCCA) is an extension of CCA that should be used when (some of) the data sets analyzed with CCA contain more variables than cases (i.e., the small-n-large-p or curse of dimensionality problem). In that case, the joint covariance analysis underlying CCA suffers from a singularity problem. To circumvent this singularity problem, RCCA adds a ridge penalty to the CCA problem. Regularized CCA is thought to be very suitable for neuroimaging data in which, in general, data modalities have a large number of variables compared to the number of cases (Nielsen, Hansen, & Strother, 1998). Moreover, the method of adding a ridge penalty is often used in image processing to improve the conditioning of covariance matrices when dealing with ill-posed problems (Leurgans, Moyeed, & Silverman, 1993).

Regularized Generalized Canonical Correlation Analysis (RGCCA) is a generalization of RCCA to three or more sets of variables. It constitutes a general framework for many multi-block data analysis methods (Tenenhaus et. al, 2011). RGCCA combines the power of multi-multi-block

(21)

21 data analysis (i.e., implying the maximization of well-defined optimization criteria) and the flexibility of PLS path modeling (i.e., the researcher decides which modalities are connected and which are not). In neuroscience research, RGCCA has been successfully applied to classify certain diseases. For example, in Garali et al., (2018) it has been shown that RGCCA is suitable method for identification in spinocerebellar ataxia.

1.3.3 Addressing multimodality: concatenated and separate strategy

When having data that consist of multiple modalities, feature extraction (or selection) can be applied using a separate or a concatenated strategy. In a separate strategy, the data (features) of each modality separately are reduced to a limited number of (new/derived) features. As such, only the relations between features within a single modality are used to derive the new features/components. In a concatenated strategy, on the contrary, data from multiple modalities are analyzed simultaneously, which implies that new features are derived based on the relations between features from the same and from different modalities. As such, as opposed to a separate strategy, a concatenated strategy captures the links between the modalities; these multimodal links often contain important information for the classification. A concatenated strategy may also be more beneficial in terms of classification performance than a separate strategy regarding the handling of the noise in the data. Indeed, by analyzing the data from all modalities simultaneously, a concatenated approach potentially filters the noise hidden in the data in a better way from the true signal than a separate strategy, which, as a consequence, may ameliorate classification accuracy.

(22)

22 CCA is by nature a concatenated strategy as all modalities are analyzed simultaneously and CCA aims at finding the largest correlation(s) between the different data modalities. PCA and PLS-R, on the contrary, both can be applied in a separate (i.e., reducing the data in each modality separately) and in a concatenated fashion. Regarding the latter, first, the data from the multiple modalities need to be concatenated into a large matrix, with as many columns as the total number of features in all modalities together. Next, PCA or PLS-R is applied to this concatenated matrix. As such, the interrelations between (features from) the multiple modalities are used to derive the new features (or components).

1.4 Goal of the study, research questions and hypotheses

The goal of this thesis is to find out, in the context of a neuroimaging study that tries to optimally classify AD patients (and discriminates them from HC’s), which multimodal feature extraction approach to combine information from multiple modalities results in the best classification performance. In particular, this thesis aims to shed some light on which feature extraction method (i.e., PCA, PLS-R or CCA) in combination with which strategy to deal with multimodality (i.e., separate versus concatenated) results in the best classification accuracy. Furthermore, we will also investigate the effect of the number of extracted components on the classification performance and whether this effect of the number of components differs between multimodal feature extraction approaches. Finally, this study aims at investigating whether multimodal feature extraction improves classification performance in comparison to an approach without any feature extraction (i.e., whole-brain analysis), which will be applied to all data modalities simultaneously and to each data modality separately.

(23)

23 Based on these goals, the current study wants to address two main research questions. First, which multimodal feature extraction approach performs best in terms of classification performance and how does this performance depends on the number of extracted components? Related sub questions are: (a) Which strategy for dealing with multimodality yields better classification accuracies? (b) Is there a feature extraction method that clearly yields better results than the other methods? Second, does multimodal feature extraction outperforms whole-brain analysis? A sub question here is whether executing a whole-brain analysis on all data modalities simultaneously performs differently than applying whole-brain analysis to each data modality separately.

Regarding the first research question, we hypothesize that PLS-R will outperform both CCA and PCA as PLS-R is the only method that takes the response variable into account when extracting features from the data. Moreover, we expect that the concatenated strategy will yield better results than the separate strategy as the former, as opposed to the latter, is a truly multimodal strategy that takes the links between modalities into account. As such, we postulate than PLS-R in combination with the concatenated strategy will outperform all other multimodal feature extraction approaches. With respect to the effect of the number of components, we expect that a smaller number of components will decrease the classification performance in comparison to a larger number of components. Differences in classification performance between feature extraction methods will also decrease when the number of components increase. Regarding the second research question, we hypothesize, based on the literature, that a multimodal feature extraction approach will improve classification accuracy compared to a whole-brain analysis that does not involve any dimension reduction or feature extraction. We conjecture that this especially will happen when only data from a single modality are used in the whole-brain analysis, whereas we

(24)

24 think that the differences will be smaller when applying a whole-brain analysis to the data from all modalities simultaneously.

In the remainder of this thesis, in Section 2, a detailed description is given of the methodology that is used to compute the classification performance of the proposed multimodal feature extraction approaches. Analysis results are presented in Section 3. Section 4 is dedicated to a summary and discussion of the results and presents the conclusions of this study, herewith also pointing out limitations of this study as well as avenues for future research.

(25)

25 Section 2. Methods

2.1 Data

To examine which multimodal feature extraction approach results in the best classification accuracy and whether multimodal feature extraction outperforms whole-brain analysis, a dataset that is presented in Schouten et al. (2016) is used. This data set consists of structural and functional neuroimaging features that were collected from 250 participants of which 77 (30.8%) suffered from AD and 173 (69.2%) were healthy controls (HC). Table 1 provides an overview of the different features and their description that were used in this study. The AD participants were scanned as part of a prospective registry of dementia at the University of Graz (PRODEM; see Seiler et al., 2012 for more information). Only those patients that received the diagnosis of AD and for which FMRI or MRI scans were available were included in the study. The HC’s neuroimaging data was taken from the Austrian Stroke Prevention Family study. This is a prospective single center community-based follow up study that aims at examining the frequency of vascular risk factors and the effects on cerebral morphology and function in healthy elderly (for more information, see Schouten et al. 2016).

From all the different neuroimaging data modalities that were collected (see Schouten et al. 2016), three different modalities were selected for this study, with one neuroimaging modality being a structural one (i.e., MR) and two modalities being functional ones (i.e., ALFF and EX). Grey matter values (MR), which refer to structural brain features, were used to distinguish between AD patients and HC’s. These values indicate the percentage of each voxel (volume) that consists of grey matter. The loss of grey matter is known to be associated with AD (Yang et al., 2011; Frisoni et al., 2002). The correlation between the voxel’s functional resting-state time course and the time course of the executive center network in the brain was a second feature used to

(26)

26 distinguish between AD patients and HC’s. This executive center in the brain is believed to be associated with brain abnormalities in people with AD. The last feature measures the amplitude of low-frequency fluctuations (ALFF) in the BOLD signal. This is a functional resting-state fMRI feature which quantifies the amplitude of regional spontaneous low frequency fluctuations in brain activity measured when the subject is at rest. This is another fMRI property which has been shown to differ between HC’s and people that have mild AD (Qian et al., 2012).

Table 1. Description of the modality type, voxel size and number of variables for the data modalities used in this study

Keyword Description (modality type) Voxel size (mm) Number of variables VBM4 Percentage of grey matter of each

voxel (structural)

4 by 4 by 4 59,049 ALFF4 Amplitude of low-frequency

fluctuation of each voxel (functional)

4 by 4 by 4 25,750 EX4 Correlation of each voxel with the

executing center (functional)

4 by 4 by 4 25,759

2.2 Six-step procedure

The following six-step procedure was executed to determine the classification accuracy for each of the studied multimodal feature extraction approaches (see Section 2.4):

 Step 1: Adopt the validation approach to split the data into a training set (150 subjects) and a test set (100 subjects)

(27)

27  Step 2: Apply a multimodal feature extraction approach to the training data (of the different modalities), including possible pre-processing steps (i.e., centering, normalization and selecting variables)

 Step 3: Use the scores on the 𝑆 derived variables/components from the training set (in Step 2) to train an SVM classifier

 Step 4: Derive 𝑆 components for the test set using the parameters of the multimodal feature extraction method (and possible pre-processing steps) that was applied to the training set (in Step 2)

 Step 5: Predict the class labels for the test set using the scores on the 𝑆 derived components for the test set (in Step 4) and the parameters of the SVM model applied to the training set (in Step 3)

 Step 6: For the test set, compare the predicted class labels (obtained in Step 5) with the observed class labels and compute the classification accuracy

To study the effect of the number of extracted components 𝑆 on the classification performance, this six-step procedure was repeated for a range of values of 𝑆 (see Section 2.4.5). Below, these six steps are discussed in more detail.

2.3 Validation approach to create a training and a test set (Step 1)

In order to evaluate for a particular data set the performance of a machine learning classifier, the data set needs to be split at random into a training set and a test set (i.e., Step 1 of the above mentioned six-step procedure) in such a way that each case from the data set belongs to one of both sets only. The latter is crucial as no information from the test set may be used in the training

(28)

28 phases (i.e., Step 2 and 3) of the six-step procedure (see Section 2.2). An important advantage of applying this splitting method is that it enhances the reliability of the findings and that it allows to generalize the obtained classification accuracies towards the general population. To split the data in a training and test set, we used the validation approach. In particular, at random, 60% of the cases (i.e., 150 of the 250 subjects) were allocated to the training set and 40% (100 subjects) to the test set. Applying Steps 2 to 6 from the six-step procedure to this training and test set results in a single estimate of the classification accuracy for this data set (for each particular combination of a multimodal feature extraction approach and a number of components 𝑆). To split the data randomly, the function Splits from the base-package (R Core Team, 2015) was used.

A serious drawback of the validation approach is that the obtained classification accuracy estimate can be very variable. Indeed, the estimate heavily depends on the particular (random) split of the cases into a training and a test set. Using another (random) split of the cases in a training and a test set may lead to a (very) different classification accuracy estimate. In order to stabilize this estimate –and, as such, reduce its variability and get a more reliable estimate– one can repeat the validation approach a (large) number of times and compute the average classification accuracy across the considered splits. In this study, the six-step approach is repeated 100 times, with each time using a different random split of the data into a training set of 150 cases (60%) and a test set of 100 cases (40%). The average of the obtained classification accuracy estimates is then taken as the final classification accuracy estimate and the variance in the estimates across the 100 splits is used to quantify the variability (i.e., standard error) in the classification accuracy estimate. In order to keep the computation time of the whole analysis (i.e., 100 splits, 5 multimodal feature extraction approaches and 49 different values for the number of components 𝑆) within reasonable limits, high-performance parallel computing infrastructure was used.

(29)

29 2.4 Multimodal feature extraction approaches (Step 2)

First, the three considered feature extraction methods (i.e., PCA, PLS-R and CCA) are presented at a more technical level in Sections 2.4.1-2.4.3 (for a more intuitive description, see Section 1.3.2). Next, the two strategies to deal with multimodality (i.e., separate and concatenated strategy) are outlined (Section 2.4.4). Combining the three feature extraction methods with the two multimodality strategies, results in five multimodal feature extraction approaches (Section 2.4.4) as CCA can by nature only be combined with the concatenated strategy. PCA and PLS-R, on the contrary, can be combined both with a separate and a concatenated strategy. Finally, the range of values for the number of components 𝑆 adopted in this study is discussed (Section 2.4.5).

2.4.1 PCA

Principal Components Analysis (PCA) is obtained by a decomposition of a data matrix 𝑿 (𝑛 cases by 𝑝 variable) as:

𝑿 ≈ 𝑨𝑩𝑻_(1.1)

where 𝑨 is an 𝑛 by 𝑧 matrix containing the component scores, 𝑩 is a 𝑝 by 𝑧 matrix containing the loadings of the original features on the components, 𝑩𝑻_{denotes the transpose of matrix 𝑩 and 𝑧}

indicates the number of extracted components. The matrices 𝑨 and 𝑩 can be obtained by an Eigenvalue Decomposition (ED) of 𝑿′𝑿 or by a Singular Value Decomposition (SVD) of 𝑿 (Ten Berge, 1993). The maximal number of extracted components equals min(𝑛, 𝑝). When 𝑧 = min(𝑛, 𝑝) components are extracted, there is no loss of information because the extracted components

(30)

30 explain all the variance present in the data. When 𝑧 < min(𝑛, 𝑝) components are extracted, in general, there is a loss of information.

In this study, PCA was performed by using the function prcomp (R Core team, 2015). Before applying PCA, for the training set, the variables within each data modality were scaled and standardized (i.e., z-scores). Moreover, variables that had no variance in the training set were removed from the training set (in Step 2) and the test set (in Step 4) as these variables have no discriminating power to differentiate between the AD patients and HC’s. Next, PCA was performed to the pre-processed training data 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 . In the separate strategy, PCA was applied to

the data from each modality separately (see right panel of Figure 1). In the concatenated strategy, the data from all modalities were first concatenated into a large matrix (see left panel of Figure 1) before subjecting this large matrix to PCA. In particular, in PCA the data 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 (from each

modality separately or concatenated into a large matrix) are decomposed into component scores 𝑨𝒕𝒓𝒂𝒊𝒏𝑷𝑪𝑨 and component loadings 𝑩𝒕𝒓𝒂𝒊𝒏𝑷𝑪𝑨 as follows: 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 = 𝑨𝒕𝒓𝒂𝒊𝒏𝑷𝑪𝑨 𝑩𝒕𝒓𝒂𝒊𝒏𝑷𝑪𝑨

𝑻

. Next, the first 𝑆 components from 𝑨𝒕𝒓𝒂𝒊𝒏𝑷𝑪𝑨 were selected and used as inputs for training the SVM classifier (i.e., in

the separate strategy).

2.4.2 PLS-R

Partial Least Squares Regression (PLS-R) refers to a regression in the form of:

(31)

31 where 𝒀 is an 𝑛 by 𝑚 (often 𝑚 = 1) response matrix(/vector), 𝑿 is an 𝑛 by 𝑝 predictor matrix, 𝑩 is a 𝑝 by 𝑚 coefficient matrix(/vector) and 𝑬 is an 𝑛 by 𝑚 noise matrix(/vector) with the same dimension as 𝒀. To establish the model, a weight matrix 𝑾 (𝑝 by 𝑧) is estimated for 𝑿 as:

𝑻 = 𝑿𝑾 (1.3) with 𝑾 containing the weight vectors for the columns (e.g., predictor variables) of 𝑿; this gives the corresponding score matrix 𝑻 (𝑛 by 𝑧) which contains the scores of the n cases on the 𝑧 components. Using Ordinary Least Squares (OLS) regression for predicting 𝒀 based on 𝑻 results in the regression coefficient matrix 𝑸 (𝑧 by 𝑚). The entries in 𝑸 are the loadings in the decomposition of 𝒀 with 𝑻 being the scores:

𝒀 = 𝑻𝑸 + 𝑬 Defining 𝑩 = 𝑾𝑸 yields

𝒀 = 𝑻𝑸 + 𝑬 = 𝑿𝑾𝑸 + 𝑬 = 𝑿𝑩 + 𝑬

The aim of PLS-R is to find the 𝑾 matrix that yields features 𝑻 that at the same time explain 𝑿 and are related to 𝒀 as much as possible.

To estimate the PLS-R model, we used the plsr function from the pls package (Mevik, Wehrend, & Liland, 2013) in R. As was done for PCA, before performing PLS-R, the variables within each modality for the training set were centered and standardized and variables with no variance (in the training set) were removed (also from the test set). Next, PLS-R was applied to the pre-processed training data 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 (from each modality separately or concatenated into a large

(32)

32 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 𝑾𝒕𝒓𝒂𝒊𝒏𝑷𝑳𝑺 𝑹𝑸𝒕𝒓𝒂𝒊𝒏𝑷𝑳𝑺 𝑹. The first S components from 𝑨𝑷𝑳𝑺 𝑹𝒕𝒓𝒂𝒊𝒏 = 𝑻𝒕𝒓𝒂𝒊𝒏𝑷𝑳𝑺 𝑹 = 𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 𝑾𝒕𝒓𝒂𝒊𝒏𝑷𝑳𝑺 𝑹 were

extracted and subjected to SVM (i.e., in the separate strategy).

2.4.3 Regularized Generalized CCA (RGCCA)

Canonical Correlation Analysis (CCA) for two data sets aims at finding the largest correlating canonical variates, one associated to each data set, that best describe the associations between both data sets. Consider two multivariate data sets, here represented as multivariate random variables 𝑋 and 𝑌. CCA looks for the canonical variates 𝑉 = 𝑎 𝑋 and 𝑉 = 𝑏 𝑌 (i.e., linear combinations or weighted sums of 𝑋 and 𝑌), with 𝑎 and 𝑏 being weight vectors. During a CCA analysis, the vectors 𝑎 and 𝑏 are found that maximize the correlation between the canonical variates 𝑉 and 𝑉 . In other words, CCA searches for the low-dimensional projections 𝑉 and 𝑉 that maximize the correlation 𝐶𝑂𝑅 (𝑉 , 𝑉 ) = 𝐶𝑂𝑅 (𝑎 𝑋, 𝑏 𝑌) between the two canonical variates.

Regularized Canonical Correlation Analysis (RCCA) is a CCA-based technique that solves the singularity problem that occurs when CCA is applied to (two) data matrices in which the number of columns (variables) exceeds the number of rows (cases). In particular, when the data has more variables than cases, which is often encountered in neuroimaging data (Nielsen, Hansen, & Strother, 1998), the matrices 𝑿′𝑿 and/or 𝒀′𝒀 are ill-conditioned. As a consequence, it becomes impossible to compute the inverse of the cross-product matrices 𝑿′𝑿 and/or 𝒀′𝒀, which is an essential step in the joint covariance analysis underlying CCA, in a stable (unique) way. By regularizing the problem, which, for example, can be obtained by adding a penalty term to the diagonal of 𝑿′𝑿 and/or 𝒀′𝒀 (i.e., ridge regression), a unique and stable solution for the inverse of 𝑿′𝑿 and/or 𝒀′𝒀 is obtained.

(33)

33 Regularized Generalized Canonical Correlation Analysis (RGCCA) is a generalization of RCCA to three or more sets of variables. In other words, this generalized version of (R)CCA is suited for analyzing more than two sources of information. In particular, RGCCA simultaneously analyses J matrices 𝑿 , 𝑿 , ..., 𝑿 that represent J sets of variables that are observed on the same set of n individuals (i.e., J brain modalities measured for the same individuals). The matrices 𝑿 , 𝑿 , ..., 𝑿 must have the same number of rows/cases but may –and usually will– have different numbers of columns/variables. The aim of RGCCA is to study the relationships between these J blocks of variables (Tenenhaus & Guillemot, 2017). In particular, canonical variates 𝑽𝑿 = 𝒂 𝑿

are sought for each data block 𝑿 (𝑗 = 1, … , 𝐽) such that a generalized version of the correlation between these canonical variates 𝑽𝑿 (𝑗 = 1, … , 𝐽) is optimized.

In this study, RGCCA was performed by using the function rgcca from the package RGCCA (Tenenhaus, Tenenhaus, & Groenen, 2017). As was done for PCA and PLS, before performing RGCCA, the variables within each modality for the training set were centered and standardized and variables with no variance (in the training set) were removed (also from the test set). Next, RGCCA was performed to the pre-processed training data (𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 ) (𝑗 = 1, … , 𝐽). This

pre-processed data from all modalities were analyzed together with CCA. In particular, the canonical variates 𝑽_{𝒕𝒓𝒂𝒊𝒏}𝑪𝑪𝑨

𝑿 = (𝑿𝒕𝒓𝒂𝒊𝒏 𝒔𝒕𝒂𝒏 _{) 𝑨}

𝒕𝒓𝒂𝒊𝒏

𝑪𝑪𝑨 _{for each data modality 𝑗 were sought that}

maximized their joint correlation for the training data (𝑿𝒕𝒓𝒂𝒊𝒏𝒔𝒕𝒂𝒏 ) (𝑗 = 1, … , 𝐽). Next, the first

components from each 𝑽𝒕𝒓𝒂𝒊𝒏𝑪𝑪𝑨 _𝑿 were selected (and concatenated into a single matrix) and used

(34)

34 2.4.4 Strategies for dealing with multimodality

In this study, data belonging to three brain modalities (i.e., MR, ALFF and EX; see Table 1), which implies three separate sets of predictor variables that are all assessed from the same participants, will be jointly analyzed to improve the classification of AD. In Step 2 of the six-step algorithm (see Section 2.2), the data (features) from the three modalities are reduced to a total of S new features/components by means of PCA, PLS-R and CCA, herewith only using the training data (as obtained in Step 1 of the six-step procedure). For PCA and PLS-R, this dimension reduction step can be obtained through a separate and a concatenated strategy.

In the separate strategy, as can be seen in the right-hand panel of Figure 1, first, the data of each modality separately are reduced with either PCA or PLS-R and components are retained from the reduced data of each modality. Next, the components obtained for each of the three modalities are concatenated into a larger matrix with 𝑆 columns/components, which is further subjected to the SVM classifier (Step 3 of the six-step procedure). To illustrate this, imagine that PCA is applied to the training data of each of the three data modalities. As each data modality for the training set contains 150 cases, at most 150 PCA components can be retained from each modality. If we need 𝑆 = 15 components to train the SVM, the first = 5 PCA components, which correspond to the components with the largest eigenvalues, are selected from each data modality. After concatenating these 3 × = 15 PCA components into a single matrix, this matrix is analyzed with SVM. The two multimodal feature extraction approaches using the separate strategy will in the remainder of this thesis be indicated by PCAsep and PLSRsep.

In the concatenated strategy, as can be seen in the left-hand panel of Figure 1, first, the data (features) from the three modalities are concatenated into a very large matrix. Note that here, as

(35)

35 opposed to the separate strategy, the original features of the three modalities are concatenated and that not, as is the case for the separate strategy, the extracted PCA/PLS-R components are concatenated. Next, PCA/PLS-R is applied to this very large concatenated matrix and 𝑆 components are directly extracted from this concatenated matrix. The multimodal feature extraction approaches adopting the concatenated strategy will be termed PCAconc and PLSRconc. Because CCA can only be applied when at least two data sets for the same subjects are available, CCA inherently implies a concatenated strategy. Therefore, this multimodal feature extraction approach will just be coined as CCA. Note, however, that, although the CCA component scores are obtained in a simultaneous fashion, CCA yields component scores for each modality separately. Indeed, the CCA component scores for each modality are simultaneously estimated in such a way that they optimize the correlations between the three modalities. Therefore, just as for PCAsep and PLSRsep, the 𝑆 components for CCA are obtained by taking the first component score vectors from each modality, with these components corresponding to the largest canonical correlations. After concatenating these 𝑆 components into a matrix, SVM is applied to this concatenated matrix (Step 3).

In sum, the main difference between both strategies to deal with multimodality is that in the separate strategy, the feature extraction methods are performed for each modality separately and the modality specific components are then combined, while in the concatenated approach the modalities are combined prior to applying the feature extraction methods.

(36)

36 By combining the three feature selection methods with the two multimodality strategies, as can be seen in Table 2, five multimodal feature extraction approaches are obtained. The resulting 𝑆 components of each of these five multimodal feature extraction approaches are analyzed with SVM (Step 3 of the six-step procedure). For the whole-brain analysis, no preliminary dimension reduction by means of feature extraction is performed and the features of each modality separately or the concatenated (original) features of all data modalities simultaneously are directly subjected to a SVM classifier.

Table 2. Overview of the five multimodal feature extraction approaches considered in this study

PCA PLS-R CCA

(37)

37

Concatenated strategy PCAconc PLSRconc CCA

*** CCA cannot be combined with a separate strategy as it inherently is a concatenated strategy (i.e., it can only be applied when the data from all modalities are analyzed simultaneously).

2.4.5 Number of components 𝑆

To evaluate the effect of the number of extracted components 𝑆 on the classification performance, each of the five multimodal feature extraction approaches was tested for several values of 𝑆. More specifically, for each multimodal feature extraction approach, mean classification accuracy estimates (see Section 2.7) were computed for each different value of 𝑆 considered by averaging the obtained estimates across the 100 splits of the data in a training and a test set. In this study, the following range of 𝑆 values is adopted, which increases in steps of three: 3, 6, 9, 12, 15, …, 𝑆 , with 𝑆 being equal to 147. Steps of three were chosen as the data consists of three modalities and the separate strategy (and CCA) demands that components are retained from each modality (i.e., for 𝑆 being equal to 3, 6, 9, … the number of components extracted from each of the three modalities equaled 1, 2, 3, …). 𝑆 was chosen to equal 147 as the maximum number of extracted components cannot be larger than the maximum number of observations in the training set, which always contained 150 subjects in this study. Moreover, in some cases, the feature extraction techniques only allowed to extract a lower maximum number of components (e.g., 147 for PLS).

2.5 Support Vector Machine (SVM) classifier (Step 3)

After reducing the multimodal data to a limited number of features (Step 2), a Support Vector Machine (SVM) classifier is applied to these reduced (new) features. SVM, which was proposed

(38)

38 by Vapnik (1996), has been used for classification purposes within the neuroimaging domain for a while now as evidenced by several more recent studies using SVM. For instance, in Mourao-Miranda et al. (2005) and Kloppel et al. (2008) it has been shown that SVM is very suitable for classification in neuroimaging studies.

In the SVM learning process, given 𝑃 input features on 𝑛 training cases, the optimal 𝑃 − 1 hyperplane is determined which optimally separates the training observations in the groups based on the predictor variables (i.e, input features). In other words, SVM searches for hyperplanes that function as decision boundaries that can be used to optimally classify the observations. Data points falling on either side of the hyperplane are attributed to a different group, in our case AD or HC. Note that the maximal dimensionality of the hyperplane depends on the number of input features. When the number of input features is 2, then the hyperplane is just a line (i.e., a one-dimensional plane). When the number of input features is 3, then the hyperplane becomes a two-dimensional plane. Note that it becomes difficult to imagine how the hyperplane looks like when the number of features exceeds 3 (James et al., 2013). The observations that have the smallest distance to the hyperplane are the support vectors which determine the width of the hyperplane margin. These observations are indicated in Figure 2 (middle and right-hand panel) as the filled blue circles and red squares, which represent two different classes. The aim of the SVM algorithm is to find a maximal margin hyperplane, which is the hyperplane that has the largest distance to the nearest training observation (i.e., the support vectors) from each of the –two in this case– classes.

Figure 2. SVM hyperplanes. Two classes of observations are shown (blue circles and red squares), along with optimal (maximal) hyperplanes (green line) and margin (distance between dashed green lines). SVM searches for the data points (support vectors indicated with a filled circle/square) such that the margin is maximized. Left panel: no optimal hyperplane. Middle

(39)

39 panel: optimal hyperplane under large C. Right panel: optimal hyperplane under smaller C, where some observations fall on the other side of the margin (James et al. 2013, p 346).

As a perfect separation in two groups is rare in empirical data, some observations will fall on the wrong side of the margin (see right hand panel of Figure 2; circle and square unfilled). The number of these violations that is tolerated during the training of the SVM algorithm can be varied by changing the cost parameter C (James et al., 2013), with larger values of C implying a smaller number of tolerated violations and yielding a more complex model. As such, this meta-parameter of the SVM algorithm allows a tradeoff between the training error and the model complexity. A large C will decrease the training error and may even cause underfitting. A small C is also known as a soft margin, which allows more misclassification (i.e., a more parsimonious, complex model), and will result in an increase in the training error. A large C is known as a hard margin and allows no misclassifications at all.

When training the SVM algorithm, the optimal value for the cost parameter C needs to be determined or estimated. When the data at hand is very high-dimensional (e.g., whole-brain analysis), the influence of the cost parameter C is negligible. However, in this study, a feature

(40)

40 extraction technique is applied before training the SVM algorithm, which implies that, instead of using the original high-dimensional data, only a few components (i.e., between 3 and 𝑆 ) extracted from this original data are used as predictor variables. As such, SVM is trained on a small number of input features, which necessitates an appropriate choice for C. Therefore, in this study, the cost parameter was tuned by means of 5-fold cross validation using the function tune.svm function from the e1071 package in R. Cross validation is not used for the whole-brain analysis as there the number of features used as input for SVM is large and the cost parameter C is almost of no influence in this high-dimensional case.

To predict the group labels of the test cases (Step 5) for whole brain analysis, the SVM model that was trained on the training set was applied to the test set by using the R function predict.svm from the e1071 package. Regarding the feature extraction methods, the component scores of the test data were derived (Step 4) and the predict.svm function was used in order to predict the group labels for the test set (Step 5) based on these component scores of the test data and the SVM model obtained for the training set (Step 3). This procedure was identical for all feature extraction methods, data sets (modalities) and 𝑆 values.

2.6 Deriving component scores for the test set (Step 4)

After obtaining the SVM model for the training set (Step 3), the component scores for the test set need to be calculated. As the component scores are linear combinations of the features, the component scores for the test set can be computed by linearly weighting the features for the cases in the test set. Importantly, the component weights/loadings obtained for the training set need to be used here, along with the parameters from possible adopted pre-processing steps (e.g.,

(41)

41 centering, normalization, removing predictors with no variance). The latter implies that the features for the test set need to be centered and/or scaled with the means and/or variances computed for the same features in the training set. Moreover, the same variables that are removed from the training set should also be removed from the test set, also when these variables show some variance in the test set. Below, it is described for each feature extracted method separately how the component scores for the test set can be computed.

2.6.1 PCA

The PCA component scores for the test set 𝑨𝒕𝒆𝒔𝒕𝑷𝑪𝑨 were computed by applying the PCA loadings

from the training set 𝑩𝒕𝒓𝒂𝒊𝒏 𝑷𝑪𝑨 to the pre-processed test set 𝑿𝒕𝒆𝒔𝒕 𝒔𝒕𝒂𝒏: 𝑨𝒕𝒆𝒔𝒕𝑷𝑪𝑨 = 𝑿𝒕𝒆𝒔𝒕 𝒔𝒕𝒂𝒏 𝑩𝒕𝒓𝒂𝒊𝒏 𝑷𝑪𝑨 𝑻

. The first 𝑆 components from 𝑨𝒕𝒆𝒔𝒕𝑷𝑪𝑨 were then used to make predictions for the cases in the test set (i.e., in

the separate strategy).

2.6.2 PLS-R

By applying the PLS-R weights from the training set 𝑾_{𝒕𝒓𝒂𝒊𝒏}𝑷𝑳𝑺 𝑹_{to the pre-processed test set 𝑿} 𝒕𝒆𝒔𝒕 𝒔𝒕𝒂𝒏_,

the PLS-R component scores for the test set 𝑨𝒕𝒆𝒔𝒕𝑷𝑳𝑺 𝑹 were obtained: 𝑨𝒕𝒆𝒔𝒕𝑷𝑳𝑺 𝑹 = 𝑻𝒕𝒆𝒔𝒕𝑷𝑳𝑺 𝑹 =

𝑿𝒕𝒆𝒔𝒕 𝒔𝒕𝒂𝒏𝑾𝒕𝒓𝒂𝒊𝒏𝑷𝑳𝑺 𝑹. The first 𝑆 components from 𝑨𝒕𝒆𝒔𝒕𝑷𝑳𝑺 𝑹 were then used for obtaining predictions for

(42)

42 2.6.3 RGCCA

The RGCCA canonical scores for the test set 𝑽𝒕𝒆𝒔𝒕𝑪𝑪𝑨 _𝑿 were computed for each modality by

applying the modality specific CCA loadings from the training set 𝑨𝒕𝒓𝒂𝒊𝒏𝑪𝑪𝑨 to the pre-processed

test set (𝑿𝒕𝒆𝒔𝒕𝒔𝒕𝒂𝒏) (𝑗 = 1, … , 𝐽): 𝑽𝒕𝒆𝒔𝒕𝑪𝑪𝑨 _𝑿 = (𝑿𝒕𝒆𝒔𝒕𝒔𝒕𝒂𝒏) 𝑨𝒕𝒓𝒂𝒊𝒏𝑪𝑪𝑨 . The first components from

𝑽𝒕𝒆𝒔𝒕𝑪𝑪𝑨 _𝑿 for each modality were then (concatenated and) used to make predictions for the cases

in the test set.

2.7 Computing classification accuracy (Step 6)

After computing the component scores for the test set (Step 4), predictions for the labels (i.e., AD or HC) of the cases in the test set were made (Step 5), using these test set component scores and the parameters of the SVM model that was fitted to the training set (Step 3). This procedure for estimating the group labels is the same for all five multimodal feature extracting approaches and all values of 𝑆 considered. In a next step, for the cases in the test set, the predictions for the labels are compared to the observed labels and a measure of classification accuracy is computed (Step 6). In this study, classification accuracy will be measured by constructing Receiving Operating Characteristics curve (ROC). A ROC curve shows how the performance of the SVM classifier changes when its discriminating threshold for allocating cases to groups varies. A ROC curve is established by plotting the true positive rate, also called sensitivity, against the false positive rate (i.e., 1 minus the true negative rate), also called specificity, for different discrimination thresholds. True positive rate represents the actual positives that are correctly classified (e.g., the percentage

(43)

43 of AD correctly classified as having the AD) and the true negative rate represents the actual negatives (e.g. the percentage of HC’s correctly classified as being healthy).

In this study, the Area Under the ROC Curve (AUC) represents the probability that a randomly chosen AD patient is correctly marked by the classifier with a greater suspicion –of the probability of being diseased– compared to a random person that is healthy. The AUC is a reliable measure for classification accuracy because it controls for differences in group sizes, which is the case in the current study (i.e., 100 AD patients versus 150 HC’s). When data are (severely) unbalanced, a classifier that assigns all cases to the majority class, which is a rather trivial classifier, will have a percentage agreement (substantially) larger than 50% (i.e., about 70% in this study). By using AUC as a measure of classification accuracy, this unbalanced distribution of the cases across groups is accounted for; ROC curves account for all possible thresholds (James et al., 2013).

In this study, AUC estimates were obtained from the ROC curve by using the auc function, being part of the AUC package (Ballings & van den Poel, 2013) in R. Note that due to the 100 random splits of the data in a training and a test set, for each combination of a multimodal feature extraction approach and a value of number of components 𝑆, 100 estimates (i.e., AUC values) for the classification accuracy were obtained. These 100 estimates were averaged (across the random splits) to obtain a single (stable) classification accuracy estimate and the variance in these estimates was used to quantify the variability (i.e., standard error) in the classification performance.

(44)