Graph Convolutional Network-based model for patient classification using Neuroimaging

(1)

MSc Artificial Intelligence

Master Thesis

Graph Convolutional Network-based

model for patient classification using

Neuroimaging

by

Richard Olij

10833730

August 27, 2019

36 EC January 2019 - August 2019

Supervisors:

Devanshu Arya

Rajat Mani Thomas

Assessor & Examiner:

Marcel Worring

Amsterdam UMC

Department of Psychiatry

University of Amsterdam

Informatics Institute

(2)

Abstract

The master thesis is about classification of Autism Spectrum Disorder or Typical Developing subjects based on RS-fMRI data using a Graph Convolutional Net. The advantage is that such a net uses a graph which specifies the relations between subjects so that the model uses a weighted average of the features of those subjects to have stronger features. The dataset used is the ABIDE I-II dataset which has data for 21 patients aggregated from 38 different locations with different policies, which makes the usage of all data challenging. Current research achieves about 70% accuracy on 1) specific subsets of the available data, 2) instead of using every voxel as a feature, they divide the brain in a few important regions based on expert knowledge and 3) the relationships between patients are defined based on gender, age or the location. In this thesis a more general method is proposed combining a GCN, 3DCNN and VAE to achieve similar results on the more difficult data, without those human made decisions.

(3)

3.3.2 Variational Auto-Encoder . . . 8 4 Models 8 4.1 Baseline . . . 9 4.1.1 Functional data . . . 9 4.1.2 Structural data . . . 10 4.1.3 Method . . . 10 4.1.4 Results . . . 12 4.2 VAE Adjacency . . . 16 4.2.1 Functional data . . . 17 4.2.2 Structural data . . . 17 4.2.3 Method . . . 17 4.2.4 Results . . . 18 4.3 Combined Model . . . 18 4.3.1 Functional data . . . 21 4.3.2 Structural data . . . 21 4.3.3 Method . . . 21 4.3.4 Results . . . 21 5 Comparison 22 6 Discussion 24 7 Future work 25 8 Acknowledgement 25 Appendices 29

A ABIDE I-II Dataset 29

B Extra results baseline model 30

C Extra results VAE addition model 34

(4)

1 Introduction

Machine Learning (ML) has been a great success in processing images, video and audio with great amounts of data. However, in the field of neuroimaging it lags behind due to the lack of such vast amounts of data. Apart from this reason, the usage of ML algorithms in neuroimaging suffers from 1) the difficulty of scanning patients with brain pathologies, 2) tumors and lesions that change blood flow which is not related to neural activity, 3) usage of drugs or 4) lying subjects.

In order to train (deep) neural nets for image classification, large datasets are required. In com-puter vision, often millions of examples are used, e.g. the popular research who successfully used a Convolutional Neural Network (CNN) on the ImageNet dataset containing 15 million labeled 256 × 256 pixel images [Krizhevsky et al., 2012]. For classifying ASD patients, the most recent Autism Brain Imaging Data Exchange (ABIDE) dataset contains multi-million-voxels neuroim-ages for only 2100+ subjects (Section 2.2). The data consists of subjects with neuroim-ages ranging from 5 to 64 years, which is challenging data because the brain is developing significantly through time and thus the range of ages has a lot of variation [Morita et al., 2016]. Additionally, the data is aggregated from different sites that have different procedures. These factors makes it thus difficult to successfully apply ML approaches and, as explained in Section 2.3, most of the research only takes a small subset of the most suitable data based on human knowledge. In addition, Autism Spectrum Disorders (ASD) can be difficult to classify reliably due to rather subjective classification methods, as described in Section 2.1. This makes labels aggregated from different sources more difficult to use.

For analyzing neuroimages it has been shown that using the whole brain is often more useful than merely focusing on one region. However, using the whole brain rather than just a small portion causes dimensionality issues. Therefore, the brain is often described in terms of relationships be-tween certain regions, lowering the dimensionality, as described in Section 4.1.1. Unfortunately, this is based on human knowledge and various ways of dividing the brain in the regions result in significantly different training performances. Research tend to investigate only one of many variations of these sort of human knowledge based dimensionality reduction methods and thus the results are likely to be highly dependent on those decisions.

As described in Section 2, recent research try to improve accuracies on the ABIDE dataset using a Graph Convolutional Net (GCN). This kind of model use metadata besides the usual features. This is useful for data that is related to one another. Recent research on this topic using GCN defines these connections between subjects by metadata, such as gender and site, to determine similarity between patients. This is somewhat arbitrarily chosen and not well substantiated. Various Neuroimaging methods have different characteristics [Morita et al., 2016]; Electroen-cephalography (EEG) is a method where electrodes are placed on the scalp so that the electrical potential of each electrode can be measured. As it requires only cheap and small hardware, it measures deeper neural activity poorly. For Positron Emission Tomography (PET), a tracer substance is injected to the blood so that active brain areas, where blood is needed, can be detected by the detectors. This approach has a low resolution, and is not the safest method. Magnetic Resonance Imaging (MRI) uses a magnetic field to align spinning atomic nuclei, which disturbs the axis of rotation. It then observes the radio frequency signal generated as the nuclei return to their baseline status. The downside is that the patients have to lie still in order to have good images. The advantage is that it is relatively safe and can be used for all ages and

(5)

it has a high resolution of 4 × 4 × 4 mm voxels. A sequence of images can be taken as well to observe the functionality of the brain, i.e. 4-D functional Magnetic Resonance Imaging (fMRI). For this thesis fMRI data is used. More specifically, the Resting State fMRI (RS-fMRI) is used to see what the activity is when the patient is not performing a specific task.

In order to image either the structure or functionality of the brain, various Neuroimaging (or brain imaging) techniques are used. Structural imaging techniques capture the structure of the nervous system. Among others, it can show tumors, or loss of tissue that can indicate e.g. Alzheimer’s decease. Functional imaging techniques are used for diagnosing metabolic diseases, such as Autism. Therefore, ASD related research merely focuses on the Functional data as there should be no differences in brain structure for ASD subjects.

The contributions of this thesis are 1) providing a more flexible model that can use all the complex data instead of just less complex subsets, 2) comparing more variations of the hand based dimensionality reduction methods, 3) substituting these methods by deep learning based methods, 4) replacing the arbitrarily chosen connections between patients by using structural brain similarities and 5) instead of merely using the functional images, the unused structural images are used as well to create a model that only uses fMRI data and no additional information. Even with these drastic changes and increased complexity of the data, most of the steps improve the current state-of-the-art GCN methods for the ABIDE dataset. Code is available at https: //github.com/RichardOlij/AMC-UvA-GCN-Classification-Neuroimaging.

2 Related work

This section describes the developmental disorders that will be classified, the dataset that will be used and previous research that had similar objectives as this thesis.

2.1 Autism spectrum disorders

ASD is a range of developmental disorders in which social communication and interaction skills are underdeveloped and where the patient prefers repetitive patterns, due to poor adaptive func-tioning. Among others, it includes Asperger syndrome and Autism disorder. For approximately 1% to 2% of the population, some form of ASD is recognized [Baird et al., 2006, Gazzar et al., 2019, Harville et al., 2019, Hyde et al., 2019, Varma et al., 2019]. The numbers vary significantly since there is no quantifiable measurement for ASD and instead the levels of autism are based on questions about daily habits and aspects of their social life.

With the availability of more data, the interest in ML approaches for this field of study is growing. Various ML approaches such as Support Vector Machine (SVM) [Li et al., 2018, Bi et al., 2018, Jiao et al., 2010] which basically fits a hyperplane that maximises the margin between the classes, Decision Tree (DT) [Jiao et al., 2010] which yields more transparency due to clearly defined decisions, Random Forest (RF) [Li et al., 2018] which tend to perform better due to its ensemble nature at the cost of explainability and various (Deep) Neural Net (NN) approaches [Li et al., 2018, Jiao et al., 2010] that map the input to the output with non-linearity layers in between, have been introduced [Hyde et al., 2019]. All these methods have in common that they merely look at the given data in form of a feature vector. We believe that using extra (possibly meta) data that connects multiple subjects to one another can improve predictions, hence the useage of GCNs.

(6)

2.2 ABIDE datasets

The most recent Autism Brain Imaging Data Exchange (ABIDE) dataset contains both the ABIDE I and II datasets1. It contains structural T1-weighted and functional RS-fMRI brain images. The set is aggregated across 38 sites2 and consists of 2169 subjects of which are 1028 ASD diagnosed and 1141 typically developing (TD) as tabulated in Appendix A. The female to male ratio is 20/80, age ranges from 5 to 64 with the median of 13 ASD diagnosed subjects are 61% autistic, 25% Asperger and 14% PDD-NOS and the RS-fMRI image acquisition times range from 2 to 10 minutes [Gazzar et al., 2019]. For the experiments in this thesis, 47 out of the 2169 subjects were discarded due to poor data quality and thus 2122 subjects remain.

2.3 Previous research

Publicly available data becomes more available, but it is often obtained from different sites us-ing different scanner types and acquisition protocols [Abraham et al., 2017, Moradi et al., 2017]. Therefore, [Moradi et al., 2017] performs various experiments on a subset of the ABIDE-I dataset of only four sites (NYU, PITT, TRINITY, USM) which results in 156 subjects. [Abraham et al., 2017] reached 66.8% accuracy on a subset (871 out of 1112 subjects) of ABIDE-I dataset. The data was inspected by experts for incomplete brain coverage, movements, ghosting and other scanner artifacts. This subset of 871 subject will be referred to as Expert Knowledge Subset (EKS) data throughout this thesis. (This subset of the dataset can be found in Appendix A.) They used Dictionary Learning, which essentially is a simplified Auto-Encoder where the hidden layer is made sparse by using a L1 regularisation. The decoder part gets the sparse input and will only sum ’a few’ features in order to reconstruct, hence the analogy with a dictionary. [Parisot et al., 2018] reached 70.4% with the same subset of 871 subjects of the ABIDE-I dataset using a GCN. A GCN (as explained in more detail in Section 3.2) uses features of similar datapoints as an extension of its own feature vector. Among other measures, they used the gender and site information of subjects as the similarity definition.

The major drawback of these models is that they are based on a subset of the data which is based on expert knowledge, resulting in a model that is not generalizable for real world applications. The highest performing research [Parisot et al., 2018] will be the baseline for this thesis. They perform dimensionality reduction based on human knowledge and only mention a single version of that method even though there are many to choose from. Additionally they define similarity between patients, which is needed for the GCN approach, by the arbitrarily chosen metadata; age and site. We are tackling all of these shortcomings of the baseline model by comparing 9 dimensionality reduction approaches (referred to as atlasses later on in the thesis) and use (deep) neural network architectures to create generalizable features.

3 Approach overview

In this thesis various machine learning approaches will be used to takle the shortcomings of the baseline GCN model. This a network that uses relations between datasamples and can apply operations on such information. To lower the dimensionality of the data a Ridge classifier is used. To improve this method of lowering the dimensionality, a VAE is used to represent the high dimensional data in a lower dimension resulting in non-linear combinations of the features.

1_{Downloadable from http://fcon_1000.projects.nitrc.org/indi/abide/}

2_{Note that different research use different definition of sites. We use all differently defined sites from both}

(7)

Finally to reduce the dimensionality of the RS-fMRI data without using the atlasses, a 3D-CNN is used. These key-methods are briefly explained in the following sections.

3.1 3D Convolutional Network

A Fully Connected (FC) layer is a common building block, of which the output is the dot-product of the input vector and the weight matrix, the optional addition of the bias term, followed by some non-linearity. Every node influences every other node in the next layer since they are all connected, hence there is no information isolation. Since every node is connected to every node of the next layer, the complexity tends to get high quickly. For regular grid data, Convolutions that have multiple, but smaller kernels which are applied to multiple positions in the input, can be used. It only uses local information and ignores the distant features. This increases infor-mation isolation and every kernel can represent different features whereas a single FC layer can essentially do one. Since natural images tend to have high local covariance that diminishes with distance, Convolutional Neural Nets (CNN) are popular with image processing [Battaglia et al., 2018]. Applying convolutions on grid-data, such as images, video, speech and audio has been researched extensively the last decade. AlexNet [Krizhevsky et al., 2012] performed unseeingly good, as opposed to the state-of-the-art of that time, which gave new life to the already long existing CNN technique from [LeCun et al., 1990]. Through its convolutions it allows to extract highly meaningful statistical patterns in large-scale and high-dimensional datasets while sup-pressing irrelevant features. The improvement is that instead of handmade features, the model learns features using a general learning procedure [LeCun et al., 2015].

The RS-fMRI data is 3-D data and therefore the 2D CNNs as described above should be extend into the third dimension. This is rather simply done by transforming the 2D convolutions to 3D ones by simply making the filters 3D instead of 2D. This results in a model that performs the same, but in just an additional dimension. This is a popular approach for object recognition within videos, where the third dimension is the time dimension [Ji et al., 2012].

3.2 Graph Convolutional Neural Network

(3D) Convolution layers pays attention to local information in features (e.g. fMRI images) of a single subject. A Graph Convolutional layer pays attention to local information between subjects (e.g. same gender), and applies the same filter parameters over all locations in the graph. The Graph CNN (GCN) takes the features as nodes and the meta information from the population as edges [Parisot et al., 2018]. This is important because many entities in the world do not have a natural order even though they are related to one another, they are permutation invari-ant. Whereas for Natural Language Processing (NLP) the order of words within a sentence is essential to the meaning of the sentence, the order of a (weighted) summation does not matter even though every element is related [Battaglia et al., 2018]. This is used in a wide range of applications from user-product interaction [Berg et al., 2017], social networks [Arya and Wor-ring, 2018, Feng et al., 2019] to chemistry [You et al., 2018]. Even though CNNs have usually a deep architecture to perform better, GCNs at this point seems to work only in shallow architec-tures of mostly one or two layers [Parisot et al., 2018, LeCun et al., 2015, Kipf and Welling, 2016]. A basic GCN-layer has the form G = (V, E ), where the Graph consists of vertices and edges. The vertices are the feature vectors of the data X with the shape N × D for N datapoints with dimensionality D. The edges are relations the datasamples have with each other. It is represented

(8)

as (some form of) an N ×N adjacency matrix A (Adjacency matrices will be explained in Section 4.1.2). The output Z of each layer is N ×F dimensional where F is the amount of output features. Every layer can now be described as

H(l+1)= f (H(l), A)

with H(0)= X and H(L)= Z for L layers. In this thesis, there are N subjects with D dimensional (brain image) data. There are two layers L = 2, the hidden dimension dim(H(1)) = 16 and the final output predicts either ASD or TD.

This function

f (H(l), A) = σ(AH(l)W(l))

with non-linearity σ and the adjacency A multiplied with the current layers l input H(l)_{and its}

weights W(l)_{. The models in this thesis use ReLU as non-linearity σ.}

Depending on how the adjacency matrix A is defined, it is important to keep self-loop i.e. the diagonal with ones so that it otherwise only gets information of neighbors/similar samples, instead of itself as well. Matrix A should be normalized too, so that the scale of the outputs stay within optimal range of the activation function σ to prevent vanishing gradients.

[Kipf and Welling, 2016] proposes symmetric normalisation D−12AD−12 with diagonal node degree

matrix D. This results in

f (H(l), A) = σ( ˆD12_{A ˆ}ˆ_D12_H(l)_W(l)₎

with ˆA = A + I for self-looping and ˆD the diagonal node degree matrix of ˆA. Note that the exact model for this thesis is based on the work of [Parisot et al., 2018] where the adjacency A consists of three K = 3 chebyshev polynomial versions of the ’raw’ adjacency matrix and that therefore the above describe function f is applied three times with only a different A(k). This is then summed over the polynomial dimension, followed by the non-linerity σ. In the final layer, softmax is applied as well to get the predictions for both classes.

Even though the model is very basic Neural Networks, besides the essentially weighted sum of similar datapoints, training is different. First of all, the model cannot be trained in batches using the basic approach due to the dependencies with possibly any neighbor. Recent work showed ways to do batched training [Hamilton et al., 2017, Chen et al., 2018, 2017], which instead of taking all neighbours into account, only takes a fixed subset per sample or a subsample from all edges. Secondly, the labeled set, unlabeled set and if available the labeled set should be available and even be propagated forward through the model in order to determine the actual predictions of the labeled set samples, due to the adjacency connections. The unlabeled set samples should not be trained on and therefore instead of summing all individual losses before backpropagation, the losses are masked so that only labeled set samples are summed. This way, even though all data is used to determine losses, only the labeled set sample losses are used for backpropagation.

3.3 Feature reduction

The features for each subject are based on the corresponding RS-fMRI. Since the dimensionality of the raw data is 45 × 54 × 45 × n, where n is the varying number of images per session, which is typically in hundreds. Given that the brain is a spherical shape in this box the data contains many zeros and therefore the dimensionality of the features is in the order of millions per subject [Gazzar et al., 2019]. This high and varying dimensionality, is unsuitable for ML applications with so few subjects in the dataset. Therefore some dimensionality reduction methods are applied.

(9)

3.3.1 Ridge dimensionality reduction

[Parisot et al., 2018] reduced the dimensionality of the features using a Ridge regression classifier followed by Recursive Feature Elimination (RFE). It is assumed that the data is multicollinear, i.e. it has highly correlated features. The Ridge regression, which is similar to Ordinary Least Squares (OLS) with a regularisation parameter, will try to find a general minimum, whereafter the RFE will eliminate a fixed amount of features, given those predictions of the Ridge classifier. This is repeated until the desired amount of remaining features are left.

3.3.2 Variational Auto-Encoder

Besides the Ridge dimensionality reduction method, another method is introduced in this thesis as well. This method will not eliminate features, but instead create new features that are describ-ing the original data as good as possible. Since there are no labels for such a new representation, it should be done unsupervised. In order to reduce dimensionality in an unsupervised manner, an Auto-Encoder (AE) can be used. Since a vanilla AE merely maps the high dimensional input to a lower one, it does not generalize well and the representation does not have a meaning in the sense that inputs that are similar remain close to each other in the new representation. A Variational AE (VAE) penalizes representation that are far away from the center, forcing the representations to be close to each other based on a probability distribution. This way similar inputs should be closer to each other than dissimilar ones and thus the outputs of similar inputs remain similar. This is important since these representations are used as input for the learning models and thus should have relative meaning. Since it is from a distribution, it can be used as a generative model as well. However, in this thesis the VAE is merely used as a meaningful dimensionality reduction tool.

4 Models

This thesis describes three main experimental setups; 1) a baseline GCN model, 2) automated ad-jacency definition and 3) an automated RS-fMRI dimensionality reduction method. An overview of these three architectures are displayed in Figure 1. Note that the first two models preprocess the functional data to reduce dimensionality before using it in the graph denote by the blue dotted line, whereas the third model uses differently processed functional data, denoted by the red dotted line.

To keep the thesis organized, each models’ section is divided into four subsections 1) Functional data includes information about the procedure of obtaining RS-fMRI data to usable features 2) Structural data includes structural similarities methods, including metadata that is used for the baseline model, 3) the method which explains the way data has been used in the model in order to train, predict and evaluate and 4) the results where results are displayed and the conclusions about them and the model are made. After each model, the best configurations of that model are used as a starting point for the preceding model.

(10)

(a) Illustration baseline architecture. RS-fMRI data as well as the meta data specify the partially labeled graph. The 2 layer GCN predicts the remaining labels.

(b) Illustration VAE adjacency architecture. Met data is replaced by the T1-weighted structural data of which the dimensionality is reduced by the VAE.

(c) Illustration Combined architecture. Instead of lowering dimensionality of the RS-fMRI data by handcrafted ROIs, the 3DCNN lowers its dimensionality.

Figure 1: This figure demonstrates all three architectures that are used throughout this thesis.

4.1 Baseline

This baseline model is based on the [Parisot et al., 2018] model which is a GCN with an adjacency matrix based on site and gender. The architecture is illustrated in Figure 1a3.

4.1.1 Functional data

The data of the subjects consists of the RS-fMRI and meta data such as gender, site and age. As explained in Section 3.3, the dimensionality of the RS-fMRI data is too large to be handled in these experiments with so few subjects. The next sections explain how the dimensionalities are reduced, how the adjacency matrix has been defined and how the data has been split in order to train the structural part.

The dimensionality of the RS-fMRI data is in the order of millions, and varies per subject, as explained in Section 3.3. Using this to determine the connectome, that is the functional con-nectivity (wiring) of the brain, will be too complex to compute. The first step is to reduce the dimensionality of single MRI images. Instead of using the 45*54*45 voxels, the brain is divided into Regions of Interests (ROIs), referred to as an Atlas [Abraham et al., 2017]. Between 7 and 400 brain regions are manually defined and the activity of such a region is represented by the average of all its voxels. The specification of those regions are referred to as various

(11)

atlasses, of which there are 9 used in this thesis’s experiments. The available and compared at-lasses are AAL, cc 200, craddock 200, HO cort maxprob thr25-2mm, JAMA IC7, JAMA IC19, JAMA IC52, schaefer 100 and schaefer 400. For all atlasses, the same specifications, that are out of the scope of this thesis, are used. The tenth HO atlas did not have data for these specific settings and was therefore ignored. We have not got the exact same atlas as [Parisot et al., 2018] had in their experiments. They used some other variation on HO, but the exact settings are unknown to us and neither of our atlasses correspond dimensionality wise either.

Secondly, the GCN has no notion of time and thus the varying n amount timesteps cannot be used optimally. Therefore each sample has to be transformed into the fixed shape with no explicit time dimension. To do so, the activity of such regions will be measured over time, which results in a (1d) activity vector for each region. The correlation of all combinations of these regions (of that subject) are used to determine what regions have similar patterns. This results in a 2-D m2 _{dimensional symmetric similarity matrix of corresponding to these 7 to 400 brain regions.}

Since only one half is needed and the diagonal is all 1, this matrix will have (m2_{− m)/2 useful}

values where m is the amount of brain regions. This will be flattend to a single dimensional feature vector that is the input for the GCN. The dimensionality ranges from 21 to almost 80000 features for these different atlasses. Most atlasses still gives too high dimensionalities to train properly with only 2122 subjects (or even 871 subjects as other research does). Additionally, to calculate similarity as the extra sparsity property as explained in Section 4.1.2, a relatively low dimensionality is required to give meaningful results. Therefore, dimensionalities above 2000 will be reduced to 2000 using the Ridge dimensionality reduction approach as explained in Section 3.3.1.

4.1.2 Structural data

The strength of a GCN is to exploit relationships between the samples, i.e. edges and nodes re-spectively, in the graph analogy. These relationship graphs are represented by (sparse) adjacency matrices. An adjacency matrix can be defined based on the similar meta data such as gender, site or age. Creating edges based on site or gender is simply done by noting 1 when samples have the exact same site or gender and 0 otherwise. For age, there must be a limit or measure of how similar various age differences are. In our experiment this relation is set to 1 if the age difference is below a threshold and 0 otherwise. In addition, combinations of those relations can be used, note that the adjacency matrix then yields values greater than 1.

This ’raw’ adjacency matrix will be multiplied with a correlation matrix which is based on the correlation of the features to ’add some extra sparsity’. The adjacency matrix is normalized and the third order chebyshev polynomials of the adjacency matrix is determined and used in the GCN as adjacency matrices. This results in three + one (polynomials plus identity) amount of adjacency matrices. For each of these, training parameters are learned by the model, and the results of each of the four will be summed to a single matrix as already noted in Section 3.2.

4.1.3 Method

In the experiments, transductive learning is used. This means that there is a labeled and un-labeled set of the data which are both used to improve its performance. The eventual results used in this thesis are evaluated on the predictions of the unlabeled data with the actual

(12)

la-bels of those ’unlabeled’ data4_{. In order to use the same labeled and unlabeled sets for all}

experiments and all models, the subject-ids are fixed. Random stratified k-fold splits are used which splits the data in such a way that labels are balanced. This results in better learning due to less biased samples. To get representative results, k-fold is applied with k = 10. Given the 2122 samples, each unlabeled set and train set are roughly 200 and 1900 samples respectively. The code of the original [Parisot et al., 2018] experiments that was provided on github5was used as is. As mentioned in Section 4.1.1, the atlasses are different and therefore the results could not be exactly the same. As expected, results are similar to the original results provided in their paper. However, changing the way of selecting data, as explained in Section 4.1.3, did decrease the performance significantly as will be discussed in Section 4.1.4.

In addition to the provided TensorFlow code, a new PyTorch implementation was created for this thesis that has a clearer structure and that trains quicker. The PyTorch implementation will be used as results throughout this thesis and the TensorFlow implementation results will be reported in the appendices.

For the 9 atlasses, experiments were run for the NYU-only, EKS-only and ABIDE I-II combined datasets. As explained in Section 2, the data obtained from different sites can be a disadvantage due to the differences between sites. Therefore having more data that is obtained from more different sites, does not improve the model per se and therefore research tend to take a subset of the ABIDE dataset. Therefore, the hypothesis is that the largest single ABIDE-I site, NYU, performs better than ABIDE I(-II) dataset, though the results are not comparable as explained below. The EKS-only is motivated from [Abraham et al., 2017, Parisot et al., 2018], and all data from ABIDE I-II will be used to train the model too.

It must be noted that since the datasets are different, the evaluations are performed on other datasets as well, i.e. NYU-only dataset will only be trained and evaluated on NYU-only data, accordingly for the other two datasets. Therefore the results cannot be directly compared be-tween the datasets. The measurements for evaluating the models are Sensitivity, Specifity and Balanced Accuracy (average of sensitivity and specifity, it will be referred to as just Accuracy) and Area Under Curve (AUC) based on the Receiver Operating Characteristic (ROC). The lat-ter combination displays the relations between sensitivity and specifity for various thresholds insightfully. The main measure of comparison will AUC and accuracy secondarily. Sensitivity and specifity results can therefore be found in the according appendices.

The first experiment 1.a will be comparing the atlasses, implementations and datasets. After the best configurations are found, these will be compared on the adjacency polynomial order; this is 3 on default, and will be compared from 0 (identity, i.e. no relations between subjects) up to 4 in experiment 1.b. Finally, the best configuration will be used for the adjacency comparison where the default of gender and site, will be compared with only gender, site or age and combi-nations of those in experiment 1.c. As mentioned in Section 4.1.2, gender and site comparison is straightforward; only connect when they are the same. In the case of age this is done with a threshold; when the age difference is within a certain amount, the two subjects are connected. This will be tested with maximum age differences of 3 and 6 years, and the combination of no difference and 3 or 6 years combined. The latter definition adds a connection if the difference is within 3 or 6 years of difference, and doubles this when the age is exactly the same. In addition, 4_{Note that the data is unlabeled for the algorithm, but the labels of all data do exist in the complete dataset.} 5_{Code available at https://github.com/parisots/population-gcn.}

(13)

NYU EKS ABIDE I-II

Atlas Acc AUC Acc AUC Acc AUC

AAL 0.627 0.668 0.535 0.569 0.604 0.646 cc 200 0.604 0.643 0.598 0.638 0.605 0.661 craddock 200 0.597 0.643 0.572 0.617 0.586 0.628 HO ...-2mm 0.517 0.551 0.535 0.552 0.587 0.623 JAMA IC7 0.536 0.568 0.538 0.574 0.600 0.632 JAMA IC19 0.560 0.592 0.577 0.613 0.594 0.657 JAMA IC52 0.573 0.632 0.559 0.619 0.610 0.656 schaefer 100 0.618 0.647 0.548 0.568 0.608 0.645 schaefer 400 0.607 0.641 0.575 0.631 0.627 0.665

Table 1: Comparing the 9 atlasses for both the NYU-only, EKS-only and ABIDE I-II datasets with a maximum of 2000 features and adjacency based on gender and site with the extra sparsity feature applied. These results are based on the PyTorch implementation. The full tables can be found in Appendix B

the correlation only will be used as well. As mentioned, the correlation is supposed to make the adjacency matrix sparser, however, it might work on its own as well.

4.1.4 Results

For experiment 1.a Table 1 demonstrates the 9 different atlasses for each of the three datasets and both implementations of the model. One observation is that NYU-only and the ABIDE I-II dataset predictions perform similarly as good, whereas the EKS-only datasets performs signifi-cantly worse. Note that results for EKS using the original (TensorFlow) code and data splitting method, had about 70% accuracy, just as [Parisot et al., 2018]. Using the new splitting method, the best results are about 59% on the AAL, JAMA IC7 and scheafer 400 atlasses and about 63% AUC on JAMA IC19 and scheafer 400 (Can be found in Appendix B). Even though the results cannot be compared directly across different datasets, the results compared in this thesis will merely be for the NYU-only and ABIDE I-II datasets. There is no (obvious) correlation of better performing implementation, or an overall better atlas. The NYU-only dataset performs best for the schaefer 400 atlas using PyTorch and similarly as good on the AAL atlas using Ten-sorFlow implementation. Whereas the ABIDE I-II performs similarly as good for schaefer 400 and cc 200 atlasses for the PyTorch implementation and Jama IC19 for the TensorFlow imple-mentation. The AAL and schaefer 400 atlasses had 2000 features after dimensionality reduction was applied. The JAMA IC19 had 190 features, which were all used.

Since the accuracy of JAMA IC19 was not similarly as good as for the schaefer 400 and AAL atlasses, and due to the lower dimensionality of the JAMA IC19 where no dimensionality re-duction was performed on (which will be an important part of the next experiment), only the schaefer 400 and AAL atlasses will be used for further experiments. Therefore, only the PyTorch implementations will be shown in the main results whereas the TensorFlow implementation re-sults will be shown in the appendices.

For experiment 1.b Table 2 demonstrates various polynomial orders for the best models. It demonstrates that the NYU-only dataset performs better than the ABIDE I-II dataset and that for both the PyTorch implementations the default polynomial order of three is not optimal. A lower polynomial order give higher performances for both datasets. Both the NYU-only and

(14)

NYU ABIDE I-II

Order Acc AUC Acc AUC

0 0.641 0.678 0.623 0.665

1 0.591 0.681 0.619 0.668

2 0.617 0.644 0.617 0.667

3 0.627 0.668 0.627 0.665

4 0.644 0.659 0.618 0.649

Table 2: Comparing polynomial orders for the best performing configurations for NYU-only AAL data and ABIDE I-II schaefer 400 data. These results are based on the PyTorch implementation. The full table can be found in Appendix B.

ABIDE I-II datasets should be investigated further, the settings for the next experiments will be NYU-only dataset with AAL atlas and ABIDE I-II dataset with schaefer 400 dataset, both with first order polynomial.

For experiment 1.c Figures 2 and 3 show the heatmaps and corresponding histograms for the ABIDE I-II dataset on all data (including the unlabeled set) for various adjacency definitions. Note that the baseline adjacency matrices are multiplied with the correlation matrix, hence the decimal values and similar distributions. In Section 4.1.2, it was mentioned that according to [Parisot et al., 2018], the correlation matrix causes more sparsity, however, as demonstrated in Figure 2a, this does not cause real ’zeros’ sparsity, but merely for lower values. This is not a surprise since correlations hardly ever get completely zero. The figure also shows patterns that are similar to the site heatmap, demonstrating that the various sites do deliver different data as discussed in Section 2. Contrarily, the gender and age (difference ≤ 3) matrices are very dissim-ilar and do not seem to correspond to each other, i.e. all these definitions do give significantly different adjacency matrices and it is worth investigating them. Figure 3 demonstrates that those adjacency definitions are sparse as they bars corresponding to values between 0 and 0.1 tend to be a significant portion of all the values within the matrices. The gap between this bar and the following bars is because non-zero values have the distribution with the mean of about 0.6 due to the correlation matrix and the bar of 0 to 0.1 are all the zeros in the matrix. I.e. there exist no non-zero values lower than 0.2, hence the gap.

Table 3 demonstrates the performances of the best scoring models of the previous experiment for different definitions of the adjacency matrices. Note that for the NYU-only dataset, there is a site adjacency definition which is unintuitive since it only has one site. However, it does outperform other adjacency definitions and thus, the extra self-looping seems to be important. This is also backed up by the best performing zeroth order polynomial from the previous experiment, which is self looping only. Both the NYU, AAL and the ABIDE I-II schaefer 400 results can be slightly improved with another adjacency matrix. Using the unintuitive site only adjacency matrix for NYU-only, has a insignificantly higher AUC, but a significant higher accuracy. For the ABIDE I-II dataset, either the correlation only or age results in the highest AUCs whereas the age and site result in the highest accuracy. The best adjacency matrix definitions for this dataset seems to be either just the correlation of the features, or age with a maximum difference of 3 years with an optional additional boost for the exact same year. However, since they perform similarly and for simplicity of the next experiments, the best settings are chosen to be Age with a maximum difference of 3 years only.

(15)

(a) Correlation (b) Site

(c) Gender (d) Age ≤ 3

Figure 2: Different adjacency matrices. These are done on all data and since that includes all data from all folds and datasets, these matrices are not directly used for training. Correlation are on the craddock 200 atlas reduced to 2000 features. Note that the correlation is multiplied to the other three matrices as well.

(16)

(a) Correlation (b) Age ≤ 3

(c) Site (d) Gender

Figure 3: The corresponding histograms of the matrices in Figure 2. Note that the 0 to 0.1 bar is all the zero data and there exist no non-zero data lower than 0.2. The distribution of the non-zero data is similar to the correlation due to the multiplication with it.

NYU ABIDE I-II

Adjacency definition Acc AUC Acc AUC

Correlation 0.634 0.663 0.622 0.680 Gender 0.622 0.674 0.617 0.671 Site 0.646 0.684 0.621 0.675 Age (≤ 3 diff) 0.593 0.661 0.623 0.679 Age (≤ 6 diff) 0.596 0.647 0.624 0.674 Age (≤ 0-3 diff) 0.603 0.648 0.627 0.678 Age (≤ 0-6 diff) 0.596 0.643 0.611 0.661

Gender, Site (Default) 0.591 0.681 0.619 0.668

Gender, Age (≤ 3 diff) 0.622 0.660 0.617 0.666

Site, Age (≤ 3 diff) 0.575 0.650 0.630 0.669

Gender, Site, Age (≤ 3 diff) 0.613 0.658 0.617 0.667

Table 3: Comparing performances of various adjacency matrix definitions for the NYU-only data AAL data and ABIDE I-II schaefer 400 data, both with only 1 order polynomial. The full tables can be found in Appendix B.

(17)

(a) NYU, train set (b) NYU, unlabeled set

(c) ABIDE I-II, train set (d) ABIDE I-II, unlabeled set

Figure 4: These figures display the AUC curves of the best performing models for the NYU-only and ABIDE I-II datasets and both the train and unlabeled sets.

Figure 4 demonstrate the AUC curves of the best performing model for both datasets. It shows that it overfits heavily for both the NYU-only and the ABIDE I-II datasets. Whereas the varia-tion between folds for the ABIDE I-II data is rather small, for the NYU-only data the variavaria-tion is increased. This makes sense since the dataset itself is significantly smaller. Additionally, it demonstrates that there is a significant variation in results between the 10 different folds.

4.2 VAE Adjacency

To recall, the adjacency matrices are used so that the GCN can combine features of similar subjects to get more expressive. The current definition of similarity is based on similar age, gender or site, of which all variations result in significantly different adjacency matrices. In addition, the correlation of the (reduced) features used for more sparsity as described in Section 4.1.4, performed better as an adjacency matrix on its own, than other definitions did. Therefore similarity as basis for the adjacency matrix seems to be promising. To tackle the problem of using some arbitrarily chosen connections between subjects, the yet unused structural information of the neuroimages can be used to define similarity. Since the dimensionality of this information is too high to calculate the cosine similarity with in this work, the dimensionality will be reduced

(18)

using a VAE. The complete architecture is demonstrated in Figure 1b. 4.2.1 Functional data

The RS-fMRI data is prepared in the same way as explained in Section 4.1.1. However, it should be noted that the VAE dimensionality reduction as mentioned in Section 4.2.2, was used as functional data as well in some minor experiments not noted in this thesis. Since it is based on the structure of the brain and ASD is a functionality disorder, it did not yield a good model, as expected.

Instead of using the metadata such as age, gender and site of a subject to define its connections with, the actual similarities of patients’ brains are used. The previously unused Structural T1 weighted data cannot be used to classify ASD patients since ASD is a functional disorder, as explained in Section 1. However, it can be used to determine the similarities of the brains hard-ware between patients. This is intuitive since similar brain hardhard-ware, might have more similar patters in similar circumstances. And thus, when GCN gathers information of connected func-tional brain data, similar hardware might have similar funcfunc-tional data. As opposed to gathering functional data from subjects with similar arbitrary metadata which might cause more variation in the form of noise, the similar hardware subjects might create stronger features that are more general for corresponding hardware.

The main problem is the high dimensionality of the data. Taking some sort of similarity of such high dimensionalities does not make sense. Therefore, the dimensionality should be reduced in a way that it still contains relatable representations in which similar brains are still similar in the reduced dimensions. A VAE would be suitable for this since it, as described in Section 3.3.2, is forced to compress the new representation in a small area, instead of mapping it regardless of the related positions as a vanilla AE would do. In this work, a pretrained VAE was used to lower the dimensionality of the structural T1-weighted images6. Using this VAE, the dimensionality of the data of the subjects are reduced to 200 dimensions. The cosine similarity of all the subjects determines the new adjacency matrix. As one would notice, similarly to the correlation matrix for extra sparsity as mentioned in Section 4.1.4, the cosine similarity of the VAE data will not be a sparse matrix at all either. In the previous experiments this was not a problem since the adjacency definition had a sparse nature (e.g. same gender, site or age or not). However, when using the new adjacency definition, either with or without the extra sparsity matrix multiplied with it, will not be sparse.

Therefore an automatically determined threshold will be applied that is as high as possible for maximum sparsity, but still low enough so that every node in the graph is connected to at least one other node besides itself. This is done by setting the threshold to the minimum value of all the maximum values of each row (or column). This, instead of yet another hyperparameter, will be done completely automatically in a way that intuitively makes sense as well.

4.2.3 Method

The next experiments are performed on the same labeled and unlabeled sets as explained in Sec-tion 4.1.3 and only using best performing combinaSec-tions from the previous experiments. These 6_{The VAE was trained on the UK Biobank and provided by AMC. Information at https://imaging.ukbiobank.}

(19)

are 1) AAL atlas, NYU-only data and Site adjacency matrix and 2) schaefer 400 atlas on ABIDE I-II data using the age with a maximum of three years of difference.

Instead of the meta data, the structural data of the brain is used, reduced in dimensionality and used in the graph. To see the influence of a VAE embedding similarity instead of the baseline adjacency matrix, these two definitions will first be compared using the best performing model settings of the previous experiments. It is possible that the optimal amount of polynomials found in the previous experiments, might only work for the baseline adjacency matrix. Therefore the following experiment will compare polynomials 1 up to 4, similarly to previous experiments in this thesis. Additionally, the correlation matrix for extra sparsity might not have similar influence on the new adjacency definitions and therefore are compared as well as the automatic thresholding method.

4.2.4 Results

Table 4 demonstrates the performance comparison of all model settings for the baseline adjacency matrix, VAE based matrix for polynomial order 1 up to 4 and either with or without correlation sparsity and with or without automatic threshold. In many instances shown in the table, the VAE adjacency outperforms the baseline adjacency definition. Surprisingly, there seems to be no obvious correlation with the dataset, polynomial order, sparsity and automatic thresholding. However, the best performing method for NYU-only data on the AAL atlas is VAE without a threshold and without the sparse correlation matrix with second order polynomial. For the ABIDE I-II data on the schaefer 400 the usage of the VAE adjacency improved its performance, but many adjacency processing variations perform equally as good. Using the second order poly-nomial and the VAE without threshold and without sparse correlation matrix, it seems most similar to the NYU-only data which performs good.

Figures 5 and 6 demonstrate the VAE based adjacency matrices both unthresholded and auto-matically thresholded. At first sight the unthresholded adjacency matrix might look similar to the adjacency matrix based on correlation only as in Figure 2a. However, this is not the case; the darker vertical and horizontal lines are at different positions indicating differently correlated subjects. They both are comparably ’sparse’ in terms of similar amount of values between 0 and 0.5 as seen in the histograms in Figure 6a and 3a. Values from the VAE adjacency lie in a narrower distribution than for all the previous adjacency methods as can be seen in Figures 5 and 2. Using the automatically determined threshold, the VAE adjacency matrix really gets sparse, more so than in the cases of the sparse adjacency matrices based on age, gender or sites, which had a sparse nature already. This indicates that the new similarity is significantly different than the previous one and that the automatic thresholding method does a good job making the matrix actually sparse.

4.3 Combined Model

The final major drawback of the GCN architectures proposed by other research is the way of defining ROIs based on expert knowledge. Figure 1c Illustrates the final architecuture where these atlasses are substituted by the 3DCNN. Due to time constrains, no 3DCNN models will be extensively tested besides some partially trained models I have done before picking the current 3DCNN architecture. So, using that single architecture, the different summarisation methods will be compared for both the NYU-only and ABIDE I-II data on the 3DCNN with linear layer results, and both GCN implementation using 3DCNN data without the final linear layer.

(20)

NYU ABIDE I-II

Order Sparsity Threshold Acc AUC Acc AUC

1 Yes No 0.619 0.690 0.620 0.681 1 No No 0.619 0.666 0.626 0.679 1 Yes Yes 0.641 0.679 0.626 0.678 1 No Yes 0.623 0.654 0.626 0.682 2 Yes No 0.595 0.651 0.631 0.681 2 No No 0.582 0.629 0.630 0.681 2 Yes Yes 0.624 0.670 0.629 0.681 2 No Yes 0.611 0.625 0.632 0.677 3 Yes No 0.616 0.655 0.604 0.651 3 No No 0.654 0.689 0.601 0.654 3 Yes Yes 0.627 0.651 0.640 0.675 3 No Yes 0.647 0.655 0.612 0.659 4 Yes No 0.610 0.656 0.596 0.623 4 No No 0.599 0.638 0.593 0.628 4 Yes Yes 0.643 0.674 0.638 0.681 4 No Yes 0.644 0.676 0.620 0.671

Table 4: Comparing the polynomial order, additional sparisty and automatic threshold for both atlasses and for both the NYU-only and ABIDE I-II datasets. The full tables can be found in Appendix C.

(a) No threshold (b) Auto threshold

(21)

(a) No threshold (b) Auto threshold

Figure 6: The corresponding histograms of the matrices in Figure 5.

Figure 7: Illustration 3DCNN. Given the 45 × 54 × 45 dimensional summarisation, two layers of convolutions and max poolings are applied. The remaining 16 8 × 10 × 8 dimensional filters are flattend to yield a single 10240 dimensional feature vector which is trained using a single fully connected layer with dropout of 70%.

(22)

4.3.1 Functional data

Previously, the vector features are created by taking the correlation of the time dimension of the expert knowledge based ROIs as described in Section 4.1.1. That will be replaced in this final model. Instead, the time dimension will be summarized using any of the following 12 (sub) methods based on the description of [Lisa Salomons, 2019]; 1) Amplitude of Low Frequency Fluc-tuations (ALFF) measures flucFluc-tuations in oxygen levels [Zou et al., 2008], 2) Fractional ALFF (FALFF) is a noise reduction method for ALFF, 3) Autocorrelation (Autocorr) is the correlation of a delayed version of itself to find repeating patterns [Olszowy et al., 2019], 4) Degree Centrality are local network connectivity measures that counts either 4.1) Binarized or 4.2) Weighted di-rect connections [Zuo et al., 2011], 5) Local Functional Connectivity Density (LFCD) is another 5.1) Binarized or 5.2) Weighted local connectivity measurement [Tomasi and Volkow, 2010], 6) Eigenvector Centrality focusses more on the 6.1) Binarized or 6.2) Weighted global network connectivity [Zuo et al., 2011], 7) Entropy measures the irregularity of the brain which should be low given that the brain is not behaving randomly, 8) Regional Homogeneity (ReHo) mea-sures similarity of near neighbours over time [Zang et al., 2004], 9) Voxel-Mirrored Homotopic Connectivity (VMHC) computes the connectivity of each voxel and its counterpart in the other hemisphere [Zuo et al., 2010].

These methods are some standard forms of averaging all the MRI images of the fMRI into a single brain image. This single summarized brain image is then fed to a basic 3D-CNN model that is illustrated by Figure 7. The input is the 45 × 54 × 45 dimensional brain image. A convolution to 6 filters using kernel size of 5 is applied where after max pooling with kernel size 2 is used to lower its dimensionality. This results in 6 filters of dimensionality 20 × 25 × 20. Another convolution to 16 filters using kernel size of 5 is applied with another max pooling with kernel size 2 resulting in 16 8 × 10 × 8 filters so that after flatting them, exactly 10240 features will be left to train. This arbitrary chosen 3DCNN, has been chosen for its simplicity after trying a few similar configurations. The model is trained separately with a single fully connected layer with 70% dropout at the end to classify the ASD and TD labels. The intuition is that the flattend 10240 dimensional featurevector now contains a decent representation of the brain image that can be classified with a linear layer, and thus should be performing better when using it as the content part of the whole architecture. The dropout has been applied since similar models tends to overfit the data significantly.

The structural data is prepared in the same way as explained in Section 4.2.2.

4.3.3 Method

For each of the 12 summarisation methods and for both the NYU-only data and ABIDE I-II data, the experiment will be done for the CNN only model, where the final layer is a fully connected one that classifies and thus can be evaluated, and for both implementations. Unfortunately due to time constraints, not all experiments for the ABIDE I-II dataset could be performed. 4.3.4 Results

Tables 5 demonstrate the results achieved for the next experiments. For the NYU-only datset, the 3DCNN with just a linear layer outperformes the default baseline model slightly, but under-performs compared to the improved models. The ALFF summarisation shows most potential.

(23)

NYU ABIDE I-II

Polynomial Acc AUC Acc AUC

ALFF 0.623 0.683 0.588 0.635

Autocorr 0.486 0.561

Degree Centrality Binarized 0.583 0.637

Degree Centrality Weighted 0.570 0.622 0.603 0.643

Eigenvector Centrality Binarized 0.553 0.593 Eigenvector Centrality Weighted 0.527 0.570

Entropy 0.472 0.534 0.510 0.512 FALFF 0.541 0.592 LFCD Binarized 0.596 0.612 LFCD Weighted 0.589 0.595 0.580 0.606 ReHo 0.582 0.652 VMHC 0.568 0.590

Table 5: Training the CNN with linear layer using the NYU-only dataset with third order polynomial and the ABIDE I-II dataset with second order polynomial. The full tables can be found in Appendix D.

Even though not all data has been gathered for the ABIDE I-II, the most intentional ALFF summarisation technique underperformes significantly to any of the previous best models. The Degree Centrality Binarized technique performs slightly better, but still is not as good as other results so far.

The PyTorch implementation scores significantly worse than just the 3DCNN on it own as dis-played in Table 6. The performance is as good as random classification and thus needs further investigation. It does not seem to decrease its loss after a single epoch and it keeps predicting most samples with the same label. However, the TensorFlow model outperforms the PyTorch model significantly on the promising ALFF using NYU dataset with an accuracy of 0.630 and AUC of 0.688 as can be seen in Appendix D. The appendix also demsonstrates that for the other configurations, including the PyTorch implementation, the Sensitivity is high and thus the model is predicting mostly just one class.

For the ABIDE I-II dataset, the TensorFlow implementation also outperforms the PyTorch im-plementation. The same appendix also demonstrates that using the Degree Centrality Binarized technique, the accuracy is 0.614 and AUC is 0.658. These results are worse than other results obtained so far, but it does show that the same techniques seem to perform better for both 3DCNN only and combined models. Note that it does show that the combined model of both 3DCNN and GCN does outperform the 3DCNN with just a linear layer. Note that these embed-dings are trained and thus optimised for a linear layer and not for the GCN. Therefore training it end-to-end might improve performance since the 3DCNN could generate other features that work better with the shared features approach of the GCN.

5 Comparison

The baseline model was tested on 9 atlasses for NYU-only, EKS-only and ABIDE I-II datasets. The EKS-only data subset was merely used to compare results with other research using the same subset. Further experiments are done using the NYU-only dataset on the AAL atlas and

(24)

NYU ABIDE I-II

Summarisation Acc AUC Acc AUC

ALFF 0.500 0.543 0.500 0.507

Autocorr 0.493 0.483

Degree Centrality Binarized 0.500 0.531

Degree Centrality Weighted 0.500 0.527 0.500 0.516 Eigenvector Centrality Binarized 0.500 0.533

Eigenvector Centrality Weighted 0.500 0.512

Entropy 0.479 0.529 0.508 0.544 FALFF 0.500 0.499 LFCD Binarized 0.500 0.518 LFCD Weighted 0.500 0.522 0.500 0.526 ReHo 0.500 0.534 VMHC 0.500 0.524

Table 6: Training the GCN with 3DCNN input using the NYU-only dataset with third order polynomial and the ABIDE I-II dataset with second order polynomial. The full results can be found in Appendix D.

ABIDE I-II datasets using the schaefer 400 atlas. First order polynomials tend to have better results than the default (third order). Furthermore, for the adjacency matrix, extra self looping only seems to be the best configuration for NYU-only data. Both the site for the single sited dataset and polynomial order 0 result in a identity adjacency matrix, i.e. no connection with sim-ilar subjects and thus no GCN needed. For the ABIDE I-II dataset, either the correlation-only or various combinations of age outperform the default of gender and site. The best adjacency matrices are site-only and age with a maximum of three years for the NYU-only and ABIDE I-II datasets, respectively.

The second model replaces the arbitrarily defined adjacency matrix by the VAE embedded struc-tural information. This is done for multiple combinations of polynomial orders, additional spar-sity and thresholding. There seems to be no clear correlation and the best performing configura-tion is choosen to be third order polynomial for the NYU-only data and third order polynomial for the ABIDE I-II data. Both datasets perform good without thresholding and no extra sparsity. The final model uses the 3DCNN for dimensionality reduction. The 3DCNN on its own with a linear layer shows encouraging results given that it classifies only using a single linear layer. When using the full combination of the 3DCNN embeddings and the best performing GCN with VAE Adjacency matrix, the performance is slightly increased, but can be further investigated. Firstly, Table 7 demonstrates the results of the best configurations, starting with the best atlas found for the baseline with default adjacency matrix definition, dimensionality reduction, sparsity matrix addition and polynomial. Secondly, it represents the best results after optimizing the adjacency matrix definition and polynomials. Thirdly the VAE adjacency model without extra sparsity and thresholding is demonstrated to slightly improve the best scores so far. Fourthly, the 3DCNN on its own on the ALFF and Degree Centrality Binarized summarisation technique does show promising results. Finally, using the full combination of 3DCNN embedding of the RS-fMRI, VAE embedding of the structural images as adjacency matrix and TensorFlow implementation of GCN, resulted in a promising accuracy of 61.4% and an AUC of 65.8%.

(25)

NYU ABIDE I-II

Polynomial Acc AUC Acc AUC

Baseline model default 0.627 0.668 0.627 0.665 Baseline model optimized 0.646 0.684 0.623 0.678 VAE adjacency model 0.654 0.689 0.630 0.681

3DCNN alone 0.623 0.683 0.603 0.643

Full combined model 0.630 0.688 0.614 0.658

Table 7: A final comparison of the best results of each model using the PyTorch implementations.

6 Discussion

When using the baseline model and data splitting method of [Parisot et al., 2018], with different atlasses, the results were similar to theirs which is about 70% accuracy. When changing the data splitting method, the results decreased significantly to 60% accuracy and 63% AUC on their TensorFlow implementation for their EKS subset of ABIDE I. Using either the NYU-only or ABIDE I-II datasets, the performances got increased when tuning the order of polynomials and adjacency matrices. A lower polynomial order and Correlation only and Age based adjacency matrices tend to work well for the ABIDE I-II dataset, whereas just using no connections to other subjects works best for NYU-only data.

Replacing the arbitrarily chosen adjaceny matrix definitions by using the similarity of the unused brain structures improved the performance. This is important since wihtout human knowledge or metadata, the structure which is usually not used to classify functional disorders with, can now be used and outperform the human knowledge methods. Extra sparsity did not have an positive influence on the results and thus the most vanilla version of the VAE Adjacency matrix performs best.

Nine different atlasses were compared and they all yield significantly different results, which extends the research of [Parisot et al., 2018]. When replacing these expert knowledge based preprocessed RS-fMRI images with the very simple and not finetuned 3DCNN, the results are better than the untuned baseline for the NYU-only dataset, but slightly worse for the ABIDE I-II datasets. However, compared with the optimisations that were made throughout this thesis, the full combination does not perform as well as one could expect.

The VAE improved the model using training data that is separated from our ABIDE I-II dataset. When training a VAE using only the labeled set, the performance was significantly worse and it would not have worked as well as it did in this thesis. The same probably applies to the 3DCNN approach which is just trained for a short amount of time on the labeled set. When using a fully trained 3DCNN on some other dataset, it would most likely get significantly better features that could improve the final combined model. Therefore, given the minor dataset which it was trained on, it still performed well.

Some important note is that when rerunning some experiments, the results tend to vary. The code is seeded and no problems have been found until the last two weeks of this thesis were I acci-dentally retrained a model on another GPU and found results that differ a few percentages. Each model has been excessively trained with many variations (of which most are not even displayed in this thesis) and at least k -fold with k up to 32 (for leaving-one-site-out data splitting, which unfortunately could not made it in this thesis, but can be trained using my code). Therefore,

(26)

the models are trained on a number of epochs that is the same for all of them (150 epochs for GCN and 250 for 3DCNN), which is based on early learning curves, but has not been changed later on. The last model and thus not the best within these epochs have been taken. This is not the best approach given that most models converged at these number of epochs, but still vary in the unlabeled set losses. This possibly a reason that results change between runs.

7 Future work

The main improvement of this work will be the use of a fully trained 3DCNN to extract better features. The current one is based on little data and a very simple architecture out of roughly ten tested similar architectures. Secondly, this thesis gives an insight of many variations of models and data and just a hand full of them should be more extensively tested to ensure that multiple different runs yield similar results. Thirdly, this could be compared with a fully end-to-end model where the GCN-batch difficulty has been taken care of. The comparison can be between 1) fully pretrained VAE and 3DCNN on other data using the GCN with just the ABIDE I-II data and 2) with an end-to-end model based on the complex multi sited ABIDE I-II dataset only. As mentioned, sitewise comparison was intended for this thesis as well. Due to the many experiments, it took too many resources to gather enough results for this thesis. This should be done as well to see if only some sites might cause significant performance drops due to their different methods of getting the RS-fMRI data or the labeling of it.

8 Acknowledgement

I am grateful that Devanshu and Rajat have supported me during this thesis by explaining a lot about current research and how to approach the experiments and problems I encountered throughout the thesis period. In addition, I want to thank Rajat for getting me clean data to work with and for giving me the trained VAE embeddings and Devanshu for helping me finalising this thesis report. Finally, I want to thank Marcel as well for taking the time to be my examiner.

References

Alexandre Abraham, Michael P Milham, Adriana Di Martino, R Cameron Craddock, Dimitris Samaras, Bertrand Thirion, and Gael Varoquaux. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage, 147:736–745, 2017. Devanshu Arya and Marcel Worring. Exploiting relational information in social networks using

geometric deep learning on hypergraphs. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 117–125. ACM, 2018.

Gillian Baird, Emily Simonoff, Andrew Pickles, Susie Chandler, Tom Loucas, David Meldrum, and Tony Charman. Prevalence of disorders of the autism spectrum in a population cohort of children in south thames: the special needs and autism project (snap). The lancet, 368(9531): 210–215, 2006.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zam-baldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.

(27)

Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.

Xia-an Bi, Yang Wang, Qing Shu, Qi Sun, and Qian Xu. Classification of autism spectrum disorder using random support vector machine cluster. Frontiers in genetics, 9:18, 2018. Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with

variance reduction. arXiv preprint arXiv:1710.10568, 2017.

Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018.

Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural net-works. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3558–3565, 2019.

Ahmed El Gazzar, Leonardo Cerliani, Guido van Wingen, and Rajat Mani Thomas. Simple 1-d convolutional networks for resting-state fmri based classification in autism. arXiv preprint arXiv:1907.01288, 2019.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017. Terry Harville, Bobbie Rhodes-Clark, Sirish Chandra Bennuri, Leanna Delhey, John Slattery,

Rebecca Wynne, Shannon Rose, Stephen Gregory Kahler, and Richard Eugene Frye. Inheri-tance of hla-cw7 associated with autism spectrum disorder (asd). Frontiers in Psychiatry, 10: 612, 2019.

Kayleigh K Hyde, Marlena N Novack, Nicholas LaHaye, Chelsea Parlett-Pelleriti, Raymond Anden, Dennis R Dixon, and Erik Linstead. Applications of supervised machine learning in autism spectrum disorder research: a review. Review Journal of Autism and Developmental Disorders, 6(2):128–146, 2019.

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1): 221–231, 2012.

Yun Jiao, Rong Chen, Xiaoyan Ke, Kangkang Chu, Zuhong Lu, and Edward H Herskovits. Predictive models of autism spectrum disorder based on brain regional cortical thickness. Neuroimage, 50(2):589–599, 2010.

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-works. arXiv preprint arXiv:1609.02907, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.

(28)

Xiaoxiao Li, Nicha C Dvornek, Xenophon Papademetris, Juntang Zhuang, Lawrence H Staib, Pamela Ventola, and James S Duncan. 2-channel convolutional 3d deep neural network (2cc3d) for fmri analysis: Asd classification and feature learning. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 1252–1255. IEEE, 2018.

Lisa Salomons. Plain and Residual 3D convolutional neural networks for Autism classification, 2019. Bsc. AI Thesis UvA Draft; accessed 1 August 2019.

Elaheh Moradi, Budhachandra Khundrakpam, John D Lewis, Alan C Evans, and Jussi Tohka. Predicting symptom severity in autism spectrum disorder based on cortical thickness measures in agglomerative data. Neuroimage, 144:128–141, 2017.

Tomoyo Morita, Minoru Asada, and Eiichi Naito. Contribution of neuroimaging studies to under-standing development of human cognitive brain functions. Frontiers in Human Neuroscience, 10:464, 2016.

Wiktor Olszowy, John Aston, Catarina Rua, and Guy B Williams. Accurate autocorrelation modeling substantially improves fmri reliability. Nature Communications, 10(1), Mar 2019. doi: 10.1101/323154. URL https://www.nature.com/articles/s41467-019-09230-w. Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrero, Ben Glocker,

and Daniel Rueckert. Disease prediction using graph convolutional networks: Application to autism spectrum disorder and alzheimer’s disease. Medical image analysis, 2018.

Dardo Tomasi and Nora D. Volkow. Functional connectivity density mapping. Proceedings of the National Academy of Sciences, 107(21):9885–9890, 2010. ISSN 0027-8424. doi: 10.1073/ pnas.1001414107. URL https://www.pnas.org/content/107/21/9885.

Maya Varma, Kelley Marie Paskov, Jae-Yoon Jung, Brianna Sierra Chrisman, Nate Tyler Stock-ham, Peter Yigitcan Washington, and Dennis Paul Wall. Outgroup machine learning approach identifies single nucleotide variants in noncoding dna associated with autism spectrum disorder. In PSB, pages 260–271. World Scientific, 2019.

Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional pol-icy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems, pages 6410–6421, 2018.

Yufeng Zang, Tianzi Jiang, Yingli Lu, Yong He, and Lixia Tian. Regional homogeneity approach to fmri data analysis. NeuroImage, 22(1):394–400, May 2004. doi: 10.1016/j.neuroimage.2003. 12.030.

Qi-Hong Zou, Chao-Zhe Zhu, Yihong Yang, Xi-Nian Zuo, Xiang-Yu Long, Qing-Jiu Cao, Yu-Feng Wang, and Yu-Feng Zang. An improved approach to detection of amplitude of low-frequency fluctuation (alff) for resting-state fmri: Fractional alff. Journal of Neuroscience Methods, 172 (1):137–141, Jul 2008. doi: 10.1016/j.jneumeth.2008.04.012. URL https://www.ncbi.nlm. nih.gov/pmc/articles/PMC3902859/.

X.-N. Zuo, C. Kelly, A. Di Martino, M. Mennes, D. S. Margulies, S. Bangaru, R. Grzadzinski, A. C. Evans, Y.-F. Zang, F. X. Castellanos, and et al. Growing together and growing apart: Regional and sex differences in the lifespan developmental trajectories of functional homotopy. Journal of Neuroscience, 30(45):15034–15043, May 2010. doi: 10.1523/jneurosci.2612-10.2010.

(29)

Xi-Nian Zuo, Ross Ehmke, Maarten Mennes, Davide Imperati, F. Xavier Castellanos, Olaf Sporns, and Michael P. Milham. Network centrality in the human functional connectome. Cerebral Cortex, 22(8):1862–1875, Jul 2011. doi: 10.1093/cercor/bhr269.

(30)

Appendices

A

ABIDE I-II Dataset

Dataset EKS-only ABIDE I ABIDE II

Site Autistic Control Autistic Control Autistic Control

CALTECH 10 5 19 19 CMU 5 6 13 14 KKI 21 12 33 22 LEUVEN 1 14 14 15 14 LEUVEN 2 16 12 20 15 MAX MUN 27 19 33 24 NYU 98 74 105 79 OHSU 13 12 15 13 OLIN 14 14 16 20 PITT 26 24 27 29 SBL 14 12 15 15 SDSU 19 8 22 14 STANFORD 13 12 20 20 TRINITY 25 19 25 24 UCLA 1 27 37 32 41 UCLA 2 10 11 13 13 UM 1 52 34 55 55 UM 2 21 13 22 13 USM 24 43 43 58 YALE 19 22 29 28 BNI 1 29 29 EMC 1 27 25 ETH 1 24 13 GU 1 55 51 IP 1 34 22 IU 1 20 20 KKI 1 155 56 KUL 3 0 28 NYU 1 30 48 NYU 2 0 27 OHSU 1 54 37 OILH 2 35 24 SDSU 1 25 33 TCD 1 21 21 UCD 1 14 18 UCLA 1 16 16 USM 1 26 17 U MIA 1 15 13

Table 8: Distributions of ASD and TD patients per site and dataset. The EKS is a subset of ABIDE I. And ABIDE I-II refers to the ABIDE I and ABIDE II combined.

Graph Convolutional Network-based model for patient classification using Neuroimaging

MSc Artificial Intelligence

Master Thesis