Machine learning as an approach for predicting individual outcome in individuals at risk for psychosis

(1)

U

NIVERSITY OF

A

MSTERDAM

L

ITERATURE

T

HESIS

Machine learning as an approach for predicting

individual outcome in individuals at risk for

psychosis

Author:

I.A.M. BOETERS

Supervisor: Dr. T.B. ZIERMANS

Master Brain and Cognitive Sciences Cognitive Science track

Institute for Interdisciplinary Studies

(2)

Studying disease progression in populations that are at risk for psychosis, provides insights that can improve early intervention and prevention. Thus far, no ultimate biomarker revealed itself to predict the development of psychosis. Multivariate pattern recognition methods, like machine learning, are promising strategies that enable the prediction of disease progression on an individual level. Machine learning can be an objective measure, that is able to deal with the complexity and heterogeneity of psychosis. Despite recent developments in machine learning for disease prediction, clinical applicability remains beyond reach. The aim of this study is to investigate the added value of machine learning thus far, and its limitations that hinder the development of a clinically applicable prediction tool. A systematic literature search, result-ing in 18 prognostic studies, revealed that current models are able to predict disease outcome in at risk populations with an average accuracy of 80.4%, a sensitivity of 76.4% and a speci-ficity of 84.9%. Some models even outperform clinical tools and disease estimates by clinical experts. However, lack in external validation, biases and heterogenous methods hinder the development of clinical machine learning prediction tools. Future directions include the search for complex patterns in multi-modal studies, the prediction of non-arbitrary continuous dis-ease outcome, the use of heterogenous multi-site samples and a focus on external validation. Promising new programs like the PSYSCAN project, holding a large variety of multi-site data, will aid the further unfolding of machine learning to get a grip on the complex development of psychosis.

(3)

(4)

Abstract i 1 Introduction 1 2 Machine learning 3 2.1 The basics . . . 3 2.2 Validation . . . 4 2.3 Algorithm type . . . 4

3 Overview and critical discussion of literature 5 3.1 Literature search . . . 5

3.2 Data acquisition . . . 5

3.3 Results . . . 6

3.4 Model performance . . . 9

3.5 Revealed patterns by machine learning . . . 10

3.6 Heterogeneity in psychosis and methodological instruments . . . 11

3.7 Generalizability of machine learning models . . . 14

3.8 Objectivity . . . 15

3.9 Clinical applicability . . . 15

4 Personal opinion and future directions 17 4.1 Conclusion . . . 18

(5)

Introduction

Psychosis has a heterogenic nature, resulting in diverse phenotypes with different positive and negative symptom combinations. Positive symptoms refer to additive behaviours, for exam-ple delusions, hallucinations and thought disorder. While negative symptoms refer to lacking behaviours, for example social withdrawal, lack of motivation and poverty of speech. Nega-tive symptoms, followed by posiNega-tive symptoms, often precede the onset of psychosis and can therefore be used as indicators of disease progression (Yung & McGorry, 1996). Only a part of the people with subthreshold negative and positive symptoms develop full-blown psychosis: around 30% in a follow-up duration of four years (Fusar-Poli et al., 2016). Due to the initial prodrome of psychosis which starts relatively early, there has been an increasing interest in the predictive account of psychosis.

Predicting disease progression holds the potential to intervene more focused, to intervene at an early stage and to ultimately prevent the onset of psychosis altogether. To achieve this, the mechanisms of development of psychosis have been studied in at risk populations. Mul-tiple clinical criteria can be used to identify at risk individuals: clinical high risk (CHR), ultra high risk (UHR) or at risk mental state (ARMS) for psychosis, which all rely on having at least one out of the three following criteria: 1) Attenuated Positive Symptoms and/or 2) Brief Lim-ited Intermitted Psychotic Symptoms and/or 3) genetic risk for psychosis in combination with a deterioration in functioning. Additionally, basic symptoms criteria, which consist of self-experienced abnormalities in cognition, can be used to identify individuals at risk. Individuals with genetic predispositions are also at risk, such as 22q11.2 deletion syndrome patients with a prevalence of psychosis of 25% (Murphy et al., 1999). Several potential biomarkers are revealed in psychotic populations, that are already present in the at risk populations. Such prognos-tic biomarkers characterize at risk populations on the level of neuroanatomy, neurocognition and neurofunction, e.g. altered brain volumes (Fusar-Poli et al., 2011), neuronal oscillations (Uhlhaas & Singer, 2011), language (DeVylder et al., 2014) and cognitive functioning (Seidman et al., 2016).

As for predicting psychosis, different factors hinder the development of good clinical tools for individual prognostics. First of all, most studies revealed group differences which cannot be translated to an individual level (Zarogianni et al., 2013). Secondly, heterogeneity on multi-ple levels make it difficult to investigate the disease’s development. There is disagreement in the definition of psychosis: psychosis can refer to a symptom, such as in schizophrenia, bipolar disease and depression or refer to a disorder itself. Besides, instead of calling it a disorder, psy-chosis can be referred to as a disproportionally “normal” mental state. Experts have developed different views on psychosis: some formulate psychosis as a neurodegenerative and lasting disease, others define it as an altered state of the mind. Again others see psychosis as a devel-opmental disease, as recent work addressed the importance of develdevel-opmental factors, such as altered developmental course of the hippocampi (Bernard et al., 2015). Around the world, di-agnostics are also handled by different clinical instruments such as DSM-V or ICD-10. Within official classification systems like the DSM-V, psychosis is again divided into more than ten subtypes with specific symptom combinations. Patients can switch between multiple subtypes

(6)

during psychosis. The heterogenous nature of psychosis together with the inconsistency about its definition make individual prognosis difficult. A third factor that hinders the development of clinical tools, has to do with defining transition to psychosis. The threshold for psychosis is arbitrarily chosen, by using different assessment tools (Fusar-Poli & Van Os, 2013). Transition rates are also highly variable, since at risk populations are identified by different clinical cri-teria: Attenuated Positive Symptoms, Brief Limited Intermitted Psychotic Symptoms, genetic risk and/or basic symptoms, with different risk ratios (Fusar-Poli et al., 2016). Over the years, detected transition rates have become lower, possibly by loosening inclusion criteria for at risk populations, earlier inclusion or more effective preventive treatment (Fusar-Poli et al., 2012). Still one third of the not-transitioned group, experiences sub-threshold symptoms (Beck et al., 2019). This makes it relevant to predict not only transition, but also broader outcomes, like functioning and quality of life.

One possible solution to overcome these problems, is by using multivariate pattern recog-nition methods, such as machine learning. Machine learning can offer a measure which is, in its most optimal form, objective, able to capture the heterogeneity and complexity of psychosis and able to make individual prognoses on disease outcome (Zarogianni et al., 2013). A lot of studies used machine learning for these purposes, with some success. A larger number of stud-ies examined machine learning with diagnostic purposes, compared to prognostic purposes which is still in its infancy (Janssen et al., 2018). Given the available computational power and access to wide ranges of data, machine learning is becoming an increasingly popular approach. One benefit of machine learning is making diagnoses outside of clinical boarders, for exam-ple by using clustering methods to form disease subtypes in psychosis (Clementz et al., 2016). When well-performed, machine learning is capable of making more objective interpretations of the data. Especially when unsupervised learning takes place where no arbitrary choices are made, as with clustering algorithms (Schnack, 2019). Another benefit of machine learning is the pattern extraction, for example using structural magnetic resonance imaging (sMRI) data to find shared altered brain areas (Koutsouleris et al., 2009; Das et al., 2018; Zarogianni et al., 2019). Although some relationships were already found as group-differences by earlier research, ma-chine learning holds the ability to make individual prognoses based on these biomarkers. It also has the ability to reveal complex patterns, e.g. some studies revealed new relations be-tween biomarkers and clinical representation, which remained unnoticed by experts in the field (de Wit et al., 2017; Koutsouleris et al., 2018). The models that predict whether individuals do or do not convert to psychosis are highly accurate, some with accuracies around 90% (Kout-souleris et al., 2009; Gothelf et al., 2011; Bedi et al., 2015; Rezaii et al., 2019). However, after years of research, clinical applicability of machine learning on psychosis is still not reached. Reasons could be the lack of generalizability of the models, lack of good validation and small sample sizes.

The aim of this review is to critically discuss the state-of-the-art of machine learning for individual prognosis in populations at risk for psychosis. This article starts with a brief in-troduction on the basics of machine learning. Followed by a systematic literature search and a discussion on the selected studies. For this discussion the performance of current machine learning models, as well as generalizability, validation and type of outcome measure are taken into account. Multiple prospects on clinical applicability can be derived from the discussion. These are listed, along with future directions, in the last part of the article.

(7)

Machine learning

2.1 The basics

Over the past few years there has been an increasing interest in machine learning as a tool to study psychiatric diseases. Machine learning is a statistical/mathematical method, originating from the field of artificial-intelligence, for constructing practical models. As well as in statis-tics, machine learning models are trying to make predictions for a target population, based on limited data from a smaller sample. To construct such a model, a mathematical algorithm is used to explore the data and extract patterns. During pattern extraction, i.e. the training of the model, the algorithm learns by updating its rules. Either supervised or unsupervised learning takes place. Supervised learning uses labelled input data to learn from. This is the case when predicting disease outcome whereby the algorithm knows which subjects will have a good dis-ease outcome and poor disdis-ease outcome. Such models will make a prediction based on input data, following the current mathematical rules. Afterwards these rules will be update based on whether the prediction was right or wrong, knowing the actual disease outcome. Unsu-pervised learning explores unlabelled data, to find unknown patterns in data. This happens for instance when clustering is performed. Apart from learning and updating the rules of the model, parameter optimization occurs during the training phase. A parameter could for in-stance be the amount of hidden layers in case of multi-layer neural networks. This parameter is set based on the performance of models with differing amounts of hidden layers.

After the training phase, the model can be tested to get an estimate of model performance. In binary classification, model performance is usually expressed in accuracy of the model, as well as its sensitivity to identify true positives, and its specificity in identifying true negatives. In cases where the outcome variable is continuous instead of categorical, regression can be used and performance is represented in the regression coefficient. Low performance when applying a model to external datasets can be a result of overfitting, underfitting or irreducible errors (Janssen et al., 2018). In a binary classification model to predict good versus poor disease outcome, overfitting occurs when the constructed hyperplane between the two classes splits the data perfectly. The learned rules by the model are not generalizable as they are not only based on disease predictors, but also on the characteristics of the noise in the data. This will result in high scores on training sets, but misclassifications in independent datasets, thus a low score on testing sets. High accuracy of models on training sets may be caused by overfitting, thus one must be cautious when examining machine learning studies. Especially low sample sizes have a high possibility of overfitting, simply because individual noise can have more effect on the model construction. In cases of underfitting, the decision boundary does not explain the data well enough and is not specific enough. The pattern is not extracted from the test data and misclassifications in new data are inevitable. Both performance on training sets and testing sets will be low. Irreducible errors are a property of the data itself, as there will always be some misclassifications due to noise in the data. They do not depend on the methodological decisions of researchers.

(8)

2.2 Validation

For good validation, the machine learning model is usually tested on an independent dataset. To get a good estimate of the generalizability, only splitting the data once for training and testing purposes is not enough. The resulting generalizability probably only holds true for this specific train and test subdivision. To overcome this problem, cross-validation must be performed to get a better estimate of generalizability. Hereby, the available data is splitted re-peatedly, into evenly test and train subgroups to examine model performance. The amount of repetitions is usually described as k-fold cross-validation and when k is 1, it is called leave-one-out cross-validation. Since parameter optimization must not be based on the test data, otherwise the test set is no longer independent, but solely on training data, nested k-fold is preferred (Arlot & Celisse, 2010). Hereby, parameter optimization occurs within the inner k-folded cross-validation of the training data, which is therefore again split up into training and test sets, while the training dataset of the outer k-folded cross-validation remains independent (Janssen et al., 2018). Cross-validation is an accurate tool to validate studies with small sample sizes, although test and training groups must remain large enough to assess true generalizabil-ity. Besides cross-validation, the significance of a machine learning model can be statistically tested by permutation testing. Hereby it is assessed whether the model assigns subjects to the right class above chance.

2.3 Algorithm type

Hundreds of different machine learning algorithms exist. The choice of algorithm depends on the problem and purpose of the study, as the ultimate model does not exists. However, when using binary classification models instead of regression to describe the outcome variable, in-formation is lost. Although regression algorithms are more difficult to train, as the increase in information holds the risk of overfitting the data, it is preferred as it can capture the variance in functional outcomes in patient populations. Besides, disease threshold for binary classifi-cation is almost always arbitrary, which is why a continuous dependent variable is favored. Although most binary classifiers do learn in a continuous manner, and only the output is con-verted into a binary scale, direct use of regression is preferred (Janssen et al., 2018). The support vector machine (SVM) algorithm is an algorithm that is specifically good at capturing patterns in high dimensional data, which is why it is used in most sMRI studies. SVM is a linear type of algorithm, which means that the decision boundary for classification can only consist of one straight line. Non-linear machine learning algorithms, like multi-layer neural networks, also allow high dimensional data and can capture even more complex patterns. Such algorithms, however, are computationally complex and require around ten times more data, considering the high amount of parameters used. One last important thing when taking into account ma-chine learning studies, is the exploratory design of studies. When researchers decide to try multiple algorithms and/or multiple different validation types for example, to select the best performing model, it must be mentioned that the study is exploratory and study results must be externally validated and replicated.

(9)

Overview and critical discussion of

literature

3.1 Literature search

The search engines Pubmed and Google Scholar were considered for a systematic literature search. Search terms included different combinations of the following words: “machine learn-ing”, “model”, “predict”, “prognosis”, “transition”, “outcome”, “psychosis” and “nia”. Only studies on predicting outcome in the population at risk for psychosis or schizophre-nia by machine learning models were selected. Reference lists of the selected literature were considered as well. The included studies all mentioned performance accuracy, sensitivity and specificity of the model, or these results could be derived from the data. A total of 18 studies were selected from literature, to describe the current status of machine learning in prognostics in psychosis.

3.2 Data acquisition

Different data elements of the selected literature were obtained as summarized in Table 1. Prog-nostic target i.e. the outcome, was either transition to psychosis or functional outcome with differing assessment tools to measure the prognostic target. The type of risk of the at risk pop-ulation could be termed clinically high risk, ultra-high risk or at risk mental state, which rely on the same criteria, as mentioned above. Another type of risk could be a genetic predisposi-tion, including individuals with familiar high risk or 22q11.2 deletion syndrome. Studies with individuals that experienced a first episode of psychosis (FEP) were also included, since the rate of remission in these groups is 58% and the rate of recovery 38% after one year (Lally et al., 2017). Predicting the outcome in FEP individuals remains difficult, demonstrating the im-portance of better prognostic tools in this particular group. Studies were sorted based on the type of modality: (neuro)biological biomarkers, neurocognitive biomarkers, clinical biomark-ers or multi-modal biomarkbiomark-ers. (Neuro)biological biomarkbiomark-ers consisted of sMRI measures, electroencephalogram (EEG) measures, blood plasma analytes and immunology. Neurocogni-tive biomarkers included speech, Rey Auditory Verbal Learning Test and neurocogniNeurocogni-tive clin-ical data: IQ, processing speed, memory and executive functioning. Clinclin-ical data included all psychological related measures: baseline clinical data, the Global Assessment of Functioning score (GAF), the Comprehensive Assessment of At Mental Risk State, the Scale for the Assess-ment of Negative Symptoms, the Brief Psychiatric Rating Scale, the Rust Inventory of Schizo-typal Cognitions, the Structured Interview of Prodromal Symptoms (SIPS) and the Scale of Prodromal Symptoms (SOPS). Multi-modal studies combined measures from multiple modal-ities.

Another acquired data element was type of algorithm: SVM, Least Absolute Shrinkage and Selection Operator (LASSO), randomized trees, greedy algorithm, lowest common ancestor,

(10)

convex hull, elastic-net regularized regression or support vector regression. Type of internal and external validation, if any, was also collected. The model performance is described in terms of sensitivity of the model (SE), calculated by true positives divided by true positives plus false negatives, specificity of the model (SP), calculated by true negatives divided by the sum of true negatives and false positives and accuracy of the model, which is the sum of sensi-tivity and specificity, divided by 2:

SE= True positives

True positives + False negatives (Eq. 3.2.1) SP = True negatives

True negatives + False positives (Eq. 3.2.2) Accuracy= SE+SP

2 (Eq. 3.2.3)

In other words, sensitivity represents the correctly assigned positives, specificity represents the correctly assigned negatives and accuracy represents the performance on both. When multiple models are described in a study, the mean accuracy, sensitivity and specificity is calculated. For continuous outcome variables, the model performance is considered as the regression coeffi-cient and corresponding p-value. Sample size was collected as well as the number of converters and the corresponding follow-up duration.

3.3 Results

The collected data is summarized in Table 1. A total of 2857 subjects participated in all studies. In 6/18 studies functioning was a prognostic target of the model and in 13/18 studies transition was a prognostic target, with one study examining both. The type of risk as inclusion criteria was highly variable. Most models were based on (neuro)biological data, and four multi-modal studies were found. Most studies used SVM as machine learning algorithm: 50% of all studies. The choice in type of algorithm for the other studies is highly variable, only Elastic-net reg-ularized regression and LASSO regression were used twice. All but one study reported some kind of internal cross-validation, the one study that did not used external validation. Five other studies used both internal and external validation on their constructed model. The mean ac-curacy of models with a binary outcome variable was 80.4% (SD = 9.6), the mean sensitivity 76.4% (SD = 12.8) and the mean specificity 84.9% (SD = 10.5). In a mean follow-up duration of 3.1 years (SD = 1.8), the transition rate was 46.6% (SD = 0.2), although this is not comparable to natural disease course, as a lot of studies picked the same amount of not converted cases as converted cases. The two studies with continuous outcome variables revealed a regression coefficient of 0.28 and 0.42 between biomarker and outcome variable, with p < 0.01 and p < 0.08 respectively. In the next sections, the results will be further discussed in terms of model performance, revealed patterns, heterogeneity in psychosis and instruments, generalizability and objectivity of the models as well as possibilities for clinical application.

(11)

Table 1: Overview of the obtained data from prognostic machine learning studies

Study Prognostic target Type of risk

Data modality Algorithm Internal validation

External validation

Model performance (%) Sample size Converters Follow-up duration Accuracy SE SP (Neuro)biological biomarkers Koutsouleris et al. (2009)

Transition ARMS sMRI SVM 5-fold CV 45 new HC T: 88

NT: 89 HC: 88 E: 93 T-NT: 82 T: 87 NT: 78 HC: 82 E: - T-NT: 83 T: 89 NT: 91 HC: 94 E: 93 T-NT: 80 50 (17 HC) 15 4 years Gothelf et al. (2011)

Transition 22q11.2DS sMRI SVM LOOCV NA 96 90 100 19 9 5 years

Koutsouleris

et al. (2015)

Transition ARMS sMRI SVM Repeated

double CV Holdout data (n = 7) I: 80 I: 76 I: 85 I: 66 I: 33 4 years E: 86 E: 86 E: - E: 7 E: 0 Perkins et al. (2015)

Transition CHR Blood plasma analytes Greedy

algorithm

5-fold CV NA 76 60 90 72 32 2 years

Ramyead et

al. (2015)

Transition ARMS EEG LASSO 10x

10-fold nested CV NA 71 58 83 53 18 3 years Kambeitz-Illankovic et al. (2016) Functioning ARMS (mGAF > 70) sMRI SVM Nested LOOCV NA 82 79 85 27 14 4 years Das et al. (2018)

Transition ARMS sMRI Randomize

d trees 5-fold CV NA 82 66 97 79 16 20 months Zarogianni et al. (2019) Transition (PACE/BPRS)

ARMS sMRI SVM Nested

LOOCV

NA 74 63 84 35 16 4 years

Neurocognitive biomarkers Koutsouleris

et al. (2012)

Transition ARMS Neurocognitive clinical data

(IQ, processing speed, memory and EF)

SVM Repeated double CV NA 77 80 75 35 15 4 years Bedi et al. (2015) Transition (SIPS/SOPS) CHR Speech Convex Hull LOOCV NA 100 100 100 34 5 2.5 years Rezaii et al. (2019) Transition SIPS positive > 6

Speech LCA NA Holdout

data

I: 90 I: 86 I: 96 30 7 2 years

(12)

Abbreviations: 22q11.2DS = 22q11.2 deletion syndrome, ARMS = at risk mental state, BPRS = Brief Psychiatric Rating Scale, CAARMS = Comprehensive Assessment of At Mental Risk State, CHR = clinical high risk, CV = cross-validation, E = external, EEG = electroencephalogram, EET = employment, education or training, EF= executive functions, FEP = first-episode psychosis, FR = familiar risk,

GAF = Global Assessment of Functioning, GF:Social and GF:role = Global Functioning: Social and Global Functioning: Role, HC= healthy controls, I = internal, ICD-10 = 10th_{version International}

Statistical Classification of Diseases and Related Health Problems, IQ = intelligence quotient, LASSO = Least Absolute Shrinkage and Selection Operator, LCA =lowest common ancestor, LOOCV =

leave-one-out cross-validation, LOSOCV = leave-one-site-out cross-validation, LSOCV = leave-site-out cross-validation, mGAF = modified Global Assessment of Functioning, NT= not-transitioned, PACE = Personal Assessment and Crisis Evaluation, PANSS = Positive and Negative Syndrome Scale, QoL = quality of life, RAVLT = Rey Auditory Verbal Learning Test, RISC = Rust Inventory of

Schizotypal Cognitions, SANS = Scale for the Assessment of Negative Symptoms, SIPS = Structured Interview of Prodromal Symptoms, sMRI = structural Magnetic Resonance Imaging, SOFAS =

Social and Occupational Functioning Assessment Scale, SOPS = Scale of Prodromal Symptoms, SPI-A = Schizophrenia Proneness Instrument – adult version, SVM = support vector machine , SVR = support vector regression, SZ= schizophrenia, T – NT: transitioned vs. not transitioned, T= transitioned, UHR= ultra-high risk.

BPRS) Functioning (SOFAS) -binary: > 50 good functioning 63 63 63 48 48 -continuous r = 0.28, p < 0.01 Leighton et al. (2019a) Functioning (EET) and PANSS remission -binary

FEP Baseline clinical data Elastic-net

regularized regression 5-fold repeated CV Holdout data 70 67 73 I: 83 E: 79 I: ±49% E: ±54% I: 2 years E: 1 year Leighton et al. (2019b) Functioning (symptom remission, GAF, vocational, QoL) -binary

FEP Baseline clinical data Elastic-net

regularized regression Nested LOSOCV 2 different cohorts I: 68 I: 66 I: 70 I: 1024 ±44% 1 year E: 68 E: 76 E: 59 E: 740 Multi-modal biomarkers Chan et al. (2015)

Transition FEP Immunology

(neurobiological) CAARMS positive score (clinical) LASSO regression 10-fold CV NA 84 89 79 76 18 2 years de Wit et al. (2017) Functioning (mGAF and SIPS)

UHR (SIPS, SPI-A)

sMRI (neurobiological) SIPS disorganization (clinical)

SVM LOOCV NA 82 69 94 41 17 6 years

Continuous Subcortical volume

(neurobiological) SVR r = 0.42, p < 0.008 Zarogianni et al. (2017) Transition (IDC-10) FR: ≥ 2 relatives with SZ sMRI (neurobiological) Memory (RAVLT) (neurocognitive)

Schizotypy (RISC) (clinical)

SVM LOOCV Holdout data (40 non-converters) I: 94 I: 100 I:88 34 17 ± 2.5 years E: 80 - E: 80 Koutsouleris et al. (2018) Functioning (GF:Social and GF:role R < 7) CHR (SIPS, SPI-A) sMRI (neurobiological) functioning (clinical) SVM Nested LSOCV NA (but multi-site) 74 71 76 116 66 11 months

(13)

9

3.4 Model performance

The accuracy of the models lies around 80.4% percent which is comparable to that of other psychiatric machine learning models (Janssen et al., 2018). Calculated accuracy rests on sen-sitivity and specificity performance. In the selected literature mean sensen-sitivity performance (76.4%) of models was lower than its specificity performance (84.9%). In 67% of the selected studies, the model had higher specificity than sensitivity and this was the case in all mod-els based on (neuro)biological biomarkers. The relative importance of both depends on the goal of the model, which is why some authors argue that either sensitivity or specificity is of more importance than the other. Sensitivity represents the detection of converters, while speci-ficity represents the detection of non-converters. Detecting all individuals that will convert to psychosis, meaning high sensitivity, assures possible intervention for almost all psychotic in-dividuals, whereas high performance on detecting non-converters, meaning high specificity, is essential to reduce unnecessary treatment. When the goal of the model is to identify the people with psychosis for medical treatment, one might argue that specificity is more important, given the unwanted side effects of psychiatric medication. When the goal is to find a biomarker to identify all psychotic individuals, sensitivity is more important. Overall, constructing a model with high sensitivity but very low specificity or high specificity but very low sensitivity will not improve the prediction of psychosis. A balanced model with both high sensitivity and specificity will favour both purposes resulting in better prognosis. This is also the reason why accuracy is calculated by summing sensitivity and specificity divided by 2, see (Eq. 3.2.3), and not by dividing properly assigned subjects by total number of subjects. This way sensitivity and specificity are equally important for model performance, regardless of the imbalance in amount of included converted and non-converted subjects. However, in studies with large imbalances it must be considered that model accuracy might not be a good representative of sensitivity and specificity.

The model with highest accuracy, the one from Bedi and colleagues, had a model accuracy, sensitivity and specificity of 100% (2015). It must be noted that in their sample only 5 out of 34 individuals converted to psychosis and this disbalance can lead to over-optimistic results. That being said, the model was based on speech features, as well as the model from Rezaii and colleagues (2019), studies that both reported accuracies higher than 90%, the latter being exter-nally validated. Although only those two studies investigated speech as a biomarker, speech has the potential to be a valuable biomarker for predicting psychosis. When comparing other modalities, it seems that clinical models have a slightly lower performance than neurocogni-tive and (neuro)biological models. However, due to the small amount of clinical studies, it is difficult to draw firm conclusions on the data. Furthermore, it must be noted that prior clinical knowledge is used in the studies to preselect the at risk population, thus clinical information is inherently part of the constructed models. It is thereby unknown how models would perform on data from an unselected population.

Considering the high prevalence of subthreshold symptoms in non-converted populations (Beck et al., 2019), it is also interesting to compare studies that predict functional versus transi-tion outcome. Models that predicted functransi-tional outcome had a mean accuracy of 73.2% and models that predicted transition outcome had a mean accuracy of 82.5%. This shows that models predicting transition are somewhat better, showing higher performances. This is re-markable, as you would expect the opposite, knowing that functional outcome is closer to true clinical behaviour. It might be due to the fact that only 6 models predicted functional outcome and half of them were based on clinical measures, and models in the clinical modality showed lower performances in the first place. Considerably more work will need to be done to improve models that predict functional outcome in all modalities.

(14)

models outperform clinicians in predicting psychosis outcome. The study of Bedi and col-leagues claim that the performance based on speech features outperforms clinical assessment (2015). When the model was fed clinical assessment data, specifically SIPS and SOPS data, the accuracy was 79%, sensitivity 40% and specificity 89%, compared to 100% on all facets for the model based on speech features. However, these were both machine learning models and do not represent the performance of clinical experts. One study investigated the prognostic performance of clinicians on the same dataset: experts had an accuracy of 71%, a sensitivity of 50% and a specificity of 92%, compared to 74%, 71% and 76% respectively, for the model’s performance (Koutsouleris et al., 2018). The model thus outperformed experts on accuracy and sensitivity, but not specificity. This model performance was however an average of two models, their best model had even higher accuracy and sensitivity: an accuracy of 83%, sensitivity of 83% and specificity of 82% (Koutsouleris et al., 2018). Although picking the best model, which is also done by a lot of studies, might not be representable for machine learning performance in general. More on this so-called publication bias in section 3.8 Objectivity.

3.5 Revealed patterns by machine learning

An important advantage of machine learning is its ability to capture complex patterns whereas other statistical testing is based on linear relations, with no more than a couple of variables to explain the outcome variable. Complexity can be found in broad measures, for example in whole brain measures of sMRI when a combination of factors that are increased, decreased and/or altered can explain conversion to psychosis (Koutsouleris et al., 2009). It can also be found in more specific data, for example when using sMRI when a combination of all dif-ferent brain areas and their grey and white matter volumes can be included (Gothelf et al., 2011; Koutsouleris et al., 2015; Kambeitz-Illankovic et al., 2016; Zarogianni et al., 2019). Brain areas that show overlap in multiple studies, are the (medial) prefrontal cortex, orbitofrontal cortex, cingulate cortex and superior temporal areas (Gothelf et al., 2011; Koutsouleris et al., 2015; Ramyead et al., 2015; Kambeitz-Illankovic et al., 2016; Zarogianni et al., 2019).

The relationships between variables can more easily become clear when using machine learning, and a lot of data can be fed to the model to find these complex patterns. The complex patterns revealed by the selected literature are reported in Table 2 and sorted on type of modal-ity. An interesting study used a computer analysis to extract latent content of words used by the at risk groups (Rezaii et al., 2019). They found that an increased tendency to indirectly talk about sounds is an indicator of conversion to psychosis, which can remain unnoticed by pro-fessionals. Some predictors listed in Table 2 were already reported in earlier prediction studies, but the constructed models are based on novel combination of variables.

As an overall pattern, multi-modal studies found that adding variables from more modal-ities, thereby making the relationships more complex, favours the model’s accuracy. This is in line with the complex nature of psychosis and supports the idea that adding more facets of the illness will increase information, which will intuitively lead to better understanding and pre-diction by the model. Therefore multi-variable studies, where multiple variables within one modality are included, and multi-modal studies, where data is acquired using different meth-ods, become even more valuable. In all multi-modal studies, models were built on data from a clinical modality in combination with data from the (neuro)biological modality (Chan et al., 2015; de Wit et al., 2017; Zarogianni et al., 2017; Koutsouleris et al., 2018). Diverse clinical mea-sures were used, suggesting that multiple clinical meamea-sures can have added value for model construction. One study also included a neurocognitive feature, memory performance, result-ing in a constructed model based on three different modalities (Zarogianni et al., 2017). This study also revealed one of the highest model performance: an accuracy of 94%, sensitivity of 100% and a specificity of 88%. External validation is also beneficial in this study, although only

(15)

11

a group of non-converters were tested, which resulted in an external specificity of 80%. On average, multi-modal studies revealed an accuracy, sensitivity and specificity of around 82%. This does not appear to be extremely higher than the average of the other modalities, although it is difficult to draw firm conclusions given the small amount of multi-modal studies. Method-ological malpractices could have hindered further improvement in model performance, or the limits of machine learning could have been reached, more on this in the section Personal opin-ion and future directopin-ions.

3.6 Heterogeneity in psychosis and methodological instruments

Multivariate and multi-modal studies are also of great importance for studying the hetero-geneity of psychosis as will become clear in this paragraph. Psychosis is known to have a heterogenic etiology whereby multiple biological pathways can lead to the disease (Deng & Dean, 2013). Besides, the disease course is heterogenic: seemingly similar patients can have differing disease outcomes, which is specifically relevant for predictive research. Univariate approaches are unable to capture this heterogeneity while multivariate studies can and indeed show higher effect sizes in diagnostic psychosis studies (Schnack, 2019).

This underlying heterogeneity is important to take into account when interpreting machine learning studies in psychosis. An important choice that affects heterogeneity is the type of algorithm used: linear or linear. Linear algorithms can only separate converted from non-converted individuals by a single line. In the selected literature, linear SVM is chosen in 50% of the time. The main reason for most studies is its reliable performance even when more variables than participants are included, which is especially relevant for sMRI studies. Combining mul-tiple SVM classifiers can account for more complex decision boundaries as well as non-linear algorithms, which can compute almost any form of decision boundary. One prognostic study used the non-linear randomized trees algorithm, whereby a connectome is constructed with certain network properties (Das et al., 2018). The ability to capture heterogeneity is reflected in a model performance of 82%. The disadvantage of using such algorithms could be the in-terpretability, specifically when large trees are constructed. Other non-linear artificial neural networks are not used in predictive psychosis research yet, possibly because they require a lot more data.

Another choice that affects the underlying heterogeneity of psychosis is the selected sam-ple. Having a lot of inclusion criteria will lead to a small homogeneous samsam-ple. Finding a biomarker for the general disease here is impossible as the revealed biomarker will only hold true for this small sample (Schnack, 2019). Besides, overfitting is of great risk, resulting in a biomarker that is unrelated to the disease. Selecting a sample with almost no inclusion criteria results in a large heterogeneous group. It is again difficult to find biomarkers in such samples: the biomarker has to be apparent in all individuals and might therefore represent secondary effects of suffering from psychosis and possibly other diseases (Schnack, 2019). The conflicting approach of either choosing a large heterogenous or small homogenous sample is discussed in most of the selected studies. Choosing the larger group will result in better generalizability, but over-conservative measures. Choosing the smaller group will result in non-generalizable predictors and biomarkers that are too specific. For capturing heterogeneity, the larger sam-ple with less inclusion criteria would be favourable. An option to solve the problem of over-conservativity and non-specific biomarkers, is dividing the heterogenous patient group into subgroups, either based on explicit criteria such as clinical characteristics, or by using cluster-ing machine learncluster-ing algorithms (Schnack, 2019). The last option is especially relevant when splitting criteria are based on complex interplays themselves. Studies show that homogenizing groups indeed improves model performance when classifying psychosis patients from controls

(16)

Table 2: Revealed biomarkers in prognostic machine learning studies

Study Data modality Biomarkers Outcome (Neuro)biological biomarkers

Koutsouleris et

al. (2009)

sMRI Whole-brain anatomical volumes:

CSF GM Total intracranial Ventricular WM ARMS-T/ARMS-NT/HC Gothelf et al. (2011)

sMRI GM and WM alterations:

Lesser left dPFC and dorsal cingulum

Greater mPFC, right amygdala and orbitofrontal cortex

22q.112DS (BRPS score high/low) Koutsouleris et

al. (2015)

sMRI Gray matter volume alterations in prefrontal, perisylvian, and

subcortical structures ARMS Perkins et al. (2015) Blood plasma analysis

15 Plasma analytes reflecting inflammation, oxidative stress, hormones, and metabolism

CHR-T/CHR-NT Ramyead et al.

(2015)

EEG CSD in beta1/beta2/gamma – activity STG, IPL, precuneus

LPS in beta1/beta2/gamma

ARMS/ARMS-T

Kambeitz-Illankovic et al. (2016)

sMRI Cortical area reductions in superior temporal, inferior frontal

and inferior parietal areas

ARMS (GAF poor/good outcome)

Das et al. (2018) sMRI Network properties of gyrification based connectome:

Characteristic path length Clustering coefficient Small-world index Transitivity Average degree Assortativity

FEP and ARMS-T / ARMS-NT and HC

Zarogianni et al. (2019)

sMRI GM alterations ARMS-T group: cerebellum, STpole, right

ACC, right superior MFC, left orbitofrontal cortex, insula bilateral.

GM alterations ARMS-NT group: right IPL, right MTL, right orbitofrontal cortex and left palladum

ARMS-T/ARMS-NT Neurocognitive biomarkers Koutsouleris et al. (2012) Executive functions Executive functions

Learning disabilities (MWT-B, DST, TMTB, RAVLT-DR, and RAVLT-Ret)

ARMS-T/ARMS-NT

Bedi et al. (2015) Language (free

speech)

Semantic coherence Maximum phrase length Use of determiners CHR-T/CHR-NT Rezaii et al. (2019) Language (interviews)

Low semantic density speech

Increased tendency to talk about sound

Converter/non-converter Clinical biomarkers Mechelli et al. (2017) Clinical data: Transition

Disorder of thought content

Attenuated positive symptoms and functioning

Symptoms based on BPRS and CAARMS

Functioning Attention disturbances

Anhedonia–asociality Disorder of thought content

SOFAS > 50 good functioning, continuous outcome Leighton et al.

(2019a)

Clinical data Baseline EET

Living with spouse/children Higher PANSS suspiciousness Hostility

Delusions scores

PANSS affective symptoms

PANSS passive social withdrawal symptoms

EET

Symptom remission

Leighton et al. (2019b)

Clinical data Symptom remission

Social recovery Vocational recovery Quality of life

Clinical scores higher than median = transitioned

(17)

13

Abbreviations: 22q11.2DS = 22q11.2 deletion syndrome, ACC = anterior cingulate cortex, ARMS = at risk mental state, ARMS-NT = at risk mental state – not transitioned, ARMS-T = at risk mental state – transitioned, BPRS = Brief Psychiatric Rating Scale, CAARMS = Comprehensive Assessment of At Mental Risk State, CHR = clinical high risk, CHR-NT = clinical high risk – not transitioned, CHR-T = clinical high risk – transitioned, CSD = current-source density, CSF = cerebrospinal fluid, DLPFC = dorsolateral prefrontal cortex, dPFC = dorsal prefrontal cortex, DST = digit symbol test, EET = employment, education or training, FEP = first-episode psychosis, GAF = Global Assessment of Functioning, GM = grey matter, HC= healthy controls, IPL = inferior parietal lobe, LPS = lagged phase synchronization, MFC = medial frontal cortex, mGAF = modified Global Assessment of Functioning, mPFC = medial prefrontal cortex, MTL = medial temporal lobe, MWT-B = Mehrfach-Wortschatztest B, PANSS = Positive and Negative Syndrome Scale, RAVLT-DR = Rey Auditory Verbal Learning Test, after delayed recall, RAVLT-Ret = Rey

Auditory Verbal Learning Test, retention, SIPS =Structured Interview of Prodromal Symptoms, SOFAS = Social and Occupational

Functioning Assessment Scale, STG = superior temporal gyrus, STL = superior temporal lobe, STpole = superior temporal pole, TMT-B = Trail-making test, part B, WM = white matter.

Multi-modal biomarkers Chan et al. (2015) Immunoassay and clinical data 26 analytes CAARMS positive Converter/non-converter de Wit et al. (2017)

sMRI and clinical data

Subcortical volume Gyrification

Cortical thickness (binary classification) Surface area

Cortical volume

SIPS positive, negative and disorganization scales

mGAF SIPS (negative /positive/disorganizing) Resilient/non-resilient Zarogianni et al. (2017) sMRI, memory and clinical data

GM volumes frontal, orbitofrontal, occipital, medial STL and cerebellum

Memory test (RAVLT)

Self-completed schizotypy questionnaire

Converter/non-converter

Koutsouleris et

al. (2018)

sMRI and clinical data

CHR-T:

Lower GM mPFC and temporo-parieto-occipital cortex Higher GM cerebellum and DLPFC

Lower GM medio-temporal cortex Higher GM prefrontal-peri-Sylvian Lower functioning

(18)

(Schnack, 2019). Unfortunately, none of the selected prediction studies have applied clustering into subgroups before investigating predicting performance.

Multi-modal approaches help to grasp the underlying heterogeneity by taking into ac-count multiple modalities wherein psychosis manifests itself. This is important as it can take into account interactions between different modalities. There are for example known gene-environment interactions, whereby genetic predisposition will only result in the development of psychosis when environmentally triggered. Hereby the study results of the last four exam-ined studies of Table 1, the multi-modal studies, become even more important.

Besides the heterogeneity in the etiology and patient groups of psychosis itself, an impor-tant problem is methodological heterogeneity. This completely different form of heterogeneity is influenced by decisions of researchers in the field. In the current literature selection, there are differing type of risk groups with different risk ratios. For example, the study of Zaro-gianni and colleagues defined the at-risk population as individuals with at least two relatives with schizophrenia (2017). The resulting conversion rate was too small to compare results with other studies, which is why the authors decided to the let go of the individuals not showing any symptoms after follow-up duration. The resulting model had to classify subjects into the fol-lowing two groups: high risk individuals that developed psychosis and high risk individuals with at least some symptoms. Yet, it remains difficult to compare such non-specific populations with other at risk populations. Besides, disease outcome in the selected literature is measured by various assessment tools and even within the same assessment tool, transition is defined by different thresholds. In the selected 18 studies, 8 different algorithms are used with 9 differ-ent types of cross-validation. The diverse methods make it complicated to compare prognostic studies and they hinder generalizability of predictors.

3.7 Generalizability of machine learning models

Homogenising methodical tools by setting up best practices is crucial to ensure generalizability in the future. Up until then, well-performed cross-validation is necessary to tackle overfitting and to increase generalizability of models. As mentioned earlier, nested cross-validation is preferred, to remain test sets independent of parameter optimization. Only 5 out of 18 studies used this form of cross-validation. In 6 out of 18 studies external validation is used to assure generalizability. Future work is required to assess external validity and to replicate studies.

Generalizability is especially important knowing the heterogenous aspect of psychosis in populations with sociodemographic differences and the use of different scanners. Thus, it is relevant to use multi-site samples, with patient cohorts from different institutes and continents. This will capture heterogeneity on different levels and will result in better generalizability of the models. The study by Koutsouleris and colleagues was one study to use multi-site samples to construct the model, with model performance around 74% (2018). Only one other study used multi-site data, for both constructing the models and external validation, with a mean internal-external performance by using one-site-out cross-validation of 68% and a mean exter-nal performance of also 68% (Leighton et al., 2019b).

Both for heterogeneity issues and generalizability it is necessary to use less strict inclusion criteria resulting in larger, naturalistic samples. Besides, within smaller samples there is a high risk of overfitting which affects generalizability. The SVM algorithm, used in 50% of the stud-ies, is of specific use in such cases, as it avoids overfitting in samples with high amounts of variables. The largest sample consisted of 1764 participants (Leighton et al., 2019b), while the overall median was 50 participants. The high accuracies of studies are possibly partly influ-enced by overfitting due to small sample sizes. A negative relation between sample size and accuracy was also found in the review of Sanfelici and colleagues (in press).

(19)

15

Another element that affects generalizability is the follow-up duration, which must be long enough to capture all conversions to psychosis. Follow-up duration in the selected literature starts at 11 months up until 7.5 years, although the mean duration of the prodromal phase in CHR is 21.6 months, with one third of individuals transition within 1 year, two third within 1-2 years and one third more than 2 years (Powers et al., 2020). Especially when at risk population with lower risk ratios are investigated, follow-up duration must be large enough. Gothelf and colleagues, investigating 22q11.2 deletion syndrome in subjects who do not show signs of the development of psychosis, indeed have a long follow-up duration of 5 years (2011). However, the follow up duration of 2.5 years in the familiar high risk group might be too short (Zarogianni et al., 2017).

3.8 Objectivity

One of the main advantages of machine learning includes its objectivity. This objectivity must be carefully considered when performing machine learning, as biases hinder objectivity. For example publication bias, whereby researchers have a sizable influence on model parameters to get the best model accuracies and only the best models are reported in the paper. Besides, machine learning remains objective when no arbitrary choices by humans are made. Arbitrary choices that hinder objectivity are choices during the preselecting procedure and the thresholds in binary classification problems (Fusar-Poli & Van Os, 2013). Recent work succeeded in inves-tigating diagnosis of psychosis outside of clinical boarders, where subgrouping is handled by the algorithm itself (Clementz et al., 2016). Given the fluctuations in disease transition (Fusar-Poli et al., 2012), there is a need for more objective methods. Therefore this paragraph will mainly focus on the promising machine learning studies with continuous outcome variables, free of subjective judgements. Furthermore, such objective methods acknowledge the idea of a continuum from the “normal” mental state to the psychotic mental state.

Two studies in the selected literature predicted continuous disease outcome. The models of de Wit and colleagues (2017) yielded significant correlations for gyrification and subcortical volumes when SIPS negative, SIPS disorganization and modified GAF score was predicted, but not when SIPS positive score was predicted, see Table 2. Interestingly, the weight of the poste-rior part of the corpus collosum significantly contributed to predicting the continuous scores. Not all combinations of predictors and outcomes resulted in significant correlation coefficients. In the study of Mechelli and colleagues (2017) from a large variety of clinical measures, three key clinical factors: attention disturbances, anhedonia-asociality and disorder of thought con-tent were able to predict continuous Social and Occupational Functioning Assessment Scale score, see Table 2. The studies confirm that, using machine learning, continuous functional outcome can be predicted in individuals at risk for psychosis.

3.9 Clinical applicability

There is a large gap between developing clinical tools and the current machine learning mod-els. As earlier research described, important aspects of translation to clinical tools include real-world validation, clinical utility, feasibility, acceptability and safety (Mechelli & Vieira, 2020). Before machine learning models will be applicable, validation with real-world data is neces-sary, as well as more naturalistic samples in future research. Less strict inclusion criteria will lead to a better representation of the heterogenous at risk groups, thus better generalizability and feasible applicability.

In terms of clinical utility, the authors refer to the necessary added value of machine learn-ing for clinical assessment (Mechelli & Vieira, 2020). Current tools to predict psychosis include psychological interviews and clinical assessment. Individual risk models on psychosis in a

(20)

clinical setting already exist. Those are online individualized risk calculators, that output a probability based on clinical information, such as age, ethnicity and clinical assessment scores (Cannon et al., 2016; Fusar-Poli et al., 2017). The predictors of those prognostic models are se-lected using regression methods. The predictive performance of such risk calculator models are comparable to machine learning models that are based on the same clinical predictors (Fusar-Poli et al., 2019). Thus, the added value of machine learning remains arguable, especially since the resulting prognostic predictors are more difficult to interpret. Machine learning can how-ever be a robust method to examine high dimensional data, like sMRI, and can be of use when there is lack of prior clinical measures (Fusar-Poli et al., 2019). Besides, this review revealed that machine learning models based on the clinical modality have slightly lower performances than the other modalities, including multi-modal approaches, which have not been compared to the current risk calculators. Regarding the reported accuracies, there is no clear consensus on the amount of sensitivity and specificity needed for clinical applicability, although some ar-gue that it must be higher than 80% for both sensitivity and specificity (First et al., 2012). Only taking into account model performance, some of the models in the selected literature have a good option to become clinical applicable. Yet, this estimate of 80% should probably be spec-ified to external validation, concerning the risk of overfitting. Besides, it must be considered whether an accuracy of 80% is enough to fully rely on a machine learning tool. This depends on the purpose of the tool, as for medical treatment, one should strive for higher accuracies. On the other hand, such high performances are not necessarily needed when difficult decisions are being made. For example in “toss-up” decisions, where clinicians have a 50% chance of the right guess, a model with 70% sensitivity and specificity will already be of value.

Along with better prediction on disease outcome, for clinical utility one can also think of predicting things like treatment response and psychosis relapse. Although this was not within the scope of this review, researchers have attempted to predict response to psychiatric medicine (Koutsouleris et al., 2016) as well as mental healthcare consumption in the future (Kwakernaak et al., 2020) using machine learning. The study of Koutsouleris and colleagues, examining re-sponse to psychiatric medicine, was a multisite study with well-performed cross-validation, and revealed a mean sensitivity of 65.6% and a specificity of 73.9% for predicting poor versus good response outcome (2016). Models were based on a large scale of clinical data using vari-ous assessment tools. Kwakernaak and colleagues found that models based on the amount of psychotic episodes, total care needs, GAF disabilities, paid employment and GAF symptoms, were able to predict binary (low/high) healthcare consumption over the next three years (2020). The models had a mean sensitivity of 35% and a mean specificity of 86%. Since both studies had an exploratory approach by testing multiple algorithms and the study on mental health-care consumption reported low sensitivity, further research in these areas is needed. However, there is an unmistakable value of such research for the development of tools for clinical prac-tice.

Regarding practicality, acceptability, safe and economically viable tools, data modalities such as EEG and baseline clinical data would be favourable tools as they are low-cost and practical measures. Acceptability might be a problem in cases with measures like sMRI, as patients with acute psychosis might not consent to such measures. One can imagine that for example patients with persecutory delusions might find it terrifying to lay in such scanners. Acceptability of machine learning when it comes to clinicians is another obstacle when devel-oping clinical tools. Clinicians have to trust the abilities of the built machine learning tool and they have to be careful when considering the result and not misuse the tool. Ethical accep-tance plays another important role: is it ethical to make decisions based on computer models? And are these machine learning tools fully objective: does it not result in early stigmatization? Additional questions concern how decisions based on machine learning tools can be commu-nicated in psychiatry (Martinez-Martin et al., 2018). These are important questions which will be considered in the future before machine learning tools can be clinically applicable.

(21)

Personal opinion and future directions

This review discussed the possibilities for machine learning and its clinical applicability in pre-dicting psychosis. The discussed literature revealed interesting patterns that can be used for further research in psychosis. Furthermore, current studies show quite high performances of models, some even outperforming clinical estimations. However, lack in external validation, heterogenous methods and biases hinder the development of clinical machine learning tools. One obstacle that we are facing in current research is methodological heterogeneity, which can be solved by homogenising methods and setting up guidelines to practice machine learning in this field. For lack of internal and external validation, it is important to switch the focus to more external validation by using independent datasets and focus on replication. Replication can be as valuable as, or even more valuable than, investigating new biomarkers. At all times, the ob-jectivity of the methods used must be considered, to ultimately construct an objective predictor of psychosis. Although predictive machine learning is not applicable yet, there are some main reasons why we would benefit from continuing the search for predictors of psychosis outcome using machine learning studies.

First of all, the specific measures that are being used in the field right now, are not able to predict disease outcome well enough. Although these measures might include the specific firing of groups of neurons by EEG measurements, the exact volumes of brain areas by sMRI measurements or the assessment of behaviour by dozens of facets, no ultimate biomarker re-vealed itself in predictive psychosis research. This problem addresses complexity on different levels: to begin with, brains can be described as complex systems with interacting elements on the level of neurotransmitters, neurons and pathways of neurons. Next, psychosis as well as other psychiatric diseases, are known to have a complex nature and likewise a complex dis-ease course. Linear thinking, meaning the idea that a couple of factors will contribute to the development of a static disease, will stand in the way of coming across actual developmental characteristics of psychosis. In line with this, psychosis might be seen as an altered state of the mind, strengthened by different factors that contribute to the disease. Defining these factors on a continuous scale alongside the “normal” state of the brain would be a preferable strategy. The amount of complexity in the brain and its projection in psychiatric disorders asks for complex problem solving. Machine learning holds the possibility to capture complex interactions and build models based upon them. Machine learning research shows that making the variables more complex, by adding variables from different modalities, indeed favours performance of the model, both in diagnostic studies and in the selected prognostic studies (Schnack, 2019).

Secondly, the fact that machine learning is able to reveal patterns above chance, suggests that the gained knowledge might add something to current psychological research. It is impor-tant not to put computational methods and psychological assessment against each other, as it is much more valuable to learn from the best parts of both. Machine learning might not sub-stitute human assessment, but can help clinicians in difficult decisions. Although pattern ab-straction above one’s understanding might sounds intimidating, machine learning will evolve as computers become more powerful, and collaboration of both disciplines is unavoidable. The progression of machine learning itself is a valid reason to continue machine learning research,

(22)

as other possibilities might reveal itself in the future. However, to fully benefit from machine learning and its ability to look for complex predictors, more multi-modal studies must be per-formed. Besides, disease outcome should be considered on a qualitatively, continuous scale. Also, less inclusion criteria should be used, resulting in larger heterogenous and naturalistic samples, with the possibility to homogenize patients into subgroups by unsupervised clus-tering approaches. Even when all this works out, and the models will show the least over- or underfitting possible, irreducible errors will still exist that hinder generalizability. Even though externally validated models with close to 100% accuracy will become available, the heterogene-ity in patient population will cause misclassifications in a clinical setting. Thus ethical questions will have to be discussed and it remains questionable whether machine learning will ultimately replace the personal assessment of clinicians. In a more likely scenario, the tools will be of addi-tional help to clinicians, accompanying clear guidelines on how to communicate decisions that are partly based on machine learning. Fortunately, large multi-site studies are recently set up, such as the PSYSCAN program (Tognin et al., 2020), with the goal of increasing the amount of available data, stratifying methods, externally validating models and ultimately bringing ma-chine learning models into clinical practice. The program is also aiming for collaboration with other large databases, via the Harmonization of At Risk Multisite Observational Networks for Youth (HARMONY), which includes several different programs located in Europa and North America (Tognin et al., 2020).

4.1 Conclusion

This review selected 18 prognostic psychosis studies, to investigate the added value of ma-chine learning in the field. Mama-chine learning lends itself to find more complex patterns due to its ability to make predictions based on multiple variables from different modalities. The constructed models obtained quite high performances, which would in principle be clinical applicable. However, some major factors that hinder clinical applicability are identified in this study: heterogeneity in psychosis and measurements, objectivity of the models and low generalizability. Further research should focus on finding complex patterns in multi-modal studies, predicting non-arbitrary continuous disease outcome, heterogenous multi-site sam-ples and performing cross-validation and external validation. First steps are already taken by new programs such as the PSYSCAN project. With the further evolving machine learning field, continuing research in this area will ultimately provide better understanding of the complex development of psychosis.

(23)

References

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4, 40-79.

Beck, K., Andreou, C., Studerus, E., Heitz, U., Ittig, S., Leanza, L., et al. (2019). Clinical and functional long-term outcome of patients at clinical high risk (CHR) for psychosis without transition to psychosis: A systematic review. Schizophrenia research, 210, 39-47.

Bedi, G., Carrillo, F., Cecchi, G.A., Slezak, D.F., Sigman, M., Mota, N.B., et al. (2015). Auto-mated analysis of free speech predicts psychosis onset in high-risk youths. NPJ Schizophrenia, 1(1), 1-7.

Bernard, J.A., Orr, J.M., & Mittal, V.A. (2015). Abnormal hippocampal–thalamic white mat-ter tract development and positive symptom course in individuals at ultra-high risk for psy-chosis. NPJ Schizophrenia, 1(1), 1-6.

Cannon, T.D., Yu, C., Addington, J., Bearden, C.E., Cadenhead, K.S., Cornblatt, B.A., et al. (2016). An individualized risk calculator for research in prodromal psychosis. American Journal of Psychiatry, 173(10), 980-988.

Chan, M.K., Krebs, M.O., Cox, D., Guest, P.C., Yolken, R.H., Rahmoune, H., et al. (2015). Development of a blood-based molecular biomarker test for identification of schizophrenia be-fore disease onset. Translational psychiatry, 5(7), e601-e601.

Clementz, B.A., Sweeney, J.A., Hamm, J.P., Ivleva, E.I., Ethridge, L.E., Pearlson, G.D., et al. (2016). Identification of distinct psychosis biotypes using brain-based biomarkers. American Journal of Psychiatry, 173(4), 373-384.

Das, T., Borgwardt, S., Hauke, D.J., Harrisberger, F., Lang, U.E., Riecher-Rössler, A., et al. (2018). Disorganized gyrification network properties during the transition to psychosis. Jama Psychiatry, 75(6), 613-622.

Deng, C., & Dean, B. (2013). Mapping the pathophysiology of schizophrenia: interactions between multiple cellular pathways. Frontiers in cellular neuroscience, 7, 238.

DeVylder, J.E., Muchomba, F.M., Gill, K.E., Ben-David, S., Walder, D.J., Malaspina, D., et al. (2014). Symptom trajectories and psychosis onset in a clinical high-risk cohort: the relevance of subthreshold thought disorder. Schizophrenia research, 159(2-3), 278-283.

(24)

First, M., Botteron, K., Carter, C., Castellanos, F.X., Dickstein, D.P., Drevets, W.C., et al. (2012). Consensus Report of the APA Work Group on Neuroimaging Markers of Psychiatric Disorders. Available at: https://www.researchgate.net/publication/261507750. Accessed June 25, 2020.

Fusar-Poli, P., Bonoldi, I., Yung, A.R., Borgwardt, S., Kempton, M. J., Valmaggia, L., et al. (2012). Predicting psychosis: meta-analysis of transition outcomes in individuals at high clini-cal risk. Archives of general psychiatry, 69(3), 220-229.

Fusar-Poli, P., Borgwardt, S., Crescini, A., Deste, G., Kempton, M.J., Lawrie, S., et al. (2011). Neuroanatomy of vulnerability to psychosis: a voxel-based meta-analysis. Neuroscience & Biobehavioral Reviews, 35(5), 1175-1185.

Fusar-Poli, P., Cappucciati, M., Borgwardt, S., Woods, S.W., Addington, J., Nelson, B., et al. (2016). Heterogeneity of psychosis risk within individuals at clinical high risk: a meta-analytical stratification. JAMA psychiatry, 73(2), 113-120.

Fusar-Poli, P., & van Os, J. (2013). Lost in transition: setting the psychosis threshold in pro-dromal research. Acta Psychiatrica Scandinavica, 127(3), 248-252.

Fusar-Poli, P., Rutigliano, G., Stahl, D., Davies, C., Bonoldi, I., Reilly, T., et al. (2017). Devel-opment and validation of a clinically based risk calculator for the transdiagnostic prediction of psychosis. JAMA psychiatry, 74(5), 493-500.

Fusar-Poli, P., Stringer, D., Durieux, A.M., Rutigliano, G., Bonoldi, I., De Micheli, A., et al. (2019). Clinical-learning versus machine-learning for transdiagnostic prediction of psychosis onset in individuals at-risk. Translational Psychiatry, 9(1), 1-11.

Gothelf, D., Hoeft, F., Ueno, T., Sugiura, L., Lee, A.D., Thompson, P., et al. (2011). Devel-opmental changes in multivariate neuroanatomical patterns that predict risk for psychosis in 22q11.2 deletion syndrome. Journal of psychiatric research, 45(3), 322-331.

Janssen, R.J., Mourão-Miranda, J., & Schnack, H.G. (2018). Making individual prognoses in psychiatry using neuroimaging and machine learning. Biological Psychiatry: Cognitive Neuro-science and Neuroimaging, 3(9), 798-808.

Kambeitz-Ilankovic, L., Meisenzahl, E.M., Cabral, C., von Saldern, S., Kambeitz, J., Falkai, P., et al. (2016). Prediction of outcome in the psychosis prodrome using neuroanatomical pat-tern classification. Schizophrenia research, 173(3), 159-165.

Koutsouleris, N., Davatzikos, C., Bottlender, R., Patschurek-Kliche, K., Scheuerecker, J., Decker, P., et al. (2012). Early recognition and disease prediction in the at-risk mental states for psychosis using neurocognitive pattern classification. Schizophrenia bulletin, 38(6), 1200-1215.

Koutsouleris, N., Kahn, R.S., Chekroud, A.M., Leucht, S., Falkai, P., Wobrock, T., et al. (2016). Multisite prediction of 4-week and 52-week treatment outcomes in patients with first-episode psychosis: a machine learning approach. The Lancet Psychiatry, 3(10), 935-946.

(25)

21

Koutsouleris, N., Kambeitz-Ilankovic, L., Ruhrmann, S., Rosen, M., Ruef, A., Dwyer, D.B., et al. (2018). Prediction models of functional outcomes for individuals in the clinical high-risk state for psychosis or with recent-onset depression: a multimodal, multisite machine learning analysis. JAMA psychiatry, 75(11), 1156-1172.

Koutsouleris, N., Meisenzahl, E.M., Davatzikos, C., Bottlender, R., Frodl, T., Scheuerecker, J., et al. (2009). Use of neuroanatomical pattern classification to identify subjects in at-risk mental states of psychosis and predict disease transition. Archives of general psychiatry, 66(7), 700-712.

Koutsouleris, N., Riecher-Rössler, A., Meisenzahl, E.M., Smieskova, R., Studerus, E., Kambeitz-Ilankovic, L., et al. (2015). Detecting the psychosis prodrome across high-risk populations using neuroanatomical biomarkers. Schizophrenia bulletin, 41(2), 471-482.

Kwakernaak, S., van Mens, K., Cahn, W., Janssen, R., & GROUP Investigators. (2020). Using machine learning to predict mental healthcare consumption in non-affective psychosis. Schizophrenia Research, 218, 166-172.

Lally, J., Ajnakina, O., Stubbs, B., Cullinane, M., Murphy, K.C., Gaughran, F., & Murray, R.M. (2017). Remission and recovery from first-episode psychosis in adults: systematic review and meta-analysis of long-term outcome studies. The British Journal of Psychiatry, 211(6), 350-358.

Leighton, S.P., Krishnadas, R., Chung, K., Blair, A., Brown, S., Clark, S., et al. (2019a). Pre-dicting one-year outcome in first episode psychosis using machine learning. Plos one, 14(3), e0212846.

Leighton, S.P., Upthegrove, R., Krishnadas, R., Benros, M.E., Broome, M.R., Gkoutos, G.V., et al. (2019b). Development and validation of multivariable prediction models of remission, re-covery, and quality of life outcomes in people with first episode psychosis: a machine learning approach. The Lancet Digital Health, 1(6), e261-e270.

Martinez-Martin, N., Dunn, L.B., & Roberts, L.W. (2018). Is it ethical to use prognostic esti-mates from machine learning to treat psychosis?. AMA journal of ethics, 20(9), E804.

Mechelli, A., Lin, A., Wood, S., McGorry, P., Amminger, P., Tognin, S., et al. (2017). Using clinical information to make individualized prognostic predictions in people at ultra high risk for psychosis. Schizophrenia research, 184, 32-38.

Mechelli, A., & Vieira, S. (2020). From models to tools: clinical translation of machine learn-ing studies in psychosis. NPJ Schizophrenia, 6(1), 1-3.

Murphy, K.C., Jones, L.A., & Owen, M.J. (1999). High rates of schizophrenia in adults with velo-cardio-facial syndrome. Archives of general psychiatry, 56(10), 940-945.

Perkins, D.O., Jeffries, C.D., Addington, J., Bearden, C.E., Cadenhead, K.S., Cannon, T. D., et al. (2015). Towards a psychosis risk blood diagnostic for persons experiencing high-risk symp-toms: preliminary results from the NAPLS project. Schizophrenia bulletin, 41(2), 419-428.