A retrospective-study: Predicting clinical improvement in motor function for Parkinson's Disease patients after bilateral Subthalmus nucleus Deep Brain Stimulation based on preoperative clinical features

(1)

A retrospective-study: Predicting clinical improvement in

motor function for Parkinson’s Disease patients after

bilateral Subthalmus nucleus Deep Brain Stimulation based

on preoperative clinical features

January 22, 2021

Karina van den Berg Student number 10744614

k.d.vandenberg@amsterdamumc.nl

Bachelor Thesis

Course code 5102BPWS0Y (30 European credits) Bachelor of Science

University of Amsterdam, The Netherlands Academic year: 2020-2021 Word count: ±3400

Direct supervisor

Dr. M. Beudel MD PhD m.beudel@amsterdamumc.nl Neurologist

Amsterdam UMC, location AMC

(2)

1 Abbreviations

The next list describes several abbreviations and definitions.

αSY N alpha-synuclein protein

Accuracy calculation:

T P + T N T P + T N + F P + F N

AU ROC Area Under the Receiver Operating characteristics Curve

DBS Deep brain stimulation

LEDD Levodopa Equivalent Daily Dosage in milligrams

LR Levodopa response; preoperative difference between UPDRS part three off and on condition

M otor improvement calculation:

100 ∗P REof f − P OST of f /on P REof f

P D Parkinson’s disease

P OST − AA/BB Postoperative assessment with or without medication (AA=off/on) and stimu-lation (BB=off/on)

P RE − of f Preoperative off condition = assessment without dopaminergic medication (last intake 12 hours ago)

P RE − on Preoperative on condition = assessment ±1 hour after intake dopaminergic medication

P recision calculation: T P T P + F P sd Standard deviation Sensitivity calculation: T P T P + F N

Specif icity calculation:

T N T N + F P

ST N Subthalamic nucleus

T P, F P, T N, F N True positive, false positive, true negative, false negative

(4)

2 Abstract

Background Deep Brain Stimulation (DBS) is the next step for Parkinson’s Disease patients who experience unpredictable motor fluctuations. One of the limitations of DBS is lack of clarity on which patients would benefit most. This could be improved via supervised machine learning (ML) to generate accurate motor improvement predictions after DBS. This study aimed to develop a logistic regression model that could predict clinical improvement through distinguishing between patients who have a high probability of becoming non-responders.

Methods A forward stepping logistic regression supervised classification learner was used to deter-mine what preoperative predictors could separate the two classes (non-responders = 44, responders = 179). Non-responders were defined via the Unified Parkinson Disease Rating Scale (UPDRS) part III and had less than 30% motor improvement. Sex, disease duration, Levodopa response (LR) and Levodopa Equivalent Daily Dosage (LEDD) were used to train the model. Model performance was evaluated using 5-fold cross-validation, accuracy, sensitivity, specificity and precision.

Results The supervised learning algorithm calculated an Area Under the Receiving Operating characteristic Curve (AUROC) of 0.67, accuracy of 79.8%, sensitivity of 2%, specificity of 96% and 12.5% precision. The group of correctly classified non-responders were smaller than the incorrectly classified false positives.

Conclusion The algorithm was not able to distinguish between two patient groups using logistic regression. These results suggest using a different approach to make clinically relevant predictions about individual outcome probabilities.

(5)

3 Introduction

Parkinson’s Disease (PD) is a progressive neurodegenerative hypokinetic movement disorder. Most patients experience their first symptoms between age 50 and 70. Worldwide prevalence is 0.3% in the total population and 4% among elderly (Hijdra et al.,2016). The Dutch National Institute for Public Health and the Environment (RIVM) estimated that there is going to be a 71% absolute increase in patients with PD in 20 years due to the ageing population (Toekomst Verkenning Volksgezondheid, 2018). Classic motor symptoms are hypo-/ bradykinesia, tremor, and rigidity. Furthermore, PD is accompanied by an array of non-motor symptoms too (see figure and table 1 (Hijdra et al.,2016)).

Figure 1: Parkinson’s Disease patients, sketch by R.W. Gowers 1886

Motor symptoms Non-motor Hypo-/bradykinesia Hyposmia

Tremor Mood disorder

Rigidity Anxiety

Postural imbalance Hypophonia Stooped posture Micrography Masked facial expression Obstipation

Shuffling Dysphagia

Short-stepped gait Cognitive disorders

Table 1: Examples of (non)motor symptoms.

The exact pathophysiology is unknown although recent theories state that the neuronal loss is induced via the Gut-Brain Axis (Houser & Tansey, 2017). Aggregation of α-synuclein (αSYN) causes abnormal oscillatory activity in the basal ganglia motor pathway. This is most robust in the subthalamic nucleus (STN) and globus pallidus pars interna (GPi) (Hijdra et al., 2016;

Herrington et al., 2016). The affected pathway of neurotransmitter dopamine (DA) explains the development of PD specific symptoms (see appendix 10.1 for more in-depth explanation motor pathway).

Treatment is symptom based. In the beginning, dopaminergic agonists and Levodopa (L-dopa) are most effective, but dyskinesia and motor fluctuations develop after 4-5 years (Williams et al.,2010;

Weaver, 2009). The unpredictable fluctuations cause the typical ON/OFF time in advanced PD (see figure2). Advanced stages in PD are less or even none responding to dopaminergic treatment.

Figure 2: Motor fluctuations

The yellow area represents the therapeutic window. Neural degeneration decreases the dopaminergic reservoir, resulting in a decreased buffer capacity. A smaller window leads to more OFF time; either hypokinetic or hyperkinetic. The blue pill represents the moment of medication. ©2020 Davis Phinney Foundation; Image courtesy of Boston Scientific

(6)

Fortunately, for patients who respond well to DA medication and are severely affected by these fluctuations, deep brain stimulation (DBS) is an outcome. The mechanism behind DBS remains unclear (Hamani et al.,2017; Herrington et al.,2016). Based on PD’s physiology and the under-standing of DA networks, bilateral STN DBS is preferred over other placements to improve motor symptoms (Odekerken et al., 2013). STN DBS, in combination with medical therapy, improves motor functioning more then when compared to medication only (Deuschl et al., 2006; Weaver,

2009; Williams et al.,2010).

Motor improvement can be measured via the Unified Parkinson’s Disease Rating Scale. This scale contains four different aspects of PD: (I) mental activity, behaviour and mood, (II) motor expe-rience of daily living, (III) motor function and (IV) motor complications. Motor improvement is particularly represented in part three (UPDRS-III, scale 0-132) (Rodriguez et al.,2007; Schul-man et al., 2010). The UPDRS-III sums the individual motor scores for six different conditions (see table2). The difference between PRE conditions is referred to as Levodopa response (LR). Motor improvement can be defined in two ways; either the difference between the PRE-off and POST-off/off scores or between Post-off/off and POSToff/on. If there is a significant difference between the PRE-off and POST-off/off scores, one could say favouring PREoff over POSToff/off minimises the influence of DBS on the assessments (i.e. micro lesion- or residual effects) (Zaidel et al.,2010). Literature is unambiguous about the cutoff but averages between 30%-66% (Bot et al.,

2018; Kleiner-Fisman et al.,2006). Although not every patient has this objectified improvement.

Table 2: UPDRS-III motor scores

PRE-off PRE-on

POST-off/off POST-off/on POST-on/off POST-on/on

PRE = preoperative assessment, on or off dopaminergic medication. POST-AA/BB = Postoperative assessment with or without medication (AA=off/on) and stimulation (BB=off/on) .

DBS limitations come to light in the clinical setting. DBS has: (a) varying and confined effect, (b) unclarity about which patients would benefit most, (c) surgical risk, (d) adverse effects, and (e) DBS does not stop PD due to its progressive nature. Patient selection could be improved if a known combination of preoperative variables would accurately predict postoperative motor improvement.

Linear correlations are widely performed to predict motor improvement after DBS. The LR is reported as a positive correlation (Charles et al.,2002;Habets et al., 2020; Kleiner-Fisman et al.,

2006;Rodriguez et al.,2007). Another preoperative variable (=clinical feature) is based on a sub-section of the UPDRS-III, expressed as the total of questions 9, 10, 11, 12 and 13 as a percentage from the total PRE-on. This is described as the percentage of axial symptoms a patient experi-ences. These axial symptoms, such as freezing gait and postural instability, negatively correlate with motor improvement (Charles et al.,2002;Habets et al.,2020;Welter et al.,2002). Although UPDRS-III stands for motor scores, the UPDRS-IV corresponds with motor complications. A subsection asks about the number of hours a patient is experiencing motor complications. These ’hours off’ are correlated negatively too (Charles et al.,2002;Welter et al., 2002).

All the previously mentioned papers have described the correlation for single predictive features. However little is known about the combination of these features. This can be performed using machine learning (ML). Via supervised learning, a training data set creates a model through algo-rithms that make combinations of weights and features to generate accurate test data predictions (Wan et al., 2019).

(7)

Recently,Habets et al.(2020) published their prediction model using ML that was able to gen-erate probabilities and the model differentiates between two classes categorised by motor response. This is important in clinical settings because potentially there could be (a) improved patient se-lection, (b) adjusted patient expectation and (c) postoperative DBS modification if the expected motor improvement is not met. Their study is limited due to small sample size and a relatively difficult outcome parameter.

It is hypothesised that a similar model with a simpler outcome parameter can generate a probabil-ity score to accurately predict if a patient is going to be a non-responder. The individual outcome probability is accurate above 0.7 (Habets et al.,2020). The aim of this study is to develop a model to predict clinical improvement in motor function for PD patients after bilateral STN DBS based on preoperative clinical features.

4 Methods

4.1 Patients and procedures

The data used for the first screening had 314 patients treated with STN DBS in the Amsterdam UMC and who were admitted in the Parkinson’s DBS database. For the current retrospective study, selection criteria were; (1) clinically diagnosed idiopathic PD, (2) bilateral STN DBS, and (3) pre- and postoperative obtained clinical assessments. Therefore a total of 223 patients are included who had neurosurgery for bilateral STN DBS between 2015 and 2020 (see table 3 and appendix10.2 for baseline characteristics). All patients met the selection criteria for STN DBS and signed informed consent. DPIA (Data Protection Impact Assessment) had been approved. To summarise the procedure at full length, baseline assessments determined if patients qualified for surgery. If approved, surgery was performed (details of the procedure are provided inBot et al.(2018)). DBS is not yet turned on. Postoperative, initial programming was performed within two weeks to minimise the influence of lesions caused by the electrode (Koeglsperger et al.,2019). According to the Amsterdam UMC standard protocol, the stimulation amplitude was gradually increased in steps 0.5 V/mA up to a maximum of 5V/mA. The minimal stimulation amplitude was set at maximum effect without side effects. Over the course of months, changes in DBS settings and medication dosage were administered by a neurologist. The postoperative assessment was performed at follow-up meeting 6 months after surgery.

4.2 Feature selection and outcome definition

Criteria for selecting predictive features were: existing evidence in literature, availability in Parkin-son’s database and clinical relevance. This resulted in seven evaluated features; (1) sex, (2) age, (3) disease duration, (4) LR, (5) axial symptoms, (6) Levodopa Equivalent Daily Dose (LEDD), and (7) hours off (Charles et al., 2002; Farrokhi et al., 2020; Habets et al., 2020; Jaggi et al.,

2004; Kleiner-Fisman et al.,2006; Rodriguez et al.,2007;Welter et al., 2002) (see table4for the description of each feature). Quantifying motor improvement after DBS was calculated using rel-ative improvement (Zaidel et al.,2010). A minimal improvement of 30% is considered significant (Rodriguez et al.,2007). Therefor patients <30% were defined as Non-responders and >30% were Responders. These two classes were used as the response variable in the prediction model. The used calculation for relative improvement is;

100 ∗P REof f − P OST of f /on P REof f

(8)

4.3 Statistical methods

The main analysis was performed for the two classes: Non-responders (n=44) and Responders (n=179). The differences between the groups’ features were analysed with Fisher’s exact test, Mann-Whitney U tests, or two-sided unpaired t-test, assuming underlying assumptions are met. Means and standard deviations (sd) were derived from the Parkinson’s DBS database. A statistical level of α=0.05 was considered as significant.

A general linear model was performed to calculate the correlation between LR and motor im-provement. Using this outcome, the Amsterdam UMC Parkinson’s database could be compared to other papers to identify if this database is similar to other research groups. Relative LR (%LR) is commonly used in previous papers for correlations, however absolute LR (aLR) could identify significant relationships that were not visible while solely using %LR (Pieterman et al., 2018). Therefore both absolute and relative LR and motor improvement are relevant.

4.4 Classification learner

To compute a model to predict motor improvement using ML, a binary classification supervised learning application was used in Matlab R2019a, Natick, MA. This paper used the logistic regres-sion algorithm as a predictive model as it is easy to interpret and the database is relatively small. The model is capable of making more complex nonlinear relationships in comparing to linear mod-els (Farrokhi et al., 2020). The classifier generates outcome probabilities for individual patients (Habets et al.,2020).

A forward stepping logistic regression was used to determine which feature contributed most to the outcome and if a combination of features would improve the model. This method was used to estimate the individual probability for becoming a non-responder.

It should be noted that applying certain prediction models in a clinical setting, threshold settings are required for meaningful prediction (see subsection 4.4.2 Evaluation of Classifiers).

4.4.1 Data exploration and preparation

Prior to ML modelling, descriptive statistics were explored (table3 and figures 6 and 7). Fur-thermore, linear features were analyzed using correlation heat-map. In this heatmap a Pearson correlation for numerical features is performed. This correlation coefficient shows to which extend features are related (see figure9in appendix 10.4).

In the classification learner application, scatter- and parallel plots were investigated (not shown). At first sight, no predictors were identified that could separate the classes linearly.

4.4.2 Evaluation of Classifiers

To extract results from a model, evaluation metrics were used to analyse each performance. First, the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) were derived from the model’s confusion matrix. Common statistical measures for performance are Sensitivity and Specificity.

Sensitivity = T P

T P + F N and Specif icity = T N T N + F P

It is important to define what is considered more important. Sensitivity gives a percentage for how many non-responders are correctly identified by the prediction model, contrary to the specificity percentage for the responders. As this research focused on predicting patients who do not respond on DBS, a high sensitivity would be preferable.

(9)

The second metric evaluation is accuracy. Accuracy can evaluate the correctly classified predictions denoted in a percentage regarding the total model performance.

accuracy = T P + T N

T P + T N + F P + F N and precision = T P T P + F P

It should be noted that the classes are not equally divided (non-responders = 44, responders = 179). If the model classifies only the responders correctly and all the non-responders incorrectly, there is still an accuracy of 80%. Precision can be used as a metric not influenced by the imbalance as the calculation uses only the positive rows from the confusion matrices.

Additional assessment for performance can be calculated via the Receiving Operating character-istic Curve (ROC curve). For more information see appendix 10.3. Briefly, the ROC sums all potential calculated confusion matrices that each potential classification threshold generates in the logistic regression. The corresponding Area Under the Curve (AUC) represents the extent to which the model can separate between the two classes corresponding TP and FP rates. ROC and AUC together will be referred to as AUROC.

This research used a k-fold validation to split the original data into 5 folds. Each iteration trains the model via training data (80%) and testing data (20%). It shows how accurately the trained model performs with unseen (testing) data in terms of underfitting, overfitting, and generalizabil-ity.

To summarise, accuracy, precision, AUROC, sensitivity and specificity were calculated to evaluate the model’s performance. These metrics depend on positive (non-responders) and negative (re-sponders) motor improvement which has been correctly (true) or incorrectly (false) classified.

As mentioned before, clinical practice requires threshold settings. This can either be the in-dividual outcome probability score or sensitivity and specificity calculations. A probability score above a certain threshold is regarded as true (non-responder) and below the threshold is false (responder) (Habets et al., 2020). An AUROC of 0.5 is simply chance and from 0.7 it can be considered as moderately clinical accurate. For threshold setting in the logistic regression, it is preferable to lower the threshold to maximise the sensitivity (correctly classified non-responders, TP).Lange & Lippa(2017) recommend descriptors for sensitivity and specificity, ranging between very low (<10%), moderate (40-59%) and very high (90-100%).

5 Results

5.1 Clinical variables

A total of 223 STN-DBS PD patients were assessed. Mean time to post-operative assesment was 7.8 months (±3.2 sd). Motor improvement divided patients in two classes; Non-responders (n=44, 26 male) and responders (n=179, 119 male). The study characteristics are described in table 3

(visualized in figure6and7 in appendix10.2).

To start with the non-responders, the average age at preoperative assessment was 65(±8.5 stan-dard deviation (sd)) and average disease duration was 9(±4.8 sd) years. Levodopa Equivalent Daily Dose (LEDD) was average 1398 mg (±765.6 sd) and the Levodopa response (LR) was 50.6% (±18 sd). UPDRS-III scores PRE-off 46.6(±14.8 sd), PRE-on 21.7(±12 sd), axial symptoms 11.5%(±10.6 sd). Postoperative UPDRS-III averaged scores were POST-off/off 50.6(±18 sd) and POST-off/on (46.6±18.3). Within the subgroup, paired two-sided t.test was significant between PRE-off and PRE-on (p<0.001).

(10)

Table 3: Clinical characteristics and outcome parameters

Non-responders

Responders

P-value

Patients (Male)

44 (26)

178 (119)

0.38◦

Mean age (years, sd)

65(±8,5)

63(±8.3)

0.13

1

Mean disease duration (years, sd)

9(±4,8)

10(±4.9)

0.11

1

LEDD (mean in milligram, sd)

1398(±765.6)

1294(±697)

0.38

1

LR (mean, sd)

50.6%(±18)

60.4%(±17.4)

?0.001

2

UPDRS-III preoperative scores

PRE-off

46.6(±14.8)

51.8(±16.2)

?0.03

1

PRE-on

21.7(±12.0)

21.1(±11.3)

0.73

1

Axial symptoms (mean, sd)

11.5%(±10.6)

10.4%(±8.6)

0.47

1

UPDRS-III postoperative scores

POST-off/off

50.6(±18.0)

60.4(±17.4)

0.32

1

POST-off/on

46.6(±18.3)

44.1(±16.5)

0.29

1

sd = standard deviation, LEDD = Levodopa Equivalent Daily Dose, UPDRS = Unified Parkinson’s Disease Rating Scale, LR = levoropa response. Significant difference between two classes (?) when p-value ≤ 0.05. ◦ Categorical Fisher exact test. 1 _{t.test with equal variances,} normal distribution, non-paired and two sided. 2 _{Kolmogorov-smirnov test normal distribution} (p<0.005); Mann-Whitney U test.

Responders had an average age of 63(±8.3sd) and average disease duration was 10(±4.9 sd) years. LEDD was average 1294 mg (±697 sd) and LR 60.4% (±17.4 sd). UPDRS-III scores PRE-off 51.8(±16.2 sd), PRE-on 21.1(±11.3 sd), axial symptoms 10.4%(±8.6 sd). Postoperative UPDRS-III averaged scores were POST-off/off 60.4(±17.4 sd) and POST-off/on (44.1±16.5). Within the subgroup, a paired two-sided t.test was significant between PRE-off and PRE-on (p<0.001), and PRE0off and POST-off/off (p<0.001) too.

LR (Mann Withney U, p=0.001) and PRE-on (unpaired two-sided t.test, p=0.03) were signifi-cantly different between the two classes. No differences were found between the classes in sex, age, disease duration, LEDD, PRE-off, POST-off/off, POST-off/on and hours off (UPDRS-IV).

Next, two linear analysis were performed to calculate the correlation between LR and motor improvement (see figure 3). The absolute numbers showed a higher correlation than the relative correlation (respectively r=0.55 and r=0.38, both p<0.001).

Figure 3: Correlation between motor improvement and Levodopa response

(11)

5.2 Prediction model

For optimal prediction performance, a forward stepping logistic regression was estimated. First, all seven possible single features were calculated (see5). The Area Under the Receiving Operating characteristic Curve (AUROC) were: sex = 0.51, age = 0.58, disease duration = 0.57, LR = 0.64, axial symptoms = 0.48, LEDD = 0.51 and hours off = 0.51. This means that sex, axial symptoms, LEDD and hours have a probability similar to chance (tossing a coin). Because LR had the highest outcome, all further calculations contain the LR feature. This resulted in a total of 68 calculations (including the single features). The three most promising combination of features were: sex + disease duration + LR + LEDD (=0.67), sex + age + disease duration + LR + LEDD (=0.66) and sex + disease duration + LR + axial symptoms and LEDD (=0.66). A minimum AUROC of 0.7 is necessary to make clinical relevant probability predictions.

To continue making the final model, sex + disease duration + LR + LEDD were used as pre-dictive features. This resulted in an AUROC, confusion matrix and performance evaluation table presented in figure4. The prediction model does not have a good performance with an average AUROC of 0.67. The AUROC represents to what extent the model can separate between the two classes corresponding. The model was trained to generate an outcome of an individual patient’s probability of becoming a non-responder. In figure4.B the confusion matrix shows the threshold of 0.006 for accepting the probability to be non-responder (TP / TP + FP + TN). This matrix is withdrawn from the current classifier (red dot in figure4.A). The matrix contains TP = 1, FN = 43, FP = 7 and TN = 172. This results in a sensitivity of 0.02 (1 out of 44) and specificity of 0.96. Accuracy is 78.9% with 173 out of 223 correctly predicted PD patients and precision is 0.125. The model classified 215 out of 223 patients as responders.

Figure 4: Logistic regression outcome variables

A: Area Under the Receiving Operating characteristic Curve (AUROC) is 0.67. The red dot is the current classifier. B: Confusion matrix based on current classifier. C: Evaluation metrics. For used calculations see chapter 4.4.2 Evaluation of classifiers.

(12)

6 Discussion

This study aimed to predict motor improvement after STN-DBS based on preoperative clinical features using machine learning. The logistic regression model could not identify preoperative assessment features relevant in a clinical setting to predict non-responders. The two classes (non-responders n=44, (non-responders n=179) were grouped by a minimum of 30% relative motor improve-ment scored via the Unified Parkinson’s Disease Rating Scale part three (UPDRS-III). Via forward stepping regression, four features were found that generated the most promising model to differenti-ate between the two classes. These were sex, disease duration, Levodopa response (LR) percentage and Levodopa Equivalent Daily Dose (LEDD, in mg). Previous papers support these features as-sociated with motor improvementFarrokhi et al.(2020);Habets et al.(2020);Jaggi et al. (2004);

Kleiner-Fisman et al.(2006);Rodriguez et al.(2007);Welter et al.(2002). The predictive features were not capable of generating a model that could distinguish between the two classes. Both lin-ear correlations between LR and motor improvement are in line with previous reslin-earchZaidel et al.(2010); Pieterman et al. (2018). The absolute numbers showed a higher correlation than the relative correlation (respectively r=0.55 and r=0.38, both p<0.001).

The supervised learning algorithm calculated an AUROC of 0.67, accuracy of 79.8%, sensitivity of 2%, specificity of 96% and precision of 12.5%. The proportion correctly classified non-responder (TP) was smaller than the incorrectly classified responders as non-responder (FP). Because the AUROC was below 0.7, the model could not generate probabilities to separate between the two classes that have clinical relevancy (Habets et al., 2020). The accuracy is overfitting because it classifies almost every patient as responder (215 out of 223). The imbalance between the two classes causes this. Precision is a better performance evaluation as the imbalance does not influ-ence it because the fraction only contains the positive class (non-responders). The model is not capable of correctly predicting the true class for non-responders. This is represented in the very low sensitivity (2%) and very high specificity (96%). It was preferable to have a high sensitivity as the aim was to predict the patients who will not benefit from STN-DBS.

One of the explanations for the outcome could be found in feature selection. The LR itself may already contain most of the other features’ information. In more advanced PD stages, older age, or less responsiveness to dopamine, the LR decreases (Pieterman et al., 2018;Charles et al.,

2002). On the other hand, the outcome measurement is debatable too. The downside of using PRE-off as the baseline is that both LR and motor improvement contain the identical PRE-off. This could lead to a more positive correlation as the PRE-off entity correlates with itself (Zaidel et al.,2010). However, it is questionable if POST-off/off and POST-off/on would capture DBS’s mo-tor improvement better because the baseline threshold will be assessed with implanted electrodes. Even though it is stated that POST-off/off condition supposedly return when the stimulation is stopped within seconds to minutes, there is evidence that some symptoms last hours to return to off condition fully (Herrington et al., 2016).

The current paper has several limitations. First, there is a bias towards patient selection in this model. More patients with advanced PD were assessed in the Amsterdam UMC. However, only those selected for surgery were admitted in the database. An experienced multidisciplinary team makes this selection. Since they select the fittest patients, it is not surprising that the two classes were imbalanced. It is not a fair representation of the advanced Parkinson’s population. Second, the clinical history of the patients is not reported in the database. The typical advanced

(13)

PD patient is older and therefore more likely to have other medical conditions. The preoperative assessment does not contain information about other medical conditions that could have influenced the scores. Third, it is controversial to focus too much on the preoperative assessment. A patient could have the perfect probability score to become a responder, but the distance from the medial STN border is negatively correlated with motor improvement and thus highly relevant (Bot et al.,

2018). Preoperative probabilities could help via postoperative evaluation. If a certain probability could be expected, it could lead to more accurate electrodes placement and better stimulus pro-gramming.

Future studies could have added clinical value if they can create better models with intra-operative data. If there would be a supervised learning model that is not only capable of predicting clinical improvement but the degree of succes of the outcome too, more accurate patient expectations could be set.

The strength of this paper lies in its relatively big sample size and relevant clinical setting. Even though DBS is an accepted treatment for PD, large databases are lacking. The advantages of ML compared to traditional statistical techniques are (1) making estimations about more com-plex nonlinear relationships, (2) enhanced clinical decision making, (3) making accurate outcome predictions at an individual level (Farrokhi et al.,2020).

7 Conclusion

While studies found linear correlations for single motor improvement features after STN-DBS, the current study did not find a prediction model based on a combination of preoperative predictive features that had clinical relevance. The supervised machine learning algorithm was not able to distinguish between two patient groups using logistic regression. The proportion of correctly classified non-responders was not greater than the incorrectly classified responders. These results suggest using a different approach to make clinically relevant predictions about individual outcome probabilities.

8 Acknowledgments

This retrospective research is written to complete a five months research internship. My work at the neurology department of the Amsterdam UMC location AMC is summarised in this bachelor thesis. This internship is completed for the Bachelor of Science in Neuroscience at the University of Amsterdam. I want to thank Martijn Beudel, my direct supervisor, who has taught me not only how to start doing research but also guided me along the way to this final paper. I have learned different aspects of doing research: from the literature search to the weeks of working in Matlab to create prediction models. Leaving aside that things would have been different if the corona pandemic would not have limited my time to work in the hospital, I am grateful for the opportunity to work side-by-side with experienced doctors and researchers. Even when working from home, I know now more than ever that my path lies in the continuation of both my studies: neuroscience and medicine. Those two bachelors educated me in combining clinical practices with academic research to improve patient support.

(14)

9 References

Bot, M., Schuurman, P. R., Odekerken, V. J., Verhagen, R., Contarino, F. M., De Bie, R. M., & van den Munckhof, P. (2018). Deep brain stimulation for Parkinson’s disease: defining the optimal location within the subthalamic nucleus. Journal of neurology, neurosurgery, and psychiatry , 89 (5), 493–498.

Charles, P. D., Van Blercom, N., Krack, P., Lee, S. L., Xie, J., Besson, G., . . . Pollak, P. (2002). Predictors of effective bilateral subthalamic nucleus stimulation for PD. Neurology, 59 (6), 932– 934.

Deuschl, G., Schade-Brittinger, C., Krack, P., Volkmann, J., Schäfer, H., Bötzel, K., . . . Voges, J. (2006). A randomized trial of deep-brain stimulation for Parkinson’s disease. New England Journal of Medicine, 355 (9), 896–908.

Farrokhi, F., Buchlak, Q. D., Sikora, M., Esmaili, N., Marsans, M., Mcleod, P., . . . Carlson, J. (2020). Original Article. , 325–338.

Habets, J. G., Janssen, M. L., Duits, A. A., Sijben, L. C., Mulders, A. E., de Greef, B., . . . Herff, C. (2020). Machine learning prediction of motor response after deep brain stimulation in Parkinson’s disease—proof of principle in a retrospective cohort. PeerJ , 8 , 1–17.

Hamani, C., Florence, G., Heinsen, H., Plantinga, B. R., Temel, Y., Uludag, K., . . . Fonoff, E. T. (2017). Subthalamic nucleus deep brain stimulation: Basic concepts and novel perspectives. eNeuro, 4 (5).

Herrington, T. M., Cheng, J. J., & Eskandar, E. N. (2016). Mechanisms of deep brain stimulation. Journal of Neurophysiology , 115 (1), 19–38.

Hijdra, A., Koudstaal, P., & Roos, R. (2016). Neurologie (7th ed.). Houten: Bohn Stafleu van Loghum.

Houser, M. C., & Tansey, M. G. (2017). The gut-brain axis: Is intestinal inflammation a silent driver of Parkinson’s disease pathogenesis? Nature partner journals, 3 (1).

Jaggi, J. L., Umemura, A., Hurtig, H. I., Siderowf, A. D., Colcher, A., Stern, M. B., & Baltuch, G. H. (2004). Bilateral stimulation of the subthalamic nucleus in Parkinson’s disease: Surgical efficacy and prediction of outcome. Stereotactic and Functional Neurosurgery, 82 (2-3), 104–114.

Kleiner-Fisman, G., Herzog, J., Fisman, D. N., Tamma, F., Lyons, K. E., Pahwa, R., . . . Deuschl, G. (2006). Subthalamic nucleus deep brain stimulation: Summary and meta-analysis of out-comes. Movement Disorders, 21 (SUPPL. 14), 290–304.

Koeglsperger, T., Palleis, C., Hell, F., Mehrkens, J. H., & Bötzel, K. (2019). Deep brain stimulation programming for movement disorders: Current concepts and evidence-based strategies. Frontiers in Neurology , 10 (MAY), 1–20.

Lange, R. T., & Lippa, S. M. (2017). Sensitivity and specificity should never be interpreted in isolation without consideration of other clinical utility metrics. The clinical neuropsychologist , 31 , 1015–1028.

Odekerken, V. J., van Laar, T., Staal, M. J., Mosch, A., Hoffmann, C. F., Nijssen, P. C., . . . de Bie, R. M. (2013). Subthalamic nucleus versus globus pallidus bilateral deep brain stimulation

(15)

for advanced Parkinson’s disease (NSTAPS study): A randomised controlled trial. The Lancet Neurology, 12 (1), 37–44.

Pieterman, M., Adams, S., & Jog, M. (2018). Method of levodopa response calculation deter-mines strength of association with clinical factors in Parkinson disease. Frontiers in Neurology, 9 (MAY).

Rodriguez, R. L., Fernandez, H. H., Haq, I., & Okun, M. S. (2007). Pearls in patient selection for deep brain stimulation. Neurologist , 13 (5), 253–260.

Schulman, L., Gruber-Baldini, A., Anderson, K., Fishman, P., Reich, S., & Weiner, J. (2010). The Clinically Important Difference on theUnified Parkinson’s Disease Rating Scale. Arch Neurology , 67 (1), 64–70.

Toekomst Verkenning Volksgezondheid. (2018). Ziekte van Parkinson | Cijfers & Context | Trends | Volksgezondheidenzorg.info. Retrieved from https://www.volksgezondheidenzorg.info/ onderwerp/ziekte-van-parkinson/cijfers-context/trends#!node-toekomstige-trend -ziekte-van-parkinson-door-demografische-ontwikkelingen

Wan, K. R., Maszczyk, T., See, A. A. Q., Dauwels, J., & King, N. K. K. (2019). A review on microelectrode recording selection of features for machine learning in deep brain stimulation surgery for Parkinson’s disease. Clinical Neurophysiology , 130 (1), 145–154.

Weaver, F. (2009). NIH Public Access Patients With Advanced Parkinson Disease :. JAMA, 301 (1), 63.

Welter, M. L., Houeto, J. L., Tezenas Du Montcel, S., Mesnage, V., Bonnet, A. M., Pillon, B., . . . Agid, Y. (2002). Clinical predictive factors of subthalamic stimulation in Parkinson’s disease. Brain, 125 (3), 575–583.

Williams, A., Gill, S., Varma, T., Jenkinson, C., Quinn, N., Mitchell, R., . . . Wheatley, K. (2010). Deep brain stimulation plus best medical therapy versus best medical therapy alone for advanced Parkinson’s disease (PD SURG trial): a randomised, open-label trial. The Lancet Neurology, 9 (6), 581–591.

Zaidel, A., Bergman, H., Ritov, Y., & Md, Z. I. (2010). Levodopa and subthalamic deep brain stimulation responses are not congruent. Movement Disorders, 25 (14), 2379–2386.

(16)

List of Tables

1 Examples of (non)motor symptoms. . . 5

2 UPDRS-III motor scores . . . 6

3 Clinical characteristics and outcome parameters . . . 10

4 Description of preoperative features . . . 16

5 Single feature and top three best combinations . . . 16

Table 4: Description of preoperative features

Feature Description

1 n Age Age at moment of first clinical assessment

2 c Sex (1) Male, (2) Female

3 n Disease duration Time between PD onset and pre-operative clinical assessment

4 n LEDD Levodopa Equivalent Daily Dose in milligram

5 n Levodopa response Percentage UPDRS-III PRE-off - PRE-on

6 n Percentage Axial symptoms Percentage out of total UPDRS-III score (total subscores 9, 10, 11, 12 and 13)

7 c Hours off Categorical score UPDRS-IV;

hours patient experiences motor complications (0) None, (1) <25%, (2) 25-50%, (3)50-75% and (4)>75% Summation of predictive features either numeric (=n) or categorical (=c). PD = Parkinson’s

Disease, UPDRS = Unified Parkinson’s Disease Rating Scale (part III motor function and IV motor complications). PRE-off - PRE on = preoperative off condition - peroperative on condition.

Table 5: Single feature and top three best combinations

Feature AUROC Accuracy

1. Sex 0.51 80.3% 2. Age 0.58 80.3% 3. Disease duration 0.57 80.3% 4. Levodopa response 0.64 79.8% 5. Axial symptoms 0.48 80.3% 6. LEDD 0.51 80.3% 7. Hours off 0.49 79.8% 1 + 3 + 4 + 6 0.67 79.8% 1 + 2 + 3 + 4 + 6 0.66 81.2% 1 + 3 + 4 + 5 + 6 0.66 79.8%

Forward stepping logistic regression to calculate the probability between features and motor outcome (non-responder or responder). AUROC; Area Under the Receiving Operating characteristic Curve, LEDD; Levodopa Equivalent Daily Dosage.

(17)

10 Appendix

10.1 Motor pathway

Figure 5: The direct and indirect motor pathway GPe = globus pallidus pars externa, GPi = globus pal-lidus pars interna, STN = subthalamicus nucleus, SNc = substantia nigra pars compacta. D1 and D2 = type of dopamine receptor. The bold lines indicate the direct pathway and the dotted lines represent the indirect path-way. The color of the lines and the symbols represent activity ( green / + = stimulate and red / - = inhibit)

Normally the substantia nigra (SN) commu-nicates with the neurons of the basal gan-glia trough the neurotransmitter dopamine (DA). The substantia nigra pars compacta (SNc) can increase activity via Dopamine re-ceptor 1 (D1) in the direct pathway to the striatum, but also decrease via the indi-rect pathway via dopamine receptor 2 (D2) (see figure 5). This way the SNc facili-tates movement via DA (Herrington et al.,

2016). Due to loss of DA input to the striatum, the direct pathway is under ac-tive and the indirect pathway is overac-tive. This causes for example the typi-cal tremor, rigidity and hypokinesia symp-toms.

(18)

10.2 Visualized characteristics and outcome parameters

Figure 6: Box plot visualizing patients’ characteristics

The line in the box plot represents the median and x the average. Box length is interquartile range (IQR). Individual points represent outliers. See for averages and standard deviation table3, N=223. Average distances did not differ between classes (non-responder=44 and responders=179).

The five plots are: (A) age in years with median 67, (B) disease duration in years with median 8, (C) LEDD = Levodopa Equivalant Daily Dosage in milligram with median 1399.125, (D) Axial symptoms in percentage with median 9.09 and (E) accumulative percentages hours-off divided in five different categories; ranging between None and >75%.

(19)

Figure 7: Box plot visualizing UPDRS-III scores

N = 223, %LR = Levoropa response in percentage. Unified Parkinson’s Disease Rating Scale (UPDRS) part III scores the individual motor function. This is assessed in four different conditions. Left graphic is preoperative and right postoperative. The line in the box plot represents the median and x the average. Box length is interquartile range (IQR). Individual points represent outliers. See for averages and standard deviation table3. Median preoperative is; off non-responder 46, off responder 50, on non-responder 20.5 and on responder 18. Postoperative median is; off/off non-responder 53, off/off responder 62.5, off/on non-respnder 46 and off/on responder 43.5. There was a significant difference found between; off non-responders and off responders (p=0.03) and LR non-responder and LR responders (p<0.001) using respectively two sided Student t.test with underlying

assumptions met and Mann Witheny U test (not normal distributed p<0.005). Within the two classes, there was a significant difference between the non-responders off and on condition (p<0.001), responders off and on condition (p<0.001) and between preoperative responders in off condition compared to responders in off/off condition postoperative (p<0.001). All statistical calculations within the classes were paired two sided student t.tests.

(20)

10.3 Receiving Operating characteristic Curve

Predicted class

Positive Negative Total

True class Positive a b a + b

(non-responder) TP FN

Negative c d c + d

(responder) FP TN

Total a + c b + d N

The table above describes the underlying values in a confusion matrix. It can be used for different calculation to evaluate model performances. The small letters (a to d) can be replaced with every confusion matrices outcome. N is the total amount of patients, TP = true positive, FN = false negative, FP = false positive, and TN= true negative.

For this paper, the Receiving Operating characteristic Curve (ROC) visualizes the performance of the logistic regression model too. Each potential regression model has its own probability curve. This probability curve is S-shaped, ranging from 0 to 1. Above a certain threshold, every patient is classified as non-responder. If it is important for the prediction model to classify the true posi-tives, a low threshold is preferable. On the contrary, this will increase the FP too. To turn these probabilities into classification, the ROC is capable of summarizing all potential confusion matrices produced by every single possible threshold. The x-axis is the false positive rate (1-specificity) and the y-axis is the true positive rate (sensitivity).

Sensitivity = T P

T P + F N and Specif icity = T N T N + F P

The dotted line represents the same proportion for true positive and false positive rates. At top right (coordinates 1,1) even though the model classifies all of the non-responders correctly, it will incorrectly classify all the patients who are responder. When the ROC shifts to the left (point A) the proportion of TP rate is greater than the incorrectly classified responders as non-responders (FP). B is represents rate . Point C has no false positives and coordinates 0,0 has no TP or FP.

(21)

10.4 Data exploration

Figure 9: Correlation matrix of numerical features

This figure shows the Pearson correlation in a heatmap for all numerical features. X-axis and Y-axis label; (1) age, (2) disease duration, (3) Levodopa Equivalant daily dose (LEDD) in mg, (4) Axial symtom score, (5) Levodopa response (LR). The values in the heatmap represent the correlation coefficient.

A retrospective-study: Predicting clinical improvement in motor function for Parkinson's Disease patients after bilateral Subthalmus nucleus Deep Brain Stimulation based on preoperative clinical features