• No results found

Improving early diagnosis of Alzheimer’s disease : the predictive power of cognitive tests with increased reliability and spatial map of brain atrophy

N/A
N/A
Protected

Academic year: 2021

Share "Improving early diagnosis of Alzheimer’s disease : the predictive power of cognitive tests with increased reliability and spatial map of brain atrophy"

Copied!
34
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Improving Early Diagnosis of Alzheimer’s Disease

The Predictive Power of Cognitive Tests with Increased Reliability and Spatial Map of Brain

Atrophy

Research Master thesis: Nikki Lammers 5840899 August 2014

Department of Neurology, Academic Medical Centre Department of Brain and Cognition, University of Amsterdam

Supervisor: Ben Schmand

ABSTRACT

Background: In Alzheimer Disease (AD) research, early detection is a very relevant topic as it seems from interventional studies that intervention is most promising at an early stage.

Objectives: To improve the detection of early AD, combining well-known predictors such as structural brain volume and cognitive performance, with novel predictors such as elongated tests.

Methods: Data were collected in a longitudinal study in memory clinic patients (N=69) who had either mild cognitive impairment or as yet an uncertain diagnosis. Cognition was measured with a standard neuropsychological assessment (NPA) and elongated tests of memory and executive functions. MRI scans of the brain (3 Tesla) were analyzed automatically using Freesurfer software. To analyze a spatial map of structural brain regions we calculated the volumes of hippocampi, entorhinal cortex, inferior temporal gyri and anterior and posterior cingulate gyrus. After two and six years, patients were cognitively evaluated using cognitive screening instruments and clinical diagnosis. Different prediction models of conversion to dementia were analyzed using logistic regression.

Results: Standard NPA was able to correctly predict diagnosis at follow-up after six years in 85% of cases. Adding a spatial map to standard NPA improved the predictive power to 92% accuracy. Adding elongated tests to a standard NPA slightly increased the predictive power to an accuracy of 86.5%. The most accurate prediction model included basic NPA combined with elongated tests and spatial map, with an accuracy of 94%. Accuracy of all models, except for standard NPA, was higher for six-year follow-up than for two-year follow-up.

Conclusion: Our results complement prior research of early AD detection by demonstrating that when improved behavioral measurements and improved neuroimaging techniques are included in the same model, very high accuracy can be reached in predicting who will subsequently develop AD.

Keywords: Alzheimer Disease; Predictors; Neuropsychological assessment; Elongated tests; Spatial MRI; Longitudinal

(2)

2 INTRODUCTION

Alzheimer’s disease (AD) is a major public health issue. The estimated prevalence of people living with AD is 35.6 million worldwide (World Alzheimer report, 2010, cited by Weiner et al., 2012). It is expected that the prevalence of AD throughout the world will double in the next 20 years (Weiner et al., 2012). This increased prevalence of AD makes the development of disease-modifying treatments extremely important (Mueller et al., 2005). Although there are currently no disease-modifying treatments, many potential treatments are being tested in clinical trials (Roberson & Mucke, 2006; Cummings et al., 2014). It is expected that when treatment can delay the clinical onset of AD by one year, this would reduce the prevalence of AD in 2050 by 9 million cases (Brookmeyer et al., 2007). Most of these treatments are focusing on early initiation before brain tissue is irreversibly damaged (Mueller et al., 2005). To be able to apply these early treatments, we need accurate identification of early stages of the disease. Therefore, there is a clinical need for accurate diagnostic instruments to identify early Alzheimer disease, before clinical dementia is established.

The early stage of AD is often referred to as mild cognitive impairment (MCI; Jack et al., 2011; Petersen et al., 1999). Clinically, mild cognitive impairment is defined as an impairment in one or more cognitive domains, in particular memory, language or executive functioning, and is severe enough to be visible on test performance but are insufficient to highly interfere with daily life (DSM-IV). Therefore, MCI is a clinical diagnosis based on the patient’s history, in particular subjective cognitive complaints, and on the results of a neuropsychological assessment (NPA; Petersen et al., 1999). NPA includes memory tests, verbal fluency and executive functioning, since these domains are affected most in MCI (Harrison et al., 2007; Scherder, 2011). When diagnosed with MCI by neuropsychological assessments, the rate of further cognitive decline and conversion to AD is 10-15% per year. After six years, approximately 80% show progression to AD (Petersen, 2007). Although these conversion rates are impressive, not all MCI patients deteriorate over time. Some of these patients have reversible impairment or remain stable (Larrieu et al., 2002). This indicates that MCI, as measured by neuropsychological tests, is a rather heterogeneous condition (Chetelat & Baron, 2003) and not a perfect predictor for AD.

The heterogeneity of MCI is due to several difficulties in the diagnosis. The first difficulty is that the abnormal neuropsychological test results may not reflect cognitive symptoms due to brain damage because the tests are not perfectly reliable (Lezak, 2012). For example, widely used memory tests such as word list

(3)

3

learning are of proven validity but have a retest-reliability of .70-.80 (Lezak, 2012). The second difficulty is that the transition between normal cognitive aging and cognitive impairment due to degenerative cerebral disease is gradual (Selkoe, 2001; Morris, 2012). The third difficulty is the frequent co-occurrence of cognitive impairment and psychiatric symptoms. This co-occurrence varies from 25% in population-based studies to more than 60% in clinical samples (Steffens et al., 2006). Consequently, there is a frequent misclassification of pre-AD and functional psychiatric states in elderly patients. The fourth difficulty is that undetected, non-credible cognitive test performance can result in invalid MCI diagnose (Rienstra et al., 2013).

These difficulties make it hard to clinically distinguish those MCI patients who will convert to AD from those who will not. As a result, researchers were urged to look for neuroimaging markers to improve the prediction (Nordberg, 2004; Nestor, Scheltens & Hodges, 2004; Weiner, 2012; Thurfjell et al., 2012; Chételat et al., 2012). One neuroimaging marker associated with AD decline is atrophy of the hippocampus (Weiner, 2012). Hippocampal atrophy is the best studied structural biomarker, as it is one of the earliest structures to degenerate in AD (Karow et al., 2010; Scherder, 2012). Hippocampal atrophy together with entorhinal atrophy correctly classified 70% of MCI patients who converted to AD within 12 months (Wolz et al., 2010). However, neuropsychological assessment is more responsive than hippocampal atrophy in patients with MCI and early AD ( Schmand et al., 2014) and the added predictive value of hippocampal and entorhinal cortex volume to cognitive tests is small (Devanand et al., 2007). Compared to single feature measurement, such as hippocampal atrophy, measurement of multiple features analyzed by spatial pattern analysis is far more accurate (Weiner, 2012; Davatzikos, 2008; Fan et al., 2008; Evans et al., 2010). Spatial pattern analysis is a sophisticated neuroimaging technique in which a spatial map is formed with a minimal set of brain regions that are considered to maximally differentiate between normal aging and AD (Davatzikos et al., 2008). An important principle of this technique is that it considers all voxels simultaneously instead of analyzing the image voxel by voxel (Lao et al., 2004). As atrophy in early stages of AD may be subtle and distributed over several brain regions, examining the voxels of these regions collectively is better able to discriminate between converters and non-converters (Lao et al., 2004). Previous studies suggest that a spatial map with structural patterns that discriminate best between AD and control subjects involves hippocampal volume, entorhinal volume, inferior temporal structures and anterior and posterior cingulate cortical thickness, with 85% accuracy (Walhovd et al., 2010; Davatzikos et al., 2008; Davatzikos et al., 2009). However, neuroimaging methods

(4)

4

such as hippocampal volume measurement and spatial map analysis are not able to accurately mark the appearance of clinical symptoms of AD, as these symptoms are of behavioral nature. Even when there are clear signs of an active disease process, such as brain atrophy, this does not necessarily imply that the individual is clinically impaired (which is necessary for MCI diagnosis). Therefore, behavioral methods should always be included in establishing and predicting AD (Petersen et al., 1999).

In view of the increasing need for early AD diagnose, and given the flaws in currently used methods, we conducted a longitudinal study in a sample of memory clinic patients to improve the prediction of conversion to AD in several ways. Our first aim was to improve the predictive power by adding a spatial map, as analyzed in previous studies of spatial pattern analysis (Davatzikos, 2008; Davatzikos, 2009), to a standard NPA. The spatial map contains hippocampal volume, entorhinal thickness, inferior temporal structures and anterior and posterior cingulate thickness. The standard NPA consists of memory tests, executive functioning tests and fluency test. Given the promising research findings of spatial pattern analysis as a substitute of solely hippocampal atrophy, we expected that adding this spatial map of brain structures would increase the predictive power of NPA. Our second aim was to improve the predictive power of a standard NPA by increasing the reliability of the neuropsychological tests, using the classical method of test elongation, i.e. applying more trials of the same task (Lezak, 2012). The validity of a diagnosis is limited by the reliability of the diagnostic instruments used (Lezak, 2012). Thus, part of the false positive diagnoses can be due to measurement error. Therefore, we expected that elongated tests would increase the predictive power of NPA. Our third aim was to maximize the predictive power by combining the spatial map and elongated tests as an addition to a basic NPA. As both methods seem promising, we expected this combination would further increase the predictive power. Additionally, we examined the change in accuracy of our new models (with increased reliability and spatial map) over time by comparing the models at first follow-up after two years with the models at second follow-up after six years from baseline. We expected the accuracy to be higher for the first follow- up compared to the second follow- up, as it is plausible that some of the converters at second follow-up did not show any signals at baseline yet (false negatives). The conversion process from normal aging to AD is gradual (Selkoe, 2001), which increases the chance of false negatives as the time between predicting and diagnosing increases. As future treatments for AD are focusing on early initiation before brain

(5)

5

tissue is irreversibly damaged (Mueller et al., 2005), the earlier our model can accurately predict conversion to AD, the more effective the treatment will be.

With an additional explorative analysis we ranked the elongated tests that are considered most suitable for group differentiation (AD conversion vs. no conversion).

METHODS

This study was part of the IDADO-project (Improving early Diagnosis of Alzheimer Disease and Other dementia’s), which takes place within the Academic Medical Centre (AMC) in Amsterdam, the Netherlands.

Participants

Between 2007 and 2009 170 patients were recruited from the neurological and geriatric outpatient clinics, memory clinics and day clinics of six general and psychiatric hospitals (Academic Medical Center, Slotervaart Hospital, Geriant Noord-Kennemerland, Medical Center Alkmaar and GGZ-Noord-Holland-Noord). Inclusion criteria were age between 50 and 85 years, presence of cognitive complaints and uncertain diagnosis, completion of baseline assessment and MRI scan and both first and second follow-up. Exclusion criteria were dementia as established by a dementia specialist according to the Diagnostic and Statistical Manual of Mental Disorders–IV criteria (American Psychiatric Association, 1994), other brain disease or systemic disease sufficient to cause the mental complaints, current substance abuse or addiction, a medical condition or handicap that prevented neuropsychological evaluation, pre-existent mental retardation, contra-indications for MRI scanning, insufficient knowledge of the Dutch language and non-credible test performance, which can invalidate MCI diagnosis (Rienstra et al., 2012). Psychiatric (co)-morbidity was not an exclusion criterion, as excluding psychiatric patients does not do sufficient justice to clinical reality. Written informed consent was obtained from all patients after the nature of the study was fully explained. The study was approved by the institutional review board of the AMC.

(6)

6 Procedures

At baseline (2007-2009) participants underwent an MRI scan, cognitive screening instruments (i.e. 7 minute screening, MMSE and CDR), symptom validity tests (Word Memory Test and Test of Memory Malingering), a psychiatric interview (MINI plus) and a comprehensive neuropsychological assessment. The cognitive screening instruments at baseline were used for excluding patients with dementia. Symptom validity tests were used for excluding patients with non-credible test performance. The MINI plus was used in order to establish psychiatric co-morbidity. Both MRI and NPA were used for the prediction rate of AD conversion. With the NPA we evaluated several cognitive domains, such as memory, fluency and executive functioning. For some of these tests elongation was used to increase the reliability (see below). Sixteen tests were used for the NPA, of which five were used in this study (see the materials section). The whole test battery took approximately 2-3 hours to complete. Tests were administered in a fixed order and every patient followed the same instructions. All tests were administered by a skilled clinician (e.g., neuropsychologist, neurologists, or master student of neuropsychology supervised by the neuropsychologist) and a clinical probability diagnosis was made by a neurologist.

At the two-year follow up, we were interested in whether or not a patient had converted to AD and whether there was psychiatric co morbidity. For this, only cognitive and psychiatric screening instruments were used. Although a comprehensive NPA was administered as well, it would be methodologically incorrect to determine the accuracy of a prediction model based on the same outcome variables as used in the prediction models, in our case neuropsychological tests and brain atrophy. Therefore, it was mandatory for the clinician to select a probability diagnosis based on their own diagnostic work-up including these screening instruments only. Tests of the NPA administered at first follow-up were left out of the analysis.

At the six-year follow up, we were only interested in which patients converted to AD and which patients did not. For this, patients were invited for a cognitive and psychiatric screening only. The cognitive screening included the same screening instruments used at baseline and first follow-up. For the present study the psychiatric screening was left out of the analysis, since psychiatric co morbidity at follow-up was irrelevant to our research question. Patients with normal results on the cognitive screening were diagnosed negative for AD by the neuropsychologist. Patients with abnormal results were seen by a neurologist for a definite diagnosis.

(7)

7

It is important to note that in some cases we first retrieved information from the patients’ general practitioner about their cognitive condition. Patients already diagnosed with AD by a neurologist or geriatrician during the interval of the follow-up, were not invited for a second follow-up. They were marked as AD positive.

Materials

Materials used were 1) tests for the comprehensive neuropsychological assessment, 2) tests for clinical diagnosis of conversion vs. non-conversion, and 3) MRI.

1: Comprehensive neuropsychological assessment

At baseline, neuropsychological tests sensitive to functional deficits characteristic of AD and with proven validity were used to predict conversion to AD (see Table 1). We used the Auditory Verbal Learning Test (AVLT) delayed recall, the Rivermead Behavioural Memory Test (RBMT) delayed recall of a prose passage,

Trail Making Test B (TMT), Stroop color-word test card 3 and letter fluency test. The AVLT and RBMT are sensitive to episodic memory deficits. Episodic memory tasks appear to have a predictive power for indicating early AD, as episodic memory is associated with medial temporal lobe functioning. This structure is most vulnerable for degeneration in AD (Scherder, 2011). The TMT and Stroop test are both tests of executive functioning, i.e. attention and inhibition. Because of the severely affected frontohippocampal circuit in AD (Scherder, 2011), executive deficits are characteristic for AD. The letter fluency test is sensitive for seeking a strategy in guiding the search for words and is most difficult for subjects who cannot develop strategies on their own.

The Stroop ColorWord test (Stroop, 1935; Hammes, 1973). Participants are requested to read color-words

(card 1), name colors (card 2), and name the ink color of color-words when the words are printed in a non-matching color (card 3). Scores are the time needed to complete the items. Furthermore, an interference condition is calculated (card 3|2), which reflects the score on card 3 corrected for the time on card 2. Scores are expressed as T-scores corrected for age, education and gender (Schmand et al., 2012). Test-retest reliability is .80, .84, and .73 (respectively word, color, color-word; Straus, 2006).

(8)

8

The Trail Making Test (TMT; Reitan, 1992). Participants are asked to draw a line to connect ascending

numbers scattered on a sheet of paper (part A) and to connect ascending numbers alternating with letters in alphabetical order (part B). Scores are time to completion in seconds. Furthermore, a score on part B corrected for the time on part A (TMT B|A) is calculated to reflect the ability to divide attention. Scores are expressed as T-scores corrected for age, education and gender (Schmand et al., 2012). Test-retest reliability is .84 (Straus, 2006).

Auditory Verbal Learning Test (ALVT; Rey, 1964; Deelman et al., 2006) reflects declarative memory. Fifteen one-syllable words are read five times. After each trial subjects try to recall as many words as they can

Twenty minutes after immediate recall, delayed recall and recognition follows. Scores are expressed as T-scores corrected for age, education and gender (Schmand et al., 2012). Test-retest reliability is .6 -.8 (Straus, 2006; Deelman et al., 2006).

Rivermead Behavioural Memory Test (RBMT), subtest prose recall (Wilson, Cockburn & Baddley,

1985; Deelman et al., 2006). Memory task in which newspaper items have to be reproduced. After 15

minutes delayed recall follows. Four parallel versions of the test were used. Scores are expressed as T-scores corrected for age, education and gender (Schmand et al., 2012). Test-retest reliability is .7 -.8 (Schmand et al., 2012).

Letter Fluency (Benton & Hamsher, 1989; Schmand et al., 2008). Letter Fluency consists of three

one-minute trials in which patients have to say as many words as they can think of that begin with a given letter of the alphabet, excluding proper nouns, numbers and the same words with a different suffix. The score, which is the sum of all acceptable words produced in the three trials, are expressed as T-scores corrected for age, education and gender (Schmand et al., 2012). The test-retest reliability is .8 -.84 (Schmand et al., 2008). Elongated tests. To improve the reliability of the neuropsychological tests, we used the classical method of

test elongation (Nunnally, 1967). We expanded the basic test battery by applying more trials of the NPA tests. We administered three instead of one trial of the RBMT prose recall subtest, two instead of one list of words of the AVLT, two trials of the letter fluency task (i.e. six letters in total), and one extra trial of the TMT and Stroop test. For all extra tests a parallel version was used, except for the Stroop test.

(9)

9 2: Clinical diagnosis, conversion vs. non-conversion

At first- and second follow-up the following cognitive screening instruments were used for diagnosing positive or negative for AD.

The Mini Mental State examination (MMSE; Folstein et al, 1975) is used to screen for cognitive impairment and to assess the effects of therapeutic agents. It includes questions and problems concerning orientation, memory, attention, verbal comprehension and visuoconstructive abilities. Cut- off score is < 24; maximum score is 30.

The Clinical Dementia Rating (CDR; Morris et al, 1993) is a semi-structured interview with the patient and informant for determining the presence and severity of dementia. It rates the subject’s cognitive performance in six domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care. Each domain and the global CDR are rated on five levels, where 0 indicates no dementia, and 0.5, 1, 2, and 3 indicate respectively questionable, mild, moderate and severe dementia.

The 7-minute screen (Solomon et al., 1998) is a screening instrument consisting of four individual tests: orientation, memory, clock drawing and verbal fluency. Raw score is converted into a percentage reflecting the probability of having a dementia syndrome. It is a useful screening tool for discriminating patients with dementia from patients who are cognitively unimpaired (Meulen et al., 2004).

3: MRI scan

MRI scans were used for analyzing the spatial map of brain regions in predicting conversion to AD. We did not perform a spatial pattern analysis ourselves, but used the spatial map from previous studies of spatial pattern analysis (Davatzikos et al., 2005; Davatzikos et al., 2008; Weiner et al., 2012).

The MRI-scans were made on a 3 Tesla scanner. The scan protocol consisted of axial T2-weighted and FLAIR sequences (3mm slice thickness) and an isotropic 3D-T1-weighted sequence. Volume of hippocampal, entorhinal, inferior temporal structures and anterior and posterior cingulated gyrus were calculated automatically using Freesurfer image analysis (http://surfer.nmr.mgh.harvard.edu/), version 4.5.0. All volumes in each individual were normalized using the total intracranial volume to control for variation in head size. Using the program 3D-Slicer, we examined whether Freesurfer correctly calculated the volumes of all images

(10)

10

and whether scan quality was sufficient. Furthermore, structural axial T2-weighted MRI was used by the neurologist to exclude/visualize vascular causes of dementia and other brain diseases.

Figure 1. Schematic overview of test materials used for the current study.

Baseline (predictors) (2007-2009)

First follow-up (clinical diagnosis) ( 2009-2011)

Second follow-up (clinical diagnosis) (2014-2016)

Stroop ColorWord test TMT ALVT RBMT Fluency MRI scan MMSE CDR 7 minute screen MMSE CDR 7 minute screen Data Analysis

Statistical analyses were performed using the SPSS statistical software package version 22.

Demographic characteristics. We compared test results of patients with cognitive complaints who converted

to AD (1) with patients with cognitive complaints who did not convert to AD (0). Group differences in age, education level and MMSE score at baseline were assessed using independent sample t tests and group differences in gender were analyzed using a chi-square test. Furthermore, group differences in clinical diagnosis at baseline were analyzed using independent sample t tests.

Neuropsychological tests and MRI. For each group, the mean T-score of all neuropsychological tests and the

mean volumes of the brain regions at baseline were calculated. To limit the effect of overfitting, we included one variable per cognitive test: AVLT delayed, RBMT delayed, Stroop card 3, TMT B and letter fluency in our analyses. All variables were tested for normality using the Kolmogorov-Smirnov test. If data in all groups

(11)

11

were normally distributed, Levene’s tests were conducted to further investigate whether the variances differed between groups. Depending on the normality of the data, independent t-tests or Mann-Whitney U tests were performed. Calculations were performed separately for first- and second follow-up.

Main analysis. Using logistic regression analysis, we analyzed predictive power by global model fit,

discrimination and calibration of five different prediction models. Model 1 consisted of spatial map MRI. Model 2 consisted of basic NPA (without test elongation). Model 3 consisted of step 1) basic NPA, step 2) spatial map MRI. Model 4 consisted of step 1) basic NPA, step 2) elongated tests. Model 5 consisted of step 1) Model 3 (basic NPA and spatial map MRI), step 2) elongated tests. See figure 2 for model overview.

Figure 2. Prediction models for conversion to AD. For all models, global model fit, discriminative ability and calibration were analyzed performing logistic regression analysis, ROC curve and Hosmer Lemeshow test.

Spatial MRI Basic NPA Basic NPA + Basic NPA+ Model 3+ Spatial map MRI Elongated Tests Elongated tests

Global model fit: Global model fit of all models was analyzed with Nagelkerke R-square and -2Loglikelihood. Using model chi-square statistic we examined the Log-Likelihood changes to first establish whether spatial map MRI significantly improved model fit of Model 2 (hypothesis 1), second whether test elongation significantly increased model fit of Model 2 (hypothesis 2), and third whether test elongation increased model fit of Model 3 (hypothesis 3). Because the ongoing debate concerning the best predictors (neuroimaging or neuropsychological tests) for early AD detection, it is interesting to make a straight comparison between the improved versions of both techniques. Therefore, we also made a straight comparison of model fit between NPA with test elongation (Model 4) and solely spatial map MRI (Model 1) using Nagelkerke R-square.

Model 2 Model 3 Model 4 Model 5

(12)

12

Furthermore, we have calculated the adjusted R-square to evaluate how well our model generalizes. Whereas R-square tells us how much of the variance is accounted for by our models in the sample, adjusted R-square tells us how much of the variance would be accounted for if the model had been derived from the population (Field, 2009). Because SPSS cannot calculate the adjusted R-square for logistic regression, we calculated the adjusted R-square for all models with the following formula (Steyerberg, 2009):

Adjusted R-squared = (1 – e (−(LR−df ) /n)

) / (1 – e

(−(−2LL0)/n)

).

In which LR refers to the difference in -2 log likelihood (-2LL) of the model with and without the predictors, df are the degrees of freedom of the predictors in the model, N is the sample size, and LL0 is the log likelihood of the Null model (without predictors).

Discrimination: Discrimination refers to how precise the prediction model can discriminate those with a positive clinical outcome from those with a negative clinical outcome. Discrimination ability of the models was analyzed using sensitivity, specificity and accuracy with a traditional cut-off point set to 50%, as false positive and false negative classification were equally important for our research question. However, in clinical practice false negative errors can be more important than false positive errors. For example, if future medication for AD can delay the onset of clinical symptoms, false negative classification would be a bigger problem than false positive classification (assuming that there are no substantial negative side-effects). Therefore, we also performed a ROC curve to analyze discrimination ability using c statistic (i.e. area under the curve).

Calibration: Calibration refers to the agreement between observed endpoints and predictions (Steyerberg, 2014). General indices of calibration were tested using the Hosmer-Lemeshow test and graphically plotted in a calibration plot. In this plot distribution of similar predicted probabilities of conversion to AD is shown on the x-axis of the graph. The actual outcome is shown on the y-axis. Perfect predictions should be on the 450 line.

We conducted all analyses for both first and second follow-up, so that we could examine whether the predictive power of our models varies between two years and six years after baseline (hypothesis 4).

(13)

13

All models were tested for independent errors using the Durbin-Watson test. Also, they were tested for linearity, conducting a logistic regression with predictors that were the interaction between each predictor and the log of itself. Multicollinarity does not affect final model fit, but only the individual parameters of a regression model. As we were not interested in individual predictors, we did not check for multicollinarity. Explorative analysis. We ranked the elongated tests that are considered most suitable for group differentiation (AD conversion vs. no conversion) using stepwise analysis (backwards).

Missing data. Some patients did not fully complete the elongated tests, which resulted in missing data. The data were not randomly missing, but were more often of patients with severe cognitive deficits. Consequently, excluding all missing data would limit the ecological validity of our study and introduce bias. Therefore, we used the method of Last Observation Carried Forward (LOCF) to impute the missing data. With this commonly used imputation method we replaced the missing data of sixteen patients by their last observed data. For instance, missing data caused by incomplete responses on the parallel version of the TMT were replaced by their responses on the first version of the TMT.

For all tests, an alpha of ≤ 0.05 was considered as statistically significant.

RESULTS

Demographic characteristics

Seventy-three patients completed both first and second follow-up assessments and underwent an MRI scan. Four of the 73 patients were excluded, one because he was diagnosed with frontotemporal dementia (FTD), one because he was diagnosed with vascular dementia (VD), and two because of insufficient scan quality. The mean time between baseline and the first follow-up was two years and one month. The mean time between baseline and the second follow-up was six years and two months. At first follow-up 29 patients had converted to AD. At second follow-up the number of converters had increased to 40. Thus, the mean conversion rate was about 10% per year. Age at baseline was significantly higher for converters compared to converters, and the MMSE score at baseline was significantly lower for the converters compared to converters. There was no significant difference in gender and education level between converters and non-converters. See Table 1 for demographical characteristics. Furthermore, there was a significant difference in

(14)

14

clinical diagnosis at baseline between converters and non-converters. Percentage of MCI diagnosis was significantly higher for converters compared to non-converters. In contrast, percentages of psychiatric and worried well diagnoses were significantly higher for non-converters compared to converters (see Table 2).

Table 1. Demographic characteristics of converters and non-converters at baseline. Patients were diagnosed after

neurological consultation. Non-converters are patients who at follow-up were not diagnosed with AD. Converters are patients who at follow-up were diagnosed with AD. Values are expressed as mean (SD) unless otherwise indicated.

First follow-up P Second follow-up P Converters non-converters converters non-converters

(n = 30) (n = 39) (n=40) (n=29) Age 71.6 (8.1) 63.8 (9.1) < 0.001a 71.3 (7.9) 61.1 (8.4) < 0.001a Gender (m/f) 17/13 20/19 0.70b 22/19 15/13 0.99b Education level, 4.0 (1.2) 4.0 (1.6) 0.90a 3.9 (1.3) 4.1 (1.6) 0.50a ISCED (range 0-6) MMSE score 24.7 (0.4) 27.7 (0.3) < 0.001a 25.4 (0.4) 28 (0.3) < 0.001a a t-test; b chi-square.

Table 2. Clinical diagnoses at baseline. Diagnosis was made by a neurologist, based on cognitive and psychiatric

screening instruments, NPA results and MRI scan. Non-converters are patients who at follow-up were not diagnosed with AD. Converters are patients who had AD at follow-up.

First follow-up P Second follow-up P Converters non-converters converters non-converters

(n = 30) (n = 39) (n=40) (n=29)

MCI 82% 21% <.001 74% 12% <.001

Psychiatric 4% 32% .01 6% 38% .003

Psychiatric & MCI 4% 3% .78 6% 0% .91

Worried well 9% 38% .01 10% 46% .001

(15)

15

Furthermore, baseline volumes of hippocampi, inferior temporal gyri and posterior and anterior cingulated gyrus significantly differed between converters and non-converters. The volume of the entorhinal cortex did not significantly differ between converters and non-converters at first follow-up, but did differ at second follow-up (Table 3a). There was a significant difference in T-score on all cognitive tests at baseline between converters and non-converters at first follow-up. Converters had an overall mean T-score of 31.1 (SD = 8.7) and non-converters had an overall mean T-score of 44.7 (SD = 8.7). At second follow-up, baseline T-scores significantly differed on all cognitive tests between converters and non-converters, except for letter fluency. Converters had an overall mean T-score of 32.9 (SD = 9.5) and non-converters had an overall mean T-score of 47.5 (SD = 6.1) at second follow-up (Table 3b).

Table 3a. Baseline cortical volume ( mm3) of hippocampus, inferior temporal gyri, entorhinal cortex, anterior

cingulate gyrus and posterior cingulate gyrus of converters and non-converters at follow-up. Patients were diagnosed after neurological consultation. Non-converters are patients who at follow-up were not diagnosed with AD. Converters are patients who at follow-up showed AD characteristics. Values are expressed as mean (SD).

First follow-up P Second follow-up P Converters Non-converters Converters Non-converters

(n = 30) (n = 39) (n=40) (n=29)

Hippocampus 5341 (613) 6702 (828) <.001a 5548 (757) 6933 (709) <.001a

Inferial Temporal gyri 10703 (2116) 12226 (2439) .007a 10785 (2047) 12705 (2482) .001a

Entorhinal cortex 1506 (395) 1633 (490) .240a 1484 (416) 1715 (475) .041a

Anterior Cingulate gyrus 9776 (1546) 10588 (1458) .030a 9838 (1424) 10815 (1534) .010a

Posterior Cingulate gyus 8040 (1114) 8765 (1208) .012b 8098 (1095) 8964 (1216) .004b

a

(16)

16

Table 3b. Demographically corrected T-scores on neuropsychological tests at baseline of converters and non-converters. Patients were diagnosed after neurological consultation. Non-converters are patients who at follow-up were not diagnosed with AD. Converters are patients who at follow-up showed AD characteristics. Values are expressed as mean (SD).Independent t tests were performed for all variables.

First follow-up P Second follow-up P Converters non-converters converters non-converters

(n = 30) (n = 39) (n=40) (n=29) Mean T-score a + b 31.1 (8.7) 44.7 (8.7) <.001 32.9 (9.5) 47.5 (6.1) <.001 RBMT delayed 26.9 (9.5) 42.5 (13.3) <.001 28.8(10.4) 46.0 (12.4) .001 Story 1&2a AVLT delayed 29.7 (12.6) 49.7 (9.2) <.001 27.7 (12.7) 49.7 (9.2) .001 Word list 1a Stroop1 card 3a 35.5 (12.2) 45.1 (10.9) .001 36.7 (12.0) 47.1 (10.2) .001 TMT-B 1a 25.0 (23.0) 43.3 (17.4) .001 26.8 (22.9) 47.9 (12.7) <.001 Fluency 1a 38.0 (8.8) 44.9 (8.9) .002 40.2 (10.0) 44.5 (8.0) .061 RBMT Story 3b 29.3 (8.0) 42.0 (11.7) <.001 30.7 (9.0) 46.6 (10.0) <.001 AVLT delayed 25.9 (10.7) 40.3 (13.5) <.001 27.0 (11.2) 44.9 (11.0) <.001 Word list 2b Stroop 2 card 3b 36.4 (12.4) 46.9 (10.8) .002 38.0 (12.6) 48.3 (10.0) <.001 TMT-B 2b 25.9 ( 25.9) 47.5 (16.1) <.001 29.0 (24.9) 51.5 (11.9) <.001 Fluency 2b 40.6 (8.8) 47.8 (8.6) .001 42.3 (9.1) 48.2 (8.7) .008 a

Basic NPA; b Elongated tests.

(17)

17

We evaluated overall model fit, discrimination and calibration for all prediction models, with clinical outcome (conversion vs. no conversion) as dependent variable. See Table 3 and figure 3 for detailed information.

Models 1 and 2; spatial map MRI and basic NPA

First, we conducted a binary logistic regression analysis for solely spatial map MRI (Model 1). The brain regions used for pattern analysis included left and right hippocampus, inferior temporal gyri, entorhinal cortex and anterior and posterior cingulated gyrus. Conversion to AD (yes/no) was the dependent variable. Second, we conducted a similar binary logistic regression for the basic NPA (Model 2), with the individual predictors RBMT delayed recall, AVLT delayed recall, letter fluency, Stroop card 3 and TMT B. The mean model fit and discrimination of Model 1were slightly better than the model fit of Model 2. Although Model 1 has a better calibration than Model 2, both models had an adequate calibration (i.e. the dots did not significantly deviate from the perfect 450 line; see Figure 3). This indicates that the agreement between observed outcomes and predictions is high.

Model 3; NPA with addition of spatial map MRI

To determine whether adding spatial map MRI would increase the predictive power of the basic NPA we conducted a two-step logistic regression analysis with step 1) basic NPA and step 2) spatial map MRI. Results indicated that at both follow-ups the addition of spatial map MRI to basic NPA significantly increased the model fit and the discriminative ability. Also, at both follow-ups the calibration increased compared to Model 1 and 2, which is shown graphically by the calibration plot where dots are located closer to the 450 line compared to Model 1 and 2.

Model 4; NPA with addition of elongated tests.

To examine whether the addition of elongated tests would increase the predictive power and model fit of the basic NPA, we conducted a two-step logistic regression analysis with step 1) basic NPA and step 2) elongated tests. Results indicated that at first follow-up the addition of elongated tests had no influence on the model fit, but did show a small improvement in discriminative ability compared to model 2. At second follow-up the elongated tests significantly increased the model fit and showed a small improvement of discriminative ability compared to model 2. The calibration plot shows for both follow-ups an improvement of the calibration compared to the calibration of model 2.

(18)

18 Model 5; spatial map MRI combined with elongated NPA.

To examine whether elongated tests can further improve the model fit and predictive power of model 3, we conducted a two-step logistic regression analysis with step 1) basic NPA plus spatial map MRI and step 2) elongated tests. At first follow-up model fit and discriminative ability were slightly increased by the addition of elongated tests. However, change in model fit was not significant and the improvement of discrimination was rather small. At second follow-up adding elongated tests did significantly increase the model fit and discriminative ability. The calibration plot shows a close to perfect calibration (i.e. the dots are close to the perfect 450 line).

In sum, the results indicated that both model 3 and model 4 increased model fit and discrimination compared to basic NPA. Both models had adequate calibration. The model fit, discrimination and calibration were higher for model 3 than for model 4, indicating a higher additional effect for spatial map MRI than for elongated tests. Therefore, we chose model 3 as the new predictive model. However, when elongated tests were added to model 3, the model fit and discrimination of the model further increased at second follow-up and slightly at first follow-up. A combination of both spatial map MRI and elongated tests led to a model with highest predictive power. Also, as graphically shown by the calibration plot, the agreement between observed outcome and predictions (calibration) is highest for model 5. Therefore, model 5 is our full research model. Additionally, a straight comparison between the new NPA (with test elongation) and spatial map MRI indicates that predictive accuracy and model fit are slightly better for the new NPA than for spatial map MRI.

(19)

19

Table 4. Characteristics of model fit, Discrimination and Calibration for all models at fist- and second follow-up. Model fit is expressed as Nagelkerke R-square and Log-Likelihood, with p indicating the significance level of the Log-Likelihood change using model chi-square statistic. Adjusted R-square indicating generalization of the model fit. Discrimination is expressed as sensitivity, specificity, predictive accuracy and c statistic performing ROC curve. Calibration is analyzed using Hosmer-Lemeshow test.

Global model fit

Discrimination

Calibration

R-square -2LL p of Adjusted Sensitivity Specificity Accuracy C stat H-L test (p) change R-square

Model 1 (Spatial map MRI)

First follow-up 0.68 45.3 .65 73.7 92.3 84.1 .94 (.88 - 1.0) .48 Second follow-up 0.66 46.2 .63 87.8 82.1 85.5 .93 (.86 - .99) .90 Model 2 (basic NPA)

First follow-up 0.61 52.9 .55 86.7 84.6 85.5 .90 (.81 - .98) .19 Second follow-up 0.63 49.7 .58 85.4 82.1 84.1 .92 (.84 - .99) .32

Model 3 (step 1 Basic NPA; step 2 Spatial map MRI)

First follow-up 0.81 30.4 < 0.001 .75 90.0 92.3 91.3 .97 (.93 -1.0) .70 Second follow-up 0.83 27.8 < 0.001 .76 90.1 96.4 92.8 .97 (.93 -1.0) .78

Model 4 (step 1 Basic NPA; step 2 Elongated tests)

First follow-up 0.66 45.7 0.23 .56 82.8 86.8 85.1 .93 (.86 -1.0) .51 Second follow-up 0.76 35.2 0.02 .66 87.5 88.9 88.1 .95 (.90 -1.0) .84 Model 5 (step 1 Model 3; step 2 Elongated tests)

First follow-up 0.86 22.5 0.16 .75 93.1 92.1 92.5 .98 (.95 -1.0) .98 Second follow-up 0.92 14.5 0.02 .82 92.5 96.3 94.0 .99 (.98 -1.0) .95

(20)

Figure 3. Calibration plots of actual vs. prediction for all models at first- and second follow-up. Calibration is analyzed using Hosmer-Lemeshow test. The predicted probabilities of conversion to AD are on the x-axis of the graph. The actual outcome is on the y-axis. Triangles indicate the observed frequencies by deciles of predicted probability. Perfect predictions are on the 450 line.

Model 1 Follow-up 1 Follow-up 2 Model 2 Follow-up 1 Follow-up 2 Model 3 Follow-up 1 Follow-up 2

(21)

21 Model 4 Follow-up 1 Follow-up 2 Model 5 Follow-up 1 Follow-up 2

Differences in model fit and predictive power between first and second follow-up.

For a better overview of the change in predictive power of our research models over time, we compared Nagelkerke R-square, predictive accuracy, sensitivity and specificity between first and second follow-up (Figure 4). For all models the Nagelkerke R-square was higher at second follow-up than at first follow-up, indicating that all three models have a better fit when looking at the clinical outcome at second-follow-up. The overall predictive accuracy was higher for the second follow-up in comparison to the first follow-up. This indicates that all models predicted the clinical outcome more accurately when the time between predicting and diagnosing increased. Sensitivity did not change over time for model 3 and model 5, but did so for model 4. This indicates that the percentage of all truly declining cases accurately classified by spatial map MRI was the

(22)

22

same for first- and second follow-up, but that all truly declining cases accurately classified by elongated tests were higher for second follow-up than for first follow-up. Specificity was higher at second follow-up compared to first follow-up for all three models, indicating that specificity increased as the time between predicting and diagnosing increased.

Figure 4. Difference in model fit and predictive power of the three main research models between first –and second

follow-up. Model fit is characterized by Nagelkerke R-square. Predictive power is characterized by overall accuracy, sensitivity and specificity. Model 3 is NPA plus spatial map MRI. Model 4 is NPA plus elongated tests. Model 5 is NPA plus spatial map MRI and elongated tests.

Stepwise analysis for elongated tests

In an explorative analysis, we ranked the elongated tests according to suitability for group differentiation (AD conversion vs. no conversion) using stepwise logistic regression analysis (backwards). This analysis identified two elongated tests as best discriminating between conversion and non-conversion: Stroop card 3 (Model chi-square =4 (1), p= .044) and TMT B (Model chi-chi-square = 3 (1), p= .060). Adding more versions of RBMT,

(23)

23

AVLT and letter fluency did not result in a significantly better model. Therefore, we re-executed all logistic regression analyses concerning elongated tests (model 4 and model 5) without the extra versions of RBMT, AVLT and letter fluency. Omitting the extra versions of these tests from the elongated NPA did not result in a significant change in model fit or predictive power of the research models. This indicates that just adding extra versions of TMT and Stroop to the NPA test is sufficient to achieve the positive effect of tests elongation.

DISCUSSION

Findings and explanations

In the present study, logistic regression models were developed to more accurately predict conversion to AD, combining well-known predictors such as structural brain volume and NPA with novel predictors such as elongated tests. First, our results suggest that adding a spatial map of brain regions to a basic NPA considerably improves the predictive power of the model. This new model explains significantly more of the variance than solely basic NPA; mean accuracy of the prediction rises from 85% for solely NPA to 92 % for a NPA plus spatial map. Second, our results suggest that adding elongated tests to the basic NPA does not significantly improve the predictive power for first follow-up, but does so for second follow-up with an accuracy of 88%. Furthermore, our analysis suggests that the most accurate prediction model includes basic NPA combined with elongated tests and spatial map of brain regions, with a mean model fit of .89 and an accuracy of 94% at second follow-up. Thus, the addition of the spatial map of brain regions and elongated tests to a basic NPA improves the predictive accuracy from 85% to 94%.

These findings are in line with our expectations described in the introduction, as both spatial map MRI and elongated tests increase the validity of the diagnostic assessment for pre-clinical AD (MCI). Spatial map MRI increases validity by distinguishing between cognitive disorders due to brain disease and due to psychiatric syndromes. In case of psychiatric co-morbidity, neuropsychological methods may not render valid results, as the cognitive decline observed with neuropsychological tests is often not due to degenerative brain disease. Therefore, spatial map MRI partly explains other aspects of cognitive decline than NPA. Elongated

(24)

24

tests increase the validity by improving the reliability of NPA. Because the validity of a diagnosis is limited by the reliability of the instruments used, part of the false positive diagnoses is probably due to measurement error, which can be decreased by elongation of tests. Because the tests used in the basic NPA already have reasonable reliability, we expected the effect of test elongation to be smaller compared to the effect of spatial map MRI. Combining these two methods will best distinguish those MCI patients who will convert to AD from those who will not, as the combination of the two methods increases both validity and reliability.

Furthermore, we predicted the accuracy of our prediction models to be higher for the first follow-up than for the second follow-up, as it is plausible that some of the converters at second follow- up did not show any signals at baseline yet. Contrary to this prediction, our results suggest that the accuracy of our models is higher for the second follow-up than for the first follow-up. A possible explanation for this unexpected finding is the concept of cognitive brain reserve (Stern, 2012; Scherder, 2011). Throughout life, people build up a cognitive brain reserve, for example by education, occupation and physical activity. This can cause an extensive compensation network (Grady et al., 2003), which limits the clinical consequence of brain atrophy. This makes it plausible that in the present study the pathology measured at baseline began to accumulate many years before the actual onset of clinical AD, and thus patients were able to perform well on the cognitive screening instruments at follow-up, despite of brain atrophy. This can explain false negatives at follow-up. However, at a certain point brain pathology is so severe that clinical functioning cannot be maintained, regardless of the cognitive brain reserve patients have built up (Stern, 2012). This critical point may explain why the accuracy for second follow-up was higher than for first follow-up, as it is plausible that patients with AD pathology will reach this point of clinical dysfunction within the six-year follow-up. Hence, part of the patients with pathology at baseline who did not yet show AD symptoms at first follow-up did show symptoms at second follow-up. The false negatives became true positives and the accuracy increased.

Additionally, our explorative analysis suggests that adding extra versions of TMT B and Stroop card 3 to the NPA is sufficient to achieve the positive effect of tests elongation. This means that adding extra versions of the RBMT delayed, AVLT delayed and fluency does not significantly increase the predictive power. This is not as expected, as Straurs (2006) found that AVLT and RBMT have lower reliability than Stroop and TMT and thus are likely to be more sensitive for increasing the reliability trough test elongation. It therefore is possible that the increased accuracy by test elongation is not initially caused by improved

(25)

25

reliability, but by a learning effect. As patients without AD pathology are probably more sensitive to a learning effect than patients with AD pathology, the learning effect by elongation may better distinguish converters from non-converters. Previous research of memory tests in healthy persons (Frerichs & Tuokko, 2005) and in memory clinic patients (Tröster, Woods & Morgan, 2007) showed that there is no learning effect for the memory tests; in fact the opposite effect can be expected. Patients with a reasonably good memory often show proactive interference in these memory tasks. This is in line with our results, which demonstrated that in non-converters the score on memory tests, in particular on the AVLT, decreased when elongated tests were applied (see Table 2b for demographically corrected T-scores on neuropsychological tests). It is likely that this proactive interference negatively influenced the discrimination between converters and non-converters, resulting in the insignificance of predictability of elongated memory tests. In contrast, a learning effect is very much possible for Stroop and TMT, as we did not use a parallel version for the Stroop and the parallel version of the TMT is based on the same principle as the original version on which the performance can be improved by practicing. The possible learning effect of TMT and Stroop and the proactive interference of memory tests, may explain why the effect of elongation on predictive accuracy is stronger for Stroop and TMT than for memory tests. However, it should be noted that prediction is about estimation, and therefore non-significance does not mean that there is evidence for a non-effect. Especially in relatively small datasets - as is the case in our study - or when the predictors are well known from previous research, it is reasonable to not only include predictors that are significant, but also predictors with p values higher than .05 (Steyerberg, 2009). Because episodic memory is most vulnerable to degeneration due to AD (Scherder, 2011) and because of the relatively low reliability of these tests, we suggest not excluding AVLT and RBMT from the elongation method, despite their lack of significance.

Strengths and limitations of the study

Our study has a number of strengths. First, the inclusion criteria for our study were relatively broad, implying that the patients in our study were reasonably representative for the population of memory clinic patients with cognitive complaints but without clear dementia syndromes. This distinguishes our study from many previous studies, in which patients with psychiatric disorders are often excluded. An estimate of psychiatric co-morbidity in an early stage of AD varies from 25% in population-based studies to more than

(26)

26

60% in clinical samples (Steffens et al., 2006). In our study, about 50% of the patients had co-morbidity. Hence, excluding these patients from analysis does not do sufficient justice to clinical reality. It is a clinically relevant challenge to find valid ways to distinguish between mental symptoms that signify genuine cognitive disorders due to brain disease (with or without co-occurrence of psychiatric syndromes) and mental symptoms that stem primarily from psychiatric syndromes. The high accuracy of our new prediction models indicates that distinguishing between cognitive disorders due to AD and due to psychiatric syndromes seems indeed possible.A second major strength of our study is that we excluded non-credible test performers. Patients with non-credible test results cause noise in the data, and therefore are a threat to the validity of a NPA (Bush et al., 2005). For example, if patients do not put enough effort in making the tests or show extensive exaggeration, abnormal performance may not reflect cognitive impairments due to brain abnormalities (Rienstra et al., 2012). Especially when financial claims are involved exaggeration of symptoms frequently occurs (Larrabee, 2000). Because it becomes uncertain whether the tests measure what they intend to measure, excluding these patients is necessary to prevent unjustified conclusions. A third strength, and perhaps an important contribution to the field, is that we improved the reliability of NPA through test elongation. We further introduced this improved behavioral method in combination with a spatial map of brain regions (based on spatial pattern analysis) and included both methods in a single prediction model.

Although our findings of new prediction models are promising, we acknowledge some limitations in our research. The first concerns the number of events in our study, i.e. number of conversions to AD, which is low compared to the predictors in our models. A common rule of thumb is to require at least ten events per predictor (Steyerberg, 2009). The risk of including more predictors in the model is that the performance of the selected model could be overestimated, i.e. type 1 error and overfitting. Overfitting is mainly a problem in statistical selection methods, such as stepwise analysis, as the selection of predictors becomes unstable (Steyerberg & Vergouwe, 2014). However, our predictors are based on previous studies and clinical knowledge and thus contain a theoretical underpinning rather than statistical selection methods. This makes it more plausible that, even though the model is overfitted, the predictors are not just specific for our sample, but a good representation of the population. This is supported by the plausible sizes of the adjusted R-squares we found for all our prediction models (see Table 4 of the result section).

(27)

27

Second, it is debatable whether or not we used the right instruments for diagnosing the patients at the follow-ups. In our study, clinical diagnosis was established by a semi-structured interview and standardized screening instruments. However, in most memory clinics diagnosis is not only based on screening instruments, but also on a comprehensive NPA and brain MRI scan (as we used in our baseline diagnosis). The main reason why we chose for this diagnostic instruments as a substitute for the comprehensive methods used in memory clinics, is that it would be methodologically incorrect to determine the accuracy of a prediction model based on the same outcome variables as used in the prediction models (in our case neuropsychological tests and brain atrophy). This would have led to incorporation bias. As it was mandatory for the researcher to select a probability diagnosis based on the screening instruments, even in complex cases that would usually require additional examination, it is possible that some of the complex cases in our study would have been diagnosed differently in a clinical setting than in our study. However, although MMSE, 7 minute screen and CDR are measuring a more global cognitive decline and some researchers suggest such scales are unreliable (Zamrini et al., 2004; Harrison et al., 2007), these instruments are generally well-accepted for AD diagnosis when the disease is moderately severe and they are robust measures for cognitive decline (Meulen et al., 2004). Furthermore, when the performance of a patient was questionable, a neurologist further examined this patient. Therefore, we do not assume the clinical diagnosis of patients in our study to be very different compared to the clinical diagnosis of similar patients in memory clinics.

A third limitation is the possibility of observer- expectancy bias. Before testing the patients, researchers were informed about the patients’ performance and diagnosis in the previous assessments. This information might have led to expectations that patients were more (or less) capable than they actually were. For example, the researcher might have been more focused on abnormalities in patients diagnosed with MCI at baseline compared to patients diagnosed as worried well. These expectations of the researcher could have unconsciously influenced the patients’ performance. Also, these expectations could have instigated the tendency to interpret information in a way that confirms the preconceptions (confirmation bias) and therefore biases the conclusion. Although we cannot rule out that some results might have been influenced by these biases, the use of standardized measurements has minimized the risk of confirmation bias. Improving the study design by blinding the researcher about the patients’ performance will decrease the risk of biased conclusions and thereby increase the internal validity.

(28)

28

Furthermore, it is important to note that our imputation method may have led to an underestimation of the strength of the elongation effect. In our study we had various missing data points for the elongated tests, which we have imputed by the method of last observation carried forward. This method assumes that a patient’s response remains constant over the parallel versions of the elongated tests. However, incompletion of the tests was more common in patients who were less able to do the tests or had become tired because of their spending effort during the NPA. Their responses probably would not have remained constant over the parallel versions but would have worsened. Because of this likely underestimation, we expect the effect of test elongation to be greater in circumstances where less data are missing.

Recommendation for future research

As is often the case, study results raise more questions than they answer, and more research is needed to answer these new questions. One such question, which arises from our finding that the accuracy of our model is higher for second follow-up than for first follow-up, is whether or not accuracy will even further increase in the longer term. It would be interesting to expand our study to a 10 year follow-up and annually evaluate the patients to obtain a better understanding of change in accuracy over time. This would reveal how many years before clinical diagnosis our models can predict conversion to AD with reasonable accuracy. However, this brings up some ethical questions. For example, as long as there is no cure for AD, should cognitively unimpaired persons be informed about their 10 year prognosis? Is premature knowledge of the risk for AD that many years before actual diagnosis without an available cure not worse than ignorance?

Another question that arises is whether we can find (bio-) markers to predict the exact cognitive change of an individual patient. So far, research has focused on markers to predict conversion to AD and not so much on the prediction of the exact course of the cognitive changes. Therefore, the next step would be to use our predictors in order to predict the level of functioning and rate of cognitive decline over time. This may help identify inter- and intra-individual differences in the course of cognitive decline. However, understanding the course of decline by examining the predictors is very difficult, as there are many more risk factors that may influence the course of AD, such as genetic factors like Apolipoprotein E (APOE; Van der Flier and Scheltens, 2005), amyloid deposition in the brain (Jack & Holtzman, 2013; Forlenza, Diniz & Gattaz, 2010), cardiovascular disease (Stampfer, 2006; Van Exel et al., 2002), and several lifestyle patterns (Stern, 2012).

(29)

29

Extending our study to longitudinal cognitive testing in combination with all these factors together, may help to not just accurately predict the onset of the disease but also its progression.

When effective treatments are found, additional research with more orientation on the ROC curve is needed. Although some attention was paid to the ROC curve, our main analysis was based on the 50% cut-off point as for our research questions false positive and false negative classification were equally important. However, if future medication for AD can delay the onset of clinical symptoms, false negative classification would be a bigger problem than false positive classification, assuming that there are no substantial negative side-effects. More research is needed in order to accurately specify the sensitivity and specificity for different cut-off points, so that clinicians can decide on an individual basis whether or not treatment is effective.

In conclusion, our results complement prior research of early AD detection by demonstrating that when improved behavioral measurements and improved neuroimaging techniques are included in the same model, high accuracy can be reached in predicting who will subsequently develop AD. Additionally, we found the accuracy to be higher when the interval between baseline and follow-up increases, possibly due to cognitive brain reserve.

Improvement of early and differential diagnosis is of great importance, especially in view of emerging disease modifying treatments. Early treatments of patients with high likelihood of conversion to AD could result in preservation of their cognition and maintenance of functional independence (Gauthier et al., 2006). Even when no effective treatment is found, early detection of AD can help patients and caregivers to anticipate better on future decline. However, ethical questions make it debatable how far before actual diagnosis early AD detection is desirable. Is early detection useful even in totally healthy persons, or just in persons in whom the first cognitive complaints can be observed?

ACKNOWLEDGEMENT

I thank prof. dr. B. Schmand for his enthusiastic supervision during the whole research project and for sharing his stories and thoughts regarding the politics of research, dr. P. de Groot for his patient support throughout my MRI analyses, teaching me all insides of Freesurfer and 3D-Slicer and dr. S. van der Werf for his

(30)

30

additional comments on my thesis proposal. Last but not least, I would like to thank all patients participating in the IDADO project, making our study possible.

REFERENCES

American PsychiatricAssociation (2001). Beknopte Handleiding bij de Diagnostische Criteria van de DSM-IV-TR (Nederlandse Vertaling). Uitgeverij Harcourt Assessment, Amsterdam.

Brookmeyer, R., Johnson, E., Ziegler-Graham, K., Arrighi, H. M. (2007). Forecasting the global burden of Alzheimer’s disease. Alzheimer's & Dementia, 3, 186–191.

Bush, S. S., Ruff, R. M., Troster, A. I., Barth, J. T., Koffler, S. P., Pliskin, N. H., Silver, C. H. (2005).

Symptom validity assessment: Practice issues and medical necessity: NAN Policy & Planning

Committee. Archives of Clinical Neuropsychology, 20, 419-426.

Chételat, G., Villemagne, V. L., Pike, K. E., Ellis, K. A., Ames, D., Masters, C. L., & Rowe, C. C. (2012). Relationship between memory performance and β-amyloid 13 deposition at different stages of Alzheimer's disease. Neurodegenerative Disease, 10, 141-144.

Cummings, J. L., Morstorf, T., Zhong, K.. (2014). Alzheimer's disease drug-development pipeline: few candidates, frequent failures. Alzheimer's Research & Therapy, 6, 1-7.

Davatzikos, C., Fan, Y., Wu, X., Shen, D., Resnick, S. M. (2008). Detection of prodromal Alzheimer’s disease via pattern classification of magnetic resonance imaging. Neurobiol Aging, 29, 514–523. Deelman, B., Eling, P., de Haan., van Zomeren, E., et al (2006). Klinische Neuropsychologie. Uitgevrij Boom,

Amsterdam.

Devanand, D. P., Pradhaban, G., Liu, X., Khandji, A., De Santi, S., Segal, S., Rusinek, H., Pelton, G. H., Honig, L. S., Mayeux, R., Stern, Y., Tabert, M. H., deLeon, M. J. (2007). Hippocampal and enthorhinal atrophy in mild cognitive impairment. Prediction of Alzheimr. Neurology, 68, 828- 836. Evans, M. C., Barnes, J., Nielsen, C., Kim, L. G., Clegg, S. L., Blair, M., Leung, K. K., Douiri, A., Boyes, R.

G., Ourselin, S., Fox, N. c. (2010). Volume changes in Alzheimer’s disease and mild cognitive impairment: cognitive associations. Eur Radiol, 20, 674–682.

Referenties

GERELATEERDE DOCUMENTEN

Uit de cijfers bleek dat met name op hoogproductieve bedrijven de aanvoer van stikstof via voer veel hoger was dan op bedrijven met een lagere productie per koe.. Daarnaast komen

The university also held another collection of human remains in the Natural History Museum, but these have since moved to the Anatomical Museum and the National Museum of

➢ Verbetering van vroege signalering en screening van jeugdigen met autismespectrumstoornissen (ASS); ➢ Diagnosticeren van 60% van kinderen die ASS hebben, voor het

More specifically, if specific sensory attributes (Research proposition 1) and/or emotions (Research proposition 2) would be predominantly present at hot, ambient, or cold

By exploring and describing existing school-health services to adolescents as experienced by stakeholders and their perceptions on how comprehensive school-

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Toch zeggen Onnekink en Rommelse geen overzichtswerk van de Republiek te hebben geschreven, maar een geschiedenis van haar buitenlandse politiek in de breedste zin.. In een

Het Milieu- en Natuurplanbureau en het Ruimtelijk Planbureau geven in de &#34;Monitor Nota Ruimte&#34; een beeld van de opgave waar het ruimtelijk beleid voor de komende jaren