Univariate normative comparisons for multilevel norm data with missing data

(1)

Univariate Normative Comparisons for Multilevel Norm Data with

Missing Data

Author: Jacqueline Zadelaar Project: Master thesis

Supervisor: prof. dr. H. Huizenga Second assessor: dr. R. Grasman Date: 09 - 09 - 2016

Word count: 7.055

(2)

Abstract

Normative comparison is a method for comparing a single test score to the scores of a norm group. It is regularly applied in neuropsychological assessment to evaluate if an individual deviates on cognitive traits like memory or attention. Resampling methods, mainly one-step and step-down resampling, are advised to execute these comparisons, as they: 1) correct for multiple testing, while 2) maintaining high sensitivity –power to detect true differences. A remaining problem is that normative comparison requires a norm group that is: 1) representative of the population that the individual is from, were (s)he cognitively healthy, and 2) includes all tests of interest. Such data is most easily found by merging data of several completed studies. Unfortunately, this leads to multilevel data with large portions of systematically missing data, which – if not accounted for – can result in anlysis inaccuracy. As such, current resampling methods as used in normative comparison were expanded for use with this type of data. The expanded methods performed satisfactory for multilevel data, only when no missing data was involved. From this we must conclude the expanded methods to perform anadequately for use in neuropsychological assessment, though only when the norm group dataset does not contain missing data.

(3)

Introduction

Normative comparison is a method of comparing test scores to the scores of a general population. It is often applied in neuropsychological assessment, with the goal to draw conclusions about an individual’s cognitive capacities, like memory or concentration. If an individual deviates sufficiently from the norm group, a group of healthy individuals, we may speak of “abnormality” (Harvey, 2012; Kendall, Marrs-Garcia, Nath & Sheldrick, 1999). As such conclusions may affect one’s academic, professional and personal life, assessment accuracy is vitally important. For example, a ‘healthy’ individual being falsely diagnosed with memory impairments could result in a waste of time and treatment resources, as well as personal suffering. Similarly, an undiagnosed condition may linger or even worsen over time, possibly with dire consequences for the individual and their surroundings (Harvey, 2012). As such, the focus of this paper will be on improving statistical methods for normative comparison.

Normative comparison entails comparing the t-statistic of a single test score (denoted as tnorm) to the distribution of t-statistics of a norm group’s test scores under the null-hypothesis

(Huizenga, Agelink, Grasman, Muslimovic & Schmand, 2016). If the resulting p-value is lower than the chosen significance threshold, the assessed individual is concluded to differ from the norm on the test in question. In neuropsychological assessment it is common to administer multiple tests. As such, multiple normative comparisons need to be performed. This leads to an increased familywise error (FWE), the increased probability of at least one false positive1 result occurring (Binder, 2009; Van der Laan, Dudoit & Pollard, 2004). This happens because each comparison has a certain chance of producing a false positive result, so logically, when more than one comparison is executed, the chance of at least one false positive occurring grows, as illustrated by the FWE rate formula2: FWER= 1 − (1 − 𝛼)𝑀. Here α denotes the chosen significance threshold (for example, α=.05) and M the total number of comparisons. Notice that the FWER equals α when M = 1 and increases as M increases (Feise, 2002; Huizenga et al.,2016 ). In terms of neuropsychological assessment, this means that administering more tests to the individual increases the chance of at least one test falsely indicating deviation. To prevent this from happening, methods that correct for the increased FWER should be applied.

This can be achieved with resampling procedures, or ‘permutation tests’, which are non-parametric significance tests. Recall that normative comparison entails comparing a single test score to the distribution of norm group test scores. In uncorrected normative comparison, this norm group distribution is simply that of the t-statistics of norm group test scores under the null-hypothesis. Using resampling procedures, however, a different kind of norm distribution is computed. Comparing the individual’s test scores to this newly created (or ‘resampled’) distribution reduces the FWER seen when multiple normative comparisons are performed using the uncorrected method. Moreover, because resampling procedures take into account between-comparison dependencies, power3 is maintained (Huizenga et al., 2016; Li & Dye, 2013). This provides an advantage over other methods like the Bonferroni correction, where the power

1_{False positives, also referred to as “type I errors”, are cases wherein the null hypothesis is falsely rejected. In the case of} neuropsychological assessment, false positives result in an untruthful conclusion that there is deviation from the norm. 2_{Note that this formula only applies if the tests are independent. When the tests are correlated, the FWE rate will be lower than} the formula would indicate – though still exceeding the preset α.

3_{Power, also known as the ‘true positive rate’, is a statistical measure that describes the extent to which a method can detect} true deviation.

(4)

rapidly decreases as the number of comparisons grows, especially when applied to more dependent (more highly correlated) tests (Bland & Altman, 1995; Moran, 2003; Narum, 2006). Other advantages are that resampling procedures do not depend on the normality assumption, and perform relatively well with small datasets (Li & Dye, 2013; Troendle, 1995). A downside is that the actual computations can be complex, but related complications can be prevented by implementing these procedures in user-friendly software (Huizenga et al., 2016).

In summary, resampling procedures seem very suitable for multiple normative comparisons, as often conducted in neuropsychological assessment. Still, these methods face one important problem: requiring an appropriate norm group. After all, comparing an 80-year old male with a norm group of 20-year old women may well result in deviation(s) attributable to demographic differences rather than cognitive abnormalities. Moreover, the data needs to contain all tests that were administered to the individual, and needs to consist of raw data (Huizenga et al., 2016). Such a specific dataset will rarely be available, and collecting norm data separately for each tested individual would be very resource-consuming. Instead, one might merge already available datasets – the ‘healthy’ control groups of other studies – to create one dataset that meets all these demands. With data-sharing increasing in popularity, especially in social sciences (Asendorpf et al., 2013; King, 2011; Poline et al., 2012; Vines et al., 2014), this seems like an opportune solution. Yet, combining the data of several studies into one dataset would result in multilevel data; hierarchically structured data with some levels ‘nested’ (or embedded) within others (Steenbergen & Jones, 2002). Our norm data – as described above – would have two levels, namely a within-study level (the level of individual participants) and a between-study level (the level of different studies), with the former nested within the latter. This multilevel structure accounts for the expectation that test scores are influenced by both interpersonal differences (e.g. differences in intelligence or memory) and by differences between studies (e.g. sample or design). Ignoring this structure is treating the data as if all studies were identical – as if all data variance is solely attributable to individual differences. This may lead to underestimation of standard errors, resulting in overestimation of t-statistics and a subsequent increase in FWER (Steenbergen & Jones, 2002). As such, expanding resampling methods to account for multilevel structure is expected to improve analysis accuracy. Another problem of a merged norm dataset is that of missing data; because not every study includes every test of interest, some tests scores will be missing for certain studies. These large portions of systematically missing data need also be accounted for. In short, current methods of normative comparison need to be expanded for use with multilevel data with systematically missing data.

Currently, a multivariate method to analyze this kind of data4 has already been developed (Agelink van Rentergem, Murre, & Huizenga, submitted). Yet, while there are several advantages to multivariate approaches, like the capacity to analyze complex scenarios (Stevens, 2009), there are also complications. Compared to univariate comparisons – that simply investigate if a single statistic deviates from a single distribution – multivariate comparisons investigate a more overall effect, incorporating between-test dependencies. As such, results can be difficult to interpret without a statistical background. Also, the exact source of overall deviation is hard to identify, while it is important in neuropsychological assessment to know exactly which tested trait(s) of an individual need special consideration (Stevens, 2009). This is why the goal of this paper is to expand univariate resampling procedures for (multiple)

(5)

normative comparisons, for use with multilevel data with (large) portions of missing data, for use in neuropsychological assessment.

Two currently existing resampling procedures, the one-step and the step-down resampling procedure, that have already proved to be effective when applied for (multiple) normative comparison, are expanded for applications in data with a multilevel structure and large portions of systematically missing data, and then tested on accuracy. If applying these expanded methods to multilevel data produces: 1) FWERs no larger than the significance threshold of 0.05 and lower than those of non-expanded methods, while 2) maintaining power similar to (or better than) that of the non-expanded methods, the expanded methods are deemed a necessary and adequate improvement.

Methods

Operationalisation

Firstly, the non-expanded (currently available) resampling procedures were applied to four kinds of simulated data: 1) non-multilevel data without missing data, 2) non-multilevel data with missing data, 3) multilevel data without missing data, and 4) multilevel data with missing data. For each of these four datasets the FWER and power were estimated. These results served as a baseline. Next, the expanded resampling procedures were applied to the same four types of data. Again, the FWER and power were calculated. Performance of the expanded methods was assessed by: 1) examining the resulting FWER and power, and 2) comparing these results to those of the non-expanded methods.

Simulation characteristics

Data was simulated in R, with each dataset containing normative data (the norm group) and patient data (the evaluated individual). The normative data was simulated as if the data of N participants, from S studies, each containing some or all of the possible M tests, was merged. The data was made to resemble that of actual studies, containing test scores but also participant ID’s, three demographic variables (age, gender, education level), and test score residuals. Test scores were computed using a multilevel linear regression model, as based on Agelink van Rentergem et al. (submitted), using the demographic variables and residuals as input. The residuals were computed as random draws from a multilevel distribution, using the within-study and between-study covariance matrices as input (for the exact code, see appendix B). As the residuals contained all relevant information – the part of the test scores accounted for by individual differences and between studies differences – the residuals, not the test scores, were used in all analyses.

For multilevel data (S > 1), the residuals consisted of a within-study part – denoted as epsilon (ε) which differs for each participant and test – and a between-study portion – denoted as nu (ν) which varies between different tests and different studies. As non-multilevel data consists of only one study (S = 1), it should have no between-study portion (between-study variances and covariances were set at zero), meaning non-multilevel residuals consisted solely

(6)

of epsilons. This study used the exact residuals (the ones used to simulate the data) in all analyses. In real-life situations however, the residuals would have to be estimated from the data. The simulated data consisted of 9 tests, as done by 900 participants (M = 9; N = 900) both for non-multilevel and multilevel data, which consisted of 1 and 30 studies (S = 1 and S = 30) respectively. The number of data points was kept constant over non-multilevel and multilevel data, as to optimize comparability of results. The participant-study ratio of the multilevel data was chosen because this corresponds to 30 participants per study, a regularly used sample size as dictated by the Central Limit Theorem (Field, 2009). The exact values were chosen because a small number of studies was expected to undermine power (Maas & Hox, 2005), while any more than 30 studies seemed unrealistic. The number of tests was a compromise between having a large number of tests, as is conventional in neuropsychological assessment, and practicality, as larger numbers of tests increases simulation duration substantially and wasn’t expected to alter results substantially.

The within- and between-study variances were based on Agelink van Rentergem et al. (submitted) which were set at 25 and 5 respectively, resulting in a 5 to 1 ratio. We adopted a 5:2 ratio instead, as to implement a more predominant multilevel structure (by having a relatively larger between-study variance), thus making the distinction between multilevel and non-multilevel data – and more specifically the influence of the current and the expanded resampling methods on them – more clear-cut. The exact values of the witin- and between variances and covariances used to simulate the data are depicted in Table 1-2. Notice that the total amount of variance (and covariance) was kept constant for non-multilevel and multilevel simulated data to ensure similarity between the two types of data, thus increasing comparability.

Table 1. The Within-study (left) and the Between-study (right) Variance-Covariance Matrices used to Simulate Non-multilevel Data.

Test 1 Test 2 . . . Test M Test 1 Test 2 . . . Test M

Test 1 20 8 8 8 Test 1 0 0 0 0

Test 2 8 20 8 8 Test 2 0 0 0 0

. . . 8 8 20 8 . . . 0 0 0 0

Test M 8 8 8 20 Test M 0 0 0 0

Note: The diagonal contains all test variances, the off-diagonal contains all the test covariances. Tests not depicted have the same diagonal and off-diagonal values as the rest.

Table 2. The Within-study (left) and the Between-study (right) Variance-Covariance Matrices used to Simulate Multilevel Data.

Test 1 Test 2 . . . Test M Test 1 Test 2 . . . Test M

Test 1 12 8 8 8 Test 1 8 0 0 0

Test 2 8 12 8 8 Test 2 0 8 0 0

. . . 8 8 12 8 . . . 0 0 8 0

Test M 8 8 8 12 Test M 0 0 0 8

Note: The diagonal contains all test variances, the off-diagonal contains all the test covariances. Tests not depicted have the same diagonal and off-diagonal values as the rest.

(7)

Both multilevel and non-multilevel data were computed with and without missing data. In the missing data condition, we assumed that all tests co-occured together in a study at least once, because the covariances between tests cannot be estimated otherwise (Agelink van Rentergem et al. (submitted). While this estimation complication does not affect this study, as the covariances are known (they are used to simulate the data), it helped translate our results to situations with real data. Missing data was computed so that a substantial part, namely 50% of the data was missing.

Patient data had the same format as the norm dataset, only for N = 1. Patients were either healthy (with the exact residuals as used in simulating the data) or deviant (two standard deviations below the true residuals). This enabled to estimate both the FWER and power of different datasets and methods. As with the norm data, only the patient’s residuals were used in further computations. Both the norm data residuals and the patient data residuals were centered5, as is required for the resampling procedures applied in this study.

A total of 1000 datasets (each consisting of one norm dataset and one patient dataset) were computed for each type of data (non-multilevel vs. multilevel, no missing data vs. missing data). These large numbers enabled accurate estimation of FWER and power.

Normative Comparison

Recall that normative comparison entails comparing a single test score to the distribution of a norm group. This method is an extension of the one-sample t-test, which tests if a single score deviates from a norm group mean, as depicted in equation 1:

𝑡

_{𝑡𝑒𝑠𝑡}

=

(𝑥∗−𝑦̅∗)

𝑠𝑑(𝑦∗_{) √𝑁}_⁄ (1)

In this formula, x* indicates the single score; 𝑦̅∗_{is the mean of an N amount of norm group}

scores, 𝑦₁, 𝑦₂, … , 𝑦_𝑛; 𝑠𝑑(𝑦∗) √𝑁⁄ equals 𝑠𝑒(𝑦∗), the standard error of the mean of 𝑦∗_{; and t} test

is the resulting test statistic, which corresponds to a certain p-value. The * indicates that the scores are centered.

In normative comparison however, the goal is not to assess if a score deviates from the mean, but rather from the distribution of the norm group. This is accomplished by the addition of a scaling factor, see equation 2:

𝑡

_{𝑛𝑜𝑟𝑚}

=

(𝑥∗−𝑦̅∗)

𝑠𝑑(𝑦∗) √𝑁⁄

∗

1

⁄

√

𝑁 + 1 (2)

This scaling factor, which equals 1 √𝑁 + 1⁄ , causes (x* - 𝑦̅∗) to be divided by the standard deviation of the distribution, instead of the standard error of the mean of y*. The resulting statistic, tnorm, is called the normative statistic, which is then compared to the Student’s

t-distribution with N – 1 degrees of freedom. This corresponds to a p-value indicating whether or not there is deviation (Huizenga et al., 2016).

55_{Centering causes the mean of a set of scores to equal zero. Centered scores are a condition for the resampling}

methods applied in this study (Huizenga et al., 2016). The centered scores are calculated as follows: 𝑦𝑛∗= 𝑦𝑛− 𝑦̅

and 𝑥∗_{= 𝑥 − 𝑦̅. Note that this will cause each test mean to equal zero (𝑦} 1 ̅̅̅∗_{, 𝑦} 2 ̅̅̅∗_{, … , 𝑦} 𝑀 ̅̅̅̅∗_{= 0).}

(8)

The significance threshold chosen in this study was α=.05, as is conventional in psychological research (Field, 2009). As we set all deviating patients to have scores below the norm, a one-sided significance test was used. This translates well to real situations in neuropsychological assessment, wherein the assessor is likely to have prior expectations about the direction of deviation. In the case of multiple testing, the normative comparison analysis as described above is performed separately for each of the M tests of interests. This means that multiple p-values are calculated, resulting in an increased FWER.

Resampling

To remedy the increased FWER associated with multiple testing, normative comparison may be done using resampling procedures. These procedures operate by ‘resampling’ the original norm data, resulting in a new dataset which consists of different scores that – like the original –satisfies the null-hypothesis6 (Yu Chong Ho, 2003). The resampling procedures used to correct for multiple normative comparisons apply a method called ‘sign-flipping’. This entails all scores of each norm group participant being randomly multiplied with either 1 or -1 (e.g. all test scores of participant 1 are multiplied by 1, all test scores of participant 2 are multiplied with -1, etc.), see Figure 1A.

Figure 1. An schematic of the sign-flipping method applied in this study. A vector containing values of 1 and -1 values is

computed randomly (the red and blue rectangles), and multiplied with the scores of the original dataset, resulting in a new dataset of resampled scores. The scores are color-coded from highest (light) to lowest (dark). A) Resampling methods not yet expanded for multilevel data. The residuals (ε) differ per participant, and thus all residuals of each participant are separately multiplied by either 1 or -1; all M test scores of each participant are multiplied with the same sign. B) Multilevel resampling. Multilevel data contains two residuals, the within-study portion (ε) and the between-study portion (ν), which are resampled separately; ε is resampled in the same manner as the non-multilevel variant; ν is resampled in blocks, with each score from the same study being multiplied with the same sign, either 1 or -1.

Repeating this process a P number of times, provides a P number of different, fictitious (or ‘resampled’) norm group datasets. Each resampled dataset is then transformed into

(9)

statistics by applying a one-sample t-test (equation 1) to each of the norm group scores, assessing whether each score differs significantly from zero (𝑦̅∗ _{= 0}₎_{. This results in P}

resampled datasets, each consisting of N t-statistics. Only the most favourable t-statistics of each resampled dataset are used to form the norm group distribution to which tnorm is compared.

Comparing tnom to the resampled norm group distribution entails calculating the proportion of

absolute norm group t-statistics that are equal to or larger than the absolute value of tnorm. This

proportion equals the p-value that will indicate whether or not there is deviation (Huizenga et al., 2016). Which t-statistics are selected to form the norm group distribution depends on the exact resampling procedure. This study focusses on two resampling procedures, the one-step and the step-down resampling procedure, both of which have already proven to reduce the increased FWER caused by multiple normative comparisons, while maintaining power (Huizenga et al., in press).

In the one-step resampling procedure, tnorm is not compared to the distribution of

t-statistics under the null-hypothesis – as done in uncorrected normative comparison – but to the max-distribution; the distribution of the maximum absolute t-statistics under the null-hypothesis. This distribution is obtained by: 1) resampling the dataset so that the null-hypothesis is true for all M tests, 2) selecting the largest absolute t-statistic of that resample, 3) repeating this a P number of times, and 4) combining all maximum t-statistics into one distribution (Huizenga et al., 2016; Li & Dye, 2013). Because this max-distribution only contains the most extreme t-statistics of the norm group, the chance of false positives decreases; if an individual’s scores are extreme when compared to the most extreme scores of a norm group, it seems more likely to reflect true deviation than if compared to all norm test scores. As such, this method reduces FWER.

The step-down procedure is similar. Like in one-step resampling, the largest absolute tnorm

is compared to the max-distribution over all M tests. All other tnorms, however, are compared to

the max-distribution over all tests, not including the ones corresponding to tnorm’s that have a

higher absolute value. So the second largest tnorm is compared to the max-distribution over all

tests except the one corresponding to the largest tnorm, the third largest tnorm is compared to the

max-distribution over all tests except the tests corresponding the largest and second largest tnorm,

and so on. Note that the p-value resulting from a lower absolute tnorm may not be lower than that

of a higher tnorm, as it would be counter-intuitive for a less extreme result to indicate more

significant deviation. So if a certain tnorm yields a p-value lower than that of a higher tnorm, its

p-value will be set to the minimal p-value for this rule to still apply (Huizenga et al.,2016 ). For example, if the highest tnorm yields p=.05 and the second-highest tnorm yields p=.04, the second

p-value will be set to 0.05 . Like in the one-step procedure, step-down resampling reduces FWER by comparing an individual’s test statistics to the most extreme, instead of all norm group statistics. Also, because each test statistic is compared to a different distribution, depending on their order, high power is maintained.

Both resampling procedures hinge on the assumption of exchangeability (Winkler, Ridgway, Webster, Smith, & Nichols, 2014) and require the dataset to consist of raw, centered scores (Huizenga et al., in press). In this study, the number of resamples was set at 2000 (P = 2000). This ensured smooth resampled distributions, which improves analysis accuracy.

(10)

To expand the methods discussed above for multilevel structured data, both normative comparison and resampling procedures need to be adapted accordingly. The resampling procedures themselves needs to be expanded because the residuals of multilevel data consist of two distinguishable parts (a within-study portion and a between-study portion) instead of one (a within-study portion). As such, both portions need to be resampled separately, see Figure 1B. Notice that the epsilons are resampled in the same manner as done with non-multilevel data, with all scores of from the same participant being randomly multiplied with either 1 or -1. The nu’s however, are resampled in blocks; as the nu’s depict the between-study differences, all scores from the same study are multiplied randomly with either 1 or -1.

Next, normative comparison required adaptation; recall that ignoring the multilevel structure of data results in underestimation of standard errors, which needs to be corrected for. This may be achieved by calculating standard errors using the ‘effective sample size’ instead of the total sample size7. The formula for the effective sample size is given in equation 3:

𝑁_𝑒𝑓𝑓 = 𝑁/[1 + (𝑁_{𝑐𝑙𝑢𝑠𝑠}− 1)𝜌] (3)

Here N is the total sample size (the number of participants per administered test over all studies), Ncluss is the number of particiants per cluster (the number of people per study), and 𝜌

the intraclass correlation (the ratio of between-study to the total variance). Notice that: 1) for non-multilevel data, 𝜌 equals zero, so Neff will equal N, meaning no correction will occur, and

2) as between-study differences increase 𝜌 increases also, resulting in a smaller Neff, which in

turn increases the standard error (Hox, Moerbeek & van de Schoot, 2010). Replacing N with Neff in equation 1 and 2 will result in a new formula to calculate the norm group t-statistics (ttest)

and the normative statistics (tnorm), see equation 4 and 5 respectively:

𝑡

_{𝑡𝑒𝑠𝑡}

=

(𝑥∗−𝑦̅∗) 𝑠𝑑(𝑦∗_{) √𝑁} 𝑒𝑓𝑓 ⁄ (4)

𝑡

_{𝑛𝑜𝑟𝑚}

=

(𝑥∗−𝑦̅∗) 𝑠𝑑(𝑦∗) √𝑁⁄ 𝑒𝑓𝑓

∗

1

√

𝑁𝑒𝑓𝑓 + 1

⁄

(5)

Missing data, in turn, was expanded for by: 1) only incorporating the residuals of the non-missing data, and 2) exluding the participants with non-missing data on the test in question from N in all computations (including the computation of Neff). This should account for the values and

number of missing data points, thus allowing for R to complete all calculations regardless of which data points are missing.

This expansion for missing data was also incorporated in the non-expanded methods as it was deemed an oversimplistic method (after all, it is simply omitting the missing data points

7_{Recall that the standard error of y}*_{equals 𝑠𝑒(𝑦}∗_{) = 𝑠𝑑(𝑦}∗_{) √𝑁}_⁄ _{, the denominator in equation 1 and 2. Replacing}

N with Neff produces 𝑠𝑒(𝑦∗_{) = 𝑠𝑑(𝑦}∗_{) √𝑁} 𝑒𝑓𝑓

⁄ . When Neff is smaller than N, the standard error increases, thus correcting for the underestimation of standard errors resulting from multilevel data.

(11)

from calculations) that was insufficient to maintain accuracy, and a more complex procedure to account for missing data was planned for the expanded methods. Unfortunately time restraints did not allow for such a procedure to be developed, leaving both the non-expanded and the expanded methods with the same manner to handle missing data.

Data analysis

For both the current and the expanded one-step and the step-down procedure, for each type of dataset, the FWER and power were estimated. The FWER was defined as the proportion of healthy patient datasets that were incorrectly identified as deviant – meaning that significant deviation on at least one test (at least one false positive result) was found (Huizenga et al., 2016). The power was defined as the proportion of deviant patient datasets that were identified as such – meaning that significant deviation on at least one test was found (Malik, Turner, & Sohail, 2015; Parikh, Mathai, Parikh, Sekhar, & Thomas, 2008).

Expectations

We expected the expanded resampling methods to produce FWERs: 1) not exceeding the significance threshold of 0.05, and 2) equal to or lower than those produced by the non-expanded methods, as not accounting for multilevel structures tends to increase the amount of false positive results (Steenbergen & Jones, 2002).

There were no expectations about the exact power values, as power varies substantially depending on factors such as test covariance and norm group size, making it hard to predict based solely on differences between testing procedure (Huizenga et al., 2016). Anticipated was that expanded methods produced a power similar to or higher than those of non-expanded methods.

Results

Primary analysis

Both the non-expanded and the expanded one-step and step-down resampling method were applied to: 1) non-multilevel data without missing data, 2) non-multilevel data with missing data, 3) multilevel data without missing data, and 4) multilevel data with missing data, and the corresponding FWER and power was calculated, see Figure 2. Notice that only one set of results is plotted for both the one-step and the step-down resampling procedure. This is because the procedures produced identical results, leaving no need to visualize or interpret both seperately.

(12)

Figure 2. The FWER and power analysis computed by applying both the non-expanded and the expanded resampling methods

to four types of data, all with M = 9, N = 300. No Multi = non-multilevel data (S = 1). Multi = multilevel data (S = 30). No Miss = no missing data. Miss = 50% of the data is missing. A) The FWERs and confidence intervals. The dark grey line indicates the significance threshold of 0.05. B) The power values. No confidence intervals could be plotted here, as there was no exact expected value (like the 0.05 significance threshold corresponding to the FWER values).

Figure 2A portrays the results of the FWER analysis of both the non-expanded and the expanded methods. Notice that applying either the non-expanded or the expanded methods to non-multilevel data (regardless of missing data) produced almost identical results, all of which approximated the preset significance threshold of 0.05. When applied to multilevel data however, the expanded methods produced FWERs below the significance threshold, and substantially lower than those produced by the non-expanded methods. From this, we may conclude that – with respect to the FWERs – the expanded methods perform as intended, and outperform the non-expanded methods when applied to multilevel data8 with or without missing data.

Figure 2B portrays the results of the power analysis of both the non-expanded and the expanded methods. As in the FWER analysis, the results for both methods when applied to non-multilevel data were highly similar, with a power of approximately 0.79. The non-expanded methods maintained this power over at all types of data, whereas the power of the expanded methods, when applied to multilevel data, declined. This suggested that the expanded methods cannot maintain power while reducing FWER. Closer inspection revealed that for multilevel data without missing data, the power was 0.691, meaning that approximately 70% of deviant participants were correctly identified. While lower than the power of the non-expanded methods, this was still considered to approximate it sufficiently, and to be an acceptable power in its own right. For multilevel data with missing data however, the power dropped to 0.169, meaning that only 17% of deviant participants were recognized as deviant. This not only diverged substantially from the power of the non-expanded methods, but was also unacceptedy

(13)

inaccurate. From these results could be concluded that the expanded methods performed satisfactory for multilevel data, but not for multilevel data combined with missing data9.

Surprising was that the non-expanded methods, when applied to multilevel data without missing data, did not produce an increased FWER. This contradicted the expectation that ignoring multilevel structure would lead to an increased FWER, thus questioning the need to expand current methods for multilevel data in the first place. Additional analyses were planned to investigate this.

Another unexpected finding was the way in which missing data affected results. The addition of missing data barely affected the results of both the non-expanded and the expanded methods when applied to non-multilevel data, suggesting that the expansion for missing data performed well for non-multilevel data. However, with multilevel data, missing data caused strong fluctuations in results, both in the FWER and power. This suggested that missing data affects non-multilevel data and multilevel data differently – at least within the context of these analyses. Moreover, it appeared to affect the two methods differently, with the addition of missing data to multilevel data causing an increased FWER with the non-expanded methods10_{but a strongly decreased FWER for the expanded methods. This suggests}

that the procedure to account for missing data (currently implemented in both methods) cannot adequately handle missing data when presented in multilevel data.

Post Hoc Analyses

Additional FWER and power analyses of the non-expanded and expanded methods were performed for non-multilevel and multilevel data without missing data, with varying numbers of studies and participants11, see Table 3. These results were intended to further investigate the surprise finding of the non-elevated FWER of multilevel data, and also examine if several results of the primary analysis were generalizable to datasets with a different parameter settings. Firstly, the number of studies was increased while keeping the study-participant ratio consistent. Secondly, for those same number of studies, the number of participants was increased, thus changing the ratio. The former was intended to investigate whether the primary analysis simply had a suboptimal number of studies, while the latter allowed to check if the number of datapoints per study was insufficient for the applied methods.

9_{Exact results can be found in Table B, in Appendix A.}

10_{Here, the increased FWER seems logical, as a smaller number of datapoints may result in an underestimation}

of the standard error.

11_{The number of tests was not varried because the resampling procedures used for normative comparison in this}

study were already proven to corrected for the consequences of multiple testing. As such, there is no reason to believe that varying the number of tests would change the FWER or power results.

(14)

Table 3. The FWER and power of both methods applied to datasets with varying parameters.

S N M Data Method FWER POWER

5 150 9 Non-multilevel Non-expanded .060 .809 5 150 9 Multilevel Non-expanded .104 .878 5 150 9 Multilevel Expanded .001 .213 10 300 9 Non-multilevel Non-expanded .050 .782 10 300 9 Multilevel Non-expanded .091 .843 10 300 9 Multilevel Expanded .015 .540 10 900 9 Non-multilevel Non-expanded .041 .806 10 900 9 Multilevel Non-expanded .083 .861 10 900 9 Multilevel Expanded .010 .508 30 300 9 Non-multilevel Non-expanded .055 .801 30 300 9 Multilevel Non-expanded .068 .831 30 300 9 Multilevel Expanded .040 .727 30 900 9 Non-multilevel Non-expanded .047 .797 30 900 9 Multilevel Non-expanded .049 .806 30 900 9 Multilevel Expanded .022 .691 100 3000 9 Non-multilevel Non-expanded .050 .805 100 3000 9 Multilevel Non-expanded .060 .803 100 3000 9 Multilevel Expanded .041 .757

Note: Data = whether the simulate data was multilevel or non-multilevel. Method = whether the method expanded for multilevel was applied (multilevel) or the non-expanded method (non-multilevel). S = number of studies, N = number of participants, M = number of tests. None of these datasets contained missing data. The number of datasets, the number of resamples and the significance threshold were the same as in the primary analysis (X = 1000; P = 2000, α = 0.05).

Upon examining these results, several trends became apparent. Firstly, the non-expanded methods, when applied to multilevel data, consistently produced higher FWERs than those of the non-expanded methods applied to non-multilevel data, and those of the expanded methods applied to multilevel data, for all parameter settings except those used primary analysis. These results confirmed our initiual expectation that ignoring the multilevel structure of data can lead to an increased FWER, and suggested that the non-elevated FWER found in the primary analysis was an unusual case. With that, the nessecity of expanding current resampling methods for use with multilevel data was reassured.

Secondly, the expanded methods, when applied to multilevel data, consistently produced FWERs: 1) below the significance threshold, and 2) lower than those produced by the non-expanded methods. A similar pattern could be seen in the primary analysis. This suggested that the expanded methods tend to overcorrect, resulting in excessively low FWERs. As a lower FWER comes with lower power (as seen in both the primary and post hoc analyses) this overcorrection may well be responsible for the unsatisfactory power found in multilevel data with missing data.

(15)

Finally, the FWER and power analyses of simulations with larger amount of studies and/or participants produced results closer to the anticipated results than simulations with smaller numbers of studies and/or participants, with FWERs closer to (while still lower than) the significance threshold and higher power. This suggests that the FWER overcorrection hypothesized in the previous paragraph mainly occurs when the expanded methods are applied to smaller norm group datasets. The data did however suggest that it was mainly the increase in studies, not participants that caused this effect, with differences in results seeming greater when the number of studies was changed and the number of participants kept constant than vice versa. Based on this, the conclusion that the expanded methods performed satisfactory for multilevel data without missing data – as presented in the primary analysis – should be limited to situations were a large enough norm data set (specifically, a large enough number of studies) is available.

Conclusions & Discussion

Summary

This study aimed to expand currently existing resampling procedures, that have already proved effective when used in (multiple) normative comparison, for applications in data with a multilevel structure and large portions of systematically missing data. Performance of the expanded methods was assessed through the FWER and power they produced when applied for multiple normative comparisons, and how these results compared to those produced by the non-expanded methods.

The FWER analysis revealed that the expanded methods, when applied to multilevel data – with and without missing data – produced FWER values: 1) lower than those produced by the non-expanded methods, and 2) below the preset significance threshold. These results suggested that the expanded methods outperform the non-expanded methods when applied to multilevel data. The power analysis showed that the expanded methods, when applied to multilevel data without missing data approximated the power of non-expanded methods, but power decreased drastically when missing data was introduced. This suggested the expanded methods to perform satisfactory when applied to complete multilevel data, while being inferior to the non-expanded methods when missing data is involved.

Discussion

Several findings require further interpretation. One such finding is that the results of applying the non-expanded and the expanded methods to non-multilevel data produced almost identical results. In retrospect, this could have been expected, as the only differences between these two methods are that: 1) the expanded methods replace the total sample size with the effective sample size in all computations, and 2) the expanded methods resample both the within-study and the between-study portion of scores (epsilon and nu respectively), whereas the non-expanded methods only resample the within-study portion (epsilon). Non-multilevel data however: 1) has an effective sample size that equals the total sample size, and 2) contains no between-study portion. As such, non-multilevel data cancels out the differences between the

(16)

two methods. This has positive implications for the expanded methods as it suggests that, once they’re perfected, they might be used to replace the unexpanded altogether.

Another result requiring discussion was that the expanded methods when applied to multilevel data, produced a low power, especially when missing data was present. This may be explained by the corresponding FWERs – which were also relatively low compared to the preset significance threshold – as a lower FWER is generally related to a lower power (Zhu, Zeng & Wang, 2010). The problem may be that the expanded methods overcorrect for multilevel data, reducing the FWER too extensively at the cost of the power. This still leaves the question what causes such overcorrection. The expanded methods did not overcorrect with non-multilevel data, which implies that the cause of overcorrection lies within the two additions that differentiate the expanded from the expanded methods, which get canceled out in non-multilevel data (see previous discussion point): the effective sample size and resampling the between-study portion of the data. Where exactly the cause lies and how it may be remedied requires further investigation.

Another interesting find that missing data seemed to affect the results differently for different types of data, as well as for different methods. Note that missing data did not seem to influence the results of non-multilevel data, suggesting that the currently applied procedures provide sufficient correction for missing data in non-multilevel data. Within multilevel data however, missing data had varying effects, causing an increased FWER when submitted to the non-expanded methods and a strongly decreased (a strongly overcorrected) FWER for the expanded methods. Understanding how and why exactly missing data affects the analyses of multilevel data cannot only provide a way to correct for it, but may also give insights in mechanisms behind the previously mentioned FWER overcorrection. For now however, we may conclude that despite possible overcorrection, the expanded methods perform satisfactory when applied to non-multilevel data, and multilevel data without missing data.

Finnally, results of the post hoc analyses suggested that the expanded methods perform better with multilevel data in terms of both FWER and power, when applied to larger datasets, specifically datasets with more studies. This seems logical, as it is not uncommen for larger datasets to produce more accurate results. Similar results were found for the non-expanded methods, which showed less bias (a smaller, though still present increase in FWER) when applied to multilevel data as the number of studies in the dataset grew. This suggests that the accuracy of these resampling methods when applied to multilevel data is at least partially dependent on the number of norm data studies. As such, applying these methods to norm data with a small number of studies should be expected to produce suboptimal results. This should be kept into account in future research and applications.

Final Conclusions

In conclusion, the expansion of current resampling methods to be used in multiple normative comparisons, for use with a multilevel structured norm group dataset with missing data was partially successful; the expanded methods were revealed to be sufficiently accurate when applied to non-multilevel data with and without missing data, as well as multilevel data without missing data. When applied to multilevel data with missing data however, accuracy disappointed in terms of power. Moreover, the expanded methods performed better when

(17)

applied to larger norm group datasets. As such, the expanded methods were concluded to have limited advantages. We recommend it’s use for situations where the multilevel norm data is sufficiently large (approximately 30 studies or more) and contains no missing data, or when power has low priority.

While the aforementioned methods still require further work and extensive testing before they can be applied to real-life neuropsychological assessment, this study has succeeded in expanded the current methods for multilevel data at least. Also, it has unveiled some interesting aspects of these types of data, and formed a base for interpreting and managing related phenomena in future research.

Awknowledgements

Special appreciation goes out to H. Huizenga and J. Agelink van Rentergem for their contributions to this study, their active involvement in the writing this report, and construction of the R-code.

(18)

Appendix A: Results of FWER and Power Analysis

Table A. The results of the FWER and power analysis of the non-expanded methods.

Data Results

Multilevel Missing FWER POWER

No No .047 .797

No Yes .051 .770

Yes No .049 .802

Yes Yes .080 .773

Table B. The results of the FWER and power analysis of the expanded methods.

Data Results

Multilevel Missing FWER POWER

No No .049 .802

No Yes .052 .773

Yes No .022 .691

Yes Yes .001 .169

Appendix B: The R Code

#COMPLETE CODE MULTILEVEL

#In this code, normative comparison EXPANDED FOR MULTILEVEL is performed using either the one-step or step-down procedure. #The data is simulated. The number of tests per study, number of studies and total number of participants can be altered.

#The 3 demographic variables used to compute test scores are: age (continous), gender (m/v), education level (1/2/3).

#Standard variables: ##ntest = number of tests

##npar = number of participants (total, over all studies) ##nstud = number of tests

#Different types of data can be chosen through the variables: ##1) multileveldata: does the data have a multilevel structure or not?

##2) multilevel: is the analysis multilevel or not? (analysis includes everything except data simulation)

##3) ndatmissing: does every study of the norm data contain every test or not?

##4) pdatmissing: has the patient made all the tests in the norm data or not?

(19)

##5) theory: in computing the test score residuals, use the ones used to simulate the data or use residuals estimated from data? (multilevel analysis uses theoretical residuals regardless) #Other settings include:

##number_datasets: what is the number of datasets being simulated on which the results are based?

##nsamp: what is the number of resamples being done for each dataset?

##alpha: nominal p-value threshold (at what p-value does a test significantly deviate?)

##FWER: do you want the code to calculate the Familywise Error Rate or the power? TRUE = FWER. FALSE = Power

#DON'T FORGET TO NAME THE OUTPUT FILES DIFFERENTLY OR THEY'LL OVERWRITE EACH OTHER! See: name_output_file

rm(list=ls(all=TRUE))

setwd("C:/Users/Eigenaar/Documents/UvA/UvA RM jaar 2/Thesis") getwd()

##PRE-CODE: load packages library(matrixStats) library(MASS) library('micEcon') library('miscTools') library('dclone') library('lme4') library("nlme") library("multilevel") library('foreign') #################################################################### ###################################

##STARTING VALUES & DATA

#################################################################### ###################################

#Settings:

ntest <- 9 #number of tests

npar <- 900 #number of participants (total) nstud <- 30 #number of studies

#Type of data:

multileveldata <- TRUE #TRUE = multilevel data. FALSE = non-multilevel data.

multilevel <- TRUE #TRUE = multilevel analysis FALSE = non-multilevel analysis.

ndatmissing <- TRUE #TRUE = normdata set contains 50% missing data (all overlap); FALSE = no missing normdata

pdatmissing <- FALSE #TRUE = patient didn't do all tests in the data theory <- TRUE #TRUE = use true residuals (as used to simulate

data); FALSE = use residuals estimated from data #Wanted result:

number_datasets <- 1000 #number of datasets being simulated nsamp= 2000 #number of resamples

(20)

alpha=0.05 #nominal p-value threshold

FWER <- FALSE #TRUE = result produced is FWER. FALSE = result produced is power

#Name files:

file_name <- "Full Code (expanded for multilevel 35).R" #R-file that's been run

description <- "Resampling code expanded for multilevel (faster). Neff added, neff estimated in resampling; missing data fixed; By Joost." #Description of R-file

name_output_file <- "code 35 - multi multi missing - POWER (nstud=30, npar=900, ntest=9)" #name output file

#Used for spread checks:

spread_residual <- spread_epsilon <- spread_nu <- spread_se <- NULL #spread normdata

spread_resample <- spread_resamplese <- spread_resamplemean <- NULL #spread

spread_normstat <- spread_check_tboot <- spread_check_tmaxboot <- NULL

plopcheck <- NULL #START DATASET LOOP

result_onestep <- result_stepdown <- result_uncorrected <- NULL #set objects for results

for(k in 1:number_datasets){ #note: make sure this loop has a letter than isn't used WITHIN the loop

set.seed(k) #set the seed for every dataset (so different simulations are identical, thus comparable)

if( (k %% 2) == 0){ #if j is dividable by 2, then... print( paste( "number of datasets = ", k, sep = "" ) )} #...print

#################################################################### ###################################

#START DATA SIMULATION

#################################################################### ###################################

#START NORM DATA SIMULATION

#NORM DATA (every person has been in only 1 study, done every test)

ID <- rep(1:npar,each=(ntest))

study <- rep(rep(1:nstud,each=ntest), length(ID)/(nstud*ntest)) #Note: makes sure all studies occur equally often. If you want each study to occur a random number of times, try: study <-

rep(sample(1:nstud,npar,T),each=ntest) test <- rep(1:ntest,npar)

age <- rep(round(rnorm(npar,50,15), 0),each=ntest) edu <- rep(sample(1:3,npar,T),each=ntest)

(21)

data1 <-

as.data.frame(matrix(c(ID,study,test,age,gender,edu),length(ID))) colnames(data1) <- c("ID","study","test","age","gender","edu")

#No indicator variables (ZA, ZB, etc.) data3 <- data1

#Fixed (different coefficients per test): int <- rep(20,ntest)

agepar <- rep(-.125,ntest) edupar <- rep(1.25,ntest) genderpar <- rep(0.5,ntest)

#Random (nu's and epsilons):

if(multileveldata==TRUE){ #if you want multilevel data #Between:

cov.between <- matrix(0,ntest,ntest) diag(cov.between) <- 8

v0s <- mvrnorm(n = nstud, rep(0, ntest),cov.between) #Within error:

cov.within <- matrix(8,ntest,ntest) diag(cov.within) <- 12

epsilon <- mvrnorm(n = npar, rep(0, ntest),cov.within)

}else{ #if you want non-multilevel data #Between:

cov.between <- matrix(0,ntest,ntest) diag(cov.between) <- 0

v0s <- mvrnorm(n = nstud, rep(0, ntest),cov.between) #Within error:

cov.within <- matrix(8,ntest,ntest) diag(cov.within) <- 20

epsilon <- mvrnorm(n = npar, rep(0, ntest),cov.within) }

#if nstud=1, make sure v0s is also a matrix (so formula for calculating scores below still works):

if(nstud==1){

v0s <- t(as.matrix(v0s))}

#Data simulation (flexible for different number of tests/studies/participants)

data3$score <- NA

for(j in 1:nrow(data3)){

data3$score[j] <- (int[data3$test[j]] +

(agepar[data3$test[j]] * data3$age[j]) + (edupar[data3$test[j]] * data3$edu[j]) + (genderpar[data3$test[j]] * data3$gender[j]) +

v0s[data3$study[j], data3$test[j]]) + epsilon[data3$ID[j],data3$test[j]]}

#END NORM DATA SIMULATION

#################################################################### #START NORM DATA REMOVAL

(22)

if(ndatmissing == TRUE){ #if ndatmising = TRUE, remove 50% of the data

#START AUTOMATIC NORM DATA REMOVAL

##(removing 50% of data, complete and incomplete overlap of tests)

overlap <- TRUE #Does every pair of tests co-occur in a study at least once?

#TRUE = all tests co-occur at least once #FALSE = not all tests co-occur at least once

check <- FALSE #set condition for while-loop while(check == FALSE){ #START WHILE LOOP

#FILL IN 0's / 1's INTO MATRIX RANDOMLY

testdat <- c(rep(1,(.5*ntest*nstud)),rep(0,.5*ntest*nstud)) #fill in random combination of 1's and 0's (50-50%)

comb <- combn(1:ntest, 2, FUN = NULL, simplify = F) #all possible combinations of tests

if(length(testdat)!=(ntest*nstud)){ #if there is an uneven number of testsxstudies (so you can't have a 50-50% missing data proportion)

testdat <- c(testdat,sample(c(0,1),1)) #add a random 1 or 0 to the one leftover value

}

tests <- matrix(sample(testdat,ntest*nstud),ntest,nstud) #rows: tests (1=included, 0=not included), columns: studies #CALCULATE CO-OCCURANCE: a <- matrix(0,length(comb),nstud) b <- c <- d <- testsum <- NULL for(v in 1:length(comb)){ for(j in 1:nstud){

a[v,j] <- tests[comb[[v]][1],j] + tests[comb[[v]][2],j] #matrix testing co-occurance of tests

#Rows: test pairs (e.g. [1,1] is pair test 1 + test 2) #Columns: study

#2 = two tests appear together in a study

#0/1 = two tests do not appear together in a study

b <- apply(a,1,max) == 2 #see if tests pairs ever co-occur in the same study (TRUE/FALSE)

c <- sum(b)/length(b) ##find proportion of test-pairs that co-occur in at least one study (Proportion of TRUES)

}}

#CHECK IF NO TEST OCCURS NEVER (ROW ONLY CONTAINS ZERO'S) testsum <- sum(apply(tests,1,sum) > 0) == ntest

if(testsum == TRUE){d <- 1}else{d <- 0}

#CHECK IF TESTS CO-OCCUR: if(overlap == TRUE){

if(c == 1 & d == 1){check <- TRUE}}else{ #if all tests co-occur at leats once + every test co-occurs at least once, end loop

(23)

if(c < 1 & d == 1){check <- TRUE}} #if not all tests co-occur at leats once + every test co-occurs at least once, end loop #Note: chance the "1" in c <- 1 if you want more specifics about the proportion of tests that don't co-occur

} #END WHILE LOOP

#remove these test-study combinations from the data (make them 0) for(i in 1:ntest){ for(j in 1:nstud){ for(n in 1:nrow(data3)){ if(data3$test[n] == i){ if(data3$stud[n] == j){ data3$score[n] <- data3$score[n]*tests[i,j] }}else{data3$score[n] <- data3$score[n]} }}} for(i in 1:nrow(data3)){

if(data3$score[i] == 0){ #change 0's into NA's

data3$score[i] <- data3$score[i] * NA}else{data3$score[i] <- data3$score[i]}}

}else{data3 <- data3} #if ndatmissing = FALSE, keep normdata the same (without missing)

#END NORM DATA REMOVAL

#################################################################### #####

##START PATIENT DATA SIMULATION

#Set variables:

ID_P <- rep((1),each=(ntest)) test_P <- 1:ntest

age_P <- rep(round(rnorm(1,50,25), 0),each=ntest) edu_P <- rep(sample(1:3,1,T),each=ntest) gender_P <- rep(sample(c(-1,1),1,TRUE),each=ntest) data_P <- as.data.frame(matrix(c(ID_P,test_P,age_P,gender_P,edu_P),length(ID_P ))) colnames(data_P) <- c("ID","test","age","gender","edu")

##FIXED (all fixed variables are the same as norm data)

##RANDOM (not the same as norm data)

#Between error (per study, per test)

v0s_P <- mvrnorm(n = 1, rep(0, ntest),cov.between) #Within error (per person, per test:

epsilon_P <- mvrnorm(n = 1, rep(0, ntest),cov.within) #Simulate scores data_P$score <- NA for(j in 1:nrow(data_P)){ for(i in 1:ntest){ data_P$score[j] <- (int[data_P$test[j]] +

(24)

(agepar[data_P$test[j]] * data_P$age[j]) + (edupar[data_P$test[j]] * data_P$edu[j]) + (genderpar[data_P$test[j]] * data_P$gender[j]) + v0s_P[data_P$test[j] ] ) + epsilon_P[data_P$test[j]] }}

#PATIENT DATA REMOVAL x <- NULL

if(pdatmissing == TRUE){ #if pdatmissing is TRUE, remove datapoints

x <- sample(1:length(data_P$ID),1,replace=TRUE) data_P$score[x] <- NA #change either

}else(data_P <- data_P)

#moet die niet hetzelfde zijn voor ieder van de 1000 simulaties??

#END PATIENT DATA SIMULATION

#################################################################### ###################################

#END DATA SIMULATION

#################################################################### ###################################

#ONLY SELECT TESTS FROM NORMDATA THAT PATIENT DID

#rename normdata an patient data normdat <- data3 #normdat = norm data patdat <- data_P #patdat = patient data

patdat <- patdat[complete.cases(patdat),] #make sure patient data doesn't contain testscores on missing data (NA's)

if(sum(length(patdat$test) != length(unique(normdat$test))) == 1){ #if patient hasn't done all tests in norm data

#Note: no need to check if it's the same tests, as the norm data should contain AT LEAST all tests that the patient made

#So norm data can only contain more tests (in which case some unneeded ones will be removed)

pattests <- patdat$test #what tests has the patient done? normtests <- unique(normdat$test) #what tests are in the norm data?

difftests <- setdiff(normtests,pattests) #which tests are in the norm data that aren't in the patient data?

for(i in 1:length(difftests)){

p <- (normdat$test == difftests[i]) #what test didn't the patient do?

normdat <- normdat[!c(p),] #remove these from norm data }}else{normdat <- normdat} #if patient did all tests, keep all tests

(25)

#################################################################### ###################################

#START CALCULATE RESIDUALS

if(multilevel == FALSE){ #if the data is NOT multilevel

if(theory == FALSE){ #if you want to estimate residuals from data

#NON-MULTLEVEL PRACTICAL (ESTIMATE FROM DATA) #NORM DATA RESIDUALS:

plopmodel <- NULL

for( i in unique(normdat$test)){

plopmodel[[i]] <- lm(score ~ 1 + age + gender + edu, data=normdat[normdat$test == i & !is.na(normdat$score),], na.action=na.exclude)

#selecteer alleen test = i. Dus bereken model apart voor iedere test.

#Dit is belangrijk want alleen zo krijg je een within- en between- voor iedere test.

#Het geeft niet helemaal dezelfde uitslag als wanneer je alle testen tegelijkertijd zou meerekenen maar verschil is

verwaarloosbaar

#is.na makes sure only the non-missing data is selected for the model

normdat$pred[normdat$test == i & !is.na(normdat$score)] <- predict.lm(plopmodel[[i]], newdata = normdat[patdat$test == i,]) #make column with predicted values of score

normdat$residual[normdat$test == i & !is.na(normdat$score)] <- residuals(plopmodel[[i]])

#vervang non-missing data van test i met residu van die test (residu is hier epsilon EN nu)

}

##PATIENT DATA RESIDUALS: pred <- NULL

for(i in unique(normdat$test)){

pred[i] <- predict.lm(plopmodel[[i]], newdata =

patdat[patdat$test == i,]) #fitted non-multilevel model used to predict patient scores

patdat$residual[patdat$test == i] <-

patdat$score[patdat$test == i] - pred[i]} #find residuals and add them in column "epsilon"

#observed score - predicted score = residual (epsilon) for non-multilevel

}else{ #if you want to use the non-multilevel true residuals (as used in data simulation)

#NON-MULTILEVEL THEORETICAL (TRUE VALUES, USED TO ESTIMATE DATA)

##NORM DATA RESIDUALS: for(i in 1:nrow(normdat)){

(26)

normdat$residual[i] <-

epsilon[normdat$ID[i],normdat$test[i]] +

v0s[normdat$study[i],normdat$test[i]] #calculate residuals #residual = epsilon + nu

}else{normdat$residual[i] <- NA}}

##PATIENT DATA RESIDUALS: for(i in 1:nrow(patdat)){

patdat$residual[i] <- epsilon_P[i] + v0s_P[i]} #calculate patient residuals (between + within)

}

#end non-multilevel residual calculation

}else{ #if the data IS MULTILEVEL, then... for(i in 1:nrow(normdat)){

if(is.na(normdat$score[i]) == FALSE){ #if data isn't missing normdat$epsilon[i] <- epsilon[normdat$ID[i],normdat$test[i]] #calculate epsilons (within)

normdat$nu[i] <- v0s[normdat$study[i],normdat$test[i]] #calculate nu's (between)

#Epsilon: row = ID, column = test number #Nu: row = study, column = test.

}else{

normdat$epsilon[i] <- NA #if score == NA, make epsilon NA normdat$nu[i] <- NA #if score == NA, make nu NA

}}

##PATIENT DATA RESIDUALS: for(i in 1:nrow(patdat)){

patdat$residual[i] <- epsilon_P[i] + v0s_P[i]} #calculate patient residuals (between + within)

normdat$residual <- normdat$epsilon + normdat$nu #make residuals = epsilon + nu

} #End residual calculation

#NOTE: multilevel only has theoretical residuals!!! #END CALCULATE RESIDUALS

#check:

spread_residual[[k]] <- normdat$residual #list with spread of residuals

spread_epsilon[[k]] <- normdat$epsilon #list with spread of epsilons

spread_nu[[k]] <- normdat$nu #list with spread of nus #################################################################### ################################### #ADD DEVIATION

if(FWER == FALSE){ #add deviation if you want to calculate power stdev <- 2* sqrt(diag(cov.between + cov.within)) #stdev = 2*standard devation of data

(27)

for(i in 1:length(patdat$ID)){ #add 2x st dev to every residual patdat$residual[i] <- patdat$residual[i] + ((-1) * stdev[i])} #make deviant scores lower

}else{patdat <- patdat} #don't add deviation if you want the FWER #################################################################### ################################### #CENTERING

normdat_uncentered <- normdat #Save uncentered normdata (just in case)

##CENTER NORM SCORES:

x <- y <- normmean <- normmean_eps <- normmean_nu <- NULL

if(multilevel == FALSE){ #if the data is non-multilevel

#Center residuals:

for(i in 1:length(patdat$test)){ y <- NULL

x <- patdat$test[i] #set x to the testnumber of interest (passes by all tests that patient did)

y <- normdat$residual[normdat$test == x] #select all residuals of test in question

normmean[i] <- mean(y, na.rm=TRUE) #calculate residual norm mean of the test in question

normdat$residual[normdat$test == x] <- y -

rep(normmean[i],npar)} #substract mean from residual (= centering) #Check: mean(normdat$residual) == 0? That means centering worked!

}else{ #if the data is multilevel

#Center residuals:

y <- normdat$residual[normdat$test == x] #select all residuals of test in question

normmean[i] <- mean(y, na.rm=TRUE) #calculate residual norm mean of the test in question

normdat$residual[normdat$test == x] <- y -

rep(normmean[i],npar)} #substract mean from residual (= centering) #Check: mean(normdat$residual) == 0? That means centering worked!

#Center epsilons:

y <- normdat$epsilon[normdat$test == x] #select all epsilons of test in question

(28)

normmean_eps[i] <- mean(y, na.rm=TRUE) #calculate epsilon norm mean of the test in question

normdat$epsilon[normdat$test == x] <- y -

rep(normmean_eps[i],npar) #substract mean from epsilons (= centering)

#Check: mean(normdat$epsilon) == 0? That means centering worked!

#Center nu's: y <- NULL

y <- normdat$nu[normdat$test == x] #select all epsilons of test in question

normmean_nu[i] <- mean(y, na.rm=TRUE) #calculate epsilon norm mean of the test in question

normdat$nu[normdat$test == x] <- y - rep(normmean_nu[i],npar)} #substract mean from epsilons (= centering)

#Check: mean(normdat$epsilon) == 0? That means centering worked!

}

#CENTER PATIENT SCORES:

patdat$residual <- patdat$residual - normmean

#################################################################### ###################################

#FIND STANDARD ERROR

#NON-MULTILEVEL###################################################### if(multilevel == FALSE){ #if data is non-multilevel

##Caculate sd and se

x <- y <- normsd <- normse <- npartest <- NULL for(i in 1:length(patdat$test)){

y <- normdat$residual[normdat$test == x] #select all scores of test in question

npartest[i] <- sum(!is.na(y)) #find number of participants for test in question

normsd[i] <- sd(y, na.rm=TRUE) #find stdev for test in question

normse[i] <- normsd[i]/sqrt(npartest[i])} #calculate sterror for test in question

#MULTILEVEL###################################################### }else{ #if data is multilevel

#Calculate neff:

epsilon_est <- matrix(normdat$epsilon, ncol = ntest, byrow = T ) v0s_est <- matrix(normdat$nu, ncol = ntest, byrow = T )