University of Groningen Flexible regression-based norming of psychological tests Voncken, Lieke

(1)

University of Groningen

Flexible regression-based norming of psychological tests

Voncken, Lieke

DOI:

10.33612/diss.124765653

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2020

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Voncken, L. (2020). Flexible regression-based norming of psychological tests. University of Groningen. https://doi.org/10.33612/diss.124765653

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Flexible regression-based norming

of psychological tests

(3)

ISBN: 978-94-034-2465-1 (print version) ISBN: 978-94-034-2466-8 (electronic version) Cover design: Desirée van Dooren

Printed by: Ipskamp Printing, Enschede

The research presented in this thesis was funded by the Dutch Research Council (NWO) within research programme ‘Graduate Programme 2013’ with project number

022.005.003.

(4)

Flexible regression-based norming

of psychological tests

Proefschrift

ter verkrijging van de graad van doctor aan de

Rijksuniversiteit Groningen

op gezag van de

rector magnificus prof. dr. C. Wijmenga

en volgens besluit van het College voor Promoties.

De openbare verdediging zal plaatsvinden op

donderdag 14 mei 2020 om 16.15 uur

door

Lieke Voncken

geboren op 10 februari 1992

(5)

Promotores

Prof. dr. M.E. Timmerman Prof. dr. C.J. Albers

Beoordelingscommissie

Prof. dr. L.A. van der Ark Prof. dr. T.A.B. Snijders Prof. dr. P.H.C. Eilers

(6)

1

Chapter 1 Introduction

Psychological tests are widely used to assess individuals in clinical, educational, and personnel contexts. Intelligence tests, developmental tests, personality tests, and neu-ropsychological tests are used for diagnosis, monitoring, assessment, and selection. Be-cause the results of these tests are used to make important decisions about individuals, it is essential that the tests are of high quality (i.e., have high validity and reliability), and that the test scores can be meaningfully interpreted. Meaningful interpretation of a raw test score (e.g., the number of items correct on a test) is typically done via a reference point.

Flanagan (1939) distinguished four different reference points to interpret raw test scores (as cited in Mellenbergh, 2011, p. 346). First, a testee’s test score can be compared to his/her score on other (sub)tests. This is referred to as test-referenced test score inter-pretation. For instance, given a constant test length, a testee’s short-term memory test score can be compared to his/her long-term memory test score.

Second, the test score can be compared to the testee’s score on the same test on different occasions. This is referred to as occasion-referenced test score interpretation. For instance, a testee’s test score can be compared before treatment and after treatment.

Third, a testee’s test score can be compared to an external criterion or standard. This is referred to as criterion-referenced or standard-referenced test score interpretation. This type of interpretation is mainly used in achievement testing (Mellenbergh, 2011, p. 369). For instance, teachers might believe that students have mastered the test materials when they have obtained a test score of at least 80% of the maximum test score.

Fourth, a testee’s test score can be compared to the scores of other testees for the same test. This is referred to as norm-referenced test score interpretation. This type of interpretation makes sense for many psychological tests because one often wishes to com-pare the testee’s score to the scores of a reference population. For instance, intelligence test scores are typically interpreted relative to scores of the community population. The focus of this thesis is on norm-referenced test scores.

(9)

1

Chapter 1

Norm-referenced test scores

Norm-referenced test scores – referred to as normed scores – are typically created by transforming the raw test scores to another scale. There are three types of normed scores: percentile-based, distribution preserving, and normalized normed scores (Mellenbergh, 2011, pp.351-360). Percentile-based normed scores (e.g., percentiles, deciles, and sta-nines) are directly derived from the cumulative density distribution of the raw test scores in the reference population. The population percentile is the percentage of people in the reference population with the same score or below. For instance, percentile 72 indicates that 72% of the people in the reference population obtained the same score or below.

Distribution preserving normed scores have the same distribution as the raw test scores, and are obtained by linearly transforming the raw test score distribution to have a specific mean and standard deviation. Examples of distribution preserving normed scores are IQ scores (M = 100, SD = 15), Z scores (M = 0, SD = 1), Wechsler scores (M = 10, SD = 3), and T scores (M = 50, SD = 10). Normalized normed scores have a normal distribution with a specific mean and standard deviation, and are typically normalized versions of the distribution preserving normed scores mentioned above, such as normalized Z scores. Percentiles can be transformed to normalized Z scores via the inverse cumulative density function (CDF) of the normal distribution.

Transforming the raw test scores to normed scores allows for easy interpretation of the test score. For instance, a percentile of 50, which equals a normalized IQ score of 100, implies that 50% of the reference group obtained the same score or below. A normalized IQ score of 115 means that the testee scored one SD above average compared to the reference population. It is generally easy to go from one normed score type to another, but it can be difficult to determine the transformation from raw test score to normed test score. This transformation depends on the raw test score distribution in the reference population.

Because the test score distribution from the reference population is unknown and it is practically impossible to collect test scores from everyone within the population, a reference sample is used to make inferences about the reference population. That is, we estimate the raw test score distribution on the sample, and generalize this to the popula-tion. As we will see later, there are different approaches to estimate this (conditional) raw

(10)

1

Introduction

score distribution. Representative sample

The population for whom the test is designed is referred to as the target population. For instance, the target population of the Dutch Wechsler Intelligence Scale for Children-V (WISC-V-NL; Wechsler, 2018) consists of all Dutch-speaking children between 6 and 17 years old. The test scores of the WISC-V-NL are interpreted relative to the test scores of the community population of the same age. Hence, the reference population consists of Dutch speaking people of a certain specific age within the range 6–17 years. Technically, the number of reference populations is infinite: There is one reference population for every exact age value in the range 6–17 years. The norm population is the combination of all reference populations. Note that the norm population does not have to be equivalent to the target population. For instance, the target population of a clinical test might consist of all people who are able to complete the test, while the norm population might consist of healthy people only. This would allow for comparison of the (possibly unhealthy) testee’s test score to the test scores of healthy people.

In the test construction phase, one wants to collect a representative sample of the norm population, which means that the sample reflects the characteristics in the popula-tion. It is important that this sample is representative with respect to characteristics that are related to the test scores. For instance, if highly educated people are overrepresented in the normative sample of an intelligence test, the test scores in the sample are likely to be higher than the test scores in the reference population, and this will result in too high normed scores. As a result, the mean test score in the population would be interpreted as a below-average test score because the mean test score in the sample was higher.

In practice, it is very difficult to collect a representative sample. It is unclear before-hand which characteristics are related to the test scores. Theoretically, the best way to deal with this is to randomly sample from the population, through which all members of the population have the same probability of being included in the sample. This is referred to as simple random sampling. The larger the random sample, the larger the probability that the sample is representative with regard to all relevant characteristics in the population. Unfortunately, this is usually impossible in practice.

(11)

1

Chapter 1

Random sampling from the norm population requires a list of the full norm popu-lation. However, these lists are typically not (publicly) available. For clinical populations, these lists do usually not even exist. For general populations, many countries have a population register, but privacy regulations like the General Data Protection Regulation (European Parliament and Council of the European Union, 2016) generally prohibit shar-ing data without explicit consent of the data subject. Even if lists of the norm population would be available, there is still the problem of nonresponse.

An alternative sampling method is cluster sampling, in which random clusters (e.g., schools) are chosen. This only requires a list of all clusters in the norm population, rather than all individuals. Unlike lists of individuals, list of clusters may be publicly available, as – for example – schools or hospitals. It is possible to only randomly select the clusters (i.e., one-stage clustering), or to randomly select both the clusters and the individuals within the clusters (i.e., two-stage clustering). Like in simple random sampling, there is still the problem of nonresponse. A disadvantage of this method is that the dependency between individuals within clusters makes the method less efficient than simple random sampling. A more efficient sampling method than simple random sampling and cluster sam-pling is stratified samsam-pling, in which samples are randomly drawn from subpopulations (strata) that are related to the test score. For instance, if it is known that the test score is related to education level, random samples are drawn from subpopulations relating to each education level, with the sample sizes proportional to the size of the subpopulations. If the proportions in the sample do not match those in the population, the observations can be weighted accordingly. In this way, there is no overrepresentation of one of the subpop-ulations in the normative sample. Unfortunately, as discussed before, random sampling is infeasible in practice, and it is unknown beforehand which characteristics are related to the test scores.

In practice, a non-random sampling technique of stratified sampling – judgmental sampling (Mellenbergh, 2011, p. 351) – is typically used, in which the subpopulations are typically based on easy-to-measure characteristics, like age, sex, education level, and region. The downsides of this approach are that it is unknown whether enough subpopu-lations are included, and that the proportions in each of the subpopusubpopu-lations are typically only assessed in a univariate way. That is, it is not assessed whether the proportion of 10

(12)

1

Introduction

combinations of characteristics (e.g., highly-educated, elderly males) match those in the population. That is, the multivariate distribution is not assessed. Tacitly, it is thus assumed that the characteristics are independent – which more often than not will fail to hold. Choice of reference population

The choice of reference population depends first and foremost on the desired inter-pretation of the test score. In clinical tests, one typically wants to compare the testee’s test score with the testee’s “healthy” test score to assess whether the testee is now unhealthy. As the score of the healthy version of the testee is not available, the test score is compared to the scores of healthy people who are as similar as possible to the testee. In practice, this means that the testee is compared to people who are similar on easy-to-measure character-istics, like age, sex, and education level. For instance, Van Breukelen and Vlaeyen (2005) considered age, sex, education level, marital state, pain duration, diagnosis, geographic region, and type of medical center.

In intelligence tests, on the other hand, one typically wants to compare the testee’s test score with the scores of a general population of people of the same age. Broader or narrower reference populations are typically uninformative for intelligence tests. For in-stance, if the reference populations consist of all people between 5 and 40 years old, it would be unsurprising to see that someone of age 5 scores below average because people generally score higher on intelligence tests as they get older. Also, a more narrow ref-erence population would be uninformative because you are typically not interested in an interpretation like “You score exactly average for a 28-year-old woman, who wears glasses, is about 1.60m tall, and is aiming at obtaining her PhD at the University of Groningen on 14 May 2020”. Rather, it would be informative to know how you scored relative to – for instance – people from the general population, or other PhD students, of your age.

Interestingly, the first version of the Groninger Intelligence Test (Snijders & Verhage, 1962) provided age scores, sex scores, and achievement scores. This allowed the test user to choose the interpretation of the test scores: relative to people of the same age, relative to people of the same sex, or relative to the full norm population.

Once the reference population is chosen based on the interpretation, it has to be investigated whether the chosen reference characteristics are related to the test score. If 11

(13)

1

Chapter 1

the test scores are not related to the reference characteristics, it does not have added value to interpret the scores conditional on them. The test scores of intelligence and developmental tests are typically related to age. More specifically, the test scores typically increase strongly with age for young children, and this relationship diminishes or decreases from age 25–35 (Ferrer & McArdle, 2004; McArdle, Ferrer-Caja, Hamagami, & Woodcock, 2002).

Traditional and continuous norming

When the reference characteristics are measured with categorical variables (e.g., sex), the reference populations are defined for each category of the variable(s). Tradition-ally, the same approach was used for continuous variables (e.g., age) by discretizing them (e.g., Wechsler Intelligence Scale for Children-III, WISC-III; Wechsler, 1991). The empir-ical probability density function for a subgroup was used as estimate for the raw score distribution of that subgroup. This traditional norming approach is problematic because it is unrealistically assumed that the test score distribution is equal for everyone within a sub-group, and it can result in jumps in normed scores at the boundaries of the subgroups. It is assumed that the conditional test score distribution changes as a step function of the con-tinuous variable(s), while theoretically it is more realistic that this relationship is smooth (Van Breukelen & Vlaeyen, 2005; Zachary & Gorsuch, 1985). A continuous function can be approximated by making the subgroups smaller, but this results in fewer observations per subgroup to estimate the raw score distribution.

These problems are solved in continuous test norming (Zachary & Gorsuch, 1985) – also referred to as regression-based norming – in which the raw score distribution is es-timated as a continuous function of the reference characteristic(s) in a regression model. The continuous norming approach is more efficient than the traditional norming approach (Oosterhuis, Van der Ark, & Sijtsma, 2016) because all observations within the normative sample, rather than a subgroup, are used to estimate the raw score distribution. By using a regression model, information from the surrounding predictor values is used. In this way, the raw score distribution can be estimated for any predictor value within the predictor range, even if the specific predictor value itself is not observed in the normative sample. Still, it is important to have the full predictor range represented in the normative sam-12

(14)

1

Introduction

ple without large gaps. There are multiple sampling strategies for the predictor(s). For instance, the observations can be sampled in a uniform way across the predictor range, or more observations can be included around predictor ranges for which a stronger rela-tionship between the predictor and the raw test score is expected. The normative sample does not have to be representative with respect to the chosen predictor(s). Rather, the reference sample has to be representative for the reference population, which is evaluated conditional on the predictor(s).

Continuous norming approaches

The continuous norming approaches can be divided into three types (Emons, 2019): inferential norming (Wechsler, 2008; Zachary & Gorsuch, 1985; Zhu & Chen, 2011), mo-ments regression-based norming (Oosterhuis, 2017; Van Breukelen & Vlaeyen, 2005), and non-parametric norming (Lenhard, Lenhard, Suggate, & Segerer, 2018; Tellegen & Laros, 2014).

In inferential norming, moments of the raw test score distributions are computed for subgroups of the normative sample, and these moments are regressed on subgroup-level predictor(s). For instance, the mean and standard deviation of the full scale score can each be regressed on the mean age in a subgroup. Then, the mean and standard deviation of the conditional (normal) distribution of the raw test scores can be predicted for each predictor value in the predictor range, which are used to transform the raw test scores to standardized scores. This procedure involves smoothing of the curves of the moments as a function of the predictor following suggestions by experts. This can be done using subjec-tive “hand smoothing” (e.g., Zhu & Chen, 2011) or using a statistical model (e.g., Zachary & Gorsuch, 1985). While Zachary and Gorsuch (1985) only modelled the mean and stan-dard deviation of the raw test score, Zhu and Chen (2011) also modelled the skewness and kurtosis, which allowed for capturing non-normality. The advantages of inferential norming compared to traditional norming are that information about the moments from all predictor groups is used, and that the normed scores are smooth across the predictor range(s). The main disadvantage of this approach is that the moments are estimated per subgroup, which results in estimates that are less precise, less efficient, and dependent on the exact subgroups. In addition, it is problematic that hand smoothing expresses individ-13

(15)

1

Chapter 1

ual beliefs about theoretical relationships. We expect that even for experts it is difficult to theorize how moments other than the mean depend on the predictor(s).

In moments regression-based norming, moments of interest are regressed on predic-tor(s) for the individual raw test score data, rather than for subgroup data. Van Breukelen and Vlaeyen (2005), and Oosterhuis (2017) used a standard regression model to estimate the mean of the raw test score distribution conditional on the predictor(s). Categorical predictors were included as dummy variables, and continuous predictors were included as linear and – possibly – quadratic terms. Residuals were calculated for each test taker, and these were transformed to standardized residuals (e.g., Z scores). Hereby, it was as-sumed that the Z scores were normally distributed conditional on the predictor(s), with a constant variance.

The main advantages of the moments regression-based approach are that no addi-tional smoothing step is required, and that statistical criteria for model selection and model assessment are available. The disadvantage of the mean-regression based approach by Van Breukelen and Vlaeyen (2005), and Oosterhuis (2017) is that homoscedasticity and normality of the residuals are assumed. Oosterhuis (2017) argued that test constructors do not have to investigate this normality assumption “...because for sample size > 50 the cen-tral limit theorem ensures that the [standard] regression model is robust against violations of this assumption” (p. 128). This argument is valid only when estimating distribution preserving normed scores, but not when estimating normalized normed scores, because it only pertains to the (implied) model parameters (e.g., regression coefficients, mean, standard deviation), rather than the score distribution itself. In practice, test publish-ers typically report normalized normed scores rather than distribution preserving normed scores (e.g., Grob & Hagmann-von Arx, 2018; Tellegen & Laros, 2014).

For distribution preserving normed scores, only the mean and standard deviation of the conditional raw test score distribution have to be estimated. Given the central limit theorem (CLT), the sampling distributions of these parameters are (approximately) normal if the sample size is large enough, regardless of the shape of the conditional raw test score distribution. Note that the standard regression model assumes homoscedasticity. If this assumption is too strict, one needs the Gaussian model to estimate the standard deviation conditional upon the predictor(s).

(16)

1

Introduction

For normalized normed scores, one needs to estimate the conditional raw test score distribution itself. If this distribution clearly deviates from normality, a regression model built upon the normality assumption cannot be used. The CLT pertains to the model pa-rameters, and does not imply that the conditional raw test scores themselves are (approxi-mately) normally distributed if the sample size would be large enough. For instance, floor-and ceiling effects will result in skewness of the conditional raw test score distribution, re-gardless of the used sample size.

Van Breukelen and Vlaeyen (2005, p. 344) recommended to use scale transforma-tions or normed scores based on deciles in the presence of non-normality, and to use the residual standard deviation for quartiles of the predicted scores in the presence of het-eroscedasticity. Unfortunately, these alternatives require that the shape of the raw score distribution is equal for (subgroups of) the predictor or predicted score range, and they do not allow for local changes in the scale or shape of the distribution. Oosterhuis (2017, p. 93) recommended to use the traditional norming method or models with weaker assump-tions in the presence of assumption violaassump-tions. We believe that the best option is to use a continuous norming model with weaker assumptions in the presence of non-normality and heteroscedasticity.

In non-parametric norming, the relationship between the raw test scores, and normed scores and age is modelled using Taylor polynomials (Lenhard, Lenhard, & Gary, 2019; Lenhard et al., 2018; Tellegen & Laros, 2014). Lenhard et al. (2019) describe their norm-ing approach in three steps. First, the normed scores are estimated based on the empirical cumulative distribution function. All powers of these normed scores, the age variable, and their interactions are calculated up to a predetermined degree. Second, the raw test scores are regressed on all calculated polynomials of the normed scores and age. Finally, the sig-nificant degrees of the polynomials of the second step are included in the final model. This model can be used to derive the normed scores for combinations of the raw test score and age. The advantage of this non-parametric approach is that it does not require assumptions about the conditional score distribution, and – thus – allows for modelling heteroscedasticity and non-normality. Tellegen and Laros (2014) argued that normality of raw test score distributions is rare, especially when subtests are designed for broad age ranges. Lenhard et al. (2019) argued that homoscedasticity and normality are only rarely 15

(17)

1

Chapter 1

fulfilled in psychometric tests, and that non-normality is common because many tests con-tain floor and ceiling effects in at least some age ranges. A disadvantage of this flexibility is that the resulting percentile curves can intersect, which is impossible from a theoretical point of view. In addition, this approach requires discretization of the continuous predictor variable to estimate the normed scores.

Distributional regression

In this thesis, we investigate a flexible moments regression-based norming approach, namely using distributional regression (e.g., Rigby & Stasinopoulos, 2005; Umlauf, Klein, & Zeileis, 2018). This approach includes the approach by Van Breukelen and Vlaeyen (2005), and Oosterhuis (2017), and allows for many other distributions (Rigby, Stasinopou-los, Heller, & De Bastiani, 2019) and function types as well. In distributional regression, distributional characteristics (e.g., the mean, variance, skewness, and kurtosis) can be modelled as a continuous function of predictors, which allows for modelling heteroscedas-ticity and non-normality locally. Unlike the non-parametric norming approach, the unde-sired intersecting percentile curves are impossible to occur. We use both a frequentist and a Bayesian framework for distributional regression, namely the generalized additive models for location, scale, and shape (GAMLSS; Rigby & Stasinopoulos, 2005), and the Bayesian additive models for location, scale, and shape (and beyond) (BAMLSS; Umlauf et al., 2018).

Figures 1 and 2 illustrate how GAMLSS can be used to arrive at normed scores condi-tional on age for Dutch normative data of composite scale “IQ screening” of the intelligence test IDS-2 (N = 1,566) (Grob, Hagmann-von Arx, Ruiter, Timmerman, & Visser, 2018). Note that the process of test norming with BAMLSS is similar, but additionally requires a prior distribution for each of the model parameters. In short, one has to select the distri-bution and function type to model the conditional raw test score distridistri-bution as a function of age. In this illustration, we have chosen the Box-Cox Power Exponential (BCPE; Rigby & Stasinopoulos, 2004) distribution, which has four distributional parameters: µ, s, n, andt, for the median, scale, skewness, and kurtosis, respectively. Identity link functions were used forµ and n, and log link functions were used for s and t. The log link function prevented negative parameter estimates ofs and t. We have selected a linear relation-16

(18)

1

Introduction

ship between age and ln(s), n, and ln(t), respectively, and µ was modelled as a smooth function of age using monotonically increasing P-splines (Eilers & Marx, 1996). Figure 1 shows the estimated relationship between each of the distributional parameters of the BCPE distribution and age.

Age

µ

(a) µ and age

Age σ (b)s and age Age ν (c)n and age Age τ (d)t and age

Figure 1. Estimated relationship between each of the distributional parameters of the

BCPE distribution (i.e.,µ, s, n, and t) and age for the Dutch normative data of composite scale “IQ Screening” of the IDS-2.

The estimated distributional parameters define the estimated conditional raw test score distribution for each age value in the age range. This information is used to trans-form raw test scores conditional on age to normed scores. Figure 2(a) shows the centile curves as derived from the estimated distributional parameters in Figure 1. The dots indi-cate the observations in the normative sample, the gray bands indiindi-cate percentile ranges, and the vertical line indicates the raw test score distribution evaluated at age 8.89. This precise age value is chosen to illustrate that the normed scores can be assessed for any age value in the age range. This conditional raw test score distribution corresponds to the probability density function (PDF) in Figure 2(b) and the cumulative density function (CDF) in Figure 2(c). Using the CDF, the raw test scores conditional on age can be trans-formed to percentiles. The dotted lines in panel (c) illustrate that the proportion of people of age 8.89 that obtained a raw test score of 30 or below equals 0.726, which means that the corresponding percentile is 72.6. This percentile of 72.6 equals to a normalized Z score of 0.6 and a normalized IQ score of 109.

(19)

1

Chapter 1 5 10 15 20 10 20 30 40 50 60 Age Test score ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● Percentiles 25−50th & 50−75th 10−25th & 75−90th 2−10th & 90−98th 0.4−2th & 98−99.6th

(a) Centile curves

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

Test Score | Age = 8.89

PDF (b) PDF 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0

Test Score | Age = 8.89

CDF

(c) CDF

Figure 2. Estimated centile curves (panel a), and for age 8.89 the corresponding PDF

(panel b) and CDF (panel c) for the Dutch normative data of composite scale “IQ Screening” of the IDS-2. The dots in panel (a) indicate the observations in the normative sample. The gray bands in panels (a) and (b) indicate percentile ranges. The dotted line in panel (c) indicates the cumulative density corresponding to raw test score 30.

(20)

1

Introduction

Aims and overview of this thesis

The availability of many different models makes norming with distributional regres-sion flexible, but this flexibility also comes with challenges. The large availability of mod-els makes the model selection difficult. It is practically impossible to compare all possible models, which makes it very useful to have a well-performing automated model selection procedure. So far, it is unknown how well automated model selection procedures perform in the context of distributional regression. In addition, it is unknown how much model flexibility is optimal. A flexible model that fits at the population generally has smaller bias, but also has larger sampling variability than more restricted model versions. Larger sam-pling variability results in more uncertainty in the parameter estimates, and thus also in the normed scores. This type of uncertainty in normed scores is typically ignored in prac-tice. The sampling variability can be decreased by increasing the size of the normative sample, but this is costly and not always possible in practice.

In this thesis, we address the challenges related to model selection and sampling variability in test norming using distributional regression. In Chapter 2, an automated model selection procedure is developed for the flexible BCPE distribution, and its perfor-mance is compared to the perforperfor-mance of an existing procedure. In Chapter 3, the bias-variance trade-off in GAMLSS models is explored. It is investigated what the costs are of using a too strict model (i.e., bias) versus the costs of using a too flexible model (i.e., variance). This is important for guiding model selection. In Chapter 4, a procedure to create confidence intervals that express the uncertainty in normed scores due to sampling variability is investigated. In Chapter 5, it is investigated whether norm estimation can be made more efficient (i.e., requiring a smaller sample size to obtain the same norm preci-sion) by using prior information via Bayesian Gaussian distributional regression. Finally, a general discussion is provided in Chapter 6.

To evaluate the procedures and models, we make extensive use of simulation stud-ies. To make the simulation studies realistic and the conditions empirically relevant, we make use of empirical normative data of psychological tests. The Dutch/German norma-tive data of the Snijders-Oomen non-verbal intelligence test 6-40 (SON-R 6-40; Tellegen & Laros, 2014) is used in Chapters 2 and 4. The Dutch normative data of the Cognitive

(21)

1

Chapter 1

Test Application (COTAPP; Rommelse et al., 2018) is used in Chapter 3. The Dutch nor-mative data of the Ekman 60 Faces Test of the Facial Expressions of Emotion - Stimuli and Tests (FEEST; Voncken, Timmerman, Spikman, & Huitema, 2018) is used in Chapter 4. Finally, the German and Dutch normative data of the Intelligence and Developmental Scales 2 (IDS-2 Grob & Hagmann-von Arx, 2018; Grob et al., 2018) are used in Chapters 3 (German) and 5 (German and Dutch).1

All computations in this thesis were performed inR (R Core Team, 2019). The R code used in this thesis is available fromhttps://osf.io/52nzt/ (Chapter 2), https:// osf.io/k6fzn/ (Chapter 3), https://osf.io/z62xm/ (Chapter 4), and https://osf .io/cjx3v/ (Chapter 5).

1_{We thank Peter Tellegen and Jacob Laros for providing us with the SON-R 6-40 normative data, we thank}

Nanda Rommelse and the other authors of the COTAPP for providing us with the COTAPP normative data, we thank Joke Spikman for providing us with the FEEST normative data, and we thank Alexander Grob and the other authors of the German and Dutch IDS-2 for providing us with the IDS-2 normative data.

(22)

1

Chapter 2 Model selection in continuous test norming with GAMLSS

Abstract

To compute norms from reference group test scores, continuous norming is preferred over traditional norming. A suitable continuous norming approach for continuous data is the use of the BCPE model, which is found in the generalized additive models for location, scale, and shape (GAMLSS; Rigby & Stasinopoulos, 2005). Applying the BCPE model for test norming requires model selection, but it is unknown how well this can be done with an automatic selection procedure. In a simulation study, we compared the performance of two stepwise model selection procedures combined with four model-fit criteria (AIC, BIC, GAIC(3), cross-validation), varying data complexity, sampling design, and sample size in a fully crossed design. The new procedure combined with one of the GAIC was the most efficient model selection procedure (i.e., required the smallest sample size). The advocated model selection procedure is illustrated with norming data of an intelligence test.

This chapter has been adapted from Voncken, L., Albers, C. J., & Timmerman, M. E. (2019). Model selection in continuous test norming with GAMLSS. Assessment, 26(7), 1329–1346. doi:10.1177/1073191117715113

(23)

2

Chapter 2

Introduction

When using psychological tests in practice, normed test scores are required to achieve a sensible interpretation. Normed scores are derived from a raw test score distribution of a reference population. In practice, one resorts to collecting test scores among a represen-tative sample of the reference population. Based on these test scores the raw test score distribution is estimated. When multiple reference populations are involved for the same test, for example depending on sex or age, and the raw score distributions differ between these populations, different norms should be provided. Traditionally, the distributions, and thus the norms, were estimated for the different subgroups. For each subgroup, norm tables were created that convert the raw scores into normed scores.

The main limitation of traditional norming is that all demographic variables are treated as discrete values, also those that are continuous in nature. This approach is built upon the assumption that the score distributions are the same for all continuous values within a subgroup. This assumption may be unrealistic in practice, yielding suboptimal norms. In the Wechsler Intelligence Scale for Children (third edition: WISC-III-NL; Kort et al., 2002; Wechsler, 1991), a difference in age of one day can lead to a difference in IQ of even 12 IQ-points (Tellegen, 2004), when the test taker moves from one age subgroup to the next. Within traditional norming, a solution for this would be to increase the number of subgroups, but this leads to less precise norms as the sample size per subgroup decreases (Oosterhuis et al., 2016).

To overcome this limitation of discrete values, continuous norming was developed (Zachary & Gorsuch, 1985), which makes it possible to relate the demographic variables to the test scores on a continuous basis. A statistical model is built that describes the distribution of test scores conditional upon the relevant characteristics. In this way, the available information from the entire norm group is used in estimating the norms. This makes sense, as, for example, test scores of five and seven years old children are infor-mative for test scores of six years old children. As a result, continuous norming requires smaller samples than traditional norming (e.g., Bechger, Hemker, & Maris, 2009; Ooster-huis et al., 2016).

This appealing property is widely recognized, witness the fact that most, if not all,

(24)

2

Model selection in continuous test norming with GAMLSS

modern tests use continuous norming. Examples are the Wechsler Intelligence Scale for Children (fourth edition: WISC-IV; Wechsler, 2003), Bayley-III (Bayley, 2006), Wechsler Adult Intelligence Scale (fourth edition: WAIS-IV; Wechsler, 2008), and the Snijders-Oomen Non-verbal intelligence test for 6 to 40 years old individuals (SON-R 6-40; Tellegen & Laros, 2014).

Various continuous norming methods have been proposed. Zachary and Gorsuch (1985) used standard polynomial regression, with age as a predictor. In such a standard polynomial regression, the means may vary with age, while the distribution at any age is assumed to be normal with constant variance. The latter can be highly unrealistic, yielding serious forms of misfit, with associated improper norms (as discussed in e.g., Van Breukelen & Vlaeyen, 2005). An alternative which allows for varying means as well as varying variances, skewness (and possibly kurtosis), is inferential norming (Wechsler, 2008; Zhu & Chen, 2011). This procedure involves two steps. First, one models the raw score means, variances, skewness, and possibly kurtosis, via polynomial regressions, using the relevant predictor(s), like age. Second, one aggregates the estimated values to generate an estimation of the raw score distributions, from which the norms are derived, typically using some smoothing by hand. This procedure was applied to the WAIS-IV test (Wechsler, 2008). The disadvantages of inferential norming are that step one can be suboptimal in view of step two, and that one step involves subjective decisions.

An alternative which involves only one step, is the use of the statistically well-founded generalized additive models for location, scale, and shape (GAMLSS; Rigby & Stasinopoulos, 2005). The GAMLSS framework includes many different distributions. Because of this useful flexibility, most recently applied continuous norming methods fall within this GAMLSS framework. An example is the Bayley-III test norming (Bayley, 2006; Cromwell et al., 2014; Van Baar, Steenis, Verhoeven, & Hessen, 2014), which involved a polynomial regression using a Box-Cox t-distribution (rather than a normal distribution, as in standard regression).

In this paper, we focus on the use of the Box-Cox Power Exponential (BCPE) distri-bution. All instances of GAMLSS allow for differences in location, most of them also allow for differences in spread, and many of them allow for differences in skewness, or skewness

and kurtosis. The BCPE distribution allows for differences in location, spread, skewness

(25)

2

Chapter 2

and kurtosis. The application of the BCPE model to test norming requires model

selec-tion. This implies that one needs to establish the specific predictor terms to include in the model. It is very convenient if this can be done with an automated selection procedure. However, the performance of automated selection procedures in this context is unknown. Hence, in this study, we will compare the performance of two stepwise model selection procedures: one existing procedure and one that we developed ourselves, combined with various criteria, considering the very flexible BCPE distribution.

The remainder of this introduction is organized as follows. First, we will explain the GAMLSS and the BCPE distributions in more detail. Second, we will explain different model selection criteria and the stepwise model selection procedures. Finally, we will describe our specific research questions.

GAMLSS and BCPE

GAMLSS. The GAMLSS framework allows the modelling of any distribution in the exponential family. GAMLSS has been applied not only in psychological test norming (Bay-ley, 2006; Cromwell et al., 2014; Van Baar et al., 2014), but also in related areas, namely growth chart developing (e.g., Borghi et al., 2006; WHO Multicentre Growth Reference Study Group, 2006) and lung function charts (e.g., Cole et al., 2009; Quanjer et al., 2012). Growth charts express references for measures such as length as a function of age. In lung function research, spirometry indices (i.e., measuring of lung function) are modelled as a function of age.

BCPE. In this paper, we focus on the BCPE distribution (Rigby & Stasinopoulos, 2004). The BCPE is a highly flexible distribution, useful for variables on a continuous scale. The BCPE is able to model any type of kurtosis (lepto, platy, and mesokurtosis), while, for example, the BCT distribution does not allow for a platykurtic distribution. This flexibility is important, as Rigby and Stasinopoulos (2004) showed that both refraining from and incorrect modelling of skewness and kurtosis can lead to distorted fitted percentiles. In order to estimate the norms properly, it is important that the distributional parameters are estimated properly. The four parameters of the BCPE distribution relate to the location (µ, median), scale (s, approximate coefficient of variation), skewness (n, transformation to symmetry), and kurtosis (t, power exponential parameter). The BCPE distribution 24

(26)

2

simplifies to a normal distribution whenn = 1 and t = 2 (Rigby & Stasinopoulos, 2004). Automated model selection

The key in using the BCPE for continuous norming is to allow the four parameters of the BCPE distribution to vary with the continuous predictor, for example age. The relationship between these four parameters and age can take different forms. To capture this, we need a flexible model for each of the four parameters. In this study, we do this with polynomials, for example _{µ =} 0+ 1age + 2age2. We make use of orthogonal

polynomials to avoid multicollinearity between the various terms in the polynomial. As the number of possible models is infinitely large, it is important to have an automated approach, based on a particular selection criterion, that finds optimal choices forµ, s, n, andt.

Selection criteria. Commonly used model selection criteria are the generalized Akaike information criterion (GAIC; Akaike, 1983; Rigby & Stasinopoulos, 2006) and cross-validation (Geisser, 1975; Stone, 1974). Both the GAIC and cross-validation try to prevent overfitting of the data, which increases the generalizability to other data of the same type (Hawkins, C., & Mills, 2003). After all, we do not want to accurately estimate the norms for the reference sample, but for the reference population. This prevention of overfitting is done indirectly by the GAIC, where the number of parameters is penal-ized, and directly by cross-validation, where the model’s prediction of new, unseen data is evaluated.

The GAIC is given by

GAIC(p) = 2b`d+ p · df, (1)

where p indicates the penalty, b`d the fitted log-likelihood of the data, and df the total

effective degrees of freedom used in the model. The value p of the penalty determines the trade-off between the fit and the complexity of the model. The higher the penalty, the higher the penalization of the addition of parameters. In general, the fit increases when the degree of the polynomial is increased (i.e., parameters are added), but this increase in fit has to be in proportion with the resulting increase in complexity.

In our study, we look at three special cases of the GAIC, namely the Akaike

(27)

2

Chapter 2

mation criterion (AIC; Akaike, 1974) where the penalty p = 2, the Bayesian information criterion (BIC; Schwarz, 1978) where the penalty p = ln(n) and n is equal to the sample size, and the GAIC(3) where the penalty p = 3. These three penalties were also used by Rigby and Stasinopoulos (e.g., 2004, 2006). According to Stasinopoulos and Rigby (2007), the penalty p = 3 appeared to be a reasonable compromise between the AIC and BIC. For each special case of the GAIC, we selected the model with the lowest criterion value.

In K-fold cross-validation, the data are split into K equal parts. For each k = 1, ..., K, the kth part is removed from the data set, the model is fitted to the remaining K 1 parts of the data (training set), and then predictions are made for the left-out kth part (validation set). The K parts are C1, C2, ..., CK, where Ck refers to the observations in part k. The

cross-validation estimate of the expected test error is CV_(K)₌ K X k=1 nk nMSEk, (2)

where MSEk=P_i2C_k( yi ˆyi)2/nk, where ˆyiis the fitted value for observation i, obtained

from the data without part k, and yiis the observed value for observation i in part k.

There is a bias-variance trade-off associated with the choice of K. When K = n, the cross-validation estimator is approximately unbiased for the true (expected) prediction error. However, it can have a higher variance than for K < n, because the n training sets are very similar to one another (Hastie, Tibshirani, & Friedman, 2009). Multiple studies have found that K = 10 is a good compromise in this bias-variance trade-off (e.g., Breiman & Spector, 1992; Davison & Hinkley, 1997; Kohavi, 1995). We selected the model with the lowest global deviance.

Model selection procedures. The gamlss R package (Stasinopoulos & Rigby,

2007) includes two stepwise model selection procedures – denoted in the package by ‘strategy A’ and ‘strategy B’ – that allow for the selection of all distribution parameters. Because these procedures may not be optimal, as motivated below, we developed an al-ternative stepwise model selection procedure.

As we employ orthogonal polynomials in our study, the model for a distributional parameter can be depicted as

(28)

2

Qk

X

q_k₌₀ qkx

qk, (3)

where qk indicates the degree of the polynomial (qk = 0, 1, 2, ...,Qk) for distributional

parameter k 2 {µ, s, n, t}, indicates the coefficient of the parameter, and x indicates the predictor (here: age).

In ‘strategy A’, the models for the distributional parameters are selected in a fixed order: first µ, followed by s, then n, and finally t. That is why we will refer to this procedure as the fixed order procedure. To apply this model selection, one needs to select a scope of models, which includes a range of terms for consideration, and an initial model, M0_{, needs to be selected. The initial model should be small, for example, including a linear}

term forµ (i.e., Q_µ= 1), and only intercepts for s, n, and t (i.e., Qs= Qn= Qt= 0). For

short, we denote this model as poly(X , 1, 0, 0, 0).

The selection of the parameters involves successive forward selection procedures for

Q_µ, Q_s, Q_n, and Q_t, followed by backward selection procedures for successively Q_n, Q_s, and Q_µ to decide whether either the parameters selected in the forward selection (i.e.,

Qk) or an intercept (i.e., Qk= 0) should be included in the final model, given the chosen

models for the other distributional parameters. This fixed order is in line with the idea of the authors of thegamlss R package that the parameter hierarchy (i.e., sequentially µ, s, n, t) has to be respected (help function ofgamlss R package version 5.0-1; Rigby & Stasinopoulos, 2005). The model selection of the fixed order procedure is based on one of the special cases of the GAIC.

‘Strategy B’ uses the same procedure as the fixed order procedure, but each term in the scope is fitted to all four distributional parameters, rather than one distributional parameter at a time. As a result, the value of Qk is the same for all four distributional

parameters k. We will not evaluate this procedure because we believe it too restricted; we see no reason to assume, for instance, that the polynomial degree of the relationship between age and the median score equals that between age and the kurtosis of the scores. We believe that the fixed order procedure is not optimal and we provide two argu-ments for this. First, in the fixed order procedure, the distribution parameters are modelled in a hierarchical order (firstµ, then s, followed by n, and finally t). We believe that it is

(29)

2

Chapter 2

logical to, for example, extend the model fort before that of µ when this results in a bet-ter model fit. Second, in the backward elimination part of the fixed order procedure, it is only checked whether the parameters that are found in the forward selection for that particular distributional parameter are needed. So, if a fourth degree polynomial of age is found forµ, it is only checked whether it is better to keep this parameter or to use an intercept. However, we believe it would be better if other polynomial degrees are consid-ered as well. That is why we developed a new stepwise selection procedure, which we term the free order procedure, that deals with these issues.

In the free order procedure, the models for the distributional parameters are se-lected in a relatively free order. The free order procedure always starts with model M0

= poly(X , 1, 0, 0, 0). Subsequently, four forward models are fitted, denoted as M1

Fµ, M 1 Fs, M 1 Fn, and M1

F_t. In each of these models, for distributional parameter k corresponding to model

M_F_k, Q_k is increased by one. In addition, if Q_k> 0, a backward model is fitted for

distri-butional parameter k. In the first step of this procedure, only a backward model forµ can be fitted, which is the intercept only model, M1

B_µ = poly(X , 0, 0, 0, 0).

The value of the specific model selection criterion is calculated for all fitted models (i.e., the initial model, the four forward models, and the backward model). The model with the best (i.e., lowest) criterion value is selected, which then becomes Ms_{. Then, the}

criterion value of model Ms_{is compared to those of the four forward models, M}s+1

Fµ , M s+1 Fs , Ms+1 Fn , and M s+1

Ft , and the, at most four, backward models, M

s+1 Bµ , M s+1 Bs , M s+1 Bn , and M s+1 Bt . This process is repeated until model Ms_{has the best value of the chosen criterion.}

The advantage of the free order procedure over the fixed order procedure is that the model selection is flexible: the order of the distributional parameters which are updated is not fixed beforehand, but depends on the model fit, and the backward selection allows going back one degree of the polynomial instead of choosing between a certain degree and the intercept.

Research questions

The goal of this study is to compare the performance in estimating norms in the context of developmental and intelligence tests, using two stepwise model selection pro-cedures (fixed order and free order) and four model selection criteria (AIC, BIC, GAIC(3), 28

(30)

2

and cross-validation). The performance is assessed considering the difference in the popu-lation and model-implied distributions of scores. In our study, we systematically varied the population model, sample size, and sampling design. We included 9 different population models, varying in complexity of the relationship between age and the different distribu-tional parameters. The sample size N was equal to 100, 500 or 1,000. The sampling design was uniform or weighted. In uniform sampling, we simulated age values equally spread across the age range. In weighted sampling, the number of people included with a certain age value depended on the change in median test score around that age value: the larger the change in test score was, the more people with that age value were included. We ap-plied uniform sampling across all population models, while we apap-plied weighted sampling only to the simplest population model. For each condition, we generated 500 data sets. As a result, 500 (replication) ⇥ 9 (population model) ⇥ 3 (sample size) = 13,500 different data sets were obtained to which uniform sampling was applied. On the other hand, 500 (replication) ⇥ 1 (population model) ⇥ 3 (sample size) = 1,500 different data sets were obtained to which weighted sampling was applied. Hence, the total number of different obtained data sets was 15,000. To each data set, we applied the two different stepwise model selection procedures (i.e., fixed order and free order procedure), combined with three GAIC model selection criteria (AIC, GAIC(3), BIC) and cross-validation.

Regarding the population model, we expected the simpler data conditions to out-perform the more complex data conditions. Regarding the sampling design, we expected the weighted sampling to outperform the uniform sampling in the simplest data condition, because more information is expected to be available when more changes in distributions are to be estimated.

Regarding the sample size, we expected conditions with larger sample sizes to out-perform those with smaller sample sizes, because with increasing sample size the proba-bility that the sample represents the population increases, and thus the precision of esti-mate would be higher. In addition, we expected this effect to be more pronounced for the weighted sampling conditions than the uniform sampling conditions, as the number of ob-servations is extremely small for some age ranges with weighted sampling (i.e., interaction sample size and sampling design).

Regarding the two stepwise model selection procedures, we expected that the free 29

(31)

2

Chapter 2

order procedure yields a better fitting model than the fixed order procedure because the free order procedure is more flexible. We did not have clear expectations for the model selection criteria.

The RMSE can be split up in a bias and a variance component. We will briefly look at the effect of all conditions on bias and variance separately.

Method Data generation

To generate the simulated data, we used the BCPE distribution within the GAMLSS framework. The four distributional parameters are modelled using monotonic link func-tions. We have used the default link functions for the BCPE distribution, namely an identity link forµ and n, and a log link for s and t (Stasinopoulos & Rigby, 2007). The log link makes sure that the values for the distributional parameters stay positive. Even thoughµ has to be positive, we have chosen to use the identity link forµ as this leads to additive effects onµ, which makes the interpretation easier. As it turns out, all estimates for µ are positive with this identity link function as well, further reducing the need for a log link function.

The penalized log likelihood function can be maximized iteratively using the Rigby and Stasinopoulos (RS; Rigby & Stasinopoulos, 1996), the Cole and Green (CG; Cole & Green, 1992) algorithm, or a combination of both algorithms (see Appendix B of Rigby & Stasinopoulos, 2005, for a detailed explanation of both algorithms). We have chosen to use the RS algorithm only, because it is more stable than the CG algorithm. To make sure that, on the one hand, the number of iterations was enough to reach convergence, but, on the other hand, the study remained feasible, we have set the maximum number of iterations equal to 10,000.

Population models. The population models differ in the relationship between age and the distributional parameters (i.e.,µ, s, n, or t). The models were chosen such that they differ in complexity of the distributions, and such that they are realistic representa-tions of models relevant in the context of developmental and intelligence tests.

In all conditions, the median scoreµ depends on age. The reason for this is that we

(32)

2

believe it is unrealistic to assume that this relationship does not exist in developmental and intelligence tests. Hence, we only varied the dependency of the other three distributional parameters on age. The population models result from a completely crossed 2 ⇥ 2 ⇥ 2 design, withsage(dependent, independent) ⇥ nage(dependent, independent) ⇥ tage

(pendent, independent), plus a ninth model, with all parameters (µ, s, n, and t) age de-pendent andµagemore complex. We named the nine resulting population models ‘1000’,

‘1100’, ‘1010’, ‘1001’, ‘1110’, ‘1101’, ‘1011’, ‘1111’, and ‘complex’, where the ‘1’ refers to age dependence and ‘0’ refers to age independence of the distributional parameters ‘µsnt’. In addition, ‘complex’ refers to the ninth model, in which there is age dependence for all distributional parameters and the age dependence forµ is more complex than in the other models.

The age dependence of the distributional parameters is expressed in Table 1 and visualized in Figure 3 for the various conditions. We have chosen these relationships be-cause those relations resemble those found in the normative data of the intelligence test SON-R 6-40 (Tellegen & Laros, 2014). We made the values of_{n range from -2 (age = 5)} to 4 (age = 40), and the values of t from 1 (age = 5) to about 3 (age = 40). Recall that the BCPE distribution simplifies to a normal distribution when_{n = 1 and t = 2 (Rigby &} Stasinopoulos, 2004). Hence, the distribution ranges from positively skewed (_{n < 1) to} negatively skewed (n > 1), expressing floor and ceiling effects, respectively. In addition, the distribution ranges from leptokurtic (t < 2) to platykurtic (t > 2).

Table 1

Relationship between age and the distributional parameters in the simple and complex data conditions

Parameter Age dependence Age independence

µ (simple) _{220 (age + 2)}1.4 1_{+ 20}

-µ (complex) 12

13

Ä ₁₁₅

age+2 25sin(12age) + 6

ä +15910 -s exp _{0.0029 (age 23.5)}2 _{1.4 + 0.0001} _0.17 n 1 35(6 age 100) 1 t exp(0.0314 age 0.1895) 2

Note. The parameter for age independence refers to the intercept.

(33)

2

Chapter 2 5 10 15 20 25 30 35 40 6 10 14 18 µ age

(a) Relationship age andµ

5 10 15 20 25 30 35 40 0.1 0.15 0.2 0.25 σ age

(b) Relationship age ands

5 10 15 20 25 30 35 40 − 2 0 2 4 ν age

(c) Relationship age andn

5 10 15 20 25 30 35 40 1.0 1.5 2.0 2.5 τ age

(d) Relationship age andt

Figure 3. The relationships between age and each of the distributional parameters. In

panel (a), the solid line shows the simple relationship between age andµ (models 1 to 8), while the dashed line shows the complex relationship between age andµ (model 9). In panels (b), (c), and (d), the solid line shows the age dependence of the distributional parameter, while the dashed line shows the age independence (i.e., intercept). Note that µ always depends on age.

Examples of the resulting probability density functions (PDFs) for the ages 8, 22, and 35 in the population are presented in Figure 4. Panel (a) shows the PDFs for the simplest population model, ‘1000’, and panel (b) shows the PDFs for population model ‘1111’. The first model shows how age dependency forµ only looks like, and the latter model shows what the age dependency for all four distributional parameters looks like (except for the more complex relationship for mu in model ‘complex’).

(34)

2

0.00 0.05 0.10 0.15 0.20 PDF Test Score 0 5 10 15 20 25 30 Ages 8−year−old 22−year−old 35−year−old

(a) PDFs for population model ‘1000’

0.0 0.1 0.2 0.3 0.4 PDF Test Score 0 5 10 15 20 25 30 Ages 8−year−old 22−year−old 35−year−old

(b) PDFs for population model ‘1111’

Figure 4. The probability density functions (PDFs) for population model ‘1000’, panel

(a), and ‘1111’, panel (b). The dotted lines represent age 8, the solid lines represent age 22, and the dashed lines represent age 35.

(35)

2

Chapter 2

Sampling design. We have two different sampling designs of age values: uniform sampling and weighted sampling. In uniform sampling, we generated N age values uni-formly distributed from 5 to 40, which is the age interval relevant for the SON-R 6-40. In weighted sampling, we included more (simulated) people of a certain age when we ex-pected more change in the median test score for that age. More specifically, we generated a weighted sample of age values based on the first derivative of the formula ofµ, which is dependent on age. As we expected the positive effect of weighted sampling to be most pronounced in the simplest data condition, where onlyµ depends on age, we only applied weighted sampling to this simplest data condition.

The function ofµ becomes almost flat for age values above 25. Hence, to avoid that the sample almost only consists of age values below 25, the sample weights consisted not only of the first derivative of the formula ofµ, but we also added a constant. This constant was set equal to the mean of the values of the first derivatives. As population of age values we generated N values uniformly distributed from 5 to 40. Using the weights explained above, we sampled (with replacement) N values from this population of age values. In the models with the more simple relationship betweenµ and age (models 1 to 8), the median and mean of the age values are 14.02 and 17.18, respectively. In the models with the more complex condition betweenµ and age (model 9), the median and mean of the age values are 13.27 and 16.72, respectively. The thus sampled age values were used across all replications.

Test score simulation. The test scores resulting from the distribution parameters (i.e.,µ, s, n, and t) are randomly drawn from the Box-Cox power exponential distribution with the specific distributional parameters belonging to a specific age value.

Examples of percentiles (i.e., 5, 15, 25, 50, 75, 85, 95) as a function of age under the simplest and most complex population models with randomly drawn observations with uniform sampling (N = 500) are visualized in Figure 5.

(36)

2

5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 Age Test Score Percentiles 25−50th & 50−75th 15−25th & 75−85th 5−15th & 85−95th

(a) Data and percentiles for the simplest population data model (1000)

5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 Age Test Score

(b) Data and percentiles for the most complex population data model (complex)

Figure 5. The percentiles under the population model with uniform sampling and N =

500, and the randomly drawn observations under this model for one replication (black dots).

University of Groningen Flexible regression-based norming of psychological tests Voncken, Lieke