University of Groningen Flexible regression-based norming of psychological tests Voncken, Lieke

(1)

Flexible regression-based norming of psychological tests

Voncken, Lieke

DOI:

10.33612/diss.124765653

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2020

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Voncken, L. (2020). Flexible regression-based norming of psychological tests. University of Groningen. https://doi.org/10.33612/diss.124765653

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

6

Discussion

The aim of this thesis was to investigate the challenges related to model selection and sampling variability for test norming using distributional regression. In Chapter 2, we developed an automated model selection procedure to select the maximum polynomial de-gree of a predictor for each of the four distributional parameters of the BCPE distribution. We showed that this procedure in general performed better than an existing automated model selection procedure, especially in combination with one of the GAIC as selection criterion. In Chapter 3, we investigated the costs of using a too strict model (i.e., bias) versus the costs of using a too flexible model (i.e., variance). In our simulation study, increasing flexibility resulted in a larger decrease in (squared) bias than the increase in variance, but using a too flexible model resulted in very poor normed score estimations in the presence of normality of the test score distribution conditional on age. We expect that these problems are specific to the skew Student t distribution, but this has to be investi-gated in future research. In Chapter 4, we investiinvesti-gated a procedure to estimate confidence intervals that express the uncertainty in normed scores due to sampling variability. The results showed that this procedure performed well, especially in combination with the per-centile CI method. In Chapter 5, we investigated whether norm estimation can be made more efficient by using prior information via Bayesian Gaussian distributional regression. This proof of concept showed that normed scores could be estimated more efficiently, as long as the possible prior misspecification was not age-dependent. Based on the results in this thesis, we provide practical recommendations to test publishers and directions for future research.

We modelled nonlinear relationships between the distributional parameters and the predictors with orthogonal polynomials or P-splines (Eilers & Marx, 1996). Based on the results in this thesis and our practical norming experience (e.g., Rommelse et al., 2018; Voncken et al., 2018), we conclude that both polynomials and P-splines can result in good fit. For polynomials, the maximum polynomial degree of the predictor(s) has to be se-lected for each distributional parameter. The automated model selection procedure for

(3)

6

this, which we developed in Chapter 2, can be used to select the maximum polynomial degree in the presence of one predictor. More research is needed to investigate an auto-mated selection procedure of the maximum polynomial degree in the presence of multiple predictors. Until then, default automated model selection procedures within GAMLSS can be used, or – even though this can be very time-consuming – all possible models with a prespecified maximum polynomial degree can be compared. A possible problem with polynomial regression is that values of observed scores conditional on a certain predictor value might have a large and undesirable influence on the predicted score at a very differ-ent value of the predictor (Magee, 1998). P-splines do not have this problem, because they model the relationship between the distributional parameters and the predictors more lo-cally. For P-splines, the most critical issue in model selection is the degree of the smoothing variance for each distributional parameter. We selected the degree of smoothing using the BIC in combination with visual diagnostics, but more research is needed to investigate the optimal selection of the smoothing variance for P-splines in test norming. An advan-tage of using P-splines rather than polynomials in test norming is that a monotonically increasing or decreasing relationship between a distributional parameter and predictor(s) can be forced. When the mean or median test score is theoretically expected to increase with age, the relationship between the location parameter and the predictor can be forced to be monotonically increasing – for raw test scores that increase with performance, e.g., for the number of items correct – or decreasing – for raw test scores that decrease with performance, e.g., for the number of items incorrect or for the response time. In this way, theoretical expectations can be incorporated and the number of estimated parameters can be restricted, which results in smaller sampling variability.

In the simulation studies in this thesis, the distributional regression models could not always be estimated. This problem was largest for complex models in combination with small sample sizes (in these studies, for N equal to 100 or 500). We believe that these estimation problems are an indication of poor model fit and/or a too small sample size. This means that good model selection is very important and that small sample sizes should be avoided. In practice, one typically wants to estimate a model for only one normative data set, which allows for tailoring the model to the data. The large availability of distributions and function types within distributional regression makes it likely that a

(4)

6

good fitting model can be found for every normative data set. In a simulation study, on the other hand, models are estimated for many generated data sets, and it is impossible to adjust the model to every data set.

We dealt with missingness in the empirical normative data by removing cases if their predictor values and/or test scores were missing. The percentage of missing data was typically really small. For instance, in the Dutch normative data of the IDS-2 (N = 1, 663) (Grob et al., 2018), the age value was unknown for only two test takers and – for the 14 intelligence subtests – the number of missing raw test scores was relatively high for only one subtest (95 scores = 5.7%) and relatively low (ranging from 0 to 18 scores) for the other subtests. We expect that the data are missing at random because the data are collected in a controlled test setting, in which the tests are typically administered individually, and the items are not sensitive. That is why we do not expect bias by removing cases in the case of the missing data.

An important practical question is how large the normative sample minimally needs to be to obtain a minimum level of norm precision. Oosterhuis et al. (2016) provided sample size recommendations for regression-based norming with the standard linear re-gression model, assuming homoscedasticity. There are no clear sample size guidelines yet for models with nonlinearity, heteroscedasticity, and/or non-normality. In this thesis, we concluded in Chapter 2 that the normed scores were estimated sufficiently precise when the sample size was 500 or 1,000, and in Chapter 4 that a sample size of about 1,000 was enough. Unfortunately, it is difficult to generalize these results to other norm situations. Factors that influence the required sample size are the chosen distribution, the number of predictors, the nature of the predictors, and the complexity of the chosen relationship be-tween the distributional parameters and the predictor(s). The more complex the required model is, the larger the required sample size is. To obtain general sample size recommen-dations for test norming using distributional regression, the minimally required sample size has to be investigated for an extensive range of norming conditions. In the meantime, the sample size requirements by Oosterhuis et al. (2016) can be used as the lower bounds. We have concluded in Chapter 3 that the costs of using a too restricted model are typically larger than the costs of using a too flexible model, as long as the skew Student t distribution was not used to model normal data. Besides investigating this bias-variance

(5)

6

trade-off for GAMLSS models, it is also interesting to compare models across continuous norming approaches. Lenhard et al. (2019) already compared some models within the GAMLSS approach (i.e., the normal distribution family, Box-Cox family, and Sinh-arcsinh

family) to a model within their non-parametric approach†_{. Lenhard et al. (2019) found}

that – in the presence of skewness – the non-parametric model in general had a better model fit (i.e., lower RMSE for T scores) than the considered GAMLSS models, and – in symmetric distributions – at small sample sizes (up to 250 per group) performance seems similar, and as of 250 per group GAMLSS outperforms the non-parametric approach. They conclude that the non-parametric approach outperformed the semi-parametric approach under most conditions. This is a surprising result, because based on theoretical reasons we would expect the statistical efficiency of the semi-parametric approach to be equal or larger than the non-parametric approach. We believe that it is important to stress that this does not mean that the non-parametric approach outperforms GAMLSS models in general, because GAMLSS allows for many other models. Only a couple of distributions were considered in the model selection, and the default P-splines were used to model all distributional parameters as a function of age. In practice, one could select a different distribution and different functions.

To examine this issue a bit further, we considered a replication in the negative

skew-ness condition with Ngroup = 50 of Lenhard et al. (2019). For this simulated data set,

the non-parametric model (RMSE = 1.442) clearly outperformed the best of the chosen GAMLSS models (RMSE = 1.689). Visual inspection of the fit of the chosen GAMLSS model, which is based on the Box-Cox Cole Green (BCCGo) distribution, via centile curves indicated severe misfit and theoretically odd curves in the age range 0.5 to 2 years old (see Figure 28(a)). To remedy this severe misfit, we selected the beta-binomial (BB) distribu-tion, which respects the discrete nature of the raw test scores and – unlike the Binomial distribution – allows for variation in the item difficulty across items and/or ability level

†_{Note that Lenhard et al. (2019) refer to the GAMLSS and their non-parametric approach as parametric and} semi-parametric, respectively, while we refer to these approaches as semi-parametric and non-parametric. The developers of GAMLSS refer to the GAMLSS models as semi-parametric regression type models because they require a parametric distribution assumption for the response variable, but allow for non-parametric smoothing functions to model the distributional parameters as a function of explanatory variables (Stasinopoulos & Rigby, 2007, p. 1).

(6)

6

across the test takers. The centile curves of the chosen BB model, with a monotonically

increasing P-spline forµ and linear effect of age for s, indicated better fit than the BCCGo

model (see Figure 28(b)). This BB model (RMSE‡_{= 1.221) clearly outperformed both the}

non-parametric (RMSE = 1.442) and BCCGo model (RMSE = 1.689).

‡_{A continuity correction is used to transform the percentiles under the discrete BB distribution to the} continuous T score scale.

(7)

6

1 2 3 4 5 6 7 Age Test score ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 5 10 15 20

(a) Centile curves of BCCGo model as selected by Lenhard et al. (2019)

1 2 3 4 5 6 7 Age Test score ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 Percentiles 25−50th & 50−75th 10−25th & 75−90th 2−10th & 90−98th 0.4−2th & 98−99.6th

(b) Centile curves of BB model as selected here

Figure 28. Estimated centile curves for the Box-Cox Cole Green model (BCCGo; panel a)

and the beta-binomial model (BB; panel b) for simulated data of one replication with

Ngroup= 50 and negative skewness of the simulation study of Lenhard et al. (2019). The

dots indicate the observations in the simulated sample, and the gray bands indicate percentile ranges.

(8)

6

An alternative flexible norming approach is quantile regression-based norming, in which specific percentiles can be modelled as a linear combination of predictor(s), with-out assuming a distributional form. This can be useful when one wants to estimate a specific percentile (e.g., for a cut-off) rather than all percentiles. Rigby, Stasinopoulos, and Voudouris (2013) recommended to combine quantile regression with GAMLSS when concentrating on the tail of the distribution only. Crompvoets, Keuning, and Emons (2020) compared quantile based norming to traditional norming and mean regression-based norming in a simulation study. They concluded that quantile regression-regression-based norm-ing generally resulted in the most precise estimates, but were biased for skewed distribu-tions. More research is needed to investigate which specific continuous norming approach works best in which situation.

We have shown in Chapter 5 that using prior information can make norm estimation more efficient. This was a proof of concept for the Gaussian model. Because we believe this model is too restricted for norming practice, future research is needed to investigate whether this works for other, more flexible, models as well. An important issue in using prior information is the extent to which we should trust our prior. Especially for small sample sizes, the weight of the prior is large. Even if we have theoretical reasons to assume that the population model underlying the normative data is similar for two countries, we need to make sure that the data are in line with this expectation.

Finally, we have recommendations for test publishers on how to report on test norm-ing. In general, we noticed that test manuals provide only little information about the norming approach. That is why we recommend test publishers to provide more informa-tion in the test manual on the used norming method, including the model selecinforma-tion and the chosen norming model. This thesis has shown that the uncertainty in normed scores due to sampling variability it is too large to ignore and gives test users a false sense of precision. The fact that this is still ignored in practice, is especially problematic when the normed scores are used for important decisions. That is why we strongly recommend test developers to report confidence intervals for both the uncertainty in normed scores due to test unreliability and the uncertainty in normed scores due to sampling variability. In the simulation studies of Chapters 2 and 4, we looked at the precision in norm estimates for specific age values and percentiles. In Chapter 2, we directly looked at the variance in

(9)

6

the percentile estimates conditional on age, and in Chapter 4, we calculated the median interval length of the confidence intervals for combinations of percentiles and age values.

The variance of sample proportions is equal to n 1_{p(1 p) (e.g., see Fleiss, Levin, & Paik,}

2003, p. 141), which is in line with our observation that – given the sample size – the variance was higher for the median than for the 5th and 95th percentiles. In addition, we observed that the variance was larger for extreme predictor values than for less ex-treme predictor values. Because of the larger variance for exex-treme age values compared to less extreme age values, we recommend to make the age range in the normative sample somewhat larger than the age range in the target population. In this way, the extreme age values of the target population are supported by more observations. Naturally, this is only possible when the test is suitable for testees outside the target age range.

Even though continuous test norming results in more accurate and more efficient normed score estimations than traditional test norming, a practical disadvantage of con-tinuous test norming is that it is more difficult to arrive at normed scores and more difficult for test users to understand (Van Breukelen & Vlaeyen, 2005). For a detailed tutorial on

how to use GAMLSS to arrive at normed test scores, includingR code and example data,

see Timmerman, Voncken, and Albers (2019). We recommend test publishers to provide visualizations like the centile curves in Figure 28 for each normed subtest. Even though the norming model itself can be complex, the resulting centile curves are easy to under-stand. The centile curves can be created for each predictor, and allow test users to inspect how the normed scores change as a function of the predictor and raw test score. They show the test developer which ranges of the normed scores are best supported by the normative data, and whether starting and stopping rules might need to be adjusted. An alternative might be to visualize the percentiles as a function of a predictor, with different lines indicating raw test scores. Figure 29 shows an example of this visualization for the same model as in Figure 28(b), after using a continuity correction. This visualization can show test developers whether more test items might be required for certain ranges of the predictor(s). Figure 29 clearly shows that more difficult items are needed for 6–7 year olds because for those age values, an increase in raw test score of only one at the upper end of the raw score range results in a large increase in the corresponding estimated percentile. Because continuous test norming allows the user to determine normed scores for each raw

(10)

6

test score conditional on each exact predictor value, the resulting fine-grained norm tables can become very large. That is why we recommend test developers to use a digital scoring form with fine-grained norm tables incorporated in them. In combination with the cen-tile curves, this allows test users to compute normed scores for raw scores conditional on precise predictor values without losing track of the general relationship between – on the one hand – the normed scores and – on the other hand – the raw test scores and predictor values. 1 2 3 4 5 6 7 0 20 40 60 80 100

Age

P

ercentile

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 29. Percentiles – after continuity correction – as a function of age, with different

lines indicating raw test scores, for the beta-binomial model as estimated on simulated data of Lenhard et al. (2019). The dots indicate the observations in the simulated sample.

In conclusion, continuous test norming with distributional regression offers great flexibility, which allows for accurate norm estimation. This flexibility comes with chal-lenges, like complicated model selection and possibly complex models that are difficult to understand, and increased sampling variability. Fortunately – as we have shown in this thesis – we can overcome those challenges with proper model selection, visualization of the normed scores, and efficient norm estimation.

(11)

(12)

References

Agelink van Rentergem, J. A., de Vent, N. R., Schmand, B. A., Murre, J. M. J., & Huizinga, H. M. (2018). Multivariate normative comparisons for neuropsychological assessment by a multilevel factor structure or multiple imputation approach.

Psychological Assessment, 30(4), 436-449. doi:10.1037/pas0000489

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions

on Automatic Control, 19(6), 716-723. doi:10.1109/TAC.1974.1100705

Akaike, H. (1983). Information measures and model selection. Bulletin of the

International Statistical Institute, 50(1), 277-291.

American Psychiatric Association. (2013). Diagnostic and statistical manual of mental

disorders (DSM-5). Arlington, VA: American Psychiatric Publishing.

Bayley, N. (2006). Bayley Scales of Infant and Toddler Development – Third Edition. San Antonio, TX: Harcourt Assessment, Inc.

Bechger, T., Hemker, B. T., & Maris, G. (2009). Over het gebruik van continue normering [On the use of continuous norming]. Arnhem, The Netherlands: Cito.

Borghi, E., De Onis, M., Garza, C., Van den Broeck, J., Frongillo, E. A., Grummer-Strawn, L., . . . Martines, J. C. (2006). Construction of the World Health Organization child growth standards: Selection of methods for attained growth curves.

Statistics in Medicine, 25(2), 247-265. doi:10.1002/sim.2227

Breiman, L., & Spector, P. (1992). Submodel selection and evaluation in regression. The x-random case. International Statistical Review, 60(3), 291-319.

doi:10.2307/1403680

Cole, T. J., & Green, P. J. (1992). Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine, 11(10), 1305-1319.

doi:10.1002/sim.4780111005

Cole, T. J., Stanojevic, S., Stocks, J., Coates, A. L., Hankinson, J. L., & Wade, A. M. (2009). Age- and size-related reference ranges: A case study of spirometry through childhood and adulthood. Statistics in Medicine, 28(5), 880-898. doi:10.1002/sim.3504

(13)

DASS-21, STAI-X, STAI-Y, SRDS, and SRAS). Australian Psychologist, 46(1), 3-14. doi:10.1111/j.1742-9544.2010.00003.x

Crompvoets, E. A. V., Keuning, J., & Emons, W. H. M. (2020). Bias and precision of continuous norms obtained using quantile regression. Assessment.

doi:10.1177/1073191120910201

Cromwell, E. A., Dube, Q., Cole, S. R., Chirambo, C., Dow, A. E., Heyderman, R. S., & Van Rie, A. (2014). Validity of US norms for the Bayley Scales of Infant Development – III in Malawian children. European Journal of Paediatric

Neurology, 18(2), 223-230. doi:10.1016/j.ejpn.2013.11.011

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge, England: Cambridge University Press.

doi:10.1017/CBO9780511802843

Death Penalty Information Center. (2015). Intellectual disability and the death penalty.

Retrieved January 31, 2017, fromhttp://www.deathpenaltyinfo.org/

intellectual-disability-and-death-penalty

Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans (Vol. 38). Philadelphia, PA: Society for Industrial and Applied Mathematics.

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY: Chapman and Hall.

Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties.

Statistical Science, 11(2), 89-102. doi:10.1214/ss/1038425655

Eilers, P. H. C., & Marx, B. D. (2010). Splines, knots, and penalties. Computational

Statistics, 2(6), 637-653. doi:10.1002/wics.125

Emons, W. H. (2019). Sixty years of assessing the quality of psychological tests in the

Netherlands: then, now, and in the future. Presented at the 34th IOPS Summer

Conference in Utrecht University, The Netherlands.

Ernst, A. F., & Albers, C. J. (2017). Regression assumptions in clinical psychology research practice – a systematic review of common misconceptions. PeerJ, 5, [e3323]. doi:10.7717/peerj.3323

(14)

European Parliament and Council of the European Union. (2016). Regulation (EU)

2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Retrieved from

https://data.europa.eu/eli/reg/2016/679/oj

Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K. (2009). COTAN assessment system for

the quality of tests. Amsterdam, The Netherlands: Nederlands Instituut van

Psychologen.

Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). Regression: Models, Methods and

Applications. Heidelberg, Germany: Springer.

Fernandez, C., & Steel, M. F. J. (1998). On Bayesian modeling of fat tails and skewness.

Journal of the American Statistical Association, 93(441), 359-371.

doi:10.2307/2669632

Ferrer, E., & McArdle, J. J. (2004). An experimental analysis of dynamic hypotheses about cognitive abilities and achievement from childhood to early adulthood.

Developmental Psychology, 40(6), 935-952. doi:10.1037/0012-1649.40.6.935

Flanagan, J. C. (1939). The cooperative achievement tests: A bulletin reporting the basic

principles and procedures used in the development of their system of scaled scores.

Oxford, England: Cooperative Test Service.

Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). Hoboken, NJ: Wiley.

Frangos, C. C., & Schucany, W. R. (1990). Jackknife estimation of the bootstrap acceleration constant. Computational Statistics and Data Analysis, 9(3), 271-281. doi:10.1016/0167-9473(90)90109-U

Ganguli, M., Snitz, B. E., Lee, C.-W., Vanderbilt, J., Saxton, J. A., & Chang, C.-C. H. (2010). Age and education effects and norms on a cognitive test battery from a population-based cohort: The Monongahela – Youghiogheny Healthy Aging Team (MYHAT). Aging Mental Health, 14(1), 100-107.

doi:10.1080/13607860903071014

(15)

Goretti, B., Niccolai, C., Hakiki, B., Sturchio, A., Falautano, M., Minacapelli, E., . . . Amato, M. P. (2014). The brief international cognitive assessment for multiple sclerosis (BICAMS): Normative values with gender, age and education corrections in the Italian population. BMC neurology, 14, 171-176.

doi:10.1186/s12883-014-0171-6

Grob, A., & Hagmann-von Arx, P. (2018). IDS-2: Intelligence and Development Scales – 2. Bern, Switzerland: Hogrefe.

Grob, A., Hagmann-von Arx, P., Ruiter, S., Timmerman, M. E., & Visser, L. (2018). IDS-2:

Intelligentie- en Ontwikkelingsschalen voor kinderen en jongeren [IDS-2: Intelligence and Development Scales for children and adolescents]. Amsterdam, The

Netherlands: Hogrefe.

Grober, E., Mowrey, W., Katz, M., Derby, C., & Lipton, R. B. (2015). Conventional and robust norming in identifying preclinical dementia. Journal of Clinical and

Experimental Neuropsychology, 37(10), 1098-1106.

doi:10.1080/13803395.2015.1078779

Harrell, F. E., & Davis, C. E. (1982). A new distribution-free quantile estimator.

Biometrika, 69(3), 635-640. doi:10.1093/biomet/69.3.635

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Data

mining, inference, and prediction. New York, NY: Springer.

doi:10.1007/978-0-387-84858-7

Hawkins, D. M., C., B. S., & Mills, D. (2003). Assessing model fit by cross-validation.

Journal of Chemical Information and Computer Sciences, 43(2), 579-586.

doi:10.1021/ci025626i

Heathcote, A., Popiel, S. J., & Mewhort, D. J. K. (1991). Analysis of response time distributions: An example using the stroop task. Psychological Bulletin, 109(2), 340-347. doi:10.1037/0033-2909.109.2.340

Higham, N. (2002). Computing the nearest correlation matrix - a problem from finance.

IMA Journal of Numerical Analysis, 22(3), 329-343.

(16)

Kirsebom, B.-E., Espenes, R., Hessen, E., Waterloo, K., Johnsen, S. H., Gundersen, E., . . . Fladby, T. (2019). Demographically adjusted CERAD wordlist test norms in a Norwegian sample from 40 to 80 years. The Clinical Neuropsychologist, 33(1), 27-39. doi:10.1080/13854046.2019.1574902

Knol, D. L., & Ten Berge, J. M. F. (1989). Least-squares approximation of an improper correlation matrix by a proper one. Psychometrika, 54(1), 53-61.

doi:10.1007/BF02294448

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 2(12), 1137-1143. Retrieved from

https://dl.acm.org/doi/10.5555/1643031.1643047

Kort, W., Schittekatte, M., Compaan, E. L., Bosmans, M., Bleichrodt, N., Vermeir, G., . . . Verhaeghe, P. (2002). WISC-III-NL. Handleiding [WISC-III-NL Manual]. London, England: The Psychological Corporation.

Lenhard, A., Lenhard, W., & Gary, S. (2019). Continuous norming of psychometric tests: A simulation study of parametric and semi-parametric approaches. PLoS ONE,

14(9). doi:10.1371/journal.pone.0222279

Lenhard, A., Lenhard, W., Suggate, S., & Segerer, R. (2018). A continuous solution to the norming problem. Assessment, 25(1), 112-125.

doi:10.1177/1073191116656437

Llinàs-Reglà, J., Vilalta-Franch, J., López-Pouse, S., Calvó-Perxas, L., & Garre-Olmo, J. (2013). Demographically adjusted norms for Catalan older adults on the stroop color and word test. Archives of Clinical Neuropsychology, 28(3), 282-296. doi:10.1093/arclin/act003

Magee, L. (1998). Nonlocal behavior in polynomial regressions. The American

Statistician, 52(1), 20-22. doi:10.1080/00031305.1998.10480531

McArdle, J. J., Ferrer-Caja, E., Hamagami, F., & Woodcock, R. W. (2002). Comparative longitudinal structural analyses of the growth and decline of multiple intellectual abilities over the life span. Developmental Psychology, 38(1), 115-142.

doi:10.1037//0012-1649.38.1.115

(17)

questionnaires (Doctoral dissertation), Tilburg University, Tilburg, The

Netherlands. Retrieved fromhttps://pure.uvt.nl/ws/portalfiles/portal/

16257245/Oosterhuis_Regression_12_04_2017.pdf

Oosterhuis, H. E. M., Van der Ark, L. A., & Sijtsma, K. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23(2), 191-202.

doi:10.1177/1073191115580638

Oosterhuis, H. E. M., Van der Ark, L. A., & Sijtsma, K. (2017). Standard errors and confidence intervals of norms statistics for educational and psychological tests.

Psychometrika, 82(3), 559-588. doi:10.1007/s11336-016-9535-8

Quanjer, P. H., Stanojevic, S., Cole, T. J., Baur, X., Hall, G. L., Culver, B. H., . . . the ERS Global Lung Function Initiative (2012). Multi-ethnic reference values for spirometry for the 3–95-yr age range: The global lung function 2012 equations.

European Respiratory Journal, 40(6), 1324-1343.

doi:10.1183/09031936.00080312

R Core Team. (2019). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from

https://www.R-project.org/

Rigby, R. A., & Stasinopoulos, D. M. (1996). A semi-parametric additive model for variance heterogeneity. Statistics and Computing, 6(1), 57-65.

doi:10.1007/BF00161574

Rigby, R. A., & Stasinopoulos, D. M. (2004). Smooth centile curves for skew and kurtotic data modelled using the Box–Cox power exponential distribution. Statistics in

Medicine, 23(19), 3053-3076. doi:10.1002/sim.1861

Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Applied Statistics, 54(3), 507-554.

doi:10.1111/j.1467-9876.2005.00510.x

Rigby, R. A., & Stasinopoulos, D. M. (2006). Using the Box-Cox t distribution in GAMLSS to model skewness and kurtosis. Statistical Modelling, 6(3), 209-229.

(18)

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., & De Bastiani, F. (2019). Distributions

for modelling location, scale, and shape: Using GAMLSS in R. Boca Raton, FL:

CRC/Chapman & Hall.

Rigby, R. A., Stasinopoulos, D. M., & Voudouris, V. (2013). Discussion: A comparison of gamlss with quantile regression. Statistical Modelling, 13(4), 335-348.

doi:10.1177/1471082X13494316

Rommelse, N., Hartman, C., Brinkman, A., Slaats-Willemse, D., de Zeeuw, P., & Luman, M. (2018). COTAPP: Cognitieve taak applicatie handleiding [COTAPP: Cognitive

test application manual]. Amsterdam, The Netherlands: Boom.

Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of

Computational and Graphical Statistics, 11(4), 735-757.

doi:10.1198/106186002853

Schmider, E., Ziegler, M., Danay, E., Beyer, L., & Bühner, M. (2010). Is it really robust? Reinvestigating the robustness of ANOVA against violations of the normal distribution assumption. Methodology, 6(4), 147-151.

doi:10.1027/1614-2241/a000016

Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464. doi:10.1214/aos/1176344136

Snijders, J. T., & Verhage, F. (1962). Groninger Intelligentie Test: Handleiding [Groninger

Intelligence Test: Manual]. Amsterdam, The Netherlands: Swets & Zeitlinger.

Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1-46. doi:10.18637/jss.v023.i07

Stasinopoulos, D. M., Rigby, R. A., Heller, G. Z., Voudouris, V., & De Bastiani, F. (2017).

Flexible regression and smoothing – Using GAMLSS in R. Boca Raton, FL: CRC

Press.

Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika, 61(3), 509-515. doi:10.1093/biomet/61.3.509

Tellegen, P. J. (2004). De aangepaste normen van de WISC-III-NL [The adjusted norms of

the WISC-III-NL]. Retrieved from

(19)

Tellegen, P. J., & Laros, J. A. (2017). SON-R 2-8: Snijders-Oomen Niet-verbale

intelligentietest: III. Normtabellen [SON-R 2-8: Snijders-Oomen Non-verbal intelligence test: III. Norm tables]. Amsterdam, The Netherlands: Hogrefe.

Timmerman, M. E., Voncken, L., & Albers, C. J. (2019, November 7). A tutorial on regression-based norming of psychological tests with GAMLSS.

https://doi.org/10.31219/osf.io/mdc9u.

Umlauf, N., Klein, N., & Zeileis, A. (2018). BAMLSS: Bayesian Additive Models for Location, Scale, and Shape (and Beyond). Journal of Computational and Graphical

Statistics, 27(3), 612-627. doi:10.1080/10618600.2017.1407325

Van Baar, A. L., Steenis, L. J. P., Verhoeven, M., & Hessen, D. J. (2014). Bayley-III-NL,

Technische handleiding [Bayley-III-NL, Technical manual]. Amsterdam, The

Netherlands: Pearson Assessment and Information B.V.

Van Belle, G. (2003). Statistical rules of thumb (2nd ed.). Hoboken, NJ: Wiley.

Van Breukelen, G. J. P., & Vlaeyen, J. W. S. (2005). Norming clinical questionnaires with multiple regression: The pain cognition list. Psychological Assessment, 17(3), 336-344. doi:10.1037/1040-3590.17.3.336

Van Buuren, S., & Fredriks, M. (2001). Worm plot: a simple diagnostic device for modelling growth reference curves. Statistics in Medicine, 20(8), 1259-1277. doi:10.1002/sim.746

Van der Elst, W., Hoogenhout, E. M., Dixon, R. A., De Groot, R. H. M., & Jolles, J. (2011). The Dutch Memory Compensation Questionnaire: Psychometric properties and regression-based norms. Assessment, 18(4), 517-529.

doi:10.1177/1073191110370116

Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York, NY: Springer.

Voncken, L., Albers, C. J., & Timmerman, M. E. (2019a). Improving confidence intervals for normed test scores: Include uncertainty due to sampling variability. Behavior

Research Methods, 51(2), 826-839. doi:10.3758/s13428-018-1122-8

(20)

test norming with GAMLSS. Assessment, 26(7), 1329-1346. doi:10.1177/1073191117715113

Voncken, L., Albers, C. J., & Timmerman, M. E. (2019, November 6). Bias-variance trade-off in continuous test norming.

https://doi.org/10.31234/osf.io/cz8k3.

Voncken, L., Kneib, T., Albers, C. J., Umlauf, N., & Timmerman, M. E. (2019, August 14). Bayesian Gaussian distributional regression models for more efficient norm

estimation.https://doi.org/10.31234/osf.io/7j8ym.

Voncken, L., Timmerman, M. E., Spikman, J. M., & Huitema, R. (2018). Beschrijving van de nieuwe, Nederlandse normering van de Ekman 60 Faces Test (EFT), onderdeel van de FEEST [Description of the new, Dutch norming of the Ekman 60 Faces Test (EFT), part of the FEEST]. Tijdschrift voor Neuropsychologie, 13(2), 143-151. Wasserman, J. D., & Bracken, B. (2013). Fundamental psychometric considerations in

assessment. In J. R. Graham, & J. A. Naglieri (Eds.), Handbook of psychology:

Assessment psychology (Vol. 10, 2nd ed., pp. 50-80). Hoboken, NJ: Wiley. Wechsler, D. (1991). Manual for the Wechsler Intelligence Scale for Children — Third

Edition. San Antonio, TX: Psychological Corporation.

Wechsler, D. (2003). Wechsler Intelligence Scale for Children — Fourth Edition (WISC-IV). San Antonio, TX: Psychological Corporation. doi:10.1080/08035320500495548 Wechsler, D. (2008). Wechsler Adult Intelligence Scale — Fourth Edition (WAIS-IV). San

Antonio, TX: NCS Pearson.

Wechsler, D. (2018). WISC-V-NL: Wechsler Intelligence Scale for Children – Fifth Edition –

Nederlandstalinge bewerking. Technische handleiding [Dutch adaptation. Technical manual]. Amsterdam, The Netherlands: Pearson Benelux B.V.

WHO Multicentre Growth Reference Study Group. (2006). WHO child growth standards based on length/height, weight and age. Acta Paediatrica Supplement, Supplement

450, 76-85. doi:10.1080/08035320500495548

Wilcox, R. R. (2012). Introduction to robust estimation and hypothesis testing (Vol 3rd ed.). Amsterdam, The Netherlands: Academic Press.

Williams, M. N., Grajales, C. A. G., & Kurkiewicz, D. (2013). Assumptions of multiple regression: Correcting two misconceptions. Practical Assessment, Research &

(21)

FL: CRC/Chapman & Hall.

Würtz, D., Chalabi, Y., & Luksan, L. (2006). Parameter estimation of ARMA models with GARCH/APARCH errors. An R and SPlus software implementation. Journal of

Statistical Software, 55(2), 28-33.

Young, A. W., Perrett, D., Calder, A., Sprengelmeyer, R., & Ekman, P. (2002). Facial

expressions of emotion: Stimuli and tests (FEEST). Bury St. Edmunts, UK: Thames

Valley Test Company.

Zachary, R. A., & Gorsuch, R. L. (1985). Continuous norming: Implications for the WAIS-R. Journal of Clinical Psychology, 41(1), 86-94.

doi:10.1002/1097-4679(198501)41:1<86::AID-JCLP2270410115>3.0.CO;2-W Zhu, J., & Chen, H.-Y. (2011). Utility of inferential norming with smaller sample sizes.

Journal of Psychoeducational Assessment, 29(6), 570-580.

doi:10.1177/0734282910396323

Zorginstituut Nederland. (2017). Toegang tot Wlz-zorg [Access to long-term care act].

Retrieved January 17, 2017, fromhttps://www.zorginstituutnederland.nl/