Contributions to latent variable modeling in educational measurement

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Zwitser, R.J.

Publication date

2015

Document Version

Final published version

Link to publication

Citation for published version (APA):

Zwitser, R. J. (2015). Contributions to latent variable modeling in educational measurement.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

Bibliography

Adams, R. (2011, 19 April). Comments on Kreiner 2011: Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. Retrieved from http://www.oecd.org/pisa/47681954.pdf

Adams, R., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21 (1), 1-23.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washinton, DC: American Educational Research Association.

Andersen, E. B. (1973a). Conditional inference and models for measuring. (Unpublished doctoral dissertation). Mentalhygiejnisk Forskningsinstitut.

Andersen, E. B. (1973b). A goodness of fit test for the Rasch model. Psychometrika, 38 , 123-140.

Bechger, T. M., & Maris, G. (2014). A statistical test for differential item pair functioning. Psychometrika.

Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view on DIF (Measurement and Research Department Reports No. 2010-4). Cito.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (p. 395-479). Reading: Addison-Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation

of item parameters. Psychometrika, 46 , 443-460.

(3)

Bolsinova, M., Maris, G., & Hoijtink, H. (2012, July). Unmixing Rasch scales. Paper presented at the V European Congress of Methodology, Santiago de Compostela, Spain.

Brennan, R. L. (2006). Perspectives on the evaluation and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., chap. 1). Westport: Praeger Publishers.

Conover, W. J. (1999a). Practical nonparametric statistics (3rd ed.). New York: John Wiley & Sons.

Conover, W. J. (1999b). Statistics of the Kolmogorov-Smirnov type. In W. J. Conover (Ed.), Practical nonparametric statistics (3rd ed., p. 428-473). John Wiley & Sons.

Council of Europe. (2012). First european survey on language competences: Technical report. Retrieved from http://www.surveylang.org/

Cronbach, L. J., & Gleser, G. C. (1965). Psychological test and personnel decisions (2nd ed.). Urbana: University of Illinois Press.

Dieterich, C. (2013, March). In or out, DJIA companies reflect changing times. The Wall Street Journal . Retrieved from http://online.wsj.com/news/articles/

SB10001424127887324678604578342113520798752

Doob, J. (1949). Heuristic approach to the Kolmogorov-Smirnov theorems. The Annals of Mathematical Statistics, 20 , 393-403.

Eggen, T. J. H. M., & Verhelst, N. D. (2011). Item calibration in incomplete designs. Psychol´ogica, 32 , 107-132.

Fischer, G. H. (1974). Einfuhrung in die Theorie Psychologischer Tests. Bern: Verlag Hans Huber. (Introduction to the theory of psychological tests.) Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.

Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character , 222 , 309-368. doi: 10.1098/rsta.1922.0009

Glas, C. A. W. (1988). The Rasch model and multistage testing. Journal of Educational Statistics, 13 , 45-52.

Glas, C. A. W. (1989). Contributions to estimating and testing Rasch models (Unpublished doctoral dissertation). Arnhem: Cito.

Glas, C. A. W. (2000). Item calibration and parameter drift. In W. J. Van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory

(4)

99

and practice (p. 183-199). Kluwer Academic Publishers.

Glas, C. A. W. (2010). Item parameter estimation and item fit analysis. In W. J. Van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (p. 269-288). Springer.

Glas, C. A. W., Wainer, H., & Bradlow, E. (2000). MML and EAP estimation in testlet-based adaptive testing. In W. J. Van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (p. 271-287). Kluwer Academic Publishers.

Goldstein, H. (2004). International comparisons of student attainment: some issues arising from the PISA study. Assessment in Education, 11 (3), 319-330. doi: 10.1080/0969594042000304618

Grayson, D. A. (1988). Two-group classification in latent trait theory: Scores with monotone likelihood ratio. Psychometrika, 53 , 383-392.

Guttman, L. (1950). The basis for scalogram analysis. In S. Stouffer, L. Guttman, E. Suchman, P. Lazarsfeld, S. Star, & J. Clausen (Eds.), Measurement and Prediction (Vol. 4, p. 60-90). Princeton, NY: Princeton University Press.

Hemker, B. T., Sijtsma, K., Molenaar, I. W., & Junker, B. W. (1997). Stochastic ordering using the latent trait and the sum score in polytomous IRT models. Psychometrika, 62 (3), 331-347.

Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-Haenszel null hypothesis. Psychometrika, 70 (3), 497-516.

Holland, P., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Huynh, H. (1994). A new proof for monotone likelihood ratio for the sum of independent Bernoulli random variables. Psychometrika, 59 , 77-79. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking.

methods and practices (Second ed.). New York: Springer.

Kreiner, S. (2011). Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. (Tech. Rep.). Dept. of Biostatistics, University of Copenhagen.

Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in health-related scales: Analysis by graphical loglinear Rasch models. In M. Von Davier & C. H. Carstensen (Eds.), Multivariate and mixture

(5)

distribution Rasch models (p. 329-346). New York: Springer.

Kreiner, S., & Christensen, K. B. (2013). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika. doi: 10.1007/s11336-013-9347-z

Kubinger, K. D., Steinfeld, J., Reif, M., & Yanagida, T. (2012). Biased (conditional) parameter estimation of a rasch model calibrated item pool administered according to a branched testing design. Psychological Test and Assessment Modeling, 52 (4), 450-460.

Le, L. T. (2007). Effects of item positions on their difficulty and discrimination: A study in PISA science data across test language and countries. Paper presented at the 72nd Annual Meeting of the Psychometric Society, Tokyo, Japan. Retrieved from http://research.acer.edu.au/pisa/2/

Linthorne, N. (2014, August). Wind assistance in the 100m sprint. Retrieved from http://www.brunel.ac.uk/ spstnpl/Publications/

Lord, F. M. (1971a). The self-scoring flexilevel test. Journal of Educational Measurement, 8 (3), 147-151.

Lord, F. M. (1971b). A theoretical study of two-stage testing. Psychometrika, 36 , 227-242.

Lord, F. M. (1980). Application of item response theory to practical testing problems. Mahway, New Jersey: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of educational measurement, 17 (3), 179-193.

Maris, G. (2008). A Note on ”Constant Latent Odds-Ratios Models and the Mantel-Haenszel Null Hypothesis” Hessen, 2005. Psychometrika, 73 (1), 153-157.

Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013a, July). Composition algorithms for conditional distributions. Paper presented at the 78th Annual Meeting of the Psychometric Society, Arnhem, The Netherlands. Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013b, October). A

non-parametric estimator of latent variable distributions. Paper presented at the RCEC workshop on IRT and Educational Measurement, Enschede,

(6)

101

The Netherlands.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47 , 149-174.

Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement, 14 (3), 283-298.

Milgrom, P. R. (1981). Good news and bad news: Representation theorems and application. The Bell Journal of Economics, 12 (2), 380-391. Mislevy, R. J. (1998). Implications of market-basket reporting for

achievement-level setting. Applied Psychological Measurement, 11 (1), 49-63.

Mokken, R. (1971). A theory and procedure of scale analysis. The Hague: Mouton.

Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16 , 1-32.

OECD. (2004). Learning for tomorrows world: First results from PISA 2003. Retrieved from www.oecd.org/dataoecd/1/60/34002216.pdf

OECD. (2007). PISA 2006: Science competencies for tomorrows world: Volume 1: Analysis.

OECD. (2009a). PISA 2006 technical report. OECD. (2009b). PISA data analysis manual.

OECD. (2012). The policy impact of PISA: An exploration of the normative effects of international benchmaring in school system performance (OECD Education Working Paper No. 71). Organisation for Economic Co-operation and Development.

OECD. (2014). PISA 2012 results in focus: What 15-year-olds know and what they can do with what they know.

Oliveri, M. E., & Ercikan, K. (2011). Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions? Applied Measurement in Education, 24 (4), 349-366. doi: 10.1080/08957347.2011.607063

Oliveri, M. E., & Von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53 (3), 315-333.

Oliveri, M. E., & Von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international

(7)

large-scale assessments. International Journal of Testing, 14 (1), 1-21. doi: 10.1080/15305058.2013.825265

Post, W. J. (1992). Nonparametric unfolding models. A latent structure approach. Leiden: DSWO Press.

R Development Core Team. (2013). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago, The University of Chicago Press) Ross, S. M. (1996). Stochastic processes (Second ed.). New York: John Wiley

& sons.

Rubin, D. (1976). Inference and missing data. Biometrika, 63 , 581-592. Sandilands, D., Oliveri, M. E., Zumbo, B. D., & Ercikan, K. (2013).

Investigating sources of differential item functioning in international large-scale assessments using a confirmatory approach. International Journal of Testing, 13 (2), 152-174. doi: 10.1080/15305058.2012.690140 Sekhon, J. S. (2011). Multivariate and propensity score matching software with

automated balance optimization: The matching package for R. Journal of Statistical Software, 42 (7), 1-52.

Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering persons and items using simple sum scores. Journal of Educational and Behavioral Statistics, 25 (4), 391-415.

Sijtsma, K., & Molenaar, I. (2002). Introduction to nonparametric item response theory. Thousand Oaks, California: Sage Publications, Inc. Spearman, C. (1904). General intelligence, objectively determined and

measured. American Journal of Psychology, 15 , 201-293.

Van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive testing. New York: Springer.

Van der Linden, W. J., & Pashley, P. J. (2010). Item selection and ability estimation in adaptive testing. In W. J. Van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (p. 3-30). Springer.

Verhelst, N. D. (2012). Profile analysis: A closer look at the PISA 2000 reading data. Scandinavian Journal of Educational Research, 56 (3), 315-332. doi: 10.1080/00313831.2011.583937

(8)

103

Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter logistic model: OPLM. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments and applications (p. 215-238). New York: Springer Verlag.

Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1993). OPLM: One parameter logistic model. Computer program and manual. Arnhem: Cito.

Wainer, H., Bradlow, E., & Du, Z. (2000). Testlet response theory: An analog for the 3pl model useful in testlet-based adaptive testing. In W. Van der Linden & C. Glas (Eds.), Computerized adaptive testing: Theory and practice (p. 245-269). Kluwer Academic Publishers.

Warm, T. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54 , 427-450.

Weiss, D. J. (Ed.). (1983). New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press.

Zenisky, A., Hambleton, R. K., & Luecht, R. (2010). Multistage testing: Issues, desings and research. In W. J. Van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (p. 355-372). Springer.

(9)

(10)

References published chapters

Chapters 2, 3, and 4 have been submitted or accepted for publication as, respectively:

Zwitser, R.J. & Maris, G. (in press). Conditional Statistical Inference with Multistage Testing Designs. Psychometrika

Zwitser, R.J. & Maris, G. (conditionally accepted). Ordering Individuals with Sum Scores: the Introduction of the Nonparametric Rasch Model. Psychometrika

Zwitser, R.J., Glaser, S.S.F. & Maris, G. (submitted). Monitoring Countries in a Changing World. A New Look at DIF in International Surveys

Interest of co-authors:

• G. Maris is the supervisor of this thesis

• S.S.F. Glaser is a master’s student who has contributed to the analyses described in Section 4.4.3

(11)

(12)

Summary of ‘Contributions to Latent

Variable

Modeling

in

Educational

Measurement’

One of the prominent questions in educational measurement is how to summarise scored item responses into a final score, such that the final score reflects the construct that is supposed to be measured. In answering this question, latent variable models play an important role. This thesis considers a couple of questions regarding the use of latent variable models in scoring item responses.

Chapter 1 is an introduction to the rest of the thesis. It is explained that there are different views on what a construct is. Furthermore, this chapter introduces some general terms and theory, after which a broad overview of the thesis is given.

The core of the thesis consists of Chapters 2 to 4, where three distinct research projects are described. The first project, described in Chapter 2, is about conditional likelihood inference from multistage testing designs. In adaptive testing, the scoring of individual test takers is usually done via estimates of person parameters. To obtain unbiased estimates, it is required that the item parameters are also unbiased. This chapter shows how to obtain item parameter estimates from multistage testing designs based on the conditional likelihood method. Besides this technical result, some more general issues related to adaptive testing, item parameter estimation, and model fit are discussed. It is explained that simple measurement models are more likely to fit the data obtained from adaptive designs compared to data obtained from linear designs. This is illustrated with simulated data, as well

(13)

as with real data taken from the Dutch Entreetoets.

In Chapter 3, the item response theory (IRT)-based justification of the use of the sum score is considered. Two IRT-models are well-known for their relationship between the sum score and the person parameter: the parametric Rasch Model (RM), in which the sum score is a sufficient statistic for the person parameter, and the nonparametric Monotone Homogeneity Model (MHM), in which the latent trait is stochastically ordered by the sum score. It is illustrated that there is a theoretical gap between the two: the RM enables scoring individuals by means of the sum score, while the MHM enables ordering groups by means of the sum score. To fill the gap, the concept of ordinal sufficiency is defined, and the nonparametric Rasch Model is introduced as a less restrictive nonparametric alternative that enables ordering individuals by means of the sum score.

The final project, in Chapter 4, is about differential item functioning (DIF) in international surveys. Usually, DIF is considered as a threat to validity, and as a phenomenon that hinders the comparison of performances between countries. However, in the approach described in Chapter 4, DIF is not considered as a threat, but as an interesting survey outcome reflecting qualitative differences between countries. To obtain comparable scores in a context with DIF, it is proposed not the take the person parameter estimates as a basis for comparison. Instead, it is proposed to define the construct as a market basket of items, and to take (a summary statistic of) the item responses as the basis for comparisons. Since survey data are usually incomplete, the latent variable models - probably different models in different countries - are used to describe the distribution of the item responses in the market basket. This approach is illustrated with data from the PISA cycle of 2006.

Chapter 5 is a general discussion. Three issues related to Chapters 2 to 4 are raised. The first is about an optimal adaptive test design for high-stakes testing. It is argued that this is not a computerized adaptive test (CAT) with an infinitely large and calibrated item bank. Instead, a multistage test can lead to more efficient results. The second is about what to do when the sum score is not ordinal sufficient for the person parameter. It is argued that, especially in high-stakes testing, one should look for a coarser statistic that is ordinal sufficient. The third issue is an elaboration on the positive aspects of DIF.

(14)

Samenvatting

van

‘Contributions

to

Latent

Variable

Modeling

in

Educational Measurement’

Een prominente vraag bij meten in het onderwijs is hoe de scores op afzonderlijke toetsvragen samengevat moeten worden in een eindscore, zodanig dat de eindscore het construct dat is de te meten vaardigheid -representeert. Bij het beantwoorden van deze vraag spelen latentevariabelemodellen een belangrijke rol. Dit proefschrift beschouwt een aantal vraagstukken omtrent het gebruik van latentevariabelemodellen en het bepalen van eindscores.

Hoofdstuk 1 is een introductie. Er wordt in de eerste plaats beschreven dat er verschillende visies zijn op wat een construct eigenlijk is. Verder worden een aantal algemene termen en de nodige theorie ge¨ıntroduceerd. Tenslotte wordt een overzicht gegeven van de rest van het proefschrift.

De hoofdstukken 2 tot en met 4 vormen de kern van het proefschrift. In deze hoofdstukken worden drie afzonderlijke onderzoeksprojecten beschreven. Het eerste project, beschreven in hoofdstuk 2, gaat over conditionele likelihood1 inferentie bij multistage toetsing. Bij adaptief toetsen worden de scores meestal toegekend via van schattingen van de persoonsparameters. Om zuivere schatters te krijgen is het vereist dat de itemparameters ook zuiver zijn. Dit hoofdstuk laat zien hoe bij multistage toetsing de itemparameters geschat kunnen worden op basis van de conditionele likelihood methode. Naast dit technische resultaat wordt ook een aantal algemene thema’s met betrekking tot adaptief toetsen, itemparameterschattingen en model fit 1_{Voor sommige woorden in de Nederlandse samenvatting wordt de Engelse term gebruikt,}

omdat deze ook in het Nederlandse jargon gebruikt worden.

(15)

besproken. Daarbij wordt uitgelegd dat eenvoudige meetmodellen waarschijnlijk beter passen op data van adaptieve toetsen in vergelijking met data van lineaire toetsen. Dit wordt zowel ge¨ıllustreerd met gesimuleerde data als met data van de Nederlandse Entreetoets.

Hoofdstuk 3 gaat over de onderbouwing van het gebruik van de somscore met behulp van itemresponstheorie (IRT). Twee IRT-modellen zijn bekend vanwege de relatie tussen de somscore en de persoonsparameter: het parametrische Rasch Model (RM), waarin de somscore een sufficient statistic is voor de persoonsparameter, en het niet-parametrische Monotone Homogeneity Model (MHM), waarin de latente trek stochastisch geordend is op basis van de somscore. In hoofdstuk 3 wordt betoogd dat het RM het scoren van individuen op basis van de somscore onderbouwt, terwijl het MHM het ordenen van groepen op basis van de somscore onderbouwt. Dit laat ruimte voor een derde model. Om het derde model te kunnen introduceren, wordt eerst het begrip ordinal sufficiency gedefinieerd. Het model dat vervolgens wordt ge¨ıntroduceerd is het niet-parametrische Rasch model. Dit is een minder restrictief model dan het RM, waarmee het ordenen van individuen op basis van de somscore kan worden onderbouwd.

Het laatste project, dat beschreven wordt in hoofdstuk 4, gaat over differential item functioning (DIF) in internationale onderwijspeilingen. Meestal wordt DIF gezien als een bedreiging voor de validiteit en als iets dat de vergelijking van de prestaties van landen bemoeilijkt. Echter, in de methode zoals die wordt beschreven in hoofdstuk 4, wordt DIF niet gezien als een bedreiging, maar als een interessant resultaat dat kwalitatieve verschillen tussen landen weergeeft. Om in een context met DIF te komen tot vergelijkbare scores wordt voorgesteld om de vergelijking niet te baseren op de persoonsparameters. In plaats daarvan wordt voorgesteld om het construct te defini¨eren als een grote verzameling toetsvragen (de market basket), en om vergelijkingen te baseren op een samenvattende statistiek op deze market basket. Aangezien de data van peilingen meestal incompleet zijn, worden latentevariabelemodellen - mogelijk verschillende modellen in verschillende landen - gebruikt om de verdeling van de itemscores in de market basket te schatten. Deze benadering is ge¨ıllustreerd met PISA data uit 2006.

Hoofdstuk 5 is een algemene discussie. Drie zaken met betrekking tot de hoofdstukken 2 tot en met 4 worden nader bediscussieerd. De eerste is de

(16)

111

vraag wat een optimale adaptieve toets is voor high-stakes toetsing. Daarbij wordt beargumenteerd dat dit niet een computer adaptieve toets (CAT) met een oneindig grote en gekalibreerde vragenbank is. In plaats daarvan kan een multistage toets tot effici¨entere resultaten leiden. De tweede gaat over wat te doen als de somscore niet ordinal sufficient is voor de persoonsparameter. Er wordt beargumenteerd dat in het geval van high-stakes toetsing men wellicht een ruwere statistiek dan de somscore moet zoeken die wel ordinal sufficient is. De derde is een uitwijding over de positieve aspecten van DIF.

(17)

(18)

Dankwoord

Dit proefschrift heb ik mogen schrijven in de tijd dat ik in dienst was bij Cito. Graag bedank ik Cito voor alle geboden faciliteiten. Een aantal mensen heeft de achterliggende jaren een bijzondere bijdrage geleverd. Ik wil hen hierbij persoonlijk bedanken.

Gunter, ik heb jouw begeleiding enorm gewaardeerd. Door jouw creativiteit, toegankelijkheid, beschikbaarheid, eigenwijsheid en heldere stijl van uitleggen heb ik de achterliggende vijf jaar ontzettend veel geleerd. Bedankt voor al je toewijding en vertrouwen.

Anton, bedankt voor mijn plek bij POK, voor je lessen in pragmatisch en tactisch handelen, en voor je aanmoediging om hoofdstuk 4 helemaal te herschrijven. Uiteindelijk is dit hoofdstuk er een stuk beter van geworden.

Bas, bedankt voor je betrokkenheid in brede zin. Al tijdens mijn afstudeerstage was je als kamergenoot nieuwsgierig naar mijn bezigheden. En in een later stadium, toen het stageonderzoek ontwikkelde tot hoofdstuk 3 van dit proefschrift, was je genuanceerde feedback op het iets te ongenuanceerde manuscript een welkome aanvulling.

Timo, bedankt voor de vele momenten waarop je beschikbaar was voor vragen en voor je feedback op verscheidene delen van het manuscript.

Han, bedankt voor je begeleiding in de laatste maanden. Mede door jouw betrokkenheid is het gelukt in een drukke tijd tot een afronding te komen.

Saskia, Matthieu en Maarten, bewoners van de ‘prijzenkast’, bedankt voor de vele gesprekken, koffiemomentjes, flauwe grappen en morele steun. Ik koester vele mooie herinneringen aan B5.46. Ook dank ik de overige collega’s van Cito, en in het bijzonder die van POK, voor de fijne tijd, de goede sfeer en de voortdurende bereidheid om even mee te denken.

Michiel en Matthieu, bedankt voor jullie vriendschap en betrokkenheid in

(19)

de afgelopen jaren. Ik vind het fijn dat jullie mijn paranimfen zijn.

Pa en ma, van kinds af aan hebben jullie mijn nieuwsgierigheid positief benaderd. Mijn studietijd heeft wat omzwervingen gekend, maar ook daarin heb ik de mogelijkheid gekregen om mijn weg te vinden. Ik ben blij met dit proefschrift als resultaat. Bedankt voor al jullie aanmoedigingen.

Janet en Thijmen, ik ben blij dat jullie werk en wetenschap zo kunnen relativeren. Bedankt voor alles wat jullie mij geven.

Contributions to latent variable modeling in educational measurement - Back matter