• No results found

Statistical properties and practical use of classical test-score reliability methods

N/A
N/A
Protected

Academic year: 2021

Share "Statistical properties and practical use of classical test-score reliability methods"

Copied!
141
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Tilburg University

Statistical properties and practical use of classical test-score reliability methods

Oosterwijk, Pieter

Publication date:

2016

Document Version

Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Oosterwijk, P. (2016). Statistical properties and practical use of classical test-score reliability methods. Ridderprint.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal

Take down policy

(2)
(3)

Use of Classical Test-Score Reliability

Methods

(4)

All rights reserved. No part of this dissertation may be reproduced, stored or transmitted in any form or by any means, without written permission of the copyright owner.

The cover illustration was created by Mira de Graaf. The LaTeX template was based on a design of Marjolein Fokkema. Editing of the acknowledgments by Bep van Muilekom

(5)

of Classical Test-Score Reliability Methods

Proefschrift

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie

in de aula van de Universiteit op vrijdag 1 juli 2016 om 14.15 uur

door

(6)

Promotores: Prof. dr. K. Sijtsma Prof. dr. L.A. van der Ark

(7)

1 Introduction 1

1.1 Different Approaches to Reliability . . . 3

1.2 Statistical Framework of Classical Test Theory . . . 6

1.3 Reliability Estimation in Classical Test Theory . . . 7

1.4 Outline of the Dissertation . . . 9

2 Using Confidence Intervals for Assessing Reliability of Real Tests 13 2.1 Introduction . . . 15

2.2 Reliability and Estimation Methods . . . 20

2.3 Method . . . 27

2.4 Results . . . 35

2.5 Discussion . . . 38

3 On the Precision of Reliability Estimates 41 3.1 Introduction . . . 43

3.2 Classical Test Theory . . . 45

3.3 Method . . . 50

3.4 Results . . . 55

(8)

4 Numerical Differences Between Guttman’s Reliability Coefficients

and the GLB 67

4.1 Introduction . . . 69

4.2 Classical Test Theory . . . 71

4.3 Guttman’s Reliability Coefficients and the GLB . . . 72

4.4 Method . . . 78

4.5 Results . . . 84

4.6 Discussion . . . 90

5 Bias of Guttman’sλ4,λ5, andλ6 Coefficients and the GLB 93 5.1 Introduction . . . 95

5.2 Classical Test Theory, Reliability Methods . . . 98

5.3 Method . . . 103

5.4 Results . . . 106

5.5 Discussion . . . 113

Summary 115

Bibliography 119

Samenvatting (Summary in Dutch) 131

(9)
(10)

The assessment of psychological attributes by means of tests and questionnaires ori-ginated more than a century ago in psychology and educational measurement, but nowadays is omnipresent in a myriad of other disciplines. Examples of multi-item attribute assessment not only applies to cognitive abilities, attitudes, and persona-lity traits but also to attributes measured, for example, in political science (e.g., participation in political action), sociology (e.g., religiosity), medicine and health (e.g., anxiety) and nursing (e.g., pain experience). For the majority of attribute assessment, classical test theory (CTT; Lord & Novick, 1968) is the statistical fra-mework used to analyze test and questionnaire data and ascertain psychometric properties of measurement instruments. The fundamentals of the CTT framework were developed at the theoretical level by Spearman (1904) and at the applied level in intelligence testing by Binet and Simon (1908).

(11)

speakers) taking a intelligence test. Cronbach, Gleser, Nanda, & Rajaratnam (1972; also, Brennan, 2000) introduced a third approach to reliability, correcting for vari-ance sources in test scores that do not contribute to reliable person ordering. For more discussion on the three approaches to reliability, see Sijtsma and Van der Ark (2015).

1.1 Different Approaches to Reliability

Central to CTT is the idea that all measurements are subject to random measure-ment error. This does not only hold for the test score but also for he item scores on which the test score is based. The decomposition of observable measurement value into true score and random measurement error is final; CTT does not make any assumptions on how the items constituting the test score are related to one another and to the true score, and there are no assumptions about the factorial composition of the true score. Reliability of measurement thus refers to test scores and nothing else. Assume the same test is administered twice to the same group of people under precisely the same administration conditions, including the mental condition of the examinees. The latter assumption says that examinees stay unaffected when tested once so that the second administration takes place as if they were never tested before using the same test. Such independent test administrations are said to be parallel; the two sets of test scores for each individual only differ with respect to random measurement error. Reliability is defined as the correlation between paral-lel measurements obtained in a population of persons (Lord & Novick, 1968, p. 61), and can be shown to equal the proportion of true-score variance in the test scores.

(12)

than the test’s true score. In addition, one might include other factors to explain the test-score variance, such as group-specific factors that only subsets of items share but not all items. Variance that is unique to a specific item, also called item-specific variance, cannot be separated from the random error variance and is included in the unique variance that consists of item-specific variance and random error vari-ance. Usually the reliability of the test score is obtained using structural equation modelling and estimated by means of different variations of coefficient ω (McDo-nald, 1999; Revelle & Zinbarg, 2009; Sijtsma & van der Ark, 2015). Before one estimates reliability, one determines whether the hypothetical factor model fits the data. Badly fitting models produce low reliability because the common factor does not explain the data well. We assume that researchers and test users alike are in-terested in test scores and take the precise factorial composition of the item set for granted; hence, we concentrate on the CTT approach to reliability.

(13)

equal for each person’s true score.

Item response theory uses the individual’s standard error of his latent-variable value. The standard error depends on this value and thus varies across persons. This means that measurement precision varies across individuals. Specifically, mea-surement precision is greater and confidence intervals are shorter when someone’s latent-variable score matches the items’ difficulties better; then, items are statisti-cally more informative of people. If items are too extreme for an individual, ei-ther too easy (“8 + 7 = ..”) or too difficult (“e27−p19 =..”) when maximum performance is measured or trivial (e.g., “I sometimes feel uncertain”; notice that everybody feels uncertain now and then) or exotic (“I have seriously contempla-ted suicide”; notice that almost no one has such thoughts) when typical behavior is measured, people produce only extreme item scores and their measurement is imprecise. Imprecision is reflected by a large standard error and a long confidence interval for the latent-variable value. Item response theory is interesting but its study is beyond this dissertation, because most researchers still use CTT.

(14)

1.2 Statistical Framework of Classical Test Theory

In this section, some concepts of CTT that are used throughout this dissertation are introduced. Let a test contain J items indexed j ( j = 1, . . . , J ). Item scores are de-noted Xj. The test score is defined as the sum of the item scores; X =PJj=1Xj. The true score is denoted T and the random measurement error is denoted E. Because for one person the expected value of E equalsE (E) = 0, the expected value of X equalsE (X ) = T . The test score, X , equals the sum X = T + E and because random measurement error correlates zero with the true score, it follows that test-score va-riance equals the sum of true-score vava-riance and error vava-riance, σ2X = σ2T + σ2E. Because CTT applies to any measurement value, item score Xj also can be decom-posed into Xj= Tj+ Ejwith variances σ2X

j = σ

2

Tj+ σ

2

Ej. Furthermore, it is assumed

that item-error scores have zero correlation with item true-scores and error scores of other items. Reliability is defined as the product-moment correlation between two parallel measurements X and X, and is denoted ρX X′. Lord and Novick (1968,

p. 61) define reliability as:

ρX X′=

σ2T

σ2X = 1−

σ2E

σ2X. (1.1)

Thus, reliability also equals the ratio of true-score variance and the test-score riance; hence, it equals the proportion of test-score variance that is true-score va-riance, also denoted ρ2T X, which is another notation for reliability. Let ΣX be the

J×J inter-item covariance matrix, which can be written as the sum of the J ×J item true-score covariance matrix ΣT and the J× J diagonal item error-score matrix ΣE;

that is,

(15)

Furthermore, it may be noted that σ2T

jTk = σ

2

XjXk. For J = 3, we then have

  σ2X 1 σX1X2 σX1X3 σX 2X1 σ 2 X2 σX2X3 σX3X1 σX3X2 σ2X 3   =   σ2T 1 σX1X2 σX1X3 σX 2X1 σ 2 T2 σX2X3 σX3X1 σX3X2 σ2T 3   +   σ2E 1 0 0 0 σ2 E2 0 0 0 σ2E 3   . (1.3) For the J = 3 example, reliability in matrix form is equal to

ρX X′=    1′     σ2 T1 σX1X2 σX1X3 σX 2X1 σ 2 T2 σX2X3 σX 3X1 σX3X2 σ 2 T3     1     /    1′     σ2 X1 σX1X2 σX1X3 σX 2X1 σ 2 X2 σX2X3 σX 3X1 σX3X2 σ 2 X3     1     , (1.4) or alternatively, ρX X′= 1−    1′     σ2E 1 0 0 0 σ2E 2 0 0 0 σ2 E3     1     /    1′     σ2X 1 σX1X2 σX1X3 σX 2X1 σ 2 X2 σX2X3 σX 3X1 σX3X2 σ 2 X3     1     . (1.5)

1.3 Reliability Estimation in Classical Test Theory

In Equation 1.1, true-score variance and error variance cannot be observed, so that reliability cannot be estimated directly. Hence, alternative methods have been de-veloped to estimate reliability. Four methods are well known. Two methods use different administrations of tests and correlate the two sets of test scores in order to estimate reliability.

(16)

correlation between the two sets of test scores is an estimate of test-score reliability. Because it proves to be highly difficult to construct two truly parallel test versions and also because it is somewhat awkward to construct two two test versions when one intends to use only one test version in practice, the parallel-test method is rarely used for reliability estimation.

The second method is the test-retest method. This method uses the correla-tion between the test scores obtained from two administracorrela-tions of the same test to the same group of people, and the correlation between the two sets of test sco-res estimates test-score reliability. This method suffers from several problems. For example, if the time interval between the two administrations is brief, people might remember their answers to the items administered the first time so that the two administrations are not independent; this may artificially boost the correlation bet-ween the two administrations. If the time interval betbet-ween the two administrations is long, the attribute measured may have developed further and the degree of devel-opment may vary between individuals. This might result in artificially low correlati-ons, which again are the result from administrations not being independent. Evers (2009, 2010a; cited in Sijtsma, 2009) reported that this method usually produces lower reliability estimates than other methods.

The other two methods use one test administration. The split-half method correlates the total scores on two half tests, and uses the Spearman-Brown prophesy formula (Brown, 1910; Spearman, 1910; also, see Lord & Novick 1968, p. 84) to estimate the reliability of the test score on the complete test, assuming that the two test halves are parallel. The internal consistency method uses the inter-item covariance matrix ΣX to estimate test-score reliability. The result is a lower bound

to the reliability. Several methods have been proposed that use information from ΣX

(17)

& Agunwamba, 1977; Ten Berge & Zegers, 1978).

Methods deviating from these four basic approaches have been proposed by Mokken (1971, pp. 142-147; also, see Sijtsma & Molenaar, 1987) in the context of nonparametric item response theory, Schulman and Haden (1975) for ordinal test scores, Gustafsson (1980) in the context of the 1-parameter logistic model, and Van der Ark, Van der Palm, and Sijtsma (2011; also, see Van der Palm, Van der Ark, & Sijtsma, 2014) using latent class analysis to approximate the structure in ΣXas well

as possible. These methods are interesting but they are beyond this dissertation.

1.4 Outline of the Dissertation

This dissertation addresses several issues concerning test-score reliability that have received little attention thus far.

(18)

approximately 20% of the initial quality assessments.

In Chapter 3, Monte Carlo simulation was used to investigate the influence of sample size, test length, number of of ordered item scores, and test-score reliabi-lity on the spread of the sampling distribution of the reliabireliabi-lity estimation methods coefficient α (Cronbach, 1951; Guttman, 1945), coefficient λ2 (Guttman, 1945), method MS (Molenaar & Sijtsma, 1988), and the greatest lower bound to the re-liability (GLB; Bentler & Woodward, 1980). We found that both test length and sample size had influence on the sampling distribution of all reliability estimation methods under investigation. The sampling variance of these methods was often large enough to threaten their practical usefulness for samples smaller than 500 observations. For tests with five or ten items larger samples may be needed.

(19)

covariances .

In Chapter 5, Guttman’s (1945) coefficients λ4, λ5, and λ6, and the GLB were investigated, because unlike other reliability methods these methods tend to capita-lize on chance and thus may overestimate population reliability ρX X′. We

(20)
(21)

Using Confidence Intervals for

(22)

Abstract

(23)

2.1 Introduction

Scores on psychological tests and questionnaires are used for making high-stakes decisions about hiring applicants for a job or rejecting them, assigning or withhol-ding a patient a particular treatment, a therapy, or a training, accepting students at a school or rejecting them, enrolling students in a course or rejecting them, or pas-sing or failing an exam. In these applications the stakes are high for the individuals and the organizations involved, and tests must satisfy a couple of quality criteria so as to guarantee correct decisions. For example, the test score must be highly relia-ble and valid, and norms must be availarelia-ble to interpret individual test performance. Tests are also used in lower-stakes applications. Here they also must satisfy quality requirements but not necessarily as high as those required for high-stakes decisi-ons (Evers, Lucassen, Meijer, & Sijtsma, 2010a). For example, an inventory may be used to assess personal interests so as to help clarifying the kind of follow up edu-cation a high school student might pursue. Often, the inventory is only one of the many data sources used next to, for example, school and parental advise. Another example is the use of the test score as the dependent variables in experiments (e.g., degree of anxiety) or an independent variable in linear explanatory models (e.g., as predictors of therapy success).

(24)

test (Fan & Thompson, 2001). Rather they incorrectly treat these estimates as if they are parameter values not liable to sampling error and conclude, for example, that a test has a reliability of .84 but ignore that a 95% CI equal to, say, (.74; .91), suggests true reliability may be considerably higher or lower than .84. One may also consult Kelley and Cheng (2012) who argued that CIs may be more important than the sample reliability estimate, and Wilkinson and the Task Force on Statisti-cal Inference (1999) who provided general guidelines for the use of statistics such as CIs in psychological research. Also, test assessment agencies tend to base their assessments of reliability on the estimate thus ignoring sampling error (e.g., Evers et al., 2010a). This means that if they consider reliability denoted by ρ in excess of a criterion value of, say, c, to be “good”, they make the decision if sample reliabi-lity r≥ c without statistically testing (e.g., based on CIs) whether r is statistically greater than c. In particular when sample sizes are small, ignoring sample size evidence can readily lead to quality assessments that are too optimistic, thus provi-ding test practitioners and their clients and patients with measurement instruments that might promise better psychometric quality than is realistic. This is a situation one would like to avoid and rather provide more realistic information sensitive to sample size uncertainty.

(25)

base for which we had to change the quality assessment when CIs were considered instead of point estimates.

Two remarks are in order. First, we chose to investigate the reliability rather the standard error of measurement, although one might argue that the standard error of measurement rather than the reliability should be used for assessing the quality of decisions about individuals on the basis of test scores (e.g., Mellenbergh, 1996). Because the standard error of measurement is based directly on reliability and test-score variance, the choice for either one is arbitrary. Our choice was mo-tivated by the fact that researchers routinely report reliability (e.g., AERA, APA, & NCME, 2014; Wilkinson and the Task Force on Statistical Inference, 1999), and test agencies assess reliability prior to the standard error of measurement, emphasizing reliability’s pivotal position in measurement assessment. Second, we investigated the reliability of tests in a single data base. Based on a sample estimate of the test-score reliability, the COTAN data base classifies the tests’ reliability as either in-sufficient, in-sufficient, or good. Although this classification is arbitrary to some extent and other data bases may employ different classifications, any classification suffices to make our point that taking uncertainty due to sample size into account makes a difference when assessing test quality.

(26)

dis-cussed different ranges implying different interpretations for reliability. Evers et al. (2010a), on behalf of COTAN, recommended values in excess of .8 the very least for reliable measurement of individuals if important decisions are to be made. Evers et al. (2013, pp. 43-52), on behalf of the European Federation of Psycho-logists’ Associations (EFPA), provided four categories for reliability, the highest of which was labelled “Excellent” for high-stakes testing (r≥ .9) and the next “Good” (.8≤ r < .9). Christensen (1997, pp. 217-219) discussed recommendations for reliability for dependent variables in experiments.

The need to study whether test constructors report CIs for test-score reliabi-lity and the influence of ignoring CIs for reliabireliabi-lity on reliabireliabi-lity classification was further underlined by a preliminary literature search we did, and which revealed that CIs for reliability were reported not even once. This absence is remarkable the more so because methods for estimating standard errors and CIs for reliability, in particular coefficient alpha (e.g., Cronbach, 1951; Guttman, 1945) have long been available (e.g., Feldt, 1965; Feldt, Woodruff, & Salih, 1987; Hakstian & Whalen, 1976; Kristof, 1963) and in addition several authorities have urged researchers to report CIs (e.g., AERA, APA, & NCME, 2014), but apparently so far this has had little success. Researchers, test constructors, and test users do report sample reliability as a quality index for test scores but without CIs or standard errors of measure-ment, and amply use the rules of thumb textbook authors like Nunnaly (1978) and assessment agencies like COTAN and EFTA provide.

(27)

use, both in psychology and education. COTAN is an active test assessment agency of great reputation that has assessed the quality of tests and questionnaires since 1959; also see Evers, Sijtsma, Meijer and Lucassen (2010b) and Sijtsma (2012) for more information. Dutch governmental and insurance companies require COTAN’s approval of tests as a necessary condition for accepting requests for particular bene-fits and payments, respectively. We considered tests assessing various psychological attributes, so as to have sufficient coverage of the testing field. We also notice that the vast popularity of testing in the Netherlands guarantees some degree of genera-lity of the results although we realize that a sample of tests from a larger geographic region would have been more representative. Such a huge endeavor was considered beyond the scope of this study.

(28)

2.2 Reliability and Estimation Methods

Reliability can be estimated by means of methods from three different theoreti-cal approaches, which are classitheoreti-cal test theory (e.g., Lord & Novick, 1968; Novick, 1966; Novick & Lewis, 1967), factor analysis (e.g., Bollen, 1989; also, see Dunn, Baguley, & Brunsden, 2013; Graham, 2006), and generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; also, see Brennan, 2001a, 2001b; Shavelson & Webb, 1991). Classical test theory assumes that test scores suffer from random measurement error, factor analysis distinguishes random measurement error from systematic measurement error, and generalizability theory identifies the sources of variance in the test score that are immaterial to person ordering and excludes them from reliability thus pinpointing the universe of generalization (Sijtsma & Van der Ark, 2015). In the context of this study, two remarks are in order.

(29)

Classical Test Theory and Definition of Reliability

Assume that a psychological test consists of J items indexed by j ( j = 1, . . . , J ). Let variable Xj denote the score on item j. The test score is the sum of item scores Xj, defined as X+=

PJ

j=1Xj, with group variance, σ2X+. Classical test theory assumes

that X+ is the sum of an unobservable true score T and an unobservable random measurement error E. For person i, the true score Ti is defined across

hypotheti-cal, independent replications of the test administration procedure and equals the expectation of the test score for person i; that is, Ti =E (X+i). In each replication,

the measurement error equals the deviation score defined as Ei= X+i− Ti. Hence, for person i the model is written as X+i= Ti+ Ei. The true-score variance in a group

is denoted σ2T, and the measurement-error variance in a group is denoted σ2E. Because random measurement error E is assumed to be uncorrelated with true score T , the variance of the test score can be decomposed as σ2X

+ = σ

2

T + σ

2

E. For

two tests with test scores X and X′to be called parallel, two conditions must hold. First, for each person i his true scores must be equal: Ti= Ti′. Second, the variance of the test scores in the group must be equal: σ2X

+ = σ

2

X+′

. The reliability of the test score is defined as the product-moment correlation of X and X, denoted ρX X′.

Lord and Novick (1968, p. 61) showed that ρX X′equals the proportion of test-score

variance that equals true-score variance, and this proportion is the same for both parallel tests; that is,

ρX X′= σ2T σ2X + = σ 2 Tσ2X + . (2.1)

Reliability ranges from 0 (if σ2T = 0) to 1 (if σ2T = σ2X

+, meaning zero error

(30)

Kamp-huis & Kleintjes, 2010), .97−1.00 (Schlichting & Lutje Spelberg, 2007), and .79−.97 (Bekebrede et al., 2010). Reliability estimation methods using the mean inter-item correlation in principle may provide negative values but such values are extremely rare and never reported, as are values lower than, say, .6.

One may notice that reliability in Equation 2.1 cannot be computed in practice, because (1) parallel test scores X+and X+′ are rarely available so that ρX X′ cannot

be estimated, and (2) both true-score variance σ2T and true-score variance σ2T are unobservable so that none of the two ratios can be estimated. In practical test research, usually one has data available from one test and one test adminstration, and several methods have been proposed to estimate reliability in this situation (e.g., Bentler & Woodward, 1980; Cronbach, 1951; Guttman, 1945; Lord & Novick, 1968; Ten Berge & Zegers, 1978; Zinbarg, Revelle, Yovel, & Li, 2005).

Two Reliability Methods

(31)

Split-Half Method

Sometimes, researchers construct two test versions intended to be parallel (e.g. Bleichrodt, Resing & Zaal, 1993) and use the correlation between the two test scores to estimate Equation 2.1, but usually only one version is constructed so that this parallel-tests method of estimating reliability cannot be used. Instead, some test constructors mimic the method by splitting their one test version into test halves that are as similar as possible, thus meant to be parallel. Similarity may be realized in the phase of test construction by including pairs of similar items in the test, where different items from the same pair belong to different test halves, or the items in an existing tests may be distributed across two halves to obtain maximum similarity. Let us assume that two similar test halves are available and fixed, then two formal situations may be distinguished.

First, assume the test halves are parallel. Then, the product-moment correla-tion between the half-test scores Y1and Y2, denoted ρY1Y2, by definition equals the reliability of the test score on a half test, ρY Y; that is, ρY1Y2 = ρY Y′. Then, the

reliability of the test score on the complete test, ρX X′, can be computed by means

of the Spearman-Brown prophesy formula (Lord & Novick, 1968, p. 84) adapted to doubling test length,

ρX X′=

Y Y

1 + ρY Y

. (2.2)

Second, in practice test halves never are parallel, so that Equation 2.2 produces an invalid result; that is, ρY1Y26= ρY Y, and inserting ρY1Y2does not produce reliability

ρX Xbut a value that one may denote as SH, for which SH6= ρX X′.

Methods to compute a CI for SH are available (Charter, 2000; Fan, & Thompson 2001). Let rY

1Y2 denote the sample correlation between the two test scores on the

(32)

distribution of the product-moment correlation into account. First, the estimate of the correlation between two test halves (rY1Y2) is obtained. Second, rY1Y2 is trans-formed using the Fisher-Z transformation (Fisher, 1915),

Z = .5 ln1 + rY1Y2

1− rY1Y2

. (2.3)

Z is approximately normally distributed with a mean equal to .5 ln1+ρ1−ρY1Y2

Y1Y2 and a

standard error approximately equal to p1

N−3 (e.g., Hays, 1994, p. 649). Third, let

α denote the nominal Type I error rate. Let ζ be the parameter corresponding to Z;

and let Zα/2be the lower bound and let Z1−α/2be the upper bound of a (1−α)∗100% CI for ζ. For a 95% CI, the lower bound equals Zα/2 = Z − 1.96/

p

N− 3 and

the upper bound equals Z1−α/2 = Z + 1.96/pN− 3, so that the 95% CI equals

(Zα/2; Z1−α/2) or, equivalently,

Z− 1.96/pN− 3; Z + 1.96/pN− 3. (2.4)

Fourth, the bounds of the CI can be transformed into bounds on the rY1Y2scale using the inverse of Equation 2.3,

rY1Y2= e

2Z− 1

e2Z+ 1. (2.5)

Finally, after having obtained the bounds of a CI for ρY1Y2, Equation 2.2 is used to transform the bounds into bounds on the SH scale. The resulting CI is asymmetrical.

Coefficient Alpha

(33)

between items j and k be denoted σjk. Coefficient alpha is defined as alpha = J J− 1 P P j6=kσjk σ2 X+ . (2.6)

Alpha is a lower bound to the reliability; al pha≤ ρX X′(Novick & Lewis, 1967).

Se-veral authors have derived standard errors for the sample estimate Øal pha and CIs

for alpha (e.g., Feldt, 1965; Kuijpers, Van der Ark, & Croon, 2013; Maydeu-Olivares, Coffman, & Hartmann, 2007; Van Zyl, Neudecker, & Nel, 2000). Using each of the standard errors these authors proposed to estimate CIs for alpha produces symme-trical intervals whereas alpha is bounded from above by the value 1.

In this study, we used Feldt’s method (Feldt, 1965) to obtain CIs for alpha. Feldt’s method is convenient because it uses only Øal pha, test length J , and sample

(34)

A bootstrap approach to develop a CI requires the complete data matrix; also, see Kelley and Cheng (2012). We found that for real tests, counts of item-score pat-terns, complete inter-item covariances, or raw data are hardly ever made available to other researchers.

To compute the 95% CI for alpha, let the nominal Type I error rate be 0.05, and let Faand Fbbe critical values of an F distribution with N− 1 and (N − 1)(J − 1) degrees of freedom, such that P(F < Fa) = .025 and P(F < Fb) = .975. For example, using Hays (1994, pp. 1016-1022) for N = 100 and J = 10, one finds that Fa≈ 0.7315 and Fb≈ 1.3198. Feldt (1965) showed that the 95% CI for alpha is estimated by ‚ 1−1− Øal pha Fa ; 1− 1− Øal pha Fb Œ . (2.7)

Two remarks are in order. First, coefficient alpha has been heavily criticized (e.g., Cortina, 1993; Cronbach, 2004; Schmitt, 1996; Sijtsma, 2009) but continues to be the reliability method most regularly used in the face of many excellent alternative methods. Second, for any split of the test in two parts that together contain all J items but need not be of equal size, one can compute alpha for two test parts; that is, for J = 2. Let us denote this alpha value by al pha(2) and alpha based on the J separate items by al pha(J ). The mean of al pha(2) across all possible splits equals

al pha(J ) (Jackson & Agunwamba, 1977); that is,E [al pha(2)] = al pha(J). Lord

(35)

2.3 Method

We first describe the set of tests we used; then, the procedure for collecting the tests we re-assessed, and the composition of the subset of collected tests; and finally, the method of data analysis.

Test Population and Test Sample

COTAN assesses tests they know are used in the Dutch and Belgian practice for testing individuals to obtain a diagnosis, give an advise, or make a decision, and in addition COTAN assesses tests used in scientific research. Several tests are in-accessible to COTAN, because they are developed in institutions or by researchers and lead a “hidden life”. These tests, are never published and used only within a closed circle of test practitioners or researchers. Because they are not exposed to external scrutiny by experts, feedback provided by a panel of peers such as COTAN’s reviewers or reviewers for journals is absent.

(36)

Between January 1997 and January 2014, COTAN assessed the reliability of 520 tests, and several other criteria not relevant here such as validity and norms. Raters commissioned by COTAN assessed the tests to have insufficient (138), suffi-cient (217) or good (165) reliability. They used standards given by an assessment system created by COTAN of which Evers et al. (2010a) provided the most recent update. The tests were 309 person–situation tests, 153 person tests, and 18 situa-tion tests. Forty tests may be placed in more than one category.

Collecting Tests and Composition of Test Subset

We distinguish test batteries and single tests. Test batteries, such as intelligence test batteries, consist of several subtests. Separate test scores are provided for each of the subtests and for the whole battery based on the separate subtest scores. True single tests exist independent of test batteries and measure a single attribute. Single tests are the collection of subtests and true single tests.

Test publishers provide COTAN with a copy of the test and all corresponding materials including the manual, only for the purpose of COTAN providing and pu-blishing an assessment but expressly not with the purpose of providing background information to third parties other than what is published. Hence, COTAN was una-ble to give us access to the more detailed background information COTAN uses for their assessments. Being unable to obtain the more detailed information, we retrie-ved information from two sources, which were the COTAN online data base (Egbe-rink et al., 2009-2014) and test manuals available from Pierson Revesz Library of the University of Amsterdam and the Library of Tilburg University.

(37)

esti-J Frequency 0 20 40 60 80 100 0 50 100 150 0 10 20 30 40 50 60 70 80 90 100 110

Figuur 2.1: Number of items in 1020 tests. Reliability estimates of four long subtests

(J > 200) were omitted to obtain a readable histogram.

Tabel 2.1: Tests Included in in the Analysis and Tests Excluded from the Analyses.

Included Excluded Total

Qualification Insufficient 18 (17.3%) 120 (83.7%) 138 (100%) Sufficient 57 (26.3%) 160 (73.7%) 217 (100%) Good 41 (24.8%) 124 (75.2%) 165 (100%) Total 116 (23.1%) 404 (76.9%) 520 (100%) Test type Person-Situation 55 (17.8%) 254 (88.2%) 309 (100%) Person 34 (22.2%) 119 (77.8%) 153 (100%) Situation 12 (66.7%) 6 (33.3%) 18 (100%) Two types 15 (37.5%) 25 (62.5%) 40 (100%) Total 116 (22.3%) 404 (77.7%) 520 (100%)

Note: Included: tests for which at least one CI could be calculated; Excluded: tests for which number of items,

(38)

mate (r) were available for the 520 tests COTAN assessed. Tests for which results relevant for the estimation of CIs were available were fewer than tests for which at least one relevant quantitative result was missing. For example, frequency “41” (Ta-ble 2.1; 3rd row, 1st column) should be interpreted as “For 41 tests COTAN assessed to have “Good” reliability, we could retrieve all the relevant results”. These 41 tests entail both test batteries that are counted once, also when they were assessed for different groups, and single tests. In total, 116 (22.31%) tests could be included in the analysis.

The 116 tests produced 1024 reliability estimates, 105 of which pertain to total scores on a test battery and 919 to single tests. Three (.032%) reliability estima-tes were based on more than 200 items but most estimaestima-tes (74.71%) were based on at most 20 items (Figure 2.1). The largest sample size reported was 12, 522 (Figure 2.2) (Helleman, Olivier, Schoonbrood, & Van Zessen, 2009). For 53 tests, sample size ranged from 6, 294 to 12, 522. These 53 sample sizes were omitted from Figure 2.2 to render the histogram better interpretable. More than half of the reliabilities were estimated from sample sizes smaller than 1000. The right-hand panel in Figure 2.2 provides more-detailed information about sample sizes between 0 and 1000. Most (94.73%) reliability estimates varied between 0.6 and 0.95 (Fi-gure 2.3). The left-hand panel shows the distribution of split-half reliability and the right-hand panel shows the distribution of coefficient alpha.

(39)

N Frequency 0 1000 2000 3000 4000 5000 0 100 200 300 400 0 1000 2000 3000 4000 5000 N from 0 to 1000 Frequency 0 200 400 600 800 1000 0 10 20 30 40 50 60

Figuur 2.2: Sample Sizes in 1024 tests. 53 tests based on large samples (N > 6, 000) were omitted to obtain a readable histogram.

Split−Half Frequency 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 Coefficient alpha Frequency 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200

Figuur 2.3: Split-half reliability (87 estimates) and coefficient alpha (937

(40)

Tabel 2.2: Descriptive Statistics for Reliability Estimates, Separately for Test

Batte-ries and Single Tests.

Test battery Single test

M J N rX XM J N rX X′ Qualification Insufficient 16 25.56 530.06 .71 142 11.94 844.28 .68 Sufficient 34 34.65 550.50 .83 309 12.63 1611.60 .79 Good 55 63.16 863.29 .90 468 19.10 1338.39 .88 Type of test Person-Situation 52 57.63 722.75 .85 483 12.68 1910.52 .80 Person 41 41.56 627.10 .84 254 18.47 844.16 .83 Situation 9 31.44 1103.11 .86 118 16.51 628.47 .82 Multiple types 3 25.67 485.67 .89 64 23.91 513.83 .88 Total 105 48.20 711.23 .85 919 15.71 1353.91 .82

Note: M : number of reliability estimates, J : mean number of items used for computing coefficient alpha, N : =

mean sample size used for computing reliability estimate, rX X′ = mean reported reliability estimate.

and 468 reliability values (3rd row, 5th column) based on single tests. The other results reported in Table 2.2 are mean values. Another example pertains to the 34 person tests in Table 2.1 (6th row, 1st column) that are split into 41 test battery reliability values (Table 2.2; 5th row, 1st column), also counting separate group results, and 254 single test reliability values (Table 2.2; 5th row, 5th column).

Reliability Assessment Rules

COTAN Rules

(41)

exclu-sive and exhaustive reliability intervals labelled “Insufficient” (I), “Sufficient” (S), and “Good” (G). Let r denote a reported, estimated reliability. Let cI Sdenote the re-liability value that separates “Insufficient” from “Sufficient”, and cSG the reliability value that separates “Sufficient” from “Good”. Hence, the three regions are defined by (0; cI S] for “Insufficient”; (cI S; cSG] for “Sufficient”; and (cSG; 1) for “Good”. CO-TAN assessments are formalized as follows: if r∈ (0; cI S], then assign “Insufficient”;

if r∈ (cI S; cSG], then assign “Sufficient”; and if r∈ (cSG; 1), then assign “Good”.

Tabel 2.3: Intervals for Insufficient, Sufficient, and Good Reliability, For Three Test

Applications.

Application Qualification

Insufficient Sufficient Good Individual measurement (important decisions) (0; .8] (.8; .9] (.9; 1)

Individual measurement (0; .7] (.7; .8] (.8; 1)

Group-level measurement (0; .6] (.6; .7] (.7; 1)

(42)

values of cI Sand cSGare smaller. Table 2.3 shows the qualifications COTAN uses for the three different applications of test scores.

Confidence Intervals

Figure 2.4 presents a numerical example for cI S = .7 and cSG = .8 (for the less important decisions), and a test for which r = .82 is reported. Following its decision rules, COTAN decides: .82∈ [.8; 1]; hence, assign “Good”. It may be noted that for reliability assessment COTAN does not use statistical information contained in CIs for ρX X′.

.0 .1 .2 .3 .4 .5 .6 .7 .74 .8 .86 .9 1 ρXX’

rXX’ = .82

insufficient sufficient good 95% CI for ρXX’

Figuur 2.4: Example of qualification of reliability and CI for reliability.

We took CIs into account in the following way. In Figure 2.4, assume that the CI has been estimated to be (.74; .86); then, we conclude: cSG ∈ (.74; .86); hence,

r is not significantly larger than cSGand “Good” is ruled out but “Insufficient” and

“Sufficient” are open. Next, cI S∈ (.74; .86); hence, r is significantly larger than c/ I S

and “Sufficient” is assigned for this reliability value (“Insufficient” is ruled out). We considered 90% and 95% CIs, implying nominal one-sided Type I errors of 0.05 and 0.025, respectively, for the test that a reliability value is significantly greater than a lower threshold value.

(43)

decision rule we used that takes CIs into account is: (1) If r < cI S, then assign “Insufficient”; (2) If cI S≤ r < cSG, then establish if cI S ∈ (L; U); if so, then assign

“Insufficient”, else assign “Sufficient”; (3) If r ≥ cSG, then establish if cI S∈ (L; U); if so, then assign “Insufficient”; else establish if cSG ∈ (L; U); if so, then assign “Sufficient”, else assign “Good”.

One may notice that the decision rule that takes CIs into account cannot up-grade a reliability value to a higher category, because it tests whether a sample reliability value is significantly larger than a cut-score; if yes, the original COTAN assignment is maintained, else it is downgraded. One could object that ca b∈ (L; U) does not rule out the possibility that ρ ≥ ca b, and hence the higher of the two

ca-tegories may just as well be assigned. A statistical argument that supports this line of reasoning is that P(ρ≥ ca b|r ≥ ca b) > P(ρ < ca b|r ≥ ca b); that is, given that, for

example, r = .82, one has more evidence that ρ > cSG than that ρ≤ cSG, even if the evidence based on one, possibly small sample is thin. We chose our somewhat conservative procedure to protect the test practitioner and his clients and patients from tests that provide less quality than the assessment promises.

2.4 Results

(44)

asses-sed to have “Good” reliability. For test type, Table 2.2 shows for test batteries that person–situation tests on average are the longest and situation tests are based on the largest samples. Mean reliability is almost equal between test types but mean CI width is lowest for person–situation tests (Table 2.4).

Tabel 2.4: CI Width for Reliability Estimates, Separately for Test Batteries and Single

Tests.

Test battery Single test

M C I95% Pr C I90% Pr M C I95% Pr C I90% Pr Qualification Insufficient 16 .10 NA .08 NA 142 .09 NA .07 NA Sufficient 34 .05 .68 .04 .74 309 .06 .65 .04 .70 Good 55 .03 .82 .03 .82 468 .04 .75 .03 .78 Type of test Person-Situation 52 .04 .83 .03 .83 483 .07 .65 .04 .69 Person 41 .06 .51 .05 .59 254 .05 .73 .04 .76 Situation 9 .05 .89 .04 .89 118 .05 .68 .04 .73 Multiple types 3 .04 1.0 .03 1.0 64 .04 .80 .03 .83 Total 105 .05 .71 .04 .74 919 .06 .68 .04 .72

Note: M : number of reliability estimates; CI: mean CI width for reliability; Pr: Proportion of reliability estimates

for which CI lower bound exceeds minimally required reliability; NA: not available, “Insufficient” assessment category does not have lower boundary.

Table 2.4 shows the proportion of reliability estimates for which the lower bound of the 90% CIs and 95% CIs exceeds the minimally required reliability for the “Sufficient” and “Good” assessment categories and the different test types; that is,

cI S and cSG. Across test batteries and single tests, in the “Good” category CI lower

(45)

CI lower bound exceeded c thresholds more often than for person tests. For single tests, differences were small and difficult to interpret. Person–situation tests were often based on larger samples than person tests and situation tests (Table 2.2) and in individual cases their CI lower bound exceeded the category lower bound c, but at the aggregate level this was obscured by person-situation tests (mostly, educati-onal tests) based on extremely large sample sizes.

Tabel 2.5: Turnover Results for Quality Assessment for Test Batteries. Without CIs I S G Total Using 90% CIs I 16 9 0 25 S 0 25 10 35 G 0 0 45 45 Using 95% CIs I 16 11 0 27 S 0 23 10 33 G 0 0 45 45 Total 16 34 55 105

Note: Without CIs: Qualification of reliability estimates using COTAN standards; Using 90% CIs: Qualification of

the reliability estimates using 90% CIs; Using 95% CIs: Qualification of the reliability estimates using 95% CIs; I = insufficient, S = sufficient, G = good.

(46)

estima-tes that were initially classified as “Sufficient”, 26% (90% CIs) and 32% (95% CIs) were classified as “Insufficient”. Of the 55 reliability estimates initially classified as “Good” 18% (90% CIs) and 18% (95% CIs) were re-classified to “Sufficient”, and in both cases none were re-classified as “Insufficient”.

For single tests (Table 2.6), the results are the following. In total, out of 919 tests, 78.45% (90% CIs) and 75.41% (95% CIs) were not re-classified; of the 309 reliability estimates originally classified as “Sufficient”, 30.4% (90% CIs) and 35.0% (95% CIs) were re-classified as “Insufficient”; and of the 468 reliability estimates originally classified as “Good”, 21.6% (90% CIs) and 24.1% (95% CIs) were re-classified as “Sufficient”, and 0.6% (90% CIs) and 1.1% (95% CIs) were re-re-classified as “Insufficient”.

Tabel 2.6: Turnover Results for Quality Assessment for Single Tests. Without CIs I S G Total Using 90% CIs I 142 94 3 239 S 0 215 101 316 G 0 0 364 364 Using 95% CIs I 142 108 5 255 S 0 201 113 314 G 0 0 350 350 Total 142 309 468 919

Note: Without CIs: Qualification of reliability estimates using COTAN standards; Using 90% CIs: Qualification of

the reliability estimates using 90% CIs; Using 95% CIs: Qualification of the reliability estimates using 95% CIs; I = insufficient, S = sufficient, G = good.

2.5 Discussion

(47)

from “Good” to “Insufficient” only happened with single tests and was rare. These results demonstrate that interpreting sample reliability values without taking CIs for population values into consideration may produce conclusions which are too opti-mistic. We hope this study is a wake-up call for anyone involved in test construction and test assessment not to treat sample results as parameters and assess reliability using CIs.

For coefficient alpha, we used Feldt’s method because the method requires less statistical information about the test under consideration than alternative, possibly superior methods (e.g., Maydeu-Olivares et al., 2007). In our study, it was difficult to even collect the limited information for tests that we needed for the Feldt me-thod, and so using more demanding methods was out of the question. However, we recommend that researchers and test constructors not only start using CIs but do this by means of statistically better equipped methods. They might also consider using reliability methods other than coefficient alpha, for example, greater lower bounds as suggested by Guttman (1945) and Bentler and Woodward (1980); fac-tor analysis related methods such as coefficient omega (e.g., Zinbarg et al., 2005); and generalizability coefficients (Brennan, 2001a; Shavelson & Webb, 1991) if the data structure permits. Kelley and Cheng (2012) suggested a bootstrap method for obtaining CIs for methods other than split-half reliability and coefficient alpha.

(48)
(49)
(50)

Abstract

(51)

3.1 Introduction

Test-score reliability is one of the most frequently reported measures for assessing measurement quality. The estimation of reliability requires that one has access to two test scores resulting from two independent administrations of the same test to the same sample of persons, but in real life for each person usually only one test score is available. This practical circumstance has necessitated the develop-ment of reliability estimation methods that use only one data set collected in one test administration consisting of the scores on the items in the test. Examples are coefficient alpha (Cronbach, 1951; Guttman, 1945, denotes alpha as lambda-3), coefficient lambda-2 (Guttman, 1945), the greatest lower bound to the reliability (GLB; Bentler & Woodward, 1980; Ten Berge, Snijders, & Zegers, 1981), and the Molenaar-Sijtsma method (method MS; Molenaar & Sijtsma, 1988; Sijtsma & Mole-naar, 1987; Van der Ark, 2010; also, see Mokken, 1971, pp. 142-147). The amount of psychometric research done with respect to reliability estimation is impressive and thus it is all the more remarkable that only little attention has been paid to the precision of the resulting estimates. This article studies the precision of estimates of coefficient alpha, coefficient lambda-2, GLB, and method MS.

(52)

which causes reliability to be lower than 1. For real tests, values between .80 and .95 are frequently considered acceptable for making decisions about individuals on the basis of the test score (e.g., Nunnally, 1978, p. 246).

In real-life test construction, administering the same test twice obviously would run into the problem that memory of the first administration would render the two test scores dependent, probably inflating their correlation. Hence, for each tested person the test constructor collects one test score and this has necessitated the de-velopment of reliability estimation methods that take advantage of the data col-lected in one test adminstration so as to approximate the correlation between two independent test scores, and coefficients alpha and lambda-2, GLB and method MS are useful examples. Coefficient alpha is by far the most frequently used reliability estimation method. Coefficient alpha is a lower bound to the reliability and may grossly underestimate the reliability thus producing considerable negative estima-tion bias(e.g., Cortina, 1993; Schmitt, 1996; Sijtsma, 2009). In this article we investigated the bias of statistics of reliability estimation methods with respect to reliability. Bias is defined as the mean difference of mean sample values of reliability estimation methods with respect to ρX X; for example bias(α) = ¯αˆ− ρX X′.

(53)

obtain a certain width for Van Zyl’s confidence interval for coefficient alpha. Re-markably, statistical results on estimation precision have been ignored almost com-pletely in practical test research. Moreover, many test constructers and practitioners seem to consider sample values of test-score reliability as parameter values (Sijtsma, 2012), and thus seem to be inclined to ignore the influence of sample fluctuation.

Except for some analytical results for coefficient alpha, for the other methods estimation imprecision seems to be unknown. Given the practical importance of the topic, the purpose of this study was to determine the precision of estimates of alpha, lambda-2, GLB, and method MS. We used two simulation studies to investigate the spread of the methods’ sampling distribution for three factors that are important in practical testing: sample size, number of items, and number of answer categories. In Study 1, in realistic conditions we investigated the effects of the three factors on the precision of the estimation methods. Study 2 was a follow-up study, in which we further investigated estimation precision for the most effective factor of Study 1, which was sample size, and we added population reliability as a second factor.

The paper is organized as follows. First, we discuss coefficients alpha and lambda-2, GLB and method MS. Second, we discuss the factors we used in the two simulation studies that assessed the sampling fluctuation of the four reliability esti-mation methods. Third, we discuss the results of the simulation study and provide recommendations for researchers.

3.2 Classical Test Theory

(54)

number of m + 1 item scores, but results can readily be generalized to unequal numbers. If m = 1, the items are dichotomously scored and if m > 1, the items are polytomously scored. The test score is defined as the sum of the item scores; that is,

X =PJj=1Xj. In the population, the test score has a distribution with mean µX and variance σ2X. Following Lord and Novick (1968, p. 34), a test score is the sum of the unobservable true score, denoted T , and an unobservable random measurement error, denoted E; that is, X = T +E. The true score is the person’s expected test score across hypothetical test repetitions, and the measurement error is the difference between the observed test score and the true score in one administration; that is

E = X − T . In a group, the true score has a distribution with mean µT = µX and

variance σ2T. Assuming zero correlation between T and E, it follows that σ2X =

σ2

T + σ2E. Two tests with test scores X and Xare parallel if (1) for each person v,

Tv = Tv, and (2) in the group, σ2X = σ2X. Test-score reliability is defined as the product-moment correlation between X and X′or, equivalently, as

ρX X′=

σ2T

σ2X = 1−

σ2E

σ2X. (3.1)

(55)

Coefficient Alpha

Let σi j denote the covariance between items i and j. We use shorthand coefficient

α for coefficient alpha. Then, coefficient α is defined as

α = J

J− 1

P P

i6= jσi j

σ2X . (3.2)

Coefficient α is a lower bound to the reliability; that is, α≤ ρX X′ (Guttman, 1945;

Novick & Lewis, 1967). For sample estimateα, various authors have derived analy-b

tical standard errors (e.g., Feldt, 1965; Kuijpers, et al., 2013; Maydeu-Olivares, et al., 2007; Van Zyl, et al., 2000).

Coefficient Lambda-2

Coefficient lambda-2, henceforth denoted coefficient λ2, is a greater lower bound to the reliability than coefficient α; that is, α ≤ λ2 ≤ ρX X′. Using the squared

covariance σ2i j, λ2can be written as

λ2= P P i6= jσi j+ Ç J J−1 P P i6= jσ2i j σ2X . (3.3)

As far as we know, a standard error for λ2has not been derived analytically. Hence, to estimate confidence intervals for λ2one has to rely on resampling methods, such as the nonparametric bootstrap (e.g., Efron & Tibshirani, 1993).

Greatest Lower Bound

Let ΣX be the observable J× J inter-item covariance matrix, let ΣT be the

unobser-vable J× J covariance matrix of the true item scores, and let ΣEbe an unobservable

J× J diagonal matrix containing the error variances of the items (σ2E

1, . . . , σ

2

(56)

the diagonal entries and zeroes reflecting zero-correlating measurement errors on the off-diagonal entries. Matrices ΣT and ΣE are positive semidefinite. Inter-item

covariance matrix ΣX can be decomposed as

ΣX= ΣT + ΣE. (3.4)

Equation 3.4 is satisfied by many different combinations of matrices ΣT and ΣE. Let

t r(ΣE) =

PJ j=1σ

2

Ej be the trace of ΣE. The reliability can be written as

1− t r(ΣE)

σ2

X

.

The problem to be solved is the following. One searches for the nonnegative, dia-gonal matrix eΣE that produces maximum t r(eΣE) under the condition that eΣT =

ΣX − eΣE is positive semidefinite; and substituting the resulting eΣE for ΣE one

ob-tains

G LB = 1t r(eΣE)

σ2

X

. (3.5)

Equation 3.5 provides the lowest possible reliability given ΣX and the assumptions

of classical test theory. It can be shown that GLB exceeds all other lower bounds; for example, α≤ λ2≤ GLB ≤ ρX X′ (Jackson & Agunwamba, 1977), with equivalence

(57)

Method MS

Method MS is not an analytical lower bound to the reliability in the sense that it has been proven that MS ≤ ρX X′. Van der Ark, Van der Palm and Sijtsma (2011)

conducted simulation studies suggesting that method MS is approximately unbia-sed relative to ρX X′ when the assumptions of the double monotonicity model (e.g.,

Molenaar, 1997) hold for the data under consideration. Let πx(i) = P(Xi ≥ x) be

the probability that a randomly drawn person scores at least x on item i and let

πx(i), y( j) = P(Xi ≥ x, Xj ≥ y) (i 6= j) be the joint probability of scoring at least x on item i and at least y on another item j. Finally, let πx(i), y(i)= P(Xi≥ x, Xi≥ y) be the probability of scoring at least x and at least y on item i in two independent replications of item i. If items are administered only once, the latter probability is not directly estimable. Reliability can be rewritten as

ρX X′= P P i6= j P x P y x(i), y( j)− πx(i)πy( j)] σ2 X + P i P x P y

x(i), y(i)− πx(i)πy( j)]

σ2

X

, (3.6) (Van der Ark et al., 2011). Method MS replaces πx(i), y(i), by ˜πx(i), y(i), which is a linear combination of univariate and bivariate cumulative probabilities (Molenaar & Sijtsma, 1988; Van der Ark, 2010). Hence, method MS equals

M S = P P i6= j P x P y x(i), y( j)− πx(i)πy( j)] σ2 X + P i P x P y

[ ˜πx(i), y(i)− πx(i)πy( j)]

σ2

X

(58)

3.3 Method

Based on our own judgement, we assumed that sample size N , number of items J , and number of item scores m + 1 were relevant design factors. We first discuss the results of a literature study from which we derived appropriate values for N , J , and

m + 1; these served as factor levels in the two simulation studies. Next, we discuss

the method of data generation and the factors and the dependent variables used in the simulation studies.

Literature Study

Three journals were scanned so as to obtain suggestions for factor levels in the simu-lation designs. The journals were Personality and Individual Differences (personality psychology), Assessment (applied clinical assessment), and Journal of Clinical

Psy-chiatry (clinical psyPsy-chiatry). For these three journals, we scanned more than 500

articles that appeared in 2011. First, for each article we checked whether they used test scores or composites of test scores based on at least three items that operati-onalized a psychological ability, trait or attitude. This procedure resulted in 282 useful articles. Second, we screened each of the selected articles with respect to sample size, number of items, and number of answer categories per item. When an article reported results with respect to different subscales that together constituted an instrument, we included the data for each subscale and for the total scale in our investigation.

(59)

used items with 2 to 7 answer categories, with mode equal to 5 (Figure 3.1, bottom panel). These results were input for the simulation studies that we discuss next.

Monte-Carlo Study 1

We generated data from a population model with known test-score reliability. The test-score reliability of the population models had values typical of real tests, ran-ging from .50 to .95. The two-parameter logistic model (2PLM, Birnbaum, 1968) was used to generate dichotomous item scores. The graded response model (GRM; Samejima, 1969) was used to generate polytomous item scores. The 2PLM and the GRM have item discrimination parameters aj ( j = 1, . . . , J ), and item location pa-rameters bj x ( j = 1, . . . , J ; x = 1, . . . , m). Persons have a score on latent variable

θ which represents their level on the scale of measurement. The probability that

person v with measurement value θv has at least a score of x or higher equals

P(Xj≥ x|θv) = exp[aj(θv− bj x)]

1 + exp[ajv− bj x)], for j = 1, . . . , J ; x = 1, . . . , m. (3.8) The 2PLM is obtained when m = 1; hence, it has one location parameter, so that in Equation 3.8 bj x= bj.

Item scores were generated as follows. First, θ s were sampled from a standard normal distribution. Second, given a particular θ value, say, θv, and given particular choices of the item parameters, item scores were sampled from

P(Xj= xv) = P(Xj≥ x|θv)− P(Xj≥ x + 1|θv), (3.9)

where P(Xj ≥ x|θv) and P(Xj ≥ x + 1|θv) are defined by Equation 3.8. Test score

X was obtained by means of X =PJj=1Xj. True score T was obtained as follows.

(60)

0% 10% 20% 40% 40% 50% 60% <50 [50−100) [100−200) [200−500) [500−1000) [1000−2000) [2000−5000) >5000 0% 10% 20% 40% 40% 50% 60% 3−5 6−10 11−20 21−50 >50 0% 10% 20% 40% 40% 50% 60% 2 3−4 5 6−7 >7

Figuur 3.1: Histograms of sample size (top panel), number of items (central panel),

(61)

of E (Xj|θv) = m X x=1 P(Xj≥ x|θv).

Second, the true score equals the expected conditional test score E (X |θ ) and was computed by means of Tv=E (X |θv) = J X j=1 E (Xj|θv).

Test-score reliability was computed by dividing the group variance of T by the group variance of X (Equation 3.1).

The literature search results suggested using sample sizes N = 20, 50, 100, 500, 1000, 2000, 5000; test lengths J = 5, 10, 25, 50; and numbers of answer categories

m + 1 = 2, 5. Five (J = 5) items was considered the basic set of items. For the basic

set, discrimination parameters were chosen equal to a1 = a2 = .7, a3 = 1.2, and

a4= a5= 1.7. Table 3.1 (upper panel) shows the location parameters for the basic set. Simulated data sets containing 10, 25, and 50 items, consisted of multiples of the basic set.

We used reliability estimation method, sample size, test length, and number of answer categories as factors, and the precision of each of the sampling distributions of the reliability estimate produced by methods α, λ2, GLB, and MS as the depen-dent variable. The reliability estimates were denoted ˆα, ˆλ2, GLB and MS. Estimates

ˆ

α, ˆλ2 and MS were computed using the R package

♠♦❦❦❡♥

(van der Ark, 2012), and estimate GLB was computed using the R package

♣s②❝❤

(Revelle, 2015). Com-putations showed that under some population models, ˆα had a small negative bias

(62)

Tabel 3.1: Item Location Parameters in Study 1 and Study 2. Study Item m = 4 m = 1 bj1 bj2 bj3 bj4 b1 1 1 −2.5 −1.7 −0.3 0.5 −1 2 −0.5 0.3 1.7 2.5 1 3 −1.5 −0.7 0.7 1.5 0 4 −2.5 −1.7 −0.3 0.5 −1 5 −0.5 0.3 1.7 2.5 1 2 1 −3.5 −2.5 −1.5 −0.5 −2 2 −2.5 −1.5 −0.5 0.5 −1 3 −1.5 −0.5 0.5 1.5 0 4 −0.5 0.5 1.5 2.5 1 5 0.5 1.5 2.5 3.5 2

which is the difference between the 97.5th percentile and the 2.5th percentile of the sampling distribution, and we used 5000 samples to determine the IPR. Ninety-five percent of all sample values fall in this range, so the range provides a good measure of the precision of a reliability estimation method.

Bias was the mean difference of the 5000 sample values of ˆα, ˆλ2, the GLB, and method MS with respect to ρX X; that is, bias(α) = ¯αˆ−ρX X′, and so on. Parameters

for α and ρX X′were obtained by means of numerical integration. We compared the

four reliability estimation methods with respect to bias and precision.

Monte-Carlo Study 2

Referenties

GERELATEERDE DOCUMENTEN

Ex- amples are the corrected item-total correlation (Nunnally, 1978, p. 281), which quantifies how well the item correlates with the sum score on the other items in the test;

For the top-down procedure, for each item-assessment method, we investigated the ordering in which items were omitted, this time in each step omitting the item that had the

This enabled us to estimate the item-score reliability at the cutoff value of the item index (.3 for item-rest correlation, .3 for item-factor loading, .3 for item scalability, and

Examples are the corrected item-total correlation (Nunnally, 1978, p. 281), which quantifies how well the item correlates with the sum score on the other items in the test;

When a todo is too long (say, longer than a line of text) but it is still intended to be a verbose one (whose text should appear in the text and not only in the todo list), then

A simulation study was used to compare accuracy and bias relative to the reliability, for alpha, lambda-2, MS, and LCRC, and one additional method, which is the split-half

Moreover, the teachers (N  =  337) indicated the need for additional information significantly more often when the score reports included an error bar compared to when they

Wie zelfs een heel klein plekje tot natuurlijke ont­ plooiing kan helpen brengen diept daarvan de waarde steeds meer uit , Hij kijkt steeds mindel' naar getalien