Standard errors of scalability coefficients in two-level Mokken scale analysis

(1)

Faculty of Social and Behavioural Sciences

Graduate School of Child Development and Education

Standard errors of scalability coefficients

in two-level Mokken scale analysis

Research Master Child Development and Education

Thesis 2

Student:

V.E.C. Koopman, 10382429

Supervisors:

Prof. dr. L.A. van der Ark

Dr. B.J.H. Zijlstra

(2)

Abstract

Tests where subjects are assessed by multiple raters are often applied in educational and psychological research. Such a testing method results in a multilevel test structure. Two-level Mokken scale analysis is an ordi-nal scaling technique that accounts for this structure in test construction by using within- and between-rater scalability coefficients H. To enable hy-pothesis testing and confidence interval construction of the H -coefficients, standard errors are necessary. In this paper, the standard errors for the two-level scalability coefficients were estimated by writing the scalability coefficients as a vector function of the data using a recursive exp-log nota-tion, deriving the Jacobian, and applying the delta method. Computation of both the coefficients and their standard errors have been implemented in software, and presented with a data example.

Keywords: multilevel test construction, non-parametric item response theory, scalability coefficients, standard errors, two-level Mokken scale analysis.

(3)

Standard errors of scalability coefficients in two-level

Mokken scale analysis

1 Introduction

In many research areas, measuring a latent trait in a subject is based on only one rater. This can be considered a one-level test structure. However, in ed-ucational and psychological research, tests regularly occur where subjects are assessed by several raters, resulting in a multilevel test structure. The differ-ence between one-level and multilevel measurement instruments is that in the former there is only one respondent which is often the subject of study, whereas in the latter the respondents are nested within the subject of study (Snijders, 2001). An example of such a measurement instrument is the ICALT observa-tion instrument, a multiple-item quesobserva-tionnaire measuring the teaching skills of trainee teachers throughout the Netherlands (Van der Grift, 2007). Typically, each trainee teacher is assessed by five instructors. The instructors’ average test score is the measured value of the trainee teacher’s teaching skills. The multilevel structure exists since the instructors are nested within the trainee teacher. Other examples include course evaluations of Dutch universities, in which courses are rated by multiple students using a standardized questionnaire and evaluation of ecological settings, such as a neighbourhood, based on the inhabitants. In these examples it may be noted that the purpose is to scale the subjects (i.e., the trainee teacher, courses, neighbourhood), but the subject is measured through the raters (i.e., instructors, students, inhabitants).

In multilevel test data the first level reflects the raters who respond to the measurement. These raters are nested in the second level, the subjects. More specifically, the latent trait of the subject is measured through the raters. Hence,

(4)

information is available on level 1 (the raters), but we are interested in scaling the subjects in level 2. If this multilevel structure is ignored when investigat-ing the quality of a multilevel measurement instrument, this quality can be overestimated (Crisan, Van de Pol, & Van der Ark, in press; Crisan, 2015). For example, for a particular questionnaire on teacher’s support, Crisan et al. (in press) found that classic item analysis as Cronbach’s alpha and one-level Mokken scale analysis reflected a very reliable (α = .93) and strong scale (H = .58). However, after taking into account the multilevel structure of the data, they found that variation among pupil ratings within a class was high (intraclass correlation = .23). As a result, a larger class size would be required to obtain a reliable estimate, and the ratings were mostly affected by the pupils, rather than the teacher. The main problem is that these type of measurement instruments intend to measure the trait level of the subjects, whereas classical item analsis and one-level Mokken scale analysis provide information on the raters.

Several models within the parametric item response theory (IRT) framework exist to account for the dependency between multiple raters. Examples include the hierarchical rater model (Patz, Junker, Johnson, & Mariano, 2002), the rater bundle model (Wilson & Hoskens, 2001), the IRT model for multiple raters (Verhelst & Verstralen, 2001), and the ecometric model (Raudenbush & Sampson, 1999). These models describe the effect of raters on the item responses, and are useful for assessing the quality of an existing questionnaire or developing a new measurement instrument. In contrast, a multilevel structure exists where subjects (level 1) are nested in classes (2), now both the information is available on and the interest is at level 1. Although the class might affect the response, there is only 1 rater. Fox and Glas (2001), May (2006), and Vermunt (2003), among others, have provided scaling methods to deal with this latter multilevel structure. Since these models assume still one rater, they are

(5)

considered beyond the scope of this paper.

Although parametric IRT models have attractive features (e.g., easily sum-marizing the data, placing respondents on a metric scale), the assumptions un-derlying these methods are restrictive, which can lead to unsatisfactory goodness of fit indices or only few remaining items. Alternatively, less restrictive are the nonparametric IRT (NIRT) models (e.g., Mokken, 1971; Sijtsma & Molenaar, 2002). These models might be preferred over parametric models because of their flexible assumptions. For instance, they allow for ordinal measurement and, as a consequence, rank ordering respondents on their latent trait. This property can be particularly useful for ordering-based decisions, for example, selecting a top ten. Moreover, each IRT model is a special case of a NIRT model, making the latter an appropriate starting point in test construction. Snijders (2001) proposed a two-level NIRT model for several raters. Specifically, he laid the foundation for a two-level method for non-parametric scaling of dichotomous items (i.e., items with two response categories), also referred to as two-level Mokken scale analysis. Recently, this model has been generalized to polyto-mous items, characterized by more than two response categories (Crisan et al., in press). Two-level NIRT models describe particular patterns of item responses, and the scaling technique results in estimated scalability coefficients to evaluate the consistency of item response patterns within and between raters. Further-more, the results enables ordering of the subjects according to their scores. Due to its potential in scaling for multilevel test data, two-level Mokken scale analysis is the focus of this paper.

A viable scaling requires correct standard errors for the scalability coeffi-cients. Computation of the standard errors leads to a more complete descrip-tion of the coefficients, and hypotheses on the values of these coefficients can be tested. Consequently, ignoring the standard errors can result in incorrect

(6)

inferences (Kuijpers, Van der Ark, & Croon, 2013). Mokken (1971) described standard error computation and hypothesis testing of the scalability coefficients using all possible response patterns, although this method was limited to di-chotomous items and to the Hij coefficients. Alternatively, standard errors for

all item types and scalability coefficients can be estimated by bootstrapping (e.g., Efron, 1979). Both methods quickly become computer-intensive with a larger number of items, which is a practical drawback. Under the assump-tion that the response patterns follow a multinomial distribuassump-tion, Kuijpers et al. (2013) derived standard errors of the scalability coefficient by means of a marginal modelling framework. This method only uses the observed item re-sponse patterns and as a result the computational burden cannot exceed the lesser of the number of participants and the number of possible item-response patterns. Kuijpers, Van der Ark, Croon, and Sijtsma (2016) showed that the bias in the standard errors was negligible, and that the coverage of the 95% con-fidence intervals was often satisfactory. The extension of Mokken scale analysis to multilevel test data has resulted in different scalability coefficients. Cur-rently, standard errors for these coefficients are not yet available, leading to the research question: How can standard errors of the scalability coefficients in two-level Mokken scale analysis be derived?

The remainder of this paper is organized as follows. Section 2 discusses two-level Mokken scale analysis. Section 3 describes the mathematical derivation of standard errors. Section 4 presents and demonstrates the implementation of the two-level scalability coefficients and their standard errors in software.

(7)

2 Two-level Mokken scale analysis

In multilevel test data, there are S subjects, indexed by s or t, which are rated by Rs raters each, indexed by r or p, who respond to I items, indexed by i or

j. Each item has m + 1 response categories, scored 0,...,m, indexed by x or y. Score Xsri is defined as the item response for subject s by rater r on item i.

Subjects are generally scaled by the mean total score across raters:

Xs++= 1 Rs Rs X 1=r I X 1=i Xsri (1)

A small dataset (Table 1) is used throughout to demonstrate the compu-tations of the various components necessary for the scalability coefficients and their standard errors.

Table 1:

Data example for S=3 subjects, with Rs = 5, 4, and 6 raters, respectively, who

respond to I = 3 items having m + 1 = 3 response categories s Item a Item b Item c

1 0 0 0 1 0 0 1 1 0 1 2 1 1 1 1 1 2 0 1 2 1 0 1 2 2 0 2 2 2 1 1 2 2 1 2 3 0 1 0 3 0 1 1 3 0 1 0 3 2 0 1 3 2 2 0 3 2 2 0

(8)

2.1 NIRT models

Let responses on items are driven by the latent trait of subject s, θs, and by the

deviation of rater r within the subject, δsr. Each item i has item-step response

function pix(θs+δsr) = P (Xsri≥ x|θs, δsr), which is the probability of obtaining

at least item score x given θs and δsr. For dichotomous items, this probability

simplifies to P (Xsri = 1|θs, δsr). Additionally, the item-step response function

only dependent on θsis defined as pix(θs) = P (Xsi≥ x|θs) = Eδ[pix(θs+ δsr)],

where expectation Eδ refers to the distribution of δsr. The most commonly

used NIRT model is the monotone homogeneity model. Both the one- and two-level application of this model is based on three main assumptions (Snijders, 2001; Sijtsma & Molenaar, 2002). For two-level polytomous item scores, these assumptions are

• Unidimensionality: The latent trait θ is unidimensional.

• Local independence: Conditional on θs, the deviations of the Rs raters and

the responses to the I items are outcomes of independent random variables

P (Xs11= xs11, ..., XsRI = xsRI|θs) = R Y r=1 I Y i=1 P (Xsri= xsri|θs). (2)

• Monotonicity: Each item-step response function pix(θs+δsr) is non-decreasing

with respect to θ. Since pix(θs) is averaged with regards to δsr, its function

will be flatter as compared to the function of pix(θs+ δsr). Consequently, if

pix(θs+ δsr) is non-decreasing, so is pix(θs). Item step response functions are

allowed to intersect, unless they belong to the same item.

A second NIRT model is the double monotonicity model, which has one additional assumption

• Non-intersecting item response functions: An item response function is defined as the mean of M item-step response functions per item

ε(Xsri|θs, δsr) = 1 M M X m=1 P (Xsri≥ x|θs, δsr). (3)

(9)

If none of the item response functions intersect, the items have an invariant item ordering. While the first model is only appropriate to order persons, the second model justifies the ordering of both persons and items.

2.2 Scalability coefficients

Key in Mokken scale analysis are the scalability coefficients H (e.g., Sijtsma & Molenaar, 2002; Mokken, 1971). For two-level data, Snijders (2001) dis-tinguished within-rater (W ) and between-rater (B) scalability coefficients to determine the consistency of item response patterns within and between raters, respectively.

2.2.1 Guttman weights

Computation of the scalability coefficients requires Guttman weights. Let an item-step, indexed by g, be a boolean statement, indicating whether the step has been passed (1 if Xsri ≥ x) or failed (0 if Xsri < x). Let the popularity

of an item-step be the cumulative probability of scoring value x or higher on item i, that is, P (Xsri≥ x). The item-steps are sorted in descending order of

their popularity. In a perfect Guttman scale no further item-steps are passed once an item-step is failed. As a result, a Guttman error is described as passing an item-step when a more popular item-step is failed. For example, if item 2 is less popular than item 1, a Guttman error is an item-score pattern of (Xsr1 = 0, Xsr2 = 1). As an example, the item-steps for two items with 3

response categories might be

X1≥ 1, X2≥ 1, X1≥ 2, X2≥ 2. (4)

Note that item-steps X1 ≥ 0 and X2 ≥ 0 are omitted, because P (Xsri ≥ 0)

(10)

Equation 4 is evaluated with regards to a particular item-score pattern (x, y) as value zxyg and collected in vector zxy. For item-score pattern (0, 2), z02= (0, 1,

0, 1). The second and fourth were passed, whereas the first and third item-steps were failed. Passing a less popular step while failing a more popular item-step is called a Guttman error. The weight of this error indicates the deviation from the perfect Guttman scale (Molenaar, 1991). For the item-steps (0, 1, 0, 1), the more popular first item-step (X1≥ 1) was failed whereas the less popular

second item-step (X2≥ 1) was passed, and the more popular first (X1≥ 1) and

third item-step (X1 ≥ 2) were failed whereas the less popular fourth item step

(X2≥ 2) was passed. In other words, the weight of the item-score pattern (0, 2)

is 3, since for three pairs of item-steps (X1≥ 1, X2≥ 1), (X1≥ 1, X2≥ 2), and

(X1≥ 2, X2 ≥ 2) the less popular item step was passed and the more popular

item step was failed. In general, Guttman weights w_ijxyfor score x on item i and score y on item j, with G item-steps, can be computed as

wxy_ij = G X h=2 ( zxy_h × "_h−1 X g=1 (1 − z_gxy) #) , (5)

(e.g., Kuijpers et al., 2013)

2.2.2 Item-pair, item, and total scale coefficients

Scalability coefficients compare the sum of weighted observed Guttman errors Fij to the sum of weighted expected Guttman errors under marginal

indepen-dence of the items Eij, and its values can be used to assess how well the

sub-jects can be ordered on the scale. For K = I(I − 1)/2 item pairs there are the item-pair scalability coefficients H_ijW and H_ijB. These coefficients reflect the homogeneity of item pairs within and between the raters, respectively. For each of the I items the scalability coefficients H_iW and H_iB, exist. These coefficients reflect the homogeneity of items with respect to the rest of the scale. The

(11)

homo-geneity of the total scale is summarized by scalability coefficients HW and HB. Of special interest is the ratio of the between to within coefficients, reflecting whether responses are driven by the trait value of the subject rather than by rater effects.

For any finite population, let f_ijxy be the within-rater expected frequency (Xsri= x, Xsrj = y), and b

xy

ij the between-rater expected frequency (Xsri = x,

Xspj = y). Moreover, let exyij be the expected frequency under marginal

inde-pendence (Xsri = x, Xsrj = y). The within-rater scalability coefficients for

item-pairs, items, and the whole scale are defined as

H_ijW = 1 −F W ij Eij = 1 − P x P yw xy ij f xy ij P x P yw xy ij e xy ij , (6) H_iW = 1 − P i6=jF W ij P i6=jEij = 1 − P i6=j P x P yw xy ij f xy ij P i6=j P x P yw xy ij e xy ij , (7) HW = 1 − P P i6=jFijW P P i6=jEij = 1 − P P i6=j P x P yw xy ijf xy ij P P i6=j P x P yw xy ije xy ij . (8)

The between-rater scalability coefficient for item-pairs, items, and the whole scale are defined as

H_ijB = 1 −F B ij Eij = 1 − P x P yw xy ij b xy ij P x P yw xy ije xy ij , (9) H_iB = 1 − P i6=jF B ij P i6=jEij = 1 − P i6=j P x P yw xy ij b xy ij P i6=j P x P yw xy ije xy ij , (10) HB= 1 − P P i6=jFijB P P i6=jEij = 1 − P P i6=j P x P yw xy ijb xy ij P P i6=j P x P yw xy ij e xy ij . (11)

(12)

One-level scalability coefficients can be considered a special case of these two-level coefficients, with only one rater per subject. Since there is only one rater, the between-rater coefficients are no longer relevant, nor are the ratios. Only the within scalability coefficients remain, which are identical to the tradi-tional H -coefficients (e.g., Crisan et al., in press; Snijders, 2001). When Rs is

equal for all S the within-rater scalability coefficients are equal to the one-level coefficients as well. If there are no Guttman errors, Hij = Hi = H = 1. If there

are as many errors as under marginal independence, Hij = Hi = H = 0. In

addition, min(Hij) ≤ min(Hi) ≤ (H) ≤ max(Hi) ≤ max(Hij), where the within

coefficients are expected to be greater than the between coefficients (Snijders, 2001; Sijtsma & Molenaar, 2002).

As a heuristic value, Snijders (2001) proposed a threshold of 0.2 for within-rater scalability coefficients of the items and the whole scale, where item-pair coefficients are allowed to be between 0.1 and 0.2. These values reflect a lower threshold than for one-level scalability coefficients, which is possible due to the parallel measurements, resulting in the availability of more information. For the between-rater scalability coefficients, Snijders proposed a value of at least 0.1, with possibly some exceptions for item-pair coefficients. Of special interest is the ratio HB_/HW_{, indicating a reasonable scale for values over 0.3 and a}

strong scale for values over 0.6. These values are arbitrary and may be adjusted according to particular preferences or characteristics of the data.

2.3 Estimation of scalability coefficients

2.3.1 Estimating Guttman weights

To estimate the Guttman weights it is assumed that the order of the item-steps in the sample is identical to the order of the item-steps in the population. This assumption might be violated, especially when the popularity of item-steps is

(13)

identical in the population. Recently, Kuijpers et al. (2016) found that although identical item-step popularities resulted in lower H coefficients, they were still often unbiased, except for fewer response categories and smaller sample sizes (bias < 0.5). The order of the item-steps is determined by their popularities, that is, their cumulative probabilities P(Xsri≥ x), which is essentially the same

as their cumulative frequencies f (Xsri≥ x). A simple estimate for the

univari-ate frequency f (Xsri = x) is computed by summing over all R = PS_s=1Rs

raters b f_ix= R X r=1 1(Xsri= x), (12)

which is biased when there is a relation between θs and Rs (Snijders, 2001).

Consequently, computing the relative frequency is preferred

b f_ix= S X s=1 1 Rs Rs X r=1 1(Xsri= x). (13)

As a result, the cumulative relative frequency bfx

i(Xsri ≥ 0) equals S. Table 2,

(based on Table 1) provides estimated univariate frequencies for the item a and b in the lower panel. According to these estimates, the item-steps for this example are ordered as

Xa≥ 1, Xb≥ 1, Xa≥ 2, Xb≥ 2. (14)

Applying Equation 5 to this item-step order results in the weights as displayed in the last row of first two panels. Because they are based on the same marginal frequencies, weights are identical for within- and between-rater coefficients. In case of ties in the item-step popularities weights are computed for the different orderings and averaged (Molenaar, 1991).

(14)

Table 2:

Example of Multilevel Item-Score Frequencies of Items a and b from Table 1.

s Item-score pattern (x, y) Rs (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) (2,0) (2,1) (2,2) 1 2 1 1 1 5 2 1 1 2 4 3 3 1 2 6 f_abxy .40 .70 .00 .25 .20 .00 .62 .50 .33 w_abxy 0 1 3 0 0 1 1 0 0 s Item-score pattern (x, y) Rs(Rs− 1) (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) (2,0) (2,1) (2,2) 1 7 5 3 1 2 2 20 2 1 2 5 4 12 3 3 6 6 2 9 4 30 b_abxy .45 .45 .20 .23 .22 .00 .58 .73 .13 w_abxy 0 1 3 0 0 1 1 0 0 s Item 1 Item 2 Rs x = 0 x = 1 x = 2 x = 0 x = 1 x = 2 1 3 1 1 3 2 5 2 1 3 2 2 4 3 3 3 1 3 2 6 b fx i 1.10 0.45 1.45 1.27 1.40 0.33 bf (Xsri ≥ x ) 3.00 1.90 1.45 3.00 1.73 0.33

Note: Frequencies of unobserved item-score patterns are left blank. The upper panel shows the bivariate frequencies within raters, the middle panel shows the bivariate frequencies between raters, and the lower panel shows the univariate frequencies.

(15)

2.3.2 Estimated scalability coefficients

Let bf_ijxy be the estimated within-rater bivariate frequency (Xsri= x, Xsrj = y)

and bbxy_ij be the estimated between-rater bivariate frequency (Xsri= x, Xspj =

y). Estimations for both the within and between bivariate frequencies are, sim-ilar to the univariate estimate, based on the relative frequencies. The estimate for the within-rater frequency is

b f_ijx,y = S X s=1 1 Rs Rs X r=1 1(Xsri= x, Xsrj = y), (15)

see Table 2, upper panel. For the between-rater frequency this estimate is

bb x,y ij = S X s=1 1 Rs(Rs− 1) Rs X X p6=r 1(Xsri= x, Xspj = y), (16)

see Table 2, middle panel. Finally, the expected bivariate frequency under marginal independence exy_ij is estimated by

b exy_ij =fb x i × bf y j S . (17)

Since the univariate frequency f_ix is independent of r, the expected frequency b

exy_ij is equal for the within- and between-rater coefficients. The estimated bi-variate frequencies, together with the estimated weights, can be implemented in Equation 6 to 11 to obtain the estimated scalability coefficients bH.

Table 3 presents the estimated scalability coefficients for Table 1. The item-pair coefficient bHacreflect a value above the threshold of 0.1 on both the

within-and between-rater column. Potentially item a within-and c could form a scale, but by including item 2 the three items appear unscalable, as reflected by low and neg-ative Hi- and H-values. In addition, the ratio HB/HW indicate that currently

(16)

level of the subject.

Table 3:

Estimated within- (W) and between-rater (B) scalability coefficients ( bH) for the example data from Table 1.

W B b Hab 0.146 −0.059 b Hac 0.327 0.222 b Hbc −0.379 −0.365 b Ha 0.238 0.084 b Hb −0.078 −0.189 b Hc 0.031 −0.024 b H 0.072 −0.037 b HB/ bHW −0.506

To know whether it is meaningful to create a scale with item a and c, we need to gain information on the precision of the estimates. This information is currently missing, and as a consequence, it cannot be tested whether the the Haccoefficients are significantly greater than the threshold values provided.

Since inferences on the item-pairs, items, and scale are based on values of the within- and between-rater scalability coefficients as well as the ratio between them, standard errors need to be derived for all.

3 Standard errors of scalability coefficients

Standard errors determine the precision of an estimate. For example, item i is considered to contribute to a scale when its Hi-coefficients are above the

thresh-old of 0.2 for within and 0.1 for between coefficients (Snijders, 2001). Standard errors can be used to assess whether the Hi-coefficients in the population are

likely to be above that threshold. Computing the standard errors for the H -coefficients is not straightforward. Initial methods (e.g., Mokken, 1971) required

(17)

computations for all possible response patterns and could only be applied to di-chotomous items and for item-pair coefficients. An alternative method is the non-parametric bootstrap procedure (e.g., Efron, 1979; Van Onna, 2004). The bootstrap can be adjusted for multilevel data, resulting in a clustered bootstrap that resamples on the subject level (De Rooij, 2013). All these methods become computationally demanding when the number of items increases, and it is there-fore convenient when less demanding methods to derive the standard errors are developed.

For one-level Mokken scale analysis, Kuijpers et al. (2013) derived standard errors of the scalability coefficient estimates by (a) writing the scalability coef-ficients as a vector function of the data using a recursive exp-log notation (e.g., Bergsma, 1997), (b) deriving the matrix of first-order partial derivatives of the vector function, and (c) applying the delta method (e.g., Agresti, 2002, pp. 577-581; Sen & Singer, 1993, pp. 131-152). This method assumes that the item response patterns are sampled from a multinomial distribution, which poses no additional limitations on the model, and can be used regardless of item type and up to 100 items and 100,000 respondents. The method is inspired by marginal modelling of categorical data and was first applied on H -coefficients by Van der Ark, Croon, & Sijtsma (2008; see also Bergsma, 1997).

3.1 Writing scalability coefficients as a vector function

Writing the scalability coefficients as a vector function of the data commences with collecting the raw frequencies of the observed item-score patterns in vector n. For a given test there are L = (m + 1)I _{possible item-score patterns. Only}

the observed patterns need to be selected (for proof, see Kuijpers et al., 2013, pp. 54-55) and ordered in a way such that they are in lexicographical order from 00...0 to mm...m, with the last digit changing fastest. If a test consists

(18)

of 3 items (a, b, c), vector n potentially consists of frequencies n000_abc to nmmm_abc . In the two-level situation, this vector should be present for each subject to be scaled. Let Lsbe the observed item-score patterns subject s. Then vector n =

n1, n2, . . . , nS, with length L∗ =P S

s=1Ls, which cannot exceed the the lesser

of R and SL. For Table 1 the vector of observed item-score patterns results in

n =       n1 n2 n3       =                                         n1000sabc n1001 sabc n1012sabc n1111 sabc n1201sabc n2101sabc n2202 sabc n2211sabc n2212 sabc n3010sabc n3011sabc n3201 sabc n3220sabc                                         =                                         1 1 1 1 1 1 1 1 1 2 1 1 2                                         . (18)

Vector n is of key importance for further specification of the scalability coefficients, because the coefficients can be written as a function of n. Let the g-functions represent the scalability coefficients as a function of the data using the recursive exp-log notation. Let Hij refer to a vector with all item-pair

scalability coefficients, rewritten as g(n), Hito a vector with all item scalability

coefficients, rewritten as g†(n), and H to the total scale coefficient, rewritten as g‡(n), all present for both the within-rater (W ) and the between-rater (B ) version. The ratio between the two total scale scalability coefficients HB_/HW

(19)

HW_ij = gW(n), (19) HB_ij= gB(n), (20) HW_i = gW†(n), (21) HB_i = gB†(n), (22) HW = gW‡(n), (23) HB= gB‡(n), (24)

HB/HW = g?(n). (25) 3.1.1 The recursive exp-log notation

Many functions can be written in a recursive exp-log notation. Key in this nota-tion are the design matrices A1to Aq and recursively applying the exponential

and logarithmic function to the data. Let exp(X) be the exponential function and log(X) the natural logarithmic function, element-wise evaluated over X. These functions are useful since the product of positive terms can be written as the exponent of the sum of logarithms of these terms. As an example of the exp-log notation, let the frequencies of two dichotomous items (a, b) be stored in n = [n00 ab n 01 ab n 10 ab n 11

ab]. For item a, a simple calculation results in the mean

of Xa = n1+ab/N = (n 10 ab+ n

11

ab)/N , with N the total number of observations.

The mean of item b can similarly be computed. These means can be computed using the exp-log notation. Let

A1=   0 0 1 1 0 1 0 1 1 1 1 1   and A2= 1 0 −1 0 1 −1 , (26) then,Xa Xb T

= g(n) = exp A2log (A1n). First, A1n is

  0 0 1 1 0 1 0 1 1 1 1 1       n00ab n01 ab n10ab n11 ab     =   P Xa P Xb N  . (27)

Inserting the right-hand side equation of Equation 27 into exp A2log (A1n)

results in exp   1 0 −1 0 1 −1 log   P Xa P Xb N    = Xa Xb . (28)

(20)

In Section 3.1.2 to 3.1.5, the recursive exp-log notation will subsequently be defined for item-pair, item, and total scale scalability coefficients and for the ratio of the total scale coefficients.

3.1.2 Item-pair scalability coefficients in exp-log notation

The recursive exp-log notation for the within-rater item-pair scalability coeffi-cients, H_ijW (Equation 6 and 19), is

HWij = g(n)W= A7exp(A6log(A5exp(A4log(A3exp(A2log(AW1 n)))))).

(29) For I items and K item-pairs, there are B = K(m + 1)2 _{bivariate and}

U = I(m + 1) univariate frequencies. The S(B + U + 1) × L∗ matrix AW 1 is

required to create a vector with the bivariate and univariate frequencies and the number of raters per subject. Therefore, matrix AW

1 consists of sub-matrices

BW

s and UWs and vector rWs for each subject. In general, for three items, the

raw bivariate frequency (Xsra = 0, Xsrb = 0) for subject 1 is computed by

n100+W_sabc = n1000_sabc+ n1001_sabc + n1002_sabc, and the raw univariate frequency (Xsra= 0)

by n10++W_sabc = n100+_sabc + n101+_sabc + n102+_sabc. Entry (b, l) in the B × Ls sub-matrix

BWs takes value 1 if the l-th observed item-score pattern contributes to the

b-th bivariate frequency, and 0 otherwise. For subject 1 in Table 1 (see also Equation 18), the first row of BW

1 is (1, 1, 0, 0, 0). Element (u, l) of the U × Ls

sub-matrix UW

s takes value 1 if the l-th item-score pattern contributes to the

u-th univariate frequency and 0 otherwise. The first row of UW

1 is (1, 1, 1, 0,

0). Per subject, the last row is essentially a transposed unit-vector rW

s of length

Ls. Vector rW1 is therefore (1, 1, 1, 1, 1). Let ⊕ indicate the direct sum and let

(21)

AW₁ = S M s=1     BWs UWs rW s     =                        BW 1 0 · · · 0 UW 1 0 · · · 0 rW 1 0 · · · 0 0 BW₂ · · · 0 0 UW2 · · · 0 0 rW2 · · · 0 .. . ... ... 0 0 · · · BW S 0 0 · · · UW S 0 0 · · · rW S                        (30) When matrix AW

1 is multiplied with the n-vector, the result is vector gW1 with

raw bivariate (nsxy+W_sabc , in vector nW

sij), univariate (n sx++W

sabc in vector n W si), and

rater frequencies Rsper subject

gW1 = log AW1 n = log        AW1        n1 n2 . . . nS               = log                                      nW1ab nW1ac nW1bc nW1a nW 1b nW 1c R1 . . . nW Sab nWSac nWSbc nWSa nWSb nWSc RS                                      . (31)

The second step is computing the relative frequencies by dividing the raw frequencies by Rs(Equation 15). This results in the S(B + U ) × S(B + U + 1)

matrix A2, with on the diagonal an identity matrix I and a negative unit-vector

(22)

A2= S M s=1 I(B+U ) −1(B+U ) =        I1 −11 0 0 · · · 0 0 0 0 I2 −12 · · · 0 0 . . . ... ... ... ... ... 0 0 0 0 · · · IS −1S        (32) Implementing gW

1 (Equation 31) in exp(A2log(A1n)) equals

gW2 = exp A2g1W =                                 f1ab f1ac f1bc f1a f1b f1c . . . fSab fSac fSbc fSa fSb fSc                                 , (33)

resulting in a vector with the estimated bivariate frequencies bf_sijsxy (in fsij) and

univariate frequencies bfsx

si (in fsi).

The computed relative bivariate and univariate frequencies need to be ag-gregated over the subjects, and in addition, the number of subjects should be computed. All but the last row of the S(B + U + 1) × S(B + U ) matrix A3

consists of S side by side identity matrices of order (B + U ), collected in sub-matrix C. The last row of A3 consist of S transposed hs-vectors. Vector hs

concatenates a (m + 1)2 unit-vector and a (B + U − (m + 1)2) zero-vector per subject A3= C h ! = I1 I2 · · · IS hT1 hT2 · · · hTS ! . (34)

(23)

Implementing vector gW₂ (Equation 33) in log(A3exp(A2log(A1n))) equals g3W = log A3 gW2 = log                fab fac fbc fa fb fc S                , (35)

resulting in a vector containing the relative bivariate ( bf_ijxy in vector fij,

Equa-tion 15) and univariate frequencies ( bfx

i in vector fi, Equation 13) and the total

number of subjects S.

From this point, design matrices A4to A7are comparable to the design

ma-trices A2to A5, respectively, in Kuijpers et al. (2013). In order to compute the

scalability coefficients, both the observed and the expected bivariate frequencies are necessary. The observed bivariate frequencies are present in gW3 , but the

ex-pected frequencies exy_ij need to be computed (Equation 17). The 2B ×(B +U +1) matrix A4 consists of several sub-matrices. First, an identity matrix order B,

resulting in repetition of the observed bivariate frequencies. Second, the B × U matrix P, which is an indicator matrix where entry (b, u) takes value 1 if the u-th univariate frequency contributes to u-the b expected bivariate frequency, and 0 otherwise. This guarantees multiplication of the relevant univariate frequencies. The last column of A4concatenates a zero-vector o and a negative unit-vector

-1, both of length B, ensuring division of the multiplied univariate frequencies by S. Implementing gW3 (Equation 35) in exp(A4log(A3exp(A2log(A1n))))

results in a vector with the estimated observed and expected (_bexy_ij, in vector eij)

(24)

gW4 = exp A4g3W = exp IB 0 o 0 P −1B ! gW3 ! =             fab fac fbc eab eac ebc             . (36)

To determine which of the observed and expected bivariate frequencies from gW

4 are actual Guttman errors, a matrix need to be created to multiply the

Guttman weights with the frequencies. This results in the (2K + 1) × 2B matrix A5. This matrix consists of twice the sub-matrix W, containing the Guttman

weights, and one copy-vector (cT1) of the first row of W, necessary to, in a later

stage, construct the scalar 1 (Equation 6). Sub-matrix W contains a row-vector wij = [wij00, w01ij, . . . , wmmij ] on each diagonal entry, which represent the (m + 1)2

Guttman weights for item-pair (i, j)

W = I M i<j wij=           w12 0 0 . . . 0 0 w13 0 . . . 0 0 0 w14 . . . 0 . . . ... ... ... 0 0 0 . . . wI−1,I           . (37) Implementing vector gW

4 (Equation 36) in log(A5exp(A4log(A3exp(A2log(A1n)))))

results in a vector with observed and expected Guttman errors (notation as in Equation 6), where the first entry is repeated once

gW5 = log A5 gW4 = log         cT 1 0 W 0 0 W     gW4     = log                wabnab wabnab wacnac wbcnbc wabeab waceac wbcebc                = log                FW ab FW ab FW ac FW bc Eab Eac Ebc                . (38)

(25)

The (K + 1) × (2K + 1) A6is a concatenation of several values, zero-vectors,

and identity matrices, necessary to compute the ratio between the observed and expected Guttman errors,

A6= 1 −1 oTK−1 o T K oK IK −IK ! . (39) Implementing vector gW

5 (Equation 38) in exp(A6log(A5exp(A4log(A3exp(A2log(A1n))))))

results in a vector with scalar 1 and the ratio’s of observed and expected Guttman errors for the K item-pairs

g6W = exp A6 gW5 =       1 FabW/Eab FacW/Eac FbcW/Ebc       . (40)

To compute the item-pair scalability coefficients, the ratios need to be sub-tracted from scalar 1. The K × (K + 1) matrix A7, consisting of a unit vector

and a negative identity matrix, both of order K, which when implementing gW₆ in A7exp(A6log(A5exp(A4log(A3exp(A2log(A1n)))))) becomes

gW(n)= A7 gW6 = 1K −IK gW6 =     1 − FabW/Eab 1 − FacW/Eac 1 − FbcW/Ebc     =     HabW HacW HbcW     . (41)

For the between-rater scalability coefficients, all but the first design matrix is identical. Therefore, writing the between-rater item-pair scalability coefficients, HB

ij (Equation 9 and 20), in recursive exp-log notation results in

HB_ij= f (n) = A7exp(A6log(A5exp(A4log(A3exp(A2log(AB1n)))))). (42)

The S(B + U + 1) × L∗ matrix AB₁ from AW₁ , because it needs to select the bivariate item-score patterns for different raters within a subject. This results in sub-matrices BBs and UBs and vector rBs per subject. Sub-matrix BWs specifies

(26)

how often an item-score pattern contributes to a particular bivariate between-rater frequency. As a result entry (b, l) reflects how many combinations of the l-th pattern with the other patterns (within the same subject) contribute to the b-th bivariate frequency. For example, in vector n (Equation 18), for subject 1, pattern n1000

sabc contributes to the bivariate frequency (Xsra= 0, Xspb= 0) when

combined with pattern n1001 sabc and n

1201

sabc, since these patterns reflect value 0 on

item b. Therefore, the first entry is 2. The first row of BB

1 results in (2, 2, 3, 0,

0). Both the UB

s-matrix and rBs-vector are computed by multiplying UWs and

rW

s with (Rs− 1), respectively. Consequently, for Table 1, the first row of UB1

is (4, 4, 4, 0, 0), and vector rB 1 is (4, 4, 4, 4, 4). In general, matrix AB1 is AB₁ = S M s=1     BB s UB s rB s     =                        BB₁ 0 · · · 0 UB1 0 · · · 0 rB1 0 · · · 0 0 BB 2 · · · 0 0 UB 2 · · · 0 0 rB 2 · · · 0 .. . ... ... 0 0 · · · BB S 0 0 · · · UB_S 0 0 · · · rB_S                        (43) When matrix AB

1 is multiplied with the n-vector, the result is vector gB1 with

the raw between-rater bivariate (nxy+B_sabc , in vector nB_sij), univariate (nsx++B_sabc in vector nB_si), and rater frequencies Rs(Rs− 1) per subject

(27)

gB1 = log AB1 n = log        AB1        n1 n2 . . . nS               = log                                      nB1ab nB1ac nB 1bc nB 1a nB 1b nB 1c R1(R1− 1) . . . nBSab nBSac nBSbc nBSa nBSb nBSc RS(RS− 1)                                      (44)

Matrices A2to A7are identical for the between-rater coefficients. However,

the resulting vectors apply to the between-rater frequencies. This results in

gB2 =                        b1ab b1ac b1bc f1a f1b f1c . . . bSab bSac bSbc fSa fSb fSc                        , gB3 =           bab bac bbc fa fb fc S           , gB4 =         bab bac bbc eab eac ebc         , g5B=           FB ab FBab FBac FBbc Ea Eb Ec           , gB6 =     1 FabB/Eab FB ac/Eac FB bc/Ebc     , gB(n)=   HB ab HB ac HbcB  . (45)

3.1.3 Item scalability coefficients in exp-log notation

The recursive exp-log notation for the within-rater item scalability coefficients (Equation 7 and 21) is

(28)

As can be seen from Equation 46, adjustments need only be made to design matrices A5 to A7. The difference between item coefficients compared to the

item-pair coefficients is that the Guttman errors need to be summed over the item-combinations per item i (Equation 7). Therefore, the steps up to compu-tation of these Guttman errors, are identical.

To get matrix A†₅, matrix A5(Equation 38) needs to be premultiplied with

(2I +1)×(2K +1) matrix S†. Let row cT

1 be a copy row of the first row of matrix

J, and let matrix J = (J1 J2. . . JI−1). For i = 1, 2, . . . , I − 1, the I × (I − i)

sub-matrix Ji is Ji=     0(i−1)×(I−i) 1T1×(I−i) II−i     . (47) Then matrix S† is S†=     0 cT 1 0 0 J 0 0 0 J     . (48)

Implementing gW₄ (Equation 36) in log(A†₅exp(A4log(A3exp(A2log(AW1 n))))),

results in the sum of the Guttman errors

g5W† = log A†5 g W 4 = logS†A5gW4 = log                P i6=aF W ia P i6=aF W ia P i6=bF W ib P i6=cF W ic P i6=aEia P i6=bEib P i6=cEic                . (49)

Matrices A†₆and A†₇only differ from A6and A7(equations 40 and 41) in that

the sub-matrices and vectors are of order I rather than K. Implementing gW† 5

(29)

gW6 †= exp A†6 g W† 5 = exp            1 −1 0T I−1 0TI 0I II −II log            P i6=aF W ia P i6=aF W ia P i6=bF W ib P i6=cF W ic P i6=aEia P i6=bEib P i6=cEic                       =     1 FW a /Ea FW b /Eb FcW/Ec     , (50) and multiplying AW† 7 with gW † 6 results in gB(n)† = A † 7g † 6= 1I −II     1 FW a /Ea FW b /Eb FcW/Ec     =   1 − FW a /Ea 1 − FbW/Eb 1 − FcW/Ec  =   HW a HbW HcW  . (51)

The recursive exp-log notation for the between-rater item scalability coeffi-cients (Equation 10 and 52) is

HBi = g B†_(n)

= A†₇exp(A†₆log(A†₅exp(A4log(A3exp(A2log(AB1n)))))). (52)

No further differences in matrices A†₅ to A†₇ are specified, although the results apply to the between-rater frequencies

g5B† = log            P i6=aF B ia P i6=aF B ia P i6=bF B ib P i6=cF B ic P i6=aEia P i6=bEib P i6=cEic            , gB6†=     1 FaB/Ea FbB/Eb FcB/Ec     , g(n)B† =   HaB HbB HB c  . (53)

3.1.4 Scale coefficient in exp-log notation

The recursive exp-log notation for the within-rater total scale scalability coeffi-cient (Equation 8 and 23) is

(30)

Similarly to the changes between item-pair and item scalability coefficients, the differences in the functions to compute the total-scale coefficients only affect matrices A5 to A7. To compute design matrix A‡5, matrix A5 needs to be

premultiplied with 3 × (2K + 1) matrix S‡

S‡=   0 1T K oTK 0 1TK o T K 0 oTK 1 T K   (55) Implementing gW 4 in log(A ‡

5exp(A4log(A3exp(A2log(AW1 n))))) results in

gW5 ‡= log A‡5 g4 = log S‡ A5g4 = log   P P i6=jF W ij P P i6=jF W ij P P i6=jEij  . (56)

The 2 × 3 design matrix A‡₆consists of 1 and -1 values, and is constructed to divide the total observed and expected Guttman errors. Implementing gW‡

5

(Equation 56) in exp(A‡₆log(A‡₅exp(A4log(A3exp(A2log(AW1 n)))))) results in

gW6 †= exp A ‡ 6 log gW5 ‡ = exp   1 −1 0 1 0 −1 log   P P i6=jFij P P i6=jFij P P i6=jEij    = 1 F/E . (57)

The 1 × 2 matrix A‡₇contains value 1 and -1. When multiplying this matrix with gW† 6 the result is gW(n)‡= A ‡ 7 g W‡ 6 = 1 −1 1 F/E = 1 − F/E = HW _. ₍₅₈₎

The recursive exp-log notation for the between-rater total scale scalability coefficient (Equation 11 and 59) is

(31)

Again, for the between-rater item coefficients only the resulting vectors differ g5B‡ = log   P P i6=jF B ij P P i6=jF B ij P P i6=jEij  , g B‡ 6 = 1 FB_/E , gB(n)‡ = H B _. (60)

3.1.5 Ratio HB_/HW _{in exp-log notation}

The ratio of the two total scale scalability coefficients HB/HW _{(Equation 61)}

in the recursive exp-log notation is written as

HB/HW = g(n)?= exp(A?₈log(A?₇exp(A?₆log(A?₅exp(A?₄log(A?₃exp(A?₂log(A?₁n) . . . ). (61)

In general, most of these design matrices are a concatenation the matrices used to compute HB _{and H}W

A? 1=   AB 1 AW1   A? i = I2N Ai=   Ai 0 0 Ai   for i = 2, . . . , 4, 5‡, . . . , 7‡ (62)

In addition, the 1 × 2 matrix A?8 is defined as (1 − 1). The first part of

Equa-tion 61, from A?

7onwards, results in vector g?7= (HBHW)T. Implementing this

vector in exp(A?

8log(A?7exp(A?6log(A?5exp(A?4log(A?3exp(A?2log(A?1n))))))))

results in g?(n)= exp A ? 8 log ( h ? 7 ) = exp 1 −1 log H B HW = HB_/HW_. ₍₆₃₎

3.2 Deriving the matrix of first-order partial derivatives

Let the Jacobian of g(n), which is the matrix of first-order partial derivatives with respect to n, be G ≡ G(n). For function gi(n) (i = 0, 1, . . . , q), the

Jacobian equals

Gi=

∂gi(n)

(32)

The method of rewriting the scalability functions in recursive exp-log notation enables relatively straightforward computation of the Jacobian. The scalability coefficients in this notation are eight nested functions of n

A7 exp( A6 log( A5 exp( A4 log( A3 exp( A2 log( AW1 n g0 ) g1 ) g2 ) g3 ) g4 ) g5 ) g6 g7 . (65)

Due to the composition of this notation, it is possible to apply the chain rule to the function. The chain rule is applied to differentiate a function of a different function, for example y = f (g(x)) (e.g., Stewart, 2008). First, substitute g(x) with u to obtain y = f (u). Then the derivative of y is

dy dx = dy du× du dx. (66)

Let D(x) be a diagonal matrix with x on the diagonal, and D−1 be the inverse of matrix D. Applying the chain rule to functions g0 to g7results in

Gi=            I if i = 0 D−1(Ai gi−1) Ai Gi−1 if i ∈ {1, 3, 5}

D exp(Ai gi−1) Ai Gi−1 if i ∈ {2, 4, 6}

Ai Gi−1 if i = 7

. (67)

Computing both the design matrices and the matrices of partial derivatives for function g1 to g3 can be computationally quite demanding, as more items

are being used and more subjects are being scaled. For example, using 10 four-category items to scale 10 subjects can result in an A1 matrix of order

7, 610 × 10, 485, 760. However, these steps are not necessary and both A3g2and

its derivative can be computed directly from the data. This is convenient, since the order of matrices A1to A3is multiplied by S, whereas the order of the other

(33)

matrices only depend on B or U and L∗. As specified in Equation 35 A3gW2

results in a vector with the estimated bivariate and univariate frequencies within raters and the number of subjects S

A3 gW2 =                fab fac fbc fa fb fc S                . (68)

Computing its Jacobian results in (B +U +1)×L∗matrix A3GW2 , consisting

of sub-matrices GW_b and GW_u and a zero-vector of length L∗. The B × L∗ sub-matrix GW_b relates the B bivariate within-rater frequencies to the L∗ observed response patterns. Entry (b, l) takes value bf_sijsxy/nsxyW_sij − bf_sijsxy/Rsif in the l-th

response pattern the score on item i equals x and the score on item j equals y ; value − bf_sijsxy/Rs if in the l-th response pattern the score on item i does

not equal x or the score on item j does not equal y (or both) and frequency

b

f_sijsxy > 0; and value 0 if bf_sijsxy = 0. The U × L∗ sub-matrix GW

u relates the U

univariate frequencies to the L∗ observed response patterns. Entry (u, l) takes value bfsx

si/nsxWsi − bfsisx/Rs if in the l-th response pattern the score on item i

equals x ; value − bfsx

si/Rsif in the l-th response pattern the score on item i does

not equal x and frequency bf_sisx > 0; and value 0 if bf_sijsxy = 0. The resulting matrix is A3 ∂gW2 (n) ∂nT =     GW b GW_u oT_L∗     . (69)

For between-rater scalability coefficients A3gB2 results in a vector with the

estimated bivariate and univariate frequencies between raters the number of subjects S

(34)

A3 gB2 =                bab bac bbc fa fb fc S                . (70)

Computing its Jacobian results in (B + U + 1) × L∗matrix A3GB2, consisting

of sub-matrices GB_b and GBu and a zero-vector of length L∗. Let asblbe element

(b, l) of matrix BBs (Equation 43). The B × L∗ sub-matrix GBb relates the

B bivariate between-rater frequencies to the L∗ _{observed response patterns.}

Entry (b, l) takes value bbsxy_sijasbl/n sxyB sij − bb

sxy

sij/Rs if in the l-th response pattern

the score on item i equals x and for another response pattern the score on item j equals y ; value −bbsxy_sij/Rs if in the l-th pattern the score on item i does not

equal x and frequency bbsxy_sij > 0; and value 0 if bbsxy_sij = 0. Let asul be element

(u, l) of matrix UB

s (Equation 43). The U × L∗ sub-matrix GBu relates the U

univariate frequencies to the L∗ observed response patterns. Entry (u, l) takes value bbsx_siasul/nsxsiB− bb

sx

si/Rsif in the l-th response pattern within that subject

the score on item i equals x ; value −bbsx_si/Rs if in the l-th response pattern the

score on item i does not equal x and frequency bbsx

si > 0; and value 0 if bb sxy sij = 0.

The resulting matrix is

A3 ∂gB 2(n) ∂nT =     GB b GB u oT L∗     . (71)

Omitting design matrices A1to A3 considerably reduces the computational

(35)

3.3 Applying the delta method

To compute standard errors of the scalability coefficients, the delta method was applied. The delta method approximates the variance of the transformation of a variable (e.g., Agresti, 2002; Sen & Singer, 1993). Let Vn be the

variance-covariance matrix of vector n. Assuming n is sampled from a multinomial distribution, its variance-covariance matrix is Vn= D(n)−nN−1nT. According

to the delta method, the variance-covariance matrix of the transformation of vector n, g(n), is approximated by Vg(n) ≈ G Vn GT = G D(n) − nN−1 nT GT = G D(n) GT− G n N−1 nT GT. (72)

All scalability functions are homogeneous functions of order 0. Therefore, g(n) does not change when n is multiplied by the same constant t, g(t n) = g(n). Consequently, according to Euler’s homogeneous function theorem (e.g., Weis-stein, 2011), Gn = 0, and Equation 72 simplifies to G D(n) GT_.

For the data in Table 1, Table 4 shows the within and between coefficients and their standard errors. Although the values for Hacare above the threshold

of 0.1, it appears that these estimates are uncertain, as is reflected by a standard error of 0.301. A larger sample is necessary to obtain a more precise estimate of the H-coefficients. As is clear in this small data example, the standard error of the ratio is larger in comparison to the other coefficients. This difference is expected to be present in other datasets as well. Section 4 presents a more extensive data example.

(36)

Table 4:

Within- (W) and between-rater (B) scalability coefficients (H), including their standard errors (se) for the example data from Table 1.

W (se) B (se) H12 0.146 (0.273) −0.059 (0.254) H13 0.327 (0.301) 0.222 (0.309) H23 −0.379 (0.240) −0.365 (0.272) H1 0.238 (0.175) 0.084 (0.170) H2 −0.078 (0.190) −0.189 (0.179) H3 0.031 (0.245) −0.024 (0.279) H 0.072 (0.164) −0.037 (0.178) HB_/HW _{−0.506 (2.852)}

4 Data example in R

The discussed process of constructing the design matrices, calculating the two-level scalability coefficients, recursively applying the chain rule to retrieve the Jacobian, and applying the delta method has been programmed by the author in software package R (R Core Team, 2013), and will be adopted in the R package mokken (Van der Ark, 2007, 2012). In this section, the programmed functions will be explained and applied to a simulated dataset. The results of the delta method are compared to the results of a non-parametric bootstrap, a method considered robust for computing standard errors.

Computation of the two-level scalability coefficients and their standard errors has been implemented in the function MLcoefH(). Argument for this function is a data matrix with one subject column and one column per item, similar to the matrix in Table 1. By default the subject column is expected to be the first column, but this may be specified differently. The function returns a list with seven matrices, each with a set of scalability coefficients and their standard

(37)

errors. As a necessary addition to the MLcoefH() function, a second function is programmed to compute the multilevel weights, MLweight(). As explained in Section 2.2.1, these weights are computed using the relative frequencies rather than the raw frequencies, and therefore the existing weight() function in the mokken package (Van der Ark, 2012) would give inappropriate results.

As an example, a dataset was simulated in R (for the code, see the Ap-pendix), containing S = 20 subjects whom were rated on I = 5 items with m + 1 = 4 response categories by Rs = 10 raters each. For this dataset, the

two-level scalability coefficients were computed and the standard errors were es-timated two ways, by applying the delta method and a non-parametric bootstrap with B = 1000 bootstrap samples. Table 5 shows the results for all computa-tions. According to the guidelines presented in Section 2.2.2, all item-pairs and item coefficients had values above the threshold of 0.2 and 0.1 for within and between coefficients respectively, although the threshold values would fall within the 95% confidence interval for some coefficients. The total scale coefficient and the ratio indicate current set of items form a strong scale.

Comparing the two methods of computing standard errors revealed that although they were strongly related (r = .95), there was a small but significant difference between the methods (Mdifference = -0.004, = SEdifference 0.001).

Specifically, the delta method resulted in a lower estimate of the standard errors when compared to the bootstrap for 30 of the 33 coefficients (91%).

(38)

Table 5:

The scalability coefficients ( bH) and standard errors as computed with the delta method (SEd) and the bootstrap (SEb) for the simulated data example.

H SEd SEb HW 12 0.413 0.070 0.075 HW 13 0.486 0.055 0.058 HW 14 0.352 0.061 0.066 H₁₅W 0.365 0.063 0.066 HW 23 0.321 0.061 0.062 H24W 0.282 0.062 0.068 HW 25 0.308 0.058 0.061 HW 34 0.332 0.057 0.063 H₃₅W 0.463 0.053 0.057 HW 45 0.247 0.064 0.067 HW 1 0.404 0.035 0.038 H2W 0.327 0.040 0.043 HW 3 0.397 0.031 0.034 HW 4 0.302 0.039 0.043 HW 5 0.345 0.040 0.041 HW _0.354 _0.028 _0.031 HB 12 0.270 0.041 0.052 HB 13 0.282 0.046 0.048 H₁₄B 0.290 0.042 0.046 HB 15 0.249 0.043 0.055 H23B 0.240 0.049 0.045 HB 24 0.229 0.034 0.049 HB 25 0.250 0.041 0.044 HB 34 0.216 0.043 0.050 HB 35 0.269 0.049 0.051 H45B 0.224 0.048 0.051 H1B 0.273 0.028 0.034 HB 2 0.246 0.025 0.035 H3B 0.250 0.027 0.032 HB 4 0.238 0.028 0.036 HB 5 0.248 0.039 0.038 HB _0.251 _0.023 _0.028 HB_/HW _0.708 _0.069 _0.063

(39)

5 Discussion

The purpose of this paper was to investigate the delta method as a way to estimate the standard errors for two-level scalability coefficients (Snijders, 2001; Crisan et al., in press). The results revealed that the recursive exp-log notation and the delta method can be applied to estimate the scalability coefficients and their standard errors. In addition, the computations have been implemented in the R-function MLcoefH(). The data example in Section 4 revealed a small but structural lower estimate of the standard error computed by the delta method compared to the non-parametric bootstrap. In general, the bootstrap is known for its robustness, but it is unsure whether it provides better estimates in a multilevel situation, and whether the small difference is of practical importance. Although the difference between the results of the delta method and the bootstrap was small, this does not imply that the estimates are unbiased. Bias of the standard errors of one-level scalability coefficients appeared to be negli-gible, and coverage of the 95% confidence intervals was close to .95 (Kuijpers et al., 2016). Since the within-rater scalability coefficients are comparable to the one-level coefficients, their bias and coverage are expected to be similar. How-ever, additional manipulations specific for the multilevel situation need to prove their effects, such as the number of raters per subject and discrepancy between the number of raters per subject. For the between-rater scalability coefficients and the ratio between the two total scale coefficients, it is less clear what to expect about their bias and coverage. A next step for future research should therefore be to set up an extensive simulation study to structurally investigate the bias and variance of the scalability coefficients and standard errors as ap-proximated with the delta method, and possibly compare this to the simple and the clustered bootstrap method, the latter being appropriate to account for a multilevel structure in the data.

(40)

Although the computation speed of the function has been reduced consid-erably by direct computation of A3g2 and its derivative A3G2, it is still time

consuming when the number of items and subjects is large. Currently, all stan-dard errors are computed when running the function MLcoefH(). Since the running time of the function increases with the number of items, an alternative strategy might be considered, such as a stepwise approach. For example, the scalability coefficients might initially be computed without the standard errors, and if coefficients themselves are above a particular threshold their standard errors are computed as well. At the moment, an option to compute the coeffi-cients without the standard errors is lacking, but this could relatively easy be implemented in the function.

Two of the main reasons to compute standard errors are hypothesis testing and construction of confidence intervals. Hypothesis testing is an often used, but also debated technique. One of the issues relevant in two-level Mokken scale analysis is the number of test performed at once. For three items 15 tests are performed, but for 20 items this number already increases to 423. One possible solutions to control the type I error rate is applying global test to a set of coefficients, for example, test the null-hypothesis that all values in vector HW

ij are above the threshold of 0.2 (Van der Ark et al., 2008). Another

solution is applying a correction to control the family-wise error rate, such as a Bonferroni or Holm correction, or since the tests might be considered dependent, the Westfall & Young algorithm.

Currently, it is common practice to compute the Guttman errors using the full dataset. This implies two assumptions. The first assumption is that the item ordering in the data is assumed to be identical to the population error. However, random sampling variation can result in different orderings, especially when item-steps are close in the population. Kuijpers et al. (2013) provided

(41)

evidence that this assumption is appropriate by means of a simulation study, since the bias of the estimates for H and their standard errors was negligible, even when item-steps were identical in the population (Kuijpers et al., 2016). However, in this simulation study the item-ordering was equal for the entire population, while the estimates for H and their standard errors might be biased when the item ordering depends on θ. By estimating the item-ordering on the entire dataset, it is implied that this ordering is equal on all levels of the latent trait θ, described as an invariant item ordering, (i.e., non-intersection of the item response functions). Therefore, the second assumption is that the item ordering is equal for all values of θ. According to the monotone homogeneity model non-intersection of the item response functions is not necessary to scale the subjects on the latent scale. However, computation of the Guttman er-rors requires ordering of the items, which implies that the double monotonicity model should hold. Therefore, a method to investigate invariant item ordering in multilevel test data is required. Furthermore, bias in the estimated coeffi-cients and standard errors when there is no invariant item ordering should be investigated.

Two-level Mokken scale analysis comprises more than the computation of the two level scalability coefficients and their standard errors. For instance, since the analysis is based on statistical models, methods should be developed to determine whether underlying assumptions are satisfied. Several methods in one-level Mokken scale analysis exist to investigate whether the data reflect the underlying assumptions, such as conditional association and manifest mono-tonicity. The two-level situation has adjusted assumptions and a different data structure, raising the question if and how current methods can accommodate this structure. In addition, scalability coefficients of item-pairs, items, and scales can be used to partition items in one or multiple scales. One-level Mokken scale

(42)

analysis provides an automated item selection procedure for this purpose. Such a procedure should be developed for the two-level scalability coefficients as well, since it gives insight in how items can form a scale.

To conclude, this paper contributes to the field of psychometrics in two ways. First, standard errors of the two-level scalability coefficients have been derived. Second, the estimation of both the scalability coefficients and their standard errors has been implemented in software. As a result, the quality of existing multilevel measurement instruments can be assessed. In addition, development of two-level Mokken scale analysis can progress, which is crucial to develop new measurement instruments designed for multilevel test situations.

Appendix: R-code

> # Data example throughout thesis

> dataThesis <- data.frame(s = rep(1:3, c(5, 4, 6)),

+ Item1 = c(0, 0, 0, 1, 2, 1, 2, 2, 2, 0, 0, 0, 2, 2, 2), + Item2 = c(0,0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 2, 2), + Item3 = c(0, 1, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 1, 0, 0)) >

> # Simulate data for thesis > set.seed(1008) > S <- 20 > Rs <- 10 > m <- 3 > I <- 5 > X <- matrix(0, Rs * S, I + 1) # I = 5, Rs = 10, S = 20. > colnames(X) <- c("Subject", paste("Item", 1:I, sep = " ")) > n <- I * Rs

> rows <- 1

> itemorder <- rep(sort(runif(I, 0, 0.2)), each = Rs) > for(i in 1:S){

+ deltas <- runif(2, 0, 0.8)

+ theta <- rep(runif(Rs, min(deltas), max(deltas)), Rs) + X[rows:(rows + (Rs - 1)), 1] <- i

+ X[rows:(rows + (Rs - 1)), -1] <- rbinom(n, m, theta + itemorder) + rows <- rows + 10

+ }

> TLresult <- MLcoefH(X, nice.output = F)

> ## When running the following code, be aware that it can take a while.

(43)

> # user system elapsed > # 820.17 0.11 822.67 > > # simple bootstrap > B <- 1000 > ht1 <- ht2 <- ht3 <- ht4 <- ht5 <- ht6 <- ht7 <- NULL > count <- 0

> select <- matrix(sample(1:nrow(X), nrow(X) * B, replace = T), B) > ptm <- proc.time() # Keep track of time

> for(i in 1:B){ + count <- count + 1 + Xt <- X[select[i, ], ] + Ht <- MLcoefH(Xt, nice.output = F) + ht1 <- cbind(ht1, Ht[[1]]) + ht2 <- cbind(ht2, Ht[[3]]) + ht3 <- cbind(ht3, Ht[[5]]) + ht4 <- cbind(ht4, Ht[[7]]) + ht5 <- cbind(ht5, Ht[[9]]) + ht6 <- cbind(ht6, Ht[[11]]) + ht7 <- cbind(ht7, Ht[[13]]) + print(count) + }

> proc.time() - ptm # Duration of the time >

> MatDeltaBoot <- matrix(0, 2 * (10 + 5 + 1) + 1, 3)

> MatDeltaBoot[, 1] <- rbind(TLresult[[1]], TLresult[[3]], TLresult[[5]], + TLresult[[7]], TLresult[[9]], TLresult[[11]], TLresult[[13]]) > MatDeltaBoot[, 2] <- rbind(TLresult[[2]], TLresult[[4]], TLresult[[6]], + TLresult[[8]], TLresult[[10]], TLresult[[12]], TLresult[[14]])

> MatDeltaBoot[, 3] <- c(apply(ht1, 1, sd), apply(ht2, 1, sd), apply(ht3, 1, sd), + apply(ht4, 1, sd), apply(ht5, 1, sd), apply(ht6, 1, sd), apply(ht7, 1, sd)) > colnames(MatDeltaBoot) <- c("H", "SE Delta", "SE Boot")

> rownames(MatDeltaBoot) <- c(paste("$H^W_{", rep(1:4, 4:1), + c(2:5, 3:5, 4:5, 5), "}$", sep = ""),

+ paste("$H^W_{", 1:5, "}$", sep = ""), "$H^W$",

+ paste("$H^B_{", rep(1:4, 4:1), c(2:5, 3:5, 4:5, 5), "}$", sep = ""), + paste("$H^B_{", 1:5, "}$", sep = ""), "$H^B$", "$H^B/H^W$")

> round(MatDeltaBoot, 3)

(44)

References

Agresti, A. (2002). Categorical data analysis (2nd ed.). New York, NY: Wiley. Bergsma, W. P. (1997). Marginal models for categorical data. Tilburg, the

Netherlands: Tilburg University Press. Retrieved December 3, 2015, from http://stats.lse.ac.uk/bergsma/pdf/bergsma_phdthesis.pdf Crisan, D. R. (2015). Scalability coefficients for two-level dichotomous and

polytomous data: A simulation study and an application (Unpublished master’s thesis). Tilburg University, Tilburg, The Netherlands.

Crisan, D. R., Van de Pol, J. E., & Van der Ark, L. A. (in press). Scalability coefficients for two-level polytomous item scores: An introduction and an application. In L. A. Van der Ark, D. M. Bolt, W.-C. Wang, & M. Wiberg (Eds.), Quantitative psychology research. New York, NY: Springer. De Rooij, M. J. (2013). Standard regression models for repeated measurements

and longitudinal data. (Unpublished manuscript)

Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7 , 1-26. doi: 10.1007/978-1-4612-4380-9_41

Fox, J. P., & Glas, C. A. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66 , 271-288. doi: 10.1007/BF02294839

Kuijpers, R. E., Van der Ark, L. A., & Croon, M. A. (2013). Standard er-rors and confidence intervals for scalability coefficients in mokken scale analysis using marginal models. Sociological Methodology , 43 , 42-69. doi: 10.1177/0081175013481958

Kuijpers, R. E., Van der Ark, L. A., Croon, M. A., & Sijtsma, K. (2016). Bias in point estimates and standard errors of mokken’s scalability coefficients. Applied Psychological Measurement , 40 , 331-345. doi: 10.1177/0146621616638500

(45)

May, H. (2006). A multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education. Journal of Educational and Behavioral Statistics, 31 , 63-79. doi: 10.3102/10769986031001063

Mokken, R. J. (1971). A theory and procedure of scale analysis. The Hague: Mouton.

Molenaar, I. W. (1991). A weighted Loevinger H-coefficient extending mokken scaling to multicategory items. Kwantitatieve Methoden, 12(37), 97-117. Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The

hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 , 341-384. doi: 10.3102/10769986027004341

R Core Team. (2013). R: A language and environment for statistical computing [Computer software]. Retrieved from http://www.R-project.org/ Raudenbush, S. W., & Sampson, R. J. (1999). Ecometrics: Toward a science

of assessing ecological settings, with application to the systematic social observation of neighborhoods. Sociological Methodology , 29 , 1-41. doi: 10.1111/0081-1750.00059

Sen, P. K., & Singer, J. M. (1993). Large sample methods in statistics: an introduction with applications. London: Chapman & Hall.

Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Thousand Oaks, CA: Sage.

Snijders, T. A. B. (2001). Two-level non-parametric scaling for dichotomous data. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (p. 319-338). New York, NY: Springer. doi: 10.1007/978-1-4613-0169-1_17