• No results found

Item analysis of single-peaked response data : the psychometric evaluation of bipolar measurement scales Polak, M.G.

N/A
N/A
Protected

Academic year: 2021

Share "Item analysis of single-peaked response data : the psychometric evaluation of bipolar measurement scales Polak, M.G."

Copied!
35
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

psychometric evaluation of bipolar measurement scales

Polak, M.G.

Citation

Polak, M. G. (2011, May 26). Item analysis of single-peaked response data : the psychometric evaluation of bipolar measurement scales. Optima, Rotterdam. Retrieved from https://hdl.handle.net/1887/17697

Version: Not Applicable (or Unknown) License:

Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/17697

Note: To cite this publication please use the final published version (if applicable).

(2)

Chapter 4

Diagnostics for Single-Peakedness of Item Responses with Ordered Conditional Means (OCM) 1

Abstract

In this chapter we propose a model-free diagnostic for single-peakedness (unimodality) of item responses. Presuming a unidimensional unfolding scale, it approximates item response functions (IRFs) of all items by com- puting ordered conditional means (OCM) under the assumption that these functions are unimodal. We propose a diagnostic that includes all items.

It can be used prior to or in combination with any (IRT) unfolding model.

The proposed OCM methodology is based on the criterion of irrelevance that is a graphical, exploratory method for evaluating the “relevance” of dichotomous attitude items. We generalized this criterion to polytomous items and quantified the relevance by fitting a unimodal smoother. The resulting goodness of fit is used as a measure of scale fit. Item fit is deter- mined by the goodness of fit “if item deleted”. The sampling behavior of the fit statistics was explored using data generated with a specific unfolding IRT model. Evidence is presented showing that this method identifies both poorly discriminating items and items with an irregular IRF. We give two applications of the diagnostic to data concerning personality development and attitude scaling.

4.1 Introduction

In the field of psychometrics the objective is to construct valid and reliable scales.

These scales are constructed to measure latent constructs, for example attitudes, and consist of a set of items, for example numerical attitude items. The quality of a set of items is judged based on the analysis of item responses, which are collected in random samples from the target population. Usually the interest lies in whether the variability in responses is caused by systematic differences between both persons and items. The goal of item analysis is to determine whether

1This chapter has been submitted for publication as: Polak, M. G., De Rooij, M., & Heiser, W. J. (2010a). Diagnostics for Single-Peakedness of Item Responses with Ordered Conditional Means (OCM). Manuscript submitted for publication.

(3)

there are one or more underlying scales, and whether items need to be discarded or included. In this chapter, we will focus on the evaluation of unidimensional unfolding scales consisting of a set of ordered polytomous items.

The essence of an unfolding scale is that the probability of agreement with a certain item on this scale is inversely related to the distance between the position of the item on the latent continuum and the position of the respondent; the closer an item is located near the respondent’s position on the latent continuum, the more likely the respondent will agree with it. Since this yields a single-peaked response function (see Figure 4.1b discussed in Section 4.1.1), these items are often referred to as single-peaked items. Single-peaked items typically arise in fields of personality measurement (e.g., Chernyshenko, Stark, Drasgow, & Roberts, 2007;

Weekers & Meijer, 2008), preference research (e.g., Ashby & Ennis, 2002) and attitude research (e.g., Andrich & Styles, 1998).

Methods for scale evaluation can be divided into two approaches, item anal- ysis based on classical test theory (CTT) and item analysis based on item re- sponse theory (IRT), either parametric or nonparametric. Within the framework of CTT, scales are usually evaluated with factor analysis (FA) and Cronbach’s alpha (Cronbach, 1951). CTT is exclusively developed for analysis of monotonic items. Numerous studies have shown that FA is not suited for analyzing data conforming to an unfolding model (e.g., Coombs & Kao, 1960, Ross & Cliff, 1964, McDonald, 1967, Davison, 1977, Van Schuur & Kiers, 1994, Andrich, 1996, Rost

& Luo, 1997, Roberts, Laughlin, & Wedell, 1999, and Maraun & Rossi, 2001).

The main conclusions from this literature are, first, that FA of one-dimensional unfolding data results in a two-component solution, leading to erroneous conclu- sions about the dimensionality of the data. Second, that for the extreme ends of the latent scale, the respondents’ factor scores and items’ factor loadings are un- derestimated, resulting in a non-optimal item selection. Several authors explored the use of inter-item correlations for evaluating unfolding scales (i.e., single-peaked items) as well (e.g., Davison, 1977; Post, 1992; Van Schuur & Kiers, 1994; Po- lak, Heiser, & De Rooij, 2009). The correlation matrix turned out to be a useful diagnostic for indicating whether a set of items conforms to an unfolding model instead of a cumulative model. However, it was shown that correlations provide little information about the psychometric quality of individual unfolding items.

IRT, although originally developed for monotonic items, also provides models for single-peaked items either nonparametric (e.g., the multiple unidimensional unfolding model (MUDFOLD; Van Schuur, 1984)), or parametric, (e.g., the PAR- ELLA model (Hoijtink, 1991); the hyperbolic cosine model (HCM; Andrich & Luo,

(4)

4.1. Introduction

1993); the generalized graded unfolding model (GGUM; Roberts et al., 2000), and the multidimensional unfolding model (MUM; Javaras & Ripley, 2007)). These models are often referred to as unfolding IRT models (see Van Schuur & Post, 1998 for an introduction to this type of models). In Section 4.1.1 we discuss the diagnostics for item fit from the field of unfolding IRT, by discussing GGUM and MUDFOLD in more detail. We chose these models as representatives of, respectively, parametric, and nonparametric unfolding IRT, since both models have been well-developed and provide user-friendly, Windows-based software (i.e., resp., GGUM2004; Roberts, Fang, Ciu, & Wang, 2006, and MUDFOLD 4.0; Van Schuur & Post, 1998).

However, one possible drawback of the unfolding IRT approach is, that the diagnostics provided by these models are conditional on the model estimates.

Thus, in contrast to the CTT approach that always results in information for all items, this approach gives no information about the items that do not fit into the final scale, which makes it difficult to identify possible causes for misfit.

In this chapter, we propose a model-free methodology for scale and item eval- uation of single-peaked items that can be used prior to, or in combination with any of the existing unfolding models. The method gives a graphical display of the approximated item response function (IRF) for each item. Based on the observed scores and a user defined item order, ordered conditional means are depicted in an item plot. The method assumes unidimensionality, an ordering of the items along the underlying dimension, and single-peaked IRFs, but does not impose any further constraints on the shape of the IRF. A smoother is used as nonparamet- ric fitting procedure, which results in values of fit for all items and the scale as a whole. Ideally, each item plot shows a single-peaked pattern with the peak moving from left to right along the scale.

4.1.1 Unfolding IRT Models and Evaluation of Fit

In this section we discuss two unfolding IRT models and their item diagnostics.

We will start with the parametric model GGUM, and secondly we will discuss the nonparametric model MUDFOLD.

Parametric Unfolding IRT According to GGUM2004

The GGUM is a parametric item response model that incorporates features such as variable item discrimination and variable threshold parameters for the response categories. The model-assumptions are: existence of a latent trait (i.e., unidimen-

(5)

sionality), local independence (both are also common assumptions of monotone IRT models), and symmetric (around the item location), bell-shaped IRFs. The GGUM allows for binary or graded responses. One premise of the GGUM is that for each person there are two subjective responses associated with each observable response. These subjective responses can be seen as two distinct reasons for a person’s response. For instance, when a person strongly disagrees with a certain item this could be for either of two reasons. If on the underlying (bipolar) con- tinuum the item is located more to the right extreme than the person, the person disagrees from below the item. However, if the item is located more to the left extreme than the person, the person disagrees from above the item. The prob- ability that a person will respond using a particular observable answer category is defined as the sum of the probabilities associated with the two corresponding subjective responses. Specifically, the model has the form:

P (Zij=z|θi)=

=

exp{αj[z(θi− δj) −

z

P

m=0

τjm]} + exp{αj[(S − z)(θi− δj) −

z

P

m=0

τjm]}

M

P

ω=0



exp{αj[ω(θi− δj) −

ω

P

m=0

τjm]} + exp{αj[(S − ω)(θi− δj) −

ω

P

m=0

τjm]}

 , (4.1)

where

Zig is the observed response of subject i (i = 1, ..., n) to item g (g = 1, ..., k) with Zig= z, where z = 0, 1, ... , M , with

z = 0 indicating the strongest level of disagreement, z = M indicating the strongest level of agreement, S = 2M + 1,

θi is the location of person i,

δg is the location of item g (on the same metric as θ), αg is the discrimination of item g, and

τgm is the relative location of response category m within item g.

Within the framework of parametric IRT the methods for evaluating scales and items generally involve some method of determining the agreement between observed responses and the expected responses given the specific model. In GGUM this yields two types of diagnostics for item fit: a graphical representation of the IRFs (see Figure 4.1 for examples of the two diagnostic item plots provided by the GGUM software), and item fit statistics.

(6)

4.1. Introduction

The first type of diagnostic item plots, the item fit plot (see Figure 4.1a), is based on a user-defined number of equal-sized fit groups, where subjects are grouped according to their estimated location, ˆθi. In the item fit plot, for each fit group, the average observed response (the black dots in Figure 4.1a) is plotted against the average value of ˆθi. Additionally, the average expected value based on the model within each group is plotted against the average ˆθi value, these points are connected with a straight line. The vertical lines around each expected value, represent a pseudo-confidence interval, which equals plus or minus 2 times the square root of the average variance of an observed score for respondents in a given group. The user can judge the degree of item (mis-)fit by a visual inspection of the discrepancy between the observed and expected values. In particular, the GGUM2004 user’s guide points out, that many dots outside of the confidence intervals are indicative of item misfit. The second type of diagnostic item plots, the estimated IRF (see Figure 4.1b), is based on hypothetical values for θi and the estimated item parameters (ˆδg , ˆαg, ˆτgm). Here, the model equation described in (4.1) is used to compute the expected response for a large set of hypothetical values for θi. The GGUM2004 user’s guide recommends this plot to assess the manner in which the item functions across the latent continuum.

(a) Item fit plot (b) Estimated ICC

Figure 4.1: The two types of diagnostic item plots in GGUM for a midpoint item in a hypothetical 9 item scale.

Furthermore, GGUM reports item fit statistics, infit and outfit t-statistics and associated chi-square tests that are based on fit indices developed for other (cu- mulative) IRT models. As a rule of thumb, t-values exceeding 2.576 and chi-square probability values below 0.01, are recommended as indicative of item mis-fit. How-

(7)

ever, because the distribution of these statistics is not fully known, the GGUM2004 manual warns against an absolute interpretation of these statistics, and recom- mends regarding them as a measure of relative fit (see also DeMars, 2004).

Nonparametric Unfolding IRT According to MUDFOLD 4.0

Secondly, we discuss item diagnostics offered by MUDFOLD. MUDFOLD is a nonparametric IRT model that models the ordering of respondents and a selected subset of k items along the latent scale. The methods for selecting the optimal subset of items, and assigning scale values to respondents are explained elsewhere (Van Schuur, 1992 ). MUDFOLD assumes unidimensionality, local independence, and single-peaked IRFs. In contrast to GGUM, the model does not assume a specific shape of the IRFs.

An important diagnostic in MUDFOLD is the conditional adjacency matrix (CAM2; Post, 1992). In the k x k CAM the entry (g,h) is the proportion of respondents that agrees with both items g and h. Each row g of the population CAM consists of the conditional probabilities of choosing item g given that a person also choses item h. Row g is considered a rough estimate of the IRF of item g. Post (1992) showed that, for unfolding data, each row of the population CAM is a weakly unimodal function (unimodality property) and the maxima of the rows are situated in a position that moves to the right as one moves downward in the matrix, except for possible inversions around the diagonal. The conditional adjacency matrix (CAM) was developed for binary response data. For graded responses the user must define a cutoff value in order to dichotomize the response, after which the CAM can be computed.

For the CAM several fit statistics were developed. However, not much is known about the sampling behavior of these statistics in other conditions than those that were tested by Post (1992). Moreover, Post (1992) concluded that, the fit statistics concerning single-peakedness of the IRFs were not adequate (in terms of type I error and power). Like GGUM2004, MUDFOLD 4.0 offers two approaches to estimate the IRFs of the individual items. The first is based on the rows of the CAM, where the proportions in each row are plotted against the k item ranks. The second approach is based on an s x k score group by responses matrix, where the rows define approximately equal sized score groups, s that are selected by the MUDFOLD algorithm. The entries of this matrix give the

2Johnson (2006) provided an algorithm to use the CAM not only for scale evaluation, but also to estimate the rank order of items and respondents on the latent scale.

(8)

4.1. Introduction

proportion of respondents within a score group that agree with each item. Each of the k columns of this matrix is interpreted as the (approximated) IRF of the corresponding item.

In conclusion, within the field of unfolding IRT, both exploratory graphical, and statistical methods for evaluating item fit exist. Given that an optimal scale is found, it is possible to evaluate the fit of the items on this scale. However, further studies need to be done on the sampling behavior of the various fit statistics in various conditions of, for instance, item misfit. Moreover, it could be argued that a disadvantage of the unfolding IRT item diagnostics is, that they are dependent on model convergence. GGUM and MUDFOLD may require a pre-selection of items, in order to allow the model to converge and provide results. This pre- selection-step may not be straightforward for practical researchers, as it requires other methods than provided by the models themselves, and makes it difficult to identify possible causes for item misfit. These drawbacks could be resolved by a model-free method, and it is the aim of this chapter to propose and evaluate such a method.

4.1.2 The Criterion of Irrelevance

One of the earliest quantitative methods for evaluation of single-peaked dichoto- mous attitude items is the criterion of irrelevance (COI) by Thurstone and Chave (1929). The COI is a diagram for a certain item g depicting on the horizontal axis the position of all items on the scale and on the vertical axis the index of similarity (Cgh) of item g with any other item h on the scale. The index Cgh is defined as

Cgh =N (g, h)

N (h) , (4.2)

where

N (g, h) is the number of subjects in the sample choosing both items g and h, and N (h) is the number of subjects in the sample choosing item h.

Thus, Cghis the conditional probability of choosing item g given that a subject chooses item h. Ideally, the COI shows a single-peaked pattern, meaning that the probability of choosing item g increases for subjects who choose items which are located near item g on the latent scale. And vice versa, the greater the distance between item g and any other item, the smaller the probability of choosing both items is. Thurstone and Chave’s method for determining item fit is based on the

(9)

visual inspection of the COI; namely, the more the diagram of item g shows a single-peaked pattern, the more “relevant” item g is for discriminating between different attitudes. Equation (4.2) is exactly the definition of the row elements in Post’s (1992) conditional adjacency matrix.

Both the COI and the CAM have exclusively been developed for binary re- sponse items. We propose a generalization which is suited for graded response items as well. This method encompasses a diagram depicting ordered conditional means (OCM) for each item g. We will refer to this diagram as the OCM dia- gram. We propose a quantitative method for evaluating the item fit based on the OCM diagram. For this purpose we fit a unimodal smoother to the points in the diagram. We combine the item fit indices into a measure of scale fit.

4.2 A New Diagnostic for Internal Consistency of Single-Peaked Items: Ordered Conditional Means (OCM)

Suppose we have a set of k polytomous items that are ordered on a latent scale.

We compute the entry (g, h) of the k × k OCM matrix C as the conditional mean response on item g, for all subjects who choose item h. The order of the rows and columns of C is determined by the presumed item ordering on the latent scale. In case of graded response items, we define endorsement as expressing the strongest level of agreement. This criterion, selects a subset of respondents, who are maximally homogeneous with respect to their position on the latent scale3. In the current chapter, we define element h in row g as

Cgh= 1

ME(Zig|Zih= M ), (4.3)

which is estimated by

ˆ cgh = 1

M 1 Nh

X

iLh

Zig, (4.4)

where

3One could consider selecting a larger subset by allowing a user-defined cut off value for agreement, for instance Zih≥ M − 1 (cf. Van Schuur 1992, p. 65). This procedure might be particularly useful for finding respondents that agree with items with extreme locations, as only few respondents tend to choose these items.

(10)

4.2. A New Diagnostic for Internal Consistency of Single-Peaked Items

Zig is the observed response of subject i (i = 1, ..., n) to item g (g = 1, ..., k) with Zig= z, where z = 0, 1, ... , M , with

z = 0 indicating the strongest level of disagreement, z = M indicating the strongest level of agreement,

Lhis the subset of subjects in the sample with Zih=M , and Nhis the number of subjects in subset Lh.

In equation (4.4) the conditional mean response on item g given Zih = M is multiplied by 1/M to make the entries of the OCM matrix comparable to (4.2) concerning the maximum value of 1. An outcome of 1 indicates the highest level of similarity between items g and h. An outcome of 0 indicates the lowest level of similarity between the items g and h, which is the case when all subjects who express the strongest level of agreement with item h express the strongest level of disagreement with item g. In case of binary responses, (4.4) equals (4.2) and the OCM matrix equals the conditional adjacency matrix. Note, that in order for the above to apply, researchers must recode their data so that the lowest possible response is always 0.

Each row g in the OCM matrix contains the conditional mean response on row item g, given the subjects express the highest level of agreement with column item h. Analogous to Post (1992) we regard each row as a rough estimate of the IRF of the corresponding item. Note that the k columns of the matrix represent k ordered subgroups Lh of persons choosing a particular column item h (i.e., showing the observed response Zih=M ). Ideally, each row of the OCM matrix shows a single- peaked pattern with the peak moving from left to right as we move downward in the matrix. The more the row pattern deviates from unimodality, the less this item conforms to the assumptions of the unfolding model, which diminishes the internal consistency of the set of items.

The evaluation of the approximated IRF’s is facilitated by a graphical display of the rows of the OCM matrix. These OCM diagrams are explained in the following paragraph.

4.2.1 The OCM Diagrams

We explain the OCM diagrams with the following example. Suppose we have a simulated dataset of 300 persons rating 9 items, v1 to v9, using a five point scale ranging from 0 to 4. Probabilities were generated according to (4.1), with αg = 1, and a constant interthreshold distance of .4 (these values are based on previous studies by Roberts et al., 2000), δg ranging from -3 to 3 on the latent

(11)

scale, and θiwas sampled from a normal (0,22) distribution. For each subject the response category with the highest probability was sampled, so that we generated a deterministic data structure. The OCM matrix for this data is displayed in Table 4.1.

Table 4.1: OCM matrix for simulated responses of 300 respondents on 9 items.

For each column item h the size of the subgroup expressing maximum agreement with this item, Nh, is displayed between parentheses.

item v1(26) v2(33) v3(54) v4(70) v5(86) v6(84) v7(56) v8(46) v9(36)

v1 1.00 .62 .17 .00 .00 .00 .00 .00 .00

v2 .68 1.00 .66 .21 .00 .00 .00 .00 .00

v3 .24 .81 1.00 .69 .17 .01 .00 .00 .00

v4 .01 .28 .77 1.00 .65 .25 .00 .00 .00

v5 .00 .01 .25 .72 1.00 .77 .21 .01 .00

v6 .00 .00 .00 .23 .77 1.00 .75 .26 .01

v7 .00 .00 .00 .01 .26 .65 1.00 .73 .30

v8 .00 .00 .00 .00 .00 .18 .72 1.00 .82

v9 .00 .00 .00 .00 .00 .01 .21 .68 1.00

Each row g in Table 4.1 is regarded as a rough estimate of the IRF of item g.

Note that the diagonal values are 1 by definition. In Table 4.1 it can be seen that item v1 has its peak at the left end of the scale, and the conditional mean response decreases for subgroups choosing items that are more distant from this item. That is item v1 has a monotonically decreasing IRF. In contrast, the IRF of item v9 (with its peak at the right end of the scale) is monotonically increasing.

The IRF of item v5, which is located on the midpoint of the scale is single-peaked and approximately symmetrical.

Figure 4.2 depicts the OCM diagrams for the 9 items in the above example.

Each diagram corresponds to one row of the OCM matrix. The horizontal axis of this diagram represents the ordering of the items along the latent scale and the vertical axis represents the conditional mean response on item g defined by (4.4). The 9 columns of the OCM matrix define 9 subgroups of respondents (with a maximum score M on the respective column item) that are represented on the horizontal axis by the respective item rank numbers. Note, that both the number of diagrams, and the numbers of points in each diagram equals the number of items in the scale, nine in this example.

(12)

4.2. A New Diagnostic for Internal Consistency of Single-Peaked Items

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 1

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 2

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 3

conditional mean response

rank numbers

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 4

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 5

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 6

conditional mean response

rank numbers

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 7

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 8

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 9

conditional mean response

rank numbers

Figure 4.2: OCM diagrams for simulated responses to 9 Likert items.

4.2.2 Unimodal Smoothing of the OCM Diagrams

We used a unimodal smoother defined by Eilers (2005) to determine the fit in each diagram. The smoother is based on a nonparametric function, and yields the estimated values ˆegh, in the OCM diagram of each item g. This approach assumes single-peakedness of the IRFs and unidimensionality, and a hypothesized order of the items on the latent scale. We chose the smoother defined by Eilers (2005), because it is the only algorithm that (we know of) explicitly models a smooth global unimodal shape, whereas other nonparametric curve estimators (cf. Bro

& Sidiropoulos, 1998; Gemperline & Cash, 2003), split the unimodal function into monotone left and right parts, and do not consider smoothness. We refer the reader to Eilers (2005) for further technical details, a Matlab routine, and recommended values for the number of knots, and the multimodality penalty κ, which we also applied in the current study. We chose to set the roughness penalty,

(13)

λ, to 100 (instead of the starting value of 0.1) to enhance smoothness. Given the fact that the current application has relatively few data points compared to the original application of Eilers (2005), a stronger roughness penalty is desired to diminish the influence of individual data points.

Additionally, for the current chapter, we included weights for the points in each diagram. That is, since each point h in diagram g is based on a specific number of observations Nh (i.e., the number of subjects in the sample with Zih=M ), we weighted each point h with Nh in the fitting procedure. Note that within each diagram g, the weights Nh of the points h vary, but across diagrams the weights are equal. For instance, in each diagram the first point (h = 1) is weighted with N1, that is, the number of subjects expressing the highest level of agreement with the item with rank number 1; this weight assigned to the first point is constant across all k diagrams (N1= 26 in Figure 4.2, see Table 4.1).

4.2.3 Two Measures of Fit for the OCM Diagrams

For each OCM diagram g we determine two measures of fit based on the quantities ˆ

cgh defined in (4.4) and ˆegh the predicted values resulting from the unimodal smoother. The first measure of fit is the squared correlation (R2g) between ˆcgh and ˆegh, which is defined as

R2g=

k

P

h=1

Nh(ˆcgh− ¯ˆcgh)(ˆegh− ¯ˆegh) s k

P

h=1

Nh(ˆcgh− ¯ˆcgh)2 s k

P

h=1

Nh(ˆegh− ¯ˆegh)2

2

, (4.5)

where ˆ

cgh is the mean observed response to item g in the subgroup Lh as defined in equation (4.4),

¯ˆcgh is the grand mean of ˆcgh over all k subgroups, ˆ

egh is the predicted value response to item g in the subgroup Lh resulting from the unimodal smoother,

¯ˆeghis the grand mean of ˆegh over all k subgroups, and Nhis the number of subjects in subset Lh.

We chose the above definition of R2 as a measure of model fit, because it is bounded between 0 and 1. Although this measure cannot be interpreted as the proportion of variance accounted for, like in linear regression, it yields a standard-

(14)

4.2. A New Diagnostic for Internal Consistency of Single-Peaked Items

ized measure of fit, where values closer to 1 are indicative of good fit. We prefer this measure to the usual R2, that is, “1 - the error sum of squares divided by the total sum of squares”, since that can become negative for non-linear regression.

Furthermore, we compute the root mean squared error (RM SE) to measure the accuracy of the model estimates, which is defined for diagram g as the square root of the (weighted) mean squared error (M SEg):

RM SEg= v u u t

1 N+

k

X

h=1

Nh(ˆcgh− ˆegh)2, (4.6)

where N+=

k

P

h=1

Nh.

To measure the quality of the scale as a whole, we take the average of the values for both measures of fit over all k diagrams. That is, we compute R2, which is

R2= 1 k

k

X

g=1

R2g, (4.7)

and RM SE, which is

RM SE = v u u t 1 k

k

X

g=1

M SEg. (4.8)

In general, the higher R2, and the lower RM SE, the more the items of the scale show a single-peaked IRF.

4.2.4 Identifying Item Misfit Using the OCM Diagrams

As a measure of item fit, R2and RM SE “if item h deleted” are computed, which are denoted as respectively R2(−h) and RM SE(−h). R(−h)2 and RM SE(−h) are defined as R2and RM SE, but with item h discarded from the scale. To compute these statistics, both the diagram h is deleted, and the point h is deleted in each of the remaining k − 1 diagrams. Note that, after all, point h in diagram g, indi- cates the subset of subjects with Zih=M , which is non-existent after discarding item h from the scale. High values for R2(−h) compared to R2 and low values for RM SE(−h)compared to RM SE can be used to identify poorly fitting items.

Rules of thumb will be presented for both statistics, which indicate the minimum increase in R2(−h)compared to R2and the minimum decrease in RM SE(−h)com- pared to RM SE that can be regarded as a substantial improvement. ∆R2(−h)and

(15)

∆RM SE(−h)denote the change of, respectively, R2 and RM SE when item h is deleted.

The following example illustrates the nonparametric fitting procedure and shows how it identifies an item with a deviant IRF. Consider the data from the first example again. However, now we introduce a deviant item to the scale. That is, we made the IRF of item v3 a mixture of the IRFs of an item with δ=3 and α = 1 and an item with δ = 7 and α = 0.25. In this way the IRF of item v3 has a local maximum at the position of item v7. Figure 4.3 shows the OCM diagrams for this dataset using the unimodal smoothing procedure to approximate the IRF of each item.

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 1

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 2

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 3

conditional mean response

rank numbers

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 4

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 5

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 6

conditional mean response

rank numbers

1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 7

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 8

conditional mean response

rank numbers 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 9

conditional mean response

rank numbers

Figure 4.3: OCM diagrams for simulated responses to 9 items with item 3 as deviant item.

In Figure 4.3 it can be seen that the OCM diagram for item v3 clearly shows a deviating pattern; the dots show, as expected, a local maximum around point 7.

This means that item v3 is chosen by subjects, who also choose item v7 that is

(16)

4.2. A New Diagnostic for Internal Consistency of Single-Peaked Items

located at a rather large distance from item v3. In practice, this would make item v3 difficult to scale, since it is chosen by subjects who are heterogeneous with respect to their location on the latent scale. Note that furthermore, the subset L3 of subjects with Zi3=4 shows up as an influential point in the diagrams of items v6 through v9. For instance, in the diagram of item v8 the conditional mean response on item v8 of subset L3 (i.e., the third point in the diagram) is somewhat higher than expected given the pattern of the remaining points. This can be explained from the heterogeneity within this subset.

In conclusion, discarding item v3 from the scale should improve the average fit for two reasons; firstly because the fit of the diagram of item v3 will be poorer than in the other diagrams. Secondly, since the fit of the remaining diagrams will improve due to the removal of the influential point corresponding to item v3.

The scale fit, in terms of R2 and RM SE, as well as the item fit in terms of both R2(−h)and RM SE(−h), and ∆R2(−h)and ∆RM SE(−h)for the above example are reported in Table 4.2.

Table 4.2: Unimodal fit for the 9 item scale displayed in Figure 4.3.

R2 RM SE MSE∆R2(−h) ∆RM SE(−h)

Scale fit .972 .028

if item deleted: v1 .958 .030 -.014 .002

v2 .967 .031 -.005 .003

v3 1.000 .012 .028 -.016

v4 .973 .027 .002 -.001

v5 .982 .026 .010 -.002

v6 .968 .032 -.004 .004

v7 .974 .029 .002 .002

v8 .971 .030 -.001 .002

v9 .965 .030 -.007 .002

In Table 4.2 we see that the scale fit can be improved substantially by discarding item v3 (∆R2(−3)and ∆RM SE(−3)are, respectively, .028 and -.016). In contrast, discarding any other items had a minimal or negative effect on the scale fit.

In the following section we will explore the sampling behavior of the fit statistics, by means of Monte Carlo simulation, where we varied sample size, number of items, and the number and location of deviant items in the scale. We distinguish between two types of deviant items: poorly discriminating items and mixture

(17)

items, which have an irregular IRF like in the above example. Evidence is pre- sented that shows this method identifies both poorly discriminating items as well as irregular items.

4.3 Evaluation of the OCM diagnostics

In this section we evaluate the OCM methodology with respect to the identifica- tion of deviant items. We studied the sampling behavior of both scale and item diagnostics with respect to the following questions:

1. Which values of R2 and RM SE indicate good scale fit?

a. Does this depend on the subject distribution?

b. Does this depend on the scale length?

c. Is the scale fit indeed affected by including deviant items in the scale?

Does the effect of including deviant items depend on:

d. the type of deviation in their IRFs (non-discriminatory, of mixture)?

e. the number of deviant items in the scale?

f. their location on the scale?

2. Which values of ∆R2(−h)and ∆RM SE(−h) are indicative of item-misfit?

a. Does this depend on the subject distribution?

b. Does this depend on the scale length?

c. Is the item fit indeed different for deviant and regular items?

Does the degree of misfit of the deviant items depend on:

d. the type of deviation in their IRFs (non-discriminatory, of mixture)?

e. the number of deviant items in the scale?

f. their location on the scale?

3a. How often do ∆R2(−h)and ∆RM SE(−h) indicate item-misfit while the par- ticular IRF satisfies the unimodality assumption (type I error)?

3b. How often do ∆R(−h)2 and ∆RM SE(−h) not indicate item-misfit while the particular IRF doesnot satisfy the unimodality assumption (type II error)?

4.3.1 Design of the Monte Carlo Simulation

We aimed at designing a simulation study which we think is realistic in applied research. For instance, we chose moderately deviant items that seem realistic

(18)

4.3. Evaluation of the OCM diagnostics

in a practical dataset (given that the items are presumed to lie on an underlying bipolar continuum), instead of extremely deviant items (for instance with a single- dipped IRF, cf. Post, 1992). Furthermore, we explicitly evaluate whether the number and/or location(s) of the deviant item(s) affect the functioning of the OCM diagnostics. These considerations resulted in the following choices:

1. The number of subjects is kept constant over all conditions: N = 300, which we think is realistic in practical research using questionnaires.

2. Two subject distributions: N (0, 22) and Uniform on (-4.5,4.5).

3. Two conditions of scale length: 10 and 15 items.

4. All regular items have unimodal IRFs that belong to the same unimodal location parameter family, defined by the GGUM (see equation 4.1). The items differ by location. For both the 10 item scale and the 15 item scale, the items are equidistant, with δg ranging from -3 to 3, and inter-threshold distance 0.4.

5. The deviant item(s) has (have) either one of the following two IRFs:

a. poorly discriminating (but not horizontal); the discrimination parameter of the deviant item is set to a much lower value than that of the remaining items. We used the value of αg= 0.10 for deviant item or stimuli and αg = 1 for the remaining items in the scale. The choice of .10 as a value for αg for the deviant item g is considered more realistic than a value of 0 (horizontal IRF).

b. irregularly shaped; in this condition the IRF of the deviant item j is a mixture of a regular IRF defined by the GGUM for the location δg with αg

= 1 and a second IRF defined by the GGUM for δg±3and αg±3= .25. That is, the IRF has an extra (local) maximum that is located at either δg+3 if the deviant item g has a median rank number or lower, or δg−3in all other cases.

The choice of these parameter values was based on a visual inspection of the resulting IRFs. Since we define only one local maximum, we argue that this item is only moderately deviant. We argue that if it is possible to identify this type of IRF as deviant, it will most likely that more extremely deviant items (e.g. with more local maxima) will also be identified.

6. We define scales including zero, one, two, or three deviant items. Further- more, we vary the location(s) of the deviant item(s), as midpoint, interme- diate, and extreme.

(19)

Note that we have outcomes on the scale level as well as on the item level. Out- comes evaluated on the scale level are R2 and RM SE. Outcomes evaluated on the item level are the change statistics “if item h deleted”, that is, ∆R(−h)2 and

∆RM SE(−h). For each condition 100 replications of datasets were independently generated according to the GGUM (see equation 4.1).

The effect of the various factors on the measures of fit was assessed with two separate ANOVAs both on the scale level, and on the item level. We first performed a 3-factor ANOVA for each measure of fit in order to evaluate the effect of including deviant items in the scale, and whether this depended on the type of subject distribution or scale length. Second, within the conditions with at least 1 deviant item, another 3-factor ANOVA was performed to evaluate whether there was an effect of the type of deviation in the IRF, the number of deviant items in the scale, and/or the location of the deviant items. Effect sizes were expressed as partial eta squared, η2p. According to Cohen (1988), a partial eta squared of .010 indicates a small effect, .059 a medium effect, and .138 a large effect.

4.3.2 Results of the Monte Carlo Simulation

In the following sections we describe the sampling behavior of both scale statistics, R2and RM SE (Section 4.3.2), and the item statistics, ∆R2(−h)and ∆RM SE(−h) (Section 4.3.2), in the various conditions of our simulation study. Finally, in Section 4.3.2, the success and error rates of the OCM methodology will be eval- uated by introducing a rule of thumb for both measures of item fit, ∆R2(−h) and

∆RM SE(−h). Note, that in the complete design we have a total of 7600 replica- tions of each measure of fit on the scale level; and 95000 on the item level. On the item level we discarded 158 (0.166 %) outcomes due to convergence problems (mainly for items on the extreme ends of the scale with long tails, which caused the regression to be undefined). The corresponding outcomes on the scale level were also discarded, which resulted in an elimination of 69 (0.907 %) outcomes.

Scale Fit

The ANOVA effect sizes for the 2 ANOVAs for both measures of scale fit are reported in Table 4.3 (first and second column).

Table 4.3 (first half, first and second column) shows that, as expected, both measures of scale fit are strongly affected by including deviant items in the scale.

This effect seems constant over the conditions varying subject distribution (f 1) and scale length (f 2), considering the small effect sizes for f 1 and f 2 for both

(20)

4.3. Evaluation of the OCM diagnostics

Table 4.3: Effect sizes η2p, resulting from 2 separate 3-factor ANOVAs of each measure of scale fit: R2 and RM SE, and each measure of item fit:∆R2(−h) and

∆RM SE(−h)for the various conditions of the Monte Carlo simulations.

Scale fit Item fit

Factor R2 MSERM SE MSE∆R2(−h) ∆RM SE(−h)

f 1 .002 .000 .008 .001

f 2 .001 .003 .057 .024

f 3 .164 .280 .721 .485

f 1 ∗ f 2 .000 .000 .000 .000

f 1 ∗ f 3 .001 .000 .009 .001

f 2 ∗ f 3 .002 .016 .035 .033

f 1 ∗ f 2 ∗ f 3 .000 .000 .000 .000

f 3a .277 .128 .116 .018

f 3b .664 .593 .015 .322

f 3c .038 .021 .008 .012

f 3a ∗ f 3b .034 .005 .006 .023

f 3a ∗ f 3c .001 .030 .000 .008

f 3b ∗ f 3c .089 .083 .027 .037

f 3a ∗ f 3b ∗ f 3c .026 .050 .013 .009

Note. The conditions that were varied in the simulation study were, respectively, f 1: subject distribution, f 2: scale length, f 3: deviant or regular IRF, f 3a: type of deviation, f 3b: number of deviant items, f 3c: location of the deviant item(s).

R2 and RM SE. The second half of Table 4.3 (first and second column) shows a large effect of the number of deviant items in a scale, as well as of the type of IRF (non-discriminatory or mixture) of the deviant items.

Table 4.4 shows the means of R2 and RM SE in both conditions with a large effect size. The general pattern for both measures of scale fit can be interpreted as a diminishing decline as a function of the number of deviant items in the scale.

Furthermore, it can be seen, that overall, the scale fit is better in those conditions where the deviant items were characterized by a mixture IRF. Apparently, the non-discriminatory items are more deviant than the mixture items, which might be explained from the fact that the mixture items’ IRFs partly follow the same shape as those of the regular items. From Table 4.4 we deduce that the values of R2= .97 or above can be seen as indication of excellent fit. Values of R2 between .90 and .97, arise when only 1 of the items is deviant. Values of R2 between .78 and .90, arise when 2 or 3 of the items are deviant. Finally, values below .78 point

(21)

Table 4.4: Means (and standard deviations) of R2 and RM SE for the various levels of factors f 3a (type of deviation), and f 3, f 3b (number of deviant items in the scale).

f 3, f 3b

scale fit measure f 3a 0 1 2 3

R2 1

.970 (.010) .900 (.020) .820 (.030) .780 (.040) 2 .930 (.020) .860 (.040) .830 (.050)

RM SE 1

.044 (.010) .074 (.007) .097 (.007) .099 (.010) 2 .067 (.009) .088 (.012) .093 (.014) Note. The conditions that were varied in f 3a: type of deviation, were (1) non-discriminatory IRF, or (2) mixture IRF.

at a serious amount of item misfit. For RM SE we can take the value of .044 as indication of good fit. Values of RM SE between ˙044 and .074, arise when only 1 of the items is deviant. When there are 2 or 3 deviant items in the scale we expect values between .074 and .099. Finally, values of RM SE below .099 point at a serious amount of item misfit.

Item Fit

The effect sizes for the 2 ANOVAs for both measures of item fit are reported in Table 4.3 (third and fourth column).

The first half of Table 4.3 shows large effect sizes for factor f 3 (deviant vs.

regular IRF), for both measure of item fit. These results indicate that, in general, both measures of item fit strongly discriminate between deviant and regular items.

This effect seems constant over the conditions varying subject distribution (f 1), considering the small effect sizes for f 1 for both measures of item fit. Scale length (f 2) appears to have a small to medium effect on both measures of item fit. This will be further explored in the following section.

The second half of Table 4.3 shows that, for ∆R2(−h) there is a medium to large effect of factor type of deviation, where we observed more extreme values of misfit for the non-discriminatory items, than for the mixture items (which was also apparent in the scale fit). For ∆RM SE(−h) there is a large effect of number of deviant items in the scale, where we observed less extreme values of misfit for the deviant items as the total number of deviant items in the scale increased.

(22)

4.3. Evaluation of the OCM diagnostics

Rules of Thumb for Interpreting Item Fit According to ∆R2(−h) and

∆RM SE(−h)

In this section we introduce cutoff scores 1 and 2 for, respectively, ∆R2(−h)and

∆RM SE(−h), where we regard values of ∆R2(−h)> 1, and values of ∆RM SE(−h)

< 2 as indicative of item misfit. We aim at identifying values 1 and 2 that are extreme enough to identify only substantial misfit (i.e., to prevent false positives), but at the same time close enough to zero to be sensitive enough to detect deviant items (i.e., to prevent false negatives).

In Table 4.5 we report the overall percentiles of ∆R2(−h)and ∆RM SE(−h)for the two levels of factor f 3 (deviant and regular items). Note that for ∆R2(−h) positive values indicate misfit. Thus for ∆R(−h)2 we aim at less than 20% of the deviant below 1 (type II error) and less than 5% of the regular items above 1

(type I error). In contrast, for ∆RM SE(−h)negative values indicate misfit. Thus for ∆RM SE(−h)we aim at less than 20% of the deviant above 2 (type II error) and less than 5% of the regular items below 2 (type I error).

From the first half of Table 4.5 it can be deduced that approximately a value of 1 = 0.025 for ∆R2(−h)results in an overall type II error between 5% and 10%, while between 97.5% and 99% of the regular items remain below the value of .025 (i.e., the type I error rate is between 2.5% and 1%). From the second half of Table 4.5 it can be deduced that with approximately a value of 2 = -0.005 for

∆RM SE(−h)between 80% and 90% of the deviant items are identified as deviant (i.e., the type II error rate is between 20 and 10%). The overall type I error rate for a cut off value of -.005 with is between 1 and 2.5%.

Table 4.5: Percentiles of ∆R2(−h)and ∆RM SE(−h)for the two levels of factor f 3 (1 = deviant item, 2 = regular item).

∆R2(−h)

f 3 1 2.5 5 10 20 50 80 90 95 97.5 99

1 .013 .017 .022 .029 .040 .061 .084 .097 .108 .122 .142 2 -.026 -.021 -.017 -.013 -.009 -.002 .005 .011 .018 .024 .033

∆RM SE(−h)

f 3 1 2.5 5 10 20 50 80 90 95 97.5 99

1 -.051 -.043 -.037 -.029 -.022 -.012 -.006 -.003 -.002 .000 .001 2 -.007 -.004 -.003 -.001 .001 .003 .005 .006 .007 .008 .010

(23)

In Tables 4.6 and 4.7 we report the type I error and success rates (i.e., 100 − type II error rates) for the cutoff criteria ∆R2(−h) > .025 and ∆RM SE(−h) <

-.005, respectively, in various conditions of the Monte Carlo study. We selected distinctive conditions based on the ANOVA effect sizes (see Table 4.3), where we selected those conditions with at least a medium sized effect for either one of the measures of item fit.

The type I error rates are approximated, within the reported conditions, by the average percentage of replications in which ∆R2(−h)was > .025, and ∆RM SE(−h) was < -.005 over all regular items. The type II error rates are approximated, within the reported conditions, by the average percentage of replications in which

∆R2(−h) was ≤ .025, and ∆RM SE(−h) was ≥ -.005 over all deviant items. The statistical power of the respective rules of thumb is approximated by 100% - the type II error rate.

Table 4.6: Type I error rates and success rates (power) of using ∆R2(−h) > .025 as criterion for item misfit.

10-item scale 15-item scale f 3a f 3b Type I error Power Type I error Power

1 1 6.2 % 99.8 % 0.4 % 99.8 %

2 9.7 % 99.7 % 0.3 % 99.6 %

3 5.9 % 98.7 % 0.4 % 98.9 %

2 1 3.0 % 96.8 % 0.1 % 67.3 %

2 3.8 % 97.7 % 0.3 % 78.3 %

3 5.5 % 91.9 % 0.1 % 80.7 %

Note. The conditions that were varied in f 3a: type of deviation, were (1) non-discriminatory IRF, or (2) mixture IRF; f 3b indicates the total number of deviant items in the scale.

In Tables 4.6 and 4.7 it can be seen that in most conditions the power of our methodology for identifying item misfit is high, while in most conditions the type I error rates remain acceptable, i.e., < 5%. Table 4.6 shows that the type I error is > 5% (but < 10%) in the three conditions with non-discriminatory items, for the 10-item scales. Table 4.7 showed a type I error > 5% (but < 10%) only in the conditions with 1 non-discriminatory item. Inspection of the item distributions in these conditions, indicated that the deviant item(s) diminished the fit of the

(24)

4.4. Applications of the OCM Diagrams

Table 4.7: Type I error rates and success rates (power) of using ∆RM SE(−h) <

-.005 as criterion for item misfit.

10-item scale 15-item scale f 3a f 3b Type I error Power Type I error Power

1 1 8.8 % 100 % 0.2 % 91.7 %

2 3.6 % 99.1 % 0.8 % 91.2 %

3 0.1 % 89.1 % 0.3 % 84.8 %

2 1 5.1 % 100 % 0.3 % 93.0 %

2 1.1 % 95.6 % 0.5 % 91.8 %

3 1.0 % 69.4 % 0.1 % 85.6 %

Note. The conditions that were varied in f 3a: type of deviation, were (1) non-discriminatory IRF, or (2) mixture IRF; f 3b indicates the total number of deviant items in the scale.

regular items in closest proximity on the scale. That is, only for these items the measures of fit exceeded the cutoff scores, which caused the relatively high overall type I error rate. However, in these conditions the deviant item clearly stood out with, on average, values of misfit that were three times as large as those of the neighboring regular item. We therefore recommend to use an iterative procedure in deleting deviant items from the scale. That is, to only delete the item which shows the strongest degree of misfit, and to repeat the fitting procedure for the remaining items. This procedure is illustrated in Section 4.4.2.

In the following sections we will show two applications of the OCM diagrams for evaluating unfolding scales an items. The first example, from the field of personality assessment, focuses on scale evaluation (with a given set of items). The second example, from the field of attitude research, focuses on scale construction (with the purpose of item selection).

4.4 Applications of the OCM Diagrams

4.4.1 The Developmental Profile: A Bipolar Scale for Personality Development

We analyzed data on personality development collected with the Developmental Profile (DP) (Abraham et al., 2001). The DP is an instrument for personality

(25)

assessment consisting of nine subscales, referred to as developmental levels, each consisting of nine items, referred to as developmental lines. For the purpose of the current application we analyzed only eight subscales (i.e., developmental levels), because very few respondents had a maximum score on the highest developmental level (i.e., level 9, Generativity).

Each developmental level describes a central or specific aspect of behavior, characteristic of a specific phase in the development of psychosocial capacities.

The developmental levels in the DP are organized in a hierarchy, according to the degree to which they are associated with the severity of maladaptive psychosocial functioning. The lower six levels refer to maladaptive behavior; the upper two levels refer to adaptive behavior.

It is assumed that the eight developmental levels may be seen as separate (but not independent) subscales, each consisting of nine items (behavior patterns de- fined on nine developmental lines). The items are scored by a trained professional based on a semi-structured interview. A four-point scale is used to indicate the degree to which each personality characteristic is present (0 = not present; 1 = present to a limited degree; 2 = clearly present; 3 = very clearly present). The developmental profile of an individual is defined as his total score on each of the eight developmental levels. These total scores are recoded in the following man- ner: 0 = 0; 1 through 3 = 1; 3 through 6 = 2; ≥ 7 = 3. Note, that it are these recoded subscale scores (interpreted according to the polytomous values), which are the focus of the analysis, presented below.

Previous studies have shown that the 8 developmental levels (i.e., the subscales of the DP) are ordered on one underlying bipolar dimension ranging from maladap- tive to adaptive psychosocial functioning (e.g., Polak, Van, Overeem-Seldenrijk, Heiser, & Abraham, 2010; Chapter 5 of this thesis). In the current chapter the aim is to investigate the hypothesis that the shape of the IRFs of the eight levels is single-peaked. From a developmental perspective, it is presumed, firstly, that an individual’s (total) score pattern shows a peak at that level, which characterizes his current level of functioning best. A second presumption is, that the individual’

s scores on the remaining levels will decrease as function of the distance between those levels and his “peak” level along the (adaptivity) dimension underlying the DP. As an individual develops, for instance in the course of therapy, his peak will shift up the hierarchy of the DP, if the persons learns to replace maladaptive behavior with more adaptive behavior.

The current sample consisted of 736 patients who were classified as either forensic inpatients (N = 24), inpatients (N = 450), outpatients (N = 163), and

(26)

4.4. Applications of the OCM Diagrams

normal controls (N = 99).

Figure 4.4 shows the OCM diagrams for the 8 items representing the various developmental levels of the DP. The scale fit in terms of R2 and RM SE and the item fit in terms of both R2 and RM SE if item deleted, and ∆R2(−h) and

∆RM SE(−h)are reported in Table 4.8.

Table 4.8: Scale fit, scale fit “if item h deleted”, and the corresponding change,

(−h), in terms of R2and RM SE for the eight levels of the Developmental Profile.

R2 RM SE MSE∆R2(−h) ∆RM SE(−h)

Scale fit .785 .067

if item deleted: v1 .778 .069 -.006 .002

v2 .793 .067 .008 .000

v3 .834 .061 .050 -.006

v4 .829 .071 .044 .003

v5 .841 .071 .056 .003

v6 .897 .050 .113 -.018

v7 .749 .073 -.035 .006

v8 .780 .068 -.004 .001

1 2 3 4 5 6 7 8 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 1

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 2

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 3

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 4

conditional mean response

rank numbers

1 2 3 4 5 6 7 8 0

0.2 0.4 0.6 0.8 1

OCM diagram for item 5

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 6

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 7

conditional mean response

rank numbers 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1

OCM diagram for item 8

conditional mean response

rank numbers

Figure 4.4: OCM diagrams for the levels of the Developmental Profile.

Referenties

GERELATEERDE DOCUMENTEN

Consequently, I argue that people with a high propensity to morally disengage are expected to more positively evaluate the quality of social interactions of a moral

Moreover, because these results were obtained for the np-GRM (Definition 4) and this is the most general of all known polytomous IRT models (Eq. Stochastic Ordering

Each imputation method thus replaced each of these 400 missing item scores by an imputed score; listwise deletion omitted the data lines that contained missing item scores; and the

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden Downloaded.

Subject headings: item analysis / item selection / single-peaked response data / scale construction / bipolar measurement scales / construct validity / internal consistency

A second point of criticism with respect to the use of Likert scales for measuring bipolar constructs, concerns the item analysis, that is, determining item locations and

These results indicate that, in general, the quality of recovery of the ordering of true subject locations improves when the items are evenly spaced, but a gap in the item locations

We chose to include CA with doubled items, with the aim of testing the pre- sumption that, in case of unfolding data, asymmetric treatment of response cate- gories (implied by CA