• No results found

Selecting anchor items and testing measurement bias using restricted factor analysis models with product indicators

N/A
N/A
Protected

Academic year: 2021

Share "Selecting anchor items and testing measurement bias using restricted factor analysis models with product indicators"

Copied!
39
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Faculty of Social and Behavioural Sciences

GRADUATESCHOOL OFCHILDDEVELOPMENT ANDEDUCATION

Selecting anchor items and testing measurement bias using restricted

factor analysis models with product indicators

Research Master Child Development and Education

Thesis 2

Student:

L. K

OLBE

, 11111186

Supervisors:

D

R

. T. D. J

ORGENSEN

P

ROF

.

DR

. F. J. O

ORT

June 30, 2017

(2)

Abstract

The present investigation involved testing measurement bias using restricted factor analysis (RFA) models. Two simulation studies were conducted to examine (1) the selection of anchor items, and (2) modeling latent interactions to test for nonuniform bias. RFA can be extended with latent moderated structural equations (LMS) to test for nonuniform bias. Although RFA/LMS is a powerful means to detect measurement bias, several studies observed severely inflated Type I error rates. Consequently, the aim of Study 1 was to compare two anchor-selection strategies that have been proposed to prevent inflated error rates: the rank-based strategy and iterative procedure. By means of a Monte Carlo simulation, both strategies were evaluated with regard to selecting anchor items. The rank-based strategy appeared to obtain bias-free anchor sets more frequently than the iterative procedure. Therefore, this anchor-selection strategy was applied to Study 2, in which LMS was compared with an alternative extension to test for nonuniform bias using RFA models: product indicators (PI). We investigated the use of the detection methods to test measurement bias with respect to a categorical variable. Data were generated under several conditions that varied according to sample size and magnitude of the uniform and nonuniform bias. The PI method turned out to yield lower familywise Type I error rates than LMS on average and to have higher power across all conditions, but differences were negligible. More research is neces-sary to determine the optimal use of the PI method in RFA models for the purpose of detecting measurement bias.

(3)

Contents

1 Introduction 4

1.1 Selection of anchor items . . . 5

1.2 Detection of measurement bias . . . 6

1.3 Present study . . . 7

2 Theoretical background 9 2.1 Anchor-selection strategies . . . 9

2.1.1 Rank-based strategy . . . 9

2.1.2 Iterative procedure . . . 11

2.2 Restricted factor analysis models . . . 12

2.2.1 Latent moderated structural equations . . . 14

2.2.2 Product indicators . . . 14 3 Study 1 17 3.1 Method . . . 17 3.1.1 Data generation . . . 17 3.1.2 Analysis . . . 18 3.2 Results . . . 19 3.2.1 Anchor-selection accuracy . . . 20 3.2.2 Mean purity . . . 22 3.3 Conclusion . . . 22 4 Study 2 24 4.1 Method . . . 24 4.1.1 Analysis . . . 24 4.2 Results . . . 25 4.2.1 Power . . . 26

(4)

4.2.2 Familywise Type I error . . . 26 4.3 Conclusion . . . 28

(5)

1

Introduction

Over the past decades, measurement bias has become an important issue of investi-gation in behavioral and social science. Measurement bias entails that psychologi-cal instruments function differently across groups, irrespective of true differences on the construct that the scale or test was designed to measure. In research on measure-ment bias, a great emphasis has been placed on the methods for detecting measuremeasure-ment bias (Cheung & Rensvold, 1999; Marsh, 1994; Mellenbergh, 1989; Milfont & Fischer, 2010). Detection methods are regularly compared using simulated data, striving to find the best method available. Accordingly, the aim of the present study was to compare a common method for detecting measurement bias with an alternative method whose usage has been considered but not investigated yet (Woods & Grimm, 2011).

In the presence of measurement bias, observed differences in composite scores (e.g., scale means) might not represent true differences in the construct a scale is devel-oped to measure. Measurement bias is formally defined as a violation of measurement invariance:

f1(X | T = t,V = v) = f2(X | T = t), (1)

where X is a set of observed variables (e.g., item scores), T is the construct of inter-est measured by X , and V is a set of variables other than T that potentially violate measurement invariance. Function f1is the conditional distribution of X given t and

v, and f2the conditional distribution of X given t. If measurement invariance holds

(i.e., f1= f2), the measurement of T by X is invariant with respect to V . But if

mea-surement invariance does not hold (i.e., f16= f2), the measurement of T by X is biased

with respect to V . A distinction can be made between uniform and nonuniform bias, where the extent of bias is constant for all levels of the construct T for uniform bias but varies with levels of T for nonuniform bias. The definition of measurement bias does

(6)

not depend on the measurement level of the variables. Rather, the definition is general in the sense that X , T , and V may be measured on a nominal, ordinal, interval or ra-tio level, and may be latent or manifest. Most typically in research on measurement bias, T is an interval variable representing a latent trait aimed to be measured, and V is a nominal variable representing group membership. Variable X is often a dichoto-mous variable representing a correct or incorrect response to an item, but can also be a normally distributed or Likert-type scale variable.

Various methods have been proposed to test for measurement bias. In the context of structural equation modeling (SEM), a common method to detect measurement bias is restricted factor analysis (RFA; Oort, 1992, 1998). In RFA models, the potential violator V is added to a common factor model as exogenous variable that covaries with T. Uniform bias can be assessed by testing the significance of direct effects of V on X. To test for nonuniform bias, Barendse, Oort, and Garst (2010) proposed to ex-tend RFA with latent moderated structural equations (LMS). This allows for assessing nonuniform bias by testing the significance of interaction effects of T × V on X . The multiple-indicator multiple-cause (MIMIC) models (B. O. Muth´en, 1989) are statisti-cally equivalent to RFA models, but instead of a covariance between violator V and trait T a causal effect of V on T is modeled. Although RFA extended with LMS gen-erally has high power to detect measurement bias (Barendse et al., 2010, 2012; Woods & Grimm, 2011), several simulation studies (Barendse et al., 2010, 2012; Finch, 2005; Stark, Chernyshenko, & Drasgow, 2006; Woods & Grimm, 2011) observed severely inflated Type I error rates.

1.1

Selection of anchor items

Woods and Grimm (2011) argued that the inflated Type I error rates observed when using RFA extended with LMS might be caused by a contaminated subset of anchor items. Anchor items are items presumed to be bias-free and are used for an estimation

(7)

of the trait on which members of different groups are matched. A common strategy is to use all items other than the studied item as anchors, which performs well when all anchor items are bias-free (Cohen, Kim, & Wollack, 1996; Kim & Cohen, 1998). However, this strategy leads to a contaminated subset of anchor items when some items other than the studied item are biased. The latter circumstance causes problems such as inaccurate item parameter estimates and an overestimation of the amount of mea-surement bias in the test data (W.-C. Wang, 2004). These problems may account for the frequently observed inflated Type I error rates (Barendse et al., 2010, 2012; Finch, 2005; Stark et al., 2006; Woods & Grimm, 2011).

With the aim of preventing inflated Type I error rates, Woods (2009) proposed a two-step procedure called the rank-based strategy to select anchor items. This pro-cedure most often produces a bias-free set of anchor items, resulting in Type I error rates below the nominal level of significance and high power to detect measurement bias. As an alternative, Barendse et al. (2012) showed that applying RFA iteratively can control for inflated Type I error rates. Both the rank-based strategy and iterative procedure solve the problem of a contaminated subset of anchor items in similar ways, but in opposite directions. The rank-based strategy selects items with the strongest ev-idence against bias as anchors, whereas the iterative procedure removes items with the strongest evidence of bias from the anchor set .

1.2

Detection of measurement bias

Using RFA models to detect measurement bias requires an extension for modeling la-tent interactions to detect nonuniform bias. RFA models are commonly combined with LMS, but alternative methods are also available. The use of product indicators (PI) has received a great deal of attention in the general context of modeling interactions among latent variables in SEM (Henseler & Chin, 2010; Lin, Wen, Marsh, & Lin, 2010; Lit-tle, Bovaird, & Widaman, 2006; Marsh, Wen, & Hau, 2004). Product terms are built

(8)

by multiplying the indicators of the associated latent variables, which serve as indi-cators for the latent interaction variable. The PI method originates from Kenny and Judd (1984) but extensions have been proposed, among others are the mean-centered unconstrained approach (Marsh et al., 2004), the mean-centered constrained approach (Algina & Moulder, 2001), the orthogonalizing approach (Little et al., 2006), and the double-mean-centered approach (Lin et al., 2010). The PI method has never been stud-ied in light of testing measurement bias, but could be used as extension for RFA models to test nonuniform bias.

1.3

Present study

The aim of this investigation was to compare methods to model latent interactions in RFA models to test measurement bias with respect to a categorical violator. Prior to this comparison, we examined which anchor-selection strategy is most suitable when testing measurement bias using RFA models. Hence, Study 1 evaluated the following research question:

1. Which anchor-selection strategy results in the least contaminated set of anchor items when testing measurement bias using RFA models?

With the findings of the present study, guidelines may be provided on how to select anchor items when testing measurement bias using RFA models. In line with previous studies (M. Wang & Woods, 2017), the rank-based strategy was expected to frequently obtain a bias-free anchor set. Moreover, the risk of a contaminated anchor set depends on the length of the anchor set. Shorter anchor sets generally display a lower risk of contamination compared to longer anchor sets (Kopf, Zeileis, & Strobl, 2015). As the iterative procedure allows for a longer anchor set, this procedure was expected to obtain bias-free anchor sets less often compared to the rank-based strategy. In addition, we also expected the iterative procedure to perform worse, because it involves using all

(9)

In Study 2, we investigated whether the observed inflated Type I error rates are better controlled by using PI than by using LMS to model latent interactions for the purpose of testing measurement bias with RFA. Moreover, we examined whether the use of PI has adequate power to detect measurement bias relative to LMS. The follow-ing two research questions were evaluated:

2. Are Type I error rates better controlled by using PI than by using LMS to test measurement bias with RFA models?

3. Relative to LMS, does the use of PI has adequate power to detect biased items?

This study might reveal which detection method minimizes the chance of inflated Type I errors when testing for measurement bias. Conforming to previous simula-tion studies (Barendse et al., 2010, 2012; Woods & Grimm, 2011), Type I error rates were expected to be inflated when using LMS to model latent interactions in RFA models. In general, the RFA method extended with LMS was expected to have high power to detect uniform bias, but less power to detect nonuniform bias (Barendse et al., 2010, 2012). Especially for conditions with a small sample size, nonuniform bias was hypothesized to be difficult to detect (Barendse et al., 2010). Because LMS is only implemented in a single SEM computer program (i.e., Mplus; L. K. Muth´en & Muth´en, 2012), knowing whether PI works at least as well as LMS could provide more researchers the opportunity to test for nonuniform bias using any SEM software pack-age.

(10)

2

Theoretical background

2.1

Anchor-selection strategies

An anchor-selection strategy guides the decision about which particular items are used as anchor items when testing items for measurement bias. Anchor items are presumed to be bias-free and are used as operationalization of the trait. In RFA models, anchor items are not regressed on V and T × V when testing items for measurement bias. At least one anchor item is required to define the trait on which the groups are com-pared. Several strategies for selecting or identifying anchor items have been proposed. Some strategies rely on prior knowledge of bias-free items or content experts’ advice, whereas empirical strategies are based on preliminary item analysis. This study fo-cused only on empirical anchor-selection strategies.

2.1.1 Rank-based strategy

The rank-based strategy proposed by Woods (2009) involves a two-step procedure, in which anchor items are selected in the first step and items are tested for measurement bias in the second step. This strategy is primarily developed for the item response theory (IRT) approach to select anchor items and test for measurement bias, but can be extended to other detection methods as well. In order to select anchor items, an omnibus test can be performed to examine measurement bias for one item at a time. In the context of RFA, the fit of a constrained model can be compared with the fit of several unconstrained models (one for each of the items). Figure 1 depicts a simple example of an RFA-interaction model in which uniform and nonuniform bias can be tested. For visual simplicity, the measurement model of V and T ×V are excluded from Figure 1, but those details are discussed in Section 2.2. In the constrained model none of the items is regressed on V and T × V (i.e., the effects represented by the dashed arrows are fixed at zero), whereas in the unconstrained model the corresponding item

(11)

Figure 1.An example of an RFA model for testing measurement bias. Dashed arrows represent effects that may be estimated to test for uniform and nonuniform bias. Note. V= violator variable; T = latent trait; T ×V = interaction factor; Xi(i = 1, 2, ..., k)

= indicators of the latent trait T ; εi(i = 1, 2, ..., k) = residual factors.

is regressed on V and T ×V (i.e., the effect represented by the dashed arrow is freely es-timated). The constraints can be tested with likelihood ratio (LR) tests. Alternatively, Lagrange Multiplier (LM) tests can be calculated for each item’s set of constraints, which are asymptotically equivalent to the LR test but only require fitting the con-strained model. In this manner, uniform and nonuniform measurement bias are tested simultaneously. The LR and LM test statistics are distributed as chi-squared random variables with 2 df. After calculating a test statistic for each item’s set of constraints, the items are ranked in ascending ordered based on their test statistic. The items with the smallest test statistics are selected as anchor items. The actual number of items being

(12)

selected as anchor items may be determined by factors such as test length and sam-ple size. Woods (2009) suggested that the number of items should be approximately 10-20% of the total number of items. The ratio of the test statistic to the number of free parameters can be used instead when a test contains discrete items with different numbers of response options.

2.1.2 Iterative procedure

The iterative procedure was proposed by Barendse et al. (2012) as a detection method for measurement bias. However, their procedure can also be applied for the purpose of selecting anchor items (see, Candell & Drasgow, 1988; Hidalgo-Montesinos & Lopez-Pina, 2002; Kopf et al., 2015). Similar to the rank-based strategy, this procedure in-volves comparing the fit of a constrained model with several unconstrained models. In the constrained model none of the items is regressed on V and T × V (i.e., the effects represented by the dashed arrows in Figure 1 are fixed at zero), whereas in the un-constrained model the corresponding item is regressed on V and T × V (i.e., the effect represented by the dashed arrow of the studied item in Figure 1 is freely estimated). Thus, uniform and nonuniform are tested simultaneously.

In the first run of the iterative procedure, the item associated with the largest signif-icant test statistic is considered biased. This measurement bias is taken into account in the second run by allowing regression of the biased item on V and T × V in the forth-coming constrained and unconstrained models, and the remaining items are tested for measurement bias. The procedure continues until none of the remaining items is asso-ciated with a significant test statistic, or until half of the items is considered biased. The remaining items considered bias-free after the final run are selected as anchor items.

(13)

2.2

Restricted factor analysis models

In RFA models, the trait T can be modeled as a latent factor with multiple measures X as observed indicators. The possible violator V is added to the measurement model as an exogenous variable with a single indicator and is allowed to covary with T . Mea-surement bias can be examined by comparing the fit of an unconstrained model with several constrained models. In the unconstrained model, each of the studied items is regressed on V and T × V , except for the items in the anchor set. Each constrained model involves fixing the regression of the studied item onto V and T × V at zero. A significant difference in fit between the unconstrained and constrained model indicates that the item is biased with respect to V . Figure 2 illustrates an example of an RFA model to test for measurement bias using two anchor items, in which the violator V is modeled as a latent variable.

More specifically, the observed scores on items in RFA models with a grouping variable V are modeled as

xxxj= τττ + ΛΛΛtj+ bbbυj+ ccctjυj+ δδδεεεj, (2)

where xxxj is a vector of observed item scores, tjis the common factor T score, υjis a

dummy code for group membership V and εεεjis a vector of the residual scores of subject

j. Moreover, τττ is a vector of intercepts, ΛΛΛ is a vector of factor loadings on the common factor T , δδδ is a vector of residual factor loadings fixed at 1 for identification, and bbb and ccc are vectors of regression coefficients. A non-zero element in bbbor ccc respectively indicates uniform bias and nonuniform bias. Thus, if the omnibus test for an item is significant, follow-up tests can be performed using the Wald z test for these regression coefficients. Uniform bias can be assessed by testing the significance of the effect of the potential violator V on the observed item scores X . Nonuniform bias can be assessed by testing the significance of the effect of the interaction between the common factor

(14)

Figure 2. An example of an RFA model with an interaction factor. Dashed arrows represent effects that may be estimated to test for uniform and nonuniform bias. The indicators X1and X2serve as anchor items.

Note. V= violator variable; T = latent trait; T ×V = interaction factor; Xi(i = 1, 2, ..., k)

= indicators of the latent trait; λi (i = 1, 2, ..., k) = common factor loadings; bi (i =

1, 2, ..., k) = regression coefficients of the effect of V on the indicators; ci(i = 1, 2, ..., k)

= regression coefficients of the effect of T × V on the indicators; εi (i = 1, 2, ..., k) =

residual factors.

T and the potential violator V on the observed item scores X . Below, two approaches to estimate interactions between the latent variable T and violator variable V are de-scribed: LMS and PI.

(15)

2.2.1 Latent moderated structural equations

Barendse et al. (2010) proposed to extend RFA with LMS to test for nonuniform bias. The LMS approach to estimate interaction effects of latent variables is a distributional analytic approach that is available in Mplus (L. K. Muth´en & Muth´en, 2012). The raw data of indicators are used for the estimation of the latent interaction effect and no products of indicators are necessary. The LMS approach implements a maximum likelihood (ML) estimation procedure, which was especially developed for the dis-tributional properties of the model (Klein & Moosbrugger, 2000). The disdis-tributional characteristics of the nonnormally distributed joint indicator vector are explicitly taken into account. Accordingly, the joint distribution of indicators is represented as finite mixture of normal distributions. The mixture distribution function is analyzed and ML estimates are obtained by means of the expectation maximization algorithm (Dempster, Laird, & Rubin, 1977). The ML estimation procedure is tailored for the type of non-normality implied by the interaction effects. The LMS approach assumes multivariate normality for all latent exogenous variables, that is, for the latent predictors and resid-ual variables. When using LMS, the possible violator V is required to be modeled as latent variable. In case of a dummy-coded grouping variable V , the maximum likeli-hood’s normality assumption is violated for the indicator of the single-indicator latent variable representing the grouping variable. This violation can be dealt with by using a robust ML estimator (see, Woods & Grimm, 2011).

2.2.2 Product indicators

The use of PI to model interactions among latent variables was originated by Kenny and Judd (1984). The PI method involves the specification of a measurement model for the latent interaction factor. In general, product terms are built by multiplying the indicators of the associated latent variables, which serve as indicators for the latent interaction factor. All indicators, including the product indicators, are assumed to be

(16)

Figure 3.An example of an RFA model with an interaction factor using product indi-cators. Dashed arrows represent effects that may be estimated to test for uniform and nonuniform bias.

Note. RFA = restricted factor analysis; C = mean centered; V = violator variable; T = latent trait; T × V = interaction factor; G = indicator of the violator variable; Xi

(i = 1, 2, ..., k) = indicators of the latent trait; εi (i = 1, 2, ..., k) = residual factors; δi

(i = 1, 2, ..., k) = residual factors.

multivariate normally distributed if the ML estimation procedure is used. Because products of normal variables are not themselves normally distributed, this assump-tion is violated. Thus, a robust ML estimator is used to deal with this violaassump-tion (see, Marsh et al., 2004). Several variants of the PI method have been proposed, among oth-ers is the mean-centering strategy proposed by Lin et al. (2010). The double-mean-centering strategy is superior to other strategies, as it eliminates the need for a mean structure and does not involve a cumbersome estimation procedure. Although the orthogonalizing (Little et al., 2006) and double-mean-centering strategy perform equally well when all indicators are normally distributed, the double-mean-centering

(17)

strategy performs better when the assumption of normality is violated. The double-mean-centering strategy involves mean centering the indicators of the latent variables of interest. The mean-centered indicators serve as indicators of the latent variables and are used to form the product indicators for the latent interaction factor. Then the product indicators are mean-centered again and are used as indicators of the latent interaction factor. If the latent trait T has I indicators and the violator variable V has J indicators, then the latent interaction factor can have I × J product indicators, although matching schemes have been proposed to limit the number of product indicators (Marsh et al., 2004). Figure 3 shows an example of an RFA model with a latent interaction using the PI method. All possible cross products are used in this example (i.e., each indicator of T is multiplied by the indicator of V ), and all T and V indicators are mean cen-tered. The specification of a mean structure not necessary when all indicators are mean centered and all product indicators are double-mean centered (Lin et al., 2010).

(18)

3

Study 1

3.1

Method

In this study, the suitability of the rank-based strategy (Woods, 2009) and the iterative procedure (Barendse et al., 2012) to select anchor items was examined using simulated data. The suitability of these strategies was assessed for the RFA method extended with LMS and with PI. We generated 1000 replications to determine the anchor-selection ac-curacy and mean purity in each condition. Manipulated factors included the method to test for measurement bias (LMS and PI) and the reference and focal group sizes: (nr, nf) ∈ {(50, 50), (100, 100), (150, 150), (200, 200)}. Subsequently, these

manipu-lated factors of yielded 8 conditions.

3.1.1 Data generation

Data were generated for two groups under different sample sizes. A scale of k = 10 items was considered, of which two items were uniformly biased and two items were nonuniformly biased. Hence, 40% of the items was biased. This percentage of biased items was chosen, because the aim was to investigate the performance of the anchor-selection strategies under conditions with a substantial degree of contamination in the anchor set. Item scores of subject j in group g were generated using the following model:

xxxj= τττg+ ΛΛΛgtj+ εεεj, (3)

where xxxj is a vector of 10 item scores of subject j, tj is a subject’s common factor

score, εεεjis a vector of 10 unique factor scores (residuals) for subject j. Moreover, τττg

is a vector containing 10 intercepts of group g, and ΛΛΛgis a vector of 10 common factor

loadings. Differences in the latent trait were simulated by drawing latent trait values

(19)

normal distribution with a lower mean for the focal group tf ∼ N(−0.5, 1), similar to Barendse et al. (2010). Residual factor scores were drawn from a standard normal distribution εεεj∼ N(0, 1).

The same magnitude of uniform and nonuniform bias used by Barendse et al. (2010) were replicated. Uniform bias was introduced by imposing across group dif-ferences in intercepts. All intercepts τττ were equal to 0, except for the intercept for the second and third item in the focal group, which were chosen equal to 0.5 (small uniform bias) and 0.8 (large uniform bias), respectively. All common factor loadings λ

λ

λ were fixed at 0.8, except for the factor loadings of the fourth and fifth item in the focal group, which were chosen equal to 0.55 (small nonuniform bias) and 0.3 (large nonuniform bias), respectively. The residual variances were set equal to the square root of 1 − λλλ2g.

3.1.2 Analysis

When LMS was used as detection method, each item was tested for measurement bias by comparing the fit of a constrained model with the fit of an unconstrained model. In the constrained model, bbb and ccc are vectors containing zeros, whereas for the studied item the corresponding elements in bbband ccc are freely estimated in the unconstrained model. A chi-squared difference test was performed to examine the difference in fit between the models, using α = .05 as level of significance. When PI was used as detection method, LM tests were used to determine the significance of the set of con-straints for each item. In order to enable the estimation of the model parameters for both detection methods, group membership was modeled as a latent factor with a single indicator with residual variance fixed at zero and with the factor loading fixed at unity. A significant test statistic indicates that the item is biased with respect to the violator V . Depending on the detection method under investigation, certain follow-up steps were taken to select items for the anchor set, described below.

(20)

When the rank-based strategy was used select anchor items, the items were ranked in ascending order based on their chi-squared statistic. The two items (20%) with the lowest chi-squared statistic were selected for the anchor set. With the iterative procedure, items were iteratively tested for measurement bias. After each run, the item associated with the largest significant chi-squared test statistic was considered biased and this measurement bias was taken into account in the following run. The procedure continued until none of the remaining items was associated with a significant chi-squared statistic, or until half of the items was considered biased. The remaining items considered bias-free after the final run were selected for the anchor set. If the chi-squared statistic of one or more of the studied items could not be determined, for instance, because of non-convergence problems, the procedure was ended and items considered bias-free in the previous run were selected as anchor items.

The anchor-selection accuracy of the detection methods was be evaluated for each condition. Anchor-selection accuracy represents the proportion of replications that yielded a bias-free anchor set. In addition to the anchor-selection accuracy, we evalu-ated the mean purity of the anchor set, that is, the average percentage of bias-free items in the anchor set. The models were fit with Mplus (Version 7; L. K. Muth´en & Muth´en, 2012) in the LMS conditions and lavaan (Rosseel, 2012) in the PI conditions, and results were analyzed with R (Version 3.3.2; R Core Team, 2016).

3.2

Results

Each of 1000 replications were used to investigate the anchor-selection accuracy and mean purity of both anchor-selection strategies for all 8 conditions. After conducting the analysis for each of the conditions, we found that the LMS method in RFA models did not always produce valid results. On average across all conditions, convergence problems occurred with 336 replications using LMS, which is 33.60% of all replica-tions in one condition. Convergence problems especially occurred in condireplica-tions with

(21)

small sample sizes. On average among the cases with convergence problems, the chi-squared statistic could not be calculated for 25.95% of the items. Hence, these items could not be tested for measurement bias. When such problems occurred with the LMS method using the rank-based strategy, the results of that replication in that condition were not included in the analysis, because a decision about anchor items could not be made in practice. Similarly, when convergence problems occurred with the LMS method in the first run of the iterative procedure, the analysis was not conducted, be-cause in practice, a researcher would not be able to make a decision about anchor items in this situation. The PI method did not produce any convergence problems. To the contrary, each of the models converged for every single replication among all condi-tions using PI.

3.2.1 Anchor-selection accuracy

Table 1 shows the results of the selection of anchors for each of the conditions. The rank-based strategy obtained 84.89% to 97.60% bias-free anchor sets, or an overall anchor-selection accuracy of 93.74%. The LMS or PI method did not differ substan-tially with regard to anchor-selection accuracy when using the rank-based strategy, but the percentages of bias-free anchor sets were consistently higher for the PI method. Moreover, the anchor-selection accuracy of the rank-based strategy increased with sam-ple size, but even in conditions with a samsam-ple size of n = 50 percentages of bias-free anchor sets were high.

Among all conditions, the iterative procedure obtained lower percentages of bias-free anchor sets than the rank-based strategy. Anchor-selection accuracy ranged from 9.63% to 70.00%, with an overall accuracy of 40.12%. Similar to the rank-based strat-egy, percentages of bias-free anchor sets were consistently higher for the PI method compared to the LMS method. However, differences in anchor-selection accuracy were only small. For both detection methods, the anchor-selection accuracy of the iterative

(22)

Table 1

Anchor-selection accuracy and mean purity of the anchor-selection strategies for each of the conditions in Study 1.

Method Selection strategy n Anchor-selection accuracy Mean purity

LMS RB 50 84.89% 92.45% 100 94.21% 97.10% 150 96.65% 98.33% 200 97.06% 98.53% IP 50 9.63% 82.20% 100 27.93% 87.39% 150 47.32% 90.86% 200 67.36% 94.46% PI RB 50 87.10% 93.55% 100 95.00% 97.50% 150 97.40% 98.70% 200 97.60% 98.80% IP 50 10.10% 83.50% 100 33.60% 89.85% 150 55.00% 93.40% 200 70.00% 95.59%

Note. LMS = latent moderated structures; PI = product indicators; RB = rank-based strategy; IP = iterative procedure. Anchor-selection accuracy = percentage of replications in which the anchor set did not contain any biased items. Mean purity = the average percentage of bias-free items out of all the selected anchor items.

(23)

procedure increased with sample size. In conditions with a sample size of n = 50, a bias-free anchor set was selected for 9.63% to 10.10% of all replications. Anchor-selection accuracy was substantially higher in conditions with a larger sample size.

3.2.2 Mean purity

Overall, the average percentages of bias-free items in the anchor set were reasonably high, with mean purity ranging from 82.20% to 98.80%. The rank-based strategy ob-tained an average mean purity of 96.87%. The LMS or PI method did not differ much with regard to mean purity when using the rank-based strategy, but the percentages of bias-free items in the anchor set were consistently higher for the PI method. Mean purity was higher for conditions with larger sample sizes, but sample size did not affect the mean purity substantially when using the rank-based strategy.

The iterative procedure obtained mean purities comparable to the rank-based strat-egy, with an average mean purity of 89.66%. Among each of the conditions, the rank-based strategy had higher mean purities than the iterative procedure. Moreover, using the iterative procedure, the percentages of bias-free items in the anchor set were con-sistently higher for the PI method compared to the LMS method. However, again, the differences between the two detection methods were small. Similar to the rank-based strategy, the mean purity slightly increased with sample size. The effect of sample size is stronger for the iterative procedure, which showed a 12% increase in mean purity compared to only a 5-6% increase for the rank-based strategy.

3.3

Conclusion

The results of the present simulation study illustrate that the rank-based strategy out-performs the iterative procedure when selecting anchor items to test measurement bias using RFA models. The rank-based strategy consistently yields a higher percentage of bias-free anchor sets and a higher percentage of bias-free items out of all items

(24)

selected as anchors. Whereas the iterative procedure shows an unsatisfactorily low anchor-selection accuracy and mean purity for small sample sizes, the rank-based strat-egy performs well across all sample sizes. The considerably better performance of the rank-based strategy is observed regardless of the method used to model the latent in-teraction. The differences in mean purity between the two anchor-selection strategies are, however, less extreme. Given that the rank-based strategy more frequently obtains a bias-free anchor set and a higher mean purity, it is arguably more suitable than the iterative procedure for selecting anchor items to help prevent inflated Type I error rates when testing measurement bias using RFA models.

(25)

4

Study 2

4.1

Method

In Study 2, the appropriateness of the LMS and PI method in RFA models to detect uni-form and nonuniuni-form bias was assessed by evaluating the Type I error rates and power of both methods. This study involved four conditions that varied according to the refer-ence and focal group sizes: (nr, nf) ∈ {(50, 50), (100, 100), (150, 150), (200, 200)}. A total of 1000 replications were generated for each condition. Data generated for Study 1 were used again in Study 2 (see, Section 3.1.1).

4.1.1 Analysis

When LMS was used as detection method, each item was tested for measurement bias by comparing the fit of an unconstrained model with the fit of several constrained mod-els. In the unconstrained model, all elements in bbband ccc were freely estimated, except for the elements corresponding to the ninth and tenth item. These two bias-free items were used as anchor items and were not tested for measurement bias. For the eight remaining items a constrained model was fitted, in which for the studied item the cor-responding elements in bbb and ccc were fixed at zero. When PI was used as detection method, LM tests were conducted to determine the significance of the set of constraints for each item. In order to enable the estimation of the model parameters, group mem-bership was modeled as a latent factor with a single indicator with residual variance fixed at zero and the factor loading fixed at unity. An item was flagged as biased with respect to violator V when the chi-squared statistic was significant using a criterion of α = .05. Because the fit of the unconstrained model was compared with the fit of eight constrained models, we also calculated results with a Bonferroni corrected alpha level (α = .05/8 = .00625).

(26)

Power and familywise Type I error rates were calculated for both methods to model latent interactions in RFA models across conditions. Power reflects the proportion of replications in which the truly biased items were correctly flagged as biased. Family-wise Type I error rate represents the proportion of replications in which there was at leastone Type I error, that is, one of the four bias-free items being incorrectly flagged as biased. Agresti-Coull confidence intervals (Agresti & Coull, 1998) around the error rates were calculated to evaluate the significance of inflation. Power was calculated for each type (uniform and nonuniform) and magnitude (small and large) of measurement bias separately. The models were fit with Mplus (Version 7; L. K. Muth´en & Muth´en, 2012) in the LMS conditions and lavaan (Rosseel, 2012) in the PI conditions, and results were analyzed with R (Version 3.3.2; R Core Team, 2016).

4.2

Results

Each of 1000 replications were used to investigate the power and familywise Type I error of both detection methods for all 4 conditions. After performing the analysis for each of the conditions, we again observed a number of replications with invalid results when using the LMS method. On average across all conditions, convergence prob-lems occurred with 338 replications using LMS, which is 33.80% of all replications in one condition. For each of these replications, the problem involved a non-converging unconstrained model. Due to this complication, a chi-squared statistic could not be calculated for any of the items. The results of these replications were not included in the analysis, because in these situations items could not be tested for measurement bias in practice. The PI method did not produce any non-convergence problems: each of the models converged for every single replication among all conditions.

(27)

4.2.1 Power

Table 2 shows the power and familywise Type I error for the two detection methods across conditions. The LMS method had an overall power of .763 to detect measure-ment bias. Power increased with sample size for all types and magnitudes of bias, except for uniform bias with a large effect size, which approached a power of 1.000 regardless of sample size. Relative to uniform bias, nonuniform bias was more difficult to detect with the LMS method. The LMS method especially yielded low power for small nonuniform bias. For this type and magnitude of bias, power ranged from .140 to .592.

The PI method had an overall power of .818 to detect measurement bias. For each of the conditions and each type and magnitude of bias, the PI method obtained a higher power than the LMS method. However, differences between the two detection meth-ods were small. Similar to the LMS method, power increased with sample size and nonuniform bias was more difficult to detect than uniform bias. Power was especially low for detecting small nonuniform bias when using the PI method. With a sample size of n = 50, for example, small nonuniform bias was only detected in 14.60% of all replications. Power and familywise Type I error for the two detections methods using a corrected alpha level are shown in Table 3. Although power to detect uniform bias generally remained high, a particularly low power to detect nonuniform bias was ob-tained for both detection methods. Especially in the condition with the smallest sample size, power substantially decreased after correcting the alpha level.

4.2.2 Familywise Type I error

Familywise Type I error rates for the LMS method ranged between .043 to .062, with an average rate of .053 (see, Table 2). In the two conditions with the smallest sample size, the error rates were below the nominal level of significance. Familywise Type I error rates were above α = .05 in the other conditions, but were not significantly larger

(28)

Table 2

Power and familywise Type I error rates (including Agresti-Coull confidence intervals) for the LMS and PI method under each condition using an uncorrected alpha level (α = .05)

Method n Power Familywise Type I error

small large small large (95% CI)

UB UB NB NB LMS 50 .869 .992 .140 .541 .048 [.036, .063] 100 .951 .958 .300 .830 .043 [.032, .058] 150 .972 .969 .452 .826 .058 [.045, .074] 200 .996 .993 .592 .830 .062 [.049, .079] PI 50 .979 1.000 .146 .624 .052 [.040, .068] 100 .999 1.000 .306 .931 .037 [.027, .068] 150 1.000 1.000 .488 .991 .061 [.048, .078] 200 1.000 1.000 .630 1.000 .053 [.041, .069]

Note. LMS = latent moderated structures; PI = product indicators; UB = uniform bias; NB = nonuniform bias.

than .05. Agresti-Coull confidence intervals for the error rates in conditions with n = 150 and n = 200 included the nominal level of significance, respectively, 95% CIs .045, .074 and .049, .079.

The PI method yielded familywise Type I error rates ranging from .037 to .061, with an overall rate of .051. In the condition with a sample size of n = 100, the error was below the nominal level of significance. In the remaining conditions, errors were above α = .05. However, the error rates for these conditions were not significantly larger than .05. Agresti-Coull confidence intervals for the familywise Type I error rates

(29)

Table 3

Power and familywise Type I error rates (including Agresti-Coull confidence intervals) for the LMS and PI method under each condition using a Bonferroni corrected alpha level (α = .00625)

Method n Power Familywise Type I error

small large small large (95% CI)

UB UB NB NB LMS 50 .440 .834 .027 .218 .004 [.001, .012] 100 .947 .958 .101 .673 .000 [-.001, .005] 150 .972 .969 .185 .797 .003 [.001, .009] 200 .995 .993 .269 .818 .008 [.004, .016] PI 50 .649 .999 .030 .285 .003 [.001, .009] 100 .998 1.000 .101 .746 .001 [-.000, .006] 150 1.000 1.000 .182 .961 .005 [.002, .012] 200 1.000 1.000 .313 .994 .009 [.004, .017]

Note. LMS = latent moderated structures; PI = product indicators; UB = uniform bias; NB = nonuniform bias.

significance, respectively, 95% CIs.040, .068, .048, .078 and .041, .069. The Bonferroni correction of the alpha level resulted in error rates below the nominal level of significance (see, Table 3).

4.3

Conclusion

The results of this simulation study show that the PI method performs at least as well as the LMS method when testing for measurement bias using RFA models. The PI method consistently yields a higher power to detect either small or large, uniform or nonuniform bias, and overall results in lower familywise Type I error rates. Relative to

(30)

LMS, the PI method is especially a more powerful means to detect large nonuniformly-biased items. However, the power of the two detection methods to detect other types or magnitudes of measurement bias are in close agreement. The error rates do not signif-icantly differ from the nominal .05 in any of the conditions, but Bonferroni corrections were too conservative. Although the PI method yields a lower average familywise Type I error, the error is only in one condition lower than that of the LMS method. But dif-ferences between the two methods are small. Given the higher power of PI compared to LMS, this detection method is preferred for testing measurement bias using RFA models.

(31)

5

Discussion

The present study concerned testing items for measurement bias using RFA models. In order to test for nonuniform bias, RFA requires an extension to model latent inter-actions. For example, RFA can be extended with LMS to test for nonuniform bias. Although this method is a powerful means to detect measurement bias, it regularly re-sults in severely inflated Type I error rates (Barendse et al., 2010, 2012; Finch, 2005; Stark et al., 2006; Woods & Grimm, 2011). Therefore, the aim of this investigation was to compare LMS with PI, an alternative method to model latent interactions. We examined whether this method can control for the inflated error rates obtained with LMS. Woods and Grimm (2011) argued that the inflated Type I error rates of LMS might be caused by a contaminated set of anchor items. Hence, prior to the comparison between the two detection methods, we investigated which anchor-selection strategy is most suitable when testing measurement bias using RFA models.

The results of Study 1 suggested that Wood’s (2009) rank-based strategy is more suitable than an iterative procedure of removing biased items (Barendse et al., 2012) for the purpose of selecting anchor items in RFA models. As hypothesized, the rank-based strategy consistently obtained a higher percentage of bias-free anchor sets and performed well across all sample sizes. In addition, the rank-based strategy resulted in a higher percentage of bias-free items in the selected anchor set. In general, the rank-based strategy results in roughly 10% greater purity in conditions with a smaller sample size, although the differences were less salient with larger sample sizes. The considerably better performance of the rank-based strategy was observed regardless of the method used to model the latent interaction. Given that the rank-based strat-egy more frequently obtains a bias-free anchor set, we think it is more suitable than the iterative procedure for selecting anchor items to test measurement bias using RFA mod-els. These results are in line with previous studies (M. Wang & Woods, 2017; Woods, 2009), which showed that the rank-based strategy frequently obtains a bias-free anchor

(32)

set. The most striking finding of the present study was perhaps the low percentage of bias-free anchor sets obtained with the iterative procedure. A possible explanation is that the iterative procedure allows for longer anchor sets, resulting in a higher risk of a contaminated anchor set. Shorter anchor sets generally display a lower degree of contamination (Kopf et al., 2015). As the iterative procedure can obtain longer anchor sets than the rank-based strategy, this might explain why it less often selects anchor sets that are bias-free. Furthermore, the iterative procedure starts with the assumption that all items can be used as anchors, and iteratively removes items that show evidence against this assumption. As expected, this manner of selecting anchor items performed noticeably worse in small-sample conditions.

In study 2, we compared the LMS and PI method to model latent interactions in RFA models. By means of a Monte Carlo simulation, the two detection methods were compared with respect to their power and familywise Type I error. The main conclusion is that the LMS and PI method performed about equally well. Across all conditions, the PI method had a higher power to detect measurement bias than LMS, but differences were relatively small. Perhaps most remarkable were the familywise Type I error rates yielded by the two detection methods. Although previous studies observed Type I error rates that were seriously inflated (Barendse et al., 2010, 2012; Finch, 2005; Stark et al., 2006; Woods & Grimm, 2011), the error rates observed in this study were not as concerning. None were significantly higher than the nominal level of significance.

A possible explanation for these findings may be found in the use of anchor items. The majority of the studies that observed severely inflated Type I error rates used all items other than the studied item as anchors (Barendse et al., 2010, 2012; Finch, 2005). This strategy leads to a contaminated subset of anchor items when some items other than the studied item are biased. In turn, this can cause problems such as an over-estimation of the amount of measurement bias in the test data (W.-C. Wang, 2004). Hence, using all-others-as-anchors may account for the frequently observed inflated

(33)

Type I error rates. Results of this simulation study complement this line of thought, as it involved using only 20% out of all items as anchor items and did not observe severely inflated Type I error rates. However, using a subset of items as anchors might not fully resolve the problems encountered when testing measurement bias using RFA models. Although Woods and Grimm (2011) selected a small subset of anchor items, seriously inflated Type I error rates were still observed. The difference with the present investi-gation is that their study considered categorical indicators of the common factor instead of continuous indicators. Because the Type I error rates observed in this study are not significantly larger than 5%, these findings might imply that LMS only has inflated er-ror rates with categorical indicators. Given the prevalence of ordinal (e.g., Likert-type) indicators with few categories, rather than several categories that could be considered approximately continuous (Rhemtulla, Brosseau-Liard, & Savalei, 2012), future re-search is needed to investigate whether error rates are significantly more inflated with fewer categories. Woods and Grimm (2011) study both binary and 5-category ordinal indicators, but they treated the 5-category indicators as ordinal using expectation max-imization (EM MML) estimation. Perhaps error rates would have been lower if the 5-category items had been treated as (approximately) continuous, which Rhemtulla et al. (2012) showed results in negligible bias and near-nominal error rates when using normal-theory ML estimation with robust correction for nonnormality.

Corresponding to findings of previous studies (Barendse et al., 2010, 2012), we found that nonuniform bias was more difficult to detect than uniform bias. Power to detect nonuniformly biased items was especially low in conditions with a small sam-ple size. This finding is concerning to some extent, because the present investigation involved the best case scenario of a bias-free anchor set. As opposed to simulation studies where the true bias of items is known, in practice there is most often no reliable prior knowledge about bias in the items of a scale. The analysis of Study 1 evaluated what researchers could do when faced with this lack of prior knowledge, and results

(34)

showed that for a substantial proportion of replications the anchor set contained biased items.

The lower power encountered in this study might be due to the relatively short length of the anchor set. The power to detect measurement bias has shown to increase with a larger anchor set (Kopf et al., 2015; Woods, 2009). Moreover, the consequences of a contaminated anchor set seem to depend on the proportion of biased items in the anchor set rather than on the risk of contamination itself (Kopf et al., 2015). This might indicate that the mean purity serves as a better indicator of Type I error rates. Because mean purity did not differ substantially between the two anchor-selection strategies, the iterative procedure might be preferable, because it allows for longer anchor sets. How-ever, the iterative procedure may only have adequate power to remove biased items with sufficiently large sample sizes, in which case a multiple-group approach might be preferable because it requires fewer homogeneity assumptions. The present study is limited in the sense that the anchor-selection strategies were not evaluated with re-spect to their impact on testing measurement bias using the empirically selected anchor items. Future research is needed to carry out a more extensive investigation of the con-sequences of different anchor-selection strategies on power and Type I error in the context of RFA.

A limitation of the LMS method is the large proportion of replications with invalid results for the LMS method. In both Study 1 and 2, convergence problems occurred with respectively 33.60% and 33.80% among all analysis concerning LMS. The anal-ysis of the LMS method were not included for these replications because in practice, a researcher would be unable to make a decision with this method. The PI method did not result in any convergence problems. All replications could thus be included to determine its power and Type I error. Consequently, the analysis of the PI method was based on more replications than the analysis of the LMS method. Hence, the com-parison between the LMS and PI method might have been invalid to a certain degree.

(35)

The aspects on which the replications with convergence problems differed from other replications could be further investigated. In any case, the convergence problems point to an important practical limitation of the LMS method.

This study showed that the PI method to model latent interactions in RFA models performs at least as well as the LMS method for the purpose of testing measurement bias. Because LMS is only implemented in a single SEM computer program (i.e., Mplus; L. K. Muth´en & Muth´en, 2012), knowing that PI works at least as well as LMS provides more researchers the opportunity to test for nonuniform bias using any SEM software package. However, we want to emphasize that research on the use of PI in RFA models to test for measurement bias is still in its infancy. Several aspects of its usage are yet unclear, for example, which items should serve as product indicators for the interaction factor. There are various possibilities regarding the formation of product indicators, among others are using only the studied item, only the anchor items, the anchor items and the studied item, or all items. Although this study showed promising results, more research is necessary to determine the optimal use of PI in RFA models to test for measurement bias.

(36)

References

Agresti, A., & Coull, B. A. (1998). Approximate is better than exact for interval estimation of binomial proportions. The American Statistician, 52, 119–126.

Algina, J., & Moulder, B. C. (2001). A note on estimating the J¨oreskog-Yang model for latent variable interaction using LISREL 8.3. Structural Equation Modeling: A Multidisciplinary Journal, 8, 40–52.

Barendse, M. T., Oort, F. J., & Garst, G. J. A. (2010). Using restricted factor analysis with latent moderated structures to detect uniform and nonuniform measurement bias; a simulation study. Advances in Statistical Analysis, 94, 117–127.

Barendse, M. T., Oort, F. J., Werner, C. S., Ligtvoet, R., & Schermelleh-Engel, K. (2012). Measurement bias detection through factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 19, 561–579.

Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied psychological measurement, 12, 253–260.

Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1– 27.

Cohen, A. S., Kim, S.-H., & Wollack, J. A. (1996). An investigation of the likeli-hood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20, 15–26.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: Series B (Methodological), 39, 1–38.

(37)

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278–295.

Henseler, J., & Chin, W. W. (2010). A comparison of approaches for the analysis of interaction effects between latent variables using partial least squares path modeling. Structural Equation Modeling: A Multidisciplinary Journal, 17, 82–109.

Hidalgo-Montesinos, M. D., & Lopez-Pina, J. A. (2002). Two-stage equating in dif-ferential item functioning detection under the graded response model with the Raju area measures and the Lord statistic. Educational and Psychological Measurement, 62, 32–44.

Kenny, D. A., & Judd, C. M. (1984). Estimating the nonlinear and interactive effects of latent variables. Psychological Bulletin, 96, 201–210.

Kim, S.-H., & Cohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22, 345–355.

Klein, A., & Moosbrugger, H. (2000). Maximum likelihood estimation of latent inter-action effects with the LMS method. Psychometrika, 65, 457–474.

Kopf, J., Zeileis, A., & Strobl, C. (2015). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39, 83–103.

Lin, G.-C., Wen, Z., Marsh, H. W., & Lin, H.-S. (2010). Structural equation models of latent interactions: Clarification of orthogonalizing and double-mean-centering strategies. Structural Equation Modeling: A Multidisciplinary Journal, 17, 374– 391.

(38)

Little, T. D., Bovaird, J. A., & Widaman, K. F. (2006). On the merits of orthogonalizing powered and product terms: Implications for modeling interactions among latent variables. Structural Equation Modeling: A Multidisciplinary Journal, 13, 497– 519.

Marsh, H. W. (1994). Confirmatory factor analysis models of factorial invariance: A multifaceted approach. Structural Equation Modeling: A Multidisciplinary Journal, 1, 5–34.

Marsh, H. W., Wen, Z., & Hau, K.-T. (2004). Structural equation models of latent in-teractions: Evaluation of alternative estimation strategies and indicator construction. Psychological methods, 9, 275–300.

Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143.

Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Re-search, 3(1), 111–130.

Muth´en, B. O. (1989). Latent variable modeling in heterogeneous populations. Psy-chometrika, 54, 557–585.

Muth´en, L. K., & Muth´en, B. O. (2012). Mplus version 7 users guide. Los Angeles, CA: Muth´en & Muth´en.

Oort, F. J. (1992). Using restricted factor analysis to detect item bias. Methodika, 6, 150–160.

Oort, F. J. (1998). Simulation study of item bias detection with restricted factor anal-ysis. Structural Equation Modeling: A Multidisciplinary Journal, 5, 107–124.

(39)

R Core Team. (2016). R: A language and environment for statistical com-puting [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/

Rhemtulla, M., Brosseau-Liard, P. ´E., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categor-ical SEM estimation methods under suboptimal conditions. Psychologcategor-ical Methods, 17, 354–373.

Rosseel, Y. (2012). lavaan: An R package for structural equation mod-eling. Journal of Statistical Software, 48(2), 1–36. Retrieved from http://www.jstatsoft.org/v48/i02/

Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306.

Wang, M., & Woods, C. M. (2017). Anchor selection using the Wald test anchor-all-test-all procedure. Applied Psychological Measurement, 41, 17–29.

Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of Rasch models. The Journal of Experimental Education, 72, 221–261.

Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42–57.

Woods, C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35, 339–361.

Referenties

GERELATEERDE DOCUMENTEN

In de praktijk van de verkeersveiligheid wordt het begrip veiligheidscultuur vooral gebruikt binnen de transportsector en zien we dat ministerie, branche en verzekeraars

Additionally, we found that the degree to which legal professionals believe in free will predicts the extent to which they are affected by outcome information, such that those

In this paper a design science approach is used to develop a selection model that solves the problem for the case organisation.. This model follows the characteristic of IT

In terms of the likelihood of a firm operating in the same sector as the closest anchor firm, it became clear that firms operating in knowledge-intensive

BAAC  Vlaanderen  Rapport  204   5 Vondstmateriaal  Aardewerk 

Hydrogen and carbon monoxide chemisorption were suppressed by the presence of molybdenum oxide, pointing to a coverage of the rhodium particles by this

De- compositions such as the multilinear singular value decompo- sition (MLSVD) and tensor trains (TT) are often used to com- press data or to find dominant subspaces, while

Strategies deal with at least five dilemmas in assessing local governing capacity: uniformity versus local context, effectiveness versus legitimacy, self-evaluation versus