Measurement bias detection through factor analysis : a simulation study of interaction effects in restricted factor analysis

(1)

Measurement Bias Detection

through Factor Analysis

A simulation study of interaction effects in

restricted factor analysis

M. T. Barendse Supervisors:

University of Amsterdam Dr. F. J. Oort Dr. R. Ligtvoet

Goethe University (Frankfurt) Dipl.-Psych. C. S. Werner Prof. dr. K. Schermelleh-Engel

(2)

(3)

Preface

This study is about the detection of uniform and nonuniform bias in a simulation. The first ideas for this thesis rose during Working Group Structural Equation Modeling in Berlin (February 2009). The study is a cooperation between the University of Amster-dam and the Goethe University in Frankfurt. The actual study started in May 2009. As the detection of nonuniform bias entails the estimation of an interaction effect, I was very enthusiastic about the participation of the Goethe University, because Christina Werner and Karin Schermelleh-Engel are both experts on nonlinear interactions. On behalf of the University of Amsterdam Frans Oort, an expert in bias detection was involved during the entire process of this study and Rudy Ligtvoet cooperated the technical part of the study from January 2010.

This thesis has a format of an article. All tables that did not fit into the article and the M-plus scripts for bias detection were added in the appendix. I was advised and helped during every stage in this thesis:

• Writing a research proposal: Frans Oort, Christina Werner and Karin Schermel-leh-Engel

• Choosing simulation conditions: Frans Oort, Christina Werner and Karin Scher-melleh-Engel

• Conversion of population values (the dichotomous case population values con-verted to continuous case and the continuous case population values concon-verted to dichotomous case): Frans Oort

• R-scripts: Rudy Ligtvoet, Chritina Werner, and Frans Oort • Simulation study: Frans Oort and Rudy Ligtvoet

• Writing the thesis: Frans Oort and Rudy Ligtvoet

In June 2009 I visited the Goethe University (Frankfurt am Main; Istitut f¨ur

Psychol-ogy; Psychologische Methodenlehre, Evaluation und Forschungsmethoden) to work on the thesis. This was a nice and instructive experience with pleasant memories. The first results of this study were presented during the Working Group Structural Equa-tion Modeling in Utrecht (February 2010). In July 2010 all the results were presented during the European Congress of Methodology (EAM-SMABS; Potsdam).

I want to thank all my supervisors for their advice and helpfulness. I especially want to thank Frans Oort for his guidance, patience, and encouragement. Support came also from my family and friends.

(4)

Abstract

Measurement bias can be detected through factor analysis. As an alternative to the traditional multigroup factor analysis (MGFA) to investigate uniform and nonuniform measurement bias, restricted factor analysis (RFA) extended with either latent mod-erated structural equations (LMS) or with a random slope parameterization (RSP), can be applied. In a simulation study, the performance of the three methods MGFA, RFA/LMS, and RFA/RSP to detect measurement bias was compared under various conditions: type of bias (none, uniform, nonuniform, or both), type of violator (di-chotomous or continuous variable), and relationship between the trait and the “biased trait” or “violator” (independent or dependent). In addition, estimation bias was as-sessed for each of the three methods, and two different procedures for bias detection were considered (single run or iterative).

Results show that RFA/LMS and RFA/RSP methods had a high detection rate in all conditions, but the MGFA method showed lower detection rates in conditions with only uniform or nonuniform bias in combined with a continuous violator. In conditions with a high power to detect bias, the single run procedure had a high proportion of false positives. Bias detection in the iterative procedure reduced the proportion of false positives to the nominal level of significance. The accuracy of the parameter estimates did not vary much across detection methods, but the parameters estimates were less efficient in the MGFA method.

(7)

1 Introduction

In measurement bias research, it is examined whether different (groups of) respon-dents show different response behavior to test items. This is important to improve test validity and to establish fairness in behavioral and social science research, in which subjective measures are used to measure constructs such as abilities, emotions, atti-tudes, and personality traits. Measurement bias may contaminate research when in the presence of measurement bias observed differences in item and test scores do not reflect true differences between respondents. Measurement bias can be explained by an example in which mathematical ability is measured with a worded mathematical problem. Mathematical ability and verbal ability are likely to be related. This is no problem if the relation between the item score of the mathematical problem and verbal ability is fully explained by mathematical ability. However, bias occurs when verbal ability directly affects the item score, in which case a respondent’s ability to solve the mathematical problem depends not only on his or her mathematical ability, but also on his or her verbal ability. In the latter case, verbal ability is the variable with respect to which the test item is biased.

Unbiased means that the measures do not systematically depend on anything but

the trait of interest (e.g., mathematical ability). Let X denote a set of observed

variables (item score), indented to measure trait value T , and V a set of potential violator variables (e.g., verbal ability). Unbiased can be formally defined as conditional independence (cf. Mellenbergh, 1989):

f1(X|T = t, V = v) = f2(X|T = t) (1)

Function f1is the conditional distribution function of X given values T = t and V = v,

and f2 is the conditional distribution function of X given T = t. Conditional

inde-pendence holds if function f1 is equal to function f2. If conditional independence does

not hold (i.e., f16= f2), the measurement of T by X is said to be biased with respect

to V . In other words, respondents with equal standings on T have unequal expected values on X because V systematically affects response variables X. The conditional independence definition of Mellenbergh (1989) is very general as the variables X, T , and V can be measured on the nominal, ordinal, interval or ratio level, they can be latent or manifest, and their relationships may be linear or nonlinear.

Oort (1991) demonstrated that measurement issues can be subsumed under viola-tions of conditional independence. The formal definition of measurement bias can be applied to a linear model. The item score variable X is then described by

X = u + aT + bV + cT V + dE, (2)

where

T is the latent trait, V is a potential violator, E is the residual or error,

u is a vector of intercepts,

a is a vector of common factor loadings,

(8)

d is a vector of residual factor loadings.

If there is bias, variable V explains variance in measurement X in addition to what is already explained by the concept of interest T . Thus, measurement X is biased with respect to variable V . Factor loading b indicates uniform bias which is equally present for all levels of variable T . In the example mentioned above, this means that the worded mathematical problem measures verbal ability similarly across all levels of mathematical ability. If there is also an interaction effect of variable T and variable V on item score X, indicating nonuniform bias and represented by factor loading c, the extent of bias varies with levels of variable T . In the example of measurement bias, nonunform bias is present if the effect of verbal ability on the item score varies with different levels of mathematical ability.

With group membership as a dichotomous violator (e.g., gender) we can use a binary coded dummy variable for V (V = 0 for Group 1 and V = 1 for Group 2) that shows the relationship between an approach for two groups of respondents with a dichotomous violator and a single group approach with a continuous violator. With a continuous violator the interaction is a function of random variables T and V . If we substitute V = 0 into Equation 2 we get

X = u + aT + dE, (3)

and if we substitute V = 1 we get

X = u + aT + b + cT + dE (4)

= (u + b) + (a + c)T + dE (5)

For Group 1, u is the intercept and a is the factor loading, whereas for Group 2 the intercept is (u + b) and the factor loading is (a + c). With a dichotomous violator, bias means that the factor loadings (i.e., nonuniform bias) and/or intercepts (i.e., uniform bias) are not equal across groups.

The present study will investigate measurement bias detection by structural equa-tion modelling (SEM) with latent variables. Most typically, the concept of interest T is then operationalized as a (latent) common factor with multiple measures X as (ob-served) indicators. Multigroup factor analysis (MGFA), with structured means as first

described by S¨orbom (1974), is a well known bias detection method. Meredith (1993)

defined weak measurement invariance and factorial invariance across populations de-fined by V in the MGFA method. Manifestations of bias are investigated by testing across group constraints on intercepts to detect uniform bias and factor loadings to detect nonuniform bias, as reviewed by Vandenberg and Lance (2000), and Schmitt and Kuljanin (2008).

As an alternative bias detection method, Oort (1992, 1998) introduced the re-stricted factor analysis (RFA). Uniform measurement bias is indicated by direct effects of V variable on the X variables. The RFA bias detection method is equivalent with the multiple indicator multiple cause (MIMIC) analysis, however in MIMIC models

the V variables have causal effects on the T variables (Muth´en 1989). The detection

of nonuniform bias in the RFA method requires the estimation of the interaction term (c). The interaction effect on item scores implies that the item scores are not normally distributed. With the new possibilities of estimating interaction effects in structural equation models, the RFA method can be extended to detect nonuniform bias as well (Barendse, Oort, & Garst, 2010).

(9)

Product indicator approaches and distribution-analytic approaches are two main approaches for analysis of interaction effects, as described by Moosbrugger, Schermelleh-Engel, Kelava, and Klein (2009), and Schermelleh-Schermelleh-Engel, Werner, Klein, and Moos-brugger (2010). The product indicator approach requires a measurement model for the nonlinear products of the observed variables. Based on Kenny and Judd (1984) other product indicator approaches were developed (see Schumacher & Marcoulides, 1998). The distribution-analytic approaches are recent approaches to estimate interaction by taking the non-normal distribution into account (Moosbrugger, Schermelleh-Engel,

Kelava, & Klein, 2009). The latent moderated structures (LMS) approach (Klein

& Moosbrugger, 2000) and the quasi-maximum likelihood approach (Klein & Muthn, 2007) have been proposed as distribution-analytic approaches. In addition to these

ap-proaches a random slope parameterization, as suggested by Muth´en and Asparouhov

(2003), can also be considered as a distribution-analytic approach.

Earlier research showed that the well established MGFA method can be used to detect uniform and nonuniform bias, although uniform bias was easier to detect

than nonunifom bias (Barendse, Oort, & Garst, 2010; González-Romá, Hernández, &

Gómez-Benito, 2006; Hernández & González-Romá, 2003; Meade & Lautenschlager,

2004). RFA and MIMIC methods proved to be effective in detection uniform bias in di-chotomous and continuous item responces (Gomez & Navas, 2002; Oort, 1998; Woods, 2009). Only two studies investigated the RFA/MIMIC method to detect uniform and nonuniform bias with a dichotomous violator (Barendse, Oort, & Garst, 2010; Woods & Grimm, 2010). Barendse, Oort, and Garst (2010) showed that the detection rate in the RFA/LMS method was about equal to the detection rate in the MGFA method. The proportions of true positives were lower in conditions where a small sample size was combined with small nonuniform bias. Between group differences in the ability distribution did not systematically affect the proportions of true positives. Woods and Grimm (2010) applied the MIMIC/LMS method to detect uniform and nonuniform bias. An important contrast with the present study is that Woods and Grimm (2010) used different item responses (five-point scale or a dichotomous) and one reference item was used in detecting bias. Nonuniform bias turned out to be harder to detect than uniform bias. A concern for the above studies in detecting bias was that actual proportions of false positives (Type I error) in specific conditions were higher than the nominal level of significance.

When investigating bias, an advantage of RFA (or MIMIC) method over MGFA method is that in RFA the violator can be continuous or discrete, observed or latent. With a continuous violator in the MGFA method, it is necessary to divide a sample into sub-samples by the violator (e.g., median split or theoretically based split). As a result, groups in the MGFA method become smaller and analyses may yield less precise parameter estimates. MacCallum, Zhang, Preacher, and Rucker (2002) describe the negative consequences of categorizing variables: loss of information, loss of effect size and power or spurious statistical significance and overestimation of effect size. Conse-quently, the RFA is believed to have more statistical power to detect measurement bias with a continuous violator. Another advantage of the RFA method over the MGFA method is the possibility to investigate bias with respect to multiple violator variables simultaneously.

As the RFA method offers many potential advantages for bias detection, we will further investigate this method under various conditions. The purpose of the present paper is to investigate to which extent MGFA, RFA/LMS, and RFA/RSP method cor-rectly detect uniform and nonuniform measurement bias with respect to dichotomous

(10)

and continuous violating variables. To our knowledge this is the first study which utilized the RSP method bias to detect nonuniform bias and detects bias with respect to a continuous violator

2 Method

In simulated data, measurement bias was assessed by the three methods MGFA, RFA/LMS, and RFA/RSP, which are described in more detail below. Factors that were varied in generating the data are: type of bias (no bias, uniform bias, nonuniform bias, both uniform and nonuniform bias), type of violator (continuous or dichotomous violator), and relationship between the trait and the violator (independent or depen-dent). In a fully crossed design, these three data generation factors yield 4 × 2 × 2 = 16 different conditions. For each condition, 500 data sets were generated. Each data set was analyzed with each of the three methods (MGFA, RFA/LMS, and RFA/RSP), with two different procedures (single run or iteratively). The performance of each method in detecting measurement bias is evaluated by the proportions of true and false positives, and the estimation bias.

2.1 Data generation

Each data set consists of scores of 200 subjects on 6 items with continuous response scales. Scores were generated on basis of a linear model. We used the computer pro-gram R (R Development Core Team, 2010). The violator could be either dichotomous or continuous.

2.1.1 Data generated with a continuous violator

With a continuous violator, the items scores x of subject j are a linear function of T , V , and E scores:

xj= atj+ bvj+ ctjvj+ dej, (6)

where

tj is the subject’s score on the common trait,

vj is a the subject’s score on a potential violator,

ej is the subject’s score on the residual or error,

u is a vector of intercepts,

a is a vector of common factor loadings, d is a vector of residual factor loadings, and

b and c are vectors containing the regression coefficients.

Subject parameters tj, ej and vj were drawn from a standard normal distribution

with cor(t, e) = 0, cor(v, e) = 0. The standardized and unstandardized population values for the eight data conditions, generated with a continuous violator, are given in Table 1. In all data generation conditions (1 to 8) the residual variance (d) and the factor loading (a) were chosen to equal .90.

(11)

T able 1. Standardized and Unstandardized P opulation V alues w ith resp ect to a Con tin uous Violator. Condition Nr. a b c d cor( T , V ) T v ar( T ) T V v ar( T V ) X v ar( X ) a ∗ b ∗ c ∗ d ∗ Indep. T V No Bias 1 .900 .000 .000 .900 .000 .000 1.000 .000 1.000 .000 1.620 .707 .000 .000 .707 Uniform 2 .900 .300 .000 .900 .000 .000 1.000 .000 1.000 .000 1.710 .688 .229 .000 .688 Non uniform 3 .900 .000 .300 .900 .000 .000 1.000 .000 1.000 .000 1.710 .688 .000 .229 .688 Both 4 .900 .300 .300 .900 .000 .000 1.000 .000 1.000 .000 1.800 .671 .224 .224 .671 Dep. T V No Bias 5 .900 .000 .000 .900 .500 .000 1.000 .500 1.250 .000 1.620 .707 .000 .000 .707 Uniform 6 .900 .300 .000 .900 .500 .000 1.000 .500 1.250 .000 1.980 .640 .213 .000 .640 Non uniform 7 .900 .000 .300 .900 .500 .000 1.000 .500 1.250 .150 1.733 .684 .000 .213 .684 Both 8 .900 .300 .300 .900 .500 .000 1.000 .500 1.250 .150 2.093 .622 .207 .232 .622 Notes: u = V = E = 0, v ar( V ) = v ar( E ) = 1; Aste risk s denote the standardized v alues

(12)

2.1.2 Measurement bias with respect to the continuous violator

For the continuous violator, bias in the first item was introduced under Equation6 by choosing non-zero values for b to indicate uniform bias and non-zero values for c to indicate nonuniform bias. Conditions 2 through 4 in Table 1 contained bias. In conditions with bias, the unstandardized indicators of uniform bias (b) and nonuniform bias (c) were chosen to equal .30, which implies that the effects of V and T V on X are three times smaller than the effect of T . With no bias, the common variance and the residual variance are equally large. Standardized effects of V and T indicate that the size of bias is between .10 and .30, which corresponds to effect sizes between small and medium (Cohen, 1988). To give an indication of the effect sizes: if both uniform and nonuniform bias are present the standardized proportions of item variance correspond to precisely 45% variance of the trait, 5% uniform bias, 5% nonuniform bias, and 45% residual variance.

2.1.3 Dependency between trait and violator with respect to the

contin-uous violator

The correlation between the trait and the violator with respect to a continuous

vio-lator was introduced by drawing tj and vj from a bivariate normal distribution with

cor(T V ) = .50. In Table 1, conditions 5 through 8 are equal to conditions 1 through 4, except for the dependency between the trait of interest and the violator. A correlation of 0.50 is considered to represent a large effect size (Cohen, 1988). As a result of the correlation, the proportions of variance in the standardized part of the table (row 5 to 8) do not add up to 1, but a part of the variance is now explained by the correlation between T and V . Because the residual variance stays the same in conditions with uniform and/or nonuniform bias, the standardized effect of the factor loading and the indicators of bias will not increase in conditions with a correlation between T and V . In other words, the residual variance can be interpreted as the variance that is not influenced by measurement bias.

2.1.4 Data generated with respect to the dichotomous violator

In data generated with a dichotomous violator, the item score of subject j in group g was prescribed by

xj= ug+ agtj+ dgej, (7)

where

ug is a vector of intercepts,

ag is a vector of common factor loadings, and

dg is a vector of residual factor loadings.

In considering a dichotomous violator in Equation 7, the number of data generation

conditions for two groups of subjects were conveniently chosen; vg=1= −1 and vg=2=

1. The subject parameters tj, ej, and the residual variance (d) and the factor loading

(a) were sampled similarly to the case of a continuous violator, with cor(T, E) = 0. The standardized and unstandardized population values for the eight data conditions, generated with a dichotomous violator, are given in Table 2.

(13)

T able 2. Standardized and Unstandardized P opulation V alues w ith resp ect to a Dic hotomous Violator. Condition Nr. Group u a d T v ar( T ) X v ar( X ) u ∗ a ∗ d ∗ Indep. T V No Bias 1. 1 .000 .900 .900 .000 1.000 .000 1.620 .000 .707 .707 2 .000 .900 .900 .000 1.000 .000 1.620 .000 .707 .707 Uniform 2. 1 -.300 .900 .900 .000 1.000 -.300 1.620 -.236 .707 .707 2 .300 .900 .900 .000 1.000 .300 1.620 .236 .707 .707 Non uniform 3. 1 .000 .600 .900 .000 1.000 .000 1.170 .000 .555 .832 2 .000 1.200 .900 .000 1.000 .000 2.250 .000 .800 .600 Both 4. 1 -.300 .600 .900 .000 1.000 -.300 1.170 -.277 .555 .832 2 .300 1.200 .900 .000 1.000 .300 2.250 .200 .800 .600 Dep. T V No Bias 5. 1 .000 .900 .900 -.400 1.000 -.360 1.620 .000 .707 .707 2 .000 .900 .900 .400 1.000 .360 1.620 .000 .707 .707 Uniform 6. 1 -.300 .900 .900 -.400 1.000 -.660 1.620 -.236 .707 .707 2 .300 .900 .900 .400 1.000 .660 1.620 .236 .707 .707 Non uniform 7. 1 .000 .600 .900 -.400 1.000 -.240 1.170 .000 .555 .832 2 .000 1.200 .900 .400 1.000 .480 2.250 .000 .800 .600 Both 8. 1 -.300 .600 .900 -.400 1.000 -.540 1.170 -.277 .555 .832 2 .300 1.200 .900 .400 1.000 .780 2.250 .200 .800 .600 Notes: E = 0, v ar( E ) = 1; As terisks denote the standardized v alues

(14)

2.1.5 Measurement bias with respect to the dichotomous violator

Bias in the first item was introduced under Equation 7 by choosing different values across the two groups of subjects for the intercepts u (to indicate uniform bias) and for the factor loadings a (to indicate for nonuniform bias). All group differences caused by measurement bias favored Group 2. The conditions 2 through 4 of Table 2 contained bias. For the first item the intercept was chosen to equal 0 for both groups in conditions with no bias, or -.30 for Group 1 and .30 for Group 2 in conditions with uniform bias. In conditions with no bias the residuals explain an equal amount of variance. All common factor loadings a were chosen to equal .90, except for the factor loading the first item in conditions with nonuniform bias. Here, the factor loadings a were chosen to equal .60 (nonuniform bias) in Group 1 and 1.20 (nonuniform bias) in Group 2. Due to nonuniform bias the total variance in the Group 2 is larger than the total variance in the Group 1. Uniform bias results in a difference of .37 between the standardized intercepts. This difference can be interpreted as a small to medium effect size (Cohen, 1988), which is similar to the effect of uniform bias in conditions with a continuous violator. In conditions with nonuniform bias the difference in proportions explained variance is .34, which is rather large.

2.1.6 Dependency between trait and violator with respect to the

dichoto-mous violator

For the dichotomous violator in Equation 7, dependency between the trait and the violator was introduced by choosing different means for the trait for both groups. In Table 2, conditions 5 through 8 are equal to conditions 1 through 4, except for the dependency between the trait of interest and the violator. For Group 1, the subject

trait values tj were drawn from a normal distribution with mean -.40 and variance

1.00. For Group 2, the subjects’ trait values were drawn from a normal distribution with mean .40 and variance 1.00. This resulted in a mean group difference of .80. Similar to a continuous violator, the mean difference of .80 is considered large (Cohen, 1988).

2.2 Measurement bias detection

In a simulation study, the methods MGFA, RFA/LMS, and RFA/RSP were compared in detecting measurement bias.

2.2.1 MGFA method

The MGFA method sorts data based on a grouping variable (Equation 7). If the violator variable is a true dichotomy, the data analysis is straightforward. However, in data generated with a continuous violator we dichotomize V at the median. By dichotomizing V at its median in conditions with a correlation, the group means for T would be -.40 for Group 1 and .40 for Group 2. In order to enable the estimation of the model parameters through MGFA, a one-factor model is fitted to the separate variance-covariance matrices and mean vectors of both groups. The common factor mean and the common factor variance are fixed for the Group 1 and free to be estimated in the Group 2. The maximum likelihood estimation method is used to estimate all model parameters.

(15)

Measurement bias in a specific item is detected by comparing the chi-square fit values of a null model in which the intercept and factor loading are constrained to be equal across groups with the chi-square fit values of a alternative model in which the factor loading and the intercept of the item tested for bias are free to be estimated in

both groups. An across group difference in intercepts indicates uniform bias (u16= u2)

and an across group difference in factor loadings indicates nonuniform bias (a16= a2).

By using a global χ2 _{test (with df = 2) to detect uniform and nonuniform bias, we}

avoid the influence of arbitrary scaling choices that may affect test results, so called constrained interaction (King-Kallimanis, Oort, & Garst, 2010).

2.2.2 RFA/LMS method

Application of the RFA method with continuous violator scores is straightforward. With data generated with a dichotomous violator the data of Group 1 and Group 2 were stacked. Equation 6 can be applied in both cases.

To estimate the interaction effect that indicates nonuniform bias in the RFA method, the LMS method was applied. The interaction effect of two normally distributed vari-ables implies that the item scores are not normally distributed. Klein and Moosbrugger (2000) solved this problem by regarding the distribution of the items score as a mix-ture of multiple conditional distributions. When conditioning on V , these conditional distributions of the item are normal. However, the mixture of multiple conditional distributions on V is not normally distributed. LMS implements maximum likelihood estimation with full data with the Expectation-Maximization algorithm. The stan-dard errors of the LMS estimators are calculated with Fishers Estimation. A detailed technical description of the LMS method is given by Klein and Moosbrugger (2000) and Schermelleh-Engel, Klein, and Moosbrugger (1998). The LMS procedure proved to be consistent, unbiased and efficient (Klein & Moosbrugger, 2000).

In the RFA/LMS method measurement bias is detected by comparing the fit of a null model in which both b and c are zero vectors with the fit of an alternative model in which b and c elements of the first item are set free to be estimated. The

RFA/LMS method, as implemented in M-plus (Muth´en & Muth´en, 2001), utilizes

robust maximum likelihood estimation with a scaling correction to account for the violation of distributional assumptions (Satorra & Bentler, 2001).

The LMS method is implemented in M-plus and really suited for estimating inter-action effects of latent variables only. In order to enable the estimation of the model parameters through RFA/LMS, the violator is modelled as a latent variable with a sin-gle observed indicator. The residual variance was fixed at .01 to overcome identification

problems. The factor loadings were fixed at pvar(V ) − var(E) to help convergence

and/or estimation. To make sure the null model with all b and c fixed at zero con-verged to an admissible solution we compared this model to a null model without b and c. The parameters of the RFA/LMS model can be estimated with M-plus; see the Appendix A for an example script.

2.2.3 RFA/RSP method

For analyzing data with the RFA/RSP method, we performed the same operation

as for the RFA/LMS method. As described by Muth´en and Asparouhov (2003), the

interaction term in the linear model is rewritten as a random slope, similar to multilevel models. We therefore refer to this technique as random slope parameterization (RSP).

(16)

Equation 6 can be rewritten using two equations involving a random slope variable

sj:

xj = atj+ bvj+ sjvj+ dej (8)

sj = 0 + ctj+ 0j. (9)

A random slope is defined for the observed subject covariate vj, which is the same

as the subject factor tj, but without the scaling factor c. In Equation 9 sj is a

latent subject variable representing a random slope. As in missing data, the EM

algorithm has been used to obtain maximum likelihood estimates for a random slope (as described by Raudenbush & Bryk, 2002). Similar to the RFA/LMS method,

M-plus (Muth´en & Muth´en, 2001) utilizes robust maximum likelihood estimation with a

scaling correction to account for the violation of distributional assumptions (Satorra & Bentler, 2001). The violator in the RFA/RSP method is standardized to facilitate interpretation. To detect uniform and nonuniform bias, the RFA/RSP method also applied the scaled corrected chi-square global test. The parameters of the RFA/RSP model can be estimated with M-plus; see the Appendix A for an example script.

2.3 Analysis

We used the computer program M-plus (version 5.2; Muth´en & Muth´en, 2001) to

apply the methods MGFA, RFA/LMS, and RFA/RSP to each of the 8000 data sets (500 replications in each of the 16 conditions). Subsequently, the outcomes of bias detection, the different procedures for bias detection, and the outcomes of the parameter estimates are assessed.

2.3.1 True positives and false positives

Measurement bias will be tested at 5%, 1%, and 0.1% levels of significance. For

each level of significance, we will look at the proportions of true positives and false positives. A true positive is a biased item that was correctly detected as biased, and a false positive is an unbiased item that was incorrectly detected as biased.

2.3.2 Single run procedure and iterative procedure

Bias will be detected with two different procedures, either in a single run or iteratively. In a single run, we ran the detection procedures only once for every data set and counted true positives and false positives. However, to reduce false positives it is

better to conduct bias detection iteratively (Navas-Ara & G´omez-Benito, 2002; Oort,

1998). This procedure accounts for the item with the largest bias after the first bias detection run and then the bias detection procedure is rerun. This is repeated in subsequent runs until none of the remaining items shows bias. However, we stopped the iterative procedure after three bias detection runs.

2.3.3 Estimation bias

To examine the accuracy of the parameter estimates, we compared the parameter

estimates of the model that reflects the data to the true population values. The

standard deviations of the parameter estimates give an indication of the efficiency of the parameter estimates. The estimated standard errors can be evaluated for bias

(17)

by comparing the observed standard deviation of a parameter estimate to the mean estimated standard error. If there is no substantial bias for the standard error, the parameter estimates can be used for calculating confidence intervals and for testing of hypotheses about the parameters.

3 Results

After applying the MGFA, RFA/LMS, and RFA/RSP methods, we examined the out-comes of the single run procedure, the iterative procedure, and the parameter esti-mates. Non-admissible datasets in different detection methods were replaced by new datasets. Datasets were considered non-admissible if there were estimation or conver-gence problems, as indicated by more restrictive models showing better fit than less restrictive models and/or in the RFA/LMS method if a null model without interaction effects did not yield the equivalent results as model with interaction effects fixed at zero. Only the RFA/LMS method yields non-admissible datasets. Table 3 shows a larger number of non-admissable data sets in conditions with a dependency between the trait and the violator combined with a continuous violator. By replacing the non-admissable data sets with new data, all results were based on a total of 500 datasets in each condition.

Table 3. Number of Non-admissible Data Sets for the RFA/LMS Method.

Continuous Violator Dichotomous Violator

Non-admissible Non-admissible

Condition Nr. Data Sets Nr. Data Sets

Indep. T V No Bias 1. 0 9. 2 Uniform 2. 0 10. 1 Nonuniform 3. 0 11. 0 Both 4. 1 12. 1 Dep. T V No Bias 5. 6 13. 0 Uniform 6. 17 14. 0 Nonuniform 7. 13 15. 1 Both 8. 22 16. 0

3.1 Single run results

The results of item bias detection in the MGFA method, the RFA/LMS method, and the RFA/RSP method via the single run procedure are given in Table 4, Table 5, and Table 6, respectively. The first two columns describe the conditions. The next six columns of each table display conditions with a continuous violator and the last six columns display conditions with a dichotomous violator. For both types of violators, the condition number, the mean and the standard deviation of the chi-square difference tests for bias detection are given, together with the proportion of true positives or false positives at varying levels of significance (5%, 1%, and 0.1%). In conditions with bias the means, standard deviations, and proportions of true positives are calculated over 500 replications and the false positives over 2500 observations (500 replications × 5 items). We calculated means, standard deviations and proportion of false positives over 3000 observations (500 replications × 6 items) in conditions with no bias, because the values of the chi-square difference test did not differ across the relevant items.

(18)

3.1.1 MGFA single run procedure

Table 4 displays the results of the MGFA single run procedure. In conditions with bias, the means of the chi-square test values for biased items are higher than those for the unbiased items. When testing at the 5% level of significance, the proportions of true positives in conditions with a continuous violator (Condition 2 through 4 and Condition 6 through 8) showed a varying detection range (64.6% to 92.4% correct). Lower proportions of true positives were found in conditions with only uniform bias combined with a dependency between T and V and in conditions with nonuniform bias without a dependency between T and V . In conditions with a dichotomous violator (condition 10 through 12 and condition 14 through 16) the proportions were high, regardless the type of bias or the dependency between T and V (93.2% to 99.8% correct). In conditions with uniform bias (Condition 2, 6, 10, and 14) the proportions of true positives in the MGFA method ranges from 64.6% to 94.6%. The MGFA method performed only worse in detecting uniform bias in conditions with a continuous violator and a dependency between T and V . The proportions of true positives in conditions with nonuniform bias (Condition 3, 7, 11, and 15) were in 65.6% to 94.4% correct. The lower proportions of true positives were found in conditions with a continuous violator without a dependency between T and V . Nonuniform bias was easier to detect in conditions with a dependency between T and V combined with a continuous violator. As nonuniform bias is the interaction between T and V , the dependency and the bias might amplify each other. With uniform and nonuniform bias (Condition 4, 8, 12, and 16) the detection rates were very high (92.4% to 99.8% correct). In conditions with a dependency between T and V the overall proportions of detection rates were not affected (with a dependency between T and V ; Condition 6 through 8 and Condition 14 through 16: 65.6% to 99.8% correct and without a dependency between T and V ; Condition 2 through 4 and Condition 10 through 12: 64.6% to 99.4% correct). As mentioned earlier, in conditions with a dependency between T and V combined with a continuous violator uniform bias was harder to detect and nonuniform bias was easier to detect.

The proportions of false positives that were found when we tested at the 5% level of significance were generally a little higher than the nominal level of significance, espe-cially in conditions with uniform bias and in conditions with uniform and nonuniform bias (Condition 2, 4, 6, 8, 10, 12, 14, and 16). Testing at lower levels of significance al-leviates this problem, although the actual proportions of false positives are still higher than the nominal level of significance. However, lowering the level of significance neg-atively affects the proportions of true positives.

(19)

T able 4. Outcome s Single Run Pro cedure MGF A Metho d. Con tin uous Violator Dic hotomous Violator P ositiv es at α P ositiv es at α Condition Nr. χ 2 sd( χ 2) .050 .010 .001 Nr. χ 2 sd( χ 2) .050 .010 .001 Indep. T V No Bias 1. 2.063 2.054 .057 .011 .001 9. 2.106 2.148 .060 .015 .001 Uniform 2. 12.497 6.516 .848 .644 .380 10. 20.137 8.342 .980 .930 .746 2.380 2.288 .078 .020 .002 2.801 2.687 .115 .036 .004 Non uniform 3. 9.261 6.060 .656 .416 .192 11. 15.591 7.222 .932 .816 .566 2.353 2.319 .076 .021 .002 2.546 2.475 .091 .025 .004 Both 4. 19.253 8.739 .962 .902 .710 12 31.285 10.864 .998 .998 .970 2.669 2.626 .109 .029 .005 3.079 3.006 .140 .049 .010 Dep. T V No Bias 5. 2.046 2.063 .058 .013 .001 13. 2.113 2.081 .061 .011 .000 Uniform 6. 9.003 5.762 .646 .406 .198 14. 16.511 7.347 .946 .850 .602 2.362 2.278 .074 .020 .000 2.862 2.753 .120 .037 .006 Non uniform 7. 10.248 6.038 .738 .508 .234 15. 16.245 7.357 .944 .842 .594 2.324 2.300 .075 .019 .003 2.582 2.524 .095 .027 .006 Both 8. 16.464 8.193 .924 .806 .592 16. 30.587 10.917 .994 .986 .958 2.861 2.894 .124 .039 .007 3.545 3.033 .196 .056 .006 Notes: in b old typ eset: prop ortions of tr ue p ositiv es; in italics: prop orti ons of false p ositiv es .

(20)

3.1.2 RFA/LMS single run procedure

Table 5 gives the results for the single run procedure in the RFA/LMS method. In conditions with bias, the means of the chi-square difference test values for biased items are higher than those for the unbiased items. Outliers affected the mean chi-square values and the standard deviation of three conditions (see note Table 5). When testing at the 5% level of significance, the proportions of true positives in conditions with a continuous violator (Condition 2 through 4 and Condition 6 through 8) showed a high detection rate regardless of the type of bias or the dependency between T and V (89% to 100% correct). The lowest proportions of true positives were found in conditions with only uniform bias combined with a dependency between T and V and in conditions with nonuniform bias without a dependency between T and V . In conditions with a dichotomous violator (Condition 10 through 12 and Condition 14 through 16) bias was not hard to detect (94.2% to 99.6% correct). In conditions with uniform bias (Condition 2, 6, 10, and 14) the proportions of true positives were high (89% to 97.6% correct). With nonuniform bias (Condition 3, 7, 11, and 15) the RFA/LMS method worked very well, the detection rates are as high as with uniform bias (89.2% to 96.8% correct). Nonuniform bias was easier to detect if there is a dependency between T and V in conditions with a continuous violator. With uniform and nonuniform bias (Condition 4, 8, 12, and 16) the detection rates were very high (99.6% to 100% correct). In conditions with a dependency between T and V the overall proportion of detection rates was not affected (with a dependency between T and V ; Condition 6 through 8 and Condition 14 through 16: 89% to 99.6% correct and without a dependency between T and V ; Condition 2 through 4 and Condition 10 through 12: 89.2% to 100% correct). When testing at the 5% level of significance, the proportions of false positives that were generally a little larger than the nominal level of significance, especially in con-ditions uniform bias and in concon-ditions with a combination of uniform and nonuniform bias (Condition 2, 4, 6, 8, 10, 12, 14, and 16).

(21)

T able 5. Outcome s Single Run Pro cedure RF A/LMS Metho d. Con tin uous Violator Dic hotomous Violator P ositiv es at α P ositiv es at α Condition Nr. χ 2 sd( χ 2) .050 .010 .001 Nr. χ 2 sd( χ 2) .050 .010 .001 Indep. T V No Bias 1. 2.209 2.237 .070 .017 .002 9. 2.152 2.246 .064 .040 .009 Uniform 2. 22.344 10.069 .976 .926 .794 10. 21.036 9.121 .982 .932 .768 2.905 2.780 .126 .039 .005 2.849 2.756 .124 .038 .005 Non uniform 3. 18.559 16.506 .892 .768 .544 11. 19.498 9.434 .956 .010 .000 2.423 2.612 .082 .027 .006 2.343 2.353 .074 .020 .004 Both 4. 39.311 21.964 1.000 .984 .960 12 37.846 14.976 1.000 .996 .984 3.040 3.149 .136 .051 .013 2.880 2.893 .123 .042 .009 Dep. T V No Bias 5. 2.230 2.348 .070 .015 .004 13. 2.164 2.177 .062 .016 .001 Uniform 6. 15.702 9.561 .890 .730 .502 14. 17.060 8.091 .942 .848 .604 2.928 2.893 .132 .044 .008 2.895 2.811 .126 .042 .006 Non uniform 7. 25.740 26.262 .968 .900 .744 15. 19.995 9.624 .964 .898 .716 2.547 2.655 .091 .029 .006 2.364 2.397 .076 .020 .004 Both 8. 42.599 66.751 .996 .988 .942 16. 35.566 13.919 .996 .990 .970 3.258 3.338 .158 .058 .144 3.154 2.877 .151 .042 .006 Notes: in b old typ eset: prop ortions of tr ue p ositiv es; in italics: prop orti ons of false p ositiv es . After remo vin g the χ 2 outliers the mean standard deviation for Condition 3, Condition 7 and Condition 8 w ere 17.891 (mean) with 11.761 (sd), 24.455 (mean) with 16.406 (sd), and 38.349 (me an ) with 25.964 (sd), resp ectiv ely .

(22)

3.1.3 RFA/RSP single run procedure

Table 6 shows the results for the single run procedure in the RFA/RSP method. Again, in conditions with bias, the means of the chi-square difference test values for biased

items are higher than those for the unbiased items. The mean and the standard

deviation of chi-square values in three conditions were also affected by outliers (see note Table 6). Testing at the 5% level of significance indicates that the proportions of true positives in conditions with a continuous violator (Condition 2 through 4 and Condition 6 through 8) were high (89% to 100% correct). In conditions with a dichotomous violator (Condition 10 through 12 and Condition 14 through 16), bias was easy to

detect (94.2% to 100% correct). With uniform bias (Condition 2, 6, 10, and 14)

proportions of true positives were high (89.2% to 98.2% correct). The detection rates of nonuiform bias (Condition 3, 7, 11, and 15) were also high (89% to 96.8% correct). Once more, nonuniform bias was easier to detect if there is a dependency between T and V in conditions with a continuous violator. In conditions with both uniform and nonuniform bias (Condition 4, 8, 12, and 16) the detection rates were also very high (99.6% to 100% correct). A dependency between T and V did not systematically influence the proportions of true positives (with a dependency between T and V ; Condition 6 through 8 and Condition 14 through 16: 89.2% to 99.6% correct and without a dependency between T and V ; Condition 2 through 4 and Condition 10 through 12: 89% to 100% correct).

The proportions of false positives that were found when we tested at the 5% level of significance were generally a little higher than the nominal level of significance, especially in conditions with only uniform bias and in conditions with uniform and nonuniform bias (Condition 2, 4, 6, 8, 10, 12, 14, and 16).

(23)

T able 6. Outcome s Single Run Pro cedure RF A/RSP Metho d. Con tin uous Violator Dic hotomous Violator P ositiv es at α P ositiv es at α Condition Nr. χ 2 sd( χ 2) .050 .010 .001 Nr. χ 2 sd( χ 2) .050 .010 .001 Indep. T V No Bias 1. 2.208 2.237 .070 .017 .002 9. 2.152 2.245 .064 .016 .003 Uniform 2. 22.339 10.065 .976 .926 .794 10. 21.033 9.114 .982 .000 .000 2.905 2.780 .126 .039 .005 2.173 2.211 .064 .017 .002 Non uniform 3. 18.518 16.399 .890 .768 .544 11. 19.465 9.407 .956 .898 .698 2.422 2.611 .082 .216 .132 2.342 2.352 .074 .020 .004 Both 4. 39.227 21.804 1.000 .984 .962 12 37.769 14.917 1.000 .996 .984 3.039 3.147 .136 .051 .013 2.879 2.892 .123 .042 .009 Dep. T V No Bias 5. 2.231 2.346 .071 .015 .004 13. 2.163 2.176 .062 .016 .001 Uniform 6. 15.819 9.592 .892 .738 .504 14. 17.065 8.086 .942 .848 .608 2.915 2.884 .132 .042 .008 2.893 2.809 .126 .042 .006 Non uniform 7. 25.618 25.245 .968 .900 .742 15. 19.969 9.606 .962 .900 .718 2.549 2.656 .090 .029 .006 2.363 2.395 .075 .020 .004 Both 8. 42.481 65.462 .996 .988 .942 16. 35.512 13.879 .996 .990 .970 3.245 3.331 .156 .057 .018 3.152 2.873 .151 .043 .006 Notes: in b old typ eset: prop ortions of tr ue p ositiv es; in italics: prop orti ons of false p ositiv es . After remo vin g the χ 2 outliers the mean standard deviation for Condition 3, Condition 7 and Condition 8 w ere 17.858 11.725 (sd), 24.407 (mean) with 16.337 (sd), (mean) with and 38.419 (me an ) with 25.924 (sd), resp ectiv ely .

(24)

3.1.4 MGFA, RFA/LMS, and RFA/RSP methods

In general the standard deviations of the chi-square difference test values are higher with the RFA/RSP and RFA/LMS methods than with the MGFA method. The pro-portions of true positives in conditions with a dichotomous violator were very high with all methods, regardless of the type of bias or the dependency between variable T and variable V . In conditions with a continuous violator, the proportions of correctly identified items in the RFA/LMS method and the RFA/RSP method were very high, but the MGFA method showed lower proportions of true positives in conditions with a continuous violator. Overall, uniform bias had high proportions of true positives in all methods, with the exception of the MGFA method with a continuous violator com-bined with a dependency between T and V . However, uniform bias in conditions with dependency between T and V turned out to be a little harder to detect in all meth-ods. In all conditions with nonuniform bias the RFA/LMS method and the RFA/RSP method have a very high detection rate. The proportions of true positives in the MGFA method were lower. Nonuniform bias is easier to detect if the dependency between T and V and bias amplify each other. In conditions with uniform and nonuniform bias detection rates of all methods were very high. For all methods, a dependency between T and V did not affect the proportion of true positives systematically. The detection rates in the RFA/LMS method and the RFA/RSP method were remarkably similar. However, individual datasets did produce different results.

The actual proportion of false positives in all methods was a little higher than the nominal level of significance, especially in conditions with only uniform bias and in conditions with the combination of uniform and nonuniform bias. For the MGFA method it is remarkable that the proportions of false positives in conditions with uniform combined with nonuniform bias and a dichotomous violator were a little higher than in the RFA/RSP method and RFA/LMS method. In all methods testing at lower levels of significance reduced the number of false positives, but lowering the level of significance somewhat negatively affected the proportions of true positives.

3.2 Iterative procedure

The results of iterative item bias detection procedure in the MGFA method and the RFA/RSP method are given in Table 7 and Table 8. As the RFA/LMS method per-formed similar to the RFA/RSP method, we decided to exclude the RFA/LMS method from the result section. In the iterative procedure only the item with the largest chi-square difference test value (between the null model and the alternative model), if significant, is marked as biased and accounted for in the model, whereupon the model is re-analysed. This procedure is stopped after three items are identified as biased. Ideally, the first item is marked as biased and the bias detection stops after the sec-ond bias detection run. Table 7 and Table 8 give the final bias detection rates after performing the iterative procedure. The first two column in both tables describes the condition. The next four columns display conditions with a continuous violator and the last four display conditions with a dichotomous violator. For each condition the proportions of items detected as biased at varying levels of significance (5%, 1%, and 0.1%) are given.

(25)

3.2.1 MGFA iterative procedure

From the results of Table 7, it appears that the iterative procedure yields about the same bias detection pattern as described in the MGFA single run procedure. However, only if false positive marked items in the iterative procedure have higher chi-square difference test values than true positive marked items, the proportion of true positives is lower in the iterative procedure. Lowering the level of significance affects the pro-portions of true positives negatively. After three bias detection runs at a 5% level of significance, only 2.6% or less of the datasets still indicated a biased item. This occurs in conditions with uniform bias in combination with nonuniform bias (Condition 4, 8, 12, and 16). At lower levels of significance, the remaining biased items after three bias detection runs are negligible.

The iterative procedure displays a lower percentage of false positives than the single run procedure. This is especially the case if the power to detect bias is large. The percentage of false positives is about equal to the nominal level of significance. Table 7. Iterative Procedure MGFA Method.

at α = at α = Condition Nr. .050 .010 .001 Nr. .050 .010 .001 Indep. T V No Bias 1. .053 .010 .001 9. .054 .014 .001 Uniform 2. .854 .686 .424 10. .976 .924 .746 .055 .016 .001 .056 .012 .002 Nonuniform 3. .636 .410 .192 11. .912 .804 .566 .054 .016 .002 .047 .015 .002 Both 4. .948 .706 .706 12 .996 .996 .970 .051 .016 .002 .055 .012 .002 Dep. T V No Bias 5. .053 .013 .001 13. .057 .011 .000 Uniform 6. .614 .398 .200 14. .924 .846 .600 .049 .012 .000 .057 .014 .002 Nonuniform 7. .712 .502 .230 15. .934 .834 .592 .053 .012 .002 .048 .016 .002 Both 8. .912 .800 .590 16. .990 .984 .958 .059 .016 .003 .050 .014 .002

Notes: in bold typeset: proportions of true positives; in italics: proportions of false positives.

3.2.2 RFA/RSP iterative procedure

Table 8 displays the results of bias detection in the iterative bias detection procedure for the RFA/LMS method. It appears that the iterative procedure yields about the same bias detection pattern as described in the RFA/RSP single run procedure. This pattern is also recognised in the MGFA method. Again, if false positive marked items in the iterative procedure have higher chi-square difference test values than true positive marked items, the proportion of true positives is lower in the iterative procedure. Similar to the MGFA iterative procedure results, lowering the level of significance affects the proportions of true positives negatively. After three bias detection runs at a 5% level of significance, still 4.4% or less of the datasets indicated a biased item. This occurs in conditions with a continuous violator. At lower levels of significance,

(26)

the remaining biased items after three bias detection runs are negligible.

Compared to the single run procedure, the iterative procedure contains a smaller proportion of false positives, especially in cases with high power to detect bias. Both the MGFA iterative procedure and the RFA/RSP iterative procedure reduced the proportions of false positives to the nominal level of significance.

Table 8. Iterative Procedure RFA/RSP Method.

at α = at α = Condition Nr. .050 .010 .001 Nr. .050 .010 .001 Indep. T V No Bias 1. .064 .018 .003 9. .059 .016 .003 Uniform 2. .970 .920 .794 10. .974 .924 .766 .064 .018 .005 .061 .012 .003 Nonuniform 3. .886 .768 .544 11. .952 .896 .698 .061 .020 .004 .057 .014 .002 Both 4. .996 .960 .960 12 .998 .992 .984 .070 .020 .005 .057 .016 .002 Dep. T V No Bias 5. .069 .016 .004 13. .059 .016 .001 Uniform 6. .860 .720 .500 14. .916 .836 .606 .070 .024 .005 .061 .018 .002 Nonuniform 7. .964 .898 .740 15. .958 .896 .718 .076 .019 .003 .054 .015 .004 Both 8. .996 .982 .942 16. .996 .990 .970 .069 .026 .004 .056 .012 .003

Notes: in bold typeset: prop. of true positives; in italics: prop. of false positives.

3.3 Estimation bias; accuracy and efficiency

We examine the parameter estimates by focusing on the accuracy and the efficiency of the parameter estimates. The accuracy is calculated by comparing the means of the parameter estimates of the model with the true population values. The efficiency of the parameter estimates was examined by calculating the standard deviations of the parameter estimates.

3.3.1 MGFA estimation bias

Table 9 displays the results of the parameter estimates in the MGFA method. In the evaluation of the accuracy and efficiency of the parameter estimates, we concentrated on the mean of parameters that indicate bias. The first two columns describe for each type of violator the dependency between T and V and the type of bias. The remaining columns display the mean estimation bias, the standard deviations, and the mean standard errors of the intercept (u) and the factor loading (b) in Group 1 and in Group 2.

The estimates of the parameters for uniform (u) and nonuniform bias (b), calcu-lated over Groups, vary from .000 to .018 (mean .007) and from .000 to .040 (mean .010), respectively. The factor loadings that indicate nonuniform bias in conditions with a continuous violator (Condition 3, 4, 7, and 8) are less accurate than the other estimated parameters. The standard deviations, that indicate the efficiency of param-eter estimates, are ranging from .104 to .214 (mean .144) for paramparam-eters that indicate

(27)

uniform bias and from .099 to .163 (mean .132) for parameters that indicate nonuni-fom bias. As desired, a comparison of the mean standard errors and the standard deviations shows that they are roughly equal.

(28)

T able 9. Estimation Bias MGF A Metho d. Condition Estimation Group 1 Estimation Group 2 Estimation Group 1 Estimation Grou p 2 Con tin uous Nr. Bias u1 sd( u1 ) se( u1 ) Bias u2 sd( u2 ) se( u2 ) Bias b1 sd( b1 ) se( b1 ) Bias b2 sd( b2 ) se( b2 ) Indep. T V No Bias 1. .004 .127 .126 .002 .134 .140 .008 .117 .117 .001 .136 .135 Uniform 2. .008 .132 .127 .002 .141 .141 .008 .120 .119 .009 .136 .135 Non unif. 3. .001 .112 .113 .012 .169 .160 .015 .112 .110 .040 .162 .150 Both 4. .005 .114 .115 .004 .160 .162 .028 .120 .112 .028 .161 .153 Dep. T V No Bias 5. .005 .123 .121 .003 .170 .167 .005 .116 .116 .003 .141 .135 Uniform 6. .015 .130 .127 .001 .182 .175 .000 .126 .120 .005 .144 .142 Non unif. 7. .013 .108 .109 .009 .192 .190 .025 .112 .110 .023 .158 .154 Both 8. .018 .118 .112 .015 .214 .198 .029 .109 .112 .035 .163 .160 Dic hotomous Indep. T V No Bias 9. .006 .121 .126 .001 .142 .140 .004 .116 .117 .000 .141 .134 Uniform 10. .004 .135 .126 .007 .147 .140 .006 .125 .117 .008 .144 .134 Non unif. 11. .009 .104 .107 .001 .164 .169 .000 .099 .107 .007 .159 .157 Both 12. .009 .114 .108 .000 .160 .168 .003 .110 .107 .010 .162 .156 Dep. T V No Bias 13. .000 .130 .127 .014 .172 .166 .001 .121 .118 .003 .135 .134 Uniform 14. .003 .128 .126 .007 .158 .166 .006 .122 .117 .000 .141 .134 Non unif. 15. .000 .108 .108 .022 .204 .196 .003 .108 .107 .008 .159 .159 Both 16. .007 .110 .107 .002 .192 .195 .000 .110 .107 .007 .158 .158

(29)

3.3.2 RFA/LMS estimation bias

Table 10 gives the results of the parameter estimates in the RFA/LMS method. To evaluate the accuracy and efficiency of the parameter estimates, we focused on mean estimation bias, mean standard errors, and standard deviations of parameters that indicate bias. The first two columns of Table 10 describe for each type of violator the dependency between T and V and the type of bias. The remaining columns display mean estimation bias, standard deviations, and mean standard errors of the indicator of uniform bias (b), and the indicator of nonuniform bias (c). Outliers in the parameter estimates did not substantively influence the results.

The accuracy of the parameters that indicate uniform bias (b) and nonuniform bias (c) are ranging from .000 to .006 (mean .003), and from .000 to .039 (mean .010), respectively. The parameter estimates of the interaction effect that indicates nonuniform bias in conditions with a continuous violator (Condition 3, 4, 7, and 8) are less accurate than the other estimated parameters. This might be due to the estimation bias in conditions with a continuous violator and without a dependency between T and V (Condition 3 and 4). The standard deviations of parameter estimates that indicate uniform bias (b) are ranging from .072 to .087 (mean .078) and from .064 to .084 (mean .075) for parameters that indicate for nonunifom bias (c). The mean standard errors and the standard deviations are about equal to the each other.

Table 10. Estimation Bias RFA/LMS Method.

Condition Esti. Esti.

Continuous Bias Nr. Bias b sd(b) se(b) Bias c sd(c) se(c)

Indep. T V No Bias 1. .000 .072 .069 .004 .076 .072 Uniform 2. .005 .072 .069 .002 .076 .072 Nonuniform 3. .003 .075 .072 .036 .081 .073 Both 4. .002 .078 .071 .036 .076 .073 Dep. T V No Bias 5. .000 .086 .084 .004 .069 .063 Uniform 6. .004 .085 .084 .001 .069 .063 Nonuniform 7. .003 .087 .085 .031 .064 .065 Both 8. .003 .082 .084 .039 .067 .065 Dichotomous Indep. T V No Bias 9. .003 .073 .070 .001 .075 .073 Uniform 10. .003 .072 .070 .001 .080 .073 Nonuniform 11. .004 .074 .073 .003 .071 .074 Both 12. .005 .073 .073 .004 .075 .074 Dep. T V No Bias 13. .005 .080 .077 .002 .080 .078 Uniform 14. .002 .074 .076 .000 .082 .078 Nonuniform 15. .006 .078 .080 .004 .081 .080 Both 16. .001 .084 .079 .000 .084 .079

3.3.3 RFA/RSP estimation bias

The results of the parameter estimates in the RFA/RSP method are given in Table 11. In analyzing the efficiency and accuracy, we concentrated on mean estimation bias, standard deviations, and the mean standard errors of the parameters that indicate bias. Analogous to the RFA/LMS and the MGFA method, the first two columns of describe for each type of violator the dependency between T and V and the type of bias,

(30)

respectively. The other columns show the mean estimation bias, standard deviation, and the mean standard error of the indicator of uniform bias (b), and nonuniform bias (c). Outliers in parameter estimates did not substantively influence the results.

The accuracy of the parameters that indicate uniform bias and nonuniform bias are varying from .000 to .006 (mean .003) and from .000 to .039 (mean .011), respectively. Similar to the RFA/LMS method, the parameter estimates of the interaction effect that indicates nonuniform bias in conditions with a continuous violator (Condition 3, 4, 7, and 8) are less accurate than the other estimated parameters. The standard deviations of parameter estimates (i.e. efficiency) are ranging from .071 to .086 (mean .077) for parameters that indicate uniform bias and from .067 to .084 (mean .75) for parameters that indicate nonuniform bias. A comparison of the mean standard errors and the standard deviations indicates that they are about equal to the each other. Overall, the parameter in the RFA/LMS method and the RFA/RSP method were similar. However, individual datasets did produce different parameter estimates. Table 11. Estimation Bias RFA/RSP Method.

Condition Esti. Esti.

Continuous Bias Nr. Bias b sd(b) se(b) Bias c sd(c) se(c)

Indep. T V No Bias 1. .000 .072 .069 .004 .076 .072 Uniform 2. .004 .071 .069 .002 .075 .072 Nonuniform 3. .003 .075 .071 .036 .081 .073 Both 4. .003 .078 .071 .036 .076 .073 Dep. T V No Bias 5. .000 .085 .083 .004 .069 .062 Uniform 6. .006 .084 .083 .001 .069 .063 Nonuniform 7. .003 .086 .084 .031 .064 .065 Both 8. .000 .082 .084 .039 .067 .065 Dichotomous Indep. T V No Bias 9. .003 .073 .069 .001 .075 .073 Uniform 10. .002 .071 .069 .001 .079 .073 Nonuniform 11. .004 .074 .073 .004 .071 .074 Both 12. .006 .073 .072 .004 .075 .074 Dep. T V No Bias 13. .005 .080 .076 .002 .080 .078 Uniform 14. .004 .074 .076 .000 .082 .078 Nonuniform 15. .006 .078 .080 .004 .081 .080 Both 16. .001 .083 .079 .000 .084 .079

4 Discussion

The goal of this paper was to describe and evaluate the MGFA, RFA/LMS and RFA/RSP methods for detecting both uniform and nonuniform bias in different con-ditions. As expected, with a continuous violator and a single run procedure, bias detection rates were higher with the RFA/RSP and RFA/LMS methods. The MGFA method showed difficulties in detecting bias, because it was necessary to divide a sample into sub-samples, which is associated with a smaller sample size and loss of information. Even if the power of the MGFA method is increased by starting with across group constraints on all factor loadings and intercepts, thus limiting the

num-ber of parameters to be estimated, this problem is evident. With a dichotomous

(31)

rates in conditions with a continuous violator were very high in the RFA/LMS method and the RFA/RSP method, but the MGFA method showed lower detection rates. In conditions with uniform bias and nonuniform bias, the RFA/LMS method and the RFA/RSP method have a very high detection rate. However, the proportions of true positives for the MGFA method with a continuous violator were lower. In conditions with both uniform and nonuniform bias, the detection rates of all methods were very high. A dependency between the trait of interest and the violator is likely to occur in practice. We expected that bias would be harder to detect with a dependency between trait and the violator. Overall, results show that the proportions of true positives were not affected, yet there were some interaction effects of this factor and the type of bias. Although other studies investigated bias detection under other conditions, always with dichotomous violators, we can make some comparisons. Barendse, Oort, and Garst (2010) compared the MGFA method with the RFA/LMS method and found, in agreement with the present study, that both methods performed about equally well. Uniform bias turned out easy to detect in other RFA or MIMIC studies as well (Barendse, Oort, & Garst, 2010; Oort, 1998; Woods, 2009; Woods & Grimm, 2010), but it was a little harder to detect nonuniform bias with the RFA/LMS method or MIMIC/LMS method (Barendse, Oort, & Garst, 2010; Woods & Grimm, 2010). Resembling the MGFA results in the present study, uniform bias was easy to detect with normally distributed item responses or five categories item responses, but it was

a little harder to detect nonuniform bias with the MGFA method (Gonz´alez-Rom´a,

Hernández, & Gómez-Benito, 2006; Hernández & González-Romá, 2003; Meade &

Lautenschlager, 2004). Similar to the study of Barendse, Oort, and Garst (2010), the proportions of true positives were not influenced with a dependency between the trait and the violator.

Results in the iterative procedure indicated that the proportions of true positives were about equal to the proportions of true positives in the single run procedure. In

contrast to the study of Oort (1998) and Navas-Ara and G´omez-Benito (2002), the

iterative procedure did not improve the proportions of true positives. This might be due to specific simulation conditions (such as more biased items, scale of the item responses, and number of items) or to the use of the modification indices and expected parameter change instead of comparing the fit of the null model to an alternative model.

From the single run procedure results it appears that the proportions of false posi-tives in most conditions were about equal to the nominal level of significance. Similar to other studies RFA/MIMIC studies and MGFA studies , the proportion of false positives in certain conditions were inflated (Barendse, Oort & Garst, 2010; Gomez & Navas,

2002; González-Romá, Hernández, & Gómez-Benito, 2006; Hernández & Gonz´

alez-Rom´a, 2003; Meade & Lautenschlager, 2004; Oort, 1998; Woods, 2009; Woods &

Grimm, 2010). If the power to detect bias is high, the proportion of false positives is likely to be higher than the nominal level of significance. Oort (1998) and

Navas-Ara and G´omez-Benito (2002) demonstrated that the iterative procedure decreased

the proportion of false positives. These studies removed the biased item, whereupon they rerun the model. In this study, we choose to account for the bias and rerun the detection procedure. The removal of the biased item seems more appropriate in test construction, while accounting for the biased item seems more appropriate when using standard tests in applied research. Although laborious, the iterative procedure reduces the proportion of false positives in this study to a nominal level of significance.

(32)

to detect bias. The parameter estimates of the RFA/LMS method and the RFA/RSP method are more efficient than the parameter estimates of the MGFA method. A problem with the MGFA method is the relative small sample size to estimate the parameters. The mean standard errors and the standard deviations of the parameter estimates in all methods were about equal. Thus, the mean standard errors of the parameter estimates in all methods seem accurate and can be used for calculating confidence intervals and for testing hypotheses about the parameters.

On three points the RFA method is superior in detecting bias. First, the RFA method can investigate bias with respect to group membership without dividing the sample into sub-samples. Second, the RFA method can investigate bias with respect to other variables than group membership without loss of information. Thus, a violator in the RFA method can be any variable, continuous or discrete, observed or latent. Third, the RFA method can detect bias with respect to multiple variables simultaneously which is particularly important because in practice there may be many violators of the measurement model.

In practice, it is important to reflect on theoretical and empirical considerations to rule out potential violators. The researcher has to decide which potential violators are important enough to include in the investigation of measurement bias. If not all potential violators are available, we can still detect bias with respect to other variables (such as group membership) that are associated with the biasing variables.

In this study, we combined the RFA method with LMS and RSP to estimate in-teraction effects, because it is implemented in M-plus and readily available. The LMS method is intended for estimating interaction effects of latent variables only. We in-vestigated bias with respect to an observed variable by introducing a latent variable with a single indicator, a fixed factor loading, and a fixed residual variance. With the single indicator, the RFA/LMS method performed very well, at least as well as the RFA/RSP method and better than the MGFA method. However, in practice, it might be more suited to use the RFA/RSP method if the potential violator is observed.

At present, no other study has investigated the behavior of the RFA/LMS method with a latent violator. The opportunity to detect bias with a latent violator can be further explored with simulated data. Future simulation studies of measurement bias detection could also include other analytic approaches or product indicator approaches to estimate the nonlinear effect and to detect nonuniform bias. These alternatives

included the promising new quasi-maximum likelihood approach (Klein & Muth´en,

2007). However, a possible disadvantage of this approach (as currently implemented by Klein, 2007) is that it is necessary to remove the biased item in the iterative procedure, because only one item at the time can be investigated for bias. Finally, to better represent actual data that one can encounter in substantive research, future research could also examine the behavior of different bias detection methods with multiple violator variables, multiple biased items, and discrete item responses.

References

Barendse, M. T., Oort, F., & Garst, G. J. A. (2010). Using restricted factor analysis with latent moderated structures to detect uniform and nonuniform measurement bias; a simulation study. Advances in Statistical Analysis, 94, 117-127.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum.

(33)

González-Romá, V., Hernández, A., & Gómez-Benito, J. (2006). Power and Type I error of the mean and covariance structure analysis model for detecting differential item functioning in graded response items. Multivariate Behavioral Research, 41, 29-53.

Hernández, A., & González-Romá, V. (2003). Evaluating the multiple-group mean

and covariance structure model for the detection of differential item functioning in polytomous ordered items. Psicothema, 15, 322-327.

Kenny, D., & Judd, C. M. (1984). Estimating the nonlinear and interactive effects of latent variables. Psychological Bulletin, 96, 201-210.

King-Kallimanis, B. L., Oort, F. J., & Garst, G. J. A. (2010). Using structural equa-tion modeling to detect measurement bias and response shift in longitudinal data. Advances of Statistical Analalysis, 94, 139-156.

Klein, A. G. (2007). QuasiML 3.10 - Quick reference manual. Unpublished Manuscript, University of Western Ontario.

Klein, A. G., & Moosbrugger, H. (2000). Maximum likelihood estimation of latent interaction effects with the LMS method. Psychometrika, 65, 457-474.

Klein, A. G., & Muthn, B. O. (2007). Quasi maximum likelihood estimation of struc-tural equation models with multiple interaction and quadratic effects. Multivariate Behavioral Research, 42, 647-673.

MacCallum, R. C., Zhang, S., Preacher, K. J. & Rucker, D. D. (2002) On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40. Meade, A. W., & Lautenschlager, G. J. (2004). A Monte-Carlo study of confirmatory

factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11, 60 72.

Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127-143.

Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543.

Moosbrugger, H., Schermelleh-Engel, K., Kelava, A., & Klein, A. G. (2009). Testing multiple nonlinear effects in structural equation modelling: A comparison of alter-native estimation approaches. In T. Teo M. S. Khine (Eds.), Structural Equation Modeling in Educational Research: Concepts and Applications, pp. 103-136. Sense Publishers, Rotterdam.

Muth´en, B. O. (1989). Latent variable modeling in heterogeneous populations.

Psy-chometrika, 54, 557-585.

Muth´en, B. O., & Asparouhov, T. (2003). Modeling interactions between latent

and observed continuous variables using Maximum-Likelihood estimation in M-plus (MplusWeb Notes No. 6). Retrieved August 23, 2010, from

http:// www.statmodel.com/ download/webnotes/webnote6.pdf

Muth´en, B. O. & Muth´en, L. K. (2001). M-plus user s guide: statistical analysis with

latent variables. Los Angeles: CA: Muth´en & Muth´en.

Navas-Ara, M. J., & Gomez-Benito, J. (2002). Effects of ability scale purification on Identification of DIF. European Journal of Psychological Assessment, 18, 9-15. Oort, F. J. (1991). Theory of violators: Assessing unidimensionality of psychological

Measurement bias detection through factor analysis : a simulation study of interaction effects in restricted factor analysis