A Generalized Logistic Regression Procedure to Detect Differential Item Functioning Among Multiple Groups

(1)

DOI: 10.1080/15305058.2011.602810

A Generalized Logistic Regression Procedure to Detect Differential Item Functioning Among Multiple Groups

David Magis

Department of Mathematics, University of Li`ege, and K. U. Leuven, Belgium

Gilles Raˆıche and S´ebastien B´eland

Department of Education, Université du Québec à Montréal, Canada

Paul G´erard

Department of Mathematics, University of Li`ege, Belgium

We present an extension of the logistic regression procedure to identify dichotomous differential item functioning (DIF) in the presence of more than two groups of respondents. Starting from the usual framework of a single focal group, we propose a general approach to estimate the item response functions in each group and to test for the presence of uniform DIF, nonuniform DIF, or both. This generalized procedure is compared to other existing DIF methods for multiple groups with a real data set on language skill assessment. Emphasis is put on the flexibility, completeness, and computational easiness of the generalized method.

Keywords: differential item functioning, logistic regression, multiple groups

INTRODUCTION

An important research area in psychometrics is the identification of differential item functioning (DIF). In the context of dichotomous responses, an item is said to

The authors wish to thank Prof. Stephen G. Sireci, editor, Prof. Rob R. Meijer, co-editor, and two anonymous reviewers for their helpful comments. This research was funded by a grant “Charg´e de recherches” of the National Funds for Scientific Research (FNRS), Belgium, the Research Funds of the K. U. Leuven, and a grant from the Social Sciences Research Council of Canada (SSRC).

Correspondence should be sent to David Magis, PhD, University of Li`ege, Mathematics, Dept.

Mathematics (B37), Grande Traverse 12, Li`ege, 4000 Belgium. E-mail: david.magis@ulg.ac.be

Downloaded by [David Magis] at 07:48 04 November 2011

(2)

function differently (or shortly, to be DIF) if respondents with the same ability level but from different groups of examinees have different probabilities of answering this item correctly. DIF is an unwilling phenomenon that can lead to biased mea- surements of ability (Ackerman, 1992). The identification of DIF items is therefore a crucial issue for assessing valid psychometric and educational measurement. In this article, we restrict our purpose to dichotomously scored items.

The DIF detection can be classified according to two main factors: the methodological approach—either based on item response theory (IRT) models or not—and the type of DIF effect—uniform DIF or nonuniform DIF. Methods based on IRT require the fitting of item response models, for instance the logistic model with one or more parameters. Non-IRT methods, on the other hand, are based on statistical methods for assessing the presence of DIF based on observed test scores and without requiring an IRT solution. Moreover, an item exhibits uniform DIF if the interaction between the item responses and the groups of respondents is independent of the ability level. Nonuniform (or crossing) DIF is characterized by an item-group interaction, which can vary along the ability scale (Clauser & Mazor, 1998; Hanson, 1998).

The most known IRT-based methods are the Lord’s χ² test (Lord, 1980), the Raju’s area method (Raju, 1990) and the likelihood-ratio test (Thissen, Steinberg,

& Wainer, 1988). Among the non-IRT methods, the Mantel-Haenszel method (Holland & Thayer, 1988), the SIBTEST method (Shealy & Stout, 1993), and the logistic regression procedure (Swaminathan & Rogers, 1990) are the most commonly used methods. For a recent review of the DIF detection methods, see Clauser and Mazor (1998), Osterlind and Everson (2009), Penfield and Camilli (2007) and Penfield and Lam (2001).

The aforementioned methods are designed to compare two groups of respondents: the reference group and the focal group. However, practical situations might require the detection of items that function differently between three or more groups. For instance, one could be interested in comparing item performance between different classrooms within a school or between schools within a country. At an international level, DIF investigation could be performed in international studies or surveys such as the Program for International Student Assessment (PISA) or Trends in International Mathematics and Science Study (TIMSS)—large-scale assessment studies involving much more than two groups (countries) of respondents. Another possible application of multiple groups DIF detection methods is the comparison of item performance from a common test, administered repeatedly across years. For example, entrance or admission tests that are administered every year to students entering into higher studies could be screened for DIF across the years of administration. The practical example analyzed in this article is from that kind.

Until now, however, the extension of DIF detection methods to multiple groups has received very little attention. When dealing with more than one focal group,

(3)

the common practical approach is to perform pairwise comparisons between the reference group and each focal group (see e.g., Angoff & Sharon, 1974; Ellis &

Kimmel, 1992; Schmitt & Dorans, 1990; Zwick & Ercikan, 1989). But because of the simultaneous testing of multiple hypotheses, this approach provokes a Type I error inflation, and ad-hoc methods must be used such as Bonferroni correction of the nominal significance level.

To date, two methods have been specifically extended for testing DIF among multiple groups: the Mantel-Haenzel method (Penfield, 2001; see also Fidalgo &

Madeira, 2008; Fidalgo & Scalon, 2010) and Lord’s χ²test (Kim, Cohen, & Park, 1995). The so-called generalized Mantel-Haenszel method focuses on uniform DIF among multiple groups, while the generalized Lord’s test can detect both types of effects. To our knowledge, the logistic regression procedure has not been formally extended to more than two groups. Millsap and Everson (1993, p. 305) mentioned that this method can be generalized to more than two groups, and Van den Noortgate and De Boeck (2005) introduced logistic mixed models that can be considered with several groups of respondents. Kanjee (2007) recommended merging all focal groups in a single one before applying the usual logistic regression procedure. This approach avoids pairwise comparisons and subsequently controls for Type I error inflation. In addition, the power of the method increases as the “pooled” focal group has larger size. However, by merging all focal groups into a single one, one distorts the structure of the data set, which can lead to the misidentification of items as DIF or non-DIF. In particular, items that exhibit some DIF effect between one focal group and the reference group could become undetected when all focal groups are merged together, especially if the sample size of the focal group is small regarding other focal groups.

The purpose of this article is to present an extension of the logistic regression procedure in the case of multiple focal groups. Emphasis is put on the usefulness and easiness of the method as well as its flexibility in modelling DIF effects and testing within subgroups of respondents. For each group of examinees a response probability curve is fitted, and DIF can be statistically assessed by an appropriate comparison of group-specific parameters. No merging of the focal groups is required, and both types of DIF effect can be tested. This constitutes an improvement of the generalized Mantel-Haenszel method, and it can be seen as the non-IRT counterpart of the generalized Lord’s test with the two-parameter logistic (2PL) model. To keep the DIF terminology consistent, we further refer to this extension as the generalized logistic regression procedure for DIF detection.

The article is organized as follows. In Section 2 we present the generalized logistic regression procedure by introducing the notations of the logistic model and highlighting the relationships between the model parameters and the DIF effects. Two methods of null hypothesis testing (i.e., non-DIF hypothesis) are discussed in Section 3: the Wald test and the likelihood-ratio tests. These are well- established methods (Agresti, 2002) and their application in this DIF framework is

(4)

displayed in detail. The generalized logistic regression procedure is then illustrated in Section 4 by analyzing a real example about language skill assessment in Quebec colleges. The results from this method are presented and compared to those from the generalized Mantel-Haenszel method and generalized Lord’s test.

GENERALIZED LOGISTIC REGRESSION

The generalized logistic regression is an extension of Swaminathan and Rogers’

(1990) approach in the usual case of DIF between two groups. We focus on one item of interest, and all other tests items are assumed to be DIF free (this is sometimes referred to as the all-other anchor items setting).

Let π_ig be the probability that respondent i from group g answers the item correctly, where g = 0 for the reference group, and g = 1, 2, . . . , F respectively for the first, second, . . . , last focal group. Set moreover Si as the test score of respondent i, which acts as the matching variable and as a proxy for respondent’s ability. With these notations, the generalized logistic regression DIF model takes the following form:

logit(π_ig) = log[π_ig/(1 − π_ig)] = α_g + β_gS_i, (1) where α_g and β_g are the intercept and the slope parameters of group g (g = 0, 1, . . . , F), respectively.

A completely equivalent form of model (1) is obtained by setting common, group-independent intercept and slope parameters, so that α_g and β_g are group- specific parameters:

logit(πig) = α + βSi + α_g + β_gSi, (2) where α and β are the common intercept and slope parameters for all groups.

Because model (2) is overparametrized with respect to model (1), group-specific parameters must be constrained to avoid identification issues. Setting all reference group parameters to zero is the most obvious constraint; that is, α₀ = β₀ = 0.

Hence, model (2) can be rewritten as logit(π_ig) =

α + βS_i if g =0

(α + αg) + (β + βg)Si if g 6=0 . (3) The intercept and the slope parameters are respectively equal to α and β in the reference group and to (α + α_g) and (β +β_g) in the focal group g (g = 1, . . . , F). In the following we make use of the parameterization of model (3).

The tested item exhibits DIF if the response probability πig varies across the groups of examinees, or equivalently, if there is some interaction between the item

(5)

responses and the group membership. According to model (3), this occurs if at least one of the group parameters αgand βg is different from zero. If all group-specific parameters equal zero, no DIF effect is present. Furthermore, the presence of nonuniform DIF is assessed by a significant difference in the slopes of the logistic response curves—when at least one slope parameter βg is different from zero, whatever the values of the intercept parameters. Finally, uniform DIF is present if the item-group interaction does not depend on the matching variable—if at least one intercept parameter αg is different from zero, given that all slope parameters βg are equal to zero. In sum, three types of DIF effects can be tested: uniform DIF (UDIF), nonuniform DIF (NUDIF) and both types of DIF effects altogether (DIF). Each framework is characterized by the following null hypotheses:

H₀ : α₁ = · · · = α_F = β₁ = · · · = β_F =0 (DIF) (4)

H₀ : β₁ = · · · = β_F =0 (NUDIF) (5)

H₀ : α₁ = · · · = α_F = 0 |β₁ = · · · = β_F = 0 (UDIF). (6)

The alternative hypotheses are such that at least one of the tested parameters in the null hypothesis is different from zero. The null hypothesis (6) states that the focal group-specific intercept parameters α₁, . . . , αF are all equal to zero, given that all group-specific slope parameters β₁, . . . , βF are equal to zero. This implies that the hypothesis of absence of uniform DIF effect is tested on the basis of the simpler logistic model

logit(πig) = α + βSi + α_g. (7)

The null hypotheses (4) and (5), however, must be assessed with the full model (3) since the group-specific slope parameters βg must also be tested.

As the number of groups increases, small discrepancies between groups’ logistic curves would probably lead to flagging the item as DIF, so it is expected that at least one item will often be identified as DIF using this method. This would be argument against the usual the standard approach of testing the usual null hypotheses of absence of DIF. One solution consists in switching to the Bayesian paradigm, allowing for prior information regarding the potential DIF status of the items and computing posterior probabilities and credible intervals for the model parameters. This approach is nevertheless skipped from this article, as emphasis is put on a direct extension of the usual logistic regression method based on maximum likelihood estimation and testing of the model parameters, as explained in the next section.

(6)

DIF IDENTIFICATION

Statistical assessment of DIF is done as follows. Set first τ as the vector of model parameters:

τ =

((α, α₁,· · · , α_F, β, β₁,· · · , β_F)^T in the DIF and NUDIF frameworks (α, α₁,· · · , α_F, β)^T in the UDIF framework .

(8) The vector τ can be estimated by maximum likelihood estimation (Agresti, 1990).

Let ˆτ be the maximum likelihood estimate of τ . There are two reasons for focusing on simple maximum likelihood estimation. First, the estimates ˆτ are unique (Wed- derburn, 1976), asymptotically multivariate normally distributed (Bock, 1975;

Rao, 1973) with mean vector τ and covariance matrix 6, where 6 is the in- verse of Fisher’s information matrix (Nelder & Wedderburn, 1972). In short, ˆτ ≈ MVN(τ , 6). Second, the likelihood equations can be solved with an iterative optimization routine, for instance the Newton-Raphson or Fisher scoring method (Agresti, 1996, p. 94).

Under the maximum likelihood framework, the null hypotheses (4) to (6) can be tested by several methods. This article restricts to two well-known approaches:

the Wald test and the likelihood ratio test.

Wald Test

The null hypotheses (4) to (6) can be written in a common matrix form H₀: C τ = 0, where τ is given by (8), 0 is a vector of zeros, and C is an appropriate contrast matrix. The alternative hypothesis also takes a simple form: HA: Cτ 6= 0.

To write the matrix C properly in each framework, set 0n × m as the n-by-m matrix of zeros and Inas the identity matrix of dimension n. Then, in the DIF framework, Cis the (2F)-by-(2F + 2) matrix

C = 0_F×1 I_F 0_F×1 0_F×F 0_F×1 0_F×F 0_F×1 I_F

. (9)

In the NUDIF framework, C is the F-by-(2F + 2) matrix C = 0_F_{×(F +2)} I_F

(10) and in the UDIF framework, C is the F-by-(F + 2) matrix

C = 0_F_×1 I_F 0_F×1 . (11)

(7)

The forms (9) to (11) of the matrix C are straightforward extensions of the contrast matrices used in the simple situation of a single focal group. For instance, in the DIF framework with one focal group, τ reduces to (α, α₁, β, β₁)^Tand the null hypothesis (4) to H₀: α₁ = β₁ =0. The appropriate contrast matrix C is then equal to

C =0 1 0 0 0 0 0 1

(12) (Swaminathan & Rogers, 1990, p. 365), which is the particular case of (9) when F equals one.

The rank of C is equal to p = 2F in the DIF framework and p = F in the UDIF and NUDIF frameworks. Since ˆτ is asymptotically multivariate normally distributed, the p-dimensional vector C ˆτ is also asymptotically multivariate nor- mally distributed, with mean vector Cτ and covariance matrix C6C^T (Johnson &

Wichern, 1998, p. 165). It follows that the one-dimensional variable

(C ˆτ − Cτ )^T(C6C^T)⁻¹(C ˆτ − Cτ ) (13) has an asymptotic chi-squared distribution with p degrees of freedom (Rao, 1973, p. 188). Thus, under the null hypothesis H₀: Cτ = 0, the test statistic

Q= (C ˆτ )^T(C6C^T)⁻¹(C ˆτ ) (14) has an asymptotic chi-squared distribution with p degrees of freedom. When the value of Q exceeds the corresponding cut-score from the chi-squared distribution, the null hypothesis is rejected and the presence of DIF is statistically assessed.

Swaminathan and Rogers (1990) referred to the statistic (14) as χ²(equation 14, p. 365), but we use Q instead to avoid confusion with the chi-squared distribution.

This method is referred to as the Wald test since it is derived form the asymptotic normality of the maximum likelihood estimates of the model parameters (Wald, 1939). For this reason, Q is further referred to as the Wald statistic.

Another asset of the Wald test is that it can be used to test for DIF among a subset of groups of examinees. This is particularly useful when one wants to determine where the differential functioning comes from. Subtests of groups of examinees can be specified with an appropriate contrast matrix and by using the output of the logistic regression model fitted to all groups of examinees. For instance, under the DIF framework, the 4 × (2F + 2) contrast matrix

C =

!0_2×1 I₂ 0_2×(F−2) 0_2×1 0_2×2 0_2×(F−2) 0_2×1 0_2×2 0_2×(F−2) 0_2×1 I₂ 0_2×(F−2)

"

(15)

(8)

can be used to test whether the item functions differently between the reference group and the first two focal groups (assuming that the number of focal groups F is at least equal to three).

Likelihood Ratio Test

The likelihood ratio test compares two nested models: one referring to the null hypothesis and one to the alternative hypothesis. The most suitable model is retained by comparing their maximized likelihood values. This method was introduced by Wilks (1938) in the general context of comparing composite hypotheses (see also Agresti, 1990; McCullagh & Nelder, 1989).

More precisely, let M₀ and M₁ be the logistic models, which are used to represent the null and the alternative hypotheses respectively. The model M₀, referred to as the null model, is equal to

M₀ ≡ logit(πig) = α + βS_i + α_g in the NUDIF framework

α+ βS_i in the DIF and UDIF frameworks (16) while the model M₁, called the alternative model, is given by

M₁ ≡ logit(πig)= α + βS_i + α_g + β_gSi in the DIF and NUDIF frameworks α+ βS_i + α_g in the UDIF framework

(17) The null hypothesis of absence of the tested DIF effect is assessed when the model M₀ is preferred to the model M₁, so the tested hypotheses can be rewritten in this context as: H₀: model M₀is preferred to model M₁ vs H₁: model M₁is preferred to model M₀.

Once the maximum likelihood parameters estimates available, the correspond- ing maximized likelihoods are computed, say L₀ for model M₀ and L₁ for model M1. Wilks (1938) introduced the lambda statistic 3:

3 = −2 log L₀ L₁

(18) and showed that, under the null hypothesis, this 3 statistic has an asymptotic chi-squared distribution with as many degrees of freedom as the difference in the number of parameters between the models M₀ and M₁ (see also Agresti, 1990).

The 3 statistic is often called the likelihood ratio statistic. In this framework, large values of 3 indicate the presence of the tested DIF effect.

Because there are F group-specific intercept and F group-specific slope param- eters, the degrees of freedom of the asymptotic null distribution of 3 equal 2F in

(9)

the DIF framework and F in both the UDIF and NUDIF frameworks. Thus, both the Wald statistic Q and the likelihood ratio statistic 3 share the same asymptotic distribution. The tests are therefore asymptotically equivalent for detecting DIF effects among the items and would return the same results with sufficiently large sample sizes.

It is important to notice that this test is not directly related to the so-called likelihood ratio test (LRT) method for DIF identification introduced by Thissen, Steinberg, and Wainer (1988). Although the basic idea is common to both approaches (that is, the statistical comparison of two nested models by means of the likelihood ratio statistic), the latter is used with nested IRT models, while in this context we focus on logistic models. The present LRT method serves as a statistical tool for testing for the presence of DIF with multiple-groups logistic models.

Wald or LRT?

Two methods are available to test for the presence of DIF among multiple groups.

Applying both will generally return similar results, especially when the samples are large. This is because of the asymptotic equivalence between the two statistics (Cox & Hinkley, 1974). However, the likelihood ratio test is more reliable than the Wald test with smaller samples (Agresti, 2002). This is notably because the Wald test suffers from poor estimation of the standard errors of the parameters with smaller samples, while the likelihood ratio test is unaffected. Note, however, that discrepancy between both tests could also be due to a bad fit of the logistic model to the data, as the consequence of an improper model selection for data analysis, for instance.

On the other hand, the Wald test is suitable for performing subtesting between some groups of respondents, as explained previously. The likelihood ratio test, however, cannot perform such specific comparisons. One may therefore recom- mend to start by comparing the results of both statistical tests, and in case of acceptable agreement between their results, subsequent comparisons could be performed with the Wald test. Great discrepancy between the outputs of the methods indicates that the samples are not large enough to ensure asymptotic validity of the conclusions, and great care should be taken with their interpretation.

Practical Implementation

The generalized logistic regression procedure can be implemented with any suitable software that performs logistic regression modelling, such as SAS or STATIS- TICA. For the purpose of this article, however, the method has been implemented within the R package difR (Magis et al., 2010) as well as many other DIF methods for two or more than two groups. The results of the data set analysis in the next

(10)

section were obtained from this implementation. The R code can be obtained freely from the first author.

AN EXAMPLE

We illustrate the usefulness and flexibility of the generalized logistic regression procedure by analyzing an example about assessment of English, as a second language, skills. After a short description of the data set, we proceed to a complete DIF analysis, first with this method, then with other multiple-groups DIF methods.

Although the data set restricts to assessment testing of Quebec students entering into college, this example illustrates how the method can be easily adapted to international testing studies, across times of administration, countries, or both.

The TCALS-II Data Set

The TCALS-II test is administered to Canadian French-speaking students as prior to entering into college education in Quebec province, to assess their aptitude in English as a second language. The primary goal of this test is to evaluate the English level of the students in order to assign them into classes of appropriate difficulty level (Laurier et al., 1998; Raˆıche, 2002). The test consists of 85 multiple- choice items, divided in eight subgroups, and is identical for all French-speaking colleges of the Quebec province (Canada). In this study we focus on the items for which the students have to answer questions related to the reading of short English texts. There are 15 such items (referred to as items 1 to 15) and we consider this subset of items for further analysis.

In order to define the different groups of examinees, we focus on the results of the TCALS-II test from students entering into the College of Outaouais (Gatineau, Quebec, Canada). The groups of examinees are defined by the different years the test was assigned. Four years were selected, respectively 1998, 2000, 2002 and 2004. The year 1998 corresponds to the very first year the test was administered, and is selected as the reference year. The other years of administration (2000, 2002, and 2004) are the focal groups. Although the TCALS-II test was assigned every year from 1998, the data from several years of administration are not available anymore (in particular, years 1999 and 2001), so that we restricted to focal groups made by every two years of administration. Moreover, the generalized logistic regression can handle any number of groups of respondents, but for practical illustrative purposes we restricted to four groups of TCALS-II administration.

The sample sizes range from 1277 to 1547 and are relatively large. The average scores to the 15 items range from 10.14 to 10.65, and the standard deviations of these scores range from 3.36 to 3.68. Moreover, the skewness coefficients of the scores are all negative, ranging from –0.60 to –0.67, which indicates an asymmetric

(11)

distribution of the scores with a larger proportion of high scores than low scores (Raˆıche, 2002). This is partly due to the fact that students from Outaouais are more often bilingual than in the others regions of the Quebec province (mainly French speaking), because of the closeness of the Ontario province (mainly English speaking). Finally, Laurier and colleagues (1998) established that the full TCALS- II questionnaire exhibits a high fidelity level, with Cronbach’s α of 0.96, as well as the unidimensionality of the questionnaire (see also Raˆıche, 2002, for another dimensionality analysis but with similar conclusions).

DIF Analysis

Our interest is to discover whether some items perform differently over the successive years of administration. Because the test is identical from year to year, this problem can be investigated by a DIF analysis of these items.

We start by an analysis of both types of DIF effect on the whole set of 15 items, using separately the Wald test and the likelihood ratio test. For the Wald test, the model (3) is fitted and the null hypothesis (4) is tested by means of the contrast matrix







0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1







(19)

that is, the matrix C given by (9) and with F = 4. For the likelihood ratio test, both models (16) and (17) are fitted and compared by means of the 3 statistic (18). To reduce the impact of DIF items into the set of anchor (DIF-free) items, the process of item purification described by Candell and Drasgow (1988) is performed within each test. That is, items flagged as DIF are removed from the test score computation and the DIF process is re-run; this step is repeated until two successive iterations return the same classification of items as DIF or non-DIF (see also Clauser & Mazor, 1998). Significance level was set to 5% and was not adjusted (by means of Bonferroni correction) at that step. This avoids missing some items that are potentially functioning differently and is in line with other approaches for extending DIF methods to more than two groups (Kim et al., 1995;

Penfield, 2001).

The Wald test and the likelihood ratio test required respectively five and four iterations of the item purification process to reach convergence of the results.

Table 1 summarizes the test statistics and related p-values for the 15 items. The

(12)

TABLE 1

Wald and Likelihood Ratio DIF Statistics, TCALS-II Data Set

Wald Test Likelihood Ratio Test

Item Statistic p-value Statistic p-value

1 18.054 0.006^∗ 18.594 0.005^∗

2 9.627 0.141 8.147 0.228

3 7.612 0.268 6.757 0.344

4 10.765 0.096 10.740 0.097

5 2.808 0.832 2.526 0.866

6 11.861 0.065 12.555 0.051

7 3.678 0.720 3.690 0.719

8 35.126 <0.001^∗ 36.518 <0.001^∗

9 1.343 0.969 2.231 0.897

10 26.345 <0.001^∗ 29.500 <0.001^∗

11 21.654 0.001^∗ 22.495 0.001^∗

12 12.443 0.053 12.656 0.049^∗

13 10.049 0.123 9.694 0.138

14 3.043 0.803 2.800 0.833

15 2.830 0.830 3.251 0.777

∗Item is flagged as DIF at significance level 0.05.

p-values are computed on the basis of the chi-squared distribution with six degrees of freedom—the rank of the matrix (19).

Four items are flagged as DIF with very high confidence: items 1, 8, 10, and 11. Items 6 and 12 are borderline. The remaining nine items are not detected as functioning differently. Both tests provide the same classification of the items as DIF or non DIF, except for item 12 for which the p-values are equal to 0.049 (Wald test) and 0.053 (likelihood ratio test). Figure 1 displays the fitted response probabilities for the four items which exhibit a highly significant DIF effect.

These curves are also displayed on a logit scale in Figure 2, to better distinguish between each group. The fitted curves are based on the results of the Wald test. The likelihood ratio test provides very similar curves and the corresponding curves are not displayed here.

For items 1, 8, and 11, the DIF effect mainly occurs between the reference group (year 1998) on the one hand, and the three focal groups on the other hand. The focal groups have very close probability curves, while the reference group has larger probabilities almost overall. For item 10, the focal group 2000 has slightly larger probabilities and the focal group 2002 has slightly lower probabilities overall. Both the reference group and the focal group 2004 have similar probability curves. In addition, the response probability curves are somewhat parallel for the items 1 and 11, which is an indicator of the presence of uniform DIF only. On the opposite,

(13)

FIGURE 1

Item probability curves for items 1 (top left), 8 (top right), 10 (bottom left), and 11 (bottom right), based on the results of the Wald test.

one can observe different slopes in the probability curves for items 8 and 10, so that nonuniform DIF is most probably present for those items. These findings are identical whenever either the Wald test or the likelihood ratio test is considered.

Table 2 provides the model parameter estimates and the standard errors of for the four items flagged as DIF. The parameter estimates permit to describe the graphical display of the response probabilities in Figures 1 and 2. For items 1 and 11, the slope parameters β₀₀, β₀₂, and β₀₄ for respectively the focal groups 2000, 2002, and 2004 are close to zero, while the intercept parameters α₀₀, α₀₂, and α₀₄ take negative and very close values. For item 8, the intercepts are all close and negative but the slopes are positive so the three probability curves are steeper and right-shifted with respect to the reference group curve. Finally, the intercept

(14)

FIGURE 2

Logits of item probability curves for items 1 (top left), 8 (top right), 10 (bottom left), and 11 (bottom right), based on the results of the Wald test.

and slope parameters for item 10 are either positive or negative and with opposite signs. Consequently, and as noted from Figure 1, nonuniform DIF is most probably present for these two items. Note, however, that some of these parameters are not statistically significant whenever taken individually. The previous discussion is therefore purely descriptive and in line with Figures 1 and 2, as expected. A more formal investigation is provided.

The slight differences in model parameters between the two tests are due to different final results after item purification because of the different classification of item 12 as DIF or non-DIF. Recall however that item 12 returns very borderline p-values, which explains this apparent contradiction.

(15)

TABLE 2

Group-Specific Parameter Estimates and Standard Errors (in Parentheses) for the Four Items with Significant DIF Effect and Both Tests. Subscripts 00, 02, and 04 Refer

Respectively to the Years 2000, 2002, and 2004 of the Focal Groups

Test Item α00 α02 α04 β00 β02 β04

Wald 1 −0.51 (0.35) −0.59 (0.37) −0.47 (0.37) 0.02 (0.05) 0.01 (0.05) 0.02 (0.05) 8 −1.20 (0.31) −1.25 (0.34) −1.16 (0.33) 0.14 (0.04) 0.13 (0.04) 0.09 (0.04) 10 1.75 (0.69) 0.47 (0.76) −1.37 (0.85) −0.16 (0.07) −0.05 (0.08) 0.12 (0.09) 11 −0.31 (0.40) −0.69 (0.43) −0.81 (0.44) −0.01 (0.05) 0.04 (0.05) 0.04 (0.05) LRT 1 −0.49 (0.34) −0.72 (0.37) −0.59 (0.37) 0.03 (0.05) 0.03 (0.06) 0.03 (0.06) 8 −1.19 (0.32) −1.31 (0.34) −1.19 (0.33) 0.15 (0.05) 0.15 (0.05) 0.10 (0.05) 10 1.74 (0.66) 0.26 (0.73) −1.51 (0.83) −0.18 (0.07) −0.03 (0.08) 0.15 (0.09) 11 −0.22 (0.40) −0.64 (0.43) −0.91 (0.45) −0.02 (0.05) 0.04 (0.06) 0.06 (0.06)

Subset Comparisons

In order to illustrate how the Wald test can perform subset comparisons, all possible triplets of groups of examinees were compared for the four items with significant DIF effect. For instance, the comparison between the reference group and the focal groups 2000 and 2004 was achieved with the contrast matrix

C =







0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0







. (20)

As four triplets of groups of examinees are considered in a multiple comparison scheme, adjustment of the significance level was performed. Three methods were considered: the usual Bonferroni adjustment; ˇSidák correction ( ˇSidák, 1967); and the Holm-Bonferroni method (Holm, 1979; see also Abdi, 2007 and Shaffer, 1995). With four comparisons, the adjusted Bonferroni and ˇSidák significance levels equal respectively 0.0125 and 0.0127. With Holm-Bonferroni method, the four p-values are sorted by increasing order and compared respectively to the increasing significance levels 0.0125, 0.0250, 0.0375, and 0.050.

Table 3 contains the Wald statistics and the associated p-values for each item and each triplet of groups of examinees. First, the adjusted Bonferroni and ˇSid´ak significance levels are so close that they lead to the same conclusions, so they will not be distinguished further. Moreover, the Holm-Bonferroni method returns the same significant p-values in any case, except for item 10 for which the triplet of years 1998, 2000, and 2002 has significant p-value according to this method.

(16)

TABLE 3

Subtests of DIF Among All Possible Triples of Groups of Examinees, for the Four Items with Significant DIF Effect

Item Groups Statistic p-value Item Groups Statistic p-value

Item 1 (98-00-02) 17.770 0.001^∗ Item 8 (98-00-02) 21.497 <0.001^∗ (98-00-04) 10.466 0.033 (98-00-04) 32.284 <0.001^∗ (98-02-04) 17.549 0.002^∗ (98-02-04) 30.938 <0.001^∗

(00-02-04) 2.746 0.601 (00-02-04) 9.102 0.059

Item 10 (98-00-02) 11.444 0.022 Item 11 (98-00-02) 15.833 0.003^∗ (98-00-04) 26.081 <0.001^∗ (98-00-04) 20.228 <0.001^∗

(98-02-04) 6.775 0.148 (98-02-04) 19.622 0.001^∗

(00-02-04) 24.569 <0.001^∗ (00-02-04) 2.390 0.664

∗Significant at α level 0.0125 (Bonferroni) and 0.0127 ( ˇSid´ak).

Significant with Holm-Bonferroni method.

For each item, at least one triplet of groups yields a nonsignificant p-value, which indicates that the significant DIF effect mostly occurs from the fourth group not included in the nonsignificant triplet. For items 1, 10, and 11, the nonsignificant triplet is made of the three focal groups. That is, the presence of the reference group in any triplet leads to a significant difference between the probability curves. For item 8, however, the p-value is somewhat borderline (p = 0.059) and for item 1, another triplet (without the focal group 2002) is also nonsignificant, but the p-value (0.033) is also borderline. For item 10, the triplet of groups with a nonsignificant result is made of the reference group and the focal groups 2002 and 2004. That is, the focal group 2000 behaves differently than the three other groups. Finally, all other triplets provide very significant results.

These results are in agreement with Figures 1 and 2. It was noticed that for items 1, 8, and 11, the reference group tends to have higher probabilities of answering the item correctly than the other groups, while the three focal groups have quite close probability curves. For item 8, however, the discrepancy between the probability curves of the focal groups is much more visible, which explains the borderline p-value for the corresponding Wald test. Finally, the focal group 2000 has larger probabilities with item 10, while the three other groups are much closer in terms of probability curves.

Uniform and Nonuniform DIF Effects

Till now, both DIF effects have been tested simultaneously as an omnibus approach.

For the four items with established significant DIF effect, a subsequent distinct analysis of both effects is conducted. Nonuniform DIF is tested first, and uniform DIF is investigated in a second step among those items without nonuniform DIF.

(17)

TABLE 4

Results of the Tests of Nonuniform and Uniform DIF for the Four Items with Significant DIF Effect

Wald Test Likelihood Ratio Test

Effect Item Statistic p-value Statistic p-value

NUDIF 1 0.537 0.911 0.531 0.912

8 8.736 0.033^∗ 8.658 0.034^∗

10 23.673 <0.001^∗ 24.590 <0.001^∗

11 2.129 0.546 2.121 0.548

UDIF 1 17.628 0.001^∗ 18.076 <0.001^∗

11 19.810 <0.001^∗ 20.164 <0.001^∗

The results of this two-step analysis, both with the Wald test and the likelihood ratio test, are summarized in Table 4.

Both tests provide very close results so they will not be distinguished further.

It appears from the top part of Table 4 that the items 8 and 10 exhibit nonuniform DIF, with significant p-values for the former and highly significant p-values for the latter. Items 1 and 11, on the other hand, are not affected by nonuniform DIF and are therefore retained for an investigation of uniform DIF. From the bottom part of Table 4, one concludes that these two items have a highly significant uniform DIF effect. Thus, and as expected, each item has a significant DIF effect, either uniform or nonuniform. Moreover, it was observed in Figure 1 that the probability curves are rather parallel for items 1 and 11, while there is some variability in the slopes for items 8 and 10.

Other Methods

Finally, an investigation of DIF was performed by using the two usual methods for DIF identification in multiple groups: the generalized Mantel-Haenszel method (further referred to as the GMH method) and the generalized Lord’s χ² test.

Lord’s test was performed using both the 1PL model and the 2PL model. Item purification was performed with each method. Table 5 lists the DIF statistics and related p-values.

Both the GMH method and Lord’s test with the 1PL model focus on uniform DIF detection. It appears from Table 5 that they identify items 1, 8, 10, and 11 with a significant uniform DIF effect. It is expected that these methods return very close results as they are designed to detect uniform DIF, assuming that nonuniform DIF is absent (Holland and Thayer, 1988, highlighted this particular relationship).

(18)

TABLE 5

Identification of DIF Items Using the Generalized Mantel-Haenszel (GMH) Method and the Generalized Lord’s Chi-square Test Under the 1PL and the 2PL Models

GMH Lord’s Test (1PL) Lord’s Test (2PL)

Item Statistic p-value Statistic p-value Statistic p-value

1 17.233 0.001^∗ 14.402 0.002^∗ 3.528 0.740

2 3.281 0.350 3.480 0.323 26.247 <0.001^∗

3 6.810 0.078 5.133 0.162 16.526 0.011^∗

4 5.032 0.169 4.527 0.210 12.403 0.054

5 2.403 0.493 1.503 0.682 2.779 0.836

6 4.436 0.218 3.527 0.317 9.929 0.128

7 0.797 0.850 2.010 0.570 1.579 0.954

8 20.865 <0.001^∗ 19.616 <0.001^∗ 24.379 <0.001^∗

9 1.010 0.799 0.977 0.807 1.531 0.957

10 13.326 0.004^∗ 10.809 0.013^∗ 17.913 0.006^∗

11 18.280 <0.001^∗ 14.510 0.002^∗ 6.410 0.379

12 6.429 0.093 3.887 0.274 3.184 0.785

13 7.375 0.061 7.153 0.067 6.834 0.336

14 0.279 0.964 0.269 0.966 3.367 0.762

15 1.251 0.741 0.237 0.971 4.999 0.544

In addition, items flagged as DIF are identical to those obtained from the logistic regression procedure.

With Lord’s test and the 2PL model, however, items 2, 3, 8, and 10 are flagged as DIF. Items 8 and 10 are still detected as DIF items, but items 2 and 3 are now flagged as having a significant DIF effect while items 1 and 11 are not identified as such anymore. These differences in conclusions can be explained as follows. First, items 2 and 3 exhibit statistically significant differences in item discriminations under the 2PL model, which is impossible to detect with the 1PL model (or the GMH method). Note that these differences in item discriminations were also detected by the generalized logistic regression method, although judged as not statistically significant. Second, for items 1 and 11 the DIF effect is mostly uniform. When switching from the 1PL model to the 2PL model, however, the item discriminations are much larger than with the 1PL model—although there is no statistical difference between the groups of respondents—and simultaneously the item difficulties get more closer so that no DIF effect is detected.

In sum, items 8 and 10 tend exhibit nonuniform DIF, while items 1 and 11 are more affected by uniform DIF. Items 2 and 3 were not identified by the logistic regression procedure, although the generalized Lord’s test with the 2PL model indicates a possible presence of nonuniform DIF effect.

(19)

DISCUSSION

This article focused on the identification of DIF in the presence of more than two groups of respondents. The logistic regression procedure finds a natural extension into this multiple group framework. Simultaneous statistical inference permits to identify the items that exhibit a significant DIF effect. Either uniform, nonuniform, or both effects can be tested. In addition, two statistical approaches for testing for DIF, the Wald test and the likelihood ratio test, are available. The Wald test is also appropriate to perform subset comparisons between some groups of examinees by using appropriate contrast matrices. Finally, the practical working of the method was illustrated by the analysis of a real example on language skill assessment.

It was highlighted that several conclusions can be drawn from the fitted logistic curves in each group, which reinforces the usefulness of the generalized logistic regression procedure in this context. In addition, the conclusions were in line with other methods for multiple groups DIF detection, which strengthens the usefulness of the present approach.

The generalized logistic regression procedure was presented under the usual approach of several focal groups that are compared to a single reference group. To this end, the identification constraints of the logistic regression model parameters were naturally defined with respect to this reference group. One main asset of the method, however, is that it can also apply when no clear reference group can be set up. For example, in international studies such as PISA or TIMMS, it is not straightforward to set up the reference country for DIF investigations. This approach allows for comparisons of countries altogether, by specific comparisons of model parameters, and the way these parameters are constrained does not influence the conclusions themselves but only the parameterization of the logistic models. In addition, the larger the group sizes the better the estimation of model parameters, and hence the better the statistical inference for DIF identification.

This method is therefore naturally suitable for large-scale assessment studies or international surveys.

The main goal of this article was to highlight the fact that the logistic regression procedure can be easily extended to multiple-groups DIF testing. Although sometimes suggested, it has apparently never been practically achieved. How- ever, this is only the first step in the process of proposing a novel methodology for multiple-groups DIF. Indeed, the practical efficiency of this method should be carefully checked, notably with respect to the control of Type I error and by evaluating the empirical power of detecting DIF items. In addition, Monte-Carlo comparisons with other multiple-groups DIF methods should be performed. The practical example pointed out some similarities between the generalized versions of Mantel-Haenszel and logistic regression methods but also some discrepancy with generalized Lord’s test and the 2PL model. It makes sense therefore to in- vestigate further in this way. For example, from previous studies in the case of a

(20)

single focal group (e.g., Rogers & Swaminathan, 1993), one might expect that the generalized logistic regression and the generalized Mantel-Haenszel methods will perform similarly in the presence of uniform DIF, while the former will be more adequate for nonuniform DIF identification.

Finally, identifying the items that may exhibit DIF between several groups is an important step, but providing some measure of effect size to evaluate the magnitude of DIF is another important task. In the usual framework of a single focal group, Jodoin and Gierl (2001) and Zumbo and Thomas (1997) proposed to consider the difference 1R² between Nagelkerke’s R² coefficients (Nagelkerke, 1991) of the models M₀ and M₁. This measure of effect size could directly be extended in this multiple-groups framework. Its relative usefulness, however, should be assessed since it is known (Hidalgo & Lopez-Pina, 2004) that Nagelkerke’s R²coefficient tends to underestimate the true DIF effect size. Other approaches might involve the computation of the area between the logistic curves, although it is not easy to apply it with more than two groups of respondents.

To our opinion, selecting an appropriate measure of DIF effect size is a central issue and is worth being studied carefully. This could actually counterbalance the issue mentioned at the end of Section 2—items could be more often flagged as DIF as more than groups are compared simultaneously. Flagging more items is a drawback, but computing then the effect size measures could correct for potential increase of Type I error (that is, flagging as DIF some items that are not functioning differently). In addition, as suggested earlier, this method would certainly benefit from a Bayesian approach by computing posterior DIF probabilities and related effect size estimates. One avenue of potential interest is the evaluation of informative hypotheses using either hypothesis testing (Hoijtink, 1998) or the Bayes factor (Hoijtink, Klugkist, & Boelen, 2008). Although the Bayesian paradigm was recently introduced in DIF research (Bolt & Cohen, 2005; Frederickx, Tuerlinckx, De Boeck, & Magis, 2010), it has apparently not yet been applied in this multiple groups context, but would be of interest for the future.

REFERENCES

Abdi, H. (2007). Bonferroni and ˇSid´ak corrections for multiple comparisons. In N. J. Salkind (Ed.), Encyclopedia of measurement and statistics. Thousand Oaks, CA: Sage.

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley.

Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.

Angoff, W. H., & Sharon, A. T. (1974). The evaluation of differences in test performance of two or more groups. Educational and Psychological Measurement, 34, 807–816.

Bock, R. D. (1975). Multivariate statistical methods. New York: McGraw-Hill.

(21)

Bolt, D. M., & Cohen, A. S. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133–148.

Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differential item functioning test items. Educational Measurement: Issues and Practice, 17, 31–44.

Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. London: Chapman and Hall.

Ellis, B. B., & Kimmel, H. D. (1992). Identification of unique cultural response patterns by means of item response theory. Journal of Applied Psychology, 77, 177–184.

Fidalgo, A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item functioning detection. Educational and Psychological Measurement, 68, 940–958.

Fidalgo, A. M., & Scalon, J. D. (2010). Using generalized Mantel-Haenszel statistics to assess DIF among multiple groups. Journal of Psychoeducational Assessment, 28, 60–69.

Frederickx, S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2010). RIM: A random item mixture model to detect differential item functioning. Journal of Educational Measurement, 47, 432–457.

Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational and Behavioral Statistics, 23, 244–253.

Hidalgo, M. D., & Lopez-Pina, J. A. (2004). Differential item functioning detection and effect size: a comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psycho- logical Measurement, 64, 903–915.

Hoijtink, H. (1998). Constrained latent class analysis using the Gibbs sampler and posterior predictive p-values: Applications to educational testing. Statistica Sinica, 8, 691–712.

Hoijtink, H., Klugkist, I., & Boelen, P. A. (2008). Bayesian information of informative hypotheses.

New York: Springer.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates.

Holm, S. (1979). A simple sequentially rejective multiple testing procedure. Scandinavian Journal of Statistics, 6, 65–70.

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.

Johnson, R. A., & Wichern, D. W. (1998). Applied multivariate statistical analysis (4th ed.). Upper Saddle River, NJ: Prentice-Hall.

Kanjee, A. (2007). Using logistic regression to detect bias when multiple groups are tested. South African Journal of Psychology, 37, 47–61.

Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.

Laurier, M. D., Froio, L., Paero, C., & Fournier, M. (1998). L’élaboration d’un test provincial pour le classement des étudiants en anglais langue seconde au collégial [The elaboration of a provincial test to classify students in English, as a second language, in colleges]. Québec, QC: Direction générale de l’enseignement collégial, ministère de l’Education du Québec.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:

Lawrence Erlbaum Associates.

Magis, D., B´eland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862.

McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). London: Chapman & Hall.

Millsap, R. E., & Everson, H. T. (1993). Methodology review: statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.

(22)

Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination.

Biometrika, 78, 691–692.

Nelder, J., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society (Series A), 135, 370–384.

Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oakes, CA:

Sage.

Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: a comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.

Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao &

S. Sinharray (Eds.), Handbook of statistics 26: psychometrics (pp. 125–167). Amsterdam, The Netherlands: Elsevier.

Penfield, R. D., & Lam, T. C. M. (2001). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5–15.

Raˆıche, G. (2002). Le d´epistage du sous-classement aux tests de classement en anglais, langue seconde, au coll´egial [The detection of under-classification at English, as a second language, test in college].

Gatineau, QC: Coll`ege de l’Outaouais.

Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.

Rao, C. R. (1973). Linear statistical inference and its applications (second edition). New York: Wiley.

Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–116.

Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examinees on the SAT. Journal of Educational Measurement, 27, 67–81.

Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584.

Shealy, R.T., & Stout, W. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.

ˇSid´ak, Z. (1967). Rectangular confidence region for the means of multivariate normal distributions.

Journal of the American Statistical Association, 62, 626–633.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regres- sion procedures. Journal of Educational Measurement, 27, 361–370.

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–170). Hillsdale, NJ:

Lawrence Erlbaum Associates.

Van den Noortgate, W., & De Boeck, P. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30, 443–464.

Wald, A. (1939). Contributions to the theory of statistical estimation and testing hypotheses. Annals of Mathematical Statistics, 10, 299–326.

Wedderburn, R. W. M. (1976). On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika, 63, 27–32.

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hy- potheses. Annals of Mathematical Statistics, 9, 60–62.

Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, BC, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.

Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26, 55–66.