• No results found

DIF and Pisa

N/A
N/A
Protected

Academic year: 2021

Share "DIF and Pisa"

Copied!
29
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

S. Sjoerd F. Glaser

University of Amsterdam

Abstract

PISA is a worldwide international study that ranks the scholastic performance of 15-year-old student’s. A key assumption of the models used is equal item parameters for all countries. The opposite of this claim - some item parameters do differ between countries - is known as differential item functioning, or DIF. Most of the current methods see DIF as a nuisance and it is either removed or corrected for. In this thesis, we oppose this view and argue that DIF provides useful information relevant for policy makers and is both too complex and abundant to remove. We will attempt to interpret DIF and discuss a new method to compare countries while fully allowing for DIF. We will conclude that the new method is appropriate for comparing countries and that DIF can be interpreted substantively, but requires more knowledge of the items and test administration.

Keywords: PISA, DIF, IRT, Identification, Construct related, Comparing

manifest variables

Introduction

PISA is a worldwide study by the OECD in member and non-member nations of 15-year-old school pupils’ scholastic performance on mathematics, science, and reading. It was first performed in 2000 and then repeated every three years. PISA is not the only study of this kind but stands in a tradition of international school studies, undertaken since the late 1950s by the International Association for the Evaluation of Educational Achievement (IEA). Much of PISA’s methodology follows the example of the Trends in International

Mathematics and Science Study (TIMSS), which started in 1995.

The results of PISA are taken seriously by both policy makers, and, OECD itself (OECD, 2012b). The effects on policy differ greatly between countries. An extreme example are the disappointing results of German students in 2000 which led to a overhaul of the German education system, while the good results of Finland caused major international

S. Sjoerd F. Glaser, Department of Psychological Methods, University of Amsterdam.

Correspondence concerning this master thesis should be addressed to S. Sjoerd F. Glaser, University of Amsterdam, Department of Psychological Methods, Weesperplein 4, 1018 XA Amsterdam, The Netherlands. E-mail: sjoerd.glaser@hotmail.com

(2)

interest in its system (Grek, 2009) Whether international educational studies actually contribute to the improvement of education depend on the quality of the information they provide to policy makers and hence on the methodology that is used.

Children participating in PISA take several tests designed to measure specific abilities and measurements of these abilities are obtained from the item responses by means of an Item Response Theory (IRT) model. Comparing measurements across nations is not straightforward because children from different nations speak different languages, differ in the education they receive, and in cultural background. When defined in terms of an IRT model, measurements are comparable when the same model fits the data from each nation (e.g., Bechger, Wittenboer, Hox, & de Glopper, 1999). When this is not the case, we say that there is differential item functioning or DIF (e.g., Holland & Wainer, 1983). DIF means that items show different characteristics when administered to subjects from different groups. Because DIF hinders the estimation and comparison of abilities, test designers and analysts try to eliminate DIF from surveys.

The goal of this thesis is to investigate the following twofold claim in the context of PISA: 1) DIF has been eliminated from PISA and 2) elimination is the proper method of handling DIF.

To investigate these claims, we will first explain measurement, IRT, and DIF in a general framework. This sets the stage for a more specific explanation of the PISA analyses and how they handled DIF in the second chapter. In the thirds chapter, a general introduction on DIF is given together with two new general views on DIF. These views will be applied to theoretically evaluate the current methods of comparing countries with DIF, of which elimination is one, in the fourth chapter. A data driven evaluation of elimination is given in the fifth chapter together with a recommendation of which method to use. Finally, we will try to interpret DIF in the sixth chapter. We will end with a discussion.

Preliminaries

To analyze the results, PISA uses an extension of the Rasch model. In this section, we will first give a general introduction to the Rasch model. Afterwards, we will discuss several extensions that are used within PISA and in our own analyses.

The Rasch model

A cornerstone IRT model for dichotomous items was the Rasch model, proposed by Rasch (1960). In his derivation, Rasch’ started with a few basic assumptions. First, the chance to answer an item correctly depends only on the ability of a student, ξj > 0, and the

difficulty of an item, i > 0. As both consist of a single number, the Rasch model employs a

unidimensionality assumption. According to Rasch, ability is always bigger than 0, because no student can know less than nothing.

Now, suppose:

P (Xij = 1) = P (Xhk = 1). (1)

That is, the probability of person j answering item i correctly is the same as person h answering item k correctly. Furthermore, suppose this is not because the items and students are all equally able or difficult, but because student j is twice as able as student k: ξj = 2ξk,

(3)

but item i is twice as difficult as item h: i = 2h Therefore, Equation (1) implies: ξj i = ξk h ⇔ P (Xij = 1) = P (Xhk= 1). (2)

The next step is to derive a model to link the ratio on the left to the probability on the right. Because Rasch defined ability to be higher than 0, the ratio ranges from 0 to ∞. The probability on the right side of the equation obviously ranges from 0 to 1. As a final restriction, he assumed that the probability strictly increases as the ratio increases, i.e. monotonicity. Taking these restrictions into account, Rasch derived the following model:

P (Xij = 1) = ξj i 1 +ξji = ξj ξj + i . (3)

Note that in this model, if ξj = i ⇒ Pij = 0.5. If we rewrite the parameters by ξj = eθj

and i = eβi, the result is the most common parametrization:

P (Xij = 1|θj, βi) =

e(θj−βi)

1 + e(θj−βi), (4)

where θj is the latent trait or ability of person j and βi is the difficulty of item i. Note

that in this parametrization, ability difficulty now range from −∞ to ∞. The difficulty parameter can, just as i in Equation (3), be interpreted as the required ability for someone to have a 50% chance of answering correctly. See Figure 1 for the graphs of a few different items. Note that, because the probability of a response only depends on the fixed ability and item difficulty, every response is an independent Bernoulli trial with the chance of correct specified by Equation (4). In other words, the responses have become independent:

Xij⊥ Xhj|θj ∀i, h.

The Rasch model is member of the family of exponential distributions. A nice property of an exponential family model are the sufficient statistics. To estimate the ability of a student that makes n items, the sum over all items,Pn

i=1xij = xi+, is sufficient. Similarly,

to estimate the difficulty of an item, only the sum over the N students making the test,

PN

j=1xij = x+j is sufficient (Andersen, 1970).

Furthermore, the Rasch model provides a method to compare the ability and difficulty parameters on the same latent scale. The chance to correct only depends on the difference between the two parameters. For this difference to make sense, they have to be measured on the same scale. For example, see the latent scale Figure 2. All students have a higher ability than β1, so they all have a higher than 50% chance of answering item 1 correctly.

The bigger the difference between ability and difficulty, the higher the chance. For item 2, student 1 and 2 have a lower than 50% chance of answering correctly, although for student 2 just minimally. Finally, note that student 3 will probably make both items correctly. In that case, it is impossible to estimate the ability with maximum likelihood, as the likelihood estimate will go to infinity.

Besides the two assumptions mentioned above, the Rasch model infers the assumption of equal discrimination of all items. This implies that all items are linked equally to the latent trait. This is shown in Figure 1 as all lines have equal steepness.

(4)

−4 −2 0 2 4 0.0 0.5 1.0 Chance of answ er ing Correct

θ

β = − 2 β = 0 β = 2

Figure 1 . Chance of answering correctly depending on latent trait (θ) and item difficulty

(β) according to the Rasch model.

β

1

β

2

θ

1

θ

2

θ

3

Figure 2 . Latent scale with two items and three students located on the line. The probability

(5)

Extensions of the Rasch model: Partial credit model

In the original Rasch model, all item in the test have two scoring categories, 0 and 1. However, plenty items in real tests have three (or more) scoring categories: 0, 1, and 2. Masters (1982) extended the Rasch model so it could handle items with multiple scoring categories. Within this model, the categories are successive, implying that to obtain a score of l, a score of l − 1 should be obtained first. For example, consider the question p

125/5. The first step is to derive √25 and results in a score of 1. The second step is to solve the square and doing so successfully results in a score of 2. A score of 2 cannot be obtained without first obtaining a score of 1.

Masters (1982) realized that the Rasch model already specifies the probability to reach score 1, given that a student has obtained score 0, the category everybody reaches by default. This can be seen because the original Rasch model can be rewritten as:

P (Xij = 1|Xij = 1 ∨ Xij = 0) =

Pij1 Pij0+ Pij1

= exp(θj− βi)

1 + exp(θj− βi), (5) where Pij1 and Pij0 are the probability of person j to reach respectively a score of 1 and 0

on item i. The probability to reach category l is implemented in exactly the same way by conditioning on the probability of reaching category l − 1:

P (Xij = l|Xij = l ∨ Xij = l − 1) = Pijl Pijl−1+ Pijl = exp(θj− βil) 1 + exp(θj− βil) , (6)

where P (Xij = l) is the probability of person j answering in category l on item i and βil is the subsequent difficulty of category l after category l − 1 has already been reached. The probability to reach category l unconditioned on the previous categories can then be derived as (Masters, 1982): P (Xij = l|θj, βi) = exp(lθj−Pl k=0βik) PKi k=0exp(kθj− Pk m=0βim) , (7)

where Ki is the maximum scoring category of item i and βi0= 0. An example of the PCM

of one item four scoring categories is shown in Figure 3. The interpretation of βs in the PCM is similar, but not equal, to those of the Rasch model. Within the Rasch model, the β parameter is the ability where the probability of answering correct is equal to the probability of answering incorrect (see Figure 1). Similarly, within the PCM, βil is the ability where the probability of answering in category l is equal to the probability of answering in category

l − 1. The difference with the Rasch model is that in the PCM these probability are not

equal to .5, as the other categories have a probability higher than 0. If the difficulties are, as in the examples, strictly increasing, i.e. βil> βil−1∀ l, students with an ability between βl and βl−1 are most likely to answer in category l − 1.

Extensions of the Rasch model: The PISA model

In this form of the Rasch and PCM model, the ability and items are a fixed effect: every student and every item has a specific value and no distribution is assumed. This

(6)

−4 −2 0 2 4 0.0 0.5 1.0 Chance of answ er ing categor y P 0 P1 P2 P3

Figure 3 . Chance of answering in category l (Pl) according to the PCM depending on latent trait (θ) and subsequent difficulties: β1 = −2, β2 = 0.25, and β3 = 1.5.

results in, as mentioned, that every response is an independent Bernoulli trial with the chance of correct specified by Equations (4) and (7).

A downside of the fixed effect model is that with more students taking the test, the number of ability parameters increases at the same rate as the number of sufficient statistics, i.e. the sum scores. This results in inconsistent estimates of the ability parameters (Andersen, 1970). To remedy the estimation problem, PISA uses a random effect for the ability variant of the PCM: P (x; β|θ) = Z Y i exp(lθj −Pl k=0βik) PKi k=0exp(kθj− Pk m=0βim) g(θ; p) dθ = Z Y i fi(θ)g(θ; p) dθ , (8)

where P (x; β|θ) is the probability of response vector x given the vector of item difficulties β marginalized over the distribution of θ: g(θ; p) with distribution parameters p. For instance, if g is a normal distribution, p consists of µ and σ2. The random effect model reduces the number of ability parameters to the number of distribution parameters, in this case two.

The second restriction PISA looses is the unidimensionality assumption, because PISA wants to measure multiple abilities in one test. Instead of a scalar, every student has a vector of ability variables each of which can be compared separately. Note that scores between different abilities cannot be compared. This results in the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) (Adams, Wilson, & Wang, 1997; OECD, 2012): P (x; A, B, β|θ) = Z Y i exp(bikθ + a0ikβ) PKi k=1exp(bikθ + a0ikβ) g(θ; p) dθ. (9)

(7)

θ1 θ2

x1 x2 x3 x4 x5 x6

Figure 4 . Multivariate Rasch model with two latent abilities and six items, the loading of

the latent traits on the items is specified beforehand.

In this model, an item has Ki scoring categories (for example 0, 1, and 2). P (x; A, B, β|θ) is the probability of a response vector x given the design matrix A and

scoring matrix B, the item difficulties β, and the distribution of θ. With m latent variables, bik is a vector of length m that relate the latent variables to the response categories. It is

called a scorings matrix because it relates the categories to a specific score, e.g. if bik only

consists of 0’s, that category is equal to a score of 0, making it a wrong answer. In the same sense, the vector a0ik of length p relates the p item difficulties to the item categories. Examples of the MRCMLM

The MRCMLM is a very general model that can be specified to encompass many existing (and not yet existing) IRT models. Within PISA, it is only specified as either the multidimensional Rasch model or PCM. We will give two examples how the A and B matrices can be specified to obtain these two model.

Consider the a test with two latent abilities and six questions related as in Figure 4. The single headed arrows define how the abilities are related to the items. The double headed arrow between the two abilities implies the latent correlation between the two abilities. If item four is a partial credit item with three scoring categories (0, 1, 2) and all others are dichotomous, the vector β consists of seven elements, β1 till β7. With these specification the A and B matrices for item 1 are (Wilson & Wang, 1995):

B = 0 0 1 0 ! A =             0 0 1 0 0 0 0 0 0 0 0 0 0 0             (10)

If we substitute these in Equation (9), and pretend it is a fixed effect model, the original Rasch model is obtained:

(8)

P (X11= 1; A, B, β|θ) = 1 1 + exp(θ1+ β1) P (X12= 1; A, B, β|θ) = exp(θ1+ β1) 1 + exp(θ1+ β1). (11)

Note that in this example, β1 is linked to item 1, but this is completely arbitrary and need not to be the case.

The PCM is also a subclass of the MRCMLM model and to show how it can be modeled in the MRCMLM, item 4 of the previous example can be specified by the A and B matrices as: B =    0 0 0 1 0 2    A =             0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0             . (12)

Filling these in in Equation (9), and again pretending it is a fixed model, results in the following probabilities of the different scoring categories:

P (X41= 1; A, B, β|θ) = 1 Z P (X42= 1; A, B, β|θ) = exp(θ2+ β4) Z P (X43= 1; A, B, β|θ) = exp(2θ2+ β4+ β5) Z , (13)

with Z being the sum of the numerators:

Z = 1 + exp(θ2+ β4) + exp(2θ2+ β4+ β5). (14)

Specification of the MRCMLM

The MRCMLM requires a specification of the ability distribution. Within PISA a marginal multivariate normal distribution is assumed (OECD, 2012):

n; Wn, γ, Σ) = 1 p (2π)m|Σ|exp[− 1 2n− γWn) TΣ−1 n− γWn)], (15)

where θn is the m-variate ability distribution of person n conditioned on the fixed variable

vector Wn of length u and γ is a u × m matrix of regression coefficients of Wn on the

ability means. All possible variables that PISA measures, like gender, school type, parental variables, regional type, etc. (see OECD, 2012, Annex B for the full list) are used in Wn as control variables. The distribution is imputed in Equation (9) to complete the model.

(9)

−4 −2 0 2 4 0.0 0.5 1.0 Chance of answ er ing Correct

θ

α = 1 β = − 2 α = 5 β = 0 α = 0.7 β = 2

Figure 5 . Chance of answering correctly depending on latent trait (θ), item difficulty (β),

and discrimination (α) according to the OPLM.

Extensions of the Rasch model: 2-PL and OPLM model

Another possible extension to the Rasch model is to relax the assumption of equal discrimination. Conceptually, this means that one item measures the latent trait better and is therefore better linked to the latent trait than an other item. Within PISA, the relaxation of this assumption greatly improves the model fit (Zwitser, Glaser, & Maris, Submitted), making the estimates of both ability and item difficulties less biased. Due to the improved fit we will relax this assumption in all of our analyses.

Unequal discrimination is achieved by adding an α parameter for every item. Combined with the PCM, this leads to the 2-PL (two parameter logistic) model:

P (Xij = l|αi, βi, θj) = exp(αil(θj− βi)) PKi k=0exp(αi(kθj− Pk m=0βim)) . (16)

Here, αi is an discrimination parameter and all other parameters are defined as previously.

A few examples of dichotomous items are shown in figure in Figure 5. The differing steepness between slopes indicates the unequal discrimination of the items. Note that the interpretation of the β is still the required ability to have a 50% chance of correct.

In our analysis, we will use a special case of this model, the One Paremeters Logistic Model (OPLM) (Verhelst, Glas, & Verstralen, 1993). The only difference is that within OPLM the αs are not parameters of the model, but fixed, known natural numbers imputed beforehand. This has consequences for fitting the model, as explained later.

The consequence of relaxing the assumption of equal discrimination, however, is the loss of the specific objectivity assumption. Specific objectivity means that comparing the

(10)

parameters of two items leads to the same results independent of the participants measured and was one of the cornerstone assumption in Rasch’ original derivation. A simple example from Figure 5 illustrates why this is not the case with unequal discrimination. If we take a sample of students with an ability around -1, item 2 will have a lower chance of correct, thus a higher difficulty estimate than item 3. However, with a sample of an ability of 0, item 2 will have a higher chance of correct, and thus a lower difficulty estimate than item 3. The comparison of the item difficulties thus depends on the abilities of the students. It is possible to keep specific objectivity by extending the definition to more complex comparison functions with three items (Irtel, 1995), but this is beyond the scope of this thesis.

Identification

As stated above, IRT models provide a method to measure the latent trait and item difficulties on the same scale, exemplified in Figure 2. However, as it stands, this scale is unidentified because it lacks an origin. In other words, only the difference between the ability and difficulty determine the probability of correct and these can be shifted by any arbitrary number to reach exactly the same result. The identification problem can be resolved in the unidimensional Rasch model and PCM by imposing a constraint on any item difficulty, ability, or summary statistic of these parameters, usually by setting one to zero. Once one of the variables has been set, the scale is defined and all other parameters fall into place.

Identifiying the MRCMLM

Within the MRCMLM, identification can be reached with the A matrix. We will give an example of two common options within the psychometric literature (Wilson & Wang, 1995). First, to set an item difficulty to zero, it is sufficient to set the A matrix for any of the item to zero. Or in case of the PCM, set the first β of any item to zero. Second, to set the average, or the sum of all difficulties to zero, any of the item difficulties can be set as the sum of the others. In a three item dichotomous example, this results in:

A =          0 0 1 0 0 0 0 1 0 0 −1 −1          . (17)

Note that this reduces the number of columns from three to two because there are only two free parameters.

The MRCMLM is very general model that encompasses almost all known Rasch models (Adams et al., 1997). To handle all possible models, a complex identification algorithm has been developed for identification (Volodin & Adams, 2002). Within PISA, only a multidimensional PCM is used, with each item being influenced by only on latent dimension. In this simple case, it is sufficient to use the same identification techniques as the normal Rasch model for each scale seperately (Liu, Wilson, & Paek, 2008).

(11)

Identifiying the OPLM

For the OPLM, two restrictions are required. Not only can the scale be shifted, but it can also be stretched. The same probabilities for all items can be obtained by multiplying all the distances between the θj and βi parameter and dividing the αis by the same number. The common solution is to set the geometric mean of the α parameters to a fixed number, but other options, like fixing the distance between a specific item and ability, are possible.

With all these options of identification, the question remains which to choose. From a modelling perspective the choice is arbitrary, every option results in equal probabilities and equal model fit. However, the interpretation may differ. To take a simple example, when fitting multiple models in different subsamples and comparing the results, constraining the average ability is silly because the comparison of ability is lost. We will discuss this issue much more in depth the section about DIF. We will first explain the data handling and model fitting procedures of both PISA and our own treatment.

Comment 1: Identification and physics

Rasch recognized the problem of interchanging the difficulty and ability parameter and related it to the definition of force, mass, and acceleration in physics as described by the physicist Maxwell in 1876. The acceleration Avk of object Ov with mass Mv after applying

the force Fk had been found by Newton as:

Avj = Mv

Fk .

At the time, mass and force were not yet clearly defined, however acceleration was. And at that point, the model had same conceptual issues as Rasch model. The acceleration and responses to an item are clearly defined and measured. Yet both are the result of a stochastic process. Acceleration is subject to measurement error, either by imprecise measurement tools, or by incomplete model specification as some small force could have been omitted. The IRT model leads to a chance to answer correct, with the actual response being a Bernoulli variable.

To take the comparison further, in both models different parametrizations of the variables can be applied to reach the same results. θ and β can be shifted by an arbitrary number to reach the same probability, M and F can be multiplied by an arbitrary number to reach the same acceleration. Note that in both cases, it is still possible to make comparison statements: if force A results in twice the acceleration as force B on the same object, force A is twice as strong as force B. In the same sense, if student C has twice the chance of answering correctly as student D on the same item, student C is twice as able as student D.

The parametrization problem can be resolved by setting a unit parameter in any of the models to an arbitrary constant. In physics, this led to the standard international unit (SI) of mass, the well known kilogram. In educational measurement no such standard unit has been defined. As a result, a new arbitrary constant is chosen in every test and analysis.

PISA and modelling procedures

We will now explain the PISA survey and analysis method, as evaluating this method-ology is the main goal of this thesis. We will discuss the analyses to the best of our knowledge,

(12)

however some questions will remain. This thesis will not handle the item creation and sampling methods, more information about these topics can be found elsewhere (OECD, 2012, chapter 2 up till and including 8).

PISA survey

The goal of PISA is to assess 15 year old students’ "preparedness for life" (OECD, 2010) and is concertized in three domains: reading, mathematics, and science. The PISA survey is deployed every three years and each year one of the domains is the focus domain. The focus domain is measured by more items and broken down into different subscales for further analysis. We will explain the 2009 edition in detail; the other editions are structured similarly.

The complete 2009 PISA survey consists of 515,958 students across 74 countries and 219 items (OECD, 2012). These 219 questions were divided over 21 booklets with about 55 items in each booklet, resulting in an intentional missinginess design. All three domains were measured by approximately 40 questions, with most items being reused over the years. This resulted in 35 mathematics, 53 science, and 131 reading items, with reading as the focus domain.

The main model of PISA uses three latent variables, one variable linked to the items of each main category: mathematics, reading, and science. To further analyze the focus domain, new models are created that link the reading items to separate sub skills. First, it is divided into the three subskills: access and retrieve, integrate and interpret, and reflect and evaluate. In the second division, the items were divided into continuous texts and non-continous texts items. Finally, they were divided into digital and paper items.

Analysis

To estimate abilities, PISA uses a three step analysis procedure to check assumptions, estimate item parameters, and estimate abilities.

National calibration

In step one, one of the four possible latent variable models (it is unclear which, yet it is probably the main model) is fit in every country separately. For identification, the ability averages in all scales are set to zero. These models are used to test three assumptions of model fit (OECD, 2012):

Weighted Means Square (MNSQ). The first test is whether an item fits the model. This is done by calculating the weighted MNSQ (Wright & Masters, 1982, p. 100) in every country: M N SQi= X j (Xij − Eij)2/ X j Wij Wij = Ki X k=0 (k − Eij)2πijk, (18)

where Eij is the expected value and Wij the variance of person i on question j, according to the model. If the errors are appropriate for the Rasch model, the MNSQ has an expected value of 1. Items with a MNSQ higher than 1.2 or lower than 0.8 are flagged.

(13)

Discrimination coefficients. The second test is to check whether the different categories structurally discriminate between the students. An item is flagged in a country if one of the following conditions is fulfilled:

• The correlation between the right category as a dummy variable and the total percent-age correct is lower than -0.05.

• The correlation between any of the wrong categories as a dummy variable and the total percentage correct is larger than 0.05.

• For partial credit items: the ability of students answering category l is lower than those answering l − 1.

• The correlation between proportion of the max score of an item and the total percentage correct is lower than 0.20. In a dichotomous item, this correlation is equal to the first requirement.

Equal item difficulties. The final check is whether the item difficulties are equal in all countries. This is checked by calculating a confidence interval for every item in the international calibration, see the next step. An item is flagged if national point estimate falls outside this confidence interval.

From these three measures, it is assessed whether an item "has poor psychometric qualities" (OECD, 2012, p. 132), although the exact requirements are unclear. If an item performs poorly in more than ten countries, it is deleted from the database all together. If it performs poorly in less countries it set to "not administered" in those, basically making the item missing.

International calibration

After checking and controlling for the psychometric qualities, international item difficulties are estimated by fitting the model of Equation (9) without any control variables in the multivariate ability of Equation (15) on an international subsample. This subsample consisted 500 randomly drawn students of every OECD countries. For identification, the average item difficulty was set to zero.

Ability estimation

After establishing the item difficulties, the model is again fit in each country with fixed item difficulties of the previous step to estimate the latent ability distribution separately using plausible values (Mislevy, Beaton, Kaplan, & Sheehan, 1992; OECD, 2012). Plausible values are a method to draw from the posterior distribution of ability, given the item responses and known conditional variables. Countries are compared on the average of these plausible value estimates.

To test the assumptions of PISA and illustrate technical points, we will frequently use data from the PISA surveys of the 2006, 2009, and 2012 editions. Therefore, we will first describe our handling of the PISA data1.

1The data were downloaded from https://pisa2006.acer.edu.au/downloads.php, https://pisa2009.acer.edu.

(14)

Handling PISA data

The following data cleaning was performed. First, all students with more than 50% missing answers within their booklet were removed (5315 students). Then, Liechtenstein was excluded due to the small sample size.

Missing data was handled depending on the type of missingness. PISA distinguishes three types of missingness:

• Invalid response, double or missing responses

• Not reached, consecutive invalid responses at the end of a booklet

• Not-administered, either an item not administered to a person, or an item removed from the dataset due to poor psychometric qualities.

Invalid and not reached items were treated as wrongly answered items. The handling of not administered items needs some more explanation.

Not-administered items were handled by PISA not as wrongly answered, but as literally not-administered. This means that separate booklets were specified without the not-administered items. PISA itself reports some items that were not-administered (OECD, 2012, table 12.8). However, this table is not complete and removing these items from the data set did not remove all not-administered items. The remaining items were removed from the booklets in a countries if they were not administered in more than 5% of the students in a particular country, otherwise they were coded as wrong.

Fitting the OPLM

After handling the missingness, we fitted the OPLM model separately for each main domain (mathematics, reading, and science) in each country with the OPLM program (Verhelst et al., 1993). Because the α parameters are fixed in the OPLM, fitting the OPLM is a two steps procedure. First, a random two third of the data was used to estimate the integer α parameters using the OPCAT procedure. For identification, the geometric mean of all αs was set to 4. Second, the other one third of the data were used to estimate the

βs using OPCML procedure with the αs of the previous step imputed as fixed values. To

identify the model, the first item difficulty was set to 0. All other data preparation and other analyses were done with the statistical language and program R (R Core Team, 2014).

About DIF

Plausible values are a consistent estimator of ability, yet only when the IRT model is correctly specified (Marsman, Maris, Bechger, & Glas, 2013). By employing the same model in each country, PISA commits itself to the very stringent assumption that all item difficulties are equal for all countries. The opposite of this claim, i.e. item difficulties differ between countries, is known in the psychometric literature as Differential Item Function, or DIF (e.g., Holland & Wainer, 1983).

PISA tries to control for DIF, by removing items in countries where the difficulties fall outside the international confidence interval. Their methodology adheres the standard vision of DIF: it is considered a nuisance and “DIF items” are detected and removed. Not

(15)

just because DIF harms the comparability of measurements. There is also the belief that it would be “unfair” to compare groups when some items are easier for one group than for the other.

Zwitser et al. (Submitted) question this practice and point out that it is inappropriate to remove DIF items when DIF is construct related. Construct related DIF is informally defined as DIF resulting from any aspect relevant to what the test measures. For example, if country A has long division in its curriculum and country B does not, long division items will be harder for country B than for country A and this will show up as DIF. More generally, when DIF is construct related it stems from differences in the educational systems and general culture, that are of interest to policy makers and other educational agents. Removing DIF items means that we miss an opportunity to see these interesting qualitative differences.

The opposite is construct unrelated DIF, DIF resulting from anything not relevant to what the test measures. Examples include translation issues, wrong supply of tools (a calculator was available when this was not allowed), but also differences in language ability when the goal is to only measure mathematics. Construct unrelated DIF is mostly of interest to test designers, as it should be limited as much as possible. The difference between the two is not clear-cut; the final example could also be construct related, if "extracting relevant numerical information from text" would be considered part of mathematics. This shows that the difference between construct related and unrelated may be small and requires clear test goals and extensive knowledge about the domains measured.

By trying to remove DIF, PISA implicitely assumes that all DIF is construct unrelated. We will investigate whether they succeeded in removing DIF and whether this claim of construct unrelated DIF is appropriate. However, current approaches of analyzing DIF are flawed because the identification is not taken into account. We will explain a new test developed by Bechger and Maris (2014) that removes this flaw.

DIF and identification

The current analyses of DIF fail to take the identification into account, while this may actually influence which items exhibit DIF. Most existing procedures focus on individual items and test the hypothesis that an item has no DIF between two groups: that is, βi,1 = βi,2,

with βi,g being the difficulty parameter of item i in group g. This null hypothesis is assumed to be independent of the identification, but the estimated item parameters are not. More specifically, if the origin is set on βr, ˆβi is not an estimate of βi but of the relative difficulty βi− βr and thus depends on the identification. This leads to the predicament that an item

can and cannot have DIF depending on how we identify the model.

For example, take a test of three items taken by two groups. Suppose we fit the Rasch model separately in each group that results in the two latent scales, without an origin, as in Figure 6. Clearly, DIF is present, because the difficulties are not all equal. However, which items show DIF depends on the identification as explained in Figure 7. If both groups are identified by setting the origin on β1, the third item exhibits DIF. On the other hand, if they are identified by setting the origin on β3, the first two items exhibit DIF.

Bechger and Maris (2014) show that the relative difficulties between items are inde-pendent of the normalization: that is, βi− βj = (βi− βr) − (βj− βr) and βrdrops out. This principle is clear in the example of Figure 7, the distances between the items are equal in both identifications. Furthermore, the distance β1 and β2 is equal in both groups, which

(16)

β

1,1

β

2,1

β

3,1

β

1,2

β

2,2

β

3,2

Figure 6 . A latent scale for two groups without origin

β

1,1

0

β

2,1

β

3,1

β

1,2

0

β

2,2

β

3,2

(a) Identification on item 1

β

1,1

β

2,1

β

3,1

0

β

1,2

β

2,2

β

3,2

0

(b) Identification on item 3

Figure 7 . The effect of identification on DIF. In (a) the identification is on item 1 and item

(17)

means they form a cluster with no DIF between these items, or within the cluster. The distance between β3 and β1 or β2 does differ between the groups, implying that there is DIF

between item 3 and the cluster of item 1 and 2. This is the core of, what we will call, the relative view on DIF: DIF is a relative concept that exists between item clusters with no DIF within a cluster, but DIF between the clusters.

The common approach, that we call the anchor viewpoint, is a special case of this relative view. Implicitly or explicitly, one of the items or clusters is an "anchor set" that is said to have no DIF. Within this view, DIF is usually seen as a rare and construct unrelated issue. The origin of the model would be based on the anchor set and all other items could compared to this. In our three items example of Figure 7, if items 1 and 2 would form a cluster in all pairs, we can call it the anchor cluster and compare it to item 3. Note that the anchor view is not necessarily wrong, but carries a choice. The use of an anchor cluster could be justified by finding a set of items in the data that arguably is equally difficult in all countries.

Handling DIF

Which view one takes further influences how to handle DIF. Broadly speaking, two methods of handling have been proposed in the literature. We will discuss both of these methods as well as a new radical approach proposed by Zwitser et al. (Submitted).

Eliminating DIF

In one option, DIF is seen as a nuisance and construct unrelated and the researchers tries to remove items that have DIF from the test altogether. This is the route PISA takes, by removing items in a country that fall outside the international confidence interval, and is one of the methods advocated by Kreiner (2010). Whether this option is tenable depends both on the nature of DIF, and whether it is possible to find an anchor cluster. This will be the topic of the next section

Models taking DIF into account

Contrasted with removing DIF, several researchers have suggested to use IRT models with DIF by allowing differing item parameters for sub-populations (Goldstein, 2004; Kreiner, 2010; Kreiner & Christensen, 2014; Oliveri & von Davier, 2011, 2014).

However, just as the arbitrary identification influences which items show DIF in the anchor view, it also influences the ability estimates when DIF is present. Zwitser et al. (Submitted) showed this with an example from the 2009 PISA data. When comparing Poland

and the Netherlands, two DIF clusters were uncovered. When the model was identified on one cluster, the Netherlands had a higher average ability, but when the model was identified on the other, Poland had a higher average ability. The arbitrary identification influenced the ability distributions.

On a more abstract level, comparing countries with differing subparameters is prob-lematic because different scores leads to different latent ability scales. To summarize the full data, a measurement model like the Rasch model provides a function f to link the data X to the ability Θ, that is:

(18)

If the same model is used in each country, Θ1 = f (X1) and Θ2= f (X2) are measured on the same scale and thus they are comparable. However, if a model with different parameters for different subpopulations is used, the data is linked differently to the ability between countries, i.e.:

Θ1 = f1(X1)

Θ2 = f2(X2). (20)

Because the linkage function is different, Θ1 and Θ2 are scores on a different scale, thus their comparison is not meaningful anymore.

Comparing countries on manifest variables

Zwitser et al. (Submitted) argued that DIF is moslty construct related and should, as opposed to elimination, be taken into account when comparing countries, to do so they proposed a radical new approach to compare countries.

The original latent variable approach was developed because a test was supposed to measure something more abstract than its items: the latent ability. In this view, a population of items exists that all measure the latent ability. It does not matter which of these items are chosen (when equal discrimination is assumed) because "comparisons between individuals become independent of which particular instruments - tests or items or other stimuli - have been used" (Rasch, 1960).

Zwitser et al. (Submitted) question this measurement view by stating that a domain is always defined by the items used to measure it, a view in line with the third market-basket approach by Mislevy (1998). When all items about long division are not sampled from the population of items for a particular test, students long division skills will not be measured. Therefore, when all stakeholders define a set of items that operationalizes the ability (OECD, 2012, chapter 2), the ability has become those items.

This makes primary goal of statistics to estimate the distribution of X, the full matrix of all students in the population answering to all items. One option is to administer the full item set to a random sample of the population. However due to practical reasons, usually a stratified sample is drawn and incomplete data is administered. The stratified sample is taken into account with student and replicate weights (OECD, 2012, chapter 4 and 8). The goal of the IRT model is to obtain an unbiased estimate, x, of the full data.

They propose to use IRT to literally fill-in the incomplete data using plausible responses. Plausible responses are "samples from the the item response distribution according to the measurement model, the estimated item parameters for the particular country, and the plausible values for the person parameters in that country" (Zwitser et al., Submitted). The full matrix of observed and plausible responses is an estimate of X. This method has the advantage that different IRT models can be used between countries or other sub populations, thus allowing for DIF.

The next step is to how to compare students and countries on this estimate. Students and countries are only comparable if they are measured by the same yardstick, that is xij

and xik are only comparable if we assume that students j and k respond to the same item (i). Because responses to the same item are by assumption comparable and a domain is defined by its items, Zwitser et al. (Submitted) propose to compare two students directly

(19)

by comparing all items of a domain on any summary statistic, x+j and x+k. They coin the term plausible scores when the sum is used, but note that no statistic is better than the other. These summary statistics can be aggregated across countries to compare those.

Zwitser et al. (Submitted) investigated some properties of these plausible scores to show their strengths. First, even if the model does not fit perfectly, the generated plausible score distribution closely resembles the true scoring distribution. Second, they applied this method to create a league tables of the 2006 PISA reading items. Their league table closely resembles the original league table, showing that PISA’s current methodology is relatively robust for DIF.

Is elimination possible?

How to compare countries thus boils down to a choice between eliminating DIF and comparing manifest variables. Which to choose depends on whether an anchor cluster exists, and whether DIF is construct related or construct unrelated. In this section, we will discuss whether PISA succeeded in eliminating DIF and if they failed, whether elimination is still feasable. This requires a DIF test according to the relative view, as developed by Bechger and Maris (2014). Their DIF test consists of two steps: first, testing whether any DIF is present between two groups. Second, identifying how DIF is structured. The added value of their DIF test lies within the structure of the DIF, i.e. identifying clusters according to the relative view. Establishing whether any DIF is present is equal according to both views; in both identifications of Figure 7, DIF is present.

To create a statistical test, Bechger and Maris (2014) collected all pairwise parameters differences for group g in a matrixRb

g

, with entriesRb g

ij = βi,g− βj,g. Now, take the difference

between these matrices for an overall DIF test between two groups: ∆R =d Rb 1

Rb 2

. If no DIF is present, all entries ofRb

1

andRb 2

are equal. If any DIF is present, some entries differ. So the hypotheses of the test are:

H0: ∆R = 0

H1: ∆R 6= 0.

(21) The matrix ∆R can be fully recovered from any of the columns. Specifically,

∆Rij = βi,1− βj,1− (βi,2− βj,2)

= βi,1− βr,1− [βj,1− βr,1] − (βi,2− βr,2− [βj,2− βr,2])

= ∆Rir− ∆Rjr.

(22)

Therefore, any column can can be used to test the null hypothesis. Under the null hypothesis and unbiased asymptotically normal distributions of the item difficulties, δ, the rth column of ∆Rij with item r taken out, has a known multivariate distribution with means 0 and

covariance matrix Σ. And thus, the squared Mahalanobis distance

χ2∆R= ˆδTΣ−1δˆ (23)

follows a chi-squared distribution with k − 1 degrees of freedom, with k being the number of item parameters. χ2

∆R is independent of column chosen. The observed ˆχ2∆R can be

(20)

Table 1

Overall DIF test applied to reading items of the 2009 edition. Pairs with the five highest and lowest χ2 values are shown.

χ2 df p

Australia - New Zealand 271 108 3.84 ∗ 10−16 Trinidad and Tobago - United States 351 77 5.47 ∗ 10−37 Ireland - Trinidad and Tobago 362 76 3.10 ∗ 10−39 Malta - Singapore 379 76 3.75 ∗ 10−42 Malta - New Zealand 379 77 7.54 ∗ 10−42 .. . Mexico - Spain 5002 77 < 2.23 ∗ 10−308 Jordan - Mexico 5069 108 < 2.23 ∗ 10−308 Canada - Mexico 5180 77 < 2.23 ∗ 10−308 Azerbaijan - Mexico 5473 108 < 2.23 ∗ 10−308 Italy - Mexico 6087 77 < 2.23 ∗ 10−308

Results of the overall DIF test

To test for overall DIF within PISA, we applied the test on each of the 18,396 country-pair, domain, year combinations. The results showed that DIF was abundant. More than 99.9% of all combinations resulted in a p-value smaller than 1 ∗ 10−9, making them significant at α = 0.05 even after applying a Bonferroni adjustment. Almost 18% resulted in a p-value less than 2.23 ∗ 10−308, the smallest value bigger than 0 representable in R. As an example, the country combinations with the highest and lowest χ2 values in the 2009 reading edition

are displayed in Table 1. This shows that PISA’s efforts to remove DIF have failed, DIF is present everywhere. This agrees with bitter experience and previous research that DIF is an inexorable finding in educational measurement, especially in international surveys like PISA (e.g. Grisay, Gonzalez, & Monseur, 2009; Kreiner, 2010; Kreiner & Christensen, 2014; Yildirim & Berberoglu, 2009).

Although DIF is still present everywhere, it could be possible to identify an anchor cluster to make elimination possible, that is if one big cluster that generalizes over countries, this cluster could be used as the anchor. We will use the 2012 science items between the country pairs, Florida, Massachusetts, United States (without Florida, Massachusetts, and Connecticut), Great Britain, Russia, and Perm (a federal subject of Russia) to investigate this option. From here on Florida and Massachusetts, will be reffered to as the separate states, and the others taken together as the US. The results from general DIF test are depicted in Table 2. Again, DIF is present everywhere.

Specific DIF test to find item clusters

To find the items clusters, return to the matrix ∆R. We constructD, the standard-b

ization of ∆R: b Dij = b R1ijRb 2 ij q Var(Rb 1 ijRb 2 ij) = Rb 1 ijRb 2 ij q Var(Rb 1 ij) + Var(Rb 2 ij). (24)

(21)

Table 2

Overall DIF for the 56 science items of 2012 for all country pair combinations of the United States (US), Florida (Flor), Massachusetts (Mass), Great Britain (GB), Perm - federal subject of Russia, and Russia (Russ). Samples sizes are shown between brackets. Df = 55 in all pairs.

Flor (1895) Mass (1718) GB (12627) Perm (1760) Russ (5204)

US (4948) χ 2= 175; p = 1.9 ∗ 10−14 χ2= 170; p = 9.6 ∗ 10−14 χ2= 1254; p = 3.8 ∗ 10−226 χ2= 1570; p = 3.9 ∗ 10−292 χ2= 2396; p < 2.23 ∗ 10−308 Flor (1895) χ 2= 151; p = 6.7 ∗ 10−11 χ2= 572; p = 4.6 ∗ 10−87 χ2= 1020; p = 1.1 ∗ 10−177 χ2= 1381; p = 1.2 ∗ 10−252 Mass (1718) χ 2= 611; p = 6.7 ∗ 10−95 χ2= 862; p = 2.5 ∗ 10−145 χ2= 1124; p = 3.1 ∗ 10−199 GB (12627) χ 2= 2189; p < 2.23 ∗ 10−308 χ2= 4223; p < 2.23 ∗ 10−308 Perm (1760) χ 2= 293; p = 3.3 ∗ 10−34

such that every entry ofD follows a standard normal distribution under the null hypothesesb H0i,j : βi,1− βj,1 = βi,2− βj,2. With specific hypotheses about the structure of DIF, these

values can easily be tested.

In our PISA example, we have no hypotheses how DIF is structured in a country pair. Therefore, we need an exploratory method to identify DIF clusters. Bechger and Maris (2014) propose to plot a heatmap of D to visualize and exploratory interpret DIF. If web

apply these principles to the three items of Figures 6 and 7, D could look like:b

b D =    0 .81 3.53 −.81 0 3.79 −3.53 −3.79 0   , (25)

resulting in the following heatmap:

1 2 3 1 2 3 −3 −1 1 3

Figure 8 . Heatmap example of three items

From the heatmap, it is clear that item 1 and 2 are similar and form a DIF cluster. Item 3 is separate from these 2 and forms its own cluster. Bechger and Maris (2014) already

(22)

(a) Unsorted items −10 −5 0 5 10

(b) Legend (c) Items sorted by cluster Figure 9 . DIF between Florida and Perm for the 2012 science questions. The unsorted

heatmap is depicted in (a). In (c), the items are orded by cluster, making the interpretation much easier.

note that this method is rather informal, due to its visual instead of statistical nature, and impractical in large numbers of items and/or groups. Every possible groups pair has its own heatmap, complicating detailed analysis.

If many items need to be compared, the different clusters may be hard to interpret from the raw heatmap. Take the heatmap for the 2012 science items between Florida and Perm, as in Figure 9(a). It is clear that DIF is present, but which items form a DIF cluster is hard to interpret from the plot alone. To find the specific clusters, we used a

k-medoid cluster analysis, a robust version of the popular k-means method (Kaufman &

Rousseeuw, 2009). Which cluster analysis to use is heavily debated (e.g. Fraley & Raftery, 1998), but this is outside the scope of this thesis, we chose this method because it resulted in interpretable findings. Almost every cluster analysis requires two inputs: a dissimilarity or distance matrix, and the numbers of clusters to extract. The matrixDb is literally a matrix

of distances between the items and can be used directly. The number of clusters to extract was based on the average silhouette width as proposed by Rousseeuw (1987). We refer to the literature for the statistical details of these methods.

From the average silhouette width, seven clusters were extracted. The items can be sorted by these cluster, making the heatmap much easier to interpret as in Figure 9(c). As expected, within each cluster little DIF is present but between the clusters many DIF is present. It is notable that none of the clusters are much bigger than the others. Therefore, anchoring on any of these is undesirable; even if the biggest cluster of 14 items is used as an anchor, all other 42 items will have DIF. Other country pairs show similar small clusters. This is the first blow to the assumption of one big anchor cluster, choosing any of these clusters as an anchor, few items will be left for comparison. However, it could still be possible that any of these clusters generalize to other countries, making elimination a harsh, but still tenable option.

(23)

Generalization of the clusters

To check the generalization of the clusters, we can sort the items in the heatmap for a different country pair according to the clusters of the original pair. The results of these can be found in Figure 10. It is clear that no cluster generalizes over all country pairs. Especially in the Florida-US (Figure 10(a)) and the Perm-Russia (Figure 10(f)) pairs, hardly any of the DIF structure is maintained in these pairs. This result deals the final blow to the possibility to eliminate DIF. DIF is not a rare issue that influences a few items with most items not exhibiting DIF, as is usually assumed in DIF literature. The many small clusters show that DIF is actually a big issue, even in the anchor view influencing most items. Furthermore, the clusters do not generalize over country pairs in this small example, let alone in all 2080 pairs of 2012. Therefore, we believe that elimination is not possible and comparing on manifest variables is the only tenable option when comparing groups.

DIF test to generate hypotheses

Besides providing evidence that eliminating DIF is impossible, the relative DIF test could be used to generate hypotheses about qualitative differences between countries. The results of the general DIF test could tell something about the general similarity between two countries. Looking back at Table 1, the pair Astralia - New Zealand has the least DIF, showing that these countries are relatively similar, as we might expected. It is noticeable that Mexico dominates the bottom of the table. This does not necessarily mean that Mexico actually has most DIF, as the p-values are not a measure of effect size, but depends on the sample size. Mexico has the highest number of students participating, increasing the power of the DIF test. The top 100 country pairs with the most DIF are mostly populated by Mexico, Spain, and Brazil, that are all in the top 5 of sample size. Similarly, countries in the top 100 with the least DIF (Miranda-Venezuela, Malta, and Trinidad and Tobago) have relatively small sample sizes.

Similar conclusions can be drawn from Table 2. The US and the two separate states have much lower χ2 values, thus lower DIF among them, than when they are compared to the other three countries. Note that the United States has more DIF with everything than the separate states, but this is, again, probably a sample size issue. Perm and Russia have little DIF among them, but higher than among the States. Great Britain is somewhere in between, it has high DIF with Russia and Perm and lower DIF with the US and its states, but the amount of DIF is still quite high in all cases.

From these results, due to little DIF, we can conclude that the States are quite similar in the difficulty of the items, with Britain being less similar. This shows that DIF does not merely arise from translation errors, as they share the same language, but probably from their educational systems. The same goes for Perm and Russia, as they share the same Sovjet past. However, they are less similar than the states, showcasing that there are considerable differences between these former Sovjet countries. Great Britain is more similar to the states than to Perm and Russia, but considerable differences remain in all cases. This, too, is expected, because it has similar culture and language as the US and its states, but a different educational system.

(24)

(a) Florida - US (b) Perm - US

(c) Florida - Great Britain (d) Perm - Great Britain

(e) Florida - Russia (f) Perm - Russia

Figure 10 . Mapping the DIF clusters of Florida - Perm to other pairs with the US, Great

Britain, and Russia. Country pairs with Florida are on the left, pairs with Perm on the right.

(25)

Interpreting DIF clusters

Besides general interpretations, the clusters found by the DIF test can be used to generate specific hypotheses about DIF. Looking back at the clusters between Florida and Perm in Figure 9(c), one cluster consists of only one item. This single item cluster is preserved in all pairs with Perm (Figures 10(b), 10(d) and 10(f)), and in none of the pairs with Florida (Figures 10(a), 10(c) and 10(e)). Especially in the Perm Russia pair, the item stands out due to its huge DIF. This strongly indicates that this case of DIF is Perm specific. Because this item is so Perm specific, it could be a possible case of construct unrelated DIF, but such a claim requires more extensive knowledge about the test administration, that is outside the scope of this thesis. Still, this does show the use of the DIF test to generate hypotheses. It is impossible to substantively compare all items on construct unrelated DIF, yet the test identifies a specific item that warrants further investigation.

The tests can also be used to identify possible construct related DIF. From Figure 10 it is clear that the clusters are preserved well when Florida is compared to Russia, although some clusters may be merged. The clusters are not preserved when comparing Florida to the US, Great Britain, or Massachusetts (not pictured to save space). This implies that, compared to Perm, Florida’s item structure is similar to US’, Massachusetts’, and, to a lesser extent Great Britain’s. This is partly as expected, as the US and Massachusetts are part of the same country. It is interesting that Great Britain is a bit similar too, as it has a totally different educational system than the states.

When comparing Perm, it seems similar results hold. When the US and Massachusetts are compared to Perm, most clusters are reasonable preserved, although clusters four, five, and six my be shuffled. For Great Britain, the clusters are preserved reasonably well too, but more shuffling is required than compared to US. When Perm is compared to Russia, nothing is preserved, besides the single item. This implies that Perm is similar to Russia compared to the US and its states.

As opposed to the single item cluster, these bigger clusters could be a possible source of construct related DIF and may carry information about differences between countries. For example, if one cluster consisted of (almost) all items about health, and in Florida those items would be easier than in Perm, students from Florida would know more about health than the other topics compared to students from Perm. A detailed analysis of the items is impossible because the items are confidential. However, PISA reports some characteristics of its item (OECD, 2014, Annex A):

• Application area: health, frontiers, natural resources, or environment

• Competency required: using scientific evidence, identifying scientific issues, or explain-ing phenomena scientifically

• Context: personal, social, or global

• Language the item was created in: English, Dutch, German, French, Norwegian, Japanese, or Korean

(26)

To substansively analyze the DIF clusters, we calculated whether the clusters were related to on any of these item characteristics. We created contingency tables with distribu-tions of the items characteristics over the clusters. An example with scientific competence is shown in Table 3. To estimate statistical dependence, one would normally use a χ2-test. However, most expected cell frequencies are smaller than 5, making the chi square distribu-tion under the null hypothesis unlikely (Lewis & Burke, 1949). Therefore, we used a Monte Carlo simulation based on 105 simulations to estimate how often one would find an equal or bigger χ2 value under the null hypothesis of independence (Hope, 1968). The result can be interpreted as a p-value.

Table 3

Distribution of items over the clusters of the Florida Perm country pair by scientific compe-tence: Explaining phenomena scientifically, Identifying scientific issues, and Using scientific evidence

Explaining Identifying Using

1 6 2 4 2 1 0 0 3 3 2 3 Cluster 4 3 2 3 5 5 2 2 6 3 4 7 7 1 1 2

When we apply the above procedure to test the dependency between the clusters between Perm and Florida and each characteristic separately, all p-values were bigger than .30, making a dependence between the clusters and the characteristics unlikely. To complete the example, we tried to find whether any of the clusters of the country pairs were linked to the item characteristics. When linking the clusters of the 14 country pairs to the five characteristics, only 3 out of 56 comparisons showed a significant dependence at the .05 level, showing there is probably no structural dependence between the characteristics and the clusters.

This implies that the clusters are formed on a more complex basis than the item characteristics PISA supplies. Yet again, further analysis of the clusters requires more knowledge of the items, and interpreting these differences would require knowledge about the educational systems in these countries.

Discussion

In this thesis, we tried to tried to evaluate PISA’s methodology with a focus on construct related DIF. Taking the relative view on DIF, DIF was present everywhere and comparing countries on manifest variables seems the only way to effectively analyze countries with taking DIF into account. Furthermore, the relative DIF test can be used to analyze and interpret DIF, but more substantive knowledge about the items and educational systems is required to fully interpret the complex structure of DIF.

A natural suggestion for future psychometric research is to develop better ways to present the results from the relative DIF tests, where better means “more effective in showing

(27)

patterns of interest to substantive experts”. The complexity of DIF proved difficult when applying the relative DIF test on a large scale. If the goal is to interpret DIF in all possible country pairs, the current visual analysis is too cumbersome; no one can interpret all 1540 heatmaps, let alone their generalization to other clusters. A possible solution would be to develop a more statistical approach to find clusters that are preserved in multiple country pairs. Currently, we visually concluded one item is Perm specific, but an automatic analysis would make this much easier. In the same way, an analysis that identifies countries which can be lumped together, like the US and its seperate states, would too make analysis easier.

One possibility is to consider the analysis of DIF from the viewpoint of network

psychometrics (e.g. Schmittmann et al., 2013). Briefly, DIF corresponds to dependencies

between item responses and context variables like nation or gender. Network psychometrics is aimed at modeling and visualizing pairwise relations between variables and provides tools that should be instrumental to analyze DIF.

PISA plays an increasingly important role in educational policies worldwide. Currently, specific differences between countries are mostly theoretically driven, subscales are identified first and countries are compared on these. This thesis provides a data driven approach to analyze differences between countries. When meaningful clusters are be found, these show substantive differences between countries beyond the specific subscales identified by the test makers. DIF is not a nuisance but an informative concept, relevant to anyone interested in educational differences.

References

Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied psychological measurement, 21 (1), 1–23.

Andersen, E. B. (1970). Asymptotic properties of conditional maximum-likelihood estimators.

Journal of the Royal Statistical Society. Series B (Methodological), 283–301.

Bechger, T. M. & Maris, G. (2014). A statistical test for differential item-pair functioning.

Psychometrika, 1–24.

Bechger, T. M., Wittenboer, G., Hox, J. J., & de Glopper, C. (1999). The validity of comparative educational studies. Educational Measurement: Issues and Practice, 18 (3), 18–26.

Fraley, C. & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer journal, 41 (8), 578–588.

Goldstein, H. (2004). International comparisons of student attainment: Some issues arising from the PISA study. Assessment in Education: Principles, Policy & Practice, 11 (3), 319–330.

Grisay, A., Gonzalez, E., & Monseur, C. (2009). Equivalence of item difficulties across national versions of the PIRLS and PISA reading assessments. IERI monograph series:

Issues and methodologies in large-scale assessments, 2, 63–84.

Holland, P. W. & Wainer, H. (1983). Differential item functioning. Routledge.

Hope, A. C. A. (1968). A simplified monte carlo significance test procedure. Journal of the

Royal Statistical Society. Series B (Methodological), 582–598.

Irtel, H. (1995). An extension of the concept of specific objectivity. Psychometrika, 60 (1), 115–118.

Referenties

GERELATEERDE DOCUMENTEN

FACULTY OF ELECTRICAL ENGINEERING, MATHEMATICS, AND COMPUTER SCIENCE. EXAMINATION COMMITTEE

The purpose of this paper is threefold: (a) to consider a general class of ability estimators and to derive an approximate ASE formula for the whole class; (b) to derive the

The phase of mining replicated items in parallel can be considered as a stand-alone parallel FIM algorithm, which is similar to the second phase of ParDCI [10]. When used on its own,

Illustrating this role of the army as self- assigned guardian of the Turkish Republic, the chief of the Turkish General Staff, Yasar Büyükanit, lately addressed the

Item difficulty (sometime known as item facility) is the proportion of examinees who answer an item correctly (Algina &amp; Crocker 1986, p.90). Item difficulty in the context of

Moreover, Hemker, Sijtsma, Molenaar, &amp; Junker (1997) showed that for all graded response and partial-credit IRT models for polytomous items, the item step response functions (

After building the tree, the three-step procedure is used to investigate the relation of the three personality traits (neu- roticism, extraversion, and conscientiousness) with

All one-parameter macros hcmdi work this way, unless there are pro- gramming mistakes outside dowith (also thinking of arguments that take over control from dowith commands before