• No results found

Trends in PISA : a market basket approach

N/A
N/A
Protected

Academic year: 2021

Share "Trends in PISA : a market basket approach"

Copied!
41
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Trends in PISA: A Market Basket Approach

Master Thesis, Research Master Psychology

Tara Cohen (10349855), University of Amsterdam Supervised by: Robert Zwitser & Gunter Maris

(2)

Abstract

PISA is an international student survey that is used to compare countries all over the world and to track the changes in student ability on math, science and reading in each country over time. In 2015 some strange jumps have occurred in PISA trends, like in Finland, which declined majorly on all three subjects. This study aimed to investigate whether the trends in PISA scores are reliable by comparing them to the trends in percentage correct scores and trends in market basket scores. Where PISA scores are influenced by methodological choices and are difficult to interpret. The market basket score and percentage correct score rely less heavily on these methodological choices and are easy to interpret. What I found was that PISA trends differ from percentage correct trends and market basket trends. This means that countries should interpret trends in PISA scores with care.

(3)

Table of contents

1. PISA Tale of an Educational Survey 4

 The Birth of a PISA Score 4

 PISA, a Survey with a Mission 5

2. Comparing Countries 5

 Equating Scales 6

 DIF isn’t bad; it’s Interesting! 7

 The Market Basket Approach; an Alternative? 7

3. Tracking Trends 9

 Problem 1: Choosing Link Items 9

 Problem 2: Scale Drift 10

 Problem 3: Interpretability 11

 What About That Market Basket Approach? 11

4. This Study 12

Research Question 1: How does the PISA trend compare to the Percentage correct

trend? 12

Research Question 2: How does the PISA trend compare to the Market Basket

trend? 13

Research Question 3: How do the PISA trend and the Market Basket trend compare

to the trend expected based on the anchor items? 14

5. Methods 14

 Sample Characteristics 14

 Excluded Items 15

 Filling the Market Basket 16

 Exploratory analyses 18 6. Results 18 Research Question 1 18  Research Question 2 20  Research Question 3 23 7. Discussion 25  Limitations 26  Implications 26  Recommendations 27 8. Results 29 Appendixes 32

 Appendix 1: Excluded Items 32

 Appendix 2: R-Code 33

 Appendix 3: Link Items 34

(4)

PISA: Tale of an Educational Survey

The dramatic drop of Finland in the 2012 and the 2015 PISA ranking stirred up controversy all over the world; with people wondering how such an appraised education system dropped so much so quickly. The Program for International Student Assessment (PISA) is a survey designed to evaluate education systems all over the world, by testing 15 year old students in each participating country on three subscales; science, reading and math. The participating countries are ranked based on their scores. PISA is administered every three years (OECD, 2016). Changes like the drop in Finland are interpreted as changes in student ability caused by the education system. Like in Finland, where the egalitarian education system (The Guardian, 2013) and un-inspiring curriculum (The Economist, 2013) were blamed for the 2012 drop in PISA scores. Is the education system in Finland to blame for the drop in performance on PISA, or are there alternative causes for changes in PISA scores?

The Birth of a PISA Score

The way PISA scores come to be is a process in which methodological models and choices are of influence. Every three years PISA distributes a survey of over a hundred items to all participating countries. PISA uses a block design in which not every child makes every question. Instead children get chunks, or blocks, of questions that are selected from the pool of items available (Figure 1 shows a simplified example of a block design).

After all questionnaires are administered and scored, PISA uses an partial credit IRT model without discrimination parameters to estimate item difficulties and ability scores for each child on all three subjects (Math, Reading, Science). In 2015 PISA’s model changed to a generalized partial credit model. In this year country specific item parameters were also introduced for the first time. These changes came with a change

(5)

in institution responsible for calculating PISA scores, namely from ACER to ETS. The PISA 2015 technical report is not out yet, so we will find out all the specifics about the 2015 PISA analysis later this year.

To link the scales of different cycles together a few items are administered in multiple years. These items are used to link the scales of each cycle together. An overview of these link items that appear in multiple cycles can be found in appendix 3 below.

PISA, a Survey with a Mission

In my opinion PISA has two main purposes. The first is to compare the education

system in all participating countries, so countries can tell how their education

systems perform in relation to others. The second purpose is to track the progress

of the education system in each individual country over the cycles (in this thesis

will refer to changes within countries over the PISA cycles as trends). PISA scores should therefore not only represent a country’s position on a world scale within one cycle, but should also represent changes within a country compared to cycles in the past. This leaves us with a question; do PISA scores fulfil both these purposes?

Comparing Countries

Since the 2015 cycle of PISA was the first cycle in which Differential Item Functioning (DIF) corrections were made, it is interesting to dig a bit further into the implications

Item 1 Item 2 Item 3 Item 4 Finland Student 1 Student 2

Student 3 Latvia Student 4 Student 5 Student 6

*Notes Gray = Missing data; White = items made Figure 1: Simplified example of a block design.

(6)

of correcting for DIF in a survey like PISA. DIF occurs when two participants with the same ability have a different chance of answering an item correctly because of the topic or phrasing of an item, (Rogers, 2005; Holland & Wainer, 1993). In a survey like PISA where many countries, and therefore cultures, participate it is important to think about how, (and whether) to correct for DIF.

Equating Scales

Some researchers propose to correct for DIF by using models that add subpopulation specific item parameters (Kreiner & Christensen, 2014; Oliveri & Von Davier, 2011; Oliveri & Von Davier, 2014). This means that the item difficulty for children in different countries is estimated separately. A challenge that arises when applying this approach is equating the scales for all countries that have different item parameters.

Two methods for equating scales are the mean-mean model (Loyd & Hoover, 1980; Oliveri & Von Davier, 2011; Oliveri & Von Davier, 2014) and using a set of “anchor items”. A problem with these approaches is that they rely on a choice of items. There is no purely scientific way to decide what the DIF items are which means that this decision is often based on arbitrary cutoffs (Bechger & Maris, 2015). This means that a different cutoff or a different method of detecting DIF could mark different items as DIF and non DIF items. This poses a problem because the choice of items can potentially change the outcomes of a test. In the anchor items approach for example, a different choice of anchor items could potentially lead to a different ranking of countries (Zwitser et al., 2016; Bechger & Maris, 2015). An example of this effect was shown by Zwitser et al (2016). When a different selection of DIF items can change the results of the PISA ranking, the score is not merely a reflection of ability, but also of a methodological choice. This is one reason why correcting for DIF might not be

(7)

so desirable, and why an alternative method that avoids DIF corrections would be interesting to investigate.

DIF isn’t bad; it’s Interesting!

Another problem with correcting for DIF is that DIF contains potentially interesting outcomes. When correcting for DIF, the expectation is that the correction corrects differences in translations and culture and not differences that have to do with the ability measured in the survey. This suggestion, that DIF is mainly caused by language and cultural inequivalence (Gillis, Polesel & Wu, 2016), was questioned by studies that showed that a large part of DIF items are marked as such because of differences in curricula (Yildirim & Berberoĝlu, 2009; Huang, Wilson & Wang, 2016). Huang, Wilson and Wang (2016) for example found that in a comparison between Chinese and American students only 15% of the DIF was rooted in cultural and language differences, as opposed to 25% of the DIF that was caused by different curricula. This suggests that instead of correcting for a difference in culture and language, DIF corrections could actually be “correcting for” topics that are missing or overly represented in different curricula. Interpreting DIF, instead of correcting for it, might give interesting insights into where a curriculum is lacking or is exceptionally strong.

The Market Basket Approach; an Alternative?

In their 2016 study, Zwitser et al. used a different approach to compare, or rank, the countries on the reading literacy scale in the 2006 cycle of PISA. The approach they used was a market basket approach (Zwitser et al., 2016; Mislevy, 1998; NCES, 1997; Bock, 1996). In this approach the collection of items in the test represents the measured construct. This as opposed to the collection of items representing a latent variable, as seen in PISA and other IRT scored tests. So all the math items together

(8)

are the construct math, they do not represent some unmeasurable latent variable math.

Figure 2 shows a schematic representation of these two models. The market basket approach only requires the answers or estimated answers for each participant on all items to calculate any sort of summary statistic, like a sum or mean score for each participant in each country (Zwitser et al., 2016; NCES, 1997). Under the heading “What About that Market Basket Approach” later in this introduction there will be further discussion on the advantages of a market basket approach as opposed to a latent variable model.

Figure 2: Market basket vs. Latent variable model

In the research of Zwitser et al. (2016) it was found that changing the method of scoring from PISA’s latent variable approach to the market basket approach did not alter the general ranking of the countries. This suggests that the PISA score is well equipped to compare the countries within each cycle, or at least as well as the market basket approach. Thereby the first purpose of PISA (comparing/ranking the educational systems of participating countries) is fulfilled in the PISA scores.

(9)

Tracking Trends

Changes in PISA scores over the cycles are assumed to be caused by changes in the ability of students in participating countries, and consequently, by the education systems in place. If this is the case, conclusions about the education system of countries as mentioned in the introduction are completely justified. There are, however, alternative explanations for changes in trends in a survey like PISA. These mainly have to do with linking the scales of different PISA cycles, so that they can meaningfully be compared. As said above, PISA provides a reliable score in comparing countries within each PISA cycle. What question remains is whether PISA scores are well equipped to track trends in education systems within countries over the cycles.

IRT scales as used by PISA are estimated, arbitrary scales. There is no null point and the range depends solely on how the scale is transformed or estimated. In this type of scale each point on the scale within one cycle is relative to others, but not automatically relative to scores in other cycles. To solve this problem PISA introduces items that appear in multiple PISA cycles. These items are used to “link” the years together (Gillis, Polesel & Wu, 2016; Urbach, 2013). This way PISA tries to equate the scales in each cycle so scores in different cycles become directly comparable. There are three problems that arise in this approach.

Problem 1: Choosing Link Items

The first problem with this approach is that the scores and scales can change based on the link items that are chosen. The scores become dependent on the choice in link items, instead of ability. This effect has been found in PISA by Monseur and Berezner (2007). They found that discarding one particular link item when linking the 2000-2003 Japan PISA cycles changed Japan’s score with a total of 10 points.

(10)

Problem 2: Scale Drift

A second problem with scaling the PISA cycles is scale drift. A scale has drifted if the meaning of the scores on a scale changes over time. More concretely a score that was high in an earlier cycle, could now be relatively low or the other way around (Heberman & Dorans, 2009). IRT scales in student assessments tend to drift over the years (Taherbhai & Seo, 2013; Petersen, Cook & Stocking, 1983). Like the SAT’s, a test administered to students in the USA, that had to be re-centered in 1995 and that showed signs of drifting between 1994 and 2001 (Liu et al., 2009).

The reason that scales drift is a breach of the invariance assumption that is made when linking scales (Liu et al., 2009; Wanjohi et al., 2013; Deng & Melican, 2010; Michaelides, 2010). Most methods for linking scales rely on the assumption that the difficulty of the link items, the population and the curricula, stay the same in each cycle. This however, is not often the case. In each cycle of PISA new countries participate. This changes the overall population. This means that the assumptions of linking scales are not met and that the linked scales could have drifted over time.

Scale drift can form a problem even if the drift over time is small in each new cycle. The accumulated error could make it so the scores are no longer comparable when looking at long term changes (Livingston, 2004).

The linking method has effects the amount of drift (Liu, Curley & Low, 2009; Kim & Cohen, 1998; Wanjohi, van Rijn & von Davier, 2013). With the institute responsible for producing PISA scores changing it is possible that the amount of drift is not stable over the cycles. Changes in methods could cause different amounts of drift. This all means that drifting scales seem unavoidable when trying to link PISA cycles together.

(11)

Problem 3: Interpretability

A more general objection to the PISA scaling and estimation method is that the scores by themselves are uninterpretable. Saying that the Netherlands scored a 513 in reading in 2003 is difficult to interpret when not related to the scores of the other countries within a year or to the scores in other cycles; even when the content of the items in the survey are available to you.

These 3 problems suggest that another method that is not dependent on linking scales of PISA cycles together is not only more interpretable, but might also yield different results.

What About That Market Basket Approach?

As Zwitser et al. (2016) suggested, a market basket approach could be a good alternative to the latent variable model that PISA uses to estimate their scores. The main advantage of this type of scoring is that the scores do not depend on link, or anchor items. This means that there is no arbitrary choice involved that can change the scores based on which items are administered as link items or which items are seen as non DIF anchor items. In this way the market basket avoids unwarranted changes in scores when comparing countries and when tracking trends within a country due to an arbitrary choice of anchor items.

Another advantage of the market basket approach is that the scales and scores are interpretable. It provides a percentage correct and mean score, which in combination with the content of the test can be easily interpreted by informed professionals and policy makers. If there are changes in score, policy makers could get direct insight into the causes by investigating the items in each cycle and how children responded to those items.

(12)

As mentioned before, Zwitser et al. (2016) showed that the ranking between countries does not differ from the PISA ranking when the market basket method was used. Now it would be interesting to see if trends in PISA within countries differ when the market basket approach is used.

This Study

In this study I want to investigate whether PISA trends between the cycles per country are reflections of differing ability levels or methodological choices and artifacts. I will do this by comparing PISA scores to other scoring approaches, based on the same data. If these trends differ from the PISA trends this could either be because the other approaches aren’t equipped to track trends in PISA data; or it could mean that the PISA approach isn’t equipped to do so. Either way, without further research results like this would create doubt about whether the PISA scores can be meaningfully compared over the cycles.

Research Question 1

How does the PISA trend compare to the Percentage correct trend?

I will compare the PISA scores to two other scores. First, a comparison is made between percentage correct scores, which are the percentage correct of the items that children actually made and the PISA scores. Due to the block design PISA uses, this means that for each child the scores are based on a part of all the items. This score is interesting because it shows the scores without any methodological interference. It is the most direct representation of what children actually did.

A limitation of this type of scoring is that it does not account for differences in item difficulty between the administered items. So a child who made 20 easy items might score higher than a child who made 20 difficult items, even though they have

(13)

the same ability. These scores, however, are not used to asses individuals in this study. That means that the beforementioned limitation could be corrected because only the mean score per country per year is looked at.

Research Question 2

How does the PISA trend compare to the Market Basket trend?

The next score I will be comparing the PISA scores to is the market basket score as introduced in Zwitser et al. (2016). This score is interesting because it doesn’t produce arbitrary estimated ability scores and it does not link years together through anchor items. It also improves on the percentage correct score by not only taking into account the items that were administered but each of the items in the full survey for each child.

A limitation of the market basket approach is that the assumption is made that the basket of items is equal to the ability that is measured. This assumption could be seen as unrealistic. For example in math where an almost endless amount of items are available, PISA surveys only contain about 20-50 in each cycle. One could suggest that this item pool is too little to be seen as the full math ability. However, any survey or test approach is based on a smaller pool of items than the full range of items possible to test a certain subject or ability. What’s important to note here is that the amount of available information does not change based on the framework the data is analyzed with. It is merely the assumptions and way of reporting that change. Furthermore, in the market basket approach when (for example) a topic is lacking or overly represented in the items, policy makers can find out easily how that changed their scores because of the transparent way the market basket scores are made up, this would not be so easily possible in a latent variable.

(14)

Research Question 3

How do the PISA trend and the Market Basket trend compare to the trend expected based on the link items?

I will also zoom in on the link items that were administered in 2012 as well as 2015 and see whether the trend expected there is similar to the PISA trend and the market basket trend.

Link items are interesting because they stay the same over the years. The other items are written newly for each cycle. This means that when comparing market basket scores over the cycles the changes in items could be causing changes in scores. This makes it interesting to investigate the changes in link items as well as just the differences between PISA- and market basket scoring.

It would also be interesting to see whether PISA follows the same trend as the link items do. If it doesn’t this would suggest that the PISA scores have separated from the one consistent scores in the survey, the link items.

For this study I have chosen four countries: Finland, due to its spectacular drop in score and ranking in 2015; Montenegro, which has majorly increased its score since 2012 on all three subscales; Latvia, which has shown almost no change in 2015 compared to 2012; and the Netherlands, which has decreased its score slightly. Since the 2015 PISA scores show specific changes in these four countries I will focus mainly on the trend between 2012 and 2015.

Methods

Sample Characteristics

In this study I analyze all available PISA data for the countries mentioned in the introduction. PISA is taken every 3 years, so each country has multiple years of measurements. For Finland, the Netherlands and Latvia there are 6 PISA cycles

(15)

available, from 2000 till 2015. For Montenegro there are 4 cycles available, from 2006 till 2015. PISA participants are 15 years old, and are usually in school. The participating schools are randomly chosen in each country so that each level of education, and different school sizes are evenly represented, though small schools and schools for disabled children are often excluded. The test is only used for scientific and governance purposes, this means that the test is not used for selection or high stakes testing. The test takes between one and two hours and is administered at school. The content of the test is decided on by specialists in each relevant field (OECD, 2016). There are about 4000/5000 participants for each country in each year.

Excluded Items

Not all items were taken in each country. To avoid sampling responses for items that were not administered to any children in a country, all items that were not administered in all, or either one of the four countries used in this study, were deleted from the data. The names of the items that were excluded can be found in appendix 1. In 2015 there is particularly little overlap between countries with 98 excluded items. Because of the many non-overlapping, deleted items in 2015, 2075 participants are excluded. These participants did not make any of the items that were left after deleting the excluded items. The number of excluded cases are evenly distributed amongst the different countries. The excluded cases can be found in table 1.

Table 1

Number of excluded cases per country, per year

2015 2012 2009 2006 2003 2000 Finland 566 34 3 0 2 0 Latvia 452 0 1 6 7 2 Montenegro 575 26 1 0 - - Netherlands 482 11 0 4 2 0 Total 2075 80 5 10 14 2

(16)

Filling the Market Basket

As mentioned, this study uses the market basket approach as proposed by Zwitser et al. (2016) to analyze the PISA data for the chosen countries, for all the years of participation. As discussed above, not every participant makes every question, so the first step of the analysis is to fill in plausible responses for each missing data point. To analyze the data I will estimate a partial credit IRT model for each year and each country separately, this means that I will estimate 23 models.

The data were imputed by first estimating item difficulties and theta’s based on the items were administered. Besides items with binomial scores, PISA also contains items with partial credit options. Therefor the model that was used for estimating the theta’s was a partial credit IRT model in the R-package TAM (Robitzsch, Kiefer & Wu, 2017). This package uses marginal maximum likelihood estimation (MML) to estimate the item difficulties and theta’s for the data. The formulas below show what the model looks like written as the chance of scoring a 0, a 1 or a 2. Where the δ’s are item difficulties and the θ’s are a child’s estimated ability.

𝑝0 = Pr(𝑋 = 0) = 1 1 + 𝑒(𝜃−𝛿1)+ 𝑒(2𝜃−(𝛿1+𝛿2)) 𝑝1 = Pr(𝑋 = 1) = 𝑒 (𝜃−𝛿1) 1 + 𝑒(𝜃−𝛿1)+ 𝑒(2𝜃−(𝛿1+𝛿2)) 𝑝2 = Pr(𝑋 = 2) = 𝑒(2𝜃−(𝛿1+ 𝛿2)) 1 + 𝑒(𝜃−𝛿1)+ 𝑒(2𝜃−(𝛿1+𝛿2))

After the estimates were obtained they were used to sample possible responses for each item for each participant using the formulas as seen above. The R-code used for this process can be found in Appendix 2.

(17)

To check the validity of the estimates I made plots with the sum scores of the administered items and the sum scores of the sampled data. For this only the items that were administered to the children were included. The sum scores of the estimates correlated highly with the items that were actually administered; all the correlations were larger than .90. The scale of the sum scores is different in each year and partly depended on the number of items administered in each year. The distribution of the sum of all estimated responses for each year can be found in appendix 3.

(18)

Figure 3: Sum scores of estimates plotted against the sum scores of the completed items. Only items that were administered were included.

Exploratory analyses

In the proposal of this study, I only specified how I was going to estimate the market basket scores. The rest of the analysis unfolded as the study progressed.

In this study I will compare three types of scores. Fist, the original PISA scores, given to each country, in each subject (math, science & reading) in each year. The second score is the percentage correct score, which is the percentage correct that children got on the items that they were administered. This is only a portion of the total items for each child because of the block design PISA uses. The third score that will be compared is the market basket score. This score is a percentage correct based on the estimated responses of each child on each question in each cycle of PISA.

Results

I will discuss all the results by addressing each research question in the introduction. All scores (PISA score, percentage correct score & market basket score) for each country (on each subject, in each cycle) can be found in appendix 4. I have selected some interesting findings out of all these results to present in the following section.

Research Question 1

How does the PISA trend compare to the Percentage correct trend?

When comparing the PISA scores and the percentage correct scores I found a positive correlation of 0.82 (Figure 4). But when I zoomed in closer on specific trends within countries I found that 7 out of 12 PISA scores in 2015 (4 countries x 3 subjects)

(19)

changed in the opposite direction of the percentage correct score compared to 2012. All changes in scores can be found in table 2, 3 and 4. For example the Netherlands scored 13 points lower on science in 2015 than it did in 2012, but the percentage that children got correct barely changed with a 0.1% increase. The only changes that did not differ in 2015 compared to 2012 were those for Montenegro, which increased its score in both the PISA score and the percentage correct score.

Figure 4: PISA scores compared to the percentage correct of the items administered to the children in each country, split on subject.

As mentioned above, the trends in PISA scores differed from trends in percentage correct scores. Figure 5 shows some examples of differences and similarities in trends. In this figure, on the left, you can see that Finnish children had a higher percentage correct in math and science in 2015 than in 2012, but the PISA score shows a decline in both fields. The bottom right plot shows that the Netherlands have a slightly fluctuating trend around 60% correct for science. PISA scores, however, show a completely different picture. Namely, a picture of decline. The top right plot shows

(20)

Montenegro which has improved in reading in each cycles in both the percentage correct and PISA scores.

Figure 5: A visual comparison of trends in PISA score & percentage correct score

Research Question 2

How does the PISA trend compare to the Market Basket trend?

When comparing the trends of the market basked approach and the PISA score there were some inconsistencies. In 2015 compared to 2012, 7 out of 12 scores went in opposite directions. Like Finland, which had an 8 point decline in PISA score for reading, while the market basket score showed an improvement of 3.3%.

(21)

Figure 6: PISA compared to Market basket score in Latvia (all years, all courses) *Note: The bigger the point the later the PISA cycle

Now I will go through some examples of inconsistencies in PISA and market basket trends. In this figure (Figure 6) it becomes clear that, in the market basket approach, Latvia scores similar in all PISA cycles. This is inconsistent with the scores in PISA, where the 2000 cycle the PISA scores were a lot lower than the other cycles. In the next plots (Figure 7) we find that Finland scored better in 2015 than it did in 2012 when looking at the market basket scores in math and science. However, the PISA scores are a lot lower in 2015 than in 2012. This same pattern emerges in the Netherlands, where compared to the market basket scores the 2015 scores on all subjects seem to have been underestimated by PISA. In science for example, the Netherlands scores 57.7% correct in the market basket. This is similar to the scores in 2012 and 2009, though the PISA score in 2015 is more than 10 points lower than in the other two years. Another noticeable occurrence is that in the years 2003 and 2006, the PISA scores of the Netherlands were higher than in other years, while the Netherlands scored lower on the market basket (Figure 8).

(22)

Figure 7: Finland math and science scores seem underestimated by PISA in 2015, and in 2006 Finland’s score seems to have been overestimated.

Figure 8: Science scores in the Netherlands PISA and Market basket score compared.

A country where the PISA trend and market basket trend are similar is Montenegro. Which grew in score in each cycle that the country participated in math and reading (figure 9). A strange result in Montenegro’s market basket score in science in 2006 is that it has an unexpectedly high score of around 60% correct compared to the 35%-40% correct in the other years.

(23)

In research question 1 the PISA scores were compared to the percentage correct scores. If you compare the percentage correct scores to the market basket scores the trends are extremely similar. In figure 10 you can see that the market basket score correlates very strongly with the percentage correct score of items that were administered to each child in (r = 0.99). This means that the market basket score is closer to what the children actually filled in in the survey then the PISA approach (Figure 4).

Figure 10: percentage correct compared to market basket score.

Research Question 3

How do the PISA trend and the Market Basket trend compare to the trend expected based on the link items?

Since the countries I chose (besides Latvia) showed big changes in PISA score in 2015 compared to 2012, the way these countries scored on link items that were in both in the 2012 and in the 2015 cycle of PISA, could give interesting insight in the expected changes in trends. In the three plots below (Figure 11) the percentage correct on only the anchor items in 2012 and 2015 are compared. Here you see Finland improving

(24)

slightly on all subjects, Latvia decreasing slightly, Montenegro increasing and the Netherlands decreasing slightly.

In 9 out of the 12 changes in PISA scores between 2012 and 2015 the PISA scores are similar to the changes expected from the anchor items. Finland is the main exception here, with an expected increase in score based on the anchor items of all three subjects that is not reflected in the PISA scores.

The changes in percentage correct on all items between 2012 and 2015 differ from the changes in the percentage correct scores from only the anchor items. They differ 7 times out of a total of 12 observations (again 4 countries x 3 subjects).

Table 2.

Differences in Science scores

Δ PISA Δ % correct Δ MB Δ Anchors

Finland -14 4.5 1.1 2.3

Latvia -12 -0.7 0.6 -1.2

Montenegro 1 5.3 5.3 0.6

Netherlands -13 0.4 0.1 -1.1

Table 3.

Differences in Reading scores

Δ PISA Δ % correct Δ MB Δ Anchors

Finland 2 2.5 0.0 0.2

Latvia -1 1.6 0.6 -1.7

Montenegro 5 3.0 2.9 2.3

Netherlands -8 -0.8 -1.3 -1.4

Table 4.

Differences in Math Scores

Δ PISA Δ % correct Δ MB Δ Anchors

Finland -8 6.5 3.4 0.4

Latvia -9 3.1 3.8 -1.5

Montenegro 8 9.8 8.3 3.4

(25)

Figure 11: Anchor items compared in 2012 and 2015

Discussion

PISA scores are used to compare the education systems between countries (comparing countries) and within countries (tracking trends). In this study the main focus was on this second goal. Mainly because some unexplained jumps in score have occurred in the 2015 PISA cycle. This study aimed to investigate the validity of trends in PISA by comparing them to the trends in percentage correct scores and trends in market basket scores.

What I found was that PISA trends differ from percentage correct trends and market basket trends. In the case of Finland for example, the drop in scores found in PISA in 2015 was not reproduced in the percentage correct score and in the market basket score. The differences in trends were not only differences in extremities of score changes, but also differences in direction of changes.

(26)

The PISA trend in Finland was particularly surprising in the light of the anchor items; which, for the most part, show little, positive changes in scores. Based on the anchor items, one would also expect Finland to perform better in 2015 than in 2012, and definitely not to drop majorly in score in 2015 compared to 2012.

These results lead me to conclude that PISA scores could be influenced by methodological choices or “artifacts”; to fully conclude this and to find out what exactly caused the changes in scores, more research is needed. Scale drift, for example, could have caused changes in scales and scores. The changes could also be caused by new estimation or linking methods which yield different scales or scores.

Limitations

A limitations of this study was that, to execute the market basket approach, some items had to be excluded due to them not being administered in each of the four countries in this study. The excluded items were especially many in 2015, where almost a hundred items were excluded. The deleted items could potentially cause changes in the overall scores. The market basket scores, however, closely resemble the percentage correct scores where no methodological adjustments were made.

Another limitation of this study is that I could not reproduce PISA trends. This could be due to the fact that I did not have the data for each country and that PISA uses one model to model all the countries together. Not reproducing the PISA scores made it difficult to further dig into the exact reasons why the market basket trends and the PISA trends differ, we are therefor left guessing for now.

Implications

What the results of this study mean is that PISA trends could be influenced by changes in methodology and scale drift and are therefore to be interpreted with care until

(27)

further research is done. PISA produces the same scores in comparing countries amongst each other within each cycle as the market basket approach does (Zwitser et al, 2016), but there are big differences between the market basket trends and PISA trends over the cycles. The differences between the market basket trends and the PISA trends being so big could for example mean that, a PISA score of the Netherlands in science of 522 in 2012 and of 509 in 2015, does not necessarily mean that the children were actually worse in science. This means that for now, it would be unadvisable to make big changes in education systems based on the PISA trends alone.

Recommendations

To find out what exactly caused the discrepancies between the trends in PISA and in the market basket approach, further research needs to be done. This research could focus on more countries then just the four that were focused on in this paper to see whether strange patterns appear in multiple countries. There could also be further investigation into the role of the deleted items in the market basket approach. Especially because there were so many deleted items in 2015. A study focusing on the occurrence of scale drift could give insight into whether scale drift occurred throughout the PISA cycles. Lastly it would be interesting to investigate what the effect of the changes in method in 2015 was on the scales and scores.

This study also applied the market basket approach in a new situation. As Zwitser et all. (2016) showed, the market basket approach yields similar results when comparing countries amongst each other in a single PISA cycle. This study showed that the market basket score trends relate to trends in answers in administered items (the percentage correct score) more strongly than the PISA scores currently do. This could mean that a market basket approach is more suitable for comparing cycles in a survey such as PISA.

(28)

For the market basket approach to be used it is important that all questions are administered to at least some children in each country and that a new way of reporting is developed. The market basket approach leans partly on policy makers looking at the results of their country in combination with the items (or topics of the items) to gain insight in the education system over the years and compared to other countries. To achieve and facilitate this active interpreting of the results, I suggest that the results need to be reported in an interactive way that allows policy makers to investigate the freely.

This study showed how the market basket approach could be used to investigate trends in international surveys. It also showed that the trends found in PISA might not be that reliable; especially the 2015 cycle of PISA has a lot of discrepancies with the market basket scores and percentage correct scores. This means that the Guardian and the Economist might have been too quick to judge the Finnish teachers and uninspiring curriculum.

(29)

References

Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340. doi:10.1007/s11336-014-9408-y.

Bock, R. D. (1996). Domain-referenced reporting in large-scale educational assessments. Commissioned paper to the National Academy of Education for the Capstone Report of the NAE Technical Review Panel on State-NAEP Assessment, Washington, DC.

Deng, H., & Melican, G. (2010). An investigation of scale drift for arithmetic assessment of ACCUPLACER.

Gillis, S., Polesel, J., & Wu, M. (2016). PISA Data: Raising concerns with its use in policy settings. The Australian Educational Researcher, 43, 131-146.

Grisay, A., Gonzalez, E., & Monseur, C. (2009). Equivalence of item difficulties across national versions of the PIRLS and PISA reading assessments. IERI monograph series: Issues and methodologies in large-scale assessments, 2, 63-84.

Haberman, S., & Dorans, N. J. (2009). Scale consistency, drift, stability: Definitions, distinctions and principles. J. Liu & S. Haberman (Chairs), Inconsistency of scaling function: Scale drift or sound equating.

Holland, P., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Huang, X., Wilson, M., & Wang, L. (2016). Exploring plausible causes of differential item functioning in the PISA science assessment: Language, curriculum or culture. Educational Psychology, 36, 378-390.

Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied psychological measurement, 22(2), 131-143. Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at

the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231. doi:10.1007/s11336-013-9347-z.

Kobarg, K. Schöps & S. Rönnebeck (Eds.), Research on PISA: Research Outcomes of the PISA conference 2009. Springer, doi: 10.1007/978-94-007-4458-5.

Liu, J., Curley, E., & Low, A. (2009). A scale drift study. ETS Research Report Series, 2009

(30)

Livingston, S. A. (2014). Equating test scores (without IRT). Educational testing service.

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.

Mazzeo, J., Kulick, E., Tay-Lim, B., & Perie, M. (2006). Technical report for the 2000 market-basket study in mathematics. ETS-NAEP report, 06-T01.

Mislevy, R. J. (1998). Implications of market-basket reporting for achievement-level setting. Applied Psychological Measurement, 11, 49–63.

Michaelides, M. P. (2010). A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating. Frontiers in psychology, 1.

Monseur, C., & Berezner, A. (2007). The computation of equating errors in international surveys in education. Journal of Applied Measurement, 8(3), 323–335.

NCES. (1997, October). NAEP reconfigured: An integrated redesign of the national assessment of educational progress (Tech. Rep. No. 97-31). National Center For Educational Statistics. Retrieved from http://nces.ed.gov/pubs97/9731.

OECD (2016), PISA 2015 Results (Volume I): Excellence and Equity in Education, OECD Publishing, Paris. DOI: http://dx.doi.org/10.1787/9789264266490-en

Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, 315-333.

Oliveri, M. E., & Von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, 1–21. doi:10.1080/15305058.2013.825265

Petersen, N. S., Cook, L. L., & Stocking, M. L. (1983). IRT versus conventional equating methods: A comparative study of scale stability. Journal of Educational Statistics, 8(2), 137-156.

Robitzsch, A., Kiefer, T. & Wu, M. (2017). https://cran.r-project.org/web/packages/TAM/TAM.pdf

Rogers, H. J. (2005). Differential item functioning. Encyclopedia of statistics in behavioral science.

(31)

Taherbhai, H., & Seo, D. (2013). The Philosophical Aspects of IRT Equating: Modeling Drift to Evaluate Cohort Growth in Large‐Scale Assessments. Educational Measurement: Issues and Practice, 32(1), 2-14.

Urbach, D. (2013). An investigation of Australian OECD PISA Trend results. In M. Prenzel, M.

Wanjohi, R. G., van Rijn, P. W., & von Davier, A. A. (2013). A state space approach to modeling irt and population parameters from a long series of test administrations. In New developments in quantitative psychology (pp. 115-132). Springer, New York, NY.

Yildirim, H. H., & Berberoĝlu, G. (2009). Judgmental and statistical DIF analyses of the PISA-2003 mathematics literacy items. International Journal of Testing, 9(2), 108-121.

Zwitser R. J., Glaser, S. S. F., & Maris, G. (2016). Monitoring Countries in a Changing World: A New Look at DIF in International Surveys. Psychometrika, 1-23.

(32)

Appendix 1

These are the items that were deleted because they were not taken in one or more of the 4 studies countries.

Year Items Total

2000 M300Q1AT, M300Q1BT, M300Q1CT, M302Q01T, M302Q02, M302Q03, M303Q01T, M305Q01, M307Q01, M307Q02, M309Q01T, M309Q02T, M309Q03T, M309Q04T, S301Q01T, S301Q02T, S301Q03A, S301Q04, S302Q01, S302Q02, S304Q01, S304Q02, S305Q01T, S305Q02, S305Q03T, R076Q05, R100Q05, R119Q08 28 2003 M144Q01T, M464Q01T, M704Q01T, M704Q02T 4 2006 - 0 2009 R227Q02T, R403Q01, R403Q02, R403Q03, R403Q04, R417Q03, R417Q04, R417Q06, R417Q08, R429Q08, R429Q09, R429Q11, R433Q01, R433Q02, R433Q05, R433Q07, R435Q01, R435Q02, R435Q05, R435Q08T, R445Q01, R445Q03, R445Q04, R445Q06, R462Q02, R462Q04, R462Q05, R465Q01, R465Q02, R465Q05, R465Q06 31 2012 M934Q01, M934Q02, M936Q01, M936Q02, M939Q01, M939Q02, M942Q01, M942Q02, M942Q03, M948Q01, M948Q02, M948Q03, M957Q01, M957Q02, M957Q03, M961Q02, M961Q03, M961Q05, M967Q01, M967Q03T, M985Q01, M985Q02, M985Q03, M991Q01, M991Q02D, R420Q10 26 2015 R445Q03, R445Q04, R445Q06, R462Q04, R435Q02, R435Q01, R435Q08, R219Q01, R220Q02B, M192Q01, M948Q0, M948Q02, M948Q03, M936Q01, M961Q03, M939Q01, M939Q02, M967Q01, M967Q03, S252Q01, S252Q02, S252Q03, S327Q01, S456Q01, S456Q02, S133Q01, S133Q03, S133Q04, S627Q01, S627Q03, S627Q04, S635Q01, S635Q02, S635Q04, S603Q01, S603Q03, S603Q04, S603Q05, S602Q01, S602Q02, S602Q04, S607Q01, S607Q02, S646Q01, S646Q02, S646Q03, S608Q01, S608Q02, S608Q03, S605Q01, S605Q02, S605Q03, S649Q01, S649Q03, S649Q04, S634Q01, S634Q02, S634Q04, S620Q01, S620Q02, S638Q01, S638Q02, S638Q04, S625Q02, S625Q03, S615Q07, S615Q01, S615Q02, S615Q05, S604Q02, S645Q01, S645Q03, S657Q01, S657Q02, S657Q03, S656Q01, S656Q04, S643Q01, S643Q02,S 643Q04, S629Q02, S629Q04, S648Q02, S648Q03, S641Q01, S641Q02, S641Q03, S641Q04, S637Q02, S601Q01, S601Q02, S601Q04, S610Q02, S610Q04, S626Q01, S626Q02, S626Q03, S466Q05 98

(33)

Appendix 2

The R-code below was used to estimate the missing data for each participant. The function sample.response( ) takes a theta and the parameters for an item and samples one single response. The function estimate( ) uses the sample.response( ) function to sample a response for each participant on each item. It also returns the estimates of the parameters and the plausible values (or θ) for each participant. To run this code you need the TAM package (Robitzsch, Kiefer & Wu, 2017).

sample.response <- function(theta,pars) { tmp<-sapply(1:length(pars),function(m) exp(m*theta-sum(pars[1:m])) ) tmp2<-c(1,tmp) pr<-tmp2/sum(tmp2) response<-sample(0:length(pars),1,prob=pr) return(response) } estimate <- function(Dat){ model <- tam.mml(Dat,irtmodel="PCM")

pars <- matrix(NA, ncol(Dat), 5) colname <- colnames(Dat) rownames(pars) <- colname for(i in 1:(ncol(Dat))){ pars[i,1:length(grep(colname[i],rownames(model$xsi)))] <- model$xsi[grep(colname[i], rownames(model$xsi)),"xsi"] } pv <- tam.pv(model, nplausible = 1) FilledDat <- Dat for(i in 1:ncol(FilledDat)){

sampleIt <- function(x){sample.response(x, pars[colname[i], !is.na(pars[colname[i],])])}

FilledDat[, i] <- apply(as.data.frame(pv$pv[,2]), 1, sampleIt) }

return(list(FilledDat = FilledDat, Theta = pv$pv[,2], Parameters = pars))

}

(34)

Appendix 3

These are the items that reappear in multiple years for the four chosen countries (gray = administered). There are four items that appear in each year.

2000 2003 2006 2009 2012 2015 M155Q01 1 1 1 1 1 1 R220Q01 1 1 1 1 1 1 R220Q04 1 1 1 1 1 1 S256Q01 1 1 1 1 1 1 M034Q01T 1 1 1 1 1 M155Q04T 1 1 1 1 1 M192Q01T 1 1 1 1 1 M273Q01T 1 1 1 1 1 R220Q02B 1 1 1 1 1 S269Q01 1 1 1 1 1 S269Q04T 1 1 1 1 1 R055Q01 1 1 1 1 1 R067Q01 1 1 1 1 1 R102Q07 1 1 1 1 1 R104Q01 1 1 1 1 1 R104Q02 1 1 1 1 1 R104Q05 1 1 1 1 1 R111Q01 1 1 1 1 1 R220Q05 1 1 1 1 1 R220Q06 1 1 1 1 1 R227Q01 1 1 1 1 1 R055Q02 1 1 1 1 R055Q03 1 1 1 1 R055Q05 1 1 1 1 R067Q04 1 1 1 1 R067Q05 1 1 1 1 R102Q04A 1 1 1 1 R102Q05 1 1 1 1 R111Q02B 1 1 1 1

(35)

R111Q06B 1 1 1 1 R219Q02 1 1 1 1 R227Q03 1 1 1 1 R227Q06 1 1 1 1 M033Q01 1 1 1 1 1 S268Q01 1 1 1 1 S268Q06 1 1 1 1 M155Q02T 1 1 1 M155Q03T 1 1 1 R219Q01E 1 1 1 R219Q01T 1 1 1 R227Q02T 1 1 1 S114Q03T 1 1 1 S114Q04T 1 1 1 S114Q05T 1 1 1 S131Q02T 1 1 1 S131Q04T 1 1 1 S213Q01T 1 1 1 S213Q02 1 1 1 S268Q02T 1 1 1 S269Q03T 1 1 1 R083Q01 1 1 1 R083Q02 1 1 1 R083Q03 1 1 1 R083Q04 1 1 1 R101Q01 1 1 1 R101Q02 1 1 1 R101Q03 1 1 1 R101Q04 1 1 1 R101Q05 1 1 1 R245Q01 1 1 1 R245Q02 1 1 1

(36)

M411Q01 1 1 1 1 1 M411Q02 1 1 1 1 1 M423Q01 1 1 1 1 1 M442Q02 1 1 1 1 1 M446Q01 1 1 1 1 1 M447Q01 1 1 1 1 1 M474Q01 1 1 1 1 1 M496Q02 1 1 1 1 1 M559Q01 1 1 1 1 1 M564Q01 1 1 1 1 1 M564Q02 1 1 1 1 1 M571Q01 1 1 1 1 1 M800Q01 1 1 1 1 1 M828Q03 1 1 1 1 1 S326Q03 1 1 1 1 1 M406Q01 1 1 1 1 M406Q02 1 1 1 1 M408Q01T 1 1 1 1 M420Q01T 1 1 1 1 M446Q02 1 1 1 1 M496Q01T 1 1 1 1 M603Q01T 1 1 1 1 M803Q01T 1 1 1 1 M828Q01 1 1 1 1 M828Q02 1 1 1 1 S326Q01 1 1 1 1 S326Q02 1 1 1 1 S326Q04T 1 1 1 1 M603Q02T 1 1 1 M305Q01 1 1 1 1 S304Q02 1 1 1 S408Q01 1 1 1 1

(37)

S408Q05 1 1 1 1 S413Q05 1 1 1 1 S413Q06 1 1 1 1 S415Q02 1 1 1 1 S425Q02 1 1 1 1 S425Q05 1 1 1 1 S428Q01 1 1 1 1 S428Q03 1 1 1 1 S438Q02 1 1 1 1 S465Q02 1 1 1 1 S465Q04 1 1 1 1 S478Q01 1 1 1 1 S498Q03 1 1 1 1 S521Q02 1 1 1 1 S521Q06 1 1 1 1 M464Q01T 1 1 1 S408Q03 1 1 1 S408Q04T 1 1 1 S413Q04T 1 1 1 S415Q07T 1 1 1 S415Q08T 1 1 1 S425Q03 1 1 1 S425Q04 1 1 1 S428Q05 1 1 1 S438Q01T 1 1 1 S465Q01 1 1 1 S466Q01T 1 1 1 S466Q05 1 1 1 S466Q07T 1 1 1 S478Q02T 1 1 1 S478Q03T 1 1 1 S498Q02T 1 1 1

(38)

S498Q04 1 1 1 S514Q02 1 1 1 S514Q03 1 1 1 S514Q04 1 1 1 S519Q01 1 1 1 S519Q02T 1 1 1 S519Q03 1 1 1 S527Q01T 1 1 1 S527Q03T 1 1 1 S527Q04T 1 1 1 R404Q03 1 1 1 R404Q06 1 1 1 R412Q01 1 1 1 R412Q05 1 1 1 R424Q03 1 1 1 R424Q07 1 1 1 R437Q01 1 1 1 R437Q06 1 1 1 R446Q03 1 1 1 R453Q01 1 1 1 R455Q04 1 1 1 R456Q01 1 1 1 R466Q06 1 1 1

(39)

Appendix 4 T able 1 M ath scores in P ISA cy cl es Y ea r Finla nd L atvia M ontene gro Netherla nds PI SA MB % C or rec t PI SA MB % C or rec t PI SA MB % C or rec t PI SA MB % C or rec t 2000 536 58.4 60.8 463 49.6 51 .4 - - - -* 59.3 62.5 2003 544 61 .0 61 .5 483 51 .2 52.6 - - - 538 57.6 58.6 2006 548 57.5 57.5 486 43.6 44.8 399 25.9 25.9 531 54.1 54.7 2009 541 54.4 53.6 482 41 .5 42.5 403 25.1 25.1 52 6 50.7 52.5 2012 51 9 50.4 48.0 491 43.9 44.8 41 0 27.7 27.2 523 50.9 50.1 2015 51 1 53.7 54.5 482 47.7 47.9 41 8 36.0 37.0 51 2 52.6 52.6 No tes : * T he Nether la nd s di d not ha ve e nough obser va tions f or P ISA s core s t o b e e stim ated.

(40)

T able 2 R ea ding scores in P ISA cy cl es Y ea r Finla nd L atvia M ontene gro Netherla nds PI SA MB % C or rec t PI SA MB % C or rec t PI SA MB % C or rec t PI SA MB % C or rec t 2000 546 71 .0 71 .5 463 49.6 51 .4 - - - -* 65.6 67.7 2003 543 69.1 69.0 483 51 .2 52.6 - - - 51 3 61 .6 62.2 2006 547 67.7 67.9 486 43.6 44.8 392 33.6 33.2 507 59.7 59.8 2009 536 67.0 66.2 482 41 .5 42.5 408 39.6 40.4 508 59.9 62.5 2012 524 64.5 61 .8 491 43.9 44.8 422 44.1 44.1 51 1 61 .6 61 .2 2015 526 64.5 64.3 482 47.7 47.9 427 46.9 47.1 503 60.3 60.4 No tes : * T he Nether la nd s d id not ha ve e nough obser va tions f or P ISA s core s t o b e e stim ated.

(41)

T able 3 Sc ie nc e s cores in P ISA cy cl es Y ea r Finla nd L atvia M ontene gro Netherla nds PI SA MB % C orr ec t PI SA MB % C orr ec t PI SA MB % C orr ec t PI SA MB % C orr ec t 2000 538 57.5 59.5 460 47.7 49.8 - - - -* 53 .0 56.1 2003 548 61 .9 61 .1 489 52.3 53 .0 - - - 524 54.9 55.6 2006 563 58.1 58.9 490 58.4 57.8 41 2 59.5 55.1 525 54.5 55.8 2009 554 64.6 63.2 494 52.1 52.9 401 33.7 32.9 522 57.5 59.3 2012 545 62.2 59.3 502 53.3 54.3 41 0 35 .0 34.4 522 57 .5 56.6 2015 531 63.3 63.7 490 53.9 53.6 41 1 40.2 39.7 509 57.6 57.1 No tes : * T he Nether la nd s di d not ha ve e nough obser va tions f or P ISA s core s t o b e e stim ated.

Referenties

GERELATEERDE DOCUMENTEN

Taipei China Japan Finland Estland Nieuw-Zeeland Canada Vlaanderen Australië Nederland Verenigd Koninkrijk Korea Slovenië Duitsland Zwitserland Macao-China België

Pisa eert de bekendste burger met een plaquette en een kleine buiten- tentoonstelling bij het huis van zijn moeder (nog maar relatief kort geleden aangewezen als zijn geboortehuis)

This study assessed the influence of four economic factors, namely employment status, rural/urban residence, public service delivery and poverty on satisfaction with life

These abuses continue through the reproductive ages and then into old age, mainly in the form of trafficking or sexual exploitation (UNICEF) (2001). c) Lack of employment:

Consumer need for experience was proposed to positively influence the purchase intentions and willingness to pay a price premium for sustainable, and to decrease those values

Then the control strategy consists in having no control at high frequency, where its attenuation performance corresponds to the sensitivity limit of seismometers and geophones, and

PISA meet in welke mate leerlingen geïnteresseerd (geen interesse, weinig interesse, matige interesse, veel interesse) zijn in 5 brede wetenschappelijke onderwerpen: