Using comparative judgement for the assessment of written texts in primary education.

(1)

Using comparative judgement for the

assessment of written texts in primary

education

!

An exploratory study into the validity, reliability and efficiency of using comparative judgement for the assessment of written texts in primary education and the effect of

spelling errors in this assessment

Name: Carlijn van Herpt Student number: s4598393 MA Thesis General Linguistics

Date: 11 March 2020

Primary supervisor: prof. dr. J.J.M. Schoonen Secondary supervisor: dr. F.W.P. van der Slik

(2)

Acknowledgements

Before you lies my master thesis ‘Using comparative judgement for the assessment of written texts in primary education’. With this thesis, I will finish the master General Linguistics at the Radboud University. Firstly, I would like to thank my primary supervisor, Rob Schoonen, for his guidance the last couple of months. Next to this, I would like to thank all the primary schools, teachers and pupils that were willing to participate in the writing task. Of course, I’m also grateful for all the assessors that participated in the assessment of the texts. At last, I would like to thank all of my colleagues at Cito for their help in writing this thesis. I would especially like to thank my supervisor, Pauline Roumans, for all her guidance, help and enthusiasm.

March, 2020 Carlijn van Herpt

(3)

1.6 This study!...!13! 1.6.1! Hypothesis!...!14! 1.6.2! Relevance!...!15! 2.! Methodology!...!15! 2.1 The assignment!...!15! 2.2 Participants!...!15! 2.2.1! Schools!...!15! 2.2.2! Assessors!...!16! 2.3 Procedures!...!16! 2.3.1 !!!!!Format of texts!...!16! 2.3.2 !!!!!Manipulation of texts!...!17!

2.3.3 !!!!!Assessment of the texts!...!17!

2.3.4 !!!!!Comparative judgement software!...!18!

2.4 Design!...!19!

2.5 Measurements and analyses of validity, reliability and efficiency!...!20!

3.! Results!...!21!

3.1 Effects of spelling errors and validity of the assessment!...!21!

3.2 Efficiency and reliability!...!22!

3.3 Ability scores!...!22!

3.4 Evaluation forms!...!22!

3.5 Feedback from assessors!...!23!

4. Discussion!...!28!

4.1 Interpretation of the results!...!29!

4.1.1! Effect of spelling errors and validity of the assessment!...!29!

(4)

4.1.3! Efficiency of the assessment!...!31!

4.1.4! Focus of attention in the assessment!...!31!

4.2 General discussion!...!33!

4.3 Limitations of the assessment!...!34!

4.4 Recommendations!...!34!

5.! Conclusion!...!35!

6.! References!...!36!

7.! Appendices!...!40!

Appendix 1 - Components of writing proficiency!...!40!

Appendix 2 – Writing task Verstoppertje!...!41!

Appendix 3 – Manipulated texts!...!42!

Appendix 4 – Assessment instruction!...!53!

Appendix 5 – Evaluation form of the assessment!...!54!

Appendix 6 – Text LT8011!...!58!

Appendix 7 – Text LT8014!...!59!

(5)

Abstract

This study investigated whether comparative judgement is a reliable, valid and efficient scoring method for the assessment of pupils’ writing proficiency in Cito’s Centrale Eindtoets (CET) and to what extent assessors are influenced by different components of writing proficiency, such as spelling errors, in the comparative judgement process. 67 written texts were assessed by 23 assessors in the comparative judgement program Comproved. Ten of the written texts were manipulated for spelling errors; scores of these texts with spelling errors were compared to the scores of the same texts without spelling errors. As in previous studies, the assessment turned out to be reliable, quick and easy to perform. A medium effect of spelling errors in texts on the judgement of the texts arose; texts with spelling errors received significantly lower ability scores than the same texts without spelling errors. This finding, together with the results that assessors based their decisions on aspects of writing proficiency, also contributed to the argument that the assessment was valid. However, more directed instruction for the assessment would be advised for further assessments with comparative judgement. The reliability coefficient in the current assessment was not high enough for high-stakes assessment, such as the CET, but comparative judgement might still be practiced for performance assessment within the classroom, because it is a reliable, valid and efficient scoring method. Further research is needed to investigate the possibility of using comparative judgement for national high-stakes assessment.

(6)

1.!Introduction

In the Netherlands, all children in the last grade of primary school (“groep 8”), aged around eleven and twelve years old, are obliged to participate in a final assessment. This high-stakes assessment helps predict the best type of secondary education for pupils. The final assessment that has been made available by the government is the Centrale Eindtoets (Central Final Test of Primary Education; CET). The College voor Toetsen en Examens (CvTE)1 is responsible for the execution of the CET. The CET is produced by Cito, commissioned by CvTE. The CET comprises three subjects: mathematics, language and (optionally) world orientation. The language section of the CET consists of Reading, Grammar, Spelling, and Writing.

Writing is an important aspect of all learning, not only in primary school, but also in higher education. Unfortunately, writing proficiency is currently not assessed in its optimal format in the CET. At present, students’ writing proficiency is tested in the CET by means of revision questions on different aspects of writing, in which students have to read a written text and have to answer questions about how the text can be improved. In this scenario, however, the writing performance one wants to measure is not similar to the task that the pupil has to complete, namely answering revision questions. This impairs the validity of the assessment (Kane, Crook & Cohen, 2005). Offering open-ended writing tasks instead of revision questions might be a more valid way of assessing pupils’ writing proficiency. The present study is an exploratory study into the possibilities of constructing and assessing an open-ended writing task for the purpose of the CET. An open-ended writing task was designed to investigate this.

1.1 Aims of writing proficiency in education

The Referentiekader Taal en Rekenen (Expertgroep Doorlopende Leerlijnen Taal en Rekenen, 2009) (Frame of Reference) includes guidelines about what pupils of certain ages should know and be able to do when it comes to Dutch language and mathematics. It includes different reference levels of knowledge and skills which define the intended proceeds of our education, distinguishing between a fundamental level (F) and a target level (S). The fundamental level for pupils at the end of primary school, the pupils of concern for the CET, is level 1F and the target level is level 2F. The reference levels are based on the Kerndoelen primair onderwijs2 (Ministerie van Onderwijs, Cultuur en Wetenschap, 2006) and the Common European Framework of Reference3_{. Teachers can use these reference levels to monitor and evaluate the} progress of their pupils.

When it comes to writing proficiency, the Frame of Reference distinguishes six characteristics from the task performance:

•! coherence of the text;

•! alignment with the purpose of the text; •! alignment with the audience of the text; •! use of words and vocabulary;

•! spelling, punctuation and grammar; •! readability of the text.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

1_{College voor Toetsen en Examens is an administrative body responsible for the quality and level of central}

assessment in the Netherlands on behalf of the government.

2_{The Kerndoelen primair onderwijs include the generally formulated objectives of primary education. These are}

general formulated objectives that describe what the education of children should focus on.

(7)

Coherence is about the structure of the text, use of conjunctions and about whether the goal of the text is met. Alignment with the purpose of the text is about whether the pupil has understood the purpose of the writing task and executed this correctly. Alignment with the audience of the text is about adjusting the language use, for example formal or informal language use, vocabulary and tone of the text to the audience. The components use of words and vocabulary and spelling, punctuation and grammar speak for themselves. The Frame of Reference distinguishes separate descriptions for the characteristic spelling, punctuation and grammar, which will not be discussed in detail here. The final component, readability of the text, is about basic text conventions, such as using a title and headers, and the lay-out of a text. A more detailed description of each characteristic according to the Frame of Reference at level 1F and level 2F can be found in Appendix 1.

1.2 Assessment of writing proficiency

A valid task measures accurately what it is intended to measure (Hughes, 2002). So, a valid writing task measures writing proficiency. There are different forms of validity (Bachman, 1990; Messick, 1989). Content validity refers to the extent to which a task represents the domain that is being assessed. Construct validity refers to the degree to which a task measures the aspects as was meant to. Criterion-related validity is divided into concurrent validity and predictive validity. Concurrent validity is the correlation between the scores of the current task and the scores of a different task which is supposed to measure the same construct, in this case writing proficiency. Predictive validity refers to the extent to which a task can predict future scores of the candidate based on the current task. As the main purpose of the CET is to give students advice about the best type of secondary education, predicting the extent to which students can cope with the level of this type of secondary education is of great importance.

Another form of validity is face validity, which comes down to whether a task measures what it seems to measure. However, there are ongoing discussions about face validity not being considered a true measure of validity, because it is not about what the measurement actually measures, but about what it seems to measure (Bachman, 1990; Mosier, 1947). Nevertheless, face validity is important in the CET. Teachers, parents and pupils expect writing proficiency to be assessed via writing a text. This is, however, not the way it is currently assessed. This damages the face validity of the assessment of writing proficiency in the CET. An open-ended writing task might increase the face validity of the assessment of writing proficiency.

Performance assessment should not only be valid, but also reliable (Bachman, Lyle, & Palmer, 1996). Reliability refers to the extent to which the same scores can be obtained when the task is performed a second time, in other words; it refers to the consistency of a measure (Bachman, Lyle, & Palmer, 1996; Hughes, 2002). The reliability of a test can be measured by a reliability coefficient, for example Cronbach’s Alpha or the Greatest Lower Bound. The preferred level of reliability of an assessment depends on its purpose. According to the guidelines of the COTAN4 (Evers, Lucassen, Meijer, & Sijtsma, 2010), a reliability coefficient of .70 or higher is sufficient for low-stakes assessment, but high-stakes assessment, where selection is the primary goal, should have a reliability of .90 or higher.

Crucial for the reliability and validity of performance assessment are so-called judgement effects. Several judgement effects can arise during the assessment of written texts (Meuffels, 2004). One of them is the halo-effect, which indicates that an assessor is influenced in his/her judgement by irrelevant aspects of either the student or the text. The assessor could !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

4_{The COTAN (Commissie Testaangelegenheden Nederland) is a central committee of the Dutch Institute of}

Psychologists with the aim to promote the quality of tests and test use. !

(8)

for example be influenced by a student’s handwriting. A second effect that can arise is a norm shift. Assessors might adjust their judgements during the scoring process. They can, for example, be stricter for the first sample of texts and be less strict as the scoring process progresses. Whether a text is scored in the beginning or at the end of the scoring process may in this case affect the score. Another effect that can occur during a scoring process is that a written text of average quality is scored after a range of excellent written texts. It is likely that the average text will receive a lower score than it should have, because it was preceded by an excellent text. This is called the sequence-effect. Some of these judgement effects are more likely to occur in specific scoring methods than in others.

Another issue in performance assessment is that assessors can differ in their judgements about texts (Andrich, 1978; Bloxham & Price, 2015), even when clear scoring criteria are given (Hanlon, Jefferson, & Molan, 2005). These differences occur because assessors value different aspects of a text differently (Brooks, 2012).

1.3 Scoring methods

There are different scoring methods for performance assessment, each with its own strengths and weaknesses. The most appropriate assessment form depends on the main aim of the assessment. In this section, several different scoring methods will be discussed.

1.3.1! Primary trait method

One of the possible scoring methods for the assessment of written texts is the primary trait method. Primary trait scoring can be used if the aim of the assessment is to assess specific components of writing. In this methodology, the focus of the scoring, as the name itself says, is on the primary trait of the assessment. This primary trait can be one or more aspects of writing that are crucial for successfully writing the text (Davis, 2018). Such primary trait factors can include grammar, content, spelling, text-length or other aspects, depending on the purpose of a study or assessment. Primary trait scoring allows students to primarily focus on one or several aspects of the writing process and it allows teachers to provide their students with explicit feedback on specific components of writing. However, a disadvantage of primary trait scoring is that the development of detailed scoring guidelines for every writing task is very labour intensive and time consuming (Weigle, 2002). As a result of this, primary trait scoring is not a very frequently used scoring method and is mostly used for assessments in which the aim is to collect information about specific writing skills of students. An example of this is classroom testing of writing in primary school classes.

1.3.2! Analytical scoring

Probably the most used and well-known method of scoring students’ texts is by rating texts on several different aspects of writing by means of a scoring rubric. This is known as analytical scoring. Unlike primary trait scoring, the focus of the scoring of the text in analytical scoring is divided across these different aspects of writing (Weigle, 2002). These can for instance include the content, structure, coherence or language use in the text, once again depending on the aim of the assessment. Analytical scoring can provide diagnostic information about students’ performance on the aspects of writing included in the scoring rubric. This is the reason why analytical scoring is often preferred above primary trait scoring, which only provides this detailed information about one or several aspect of writing, and holistic scoring, in which a text is scored based on an overall impression of the text.

(9)

Another benefit of analytical scoring over holistic scoring is that the scoring criteria for the assessment ensure that assessors pay attention to specific aspects of writing proficiency, which they might have otherwise missed (Hughes, 2002). Furthermore, the reliability of analytical scoring can be higher than the reliability of holistic scoring, because with analytical scoring a text is given a separate score for a number of aspects of the task (Hughes, 2002). A major disadvantage of analytical scoring is that it is more time consuming than holistic scoring (Hughes, 2002; Weigle, 2002). Next to this, according to Hughes, dividing the focus of the assessment on different aspects of a text might divert the attention from the overall effect of the text.

1.3.3! Holistic scoring

As mentioned, in holistic scoring a text is judged as a whole (Huot, 1990). A major benefit of scoring based on an overall impression of a text is that it is faster than analytical scoring. Another advantage of holistic scoring is that it can be more valid than analytical scoring (White, 1984, 1985). According to White, holistic scoring is closest to the true reaction of a reader. White argues that in the case of analytical scoring, the fact that the attention is divided over different aspects of the text distracts from the intention of the text as a whole. However, it is not always easy to interpret the scores of holistic scoring, because it is not sure what assessors based their decisions on and whether these criteria are alike or not (Weigle, 2002).

If the aim of an assessment is to gain diagnostic information about students’ learning outcomes, holistic scoring is probably not the best scoring method to use (Bacha, 2001). In Bacha’s study, the results of holistic scoring were compared to those of analytical scoring to gain insight in the assessment of students’ essay writing of English foreign language (EFL) learning students. Although the reliability of holistic scoring was high, Bacha found that it did not provide enough information about the different components of students’ writing proficiency. She concluded that assessors should focus more on students’ language use and vocabulary in the texts and a combination of holistic and analytical scoring would be necessary to achieve this. These findings match the recommendation of Harsch and Martin (2012). They recommended to use an approach combining holistic and analytical scoring for the assessment of writing proficiency in order to guarantee the quality of the assessment and to increase the reliability of the assessment at the same time. A combination of holistic and analytical scoring might thus be a good scoring methodology in cases of second language learning.

1.3.4! Scale rating

A fourth scoring method is scale rating. In this method, several texts of different qualities are selected as anchor texts and constructed into a scale. In the actual scoring process, every text has to be compared to the anchors on the scale. The texts are given a score based on where they are on the scale. Feenstra (2014) studied the inter-rater reliability, generalizability and construct validity of writing scores obtained via anchored analytical assessment. A total of 620 pupils participated in her study. Three aspects of writing were rated: content, structure and correctness (of syntax, spelling and punctuation). Assessors would either score the written texts according to analytical questions or according to analytical questions with anchor essays. The addition of anchor essays turned out to lead to higher inter-rater reliability when the assessors focused on assessing text structure. Next to this, the use of anchor essays in comparison to an analytical scoring procedure led to more generalizable scores and improved the construct validity of writing assessment. Feenstra could conclude that anchor essays are a useful addition to analytical scoring.

(10)

Above this, a study by Pollman, Prenger and De Glopper (2012) showed that assessors experience scale rating as a rapid and simple scoring method once they have familiarised themselves with the anchor texts and the differences between the anchor texts. Results of the same study also showed that this scoring method can lead to a high reliability. Next to this, the occurrence of sequence effects or norm shifting declines, because every text has to be compared to the anchor texts (Pollman et al., 2012). However, a disadvantage of scale rating is that it is very time consuming and labour intensive to select the anchor texts.

1.3.5! Which scoring method to use

It can be concluded that each scoring method has its own strengths and weaknesses. Which scoring method is preferred, depends partly on the purpose and the circumstances of the assessment. If it is necessary to gain diagnostic information about students’ learning outcomes, for instance in the case of second language learners, analytical scoring is probably the best methodology to use. However, if it is necessary to receive test scores in a rapid way, holistic scoring might be the most appropriate scoring method, because it is less time consuming. Something that applies to each of the described scoring methods is that it is always important to provide the assessors with training. Furthermore, the reliability of the assessment can be increased if each text can be scored by different trained assessors more than once (Hughes, 2002).

At present, the advantages and disadvantages of these scoring methods are not an issue of importance in the assessment of CET, because all test items in the CET need to be assessed automatically. Therefore, all test items, including the revision questions to assess writing proficiency, are multiple choice questions. However, if open-ended writing tasks were to be added to the writing proficiency part of the CET in order to increase the validity, it might not be possible to continue using this automatic form of assessment. Using automatic scoring systems for open-ended writing tasks can be difficult, because multiple answers are correct, depending on how a student interpreted a writing task. This makes it hard to predict what will be written in the text and therefore, human scorers are needed to score the texts (Brooks, 2012). Which scoring method best suits the assessment of a designed open-ended writing task for the CET is yet to be decided.

1.4 Comparative judgement

A possible scoring method for open-ended writing tasks in the CET that is not yet discussed, is comparative judgement. This scoring method can overcome judgement effects like the sequence effect and norm shift and is said to be less time consuming and labour intensive than other scoring methods, such as analytical scoring (Coertjens, Lesterhuis, Verhavert, Van Gasse, & De Maeyer, 2017). The use of comparative judgement in educational testing first arose in 2004, suggested by Pollitt (2004). According to Pollitt, comparative judgement could solve some of the problems that arose in the analytical scoring of students’ written texts. Comparative judgement is based on making a holistic judgement about the written text as a whole, instead of an absolute judgement about several aspects of the written text, as is the case in analytical scoring. Comparative judgement can therefore be helpful especially in the scoring of written texts, because unlike math tests, where an answer can be either incorrect or correct, there are many aspects of a written text that influence its quality.

(11)

1.4.1! How does it work?

The idea behind comparative judgement is that an assessor decides between two objects, in this case texts, which one is the best. The algorithm behind comparative judgement is based on randomness. Text pairs are compiled based on whether texts have been compared the smallest number of times and whether the texts have not yet appeared together. If this is the case for more than one text, the texts are chosen randomly (Verhavert, Bouwer, Donche, & De Maeyer, 2019). Furthermore, each assessor sees different text pairs. So, a first assessor for instance will see text 1 and text 2 as the first text pair and text 3 and text 4 as the second text pair, while a second assessor will see text 1 and text 3 as the first pair and text 2 and text 4 as the second pair. After deciding which text is the best text in one text pair, the assessor will receive two different texts, which are again compiled randomly, and decide which text is the better one. Several assessors continue to do this for the texts, in such a way that the same text is rated by different assessors.

The results are then analysed by the Bradley-Terry-Luce model (Bradley & Terry, 1952; Luce, 1959), which can combine the judgements of the assessors and construct a ranking scale of the texts (Lesterhuis, Verhavert, Coertjens, Donche, & De Maeyer, 2017). By carrying out a number of comparisons of several texts, the ability scores of the texts can be estimated by means of maximum likelihood estimation methods. Next to this, the statistical analysis also provides the reliability coefficient for the overall assessment.

The equation of the Bradley-Terry-Luce model is the following:

In this equation, i stands for text 1 and j stands for text 2. p(i>j) stands for the probability that text 1 will be chosen as the better text above text 2. ! represents the estimated ability scores of texts i and j. When the distance on the scale between two texts is larger, it is more likely that the text that received the higher measure will be ranked higher on the scale than the text with the lower measure.

1.4.2! Efficiency

Assessment via comparative judgement is faster than scoring based on a criteria list (Coertjens et al., 2017) and comparative judgements are based on intuitions of assessors (Crisp, 2013; Greatorex, 2007). Moreover, most assessors experience it as an easy task, because they only have to decide which text they find the better one out of two texts (Pollitt, 2012). Pollit (2004) estimated that it would take assessors as much time to score one text in the traditional scoring methods as it would take to make ten comparisons by means of comparative judgement. Laming (2003) explained that people are used to making decisions based on comparisons, because they do it all the time, either consciously or subconsciously. Comparative judgement is therefore easy to perform and can be based on intuitions. Furthermore, it has been shown that assessment of written texts by means of comparative judgement only requires little training (Heldsinger & Humphry, 2010; Steedle & Ferrara, 2016).

The number of assessors and the number of comparisons to be made in order to achieve a reliable performance assessment depends on the goal of the assessment. According to Verhavert, Bouwer, Donche and De Maeyer (2017), for a reliability coefficient of at least .70, the average number of comparisons per text is twelve. There should be at least nine and at most twenty comparisons per text. To reach a higher reliability coefficient of at least .80, the average

(12)

number of comparisons per text is seventeen. There should be a minimum of thirteen and a maximum of 25 comparisons per text to reach this level of reliability. These numbers contradict the numbers provided by Verhavert et al. (2019). According to their analyses, between ten and fourteen comparisons per text are needed to reach a reliability of .70. Around nineteen or twenty comparisons per representation are needed to reach a reliability of .80 and 26 to 37 comparisons are needed to reach a reliability of .90.

1.4.3! Reliability

Comparative judgement is associated with high levels of reliability (Bramley, 2015). The high number of assessors necessary promotes the generalizability of comparative judgement. A higher number of assessors could lead to more differences in the scoring, because different assessors might score differently. However, Bramley (2007) showed that assessors turn out to be similarly strict in their judgements and therefore states that comparative judgement is highly reliable. Next to this, comparative judgement can overcome existing differences between individual assessors, because the focus is not on the individual assessor, but on the text that needs to be scored. Pollitt (2012) added to this that in the process of comparative judgement, scored tasks are ranked, by which differences between individual assessors are also overcome.

How reliable comparative judgement is as a scoring method depends on the characteristics of the assessment. Verhavert et al. (2019) performed a meta-analysis to study how the reliability of comparative judgement is influenced by assessment characteristics. In their study, they looked at six different characteristics of comparative judgement that could influence the level of reliability according to their opinion: the number of comparisons that have to be made, the number of assessors, the format of the assessment task, if and how feedback is provided to the students, how many representations the assessment task should measure, and the expertise of the assessors. Verhavert et al. looked at these assessment characteristics from 49 different comparative judgement assessments in D-PAC, a software program designed for performance assessment via comparative judgement, currently known as Comproved.

For the first characteristic, not only the number of comparisons was taken into account, but also the number of representations. For the second characteristic, the number of representations per assessor was included in the analyses. The different formats of the representation were: texts, images, audio or video and portfolios. The feedback that was provided to the students could be categorized as comparative feedback, in which assessors elaborated briefly on their decision, as pro-con feedback, in which assessors commented on strengths and weaknesses of the representation, or as no feedback. The variable expertise of the assessors existed of five groups: expert assessors, people who were experienced in the field or who had received training for the assessment, peer assessors, students who had carried out the assessment task themselves or novice assessors, who had no experience in the (domain of the) assessment and did not receive any training.

The relationship between these variables and the reliability of an assessment was analysed by regression analyses. The results showed that only the number of comparisons per representation affected the reliability level. Unlike one would expect, there was no effect of the number of comparisons that had to be made, the number of representations the task should measure, the number of assessors or the feedback type on the reliability of the assessment. Due to the small number of assessments in the representation formats images, audio or video and portfolios, no conclusions could be drawn for the effect of the representation format on the reliability. There was also no effect of the expertise of the assessors on the reliability of the assessment. Nonetheless, the level of expertise did influence the amount of effort that assessors had to put into the assessment. Novices should make more comparisons per representation than

(13)

experts or peers to reach a maximum level of reliability. According to Verhavert et al. (2019), the argument that experts and peers are more familiar with the representations than novices and that they know what to focus on in the assessment might be an explanation for this finding. It is therefore still recommended to use experts or peers as assessors, but Verhavert et al. conclude that expertise of the assessors does not limit the maximum reliability level of an assessment.

Another study that concluded that comparative judgement was a reliable scoring method was performed by Steedle and Ferrara (2016). The aim of their study was to compare comparative judgement to rubric scoring in terms of validity, reliability and duration. English essays written by students around sixteen and seventeen years old were graded. Two assessors scored the essays according to rubric scoring and nine assessors scored the essays according to comparative judgement. Steedle and Ferrara found a reliability coefficient of .90 for the comparative judgement. They also looked at the relationship between the reliability level and the number of judgements that had to be made. It turned out that the reliability coefficient still remained above or around .80 when the number of judgements was reduced by up to 50 percent. These numbers show that the reliability of an assessment can be increased by increasing the number of assessors or the number of judgements, as was also stated by Bramley (2007).

1.4.4! Validity

In her doctoral research, Lesterhuis (2018) looked at the validity of comparative judgement to assess academic writing. Although she focussed on secondary and higher education in her research, and not on primary education, like in the current study, the results of her study might still include valuable information about the usability of comparative judgement for the assessment of written texts. The results of Lesterhuis’ study on the validity of comparative judgements to assess the academic writing skills of pre-master students at a Belgian university, showed that judges differ in their focus and broadness of judgements, but that they do not differ significantly in their use of expertise. Next to this, it turned out that judges base their decision on very different aspects of a text. The judges of the written texts that participated in this study, indicated where they based their decision on and why they chose one text over another. Based on these explanations, Lesterhuis found that most judges firstly focus on the structure and the style of the essay in their decision-making. However, the second most chosen aspect on which judges base their decisions differed a lot among the judges. Nevertheless, judges still gave more or less similar scores to written texts. According to Pula and Huot (1993) the differences in the aspects of a text on which judges base their decisions are caused by background differences of the judges.

Lesterhuis (2018) found evidence that comparative judgement allows differences in conceptualisation between judges, because judges varied in the focus, broadness and use of expertise while judging the students’ essays. Overall, from this first study, Lesterhuis could conclude that comparative judgement offers valid scores in the assessment of academic writing. Nevertheless, she argues that more validation studies are necessary.

In a second study, Lesterhuis (2018) studied what aspects of argumentative writing in a written text were important for teachers in secondary education to decide which argumentative text is the better one. Again, the results of the study showed that assessors differ quite a lot in the aspects of a text on which they base their decision. The most important components for the assessment of the texts turned out to be organisation and argumentation. From this study, Lesterhuis concludes that comparative judgement is a valid way of assessing text quality.

(14)

1.4.5! Comparative judgement in primary school

Comparative judgement has also been practised in primary education. Heldsinger and Humphry (2010) conducted a study in a private primary school in Perth, Australia, in which they asked teachers of the school to rate narrative texts written by pupils from school years one to seven. The scoring method that they used in their study was pairwise comparison, which is similar to comparative judgement: scorers have to compare two texts to each other and decide which text is the better one. Twenty teachers rated 30 texts. The teachers received training in which the requirements of the writing task were explained to them and in which the components of writing that were most important according to the teachers were discussed. Each text was compared with between 64 and 74 other texts. The Person separation index (PSI), a value between 0 and 1 indicating the internal reliability of the assessment, was .982. The higher the value of PSI, the greater the internal reliability. Heldsinger and Humphry thus concluded that they had established a very high internal reliability.

The validity of the same assessment was determined by an experienced assessor who assessed the performance independently of the pairwise comparisons by means of a scoring rubric. There was a correlation of r = .921 between both scorings, from which Heldsinger and Humphry (2010) could conclude that there was a high concurrent validity of the assessment. Overall, it turned out that the pairwise comparison method was a reliable and valid method to assess the texts written by primary school pupils.

In England, a large scale comparative judgement assessment of written texts in primary school was conducted in 2017-18 by Wheadon, Barmby, Christodoulou and Henderson (2019) by means of the program No More Marking. The aim of this assessment was to support primary schools in assessing writing proficiency of pupils aged between five and eleven years old. A total of 85 schools participated in all judgement sessions. For most of the schools, all pupils took part in the assessment. All teachers of the schools were encouraged to enrol as assessor. The assessors did not receive any training before the assessment. The results of the national assessment in No More Marking showed that the assessments were very reliable. The median reliability was 0.91 for Group 1 (pupils aged five years old) and 0.85 for Group 6 (pupils aged eleven years old). Wheadon et al. (2019) argued that the assessment was also valid, because the writing skills of the pupils improved with age; children in lower grades received lower writing scores than children in higher grades. Next to this, the assessment was valid because writing scores of similar year groups turned out to be relatively stable from one year to another.

Regarding the efficiency of the assessment, the assessment was performed quickly. The assessors were able to assess 66 texts within an hour for the texts written by the youngest pupils, assuming that ten comparisons per text had to be made to reach an appropriate level of reliability. For the texts written by the oldest pupils, assessors needed more time. They could assess sixteen texts within an hour. All texts were no more than three A4 sides.

Wheadon et al. (2019) concluded that this assessment was highly reliable, valid and efficient. Nonetheless, one disadvantage of this form of assessment is that teachers might be able to recognize texts written by their own pupils and may be affected by this in their scoring of the texts, resulting in a scoring bias. A solution to this problem is to make sure that teachers are not provided with texts written by their own pupils, but Wheadon et al. state that this might make the assessment demotivating for the teachers.

A final note given by Wheadon et al. (2019) is that one should be careful to use comparative judgement for high-stakes assessment. According to Wheadon et al., some biases might still occur in the case of high-stakes assessment although the model should prevent obvious biases. It is hard to predict which biases might occur in high-stakes assessment, so Wheadon et al. conclude that potential distortions should be taken into account if comparative judgement is used for high-stakes assessment.

(15)

1.4.6! Limitations of comparative judgement

Just like in any form of assessment, comparative judgement has some limitations as well. One drawback is that especially the judgement of complex texts can be quite time consuming, due to its high cognitive requirements (Heldsinger & Humphry, 2013). McMahon and Jones (2014) even found that comparative judgement was more time consuming than analytical scoring. They studied the use of comparative judgement in comparison to analytical scoring in the assessment of a school chemistry experiment and found that the comparative judgement took around fourteen hours, while the analytical scoring procedure took around three hours. It should be noted however, that the circumstances of this assessment are quite different from the assessments that are of value for the current study, namely the assessment of written texts.

McMahon and Jones themselves admit that other studies have found contradicting results. According to Tarricone and Newhouse (2016), the assessment set out by McMahon and Jones was not suitable for comparative judgement in the first place, because instead of descriptive answers, students were expected to give list answers to most questions of the test. As a result of this, assessors had to focus on each individual answer, instead of being able to judge the test as a whole. This might have caused the results that comparative judgement was more time consuming than analytical scoring.

Another limitation of comparative judgement is that it might entail high costs, due to the fact that many assessors are necessary to receive high reliability levels and that each text has to be assessed multiple times. Furthermore, these assessors need to be trained, which can both cost a lot of money and time. However, several studies show that the required training for assessors in comparative judgement is less than for rubric scoring. In the study by Steedle and Ferrara (2016), for the rubric scoring, at least two professional assessors scored the essays on organization, focus and coherence of the text, the elaboration of the ideas, the voice use and English conventions in the essay. Their training, in which they practiced and discussed their assessments of the essays, took between eight and twelve hours. Next to this, nine English teachers participated in comparative judgement of the essays. In this training, comparative judgement was explained and essay quality was discussed. Assessors also had to practice assessing the texts and their decisions were discussed. This training took approximately three to four hours.

An earlier study has shown that the training requirements could be even less. The twenty assessors in the study by Heldsinger and Humphry (2010) only needed half an hour training for the comparative judgement of 30 narrative texts written by primary school pupils, while a training for the assessment by means of a scoring rubric, to be able to validate the same assessment, took an entire day.

Steedle and Ferrara (2016) concluded that the actual scoring of each essay might take longer in comparative judgement than in the case of rubric scoring, but that comparative judgement can be considered to be more efficient due to its reduced training requirements. In any case, the costs of comparative judgement in comparison to the costs of other scoring methods depend on the circumstances of an assessment.

Another downside of comparative judgement is that students often do not receive feedback, because teachers do not report what they base their decisions on. Teachers only indicate whether they find one text better than another one, by which it is difficult to determine what the teachers value in writing proficiency. As a consequence of this, students do not receive feedback on how to improve their writing proficiency (Heldsinger & Humphry, 2013). Of course, this is not an issue in the CET, as the goal of this test is summative. Students do not receive feedback, as opposed to the goal of formative tests.

A final remark against comparative judgement is that holistic judgement might be very complex for some assessors. And indeed, assessors seem to differ in the complexity of the

(16)

assessment that they experience by comparative judgement (Van Daal et al., 2017). Inaccurate decisions in comparative judgement assessment relate to higher experienced complexity than accurate decisions.

1.5 Effect of spelling errors on assessment

As described, a possible downfall of many scoring methods is that judgement effects might arise during the scoring process. Scorers can for example subconsciously be influenced by irrelevant factors for the scoring. Spelling errors in a written text can also affect its scoring, as will be further discussed in the next section. It would be interesting to see if spelling errors also influence the scoring of a written text in the process of comparative judgement.

1.5.1! No effects of spelling errors

Results of studies that looked at the effect of spelling errors in written texts are somewhat diverse. One study that concludes that spelling errors in written texts do not have a negative effect on the appreciation of the texts was conducted by Kloet, Renkema and Van Wijk (2003). In a first experiment, they presented students with two different letters. One version of these letters contained no errors at all. The second version contained five spelling errors and the third version contained five marking errors, such as a wrong use of connectives. The fourth and final version contained five spelling errors and five marking errors. Participants were given one condition of the letters and were asked to read them and answer some questions regarding the text appreciation, its image and its persuasiveness. Afterwards, participants had to read the letters again and indicate whether they thought there were any mistakes in the letters. The results showed that with only one of the two letters, marking errors negatively affected the comprehensibility of the text. There were no effects of spelling errors and no effects of marking errors on the image and persuasiveness of the texts.

In order to investigate what could have caused these findings, the experiment was repeated. Both younger and older adults took part in this second experiment and the error density of the experiment was raised. Results of this experiment showed that texts with both spelling and marking errors were conceived as less comprehensible than the texts with no errors or only one type of error. Young men also reported that the writer of the text containing both types of errors was the least reliable.

Overall, Kloet et al. (2003) concluded that the effect of language errors in a text is not very strong. However, readers do rate a text with marking errors as less comprehensible than a text with spelling errors and a text with both types of errors as even less comprehensible than a text with only marking errors. According to Kloet et al., spelling errors do not influence the appreciation of a text.

These findings were supported by results of Corten, De Cock, De Wachter and Smets (2012), who tested the effect of language errors in a written news report on the credibility of the writer. Language errors in the text negatively affected the credibility of the writer for students from the humanities faculty, but not for students from engineering sciences. An explanation for these differences might be that students from the faculty of humanities are more sensitive to language. However, the text that the students read, was about bio-medics and engineering students might have been better able to base their opinion about the credibility of the writer on the content of the text itself.

Russell and Tao (2004) looked at the effect of spelling errors on the assessment of written texts. They did this by presenting assessors with texts in three different formats. Texts were either hand-written, typed, or typed with occurring spelling errors corrected. Results showed that typed texts in which spelling errors were not corrected received lower scores than

(17)

typed texts in which spelling errors were corrected. However, the differences in scores were not significant. Hand-written texts received higher scores than typed texts in which spelling scores were corrected, but once again, this difference was not significant.

1.5.2! Negative effect of spelling errors

Contrary to the above findings, several earlier studies have found that spelling errors in texts can have a negative effect on the perception of a writer’s writing proficiency or credibility (e.g. Chesney & Su, 2010; Kreiner, Schnakenberg, Green, & McLin, 2002; Queen & Boland, 2015).

This was for instance studied by Appelman and Schmierbach (2018). In their study, students had to read manipulated news articles with either zero or ten errors and were asked to judge the news article on its quality, credibility and informativeness. The errors that they included in a first study were incorrect spelling, incorrect subject-verb agreement, confused homonyms and double negatives. They based their decisions of the type of errors for this study on the knowledge that copy editors were trained to avoid these errors in publications. For a second study however, they asked journalists to rank order different errors. The errors that were perceived as the most problematic were published in the news articles. These errors were subject-verb agreement errors, incorrect spelling of proper nouns, confused homophones, confusion of possessive and plural cases of nouns, spelling errors in common nouns and errors in pronoun-antecedent agreement. The results of this study showed that readers perceive a news article containing errors as of lower quality, less credible and less informative than the same news article without errors. However, for this effect to arise, the number of errors needs to be relatively large.

Not only can spelling errors influence the judgement of a text itself, but spelling errors can also influence the perception of the writer of a text. For instance in the attitude towards a job applicant, as was studied by Martin-Lacroux and Lacroux (2017). They investigated whether applicants with application letters and résumés containing spelling errors were more likely to be rejected than applicants with letters and résumés containing no spelling errors. In their study, they created fictive forms of application letters and résumés and asked recruiters whether they would invite or reject the applicant based on his or her letter and résumé. Half of the letters contained grammatical and lexical errors and the other half of the letters contained no spelling errors. The letters with spelling errors contained either five or ten errors. Also, the letters varied in their level of expertise for the job. Spelling errors turned out to impact the decision of rejecting or inviting an applicant more than professional experience. This is especially the case when applicants have a high amount of work experience. It did not matter whether the application contained five or ten spelling errors; there were no significant effects found between the rejection rates of letters containing five errors and letters containing ten errors. Spelling errors can thus be a real deal-breaker. Still, when the work experience was low and a letter contained no spelling errors, this latter factor was not enough to invite the applicant. The spelling criterion was taken into account, but not as much as the work experience criteria. It has also been shown that essays containing grammar5_{or spelling errors receive lower} scores than the same essays without grammar or spelling errors. This was studied by Marshall and Powers (1969). In their study, they collected students’ essays that were of quality A and made teachers adapt the content of the essays in such a way that the essay would now be of quality B. Furthermore, gross writing errors were corrected in the essays. From this essay form, different forms of the same essay were designed which all contained a different number of errors. In this study, the researchers distinguished three error forms: no errors, spelling errors !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

5_{The type of errors labelled as grammar error by Marshall and Powers (1969), will further be referred to as} spelling errors in the context of this thesis.

(18)

and grammar errors. Teachers had to score the (manipulated) essays and were instructed to only focus on the content of the essay while doing this. However, results of the scoring process showed that the essay forms containing spelling or grammar errors scored significantly lower than the essay forms containing no errors. Teachers are thus influenced by grammar and spelling errors in students’ essays, even when they are instructed to focus on content alone.

Findings about the effect of spelling errors in texts are thus diverse. Jansen and Janssen (2016) looked into previous studies to find out what could have caused the different findings. One explanation that they provide for the different results is that the text genre might impact the effect of spelling errors in texts. To investigate this possible explanation, they provided participants with two texts: one application letter and one sponsor letter. Six errors were manipulated in the texts and they were all Dutch d-t-errors, which means that a required –t conjugation is missing in second or third person singular on verbs with a stem ending on –d. The results of the experiment showed that there were no significant interaction effects between the errors and the genre of the letters, from which they could conclude that the genre did not influence the effect of spelling errors in texts. But still, spelling errors negatively affected the appreciation of a letter.

Jansen and Janssen (2016) also found that the negative effect of spelling errors only arises if readers of the text actually notice the error itself. This might be very straightforward, but this has been reported by earlier studies as well (Jansen & De Roo, 2012; Raedts & Roozen, 2015). These results had also been found by Harm (2008), who found that what matters in the judgement of the credibility of a text is whether readers can actually find a language error or not. Participants in her study appreciated a text in which they perceived language errors less than texts in which they did not perceive language errors. According to her findings, the type of error did not influence the extent of the effect. However, this contradicts findings of De Schutter (1982), who concluded that spelling errors are perceived as being worse than other language errors.

Results on the effect of language or spelling errors in written texts on the scoring of these texts thus seem to be diverse. The effect of spelling errors in combination with comparative judgement as a scoring method has not been looked into before, and it would therefore be interesting to study the effect of spelling errors in a written text on the scoring of that text when the texts are scored by comparative judgement.

1.6 This study

Based on the previously discussed literature, comparative judgement seems to be a promising scoring method for performance assessment via open-ended writing tasks in the CET, due to its high reliability, validity and efficiency. The current study will investigate whether this is indeed the case. According to Wheadon et al. (2019), one should be careful with high-stakes use of comparative judgement, because unintended biases might occur. This study will serve as an exploratory study into the possibilities of assessing an open-ended writing task in the CET by means of comparative judgement.

Furthermore, as there are not many previous studies on comparative judgement including written texts from eleven- or twelve-year olds, CET’s age group, it would also be interesting to study which components of writing proficiency are most important to the assessors in their judgements about the written texts. It might be the case that assessors also base their decisions on spelling, as spelling can play a great role in the opinion about a text (e.g. Jansen & Janssen, 2016; Marshall & Powers, 1969). In fact, the assessment is expected to be valid when assessors actually focus on spelling in their judgements of the texts. Therefore, the effect of spelling errors on the judgement of written texts will also be looked into in the current

(19)

study. In order to investigate this, the following research question and sub-questions were formulated.

Research question: Is comparative judgement a valid, reliable and efficient scoring method

for the assessment of pupils’ writing proficiency in the Centrale Eindtoets and to what extent are assessors influenced by different components of writing proficiency, such as spelling errors, in this comparative judgement process?

Sub-question 1: To what extent are assessors participating in comparative judgement

influenced by spelling errors in the written texts that they score and what does this say about the validity of the assessment?

Sub-question 2: Is comparative judgement a reliable way of scoring the assessment of writing

proficiency in an open-ended writing task?

Sub-question 3: Is comparative judgement an efficient way of scoring the assessment of

writing proficiency in an open-ended writing task?

Sub-question 4: On which components of writing proficiency do the assessors of comparative

judgement base their decisions?

1.6.1! Hypothesis

!

Based on the results of previous studies, comparative judgement is expected to be a reliable (Bramley, 2007; Heldsinger & Humphry, 2010; Pollitt, 2012; Verhavert et al., 2019) scoring method. Next to this, it is expected to be quick (Coertjens et al., 2017), easy to use (Pollitt, 2004, 2012) and require little training (Heldsinger & Humphry, 2010), which makes it an efficient scoring method for the assessment of writing proficiency with open-ended writing tasks. It is expected that assessors will base their judgements on components of writing proficiency, as was the case in the study by Lesterhuis (2018), and that this can contribute to the validity of the assessment. As Huot (1990) and Lesterhuis found that assessors base their opinions about texts on very different aspects of the texts based on their background, it is also expected that the assessors in the current study will vary in the components of writing proficiency that they base their decisions on. Lesterhuis found that the majority of the assessors firstly focussed on the structure and style of the essay in their rating. The structure of an essay is similar to the component ‘coherence of the text’ according to the Frame of Reference (Expertgroep Doorlopende Leerlijnen Taal en Rekenen, 2009), so based on Lesterhuis’ results, it is expected that the coherence of the text will also be of great importance to the assessors participating in the present study. Lesterhuis does not explicitly define what she means by the style of an essay and what components of writing proficiency of the Frame of Reference agree with the style of an essay. It could for instance be the lay-out of a text, but also formal or informal language use in a text. If this is the case, it can be argued that the style of a text is partly included in the ‘alignment with audience’ and ‘readability’ components of writing proficiency. Nevertheless, it is difficult to formulate expectations about the occurrence of these different components as it is uncertain what Lesterhuis defines as the style of an essay.

Despite the diverse findings about the effect of spelling errors in texts on the judgement of the texts, expectations are that assessors will be influenced by spelling errors in the written texts while scoring the texts. Like Marshall and Powers (1969), Jansen and Janssen (2016) and others found in their studies, it is expected that texts containing no spelling errors receive higher scores than texts containing spelling errors. The extent to which assessors are influenced in their judgement about a text by spelling errors in the texts will also contribute to the validity of the assessment.

(20)

1.6.2! Relevance

This research question is relevant to the field, because it is focused on finding out whether comparative judgement can be used for scoring open-ended writing tasks that are written by pupils in primary school. If this study shows that comparative judgement can indeed be a valid, reliable and efficient scoring method, Cito might be able to apply the methodology and perhaps even expand it to other assessment contexts, such as secondary school final exams, as well. The results of this study will thus contribute to the information that is available on comparative judgement in the assessment of writing proficiency in primary school pupils. Cito, and specifically the CET, can use this information to improve the way in which writing proficiency in the CET is currently assessed.

2.!Methodology

2.1 The assignment

An open-ended writing task was designed for the purpose of this study, based on the Frame of Reference for Language (Expertgroep Doorlopende Leerlijnen Taal en Rekenen, 2009). The writing task was called Verstoppertje (hide and seek) and the aim of this task was for pupils to write an e-mail to a holiday friend in which they explained the rules of hide and seek. The instruction of the task was the following:

“Op vakantie speel je verstoppertje met een paar kinderen op de camping. Je komt erachter dat iedereen verstoppertje met andere spelregels speelt. De andere kinderen vinden verstoppertje spelen volgens jouw spelregels het leukst en vragen jou om de regels op te schrijven. Als je thuis bent, stuur je de kinderen een mail waarin je jouw spelregels uitlegt.”

“During your holiday, you’re playing hide and seek with some children from the camping. You discover that everyone plays hide and seek according to different rules. The other children enjoy hide and seek according to your rules the best and they ask you if you can write down your rules. When you get home, you write your friends an e-mail in which you explain the rules.”

The pupils had to include the following components in their e-mail: salutation, introduction about what they were going to write in the e-mail, the number of people you can play hide and seek with, where you can play hide and seek, what the purpose of the game is, when the game has ended and a closing to the e-mail. They were instructed to use approximately 150 words. The Verstoppertje task can be found in Appendix 2.

2.2 Participants

2.2.1! Schools

A total of five primary schools participated in this study, with a total of 89 pupils. Two schools participated at the end of the academic year 2018-2019 and three schools participated at the beginning of the academic year 2019-2020. For four of these schools, the writing task was performed in the last grade of primary school (“groep 8”), but for one of these schools, the writing task was performed in the second last grade of primary school (“groep 7”). However,

(21)

this class participated in the study in the final weeks before the academic year 2018-2019 ended, which means that the children in this class would move up to “groep 8” within a few weeks’ time. Four schools are located in the province of Gelderland and one school is located in the province of Noord-Holland.

Teachers were sent the required number of writing tasks for their pupils and received instructions on how to perform the task. They were also sent a PowerPoint presentation, which included the writing task and a short video of children playing hide and seek. Teachers were asked to show this presentation to their class before handing out the writing tasks in order to activate pupils’ previous knowledge. Next, pupils were instructed to perform the writing task individually. Pupils were told that they had plenty of time, so there was no need for them to rush. On average, it took the pupils 30 to 45 minutes to write the text.

Teachers that participated in this study with their class also received an evaluation form containing questions about their experience with the writing task. They indicated that the writing task matched the experience world of the children very well. Next to this, they rated the difficulty of the writing task as good. However, some of the teachers reported that the instructions of the task were considered to be difficult by some children.

2.2.2! Assessors

Based on the findings from Verhavert et al. (2019) that in the case of comparative judgement, the reliability level of the assessment is not influenced by the level of expertise of the assessor, it was concluded that it was not necessary for the assessors that participated in this study to be experienced primary school teachers. 23 assessors, all female, have assessed the texts. The assessors were either primary school teachers, pabo (primary school teacher training) students, linguists or employees of the language department of Cito. All scorers had a background in primary education or linguistics. The scorers were invited to participate in this study by a letter which was spread via social networks and were told that the total assessment process would take no longer than four hours. Assessors that were not employees of Cito received a compensation for participating in the assessment.

2.3 Procedures

2.3.1 Format of texts

All texts were handwritten, but were transformed into a typed form before the texts were scored. In this way, assessors could not be influenced by the handwriting skills of a pupil in their judgements about the texts. A study by Mogey, Paterson, Burk and Purcell (2010) showed that it does not matter for the scores of essays whether they are handwritten or typed. The scores of handwritten essays of students of the University of Edinburgh differed slightly from scores of typed essays on the same topic, but these differences were not significant. The quality of the essays in both forms was similar.

Russell and Tao (2004) also studied the differences between the assessment of computer printed and handwritten texts. They based their study on earlier research by Powers, Fowlers, Farnum and Ramsey (1994). Like Powers et al., Russell and Tao found an effect of the modus in which the text was assessed: handwritten texts received higher scores than typed texts. However, they also found that this effect could be undone by presenting the typed text in a cursive font, in order to look like handwriting. Because of these findings, all handwritten writing tasks in the present study were typed in a cursive font. The layout of the texts was not changed, so an unnecessary blank space in the handwritten form of the text was copied in the typed form of the text. To prevent an effect of false use of punctuation in many written texts on

(22)

the scoring of these texts, the punctuation of all texts was adapted in such a way that all new sentences started with a capital letter and ended with a full stop. Besides this, all personal names were replaced by ‘xxx’. The font of all texts was Arial 12, similar to the font that is used in the CET. Beneath each text, a small table with the word count of the text was added. By doing this, assessors could easily see whether the word count matched the required number of words of the assignment without having to count all the words themselves.

2.3.2 Manipulation of texts

To study the effect of spelling errors on the scoring of the texts, ten of the texts were manipulated for spelling errors. This meant that they could either be presented to assessors in their original form, including the spelling errors that the pupils wrote, or in an adapted form, in which the spelling errors in the texts were corrected. Extra spelling errors were created and added to some of the texts. By doing this, each manipulated text contained between six and eight spelling errors, of which at least two errors were misspelled verbs and at least two errors were misspelled nouns. The errors that were created were based on the spelling categories of the CET. Only texts containing more or less 150 words, that thus met the required number of words of the assignment, were used as texts to be manipulated. The manipulated forms of the texts can be found in Appendix 3. Examples of spelling errors in the texts are presented in Example 1 and 2.

Example 1

Without error: Ik vind zelf het leukst om verstoppertje in het bos te doen. I enjoy playing hide and seek in the forest the most.

With error: *Ik vindt zelf het leukst om verstoppertje in het bos te doen. *I enjoys playing hide and seek in the forest the most. Example 2

Without error: Je hebt twintig seconden de tijd om je te verstoppen. You have twenty seconds to hide.

With error: Je hebt twintig *seconde de tijd om je te verstoppen. You have twenty *second to hide.

The position of the errors in the texts were diverse: some of the errors occurred in the first few sentences of the texts, others in the last few sentences and others in between. By doing this, an attempt was made to avoid an effect of the position of an error in the text. However, the occurrence of this effect could not be entirely prevented in this study.

2.3.3 Assessment of the texts

Twenty-two out of the total of 89 texts had to be discarded, resulting in 67 remaining texts. Texts were discarded when they did not meet any of the criteria in the task. For example, a task that was not about hide and seek at all. Texts containing fewer than 100 words were also excluded from the assessment. Furthermore, one text was discarded, because the text contained more than 25 spelling errors, which made the text too incomprehensible.

Assessors joined in an assessment session in which they first received a short instruction on how to use the assessment program Comproved and then assessed the texts for around an hour. After a short break, the assessors had to compare texts for another half an hour. The assessors were asked to rate the remaining texts at home within two weeks’ time and received

Using comparative judgement for the assessment of written texts in primary education.