• No results found

Scoring rules and visualization of time pressure in Math Garden

N/A
N/A
Protected

Academic year: 2021

Share "Scoring rules and visualization of time pressure in Math Garden"

Copied!
30
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Scoring rules and visualization of time pressure

in Math Garden

Name: Simone Plak Student ID: 10204725 Date: 09-07-2015

Assignment: Bachelor thesis Supervisor: Han van der Maas Word count paper: 7882 Word count abstract: 194

(2)

Scoring rules and visualization of time pressure

in Math Garden

Abstract. Math Garden is a web-based computer adaptive application where children

can practice their arithmetic skills. The arithmetic ability scores in Math Garden are based on the SRT scoring rule, which incorporates response time and accuracy and includes a correction for guessing. Due to this penalty for incorrect responses, SRT scoring can evoke the use of strategies, but the alternative CISRT scoring rule may promote guessing. Furthermore, Math Garden received some complaints about the visualization of time pressure in the games. To address both concerns, the present study examined the effect of scoring rules and visualization of time pressure on arithmetic performance of 196 children in grades one to six. It is concluded that scoring rule and visualization of time pressure do not affect arithmetic performance of the test-taker, and the type of scoring rule implemented in the task does not affect response strategies employed by the test-takers. Furthermore, both SRT and CISRT scoring, with or without visualization of time pressure, showed high convergent validity. Based on the results of the present study, the type of scoring rule and the visualized time pressure in Math Garden should not be changed until further research suggests so.

1. Introduction

Math Garden is a web-based computer adaptive application where children can practice their arithmetic skills (Straatemeier, Van der Maas, & Klinkenberg, 2009). Fourteen thousand, fifty-nine schools currently use Math Garden and over a hundred million items have been completed (retrieved from: www.rekentuin.nl on 02-07-2015). Math Garden proved to be successful in improving the arithmetic performance and decreasing the math anxiety of its users (Jansen et al., 2013). The arithmetic ability scores in Math Garden are based on a scoring rule that includes the accuracy of the response as well as the response time (Klinkenberg, 2014). To prevent extensive guessing, this scoring rule includes a punishment for incorrect answers. However, since including a correction for guessing can induce the use of strategies and add construct irrelevant error variance (Budescu & Bar-Hillel, 1993; Michael, 1968; Budescu and Bo, 2014), one could question if this scoring rule leads to the best performance of the children and to the most accurate representation of the children’s true arithmetic ability. Furthermore, Math Garden received some complaints from parents and teachers about the visualization of time pressure in the games.

The research discussed in this paper addresses these two problems by comparing the current scoring rule used in Math Garden and the current visualization of time pressure with an alternative scoring rule that does not include a correction for guessing and with no visualization of time pressure. The following sections will discuss a history on scoring rules, the speed-accuracy trade-off problem, and the current situation of Math Garden including the problems that have arisen.

1.1 History of scoring rules 1.1.2 Number-right scoring

Multiple-choice tests are widely used in educational and psychological measurement because of their many advantages like the ease of administration and scoring, the ability to test varied content, and the objectivity of the scoring procedure

(3)

(Kurz, 1999; Budescu & Bo, 2014; Bar-Hillel, Budescu, & Attali, 2005). The most frequently used scoring rule for multiple-choice tests is the number-right scoring rule, in which the test score is the sum of the number of items answered correctly (Budescu & Bar-Hillel, 1993; Burton, 2002). This scoring rule is currently employed by the American College Testing (ACT), as well as by the Graduate Record Examinations (GRE) general exams (Budescu & Bar-Hillel, 1993). Since an incorrect and an omitted response will have the same effect on the final score, both earn zero points, the test-taker will benefit from answering all the items of a test. The advantage of this scoring rule is that it is easy to understand for the test-taker that the most optimal strategy is answering all items, and therefore, instructions are easily given. However, the number-right scoring rule does have some constraints.

The weaknesses of number-right scoring include decreased reliability and validity due to added construct irrelevant error variance due to guessing, and failure to capture the test-takers’ level of partial knowledge. The major goal of testing is to estimate as closely as possible the test taker’s true ability from responses to a test. However, from an answer sheet obtained under number-right scoring it is not possible to distinguish a correctly answered item due to guessing from a correct answer due to the test-taker’s knowledge. Hence, guessing interferes with this major goal of testing because guessing affects the test scores obtained under number-right scoring. Moreover, guessing is often associated with poor educational practice (Budescu & Bar-Hillel, 1993). Zimmerman and Williams (2003) investigated the influence of chance success due to guessing on the reliability of multiple-choice tests. They distinguished error variance due to guessing from error variance due to other sources, and concluded that even if guessing were the only source of error variance, it still would have a strong influence on the reliability of the test. Furthermore, due to the dichotomous scoring procedure, the test scores do not capture the test-takers’ level of partial knowledge (Budescu & Bo, 2014).

1.1.2 Formula scoring

To reduce the effects of guessing different scoring procedures were proposed. Abundant guessing can be prevented through negative marking. A popular scoring rule that includes a penalty for incorrect answers is formula scoring (Lord, 1975; Budescu & Bar-Hillel, 1993). This rule was first suggested by Thurstone (1919) and is used to score the Scholastic Aptitude Test (SAT) since 1953, the GRE subject exams, and medical school’s progress tests (Budescu & Bar-Hillel, 1993; Bar-Hillel et al., 2005; McHarg et al., 2005). When formula scoring is implemented, respondents can either choose to answer or skip an item. The basic property of this scoring rule is that one’s expected score is the same whether one guesses the answer to an item at random or whether one omits it (Budescu & Bar-Hillel, 1993; Lord, 1975). Respondents gain one point for correctly answered items, zero points for skipped items, and they lose c points for an incorrect answer. The penalty c is generally equal to 1/(A-1), where A is the number of answer options (Lord, 1975). This results in a total formula score R – W/(A-1), where R is the number of right answers and W is the number of wrong answers.

Lord (1975) argues that when clear instructions are given and respondents behave according to these instructions, the only difference between a multiple-choice test obtained under formula scoring directions and the same test obtained under number-right scoring directions is that omitted responses on the formula scoring answer sheet are replaced by random guesses on the number-right scoring answer sheet. Therefore, if there are any omitted responses, the total test score obtained under

(4)

number-right scoring will differ from the total test score obtained under formula scoring only because of the lucky or unlucky guesses that replace these omitted responses. This implies that if there are any omitted responses, the reliability of the formula score will always be higher than the reliability of the number-right score (Lord, 1975). Studies on medical students indeed found that results obtained under formula scoring were more reliable than results obtained under number-right soring (Burton, 2002; McHarg et al., 2005; Muijtjens, Van Mameren, Hoogenboom, Evers ,& Van der Vleuten, 1999). Moreover, a study on psychology students also concluded formula scoring to result in more reliable test scores than number-right scoring (Alnabhan, 2002).

Although these studies conclude formula scoring to result in more reliable test scores, they also mention some limitations of this rule. Burton (2002) and Muijtjens et al. (1999) both acknowledge an effect of cautiousness of some students on reliability. Muijtjens et al. (1999) found that students who were less willing to guess obtained lower scores than the students who were more willing to guess. Furthermore, McHarg et al. (2005) argued that formula scoring is more reliable than number right scoring, but that this may only be the case for the medical school’s progress tests specifically. Including the option to skip an item may be appropriate for a longitudinal test like the progress test, because many items are too difficult for the ability level of many respondents (McHarg et al., 2005). Moreover, it can be an educational objective for a medical doctor to learn to recognize what he does not know.

Including a penalty for incorrect answers can reduce the effects of guessing and hence increase test reliability, but it also presents respondents with a decision problem. Whenever test-takers face an item they are not sure they can answer correctly, they have to decide whether to guess or not to guess. Whatever the respondent decides to do (which response strategy they prefer) depends on many different factors in which individuals can greatly differ (Budescu & Bar-Hillel, 1993). One reason people differ in which response strategy they employ is because people differ in their risk preferences. Kahneman and Tversky (1979) found that people underrate outcomes that are merely probable in comparison with outcomes that are obtained with certainty. This tendency is called the certainty effect and contributes to risk aversion. Risk averse respondents will prefer the sure thing (omitting) over the gamble (guessing), even when this strategy will cause the expected value to decrease (Kahneman & Tversky, 1979). Thus, individual differences in risk taking behavior affect the test scores obtained under formula scoring (Budescu & Bar-Hillel, 1993). Hence, when number-right scoring is used to score a test, the test-maker cannot distinguish a guessed correct answer from a known correct answer, but when formula scoring is applied, this limitation is replaced by the inability to distinguish between an omission deriving from risk aversion and one deriving from ignorance. So, both guessing and correcting for guessing tend to add error variance to the test score and hence reduce the reliability of the test (Michael, 1968; Diamond & Evans, 1973).

Furthermore, if the only difference between an answer sheet obtained under formula scoring and one obtained under number-right scoring is that omitted responses on the former are replaced by random guesses on the latter as argued by Lord (1975), people will have to be very aware of knowledge they posses and don’t posses, and clear instructions should be given. However, people generally don’t know how much they know; they are mostly miscalibrated (Budescu & Bar-Hillel, 1993). For example, Fishhoff, Slovic, and Lichtenstein (1977) found that when people express 100% certainty, they are right only about 70%–80% of the time, and when people assign zero probability to options, these options were actually correct 20%–

(5)

30% of the time. Moreover, Bar-Hillel et al. (2005) argue that for the most part, respondents often omit items in formula-scored tests when they are in doubt about the correct answer, which is a sub-optimal strategy. Furthermore, because of individual differences it is nearly impossible to give recommendations on how to respond to test items that will be fair and beneficial to all (Budescu & Bar-Hillel, 1993).

In addition, formula scoring is described as a solution for guessing when actually the expected value for omitting and guessing is equal. When the penalty c is equal to 1/(A-1), it does not matter whether the respondent guesses at random or omits the item. But when the respondent possesses some partial knowledge, it is always more beneficial to guess (Klinkenberg, 2014). To really discourage guessing, the penalty needs to be bigger than 1/(A-1). But a larger penalty will not change the fact that knowing how to use this rule implies an added skill in which people can systematically differ (Klinkenberg, 2014).

However, Espinoza and Gardeazebal (2010) reach a different conclusion. They simulated data and concluded that the optimal penalty is relatively high for perfectly rational students but also when they are not fully rational. They conclude that even though including a penalty discriminates against risk averse students, this effect is small compared with the measurement error that it prevents (Espinoza & Gardeazebal, 2010). Budescu and Bo (2014) discuss some critique on this study. They argue that Espinoza and Gardeazebal (2010) ignore the fact that highly risk averse respondents tend to omit items that they might be able to answer correctly by using their partial knowledge. Moreover, they extended the model of Espinoza and Gardeazebal (2010) by including loss aversion and a particular form of miscalibration to accomplish a more realistic and complete model. Based on this proposed model, Budescu and Bo (2014) conclude that there are no penalties that can be universally implemented without making strong assumptions about the non-cognitive factors in the population of test-takers.

Because of all the disadvantages of formula scoring mentioned above, Budescu and Bar-Hillel (1993), Downing (2003), Bar-Hillel et al. (2005), and Budescu and Bo (2014) all endorse number-right scoring.

1.1.3 Methods for awarding credit for partial knowledge

Besides guessing influencing the test scores obtained under number-right scoring, number-right scoring also fails to take into account the respondents’ partial knowledge. One method that awards credit for partial knowledge is confidence weighting. When confidence weighting is applied to a test, the test-takers are instructed to indicate what they think is the correct answer to the question and are also asked to indicate how certain they are of their chosen answer (Kurz, 1999). A correct answer that is given confidently will receive more credit than a correct answer given with less confidence. Advocates of confidence weighting argue the confidence scoring method to result in a higher reliability of test scores and a lower standard error of measurement than number-right scoring or formula scoring (Michael, 1968; Pugh & Brunza, 1975).

Alnabhan (2002) employed an alternative partial knowledge scoring procedure, but also concluded this partial knowledge method to result in more reliable test scores than the number-right method. The partial knowledge method used by Alnabhan (2002) allowed the respondent to choose more than one answer option per question. The number of alternatives selected by the respondent influenced the amount of points the respondent gained for a correct response and lost for an incorrect response. When the respondents selected more than one alternative, and the correct

(6)

response was included in their selection, the respondents received less credit than when they selected only selected the correct answer option. Correspondingly, when respondents selected more than one alternative, and the correct response was not included in their selection, the penalty was higher than when they selected just one incorrect alternative (Alnabhan, 2002). Alnabhan (2002) concluded the partial knowledge method to produce larger test reliability values and higher validity coefficients than the number-right method.

Some studies suggest that all different scoring procedures described in this paper so far have not met the expectations of solving/addressing the problems of number-right scoring. Kurz (1999) provided an overview of different scoring procedures that were introduced to correct for guessing and to acknowledge partial mastery. Three different methods for correction for guessing and five methods that take partial knowledge into account were discussed. Kurz (1999) concluded all these methods to lack support for the theoretical rationale behind the formulas and continues to recommend number-right scoring until a better method is found. Lesage, Valcke & Sabbe (2013) also provided an overview of conventional and non-conventional scoring methods. They discussed concerns about number-right scoring and formula scoring and gave an overview of six alternative scoring procedures. The alternative methods they discussed weren’t optimal solutions to the conventional scoring methods, and therefore they concluded there is a growing need to evaluate alternatives.

1.2 Speed-accuracy trade-off

The different scoring procedures mentioned above address the concerns of number right soring – encouraging guessing and failure to credit partial knowledge – but don’t provide a solution for the speed-accuracy trade-off problem. The problem concerns the balance between speed and accuracy (Klinkenberg, 2014). When two people answer the same amount of items correctly, it is evident that the respondent who answered faster should receive a higher test score than the other respondent. However, when one respondent answered more items correctly but another respondent answered faster, one could question who should receive a higher test score.

There are different ways to incorporate response time (RT) into a test. One could measure the time it takes to complete a task or subtask (1), or one could measure the amount of subtasks or items completed within a certain time limit (2). An example of option 1 is the Block design test of the WAIS-III, and an example of option 2 is the TempoTest Automatiseren (TTA; De Vos, 2010). Both test designs share that the speed-accuracy trade-off is left to the test-taker and therefore, individual differences in chosen strategies will affect the test scores. For example, when a test measures the amount of items completed within a certain time limit, the respondent can choose to answer the items in the order that they are presented, or the respondent can answer all the easy items first and use the remaining time to complete the items that he feels are more difficult. These two different strategies can affect the test scores. Furthermore, both models fix either the response time or the accuracy, to be able to measure either the accuracy or the response time. In the first option the amount of items are fixed, so that it is possible to measure the time the respondent takes to complete the task. And in the second option the time is fixed, so that it is possible to count the amount of completed items.

(7)

These two options both treat speed and accuracy as being one process. It is not possible to distinguish between the two. Van der Maas and Wagenmakers (2005) argue that a complete evaluation of a respondents’ behavior on a test should entail a combination of information from response latency and response accuracy. Computerized tests are becoming more popular and make it possible to measure both accuracy and RT, since every item of a test can have its own response time limit. A difficult item can in this case be compensated by spending more time.

A scoring rule that provides a solution for the SAT problem, is the scoring rule introduced by Van der Maas and Wagenmakers (2005). Their scoring rule, called the Correct Item Summed Residual Time (CISRT) scoring rule, can be applied to a test with a maximum completion time per item. The CISRT score depends on the accuracy of the response and on the remaining time of the maximum time allowed. The accuracy of a response (Xip) is scored one for correct responses and zero for

incorrect responses. The remaining time is equal to the maximum time allowed (d) minus the amount of time units passed when the response is given (Tpi). The CISRT

score per item is obtained by multiplying the remaining time (d – Tpi) with the

accuracy (Xip) of the response (Van der Maas & Wagenmakers, 2005). Thus, the total

CISRT score is equal to the summation of the residual time for each item that was answered correctly: Xip(d – Tpi). Hence, a fast correct response will result in a higher

score than a slow correct response, and an incorrect response will always result in no points.

Van der Maas and Wagenmakers (2005) compared the CISRT scoring rule with simple response accuracy on the Amsterdam Chess Test. To assess the validity of the Amsterdam Chess Test they correlated the scores obtained under different scoring procedures with the scores of the Elo chess rating system. For high Elo ratings, the scores obtained under CISRT scoring had a larger correlation with the Elo ratings than the simple response accuracy scoring. Hence, they illustrated the advantage of the CISRT scoring rule over the simple response accuracy on the more easy items of the test.

Although the CISRT scoring rule takes RT into account, this rule shares a weakness with number-right scoring. Like number-right scoring, the CISRT scoring rule may promote guessing. Therefore, Maris and Van der Maas proposed a scoring rule, called the Signed Residual Time (SRT) scoring rule, which avoids this drawback by implementing a penalty for incorrect responses. The rule is symmetric in that fast incorrect responses are penalized more heavily than slow incorrect responses. Hence, for a correct response a test-taker gains the remaining time as score, and for an incorrect response the test-taker loses the remaining time as score (Maris & Van der Maas, 2012). The remaining time (d – Tpi) is now multiplied by (2Xpi – 1) which

results in a total SRT score of: (2Xpi – 1)(d – Tpi). Figure 1 shows a graphical

illustration of the SRT scoring rule applied to a test with a maximum completion time of 20 seconds per item.

(8)

Figure 1.

SRT scoring rule for a time limit of 20 seconds.

For example, a correct response given after seven seconds results in a score of 13, and an incorrect response given after 13 seconds results in a score of -7.

Klinkenberg (2014) assessed the validity and reliability of the ability estimations generated with this scoring rule. Data from Math Garden, the CORUS chess event 2008, and the ‘statistiekfabriek’ were analyzed. To get an indication of the convergent validity, the three data sets were compared with available external measures. These correlations with external criteria were significant for all three sources. The reliability of Math Garden scores obtained under the SRT scoring rule proved to be significantly higher than the reliability of the number-right scores. Hence, Klinkenberg (2014) concluded the SRT scoring rule to result in valid and reliable estimations of ability. However, since SRT scoring includes a penalty for incorrect responses, the rule does share some drawbacks with formula scoring.

1.3 In practice: Math Garden

The SRT scoring rule described in the previous section is implemented in a computer adaptive practice and tracking system called Math Garden (Klinkenberg, 2014). Math Garden is an online environment where children can practice their arithmetic skills (Straatemeier, van der Maas, & Klinkenberg, 2009). It offers many different games which all cover a different aspect of arithmetic. The games in Math Garden are adaptive, which means that the difficulty of the items is adapted to the arithmetic ability of the child (Straatemeier et al., 2009). This adaptive system is based on the Elo rating system that is used to measure the ability of chess players. The Elo rating of chess players is an indicator of their chess ability. After winning a chess game the player’s rating will increase, and after losing the player’s rating will decrease. The strength of this increase or decrease depends on the difference in the rating of the two players. In Math Garden the same rating system is used, but instead of playing against other contestants, the children play against the items (Straatemeier et al., 2009). Thus, both the children and the items are rated. For example, whenever a child answers a difficult item correctly, the child’s ability rating will increase and the item’s difficulty rating will decrease. Based on this ability rating, the child has a

(9)

probability of answering items correctly. To make sure the children stay motivated to keep on practicing, Math Garden administers items to the child on which the child has a 75% chance to answer correctly. Thus, due to this rating system the children’s abilities and the item difficulties are constantly updated.

However, a child’s ability score is not only based on the accuracy of the given responses, but also on the response times. The SRT scoring rule described in the previous section is employed by Math Garden. Figure 2 shows how this rule is implemented in Math Garden. The time that the child has left to respond is decreasing with the amount of available coins at the bottom of the screen. When a child answers an item correctly, the available coins will turn green and will be added up to the total amount of earned coins on the bottom right of the screen. Correspondingly, when a child answers an item incorrectly, the available coins will turn red and will be subtracted from the total score. By visualizing the scoring rule, the children can directly see the result of their response. After they finish a game, the coins they earned will be added up to their total amount of earned coins, with which they can buy virtual prizes.

Figure 2.

Math Garden addition game.

Since the SRT scoring rule that is currently implemented in Math Garden can evoke the use of strategies, but the alternative CISRT scoring rule may promote guessing, it is relevant to investigate which scoring rule will lead to the best performance of the children and to the most accurate representation of the children’s true arithmetic ability. Furthermore, although most of the games in Math Garden consist of multiple-choice items, some games consist of open-ended items. Since extensive guessing is not a real problem when items do not provide alternative response options, the CISRT scoring rule seems the most obvious choice for these types of questions.

Besides the concerns about the most optimal scoring rule, Math Garden received some complaints from parents and teachers about the way the scoring rule is implemented in the games. The time pressure that is visualized by the amount of available coins at the bottom of the screen could distract the children from solving the

(10)

item at hand and could induce stress. The parents and teachers argued this to mainly affect the children with low arithmetic ability. Although this visualized time pressure can induce stress, removing the coins also implies removing the indicator of the remaining time. Not knowing how much time is left to response can also be unpleasant. To find out more about these complaints, a pre-test was done at two different primary schools in the Netherlands. The pre-test showed that many children indicated the visualized time pressure to be distracting. Of all the children that were questioned, 87% in one primary school and 53% in the other school indicated to prefer playing the games in Math Garden without the coins present. (See Appendix 1 for the full report on the pre-test, including a detailed description of the used procedure.) These results show the relevance of investigating the influence of the visualized time pressure on the arithmetic performance and experience of the children.

To answer these questions about the optimal scoring rule and the effect of the visualized time pressure on performance and test validity, an experiment was conducted at four primary schools in the Netherlands. The experiment included a non-adaptive computer task similar to the addition game in Math Garden that compared the CISRT scoring rule with the SRT scoring rule, and the presence or the absence of visualization of time pressure. This resulted in a computer task with 4 different parts. To assess the convergent validity of this computer task, the scores on these four different parts of the task were compared with the scores on the TempoTest Automatiseren (TTA; De Vos, 2010). Furthermore, a structured interview was held afterwards to find out if the scoring rules were understood and which preferences the children had regarding the scoring rules and visualization of time pressure.

2. Method

2.1 Participants

One hundred ninety six children in grades one to six from four different primary schools in the Netherlands participated in this research. The children ranged in age from 7 to 14 years and were familiar with the games in Math Garden. All participants took part in the computer task and the TTA, but due to practical reasons only one hundred forty participants were interviewed. The participants were given a stamp for participation.

2.2 Materials 2.2.1 Computer task

An arithmetic computer task was developed using Python software. The environment of the computer task was intended to be similar to the environment of Math Garden. The visualization of time pressure in this task was similar to Math Garden in that the time the participant has left to respond was decreasing with the amount of available coins at the bottom of the screen. A tiny difference between Math Garden and the computer task used in this experiment is that in Math Garden the total amount of earned coins is always present during a game, while in this computer task the total amount of earned coins were only visible for two seconds in between the items of the task. Figure 3 shows the environment of the computer task.

(11)

Figure 3.

Screen shots of the computer task. 3A: Screen shot of one item when the visualized time pressure is present. 3B: Screen shot of the total amount of earned coins.

To investigate the influence of scoring rules and visualization of time pressure on arithmetic ability, either the SRT scoring rule or the CISRT scoring rule was implemented, and either the coins were visible or they were not. This caused the task to consist of 4 different parts shown in Table 1. All respondents participated in all parts of the computer task in random order. Each part consisted of twenty different open-ended addition items of approximately the same difficulty, for which the participant had twenty seconds to respond. Because of its many users, the addition items in Math Garden all have a stable difficulty estimate. The items for this computer task were selected based on these difficulty estimates. Each part of the task started with the least difficult item and worked up to the most difficult item. For example, one easy item on the test was “3+1”, and a difficult item was “433+349”. The participants responded by typing the answer on a keyboard followed by pressing the enter button. Whenever the participants did not know the answer and did not want to guess, they could omit the item by just pressing enter. Furthermore, a correct response was followed by a high tone, while an incorrect response was followed by a low tone. For each item the accuracy, the response time, and the SRT or CISRT score were saved.

Table 1.

The four different parts of the computer task.

Visualization of time pressure

Coins present Coins absent

Scoring rule SRT Part 1 Part 2

CISRT Part 3 Part 4

2.2.2 External criterion

To be able to get an indication of the convergent validity of the different parts of the computer task, the TempoTest Automatiseren (TTA; De Vos, 2010) was also administered. Since the computer task only consisted of addition items, the participants completed only the addition part of the TTA, which consisted of 50 items. The participants were instructed to complete as many items as possible in one minute. The score on the TTA was equal to the number of correct responses.

(12)

2.2.3 Structured interview

A structured interview was held to investigate whether the participants understood and noticed the difference between the four different parts of the computer task, to find out if children prefer visualization of time pressure or not, and to investigate the strategies used by the participants. Whether the participants noticed the difference in scoring rules and visualization of time was assigned zero for not noticing and one for noticing. All other questions consisted of four alternatives; “no”, “a little”, “very much”, “do not know”. The interviewers chose the alternative that was most applicable to the participant’s response. Due to practical reasons, not all participants were interviewed.

2.2.4 Evaluation

The participants answered two evaluation questions after each part of the computer task. The first item, “I thought this game was”, consisted of three answer options: “fun”, “a little bit fun”, “not fun”. The second item, “The game went”, consisted of these three alternatives: “well”, “a little well”, “not well”. Since the participants were all native Dutch speakers, the questions were administered in Dutch. 2.3 Procedure

At each of the four schools that participated, the experiment took place in a separate classroom. During regular school hours, children participated in groups of five to ten children at the time. First, the participants were asked to complete a questionnaire about their demographic characteristics. Second, the TTA was administered preceded by vocal instructions to the group. Thereafter, vocal instructions on the computer task were given. The differences between the four parts of the task were explained and before the participants started the computer task, the experimenters made sure all questions about the task were answered. Subsequently, the participants completed the four parts of the computer task in random order. After each of the four parts of the task, the participants answered the two evaluation questions. At the end, children were randomly selected to participate in the structured interview. This interview took place in a different classroom or part of the school than the previous part of the experiment and lasted about five minutes. After the participants finished the experiment, they were allowed to choose a stamp as award for their participation. In total, the experiment lasted about 35 minutes.

3. Results

3.1 Participants

In total 196 children participated in this research. The data of one participant was removed because the z-scores were below -3.28, which is considered an outlier according to Field (2009). This resulted in a total of 195 participants. Due to computer failures, the computer task data was missing from 21 participants. From four participants all the other data but the computer task data was missing. Of these 191 participants, 91 were boys and 100 were girls, ranging in age from 7 to 14 years (M = 10.02. sd = 1.58).

(13)

3.2 Convergent validity

To be able to get an indication of the convergent validity of the different parts of the computer task, the scores obtained under the SRT and CISRT scoring rules were compared with the scores on the TTA. The correlations with the TTA proved significant for all four parts of the computer task. The correlations ranged from .651 to .748 all with p<.001. Figure 4 demonstrates these correlations.

Figure 4.

Correlations between the four parts of the computer task (SRT and CISRT scores) and the TTA.

3.4 Computer task

The computer task consisted of 4 different parts shown in Table 1. For all parts the amount of correct responses, the amount of omitted responses, the mean log response time, and the sum scores of the SRT (part 1 and 2) or CISRT (part 3 and 4) were calculated. The scores obtained under SRT scoring cannot simply be compared with the scores obtained under CISRT scoring, because SRT scoring includes a penalty for incorrect responses and CISRT scoring does not. Therefore, the scores obtained under CISRT scoring (part 3 and 4) were converted to SRT scores. Table 2 demonstrates descriptive statistics of the accuracy scores, the omitted responses, and the log response times. Table 3 shows descriptive statistics of the SRT and CISRT scores.

(14)

Table 2.

The means (M) and standard deviations (SD) of the accuracy scores, the omitted responses, and the log response times (N=174).

Accuracy Omission LogRT

M SD M SD M SD SRT Coins 14.98 3.74 2.47 3.45 1.91 0.30 No coins 15.07 3.82 2.64 3.60 1.94 0.30 CISRT Coins 15.30 3.80 2.35 3.34 1.95 0.30 No coins 14.85 3.94 3.07 3.80 1.88 0.31 Table 3.

The means (M) and standard deviations (SD) of the SRT and CISRT scores (N=174).

SRT/CISRT SRT M SD M SD SRT Coins 183.03 183.03 183.03 64.91 No Coins 186.80 186.80 186.80 68.39 CISRT Coins 206.26 186.82 186.82 61.63 No Coins 204.88 187.77 187.77 62.24

Note. The means and standard deviations on the first two rows of this table are identical because SRT scoring was already calculated in the first two parts of the computer task.

3.4.1 Arithmetic performance

To investigate the implications of scoring rules and visualization of time pressure on arithmetic ability, a repeated measures ANOVA was carried out on the accuracy scores. A Shapiro-Wilk test of normality on the standardized residuals of the accuracy scores proved significant for all four parts of the computer task. The Shapiro-Wilk test statistic ranged from .898 to .928 all with p<.001. Figure 5 shows the QQ-plots of the standardized residuals. All four QQ-plots show a negatively skewed distribution and confirm the non-normality of the residuals. Although the residuals are not normally distributed, it is often argued that ANOVA is a robust test, which means that the F statistic is accurate even when the assumption of normality is violated (Glass, Peckham, & Sanders, 1972; Field, 2009). Therefore, further analysis will be continued.

The analysis revealed no significant main effects of scoring rule (F(1,173)=0.094, p=.760), nor of visualization of time pressure (F(1,173)=1.805, p=.181) on the accuracy scores. The interaction between scoring rule and visualization of time pressure, however, proved significant; F(1,173)=4.730, p=.031. This means that the effect of the scoring rule on accuracy is different for when the coins are present compared to when the coins are absent. When CISRT scoring is employed, the participants had a higher score when the coins were present as opposed to when the coins were absent. In contrast, when SRT scoring is implemented, the participants scored slightly higher when the coins were absent as opposed to when the coins were present. Figure 6 demonstrates this interaction effect. The participants had the highest scores when CISRT scoring was employed and the coins were visible. From these results can be concluded that not correcting for guessing and present visualization of time pressure results in the best performance.

(15)

Figure 5.

QQ-plots of the standardized residuals of the accuracy scores for each part of the computer task.

Figure 6.

The influence of scoring rule and visualization of time pressure on the accuracy scores. The interaction effect between scoring rule and visualization of time pressure on accuracy proved significant.

(16)

Besides the accuracy of the response, the response times were also measured and analyzed. A Shapiro-Wilk test of normality proved the standardized residuals of the log response times to be normally distributed for all four parts of the computer task. The four QQ-plots shown in Figure 7 confirm this. The results of the repeated measures ANOVA on the log response times demonstrated no main effect of scoring rule (F(1,171)=0.531, p=.467), a marginally significant main effect of visualization of time pressure (F(1,171)=3.047 , p=.083), and a significant interaction effect between scoring rule and visualization of time pressure on log response times (F(1,171)=13.892, p=< .001). Figure 8 displays these effects. The fact that a similar interaction effect is found when analyzing the log response times as when analyzing the accuracy scores, indicates that the difference in accuracy between the four parts of the computer task can be explained by the difference in response times. When CISRT scoring was implemented and the coins were present, the participants answered more items correctly than on the other parts of the computer task, but the participants also took more time to respond.

More noteworthy is the marginally significant main effect of visualization of time pressure. Overall, the participants responded faster when the coins were absent than when they were present. This slower response when the coins are present could indicate the participants being distracted from the arithmetic problem by the decreasing coins in the bottom of the screen. Since this main effect of visualization of time pressure is only marginally significant, no hard conclusions can be drawn until follow up research confirms this effect.

Figure 7.

QQ-plots of the standardized residuals of the log response times for each part of the computer task.

(17)

Figure 8.

The influence of scoring rule and visualization of time pressure on log response times. The interaction effect between scoring rule and visualization of time pressure on log response times proved significant.

Above, the analyses on the accuracy scores and response times are discussed. However, we are mainly interested in the performance scores that combine both speed and accuracy; the CISRT and SRT scores. As mentioned above, the CISRT scores were converted to SRT scores to be able to compare the four parts of the task. An repeated measures ANOVA was conducted on these scores. A Shapiro-Wilk test of normality on the standardized residuals of the SRT scores proved significant for all four parts of the computer task. The Shapiro-Wilk test statistic ranged from .966 to .983 all with p-values ranging from .000 to .029. The QQ-plots demonstrated in Figure 9, however, show the four parts of the task to be approximately normally distributed. Further analysis will be continued.

The analysis revealed no main effect for scoring rule (F(1,173)=0.881, p=.349) nor for visualization of time pressure (F(1,173)=0.828, p=.364) on the SRT scores. No interaction effect was found either (F(1,173)=0.309, p=.579). Figure 10 shows the influence of scoring rule and visualization of time pressure on the SRT scores. The interaction effects found on the accuracy scores and on the response times cancel each other out when both accuracy and speed are included into one score. Since none of the effects on the SRT scores proved significant, it can be concluded that there are no implications of scoring rule and visualization of time for the arithmetic performance of the test-taker.

(18)

Figure 9.

QQ-plots of the standardized residuals of the SRT scores for each part of the computer task.

Figure 10.

The influence of scoring rule and visualization of time pressure on the SRT scores. No significant effects were found.

(19)

3.4.2 Response strategies

To investigate if children employed different answering strategies in response to the difference in scoring rules, a repeated measures ANOVA was conducted on the omission scores. A Shapiro-Wilk test of normality on the standardized residuals of the omission scores proved significant for all four parts of the computer task. The Shapiro-Wilk test statistic ranged from .729 to .792 all with p<.001. Figure 11 shows the QQ-plots of the standardized residuals. All four QQ-plots show a positively skewed distribution and confirm the non-normality of the residuals. As previously mentioned, it is often argued that ANOVA is a robust test; therefore, further analysis will be continued.

The analysis revealed no main effect for scoring rule on the omission scores (F(1,173)=0.881, p=.349). This means that the type of scoring rule implemented in the task has no influence on the amount of omitted items. Furthermore, the analysis did demonstrate a significant main effect for visualization of time pressure on the omissions scores (F(1,173)=15.090, p=< .001). The participants omitted more items when the coins were absent than when they were present. The analysis also revealed a significant interaction between scoring rule and visualization of time pressure on the omission scores (F(1,173)=7.919, p=.005). When the coins were absent as opposed to present, the CISRT score increased more than the SRT score. These effects are shown in Figure 12. It is concluded that the type of scoring rule implemented in the task does not affect response strategies employed by the test-takers.

Figure 11.

QQ-plots of the standardized residuals of the Omission scores for each part of the computer task.

(20)

Figure 12.

The influence of scoring rule and visualization of time pressure on the amount of omitted items. The main effect of visualization of time pressure and the interaction effect between scoring rule and visualization of time pressure on omission scores proved significant.

3.5 Structured interview

In the structured interview it was assessed whether the participant understood and noticed the difference between the four parts of the computer task. Out of the 141 participants that were interviewed, 74 participants (52.48%) understood and noticed the difference between the two scoring rules. Moreover, 121 of these 141 participants (85.82%) noticed the difference in visualization of time pressure. The fact that no significant effect of scoring rule on performance was found, could be due to this large number of participants not noticing the difference in scoring rules. To investigate whether this is the case, a repeated measures ANOVA was conducted on the SRT scores of only the participants that understood the difference in scoring rules (computer data was available for 68 out of these 74 participants). A Shapiro-Wilk test of normality proved the standardized residuals of the SRT scores to be normally distributed for the first three parts of the computer task. The test was significant for the fourth part (p=.042). Further analysis was continued.

The analysis revealed no significant main effect of scoring rule on the SRT scores. Moreover, the analysis did not demonstrate a main effect of visualization of time pressure or an interaction effect on the SRT scores either. To further explore the influence of the different scoring rules, a Wilcoxon signed-rank test (the non-parametric variant of the paired t-test) was executed to detect a difference between the first two parts of the task (where SRT scoring was implemented) and the last two parts of the task (where CISRT scoring was implemented). No significant difference

(21)

in scoring rule was found. It can be concluded that the type of scoring rule implemented in a game does not influence the arithmetic performance of the test-taker.

The participants who understood and noticed the difference between the two scoring rules were asked about their opinions and behavior as response to the SRT scoring rule. They were asked if they felt more motivated because of the penalty for incorrect responses, if being punished for incorrect answers bothered them, and if they would guess less when incorrect responses are punished (self-evidently, these were not the exact words used by the experimenters.) Table 4 presents the results of this part of the structured interview. The difference in motivation was marginally significant (χ2(2)=5.931, p=0.052), with most participants feeling very motivated due to the penalty for incorrect responses. The difference in being bothered by this punishment proved significant (χ2(2)=19.810, p<.001), with most participants not bothered by the punishment. And the difference in guessing behavior also proved significant (χ2(2)=19.810, p<.001), with most respondents indicating not to guess less due to the punishment for incorrect answers. Noteworthy is that most respondents indicated not to have guessed at all during the whole computer task.

Table 4.

Results of the structured interview on what the participants thought of including a punishment for incorrect responses (SRT scoring rule). The number of participants (N) and percentages (%) per answer alternative are given.

Motivating Bothersome Guess less

N % N % N % Very much 28 37.84 9 12.16 2 2.70 A little 16 21.62 17 22.97 13 17.57 No 14 18.89 37 50.00 40 54.05 Do not know 16 21.62 11 14.86 19 25.68 Total 74 100 74 100 74 100

Furthermore, all the interviewed participants were also asked what they thought of the visualized time pressure. The participants were asked if they experienced stress due to the presence of the coins, if they thought the coins were distracting, and what they thought of lacking an indication of the remaining time when the coins are absent (self-evidently, these were not the exact words used by the experimenters.). The difference in how much stress the participants indicated to experience due to the presence of the coins proved significant (χ2(2)=18.071, p<.001), with most respondents not experiencing any stress. The difference in how distracting the participants thought the coins were, was also significant (χ2(2)=10.380, p=.006), with most respondents indicating not to be distracted by the coins. Moreover, the difference in their preference for an indication of the remaining time also proved significant (χ2(2)=26.035, p<.001), with most participants not being bothered by the absence of an indication of the remaining time when the coins are not visible.

(22)

Table 5.

Results of the structured interview on what the participants thought of the visualized time pressure (coins). The number of participants (N) and percentages (%) per answer alternative are given.

Stressful Distracting Prefers indication of remaining time N % N % N % Very much 22 15.60 31 21.99 28 19.86 A little 33 23.40 33 23.40 22 15.60 No 58 41.13 57 40.43 63 44.68 Do not know 28 19.86 20 14.18 28 19.86 Total 141 100 141 100 141 100

Out of the 141 interviewed participants, only 31 indicated to find the coins distracting, and only 22 indicated to experience stress due to the coins. This could explain the non-significant effect of visualization of time pressure on arithmetic performance. Therefore, a Wilcoxon signed rank test was executed on the SRT scores of the participants that indicated to experience stress due to the visualization of time pressure (computer data was available for 21 of these 22 participants). The test revealed no significant difference in performance between the parts of the computer task with coins and the parts without coins.

3.6 Evaluation

The participants answered two evaluation questions after each part of the computer task. Not all participants completed all questions. Table 6 and 7

demonstrate the number of participants per answer alternative for all questions. No difference was found in the evaluation of the four different parts of the computer task; none of the Chi-squared tests were significant.

Table 6.

Results of the evaluation question: “I thought this game was …” filled in by the participants after each part of the computer task. The number of participants per answer alternative and the statistic of the Chi-squared test are given.

Part 1 Part 2 Part 3 Part 4 χ2

Fun 71 65 66 70 0.382

A little bit fun 82 82 80 82 0.037

Not fun 16 20 16 14 1.152

(23)

Table 7.

Results of the evaluation question: “I game went …” filled in by the participants after each part of the computer task. The number of participants per answer alternative and the statistic of the Chi-squared test are given.

Part 1 Part 2 Part 3 Part 4 χ2

Well 97 89 93 98 0.539

A little well 53 61 52 59 1.044

Not well 15 15 15 8 2.774

Total 165 165 160 165

4. Discussion

This study examined the effect of two scoring rules and visualization of time pressure on arithmetic performance and test validity. Both SRT scoring and CISRT scoring, with or without visualization of time pressure, show high convergent validity. As for performance, when only the accuracy or the latency of the response was taken into account, an interaction effect of scoring rule and visualization of time pressure was found. The results also showed a small effect of visualization of time pressure on response latency. The participants were most accurate when CISRT scoring was implemented and visualization of time pressure was present. However, because the participants also took more time to give a response under the same conditions, these effects disappeared when accuracy and speed were combined into one score. Moreover, even when only the participants that understood and noticed the difference in scoring rules were included in the analysis, no effect of scoring rule was found. In addition, when only the participants that indicated to experience stress due to the visualization of time pressure were included, no effect of visualization of time pressure was found. Therefore, it is concluded that scoring rule and visualization of time pressure do not affect arithmetic performance.

Since the literature suggests differences in response strategy between SRT and CISRT scoring, the omission rates were examined. The penalty for incorrect answers included in SRT scoring reduces the incentive for guessing. However, since the items of the computer task used in this experiment do not provide alternative response options, extensive guessing is unlikely to occur. The results show no effect for scoring rule on the number of omitted items. Moreover, in the structured interview most participants indicated no difference in guessing behavior between the two scoring rules. Therefore, it is concluded that the type of scoring rule implemented in the task does not affect response strategies employed by the test-takers.

Since it is important that children enjoy practicing their arithmetic skills in Math Garden, the participants were asked about their experience of the computer task. Most participants believed the SRT scoring rule to be motivating and indicated not to be bothered by the punishment for incorrect answers. Furthermore, the two scoring rules, with or without visualization of time pressure, were evaluated equally fun, and the participants believed they performed equally well on all parts of the task.

A remarkable result of the present study is that many participants did not understand the difference in scoring rules. This could be the result of a couple limitations of the experiment. The instructions were given to 10 participants at once, and each part of the computer task only consisted of 20 items. Moreover, the feedback loop in this experiment, which ensures that participants directly see the result of their

(24)

response, was not very clear. In Math Garden, as explained in the section 1.3, the consequence of a response is made very clear, and the total amount of earned coins is continuously displayed throughout game. In the computer task of the present study, the total score is only displayed for about two seconds in between the items of a task. These three concerns could have contributed to almost half of the participants not fully understanding the difference in scoring rules.

The fact that no difference in response strategies was found may be explained by the use of open-ended questions in the computer task. Not having any alternatives to choose from makes extensive guessing unlikely to occur. Further research could include multiple-choice questions to assess differences in response strategies under SRT and CISRT scoring. Moreover, since only addition is assessed in the present study, follow up research could also include other arithmetic operations like subtraction, multiplication, and division.

In the present study, we deliberately chose to include only children that are familiar with Math Garden. However, this could have influenced the results, because all participants were accommodated to SRT scoring including visualized time pressure. Furthermore, the coins in the present study had no value and no consequences were attached to the obtained results. Further research could investigate arithmetic performance and preferences of participants that are not familiar with Math Garden in a high stakes setting.

Based on the results of the present study, the type of scoring rule and the visualized time pressure in Math Garden should not be changed until further research addressing the limitations discussed above suggests so.

(25)

References

Alnabhan, M. (2002). An empirical investigation of the effects of three methods of handling guessing and risk taking on the psychometric indices of a test. Social Behavior and Personality: an international journal, 30, 645-652.

Bar-Hillel, M., Budescu, D., & Attali, Y. (2005). Scoring and keying multiple choice tests: A case study in irrationality. Mind & Society, 4, 3-12.

Budescu, D., & Bar‐ Hillel, M. (1993). To guess or not to guess: a decision‐ theoretic view of formula scoring. Journal of Educational Measurement, 30, 277-291. Budescu, D. V., & Bo, Y. (2014). Analyzing test-taking behavior: decision theory

meets psychometric theory. Psychometrika, 1-18.

Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36, 805-811.

De Vos, T. (2010). Manual Tempotoets Automatiseren. Amsterdam: Boom Testuitgevers.

Diamond, J., & Evans, W. (1973). The correction for guessing. Review of Educational Research, 181-191.

Downing, S. M. (2003). Guessing on selected‐ response examinations. Medical education, 37, 670-671.

Espinosa, M. P., & Gardeazabal, J. (2010). Optimal correction for guessing in multiple-choice tests. Journal of Mathematical Psychology, 54, 415-425. Field, A. (2009). Discovering statistics using SPSS. Sage.

Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with certainty: The appropriateness of extreme confidence. Journal of Experimental Psychology: Human perception and performance, 3, 552.

Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. Review of educational research, 237-288.

Jansen, B. R. J., Louwerse, J., Straatemeier, M., Van der Ven, S. H. G., Klinkenberg, S., & Van der Maas, H. L. J. (2013). The influence of practicing maths with a computer- adaptive program on math anxiety, perceived math competence, and math performance. Learning and Individual Differences, 24, 190–197. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under

risk. Econometrica: Journal of the Econometric Society, 263-291.

Klinkenberg, S. (2014). High Speed High Stakes Scoring Rule. In Computer Assisted Assessment. Research into E-Assessment (pp. 114-126). Springer International Publishing.

Kurz, T. B. (1999). A review of scoring algorithms for multiple-choice tests. Lesage, E., Valcke, M., & Sabbe, E. (2013). Scoring methods for multiple choice

assessment in higher education–Is it still a matter of number right scoring or negative marking?. Studies in Educational Evaluation, 39, 188-193.

Lord, F. M. (1975). Formula scoring and number‐ right scoring. Journal of Educational Measurement, 12, 7-11.

Maris, G., & Van der Maas, H. (2012). Speed-accuracy response models: Scoring rules based on response time and accuracy. Psychometrika, 77, 615-633. McHarg, J., Bradley, P., Chamberlain, S., Ricketts, C., Searle, J., & McLachlan, J. C.

(2005). Assessment of progress tests. Medical education, 39, 221-227. Michael, J. J. (1968). The reliability of a multiple-choice examination under various

test-taking instructions. Journal of Educational Measurement, 5, 307-314. Muijtjens, A. M. M., Van Mameren, H., Hoogenboom, R. J. I., Evers, J. L. H., & Van

(26)

der Vleuten, C. P. M. (1999). The effect of a ‘don't know’option on test scores: number‐ right and formula scoring compared. Medical Education, 33, 267-275.

Pugh, R. C., & Brunza, J. J. (1975). Effects of a confidence weighted scoring system on measures of test reliability and validity. Educational and Psychological Measurement, 35, 73-78.

Straatemeier, M., van der Maas, H. L., & Jansen, B. Combining computerized adaptive practice and monitoring: the possibilities of self-organizing adaptive learning tools.

Straatemeier, M., van der Maas, H., & Klinkenberg, S. (2009). Werken in de Rekentuin: spelenderwijs oefenen en meten. Willem Bartjens, 28, 4-7. Thurstone, L. L. (1919). A scoring method for mental tests. Psychological Bulletin,

16, 235-240.

Van Der Maas, H. L., & Wagenmakers, E. J. (2005). A psychometric analysis of chess expertise. The American journal of psychology, 29-60.

Zimmerman, D. W., & Williams, R. H. (2003). A new look at the influence of guessing on the reliability of multiple-choice tests. Applied Psychological Measurement, 27, 357-371.

(27)

Appendix 1: Pre-test report (Dutch)

Vooronderzoek Rekentuin Ervaringen

Om na te gaan welke problemen kinderen ervaren rondom Rekentuin, is er een vooronderzoek uitgevoerd bij twee basisscholen in Nederland. In het vooronderzoek is de aandacht vooral uitgegaan naar het effect van de muntjes die aflopen in het scherm tijdens een item. Daarnaast is er ook naar meer algemene ervaringen rondom Rekentuin gevraagd en is er op beide scholen met een leerkracht gepraat. Omdat deze bevindingen interessant kunnen zijn voor Oefenweb zijn deze ook in het verslag verwerkt.

Het vooronderzoek vond plaats op de Willibrodusschool in Diessen en de Paus Joannesschool in Zaandam. Beide basisscholen maken al een aantal jaar gebruik van Rekentuin en zijn dus bekend met het programma. Hieronder zullen onze bevindingen per school besproken worden.

Wilibrordschool in Diessen (13-04-2015)

Procedure

Het vooronderzoek vond plaats in het computerlokaal van de school. Uit elke klas (groep drie tot en met groep acht) waren twee a drie kinderen geselecteerd door de leerkrachten. De leerkrachten hadden van Ben Hagenberg, een ICT medewerker van de school, de opdracht gekregen om in ieder geval een sterke rekenaar en een zwakke rekenaar uit hun klas te selecteren. De in totaal vijftien kinderen gingen voor 30 minuten aan de slag met Rekentuin en ondertussen gingen de onderzoekers de kinderen een voor een af om wat vragen te stellen over Rekentuin en mee te kijken tijdens het spelen van een spelletje. De onderzoekers vroegen de kinderen wat zij van Rekentuin vinden, wat zij juist erg leuk en minder leuk vinden, of ze ook thuis Rekentuin spelen, hoe zij de muntjes in het spel ervaren, en of zij de in Rekentuin gebruikte score regel begrijpen.

Na het stellen van de vragen gingen de kinderen willekeurige Rekentuin spelletjes spelen, waarbij de muntjes die in het scherm te zien zijn tijdens een opgave afgedekt werden met een zwart kartonnetje. Alleen de muntjes werden afgedekt, niet de totaal score. De kinderen speelden een of meerdere spelletjes terwijl de onderzoekers bij het volgende kind de vragen stelden en het kartonnetje plaatsten. Vervolgens kwamen de onderzoekers weer terug bij het kind om te vragen hoe het was bevallen zonder de muntjes te kunnen zien.

Bevindingen Kinderen

De tendens was dat kinderen Rekentuin leuk vinden. De kinderen gaven aan vooral het mollenspel, het kangoeroespel, hun tweede tuintje, de mogelijkheid om verschillende spelletjes te spelen, het winnen van muntjes en het kopen van prijzen erg leuk te vinden. Kinderen die zeiden Rekentuin minder leuk te vinden, gaven hiervoor vaak als reden dat ze de spellen saai vinden. Vooral de kinderen uit groep zeven en acht die de spellen al vanaf groep drie speelden gaven aan dat Rekentuin niet uitdagend genoeg was. Tot slot gaven kinderen met een lagere rekenvaardigheid aan Rekentuin niet zo leuk te vinden, met als reden dat ze geen toegang krijgen tot alle spelletjes. Verder begrepen alle kinderen de in Rekentuin gebruikte scoreregel.

Wanneer de kinderen werd gevraagd hoe zij de muntjes in het spel ervoeren, vond ongeveer de helft van de kinderen de muntjes ‘prima’ en de andere helft ervoer

(28)

de muntjes als ‘stressvol’. Kinderen die de muntjes als stressvol ervoeren gaven aan dat ze zich niet meer goed kunnen concentreren op het uitrekenen van de som en dat ze hun antwoord soms vergeten wanneer ze de muntjes zien weglopen in het scherm. Nadat de kinderen spelletjes hadden gespeeld waarbij de muntjes waren afgedekt, werd hen gevraagd naar hun ervaringen. 87% Van de kinderen gaf aan het prettiger te vinden om zonder de muntjes in beeld te spelen. Zij vertelden dat ze minder stress ervoeren, zich beter konden concentreren op het uitrekenen van de som en het idee hadden minder fouten te maken dan met muntjes in beeld. Wanneer deze kinderen werd gevraagd of zij de muntjes voor altijd uit het beeld willen hebben, vonden zij dit een goed idee. Twee van de vijftien kinderen gaven aan het juist lastig te vinden zonder de muntjes in beeld, omdat zij daardoor niet meer wisten hoeveel tijd ze nog hadden om een opgave te maken. Deze twee kinderen hadden geen sterke voorkeur voor met of zonder de muntjes in beeld spelen.

Bevindingen Leerkracht

Toen de kinderen klaar waren met spelen, is er met Ben Hagenberg gepraat over zijn ervaringen met Rekentuin. Ben Hagenberg is een ICT medewerker van de school en staat ook een aantal middagen voor de klas.

Ben gaf aan dat kinderen met een lage vaardigheid ook meer last hebben van de muntjes. Het leek hem een goed idee om kinderen de optie te geven om de muntjes aan of uit te zetten. Daarnaast gaf hij aan dat er meer opties moeten zijn om het tempo aan te kunnen passen voor zwakke rekenaars. Als voorbeeld gaf hij de optie om een kind 30 seconden per opgave te geven in plaats van 20. Verder gaf Ben aan dat hij de rekenvaardigheid van kinderen niet zag verbeteren door gebruik van Rekentuin. Wel zijn de kinderen door het gebruik van Rekentuin gemotiveerder geworden. Ben vertelde echter dat dit niet geldt voor een groot aantal kinderen uit groep zeven en acht, omdat zij al vanaf groep drie met Rekentuin werken en daardoor hun interesse in de spelletjes hebben verloren.

Bevindingen voor Oefenweb

Hieronder worden puntsgewijs een aantal bevindingen en opmerkingen van de leerkracht besproken die nuttig kunnen zijn voor Oefenweb.

• Rekentuin is teveel gericht op kinderen die al hoge rekenvaardigheid hebben. • Leerkrachten worden te weinig geïnformeerd over Rekentuin. Zij begrijpen

onder andere vaak niet dat het programma adaptief is en weten niet dat zij invloed kunnen uitoefenen op welke spellen er door de kinderen gespeeld worden. Door dit gebrek aan kennis, zijn vele leerkrachten terughoudend in het gebruik van Rekentuin.

• Ben Hagenberg gaf ook aan dat kinderen het waarschijnlijk prettig zouden vinden als zij in het spel zelf meer informatie zouden krijgen over Rekentuin. Informatie over hoe het tuintje werkt, wat er gebeurd met de bloemen wanneer je niet speelt en hoe je meer spelletjes in je tuintje kan krijgen.

• Ben zou graag een koppeling zien van rekentuin resultaten naar het leerlingvolgsysteem (in hun geval ParnasSys). De resultaten van Cito en van Wereld in Getallen worden via dit systeem aan de ouders bekend gemaakt en Ben lijkt het een goed idee om de rekentuin resultaten van de kinderen hier aan toe te voegen.

• Daarnaast gaf Ben aan dat de prijzen die gekocht kunnen worden niet leuk zijn voor alle leeftijden. Ook vindt hij de prijzen minder geschikt voor jongens (“Wat moeten jongens nou met juwelen?”). Daarnaast willen meisjes

(29)

misschien liever kleren kopen en poppetjes aankleden. Voor kinderen uit groep zeven en acht zou hij graag wat volwassenere prijzen zien.

RKBS Paus Joannesschool in Zaandam (14-04-2015)

Procedure

Uit de klassen vier tot en met acht werden telkens drie leerlingen geselecteerd om in de leraren kamer vragen over Rekentuin te beantwoorden en een Rekentuinspelletje te spelen op de iPad. Naast het verschil in onderzoeksruimte en het gebruik van een iPad in plaats van een computer, was de procedure identiek aan de gevolgde procedure op de Willibrordusschool.

Bevindingen Kinderen

De meeste kinderen vinden het leuk om in Rekentuin te spelen. Kinderen vinden het leuk dat er telkens nieuwe spellen te spelen zijn. De kinderen die Rekentuin wat minder leuk vinden en daardoor minder spelen vinden het vervelend dat niet alle spellen voor hen beschikbaar zijn en geven aan dat ze dit demotiverend vinden. De kinderen uit groep zeven en acht gaven soms aan dat ze het spel saai vinden. Alle kinderen begrepen de scoreregel.

Wanneer aan de kinderen werd gevraagd wat zij van de muntjes in het spel vinden, bleek dat ongeveer één derde van de kinderen de muntjes als storend of afleidend ervoeren. Nadat de kinderen spelletjes hadden gespeeld waarbij de muntjes waren afgedekt, werd hen gevraagd naar hun ervaringen. De reacties waren wisselend. Acht van de vijftien kinderen gaven aan het prettiger te vinden om zonder de muntjes in beeld te spelen. Zij vertelden dat ze minder afgeleid werden, zich beter konden concentreren, hun antwoord niet vergaten, minder stress ervoeren en dat ze minder gingen gokken dan wanneer de muntjes wel in beeld waren. Voor vier van de vijftien kinderen maakte de aan- of afwezigheid van de muntjes geen verschil. Drie van de vijftien kinderen gaven aan het vervelend te vinden wanneer ze de muntjes niet in beeld zagen. De onduidelijkheid over de resterende tijd leverde volgens hen stress op. Zij vertelden dat ze graag willen zien hoeveel muntjes ze verdienen of verliezen. Bevindingen Leerkracht

Nadat alle vijftien kinderen mee hadden gedaan aan het vooronderzoek is er met Ethlyne Hart gepraat over haar ervaringen met Rekentuin. Ethlyne Hart is de reken coördinator van de school en tevens docent van groep 6.

Ethlyne gaf aan niet te merken dat de aflopende muntjes stress veroorzaken bij leerlingen. De muntjes zijn naar haar ervaring juist over het algemeen motiverend. Rekentuin wordt door de meeste leerkrachten alleen gebruikt als “klaarwerk” of als iets wat de kinderen mogen spelen wanneer ze voor aanvang van de les al op school zijn. Ethlyne gaf aan een kleine verbetering van de rekenvaardigheid van kinderen te zien nadat zij veel oefenen met Rekentuin. Ook worden kinderen meer gemotiveerd wanneer zij regelmatig met Rekentuin oefenen. Kinderen die meer moeite hebben met rekenen hebben vaak extra aansporing nodig om in Rekentuin te gaan spelen.

Bevindingen voor Rekentuin

Hieronder worden puntsgewijs een aantal bevindingen en opmerkingen van de leerkracht besproken die nuttig kunnen zijn voor Oefenweb.

• Ethlyne gaf aan dat bij een aantal kinderen een andere groep wordt

(30)

een fout is van de leerkrachten zelf. Die geven namelijk aan in het systeem in welke groep een kind zit. Hieruit blijkt dat de docenten niet geheel bekend zijn met alle instellingen van Rekentuin.

• Daarnaast bleek Ethlyne niet bekend te zijn met het adaptieve aspect van Rekentuin. niet snapte en dat ze Rekentuin niet gebruikt om informatie over haar klas te krijgen.

• De leerkrachten op de school maakten nauwelijks of geen gebruik van de backend. Een ICT medewerker van de school print elke week alleen een overzicht uit waarop te zien is hoeveel elke klas van Rekentuin gebruik maakt. Ethlyne wist niet dat er nog vele andere manieren zijn om in Rekentuin

informatie over het niveau en de voortgang van het kind op te vragen. Het aantal uren dat een kind in Rekentuin speelt wordt wel vertaald in een onvoldoende, voldoende of goed op het rapport van het kind.

• Als laatste gaf Ethlyne aan dat het uiterlijk van Rekentuin veranderd zou mogen worden. Het spel bestaat inmiddels al 6 jaar, dus volgens haar was het tijd om de lay-out te vernieuwen.

Referenties

GERELATEERDE DOCUMENTEN

In fact, we consider a parameterized class of scoring methods λ α that are obtained by an iterative procedure where in each step to get (new) output scores, a node shares with

Echter, een aantal vragen moet worden omgescoord, wat wil zeggen dat 'geheel mee eens' met 4 punt wordt gewaardeerd en 'geheel mee oneens' met 1 punt.. Wanneer alle items

Already since decades ago, medical scoring systems have been used to summarize medical knowledge and serve as decision support.. One example is Alvarado for

The stroma percentage has to be determined from only the solid (= neoplastic + vital stromal compartment) tissue parts.. cells) should be replaced by another area for scoring, or,

Past research has examined the moderating effect of context and individual differences on the relationship between time pressure and decision-making, but the

initial approximation, an iterative method is usually applied to approximate the relative maximum reasonably well.. In order to examine the convergence in the Scoring Method, we

Figure 3 Boxplots of the relationship between the magnetic tracer and: (A) the number of excised sentinel lymph nodes during surgery; (B) the ex vivo magnetometer counts of the

Our data strongly suggest the presence of orbital fluctuations in the intermediate-temperature regime, as, for instance, evidenced by the temperature evolution of the Raman intensity