Implications of Unidimensionalty when measuring Math Ability

(1)

Implications of Unidimensionalty when measuring Math Ability

Vincent Schreuder

May 2018

(2)

1 Abstract

The learning goals for math ability in primary school education have come to attention, since a large group of children seems to be under-performing. An alternative explanation could be that these learning goals are set to high. To give an alternative to the theory driven current learning goals, the possibility to formulate new learning goals from actual data will be explored. With the help of Math Garden, a scientific and educational tool which is being used by thousands of childeren in the Netherlands, a large high-frequency longitudinal data set was collected. In this paper, this tool will be introduced, multiple analyses and visualizations will be done to explore the unidimensionality of the item response theory model used by Math Garden and the patterns between correlating items will be looked at, to see whether this data-oriented approach holds merit.

(3)

2 Introduction

The ability to do math is a standard part of the curriculum of primary schools. Together with language skills, it is seen as one of the core subjects of a child’s education. In the last decade however, there have been multiple reports about declining math abilities in young children (Huygen, 2018; Redmond, 2018; La Rose, 2013). The necessity of learning basic mathematical skills is evidently still relevant. Higher math ability is associated with higher working memory abilities (Ashcraft and Krause, 2007), higher levels of employment (Finnie and Meng, 2006) and proficiency in human judgment and decision making (Reyna and Brainerd, 2007). At the same time, there seems to be a discrepancy between the expectations of primary school children regarding math ability and their actual performance. The National Expertise Centre for Learningplan-development, the association which is responsible for the development of learning goals in the Netherlands, sets targets for each year in primary education (Noteboom et al., 2017). In a report issued by the Dutch government, it was found however that only 46% of the children actually reaches these targets (College voor Toetsen en Examens, 2016). Thus it seems important to look at how these learning goals are formulated and look at possible new developments which can help at better understanding math ability and perhaps help with creating more feasible learning goals.

To create a better understanding of what children should be able to learn at different points and corresponding schoolyears in their lives, it is first necessary to get a better understanding of the cognitive development regarding math ability. Molenaar (2004) argued that psychology should be seen as an idiographic science and that understanding psychological processes can only be achieved by looking at variations within a person. In line with this idea, the microgenetic theory (Siegler and Crowley, 1991) states that cognitive development should be measured through high frequency observations of single cases for the duration of the process. While the theory is plausible, the actual applicability has always been troublesome since obtaining large longitudinal data sets of children is difficult. Luckily, with the coming of Computerized Adaptive Testing (CAT) in general and specifically the creation of Math Garden (Klinkenberg et al., 2011; Straatemeier, 2017) these data sets are now more readily available (Hofman et al., 2018). To get a better understanding of the origins of the data and the novelty of such tools, we present an overview of Math garden and it’s scientific and didactic implications in the following section.

2.1 Math Garden

Math Garden is an online computer adaptive testing and learning program. Originally build as an scientific measurement tool, it has now become a successful practice program used by almost 300.000 children in the Netherlands. Players each have their own online garden which represents the different games they can play and also their progress in these games; the bigger the plants, the better the player. After starting a game, the player gets ten subsequent math problems, each with a time limit of 20 seconds. Based on the accuracy and the response time of the answer, the player either wins or loses a certain amount of coins for each problem. This scoring rule is designed to punish fast incorrect responses and reward fast correct responses. This gives the players a simple understanding of what is expected of them and also counteracts random answering. This scoring rule is called the High Speed High Stakes (HSHS) scoring rule (Maris and van der Maas, 2012) and is defined as follows:

S = (2Xpi− 1)(d − Tpi)

Where X represents the accuracy of the answer and T the response time. The parameter d corresponds to the time limit and is set to 20 seconds by default. The actual score of the player is compared to the estimated score and the difference determines whether the rating of the player goes up or down. Each player and also each item has a rating which either means how good a player is or how difficult the item is. The score is estimated via the formula below.

E(S|θ, β) = de

2d(θ−β)_{+ 1}

e2d(θ−β)_{− 1} −

1 θ − β

(4)

Where θ represents the ability of the player and β the difficulty of the item. After each item, these parameters are updated based on the difference between the estimated score and the actual score following an ELO updating scheme (Klinkenberg et al., 2011):

θp= θp+ Kp(Spi− E(S)pi)

βi= βi− Ki(Spi− E(S)pi)

Where K represents the weight which is given to the update. When a player has just started, the weight is set relatively high so that the first items that are made have a high impact on the players rating. The player is matched with the items in a way that the player either makes 60, 75 or 90% depending on the difficulty that has been chosen (Jansen et al., 2013). Therefore the difficulty increases if the player gets better and the difficulty decreases if the player needs to practice a certain subject a little more. Abilities are calculated for each game and corresponding math domain separately.

The novelty of Math Garden and the scientific value it holds comes from the fact that it is used by a large number of children and on top of that, that a large percentage of the users play with high frequency and for a long period. Therefore, Math Garden provides a data set that is both suitable for analyses between persons and for within persons. Since it’s beginning it has already been used for multiple studies regarding cognitive development. Jansen et al. (2014) for example used data from Math Garden to suggest that pattern recognition plays an important role in subitizing numbers for young children. In another study, van der Ven et al. (2013), using data from Math Garden, found that visuospatial working memory and math ability were related, especially for the domains of subtraction and addition.

There are however also some possible drawbacks of the model that Math Garden uses for estimating math ability. In a recent study, Hofman et al. (2018) looked at, among other things, the assumption of unidimensionality for this model. The assumption of unidimensionality states that to say that the model actually predicts math ability, it must be assumed that the only thing that is being measured, is math ability. And that math ability is also a construct that can be described as a single variable. Checking this assumption proved to be harder than usual since the data from Math Garden is not susceptible for factor analysis, which is the usual method of testing the amount of underlying factors. This comes forth from the fact that not every player in Math Garden makes every item, resulting in a lot of missing data. Hofman et al. (2018) therefore has shown in an array of different analyses that there seems to be some evidence that the assumption of unidimensionality does not hold. One of these analyses pertained to analyzing item clusters, which will be the main focus of this article.

2.2 Item Clusters

Hofman et al. (2018) analyzed the data by using a extended version of the model that is currently being used

by Math Garden. This model is called the networked-Elo model (Pel´anek, 2016) and it differentiates between

a global skill (θglobal) and a set of local skills (θlocal). For this analysis, the global skill represented the skill

on the math domain (e.g. addition and multiplication) and the local skill represented the skill on each item (or cluster of items). The estimated score was determined using the following formula:

P (x = 1|θglobal|p, θlocal|p, βi) =

1

1 + e−(w1θglobal|pw2θlocal|p−βi)

For this particular study, the weights of both skills (w1 and w2) were set to .5. The response time was not included in the estimation and the parameters were updated in the following manner:

θglobal|new= θglobal|old+ Kp∗ (S − E(S))

βnew|i= βold|i− Ki∗ (S − E(S))

θlocal|new= θlocal|old+ Kp∗ (S − E(S))

The weights of Kp and Ki were set to .25 and .01 respectively. After using the data to estimate the

(5)

clustering algorithm was used to visualize the correlations between the local skill estimates and inspect if certain item clusters were present. See Figure 1 for the heatmap of these clusters.

The heatmap clearly shows recognizable patterns of items. These clusters and patterns represent certain items and math problems which have high positive or negative correlations. High correlations mean that a correct response on one item either correlates positively or negatively with a correct item on the other item. These patterns could therefore be interpreted as math problems which make use of the same kind of strategy and thus are developed at the same stage in the cognitive development of a child. Before these kind of conclusions can be made however, further analysis needs to be done. To that end, we will be discussing two analyses which will hopefully further substantiate the findings. The analyses will have the following goals:

1. Optimize the weights of w1 and w2

2. Optimize the definition of θlocal by estimating the optimal number of clusters

Hofman et al. (2018) set the weights of w1 and w2 to .5 for convenience sake, but the relation between these two parameters could give a lot of information about the credibility of the one parameter model used by Math Garden. The more weight is given to the global skill (w1) the more the assumption of unidimensionality and therefore also the one parameter model is substantiated. In the analysis, eleven different models with different relations will be compared using cross validation.

The definition of the local skill is also important here. In theory, the local skill gives information about which items correlate highly with each other and are therefore likely to be solved using the same cognitive processes. Optimization of the local skill is therefore important to learn as much as possible about which items are similar with respect to math skill. On top of that, if the local skill isn’t optimized, we can’t know for certain that the assumption of unidimensionality holds because the alternative may just be weakly defined and therefore insignificant. To optimize this definition, the results of two definitions will be compared. First, local skill will be defined as the skill of the player on a certain item. Then, local skill will be defined as a cluster of certain items. The clustering will be based on a cluster algorithm that searches for the amount of clusters that best describes the correlation matrix of the different items. Afterwards, these clusters and the corresponding items will be used to estimate the model and optimize the relation between local and global skill again.

3 Method

In this section, the procedure of the different analyses will be discussed. For the analyses a data set was used which was generated from Math Garden. The data set consists of the 200 most played items of players who completed at least 20 sessions of 15 responses between ’2014-09-01’ and ’2017-06-01’. The total number of players summed to 5,144 and the total amount of responses to 2,708,027. These responses only consisted of the item accuracies; response time was not taken into account. The data set was used to estimate the different models and update the parameters over time.

3.1 First analysis: Weight estimation

For the first analysis, the goal was to optimize the relation between the weights of the global skill and the local skill. The first step in doing so was to create a function that could handle the data and could be used to estimate the model. The problem with analyzing data sets of this size is that the usual methods, when using the statistical analysis program R, are quite slow. To speed up this process, the function was written in the programming language C++ and linked to R using the Rpackage Rcpp (Eddelbuettel et al., 2018). At first this gave some problems due to the different way in which C++ stores memory when compared to R. The problems were quickly overcome however and the runtime was decreased from half an hour to two seconds. The estimation process was done via cross validation. Cross validation is a statistical method used to see how well a model can predict the data. A part of the data set is used to train the model and another part is used to validate the result. The validation process is done to make sure that the model is not only

(6)

6 + 8 7 + 6 8 + 5 6 + 7 5 + 7 5 + 199 + 76 + 99 + 6 12 + 79 + 55 + 99 + 4 7 + 4 4 + 7 6 + 3 8 + 3 3 + 8 3 + 5 15 + 34 + 33 + 43 + 6 3 + 16 3 + 17 16 + 43 + 7 4 + 6 7 + 3 6 + 4 8 + 4 4 + 8 12 + 59 + 33 + 95 + 45 + 3 5 + 2 3 + 2 2 + 6 7 + 2 9 + 2 2 + 9 2 + 4 2 + 7 1 + 7 7 + 1 1 + 6 1 + 8 8 + 1 1 + 9 9 + 1 6 + 1 5 + 15 13 + 3 3 + 13 3 + 264 + 56 + 55 + 61 + 3 1 + 5 1 + 4 3 + 1 8 + 2 15 + 2 11 + 4 2 + 19 2 + 14 2 + 113 + 3 4 + 4 5 + 5 2 + 2 1 + 2 2 + 1 5 + 1 4 + 1 87 + 4 7 + 13 12 + 8 11 + 9 8 + 22 44 + 3 27 + 3 4 + 189 + 8 4 + 287 + 9 5000 + 592 + 10 21 + 21 10 + 55 50 + 45 100 + 78574 + 4 50 + 201 300 + 5106 + 8465 + 677 + 3 66 + 4 3 + 57 3 + 37 48 + 2 39 + 2 81 + 6 51 + 7 6 + 51 20 + 3 1 + 99 1 + 50 9 + 20 40 + 10 80 + 10 20 + 20 10 + 90 10 + 40 10 + 7062 + 82 + 5977 + 1 2 + 78 3 + 90 4 + 90 5 + 60 7 + 80 5 + 90 6 + 90 4 + 30 2 + 50 80 + 6 40 + 8 75 + 5 72 + 2 1 + 59 52 + 1 40 + 100 600 + 7741 + 981 + 444 + 4 40 + 309 + 97 + 78 + 8 10 + 11 10 + 16 70 + 30 30 + 20 20 + 40 30 + 80 50 + 60 20 + 84 29 + 60 47 + 50 30 + 17 78 + 20 30 + 63 25 + 80 757 + 3 90 + 4012 + 0 0 + 180 + 30 + 8 5 + 0 1 + 0 0 + 2 10 + 0 3 + 10 10 + 3 10 + 2 7 + 10 6 + 10 10 + 5 10 + 4 9 + 10 220 + 0 0 + 3452 + 203 + 3050 + 5 40 + 41 + 1 10 + 1013 + 1 1 + 17 10 + 1 1 + 10 5 + 99 99 + 8 4 + 31 8 + 61 3 + 39 33 + 2 22 + 3 2 + 23 5 + 25 3 + 21 6 + 87 + 68 + 56 + 75 + 75 + 199 + 76 + 99 + 612 + 79 + 55 + 99 + 47 + 44 + 76 + 38 + 33 + 83 + 515 + 34 + 33 + 43 + 63 + 163 + 1716 + 43 + 74 + 67 + 36 + 48 + 44 + 812 + 59 + 33 + 95 + 45 + 35 + 23 + 22 + 67 + 29 + 22 + 92 + 42 + 71 + 77 + 11 + 61 + 88 + 11 + 99 + 16 + 15 + 1513 + 33 + 133 + 264 + 56 + 55 + 61 + 31 + 51 + 43 + 18 + 215 + 211 + 42 + 192 + 142 + 113 + 34 + 45 + 52 + 21 + 22 + 15 + 14 + 187 + 47 + 1312 + 811 + 98 + 2244 + 327 + 34 + 189 + 84 + 287 + 95000 + 592 + 1021 + 2110 + 5550 + 45100 + 78574 + 450 + 201300 + 5106 + 8465 + 677 + 366 + 43 + 573 + 3748 + 239 + 281 + 651 + 76 + 5120 + 31 + 991 + 509 + 2040 + 1080 + 1020 + 2010 + 9010 + 4010 + 7062 + 82 + 5977 + 12 + 783 + 904 + 905 + 607 + 805 + 906 + 904 + 302 + 5080 + 640 + 875 + 572 + 21 + 5952 + 140 + 100600 + 7741 + 981 + 444 + 440 + 309 + 97 + 78 + 810 + 1110 + 1670 + 3030 + 2020 + 4030 + 8050 + 6020 + 8429 + 6047 + 5030 + 1778 + 2030 + 6325 + 80757 + 390 + 4012 + 00 + 180 + 30 + 85 + 01 + 00 + 210 + 03 + 1010 + 310 + 27 + 106 + 1010 + 510 + 49 + 10220 + 00 + 3452 + 203 + 3050 + 540 + 41 + 110 + 1013 + 11 + 1710 + 11 + 105 + 9999 + 84 + 318 + 613 + 3933 + 222 + 32 + 235 + 253 + 21 −0.6 −0.3 0.0 0.3 Cor

(7)

applicable to the data which has been used to train the model, but can also be used to predict new data. For the cross validation, the data set needed to be split in the training data, which would be used to estimate the model, and the test data, which would be used to validate to model. The test data set was made by sampling random responses from the actual data set. For each user, five responses were randomly selected after which the response which chronologically had been given last was selected for the test data set. The last response was chosen to reduce the learning effect which increases after each answer on the same item. After the test data was chosen from the data set, the rest of the data was used as the training data for the model. For each model, a ten-fold cross validation was done, meaning that the process of creating new test data was done ten times. Each time the test data set consisted of 25720 observations.

The models that were compared differed only in their weights of w1 and w2. In Table 1, the different weight relations can be seen. In total, eleven different models were estimated.

Table 1: Overview of the different models

w1 w2 1.0 0 0.9 0.1 0.8 0.2 0.7 0.3 0.6 0.4 0.5 0.5 0.4 0.6 0.3 0.7 0.2 0.8 0.1 0.9 0 1.0

After estimating the models, some informative statistics were calculated for the comparison against the test data. The most important were the Root Mean Squared Error (RMSE) and the bias. These are both effective measures of how the estimated scores of a model compare to the actual scores (Hyndman and Koehler, 2006).

RM SE = mean(p(S − E)2₎

Bias = mean(S − E)

On top of that, the Area Under the Curve (AUC) and the proportion correct were calculated. The proportion correct pertains to how often the estimated score reflected the true answer if estimations above 0.5 were counted as correct answers and below 0.5 were counted as wrong answers. All of these statistics were calculated ten times for each model after which the mean was taken.

3.2 Second analysis: Cluster estimation

For the second analysis, the goal was to optimize the definition of θlocal by estimating the optimal number of

clusters. The clusters represent different groups of items which correlate highly. To estimate the optimal number, a function called Mclust from the R-package mclust (Fraley et al., 2012) was used. First the missing data was set to null, to make sure it would not interfere with the process. Then the correlation matrix

of the θlocal, when estimated per item was made. Afterwards, the distances between the correlations were

calculated using the Manhattan metric. The Manhattan distance between correlations is the sum of the absolute distances. This metric was used to create more distinction between the correlations and to make the clusters easier to interpret later on.

After the correlation matrix was calculated, the Mclust function fitted models with different amounts of clusters to see which one provided the best description of the data. The models fitted ranged from one to thirty different clusters. The best model was selected using the Bayesian Information Criterion (BIC).

(8)

Using the best model to define the clusters based on a set of items, in a next step these clusters were used to again estimate the different weights of w1 and w2, to see whether the results differed from the first analysis. For this estimation, the amount of different weights was set much higher because the first analysis proved to be less detailed than expected. Therefore the amount of models that were fitted was increased tenfold. After the estimation, the fit statistics were calculated and compared for the different models.

4 Results

4.1 First analysis: Weight estimation

For the first analysis, after the tenfold crossvalidation the important statistics were calculated for all the models. In Table 2 an overview is given of the different statistics. It is important to understand that the model with w1 of 1.0 represents the model where the assumption of unidimensionality holds.

Table 2: Overview of different estimation statistics

model w1 w2 RM SE bias AU C prop.cor.

1 1.0 0 0.4171 0.046 0.5824 0.6833 2 0.9 0.1 0.418 0.0477 0.5826 0.6843 3 0.8 0.2 0.419 0.0497 0.5833 0.6854 4 0.7 0.3 0.4203 0.052 0.5838 0.6864 5 0.6 0.4 0.4219 0.0547 0.5842 0.687 6 0.5 0.5 0.4238 0.058 0.5842 0.687 7 0.4 0.6 0.4263 0.0621 0.5837 0.6859 8 0.3 0.7 0.4296 0.0677 0.5835 0.6845 9 0.2 0.8 0.4346 0.0761 0.5836 0.6822 10 0.1 0.9 0.4436 0.0917 0.5826 0.6755 11 0 1.0 0.471 0.1441 0.5724 0.6245

From Table 2 it seems evident that the unidimensional model works the best. The RMSE, bias and AUC are lowest for the model with w1 = 1. Which means that this model approximates the true scores the best. The proportion correct is not the highest but differs only very slightly from the model that scores best on this aspect. In Figure 2 and Figure 3 the RMSE and the bias are both graphed against the different values of w1 which represent the eleven models. The light grey lines represent the different cross validations and the black line in the middle the mean of these ten cross validations. The straight blue line depicts the mean value of the graphed statistic for the unidimensional model to give a clear distinction from the other models and because it is the most relevant model for our analysis.

Both Figure 2 and 3 and Table 2 show that the model with w1 = 1 fits the best. To further substantiate

this finding however, it is important to look at the θlocal and how this parameter can be optimized.

4.2 Second analysis: Cluster estimation

For the second analysis. We first used the Mclust-package to estimate models with different amounts of clusters and compare them using the comparison statistic BIC. In total, 30 different models were compared with each model increasing the number of cluster by one. The model of 21 clusters was found to have the lowest BIC. This model was then used to again estimate the different weights of w1 and w2, but now defining θlocal as 21 clusters instead of an ability per item. Because of the fact that for this estimation, 110 different

models were fitted. No overview of the different statistics will be given. In figure 4 and 5 however, the RMSE and the bias are graphed for all the different values of w1.

The first observation to note is the difference between figure 4 and 2, representing the difference between

(9)

Figure 2: Graph of RMSE for all different weights in the first analysis

(10)

Figure 4: Graph of RMSE for all different weights in the second analysis

(11)

second analysis shows that non-unidimensional models fit better. The best model seems to have an w1 of around 0.7 which is very different from the first analysis. This does not seem to hold for the bias however as figure 5 and 3 are equivalent. It can be reasoned however that the bias is more likely to be bigger when the clusters are taking into account and therefore the RMSE seems the most important statistic since it determines the absolute fit.

4.3 Correlations between higher-order Cluster skills

After completing the two analyses above, we now had the information needed to visualize the data and see if certain patters could be identified.

Firstly, the clusters identified by the second analysis were used to order the correlation matrix of θlocal

defined per item. These clusters were then visualized as a heatmap with the different items as labels. The result of this visualization can be seen in Figure 6. The colour green represents strong positive correlations between items and the colour red strong negative correlations. When compared to Figure 1 created by (Hofman et al., 2018), a few things stand out. First of all, both heatmaps clearly show similar distinguishable patterns (both have a more green lower left and right upper half). Taking a closer look at the items forming the clusters in Figure 6, the substance of these items does not appear to be related. The items making up the clusters of Figure 1 have a connection content wise. There is a cluster made of small solutions that correlates negatively with items with large solutions. And also a small cluster with items that have an add zero can be identified (Hofman et al., 2018). Unfortunately this is not the case for the heatmap that we visualized. The possible reasons for this outcome will be discussed later in this paper.

Secondly, the clusters found in the second analyses were visualized in a heatmap to see how the different

clusters related to each other. Figure 7 represents the heatmap of the correlation matrix of the θlocal defined

as skill per cluster and ordered hierarchically based on the hierarchical clustering method which was also used in the paper by (Hofman et al., 2018). The green colour represents strong positive correlations between clusters while the red colour represents strong negative correlations. While in this heatmap the same pattern as in Figure 6 can be observed, there also does not seem to be a clear connection in content. Upon closer inspection of the substance of high correlating clusters, it was difficult to see what linked these items together and what makes these clusters correlate highly with each other. This will also be considered in more depth in the discussion section.

5 Discussion

In the current paper data from Math Garden was used to look at the assumption of unidimensionality which underlies the model of Math Garden and explore the dimensionality of this data set. In the first analysis, it

was found that when using the networked model of (Pel´anek, 2016) and defining θlocal as a skill for each item,

the unidimensional data fits the best. Since it can be reasoned that it is not very likely that children use a different skill for each separate item, a second analysis was done. In the second analysis it was found that

when θlocal was defined as a skill per cluster of items, the optimal number of clusters would be twenty one.

Moreover, when this new definition of θlocal was used to again look at the models with different weights of

w1 and w2, the unidimensional model did not fit best. Instead it was the model which had a w1 of around 0.7, with a difference in RM SE of around 0.002. While the difference might not be very big, this does substantiate the evidence that math ability is not a solely unidimensional skill. On top of these analyses, the clusters were visualized in heatmaps which were then compared to the heatmap generated by (Hofman et al., 2018). While the heatmaps did show the same patterns, the items within our clusters did not seem to be connected content wise.

This brings us to the first issue with our analyses. The goal of this paper was, ultimately, to learn more about the cognitive development of math ability and apply this knowledge to the learning goals that are being set in primary education. While we have learned from the data from Math Garden that there are indeed other multiple factors which influence the performance of children on math items. The fact that the clusters are difficult to interpret content wise, makes it also difficult to apply this knowledge to actual learning goals.

(12)

1 + 2 5 + 5 50 + 5 40 + 44 + 4 30 + 203 + 2 2 + 1 2 + 2 40 + 82 + 7 3 + 1 7 + 1 50 + 450 + 1872 + 21 + 5944 + 4 5 + 0 51 + 7 2 + 20 5 + 25 15 + 3 77 + 1 2 + 237 + 7 75 + 58 + 8 6 + 90 3 + 16 33 + 2 2 + 111 + 0 3 + 374 + 1 574 + 4 10 + 404 + 5 22 + 3 10 + 30 + 83 + 3 12 + 01 + 8 65 + 61 + 4 3 + 578 + 1 10 + 102 + 193 + 1748 + 2 16 + 45 + 3 30 + 174 + 283 + 213 + 102 + 6 9 + 9 4 + 8 47 + 502 + 9 300 + 51044 + 3 12 + 8 20 + 4010 + 4 10 + 704 + 906 + 3 92 + 1099 + 89 + 6 10 + 556 + 8 8 + 2 10 + 06 + 5 9 + 3 8 + 3 1 + 7 3 + 30 10 + 1 9 + 101 + 1 27 + 35 + 63 + 8 40 + 307 + 803 + 9 220 + 010 + 25 + 46 + 1 90 + 40 40 + 10 30 + 632 + 59 21 + 2139 + 24 + 6 3 + 261 + 9 80 + 1012 + 57 + 3 50 + 20110 + 5 3 + 39 757 + 311 + 4 9 + 7 78 + 2077 + 3 8 + 5 8 + 22 0 + 345 70 + 305 + 152 + 505 + 2 3 + 7 20 + 201 + 39 + 1 1 + 5 81 + 69 + 5 7 + 9 87 + 47 + 2 62 + 83 + 6 5 + 60 4 + 315 + 1 50 + 6081 + 4 6 + 7 6 + 84 25 + 80 10 + 162 + 782 + 14 10 + 909 + 2012 + 76 + 9 8 + 4 5 + 906 + 49 + 4 3 + 5 20 + 84 29 + 606 + 107 + 131 + 993 + 4 4 + 3 15 + 2 3 + 13 80 + 6 52 + 15 + 9 5 + 99 5 + 19 600 + 776 + 51 7 + 4 1 + 6 100 + 787 + 62 + 40 + 3 40 + 1003 + 9041 + 99 + 8 1 + 10 1 + 504 + 75 + 7 5000 + 510 + 114 + 18 20 + 3 13 + 3 30 + 807 + 104 + 3013 + 10 + 2 1 + 17 66 + 4 8 + 619 + 2 11 + 9 1 + 25 + 550 + 540 + 44 + 430 + 203 + 22 + 12 + 240 + 82 + 73 + 17 + 150 + 450 + 1872 + 21 + 5944 + 45 + 051 + 72 + 205 + 2515 + 377 + 12 + 237 + 775 + 58 + 86 + 903 + 1633 + 22 + 111 + 03 + 374 + 1574 + 410 + 404 + 522 + 310 + 30 + 83 + 312 + 01 + 865 + 61 + 43 + 578 + 110 + 102 + 193 + 1748 + 216 + 45 + 330 + 174 + 282 + 63 + 213 + 109 + 94 + 847 + 502 + 9 300 + 51044 + 312 + 820 + 4010 + 410 + 704 + 906 + 392 + 109 + 699 + 810 + 556 + 88 + 210 + 06 + 59 + 38 + 31 + 73 + 3010 + 19 + 101 + 127 + 35 + 63 + 840 + 307 + 803 + 9220 + 010 + 25 + 46 + 190 + 4040 + 1030 + 632 + 5921 + 214 + 639 + 23 + 261 + 980 + 107 + 312 + 550 + 20110 + 53 + 39757 + 311 + 49 + 778 + 2077 + 38 + 58 + 220 + 34570 + 305 + 155 + 22 + 503 + 720 + 201 + 39 + 11 + 581 + 69 + 57 + 987 + 47 + 262 + 83 + 65 + 604 + 315 + 150 + 6081 + 46 + 76 + 8425 + 8010 + 162 + 782 + 1410 + 909 + 206 + 912 + 78 + 45 + 906 + 49 + 43 + 520 + 8429 + 606 + 107 + 131 + 993 + 44 + 315 + 23 + 1380 + 652 + 15 + 95 + 995 + 19600 + 776 + 517 + 41 + 6 100 + 787 + 62 + 40 + 340 + 1009 + 83 + 9041 + 91 + 101 + 504 + 75 + 7 5000 + 510 + 114 + 1820 + 313 + 330 + 807 + 100 + 24 + 3013 + 11 + 1766 + 48 + 619 + 211 + 9 −0.2 0.0 0.2 0.4 Cor

(13)

7 12 4 6 3 10 19 8 15 16 18 14 17 20 2 5 13 1 11 9 21 7 12 4 6 3 10 19 8 15 16 18 14 17 20 2 5 13 1 11 9 21 −0.4 0.0 0.4 Cor

(14)

It could be suggested that this is a clear indication of the fact that we have not found anything other than random fluctuations in the data. With the amount of data we used, the statistical power is really high and even the smallest of differences get magnified. While this is generally a good thing, it could also mean that the difference between the models is just caused by a random fluctuation or a variable that has nothing to do with math ability. To rule out this possibility, it would be best to do the analyses with clusters again only then with clusters made of items representing the learning goals as they are set now. The expectation is that this would give a similar result to what we have found now. On top of that, it would be interesting to see how the clusters behave if a different clustering method is being used. For example the hierarchical clustering method used by (Hofman et al., 2018). It might be that the content of the clusters will behave differently then.

Another issue with the findings of the second analysis is the fact that twenty one clusters is quite a lot, if we think about what these clusters mean. It means that each player has twenty one different skills for the domain of addition. A model with twenty one different parameters could be seen as a model that, while formally being the best model, is difficult to interpret. In psychology we generally try to fit models that on one hand, describe the data adequately but on the other hand are also easy to interpret. Ultimately these models are made with the goal of learning about the cognitive development of children and therefore, easy interpretability is a big plus. While for the fit of the model, twenty one clusters might be better, it gets more difficult to identify the content of these clusters, as the amount increases. Which is now evident from the fact that we cannot interpret the content of our current clusters. Therefore, an argument could be made that twenty one different clusters is too much and that it would be better to have say, ten item clusters. In future

analyses, it would be interesting to see the effect of defining θlocal as ten item clusters and to look at which

weights would be most efficient then. The expectation here is that the distribution of RMSE per w1 will be quite equal.

Lastly, there is also a possible issue with the data that is being used in the model estimation. Only the accuracy of the answers was used to train and validate the model. In the HSHT model (Maris and van der Maas, 2012) however, which is the model that Math Garden uses, response time is also taken in to account. It makes up an important part of calculating the score a person gets for answering a certain item. The reason for not taking response time into account seems to be just for convenience sake. On top of that, the chances of getting better results from a model with multiple dimensions will probably only increase since you’re already using two variables to estimate a score. Therefore this approach may well be the best for looking at the unidimensionality. In the future however, it would be best to conduct further research and analyses into this estimation process and the optimization of the estimated score.

The findings of this thesis could be used to further explore the inner workings of the cognitive development regarding math ability and to look at the learning goals which are now represented in the primary school education. If the method of identifying clusters within certain domains is improved, it could be used to identify certain sets of problems which make use of a same strategy of some sort. If these sets are identified, they could be compared to the learning goals for primary school children to see whether these findings substantiate the theory driven learning goals that are now being used or that a new way of formulating these learning goals would be better. At Oefenweb, the quest of exploring what children can learn and should learn at certain ages will continue.

(15)

References

Ashcraft, M. H. and Krause, J. A. (2007). Working memory, math performance, and math anxiety.

Psychonomic Bulletin & Review, 14(2):243–248.

College voor Toetsen en Examens (2016). Rapportage referentieniveaus 2015-2016. Technical Report November.

Eddelbuettel, D., Francois, R., Allaire, J., Ushey, K., Kou, Q., Russell, N., Bates, D., Chambers, J., Dirk, M., and Org>, E. <. (2018). Package ’Rcpp’ Title Seamless R and C++ Integration.

Finnie, R. and Meng, R. (2006). The Importance of Functional Literacy: Reading and Math Skills and Labour Market Outcomes of High School Drop-outs. Statistics Canada, Analytical Studies Branch, (11):1–21. Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (2012). mclust Version 4 for R: Normal Mixture

Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report 597, University of Washington, pages 1–50.

Hofman, A. D., Jansen, B. R. J., De Mooij, S. M. M., Stevenson, C. E., and Van Der Maas, H. L. J. (2018). A Solution to the Measurement Problem in the Idiographic Approach Using Computer Adaptive Practicing. Huygen, M. (2018). Rekenen wordt minder, taal gaat niet vooruit.

Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4):679–688.

Jansen, B. R., Hofman, A. D., Straatemeier, M., van Bers, B. M., Raijmakers, M. E., and van der Maas, H. L. (2014). The role of pattern recognition in children’s exact enumeration of small numbers. British Journal of Developmental Psychology, 32(2):178–194.

Jansen, B. R. J., Louwerse, J., Straatemeier, M., Van der Ven, S. H. G., Klinkenberg, S., and Van der Maas, H. L. J. (2013). The influence of experiencing success in math on math anxiety, perceived math competence, and math performance. Learning and Individual Differences, 24:190–197.

Klinkenberg, S., Straatemeier, M., and Van Der Maas, H. L. J. (2011). Computer adaptive practice of Maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education, 57:1813–1824.

La Rose, L. (2013). Canadian students math science scores dip.

Maris, G. and van der Maas, H. (2012). Speed-Accuracy Response Models: Scoring Rules based on Response Time and Accuracy. Psychometrika, 77(4):615–633.

Molenaar, P. C. M. (2004). A Manifesto on Psychology as Idiographic Science: Bringing the Person Back Into Scientific Psychology, This Time Forever. Measurement: Interdisciplinary Research & Perspective, 2(4):201–218.

Noteboom, A., Aartsen, A., and Lit, S. (2017). Tussendoelen rekenen-wiskunde voor het primair onderwijs, Uitwerking van rekendoelen voor groep 2 tot en met 8 op weg naar streefniveau 1S.

Pel´anek, R. (2016). Applications of the Elo rating system in adaptive educational systems. Computers and

Education, 98:169–179.

Redmond, A. (2018). Primary pupils maths skills dropping alarmingly, report finds.

Reyna, V. F. and Brainerd, C. J. (2007). The importance of mathematics in health and human judgment:

Numeracy, risk communication, and medical decision making. Learning and Individual Differences,

(16)

Siegler, R. S. and Crowley, K. (1991). The microgenetic method: A direct means for studying cognitive development.

Straatemeier (2017). UvA-DARE (Digital Academic Repository) Math Garden: A new educational and scientific instrument.

van der Ven, S. H., van der Maas, H. L., Straatemeier, M., and Jansen, B. R. (2013). Visuospatial working memory and mathematical ability at different ages throughout primary school. Learning and Individual Differences, 27:182–192.