How to analyze eye-movement patterns? : validation and development of successor representation approach

(1)

How to analyze Eye-movement patterns? Validation and development of Successor Representation approach

Šimon Kucharský, Maartje Raijmakers, and Ingmar Visser University of Amsterdam

.

Author Note

This is an internship report of Šimon Kucharský under the supersivion of Ingmar Visser and Maartje Raijmakers.

(2)

Abstract

Analyzing eye-movement patterns has been one of the golden goals of the eye-tracking research. However, a numerous problems lie with the immense complexity of visual behavior, making it difficult to make sense of individual eye-movement recordings. A relatively recent method inspired by reinforcement learning technique, so called Successor Representation Scanpath Analysis, has been developed to relate performance on Raven’s Progressive Matrices to cognitive strategies (manifesting through eye-movement patterns). We seek to use this technique on a task with different structure for which there is a theoretical expectation of different strategies -Deductive Mastermind Game. After mixed initial results, we conducted a simulation study to analyze the performance of the method and found several problems relating to its stability and tendency to over-fit our data. In addition, we also discuss possible conceptual problems with the current method. To address those, we slightly changed the method and showed using simulations that the new approach yields better stability and clearer results. The method was then used on the real data to show how it performs. Some initial findings related to the Deductive Mastermind game are discussed. We conclude with several recommendations for future development to make the successor representation approach even better.

Keywords: eye-movement patterns, cognitive strategies, Raven’s Progressive Matrices, Mastermind Game, successor representation

(3)

How to analyze Eye-movement patterns? Validation and development of Successor Representation approach

Introduction

Eye-tracking (ET) can provide us with rich data about how people process information and solve problems at hand. One of the most valuable aspects of ET is that we can examine eye-movement patterns (the ordered sequence of fixations on different areas of the stimulus), which is an

excellent opportunity to gain insight into information processing during problem solving without much interference.

However, analysis of eye-movement patterns is not straightforward, as with increasing number of fixations and areas of interests, the space of possible patterns grows exponentially. Even though a number of techniques for eye-movement analysis exists (see Boots, 2016), they are usually hard to implement. In this respect, researchers sometimes settle with easier analysis as a proxy for eye-movement patterns (such as total time fixation duration on specific areas, number of

transitions between different areas, or comparing time to first fixations on different areas, see for example Curie et al., 2016; Loesche, Wiley, and Hasselhorn, 2015; Vakil and Lifshitz-Zehavi, 2012). Although such approach still might answer some research questions, it necessarily takes away invaluable information from the data (the order of fixations), which might result in some questions becoming unanswerable.

An example is a study by Vigneau, Caissie, and Bors (2006). They tried to identify multiple strategies people used to solve Raven’s Progressive Matrices (RPM), termed constructive

matching and response elimination. The former is a systematic strategy of evaluating the Matrices to deduce the only right solution, which is then found in the response area. In contrast, response elimination is a strategy of trying out different responses and evaluating whether they are suitable or not. This strategy manifests by toggling between the matrices and the response area. Even though people who went back and forth between the matrices and the response more frequently had generally lower score on the task, it still does not provide evidence for the true nature of those strategies. For example, without looking at the sequences over time within the ones with the

(4)

lower toggling frequency, it still does not conclusively show that this pattern is really systematic. An interesting development in solving this problem came with the idea of successor

representation (SR). Hayes, Petrov, and Sederberg (2011, 2015) (henceforth HPS) developed a method building on the idea of SR and used it on data from RPM to show that it can retrieve representations of the two strategies described above. Recently, the method was also used on free viewing tasks (Hayes & Henderson, 2017). The next section discusses in more depth the idea of Successor Representation and the approach HPS have developed.

Successor Representation

The Successor Representation (Dayan, 1993) is used in the Reinforcement Learning paradigm (Sutton & Barto, 1998). It is closely related to temporal difference learning, where we build a representation of the environment in terms of expected temporal discounted rewards. The basic updating rule, known as TD(0) algorithm is:

V(St) ← V(St)+ α[Rt+1+ γV(St+1) − V(St)]

Which means given the agent makes a transition from state St to St+1and receives a reward of

Rt+1, we update the value of state St by the reward scaled by the learning parameter α and by the

expected value of the successive state St+1scaled by the temporal discount parameter γ. In that

way, by presenting the learning agent enough sampling experience, we achieve a representation of the states in the environment such that all the states will have a reward value which contains not only the information about the rewards the agent got directly after transition to another state, but also the rewards he received after transitioning from the successive states. As described by White (1995, p. 51), instead of making a representation of the expected state values, we can build an occupancy N × N matrix (where N is the number of states), given transition matrix P

Xi j = [(I − γP)−1]i j

(5)

to all states given the current state i. This occupancy matrix will be referred to as the Successor Representation matrix.

Successor Representation Scanpath Analysis

Hayes et al., 2011 came up with an exciting idea of using the successor representation to describe the eye-movement sequence of the fixated areas of interest (AOIs). They slightly modified the temporal difference algorithm to create a successor representation of the individual

eye-movement patterns. The updating formula they used is:

Mi ← Mi+ α(Ij+ γMj− Mi)

Which means when we observe a transition from state i to state j, we update the sender column Miby the first order transition (the term Ij) and the current representation of the successive state

(Mj) scaled by γ. The only difference between the successor representation used by HPS and the

occupancy matrix defined by Dayan (1993) and White (1995) is that the resulting SR matrix is based on the power series of the transition matrix (Hayes et al., 2011):

Mi j = P[(I − γP)−1]i j

so when the γ parameter is set to 0, the successor representation tracks the transition matrix instead of the identity matrix, provided it converges.

To implement this idea, HPS collected data of people solving 28 Raven’s Progressive Matrices. Their method of extracting different strategies (which should manifest in different SR) is the following procedure:

1. Given some parameters α and γ, create a SR matrix for each participant (first create one SR matrix per item, yielding 28 matrices per participant, and then average them over the items, giving one average SR matrix per participant).

(6)

2. Reshape the individual matrices such that each cell in the matrix is one column and participants are the rows. Standardize the columns. Using eigen value decomposition (standard orthogonal PCA), compute a projections of the standardized average SR matrices on the first 20 components and correlate them with the total score on RPM.

3. Select the two most correlated components with the score and use them in a linear model predicting the total score.

4. Using the Nelder-Mead algorithm, optimize the parameters α and γ so that the R2_{of the}

linear model is maximized.

Using this method, they were able to a) extract two components of the SR matrices that were in line with what we would expect under the constructive matching and response elimination and b) predict with high accuracy the score a person got given the average SR (R2_{after using}

cross-validation was 0.41).

Because of the promising results, we decided to use the method on a different task for which there is also theoretical expectation of different solving strategies people can use - the so called

Mastermind Game (Gierasimczuk, Van der Maas, & Raijmakers, 2013).

Mastermind game

The mastermind game is originally a game for two players, the code-maker, who choses a

sequence of colored pegs on a gaming board, and the code-breaker, who has a task of guessing the sequence. To come up to the right sequence, the code-breaker can place a colored pegs on the board (a conjecture) and the code-maker provides a feedback about whether there is any peg with the right color and on the right place or if there is a right color but on the wrong place. By

repeating this step, the code-breaker has to come up to the right solution. This task is based on experimentation and inference from the experience, making it a task of logical and scientific reasoning (Strom & Barolo, 2011).

(7)

Deductive Mastermind (DMM) in which the code-breaker is not actively placing the colors on the board, but is rather presented with multiple lines of conjectures and their corresponding feedback. This version reduces the complexity from inferential game into a logical-reasoning task.

One particular version of the DMM was implemented as a part of logical reasoning training system in primary schools called Math Garden (Reketuin.nl or MathsGarden.com). In this

version, 331 items are included with different lengths of the conjectures (from 1 to 5) and number of colors (from 2 to 5). Compared to the classic version of the Mastermind, the feedback can take three values: green feedback (g) indicates that one of the colors are in the right place, orange (o) that some color is in the solution but it is currently in the wrong position and red (r) means one of the colors is not in the solution.

[insert Figure 1 here]

Gierasimczuk et al. (2013) analyzed data collected with the Math Garden system (counting over 37,000 unique players solving more than 4.8 milion items) version for children and discovered that the player ratings and the item difficulty have a tri and bi modal distributions, respectively. The authors then formulated a logical analysis (Analytical Tableau Method) of the items to show that different items have a varying number of steps and branches needed to come to the right solution. When the number of branches was used to predict the item difficulty, it was able to explain additional 41 % of the variance compared to the basic model including features that are obvious predictors of item difficulty (the number of colors in the item, number of conjectures and whether or not all the colors are presented in the item). This results suggest the number of steps needed to come to the only right solution plays a role in how easy is for people to solve the item. To answer a question what causes the tri modal mixture of the player ratings, we need to go deeper in the logical analysis. An important feature of the DMM is that one can solve the items in multiple ways, depending on the order of the feedback processing. But some ways are more efficient than others, as processing less informative feedback first causes need for branching, storing intermediate results and possible solutions in working memory, and consequently,

(8)

errors than focusing on the most informative feedback at first, which requires fewer steps, evaluations and lower memory load. One possible explanation for the variability of the player ratings might be which strategy of solving the items they choose, as the order of the processed feedback influences the number of steps and branches to come to the right solution, which we already know has a role in the item difficulty (meaning the items can have different levels of difficulty according to what strategy the solver uses).

[insert Figure 2]

To test the idea that people use different strategies and that it relates to differences in performance, Trut,escu (2016) conducted an eye-tracking study. In total of 24 university students solved 32

DMM items in two blocks. The two blocks consisted of logically same items but with different colors. Trut,escu (2016) then discovered that not only people get better over blocks, but the relative

proportion of the time spent on the more informative feedback gets higher in the second block and that the time spent on the informative feedback correlates with performance. This might be an indication that people get better in the task because they adapt more efficient strategies. However, other explanations still work - for example, this pattern could emerge not because the participants acquire more efficient strategies per se, but merely learn about informativeness of various

combinations of the feedback. The time spent on the most informative feedback might then also increase within one basic strategy (evaluating the item row by row) just because the participants start to pay more attention to informative feedback when they arrive to it. This is why we need to analyze the data with a method that can directly test whether there are different strategies, for which the Successor Representation Scanpath Analysis (SRSA) is a very interesting contender.

Analysis of items with oo feedback at the third row Methods

For this purpose, we reanalyzed the data collected by Trut,escu (2016). Firstly we were interested

whether the method is able to retrieve the predicted strategies. The difference between the RPM and DMM is that while for the former the two strategies (constructive matching and response

(9)

elimination) are the same regardless of the item, different strategies in DMM are dependent on the type of conjectures, feedback and its order. Because of this, we grouped the items based on their feedback and selected the items with the oo (orange-orange) feedback at the third row (four items in total, 2 in each block) for the initial analysis. This choice was based on the notion that the easiest way to solve these items requires just one step, that is switching the colors present on the third row with the oo feedback. This strategy should manifest as fixating the third row and proceeding to the response area. The top-to-bottom strategy would be similar to all items and should manifest as fixating one row after another. On these items, we had data from 24

participants at our disposal for secondary analysis of the study conducted by Trut,escu (2016). The

fixation coordinates were recoded to the AOI (areas of interest) level. We defined 6 areas, 5 of them corresponding to the five lines of conjectures and the 6th to the response area. Each fixation was assigned to one of the defined areas based on the fixation coordinates.

The eye-movement sequences lengths (number of saccades) ranged from 10 to 256 with median of 30 and mean of about 42. Overall, there was a high proportion of repeated fixations within the same AOIs (median of about 39 % of the total number of saccades in the sequence). Because of this, we decided to analyze the data containing also the repeated fixations, as discarding them would cause some sequences becoming very short (ranging from 3 to 85 transitions with median of 13 and mean of about 16), leading to very sparse SR matrices. This means that contrary to the previous work by HPS (who removed the repeated fixations from the sequences), our

representations will also contain the first-order transitions on the matrix diagonal, which could potentially help if the strategies differ in dwelling on certain AOIs, but could harm if the strategies are defined mostly just by transition patterns.

Following the procedure outlined in previous work on SRSA (Hayes & Henderson, 2017; Hayes et al., 2011, 2015), we used two-loop procedure to analyze the eye-movement patterns. In the one loop, we constructed the SR matrix for each sequence, leading to 96 (4 items × 24 participants) 6 × 6 SR matrices. We then averaged them across the items, resulting in one average SR matrix per participant. The matrices were reshaped into vectors and merged into one matrix containing 24

(10)

participants by 36 SR matrix cells. Using this, we computed the correlation matrix of the SR matrix cells and applied eigen vector decomposition (principal component analysis). We then computed the projections of participants on all principal components (eigen vectors) and correlated them with the total score (min= 14, max = 32, mean = 27.04, sd = 7.35). The two most correlated components were then used in a linear model predicting the scores.

The second loop searched for the optimal parameters α and γ (which are used in the SR matrix construction) so that the resulting proportion of explained variance (R2) of the linear model is the highest. However, contrary to the previously reported method (optimize the parameters using Nelder-Mead algorithm), we computed the R2 _{for all parameter pairs ranging from 0 to 1 with}

steps of 0.05. The pair of parameters leading to the highest R2_{was used as starting values in}

L-BFGS-B optimization algorithm (Byrd, Lu, Nocedal, & Zhu, 1995) instead of the Nelder-Mead algorithm. This allowed us to set constrains on the parameter space, ranging from 0 to 1 for both parameters. The approximation of the joint parameter space increased the likelihood of

converging at the global optimum.

To see how the optimal fit generalizes, we performed leave-one-out cross validation (LOOCV). As previously described (Hayes & Henderson, 2017; Hayes et al., 2011, 2015), the

cross-validation is necessary because the method might over-fit, capturing idiosyncratic patterns within the current sample. The higher the cross-validation fit to the data, the more are the results generalizable out of the current sample. The cross-validation was performed as follows. For each individual participant, we made a prediction of the DMM score, given the data from other

participants. We split the data in two parts, the training set and testing set. The training set

contains all participants apart from the one for which we are currently making the prediction. The above described procedure was applied to the training set to find the optimal parameter values, principal components and regression coefficients. Then, we constructed the average SR matrix for the one participant (testing set) using the optimal α and γ from training set. The projections on the two extracted principal components were computed for the SR matrix of the one participant and scaled by the regression coefficients. The sum of those numbers is the prediction for the one

(11)

left out participant. This was repeated leaving out one participant after another making individual predictions for all of them.

Results and discussion

Using the procedure described in the methods section1_{, we reached high fit of the model (R}2₌

0.73, optimal α= 0.93, γ = 0.66). However, the interpretability of the two extracted components was low, as the first component (positively related to performance) could probably capture variability within both strategies (assuming there are any), as it had positive values on the cells representing switching between the third row and response area, and negative values on almost all other areas (see Figure 3). The second component captured variance that could not be easily explained and one possible explanation is that it only captured random noise. The predictive matrix (composed of the sum of the two components scaled by the regression coefficients) suggests that transitioning from the third line to the response is positively related to performance, but that is also the case for transitions from 1 to 2 and 2 to 3 (suggesting that the top-to-bottom strategy could be also an efficient strategy). Cross-validation showed to be difficult, as the R2

dropped to 0.01. This was because of one outlying participant for which the cross-validation predicted impossible value. However, for all other participants the cross-validated prediction was not so poor, as the resulting R2without the one participant dropped only to 0.25.

[insert Figure 3]

A problematic part of the optimal solution is that the α was very high, which means the SR in the forgets fast (meaning the first transitions have almost no relevance compared to the transitions towards the end of the sequence). The poor cross-validation and the obstacles in interpreting the extracted components raise a question whether the data does not yield multiple strategies or the method is not reliable for all types of data.

A shortcoming of our initial analysis is that the analyses performed by Trut,escu (2016) suggested

that people could adopt more efficient strategies over blocks. Because of this, our ultimate goal

1_{All analyses in R can be found at https:}_{//osf.io/cgv29. The code for SRSA is accessible as an R package. To}

(12)

would be not to average over all items with similar structure, but assess items independently. That might mean that the representations will be very unstable and sparse, so fitting the model might be difficult. To get insight whether this is the case, we conducted a simulation study to see the method’s ability to perform well on our DMM data.

[insert Figure 4]

Simulation study Methods

In order to investigate the stability of the method, we decided to conduct a simulation study. We analyzed all scan-paths on the four items selected in the analysis above and wrote a simulation functions which mimic the two different strategies (top-to-bottom and systematic). The simulated patterns were matched with the real data in respect to several criteria (see the complete

description in the supplemental materials, https://osf.io/25nkz/ and https://osf.io/9fc32/). This way, we were able to simulate arbitrary number of participants using one or the other strategy with some variability in the patterns within the strategies. Examples of the simulated patterns can be seen in Figure 5.

[insert Figure 5]

We were interested in the method’s performance with respect to three varying features of the simulated data:

1. Sample size (n). We varied the total number of the participants in the simulated studies. The numbers were 20, 60 and 100.

2. Proportion of strategies (p). We varied the proportion of participants using one or another strategy. We simulated 25 %, 50 % and 75 % of people using the top-to-bottom strategy.

3. Difference in performance (d). We varied the difference in the performance between the two strategies. The difference in terms of Cohen’s d was 0, 0.5 and 1.

(13)

We simulated 200 datasets per each combination of parameters (totaling 5400 simulated studies) and applied the SRSA method to it2_{. The main questions regarding the method’s behavior were}

whether are the optimal α and γ parameters, extracted components and predictive matrix stable across studies, and what is the method’s fit to the data (how severe is the over-fitting). We also investigated whether we could classify the participants into the right group (i.e. strategy) based on the projection on the two PCA components given the participants’ SR matrix.

Results

Alpha and Gamma. One of the important aspects of the optimalization of the α and γ

parameters is whether they converge towards similar values across the simulations. Especially in our simulation strategy where we sample individual sequences of fixations from two relatively distinct strategies, one would expect that the optimal values should cluster together and won’t be scattered to the whole parameter space. If they are scattered, that could mean the optimal values do not relate to the description of the strategies per se, but rather specific variation within the sampled datasets.

However, data from our simulation exactly showed this, as the parameter values vary wildly between all replications. In the Figure 6, you can see the two parameters plotted against each other; there is not clear pattern suggesting any dependency between the optimal values and the three varying features of the data (suggesting the parameters do not stabilize with, for example, increasing sample size). The fact that the parameter values often got stuck on the boundaries of 0 and 1 also suggests problems with the model, which is discussed later.

[insert Figure 6]

Stability of components. With regards to the stability of components, we encountered a limitation in terms of label switching (the components came in different order between different simulations). To be able to explore the stability, we needed to classify the two components into

2_{To make sure our implementation of the SRSA algorithm is correct, we simulated and sent one dataset to one of}

the authors of SRSA, Taylor Hayes, who sent back his results. We got the same optimal values of the parameters, R2 and similar components.

(14)

right categories (describing top-to-bottom strategy or systematic strategy). In order to do this, for each simulation variation (out of the total 27), we reshaped the eigen vectors into a 400 × 36 matrix (200 simulations × two components as rows, 36 cells in the SR matrix as columns) and computed k-means clustering estimating 1 to 20 clusters. If the components capture some

systematic features of the fixation patterns between the two strategies, we should see a large drop in the unexplained variance in the scree plot when we estimate two clusters. As the Figure 7 shows, this drop was present only in some simulation variations (mainly with big difference between strategies on the score and with larger sample sizes). For small sample sizes (n= 20), the scree plot showed no systematic variation as the unexplained variance drops slowly and gradually. [insert Figure 7]

Under the scenarios where there is a clear bump in the scree plot, the method is able to retrieve the systematic strategy quite convincingly (the average correlation between the replicated components was about 0.81 for the most stable scenario). Figure 8 shows the stability and the means of the two components under the most favorable scenario (n= 100, d = 1, p = 0.5). This indicates that the method has some potential in retrieving the right parameters, albeit it is probably too unstable for the data we possess. Even with a lot more data than we have to our disposal, it is unable to retrieve a strategy that has more variation (as it is clear that the second component captures just noise).

[insert Figure 8]

Stability of the predictive matrix. Although the extracted components might not be very stable, a crucial part of the SRSA method is the predictive matrix, which is a linear combination of the component loadings (eigen vectors) scaled by the regression coefficients. Because there is only one for each simulation, it is quite straightforward to check how the method performs. If the predictive matrix is unstable, the method might over-fit sample specific variation and

consequently is not able to generalize towards the patterns on the population level from which it is being sampled.

(15)

that it is very poor. Figure 9 shows the stability of the most stable simulation. Interesting is also that in some replications some of the cells got very high values (more than 1,000,000), which could be caused by that the SR matrices are very sparse or because a high collinearity between some cells. Either way, this could partially explain why it is hard to cross-validate our results, as the predictive matrix might change wildly and assign very high values to some cells, leading to poor prediction for the case that was left out of the cross-validation training set.

[insert Figure 9]

Over-fitting. We also explored how the method performs in terms of explained variance in the total scores. While it is natural that optimizing the parameters to achieve the highest R2 over-fits the dataset (this is why HPS use cross-validation), our main point in evaluating the over-fitting is by what margin the method overfits in this particular example. This is because of two reasons -the fit in our real data drops highly after -the cross-validation, which might be because 1) The method tends to over-fit the dataset and then drops to the "real" level of the variance, or 2) the method over-fits only slightly and the cross-validation tends to under-fit because of the unstability of the components and predictive matrix. Our question is therefore whether it is possible to obtain such a high fit (R2) even with no difference in the overall performance between the strategies. As can be seen in Figure 10, the method over-fits by a huge margin, especially for small sample size. In fact, the distribution of the R2 seems to react only to the sample size, but not the the actual variability to be explained by the differences in strategy groups. However, for the large sample size (n=100), the R2 _{seems to be unbiased, although it is not clear whether it is based on the actual}

variability in the data, or that the method just happens to fit the variance on this level for this sample size.

[insert Figure 10]

Strategy classification. An implicit assumption of the SRSA is that people using different strategies should project on the two components differently. At first, we were curious whether one could correctly classify people into one or another strategy based on the projections on the two PCA components. However, this was unfeasible given the unstability of the components, which

(16)

basically resulted in approximately the same distribution of projections of both groups (strategies) on the non-systematic component. When the second component (capturing systematic strategy) gets stabilized (with large sample size), there was some separation visible, as people using systematic strategy tend to score slightly higher on that components. This suggests that classifying people into groups based on the projections might be possible, but only if there is enough data to stabilize the results of SRSA.

Discussion

The present simulation showed us that our current data might have too little information for SRSA to stabilize. The method showed poor performance with small samples and even though it improved in terms of increased stability of one of the components and better fit to the data with larger samples. Two main reasons for the results are that in the simulation, we did not average across multiple items, but simulated and analyzed just one item per participant. Second, the eye-movement sequences might be too short for the construction of the SR matrices, which could be then very sparse. Both facts could mean the SR matrices contain just too little information for meaningful analysis, although given our method of simulating the data, we know the two patterns are present in the samples. This suggests the method has some limitations for data and purpose of analysis similar to ours. Because of these results, we decided to shift focus on how we can adjust the method so that it could describe DMM data better. In the next section, we discuss potential reasons of why the method performs poorly on our data and propose adjustments to the method so we can use it on less informative data than those from the RPM.

Generalizability of the SRSA

To summarize, we applied the method on real DMM data on items with similar structure. Although we achieved very good predictions of the total score, the results dropped significantly after cross-validation. The interpretability of the components was mixed and high α values could mean only the last few transitions were target relevant (which contradicted our expectation, as the

(17)

two predicted strategies could end with very similar sequence, the most important feature was that the strategies mostly manifest by the order of the first processed information).

We simulated data that mimic the two strategies on the DMM items with oo feedback at the third row. Again, the stability of the result was very low, suggesting the method cannot discriminate between the two strategies efficiently and over-fits by a huge margin. The performance grew, however, with increasing sample size and difference in performance between the two strategies. This could explain why we achieved such a good fit with just 4 items in the real data and why the results do not cross-validate well. These results could show that the current application of the idea of Successor Representation cannot be widely applicable, at least not to data from the Deductive Mastermind.

Conceptual problems

Our DMM analyses and simulations showed some problems relating to the method proposed by HPS. First, tendency of the value γ to become 1 in the optimalisation process, which is

conceptually impossible: When the SR matrix converges, the sums of the columns equal to _1−γ1 which approaches infinity for γ close to 1. Conceptually, with γ= 1, we stretch the horizon of the successive states to infinite number of future transitions. The fact that sums of the columns do not run towards infinity is that the SR matrices are not converged - which would be possible only when we decrease the α with each transition (but note that the convergence is usually slow and it takes more than a 100 transitions to get a converged matrix). This also relates to the problems of interpretation of the not converged matrices which is not really clear: while the SR matrix (or occupancy matrix) have a clear interpretation (expected discounted number of visits to other states), the non-converged matrices do not have this property. Similarly problematic are also cases where the α is set to 1, although using this value is conceptually possible. The resulting SR matrices then contain just the last transition from each state (leading to very sparse matrix), which contradicts the idea of SR as a method of analyzing the whole eye-movement sequences.

(18)

target measure. This approach clearly assumes that the strategies relate to performance. We will elaborate later that this should be treated as a separate empirical question. Thus, it is desirable to establish a method that finds distinct strategies independently from the target measure, and using this information, try to (dis)confirm a hypothesis that it is in relation to it. It is questionable whether the current approach with reducing the space by PCA and relating the components to the target measure is sensible for the DMM data for the following reasons:

1. This method might capitalize on chance. With the infinite possible variations of the learning rate and discounting parameter, we have a wide variety of possible representations of the Eye-movement patterns. By selecting some number of the components (which are the most correlating with the target measure), we end up with very flexible approach that could potentially fit anything.

2. Relating to the point above, the selection of the number of components is somewhat arbitrary. One possible approach is to select the number of components that together capture some percentage of the variance, but that might lead to a large number of components which would result in very difficult interpretation. Another approach (mentioned by HPS) is to select only components that significantly relate to the target measure. This, however, strengthens the problem with capitalizing on chance, and it should be corrected for inflation of Type-I error because of multiple comparisons. But then again, p-value cannot be used to provide information about absence of a correlation (and thus, it cannot be used as a criterion for including the components). The number of components used would be dependent on the number of participants in the dataset and the strength of the relationship of the strategy to the criterion (therefore, statistical power), which could lead to paradoxical scenarios in some datasets, where even with people using some strategy and the PCA capturing their variability correctly, the method would not include them simply because of lack of power for detecting the correlation. Deciding on the number of components based on significance could therefore suffer both from false positives (due to inflated Type-I error) and false negatives (in cases with small effect of eye-movement

(19)

pattern on external criterion or small number of participants using one of the strategies). Another problem is that the computation of significance would be done within the loop maximizing the R2 _{- the number of components used might vary between the di}fferent

combinations of parameters, which is also difficult to interpret on conceptual level. 3. The PCA does not ensure that the separate components describe separate strategies.

Instead, it is possible (as we have seen in some of our results), that one component could capture variability of more strategies. Although rotation of the components might resolve this issue, the other problems would persist.

4. Standardization distorts the interpretability of the SR matrices. One of the most important properties of the (converged) SR is that the values correspond to the expected number of visits to other states. When we standardize the values across participants, the matrices loose this property. The method could be also fragile to new observations (as we have seen in the cross-validation of the real data), which could lead to unrealistic predictions.

5. Most importantly PCA is not a technique that is useful for assigning people into distinct groups. Our view is that the cognitive strategies in solving RPM or DMM are qualitatively different and do not range on a continuous scale. Rather than making projections on the distinct components, it would be better to classify people into different groups with a technique that is suitable for this purpose.

The points made about the performance of the SRSA might relate to the specific scenario we are evaluating, how we simulate the data and the structure of the task itself and so our findings are rather specific for this particular task (but our conceptual critique would persist even if the method showed better results). It is possible that the SRSA can gain in stability if we average across multiple items. The original use of this approach was achieved by averaging 28 Raven’s Matrices for each participant. However, a central finding in the work of Trut,escu (2016) on the DMM is

that people learn with experience more efficient strategies. The items also differ in the type of feedback, for which the optimal strategies manifest in different patterns. Averaging across all

(20)

items is therefore no longer an option nor desirable, as even if the systematic strategy would be the same for different items, we would not be able to capture the learning of strategies with the progress in the task. We however firmly believe that there exist different strategies, at least for some items in the DMM (as is described in the supplemental material, the different patterns of the two strategies are clearly visible by visual inspection of the raw fixation data on those 4 items). A question arises whether we can adjust the method so it can behave better with more limited data and also meet some of our assumptions. In the next section, we are going to describe these adjustments and show how the new implementation works on simulated data and the original DMM example.

Changing the method: using the k-means clustering

Our primary goal of adjusting the method is to simplify it and produce more stable results for less informative data. Moreover, the changes should overcome the problems of the original approach outlined above. First, our new method should be able to retrieve different strategies regardless of whether the strategies relate to performance. Second, the strategies should be represented as distinct groups with members which have similar eye-movement patterns, satisfying our assumption that the systematic and top-to-bottom strategies are categorical in nature.

To achieve this, we adapted the method by replacing the PCA with k-means clustering (Hartigan & Hartigan, 1975). K-means clustering aims to classify the set of observations into k clusters with their mean (center of the cluster) by computing the Euclidian distance from the centers for each observation. This approach has a several advantages over the PCA. As a classification technique, it assumes a distinct groups of observations which are more similar to each other than to

observations from other groups. This is in line with the underlying assumption that the different strategies are rather categorical in nature, although allows for individual variation within the clusters. This method also provides a natural way to select the number of extracted strategies, as viewing a scree plot of the sum of within cluster variance divided by total variance is declining rapidly until there are identifiable distinct profiles, and then the proportion of the unexplained

(21)

variance rather stabilizes and declines slowly. The rule of thumb is therefore to select the number of clusters up until that point, because the successive cluster means are usually very similar to already extracted ones.

Another simplification we attempted was to get rid of the α parameter. In the previous version of the method, we make use of the not converged matrices to represent the eye-movement

sequences. This has two implications. As was already described earlier, the SR matrix which is not converged does not have the interpretation of the converged matrix. Rather, the information it contains is dependent on the α parameter, which in the original reinforcement literature does not have particular meaning for the interpretation of the matrix - it is used for learning the SR as new experience comes in, and the value of it controls how quickly the matrix converges. To achieve converged matrix, one must have a sufficient length of the sequence to converge to the SR matrix or use a different methods - for example, presenting the sequence to the updating rule multiple times until the matrix does not change (Sutton, 1988). Also important to note that with this type of learning the matrix the α parameter should not be a fixed value, but decrease with the length of the sequence (Dayan & Sejnowski, 1994)3. However, if we used this approach, the converged matrix could be instead computed analytically, as described earlier:

Mi j = P[(I − γP)−1]i j

This shows that once the SR converges, it is no longer dependent on the values of α. One another implication it has is that the resulting representation is just a power series of the first order

transition matrix, P (in other words, we construct the representation as if there were no higher order dependencies). This might be a problem, as clearly the information the representation contains is the same as in the transition matrix. In order to investigate whether increasing the γ parameter helps in spite of this, we conducted a simulation study.

3_{In order to converge, the rule for decreasing α has to satisfy two conditions:}P∞

t αt= ∞ and P∞t α2t < ∞ . Such a series can be made, for example, with αt= 1_t or αt= 1

t23

(22)

Simulation study Methods

To investigate the performance of modified version of SRSA, we made a slight changes in the algorithm and used it on the previous simulated data reported above:

1. We constructed the SR matrices using the analytical approach without sequential updating. This ensures the matrices will be in form of proper converged SR matrices. To investigate the influence of γ, we used values from 0 to 0.9 for the computation.

2. Instead of using PCA, we employed k-means algorithm estimating 1 to 10 clusters. This allowed us to see whether the scree plot breaks rapidly after the true number (2) of strategies.

3. Using the solution with 2 clusters, we were able to investigate whether the cluster assignment is accurate, meaning we assign the participants into the right clusters.

4. By inspection of the centers of the clusters, we checked whether they correctly represent the true strategies. Similarly to the previous simulation study, we also investigated how stable those representations are across different simulations.

5. With the cluster assignment, we computed the difference between the clusters on external target (performance) and compared it to the true value. Ideally, we should get unbiased estimate of the true differences with decreasing variance as the sample size increases.

We proceeded as follows. Given some value of γ, we created SR matrix for each participants using the formula P[(I − γP)−1], where P is the transition matrix. Similarly to the original method, the individual 6 × 6 matrices were reshaped into one n × 36 matrix (where n is the number of simulated participants). This matrix was used as an input for the k-means algorithm. We saved the cluster means, proportion of the total within cluster sum of squares to the total

(23)

variance and the vector of cluster assignment. This procedure was implemented to all 5400 simulated studies, using the 10 different values of γ and estimating 1 to 10 clusters.

To inspect the stability of the cluster means when we use the correct number of clusters (two), we had to solve label switching (using 2 cluster solution sometimes yielded the systematic cluster on the first place and sometimes on the second place). Because we had the information about the true strategy membership for each participant, we solved this problem by assuming that the method would have more than 50% classification accuracy. We made a 2 × 2 contingency table using the vector of the true membership and the computed cluster membership. If the sum on the diagonal was greater than the sum off the diagonal, we considered that the first cluster represents the top-to-bottom strategy and the second the systematic strategy. If if was not the case, we assumed the clusters switched labels.

Results

Extracting the right number of strategies. A rule of thumb for selecting the number of clusters is to inspect a scree plot to see at which point the amount of unexplained variance by the clusters stops decreasing rapidly. Our simulation shows that using this method, we would have successfully selected the right number of clusters (two) most of the time (an exact number cannot be given as the cluster selection is done based on subjective judgement and evaluating 5,400 studies individually was infeasible. Naturally, the evaluation is also biased by the prior

expectation of two strategies which we know are in the simulated data), however, several remarks have to be mentioned. The separation became more clear with increasing sample size. This is not surprising as with more people in the sample we reach a better stability of the cluster centers and the total variation. A slightly worse is the performance for the cases where the distribution of strategies in the sample is uneven, as for some simulations the scree plot suggested to extract three strategies, and for some there was rather continuous decline in the scree plot, as that would leave a researcher in a belief in more strategies that there are actually there, or that there are no patterns in the data, respectively. A positive finding on the other hand is that the scree plots show

(24)

the same pattern for most of the individual simulations for all values of γ. Thus, the performance of the method is rather mixed and a the scree plots do not always lead to the correct solution. Stability of strategy representation. The stability and interpretability of the strategy

representation by the centers of the two clusters seems to outperform the original method. In any replication of any simulation scenario, the two clusters have very similar values (an average pairwise correlation between the replications in any scenario is about 0.8-0.95 for both clusters). The stability slightly increases with sample size, but even for small samples it is quite high (the lowest mean correlation for the top-to-bottom strategy was 0.71 when only 5 people out of 20 use that strategy, and 0.8 for the systematic strategy if only 5 people out of 20 use the systematic strategy). Moreover, the two clusters correctly retrieve the true nature of the simulated data, as in each replication, one cluster clearly shows a diagonal pattern (top-to-bottom strategy) and one has high values in cells representing transition from the third line to the response (systematic

strategy), see Figure 11. This is an indicator that using the k-means clustering we are able to represent correctly the patterns that are present in the data with rather high stability.

[insert Figure 11]

Classification accuracy. For all of the simulations, we inspected how accurate is the participant assignment into the groups, given we select the correct number of clusters.

The total accuracy of correctly assigned participants into the true strategies was mostly about 0.85-0.95. There was no significant gain in using different values of γ, although there appears a very slight increase in accuracy for values around 0.6. This suggests that the converged SR matrices do not help significantly to distinguish between the strategies compared to the transition matrix.

[insert table 1]

A closer look at the classification accuracy (separately for different sample sizes and proportion of people using one strategy of the other) shows that the accuracy increases with sample size and is the most accurate when the proportion of people using different strategies is 50/50. Looking at the accuracy of capturing correctly the people using the systematic strategy is overall higher than

(25)

for the top-to-bottom strategy. This is in line with that the top-to-bottom strategy is more diffuse and has more variation, therefore it is more difficult to classify the individual cases correctly.

[insert Figure 12]

Accuracy of difference on external criterion. The previous parts of the simulation analysis concerned with 1) whether we are able to identify the correct number of strategies, 2) what is the stability of the strategy representation and 3) classification accuracy of the participants into the right groups. This in terms of the research goals deals with strategy discovery. The second part of usual research questions is "do the strategies relate to some criterion?". This could be an overall task performance (strategy x is more efficient than y), experience (people adapt strategy x as the task progress), or maturity (older children prefer more complex strategy). Important to note that whether the strategies relate or not to this criterion is itself an empirical question separate from the strategy discovery and assignment - it is easily argued that there might be different strategies in some tasks, which do not have any particular dependency to some other criterion. In this part, we investigate whether we can correctly estimate the difference between the strategies on that criterion, given we use the correct number of clusters and assign the participants to them. With the classification of the two strategies, we computed the difference in the simulated performance in terms of Cohen’s d. Figure 13 shows the results. Comparing with the simulated true difference, the method performs rather well for most of the cases, estimating the difference very closely to the true value. The estimates perform slightly better with increasing sample size, however, it tends to underestimate the true difference if the proportion of strategies in the sample is uneven. The poorest performance was achieved if there is only a few people using the systematic strategy. This may be caused by that in this scenario, the proportion of correctly classified people in the top-to-bottom strategy can be rather low (classifying a large number of them into the systematic strategy), which attenuates the true difference between the two strategies.

(26)

Discussion

The simulations with the new method showed improved performance in the strategy retrieval. Specifically, we were able to get very similar clusters for all simulations, with higher stability for smaller samples. Most importantly, we were able to retrieve the representations very similar to those that describe the true simulated patterns (compare Figure 5 and Figure 11) and that this can be achieved by analyzing individual sequences without averaging across items. On the other hand we did not observe significant impact of the γ parameter on the overall classification accuracy. However, the classification accuracy showed to be high, although getting smaller when the strategies in the sample are represented by uneven number of participants. This influences the estimation of the differences between clusters on external criterion, as is can lead to biased results when there is imbalance in the numbers of strategies (but for most of the cases, the estimation is unbiased). Problematic part of our findings is that the scree plots are not very stable and not always informative, so the current method would not be very useful for strategy discovery work without theoretical predictions of what (or at least, how many) patterns to expect.

Mastermind data Methods

Because of the promising results of the simulation study, we also analyzed the real DMM data using the k-means clustering method. Instead of averaging across all 4 items, we decided to average only two items per block. This can provide us with information whether there is a change in strategy use between the two blocks. However, we cannot analyze the transitions between the strategies on the individual items, because they were randomized for each participant within each block.

To conduct the analysis, we arbitrarily set γ to 0.5. We constructed the converged matrices for each eye-movement sequence and averaged the two items within both blocks (each participant had two SR matrices). Then, the matrices were reshaped to have the 24 participants as rows and 36 matrix cells as columns. This was used as an input for the k-means algorithm estimating 1 to

(27)

10 clusters. The scree plots of both blocks were not very informative, as for the first block there was a clear break at the 2 clusters solution, but also at the 6 cluster solution. For the second block, the scree plot decreased continuously. For the sake of brevity, we report the results of the two cluster solutions for both blocks, as that was our initial expectation.

Results

The two clusters showed clear representation of the expected strategies in both blocks, as one indicated high values along the diagonal (top-to-bottom strategy) and the other high values indicating toggling between the third feedback and the response (systematic strategy), see Figure 14.

[insert Figure 14]

In the first block, 12 people were assigned to the top-to-bottom strategy and the other 12 to the systematic strategy. In the second block, 11 people were classified into the top-to-bottom strategy and 13 into the systematic strategy; 3 participants who used the top-to-bottom strategy in the first block then used the systematic strategy in the second block, 2 participants who used the

systematic strategy in the first block used the top-to-bottom strategy in the second block. Although this finding shows an effect of adapting the more efficient strategy in the predicted direction, it is far from being distinguishable from random fluctuations and error in classification. As a follow-up, we went to see whether the strategies lead to a different performance (testing a hypothesis that systematic strategy is indeed more efficient strategy). The differences in overall performance were very small and not significant, although in the predicted direction (Block 1: µsystematic= 27.9, σsystematic= 3.7; µtop−to−bottom= 26.2, σtop−to−bottom= 5.8. Block 2: µsystematic =

27.7, σsystematic= 3.3; µtop−to−bottom= 26.3, σtop−to−bottom= 6.3).

[insert Figure 15]

Discussion

In this last analysis, we showed that the method is retrieving the expected SR matrices from the real data when using the k-means clustering. The following results however are not very

(28)

informative to the theoretical implications of the strategies in DMM, as neither the transition patterns between strategies or the differences between them on the total score were conclusive, although in the predicted direction. This finding might be because in our case the analysed data is very limited. First, the present four items are those that are the easiest (the average accuracy was about 0.92, 0.73, 1 and 0.88 for the 4 items); they require only one step to solve and are easy to solve even with the top-to-bottom strategy. Second, the people might switch between different strategies on the remaining items; the fact that some used one particular strategy on two (or four) similar items does not mean they used that strategy on all items. Because we tested the

differences on the total score, the variability induced by (hypothesised) differences of strategies would create too much noise if people switch between strategies given items with different structure, noise that could not be explained by the information from the 4 items alone. Therefore, a more suitable approach would be to analyze the data item by item, identify the strategies, assign participant to them and build a model predicting the external criterion (success on individual trial, solving speed, etc) using the strategies on individual items as predictors. So far, this example just showed us that the classification of people into different groups might be feasible.

General Discussion

In this work, we have shown that the original method for analyzing eye-movement patterns, SRSA, might not be suitable for wide application. After results of the initial analysis of four DDM items which showed high fit dropping significantly after cross-validation and that the extracted strategies were difficult to interpret, we focused attention to the stability of the method given our data. Using simulations, we arrived to a conclusion that the method performs poorly in terms of retrieving the "true" representations of the strategies (given by the unstability of the extracted components) in the data we possess. Similarly, the predictive matrix did not stabilize across the simulated data, suggesting the predictions are highly dependent on the idiosyncratic patterns within the current sample. This conclusion is also supported by the results showing the method over-fits the data, and the margin of over-fitting is only dependent on the sample size.

(29)

This indicates the method should not be used for purposes we aimed for.

An important difference between our analysis and the original approach is that we did not average SR matrices across multiple items in the simulation study, which could have caused the

unstability of our results. Averaging the matrices however hides the fact that the participants could switch between the strategies between the items. The expected profile of the systematic strategy is also dependent on the structure of the items, which restricts us from averaging them in the DMM data, where at least the predicted systematic strategy should manifest as different sequences for different types of items.

Given those findings, we adjusted the SRSA so that it reflects more the assumptions we have for the DMM data. We centralize the idea of classification of subjects into groups with similar patterns (k-means clustering) instead of making predictions of the criterion based on the fixations patterns. Using simulations, we investigated the performance of the new method and found that it produces more stable results, which are also easier to interpret. The classification accuracy also tends to be high even for small samples, although it may become less accurate when the

distribution of the strategies is uneven in the sample. The method also used converged SR matrices and our conclusion is that then the γ parameter did not have significant influence on the results. A problem still lies with the question how to decide how many clusters to estimate, as for a lot of our simulations, the scree plot was not informative or even misleading. The analysis of the real data also showed scree plots that did not show clear presence of only two clusters. The problem of choosing the number of strategies therefore still persist, along with a new problem of how to choose the best γ parameter.

In relation to the DMM, we showed with the new method that there are possibly two different strategies, at least for solving the easiest items with the oo feedback. For those items, there is not a strong evidence for a tendency to use the systematic strategy more often in the second block, which would indicate effects of strategy learning. These results also do not provide convincing evidence that the systematic strategy leads to better performance (although both effects were in the predicted direction). However, this analysis has been mainly exploratory and provisional and

(30)

the main point made here is that it is possible to describe different strategies and assign people to them with the new method. Future analysis of the whole dataset should focus on using the method on the items individually and assigning people to different strategies on those particular items. That will provide us opportunity to investigate the overall patterns of strategy use, switching between the strategies based on the type of feedback and the progress in the task. And then finally answering the question whether systematic strategy leads to better performance. The method would potentially enable us to incorporate the phenomenon of strategy shifting as a function of item difficulty and the solver ability into one comprehensive model (Bethell-Fox, Lohman, & Snow, 1984).

Future development

Our critique of the stability of the results mainly relates to how valid is the interpretation of the output for the purpose of strategy discovery, description and assignment. The proponents of the SRSA warned multiple times (Hayes & Henderson, 2017; Hayes et al., 2011, 2015) that the method over-fits and it is crucial to implement cross-validation. This argument our simulations did not take into account, so we do not have meaningful information about the stability and precision of the cross-validated results. However, we would argue that for the purpose of the strategy discovery, the results of the original method are not appropriate, as it cannot classify people into groups with similar patterns, its only target is to explain variance in some external variable given the fixation sequences. The development of the method using the PCA could be useful if the research goal is to predict some outcome based on the eye movements. This would lead more in direction of machine learning techniques with the emphasis on prediction, not explanation, which could be useful way of moving forward (Yarkoni & Westfall, 2016). To move towards this goal, the method should implement features of regularization techniques. However, the current method uses completely unconstrained estimation of the optimal parameter values and regression coefficients and the cross-validation only seeks to confirm that some of the information captured in the training sample generalizes over to the new observations. This could be adjusted

(31)

in a sense that the parameter values, number of components and regression weights would be estimated so that the cross-validated prediction is optimal. Possible adjustments in this direction could be, for example, to search for pair of parameters α and γ such that they are fixed across the cross-validation folds; instead of using standard linear regression on arbitrary number of

components use regularized regression (preferably L-1 penalized (Tibshirani, 1996), but the L-2 penalization might also prove useful) with optimizing the penalization weight to achieve the highest cross-validated variance explained in the external target. In that way, the method would by design search only for the features in the data that generalize to the new sample. However, a big challenge would be to deal with label switching, differences in the extracted components and the regression coefficients between the individual cross-validation folds. One step forward was already reported in a recent article using k-means clustering on the cross-validated components (Hayes & Henderson, 2017), but more work needs to be done to optimize the method further (e.g. implement the regularized regression).

As for the new method we developed, there are several loose ends which call for improvements. The method using the k-means clustering shows to be promising in terms of classification into different groups. However, the use of converged matrices did not show significant gain in accuracy for different values of γ. This is disappointing albeit not surprising, as the converged matrices use only the information that is already present in the first order transition matrices. A possible solution to this is to reintroduce the parameter α back to the construction of the SR matrices. That would however bring back the problem of interpretation of the not converged matrices and more work should be devoted to describe the effect of different values of α in describing sequences with different lengths. One possible approach could also make use of dynamic parameter α, which would take smaller values with increasing length of the transition sequence. It should be possible to find such a rule that ensures that the transitions across the whole sequence have the same weight, regardless of the length of the sequence. In our view, such approach could make the interpretation much more easier.

(32)

so that it does not rely on arbitrary choice (as was done in our last example analysis). We propose a method where the parameter(s) will be optimized such that when the k-means clustering is employed on the SR matrices, the within cluster variance is the smallest. But there is a problem that the method would become computationally very extensive (given we would need to optimize the parameter values for each number of clusters we try to investigate). To reduce the complexity, a suitable approach could be to optimize the parameter(s) only for the 2 cluster solution and use that values also for those with more clusters. Either way, this method would produce the scree plots on which one could base the final number of clusters to extract. This approach would then deal with the most problematic parts of our method, that is what parameter values one should choose and how many clusters (groups of fixation sequences) to use for the subsequent analysis. However, it is not ensured that it will lead to satisfying performance in terms of whether the number of clusters is the right one, if the method reaches good classification accuracy and whether it is unbiased in terms of estimating relations of the clusters to external variables. Therefore, it is desirable to conduct similar validation study as we have shown here for the two methods (PCA and k-means SRSA) to see how that method performs. Another modification might be to use k-medoids clustering instead of the k-means, as it should be more robust against noise and outliers (Kaufman & Rousseeuw, 1987). But for the time being, it is probably too soon to develop this approach, at least as long as the method using k-means produces interpretable, stable, accurate and unbiased results. Hopefully, the future development would enrich the analysis toolkit for researchers trying to investigate eye-movement patterns for testing models which predict different strategies in cognitive tasks.

(33)

References

Bethell-Fox, C. E., Lohman, D. F., & Snow, R. E. (1984). Adaptive reasoning: componential and eye movement analysis of geometric analogy performance. Intelligence, 8, 205–238. Boots, M. (2016). An overview of methods to analyse visual scan patterns using a top-down or

bottom-up approach.

Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190–1208.

Curie, A., Brun, A., Cheylus, A., Reboul, A., Nazir, T., Bussy, G., . . . David, A., et al. (2016). A novel analog reasoning paradigm: new insights in intellectually disabled patients. PloS one, 11, e0149717.

Dayan, P. (1993). Improving generalization for temporal difference learning: the successor representation. Neural Computation, 5, 613–624.

Dayan, P. & Sejnowski, T. J. (1994). Td (lambda) converges with probability 1. Machine Learning, 14, 295–301.

Gierasimczuk, N., Van der Maas, H. L., & Raijmakers, M. E. (2013). An analytic tableaux model for deductive mastermind empirically tested with a massively used online learning system. Journal of Logic, Language, and Information, 297–314. doi:10.1007/s10849-013-9177-5 Hartigan, J. A. & Hartigan, J. (1975). Clustering algorithms. Wiley New York.

Hayes, T. R. & Henderson, J. M. (2017). Scan patterns during real-world scene viewing predict individual differences in cognitive capacity. Journal of Vision, 17, 23–23.

doi:10.1167/17.5.23

Hayes, T. R., Petrov, A. A., & Sederberg, P. B. (2011). A novel method for analyzing sequential eye movements reveals strategic influence on raven’s advanced progressive matrices. Journal of Vision, 11, 10–10. doi:10.1167/11.10.10

Hayes, T. R., Petrov, A. A., & Sederberg, P. B. (2015). Do we really become smarter when our fluid-intelligence test scores improve? Intelligence, 48, 1–14.

(34)

Kaufman, L. & Rousseeuw, P. (1987). Clustering by means of medoids. North-Holland.

Loesche, P., Wiley, J., & Hasselhorn, M. (2015). How knowing the rules affects solving the raven advanced progressive matrices test. Intelligence, 48, 58–75.

Strom, A. R. & Barolo, S. (2011). Using the game of mastermind to teach, practice, and discuss scientific reasoning skills. PLoS Biol, 9, e1000578.

doi:https://doi.org/10.1371/journal.pbio.1000578

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3, 9–44.

Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: an introduction. MIT press Cambridge.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. doi:doi=10.1.1.35.7574 Trut,escu, G.-O. (2016). Logical reasoning in a deductive version of the Mastermind game.

Vakil, E. & Lifshitz-Zehavi, H. (2012). Solving the raven progressive matrices by adults with intellectual disability with/without down syndrome: different cognitive patterns as indicated by eye-movements. Research in developmental disabilities, 33, 645–654.

doi:10.1016/j.ridd.2011.11.009

Vigneau, F., Caissie, A. F., & Bors, D. A. (2006). Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence, 34, 261–272. doi:10.1016/j.intell.2005.11.003 White, L. M. (1995). Temporal Difference Learning: Eligibility Traces and the Successor

Representation for Actions.

Yarkoni, T. & Westfall, J. (2016). Choosing prediction over explanation in psychology: lessons from machine learning. Unpublished manuscript. Retrieved from http://jakewestfall. org/publications/Yarkoni_Westfall_choosing_prediction. pdf.

(35)

γ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Median 0.88 0.90 0.90 0.92 0.92 0.93 0.93 0.92 0.90 0.88 Mean 0.83 0.84 0.85 0.86 0.87 0.87 0.88 0.87 0.87 0.85 SD 0.14 0.14 0.14 0.14 0.13 0.13 0.13 0.12 0.12 0.12 Interquartile range 0.7, 0.95 0.73, 0.95 0.76, 0.95 0.79, 0.95 0.8, 0.96 0.81, 0.96 0.85, 0.96 0.85, 0.95 0.83, 0.95 0.8, 0.93 T able 1 The a g gr egate total assignment accur acy acr oss all simulation scenarios summarized with di ff er ent values of γ. F or γ = 0 , the repr esentation is equal to the fir st or der tr ansition matrix.

(36)

Figure 1. A simple example item of the DDM provided by Gierasimczuk, Van der Maas, and Raijmakers (2013). The item consists of two conjectures with feedback, the first indicating that one of the sunflowers is in the right place, but the other is not in the solution. The second line indicates the sunflower and tulip are in the solution, but in a different place. The solution to this item is then tulip-sunflower.

(37)

Figure 2. The possible feedbacks used in the work of Trut,escu (2016). The top-left (oo) is

considered the most informative feedback, as it does not require branching. In fact, it is sufficient to come to the solution just with one step, that is switching the colors. The other feedbacks are less informative, as they require at least one another piece of feedback to combine the information (rr on the top-right), or branching (gr and or on the bottom).

(38)

Figure 3. Predictive matrix and extracted components from the 4 DMM items. The first component has a positive relation to the total score (1.05), the second negative (-2.34). By

(39)

Figure 4. Predicted versus observed scores given by the SRSA method. The prediction on the full dataset on the left, the cross-validated prediction on the right (with inset of excluding one

(40)

Figure 5. Simulated strategies. SR representation of the two strategies when α= 0.5 and γ = 0.3. Averaged across 1000 replications. Random sample of the two patterns analogically plotted to the original layout (9 on the left half represent top-to-bottom, 9 on the right systematic).

(41)

F igur e 6 . The optimized parameters α and γ o v er 5400 simulations. The v alues are scattered on the whole parameter space and do not sho w dependenc y structure to an y of the v aried specifics of the data.