Investigating the Importance of Reaction Times in Reinforcement Learning Tasks

(1)

Varvara Mathiopoulou Steven Miletić

MSc Brain & Cognitive Sciences

Integrative Model Based Cognitive Neuroscience Research Unit Amsterdam, Netherlands

(2)

Abstract

Reinforcement learning models have long been used to describe processes of error-driven learning. However, RL models do not account for the response times of participants, a component that has been proven very informative in decision making literature. Here we investigate whether reaction times are important in learning; we report two

experiments in which a speed-accuracy tradeoff manipulation was introduced during an instrumental learning task. Classical RL model was proven insufficient to capture the entire range of behavioral data, therefore a combined Reinforcement Learning Diffusion Decision Model was examined. The results show that a joint model accounts for both choice patterns and reaction times and is more suitable and essential in explaining both choice accuracy and reaction times. The current study contributes to recent advances in cognitive neuroscience, where an integrative model is increasingly used in explaining latent learning processes.

Keywords: Q-learning, reinforcement learning, diffusion decision model, speed-accuracy tradeoff

(3)

Investigating the Importance of Reaction Times in Reinforcement Learning Tasks Learning from trial-and-error is a fundamental cognitive capacity (Thorndike, 1927). The field of reinforcement learning aims to describe the cognitive processes that allow us to learn from errors as well as the positive outcomes of our choices (Daw, 2014; Daw & Tobler, 2014; Dayan & Niv, 2008; Sutton & Barto, 2018). In experimental settings, learning is studied by placing participants in an environment where they face various choice possibilities and they have to discover the choice option that will yield the highest reward. In a reinforcement learning task, the participant is offered some response options, each one having a pre-set probability to give a reward and needs to discover by trial-and-error which choice options maximize the cumulative reward

(Sutton & Barto, 2018). Using such studies, the investigation of the cognitive processes of learning has progressed greatly (Frank, Seeberger, & O’reilly, 2004; Gershman & Daw, 2017; Schönberg, Daw, Joel, & O’Doherty, 2007), as well as our understanding of the neural underpinnings of these processes (Frank, Moustafa, Haughey, Curran, & Hutchison, 2007; Glimcher, 2011; Holroyd & Coles, 2002; Matsumoto & Hikosaka, 2009).

Reinforcement Learning models (Rescorla & Wagner, 1972) assume that people choose between uncertain options based on an internal estimation of the reward expectation, which is updated based on the prediction error (Cohen & Ranganath, 2007). Reinforcement Learning models consist of two components: the update rule and the choice rule (Daw & Tobler, 2014). The update rule (e.g., Q-learning, TD-learning, SARSA) formulates how exactly the expected values are updated based on prediction errors. The choice rule is a function that maps the expected value of the option to the probability of choice. The typical choice rule is Softmax which determines the

probability that the learner will choose one option over the other.

Standard reinforcement learning models, by using Softmax, focus only on choice patterns during learning, but do not account for the response times associated with these choices. This may be a legacy of the origin of reinforcement learning models, which were initially formulated for software agents (Sutton & Barto, 2018). However, human response times change as a consequence of learning (Dutilh, Vandekerckhove,

(4)

Tuerlinckx, & Wagenmakers, 2009; Liu & Watanabe, 2012; Logan, 1992; Petrov, Van Horn, & Ratcliff, 2011; Ratcliff, Thapar, & McKoon, 2006; Sewell, Jach, Boag, & Van Heer, 2019). Similarly, in decision making literature, response times are very informative about the choice process, and formally modelling response times (together with choices) has offered great progress in understanding behavior under time pressure (Forstmann et al., 2008), urgency (Boehm, Hawkins, Brown, van Rijn, & Wagenmakers, 2016; Hawkins, Forstmann, Wagenmakers, Ratcliff, & Brown, 2015), fast and slow errors (van Maanen, Katsimpokis, & van Campen, 2018), and time estimation ability (Miletić & van Maanen, 2019).

The advantage of decision making over reinforcement learning models to account for both choice patterns and response times (Ratcliff, 1978; Ratcliff & McKoon, 2008), led some researchers to focus on disentangling the latent processes of decision making in reinforcement learning tasks (Fontanesi, Gluth, Spektor, & Rieskamp, 2019; Pedersen, Frank, & Biele, 2017; Sewell et al., 2019). More specifically, recent attention has

combined Reinforcement Learning and the Diffusion Decision Model (DDM) of Ratcliff (1978). Frank et al. (2015) fitted the diffusion model in a probabilistic reinforcement learning task and showed that decision patterns and response times can be accounted for simultaneously. Pedersen et al. (2017) for the first time used the DDM as a choice function of a reinforcement learning model and argued that the DDM predicts choice patterns and RTs of adults with ADHD. More recently, Sewell et al. (2019), assuming that drift rates captures the strength of associations in learning, examined how it can be affected using a classical reinforcement learning task. Analyzing the learning data with the diffusion model, they revealed that drift rate variation can account for changes in performance in different difficulty levels during the task. Finally, Fontanesi et al. (2019) showed that a combined RL-DDM could account for a magnitude effect in a learning task, which is one of the classical findings in decision-making.

To contribute to the investigation of decision making processes during

reinforcement learning, the current study explores the significance of response times during an RL task. To do so, we introduced a classical manipulation in the

(5)

decision-making literature, the speed-accuracy tradeoff (SAT).

Speed-accuracy tradeoff

It has long been known (Garrett, 1922) that when making a decision, people can voluntarily tradeoff response speed with response accuracy: People can willingly respond faster at the cost of making more errors, or make fewer errors at the cost of being slower (Bogacz, Wagenmakers, Forstmann, & Nieuwenhuis, 2010; Forstmann et al., 2010; Heitz, 2014; Heitz & Schall, 2012; Liu & Watanabe, 2012; van Maanen et al., 2018). In SAT experimental settings, the participant is instructed either to be as fast as possible, or to focus on being accurate. Importantly, in the speed condition there is often a stricter deadline, so that participants will be cued every time their responses are slow.

One of the major advantages of decision-making models such as the DDM

(Ratcliff, 1978) is that they offer a natural account for how changes in response caution lead to changes in response speed and accuracy. In brief, these models propose that people make decisions by gradually accumulating evidence for each choice option, until a threshold level of evidence is reached, and a decision is made. Adjustments in the SAT are captured well by changes in the threshold parameter. Increasing the threshold leads to slower (since more evidence needs to be collected) but more accurate (since each choice is made based on more evidence) decisions (Forstmann, Ratcliff, &

Wagenmakers, 2016; Heitz, 2014; M. Mulder, Van Maanen, & Forstmann, 2014; Ratcliff, Smith, Brown, & McKoon, 2016).

As mentioned before, unlike the DDM, RL models do not predict changes in reaction times during the choice process. Here, we test whether an SAT manipulation has the classical behavioral effects, as well as if and how RL models can capture these behavioral effects in RTs, accuracy, or both. We also investigate whether an RT manipulation will affect the learning process, as defined by the sole parameters of a Reinforcement Learning model.

In the following sections we report the results of two experiments with different speed stress manipulations. For each experiment we report the methodology used,

(6)

results of descriptive statistics, tests and model inspection. After the first experiment there is an interim discussion of the results. Following the second experiment, we formulate a general discussion of the findings. Closing, the reader can find some limitations of our study and suggestions for future research.

Experiment 1

Method

Participants. The study was approved by the ethics committee of the Psychology Department, University of Amsterdam. 18 students participated in the study (13 women, mean age 20.17 [SD 1.58], 17 right handed). All participants were students from the Department of Psychology, University of Amsterdam and gave written informed consent prior to beginning the experiment. Participants were compensated with research credits.

Learning Task Description. The main goal of the reinforcement learning task was for the participants to earn as many points as possible by making the optimal choice among the two choice alternatives being presented (Fontanesi et al., 2019; Frank et al., 2015; Pedersen et al., 2017; Sewell et al., 2019). The stimuli were abstract

symbols in set pairs. The participant needed to learn, after some exploration and trial-and-error, which choice had a larger probability to pay-off. They were collecting points throughout the experiment, which were redeemed for research credits afterwards.

Each trial (Fig 1a, 1b) started with a fixation cross (0.3 s) after which the

participant was presented with a pair of symbols and needed to choose the one that will most likely give a positive outcome. They were instructed either to be as quick as possible or as confident as possible when making a choice. Although in classical decision making tasks with speed-accuracy tradeoff manipulations (Heitz, 2014), participants are instructed to be accurate, in the current experiment there is no external accuracy, rather optimal versus sub optimal choice. Therefore, participants were encouraged for confident choices. Once a decision was made, the chosen symbol was highlighted, and participant received the outcome of the choice (“100” with green letters/ “0” points

(7)

with red letters), and the reward (100 points with green letters / 0 points with red letters). The outcome refers to the choice’s pay-off, whereas the reward refers to the amount of points the participant earns. If the decision was made too late, meaning that the choice was not made within the deadline, they received a feedback with red letters (“Too slow!”) and the reward was 0 even if the outcome was 100. Showing the outcome regardless of missing the deadline was necessary, since in the learning paradigm,

participants learn from the feedback they receive, and they adjust their expected values for each choice accordingly.

(a) (b) (c)

Figure 1 . Illustration of various trials in the SAT block. Panel (a) presents one trial timeline in SAT manipulation, where participant receives a positive outcome. First, there was a fixation cross for 0.3 seconds, followed by the cue ’Be fast!’ or ’Be

confident!’ in red letters for 0.8 seconds. After this, the two abstract symbols appeared in the screen for 2 seconds in the confidence condition and for a shorter interval in the speed condition. The choice of the participant was highlighted and the feedback was given for 1 second including the outcome of the choice and the reward. Panel (b) shows one trial timeline in SAT manipulation, where the option chosen has outcome 0. Panel (c) presents one trial timeline in SAT manipulation in Experiment 2, where participant made no choice in time, resulting in loosing 100 points. Additionally, the participant received the feedback ’Too slow!’ with red letters.

Design. The task consisted of three types of blocks. First, all participants completed a block which determined the difficulty level for the rest of the experiment (Miletić & van Maanen, 2019; M. J. Mulder, Wagenmakers, Ratcliff, Boekel, &

(8)

Forstmann, 2012; Winkel, Keuken, van Maanen, Wagenmakers, & Forstmann, 2014). There was no SAT manipulation in this block. Participant was presented randomly with four sets of stimuli, each assigned with reward probabilities 0.8-0.2, 0.7-0.3,

0.65-0.35 and 0.60-0.40. For the rest of the experiment the probabilities of the outcomes were determined by weighting the reward probabilities by the corresponding accuracy levels in the second half of this ’calibration block’. This probability level was kept throughout the reinforcement learning task for both speed and confidence pairs. Individual performance in this block determined also the deadline in the speed

condition (Miletić & van Maanen, 2019). On each trial, the deadline was sampled from an exponential distribution with rate parameter 0.5, shifted by the 80th _{percentile RT of} the participant’s response time distribution in the calibration block. In the confidence condition, the deadline remained in 2 seconds.

In the next two blocks, the participant was introduced to the SAT manipulation. In one of the two, the Miniblock condition, the speed cue was presented interchanged with the confidence cue among blocks of 8 trials. This small amount of trials was chosen so that the participant will not manage to learn the probabilities too soon in the

procedure. In the second experimental block, the Trialwise condition, there was a trial-by-trial cuing; the setting was the same, with four new stimuli sets, but the speed or confidence instruction was interchanged randomly among trials. The comparison between Miniblocks and Trial-wise cues was included to investigate whether the speed instruction per se influences behavior, or whether the speed instruction combined with the increased error rate of the participant -most likely to occur in a Miniblocks design-affects the learning process (Forstmann et al., 2008). The order of the two types of blocks was counterbalanced among participants.

Procedure. Participants arrived in the lab and, after reading an information brochure with more details regarding the nature of the task and the compensation system, they signed an informed consent. Afterwards, they were seated in an individual cubicle and proceeded with the main experimental procedure. The experiment lasted about 45 minutes.

(9)

First they received some instructions on how to perform the task. More

specifically, the participant was introduced to the notion of probability by playing with some slot machines. Feedback was given throughout the instruction phase, in order to establish good understanding of the task. For the rest of the experiment, choices were represented with abstract symbols. After the instructions, there were 70 practice trials (without the SAT manipulation). Following, they started with a calibration block (a standard RL task without any further manipulation) with four pairs of stimuli and 50 trials per set, a total of 200 trials. Its function was to determine the individual

probability level and deadline. Afterwards, the SAT manipulation was introduced with the two RL types of blocks. In each block participants performed 77 trials per stimuli set, a total of 308 trials per block. After the task was completed, participants were debriefed by the experimenter regarding the objective of the study, and they were awarded the research credits.

Data Analysis. We analyzed the data first by calculating mean reaction times and accuracy. Accuracy was coded as 0 or 1 depending on whether the participant chose the option that had higher probability to give a reward. Accuracy was 1 for late responses, if the choice was the optimal one (Heitz, 2014).

Furthermore, we fitted the Q-Learning model with the Softmax function as the choice rule (Dayan & Watkins, 1992; Sutton & Barto, 2018). The Q-learning rule determines the value -quality- of the choice in each trial and it updates it every time there is a difference between the expected and the actual reward. It is calculated by the following equation:

Qs,t = Qs,t-1+ α(rt-1− Qs,t-1) (1)

where Qs,t is the updated expected value of the choice (s) in trial (t), Qs,t-1 is the expected value of the choice (s) in the previous trial (t-1), α is the learning rate and (rt-1-Qs,t-1) the prediction error (rt-1 is the reward given in the previous trial). For simplicity, we ignore additional mechanisms such as delay discounting (Watkins &

(10)

Dayan, 1992) and eligibility traces (Singh & Sutton, 1996). The learning rate

determines the size of the update step and varies between 0 and 1. Larger learning rate indicates that the most recent rewards are weighted more, in comparison to participants with lower learning rate that seem to take into account older trials of the task as well (Daw & Tobler, 2014).

The choice rule of the learning model is Softmax:

P (s) = exp(βQs) Σiexp(βQi)

(2)

where P(s) is the probability of choosing option (s), β is a model parameter and Qs is the expected value of choice (s). Eq. 2 determines the mapping between expected values and choice probabilities. The influence of the inverse temperature parameter β is illustrated in Figure 2 (assuming two choice options). For an equal value difference, higher betas lead to higher probabilities of choosing the choice option with the higher value.

Figure 2 . The influence of parameter β on the choice probabilities in Softmax. The x axis represents the difference in expected value between Choice A and Choice B. The y axis shows the probability of choosing A. The parameter β determines the probability of choosing choice option A and B for any give value difference. P(A) increases

monotonically with increases in β (within the observed range of β of roughly 1 to 50).

For this reason, the parameter β has been interpreted in the literature as the sensitivity to reward (Pedersen et al., 2017), under the assumption that learners with higher value of β are more likely to choose the option that in the previous trial had higher expected value because it paid off. Alternatively, it has also been interpreted as

(11)

the exploration-exploitation tradeoff, since participants with lower values of β are more likely to choose the option with lower expected values, as such ’exploring’ the

environment (Daw, O’Doherty, Dayan, Seymour, & Dolan, 2006).

We fitted seven models per participant that differed in which parameters varied with cue (speed/confidence) and block (Miniblocks/Trialwise). See Table 2 for the varying parameters in each model. Models 1-4 assumed no differences between block type, whereas models 5-7 assumed that block type had an effect on the learning rate and/or sensitivity to reward. Further, Models 2, 4, 5, and 7 assumed that SAT instruction influenced the sensitivity to reward. Models 3, 6, and 7 assumed an influence of SAT instruction on the learning rate. Model fitting was performed using maximum likelihood estimation, and parameters optimization was done using the Differential Evolution Algorithm (Storn & Price, 1997), as implemented in package DEoptim (Ardia, Mullen, Peterson, & Ulrich, 2007) for the R programming language (R Core Team, 2017).

Model comparison was done using the Bayesian Information Criterion (Burnham & Anderson, 2004; Schwarz, 1978; Wagenmakers & Farrell, 2004). The BIC is

calculated by the following equation:

BIC = log(n)k − 2log( ˆL) (3)

where n is the number of observations (trials), k is the number of parameters and ˆL is the likelihood estimate. Lower BIC values mean better fit of the model balanced for complexity. The BIC weight is calculated as follows:

wi(BIC) =

exp{−1₂∆i(BIC)}

PK

k=1exp{−12∆k(BIC)}

(4) where ∆k = BICk− minBIC.

Results & Discussion

Results. Out of the 18 participants, three were excluded because they failed to learn the optimal choices (overall accuracies < 0.55; Fontanesi et al., 2019). Mean

(12)

accuracy of the excluded participants was 0.51 [SD = 0.09]. Therefore the results concern the remaining 15 participants. Non response trials were removed (32

observations, 0.26% of trials) as well as responses faster than 100 ms (8 observations, 0.07% of all trials). Furthermore, for the SAT trials, we removed within each

participant, the data from the stimuli sets for which the accuracy was below 0.55; these were 13 stimuli sets in total, and maximum two sets per block per participant. This was done to remove the stimuli sets that the participant did not manage to capture. Mean accuracy in excluded stimuli sets was 0.43 [SD = 0.13]. These accounted for 1001 observations, 8.18% of all trials.

In Table 1 we provide descriptive statistics of Experiment 1. Accuracy refers to whether the participant chose the option that had higher the probability to pay off. The main question is whether accuracy of options under speed stress was lower than in confidence stress and, accordingly, whether reaction times were lower in the speed condition. Figure 3 shows the mean differences of accuracy and response times for each condition (panel 3a and panel 3b). Error bars indicate the standard errors of the mean. Panel 3d shows the Response time distributions of the Calibration Block and panel 3d the Speed vs Confidence RT distributions. The mean deadline was 1.05 seconds [SD = 0.23]. However in the speed condition, just 9.94% of responses were over the same deadline and 10.97% of responses in the confidence condition. Specifically, participants gave much faster responses in the confidence condition in comparison to the calibration block. This indicates that there was a general speed-up which was much larger than the between-condition (speed-confidence) difference in response speeds.

Table 1

Descriptive Statistics of Experiment 1

Calibration Trialwise Miniblocks

Speed Confidence Speed Confidence Overall Confidence Overall Speed

Accuracy 0.72 (0.10) 0.81 (0.09) 0.81 (0.10) 0.75 (0.12) 0.77 (0.12) 0.79 (0.10) 0.78 (0.09)

RTs 0.84 (0.14) 0.60 (0.12) 0.64 (0.13) 0.63 (0.13) 0.68 (0.13) 0.66 (0.12) 0.62 (0.12)

(13)

(a) (b)

(c) (d)

Figure 3 . Illustrations of descriptive statistics in Experiment 1. Panel 3a presents the mean accuracy and panel 3b the mean response times per condition: Confidence Miniblocks, Speed Miniblocks, Confidence Trialwise, Speed Trialwise. Error bars indicate the SEM. Panel 3c shows density RT distributions of the Calibration block pooled across all participants. The dashed line represents the 80th _{percentile of the} distribution, based on which the individual deadlines for the speed condition were sampled. The panel 3d shows the density RT distribution of the confidence (dashed line), the speed condition (solid line) and the mean speed deadline (vertical dashed line).

(14)

in Miniblocks t(14) = -2.62, p = 0.02 . The differences between mean response times were significant both between the two Miniblocks conditions, t(14) = -5.06, p < .001, and Trialwise, t(14) = -3.43, p = 0.004 . In order to investigate possible effect of block type or instruction in participants’ accuracy or response times, we performed two-way repeated measures ANOVA. There a significant main effect of block type; F(1,14) = 19.42, p < .001 and a main effect of cue; F(1,14) = 26.69, p < .001 on RTs. Also, there was a main effect of cue on accuracy; F(1,14) = 13.87, p = .002 .

To further disentangle the effect of speed stress on learning we proceeded with model fitting. The Q-learning model with the Softmax function as the choice rule was fitted per stimulus set and participant. Six more stimuli sets in total were excluded based on visual inspection of the model fit. These accounted for 462 trials, 5.68% of the SAT trials. Continuing, we fitted seven models for each participant and performed model comparison. Table 2 summarizes the results. The winning model was based on three ranking criteria: First, the best model per participant (best the one with the lowest BIC value or largest BIC weight); second, the number of participants for which each model had the highest rank and third; the number of participants for which each model was among the top 3. According to all three criteria Model 1 won, which allows no learning rate α or parameter β to vary. Figure 4 illustrates the quality of fit of this model.

However, based on ranking, also Model 3 (varying α between speed and

confidence) and Model 6 (varying α between speed and confidence, and types of SAT blocks) were considered. Exploratory analysis showed that the differences in α were not statistically significant for Model 3: t(14) = 0.875, p = .397, nor in Model 6: Trialwise; t(14) = -1.153, p = .268 , Miniblocks; t(14) = -0.545, p = .594 .

Discussion. In Experiment 1, participants performed a standard learning task, with a speed-accuracy trade-off manipulation. We explored both how speed stress affected the decision process as expressed by the observed data (accuracy and response times) and the learning process, as defined by the learning parameters α and β.

(15)

Table 2

Model Comparison for Experiment 1

Model Parameters k Mean Mean Weights n best n Top3

1 α, β 2 429.446 0.486 9 14

2 Different β in SAT 3 434.383 0.090 1 8

3 Different α in SAT 3 432.622 0.171 2 11

4 Both α and β varying in SAT 4 437.490 0.012 0 2

5 β varying in both SAT blocks and conditions 5 440.084 0.004 0 1

6 α varying in both SAT blocks and conditions 5 432.476 0.235 3 6

7 α and β varying in both SAT blocks and conditions 8 444.888 0.002 0 3

Notes: Winning model of experiment 1. In the first column we present the number of the model, from the simplest to the most complex one. The column ’Parameters’ includes an interpretation of the different parameters of each model. k is the number of parameters per model, and following, the mean BIC values and the mean BIC weights. n best, is the number of participants for which each model was the top model and n

Top3 is the number of participants for which each model was among the top three winning.

(a) (b)

Figure 4 . Model Fits. Both panels present model fits of the SAT blocks. The x axis represents the number of trials binned in groups of 10 for all eight stimuli sets, namely for both Miniblocks and Trialwise type of SAT blocks. The y axis represents the

probability of choosing A, given choices A and B. In panel 4a, the black solid line is the observed data and the dashed line the simulated data. Panel 4b shows the same results but separated between confidence trials (observed data blue solid line, simulated blue dashed line) and speed trials (observed data with red solid line and simulated data with red dashed line.)

(16)

Differences in mean response times were statistically significant in the trialwise block and both accuracy and mean response times were significant in the Miniblocks

condition. Overall, participants did speed up more and had lower accuracy when under speed stress, a finding consistent with the behavioral change in decision making and SAT (Forstmann et al., 2008; Heitz, 2014).

To further explore how the SAT might have affected the learning process, we proceeded with model fitting and model comparison. This revealed that regardless of the speed or confidence stress, participants’ learning process as defined by the learning rate and the sensitivity to reward was not altered. These results are inconsistent, since it seems that the SAT manipulation created a behavioral difference in performance that was not captured in the model comparison. There are two explanations regarding this difference.

First, the behavioral effect was possibly not large enough. As it can be seen in Fig. 3d, participants managed to commit to drastically faster responses in comparison to the calibration block. This is an indicator that the deadline was too lenient for many participants, therefore they might not have perceived sufficient speed stress in the majority of the trials. Most importantly, it is observed that participants did speed up a lot their responses in confidence condition as well, not altering their behavior following the confidence/speed instructions. This finding indicates that possibly the deadline sampled did not induce the desired speed stress during the choice process.

Second, these results might indicate that the model type was not appropriate in order to capture the full range of behavioral patterns in SAT. Since RTs are not accounted for in Softmax, the standard RL model might not be sufficient for the data collected. In order to further explore these two possibilities, we proceeded with a second experiment with a new SAT manipulation.

(17)

Experiment 2

Method

Participants. The study was approved by the ethics committee of the

Psychology Department, University of Amsterdam. A total of 35 students participated in the experiment (27 women, mean age 20.54 [SD 2.5], 33 right handed). All

participants were students from the Department of Psychology and were compensated with research credits. Participants gave written informed consent prior to beginning the experiment.

Reinforcement Learning Task. The Reinforcement Learning task was the same as in the Experiment 1. However, there were two alterations with the intention to increase speed stress. First, too slow responses, namely choices that were made after the deadline, were penalized with -100 points. The penalty was introduced so that

participants would prioritize being on time as well as collecting many points. Second, the deadline in the speed condition was sampled from the 65th percentile of the participant’s response time distribution in the calibration block.

Cognitive Modelling. Since in Experiment 1 we observed some behavioral differences due to SAT manipulation that were not reflected in the model comparison, in Experiment 2 we fitted both the standard RL model that was described before and a combined Reinforcement Learning - Diffusion Decision Model (RLDDM). The Diffusion Decision Model (Ratcliff, 1978) assumes that the decision maker gradually accumulates noisy evidence, until a threshold is reached and a decision is being made. The DDM typically includes four main parameters: The drift rate v which represents the amount of evidence accumulated, the threshold (b) of the choice, the non-decision time (t0) which expresses the time interval between the appearance and the perceptual encoding of the stimulus, and the starting point (z).

Here we assume the following parameters of the DDM; the drift rate v, the

threshold b, and the non-decision time t0. In the combined RLDDM we assume that the drift rate v is a linear function of the difference in Q-values of the two choices linearly scaled by a factor m:

(18)

vt= m(Q(s,a1)t − Q(s,a2)t) (5)

We fitted four models of the combined RLDDM, allowing the threshold (b), the learning rate (α), both, or none of them to vary between speed and confidence. Model comparison was again performed with the Bayesian Information Criterion.

Results & Discussion

Results. Out of the 35 participants, we excluded six based on poor accuracy in the SAT blocks, therefore the results concern the remaining 29 participants. Mean accuracy of excluded participants was 0.53 [SD = 0.05]. Non response trials (28 observations, 0.12% of all trials) as well as responses faster than 100 ms (97

observations, 0.47% of all trials) were removed from further analysis. 20 stimuli sets from the SAT blocks were removed from various participants, were accuracy was below 0.55 (but no cases where more than two stimuli sets within the same type of SAT block per participant were reported). Mean accuracy of the excluded stimuli sets was 0.41 [SD = 0.12]. These accounted for 1.540 observations, 6.54% of all trials. In Table 3 we present the descriptive statistics for Experiment 2. The mean deadline in the Speed Condition was 0.77 sec [SD = 0.14].

Table 3

Descriptive Statistics of Experiment 2

Calibration Trialwise Miniblocks

Speed Confidence Speed Confidence Confidence Total Speed Total

Accuracy 0.75 (0.12) 0.79 (0.1) 0.82 (0.07) 0.77 (0.08) 0.80 (0.09) 0.81 (0.06) 0.78 (0.07)

RTs 0.66 (0.08) 0.47 (0.08) 0.53 (0.07) 0.49 (0.08) 0.58 (0.08) 0.55 (0.07) 0.48 (0.08)

Differences were statistically significant for accuracy in Miniblocks between confidence and speed, t(28) = 3.02, p = .005 as well as in trialwise; t(28) = 5.03, p < .001 . The differences of response times were also significant both between the two miniblocks conditions, t(28) = 5.22, p < .001 , and trialwise, t(28) = 5.41, p < 0.001 . Also, there was a statistically significant difference in accuracy between Confidence and

(19)

Speed in total; t(28) = 5.85, p < 0.001 , and similarly for Response Times; t(28) = 5.6, p < 0.001 . In Fig. 5 we present the mean differences of accuracy (panel a) and

response times (panel b), as well as the RT distributions of the calibration block (panel c) and the SAT block (panel d). A two-way repeated measures ANOVA revealed a statistically significant effect of the interaction of cues and block type on RTs; F(1,28) = 10.73, p = .002 . Additionally, there was a main effect of cue on accuracy; F(1,28) = 33.84, p < .001 .

We fitted the same seven models per participant that were described in Experiment 1, and performed model comparison. As shown in Table 4 the winning model was again Model 1, with no varying learning rate α or parameter β.

Table 4

Model Comparison for Experiment 2

Model Parameters k Mean Mean Weights n best n Top3

1 α, β 2 516.037 0.393 14 24

2 Different β in SAT 3 519.178 0.164 5 14

3 Different α in SAT 3 517.983 0.175 4 19

4 oth α and β varying in SAT 4 521.020 0.052 1 9

5 β varying in both SAT blocks and conditions 5 523.487 0.075 3 8

6 α varying in both SAT blocks and conditions 5 521.950 0.122 2 10

7 α and β varying in both SAT blocks and conditions 8 531.316 0.019 0 3

(20)

(a) (b)

(c) (d)

Figure 5 . Plots of descriptive statistics in Experiment 2. Panels 5a and 5b present the mean accuracy and response times respectively per condition: Confidence Miniblocks, Speed Miniblocks, Confidence Trialwise, Speed Trialwise. Error bars indicate the SEM. Panel 5c shows the mean density RT distribution of the Calibration block. The dashed line represents the 65th _{percentile of the distribution, based on which the individual} deadlines for the speed condition were generated. Panel 5d shows the density RT distribution of the confidence (dashed line) and speed condition (solid line) and the mean speed deadline in 0.77 sec.

(21)

(a) (b)

Figure 6 . Model Fits of Experiment 2. Both panels present model fits of the SAT blocks. In panel 6a, the black solid line is the observed data and the dashed line the simulated data. Panel 6b shows the same results but separated between confidence trials (observed data blue solid line, simulated blue dashed line) and speed trials (observed data with red solid line and simulated data with red dashed line.)

Based on the other ranking criteria, namely the BIC weights and the number of participants for which each model was among the top 3, also Model 2 (Mean BICw = 0.164, n Top3 = 14), Model 3 (Mean BICw = 0.175, n top3 = 19) and Model 6 (Mean BICw = 0.122, n Top3 = 10) were explored. In model 2 varying β per speed and confidence were statistically significant, t(28) = -3.66, p = 0.001 . More specifically, mean β in confidence was 15.89 [SD = 14.81] and in speed; µ = 12.82 [SD = 13.32]. The differences in learning rates among conditions (Model 3 and 6) were not

statistically significant: Model 3; t(28) = -0.308, p = 0.761 and Model 6: Miniblocks; t(28) = 0.095, p = 0.925 , Trialwise; t(28) = 0.118 , p = 0.907 .

In order to investigate the second hypothesis raised from Experiment 1, namely that the Q-Learning model with softmax as the objective function is not sufficient to capture the behavioral changed under speed stress, we fitted a combined RLDDM model. The models were fitted only in the data of the Miniblock conditions, since the effect on this type of block was larger in comparison to Trialwise. The results of the

(22)

model comparison are presented in Table 5. Interestingly, the winning model is RLDDM 1, with both lower BIC value and the highest numbers of participants for which it was the most preferred. This model allows the threshold to vary between speed and

confidence conditions. In the confidence condition, the threshold b has µ = 1.47 [SD = 0.22] and in the speed condition µ = 1.21 [SD = 0.24]. Differences in the threshold parameters between conditions in RLDDM 1 were statistically significant; t(28) = -8.05, p < .001 . Model fits are presented in Fig. 7. As shown in the graphs, there is an important misfit in the RT distribution fits, with an overestimation of the skewness. Table 5

Model Comparison between RLDDM models

Model Parameters Interpretation k Mean BIC Mean Weights n best

RLDDM 1 b1, b2, v, α, t0 Threshold varying per condition 5 29.002 0.421 12

RLDDM 2 α1, α2, b, v, t0 Learning rate varying per condition 5 40.198 0.169 5

RLDDM 3 α, b, v, t0 No parameter varying 4 40.049 0.178 6

RLDDM 4 α1, α2, Both learning rate and threshold

(23)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 RT Def . prob . ● ● ● ● ● ● ● ● ● ● ● _Data Model Across subjects (a) 1 2 3 4 5 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Trial (binned) Mean R T ● ● ● ● ● ● _Data Model (b) 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Trial (binned) Accur acy ● ● ● ● ● ● Data Model (c)

Figure 7 . Model evaluations of combined RLDDM 1 fitted to the Miniblocks trials. Panel 7a shows mean RTs for correct (red lines) and incorrect responses (black lines). Observed data are plotted in dotted lines and simulated data in solid lines. Panel 7b shows the SAT trials binned in 5 blocks in the x-axis and mean RTs in y-axis. Panel 7c shows mean accuracy in the SAT trials binned in 5 blocks (simulated data are plotted with a solid line). In all plots we observe a misfit in skewness.

(24)

General Discussion

In the current study we investigate a classic phenomenon of decision making literature, the speed-accuracy tradeoff. The main question was whether its effects are present in a learning task and more specifically whether reaction times and accuracy, which have been repeatedly reported to be affected by an SAT manipulation, will be informative in a learning paradigm. Especially reaction times is a component of the decision process that is neglected by learning models.

In Experiment 1, participants completed a learning task in which they were cued either for fast or confident responses. Behavioral results revealed that indeed, especially in the Miniblocks type of SAT, both accuracy and RTs were significantly lower between conditions. Nevertheless, after fitting the Q-learning model as the update rule and Softmax as the choice rule, model comparison indicated preference for the model that allows no learning rate or sensitivity to reward to vary between conditions.

This led us to Experiment 2, in which we aimed for a reduced deadline in the speed condition, given that participants managed to speed up their responses both under speed and confidence stress in Experiment 1. Apart from the Q-Learning model with Softmax, we also fitted a combined RLDDM model, in which the learning rate and the threshold were allowed to vary among conditions. The main question was whether the model comparison results between the different RLDDM models will be in

accordance with the expected behavioral results. Indeed, the joint RLDDM seems to capture changes in threshold in an SAT manipulation.

Reactions times and accuracy were significantly affected when under speed stress, but even with the new SAT manipulations the effect was not as large as in other studies (e.g. Forstmann et al., 2008). A possible reason for this is the nature of the task; since participants were compensated with research credits, they might have prioritized accuracy over any other instruction, because this would lead to higher rewards. In a typical decision making task, the reward is not present and differences in accuracy are more prominent. Comparison among models with varying learning rate α and

(25)

confidence and speed. However, Softmax leaves unexplained a big range of data with respect to reaction times.

By fitting a combined RLDDM model, we aimed to also investigate whether the SAT manipulation has an effect on reaction times as well as on accuracy. Indeed, model comparison revealed that the preferred RLDDM model was the one that allowed the threshold b to vary between speed and confidence (and a constant learning rate α), a result consistent with decision making literature (Forstmann et al., 2008; Heitz, 2014). What is more, the quality of fit of the winning model showed that RLDDM captures the accuracy and the change in mean RT and accuracy over the course of an

experiment. Thus, it seems that a combined model is able to account for both the SAT and the learning effect. However, it should be noted that there seems to be a misfit between the data and the joint model especially regarding the skewness; the RLDDM predicts more right-skewed distributions (skewness > 1.5) than the observed data (median skewness = 1.07 [IQR = 0.72]).

This misspecification reflects that the model seems to miss a crucial component of learning and the choice process. One potential reason for this is that non-decision time variability st0 was not included as a free parameter in the model. Due to the short overall RTs, any variability in t0 has a relatively large influence on the final RT

distributions. Therefore, st0 would potentially reduce the distribution skewness. Other ways this misfit could be corrected would be by investigating potential urgency effects in the task (Boehm et al., 2016; Hawkins et al., 2015), by fitting a race model (e.g. LBA; Brown & Heathcote, 2005), or by including a memory component that is not captured in the current joint model.

Some limitations of the study need to be noted. First, although the intention of the reduced deadline in the Experiment 2 was to increase the speed stress, participants still managed to commit to very fast choices in comparison to the calibration block, both under speed and confidence stress. A suggestion for this could be a deadline that is sampled from a Gaussian distribution with mean the 65th _{percentile of the RT} distribution in the calibration block, since the expected value of the Gaussian

(26)

distribution is lower than the currently used exponential distribution. Second,

interestingly, participants did not alter their behavior between the speed and confidence conditions, and made very fast choices despite the instruction. For future studies it is suggested that participants be also rewarded for being fast when under speed stress, so that fast responses are prioritized as rewarding as well. Third, the increased skewness of the distributions indicate possible urgency effects, therefore a further investigation of how urgency affects learning could be conducted.

Exploring the importance of reaction times in learning tasks seems very promising. Future studies could investigate how speed stress affects learning in a time-dependent reward, where participants lose points as the time passes, as well as focus on how the reaction time change between the SAT conditions could become larger. A future study could also explore trial-by-trial effects, such as the

post-prediction error slowing effect with a trial-by-trial variation of the threshold (Dutilh et al., 2012). Finally, it seems important to investigate how a joint RLDDM model could be adapted to fit RT distributions better.

The current study complements previous research in combined modeling of decision making and reinforcement learning. The discussion of a joint model that has already started in the literature (Fontanesi et al., 2019; Pedersen et al., 2017; Sewell et al., 2019) is very promising for exploring latent processes in decision making and learning, and could result in a new, integrative theoretical framework in cognitive science.

(27)

References

Ardia, D., Mullen, K., Peterson, B., & Ulrich, J. (2007). Deoptim: differential evolution optimization in r. R package version, 1–3.

Boehm, U., Hawkins, G. E., Brown, S., van Rijn, H., & Wagenmakers, E.-J. (2016). Of monkeys and men: Impatience in perceptual decision-making. Psychonomic bulletin & review, 23 (3), 738–749.

Bogacz, R., Wagenmakers, E.-J., Forstmann, B. U., & Nieuwenhuis, S. (2010). The neural basis of the speed–accuracy tradeoff. Trends in neurosciences, 33 (1), 10–16.

Brown, S., & Heathcote, A. (2005). A ballistic model of choice response time. Psychological review, 112 (1), 117.

Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding aic and bic in model selection. Sociological methods & research, 33 (2), 261–304. Cohen, M. X., & Ranganath, C. (2007). Reinforcement learning signals predict future

decisions. Journal of Neuroscience, 27 (2), 371–378.

Daw, N. D. (2014). Advanced reinforcement learning. In Neuroeconomics (pp. 299–320). Elsevier.

Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441 (7095), 876.

Daw, N. D., & Tobler, P. N. (2014). Value learning through reinforcement: the basics of dopamine and reinforcement learning. In Neuroeconomics (pp. 283–298). Elsevier. Dayan, P., & Niv, Y. (2008). Reinforcement learning: the good, the bad and the ugly.

Current opinion in neurobiology, 18 (2), 185–196.

Dayan, P., & Watkins, C. (1992). Q-learning. Machine learning, 8 (3), 279–292. Dutilh, G., Vandekerckhove, J., Forstmann, B. U., Keuleers, E., Brysbaert, M., &

Wagenmakers, E.-J. (2012). Testing theories of post-error slowing. Attention, Perception, & Psychophysics, 74 (2), 454–465.

Dutilh, G., Vandekerckhove, J., Tuerlinckx, F., & Wagenmakers, E.-J. (2009). A diffusion model decomposition of the practice effect. Psychonomic Bulletin &

(28)

Review, 16 (6), 1026–1036.

Fontanesi, L., Gluth, S., Spektor, M. S., & Rieskamp, J. (2019, Mar 28). A reinforcement learning diffusion decision model for value-based decisions. Psychonomic Bulletin & Review. doi: 10.3758/s13423-018-1554-2

Forstmann, B. U., Anwander, A., Schäfer, A., Neumann, J., Brown, S., Wagenmakers, E.-J., . . . Turner, R. (2010). Cortico-striatal connections predict control over speed and accuracy in perceptual decision making. Proceedings of the National Academy of Sciences, 107 (36), 15916–15920.

Forstmann, B. U., Dutilh, G., Brown, S., Neumann, J., Von Cramon, D. Y., Ridderinkhof, K. R., & Wagenmakers, E.-J. (2008). Striatum and pre-sma facilitate decision-making under time pressure. Proceedings of the National Academy of Sciences, 105 (45), 17538–17542.

Forstmann, B. U., Ratcliff, R., & Wagenmakers, E.-J. (2016). Sequential sampling models in cognitive neuroscience: Advantages, applications, and extensions. Annual review of psychology, 67 , 641–666.

Frank, M. J., Gagne, C., Nyhus, E., Masters, S., Wiecki, T. V., Cavanagh, J. F., & Badre, D. (2015). fmri and eeg predictors of dynamic decision parameters during human reinforcement learning. Journal of Neuroscience, 35 (2), 485–494.

Frank, M. J., Moustafa, A. A., Haughey, H. M., Curran, T., & Hutchison, K. E. (2007). Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proceedings of the National Academy of Sciences, 104 (41), 16311–16316. Frank, M. J., Seeberger, L. C., & O’reilly, R. C. (2004). By carrot or by stick: cognitive

reinforcement learning in parkinsonism. Science, 306 (5703), 1940–1943.

Garrett, H. E. (1922). A study of the relation of accuracy to speed (Vol. 8). Columbia university.

Gershman, S. J., & Daw, N. D. (2017). Reinforcement learning and episodic memory in humans and animals: an integrative framework. Annual review of psychology, 68 , 101–128.

(29)

dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 108 (Supplement 3), 15647–15654.

Hawkins, G. E., Forstmann, B. U., Wagenmakers, E.-J., Ratcliff, R., & Brown, S. D. (2015). Revisiting the evidence for collapsing boundaries and urgency signals in perceptual decision-making. Journal of Neuroscience, 35 (6), 2476–2484.

Heitz, R. P. (2014). The speed-accuracy tradeoff: history, physiology, methodology, and behavior. Frontiers in neuroscience, 8 , 150.

Heitz, R. P., & Schall, J. D. (2012). Neural mechanisms of speed-accuracy tradeoff. Neuron, 76 (3), 616–628.

Holroyd, C. B., & Coles, M. G. (2002). The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychological review, 109 (4), 679.

Liu, C. C., & Watanabe, T. (2012). Accounting for speed–accuracy tradeoff in perceptual learning. Vision research, 61 , 107–114.

Logan, G. D. (1992). Shapes of reaction-time distributions and shapes of learning curves: A test of the instance theory of automaticity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18 (5), 883.

Matsumoto, M., & Hikosaka, O. (2009). Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature, 459 (7248), 837. Miletić, S., & van Maanen, L. (2019). Caution in decision-making under time pressure

is mediated by timing ability. Cognitive psychology, 110 , 16–29. Mulder, M., Van Maanen, L., & Forstmann, B. (2014). Perceptual decision

neurosciences–a model-based review. Neuroscience, 277 , 872–884.

Mulder, M. J., Wagenmakers, E.-J., Ratcliff, R., Boekel, W., & Forstmann, B. U. (2012). Bias in the brain: a diffusion model analysis of prior probability and potential payoff. Journal of Neuroscience, 32 (7), 2335–2343.

Pedersen, M. L., Frank, M. J., & Biele, G. (2017). The drift diffusion model as the choice rule in reinforcement learning. Psychonomic bulletin & review, 24 (4), 1234–1251.

(30)

Petrov, A. A., Van Horn, N. M., & Ratcliff, R. (2011). Dissociable perceptual-learning mechanisms revealed by diffusion-model analysis. Psychonomic bulletin & review, 18 (3), 490–497.

Ratcliff, R. (1978). A theory of memory retrieval. Psychological review, 85 (2), 59. Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: theory and data for

two-choice decision tasks. Neural computation, 20 (4), 873–922.

Ratcliff, R., Smith, P. L., Brown, S. D., & McKoon, G. (2016). Diffusion decision model: Current issues and history. Trends in cognitive sciences, 20 (4), 260–281. Ratcliff, R., Thapar, A., & McKoon, G. (2006). Aging, practice, and perceptual tasks:

A diffusion model analysis. Psychology and aging, 21 (2), 353.

Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations on the effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (p. 64-99). New York: Appleton-Century-Crofts.

Schönberg, T., Daw, N. D., Joel, D., & O’Doherty, J. P. (2007). Reinforcement learning signals in the human striatum distinguish learners from nonlearners during

reward-based decision making. Journal of Neuroscience, 27 (47), 12860–12867. Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics,

6 (2), 461–464.

Sewell, D. K., Jach, H. K., Boag, R. J., & Van Heer, C. A. (2019). Combining

error-driven models of associative learning with evidence accumulation models of decision-making. Psychonomic bulletin & review, 1–26.

Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine learning, 22 (1-3), 123–158.

Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11 (4), 341–359.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

(31)

Thorndike, E. L. (1927). The law of effect. The American journal of psychology, 39 (1/4), 212–222.

van Maanen, L., Katsimpokis, D., & van Campen, A. D. (2018). Fast and slow errors: Logistic regression to identify patterns in accuracy–response time relationships. Behavior research methods, 1–12.

Wagenmakers, E.-J., & Farrell, S. (2004). Aic model selection using akaike weights. Psychonomic bulletin & review, 11 (1), 192–196.

Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4), 279–292. Winkel, J., Keuken, M. C., van Maanen, L., Wagenmakers, E.-J., & Forstmann, B. U.

(2014). Early evidence affects later decisions: Why evidence accumulation is required to explain response time data. Psychonomic bulletin & review, 21 (3), 777–784.