Neural Network Models of Reversal Learning in Nonhuman Primates

(1)

Radboud University Nijmegen

Neural Network Models of Reversal

Learning in Nonhuman Primates

Author:

Xiaoxuan Lei

Supervisor:

Prof. Dr. Paul Tiesinga

Prof. Dr. Marcel van Gerven

Second reader:

Dr. Thilo Womelsdorf

(2)

Abstract

Primate decision making involves the activity of multiple brain regions, the distinct role of each of which can be dissected using the appropriate reversal learning tasks in combination with electrophysiological recordings in the regions of interest. In a reversal-learning task, the subject has to pick from among two presented objects the one that contains the current target feature, which is then switched at a ran-dom trial to a new feature. We constructed a recurrent network based on Long-Short Term Memory (LSTM) neurons, trained by reinforcement learning algorithms to learn to perform several variations of the reversal-learning task, including both probabilistic as well as deterministic reward schedules, and evaluated the model’s choices and the emerging stimulus/target representation.

The models produce meta-learning curves that are similar to those obtained in animal experiments, with the time to reach criterion performance increasing with the number of feature dimensions present in the objects and task difficulty. We further found that the training of the network proceeded via a sequence of rapid increases in performance, each increase reflecting the learning of a new feature dimension as a possible target, followed by plateaus with slow changes in performance. The model’s learning perfor-mance within a block was reflected in the current-target discriminability of neural population activity, which when processed by dimension reduction procedures followed by fitting supporting vector machine (SVM) classifiers to decode the target, had a similar time course.

The model’s meta-learning behavior could be fitted with Rescorla-Wagner (RW) type models, that weigh positive prediction errors (η+) differently from negative ones (η−). For tasks that had a target

rever-sal the η+ decreased and η− increased compared to tasks without reversal. This explains the resulting

model’s higher sensitivity to unrewarded trials and slower learning of the current target. The probabilis-tic reward version of the task was fit with even lower values of these weights compared with the task with deterministic rewards, consistent with expectation.

We evaluated to what extent models generalized by testing them on tasks on which they were not trained. When the model was trained on tasks with either probabilistic rewards or with a reversal, it performed well on a task with deterministic reward or without a reversal, respectively, but for the reverse case, when the training task with exchanged with the test task this was not the case. We also tested the robustness of the model by either providing or omitting multiple rewards in a row irrespective of the choice. The model trained on deterministic rewards and no reversal was robust against this as it ceased learning and would persevere with choosing the same target, the other models were less robust.

(4)

Taken together, we developed a model that can learn a reversal-learning task with probabilistic reward, and behaves similar to experiments with human and non-human primates as it can be described by a RW model. It makes a prediction for the population activity during learning which can be compared to neural activity recorded in subjects performing this task which is invaluable in guiding future theoretical and experimental studies.

(5)

1 Introduction

During reversal learning, one needs to adjust the action policy to obtain the maximally possible re-wards. The switch from ”exploitation” to ”exploration” to find the new rule is called ”rule acquisition” ([Oemisch et al., 2019]). It has been shown in previous studies that the neural activities in different brain regions are altered and correlated throughout the rule learning process ([Oemisch et al., 2019]). However, how and to what extent this cognitive flexibility is implemented and employed in primate brain remains largely unknown ([Miller, 2000]).

Several brain regions have been proposed to play central roles in those adaptive value-based decision-making tasks, including the dorsal lateral prefrontal cortex(dlPFC), orbital frontal cortex (OFC), anterior cingulate cortex (ACC) together with surrounding areas, the medial prefrontal cortex (mPFC)

([Alexander and Brown, 2010],[Deco et al., 2013]), [Heilbronner and Hayden, 2016]. How to characterize each region’s functional specialization and their interaction are of great interest. Experimental studies that probe across a wide range of spatial and temporal resolutions, from single-unit electrophysiological recording to functional magnetic resonance imaging (fMRI) and lesion studies, revealed a variety of ef-fects that cannot be easily reconciled under a single framework ([Behrens et al., 2007],

[Alexander and Brown, 2010]).

mPFC has been verified to be involved in tracking and holding the representation of many task-related variables, like conflict, error likelihood, volatility, and reward ([Silvetti et al., 2011]). Hayden and Platt([Hayden and Platt, 2010]) discovered that a majority of dACC neurons were sensitive to both action (i.e. saccade direction in their experiment context) and reward. After analyzing those factors effects on the dACC’s signal magnitude, they claimed that dACC neurons multiplex information about reward and action. Costa et al and Jang et al ([Costa et al., 2015], [Jang et al., 2015]) found by perform-ing a Bayesian analysis on a two-arm bandit reversal learnperform-ing task that amygdala and mPFC play an important role in establishing an initial belief representation of the environment’s volatility, i.e. whether the rewarding paradigm is stable. Minxha et al ([Minxha et al., 2020]) investigated the functional inter-action between mPFC and hippocampus and amygdala (HA) and suggested mPFC actively engages in the dynamic memory retrieval process during a task.

Along with experimental studies, many models have been proposed to provide an appropriate interpreta-tion of their results. In the response selecinterpreta-tion model (RSM) of ACC ([Heilbronner and Hayden, 2016]), they treated ACC as an adaptive calculator of conflict, as opposed to the static view in the conflict the-ory ([Van Veen et al., 2004]), that decides the distribution of control among several motor controllers.

(6)

Another model of interest is proposed by Shahnazian and Holroyd ([Shahnazian and Holroyd, 2018]), in which they built a recurrent neural network model to verify the hypothesis that ACC is involved in the execution of extended, goal-directed action sequences.

Recent modeling works attempt to offer a universal framework that can account for as many effects as possible. One perspective is to emphasize mPFC’s representation of error and action prediction. Silvetti and Verguts ([Silvetti et al., 2011]) proposed a reward value and prediction model (RVPM) that encodes both cues and errors. Alexander and Brown ([Alexander and Brown, 2010]) proposed treating mPFC as composed of two interacting components, one learns the prediction of the distribution of action outcomes and the other detects the response error, which in turn is used for the prediction update. They imple-mented a simple model (the Predicted Response-Outcome model, PRO, [Alexander and Brown, 2011]) based on a generalized reinforcement learning algorithm that attempts to train the response-outcome mapping instead of the usual stimulus-response mapping. This model accounts for an unprecedented range of experimental observations of mPFC. Further, they proposed the Hierarchical Error Representa-tion (HER, [Alexander and Brown, 2015]) to model the interacRepresenta-tion between ACC and dlPFC by adopting the idea of hierarchical reinforcement learning (HRL, [Botvinick et al., 2009]) that solve complex tasks by divide and conquer. Multidimensional error signals and outcome likelihood generated by ACC and expected error maintained in dlPFC interact to reciprocally update their estimates. HER showed human-level performance in structured tasks and even beated many deep learning (DL) models.

Another interesting idea is to explain ACC’s function from the effortful control perspective

([Vassena et al., 2017]). In the study by Heilbronner and Hayden ([Heilbronner and Hayden, 2016]), they summarized the advantages and disadvantages of three major views of dACC (as a monitor, controller, or an economic structure) and suggested to treat dACC as a storage buffer that uses context information to guide behavior, which constitutes a core part of the input-output transformation process. Shenhav et al ([Shenhav et al., 2013])unified a variety of functions into a single underlying process, allocation of control, and presented a normative model to evaluate the expected value of control (EVC). This model combines three critical factors: the expected payoff from a controlled process, the amount of control that must be invested to achieve that payoff, and the cost in terms of cognitive effort.

Studies and reviews of OFC also propose a variety of roles, although it was usually treated as an area that inhibits previous responses that are no longer appropriate. Deficits in reversal learning were ob-served after manipulations of dopamine and lesions of the OFC (([Jang et al., 2015]). This indicated that OFC plays an active role during the adaptive learning process. From the aforementioned response inhibitor view, reversal learning can be studied as the process of learning to inhibit previously rewarded

(7)

actions. Rudebeck and Murray([Rudebeck and Murray, 2014]) argue that OFC can also provide pre-dictions about specific outcomes associated with stimulus and actions based on current internal states. Murray and Rudebeck ([Murray and Rudebeck, 2018]) proposed that OFC and ventral lateral prefrontal cortex (vlPFC) is critical in mediating the credit assignment process, which is one of the two important components that contribute to primate decision making. Niv ([Niv, 2019]) stated that OFC represents task states and deploys them for decision making and learning elsewhere in the brain.

Building canonical models that reconcile those distinct functions revealed in experimental studies and reproduce primates’ behavior and neural responses is the ultimate goal of the modeling studies. Three important criteria can be used for the comparison of the wide range of models: prediction of behavior, interpretability in terms of cognitive variables (values, reward prediction errors), and the prediction ac-curacy of the underlying neural activity. Deep reinforcement learning (DRL) models have undergone rapid development in the past decade and provide a new approach to address the above three issues ([Botvinick et al., 2019],[Hassabis et al., 2017], [Kriegeskorte, 2015], [Yamins and DiCarlo, 2016]), espe-cially from the viewpoint of generating brain-like neural responses. In this study, we attempt to model several variations of a reversal learning task with recurrent neural networks, analyze the model behavior to gain insights of the learning process and investigate the representation of the rewarded target and the presented objects at population level.

2 Methods

2.1 Behavioral task

We modified a meta-learning model developed by [Wang et al., 2018] to model a variety of reversal-learning tasks in order to characterize the neural networks’ representational capacity and adaptation during learning. For each trial, the agent is presented with two objects. Each object has multiple (nf d = 1, 2, 3) dimensions and each dimension has two value options, represented as 1 or 0. Only one

value of a selected dimension is associated with the highest probability of getting rewards and only one of the objects posseses that character. The agent needs to identify the correct feature dimension and the approriate feature dimension values, a.k.a the target, and choose the corresponding object, which in experiment is often read out as a button press or a saccade ([Oemisch et al., 2019]). In a block, the target generaly remains the same. However, for reversal-learning tasks, there is a switch in target at a random trial somewhere around the middle of the block. The tasks in which this does not occur are referred to as non-reversal. We keep the length of the block equal to 50 trials. The reversal can only

(8)

occur beyond the 20th trials to allow for reaching the performance criterion for the original target even in the more difficult tasks. The goal for the agent is to gain as many rewards as possible within a block.

The above task is adapted from the monkey behavior task in [Oemisch et al., 2019], where they used color, location, and motion as the three dimensions, each having two feature values, while only the color dimension was associated with rewards (Figure 1b). We abstracted the task setting and treated all features equivalently, that is to say, all features can be linked to rewards.

We also used a probabilistic version of the task. For the deterministic task, the choice of the correct ob-ject is always rewarded and the choice of the incorrect obob-ject, the one not containing the target feature, is never rewarded. For the probabilistic task, the choice of the correct object will lead to reward with probability p whereas choosing the incorrect object is rewarded on a fraction 1 − p of the trials. Hence, the probabilistic task requires the agent to not only identify the correct target in each block based on the choice-reward outcomes in preceding trials, but must also hold their belief while facing consecutive unrewarded trials paired with correct actions. The lower the p, the more difficult the task is. Pilot studies indicate that the closer p is to a half, the more difficult the task is and the longer it takes to identify the correct target. We have fixed p to 0.8 as this struck the appropriate balance between being challenging and reaching criterion performance within blocks of 50 trials.

2.2 Deep reinforcement learning model

The neural network (NN) is composed of an input layer, a recurrent layer of long short term memory (LSTM) cells, and an output linear layer ([Wang et al., 2018], see Figure 2a). Current observation (ot),

action and rewards from the previous trial (at−1, rt−1), and the trial index (t) are fed into the NN. The

input layer is connected all-to-all with fully connected LSTM units that maintain past information by incorporating multiple gating function (forget, input, output, weighted sum passing through a sigmoid nonlinearity, Figure 2b). The output layer is then activated by a softmax function for the purpose of selecting the action with the highest probability (at), or is passed through a linear unit estimating state

values (Vt(st; θv), θv is the current parameter set of the model).

NN’s weights are updated by Asynchronous Advantage Actor-Critic (A3C, [Wang et al., 2016]) algo-rithm, a model-free reinforcement learning (RL) strategy that uses the asynchronous multi-thread update for network optimization. For our application the asynchronous update was not necessary so we used only a single thread (the Advantage Actor-Critic, A2C, [Wang et al., 2016]). The objective function is the gradient of a weighted sum of policy loss, state-value estimation loss, and an entropy regularization

(9)

term. Advantage at trial t (δt(st; θv)) is defined by the temporal difference of discounted (discounting

fator γ) n-step return. We used the gradient descent (GD) algorithm with learning rate α to train the network. The gradient is the sum of the gradients of the individual objective function components, defined as follows, ∇L = ∇Lπ+ ∇Lv+ ∇Lent (1) ∇Lπ= ∂ log(π(at|st; θ)) ∂θ δt(st; θv) (2) ∇Lv = βvδt ∂V ∂θv (3) ∇Lent= βe ∂H(π(at|st; θ)) ∂θ (4) δt(st; θv) = [Rt− V (st; θv)] (5) Rt= k−1 X i=0 [γirt+i+ γkV (st+k; θv)] (6)

π is the policy with parameters θ, Rtis discounted future outcome on trial t, βv and βeare the relative

weights in the objective function of the value and policy term, respectively, and k is the number of trials to the ending point of the block and bounded from above by the maximum trial number within a block. We implemented the above NN model with Tensorflow 1.13.1 and Python 3.6.10.

During training, the network learns to perform a reinforcement learning algorithm on the stimulus-action-reward sequence in a block. The agent learns to recognize the stimulus-action-rewarding targets and updates its hidden states encoded in the LSTM units. We refer to this reinforcement learning with fixed synaptic weights as meta-learning, in contrast with the training phase during which network weights change. This model incorporates the hypothesis that reward prediction error (RPE) encoded in the activity of dopamine (DA) neurons using synaptic learning, and prefrontal cortex(PFC) encoded learning, rooted in neural activity representations combined together to form a dual RL system of primates ([Wang et al., 2018]).

We abstracted concrete the feature descriptions into equivalent numerical values. Three encoding meth-ods, binary encoding(B), redundant binary encoding(RB) and one-hot encoding (O), were adopted as

(10)

depicted in Figure 2. We classified model-relevant parameters into three categories: network size, RL parameters (Table 1), and task settings (Table 2). Four types of tasks were modeled: deterministic non-reversal (DNR), deterministic non-reversal (DR), probabilistic non-non-reversal (PNR) and probabilistic non-reversal (PR). Intuitively speaking, we expect the task difficulty level increases with more features involved, and the introduction of reversal and probabilistic rewards.

Network Size RL parameters Learning rate (α) Discounting factor (γ) 4 10−2 ₀ 8 10−3 _0.2 16 10−4 0.4 48 0.6 128 0.8 256 1.0

Table 1: Model parameter settings The table lists all the NN related parameters choices that have been explored and evaluated in the study. Each column represents independent choices. After grid searching for the optimal parameter for example tasks (DR), we chose the learning rate to be α = 10−3

for all models. We also found that the optimal γ for the discounting of future rewards is much smaller than 1. γ = 0.2 led to the fastest convergence and more stable behavior in pilot runs.

Task Settings

is reversal is probabilistic n f eatures Encoding

Non-reversal (NR) Non-probabilistic(NP) 1 Binary(b)

Reversal(R) Probabilistic with reward probability 0.8 (P) 2 Redundant Binary (rb)

3 Onehot (o)

Table 2: Task settings The table lists all the choices that can be combined to define a task. In addition, we also designed several off-policy tasks to test model performance (see Section xx). The naming convention is the combination of condition abbreviations. For example, the probabilistic non-reversal model, with redundant binary encoding, 3 features, and network size of 48 is denoted as PNR-rb-f3-lstm48.

2.3 A simple reinforcement learning model

To understand the working mechanism of the NN model, we built a simple RL model and try to char-acterize the neural behavior in terms of the RL model’s parameters. The RL model is trained using the behavioral data we obtained from the test datasets of the NN model.

Each feature (target) is associated with a value which is used to guide the action choice. The model should be able to update the feature values on each trial until the value of the target feature is maximal.

(11)

Hence, we assigned each feature dimension an inital value, which are updated on each trial t by the received reward rt:

vi,t+1= vi,t+ η+max(rt− vi,t, 0) + η−min(rt− vi,t, 0) (7)

if feature i was in the chosen object, with η+ and η− as the parameters that quantify the relative

importance of unrewarded/rewarded trials and rtis the reward sequence from the NN model. Meanwhile,

we also have

vi,t+1= αvi,t (8)

if it was not in the chosen object. Here, α is the decay parameter.

The choice for action j is then made using the softmax function, which yields the probability for each possible action: pj,t= exp(βP i∈Ojvi,t) P kexp(β P i∈Okvi,t) (9) the sum over k means iterate over all presented objects, β is the softmax parameter for action selection. The bigger the β, the more weights the feature values have in determining the choice.

We can describe the choices in a block with these equations starting from a particular set of initial values for vi. At t = 0 there is no information about the target. Hence, we set vi,0=n1 as the initial condition

for our simulation.

We use the stimulus-choice-reward sequence acquired from our NN model to find the optimal parameter set θ = {η+, η−, β, α} that minimizes the negative log-likelihood of all choice probabilities (NLP):

L(θ) = −X

t

log pct,t (10)

ct is the actual choice made on trial t. We used matlab’s f minsearch function to find the optimal

(12)

3 Results

3.1 Model performance

3.1.1 RL training phase

The model trained by RL algorithms learns to perform reinforcement learning. The learning of synaptic weights during the training phase is referred to as learning (Selected training results: Figure S1), whereas the learning what the target is during a block is referred to as meta-learning (Figure 4).

First, we discuss the learning, during which the recurrent network weights as well as the weights for the policy as well as value function are adjusted in order to optimize the objective function. There is the expectation that the more difficult a problem is, the longer it takes to converge to a good solution, a.k.a to find an optimal weight configuration. We therefore determined the number of iterations needed to converge to a performance that exceeded 90% of the saturated performance, which usually corresponded to the maximal achievable performance while accounting for an initial meta-learning phase during which performance was not optimal. The model for deterministic tasks should reach 100% performance after the learning what the target is, hence the block average would be around 95%. For a probabilistic model, with probability p for getting a reward for a correct choice, the performance of the model can not exceed p × 100%.

The time to criterion depends on the initial weight configuration, we therefore repeated the simulations 10 times (Figure 3) and the convergence, expressed in episodes, showed relatively large variation across repeats (mean = 4198.2, std = 464.8563, D-NR-rb-f3-lstm48; mean = 5515.1, std = 1526.2418, D-R-rb-f3-lstm48. Each episode corresponds to 200 blocks, each block contains 50 trials). In some cases the model did not reach the expected maximum performance, these runs were not included in the average. When we analyzed the non-converged runs, the performance often saturated at intermediate values, and corresponded to the model successfully learning only a fraction of targets (see section 3.4). A typical example is shown in Figure 3, the orange curves correspond to a D-NR model, and demonstrates that the convergence is primarily determined by how long the network is stuck on an intermediate plateau, the duration of which showed appreciable variability. The same features are visible in the corresponding D-R model, green curves, and show that this model was also temporarily stuck at multiple intermediate plateaus.

As mentioned before, we looked at three different stimulus representations as inputs to the model, four different tasks, network sizes ranging from 16 to 256, and 1 to 4 features (meaning 2 to 16 possible

(13)

targets), we also performed additional training for certain task types with networks as small as 4 LSTM units and a larger number of features up to 8). We discuss each of them in turn. In general the more features, the longer the time to convergence was (Figure 4). This effect was the strongest for onehot encoding and weakest for redundant binary encoding. The convergence slowed down with D-NR being the fastest, followed by D-R, P-NR and with P-R being the slowest. As more features correspond to a more difficult task, and probabilistic rewards and reversals also make the task more difficult as well, the aforementioned results match our expectation.

Network size can also influence the convergence. First, there may be a minimum size necessary for the network to perform the task, which could also mean that for network sizes close to that minimum size, the number of reachable weight configurations are fewer and convergence may take longer. Consistent with this hypothesis, the convergence speed increases with network size. Note that even though the number of iterations for large networks is less, the computational load is higher since there are more parameters to be estimated and the running of the network is slower. These network size constraints are most visible for tasks with 2 or more feature dimensions (Figure 4).

3.1.2 Meta-learning phase

We next considered how meta-learning speed is affected by the four factors, specifically, task, encoding, number of features and network size. The speed is quantified by the number of trials needed to first reach the criterion performance of 90% of the maximum achievable rate (examples of the meta-learning curve for D-R tasks: Figure 4). We first considered the initial learning period, i.e. before a reversal when studying a reversal task. We find that the learning is not affected by whether it was a reversal or non-reversal task (Figure 4). Interestingly, during a reversal task, learning the new target after reversal takes longer than learning it just after onset. Most likely this is because there is a current target, that is not the target anymore, so the corresponding value has to decrease first before the value of the new target can increase (the concept of value is based on the interpretation of the results in terms of our RW type model fit to the model’s behavior). Hence the extra learning time is due to unlearning the old target. Probabilistic tasks are more difficult than deterministic tasks, so meta-learning takes longer. One reason is that the unrewarded correct choices confuse the model and thereby delay learning.

When the number of features, hence the number of potential targets increases, meta-learning takes longer. The criterion point varies approximately linear with the number of feature dimensions, which means it is not linear in the number of targets, suggesting that features are learned independently rather than the object identity (see further discussion in section 3.4). Very small networks (n < 16) are much slower than large networks (n >= 48), for which the criterion times are similar.

(14)

3.1.3 RL model fitting performance

We obtained sets of RL models with low NLP values (Equation 10), indicating that the fit was good (Figure S2). When choice are made randomly, the choice probability would be 0.5 and the NLP would be around 0.69. Hence when the NLP is much lower than this value, the RL model accounts for the choices of the NN model, it demonstrates further that the RL model also makes decisions with preferences for the correct target.

With the optimal parameter sets, we generated RL model’s action trajectory with the exact same stimulus-reward sequences as the one used for the NN model’s evaluation. We characterized the RL model’s performance from two perspectives: The first is how well it performed on the task, which can be visualized by plotting the meta-learning curve as we did for the NN model. There is one thing to be noticed: for meta-learning curve of the RL model, it is the reward the agent should have obtained based on its action choice, instead of the actual reward that might be omitted due to task settings like the reward probability. The second criterion is how well it captures the NN model’s behavior, which is calculated by the fraction of trials on which the RL and NN model made the same choices.

Two models have a similar meta-learning trajectory. For PR tasks, the meta-learning curve for the RL model (Figure 5) is around 1.25 times the NN model’s trajectory, which is not surprising given that we plot the choice probability without correcting for reward omissions. It indicates that the RL model is able to perform the task well. In the similarity level analysis, we calculated the fraction of blocks that share same action choices on each trial, the D-R model reaches around 100% after the saturation while for P-R model, it is only 75%. Meanwhile, there is an obvious lagging effect: after reversal, the fraction of trials on which the same actions were chosen drops and recovers a few trials later for both D-R and P-R models. Therefore, those two models made different action choices in unrewarded trials though the averaged performances are similar.

We also plotted the target value estimation curves for DR and PR models (see the example block in Figure 6). The RL model has an accurate value estimation and always assigns the largest value to the rewarding target. As expected, probabilistic tasks requires a longer time for target updating. After the reversal trial, the value of the original target decreases while for the reversal target, it increases exponentially in a few trials and these two value curves intersect on a certain trial t∗. It means that the RL model will first stick to the original target and then switch to the new target after trial t∗. Together with the similarity level analysis, we conclude that the NN model adopts a different updating strategy

(15)

during those periods.

3.2 Network representation

Better model performance indicates that the neuron population activity has built a representation of task-relevant information. One of the most prominent attributes is the identity of the target within a block, which has to be inferred by the network. To visualize the gradual changes of representation, we performed Principle Component Analysis (PCA) on neuron data and analyzed the population responses grouped by blocks with the same target index. We expect these responses representing the same target to cluster together in terms of PCA loadings, both when the training progresses as well as when the meta-learning progresses during the block.

We extracted data from the phase of converged meta-learning (last 10 trials of a block for NR tasks when performance consistently exceeded the criterion) of six intermediate models that were sampled from the plateaus of the RL training curve. For each model, we colored each data point according to the target of its corresponding block. Initially, the representation was limited to a line confined to a small square in phase space, after which it spread out, and at a later stage in training the point clouds also partially unmixed so that colors (targets) segregated (Figure 7). This means that the target is easier to decode from the population activity because the inter-cluster distances increase. To quantify this observation, we trained a support vector machine (SVM, acting on the first 2 PC loadings) classifier to determine how the target classification accuracy changes. The average accuracy shows a clear increasing tendency which agrees with our assumption (Figure 7). One interesting point to notice is that the fastest rising phase is different for each target, which suggests different targets are acquired during different training phases.

For the evolution of the representation during meta-learning, we carried out a similar analysis on a well-performing DR model, by dividing the block into 10 consecutive and non-overlapping intervals according to the trial index. 2-D projected data does not expand into the square to fully cover it, rather it shows a gradual formation of target clusters as the model better captures the identity of the rewarding target (Figure 8a). During the trials after reversal (trial 20 to 30), the clusters become more mixed, after which another set of separable clusters emerged at the end of the block. A different SVM classifier (acting on the first 3 PC loadings) was also trained and demonstrates a clear drop in accuracy during the reversal stage (Figure 8b). This analysis supports our findings regarding meta-learning time to reach criterion: pre-reversal learning is faster than the post-reversal learning.

(16)

3.3 Transerfer learning

To get an impression of the models’ meta-learning strategy, we tested their performance with a different sets of tasks than those on which model was initially trained or by adding off-policy periods during which the reward was independent of choice. For example, the DNR model tested on DR tasks can tell us whether the model adopts a similar learning strategy pre- and post-reversal. For off-policy tests, we added an off-policy period to the original task, during which the agent receives or does not receive rewards regardless of its actions. We analyzed the influence of three types of unrewarded trials on the overall performance and meta-learning trajectory of a well-trained model:

• After the reversal, during which the agent is exploring to find the new target. During this period, the agent needs to switch its target belief.

• Rewarded incorrect choices or unrewarded correct choices that occur in tasks with a probabilistic reward schedule. Ideally, the agent should preserve its target belief.

• Manually inserted intervals of consecutive rewards or consecutive omissions of rewards independent of the choice of the model. The rewards or absence of rewards in this case differs from that for the probabilistic reward schedule, because in the latter it is still correlated with the correctness of the choice. Ideally, the agent should also hold its target belief.

We made the following observations (Figure 5, S3):

• DNR model cannot detect reversal but recovered well from an unrewarded period.

• DR model does not perform as well as the probabilistic model on PNR tasks but recovers well from the unrewarded period

• PNR model performs well on the DNR task, but fails for reversal tasks.

• PR model goes through an extended relearning period after an off-policy unrewarded/rewarded period.

The simplest conclusion we can draw from the above observation is that the more complex models are compatible with simple models: reversal models work well for non-reversal tasks; probabilistic models work well for deterministic tasks. It is consistent with our expectation, since we can always treat deter-ministic (with reward probability p = 1) and non-reversal (reversal happens at the end of the block) as a special case for probabilistic and reversal tasks.

Going a step further, our model is capable of telling the difference between omitted rewards caused by the probabilistic reward and a target switch. That is to say, the model treats various kinds of unrewarded

(17)

trials in different ways. If all of the models use a consistent updating RL rule, i.e. fixed updating/action-selection parameters, all types of unrewarded trials should lead to the same behavior, since from the model’s perspective, they have an identical trial history. Yet various learning behaviors emerge: some of the models choose to hold the target belief while others adapt their choices more slowly or faster. Hence, different learning strategies are adopted by different models. To start with, probabilistic and reversal models are rather perseverant regarding their target belief, given that meta-learning period is significantly longer than for the other models (Figure 4)

To analyze the model’s behavior at a finer scale, we defined the “confusion matrix”, calculated as the probability of making the correct choice after the trial with rewarded correct choices (RC), unrewarded correct choices (UC), rewarded incorrect choices (RI), and unrewarded incorrect choices (UI). A block is divided into the learning phase (several trials after the switch of the target, task accuracy is increasing during this period) and the stable phase (during which the task accuracy remains at a high level). The probability of making the correct decision after UC and RI during the stable phase is much higher than in the learning phase, demonstrating the P model learns to identify the wrongly (un)rewarded trials (Figure 9).

We further tested how D-R models generalize on P-NR tasks. There is a tendency of meta-learning, yielding a performance on P-NR that was similar to the performance of the model trained on P-R tasks, but with a slightly lower saturation level. We characterized D-R model’s action bias using the confusion matrix, and found it did not deal well with either kind of wrongly rewarded cases: the probability of making correct decisions decreased for trials after UC but increased for trials after RI (Figure 9). Hence, we argue that D model is more sensitive to unrewarded trials compared with P models.

Different from the model proposed by Wang et al ([Wang et al., 2018]), we fed trial index as an addi-tional input to the network. Our initial guess is that because the reversal happens always within the middle part of the block (trial 20 to 30), the network develops a prior about the reversal timing. Right before the potential reversal period, the NN model can adapts its learning strategy to be more sensitive to unrewarded trials. In meta-learning curves, this would imply shorter re-learning periods for reversals that happen between trial 20 to 30. However, the results are different from this expectation. We made two modifications to the DR task, randomized trials indices were fed into the model and we used a wider range for the reversal window (from trial 15 to 45). The model performs well under both conditions and none of them leads to a longer reversal learning time. Therefore, either there is no reversal prior represented in the NN model or that information is not used.

(18)

We also tested whether the RL model has a similar transfer learning capability as the NN model. We tested a PNR model on a DR task. As we can see from Figure S5a, both models fail to recognize the second target right after the reversal, whereas it seems that the RL model slowly picks up the new piece of information by the end of the block. That can be verified by Figure S5b, where the behavior similarity level between NN and RL model drops at the corresponding trials. We made the assumption that NR models cease to learn after the initial pre-reversal period. Based on results shown here, it is possible that NR model does not stop learning, but continues to learn with a significantly lower learning rate. We tried to verify this hypothesis by testing the existing NN models on tasks with longer blocks. However, NN model’s performance shows an unexpected drop after trial 50 that used to be the ending point of the task. Whether the NN model stops learning therefore remains an open question.

We also tested what happens to the RL model when it faces an unexpected unrewarded period. Apart from the meta-learning curve mentioned above, we plotted the rewards the RL model should have earned instead of the actual zero reward during the unrewarded period in Figure S6. The performance is about chance during the unrewarded period, which means RL model de-learns all the target information. After the off-policy period, RL model recovers faster than the NN model.

3.4 Feature by feature learning strategy

There are two possible strategies for the model to capture the task rule during the training phase: learning by objects, or learning by features (pairs of targets, for our binary feature value setting). To determine which strategy the model used, we plotted the meta-learning curve for intermediate models, conditioned on either targets or stimulus, respectively (Figure S8).

The model’s improvement across training episodes proceeded via jumps in the fraction of rewarded trials followed by plateaus on which model performance changed little. We considered models corresponding to three intermediate plateaus with the final model achieving the highest performance. We averaged the reward outcome across blocks with either the same (a-d) target stimulus or the same (e-h) presented stimuli. In the latter case the trials that are averaged over come from different blocks. These conditioned averages allows us to determine whether the learning proceeds across targets or across stimuli. The se-quence from (a) to (d) shows that each plateau corresponds to a subset of targets for which performance is high and a subset for which the performance is low. These targets come in pairs for which the feature dimension is the same. The performance increases because more, and in the end all, targets wil have high performance. In contrast, panels e-h show that there is no difference in performance across stimuli, rather the average across stimuli increases with each plateau. It demonstrates that the model learns by

(19)

features instead of objects in the training phase. Similarly, overtrained/unstable NN with rapid collapsed performance lost information feature by feature (Figure S8)).

In addition, we calculated the representation similarity matrix (RSM, [Diedrichsen and Kriegeskorte, 2017], [Kriegeskorte et al., 2008]), a.k.a the correlation between neural activity patterns for all pairs of targets. Pairs of targets belonging to the same feature dimension hold a higher similarity level compared with average similarity level of pairs of targets across feature dimensions (Figure 10). This similarity decreases when the model arrives at a plateau for which the performance for the feature dimension of these targets has jumped to a high value.

Based on the behavioral performance, we assume that meta-learning does not utilize object by object learning but instead acquired all object information related to the target simultaneously. We performed T-distributed Stochastic Neighbor Encoding(tSNE, [Maaten and Hinton, 2008], [Tang et al., 2016]) on data filtered by block target, visualized by stages and colored according to stimulus index (Figure 11). We found those clusters are formed from the beginning of the block and moved within the manifold generated by the target. It shows no temporal order for different objects during meta-learning, which agrees with our assumption.

3.5 The effect of network size

Evaluating the performance on tasks with various difficulty levels, we found there exists a lower thresh-old for the number of model neurons required to guarantee sufficient task representation capacity. For example, for D-NR-b tasks, a network of 8 neurons is sufficient for a 1 feature task whereas at least 16 neurons are needed for a 2 feature task. For an 8 feature task, a 128 neuron network is not guaranteed to reach an optimal level of performance. However, the neurons in large networks are not fully used. We performed PCA analysis on neural activity data acquired from D-R-rb-lstm256 models with feature numbers ranging from 1 to 4. Those four models all reached adequate performance. We analysed how many PCA component loadings are enough to explain at least 75% variance (Figure 12, this threshold guarantees that the clusters are separable) and the results are 3, 10, 10 and 12 for when the number of feature dimensions is 1, 2, 3, 4 respectively. If the network is fully optimized, with small number of neurons (i.e the same number as the number of PCA component) the model should be able to perform well. However, our observations show that this is not the case.

The fraction of active neurons drops for increasing network sizes and neurons that encode redundant information emerge as well. We define a neuron’s target distinguishability level as the measure for how well the neuron can tell two targets apart. It is calculated as the variance of the target tuning curve. The

(20)

distribution of neurons’ target distinguishability level has a prominent peak at 0, which indicates that a large fraction of neurons cannot distinguish any of the targets. Therefore, we define that if the target tuning curve’s variance falls into an -interval around zero ( = 1e − 4 as a limit value), the corresponding neuron does not contribute to target classification. The fraction of non-contributing neurons of the whole network is referred to as the model’s sparsity level. We calculated this measure for D-R-rb-f2 model with network sizes of [8, 12, 16, 48, 128, 256] and found the bigger the network, the higher the sparsity (Figure S10). It demonstrates the existence of redundant neurons.

We further considered the network activity at the single neuron level. As we mentioned before, trial index as an additional input does not seem to contribute in guiding the action choice. We plotted neuron activity versus trial index for small and large networks and found there is no single neuron encoding the trial index information for small networks but in large networks there are several neurons that show a clear correlation (Figure S11). This result further validates our statement on the existence of redundant neurons.

3.6 Meta-learning in the RL context

We analyzed how the RL model parameter changes with task settings and NN model sizes and plotted their relationship in Figure 13. There are no prominent changes in the optimal values of η+, η− and α

versus both the network size and the number of features, whereas for β, it increases quickly with the size of the network. Higer β means the model has more confidence about its current belief. In contrast, β shows a non-monotonic trend as the number of features increases. We expected to observe a nearly flat curve for β but our model size is not sufficient to solve tasks with a number of features as high as 8. Higher feature number leads to lower saturation criterion, which in turn might result in β’s inconsistent behavior.

η+ and η− is another pair of parameters that help us to understand the models’ meta-learning

strat-egy. The bigger the η, the more the model uses the corresponding reward/un-reward trial experience to update the feature values. Hence, with smaller η+, the P model adapts its target belief slower than D

models. With larger η−, the P model changes the feature values more when it encounters unrewarded

trials (Figure 13). It agrees with our expectation, given that the P model has to learn from unrewarded correct choices.

(21)

4 Discussion

Human decision making in specific cognitive tasks can be characterized by reinforcement learning mod-els, providing concepts such as value and reward prediction error. The circuits underlying this behavior that encode these quantities cover multiple brain regions and raise the issue of how the labor is divided amongst these areas. To address this issue, we adapted a meta-learning neural network model based on LSTM units to learn several variabtions of a reversal learning task in order to study the mechanism by which networks accumulate information about the rewarded target and forget this information when the target is switched.

We studied the factors that affect the model’s saturation speed, both at the learning and the meta-learning phase. The saturation speed is characterized by how long it takes for the model to be trained on a particular task (the learning) and how long it takes for those models to acquire to correct target (the meta-learning). We assessed whether the model was capable of generalization, i.e. whether the network trained on one task could perform well on a different task, and we determined how well we could characterize the performance using a standard RW learning model. Finally, we studied the nature of the representation of stimuli and the estimate for the current target in the activity of network neurons. We here summarize the main results of the thesis.

We trained the model on four tasks, each block could have the same target for the entire period (NR) or the target could be switched from a random trial (R), the rewards could be deterministic (D), a correct choice was always rewarded, or probabilistic (P), a correct choice would be rewarded in 80 % of the cases, whereas an incorrect choice would be rewarded in 20% of the cases. A task can thus be represented by the following four sequences: D-NR, D-R, P-NR, P-R.

There were always two objects presented. We further varied the number of feature dimensions nf din the

objects between 1 and 4, with two values per dimension. This leads to 2nf d _{possible stimuli and 2n}

f d

targets. The two objects are complementary: they always have opposite value. The stimulus configura-tion (the two objects) can be represented in two ways, as a binary vector of length nf d, with each bit

representing the value of that feature dimension (0 or 1) — referred to as “b”; or via a vector of length 2nf d_{, using “hot-one” coding, where the single one in this vector indicated the stimulus, referred to as}

”o”. As the second object has exactly the opposite binary representation, one can replace each 1 by zero and vice versa to obtain it from the first object. It is possible to present to two objects in a redundant way as two binary vectors, this is the third way of presenting the stimuli, referred to as “rb”. A network can be represented by the following code: the type of stimulus (b, o, rb) followed by the number features

(22)

(f) and the number of neurons (LSTM).

A complete description of a model is then given by a combination of task and network, i.e. D-NR-rb-f2-lstm48 stands for a deterministic reward schedule (D), blocks without a midway target switch (NR), a redundant binary input (rb), with two feature dimensions (f2) and 48 neurons (ltsm48).

We found that the learning speed, the number of training episodes needed for the model to reach in 90% of the theoretically maximal performance, is a proxy for the task difficulty. It took longer to learn a reversal than a non-reversal task. It also took longer to learn to learn a probabilistic than a deterministic task. The symbolic rank is P-R>P-NR>D-R>D-NR.

PR is the most difficult task because it has two learning phases where two types of evidence need to be balanced. First, when the block starts unrewarded correct choices (UC) have to be distinguished from unrewarded incorrect (UI) choices, as well as rewarded correct choices (RC) from rewarded incorrect choices (RI). Due to the probabilistic nature, UR and RI have minority occurences. When interpreted in the context of our simple RL model, the former would correspond to value changes induced by η−,

which should be lower to filter out UC from UI, whereas the latter would correspond to η+, which should

be lower to filter out RI from RC. The second learning phase is detecting the reversal, as an incidental UC occuring at low rates should not be mistaken for an UI later in the block as that would be a sign for reversal, which means that η− has to be lower. Both of these trade-offs slow the meta-learning rate

(represented by the η values) and it also takes a longer time to train the network.

Transfer learning refers to using a network trained for one task, to solve a different task. Here we consider the transfer meta-learning. We find that a D-NR model can not perform D-R tasks, since after reversal it continues to choose for the object with the old target. In contrast, when a P-NR model is tested on D-R tasks, the performance drops to chance upon reversal then very slowly, much slower than a P-R model, increases its performance. This shows that a P model has to be more flexible to solve the P task, but that flexibility also helps with reversals for which is was not trained. This could mean that it adopts a strategy similar to RW models of reinforcement learning. However, our results show that the fit of the RW model to the network trained on the P task is not as good as for NP.

We analyzed the neural activity in the network in a number of ways to determine how the task variables were represented. We reduced the dimensionality of the neural state either using PCA or using tSNE and applied a supervised learning procedure, SVM, to see how well we could decode either the current target or the first object. We found that the first 12 PCA components captured the entire network activity

(23)

well in terms of explained variance (80%, for three feature dimensions tasks), and also performed as well as the full network activity for decoding target and stimulus identity. The number of PCA components did not vary strongly with network size, rather it varied with the number of features that the network had to represent.

The target information becomes stronger over meta-learning trials, this can be seen in the tSNE plots where points are labeled according to the current target. For the earliest chunk the minimum distance between points representing different targets is smaller than for later chunks. The same is born out on the classification error.

The meta-learning performance of the model is in a similar ball park to humans and non-human pri-mate subjects, but does not matches it exactly. The training can take millions of episodes, which is not realistic compared to the performance of subjects, also in the way it depends on the number of feature dimensions, the additional training period for adding additional feature dimensions. This can be understood mechanistically: the training proceeds in bouts where performance increase rapidly, after which a plateau in the performance is reached, for which performance does not improve for a relatively large number of episodes. Each plateau was associated with learning to represent an additional feature dimension. The network needs to learn a new presentation and the task. In subjects it is likely that part of such representation already exists, and perhaps also similar tasks have been learned previously. Hence, the learning proceeds from an already partially trained network and should be considered trans-fer learning. This may proceed significantly faster than de novo learning. In addition, such an existing network may also constrain how stimuli/targets can be represented by the neurons, whereas it is not constrained in the current model by any biological data. It is a topic for further study to achieve a better match between the model and subjects for the training and meta-learning times.

The number of neurons necessary for close-to-optimal performance could be limited by the number of possible stimuli (exponential in the number of feature dimensions) or the number of targets (linear in feature dimension)([Farashahi et al., 2017]) . When it is the latter, we could say that the model chooses an efficient strategy. We find that the number of PCA components necessary to represent the networks activity patterns increases with the number of features. This would imply that if the network uses the same strategy it should not be possible to learn the task when the number of neurons is below the num-ber of required PCA components. Currently, we have not reached a consistent conclusion regarding how many neurons are necessary to learn a task with a specific number of features. We will further explore this by determining for each network when less than 20% of the weight matrix initial conditions lead to close to expected optimal performance.

(24)

We show that the network is trained by consecutively increasing the number of feature dimensions it can represent as target feature. This would suggest that learning is constrained by the number of targets, and perhaps not by the number of objects that are possible. One additional way to check this is as follows. First, train the model on a subset of targets and then test it on all targets. If the model learns target by target the performance should be lower for the untrained targets. Second, train the model using only a subset of possible stimuli, that do cover all the target dimensions (i.e. a stimulus for all the targets), and test with all stimuli. When the performance is lower for untrained stimuli it indicates it needs to represent all stimuli explicitly. Third, we can apply the same analysis to the meta-learning state. We let the model reach criterion performance using only a subset of stimuli, subsequently we test the performance on all stimuli. When the performance is reduced for unseen stimuli this means that each stimulus has to be learned explicitly. We plan to perform this evaluation in future studies.

5 Conclusion

In summary, we have presented a deep learning model for a reversal-learning task and characterized how the training time as well as metalearning time to criterion on the various properties of the task and network. This model can be interpreted in terms of the RW model and its parameters, hence can be compared to the behavior of actual subjects. Furthermore, it makes predictions for the neural activity and the changes with metalearning. We believe this is invaluable in understanding behavioral and electrophysiological measurements of this task, which will be the focus of our future investigations.

References

[Alexander and Brown, 2010] Alexander, W. H. and Brown, J. W. (2010). Computational models of performance monitoring and cognitive control. Topics in cognitive science, 2(4):658–677.

[Alexander and Brown, 2011] Alexander, W. H. and Brown, J. W. (2011). Medial prefrontal cortex as an action-outcome predictor. Nature neuroscience, 14(10):1338–1344.

[Alexander and Brown, 2015] Alexander, W. H. and Brown, J. W. (2015). Hierarchical error repre-sentation: a computational model of anterior cingulate and dorsolateral prefrontal cortex. Neural Computation, 27(11):2354–2410.

[Behrens et al., 2007] Behrens, T. E., Woolrich, M. W., Walton, M. E., and Rushworth, M. F. (2007). Learning the value of information in an uncertain world. Nature neuroscience, 10(9):1214–1221.

(25)

[Botvinick et al., 2019] Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408– 422.

[Botvinick et al., 2009] Botvinick, M. M., Niv, Y., and Barto, A. G. (2009). Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition, 113(3):262–280. [Costa et al., 2015] Costa, V. D., Tran, V. L., Turchi, J., and Averbeck, B. B. (2015). Reversal learning

and dopamine: a bayesian perspective. Journal of Neuroscience, 35(6):2407–2416.

[Deco et al., 2013] Deco, G., Rolls, E. T., Albantakis, L., and Romo, R. (2013). Brain mechanisms for perceptual and reward-related decision-making. Progress in Neurobiology, 103:194–213.

[Diedrichsen and Kriegeskorte, 2017] Diedrichsen, J. and Kriegeskorte, N. (2017). Representational mod-els: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. PLoS computational biology, 13(4):e1005508.

[Farashahi et al., 2017] Farashahi, S., Rowe, K., Aslami, Z., Lee, D., and Soltani, A. (2017). Feature-based learning improves adaptability without compromising precision. Nature communications, 8(1):1– 16.

[Hassabis et al., 2017] Hassabis, D., Kumaran, D., Summerfield, C., and Botvinick, M. (2017). Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258.

[Hayden and Platt, 2010] Hayden, B. Y. and Platt, M. L. (2010). Neurons in anterior cingulate cortex multiplex information about reward and action. Journal of Neuroscience, 30(9):3339–3346.

[Heilbronner and Hayden, 2016] Heilbronner, S. R. and Hayden, B. Y. (2016). Dorsal anterior cingulate cortex: a bottom-up view. Annual review of neuroscience, 39:149–170.

[Jang et al., 2015] Jang, A. I., Costa, V. D., Rudebeck, P. H., Chudasama, Y., Murray, E. A., and Averbeck, B. B. (2015). The role of frontal cortical and medial-temporal lobe brain areas in learning a bayesian prior belief on reversals. Journal of Neuroscience, 35(33):11751–11760.

[Kriegeskorte, 2015] Kriegeskorte, N. (2015). Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual review of vision science, 1:417–446. [Kriegeskorte et al., 2008] Kriegeskorte, N., Mur, M., and Bandettini, P. A. (2008). Representational

similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2:4.

(26)

[Maaten and Hinton, 2008] Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.

[Miller, 2000] Miller, E. K. (2000). The prefontral cortex and cognitive control. Nature reviews neuro-science, 1(1):59–65.

[Minxha et al., 2020] Minxha, J., Adolphs, R., Fusi, S., Mamelak, A. N., and Rutishauser, U. (2020). Flexible recruitment of memory-based choice representations by the human medial frontal cortex. Science, 368(6498).

[Murray and Rudebeck, 2018] Murray, E. A. and Rudebeck, P. H. (2018). Specializations for reward-guided decision-making in the primate ventral prefrontal cortex. Nature Reviews Neuroscience, 19(7):404–417.

[Niv, 2019] Niv, Y. (2019). Learning task-state representations. Nature neuroscience, 22(10):1544–1553. [Oemisch et al., 2019] Oemisch, M., Westendorff, S., Azimi, M., Hassani, S. A., Ardid, S., Tiesinga, P., and Womelsdorf, T. (2019). Feature-specific prediction errors and surprise across macaque fronto-striatal circuits. Nature communications, 10(1):1–15.

[Olah, 2015] Olah, C. (2015). Understanding lstm networks.

[Rudebeck and Murray, 2014] Rudebeck, P. H. and Murray, E. A. (2014). The orbitofrontal oracle: corti-cal mechanisms for the prediction and evaluation of specific behavioral outcomes. Neuron, 84(6):1143– 1156.

[Shahnazian and Holroyd, 2018] Shahnazian, D. and Holroyd, C. B. (2018). Distributed representations of action sequences in anterior cingulate cortex: A recurrent neural network approach. Psychonomic Bulletin & Review, 25(1):302–321.

[Shenhav et al., 2013] Shenhav, A., Botvinick, M. M., and Cohen, J. D. (2013). The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2):217–240.

[Silvetti et al., 2011] Silvetti, M., Seurinck, R., and Verguts, T. (2011). Value and prediction error in medial frontal cortex: integrating the single-unit and systems levels of analysis. Frontiers in human neuroscience, 5:75.

[Tang et al., 2016] Tang, J., Liu, J., Zhang, M., and Mei, Q. (2016). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web, pages 287– 297.

(27)

[Van Veen et al., 2004] Van Veen, V., Holroyd, C. B., Cohen, J. D., Stenger, V. A., and Carter, C. S. (2004). Errors without conflict: implications for performance monitoring theories of anterior cingulate cortex. Brain and cognition, 56(2):267–276.

[Vassena et al., 2017] Vassena, E., Holroyd, C. B., and Alexander, W. H. (2017). Computational models of anterior cingulate cortex: At the crossroads between prediction and effort. Frontiers in neuroscience, 11:316.

[Wang et al., 2018] Wang, J. X., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J. Z., Hassabis, D., and Botvinick, M. (2018). Prefrontal cortex as a meta-reinforcement learning system. Nature neuroscience, 21(6):860–868.

[Wang et al., 2016] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.

[Yamins and DiCarlo, 2016] Yamins, D. L. and DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356–365.

(28)

(a) reversal learning task

(b) experimental task

Figure 1: Behavioral Tasks. a The task modeled in this study. Each feature-value combination is a possible target and only one target is rewarded within a block. n dimensions and m value choices for each feature creates mn targets. Each of the targets can be selected as the rewarded target. Target reversal happens in one of the trials around the middle of the block. The upper-right figure shows the three types of tasks that lead to either deterministic or probabilistic rewards. b There were 3 feature dimensions, color, location and motion, each with two values. The right hand side of the panel depicts the reversal. Only the color dimension is associated with rewards. Figure adopted from [Oemisch et al., 2019].

(29)

(a) neural network structure

(b) gating structures within a LSTM unit

(c) encoding methods

Figure 2: Network Architecture. a Figure modified from [Wang et al., 2018]. The networks is com-posed of an encoding, a LSTM layer and an output layer. The trial index is provided as an additional input, compared with the model from [Wang et al., 2018]. Meta-learning is achieved by feeding the in-formation of the previous reward and action. b The gating mechanism of LSTM units. Figure taken from [Olah, 2015]. c This is an example of an environment with three stimulus feature dimensions, each with two values: For binary and redundant binary encoding, shapes represent features and colors for dimension value (dim = 2, grey stands for 0 and yellow for 1). Redundant binary encoding consists of binary encoding plus its complementary part, which reduces the needs for abstraction and is more similar to experimental stimulus array. One-hot encoding is the equivalent of the corresponding binary encoding. For example, binary encoding {0, 1, 0}, redundant binary encoding {0, 1, 0, 1, 0, 1} and onehot encoding {0, 0, 0, 0, 0, 1, 0, 0} are equivalent to each other. See the main text.

(30)

(b) (a)

Figure 3: Variability and stability. Variability of the convergence rate across different initial condi-tions for the networks weights and the effect of the discounting factor γ on convergence speed. a shows the variability of the convergence speed of example models. The variance is not large compared to its mean. Orange color represents NPNR-rb-f3-lstm48. Green is for NPR-rb-f3-lstm48. b shows curves for different γ values. Models were trained on NRP-f2-lstm48 tasks with different encoding methods. x-axis is the number of trained blocks divided by 200. Curves were smoothed by Savitzky–Golay filters.

(31)

(a) (b)

(c) (d)

(e) (f)

Figure 4: Time to criterion during the meta-learning phase. a An example of reversal aligned meta learning curve. Different colors indicate different targets (see the legend). Model: DR-rb-f2-lstm48 b There is no significant difference for R and NR models. Model: DR and DNR, number of features nf d ∈ {1, 2, 3, 4}, size of the network ∈ {16, 48, 128, 256}. c Probabilistic model is harder to learn than

the deterministic one. Model: NR ×{D, P }. d It takes a longer time to identify the post-reversal target compared with pre-reversal learning. Model: DR. e Time needed to reach criterion increases with number of features. Dashed line is the mean value of each target’s convergence time. Model: DNR with binary encoding (B), size of the network = 128. f Larger NNs have relatively better representation and converge faster. Model: DR, redundant binary (rb) encoding, number of features nf d = 2, size of the

(32)

(b)

(c) (d)

(a)

Figure 5: RL models’ fitting performance. a,b Meta-learning curves for NN and RL models (left panel is D-R and right-panel is P-R). The data is acquired by feeding the stimulus-action-reward data from the corresponding NN model to the RL model with the optimal parameter set. The curves for the NN models are the actual rewards obtained by the agent whereas for the RL model it is the reward the agent would have obtained under a deterministic setting (predicted reward probability). Hence, the RL model’s performance is around 1.25 times the NN model’s performance in the P-R task. This figure shows two models have a similar mean accuracy trajectory. c,d Action similarity level analysis. Each datapoints shows the fraction of blocks those two models make the same action choices at a certain trial. The similarity level drops after each target switch, which means the RL model uses a different learning strategy for updating targets. In addition, P-R model has a lower similarity level compared to D-R model. Task settings: rb-f2-lstm48

(33)

(a) (b)

Figure 6: Target value versus for the RL model. Smooth curves are values associated with each target as the RL model learns. Star symbols indicate the current target. After the saturation, it is always the target that has the maximal feature value. It means that our model can indeed recognize the target and make correct choices. It takes the P-R model a longer time to converge as expected. During the re-learning of the target, the value for the original target decreases and the value for the new target increases very fast. Those two curves intersect at trial t∗, after which the model makes decisions based on which object possesses the new target. Task settings: rb-f2-lstm48

(34)

a

(b) _(c)

Figure 7: Adaptation of network representation during training. a Target representation clusters formed by the first two principle components of neuron activities of six intermediate models. Intermediate models are extracted from plateau periods during training. Model: PNR-rb-f3-lstm48. b Zoomed in version for the first intermediate model, for which no clusters can be visually identified yet. g SVM classification accuracy for intermediate models (PCA-2D). Scatter dots are for the different targets and the dashed line is their mean value (note that there are outliers that move the mean away from the center of the visible points.

(35)

(a)

(b)

Figure 8: Adaptation of network representation during meta-learning. a Target representation clusters formed by the first two principle components of neuron activities. Blocks are divided into 10 equal-sized non-overlapping parts. Model: DR-rb-f3-lstm48. b SVM classification accuracy for meta-learning phase (PCA-3D). Scatter dots are for the different targets and dashed line is their mean value

(36)

(a) (b)

(c) (d)

Figure 9: Confusion Matrix for model choices. In the “confusion matrix”, we show the probability of making the correct choice after the trial with rewarded correct choices (RC), unrewarded correct choices (UC), rewarded incorrect choices (RI), and unrewarded incorrect choices (UI). A block is divided into the learning phase (several trials after the switch of the target, task accuracy keeps increasing during this period, a and c) and the stable phase (during which the task accuracy remains at a high level, b and d). For probabilistic models (c and d), the probability of making the correct decision after UC and RI during the stable phase is much higher than the learning phase, demonstrating the P model learns to identify the wrongly (un)rewarded trials. For D-R model (a and b), it does not deal well with both kinds of wrongly rewarded cases: the probability of making correct decisions decreases for trials after UC but increases for trials after RI. Task settings: rb-f3-lstm48

(37)

(a)

(b)

Figure 10: RSM along the training direction. a: Representation Similar matrix (RSM) for four intermediate models. Each square shows the representation similarity level between two targets. b Feature representation similarity level drops faster than average targets representation similarity. For the 3-feature task, feature representation similarity is calculated by the similartity level between target 1 and 2, 3 and 4, 5 and 6. Model: DR-rb-f3-lstm48

(38)

(a) (b)

(c) (d)

Figure 11: Feature by feature learning: representation. a,b: tSNE cluster with points colored by stimulus and target identity respectively. c,d SVM classification accuracy for stimulus and target respectively. Classifiers were trained on tSNE data. The x-axis indicates the meta-learning phase. Model: DR-rb-f3-lstm48.

(39)

(a) Explained variance by PCA components

Figure 12: Explained variance of PCA components. We performed PCA analysis on neural activity data acquired from D-R-rb-lstm256 models, with feature number ranging from 1 to 4. Those four models all reach satisfying performance. We calculated the cumulative explained variance for the first n PCA components and plotted it for each feature number, respectively. Model: D-R-rb-lstm256.

(a) (b)

Figure 13: RL model parameters analysis. η−, η+ and β range fits to different models. 10 RL

(40)

Supplementary Material

(a) D-NR-rb (b) D-NR-o

Figure S1: Model Convergence. Model learning curves for different encoding methods. We show results for two networks (a) D-NR-rb and (b) D-NR-o. Each panel contains curves for different number of neurons and number of feature dimensions as indicated in the legend to panel (a). The speed of convergence descreased when more features were included and increased for larger networks. y-axis is the number of rewards obtained in a 50-trial block session. x-axis is the number of trained blocks divided by 200. Data was smoothed by Savitzky–Golay filters.

Neural Network Models of Reversal Learning in Nonhuman Primates

Radboud University Nijmegen