• No results found

Distinct effects of reinforcement schedules on sensitivity to outcome devaluation and overtraining in mice

N/A
N/A
Protected

Academic year: 2021

Share "Distinct effects of reinforcement schedules on sensitivity to outcome devaluation and overtraining in mice"

Copied!
53
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Marieke Schreuder (10069941)

Supervised by Isabell Ehmer and Ingo Willuhn

July 6

th

, 2015

Distinct effects of reinforcement schedules on

sensitivity to outcome devaluation and

(2)

Table of Contents

Abstract... 2

1 Introduction... 3

1.1 Distinguishing habits from goal-directed actions...3

1.2 Promoting habits or goal-directed actions...4

1.3 Neurobiological underpinnings... 5

1.4 Design... 6

2.1 Methods - experiment I... 8

2.2 Methods - experiment II... 9

2.3 Analysis... 11

3.1 Results - experiment I... 12

3.2 Results - experiment II... 17

4 Discussion... 21

4.1 Overtraining... 21

4.2 Impact of motivation on outcome devaluation...22

4.3 Limitations... 23 4.4 Conclusions... 26 4.5 Future directions... 26 References... 29 5 Appendix... 32 5.1 Normalization... 32 5.2 Experiment I... 33 5.3 Experiment II... 47

(3)

Abstract

Habits, which predominantly rely on stimulus-response associations, are typically contrasted with goal-directed actions, in which behavior might primarily result from action-outcome (A-O) associations. Distinguishing both behavioral phenomena is accomplished by

demonstrating differential sensitivity to outcome devaluation and alterations in the perceived causal relation between actions and their outcomes (A-O contingency). This paradigm exploits that unlike habits, goal-directed actions are dependent on both the incentive value of outcomes and A-O contingency. Earlier studies that adopted this approach reported that habit formation can be promoted by exposing animals to instrumental

trainings with relatively low A-O contingencies, simulated by overtraining or random interval (RI) schedules of reinforcement. In contrast, goal-directed behavior is facilitated by

implementing relatively few instrumental trainings or providing random ratio (RR) schedules of reinforcement. The present study aimed to differentiate between habits and goal-directed actions by exposing mice to RI and RR schedules of reinforcement and subsequently

assessing their sensitivity to outcome devaluation and A-O contingency reversal. Results from our first experiment confirmed the findings obtained in earlier studies by

demonstrating (1) sensitivity to outcome devaluation in RR-trained mice, but not in RI-trained mice and (2) no differential responsivity following outcome devaluation or A-O contingency reversal after prolonged instrumental training. In a second experiment, we re-established that overtraining renders behavior independent of its outcome value. Moreover, we observed that the habit-promoting effects of overtraining might be more pronounced in RR-trained animals compared to RI-trained animals. The present study extends earlier findings by proposing an optimal number of instrumental trainings for distinguishing habits from goal-directed actions and addressing the dual role of motivation in outcome

devaluation procedures.

Keywords: habits, goal-directed actions, reinforcement schedule, overtraining, outcome devaluation

(4)

1 Introduction

Already in 1890, it was acknowledged that habits are essentially ‘nothing but a reflex discharge’ which result from ‘the psychical principles of association’ and lack ‘any consciously formed purpose’ (James, 1890). Today, habits are still conceptualized as

behaviors that are predominantly evoked by stimulus-response (S-R) associations (Dickinson, 1985). The automated, inflexible nature of habits is often contrasted with goal-directed behavior. Unlike habits, goal-directed actions might primarily result from explicit knowledge regarding action-outcome (A-O) associations (Balleine & O’Doherty, 2010). Flexibly shifting between habits and goal-directed actions allows individuals to perform well-learned actions without conscious deliberation, thereby facilitating efficient allocation of limited cognitive resources, while simultaneously actively pursuing goals when required. Failure to alternate between habits and goal-directed actions, on the other hand, has been hypothesized to underlie several psychiatric disorders, including obsessive-compulsive disorder (OCD), Tourette’s syndrome and addiction (Gillan et al., 2011; Gremel & Costa, 2013b; Fineberg et al., 2010; Wiltgen et al., 2012).

1.1 Distinguishing habits from goal-directed actions

Earlier studies investigated the distinction between habits and goal-directed actions by manipulating the perceived causality between responses and outcomes, typically referred to as action-outcome (A-O) contingency, during instrumental training and subsequently

assessing the sensitivity of animals to outcome devaluation and A-O contingency reversal (Balleine & O’Doherty, 2010; Dickinson, 1985). This approach exploits that, in contrast to habits, goal-directed actions are relatively sensitive to shifts in outcome value or A-O contingency (Dickinson, 1985). In other words, if behavior is driven by an internal

representation of its consequences, reducing the incentive value of those consequences is expected to extinguish responses. Similarly, degradation of the causal relation between actions and their outcomes is expected to affect animals only if they maintained an internal representation of A-O contingencies. Two prerequisites for goal-directed behavior therefore consist of (1) reduced responding following outcome devaluation and (2) reduced

responding following A-O contingency degradation (Dickinson, 1985; Yin & Knowlton, 2006). In habitual animals, in contrast, devaluation of the outcome or A-O contingency degradation should have no influence on responding. Before discussing the findings of studies that applied this theoretic framework in order to distinguish habits from goal-directed actions, we will first outline the approaches by which reward devaluation and A-O contingency alterations were accomplished in previous studies. Thereafter, we will illustrate the strategies employed for promoting either habits or goal-directed actions as well as the neurobiological correlates of both behaviors.

1.1.1 Outcome devaluation

Outcome devaluation is typically achieved by either pairing the outcome with an illness inducer or feeding subjects to satiety with the outcome (Balleine & O’Doherty, 2010). The latter approach is based on the observation that the hedonic value of a stimulus depends on an individual’s physiological state, referred to as alliesthesia (Berridge, 2001). In other words, hunger will increase the hedonic quality of food, whereas satiation decreases the

(5)

indicated that alliesthesia is a stimulus-specific phenomenon, meaning that satiation with one type of food decreases the value of only that food (Balleine & Dickinson, 1985; Berridge, 2001; Valentin et al., 2007; Tricomi et al., 2009).

The effects of outcome devaluation are commonly tested in an extinction paradigm, requiring the animal to rely solely on acquired representations of rewards without actually experiencing the reward (Berridge, 2001; Niv et al., 2006; Tricomi et al., 2009). For instance, Balleine and Dickinson (1998) trained rats to press different levers in order to obtain two rewards. Whereas one lever yielded salt polycose solution, the other was associated with sour polycose solution. After several instrumental trainings, one of the outcomes was devalued by allowing ad libitum consumption for 1 h. Note that this procedure ensures devaluation of one reward, whereas the value of the other reward remains unchanged. Subsequently, rats underwent a 10 min during extinction test, in which lever presses were not rewarded. Results indicated that during extinction, rats performed fewer presses on the lever that was associated with the devalued outcome relative to the lever that previously produced the non-devalued outcome. Thus, rats maintained specific internal representations concerning the A-O contingency, which was interpreted as indicative of goal-directed

behavior. 1.1.2 Omission

Besides reduced responding following outcome devaluation, an additional requirement of goal-directed behavior is its sensitivity to alterations in A-O contingency (Yin & Knowlton, 2006). In tasks implementing reversed A-O contingencies, typically referred to as omission tests, animals are required to refrain from responding in order to obtain rewards. Inhibition of responses that were acquired during instrumental training for a predefined time interval (e.g. 20 s) results in reward delivery, whereas performing a response resets the timer. While habitual animals are expected to maintain a relatively high response frequency when

exposed to omission, goal-directed strategies are characterized by declined response

frequencies. Correspondingly, several studies demonstrated increased responding in animals that were earlier shown to be insensitive to outcome devaluation compared to goal-directed animals when exposed to reversed A-O contingencies (DeRusso et al., 2010; Yin et al., 2006). 1.2 Promoting habits or goal-directed actions

The paradigms used for promoting habits or goal-directed actions rely on differential A-O contingency exposure. Specifically, a relatively low A-O contingency, simulated through overtraining or random interval (RI) schedules of reinforcement, facilitates habit formation (Adams, 1982; Dickinson, 1985; Yin & Knowlton, 2006). In contrast, higher A-O

contingencies, which are experienced early in training or in random ratio (RR) schedules of reinforcement, evoke goal-directed behavior (Dickinson, 1985; Yin & Knowlton, 2006).

The impact of the amount of training on A-O contingencies derives from the fact that during initial trainings, animals typically experience a large variability in both response rate and reward rate. With prolonged training, however, animals show a stable, high-frequency response pattern with little variability over time, resulting in a weak A-O contingency (Dickinson, 1985; Schwabe & Wolf, 2011; Yin & Knowlton, 2006). The distinct effects of the number of trainings on behavior were first reported by Adams (1982), who demonstrated that food-deprived rats that have repeatedly been exposed to a setting in which a lever press results in reward delivery will eventually press the lever even when the outcome is no longer perceived as pleasurable. In contrast, rats that were exposed to relatively few instrumental

(6)

trainings adjusted their behavior according to alterations in outcome value, suggesting that their behavior is goal-directed. Similar findings have been demonstrated in humans and mice (Balleine & O’Doherty, 2010; DeRusso et al., 2010; Valentin et al., 2007). It is

noteworthy that the number of training sessions did not affect response rate or motivation, excluding the possibility that these factors might confound the results (Adams, 1982; Gremel & Costa, 2013b; Wiltgen et al., 2012).

Besides through extensively training subjects, reduced A-O contingencies can be simulated by adopting an RI schedule of reinforcement, in which each response following an interval of approximately X seconds is reinforced (Dickinson, 1985). RR schedules of

reinforcement, in contrast, reinforce an animal after on average X responses, thereby maintaining a relatively high A-O contingency (Dickinson, 1985). The differential effects of reinforcement schedules on A-O contingencies can be explained by considering the

correlation between response rates and reward delivery (Dickinson, 1985; Yin & Knowlton, 2006). For instance, the rate with which subjects respond in RR schedules is directly

proportional to reward delivery, implying that the optimal strategy advocates high response frequencies. In contrast, RI schedules by definition impose a plateau regarding the rate at which rewards can be obtained (Yin & Knowlton, 2006). As only responses after a certain time interval are reinforced, the optimal strategy in RI schedules is regular (as opposed to frequent) responding. Correspondingly, several studies reported reduced response

frequencies in RI compared to RR-trained animals (Gremel & Costa, 2013b; Hilário et al., 2007).

Previous studies indicated that humans as well as mice and rats respond to RI schedules of reinforcement by adopting habit-like behavior, which is reflected in persistent responding even when the outcome is devalued or when the A-O contingency is reversed (Balleine & O’Doherty, 2010; Gremel & Costa, 2013a, 2013b; Hilário et al., 2007; Tricomi et al., 2009; Wiltgen et al., 2012; Yin et al., 2004). RR schedules, in contrast, promote goal-directed strategies. For instance, Gremel and Costa (2013a; 2013b) exposed mice to two contexts in which they were trained to perform lever presses on either an RR or RI schedule of reinforcement. Results indicated that RR-trained animals adapt their behavior based on altered outcome values, while RI-trained animals behave habitual. Similar findings were obtained in earlier studies (Hilário et al., 2007; Wiltgen et al., 2012; Yin et al., 2004). 1.3 Neurobiological underpinnings

The ability to differentiate between habits and goal-directed actions through varying the experienced A-O contingency during training has led numerous studies to investigate the neurobiological framework underlying the transition from goal-directed to habitual behavior in rodents and humans. These studies consistently demonstrated a differential involvement of distinct corticostriatal brain circuits in habitual vs. goal-directed animals. Specifically, the dorsomedial striatum (DMS) and prelimbic (PLC; rodents) or prefrontal (PFC; humans) cortex appear to be crucial for adopting goal-directed strategies, while the dorsolateral striatum (DLS) along with the infralimbic cortex (ILC) are associated with habit formation. In healthy individuals, these networks are thought to orchestrate behavior in parallel (e.g. Smith & Graybiel, 2014; Yin & Knowlton, 2006). Correspondingly, several studies proposed that the balance between DMS and DLS activity determines the nature of behavior. The gradual transition from goal-directed to habitual behavior thus coincides with increased DLS and ILC involvement together with reduced DMS and PLC involvement. This was supported by Gremel and Costa (2013b), who demonstrated a positive correlation between the firing rate

(7)

modulation of neuronal ensembles in the DMS and outcome devaluation sensitivity along with a trend for an inverse relation between DLS firing rate modulation and

goal-directedness in mice. Moreover, humans show increased activity in the posterior putamen, the primate homologue of the DLS in rats, in late relative to early training phases, which was interpreted as evidence for an ‘increasing contribution of the DLS to governing behavior as the S-R habit develops’ (Tricomi et al., 2009).

Disrupting the balance between DMS and DLS networks through impairing the functioning of either circuit predisposes individuals to rely on the other, parallel circuit. Both a lesion and neuropharmacological inhibition of the DMS or the PLC therefore favor habits over goal-directed actions (Balleine & Doherty, 2010; Berridge, 2001; Gremel & Costa, 2012a, 2013b; Schwabe & Wolf, 2011; Yin et al., 2004). Similarly, disruption of DLS or ILC functioning through excitotoxic lesions or muscimol injection has been related to increased sensitivity to outcome devaluation and A-O contingency alterations in rodents (Balleine & Doherty, 2010; Gremel & Costa, 2012a, 2013b; Schwabe & Wolf, 2011; Smith & Graybiel, 2014; Yin et al., 2004, 2006). Besides directly targeting corticostriatal loops through lesions or inhibition, activity in these areas can be modulated by manipulating the dopaminergic inputs from the ventral tegmental area and the substantia nigra. Indeed, intact

dopaminergic transmission appears to be crucial for shifting from goal-directed to habitual behavior (Balleine & Doherty, 2010; Berridge, 2001; Hilário et al., 2007; Smith & Graybiel, 2014; Yin & Knowlton, 2006). In conclusion, the distinction between habits and goal-directed actions observed on a behavioral level is maintained on a neurobiological level, implying that the maladaptive habit formation potentially underlying OCD, Tourette’s syndrome and addiction might result from abnormalities in specific corticostriatal circuits (Gillan et al., 2011; Fineberg et al., 2010).

1.4 Design

Despite the apparent consistency of previous studies that investigated the distinction between habits and goal-directed actions, a number of issues concerning the relation between reinforcement schedules and outcome devaluation sensitivity have remained unresolved. For instance, several studies suggested increased response rates in the non-devalued state rather than decreased response rates in the non-devalued state in RR-trained animals compared to RI-trained animals, which is surprising given the supposed goal-directedness of the former group (Box 1; Dickinson et al., 1985; Gremel & Costa, 2013b; Hilário et al., 2007; Wilgen et al., 2012). In other words, the absence of between-group differences in measures of performance during the devalued state arguably complicates labelling the behavior of the group with increased responding in the non-devalued state as ‘goal-directed’ (DeRusso et al., 2010). The supposed goal-directedness of RR-trained animals is further confounded by studies suggesting enhanced response frequencies during

extinction relative to the response frequencies recorded during the most recent

instrumental training (Hilário et al., 2007). Indeed, if the behavior of RR-trained animals is directed towards an internally represented goal, the absence of that goal (in extinction) would be expected to reduce rather than increase response rates.

Additionally, several studies reported no outcome devaluation sensitivity in RR-trained mice (Lederle et al., 2011) as well as significant outcome devaluation-induced reductions in response rates following RI training (Balleine & Dickinson, 1998; Colwill & Rescorla, 1985; DeRusso et al., 2010). Finally, the number of RR and RI training sessions required to establish goal-directed and habitual behavior without risking overtraining is still

(8)

Box 1. Differential responding between RI and RR-trained animals following outcome devaluation

Numerous studies that distinguished habits from goal-directed actions based on outcome devaluation sensitivity demonstrated results comparable to the simulated data depicted in Fig. 1. As illustrated, the differential sensitivity to outcome devaluation appears to be driven by a relative increase in response frequency in the non-devalued state for RR-trained animals. Although this finding might seem counterintuitive (for instance, see Niv et al., 2006), several studies suggested that goal-directed actions might be less affected by shifts in primary motivation (induced by satiety) compared to habits (Balleine & Doherty, 2010; Niv et al., 2006; Wiltgen et al., 2011). Satiety, produced by ad libitum access to either the reward (devalued state) or a different food (non-devalued state), therefore affects RI-trained animals to a greater extent than RR-trained animals.

Niv et al. (2006) explained the above dissociation by introducing two distinct pathways via which motivation regulates behavior, namely (1) a directing effect, which is outcome-specific and facilitates goal-directed behavior and (2) an energizing effect, which is outcome-independent and underlies habitual behavior. Correspondingly, Dickinson and Balleine (2000, p. 197) denoted that ‘goal-directed actions are controlled by desires for outcomes that are not directly modulated by the relevant states of the motivational system.’ Thus, directing motivational effects (conceptually equivalent to the cognitive system proposed by Dickinson and Balleine (2000, pp. 185-204)) might drive increased responding in the non-devalued state in RR-trained animals. In contrast, RI-trained animals might be predominantly driven by energizing aspects of motivation (equivalent to the motivational system proposed by Dickinson and Balleine (2000), and therefore show reduced response frequencies irrespective of state.

RI RR 0 5 10 15 A rb it ra ry u n it s Devalued Non-devalued

Fig. 1| Differential sensitivity to outcome devaluation as reported by earlier studies (hypothetical data). uncertain. Whereas some studies exposed mice to numerous short training sessions (14x 30

min; Wiltgen et al., 2012), others suggested that relatively few longer trainings suffice to distinguish the effects of RR and RI reinforcement schedules (4x 90 min; Hilário et al., 2007). Finally, earlier studies differed with respect to the reinforcers used and the measure of performance analyzed (Gremel & Costa, 2013a, 2013b; Hilário et al., 2007; Wiltgen et al., 2012). Therefore, the present study will aim to replicate the findings obtained by earlier studies (specifically: Gremel & Costa, 2013b) and elucidate which parameters are required for observing a differentiation between reinforcement schedule (RI vs. RR) and behavior (habits vs. goal-directed actions). This might contribute to the development of a reliable behavioral framework for establishing habits and goal-directed actions.

2.1 Methods - experiment I

2.1.1 Mice

C57Bl/6J male mice (N=8) were ordered from Harlan Laboratories (Boxmeer, the Netherlands) at two months of age. Upon arrival, mice were housed individually and exposed to a reversed light-dark cycle. All cages (20 x 36 x 14 cm) were transparent and equipped with sawdust, nesting material, and a plastic tube. Mice had ad libitum access to water and food in their home cage. After two weeks of acclimatization, mice were handled and weighed daily for five weeks. Subsequently, mice were food-deprived in order to reduce their weight to 85% of free-fed weight. Dietary restriction ended after completion of the experiments. This study was approved by the Animal Experimentation Committee (DEC) KNAW (Amsterdam).

(9)

2.1.2 Behavioral procedures

All animals underwent two training sessions per day (described below; Fig. 2), after which they received 1.5-3.0 g of home cage chow. Training and testing were conducted in four sound-attenuating, light-resistant operant chambers (21.6 x 17.8 x 12.7 cm, Med Associates Inc.). Each chamber was equipped with a single nosepoke hole located left or right from a central food magazine. The opposite wall contained a house light (3 W, 24 V) which signaled the start and completion of training sessions and tests (i.e. outcome devaluation, omission). During operant trainings, animals were rewarded with Bio-Serv dustless precision pellets (formula F0071; 20 mg per reward) or sucrose solution (20%; 20-30 µl per reward).

Nosepoke hole position and reward type were counterbalanced across conditions (RI vs. RR) and kept constant across operant boxes. The duration of each training, the number of nosepokes performed and the number of head entries into the food magazine were recorded using Med-PC IV.

2.1.3 Magazine and continuous reinforcement training

Prior to RI- or RR-training, animals underwent a 30 min magazine training, during which the nosepoke hole was absent. A reward was delivered at random intervals lasting on average 60 seconds. Next, nosepoke holes were installed in each operant box and mice completed three continuous reinforcement trainings, in which each nosepoke was rewarded. Trainings ended after 60 min or when animals had obtained the maximum amount of rewards, which

progressively increased from 5 to 15 to 30 (Fig. 2). 2.1.4 Random interval training

RI trainings commenced with one day of RI30 trainings followed by two days of RI60 trainings (Fig. 2). Trainings ended after 90 min or when animals had received 30 rewards. During RI training, each nosepoke after a fixed interval of 3 s (RI30) or 6 s (RI60) had a probability of 0.1 of being reinforced (Fleshler & Hoffman, 1962). Importantly, the probability of obtaining a reward remained constant over time, implying that the most efficient strategy would be to perform one nosepoke every 3 (RI30) or 6 (RI60) seconds. Immediately after each training, mice were placed in their home cage and received the reward they had not obtained during training. For instance, mice that were reinforced with pellets (contingent reward; CR) received 0.9 ml sucrose solution (non-contingent reward; NCR) in their home cage. Vice versa, when the CR was sucrose solution, the NCR consisted of 20 pellets (i.e. 0.4 g). The NCR was provided in a plastic cup adhered to the home cage, which was removed after NCR consumption. During NCR consumption, which typically lasted 5-15 min, animals did not have access to water.

2.1.5 Random ratio training

Animals in the RR group completed two RR10 training sessions before progressing to RR20 trainings (Fig. 2). Again, trainings ended after 90 min or after 30 rewards were obtained. During RR trainings, each poke was rewarded with a probability of 0.1 (RR10) or 0.05 (RR20). Thus, the rate rather than the timing of nosepokes determined reward delivery. Upon completion of each training session, animals received an NCR in their home cage.

(10)

2.1.6 Outcome devaluation test

Outcome devaluation experiments consisted of two 5 min extinction tests, in which

nosepokes were not rewarded, preceded by 1 h ad libitum access to the NCR (non-devalued state) or CR (devalued state). Pre-feeding was performed in a new transparent cage (20 x 36 x 14 cm) containing sawdust. Water bottles were removed. Experiments were conducted on two consecutive days and the order of states (non-devalued vs. devalued) was

counterbalanced across conditions (RI vs. RR) and CR type (pellets vs. sucrose solution). 2.1.7 Omission test

Omission tests were performed to investigate the effects of A-O contingency reversal in RI and RR-trained animals. Specifically, mice were rewarded for each interval of 20 s in which they refrained from nosepoking. Failure to withhold responses reset the timer, thereby delaying reward delivery. Similar to reinforcement training, mice could receive a maximum of 30 rewards. Omission tests were preceded by two RI60/RR20 trainings and were repeated twice (Fig. 2).

2.2 Methods - experiment II

The second experiment aimed to validate the results obtained in experiment I. The experimental designs were similar, although a few adaptations were made in order to re-establish the relation between reinforcement training schedule and sensitivity to outcome devaluation.

2.2.1 Mice

Eight male C57Bl/6j mice were ordered from Harlan Laboratories and housed as described earlier. Following one week of habituation, mice were handled and food-restricted to 85% of their free-fed weight which was maintained for the duration of the experiment.

2.2.2 Adaptations

The context and design of experiment II were identical to that of experiment I. Importantly, a performance threshold was introduced in the transition from RI30/RR10 to RI60/RR20 training. Thus, animals that obtained less than 21 rewards (70%) in the second RI30/RR10 training underwent additional training with these schedules before transitioning to RI60/RR20 trainings. This performance criterion was applied to prevent between-group differences in the acquisition of reinforcement training schedules and is in agreement with recommendations of Rossi & Yin (2013).

Based on the results obtained in experiment I, another adaptation was related to the number of trainings preceding the first devaluation test. As illustrated in Fig. 2, animals underwent 16 rather than four RI60/RR20 trainings before experiencing outcome

devaluation. This first outcome devaluation test was followed up by four additional trainings and a second outcome devaluation test. Thereafter, animals were subjected to two

RI60/RR20 trainings and an omission test.

Magazine 1 Magazine 1

(11)

RI30/RR10 2 RI30/RR10 * RI60/RR20 4 RI60/RR20 16 Outcome devaluation I RI60/RR20 4 Outcome devaluation II RI60/RR20 8 Outcome devaluation I RI60/RR20 4

Outcome devaluation III

RI60/RR20 8 Outcome devaluation II RI60/RR20 2 Omission Outcome devaluation IV RI60/RR20 4 Outcome devaluation V RI60/RR20 2 Omission

Fig 2| Schematic overview of experimental procedures for experiment I (left) and II (right).

Numbers denote the amount of trainings provided; animals underwent two trainings per day. References: magazine training (30 min); continuous reinforcement training (60 min or 5/15/30 rewards); Random interval training (90 min or 30 rewards); Random ratio training (90 min or 30 rewards).

*Note: animals progressed to RI60/RR20 trainings after they had obtained at least 21 rewards in RI30/RR10 trainings.

(12)

2.3 Analysis

To verify similar training performance between animals assigned to RI and RR trainings, repeated-measures analyses of variance (ANOVAs) with time (training) as a within-subjects factor and condition (RI vs. RR training) as a between-subjects factor were conducted. Several indicators of performance were included as dependent variables, including the number of nosepokes, nosepoke rate, magazine entries and rewards earned. In case the assumption of sphericity was violated, Greenhouse-Geisser or Huynh-Feldt corrections were applied according to the recommendations of Field (2009, pp. 460-461). Additionally,

approximate normality of the dependent variable across conditions was verified by visual inspection of frequency distributions. Interaction or main effects were considered significant at p<0.05 and further examined in post hoc tests. For the latter, α was set to match a family-wise error rate (FWER) of 0.10 (see formula below). This FWER was considered appropriate given the stringent nature of this correction, first introduced by Šidák, in the presence of multiple dependent tests (Abdi, 2007).

FWER=1−(1−∝)N

In order to avoid confounding effects of training performance on behavioral measures during extinction or omission, data were normalized in case repeated-measures ANOVAs suggested between-group differences in training performance. This normalization ensured that

differences between RI and RR-trained animals during extinction or omission could not be attributed to baseline differences in acquisition. Similar to Hilário et al. (2007), we

normalized the number of nosepokes (norm.NP) by transforming the number of nosepokes (NP) into a percentage of the number of nosepokes measured during the most recent reinforcement training (NPtraining; see formula below). Note that this normalization differs

from the normalization applied by Gremel and Costa (2013b), who transformed the number of responses according to the total number of responses performed during the devalued and non-devalued state. As we believe this conversion might violate the assumption of

independence held by t-tests (Appendix, ‘Normalization’), we adopted the normalization introduced by Hilário et al. (2007). In cases where the normalized number of nosepokes was selected as a dependent variable, analyses on the non-normalized number of nosepokes are reported in the Appendix, and vice versa.

norm . NP= NP

NPtraining×100

Analyses of outcome devaluation tests consisted of repeated-measures ANOVAs with state (devalued vs. non-devalued) as a within-subjects factor and condition (RI vs. RR training) as a between-subjects factor. Finally, results of omission tests were analyzed by testing the interaction and main effects of time (first vs. second omission test) and condition (RI vs. RR training) on the number of nosepokes performed during omission tests. Data were analyzed using SPSS (version 22) and visualized with GraphPad Prism (version 6).

(13)

3.1

Results - experiment I

3.1.1 Outcome devaluation I

Corresponding to protocols described earlier (Gremel & Costa, 2013b; Rossi & Yin, 2013), mice (N=8) underwent two RI30/RR10 trainings followed by four RI60/RR20 trainings prior to the first outcome devaluation test. A Greenhouse-Geisser corrected repeated-measures ANOVA indicated a trend-significant main effect of time (F(1.47, 8.83)=4.01; p=0.07; partial η2=0.40) on the number of nosepokes, suggesting that animals performed more nosepokes

as trainings proceeded (Fig. 3a). Importantly, we found no time*condition interaction effect (F(1.47,8.83)=0.48; p=0.58; partial η2=0.07) nor a main effect of condition (F(1,6)=0.15;

p=0.72; partial η2=0.02) on the number of nosepokes prior to the first outcome devaluation

test. Other indicators of performance, including nosepoke rate, the number of magazine entries and rewards obtained, also showed no significant time*condition interaction effect or main effect of condition (Appendix, Fig. A1, table 1.1).

As our data indicated no between-group differences in acquisition parameters, we compared the non-normalized number of nosepokes during extinction between RI and RR-trained animals. Extinction was conducted in a devalued state (following 1 h of ad libitum access to the CR) and a non-devalued state (following 1 h of ad libitum access to the NCR). As illustrated in Fig. 3b, results indicated no state*condition interaction effect (F(1,6)=2.12; p=0.20; partial η2=0.26) nor a main effect of state (F(1,6)=2.95; p=0.14; partial η2=0.33) or

condition (F(1,6)=2.77; p=0.15; partial η2=0.32) on the number of nosepokes performed.

Similarly, there were no interaction or main effects on the normalized number of nosepokes (Appendix, Fig. A1d, table 1.2).

Fig. 3| Number of nosepokes performed during RI/RR training (a) and outcome devaluation I (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; #: trend

3.1.2 Outcome devaluation II

After the first outcome devaluation tests, animals were subjected to four additional RI60/RR20 trainings. Repeated-measures ANOVA suggested no time*condition interaction effect on the number of nosepokes performed during these trainings (F(3,18)=0.91; p=0.46; partial η2=0.13) nor a main effect of condition (F(1,6)=0.01; p=0.92; partial η2<0.01),

(14)

suggesting that the changes in training performance over time were similar between RI and RR-trained animals (Fig. 4a). The main effect of time on the number of nosepokes was significant at trend-level (F(3,18)=2.57; p=0.09; partial η2=0.30). Although the RI and RR

groups performed similar when the number of nosepokes, nosepoke rate or magazine entries were analyzed, results indicated a significant time*condition interaction effect on the number of rewards obtained (Appendix, Fig. A2c, table 2.1). Post hoc tests suggested that RR-trained animals earned more rewards than RI-trained animals during the first training session (t(6)=2.35, p=0.06; RI: mean 11.75, std. 12.53; RR: mean 28.00, std. 2.83), but not in subsequent trainings. Note that this difference was not significant at α=0.03, which was applied to ensure a FWER of 0.10.

During outcome devaluation tests, a repeated-measures ANOVA indicated a

significant state*condition interaction effect (F(1,6)=20.13, p=0.004, partial η2=0.77) as well

as main effects of state (F(1,6)=29.35, p=0.002, partial η2=0.83) and condition (F(1,6)=8.25,

p=0.03, partial η2=0.58) on the number of nosepokes performed. These effects remained

after data were normalized according to performance during the last training session (Appendix, Fig. A2d, table 2.2). Post hoc tests confirmed that animals assigned to RR trainings performed significantly fewer nosepokes in the devalued (mean 12.25, std. 10.05) compared to the non-devalued state (mean 281.00, std. 108.54; t(3)=5.27, p=0.01; Fig. 4b). RI-trained animals, in contrast, showed no difference between the number of nosepokes in devalued (mean 20.75, std. 35.07) versus non-devalued states (mean 46.00, std. 72.23; t(3)=1.35, p=0.27). Similar results were obtained when the normalized number of nosepokes was analyzed (Appendix, Fig. A2d, table 2.2).

Fig. 4| Number of nosepokes performed during RI/RR training (a) and outcome devaluation II (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; #: trend; *: significant at α=0.05

3.1.3 Outcome devaluation III

According to literature, overtraining promotes the transition from goal-directed to habitual behavior (e.g. Adams, 1982). We therefore provided eight additional training sessions with the aim to investigate whether the results obtained from the outcome devaluation II would persist after prolonged RI/RR training. Although we found no main effects of time

(F(7,42)=1.24, p=0.30, partial η2=0.17) or condition (F(1,6)=0.67, p=0.45, partial η2=0.10) on

(15)

interaction effect (F(7,42)=1.97, p=0.08, partial η2=0.25). However, between-group

differences in the number of nosepokes during individual trainings were absent in post hoc tests (Fig. 5a). When other performance measures were selected as dependent variables, results indicated a main effect of time (nosepoke rate) and main effects of condition (magazine entries, rewards), respectively (Appendix, Fig. A3, table 3.1).

As potential between-group differences in the number of nosepokes across training sessions were not established in post hoc tests, the non-normalized number of nosepokes were selected as dependent variable for evaluating the effects of outcome devaluation in both groups. A repeated-measures ANOVA revealed a significant state*condition interaction effect (F(1,6)=38.82, p=0.001, partial η2=0.87) as well as significant main effects of state

(F(1,6)=41.33, p=0.001, partial η2=0.87) and condition (F(1,6)=25.12, p=0.002, partial

η2=0.81) on the number of nosepokes performed during extinction. Similar to results from

the second outcome devaluation, data revealed an increased number of nosepokes in the non-devalued (mean 201.00, std. 66.16) compared to the devalued state (mean 24.50, std. 34.89) for RR-trained animals (t(3)=6.49, p=0.007; Fig. 5b), but not for RI-trained animals (non-devalued: mean 28.50, std. 29.69; devalued: mean 15.25, std. 2.36; t(3)=0.92, p=0.42). Correcting for individual differences in performance during the most recent RI/RR training did not affect these results (Appendix, Fig. A3d, table 3.2).

Fig. 5| Number of nosepokes performed during RI/RR training (a) and outcome devaluation III (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; **: significant at α=0.01

3.1.4 Outcome devaluation IV

As the additional training sessions provided prior to devaluation test III did not neutralize the between-group differences in outcome devaluation sensitivity established in outcome devaluation II, the overtraining procedure was repeated. Thus, animals underwent eight additional trainings in order to test the hypothesis that overtraining triggers habit formation. During these trainings, animals in the RR and RI group performed a similar number of nosepokes across trainings (time*condition interaction effect: F(7,42)=1.75, p=0.12, partial η2=0.23; main effect time: F(7,42)=1.66, p=0.15, partial η2=0.22; main effect condition:

F(1,6)=0.34, p=0.58, partial η2=0.05; Fig. 6a). Other indicators of performance suggested that

similar to trainings preceding devaluation test III, animals in the RR group performed fewer magazine entries and responded at a higher rate compared to RI-trained animals (Appendix,

(16)

Fig. A4, table 4.1). The absence of time*condition interaction effects or main effects of time on all parameters suggests that animals are no longer (differentially) improving their

performance over time. In other words, performance is stable, which likely reflects overtraining.

After RI/RR trainings, animals were exposed to outcome devaluation test IV, which consisted of ad libitum access to the contingent and non-contingent rewards followed by a 5-min during extinction test. Unlike the results obtained from outcome devaluations II and III, data revealed no state*condition interaction effect on the number of nosepokes

(F(1,6)=0.50, p=0.51, partial η2=0.08; Fig. 6b) nor main effects of state (F(1,6)=1.79, p=0.23,

partial η2=0.23) or condition (F(1,6)=0.29, p=0.61, partial η2=0.05). This was further

supported in analyses of the normalized number of nosepokes (Appendix, Fig. A4d, table 4.2).

Fig. 6| Number of nosepokes performed during RI/RR training (a) and outcome devaluation IV (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio

3.1.5 Outcome devaluation V

To confirm that the absence of a state*condition interaction effect on the number of nosepokes during devaluation test IV resulted from habit formation, animals were exposed to four RI/RR trainings followed by a final outcome devaluation test. A Greenhouse-Geisser corrected repeated-measures ANOVA indicated no time*condition interaction effect on the number of nosepokes during extinction (F(1.27,7.62)=1.08, p=0.35, partial η2=0.15).

Moreover, there were no main effects of time (F(1.27,7.62)=0.43, p=0.58, partial η2=0.07)

and condition (F(1,6)=0.27, p=0.62, partial η2=0.04) on the number of nosepokes (Fig. 7a). In

contrast, other performance measures, including nosepoke rate, magazine entries and rewards obtained, differed significantly between RI and RR-trained animals (Appendix, Fig. A5, table 5.1). These differences were also observed in analyses of trainings preceding previous outcome devaluations (Appendix, Fig. A3, table 3.1; Fig. A4, table 4.1).

Similar to the previous outcome devaluation, RI and RR-trained animals showed no differences in the number of nosepokes performed during extinction in devalued versus non-devalued states (state*interaction effect: F(1,6)=0.51, p=0.50, partial η2=0.08; main effect

condition: F(1,6)=0.11, p=0.75, partial η2=0.02). However, results indicated a trend for a main

(17)

confirmed in a post hoc test. That is, RR-trained animals performed significantly fewer nosepokes compared to RI-trained animals in the devalued state (t(6)=2.94, p=0.03; RI: mean 5.00, std. 2.31; RR: mean 1.50, std. 0.58; Fig. 7b), but not in the non-devalued state (t(6)=0.19, p=0.86; RI: mean 10.50, 7.33; RR: mean 11.75, std. 10.87). The main effect of state (i.e. devalued vs. non-devalued) was not detected when the number of nosepokes was normalized to performance during most recent training (Appendix, Fig. A5d, table 5.2).

Fig. 7| Number of nosepokes performed during RI/RR training (a) and outcome devaluation V (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; *: significant at α=0.05

3.1.6 Omission

Prior to omission tests, during which A-O contingency was reversed, animals underwent two RI/RR trainings to reinstate nosepoke behavior following outcome devaluation V. Analyses of different performance measures of these trainings revealed no time*condition interaction effect (F(1,6)=0.02, p=0.88, partial η2<0.01) nor main effects (time: F(1,6)=0.02, p=0.90,

partial η2<0.01; condition: F(1,6)=0.87, p=0.39, partial η2=0.13) on the number of nosepokes

performed or on nosepoke rate (Fig. 8a; Appendix, Fig. A6, table 6.1). In contrast, the number of magazine entries and rewards obtained differed significantly between groups (Appendix, Fig. A6, table 6.1).

Repeated-measures ANOVAs yielded no time*condition interaction effect

(F(1,6)=1.35, p=0.29, partial η2=0.18) nor main effects (time: F(1,6)=0.97, p=0.36, partial

η2=0.14; condition: F(1,6)=2.56, p=0.16, partial η2=0.30) on the number of nosepokes (Fig.

8b) or the normalized number of nosepokes performed during omission (Appendix, Fig. A6d, table 6.2).

(18)

Fig. 8| Number of nosepokes performed during RI/RR training (a) and omission (b). Bars represent

the standard error of the mean (SEM).

RI: random interval; RR: random ratio

3.2 Results - experiment II

3.2.1 Outcome devaluation I

In order to replicate the findings obtained in experiment I, the applied protocol was repeated with a second group of C57Bl/6j mice (N=8). Apart from minor adaptations (see Methods), the experimental design in experiment II was identical to that of experiment I. Notably, the implementation of a 70% performance criterion resulted in a total of five RI30 trainings or three RR10 trainings, respectively.

Prior to the first outcome devaluation test, animals underwent 16 RI/RR trainings. As illustrated in Fig. 9a, the acquisition of nosepoke behavior across trainings was different between RI and RR-trained mice. This was confirmed by a repeated-measures ANOVA, which revealed a trend for a time*condition interaction effect on the number of nosepokes

performed (F(15,90)=1.65, p=0.08, partial η2=0.22). Furthermore, results indicated

significant main effects of time (F(15,90)=1.78, p=0.05, partial η2=0.23) and condition

(F(1,6)=10.45, p=0.02, partial η2=0.64) on the number of nosepokes. Similarly, we found a

time*condition interaction effect on nosepoke rate and main effects of time on both nosepoke rate and magazine entries (Appendix, Fig. A8, table 8.1). It is noteworthy that despite the higher number of nosepokes performed by RI-trained animals, this group also had a lower rate of nosepoking, perhaps reflecting a between-group difference in the conceptualization of the most efficient strategy in RI and RR trainings. In contrast to other measures of performance, the number of rewards obtained during training was stable over time and similar across groups (Appendix, Fig. A8c, table 8.1).

In order to further examine between-group differences in the number of nosepokes performed, post hoc tests with α=0.007 were conducted. Results suggested that starting from the sixth RI/RR training, RI-trained animals performed more nosepokes compared to RR-trained animals (α<0.05). This difference was considered significant at the both the 13th

(19)

(t(6)=4.34, p=0.005; RI: mean 1097.00, std. 212.59; RR: mean 595.25, std. 91.07) and the last training (t(6)=5.47, p=0.002; RI: mean 1157.75, std. 194.92; RR: mean 534.75, std. 117.81). Given these between-group differences in performance, the number of nosepokes were normalized to the most recent RI/RR training in subsequent analyses.

Analyses of the normalized number of nosepokes performed during extinction revealed no state*condition interaction effect (F(1,6)=0.18, p=0.69, partial η2=0.03; Fig. 9b)

nor a main effect of state (F(1,6)=0.69, p=0.44, partial η2=0.10). We found a trend for a main

effect of condition on the normalized number of nosepokes (F(1,6)=4.22, p=0.09, partial η2=0.41), but this was not confirmed in post hoc analyses (devalued state: t(6)=0.92, p=0.39;

RI: mean 3.76, std. 3.76; RR: mean 10.43, std. 13.95; non-devalued state: t(3.02)=1.36, p=0.27; RI: mean 6.52, std. 1.14; RR: mean 18.86, std. 18.08). Finally, between-group differences were absent when the non-normalized number of nosepokes was analyzed (Appendix, Fig. A8d, table 8.2).

Fig. 9| Number of nosepokes performed during RI/RR training (a) and outcome devaluation I (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; *: significant at α=0.05

3.2.2 Outcome devaluation II

In order to examine the effects of prolonged training, animals were exposed to four additional trainings following devaluation test I. Moreover, this procedure allowed us to verify whether prior experience with the devaluation procedure might account for the differences observed between experiment I (outcome devaluation III) and experiment II (outcome devaluation I). Repeated-measures ANOVAs1 revealed a trend for a significant

time*condition interaction effect on the number of nosepokes (F(2,12)=3.15, p=0.08, partial η2=0.34). Furthermore, results indicated significant main effects of time (F(2,12)=4.08,

p=0.05, partial η2=0.41) and condition (F(1,6)=120.20, p<0.001, partial η2=0.95) on the

number of nosepokes (Fig. 10a). Subsequent post hoc tests with α=0.03 confirmed that RI-trained animals performed more nosepokes than RR-RI-trained animals during training 18 (t(6)=4.06, p=0.01; RI: mean: 1172.25, std. 209.03; RR: mean 650.25, std. 149.35) and 20 (t(6)=5.87, p=0.001; RI: mean 1458.50, std. 305.24; RR: mean 551.25, std. 50.05). Similar 1 Due to technical issues, the number of nosepokes, nosepoke rate and magazine entries were not recorded for the 19th training. As a consequence, analyses of these measures included three instead of four trainings.

(20)

findings were obtained when the nosepoke rate was selected as a dependent variable (Appendix, Fig. A9a, table 9.1). In line with the analyses of performance measures during the first 16 trainings, magazine entries linearly decreased over time for both groups, whereas the number of rewards obtained remained stable over time and across groups (Appendix, Fig. A9, table 9.1).

For the second outcome devaluation test, results indicated no state*condition interaction effect (F(1,6)=0.50, p=0.51, partial η2=0.08) nor main effects of state (F(1,6)=1.89,

p=0.22, partial η2=0.24) or condition (F(1,6)=0.50, p=0.51, partial η2=0.08) on the normalized

number of nosepokes (Fig. 10b). Similar results were obtained in analyses of non-normalized data (Appendix, Fig. A9d, table 9.2).

Fig. 10| Number of nosepokes performed during RI/RR training (a) and outcome devaluation II (b).

Bars represent the standard error of the mean (SEM).

RI: random interval; RR: random ratio; *: significant at α=0.05; ***: significant at α=0.005

3.2.3 Omission

The between-group differences in the number of nosepokes performed during RI/RR trainings prior to outcome devaluation II persisted during the trainings preceding omission tests (time*condition interaction effect: F(1,6)=1.06, p=0.34, partial η2=0.15; main effect

time: F(1,6)=1.97, p=0.21, partial η2=0.25; main effect condition: F(1,6)=8.96, p=0.02, partial

η2=0.60; Fig. 11a). Post hoc tests revealed that RI-trained mice performed significantly more

nosepokes compared to RR-trained mice on the first training (t(6)=3.00; p=0.02; RI: mean 1027.75, std. 276.83; RR: mean 583.75, std. 104.04) but not on the final training (t(6)=1.22; p=0.27; RI: mean 746.50, std. 316.45; RR: mean 540.50, std. 116.56). Corresponding to earlier analyses of training performance, results suggested that RI-trained animals press slower than RR-trained animals during the final, but not the first training (Appendix, Fig. A10a, table 10.1). Furthermore, there was a time*condition interaction effect on the number of magazine entries, which was not confirmed in post hoc tests (Appendix, Fig. A10b, table 10.1).

Given the substantial difference in nosepoking behavior between the RI and RR groups, which was established in the analysis of earlier trainings as well, the number of nosepokes performed during omission tests were normalized according to individual performances during the final training. Results revealed significant main effects of time (F(1,6)=8.56, p=0.03; partial η2=0.59) and condition (F(1,6)=5.97; p=0.05; partial η2=0.50) on

(21)

the normalized number of nosepokes (Fig. 11b). We found no time*condition interaction effect (F(1,6)=3.72, p=0.10, partial η2=0.38), although this effect was present along with

main effects when the non-normalized number of nosepokes was selected as a dependent variable (Appendix, Fig. A10d, table 10.2). Post hoc tests suggested that during both omission tests, RR-trained animals tended to perform more nosepokes compared to RI-trained animals (first omission test: RI mean 59.97, std. 45.38; RR mean 236.47, std. 152.36; second omission test: RI mean 23.11, std. 19.14; RR mean 57.17, std. 11.51). Notably, the between-group differences were most pronounced in the second omission test (first

omission test: t(6)=2.22, p=0.07; second omission test: t(6)=3.05, p=0.02). Second, post hoc analyses revealed that the RR group, but not the RI group, showed a trend for reduced responding in the second compared to the first omission test, which might result from a floor effect in the latter (RI: t(3)=1.99, p=0.14; RR: t(3)=2.51, p=0.09). Similar findings were obtained in analyses of the non-normalized number of nosepokes (Appendix, Fig. A10d, table 10.2).

Fig. 11| Number of nosepokes performed during RI/RR training (a) and omission (b). Bars represent

the standard error of the mean (SEM).

(22)

4 Discussion

With experiment I, we aimed to replicate the finding that variability in the experienced contingency between actions and their outcomes, operationalized by exposing animals to RI and RR trainings, accounts for differences in the sensitivity to outcome devaluation (Gremel & Costa, 2013b). In other words, we investigated whether RI trainings promote habit

formation, while RR trainings facilitate goal-directed actions in mice. Although we were unable to confirm a dissociation between training (RI vs. RR) and behavior (habitual vs. goal-directed) after four RI60/RR20 trainings, our results suggested substantial between-group differences in sensitivity to outcome devaluation following eight trainings (outcome

devaluation II), which were even more pronounced after 16 trainings (outcome devaluation III). Conducting eight additional trainings rendered previously goal-directed animals (i.e. RR-trained animals) indifferent to outcome devaluation, perhaps reflecting the habit-evoking effects of overtraining (Adams, 1982; Dickinson, 1985; Colwill & Rescorla, 1985). This finding was validated in outcome devaluation V, prior to which animals had received a total number of 28 trainings. Finally, the hypothesized habitual nature of both groups was observed in omission tests, which revealed no between-group differences in responsiveness to A-O contingency reversal.

Experiment II was carried out to validate the results obtained in the first experiment. Given the distinct pattern of results obtained in outcome devaluation III in experiment I, animals were exposed to 16 consecutive trainings after which outcome devaluation

responsivity was assessed. Contrary to experiment I, results indicated no effect of training on the number of nosepokes performed in devalued versus non-devalued states, which

persisted in a second outcome devaluation procedure. Although outcome devaluation tests implied that RR training was unable to evoke goal-directed behavior, omission tests

suggested differential responsiveness to A-O contingency reversal in both groups.

Remarkably, RR-trained animals performed more (normalized) nosepokes during omission compared to RI-trained animals, suggesting that the former group behaved habitual rather than goal-directed. Indeed, similar to unresponsiveness to outcome devaluation,

insensitivity to A-O contingency reversal, as indicated by persistent responding during omission, was previously related to habitual behavioral strategies (Yin et al., 2006). In conclusion, the second experiment suggested that prolonged instrumental training promoted habitual behavior in RR-trained mice, characterized by insensitivity to outcome devaluation and omission. Although RI-trained mice were relatively sensitive to contingency reversal, it is unlikely that these animals behaved according to goal-directed strategies, since outcome devaluation did not affect their response frequencies.

4.1 Overtraining

Despite our efforts to minimize the differences in training parameters between experiments I and II, we were unable to confirm the distinction in sensitivity to outcome devaluation between RI and RR-trained animals reported in experiment I (outcome devaluation III). A plausible explanation for this concerns the different training schedules applied in both experiments. Whereas animals in experiment I received 16 trainings interrupted by two outcome devaluation tests, animals in experiment II were trained consecutively. Additionally, experiment II introduced a performance criterion prior to RI60/RR20 training, which resulted in implementation of three extra RI30 or one extra RR10 training(s). Together, these

(23)

modifications might have simulated overtraining, which promotes habit formation regardless of reinforcement schedule (Yin & Knowlton, 2006).

The substantial effects of prolonged consecutive training on pre-outcome

devaluation status are illustrated by the differences in performance measures assessed prior to outcome devaluation III (experiment I; Appendix Fig. A7.1, table 7.1) and outcome

devaluation I (experiment II; Appendix Fig. A8.1, table 8.1). Although RR-trained animals differed moderately in nosepoke rate prior to outcome devaluation III (experiment I) and outcome devaluation I (experiment II), the absolute number of nosepokes performed by animals assigned to RR reinforcement schedules in both experiments was remarkably similar (experiment I: mean 551.16, std. 65.23; experiment II: mean 604.16, std. 21.23; Appendix Fig. A7, Fig. A8). However, animals in experiment II immediately succeeded at obtaining the maximum amount of rewards, whereas the RR group in experiment I accomplished this only after eight instrumental trainings. Consequently, the first group might have experienced a relatively solid A-O contingency (due to considerable variability in performance measures over time) compared to the RR animals in experiment II. The reduced A-O contingency in the latter group might have facilitated the transition from goal-directed to habitual strategies, which explains the relative insensitivity to outcome devaluation in RR-trained animals in experiment II (Dickinson, 1985; Yin & Knowlton, 2006). In conclusion, the discrepancies between experiment I and II are likely attributable to overtraining in the second experiment.

Possible overtraining effects of RR-trained animals in the second experiment were further substantiated by a relative resistance to A-O contingency reversal in this group. Interestingly, although both experiments showed a similar pattern of results obtained from omission tests, only the second experiment could confirm the substantial increase in responding during omission in the RR relative to the RI group. Note that the absence of outcome devaluation sensitivity does not support the inference that the latter group might act in a goal-directed manner. Instead, these results suggest that overtraining might

differentially promote habit formation in RR and RI-trained animals. Specifically, due to the limited variability in performance measures across trainings experienced by animals

assigned to RR trainings, the reduction in A-O contingency resulting from overtraining might have been more pronounced in this group compared to the RI group (Dickinson, 1985; Yin & Knowlton, 2006). Consequently, the extensive training that preceded omission tests in both experiments might have facilitated habit formation in the RR group to a greater extent, resulting in relatively limited sensitivity to A-O contingency reversal.

4.2 Impact of motivation on outcome devaluation

Remarkably, the results of outcome devaluation test II (experiment I) and III show a pattern similar to earlier findings (Box 1; Dickinson, 1985; Gremel & Costa, 2013b; Hilário et al., 2007; Wilgen et al., 2012). This not only affirms the notion that alterations in primary motivation affect goal-directed actions less than habits, but also exposes the complexity of interactions between instrumental learning, outcome devaluation and motivational

processes. For instance, Berridge (2001) denoted that motivation incorporates both internal drives (i.e. physical state of hunger) and external incentives (i.e. the sensation of consuming food). Notably, behavior is affected by alterations in motivation only if both properties are satisfied (e.g. eating food and contiguously reducing hunger). In other words, drive reduction in the absence of external incentives, accomplished by feeding hungry animals through a gastric tube, is insufficient for reducing the incentive properties of food (Berridge, 2001). This suggests that the outcome devaluation procedure applied in the current study is

(24)

ineffective when (1) animals are pre-fed with the reward (‘devalued’) or NCR (‘non-devalued’), which is subsequently removed from their stomach via a gastric fistula or (2) animals receive the reward or NCR directly into their stomach prior to extinction. In other words, such manipulations should lead to similar response rates in the ‘devalued’ vs. ‘non-devalued’ states in RR-trained animals, resulting from an inability to effectively update their internal representation of the reward value.

In contrast to the notion that drive reduction and external incentives should coincide in order to affect behavior, Niv et al. (2006; Box 1) proposed that different motivational components might independently affect behavior. Whereas energizing aspects of motivation encompass drive reduction and might predominantly motivate habitual animals, directing motivational processes influence behavior by updating the incentive value of stimuli, thus favoring goal-directed behavior. Notably, the latter process requires incentive learning processes or a ‘consciousness component’ linking the cognitive and the motivational system (Balleine & Dickinson, 1998; Dickinson & Balleine, 2000; Niv et al., 2006). In other words, the encoded value of an incentive can be updated according to the motivational state solely when the animal experiences the incentive, which is analogous to the ‘external incentive’ proposed by Berridge (2000; Balleine & Dickinson, 1998). This explains why RR-trained animals typically maintain a relatively high response frequency in the non-devalued

compared to the devalued condition. Interestingly, inflated responding in the non-devalued state might be further enhanced through ‘contrast effects’ (Balleine & Dickinson, 1998; Dickinson & Balleine, 2000). That is, satiation on a specific food not only reduces the rewarding properties of that food, but also increases the palatability of other foods. In contrast to RR-trained animals, RI-trained animals might primarily be driven by energizing motivational processes, and therefore refrain from responding irrespective of the nature of the food provided prior to extinction. Future studies could further elucidate and disentangle the differential effects of motivation on goal-directed actions and habits by including

different A-O pairings or by comparing response rates during extinction in satiated and hungry states (section 4.5).

4.3 Limitations

4.3.1 Amount of instrumental training

Although the findings of the present study were in line with the widely accepted theory that (1) RI schedules of reinforcement promote habit formation, whereas RR schedules facilitate goal-directed actions and (2) overtraining results in habitual behavior (e.g. Dickinson, 1985; Yin & Knowlton, 2006), several limitations should be acknowledged. First and foremost, we were unable to replicate the finding that RI and RR schedules differentially promote

outcome devaluation sensitivity after four instrumental trainings (Gremel & Costa, 2013b). Instead, we established reinforcement schedule-dependent effects of outcome devaluation after eight to sixteen trainings, corresponding to four or eight days of instrumental training. Note that these outcome devaluation tests were interrupted by two to four days with extinction tests (outcome devaluation I and II), which might have prevented habit formation as a result from overtraining. Correspondingly, experiment II suggested that eight

uninterrupted days of reinforcement training might have promoted habit formation in both groups as a result of overtraining.

As previous findings (not reported here) prompted us to conduct two instrumental trainings per day rather than one, we were required to match either the number of trainings or the number of training days reported by Gremel and Costa (2013b), who exposed animals

(25)

to one training per day. Interestingly, only the latter approach allowed us to replicate the results obtained earlier, suggesting that a minimum number of training days (i.e. four) rather than trainings is required for distinguishing habits from goal-directed actions through

applying different reinforcement schedules. Although outcome value-dependent responding was demonstrated after only three days of RR20 training (Hilário et al., 2007), others

indicated that three days of instrumental training are insufficient for validating differential responsivity in devalued compared to non-devalued states (Lederle et al., 2011). Thus, four consecutive days of RR training might be optimal for promoting goal-directed behavior (Gremel & Costa, 2013b; experiment I).

In order to prevent habit-promoting effects of overtraining in RR-trained animals, the number of training days should not exceed eight (outcome devaluation I, experiment II). However, given the marked insensitivity to omission following eight uninterrupted days of RR training established in the second experiment, this number might overestimate the onset of overtraining. Moreover, it is noteworthy that the risk of overtraining the animals assigned to the RR group depends on specific characteristics of the reinforcement schedule. For instance, Wiltgen et al. (2012) reported outcome devaluation sensitivity after ten

consecutive days of RR20 training. Notably, this study conducted two relatively mild RR20 trainings (max. 30 min or 20 rewards) per day. This might have neutralized the effects of overtraining by delaying reduction of A-O contingency, thus maintaining goal-directed strategies even after prolonged training (DeRusso et al., 2010). In conclusion, given reinforcement schedules similar to the schedules described here, four to six days of

instrumental training might be optimal for distinguishing goal-directed actions from habits. 4.3.2 Response manipulandum

A second limitation of the present study reflects the fact that our experiments required animals to perform nosepokes rather than lever presses (Gremel & Costa, 2013b). Despite the fact that response manipulandum does not influence CRF performance, a systematic evaluation of the impact of several experimental design parameters on behavioral measures suggested that this variable might account for differential responding in trainings using progressive ratio schedules of reinforcement (Haluk & Wickman, 2010). Although it remains uncertain whether relatively low response rates in nosepoke relative to lever press designs persist when animals are subjected to RI or RR trainings, this finding implies that response frequencies recorded during training as well as testing (i.e. outcome devaluation, omission) might not be comparable across studies implementing different response manipulanda. Nevertheless, the use of nosepoke holes rather than levers should not influence results obtained during devaluation, since relative (as opposed to absolute) measures of responding are analyzed.

4.3.3 Non-contingent reward delivery

Despite the consistent differentiation between RI vs. RR reinforcement schedules and habits vs. goal-directed actions reported in earlier (e.g. Dickinson, 1985; Gremel & Costa, 2013a, 2013b; Hilário et al., 2007; Wiltgen et al., 2012), it should be noted that personal

communication as well as earlier experiments (not reported here) suggested that the paradigm applied here might be less robust than previous studies imply. For instance, experience taught that providing a NCR immediately after instrumental training (Methods, 2.1.4-5) might be crucial for observing differential effects of RI vs. RR training on outcome devaluation sensitivity. However, others successfully distinguished habits from goal-directed

(26)

actions without post-training NCR consumption (DeRusso et al., 2010; Wiltgen et al., 2012). The necessity of providing NCR following instrumental trainings and the precise mechanism by which this procedure affects behavior thus remain ambiguous. Hypothetically, it is possible that familiarization with the NCR is necessary for sufficient consumption during outcome devaluation (non-devalued state). Alternatively, exposure to the NCR might

consolidate the internal representation of the association between nosepoking and receiving a reward (i.e. the contingent reward). The latter hypothesis relies on the assumption that judgments of A-O contingency might be based on relative instead of absolute contingencies (Yin & Knowlton, 2006). Briefly, in instrumental trainings with two distinct actions (A1 and A2) and outcomes (O1 and O2), the experienced A-O contingency remains relatively high despite RI schedules of reinforcement (Yin & Knowlton, 2006). This might be explained by the fact that A1 consistently generates O1 and vice versa (P(O1|A1) > P(O2|A1), since the latter equals 0). In conclusion, the finding that RI and RR reinforcement schedules

differentially affect outcome devaluation sensitivity might be less robust than frequently implied.

4.3.4 Water availability

Another inconsistency across studies is the availability of water during training and outcome devaluation. Whereas Hilário et al. (2007) reported that water bottles were removed prior to instrumental training, others suggested that water bottles were continuously present

(Gremel & Costa, 2013a; 2013b). Additionally, we learned from personal communication that the availability of water during specific satiety-induced outcome devaluation influences the extent to which RR-trained animals differentiate between the devalued and the non-devalued state. This might be explained by the fact that water deprivation enhances the incentive value of sucrose solution relative to pellets (Niv et al., 2006). In other words, removal of water bottles during the outcome devaluation procedure might bias animals that naturally prefer pellets over sucrose towards sucrose consumption.

4.3.5 Establishing habits: statistical and methodological considerations

A final limitation of the paradigm employed in the present studies concerns the widespread presumption that habits are established by demonstrating insensitivity to outcome

devaluation (e.g. Rossi & Yin, 2012). Notably, this insensitivity is typically validated by an inability to reject the null hypothesis (H0, which states that there is no difference in

responsivity between the devalued and the valued state). As denoted by numerous

statisticians, this approach erroneously assumes that not rejecting H0 justifies the inference

that H0 is true (Cohen, 1990; Fisher, 1937, p. 19; Gill, 1999). For instance, Gill (1999) noted

that ‘failing to reject the null hypothesis does not rule out an infinite number of other

competing research hypotheses’ such that it ‘provides almost no information about the state of the world’.

Strictly speaking, the finding that responsivity in devalued relative to non-devalued states is not significantly different should thus be interpreted as a non-informative result rather than convincing evidence for the proposed habitual behavior of animals. From this viewpoint, it appears impossible to reliably confirm habit formation based on outcome devaluation or contingency degradation/reversal tests. Goal-directed actions, in contrast, can be established based on rejection of H0. Given the broad consensus that behavior can be

conceptualized as a dichotomous phenomenon, either habitual or goal-directed, the

Referenties

GERELATEERDE DOCUMENTEN

Furthermore we tested the relationship between psychological empowerment with referent cognitions (including both referent outcome cognitions and amelioration

Based on the outcomes of the regression analysis we can argue that overall training satisfaction does not affect the relation between pre-training motivation and perceived

The objectives of this study were using literature records, our field work and environmental features to create maps of the potential distribution of tick species of the

The presented perspectives become gradually more and more decisive for the question “to what extent can web-based learning be sufficiently vicarious for a the continuous

responder process needs to be transformed based on the assumption that there will be still further interaction between sender and receiver processes. As is shown in Fig. The while

It is clear that our German tourist, who is in the situation of profi t maximization, makes sense of his life by thinking that he needs a lot of security (a fi shing empire!)

These requirements motivate a compositional distributional semantic approach, where an approach called the Recursive Neural Network is used to syntactically-semantically

We start with a positioning of the broad WEF nexus debate within the South African water sector and then draw links between water, energy and agricultural food production within