THE REINFORCED BRAIN
Daniel Lindh, Amsterdam Brain and Cognitive Sciences, University of Amsterdam INTRODUCTION
One of the most essential attributes enabling successful survival of an organism is the affinity to pursue rewarding states while simultaneously avoiding harmful situations. The term “reward” describes the positive value that an individual ascribes to an object, behavioral act, or internal physical state. The evolutionary purpose of reward representation in the brain seems clear: to reinforce advantageous behavior. Being able to classify and internally represent, over a long period of time, which actions or situations are beneficial is crucial for evolutionary success, i.e. learning. So learning, in its most primitive form, can be seen as an ability to utilize knowledge of past experiences, together with current events, to predict future states. There is a vast corpus of reward literature investigating the role of different neuromodulators in reward processing. Since the 80s and forward, the most investigated neurotransmitter in the literature is dopamine. The dopaminergic reward system can be a fragile structure, where erroneously learned contingencies can lead to severe substance abuse, as in cocaine (Carlezon Jr. et al., 1998; Sora et al., 1998) or morphine (Hnasko, Sotak, & Palmiter, 2005) addiction. Other medical disorders associated with reward processing dysfunction, such as schizophrenia (Nestor et al., 2014; Simon et al., 2015), major depression disorder (Ubl et al., 2015) and gambling addiction (Dymond et al., 2014) also highlight the importance of a properly functioning reward system. In order to form a comprehensive understanding of how reward is implemented in the brain, several levels of explanation must be utilized. Firstly, we need to understand which brain structures are involved, how they interact and what kind of processes they are involved in. Secondly, we need to consider the role attention plays in both high and low order levels of reward processing. Finally, we must consider models of reinforcement in order to assess mechanistic predictions. Because disentangling reward from other concomitant processes is extremely difficult both in animal and human research, in the current review I focus on reinforcement learning in sensory processing. Sensory processing, apart from being more straightforward than higher cognition, acts as a good model of how the brain implements reward learning in all types of cognition. Here, I will discuss perceptual learning, reward modulated sensory processing, models, and the role of attention.
THE REW ARDED BRAIN
A stimulus in itself does not intrinsically contain reward value; organisms assign different values to stimuli based on their current internal states and as a function of their previous experiences. Accordingly, reward is implemented in the brain via various neuronal reward signals, only existing as information encoded within and between neurons. In single neuron recordings, one
of the most influential reward-‐related findings have come from dopamine neurons in substantia nigra and ventral tegmental areas (Fiorillo, Tobler, & Schultz, 2003; W Schultz, Dayan, & Montague, 1997; W Schultz, 1986). These findings have later been interpreted as reflecting the prediction error in reinforced learning, computed as the discrepancy between expectation and outcome. Signs of this type of computation have also been found in several diverse structures, such as striatum (McClure, Berns, & Montague, 2003), anterior cingulate cortex (Hayden, Heilbronner, Pearson, & Platt, 2011) and frontal cortex (Ramnani, Elliott, Athwal, & Passingham, 2004). Multiple lines of evidence support the idea that these neurons construct and distribute information about rewarding events (Glimcher, 2010; Koob, 1992; Lak et al., 2014). More specifically, they convey the signed valence, meaning they code for the motivational, or reward-‐ related, value of upcoming events. Furthermore, the highly interconnected structure of electrical synapses between these dopamine neurons (Vandecasteele, 2005) prohibits individual neurons from firing alone (Komendantov & Canavier, 2002). This is pertinent, because it promotes a necessarily high neural collaboration in order to reach a threshold high enough for an actual output from the midbrain. This also means that recordings of only a few dopamine neurons give a good estimation of what the rest of the neural population is up to. Schultz and colleagues were the first ones to investigate the role of dopamine neurons in motivational processing. In one of their pioneer studies, Schultz, Apicella, and Ljungberg (1993) trained monkeys to execute three different tasks while performing extracellular recordings of single dopamine neurons located in the left substantia nigra. In the first experiment, a spatial choice task, the monkeys were trained to press a lever indicated by a light. A liquid reward was administrated 500 ms after correct lever touch, making it possible to differentiate between the motor movement and the reward encoding. In the second experiment a cue to initiate movement was presented 1 second after the onset of the instruction cue, which now served only as a preparatory signal. The final experiment, a delayed response task, was setup similarly to the instructed spatial task, however the initiation cue was presented between 2.5-‐3.5 seconds after the instruction cue, forcing the monkey to keep a representation of the target lever in working memory in the interim. The combined results of these experiments showed that dopamine neurons respond to stimuli crucial for performing a behavioral task and learning. Specifically, early on dopamine neurons responded to the reward, but once the association was learned, the response was shifted to the onset of the instructional cue. There was no ongoing activity between the instructional cue and the initiation cue, implying that these neurons do not encode working memory for the prospective action. A series of follow-‐up studies were then conducted to map out the very nature of these signals. Especially, Fiorillo, Tobler, and Schultz (2003) reported that this shift from reward towards cue related response scaled with certainty of association between the two. That is, cues with larger uncertainty of reward exhibits a stronger signal at the time of reward, but weaker signal at the time of the cue, reflecting the uncertainty of the whole event. So it seemed abundantly clear that dopamine neurons within these subcortical structures actually signal some type of reward prediction error (i.e. the discrepancy between expected and actual reward).
In fact, expression of reward prediction errors largely overlaps with regions coding for expected reward and motor-‐areas corresponding to the recently chosen action (Doherty et al., 2004; Padoa-‐Schioppa & Assad, 2006; Palminteri, Boraud, Lafargue, Dubois, & Pessiglione, 2009). This corroborates the notion that the function of prediction errors is to work as a teaching signal, improving future reward predictions and movement selections in the relevant brain circuits.
Furthermore, dopamine neurons in the midbrain also seem to be coding motivational properties. Indeed, Satoh, Nakai, Sato, and Kimura (2003), using invasive recordings in monkey’s midbrain, showed that dopamine neurons in substantia nigra and ventral tegmental area seem to have at least three distinctive types of functions. Firstly, dopamine neurons responding to a conditioned stimulus (CS) are encoding motivational engagement at the start of the trial. More specifically, Satoh and colleagues showed that expected reward correlated highly with reaction time (RT). This implies that the dopamine neurons coded the actual motivation of a future action, assuming higher motivation would lead to shorter RTs. Secondly, perfectly in line with earlier findings (Fiorillo et al., 2003; Schultz et al., 1997), they also showed that dopamine neurons accurately encode the reward prediction error for a positive reinforcer. Thirdly, while the precise coding of reward prediction errors was something that was learned by the monkey throughout the whole experiment, sign of motivation in dopamine firing rate was constant all throughout. A related influential idea is that slow tonic dopamine reflects overall motivation/satiety (Salamone & Correa, 2012) and fast phasic dopamine signaling supports learning (Satoh et al., 2003). For example, Hamid et al. (2015) measured dopamine release from the nucleus accumbens over several different timescales. They showed that motivational vigor and reward-‐rate co-‐varied with minute-‐to-‐minute dopamine, while at the same time second-‐by-‐ second dopamine release coded for an estimate of the temporally discarded future reward. These findings suggest that dopamine conveys one single decision variable that signals the value of work. Although most research has focused on the dopaminergic system when it comes to reward related predictions, it is possible that dopamine shares this mechanism with other neuromodulators. For example, recent findings have shown tonic serotonin in raphe nuclei to also reflect motivation, whereas phasic serotonin reflects a reward anticipation and prediction error (Li et al., 2016).
Despite the complex nature of the dopaminergic system in the midbrain, it is not in itself sufficient to account for the full range of processes involved in learning. For example, dopamine neurons in the midbrain have a baseline firing rate around 4-‐5 Hz, and a firing rate of up to 30 Hz elicited by a positive reward experience (Montague, Dayan, & Sejnowski, 1996). Knowing this, and assuming a linear coding for negative and positive reinforcers, it is improbable that these neurons code for the whole reward spectrum. This opens up the exciting idea of different regions being responsible for the coding of negative reinforcers, and one such candidate is Habenula (Hb) (Benarroch, 2015). Habenula is located in the dorsomedial portion of the thalamus. There it forms an essential connection between the forebrain and brainstem monoaminergic nuclei. The Hb is comprised of two subdivisions: lateral and medial. The main difference between these subdivisions is in their neurochemical characteristics and connectivity. Considering the fact that the lateral Hb exerts an inhibitory modulation both on dopaminergic neurons of the substantia nigra pars compacta (SNc) and ventral tegmental area (VTA), it is not hard to conceive why this nucleus is of interest when considering regulation of learning and reward. Interestingly, it has been shown that the lateral Hb is a primary source of negative reward-‐related signals to DA neurons. Matsumoto and Hikosaka, (2007) recorded from neurons in Hb while monkeys performed a visually guided saccade task. Many neurons in the lateral Hb exerted a phasic response to no-‐reward-‐predicting targets, and inhibition for reward-‐predicting targets, showing an opposite effect as earlier found in midbrain dopamine neurons (Fiorillo et al., 2003; Schultz et al., 1997; Schultz, 2015). Furthermore, electrical stimulation of lateral Hb prompted a strong inhibition on midbrain dopamine neurons through GABAergic connections
mediated through the rostromedial tegmental nucleus, providing a plausible mechanism for the findings of suppressed activity in VTA and substantia nigra for the absence of predicted reward (Schultz, 1986).
Studies recording single cells have also found neurons that seem to code for action-‐values (Lau & Glimcher, 2008; Samejima, Ueda, Doya, & Kimura, 2005; Tai, Lee, Benavidez, Bonci, & Wilbrecht, 2012). Action-‐values are an important concept in reinforcement learning, referring to the assignment of probable future values to a variety of possible actions, later used to make a decision about the most favorable option in a certain situation (i.e. Q-‐values, see Sutton & Barto, 1998). For example, Samejima et al. (2005) trained monkeys to turn a lever to either left or right. By manipulating the probability of high reward for either left or right-‐choices, the authors could show that certain neurons in the striatum coded for both a preferred direction and the action-‐value for this specific direction. It has also been shown that the values of predictive visual cues, chosen with either the left or right hand, are represented in contralateral ventral prefrontal cortex (Palminteri et al., 2009). Considering striatum’s known role in motor-‐ action control (Cui et al., 2014), it is not surprising that one attractive notion is that action-‐ values are predominantly represented in motor-‐related areas, such as motor cortex, supplementary motor cortex and supplementary eye fields (for saccades) (Hunt, Woolrich, Rushworth, & Behrens, 2013; Wunderlich, Rangel, & O’Doherty, 2009). This is very intuitive, seeing how these areas also plan the actions to be made, leading to more efficient processing attained through learning. Nevertheless, it could be the case these values are represented more ubiquitously than most of these studies report. FitzGerald, Friston, and Dolan (2012) showed action-‐specific signals in ventromedial PFC, putamen, insula, thalamus and hippocampus using multivariate bayes (MVB) (Friston et al., 2008). Similarly, Vickery, Chun, and Lee (2011) could decode the feedback (win/loss) in all 43 regions-‐of-‐interests across the whole cortex using multivoxel pattern analysis (MVPA) (Hanke et al., 2009). A similarity between both these approaches is that they can extrapolate information based on dispersed neuronal populations that are particular for certain processes, and does not require focal, spatially coherent activations. Meanwhile, conventional univariate fMRI analyses, assuming that more blood-‐ oxygen level dependencies (BOLD) equals higher reward processing, are inherently insensitive to these types of signatures.
For accurate updating of action-‐values an agent also needs devoted structures that implement credit-‐assignments. One candidate area for such computation is lateral orbitofrontal cortex (OFC). During the interval between decision and reward, lateral OFC neurons are relatively quiet (compared to dorsolateral PFC neurons). During feedback, however, lateral OFC neurons become relatively more active, reflecting the current choice responsible for the outcome (Tsujimoto, Genovesio, & Wise, 2009). Tsujimoto and colleagues propose a pivotal role for lateral OFC in reactivating relevant choice representations, assisting Hebbian learning. In addition to reactivating choice representations, OFC neurons are also able to preserve neural representations of rewards over an extended period of time, despite presentation of distracting reward outcomes (Lara, Kennerley, & Wallis, 2009). This putative function of lateral OFC is further substantiated by research showing that lesions to the lateral OFC impaired monkeys’ ability to make value-‐related decisions between objects and update action-‐values based on current feedback (Rudebeck & Murray, 2011). In contrast, the removal of ventromedial PFC (vmPFC) showed no such impairment (Rudebeck & Murray, 2011). Instead, BOLD activation in
vmPFC is known to correlate with the predictive value of future outcomes (Kable & Glimcher, 2007; Plassmann, O’Doherty, & Rangel, 2007; Tom, Fox, Poldrack, & Trepel, 2007), as well as the subjective value at the time of reward (Sescousse, Redoute, & Dreher, 2010). It has been proposed that this signal reflects a comparison between different possible options in the value domain (Boorman, Behrens, Woolrich, & Rushworth, 2009; FitzGerald, Seymour, & Dolan, 2009). Contrary to vmPFC, anterior PFC appears to encode the value of choices that were not selected (Boorman et al., 2009; Rushworth, Noonan, Boorman, Walton, & Behrens, 2011). Specifically, when feedback is presented about the value of the alternative choice participants could have made, activation in anterior PFC reflects the prediction error of this counter factual choice (Boorman, Behrens, & Rushworth, 2011). Additionally, this counter factual prediction error has recently been observed in human striatum (Kishida et al., 2015). Kishida and colleagues estimated sub-‐second dopamine fluctuations through Fast Scan Cyclic Voltammetry (Kishida et al., 2011) in striatum in Parkinson patients while performing a sequential investment game. Dopamine fluctuations did not only reflect the reward prediction error, but also showed a combination of the reward prediction error and the counter factual reward. In fact, earlier studies have revealed that humans use both counterfactual information (feedback relating to choices that weren’t made) and reward prediction errors over choices that were actually made to influence their up-‐coming decision (Chiu, Lohrenz, & Montague, 2008; Lohrenz, McCabe, Camerer, & Montague, 2007). To contrast the function between vmPFC and anterior PFC even more, Daw, O’Doherty, Dayan, Seymour, and Dolan (2006) reported that exploitative behavior of high-‐value options were associated with vmPFC, whereas anterior PFC seemed to process lower values during exploration. Reconciling this finding with earlier findings of anterior PFC function (Boorman et al., 2009, 2011) suggests that the anterior PFC signal during exploration (Daw et al., 2006) either reflects a high probability of switching to another alternative, or the high value of the discarded options while exploring.
In the literature, there seems to be a lot of overlap across a variety of reward processes. Both spectrums of signed valence are coded by similar neurochemicals, within similar structures. However, processes like representation of action-‐value and assignment of action-‐value differ in their temporal engagement, as well as their probable relevant structures. Nevertheless, it is possible they have more in common than perceived at first sight. As pointed out above, many fMRI studies assume that external variables influence neurobiological measurements in a linear matter. Specifically, they assume that the BOLD-‐response adds up with more reward. This has been conjectured based on psychometric-‐neurometric experiments where, for example, a monkey’s self report of perceived motion direction was predicted by higher activity in motion area MT (Treue & Martínez Trujillo, 1999), or that objective stimulus intensity was associated by a power function to both subjective intensity of the stimulus and the BOLD response (Polonsky, Blake, Braun, & Heeger, 2000). However, this assumption is not necessarily true and methods like MVPA and MVB are probably more sensitive to picking up the subject-‐specific signals associated with different reward processes. Another distinction, that might explain incongruences in result found between single-‐cell studies and fMRI studies, is the nature of what they are measuring. Single-‐cell studies usually report spike rate of cells, whereas fMRI studies report BOLD, which are not believed to reflect spike rates but rather local field potentials (LFP) (Logothetis, Pauls, Augath, Trinath, & Oeltermann, 2001). While spike rates correlate with neural output, LFPs are associated with subthreshold activity as well as incoming input into the area (Logothetis, Pauls, Augath, Trinath, & Oeltermann, 2001; Logothetis & Wandell, 2004;
Logothetis, 2003). Another caveat is that investigating reward is not a straightforward question; one main limitation is the temporal and spatial overlap reward has with attention in the brain (Maunsell, 2004). Stănişor, van der Togt, Pennartz, and Roelfsema (2013) report such a finding where monkeys were trained in a curve-‐tracing task while recording neurons from V1. The curve-‐tracing task allowed the researchers to manipulate attention and reward representation, by the means of using distractors, and a comparison between the two showed that the effects of relative value had a similar timing and magnitude as the effects of selective attention. The authors argues that their findings support the view that studies that examine attentional processes on one hand, and reward on the other, actually investigate the same selection processes. Researchers usually train monkeys in attention paradigms by using reward. For example, the monkeys might get rewarded for one (attended) stimulus, but not for the other (unattended) stimulus (Stănişor et al., 2013), meaning that the original aim to investigate attention now is contaminated by reward processing as well.
THE PERCEPTUAL BRAIN
Considering the vast fluctuating landscape of information surrounding us all of the time, the brain’s ability to predict and quickly structure incoming information is an extraordinary feat. The classical view of the perceptual system being almost purely driven by bottom-‐up processes has been heavily challenged in recent years. In addition to bottom-‐up input, the visual cortex also receives large amounts of feedback from other higher-‐order cortical areas (Harris & Mrsic-‐ Flogel, 2013; Muckli & Petro, 2013). Thus, a notion that has gained more traction in recent decades is predictive coding (Hohwy, 2014; Lee & Mumford, 2003; Rao & Ballard, 1999). Predictive coding states that top-‐level areas continuously send predictions to the early sensory processing areas in a hierarchical manner, which has been shown to exert faster processing of incoming stimuli (O’Brien & Raymond, 2012). For example, in the visual cortex, prior predictions evoke a preparatory neural template of the expected incoming stimuli, with a BOLD-‐response that closely resembles the BOLD-‐response caused by the stimuli (Kok, Failing, & de Lange, 2014a). A clear example of predictions that affect our perception can be found in binocular rivalry. Binocular rivalry is a perceptual phenomenon that has been described as far back as 1593 by Giambattista Della Porta (Hohwy, 2014). In binocular rivalry, each eye receives different visual input and, instead of the two images fusing, one eye becomes dominant resulting in one clear percept. Eye dominance alternates every few seconds, sometimes with periods of patchy transitions. Findings during binocular rivalry strengthen the supposition that the brain is engaged in high-‐level inferential work. First presenting the same image to both eyes and then changing the input for one of the eyes can prime which eye that initially dominates perception. The eye receiving the same image as before is more likely to be dominant than the eye whose input has been switched (Mitchell, Stoner, & Reynolds, 2004). In 1928 Emilio Diaz-‐Caneja (Hohwy, 2014) cut two images in half and presented a combination of the two for both eyes. Interestingly, even now the perception did not fuse, but instead people perceive a complete picture by combining the one half from one eye with the corresponding half from the other eye (Hohwy, 2014). This is an impressive achievement by the brain. The sophisticated inferences made by the brain as shown in binocular rivalry findings suggest that even high-‐conceptual
notions perturb deep down into our low-‐level machinery, affecting our perceptual awareness to a much larger extent than earlier believed.
According to a contemporary theoretical framework, the main goal of the brain is to predict future states, and thus minimize surprise, in order to effectively process and interact with the world (Friston, 2009, 2010). There has been a vast amount of findings showing the predictive nature of the brain over the past decades, ranging from domains such as visual (Kok, Failing, & de Lange, 2014b; Kok, Jehee, & de Lange, 2012) auditory (Cohen, Elger, & Ranganath, 2007), self-‐recognition/embodied self (Apps & Tsakiris, 2014; Seth, 2013) and somatosensory perception (Allen et al., 2015). Furthermore, higher order functions have been implicated in computing probabilities and predictions, such as action preparatory activity (Bestmann et al., 2008), memory (Kumaran & Maguire, 2009), and cognitive control (Pezzulo, 2012). So what could be the advantage of using prediction errors? The short answer again is to aid learning. The main reason the notion of prediction errors is so fascinating is because of how intuitively and ingeniously it can describe and emulate learning in computational models by error correction (Sutton & Barto, 1998). It is intuitive in the sense that if we constantly update our model of the world based on the size of the error, we will gradually increase the precision of our future predictions. Of course, there are several bottlenecks in the processing of incoming sensory information that hinder us from having a perfect model of the world all of the time, but given the limited amount of data we can process, the usage of prediction errors gives us a fairly good estimate. However, despite being an attractive notion both from a computational and empirical viewpoint, some inconsistent results can be used to argue against predictive coding. Rao and Ballard (1999) proposed that an intelligent system would not be surprised by predictable stimuli and only the unexpected input features are put forward to the next stage of processing. Thus, it can be contended that this is problematic, seeing that predictions sometimes rather seem to boost sensory processing (Chaumon, Drouet, & Tallon-‐Baudry, 2008; Doherty, Rao, Mesulam, & Nobre, 2005). However, it has been reasoned that these findings are confounded by attention (Kok, Rahnev, Jehee, Lau, & De Lange, 2012). In fact, in further elaborations of the predictive coding model it has been proposed that attention increases the weights for certain sensory evidence (Friston, 2009; Kok, Rahnev, et al., 2012; Rao, 2005), leading to higher precision of pertinent incoming information.
There are, in principle, two types of prediction errors that have been discussed in the literature (Den Ouden, Kok, & de Lange, 2012). I have already discussed the first one, the motivational prediction errors, expressing degree of surprise caused by a particular rewarding scenario (Fiorillo et al., 2003; Lak, Stauffer, & Schultz, 2014; Schultz et al., 1997; Schultz, 1986). The second type of prediction error, and theoretically more recent, is related to perception. In perception, predictive coding is believed to work in a hierarchical manner, where the subsequent level of processing predicts each previous level of processing. Here only the information of deviant, non-‐predicted, bottom-‐up sensory evidence is passed along to the next level of analysis (Rao & Ballard, 1999). Consequently, prediction errors are believed to be abundant all over the brain and a highly integrated and essential part of all levels of learning. In fact, it has been proposed that the main goal of the brain is to strive to minimize the amount of surprise by constantly updating the internal model of the world (Friston, 2009, 2010). There is a vast amount of empirical evidence for predictive coding processes in sensory processing areas (Allen et al., 2015; Clark, 2013; Jack & Hacker, 2014; Kok et al., 2014a; Kok, Rahnev, et al., 2012;
Rauss, Schwartz, & Pourtois, 2011; Shipp, Adams, & Friston, 2013). One of the most robust paradigms to induce prediction errors is an oddball task, where a divergent stimulus is presented after a sequence of repetitive stimuli. This elicits a large probability-‐scaled neural response, originating from the sensory areas (Akatsuka, Wasaka, Nakata, Kida, & Kakigi, 2007; Stagg, Hindley, Tales, & Butler, 2004). Perceptual prediction errors are also distinguished from associated concepts like adaption and attention. This can be shown in omission paradigms, where a predicted stimulus is withheld yet still yields a large neural response in relevant sensory areas (Den Ouden, Friston, Daw, McIntosh, & Stephan, 2009; Todorovic, Ede, Maris, & Lange, 2011). Another way of investigating perceptual prediction errors is by using illusions. For example, using Kanizsa illusions (where Pac-‐Man-‐shaped inducers are aligned such that the edges form a shape that trigger the percept of an illusory contour), Kok and De Lange (2014) showed that illusory perception of shape comes with elevated BOLD-‐response in visual regions where bottom-‐up sensory evidence is absent but part of the shape is expected (reflecting a prediction error). At the same time, they found that in regions that receive bottom-‐up evidence and is predicted by the perceptual shape led to an attenuation BOLD-‐response, consistent with how prediction error are assumed to propagate in the visual hierarchy (Rao & Ballard, 1999). In a follow-‐up study using ultra high-‐field fMRI (7T), Kok et al. (2016) were able to separate the cortex into three different parts; deep, middle and superficial layers. Since feedback and feedforward connections are largely segregated in the visual cortex (Rockland & Pandya, 1976), with feedback mostly encompassing the deep and superficial layers and feedforward connections encompassing the middle layers, Kok and colleagues predicted a layer-‐separated response depending on top-‐down or bottom-‐up influence. Indeed, the authors found that bottom-‐up stimuli almost equally activated all three layers whereas top-‐down signals (“predictions”) showed higher activation in the deep layer. Again, this clearly follows earlier findings that expected stimuli are attenuated while unexpected bottom-‐up signals are enhanced. Yet another illusion that also shows the extraordinary explainable power of predictive coding is the McGurk-‐effect (McGurk & Macdonald, 1976). The McGurk-‐effect is a multisensory perceptual phenomenon that displays a collaborate interaction between auditory and visual areas in the process of speech. For example, when an auditory stimulus, for example “Ba”, is paired with the visual input of someone saying “Ga” subjects report that they perceive a syllable in between (like “Da”). In line with predictive coding accounts, the more predictive a visual stimulus is of the subsequent spoken syllable, the stronger the response in superior temporal sulcus when this prediction is violated (Arnal et al., 2009). Both of these illusory accounts demonstrate the hallmark of a prediction error, as described by Rao and Ballard (1999).
As briefly discussed earlier, an additional vital aspect of learning and reward is attention. Attention is believed to play a pivotal role in prediction errors, increasing the weight of prediction errors (Friston, 2009; Kok, Rahnev, et al., 2012), which in turn putatively increases processing speed of sensory evidence and directs learning. Reward in itself seems to be able to modulate attention as such that it increases performance in visual tasks as a function of incentive value (Engelmann, Damaraju, Padmala, & Pessoa, 2009). In addition to improving performance, monetary reward also concomitantly boosts BOLD-‐response in task-‐related perceptual and cognitive regions, together with reward-‐related regions (Engelmann et al., 2009; Pochon et al., 2002; Small, 2005). Moreover, dopamine-‐related areas like striatum, in addition to reward prediction (Schultz, 1986), also have been implicated in coding for incremental
attention-‐capturing saliency (Zink, Pagnoni, Chappelow, Martin-‐Skurski, & Berns, 2006). Together with its role in action-‐initiation (Cui et al., 2014; Shiflett & Balleine, 2011), this finding suggests a facilitating role for subcortical dopamine in reallocating attentional resources. Consequently, Pessoa (2009) proposed that an enhanced interaction between subcortical reward-‐related areas and perceptual and cognitive regions reallocates attention and improves performance consequently promoting successful reward-‐seeking behavior. This is interesting because these findings suggest that the allocation of attentional resources in the midbrain provides a link between the motivational dopaminergic prediction errors and the modal specific predictions in the cortex. However, because of the similar nature of attention and reward (Stănişor et al., 2013), attention also poses a methodological problem for researchers. The main caveat of reward research being the difficulty that comes when trying to disentangle neural reward and attentional signals, causing many reward studies to be confounded by attention (Maunsell, 2004). Thus, one vital question becomes whether or not learning can occur without attention. Watanabe, Náñez, and Sasaki (2001) trained their participants in a letter task, and at the same time presented moving dots with a subthreshold coherence level. They showed that motion direction discrimination later on was improved selectively for the orientation that had been presented during the letter task. It was later showed that this task-‐irrelevant visual perceptual learning was contingent on the fact that the task-‐irrelevant feature was presented subthreshold (Tsushima, Seitz, & Watanabe, 2008). Watanabe and colleagues argued that if the task-‐irrelevant feature were presented above the threshold of detection, it would be considered a distractor and attention would attenuate the irrelevant feature prohibiting any task-‐irrelevant learning (Sasaki, Nanez, & Watanabe, 2010; Seitz, Kim, & Watanabe, 2009; Seitz & Watanabe, 2005; Takeo Watanabe & Sasaki, 2015). Persichetti, Aguirre, and Thompson-‐Schill (2015) reported a slightly different finding when they first taught participants to associate novel shapes with different monetary rewards. Subjects later completed an unrelated, but demanding, perceptual task using the same shapes. Curiously, Persichetti and colleagues showed that shapes earlier associated with high reward showed an increased BOLD-‐response in visual cortex despite the fact that attention was drawn away from the associate value of each shape.
Predictions and prediction errors are becoming increasingly popular as an explanatory framework for a wide range of neuropathological diseases, perceptual experiences, and higher cognitive functions. The importance of predictions becomes clear when considering what happens when things go awry. For example, erroneous prediction errors have been proposed to underlie adolescent risk-‐taking (Cohen et al., 2010) and high-‐level dysfunctions (Simon et al., 2015; van Boxtel & Lu, 2013). One such is psychosis (Corlett, Honey, & Fletcher, 2007; Corlett et al., 2007; Corlett & Fletcher, 2015; Yamashita & Tani, 2012), where individuals stereotypically report disrupted perceptual experiences such as brighter colors and louder sounds, and attention, assigning inappropriate significance to these, leads to delusions. These experiences are all congruent with an inability to explain away incoming stimuli due to erroneous prediction errors. Another, maybe even more surprising, disorder proposed to be caused by weak prediction errors is autism (Pellicano & Burr, 2012; Sinha et al., 2014; van Boxtel & Lu, 2013). In sensory systems of people with autistic spectrum disorder, weak prediction errors lead putatively to a perpetual shower of new “surprises”, causing an increase of sensory inputs for the brain to process.
THE M O DELLED BRAIN
A hallmark of true understanding of a mechanism is the ability to reproduce the process from the ground up, exhibiting the same properties when exposed to the same situations. Models of the brain can be used to show understanding of the underlying principles, helping us to further develop tools to investigate other processes of the brain and predict outcomes of situations by simulations. One of the pioneers of modeling the visual system was David Marr. He advocated for viewing the brain’s visual organization as a pure information processing system, proposing the notion that one must understand it at three distinct, complementary levels of analysis (known as the Marr's Tri-‐Level Hypothesis) (McClamrock, 1991). The first level is the computational level: what is the function of the system? What types of problems does it need to solve and overcome? And why is it doing these things? The second level is the algorithmic level: how are these functions represented in the brain, and what kind of processes are used to manipulate the representations? The third level is the level of implementation: how are these functions realized in the brain? That is, which neural structures and neural activities instigate the algorithms and processes that solve the problems of the system? These levels are not specific for the visual system, but can be applied as a general rule to understand the whole brain. On top of these levels, Tomas Poggio proposed the level of learning (Poggio, 2012): the level at which the system learns how to process information in an adequate manner, without the need to be preprogrammed for the specific task. This is exactly where sufficient and necessary models of reinforcement learning should be implemented.
Models of reinforcement learning traditionally arise from psychology and computational science, where researchers tried to understand the brain by either testing behavior or constructing artificial intelligence. More than hundred years ago, Ivan Pavlov observed in his famous experiment on the salivating dog, that when the ring of a bell is consistently paired with food, dogs eventually start to salivate after the bell is rung (Rescorla & Solomon, 1967). This is known as classical condition, in which an innate response (salivating) to a potent stimulus (food) comes to be prompted by a previously neutral stimulus (the sound of a bell). The first people to mathematically formalize this learning process were Bush and Mosteller (1951) who proposed that the probability of the dogs salivating could be expressed as an iterative equation:
𝐴
!"#$_!"#$%= 𝐴
!"#$_!"#$%+ α(𝑅
!"##$%&_!"#$%− 𝐴
!"#$_!!"#$)
Where you compute the 𝐴!"#$_!"#$% by taking the value of A for the last trial with the added
discrepancy between the value of the current (actual) and last trial (expected) value, i.e. the prediction error, multiplied by some learning rate α between 0 and 1. When α is equal to 1, A is always updated so that it is equal to R from the last trial. In fact, as long as α > 0 and α < 0.5, the value of A will always converge to the value of R. However, the smaller the learning rate, the slower will this converging be. So in fact, what the Bush and Mosteller and equation do is to compute a reward average based on previous trials, where the recent trials carry more weight. The learning rate dictates the decay of this weight along previous trials, with higher learning rate rendering the predictive process to be less influenced by older trials. The importance of the Bush and Mosteller equation is non-‐trivial; it was the first to utilize an iterative error-‐based rule for reinforcement learning, forming the keystone for most future models. In an extension of the Bush and Mosteller rule, Rescorla and Wagner (Sutton & Barto, 1998) build a learning model
that was used to investigate associative connections when two predictive cues where paired with the same event. Their model has become so widely prominent that many now fallaciously attribute the Bush and Mosteller equation to Rescorla and Wagner (Glimcher, 2010).
With time, two key issues with these models emerged (Sutton & Barto, 1998). First, they all treated time as discrete epochs, where learning happens in the end of each epoch (or trial). However, in the real world, time is continuous, meaning that several things within a trial can carry meaning for the end result. The second issue concerns the linking of sequential cues. For example, the earlier models were good at using the conditioned cue to predict the value of the trial, while not incorporating the notion that the later appearance of the reward was non-‐ informative. Sutton and Barto (1998) maintained that one of the main problems with the earlier models was that the definition of the problem the models were trying to solve were incorrect, and thus violating the first level of Marr’s Tri-‐Level Hypothesis. The goal is not to learn the value of previous events, as Sutton and Barto stated that the Bush and Mosteller rule actually did; it is to try to predict future events. In their temporal difference-‐learning model (Sutton & Barto, 1998; Sutton, 1988), the prediction error is computed by taking the difference between the prediction of all future rewards and any information that leads to an alteration of beliefs. This information is not constrained only to direct unconditioned reward, but also signals that are predictive of upcoming rewards. This is a critical difference from having a prediction error that is computed by the difference of past events and just the current reward, as in the Bush and Mosteller classes of learning models. Furthermore, learning did not happen after each epoch, but because time was represented as a series of minimally discrete moments, the predictive model was updated whenever salient and relevant events occurred. Furthermore, for each time step, not only does this model carry a prediction for reward at that very moment, but also the predicted sum of the discounted reward for all subsequent moments. To illustrate this, imagine a situation where time is divided up in discrete time points. At any point in time, a reward has an equally low probability, meaning that any reward at any time will yield a big prediction error response. Now, imagine that a tone starts to occur just before every reward. The first time this happens, the tone carries no information about the subsequent reward, which is at this point still surprising. Gradually over time, the tone completely predicts the reward and the actual reward does not carry any additional information. The prediction error now starts to occur together with the tone, because of the unpredictability of its timing. Temporal difference learning-‐models achieve this by assigning each obtained reward not just to the value function for the current moment in time but also to previous time increments. So, models of reinforcement learning (Sutton & Barto, 1998) can be described to follow three steps: (1) the organism estimates value of each action, (2) the action is selected based on a comparison between several action-‐values, and (3) action-‐values get updated, based on the prediction error generated by the current event. Temporal difference-‐learning models have gained traction in neuroscience in large part because of the work of Schultz et al. (1986), who recorded dopamine neurons in the midbrain of monkeys engaging in a reinforcement-‐learning task. The monkey was seated in front of two levers, each lever having one corresponding light cue. After the illumination of the light cue, the monkey received a juice reward if he pressed the correct lever. In the early stages of the task, neurons were silent at the start cue but responded strongly whenever the monkey received a juice reward. As the monkey continued to perform the task, and the learning process evolved, both the behavior and the activity of the neurons changed. The monkeys started pressing the correct levers more often, and the neurons stopped firing at