Information processing during spatial navigation: Combining computational models with invasive and non-invasive brain imaging

(1)

Information processing during spatial navigation:

Combining computational models with invasive and

non-invasive brain imaging

Christoffer J. Gahnstrom University of Amsterdam, NL christoffer.gahnstrom@gmail.com

Hugo J. Spiers University College London, UK

Supervisor

Michael X. Cohen Donders Institute/UvA, NL

Co-assessor

Abstract

The human brain can be thought of, and investigated as, an information processing entity. In doing so, we allow for the consideration of vast varieties of computational models that can help us better understand the brain in health and in disease. In recent years, one branch of computational models called reinforment learning, has shown itself particularly suited to this goal. This liteature thesis provides an overview of different algorithms of reinforcement learning that have improved our understanding of neural mechanisms, particularly during spatial navigation. This thesis will also provide suggestions for the use of these computational models to do interactionist neuroscience, combining invasive research in rodents with non-invasive research in humans. Furthermore, research into spatial navigation has suggested that a particular form of reinforcement learning, called successor representation, might be a candidate for a general mechanism of cognition.

(2)

1 Introduction

“Models allow the exploration of the implication of ideas that cannot be fully explored by thought alone.”

- Mcclelland (2009), TiCS,

There is a longstanding lack in our understanding of generalizable mechanisms underlying cognition. At the same time, the subfields of neuroscience investigating rodents invasively, and humans non-invasively, are seriously divided (Badre et al., 2015). This literature thesis will aim to provide solutions to both of these problems through the description of interactionist neuroscience and computational modelling. Specifically, the new research paradigm outlined in this paper suggests combining spatial navigation tasks across species with models of reinforcement learning. However, spatial navigation is just one of many fields this new emerging paradigm may be applied to, to investigate information processing during cognition (Behrens et al., 2018).

Groups of researchers from fields such as psychology, computer science, linguistics, artificial intel-ligence, and neuroscience, have searched for generalizable cognitive mechanisms since the 1950s. After several decades it became clear to some scientists that in order to model the mind none of these disciplines can do it alone. They suggested a joint effort to combine the most promising research ideas surrounding cognition (Johnson-Laird, 1980). By 1975, the term of cognitive science was coined, and it subsequently became an established research field (Frankish and Ramsey, 2012). However, the research focus in cognitive science has been very diverse over the years and has often entailed the development of theoretical models from a top-down perspective, which ignores many of the biophysical details of the human brain (Mcclelland, 2009). Recently, cognitive science, neuroscience and artificial intelligence have become increasingly integrated fields, partly due to technical advances in computing power, as well as advances invasive animal brain recordings, and non-invasive human brain recordings (Kriegeskorte and Douglas, 2018; Hassabis et al., 2017).

One commonly held abstraction principle in cognitive science is to investigate the brain from the perspective of information processing (Piccinini and Scarantino, 2011). David Marr, using this abstraction principle, operationalized an analysis approach of the brain that remains hugely influential more than 35 years after his seminal work on the cerebellum, cortex, archicortex (hippocampus), and vision (Marr, 1969, 1970; Marr et al., 1991; Marr, 1982; Vaina and Passingham, 2017). Marr suggested a framework of hierarchical levels of analysis, where the aim is to develop models of the brain that are testable and falsifiable on the basis of their so called computational, algorithmic, and biophysical predictions (Poggio, 2012). Today, there is a range of candidate computational models of cognition to consider that often provide different levels of explanation. Ultimately, one goal is bridge across these levels via the experimental testing of hypotheses derived from different models (Gerstner et al., 2012).

There are many promising experimental paradigms which allow for the study of cognition, to ad-dress questions about the mental processes involved during attention, memory, learning, language, decision-making, and/or perception. This thesis will focus on spatial navigation as a test-bed for computational hypotheses underlying cognition for the following reasons: i) Navigation is ecologi-cally sound. Every-day humans and animals alike navigate through their respective environments. ii) Neurobiological representation of spatial information is well established from single-unit recordings in the hippocampus and surrounding brain regions. iii) Navigating through space requires all the aforementioned divisions of cognitive processes except language unless instructed. iv) Navigation tasks are generalizable across species. The goal of this literature thesis is to evaluate a new way of investigating the brain, which intimately links different levels of understanding, from low-level neural circuits to the overarching computations carried out by the brain. The structure of this thesis is as follows: first an overview of findings from the rodent and human spatial navigation literature will be presented. Second, different reinforcement learning algorithms of navigation will be discussed. Third, the last section will present suggestions for a new paradigm of combining invasive recordings in animal models, non-invasive recordings in humans and computational models of reinforcement learning.

(4)

2 Encoding of spatial information in rodents

Early behavioural studies on rats navigating through mazes gave rise to the idea that mammals repre-sent an internal map of the environment (Tolman, 1948). Tolmans experiments showed that rodents extrapolate beyond simple stimulus-response mappings during navigation, and he hypothesized that they create a so called cognitive map. This theoretical map is what allowed the rodents to choose previously unseen and more efficient paths (i.e., shortcuts) in a learned environment. Interestingly, Tolman viewed this cognitive map as a generalized mechanism of cognition, one which is used for rational behaviour (Tolman, 1948; Behrens et al., 2018). He further suggested that behavioural changes caused by disease or brain injury results from damage or narrowing in these cognitive maps. A few decades later cells were discovered using single-unit recordings in the hippocampus of rodents which fire in specific spatial locations of the environment (O’Keefe and Dostrovsky, 1971; O’Keefe and Conway, 1978). Each of these so called place cells has a place field associated with it, which is a spatial region in the 2D space environment where the cell fires most often. The accuracy of these cells is such that only a few need to be analysed to accurately determine the location of freely moving animals from the recordings alone (Moser et al., 2017). The idea emerged that the hippocampal place cells could be an instantiation of the cognitive map that was hypothesized by Tolman all those years before. Place cells have also been discovered in bats (Ulanovsky and Moss, 2007), mice (Harvey et al., 2009), and a combination with view field in the form of spatial view cells in non-human primates (Rolls, 1999).

Following the place cell discovery, another set of cells were discovered which further advanced our understanding of encoded spatial information critical for navigation. One of these cells were found in the dorsal presubiculum and anterodorsal thalamic nucleus and encoded the current heading direction of the animal, irrespective of other behaviour (Ranck Jr, 1984; Taube, 2007; McNaughton et al., 1991). The place cells previously identified allocentric space to be encoded, i.e. spatial information which is bound to the environment and not the organism. However, these head direction cells encoded egocentric information which was believed to be important for navigation in conjunction with the allocentric place cells, for instance for path integration (McNaughton et al., 1991). Path integration is a central aspect of spatial navigation. It is the process by which information about current speed, position, and heading direction allows the organism to precisely calculate the current and future position (McNaughton et al., 2006). One well known issue of path integration is the accumulation of sensory errors due to the poor accuracy of heading direction and/or speed, resulting in, for instance, a case where people can walk in circles despite trying to walk in a straight line (Souman et al., 2009). In the early 2000s a new class of spatially selective cells were identified in medial enthorinal cortex, a brain region adjacent to the hippocampus, one synapse away (Fyhn et al., 2004; Hafting et al., 2005), and later also in the pre- and parasubiculum (Boccara et al., 2010). In contrast to place cells, these cells encode multiple place fields that span the whole explorable environment in a hexagonal pattern. Grid cells also adapt to changes in environmental cues suggesting a role of anchoring in navigation. Cue cards positioned on the inner walls of the box environment were rotated which resulted in the same rotation to the firing patterns of the grid cells (Hafting et al., 2005). The cells were named grid cells and appeared to encode an allocentric coordinate system (Moser et al., 2017). In contrast to place cells, the place fields of grid cells remain constant across changing environments (Bostock et al., 1991; Fyhn et al., 2007). Grid cells have also been identified in bats, non-human primates, and in mice (Yartsev et al., 2011; Killian et al., 2012; Fyhn et al., 2008).

Even more recently, cells encoding a boundary vector were discovered in the subiculum of freely moving rats (Lever et al., 2009), and in the entorhinal cortex (Savelli et al., 2008; Solstad et al., 2008), amongst other regions (Grieves and Jeffery, 2017). These cells fire when the animal is positioned at a preferred distance from an environmental boundary. Interestingly, these cells were hypothesized to exist as inputs to place cells based on the influence changes in environmental boundaries have on place field patterns and were also predicted by computational models (Barry et al., 2006).

Neurons in the medial entorhinal cortex have also been identified that specifically encode the speed of freely moving rats (Kropff et al., 2015). These cells appear to be solely responsible for encoding speed, unlike many other conjunctive cells that encode speed along with grid fields or heading direction (Wills et al., 2012).

(5)

2.1 Encoding during spatial navigation

The last section covered the encoding of spatial information in hippocampus, striatum, perhirinal, enthorinal cortex, and subiculum of freely moving animals (see detailed review: Grieves and Jeffery (2017)). However, the question remains how this encoded spatial information helps with navigation. There are four different distinct domains that navigation research can be divided into. The first two are path integration and landmark navigation. Path integration, or dead reckoning, is a process whereby an agent uses self-motion cues like changes in velocity and egocentric heading direction to keep track of their position over time (Mittelstaedt and Mittelstaedt, 1980; Gallistel and King, 2011). Landmark navigation is navigation through the use of external cues like distance between objects in the environment (Yoder et al., 2011).

The other two research domains are goal-directed and habitual behaviour (Dolan and Dayan, 2013). Goal-directed behaviour is considered a deliberate process requiring future planning based on information beyond immediate sensory cues (Pezzulo et al., 2014). Habitual behaviour constitutes previously learned stimulus-response associations which may be disassociated from current value outcomes (Dickinson, 1985). The focus of this thesis is to suggest a new research paradigm to investigate generalizable mechanisms of cognition, and these four categories are all important to explore with that aim in mind.

2.1.1 Path integration and Landmark Navigation

The research into navigation has identified that movement-dependent species commonly make use of path integration, including insects like bees or ants (Collett and Collett, 2000; Cheeseman et al., 2014). Path integration was also used by sailors in the absence of informative visual landmarks. The ability to track ones location is incredibly important since it allows the organism the ability to return to their starting position and store locations of interest while exploring their environment (Collett and Graham, 2004).

The underlying neural mechanisms of path integration have been explored in rats where head direction cells are regarded as a crucial component (McNaughton et al., 2006). The head direction cells fire in response to a specific angular direction irrespective of the current environment (Etienne and Jeffery, 2004), and therefore operates collectively as an internal compass. The activity of these cells also rely on vestibular signals, and generate accurate direction in featureless environments, and even under darkness (for review on head direction cells see Taube (2007); Stackman and Taube (1997)). One recent study found that optogenetic inhibition of head direction cells induced directional errors during path integration in the absence of visual cues (Butler et al., 2017).

The second crucial component of path integration is accurate representation of current position. This means the ability to translate self-motion cues into vector displacement signals. Spiders use proprioception, bees use optic flow, and ants use ground distance covered (Moller and Görner, 1994; Etienne and Jeffery, 2004). The discovery of grid cells provides the organism with a set of cells representing self-motion information in terms of displacement, and the medial entorhinal cortex is therefore proposed to be a location for the path integrator (McNaughton et al., 2006). Another striking finding in entorhinal cortex are head direction cells along with conjunctive cells that encode the 2D grid field, but also direction, and velocity (Sargolini et al., 2006). The entorhinal cortex is suggested as the neural substrate which computes path integration given the close proximity and interaction of these classes of neurons (McNaughton et al., 2006).

The spatial encoding of place cells is suggested to be the result of both path integration and landmark cues (Etienne et al., 2004). Gothard et al. (1996) recorded place cells while rats ran along a linear track. The position representation of the place fields was the direct result of path integration while travelling outward from the starting position, while at a certain distance from the goal location the place field reflected landmark navigation from visual cues. The incorporation of visual landmark cues is essential for accurate navigation, since, as mentioned above, path integration will accumulate substantial errors over a short period of time (Cheung, 2014). External visual cues have therefore been suggested to reset the path integrator after accumulation of errors (Etienne et al., 2004). When it comes to landmark navigation, two key findings show their importance with respect to spatial information encoding in the hippocampal formation. One early study found that rotating the visual cues in the rats environment would result in an equal rotation in the orientation of place fields (Bostock et al., 1991; Muller and Kubie, 1987). Similarly, if you stretch the boundaries of the rats

(6)

environment, then the place fields also stretch (O’ Keefe and Burgess, 1996). The angular position of place cells is not controlled by large objects which have been rotated when they are placed centrally of the rat environment. However, the same objects do control the angular position of place fields when placed (and subsequently rotated) against the walls of the environment (Cressant et al., 1997).

2.1.2 Goal-directed and Habitual Behaviour

A longstanding question in navigation, and decision-making in general, has been to investigate the extent to which actions are chains of stimulus-response mappings, or whether actions depend on an abstract representation of a goal. The idea being that this goal representation allows the agent to act beyond current or past reinforcement of actions, for example by planning out and executing a shortcut (Tolman, 1948).

Another important question is how goal locations are represented in the brain. Studies have shown that place cells will shift their place fields over time to encode reward at goal locations (Komorowski et al., 2009; Dupret et al., 2010; Hok et al., 2007). Using optogenetic stimulation, one study artificially activated dopaminergic neurons projecting to place cells after spatial learning (McNamara et al., 2014). They found an increase in place cell reactivation. Moreover, in the Morris water maze task1, place fields were found to be over-represented around the goal location (Hollup et al., 2001). These findings of goal location encoding in place cells demonstrate that they encode more than a collective spatial code of the environment.

Rats exhibit vicarious trial and error (VTE) behaviour, which mean that they halt at a decision point and appear to be considering their options (Tolman, 1938; Redish, 2016). It was recently discovered that during this behaviour, place cells generate brief sequences of spatial trajectories which predict subsequent behaviour even in novel environments (Pfeiffer and Foster, 2013). These generated sequences are called sharp-wave ripples (SWR) and consist of sudden high frequency activity in the local field potential during inactive wakefulness and sleep (Buzsáki, 2015).

One established finding of place cells in the hippocampus and grid cells in entorhinal cortex is that they fire in progressively earlier phase of extracellular theta oscillations (6-10Hz) during active behaviour (Buzsáki, 2005; Barry and Burgess, 2014). This phenomenon of phase precession occurs while the animal is moving through the place and/or grid field of the recorded neuron (Hafting et al., 2008; O’Keefe and Recce, 1993). The precession of preferred cell firing can change independently of changes in firing rate, suggesting its role as an additional information carrying signal (Huxter et al., 2003).

Pronounced theta oscillations in the hippocampus has been established since the earliest recordings of local field potentials (O’Keefe and Conway, 1978). Within each extracellular theta oscillation, populations of place cells have also been found to fire in forward succession of each other, referred to as a theta sequence (Foster and Wilson, 2007). This theta sequence has recently been found to encode goal-related information (Wikenheiser and Redish, 2015). Wikenheiser and Redish (2015) measured hippocampal place fields of rats during a foraging task. The found that the theta sequences reflected the subsequent rat path trajectories up until goal locations. Moreover, the sequences were longer during longer behavioural trajectories.

During sleep or resting, place cells have been found to fire in a sequence that predicts the follow-ing place cell activity while runnfollow-ing on a novel linear track (Dragoi and Tonegawa, 2011). This phenomenon of preplay suggests that the current state of the hippocampus preconfigures the future encoding of state representations (Dragoi and Tonegawa, 2013). The activity of place cells in hip-pocampus also encodes trajectories to regions that have never before been visited (Freyja Ólafsdóttir et al., 2015). Rodents were shown rewards positioned at unreachable regions of the environment. The firing of place cells reflected trajectories both towards and away from these reward locations. Another well-established finding in hippocampus is that of replay. Place cells will fire in a pattern sequence which re-enacts the firing patterns from previous active behaviour during sleep (Wilson and McNaughton, 1994; Skaggs and McNaughton, 1996). Moreover, these replay periods were found

1

The Morris water maze task is an example of spatial one-trial learning. After some familiarity with the task, rats only need one trial to be able to subsequently swim directly to the target location, irrespective of where in the pool of water the rat was placed. The goal of the temporal difference learning was to provide a way for goal-directed behaviour based on place cell information.

(7)

to also be present quickly following spatial navigation while the animal was awake but not moving (Foster and Wilson, 2006).

3 Encoding of spatial information in humans

Research into understanding the human brain has drastically increased in the past few decades with the advent of non-invasive brain imaging. It suddenly became possible to map out cognitive functions associated with specific brain areas. The primary imaging modality is without a doubt functional magnetic resonance imaging (fMRI). This research technique is successfully used to detect fluctuations in levels of blood-oxygenation, as a function of increased metabolic demands (Poldrack et al., 2011). This section will review experiments using non-invasive fMRI during virtual navigation (VR) of environments. This setup still allows for navigational processes to be engaged despite being constrained and immobile (Epstein et al., 2017).

The earliest study using fMRI during virtual navigation investigated brain correlations to maze exploration and subsequent recall of the topographical structure. Only the parahippocampus region was systematically engaged by most participants both during learning and recall (Aguirre et al., 1996).

One interesting line of research has focused on expert navigators in the form of taxi drivers (Woollett and Maguire, 2011). The first study found that the posterior hippocampus of London taxi drivers was larger than in control groups, while vice versa for anterior hippocampus (Maguire et al., 2000). That the size of the hippocampus is plastic suggests that the anatomy of the hippocampus adapts to the requirements of every-day spatial navigation. Moreover, the level of taxi-related expertise correlated with grey matter volume in posterior hippocampus as compared to bus drivers and controls (Maguire et al., 2003, 2006).

3.1 Path integration and landmark navigation

Rodents are demonstrated to use both path integration and landmarks when navigating. However, humans and other non-human primates are more reliant on vision, which could result in less reliance on path integration, and more reliance on landmark navigation (Ekstrom, 2015). For instance, recordings of neurons in the hippocampus of non-human primates found spatial view cells that are selectively active when gazing on specific surfaces (e.g. walls) in the environment, which have not been seen in rodents (Rolls, 1999).

Nonetheless, research on human path integration has identified several brain regions that are important for accurate performance. When humans are virtually navigating along a triangular path in the absence of landmarks and asked to point their heading direction towards the starting point after following two of the three angles, activity in hippocampus predicts the level of accuracy (Wolbers et al., 2007). A more detailed investigation used 1st person, 3rd person, and survey-level perspective in a virtual environment while participants navigated towards goal locations (Sherrill et al., 2013). Both 1st and 3rd person perspective involved hippocampus activity when goal location was successfully reached, as well as retrosplenial and parietal areas. However, only the hippocampus was associated with 1st person self-motion cues for goal proximity.

Individual difference in path integration was also found to be associated with grey matter density in retrosplenial cortex, hippocampus, and medial prefrontal cortex while tracking changes in location (Chrastil et al., 2017) Activity in hippocampus and retrosplenial cortex has also been associated with encoding the vector signal to goal location using path integration (Chrastil et al., 2015).

There has been substantially more research into the neural underpinning of landmark navigation (Epstein and Vass, 2014). Many studies have identified parahippocampal cortex to encode landmark objects (Aminoff et al., 2013), which has resulted in it being referred to as the parahippocampal place area (Epstein and Kanwisher, 1998). The retrosplenial cortex is also often implicated in landmark encoding and is suggested to be the responsible for anchoring landmarks into allocentric maps of the environment (Marchette et al., 2014; Epstein, 2008).

One study identified further disassociation between parahippocampus and retrosplenial cortex during photograph viewing depicting the inside and the outside of buildings which participants were familiar with (Marchette et al., 2015). Photographs of the same building in both conditions reflected activity in

(8)

parahippocampus, retrosplenial cortex and occipital place area. However, only the parahippocampus was associated with activity discriminating between the different buildings, implicating its role in landmark encoding. Retrosplenial cortex instead encoded auxiliary information to the specific landmark such as faces that were associated with each building as a mnemonic device.

Parahippocampal place area (PPA) is also implicated in integrating previous context in which scenes were shown. Turk-Browne et al. (2012) used the technique of fMRI-adaptation which exploits the fact that successive presentation of the same stimulus will reduce the measured BOLD response, while presenting different stimuli will not reduce activity (Bandettini, 2014). They found that the PPA exhibited more adaptation when it was presented in a temporal context, indicating that this region also takes into account the temporal sequence that landmarks are encoded in.

Recent studies have found that hippocampal activity encodes the distance between landmarks and distance information (Auger and Maguire, 2018; Baumann et al., 2012). Employing the aforemen-tioned fMRI-adaptation technique, participants were tasked with viewing photographs of a familiar landmarks (Morgan et al., 2011). The left hippocampus would respond in proportion to the real-world distance between successively presented landmarks. They also found the fMRI-adaptation effect in parahippocampus and retrosplenial cortex during successive presentations of landmarks. The parahip-pocampus has also been found to be selective for landmark objects during a virtual route following task (Janzen and Van Turennout, 2004). More specifically, the encoding in paraphippocampus was dependent on the navigational significance of the object.

The striatum has also been implicated during landmark navigation. For instance, Doeller et al. (2008) measured fMRI activity while participants virtually navigated around an enclosed arena. Landmark objects were presented one at a time during a learning phase. Subsequently, a picture of a previously shown object was presented as a cue and the participant had to navigate to the remembered position where it had been located. Right striatum was associated with the encoding of landmark object positioning, while the right posterior hippocampus was associated with the encoding of boundary locations. The striatum has also been implicated in another fMRI study where participants were tasked with virtually navigating a radial arm maze (Iaria et al., 2003). However, here participants who used a non-spatial strategy where they numbered the arms of the maze had associated activity in the caudate nucleus.

3.2 Goal-directed and habitual behaviour

The division of goal-directed (map-based) vs. habitual behaviour (route-based) in humans has often been supported by studies of behaviour in humans and rodent (Iglói et al., 2009; Tolman, 1948). This division is also implicated to recruit separate neural circuits (Wolbers and Hegarty, 2010). One study found that participants who used spatial landmarks when navigating had increased hippocampal activity, while participants who used a habit-based (non-spatial) strategy had increased caudate nucleus activity (Iaria et al., 2003). The same disassociation was found with participants virtually navigating around a town (Hartley et al., 2003), and during a route-recognition task with early-stage Huntingtons disease patients (Voermans et al., 2004).

Another common finding is that hippocampus encodes the distance to goal, while entorhinal cortex encodes Euclidean distance to goal. Howard et al. (2014) designed a task where participants navigated real-world routes and were later shown movies of the same routes had to give correct turns at different decision points. The task was structured so that the Euclidean distance and path distance to goal location were separately varied. Poster hippocampus was found to be associated with path distance and changes in path distance during detours, while anterior hippocampus and entorhinal cortex was associated with Euclidean distance while the movies were shown and when faced with new goal locations, respectively. The posterior parietal cortex reflected the egocentric angle to goal direction. Another set of experiments further investigated brain regions involved in goal proximity and detour planning. During a task where participants first viewed the position of a figure in a virtual environment and later used mental navigation to calculate the nearest route to previously shown figure from a random starting position (Viard et al., 2011). They found that the hippocampus, medial prefrontal cortex and parahippocampus tracked goal proximity. Moreover, a subset of detour trials was associated with medial and ventromedial prefrontal cortex activity. Earlier studies have also identified medial prefrontal, subiculum, and entorhinal cortex associated with goal proximity during virtual navigation (Spiers and Maguire, 2007). Recently, retrosplenial cortex is found to encode goal proximity when

(9)

navigation familiar virtual environments, while hippocampus encode goal proximity to novel virtual environments (Patai et al., 2017).

An important finding in the hippocampal formation of rodents was the existence of head direction cells. A recent study investigated the representation of goal direction during navigation of a vir-tual environment (Chadwick et al., 2015). Brain activity in entorhinal and subicular regions was associated with goal directions. Moreover, a given goal direction had the sames activity pattern in entorhinal/subiculum when simply facing that same direction.

The finding that place cells encode future trajectories to goal locations (Wikenheiser and Redish, 2015) has also been investigated in humans using fMRI and virtual navigation. Brown et al. (2016) designed a task where participants learned goal locations associated with fractal images positioned along a circular maze. They were later prompted with on of the fractals before required to plan out a route to the encoded location, and navigate there. The hippocampus encoded the location of future goals during the planning stage, and the same hippocampal encoding was present after navigation to that goal. Another recent study using virtual maze navigation also found hippocampal regions to reflect goal-specific planning (Kaplan et al., 2017).

One study investigated future planning (Horner et al., 2016) using the previously described process of measuring grid cell-related activity in entorhinal cortex in humans with fMRI (Doeller et al., 2010). They identified the same signal being present during virtual navigation and during imagined navigation, suggesting its role in route planning. Entorhinal grid cell-like activity also extends beyond navigation and represents locations in visual space (Julian et al., 2018).

4 Modelling spatial navigation using reinforcement learning

As shown in the previous section, many aspects of allocentric and egocentric spatial information are encoded in the hippocampus and neighbouring brain areas. However, the question remains how this encoded information enable the computational processes necessary for successful navigation of the environment. Looking back to the different levels of analysis approach suggested by Marr (1982); what are the computational goals used in navigation? Moreover, what are the algorithms of information processing of those goals? And finally, what is the biophysical implementation of the algorithms? This section will focus on reinforcement learning algorithms currently suggested to be implemented by the brain during navigation. These algorithms cover two levels of Marrs of the computational goal (reward optimization) and the algorithmic level (Niv, 2009). After the following section, I will suggest new ways to investigate these algorithms (and others) by combining research methods across computational modelling, rodent research and human neuroimaging research. The distinction between goal-directed and habitual behaviour goes back to the early days of behaviour research and was investigated as stimulus-response mappings and Tolmans cognitive map hypothesis, respectively (Dolan and Dayan, 2013). In more recent years, these distinctions have crossed over into computational modelling of behaviour in the form of model-based algorithms for goal-directed behaviour, and model-free algorithms for habitual behaviour (Daw et al., 2005; Foster et al., 2000). Moreover, this division is demonstrated in both rodents and humans but may be interacting instead of competing (Balleine and O’Doherty, 2010). The model-based algorithms use experience to construct an internal model of the environment, while model-free algorithms learn the value of the stimulus-response mappings of available states and actions or policies (Dayan and Niv, 2008). Importantly, these different classes of algorithms can be used to test specific hypothesis about neural activity and lead us to identify neural mechanisms of behaviour (Daw et al., 2011). Again, the aim of this thesis is to provide the theoretical grounds for investigating generalizable mechanisms of cognition, and the framework of reinforcement learning is central to this approach.

4.1 Model-based reinforcement learning

Reinforcement learning encompasses a class of models with the computational goal to maximize the amount of total future reward through optimization of the mapping between situation and action (Sutton et al., 1998). For example, deciding on which meal to choose at a restaurant is influenced by your previously experienced positive and negative reward associated with each of the meal options. The optimization procedure depends on the specific reinforcement learning algorithm used but involve interacting with the environment through trial-and-error procedures (Gershman and Daw, 2017).

(10)

First, we need to take a step back and define a general case for the kind of processes that are involved in navigation and are approachable by reinforcement learning models. All reinforcement learning models consist of three different parameters; the action from the set of possible actions in the current state, a ∈ A; the current state of all possible states, s ∈ S; and the goal in terms of rewards r received after each action a (Sutton et al., 1998, 1999). The mapping between actions and states that optimize future reward is the optimal policy, π∗.

In order for reinforcement learning models to be tractable, it is necessary that the state signal possess the Markov property. This means that all necessary information for future states, s0 ∈ S, is captured by the present state, s (Dayan and Niv, 2008). For example, in a game of chess the history of previous actions does not need to be explicitly represented since all previous actions are captured by the current state, i.e. the current configuration of chess pieces. The Markov property simply allows the next state and expected reward to be predicted based solely on the current state and action, ignoring all past actions and states (Sutton et al., 1998). Even in cases where the Markov property does not fully hold up, it can still useful to use it as an approximation.

Another important concept in reinforcement learning is discounted return (γ), which ensures a finite learning process (always a value γ ∈ [0, 1]). In simple terms, this refers to the decrease in value of a reward as a function of time. For example, imagine given the choice of a reward of 4£ immediately or a variable time t in the future. The experienced value of the reward is effectively down-weighted as a function of increasing t. This phenomenon is an established finding in humans and animals where it is known as exponential discounting (Kirby and Marakovi´c, 1996).

The general division of reinforcement learning is between model-based and model-free algorithms, where the temporal-difference learning algorithm described below is an example of the latter. Model-based algorithms explicitly estimate the reward probabilities, R(s, a), along with the state transition probabilities, i.e. the probability of the future state s0given the current state-action pair or P (s0, s, a). These two joint probabilities are referred to as the one-step model of action a, and are sufficient to infer the Markov Decision Process (MDP). Model-based reinforcement learning directly estimates these probabilities as well as the state-value function of a given policy Vsπ.

Vsπ= X a π(s, a) R(s, a) +X s0 P (s0|s, a)γVπ_(s0₎ (1)

This above is a Bellman equation (state-value function) for a given policy which capture the recursive nature of model-based reinforcement learning. Using the dynamical programming algorithm of value iteration, one can approximate the optimal policy by iterating and updating the state-value function for every future time step (Daw and Dayan, 2014):

V∗(s) = max

a (R(s, a)) +

X

s0

P (s0|s, a)γV (s0)) (2)

The state-value function is a separate equation for every state s and policy. The recursive nature of Bellman equations means that given Bellman equations for every state in the set of states, then performing value iteration will result in the optimal policy and state-value function (Sutton et al., 1999). Each value iteration can also be viewed as forward planning where the internal model of the environment is used to simulate future experience and update the value functions and current policy (Sutton et al., 1998).

One common example of a model-based behaviour is latent learning where exposure to a reward-free environment leads to better performance (Tolman, 1948). In the RL framework, latent learning occurs when you are learning about the available states and state transitions, P (s0|s, a) , of an environment. In a model-free setting, such latent learning would not exist since there is no reward to form reward prediction errors. However, this latent learning would inform the internal model of a model-based algorithm.

Latent learning has been used to investigate neural underpinnings of model-based learning using fMRI. Gläscher et al. (2010) used the internal model of state transitions, to compute state-prediction errors when the internal model incorrectly overlapped with experience. They found that the size of state-prediction errors was reflected in parietal and dorsolateral prefrontal cortex. These effects were also evident during a latent learning portion of the task, in the absence of reward. During later portion

(11)

of the task when there was reward, reward prediction errors could be computed using model-free algorithms, and these were found to be associated with activity in the ventral striatum.

A similar investigation has also been made during a spatial maze task (Simon and Daw, 2011). The different model-based and model-free algorithms was fitted to participant behavior, and best explained by the former. The trial-by-trial prediction of model-based behavior was associated with activity in the striatum which overlaps with findings of model-free correlates (Daw et al., 2005). They also found correlates of model-based predictions in the medial temporal lobe, and frontal cortex, which is more in line with previous findings (Gläscher et al., 2010).

4.2 Temporal difference learning

Model-free reinforcement learning algorithms, as in the case for temporal-difference (TD) learning, infer stimulus-response mappings from past experience. Model-based algorithms are generally much more flexible but also much more computationally demanding, because every value estimation update must be computed across the whole state space (Gershman, 2018; Momennejad et al., 2017). Model-free algorithms only update the value estimates locally.

One early application of reinforcement learning to spatial navigation was with the algorithm of temporal-difference learning (Barto et al., 1983; Foster et al., 2000). TD learning entails that stimulus-response mappings are updated according to the difference between a predicted reward and the actual reward received after a particular state-action sequence. The basic idea is related to the Rescorla-Wagner model (equation 1) of Pavlovian conditioning, which states that learning occurs when the agent is surprised (Niv, 2009).

VCSi= VCSi+ α(γCSi−

X

i

VCSi) (3)

where α is the learning rate parameter, CS is the conditioned stimuli, and U S is the unconditioned stimuli.

The key difference is that TD learning depends on the specific temporal timing of rewards starting from every possible state, instead of only predicting the directly succeeding reward (Niv, 2009). This difference also means that the TD algorithm constructs an inferred Markov decision process underlying the observed states and rewards based on interaction between the agent and environment (Sutton et al., 1998).

The based value function equation 2 is defined in terms of the available states, but for model-free algorithms it is common to instead define the value function in terms of state-action pairs Q(s, a). This increases the number of equations by the number of available actions for every state:

Q(s, a) = Q(s, a) + αδ (4)

where the prediction error δ is defined by:

δ = R(s, a) + γQ(s0, a0) − Q(s, a) (5)

The use of state-action pairs instead of the state-value function allows the optimal policy to be estimated without the explicit use of the state transitions P (s0|s, a) or the reward probabilities R(s, a) as in the model-based equation (Barto and Mahadevan, 2003).

The TD learning algorithm is a success story of computational modelling in neuroscience. The finding that dopamine neurons in basal ganglia spike during operant conditioning were difficult to understand as they only seemed to respond to reward under certain circumstances. It was found that these circumstances could be elegantly captured by the defining feature of the TD learning algorithm, namely the state-dependent prediction error (Schultz et al., 1997; Bayer et al., 2007). In general, a reward following an unconditioned stimulus will elicit phasic dopamine response while a reward following a condition stimulus will not. However, when a reward is not followed by a conditioned stimulus, then there is a decrease in the dopamine response, assumed to reflect a negative prediction error (Schultz, 2015).

(12)

However, application of the TD learning algorithm to more complex behaviour such as navigation has proved less successful. For instance, hippocampal place cell activity during Morris water maze (Morris, 1984), showed that the TD algorithm was unable to capture rat behaviour when the goal location was changed, and, importantly, was unable unable to show one-trial learning (Foster et al., 2000). This lack of generalizability is due to the inflexible nature of the TD algorithm, which can be thought of as an overfitting to the task when the goal location is stationary.

4.3 Successor Representation

Investigations into value-based decision making has clearly demonstrated that a mix of model-free and model-based algorithms describe both behaviour and neural findings (Dolan and Dayan, 2013). A recent suggestion is that the hippocampus employs the reinforcement learning algorithm called successor representation, which is based on incorporating both model-free and model-based properties (Stachenfeld et al., 2017). Successor representation is originally an extension of model-free temporal difference learning in an attempt to improve behaviour flexibility during navigation (Dayan, 1993). The successor representation has also been shown to also capture human behaviour during value-based sequential decision making (Momennejad et al., 2017). The value function of the successor representation consists of a future reward estimation R and a matrix M, which contains the estimated future state visits.

V_sπ=X

s0

M(s, s0)X

a

π(a, s0)R(s0, a)) (6)

The expected value of the matrix M, is defined as the cumulative discounted future state occupancies:

Mπ(s, s0) = E[

∞

X

t=0

γt1(st= s0|S0= s)] (7)

where 1 is 1 if state s can be transitioned into state and if not it is 0.

The reinforcement learning of the successor representation occurs similarly to TD learning, as defined in equation 5. Re-written here as:

Mt+1π (s, s0) = Mtπ(s, s0) + η1(st= s0) + γMtπ(s, s0) − Mtπ(s, s0)

(8) where ηis the learning rate parameter for estimating successor states. Instead of reward prediction error as in TD learning, the above equation computes the successor prediction error but using the same approach (Stachenfeld et al., 2017).

Russek et al. (2017) implemented three different reinforcement learning algorithms based on the successor representation architecture and simulated their performance on spatial navigation tasks. Their goal was to illustrate how these different algorithms could solve problems has been assumed to necessitate model-based computations, such as making a correct detour in the face of sudden obstacles (Spiers and Gilbert, 2015). They could successfully explain the results of using successor representation instead of model-based algorithms. For instance, latent learning is possible since it can update the successor representation matrix Mπ_{(s, s}0_{). These simulations provide a framework to}

test neural mechanisms of navigation.

Stachenfeld et al. (2017) demonstrated theoretically that many findings in rodent place cells can be attributed to a successor representation, including reward encoding and stretching of place fields of one-way linear tracks. The ability of successor representation to capture the population code of place cells, and complete spatial navigation problems like shortcuts and detours, makes it a very exciting class of algorithms to investigate as generalizable mechanisms of cognition (Russek et al., 2017; Stachenfeld et al., 2017; Gershman, 2018).

4.4 Hierarchical Reinforcement Learning

The reinforcement learning literature have another class of algorithms that fall under hierarchical reinforcement learning (HRL). These take into account the hierarchical nature of behaviour and the

(13)

environment with which we interact (Botvinick et al., 2009). Take the example of walking to the store to buy groceries. According to HRL, you divide the overall task into multiple subtasks that each represent a set of more basic actions. The basic actions in this instance could be walk right, left, and straight. An abstracted subtask could be a certain route A or, a different route B, where the optimal route might depend on factors like traffic due to the time of day. Moreover, the task of walking to the store could itself be embedded hierarchically in a larger overall task, such as making dinner. This hierarchical structure means you are less bound to sequences of basic actions, and can instead choose between the abstracted subtasks (Rasmussen et al., 2017).

The standard reinforcement learning framework, like many models, suffer from the curse of dimen-sionality (Barto and Mahadevan, 2003). This means that the number of free parameters will grow exponentially as a function of the size of the environment space, making the number of necessary computations eventually intractable and biologically implausible. The goal of HRL is to implement temporal abstraction, where high-level computations are only necessary at important decision points. This reduces the overall computational requirements, while simultaneously improving performance and providing a testable framework for hierarchical behaviour.

There are several important quantities that need to be accounted for with HRL. For instance, time delay that goes beyond the single time-step of Markov Decision Processes (MDPs), and these are referred to as Semi-Markov Decision Processes (SMDPs) (Sutton et al., 1999). In the case of model-based reinforcement learning, the Bellman value function defined in equation 2 would be expressed as (Consider re-expressing in terms of Mahadevan et al., 2003)

Vs,t= max

a (R(s, a)) +

X

s0

P (s0, τ |s, a)γτV (s0)) (9)

where τ represents the number of time steps of the abstracted task. This new equation includes P (s0_{, τ |s, a) which can be read as the joint probability of the future successor states over time delay}

from current state-action pair.

An abstracted task will by definition be any task beyond one time-step, and these will each be associated with a pseudo-reward. The abstracted tasks are also known as options, and have several other quantities associated with them. These are the policy of the option (the probability mapping between states and actions), the termination condition (when the option ends), and the initiation set (when the option starts) (Sutton et al., 1999).

There is evidence that suggest that planning behaviour is hierarchical. In one study participants planned their route through learned hypothetical subway networks (Balaguer et al., 2016). They found that reaction times of participants was associated with hierarchical representation of the subway. Simultaneous fMRI recordings revealed that dorso-medial prefrontal cortex (dmPFC) and bilateral anterior premotor cortex also encoded the hierarchical representation. However, only dmPFC also encoded the current state.

Another study used hierarchical reinforcement models to compute model-free reward prediction errors on a subtask level while participants played a package delivery game. They found in three neuroimaging studies that subtask pseudo-reward predictions were reflected in several brain regions including anterior cingulate cortex, habenula, and the amygdala (Ribas-Fernandes et al., 2011). They also identified an electroencephalogram (EEG) event-related potential negativity signature of the subtask pseudo-reward.

Rasmussen et al. (2017) investigated the biophysical plausibility of model-free hierarchical reinforce-ment learning using detailed neural simulation. They used the neural engineering framework which simulates spiking activity using leaky integrate-and-fire neurons. They found that the simulated neural model could perform many standard RL and HRL tasks, providing an important framework for future HRL research.

5 Combining computational models with rodent and human brain imaging

In the search for generalizable mechanisms of cognition, there is an impressive range of compu-tational models one could consider. These range from attractor networks, to neural networks, to sequential sampling, to drift-diffusion, to Hebbian models, to integrate-and-fire models, and many

(14)

more (Dayan and Abbott, 2001; Gerstner et al., 2012). This thesis argues the case for the use of reinforcement learning when investigating underlying mechanisms of navigation and suggests that successor representation might be a candidate for a generalizable mechanism of cognition (Russek et al., 2017; Momennejad et al., 2017; Behrens et al., 2018).

In order to test the generalizability of successor representation, or any other algorithms that could apply to several cognitive domains, there is a new emerging experimental framework that explicitly investigates hypotheses across different levels of analysis. We are at an interesting point in time where advances in human neuroimaging, rodent invasive recording technology, and computational techniques, can simultaneously be utilized to understanding the information processing capabilities of the human brain. This section will first provide some research examples that approaches the new paradigm, and lastly explore some of its future promises and pitfalls.

5.1 Examples of combining computational models and brain imaging

There are many examples of interdisciplinary research between neural data and computational modelling, especially in decision making and navigation. These studies have never explicitly and simultaneously spanned all three research modalities. There are excellent examples of all other possible combinatorial pairings between computational modelling, rodent invasive recordings, and human non-invasive recordings.

One recent experiment tested the model-free vs. model-based spatial navigation behaviour and the neural correlates thereof using virtual navigation and fMRI (Anggraini et al., 2018). Trial-specific model-based behaviour was associated with parahippocampal and medial temporal lobe activity while model-free behaviour was associated with striatal and ventromedial prefrontal cortex activity. Importantly, a hybrid model incorporating both model versions best explained overall behaviour. The question remains if successor representation would have outperformed these models in accounting for behaviour and neural activity.

In the decision-making field, there is a long history of investigating the relationship between choice behaviour and reaction time. The drift-diffusion computational model has shown incredible accuracy in predicting these two quantities and their relationship with speed-accuracy trade-off (Bogacz et al., 2006). Recently, studies have investigated predictions of these models in terms of evidence accumulation leading up to a choice in rodents (Brunton et al., 2013; Scott et al., 2017; Akrami et al., 2018). The strong evidence for evidence accumulation signals in several brain regions, suggest that this may also be another form of generalizable mechanism of cognition. There have been suggestions that perceptual decision-making and reinforcement learning is more similar, and an integrative approach could be very insightful (Summerfield and Tsetsos, 2012).

One recent study used knowledge about rodent single-unit activity to record grid cells non-invasively in humans using fMRI (Doeller et al., 2010). The current proposal would be to use similar approaches, but combined with computational models, used to understand neural mechanisms in both rodents and humans.

These approaches are commonly restricted to one species. For instance, in the case of perceptual decision making, many model-based approaches use neuroimaging and human subjects (Forstmann et al., 2016). Or they use model-based approaches and rodent invasive imaging (Scott et al., 2017). But they very rarely, if ever, combine all three. And I will suggest that spatial cognition is an ideal for testing computational hypothesis using the interactionist neuroscience framework (Badre et al., 2015).

There have been studies in perceptual decision-making that use computational modelling and be-havioural data from both rats and humans to study optimal evidence accumulation (Brunton et al., 2013), and working memory (Fassihi et al., 2014). However, there is still an underdeveloped frontier of combining these approaches with translational human neuroimaging techniques.

Deep reinforcement learning algorithms have shown to be able to capture spiking neural activity during spatial navigation in the form of grid cells, with many similar properties in different layers of the convolutional neural network (Banino et al., 2018).

(15)

5.2 The new experimental paradigm

The first step of new approach is quite straight-forward. The suggestion is that we develop experi-mental paradigms that are generalizable across species. The use of the same task will enable us to investigate every level of analysis proposed by Marr and explore the similarities and differences. For instance, one scenario would be that a certain task is best explained using the same algorithm in both humans and rodents, leading to detailed information on how this algorithm is implemented at low circuit-level as well as at a high population-level of brain activity.

The second step of the new approach suggested is to apply the recent developments from model-based cognitive neuroscience (Palmeri et al., 2017) that have been very successful when applied to understanding the computational mechanisms of perceptual decision-making across species (Hanks and Summerfield, 2017). The following points are important in this new paradigm:

• Develop generative models that can simulate the key behavior of interest during the experi-mental task.

• Fit the behavioral data to the generative models by adjusting the free parameters in the model.

• Use the fitted model to make predictions about the underlying neural mechanisms. • Correlate the predictions to collected neural data.

• Compare a range alternative models to improve belief in finding accurate model of behavior. • Use neural data to decide among the candidate models, along with behavior. Different

models might predict the same behavior but different neural implementation.

The combination of computational models with measurements of neural activity span all three of Marr levels (Marr, 1982). The models define the goals of the computation, the algorithmic application of the computation, and the neural data provide the implementation of the algorithm.

Another exciting recent advancement is the use of computational models that can provide laminar specific information processing on the basis of local field potentials as well as whole-head population-level dynamics (Sherman et al., 2016). Biophysical plausibility is a crucial aspect of all models. However, there is always a balance to strike between model complexity and simplicity, and biophysical plausible models tend therefore to only be used in simulation (Rasmussen et al., 2017).

The question remains how to further apply this model-based framework to navigation research. We have learned that spatial information is encoded in the brain. We have also learned that reinforcement learning computational models and algorithms are able to reproduce, through simulations, rodent behaviour and neural activity during navigation. One next step is to use across species experiments using the new framework, which can expand upon human only experiments (Anggraini et al., 2018). Another potentially promising avenue to research generalizable mechanisms of cognition is to combine different computational models used in sequential sampling and in reinforcement learning (Summerfield and Tsetsos, 2012). There is interesting evidence showing the involvement of parietal cortex in state-prediction errors in model-based RL, and similarly parietal cortex is often shown to reflect evidence accumulation leading up to a choice (O’Doherty et al., 2017).

One possibility is use computational models which incorporate population-level neural oscillatory dynamics as measured in both rodents and humans during spatial navigation. Recent research has demonstrated that it is possible to image areas such as the hippocampus using non-invasive MEG (Pu et al., 2018; Meyer, 2016). By also collecting local field potentials in the hippocampus of rodents during navigation could provide important insights into the neural mechanisms and algorithms used by both species.

5.3 Potential pitfalls

One problem of doing between-species translational neuroscience, is that the signals recorded are very different in origin, due to the different methods used . The most common human non-invasive imaging technology is blood-oxygen dependent functional magnetic resonance imaging (BOLD-fMRI), which as the name implies, measures changes in blood-oxygen levels to around 1mm resolution. The relationship between neural activity and BOLD-fMRI is still unknown, with

(16)

some research indicating that non-neural cell types also influence the signal (Gagnon et al., 2015; Turner, 2016). The invasive and non-invasive recordings that are most similar across species are population-level neural oscillations. These oscillations are surprisingly similar across species and can be measured non-invasively using MEG/EEG in humans (Baillet, 2017). In general, it be easier to try and align the imaging technology used in both species, but the scope of this thesis is on rodent invasive and human non-invasive recordings. However, it is possible for to use human single-unit recordings, as has been done in spatial navigation tasks (Ekstrom et al., 2003).

6 Conclusions

Information necessary for spatial navigation is encoded and processed in a wide variety of brain regions in humans and rodents (Epstein et al., 2017; Grieves and Jeffery, 2017). The specific (or algorithmic) ways in which this spatial information is processed, manipulated, and transmitted throughout cortical and subcortical brain regions and finally used in behaviour, remains undiscovered. By formulating our hypothesis about the information processing of spatial navigation in the form of testable computational models, we might be able to shed light on our lack of understanding about the different hierarchical spatial and temporal timescales that the brain operates at.

One danger of the division of human and non-human research is that the chasm separating the two will only keep widening, and develop separate vocabularies and practices, ultimately hindering progress for the neuroscience (Badre et al., 2015). We are currently witnessing a massive growth of technology in invasive rodent recordings, similar to the explosion of human neuroimaging developments in the 1990s, and early 2000s. It is important to combine these advances with human neuroscience in order to accelerate our understanding of the basic mechanisms underlying cognition, and how they subsequently go awry in neurological disorders and diseases. Perhaps we will discover that Tolman was correct in that neurological disorders are the result of a disruption of cognitive maps-like mental structures (Tolman, 1948).

7 Bibliography

Aguirre, G. K., Detre, J. A., Alsop, D. C. and D’Esposito, M. (1996), ‘The parahippocampus subserves topographical learning in man’, Cereb. Cortex 6(6), 823–829.

Akrami, A., Kopec, C. D., Diamond, M. E. and Brody, C. D. (2018), ‘Posterior parietal cortex represents sensory history and mediates its effects on behaviour’, Nature 554(7692), 368–372.

Aminoff, E. M., Kveraga, K. and Bar, M. (2013), ‘The role of the parahippocampal cortex in cognition’, Trends Cogn. Sci.17(8), 379–390.

Anggraini, D., Glasauer, S. and Wunderlich, K. (2018), ‘Neural signatures of reinforcement learning correlate with strategy adoption during spatial navigation’, Sci. Rep. 8(1), 10110.

Auger, S. D. and Maguire, E. A. (2018), ‘Dissociating Landmark Stability from Orienting Value Using Functional Magnetic Resonance Imaging’, J. Cogn. Neurosci. 30(5), 698–713.

Badre, D., Frank, M. J. and Moore, C. I. (2015), ‘Interactionist Neuroscience’, Neuron 88(5), 855–860. Baillet, S. (2017), ‘Magnetoencephalography for brain electrophysiology and imaging’, Nat. Neurosci.

20(3), 327–339.

Balaguer, J., Spiers, H., Hassabis, D. and Summerfield, C. (2016), ‘Neural Mechanisms of Hierarchical Planning in a Virtual Subway Network.’, Neuron 90(4), 893–903.

Balleine, B. W. and O’Doherty, J. P. (2010), ‘Human and rodent homologies in action control: Corticostriatal determinants of goal-directed and habitual action’, Neuropsychopharmacology 35(1), 48–69.

Bandettini, P. A. (2014), ‘Neuronal or Hemodynamic? Grappling with the Functional MRI Signal’, Brain Connect.4(7), 487–498.

Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Pritzel, A., Chadwick, M. J., Degris, T., Modayil, J., Wayne, G., Soyer, H., Viola, F., Zhang, B., Goroshin, R., Rabinowitz, N., Pascanu, R., Beattie, C., Petersen, S., Sadik, A., Gaffney, S., King, H., Kavukcuoglu, K., Hassabis, D., Hadsell, R. and Kumaran, D. (2018), ‘Vector-based navigation using grid-like representations in artificial agents.’, Nature 557(7705), 429–433.

(17)

Barry, C. and Burgess, N. (2014), ‘Neural mechanisms of self-location’, Curr. Biol. 24(8), R330–R339. Barry, C., Lever, C., Hayman, R., Hartley, T., Burton, S., O’Keefe, J., Jeffery, K. and Burgess, N. (2006), ‘The

boundary vector cell model of place cell firing and spatial memory.’, Rev. Neurosci. 17(1-2), 71–97. Barto, A. G. and Mahadevan, S. (2003), ‘Recent advances in hierarchical reinforcement learning’, Discrete event

dynamic systems13(1-2), 41–77.

Barto, A. G., Sutton, R. S. and Anderson, C. W. (1983), ‘Neuronlike adaptive elements that can solve difficult learning control problems’, IEEE Trans. Syst. Man. Cybern. SMC-13(5), 834–846.

Baumann, O., Chan, E. and Mattingley, J. B. (2012), ‘Distinct neural networks underlie encoding of categorical versus coordinate spatial relations during active navigation’, Neuroimage 60(3), 1630–1637.

Bayer, H. M., Lau, B. and Glimcher, P. W. (2007), ‘Statistics of Midbrain Dopamine Neuron Spike Trains in the Awake Primate’, J. Neurophysiol. 98(3), 1428–1439.

Behrens, T. E., Muller, T. H., Whittington, J. C., Mark, S., Baram, A., Stacenhfeld, K. L. and Kurth-Nelson, Z. (2018), ‘What is a cognitive map? Organising knowledge for flexible behaviour’, BioRxiv .

Boccara, C. N., Sargolini, F., Thoresen, V. H., Solstad, T., Witter, M. P., Moser, E. I. and Moser, M. B. (2010), ‘Grid cells in pre-and parasubiculum’, Nat. Neurosci. 13(8), 987–994.

Bogacz, R., Brown, E., Moehlis, J., Holmes, P. and Cohen, J. D. (2006), ‘The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced-choice tasks’, Psychol. Rev. 113(4), 700–765.

Bostock, E., Muller, R. U. and Kubie, J. L. (1991), ‘Experience-dependent modifications of hippocampal place cell firing.’, Hippocampus 1(2), 193–205.

Botvinick, M. M., Niv, Y. and Barto, A. C. (2009), ‘Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective’, Cognition 113(3), 262–280.

Brown, T. I., Carr, V. A., LaRocque, K. F., Favila, S. E., Gordon, A. M., Bowles, B., Bailenson, J. N. and Wagner, A. D. (2016), ‘Prospective representation of navigational goals in the human hippocampus’, Science (80-. ). 352(6291), 1323–1326.

Brunton, B. W., Botvinick, M. M. and Brody, C. D. (2013), ‘Rats and Humans Can Optimally Accumulate Evidence for Decision-Making’, Science (80-. ). 340(6128), 95–98.

Butler, W. N., Smith, K. S., van der Meer, M. A. and Taube, J. S. (2017), ‘The Head-Direction Signal Plays a Functional Role as a Neural Compass during Navigation’, Curr. Biol. 27(15), 2406.

Buzsáki, G. (2005), ‘Theta rhythm of navigation: Link between path integration and landmark navigation, episodic and semantic memory’, Hippocampus 15(7), 827–840.

Buzsáki, G. (2015), ‘Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning’, Hippocampus25(10), 1073–1188.

Chadwick, M. J., Jolly, A. E. J., Amos, D. P., Hassabis, D. and Spiers, H. J. (2015), ‘A goal direction signal in the human entorhinal/subicular region’, Curr. Biol. 25(1), 87–92.

Cheeseman, J. F., Millar, C. D., Greggers, U., Lehmann, K., Pawley, M. D. M., Gallistel, C. R., Warman, G. R. and Menzel, R. (2014), ‘Way-finding in displaced clock-shifted bees proves bees use a cognitive map’, Proc. Natl. Acad. Sci.111(24), 8949–8954.

Cheung, A. (2014), ‘Animal path integration: A model of positional uncertainty along tortuous paths’, J. Theor. Biol.341, 17–33.

Chrastil, E. R., Sherrill, K. R., Aselcioglu, I., Hasselmo, M. E. and Stern, C. E. (2017), ‘Individual Differences in Human Path Integration Abilities Correlate with Gray Matter Volume in Retrosplenial Cortex, Hippocampus, and Medial Prefrontal Cortex’, Eneuro 4(2), ENEURO.0346–16.2017.

Chrastil, E. R., Sherrill, K. R., Hasselmo, M. E. and Stern, C. E. (2015), ‘There and Back Again: Hip-pocampus and Retrosplenial Cortex Track Homing Distance during Human Path Integration’, J. Neurosci. 35(46), 15442–15452.

(18)

Collett, T. S. and Graham, P. (2004), ‘Animal navigation: Path integration, visual landmarks and cognitive maps’, Curr. Biol.14(12), 475–477.

Cressant, A., Muller, R. U. and Poucet, B. (1997), ‘Failure of centrally placed objects to control the firing fields of hippocampal place cells.’, J. Neurosci. 17(7), 2531–42.

Daw, N. D. and Dayan, P. (2014), ‘The algorithmic anatomy of model-based evaluation’, Philos. Trans. R. Soc. B Biol. Sci.369(1655), 20130478–20130478.

Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. and Dolan, R. J. (2011), ‘Model-Based Influences on Humans’ Choices and Striatal Prediction Errors’, Neuron 69(6), 1204–1215.

Daw, N. D., Niv, Y. and Dayan, P. (2005), ‘Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control’, Nat. Neurosci. 8(12), 1704–1711.

Dayan, P. (1993), ‘Improving Generalization for Temporal Difference Learning: The Successor Representation’, Neural Comput.5(4), 613–624.

Dayan, P. and Abbott, L. F. (2001), Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems.

Dayan, P. and Niv, Y. (2008), ‘Reinforcement learning: The Good, The Bad and The Ugly’, Curr. Opin. Neurobiol.18(2), 185–196.

Dickinson, A. (1985), ‘Actions and Habits: The Development of Behavioural Autonomy’, Philos. Trans. R. Soc. B Biol. Sci.308(1135), 67–78.

Doeller, C. F., Barry, C. and Burgess, N. (2010), ‘Evidence for grid cells in a human memory network’, Nature 463(7281), 657–661.

Doeller, C. F., King, J. A. and Burgess, N. (2008), ‘Parallel striatal and hippocampal systems for landmarks and boundaries in spatial memory’, Proc. Natl. Acad. Sci. 105(15), 5915–5920.

Dolan, R. J. and Dayan, P. (2013), ‘Goals and habits in the brain’, Neuron 80(2), 312–325.

Dragoi, G. and Tonegawa, S. (2011), ‘Preplay of future place cell sequences by hippocampal cellular assemblies’, Nature469(7330), 397–401.

Dragoi, G. and Tonegawa, S. (2013), ‘Selection of preconfigured cell assemblies for representation of novel spatial experiences’, Philos. Trans. R. Soc. B Biol. Sci. 369(1635), 20120522–20120522.

Dupret, D., O’Neill, J., Pleydell-Bouverie, B. and Csicsvari, J. (2010), ‘The reorganization and reactivation of hippocampal maps predict spatial memory performance’, Nat. Neurosci. 13(8), 995–1002.

Ekstrom, A. D. (2015), ‘Why vision is important to how we navigate’, Hippocampus 25(6), 731–735. Ekstrom, a. D., Kahana, M. J., Caplan, J. B., Fields, T. a., Isham, E. a., Newman, E. L. and Fried, I. (2003),

‘Cellular networks underlying human spatial navigation’, Nature 425(6954), 184–188.

Epstein, R. A. (2008), ‘Parahippocampal and retrosplenial contributions to human spatial navigation’, Trends Cogn. Sci.12(10), 388–396.

Epstein, R. A., Patai, E. Z., Julian, J. B. and Spiers, H. J. (2017), ‘The cognitive map in humans: spatial navigation and beyond’, Nat. Neurosci. 20(11), 1504–1513.

Epstein, R. A. and Vass, L. K. (2014), ‘Neural systems for landmark-based wayfinding in humans’, Philos. Trans. R. Soc. B Biol. Sci.369(1635).

Epstein, R. and Kanwisher, N. (1998), ‘A cortical representation of the local visual environment’, Nature 392(6676), 598–601.

Etienne, A. S. and Jeffery, K. J. (2004), ‘Path integration in mammals’, Hippocampus 14(2), 180–192. Etienne, A. S., Maurer, R., Boulens, V., Levy, A. and Rowe, T. (2004), ‘Resetting the path integrator: a basic

condition for route-based navigation’, J. Exp. Biol. 207(9), 1491–1508.

Fassihi, A., Akrami, A., Esmaeili, V. and Diamond, M. E. (2014), ‘Tactile perception and working memory in rats and humans’, Proc. Natl. Acad. Sci. 111(6), 2331–2336.

Forstmann, B. U., Ratcliff, R. and Wagenmakers, E.-J. (2016), ‘Sequential Sampling Models in Cognitive Neuroscience: Advantages, Applications, and Extensions.’, Annu. Rev. Psychol. 67, 641–66.

Information processing during spatial navigation: Combining computational models with invasive and non-invasive brain imaging