Continual learning in humans and neuroscience-inspired AI

(1)

Continual learning in humans and neuroscience-inspired AI

Lucas Weber,∗ Elia Bruni,† and Dieuwke Hupkes‡ University of Amsterdam

(Dated: June 28, 2018)

Abstract

The field of Artificial Intelligence (AI) research is more prosperous than ever. However, current research and applications are still aiming to optimise single task-performances instead of algorithms that are able to generalize over multiple tasks and reuse prior knowledge for new challenges. This makes these systems data hungry, computationally expensive and inflexible. A main obstacle in the way towards more flexible and generalizing algorithms is the phenomenon of catastrophic forgetting in connectionist networks. While artificial neural networks generalizing-abilities suffer under catastrophic forgetting, biological neural networks seem to be relatively unaffected by it. In this literature review we aim to understand the necessary mechanisms implemented in the human nervous system to overcome catastrophic forgetting and review in how far these mechanisms are already realized in AI systems. Our review is guided by Marrs contemplations of levels of analysis and comes to the conclusion that an integration of the partial solutions already realized in AI may be able to overcome catastrophic forgetting in a more complete way than prior solutions.

Keywords: neuroscience, interdisciplinary, catastrophic forgetting, connectionist,

∗ _{lucas.weber@student.uva.nl} † _{E.Bruni@uva.nl}

(2)

CONTENTS

1. Introduction 3

2. Machine learning literature on catastrophic forgetting 10

3. Neuroscientific literature on catastrophic forgetting 12

3.1. Neuroscientific framework 13

3.2. Neuroscientific Theory and Evidence 17

3.2.1. Complementary Learning Systems Theory 17

3.2.2. Selective Constraints of neuroplasticity 21

3.2.3. Neurogenesis within the hippocampus 24

4. Integration of neuroscientific insight into machine learning 25 4.1. Using complementary learning systems: from the DQN-model to deep generative

replay 25

4.2. Constraining weight plasticity within the network 27

5. Discussion 29

(3)

1. INTRODUCTION

Reports in mass media and popular science in recent years come thick and fast with de-lineations how artificial intelligence (AI) research is breaking through over and over again, making the lay reader expecting the machine revolution just around the next corner. As usual, news reports over the current developments in science are tremendously exaggerated: Neither is any company currently working on the creation of Skynet nor will your boss be fired tomorrow, because machine learning made his job obsolete. However, the current enthusiasm is not completely unfounded since state-of-the-art algorithms made great leaps in their capabilities in the recent years, surpassing human-level performance in important tasks.

The success is mainly carried by so-called artificial neural networks (ANNs). ANNs are sim-plified computational models of biological brains comprising graph-like structures. Nodes of artificial neural networks correspond to biological neurons and are organized layer-wise. From layer to layer the network calculates non-linear transformations, by weighting the out-put from the previous layer(s) and applying a non-linearity (e.g. by setting negative values to zero Glorot et al. (2011)). Their interactions are modeled through non-linear functions applied to weighted input to a unit. Especially popular are deep neural networks (DNNs), which are stacking greater numbers of neural layers on top of each other, making up large architectures with several million optimizable parameters. Examples for benchmark-shifting neural network architectures are, amongst others, from the domain of computer-vision (e.g. AlexNet, Krizhevsky et al. 2012) or control of artificial environments (e.g. DQN, Mnih et al. 2015). In other domains, like speech-recognition (Graves et al. 2013; Hinton et al. 2012), even though not beating human expertise yet, machine learning made great progress. Advances in these areas are especially impressive, since these domains of learning largely depend on unstructured data, that traditionally posed the more difficult form of data for algorithms to learn from. Opposed to structured data, where every datapoint has a spe-cific inherent meaning and is organized in a predefined manner (e.g. lists of housing prices mapped to number of rooms in a house), in learning from unstructured data conclusions have to be drawn from datapoints which only get their meaning through their context. As illustrative example for unstructured learning we can consider multiple pixels in a picture making up a pattern that looks like a dog. While a single brown pixel on the tip of the dogs

(4)

nose itself has no meaning without its surrounding, the pattern of multiple pixel together are meaningful. Another example from speech-recognition might be sound frequencies that have to be combined in certain patterns to make up comprehensible speech. While humans are normally extraordinary good at finding regularities in and making sense of this kind of data, AI was traditionally better in inferring from structured information. The advance of machines in these domains is reason for excitement, since most data in the world is of unstructured nature, giving AI greater scope of possible application. On top of that it may make their interactions with humans become more intuitive, when data utilized from both agents becomes more coherent: computer-speech-recognition and computer-vision might play a crucial role in handling increasingly complex technology opposed to classical highly structured interfaces.

However, under closer inspection, these recent advances may lose some of their gloriousness. Looking for the reasons of their success, namely, one finds that these are, to a not insignif-icant part, based on two factors: availability of big amounts of data and computational power. The increased distribution and usage of mobile information-technology (statista.com 2018a,b) and the accompanying surge in produced digital data (Kitchin 2014), made it possible to build large scale on-line databases (e.g. ImageNet, Jia Deng et al. 2009). These databases provide researchers with millions of labeled training examples, making it possi-ble to train larger architectures, approximating more complex functions (e.g. by building deeper networks, Cabestany et al. 2005). Training these large architectures with excessive amounts of data has a great computational cost. Which brings us to our second reason for the current surge in machine learning: Moore’s law (Moore 1965) is still retaining and results in a before unmatched amount of computational power available to researchers. The combination of explosion in available data and computational power enabled the training of larger and larger models, resulting in better and better performances.

An instance of this development is the prior mentioned AlexNet (Krizhevsky et al. 2012), which caught the AI-community by surprise by winning the ImageNet large scale visual recognition-challenge (ILSVRC) in 2012 (Russakovsky et al. 2015) and almost halving the error-rate of all its competitors on the fly. While the used technique (convolutional neural networks [CNNs]) was already widely known in the AI-community since the introduction of LeNet by Lecun & Bengio in 1995, the refinement of this technique, but especially the usage in a computationally expensive DNN trained on millions of labeled images, made this

(5)

success possible. To illustrate the dynamics the development of increasing architecture size took we shortly mention state-of-the-art architectures like deep residual networks (ResNet) (He et al. 2015) that comprise up to 152 layers of convolutional computations.

If increasing network size works so well, why would we want to change anything about it? Why do we negate media’s reports about machine’s rise of intelligence? There are multiple ways to answer this question.

First of all, retaining at the moment, there is the prospect of Moore’s law to end in the near future (see e.g. Kumar 2012, Waldrop 2016): Engineers are approaching the physical limitations of possible transistor size, what may decelerate the growth of computational effi-ciency of processing units in the near future. With its dependence on ever growing amounts of available computational power, the approach of building bigger and bigger systems to yield better performances is most likely to decelerate the advance of AI as well. If the field of AI does not want to rely on a revolution in electrical engineering and the way we do computations within the near future, it is well advised to avoid to get too heavily invested into developing brute force systems (e.g. deep ResNets). Optimization objective during the development of new architectures should not only be the decrease of error rates, but also how data- and computationally efficient they obtain their results.

Second, the current approach leads to algorithms, whose capabilities are limited to the very specific domain they are trained for. Within their domain they are, without com-plete retraining of the system (Lake et al. 2016), unable to adapt to new environments or changes in task demands. Even though DNNs have solved the problem of finding patterns within a task, finding regularities on a larger scope (between different tasks) is still a mostly unsolved issue. This reveals a lack of self-organized transfer-learning and larger scale gen-eralization. These, however, are attributes necessary to achieve what psychologists term general intelligence or g - factor (Spearman 1904), the ultimate goal in the creation of AI. In psychometrics, g describes the positive correlation of performances of an individual in different cognitive tasks (A transfer of g to machine intelligence is given by Legg & Hutter 2007). This generalization of cognitive abilities over multiple domains and tasks, being central in psychologist’s definition of intelligence since over a century, is missing in current AI. On the contrary, the current brute-force, data-hungry models perform extra-ordinarily good, but are restricted to their predefined, very narrow domain. Stating that nowadays systems are intelligent is therefore per definitionem false.

(6)

However, building a system comprising real general intelligence is a immensely complex task. Luckily, researchers are able to draw inspiration from the most sophisticated cogni-tive processor currently known: the human brain. Emerging from seleccogni-tive pressure over thousands of years (Wynn 1988), the human mind is the most adaptive and productive computational agent we know. To illustrate the extra-ordinary capabilities of human in-telligence opposed to contemporary AI, we would like to cite Lake et al. (2016), who are giving a comprehensible example on the control in Atari 2600 video games: When both, humans without noteworthy experience in one of the games and a contemporary deep rein-forcement learning algorithm (DQN) (Mnih et al. 2013) learn to play Atari videos games, their learning curves differ tremendously. When the DQN is trained on an equivalent of 924 hours of unique playing time and additionally revisiting these 924 hours of playing eight times, it still only reaches 19% of a human player’s performance, who played the very same game for 2 hours. This illustrates how humans are much more efficient in making use of the data they are given. Even though subsequent, enhanced variants of the same algorithm (DQN+ and DQN++, Wang et al. 2015) were able to achieve up to 83% and even 98% of the humans players performance, their learning curve is still far from being as steep. This is especially significant in the early phase of learning: while humans demonstrate particular large performance gains in the initial phase of learning, DQN++ needs more time to show improvements. After being trained for only two hours like the human competitor, DQN++ only reaches 3.5% of human performance.

How is this possible? As explained above, g, as it is found in humans, requires to generalize and transfer knowledge from prior tasks to apply it to new challenges posed by the environ-ment. Most machine learning agents are currently lacking this ability.

To be able to transfer knowledge from one domain to another, a cognitive agent needs to be able to learn sequentially from a myriad of different experiences over a life time and integrate their commonalities. This sequential learning task appears very natural to humans, but it is in fact of great difficulty for other cognitive agents. Learning opportunities often appear unanticipated, only shortly and temporarily separated from each other. To nevertheless make sense of this apparent mess of inputs, Lake et al. (2016) formulated guiding principles that are likely to be central in how humans are thought to learn. Lake et al. (2016) highlight three principles that are necessary for efficient, generalizing sequential learning: (1)

(7)

compo-sitionality, (2) learning-to-learn and (3) causality. We will shortly introduce these principles here and relate them to the core issue of this paper, the obstacle of catastrophic forgetting in sequentially learning systems, which we will explain in more depth in part 2.

(1) Compositionality is the idea that concepts are build out of more primitive building blocks. While these concepts can be decomposed into their elements, they can themselves again also pose the building blocks for even more complex concepts. These more complex concepts can then be recombined again and so on. As an anecdotal illustration of this we can consider computer programming: in computer programming basic functions can be combined to build more complex functions, which on their part can again be recombined to make up even more sophisticated functions. In this way functions stack up from machine code up to high level programming languages or sophisticated computer programs. To be able to reach this high level of complexity, the lower level concepts need to be of general form and be shared among as many higher level concepts as possible.

Compositionality naturally connects to (2) learning-to-learn, first introduced by Harlow (1949): When humans are confronted with situations that go beyond the data they have encountered so far, they are able to infer based on prior learned concepts what is most rea-sonable in the new situation and therewith try to deal with the new circumstances. Since concepts are often (partially) shared between different tasks, learning will go faster and with need of less data. While being very similar to transfer learning, an idea already very popular in current AI, learning-to-learn has a greater emphasis on being based in the prior mentioned compositionality. Transfer learning describes that parts of learned concepts are taken and utilized to solve other tasks. It usually takes place when two similar tasks are trained in a row. Transfer learning is already partly realized in deep learning CNNs, through feature sharing between tasks. Transfer learning through feature sharing in CNNs, however, is on a very small scale. Learning-to-learn on the other hand is defined somewhat differently. It intends to take transfer learning to a higher, more human like level, by not only extracting shareable features and let them loosely coexist next to each other, but also relate these features (or concepts) to each other in a causal way.

This leads us to Lakes third principle. (3) Causality refers to knowledge about how the observed data comes about. Systems that feature causality therefore not only concentrate on the final product, but also on the process in which it is created. In general, causal models are generative (opposed to pure discriminative models). Causality gives generative models

(8)

the possibility to grasp how concepts relate to each other, making generative models that are embracing causality usually better in capturing regularities in the environment. Causal-ity comprises knowledge of state-to-state transitions in the environment and goes therefore naturally hand in hand with sequential learning. When knowledge about state-to-state tran-sitions is learned as well, the system is able to relate concepts with each other and determine how they usually interact. By inverting the idea of causality, a cognitive agent can infer and reason about the causes of its current situation.

While making these points, Lake et al. (2016) refer to their implementation of the very same ideas in Lake et al. (2015). Their generative model (called Bayesian Programme Learning [BPL]) recognizes and categorizes different characters from different alphabets by combining a set of primitives according to the prior mentioned principles. Doing so, it is able to reach super-human level performance in one-shot learning of new character concepts, demonstrat-ing its ability to learn new concepts from sparse data, with little computational power. While promoting important ideas and yielding promising results, BPL has the problem that it needs a lot of top-down, knowledge-based hand-crafting to obtain its impressive perfor-mance. Further, their task is limited to a very specific, simple domain. This is the opposite of greater generalization abilities, the idea and ultimate goal they intent to promote. This handcrafting includes that they provide their generative model with primitives that it can use to create its more complex characters. While it is practicable to provide a generative model with appropriate primitives for a relatively simple character recognition task, it be-comes more difficult to do so with models learning more complicated functions. McClelland et al. (2010) named this problem more eloquently with the need for ’considerable initial knowledge about the hypothesis-space, space of possible concepts and structures for related concepts’ that is inherent to generative, probabilistic models. The idea of top-down design of model-architecture become less feasible when we aim for a more general AI agent. The principle ideas, compositionality, causality and learning-to-learn, however, are to be consid-ered fundamental to build more intelligent AI systems. The implementation though has to take another route: it has to be driven by an emergent approach that is able to add the needed complexity to the model. The prior mentioned ANN-architectures offer the needed emergent complexity. One way to harness top-down ideas while sticking to an emergent, connectionist framework is to create modular architectures, holding top-down inspiration in the functionality of its modules and their interactions, while keeping the benefits of emerging

(9)

complex structure within the single components (Marblestone et al. 2016).

An example for the value approaches integrating emergent structure with top-down guiding knowledge is the so-called long-short term memory (LSTM) (Hochreiter & Schmidhuber 1997). LSTMs enjoy evergrowing popularity since their introduction. LSTMs are related to the function of human working memory (Baddeley & Hitch 1974). Similar to human working memory, LSTMs can hold small portions of information that will be needed lateron by providing a static temporary memory buffer that can store, retrieve or erase its contents as needed. This modular edition to classical recurrent neural network architectures (RNNs) allows a great increase in performance in sequential-behavior tasks that rely on use of in-formation over a larger amount of timesteps. Building on this idea, subsequent algorithms (e.g. memory networks, Weston et al. 2014) that are constructed even closer to its biological archetype (e.g. by dividing memory and control function), yielding even better performance without adding excessive amounts of trainable parameters by refining the modular structure of the architecture.

In the transfer of Lake et al. (2016) principles to a proposed modular structured emergent approach, however, lays an old well known problem. To learn sequentially with the perspec-tive to achieve true intelligence, one needs to avoid the common problem of catastrophic forgetting. A system that forgets catastrophically will not be able to utilize Lake’s princi-ples of learning-to-learn, compositionality and causality. However, catastrophic forgetting is inherent to classical emergent connectionist approaches.

In the following part 2 of this review we will explain catastrophic forgetting in more detail and expand on early ideas in the AI-community to resolve the problem. In the subsequent part 3, we will look at how humans handle the problem of catastrophic forgetting. Since we already reasoned that human cognitive agents are able to learn sequentially and implement Lake’s principles, we should be able to find mechanisms by which catastrophic forgetting is prevented in biological neural networks. While doing so, we will localize the ideas we find on Marr’s 1982 levels of analysis, to make it easier to put them into context. Thereafter, we will present evidence from the isolated disciplines, that is likely to be useful in the construction of new AI architectures. In part 4 we will demonstrate machine learning algorithms in which those ideas are already implemented. In part 5 we will discuss how to go about and possibly integrate the prior mentioned ideas.

(10)

2. MACHINE LEARNING LITERATURE ON CATASTROPHIC FORGETTING

The phenomenon of catastrophic forgetting, also known as catastrophic interference, was first brought up by McCloskey & Cohen (1989). It describes the interference of a new, with previous learned tasks in classical, sequentially-trained connectionist networks. The reason why these networks are prone to interference is that, when a task A is learned by the network, information regarding this task is not saved localized, but in a distributed manner spread over many nodes in the network (see parallel distributed processing (pdp) (Rumelhart et al. 1986). When a second task B is trained afterwards, the network will use the very same connections to learn task B, that beforehand were used to memorize task A. Therewith new training of task B is overwriting the pattern for task A within the weight-distribution. What happens when knowledge representations of two tasks are interfering with each other is easiest to understand, when we consider the learned solution to a task in weight space. The weights space is a multidimensional space in which every parameter of the network represents one dimension. The weight space represents all combinations of weight-values that a network can possibly adopt. What happens in weight space, when we train the network on two different tasks (A and B) one after the other? While being trained on task A the weight distribution of the system will slowly migrate through weight-space and finally converge on a weight-combination that solves task A satisfactorily well. When the network is subsequently trained on the second task B the weight distribution will migrate through weight space towards a solution of task B. During this second training phase it is neglecting prior learned information, veering away from the solution of task A and therewith consequentially causing catastrophic forgetting. To prevent this from happening, the network needs to find a solution within the weight-space for task B that also poses a solution to task A. Since networks are usually overparameterized it is very likely that there are multiple points in weight space (certain weight-combinations) that yield overlapping solutions for both tasks (Kirkpatrick et al. 2017). If the learning algorithm can be constraint in a way that it finds a solution resided in such an overlap area, task B can be learned without interfering with task A.

Catastrophic forgetting is an extreme case of the stability-plasticity dilemma (Carpenter & Grossberot 1987). The stability-plasticity dilemma is concerned with the how the global rate of learning (or plasticity) in a network influences the stability/instability of distributed knowledge-representations. In a parallel distributed system a certain amount of plasticity

(11)

Figure 1: In this example a two-weigth system found a solution for task A. Sequentially it is trained on a second task B. During training the system is unconstrained by its prior knowledge and therefore neglects task B while migrating through weight space towards a solution of task B. This results in catastrophic forgetting of task A. To not forget catastrophically the system needs to migrate towards the overlapping area on top.

is necessary to integrate new knowledge into the system: When plasticity is too low, a so called entrenchment effect can be observed. In entrenchment the rate of change of the connection-weights is too low to cause any noteworthy adaptations when confronted with new information. While new information will not erase prior knowledge, the network is also not able anymore to adapt to new information. However, if there is too much plasticity, prior knowledge will constantly be overwritten and catastrophic forgetting will occur. Thus, a optimal learner has to keep its connections partially plastic to be able to integrate new knowledge, while at the same time constraining plasticity selectively to not overwrite prior knowledge.

(12)

networks, and thereby to the development of continually learning and generalizing systems. This is why over the years different solutions to the problem have been proposed. One of the first suggested solutions to the matter came from French (1992). He argues that reducing overlap between different representations is key to avoid catastrophic forgetting. Almost all subsequent solutions follow this line of thinking. French introduced the technique of weight ’sharpening’, in which he increased activations of nodes which are already high and decreased activation of nodes that are already low, making the activation pattern for a certain representation more sharply separated and therewith disentangling it from the activation patterns of other representations. He called the outcome ’semi-distributed representations’. While his approach was partly successful in reducing catastrophic forgetting, it may reduce the ability of the network to generalize. Generalization relays on emergence of more abstract features that may contribute to different tasks and not only to a single one. Since it is harder for the network to change prior knowledge representations when weight-sharpening is applied, it won’t create abstract features. Instead it is more likely that it will find solutions for different tasks seperate from each other and store them in parallel in different parts of the network. In another approach, Brousse & Smolensky (1989) and McRae & Hetherington (1993) stated that humans do not learn from scratch, but they can base new representations on large pretrained networks. They describe that when tasks are highly internally structured (like e.g. speaking a language), new data samples will have the same regularities as previous data and therefore not omit drastically different activation and changes in weights. However, according to French (1994) and Sharkey & Sharkey (1995) this idea suffers from an inability of generalization as well: According to Brousse & Smolensky (1989), only highly internally structured tasks should be learned squentially, meaning that new data must have the same regularities as previous data. Differing tasks will most likely have different internal structure. Therefore, generalizing knowledge from one domain to another, like it is necessary for general intelligence,, won’t be possible.

3. NEUROSCIENTIFIC LITERATURE ON CATASTROPHIC FORGETTING

In the previous paragraph we depicted the problem that catastrophic forgetting poses for AI-systems to become better sequential learners and to utilize causality, compositionality and learning-to-learn, and thereby to become true generalizing, intelligent agents. Almost

(13)

thirty years have passed since the first description of the problem, but it still was not resolved by the AI-research community. We also depicted the early attempts to resolve the issue, which were mostly seeking a solution through dexterous mathematical insight, which, however, was not able to completely sort out the problem or brought other issues with them (e.g. loss of ability to discriminate properly in pretraining). Considering this apparent persistence of catastrophic forgetting against the prevailing practice, broadening our perspective to other disciplines (studying human information processing) for inspiration, as suggested in the introduction, might be a good idea. In the following paragraph we will carve out the most promising fields of human-related research to find a solution in and present available evidence that relate to the problem of catastrophic forgetting.

3.1. Neuroscientific framework

So, how do humans prevent catastrophic forgetting in their biological neural networks? To be able to harness scientific insight about this question and probably finally answer it, we need to not only consider the discipline of neuroscience in its broadest sense, but also theory from cognitive sciences. Together, they span the field of Brain and Cognitive Science, a strongly interdisciplinary endeavor. Ranging from loose cognitive-psychological theory to deterministic molecular-biological mechanisms, the field is not straightforward to comprehend from an outside view. To make it more palatable, we will present Marr (1982) influential framework of levels of analysis, which is guiding the discipline up to this day. Based on Marr’s levels it becomes easier to organize discovered research in a meaningful way and appreciate the idea that there is most likely not a single, but multiple approaches that conjointly may give us a satisfying and complete solution.

Marr introduces three levels of analysis: computational, algorithmic and implementational. He describes the computational level as where we state the problem which we would like to address, but providing no answer on how to solve it. In Brain and Cognitive Sciences it is best described by the discipline of cognitive psychology/sciences. It offers modularized cognitive concepts, whose interconnections are loose and unformalized, providing only few concrete mechanisms by which they come about and interact. It helps us stating relevant questions (e.g. What kind of modules do humans need to store short-term memories, ma-nipulate and integrate them into prior knowledge?). We mostly obtain knowledge on this

(14)

Figure 2: The three levels of Marr, (a) computational, (b) algorithmic and (c) implementational, are exemplified by the process of human vision on the right hand side. Every lower level is a realization of its higher levels. Every higher level can be realized in different ways on lower levels. In our example of vision the algorithm at the algorithmic level can not only be realized in vivo in biological neural networks, but also in silico using artificial neural networks.

level through highly controlled, quantified psychological experiments and deductive reason-ing. Approaches that are solely inspired by the computational level of analysis and not informed by the other two, are top-down oriented like Lake et al.’s BPL mentioned in the introduction.

The algorithmic level helps to find solutions to the problems stated on the computational level, holding the concrete mechanisms by which they may be solved. This level is providing the bridge between computation and implementation. It is corresponding to the discipline of cognitive neuroscience, which is locating cognitive concepts within different brain areas and therewith matching them to neural substrate. Researchers do this by utilizing evidence from e.g. neuroanatomy and neuroimaging or electrophysiology. By finding correlations between the use of certain cognitive resources and brain activity the algorithmic level is building a bridge between ideas about high level concepts like cognition and the underlying neural ’hardware’.

On the implementational level we define how the prior mentioned mechanisms or al-gorithms are realized (i.e. the physical substrate that the mechanisms are performed on). This physical substrate may be in silico through transistors on a microchip or in vivo

(15)

through populations of neurons and their interactions. While some substrates may be more suitable than others, in principle all theory from the higher levels may be implemented in multiple different substrate. In regard to humans, this level is best represented by molecu-lar/behavioral neuroscience, showing us how exactly single neurons function and how they can be affected through different ways of stimulation (e.g. neuromodulation or in vivo elec-trical stimulation). Evolved from ideas on the implementational level of analysis, emergent connectionist networks are a very successful account in state-of-the-art AI (e.g. in form of straight feedforward networks and recurrent networks).

Thus it appears that in Brain and Cognitive Sciences we have the same approaches to knowl-edge acquisition as we have modeling approaches in AI: A top-down, knowlknowl-edge-guided approach represented through Cognitive Science/Psychology and a bottom-up, emergent approach represented through molecular-neurobiology. Both approaches try to inform an intermediate level of understanding. This intermediate level is yielding the mechanism by which neural and cognitive processing works.

In Marr’s framework, every new level should be considered as a realization of its predeces-sor, meaning that for example the algorithmic level is realizing the problem stated on the computational level. It is worth mentioning that this, however, does not mean that insights from lower level research cannot inform higher level theories (for example the discovery of grid cells in the human cortex changed the way we think about spatial-memory and -cognition [Moser et al., 2008]). Choosing the right level of analysis to conduct research has been a controversial subject for a long time. The protracted debate about the supposed superiority of one approach over another has seen no single winner. The opposite is the case: Holding on to a single framework has not been proven to be fruitful, and it is wide consensus by now that a complete theory should be informed by all levels of analysis. As systems in artificial intelligence grows more and more sophisticated, this idea will become increasingly important in that discipline as well. For illustrative purposes, we would like to give two brief examples of contemporary modeling approaches ignoring this notion.

Our first exemplar will be Bayesian Programme Learning (BPL), which we already men-tioned in the introduction. BPL is an algorithm inspired by cognitive science only and therewith residing on the computational level. It categorizes different handwritten charac-ters from different alphabets by combining a set of primitives. These primitives are possible pen strokes, that, when combined, make up a character. However, these primitives are fairly

(16)

simple and few in numbers, which is why it is straightforward to provide a appropriate hypothesis-space (set of primitives) for the bayesian inference. As soon as the task becomes more complex the primitives will have to become more abstract and greater in number. Finding such an appropriate hypothesis-space and hand-crafting it it into the system is not trivial.Therefore a purely top-down oriented approach is not able to capture complexity as in human information processing.

On the other hand, even though they were successful in the past and still are, the pure emergent connectionist networks guided from insights from the implementational level like standard feedforward networks pose the problems stated in the introduction: They lack the ability to generalize broadly and learning sequentially. Focusing on the implementa-tional level can construct more powerful networks that perform extraordinary on single tasks, but will most likely not be able to learn this task flexibly, adapt to changing task requirements and transfer knowledge from one domain to another. Thus I will not satisfy our quest for real general intelligence. An demonstration of this is how the idea of pure feedforward neural networks has been recently led ad-absurdum by creating incredibly deep networks (e.g. ResNet-152). Those Networks may yield better performances, but need an unreasonable large amount of training, consuming computational power and data on an exaggerated scale and are only possible by using clever hacks in the network architecture (skip connections in case of ResNet). From a biological perspective these models lack every plausibility. When considering human vision, which is commonly modeled by these kind of networks, such an excessive amount of layers in the feedforward sweep of processing would lead to exaggerated perceptual delays in humans, since stage to stage processing time in the ventral visual stream is approximated with 10ms per neural population (Panzeri et al. 2001). A biologically implemented ResNet would therefore be no match for human object recognition (which is estimated with 120ms) in regard of efficiency. It thus seems not to be necessary to maintain large numbers of expensive computational layers to reach sufficient performance for object recognition. In accordance with this, Serre et al. (2007) actually suggest that the depth of the human ventral visual pathway may be estimated at only 10 processing stages.

An algorithm providing a generally intelligent solution should therefore be informed by all levels of analysis and by doing so providing flexibility paired with complexity. We will keep this in mind and see how it also may apply to a solution for the problem of catastrophic

(17)

forgetting.

3.2. Neuroscientific Theory and Evidence

So far we have layed out the problem of catastrophic forgetting in machine learning and isolated the levels of description on which we are searching for a solution in neuroscience. Now we will look at evidence from the brain and cognitive sciences that we might profit from. Interestingly, even though humans do suffer from retrograde interference of newly acquired information with older memories (Barnes & Underwood 1959), this interference is never catastrophic. Since artificial neural networks are thought to function in the same way as biological networks, the human neural system must have implemented countermeasures to overcome this sequential learning problem that haven’t been adopted by their highly simplified artificial counterparts. There are different ideas how this might be accomplished about what these countermeasures are.

3.2.1. Complementary Learning Systems Theory

The first idea we want to present here is a cognitive neuroscientific theory, which is in-formed by molecular neuroscience as well as cognitive scientific ideas. Rooted in ideas of Marr (1970, 1971) and Tulving (1985), the so-called complementary learning systems theory (CLS) was first formalized by McClelland et al. (1995). CLS might give an account for how catastrophic forgetting is avoided in humans. The theory proposes that human learning functions via two separate memory systems. The first system is the neocortex which, as the name implies, is a very recent evolutionary acquisition shared only among mammals. It is responsible for all higher cognitive functions of mammals (Lodato & Arlotta 2015). To achieve the complexity of higher cognitive functions, the neocortex has to integrate ambigu-ous information over long time spans. It does this by slowly estimating the statistics of the individual’s environment. The functionality of modern deep ANNs is mainly inspired by the neocortex. Due to their similarity in architecture neocortex and deep ANNs share a large set of properties, like being large in capacity, and their slow, statistical way of learning. Since the neocortex is a statistical learner, it integrates general knowledge (i.e. semantic knowl-edge) about the world that is not connected to the specific learning experience anymore (i.e.

(18)

it stores no episodic memory).

The storage of episodic information is achieved by the second of the two memory system in CLS-theory. It is located in the medial temporal lobe structure of the hippocampus. This system is thought to be a fast learner with very limited storage capacity. The hippocampus main objective is to store episodic memories and preprocess them for later integration into the statistically learning neocortex. To be able to store specific events the hippocampus has to orthogonalize incoming activation patterns, to makes them distinct from previous experience, a process called pattern separation. Further, the hippocampus extracts regular-ities from these distinct experiences and then trains the neocortex in an interleaved fashion. This interleaved memory replay helps the neocortex in the learning process by reactivating cortical connections central for the memory. Replay is essential, since the slow learning neo-cortex will hardly learn from the single exposure to an experience. Additionally, interleaving different memories and replaying them can also be a mean to prevent catastrophic forgetting in the neocortex (McCloskey & Cohen 1989). This is similar to optimizing multiple tasks in parallel during training of ANNs, where by interleaving examples of different tasks can also help to overcome catastrophic forgetting. In support of the assumption that the hip-pocampus’ interleaved replay is important to avoid catastrophic forgetting, McClelland et al. (1995) argue that in lower mammals that are lacking the hippocampal-neocortical division (and therewith do not have complementary learning systems), catastrophic forgetting might actually take place. It is still an open question if that is actually the case (French 1999). As mentioned before, architecture and functionality of current ANNs can be seen analogous to the human neocortex. The second memory system of the hippocampus has no counterpart in most current AI-systems. Since it seems to be important to prevent catastrophic forgetting, however, it might be potentially a worthwhile additional module. To better understand how such a module might work, we should consider the inner dynamics of the hippocampus in more detail (see also figure 3 ).

The hippocampus circuitry consists mainly of the trisynaptic pathway or loop (TSP) and the monosynaptic pathway (MSP). The TSP is made up of the entorhinal cortex (ERC), dentate gyrus (DG), CA3 and CA1. This neural populations are connected through forward connections, while CA3 has additional recurrent autoconnections. The TSP is responsible for the encoding of new information and pattern separation (orthogonalization of single ex-perience) (Schapiro et al. 2016). The encoded and orthogonalized information is then stored

(19)

Figure 3: The entorhinal cortex (EC) serves as an input as well as output module for episodic memory buffer of the hippocampus. The trisynaptic pathway (green) comprises the dentate gyrus (DG) and cornu ammonis 3 and 1 (CA3, CA1). The DG orthogonalizes the input from EC to be able to store without overlap to prior experiences in CA3. The monosynaptic pathway (red) comprises CA1 and the

Ecsubscript(output). CA1 is extracting statistical regularities from the episodic memory buffer in CA3, which is necessary to subsequently train the slow, statistically learning neocortex.

in CA3 (Tulving, 1985). This episodic memory buffer is in its mechanism similar to Hopfield networks (Wiskott et al. 2006; see Amit 1989, for an introduction to Hopfield networks as neural circuit model).

The MSP on the other hand, consisting of ERC and CA1, is trained by CA3 in a statistical manner, similar to the neocortex in the CLS theory. The generalized knowledge representa-tion in the MSP is then used to train the neocortex via repetitive, interleaved memory-replay. Replay takes predominantly place during low activity phases (e.g. during slow-wave sleep, Stickgold 2005).

So far CLS-theory seems like a reasonable and parsimonious solution to our problem. How-ever, there are three reasons that make it unlikely that the hippocampal-neocortical division and therewith episodic memory replay is the only mechanism that is contributing to prevent catastrophic forgetting in humans. Firstly, there are no cases of catastrophic forgetting in higher mammals. Lesion studies in animal models or case studies in humans with lesions due to strokes in the hippocampus should lead to conditions similar to catastrophic forgetting in

(20)

neural network models. The biological brain would not be able to interleave its new learning experience with older experiences any more. A lesioned hippocampus, however, leads to a related, but different condition: medial temporal lobe amnesia (MTL amnesia) (Squire et al. 2004; Squire et al. 1991). In MTL amnesia individuals suffer from loss of episodic memory, what we described above as orthogonal, pattern separated memories stored in the CA3 of the hippocampus. At the same time patients have relatively unimpaired generalized seman-tic memory (Race et al. 2013). Since they general semanseman-tic memory is unimpaired, they seem not to suffer under catastrophic forgetting. This appears to speak against a central role of complementary learning systems for prevention of catastrophic forgetting in humans, because the lack or malfunction of the hippocampus would leave the neocortex exposed to new experiences that are not interleaved with prior knowledge. However, one might argue that with the lack of the memory replay unit, MTL amnesia patients’ neocortex is only affected by new experiences once, namely during the time at which the event is actually taking place, opposed to exposure through multiple replays in healthy individuals. It may be that the neocortex is simply not stimulated enough to exhibit fundamental changes to its connections. While it is still stimulated by the original experience, there is no replay of this memory, which is essential for the slow learning neocortex to efficiently alter it’s connectivity patterns. This would ’freeze’ the knowledge stored in the neocortex, making it inaccessible for new information, but at the same time preventing it from losing older semantic knowledge. This is indeed the case in MTL-amnestic patients: while retrospective knowledge, manifested before the loss of the hippocampus is relatively unimpaired, acquisi-tion of new knowledge almost comes to standstill, with learning of new factual informaacquisi-tion being learned only after long time intervals and repetitions (Bayley & Squire 2002).

Another compelling refutation to the exclusive role of the hippocampal system as a counter-measure to catastrophic forgetting is that all experiences an individual was ever confronted with would have to be saved within the capacity-limited hippocampal system and constantly be replayed interleaved with new experience. We know, that the hippocampus has relatively limited capacity. Additionally, the amount of memories that would be needed to be replayed would grow linearly with lifetime. Thus, with a certain age memory replay would become unfeasible. A solution to this issue would be so-called pseudoreheasal introduced by Robins (1995). Pseudorehearsal works without access to all prior training data (in our case the memories of a lifetime), but creates its own training examples (pseudoitems) by passing

(21)

random binary input into the network and using the output as new training examples for interleaved training. In this way created pseudoitems are described by Robins as a kind of ’map’ that is able to reproduce the original weight distribution. As an anecdotic side-note from the authors, this intuitively makes sense, since sleep is the time during which memory replay is thought to predominantly take place (Stickgold 2005) and sleep co-occurs with the subjective experiences of dreams which often resemble a commingling of recent experience and odd intermixes of past memories.

Lastly, how do we keep the hippocampus itself free of catastrophic forgetting? It is itself a connectionist network and should suffer under the same problem of catastrophic forgetting that it is trying to avoid in the neocortex. An additional mechanism would be necessary to protect the hippocampus from suffering under catastrophic forgetting itself (Wiskott et al. 2006). Excitingly, this mechanism actually exists and is layed out by research of Wiskott et al. (2006). Adult neurogenesis, the generation of new neurons out of neural stem cells, in the DG of the hippocampus might be a countermeasure against catastrophic forgetting within the hippocampus itself. The DG of the hippocampus is one of only two regions within the brain that is capable of neurogenesis (Eriksson et al. 1998). We will consider the idea of neurogenesis in more depth in part 3.2.3. of this review.

3.2.2. Selective Constraints of neuroplasticity

The next neuroscientific insight we would like to present here is located on the molecular level. In the human central nervous system a change of plasticity (in other words the readi-ness of a synapse to change its connection-strength) is able to either render connections (in ANNs: weights) in a network convertible or fix its status quo. By selectively increasing or decreasing plasticity, the brain is able to learn new tasks, while conserving old skills and knowledge (Yang et al. 2009). Important here is, that changes in plasticity are selective, which makes it possible to learn a new task, while not overwriting existing skills. This is opposed to what is happening in most ANNs during training: while the learning rate is often times dynamic, meaning that it is changing during the learning progress, it is applied globally over all connections and not adapted separatly in different regions.

When talking about plasticity changes in the human nervous system on a cellular level, dendritic spines are considered essential. In an interneuronal connection, spines are little

(22)

protrusions on the post-synaptic neuron, which are formed at synaptic connections between neurons. Changes of connection-strength between two neurons are due to morphological changes (remodulation) of the dendritic spines of the post-synaptic neuron (Yang et al. 2009). Additionally, increased plasticity can also lead to the formation of new spines and therewith the formation of new interneural connections or the elimination of existing spines and therewith the loss of connections. Those changes, formations and eliminations of den-dritic spines are the means to change the connection-strength between two neurons and thus are the basis for learning.

To better understand how selective neuroplasticity comes about we will take a look at the molecular basis of spine remodulation. The changes in morphology of spines depend on N -methyl-D -asparate (NMDA)-receptor activity. Opposed to other excitatory receptors (e.g. AMPA in figure 4), NMDA is not only able to allow Sodium (Na+) to enter the neuron, but also allows influx of Calcium-Ions (Ca2+_{). Ca}2+_{-influx renders the cell morphology} change-able, thus increases plasticity. If connection-plasticity is elavated, connection strength can either be increased (potentiate) (Bliss & Lømo 1973) or decreased (depotantiate) (Ito 1989) as a consequence of activity. The kind of change occurring depends on the time interval between the synaptic activity and the Ca2+-spike. Thus modulating the activity of NMDA-receptors and consequently Ca2+_{-influx changes plasticity of the neural connection.}

One mean of the brain to influence the activity of NMDA-receptors is through the hormone somatostatin (SST). According to Pittaluga et al. (2000) the hormone is able to increase NMDA activity by releasing the Mg2+_{-block off the receptor. Normally, the Mg}2+_{-block is} only released due to high activity of the neuron and consequently its strong depolarization. Only the removal of the Mg2+_{-block enables Ca}2+ _{to pass the membrane into the cell.} SST is distributed via interneurons, which are neurons that connect different neural circuits with each other without taking a primary function within either of them. To make learn-ing without forgettlearn-ing possible it is important that SST-related plasticity changes can be selective for certain branches of a neuron. This is indeed the case: Changes do not occur over all connections (i.e. branches) the neuron maintains, but only in branches that are relevant for the task that the cognitive agent is engaged in (Cichon & Gan 2015). When SST-release is disrupted (e.g. in SST-interneuron deleted mice), SST is not longer influenc-ing NMDA-receptor activity and branch-specific plasticity of spine morphology is lost. As a result the same branches show similar synaptic changes during learning of different tasks,

(23)

Figure 4: (a) When the NMDA-receptors Mg2+_{-block hampering Calcium (Casuperscript(2+)) ions to flux} into the neuron. In this case only Sodium (Na+_{)-ions will enter the neuron via the AMPA-receptor on the} right side of the dendritic spine, which may cause the cell to fire, but wont lead to strengthening of the synaptic connection. (b) If the Mg2+_{-block is removed via a high level of depolarization of the neuron (high} rate of activity) or Somatostetin (SST) in the synaptic cleft, Ca2+ _{will enter the cell which will result in} changes of the cell-metabolism. (c) The changes in cell metabolism due to Ca2+-influx results in additional AMPA-receptors being integrated into the neurons membrane. A higher density of AMPA-receptors increases the rate of Na+_{)-ion influx upon stimulation, which makes the neuron more likely to fire.}

causing subsequent tasks to ’erase’ memories of preceding tasks. This happens because new tasks are altering synapse strengths of the preceding tasks, which fits what we know as high activity in ANNs. Cichon & Gan (2015) also provide evidence for this on the behavioural level: the prior mentioned SST-interneuron deleted mice do indeed exhibit catastrophic for-getting when learning two different tasks sequentially. This gives us causal evidence that selective changes in neuroplasticity can serve as countermeasure to catastrophic forgetting, when it is branch-specific.

A single SST-interneuron is targeting directly a single other neuron, what we refer to as homosynaptic interaction. Not all neural interaction is homosynaptic. There are other substance, that are refered to as neuromodulators, acting in a heterosynaptic fashion. Gen-erally speaking, heterosynaptic neuromodulation means that a neurotransmitter released by a neuron does not only affect a single target neuron, but a whole population of neurons that that are in close distance. This happens when the neuromodulator is not only released into the targeted synaptic cleft, but also ’spilled over’ into the extracellular space (ECS), where it can diffuse and target other prior uninvolved neurons in close proximity. Next to spillover some neuromodulators are directly released into the ECS, for example classical neurotransmitter like Dopamine (Descarries et al. 1996) and Serotonin (De-Miguel & Trueta

(24)

2005). Additionally, there are also highly diffusible gaseous substances like nitric oxid (NO ), carbon-monoxid (CO ) and hydrogen sulfide (H2S ) (Wang 2002), which since they are highly diffusable have a greater area of effect. By diffusing through the ECS, neuromodulators are able to render the plasticity of adjacent neurons as well, making them more prone to change their connection strength (or vice versa, make them more stable). This localized change of plasticity might be a mean to avoid catastrophic forgetting by rendering currently relevant parts of the neural network changeable while keeping the rest of the network stable and therewith antagonizing interference with old information. While the branch-specific SST-induced changes to neuroplasticity are acting on a rather finegrained level, neuromodulators are able to render larger portions of cortex more plastic, but both are able to influence memory intereference.

3.2.3. Neurogenesis within the hippocampus

A third mechanism in the human nervous system that might be able to prevent catas-trophic forgetting we already mentioned in our section about CLS and is located inside the hippocampus. As depicted before a episodic memory unit that stores experiences tempo-rally to replay and therewith train a slow statistical learner like the neocortex will have the problem of catastrophic forgetting itself. Interestingly, there is another mechanism within the episodic memory buffer of the hippocampus to circumvent this problem. Prominently, the DG of the hippocampus is one of two cortical areas capable of adult-neurogenesis (Alt-man & Das 1965, Gould & Gross 2002, Kemper(Alt-mann et al. 2004). Neurogenesis describes the constant production of nervous cells from neural stem cells. These newly generated neurons in the DG differ from older cells in the way that they exhibit a greater degree of synaptic plasticity (Schmidt-Hieber et al. 2004), greater ease to form new connections to other neurons (Gould and Gross, 2002) and greater mortality (apoptosis) (Eriksson et al. 1998). These properties draw a picture of a nervous cell that can easily be integrated into an existing neural circuit, but may also be easily obliterated when not being proved useful. The extend of neurogenesis and cell survival is decreased by age (Altman & Das 1965) and aversive-stressful experiences (Gould & Tanapat 1999) and increased by diet (Lee et al., 2000), physical activity (van Praag et al. 1999) and enriched environments (Kempermann et al. 1998). But how do new neurons help to tackle catastrophic forgetting? French (1991)

(25)

suggested that within a large network sparsity of representations in the hidden layers of a network is a mean to reduce CF, since representations will be localized and therefore not interfere with each other. This strategy, however, has the effect of reducing generalization, since solutions will just be stored in parallel and there is no need for generalization. Wiskott et al. (2006) complement this idea by suggesting that the newly generated neurons open up the opportunity to learn new feature-representations and at the same time, by the reduction of plasticity in old neurons, saving the possibility to remember older feature-representations. While in early life people encounter a lot of new environments with a lot of new features, the need for DG to be able to adapt needs to be larger. On the other hand, in later life new environments mostly consist of a recombination of known features which reduces the need to create new feature-representations. This is an intuitive account for the reduction of neurogenesis within the individual’s lifetime. The same goes for enriched environments: In a complex environment the capability to adapt and learn is more important that it is in an impoverished one. Higher rates of neurogenesis makes this possible.

4. INTEGRATION OF NEUROSCIENTIFIC INSIGHT INTO MACHINE LEARN-ING

After we have illustrated different approaches from the different disciplines of Brain and Cognitive Sciences, we now take a look in how far these ideas are already implemented in contemporary AI systems.

4.1. Using complementary learning systems: from the DQN-model to deep gen-erative replay

The aforementioned CLS-theory probably received most attention in the last years when it comes to tackling the problem of catastrophic forgetting in connectionist networks. In their very influential approach to train a neural network architecture on controlling Atari 2600 games, Mnih et al. (2013, 2015) introduced a memory replay unit in the deep Q-network (DQN). In their attempt to make a DNN architecture learn based on less data, Mnih et al. introduced a seperate memory unit, saving all prior experiences and replaying all of them randomly up to eight times after the initial training. Even though this idea bears a lot of what we think might help overcoming catastrophic forgetting and corresponds functionally somewhat with the episodic memory buffer in CA3 of the hippocampus that we are also

(26)

referring to, it exhibits some conceptional flaws that limit its capacities as a mean against catastrophic forgetting (to be fair, overcoming catastrophic forgetting was not Mnih et al.’s intention here). As we explained in part 3.2.1 a episodic memory buffer that saves all recent experiences in a one-to-one manner like in DQNs is not feasible. For an AI system com-prising real general intelligence it is necessary to learn continual over a long time period. An episodic memory buffer like in DQN would become exorbitantly large over time and replay of the random, uniform (every memory is equally likely) sampled memories becomes a computationally heavy task. To lower the extend of replayed memories, Schaul et al. (2016) suggested to prioritize memories that are likely to yield a high reward over other memories that might not be as important for success. By doing so their model outperforms an uniformly sampling system on the basis of the same amount of training. This selective reward-guided replay is biologically plausible (see Atherton et al. 2015;Hattori 2014). Even though this constraining of memory sampling is already a step towards the right direction by reducing the amount of memory that has to be replayed, over an agents lifetime still too many memories would be needed to be stored.

In part 3.2.1 of this paper we presented Robins (1995) idea of pseudo-pattern replay. In pseudo pattern replay there is no need for saving actual memories in a one-to-one fashion, but the patterns that the new acquired memories are interleaved with are generated out of the prior weight distribution. Mocanu et al. (2016) pick up on this idea and describe the Online Contrastive Learning with Generative Replay (OCLGR)-model that uses generative Restricted Boltzmann Machines (gRBMs) to store past experiences. By saving past experi-ences in gRBMs the need to save them explicitly (in a one-to-one fashion) is made obsolete. Using this idea the OCLGR outperforms regular experience replay models and adds more biological plausibility to the approach by substantially reducing the memory requirements. Generative replay is applicable to all common types of machine learning (reinforcement-, supervised- and unsupervised learning). However(reinforcement-, Mocanu et al. do not evaluate their model on its capabilities to cope with catastrophic forgetting. Just recently, Shin et al. (2017) put a generative replay model into test in how far it may help overcome catastrophic forgetting on the MNIST-dataset (LeCun et al. 2010). Their results imply that generative replay is compatible with other contemporary countermeasures (e.g. elastic weight consoli-dation [EWC], Kirkpatrick et al. 2017; learning without forgetting [LwF], Li & Hoiem 2017). Additionally, they state that their approach is superior in regard to weight constraining

(27)

ap-proaches like EWC and LwF since there is no a trade-off between performances of old and new task. We will explain weight constraining approaches in the following part 4.2 in more detail.

4.2. Constraining weight plasticity within the network

Another approach to tackle catastrophic forgetting in connectionist networks is the se-lective constraining of weights. As explained in section 2, catastrophic forgetting in neural networks is caused by plasticity of connections needed for a first task A being high during the training of a subsequent second task B. When plasticity is high, the information for the task A will be forgotten since the weights holding this information will adapt to task B. On the other hand, when the plasticity of the weights is constrained globally, the network will lose its ability to learn (see plasticity-stability dilemma; (Carpenter & Grossberot 1987, cited in Gerstner & Kistler 2002). Current approaches to prevent catastrophic forgetting therefore try to selectively constrain weights in a way, that weights necessary for task A are protected during learning of task B and vice versa. There are several models implementing this being based on the prior mentioned neurobiological insights.

The first of these implementations is elastic weight consolidation (EWC) (Kirkpatrick et al. 2017). EWC is inspired from ideas on a molecular neurobiological level: SST-expressing in-terneurons are able selectively constrain plasticity on certain branches of a cortical neuron, while leaving it intact for other branches of the very same neuron (Yang et al. 2009). The branch-wise constraints here are functionally related: when a branch is necessary for task A, plasticity will be unconstrained during learning of task A and constrained during task B and vice versa. Kirkpatrick et al. (2017) take this idea of selectively constrained plasticity and apply it to DNNs. However, while taking biology as an inspiration, they do not try to model the underlying mechanics, but instead use a Bayesian approximation to determine the importance of single connections (i.e. weights) for the current task. When being trained on the following task, the algorithm determines based on Bayesian approximation how im-portant the different weights were for the prior solved tasks and put constraints on them, so that they remain relatively unchanged during backpropagation and subsequent weight updating during learning. The track through weight-space changes accordingly (see figure 5 ).

(28)

Figure 5: Similar to Figure 1 a two-weight system is trained sequentially on two different tasks (A and B). While being trained on task B in an unconstrained manner the system will migrate towards a solution of task B neglecting prior knowledge of task A (see bottom trajectory). If weights are constrained by elastic weight consolidation the system will be constantly ’pulled back’ towards the solution of task A will migrating through weight space. Ultimately, it will converge on a weight combination that solves both tasks satisfactory (if such a solution exists).

Velez & Clune (2017) are taking their inspiration from the way neuromodulation in the hu-man cortex is thought to work. Part 3.2.2 explained how neuromodulators are spread locally within the human cortex and are affecting plasticity of neural connections in their range. Velez & Clune (2017) translate this by locally plasticity of connection-weights through the spread of an ’artificial neuromodulator’ within their ANNs. This artificial neuromodulator then selectively increases the plasticity of the weights around the diffusion node. For differ-ent tasks differdiffer-ent diffusion nodes are activated during training and therewith the network creates local functional clusters while the rest of the network is relatively unaffected during training. Their model is only validated on a very primitive small network and verification

(29)

of its capabilities on large scale state-of-the-art architecture still remains to be proven. The main difference between these two weight constraining approaches is that on the one hand the diffusion-based implementation of Valez and Clune has greater biological plausibil-ity than Kirkpatrick et al.’s EWC. On the other hand EWC targets directly the functionalplausibil-ity of certain weights, while the diffusion-based approach lets functionality emerge within their predetermined local clusters around the diffusion nodes. In this regard EWC might be su-perior to the diffusion implementation, since the functional diffusion clusters are not flexible in their scope as the EWC constraints are. The scope of EWC weight constraints is only determined by the scale of needed weights in the task and is not handcrafted like in the diffusion based model. This lets EWC become the more elegant and flexible solution. In both weight constraining approaches the functional structure of the network is relatively fixed. When a weight combination that maximizes the performance of one task is found in the network, it is protected against change. This makes the network overall less flexible and limits its capacity to a certain amount of tasks. Also it is possible that the network does not find generalized solutions, since the approach is minimizing overlap between the different representations of tasks, not allowing a more parsimonious, flexible solution that might be found when both tasks are optimized in parallel.

5. DISCUSSION

In this literature review we took a closer look into the current developments in AI re-search. We found the field prospering, especially within the recent years: In different im-portant machine-learning applications, performance benchmarks have been shifted to reach human level performance or even go beyond. However, recent performance achievements were often build on the utilization of big amounts of data and computational power and the systems are often lacking the ability to generalize the acquired skills to other related or slightly changed tasks. We stated that this ability, however, is central to acquire true intelligence. One obstacle on the way to more generalizing, intelligent systems is catastophic forgetting in connectionist networks. Years of approaching the problem with sole mathe-matical insight did not resolve the issue satisfyingly. As a consequence researchers turned to neuroscience to draw inspiration from human cognitive agents that do not suffer from catastrophic forgetting. We introduced Marr (1982) levels of analysis to help us better

(30)

un-derstand the neuroscientific research we encounter. Here we emphasized that just like in neuroscientific research, where complete theories are always informed by all levels of analysis, an algorithm for a truly intelligent neuroscience-inspired AI system should as well always be informed by all levels of analysis. Hereafter, we brought up the complementary learning systems theory (CLS) and constrained neuroplasticity, two of the main ideas of contempo-rary neuroscience about how catastrophic forgetting is avoided in humans. Additionally to that, we shortly explained how neurogenesis in the hippocampus might be able to prevent catastrophic forgetting as well. Finally, we surveyed recent implementations of these ideas in AI and what advantages and shortcomings the different approaches pose. Doing so, we presented different examples of realisations of CLS with an emphasis on the most promising and recent ’deep generative replay’-approach (Shin et al. 2017), utilizing two complementary learning systems, just like humans do, to interleave current and past experiences.

We further depicted two realization of the molecular neuroscientific idea of selectively con-strained neuroplasticity in elastic weight consolidation (EWC) of Kirkpatrick et al. (2017) (2017) and in the diffusion-based approach of Velez & Clune (2017).

Both of the main approaches using constrained neuroplasticity have their shortcomings: The EWC is only able to change plasticity in one direction: from plastic to stable. As soon as the networks capacity is reached and the system is saturated, no more new information can be learned and blackout catastrophe, a phenomenon known from saturated Hopfield net-works (Amit 1989), may occur. A blackout catastrophe renders information in the network unretrievable. To continually learn, the cognitive agent has to be able to selectively forget information to prevent a Blackout.

On the other hand, in the diffusion-based approach number and size of the functional clus-ters is handcrafted into the system. Handcrafting limits the range of applicability of the system. To obtain a model that can function without these top-down decisions, the model will have to learn the scale of clusters from data. The question remains which mechanism might be able to supervise the assignment and the size of diffusion node within a ANN. One possible candidate for this might be the human basal ganglia (Alexander & Crutcher 1990). The basal ganglia (especially the ventral striatum) is central in human reward prediction and processing (e.g. Schultz et al. 1992). The basal ganglia seems to maintains neuromod-ulatory projections to the cortex (Alcaro et al. 2007, Graybiel 1990) and as stated in part 3.2.2 dopamine, while dopaminergic neurons are relative rare and mainly expressed in the

(31)

basal ganglia (Bjoerklund & Dunnett 2007), is able to serve as a potent neuromodulator (Descarries et al. 1996). This might pose a natural connection to currently popular rein-forcement learning algorithms in AI (like the prior mentioned DQN (Mnih et al. 2013); for an introduction to reinforcement learning see Sutton & Barto 1998).

As converging evidence in neuroscience shows it seems to be the case that catastrophic for-getting in humans is not overcome by a single, but multiple mechanisms on different levels. While there is direct causal evidence for the importance of constraints on branch-specific neuroplasticity for the avoidance of catastrophic forgetting in animal models (Cichon & Gan 2015), there is also broad evidence for the relevance of complementary learning systems in human learning and the interleaved fashion in which the hippocampus is training the slow learning neocortex (Kumaran et al. 2016). To overcome catastrophic forgetting in a way equivalent to humans, developers of connectionist AI-systems might need to integrate the separately developed frameworks into an all-embracing solution. This solution might not be straightforward, but the prospect of being able to understand and prevent catastrophic forgetting in a more complete fashion might be worth the while. Our outlook sees impor-tance in the connection of reinforcement processing and selective changes in neuroplasticity that might be facilitated through flexibly acting neuromodulatory nodes. An additional episodic memory replay unit that creates pseudopattern to replay recent experiences in an interleaved manner can not only help the consolidate recent memories within the neocortex (like intended by DQNs), but also bound the networks to earlier learned tasks and prevent catastrophic forgetting. A next step might be to create a memory replay unit that more closely resembles the inner dynamics of the human hippocampus. A potential candidate to closer match a artificial memory replay units to the human hippocampus would be the utilization of the REMERGE-model (Kumaran & McClelland 2012). REMERGE models hippocampal encoding, memory orthogonalization and retrieval in a down-scaled fashion. When the episodic memory replay unit becomes more complex by closer resembling its biological archetype (the hippocampus), it might be necessary to also mimic the internal hippocampal process of adult neurogenesis to protect the new module from forgetting catas-trophically as well.

Even though the envisioned integration of different neuroscientific inspired solutions to catas-trophic forgetting poses a big challenge the prospects it has to offer are compelling enough to shoulder the effort. Deeper analysis of the here collected ideas is necessary to find