Evolving plasticity for autonomous learning under changing environmental conditions

(1)

under Changing Environmental Conditions

A. Yaman

a.yaman@tue.nl

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, 5612 AP, the Netherlands

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea

G. Iacca

giovanni.iacca@unitn.it

Department of Information Engineering and Computer Science, University of Trento, Trento, 38122, Italy

D. C. Mocanu

d.c.mocanu@tue.nl

Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, 7522NB, the Netherlands

M. Coler

m.coler@rug.nl

Campus Fryslˆan, University of Groningen, Leeuwarden, 8911 AE, the Netherlands

G. Fletcher

g.h.l.fletcher@tue.nl

M. Pechenizkiy

m.pechenizkiy@tue.nl

Department of Mathematics and Computer Science, Eindhoven University of Technol-ogy, Eindhoven, 5612 AP, the Netherlands

Abstract

A fundamental aspect of learning in biological neural networks is the plasticity prop-erty which allows them to modify their configurations during their lifetime. Hebbian learning is a biologically plausible mechanism for modeling the plasticity property in artificial neural networks (ANNs) based on the local interactions of neurons. However, the emergence of a coherent global learning behavior from Hebbian based local plas-ticity rules is not very well understood. In this work, we employ genetic algorithms to discover local learning rules to produce autonomous learning. We use discrete plas-ticity rules to perform synaptic changes based on binary neuron activations. The sim-plicity and discrete nature of the evolved plasticity rules can provide insights in the learning behavior of the networks. We demonstrate the learning and adaptation ca-pabilities of the ANNs modified by the evolved plasticity rules on two separate tasks (foraging and a prey-predator) in continuous learning settings. Our results show that the evolved plasticity rules are highly efficient at adapting the ANNs to tasks under changing environmental conditions.

Keywords

Hebbian learning, synaptic plasticity, continuous learning, evolving plastic networks, evolution of learning.

(2)

1 Introduction

In the past few decades, a broad area of research in nature-inspired hardware and soft-ware design (De Castro, 2006; Sipper et al., 1997) has been stimulated by the study of the evolutionary, developmental and learning processes that allowed biological organisms to adapt to their environment. In particular, artificial neural networks (ANNs) have proved to be a successful -yet simplified- formalization of the information processing capability of biological neural networks (BNNs) (Rumelhart et al., 1986).

Inspired by the evolutionary process of biological systems, the research field known as Neuroevolution (NE) employs evolutionary computing (EC) approaches to optimize ANNs (Floreano et al., 2008; Yao, 1999). Adopting the terminology from biol-ogy, a population of individuals are represented as individual genotypes, which encode the parameters (i.e. topology, weights and/or the learning approaches) of the ANNs. Biologically inspired operators, namely selection, crossover and mutation, are iteratively applied to generate new individuals and find ANNs that are better adapted to the task at hand (Goldberg, 1989).

One key aspect in NE is the encoding of the ANNs. This, in turn, influences the so-called genotypes-phenotype mapping, i.e. the way a given genotype is used to build a certain phenotype. In direct encoding, the parameters of the networks (mainly weights and/or topology) are directly encoded into the genotype of the individuals and opti-mized to solve the task (Yaman et al., 2018; Mocanu et al., 2018). However, these param-eters usually remain fixed during the network’s lifetime, such that it cannot adapt if the environment changes. In indirect encoding, on the other hand, some kinds of rules for the network’s development and/or training are optimized (Mouret and Tonelli, 2014; Nolfi et al., 1994). For instance, the plasticity property observed in BNNs has been modelled in various works (Yaman et al., 2019; Soltoggio et al., 2018, 2008; Kowaliw et al., 2014) to obtain evolving plastic artificial neural networks (EPANNs).

The traditional plasticity model is based on a biologically plausible mechanism known as Hebbian learning (Hebb, 1949; Kuriscak et al., 2015), which performs synaptic adjustments using plasticity rules based on the local activations of neurons. One im-portant limitation of the basic form of Hebbian learning is its instability, due to possibly indefinite increase/decrease of the synaptic weights (Vasilkoski et al., 2011). To over-come this limitation, several variants of Hebbian learning have been proposed which stabilize the learning process in various ways (Brown et al., 1990; Vasilkoski et al., 2011; Sejnowski and Tesauro, 1989). Nevertheless, these improved plasticity rules may still require further optimization to properly capture the dynamics needed for adjusting the network parameters to a given task.

A number of works used EC to optimize plasticity in ANNs to achieve lifetime learning (Coleman and Blair, 2012). Some of these works optimize the parameters of the Hebbian learning rules (Floreano and Urzelai, 2000; Niv et al., 2002). Some other works replace the Hebbian rules with evolving complex functions, such as using an additional ANN which determines the synaptic changes (Orchard and Wang, 2016; Risi and Stanley, 2010). Others use a mechanism known as neuromodulation, in which the ANNs include specialized neurons that are used to signal the synaptic changes between other neurons (Runarsson and Jonsson, 2000; Soltoggio et al., 2008).

Despite the relatively vast literature on the use of EC to discover and optimize plasticity models, we believe that those previous models are either too simple to obtain actual adaptation, or too complex to understand the evolved learning behavior. In fact, although complex plasticity models can provide a solution to certain learning prob-lems, their complexity may prevent gaining insights into the learning behavior of the

(3)

networks1_.

In this work, we use binary neuron activations to reduce the number of pairwise activation states of neurons, and evolve discrete plasticity rules to perform synaptic changes based on pairwise neuron activation states and reinforcement signals. An im-portant advantage of our model is that, due to the discrete nature of these rules, it is possible to interpret the learning behavior of the networks. We use networks consist-ing of one hidden layer, where we introduce local weight competition to allow self-organized adaptation of synaptic weights. We demonstrate the lifetime learning and adaptation capabilities of plastic ANNs on two separate tasks, namely a foraging and a prey-predator scenario. In both scenarios, starting from a randomly initialized ANN an agent is required to learn to perform certain actions during its lifetime, in a con-tinuous learning setting. We then show that: (1) the evolved synaptic plasticity (ESP) rules we obtain with our model are capable of producing stable learning and adapta-tion capabilities; (2) the ESP rules are intelligible as they can be easily interpreted and linked to the task at hand2_{. Finally, we compare our results with the Hill Climbing (HC)}

algorithm (De Castro, 2006).

The rest of this paper is organized as follows: in Section 2, we provide back-ground knowledge on Hebbian learning and introduce our approach to represent and evolve Hebbian based synaptic plasticity rules; in Section 3, we discuss our experimen-tal setup, and introduce the foraging and pre-predator tasks; in Section 4, we provide the results of the proposed approach and compare it with the Hill Climbing algorithm; finally, in Section 5 we summarize our conclusions and discuss future work.

2 Evolution of Plasticity Rules

In its general form, Hebbian learning is formulated as:

wi,j(t + 1) = wi,j(t) + ∆wi,j (1)

∆wi,j= f (ai, aj, m) = η · ai· aj· m (2)

where the synaptic efficiency3wi,jat time t + 1 is updated by the change ∆wi,j, that is

a function of pre- and post-synaptic activations, aiand aj, and a modulatory signal, m.

This function is considered the same for all the weights in the network, i.e. f (ai, aj, m)

does not depend on i and j.

The plain Hebbian rule, given in Equation 2, strengthens the synaptic efficiency wi,j when the signs of ai and aj are positively correlated, weakens it when the signs

are negatively correlated, and keeps it stable when at least one of the two activations is zero (Brown et al., 1990). The modulatory signal, m, is used to determine the sign of the Hebbian rule. When m is positive, plain Hebbian learning is performed. However, when m is negative, the sign of the learning rule is reversed (this is also known as anti-Hebbian learning), i.e. the synaptic efficiency is strengthened if the activations of neurons

1_{This is the case of some of the existing works where the initial synaptic weights and/or the connectivity of}

the networks is evolved in addition to the plasticity rules (Orchard and Wang, 2016; Soltoggio et al., 2008). As for the initial synaptic weights, evolving them obviously increases the number of parameters. Furthermore, this aspect can be decoupled from the evolution of plasticity rules per se. Likewise, evolving the connectivity of the networks may overfit the networks to a certain task, making it difficult to evolve an actually adaptive behavior.

2_{As it will become clear in Section 3, we distinguish two levels of complexity, which are intertwined: the}

first level concerns the task, the second one concerns the optimization of the ANNs using the ESP rules. Even in simpler tasks, optimizing the parameters of the ANNs could be challenging for plasticity rules, since the process is affected by the size of the network: indeed, larger networks involve a large number of parameters and pairwise interactions, which can make it harder to optimize.

(4)

are negatively correlated, and weakened if they are positively correlated. Additionally, a constant η is used as a learning rate to scale the magnitude of the synaptic change.

In this work, we use the step function (see Appendix A) to binarize the neuron activations, i.e. we apply Equations 1-2 with binary values for aiand aj. In addition

to this, we set the modulatory signal, m, as:

m =  



+1, if desired output (reward); −1, if undesired output (punishment);

0, otherwise (neutral).

(3)

In other words, we use m as a reinforcement signal to indicate how the network is per-forming with respect to the task. Therefore, m = 1 (m = −1) indicates that the current configuration of the network is producing a desired (undesired) behavior and Heb-bian (anti-HebHeb-bian) learning should be used to increase (reduce) the synaptic weights and promote (avoid) producing this behavior in the same situation. The desired and undesired outcomes are defined by a reward function which depends on the task, ac-cording to the desired/undesired associations between sensory inputs and behavioral outputs. The reward functions we used in our experiments are discussed in Section 3. Furthermore, for each neuron i we scale the weights after each synaptic change (i.e., after Equations 1-2), as follows:

w0_i,j(t + 1) = wi,j(t + 1) ||wi(t + 1)||2

(4) where the vector wi(t + 1)encodes all the incoming weights to a post-synaptic neuron

i. This normalization process prevents indefinite synaptic growth, and helps connec-tions to specialize by introducing local synaptic competition, a concept observed also in biological neural networks (El-Boustani et al., 2018).

To optimize the plasticity rules ∆wi,j = f (ai, aj, m), we employ a Genetic

Algo-rithm (GA) (Goldberg, 1989). Figure 1 illustrates the genotype of the individuals, where we encode the learning rate η ∈ [0, 1), and one of three possible outcomes {−1, 0, 1} (corresponding to decrease, stable, and increase, respectively) for each plasticity rule. Since we use binary activations for neurons, there are 4 possible activation combina-tions of aiand aj. Furthermore, we take into account only positive and negative

rein-forcement signals, ignoring the case m = 0 since the synaptic change is performed only when m = +1 or m = −1. Consequently, there are 8 possible input combinations, and therefore 8 possible plasticity rules defined by f (ai, aj, m). The size of the search space

of the plasticity rules is then 38_{, excluding the real-valued learning rate η.}

In the initialization step, we randomly initialize a population of 9-dimensional individual vectors xk, where k = (1, . . . , N ) (in our experiments, N was set to 30)

to encode the synaptic update rules. Each dimension of the individuals is uniformly sampled within its domain, depending on its data type (real-valued, or discrete).

The evaluation process of an individual starts with the initialization of an agent with a random ANN configuration. Here, we use fixed topology fully connected feed-forward neural networks, with connection weights sampled (as real values) from a uniform distribution in [−1, 1]. The agent is allowed to interact with the environment by performing actions based on the output of its controlling ANN. After each action step, the weights of the ANN are changed based on a synaptic update table. This table is constructed by converting the vector representation of the individual plasticity rules to specify how synaptic weights are modified based on the pre-, post-synaptic, and

(5)

𝑥𝑘1 𝑥𝑘2 𝑥𝑘3 𝑥𝑘4 𝑥𝑘5 𝑥𝑘6 𝑥𝑘7 𝑥𝑘8 𝑥𝑘9 Genotype Representation Learning Rate (𝜂) 𝒇 𝒂𝒊, 𝒂𝒋, 𝒎 = ∆𝒘𝒊,𝒋 𝒂𝒊 𝒂𝒋 𝒎 ∆𝒘𝒊,𝒋 0 0 -1 𝑥𝑘2 0 1 -1 𝑥𝑘3 … … … … 1 1 1 𝑥𝑘9 Sensory Input . Agent (magnified) Action Reward (𝑚) Synaptic update Agent-environmentInteraction Environment

Figure 1: Genotype representation and agent-environment interaction. The genotypes of the individuals en-code the learning rate and the synap-tic update outcomes for 8 possi-ble states of ai, aj and m. The

agent (depicted in red) interacts, for a given number of steps, with a (task-specific) environment. An artificial neural network is used to control the actions of the agent. The ini-tial weights of the ANN are ran-domly initialized, and are changed after each action step based on the ESP rules and a reinforcement signal received from the environment.

reinforcement signals (see Figure 1). The evaluation process of the GA is based on a continuous learning setting, where the weights of the ANNs are updated constantly.

We use an elitist roulette wheel selection operator to determine the individuals for the next generation: the top 10 best individuals are copied to the next generation with-out any change; the remaining individuals are generated from fitness-proportionate selected parents using uniform crossover with a probability of 0.5. After crossover, we perturb these individuals by applying a Gaussian mutation N (0, 0.1) to the real-valued component η, with probability 1.0, and re-sampling the 8 discrete components, with a probability of 0.15. The evolutionary process is executed until there is no more im-provement in terms of best fitness for 30 generations.

3 Experimental Setup

We test the learning and adaptation capabilities of the plastic ANNs with ESP rules on two agent-based tasks within reinforcement learning settings: a foraging and a prey-predator scenario. We discuss the specifics of these tasks in the following sections.

3.1 Foraging Task

In the foraging task, inspired by Soltoggio and Stanley (2012), an artificial agent is re-quired to learn to navigate within an enclosed environment and collect/avoid the cor-rect types of items in the environment.

A visualization of the simulated environment is provided in Figure 2a. The envi-ronment consists of 100 × 100 grid cells enclosed by a wall. The agent, shown in red, has a direction to indicate its orientation on the grid. Two types of food items (50 green and 50 blue) are randomly placed to fixed locations on the grid. The agent has sensors

(6)

that can take inputs from the nearest cell on the left, in front, and on the right.

(a) Foraging environment

1 . . Left Right Straight 1 (b) Agent architecture

Figure 2: (a) Simulation environment used in the foraging task experiments. The loca-tion of the agent, two types of items and the wall are depicted in red, green, blue and black respectively. (b) Agent’s position/direction (in red) and sensory inputs from left, front and right cells (in gray) within a foraging environment, and the ANN controller, with 1 hidden layer, 6 inputs (2 per cell), and 3 outputs (Left/Straight/Right).

There are four possible states for each cell (empty, wall, green, blue), therefore we represent the sensor reading of each cell with two bits, as [(0, 0), (1, 1), (1, 0), (0, 1)]. The agent performs one of three possible actions (“Left”, “Straight”, “Right”) based on the output of its ANN. The output neuron with the maximum activation value is selected to be the action of the agent. In the cases of “Left” and “Right” the agent’s direction is changed accordingly, and the agent is moved one cell in the corresponding direction. In the case of “Straight”, the direction of the agent is kept constant and the agent is moved one cell along its original direction.

To test the adaptation abilities of the networks, we define two reward functions that we refer to as two “seasons”: summer and winter (see Appendix C.1 for the com-plete set of associations between behaviors and reinforcement signals). In both seasons, the agent is expected to explore the environment and avoid collisions with the walls. During the summer season, the agent is expected to learn to collect green items and avoid blue items, while during the winter season this requirement is reversed. We cal-culate the average of the agent’s performance score, per each season, by subtracting the number of incorrectly collected items from the number of correctly collected items. We start an experiment from the summer season. Then we switch season every 5000 action steps, for a total of four seasons (summer, winter, summer, winter). It is important to note that these seasonal changes only cause the reinforcement associations to change, but they do not effect the network configuration. We perform this process for five in-dependent trials, with each trial starting with a randomly initialized ANN. The fitness value of an individual plasticity rule is then given by:

f itness = 1 s · t s X k=1 t X l=1 (ck,l− ik,l) (5)

where t = 5 and s = 4 are the number of trials and seasons, respectively, and ck,land

(7)

trial l. When an item is collected, a new item of the same type is randomly placed on an unoccupied cell.

The architecture of the agent is illustrated in Figure 2b. We use fully connected feed-forward networks with one hidden layer, for a total of 6 input, 20 hidden and 3 output neurons4_{. One additional bias neuron is added to the input and hidden layers.}

For all experiments, the agents are set to pick a random action with a probability of 0.02, regardless of the actual output of their ANNs. Random behaviors are introduced to avoid getting stuck in a behavioral cycle. Such random behaviors, though, are not taken into account for the synaptic changes because they are not based on the output of the network.

3.2 Prey-predator Task

The prey-predator task includes two types of agents, referred to as preys and predators, that try to avoid and catch, respectively, the agent controlled by the ANN. The agent, starting from a randomly initialized ANN, is required to learn to catch preys and es-cape from predators during its lifetime, based on the ESP rules and the reinforcement signals.

Also in this case, the environment consists of 100×100 grid cells enclosed by a wall. We introduce 10 mobile preys and 10 predators, both controlled by hand-coded rules. In each action step, they all move to a randomly selected neighboring cell. When the agent is in close proximity (determined by a certain threshold based on Euclidean dis-tance), preys move to a randomly selected neighboring cell with a higher probability of maximizing their distance to the agent; on the contrary, predators move to a randomly selected neighboring cell with a higher probability of minimizing their distances to the agent. If the agent moves to the same cell occupied by a prey, we randomly relocate the prey to another unoccupied cell and count this event as a “collected” point for the agent. If a predator moves to the same cell occupied by the agent, we keep the preda-tor’s and agent’s location, and count this event as a “caught” point for the agent. As in the previous scenario, we perform this process for five independent trials, with each trial starting with a randomly initialized ANN.

In this case, we aim to find plasticity rules that train the agents to maximize the number of “collected” points and minimize the number of “caught” points. We com-bine these two objectives into one equation, as follows:

f itness = 1 t t X k=1 (ck− α · ik) (6)

where t = 5 is the number of trials, and ck and ik are the number of “caught” and

“collected” points in the trial k, respectively. We use α to increase the weight of the “collected” point because cktend to be larger than ik.

We use a similar network structure to that described in Section 3.1. However, since this task may require a larger vision to keep track of the movement of preys and preda-tors, we increase the field of vision of the agent to include all the cells in a 9×5 rectangle (4 cells distance on the left, right and in front) in front of the agent. Therefore, the agent is able to see 44 cells in total, excluding its own location. Since we encode each cell using two bits, we have 88 inputs to the network. We use one hidden layer with 50 neurons and 3 output neurons.

4_{We also performed additional experiments with networks consisting of various numbers of hidden}

(8)

In this case, we define reinforcement signals based on the behavior of the agent with respect to the closest object (either a wall, a prey or a predator). If the closest object in the visual range of the agent is a prey and the agent avoids it by choosing an action that increases its distance to it, then we provide a punishment. Otherwise, if the agent performs an action that reduces its distance to the prey, then we provide a reward. Similar signals are associated to predators and walls. The complete behavior-reinforcement signal associations can be found in Appendix C.2.

Similarly to the foraging task, we employ two seasons where we switch the roles of the prey and predator agents. During the summer season, preys are shown in green and predators are shown in blue; during the winter, predators are shown in blue and preys are shown in green. We switch the reinforcement signals in each season accordingly. In total, we run two seasons (summer, winter), each consisting of 4000 action steps.

4 Experimental Results

We present now the results of the ESP rules on the two tasks described above, and compare them with the Hill Climbing (HC) algorithm (De Castro, 2006). The HC is an offline optimization approach where a single individual, encoding the connections of an ANN, is evaluated on the task without any lifetime learning capability, and opti-mized iteratively for a certain number of iterations (see Appendix B for further details).

4.1 Foraging Task

In Table 1, a comparison of the average fitness results of the best agents con-trolled/trained by various algorithms is given. In the table, “Perfect agent” (PA) refers to the results of an agent controlled by hand-coded rules (i.e., not controlled by an ANN). In this case, the agent has “perfect knowledge” about the problem from the beginning of the trial (therefore, there is no lifetime learning).

Table 1: Foraging task: average fitness results of the agents controlled/trained by dif-ferent algorithms. The details of rules ID:1 and ID18 can be found in Table 2 and 3.

Algorithm Fitness Std Learning Type

Perfect Agent (PA) 67 8.28 Hand-coded

Hill Climbing (HC) 59 19.70 Offline optimization

Evolved Synaptic Plasticity (rule ID:1) 50 9.91 Lifetime learning

Discrete Hebbian/anti-Hebbian (rule ID:18) 0.2 6.13 Lifetime learning

We collected a total of 300 ESP rules by taking the top 10 best performing rules from 30 independent GA runs. The overall results of the best performing ESP rule is indicated as “rule ID:1” in Table 1. The performance of the ESP rule is lower relative to the PA and HC. This is due to lifetime learning. Even though the agents are tested for the same number of action steps, the agent with the ESP rule is required to learn the task during its lifetime. Moreover, the mistakes it makes during this process are included into the fitness evaluation.

The complete collection of ESP rules found by the GA results in 15 distinct rules (out of 38_{possible rules), i.e. rules that differ only for their discrete parts (as shown in}

Figure 1, the discrete parts of the rules consist in 8-dimensional ternary strings, with-out considering the specific value of the learning rate). The results of these 15 distinct rules are given in Table 2, where the rules are ranked based on their average fitness values (lower ID rule means better). In the second column of the table, we show how many rule instances we found for each of these rule types. Slightly more than half of the ESP rules (164 out of 300) converged to the first rule type (rule ID:1). The

(9)

re-maining columns, labelled as “Median”, “Std”, “Max”, “Min”, “η Mean”, and “η Std”, show the median, standard deviation, max and min values of the fitness, and the av-erage and standard deviation of the learning rates for each distinct rule, respectively. The discrete parts of the rules are shown in the remaining eight columns of each row. Values {−1, 0, 1} in each cell indicate what kind of change is performed ({decrease, sta-ble, increase}, respectively). The column labels indicate the activation states of pre- and post-synaptic neurons (ai and aj), encoded as 2-bits, and m = −1 and m = 1

indi-cate the reinforcement signals. For instance, the best performing rule (ID:1) performs the following synaptic changes: when m = −1 and ai = 1, aj = 0, increase (1); when

m = −1and ai= 1, aj= 1decrease (−1); otherwise, the weight is kept stable (0).

For comparison, Table 3 provides the results of three plasticity rules defined by hand. In particular, the rule ID:16 was defined by taking the best performing ESP rule (ID:1) and replacing the case m = −1 and ai = 1, aj = 0 with stable (0). After this

change, the rule performs synaptic changes only when pre- and post-synaptic neurons are active and the network produces an undesired outcome (m = −1). As for the other two rules, the rule ID:18 (also shown in the last row in Table 1) performs Hebbian/anti-Hebbian learning, as it increases/decreases the synaptic weights between neurons when they are both active and the network produces a desired/undesired outcome. Instead, the rule ID:17 performs synaptic changes on two additional activation combi-nations w.r.t. rule ID:18, in order to facilitate the creation of new connections.

From Table 3, we can see that the performance obtained by the rule ID:16 is signif-icantly better than the other two rules (ID:17 and ID:18). However, this performance is worse than the original ESP rule ID:1. Surprisingly though, the GA did not find the rule ID:16, as shown in the distinct rule list given in Table 2, even though it performs better than most of all the other rules. This may be due to the convergence of the evolutionary process to the rules that perform better, specifically the rules ID:1, ID:2 and ID:3. As for the rules ID:17 and ID:18, we observe that even though the rule ID:17 performs better than the rule ID:18, its performance is not better than the worst ESP rule (ID:15).

We also observe that the ESP rules that perform synaptic changes when m = 1 are worse than the ones that perform synaptic changes only when m = −1 (ESP rules ID:1, which represents more than half of total 300 rules, ID:2 and ID:3). This may be due to the design of the reward function for this task. The reward function provides rewards based on the desirable behavior of the network. However, if the plasticity rule continues to perform synaptic changes when the network has already learnt the task, then these changes can disrupt the weights, causing the network to “forget”. The issue of forgetting already learnt knowledge/skills in ANNs, due to the acquisition of new knowledge/skills, is usually referred to as “catastrophic forgetting” (Parisi et al., 2019).

(10)

T able 2: Foraging task: results of the distinct ESP rules found by the GA, ranked by their median fitness. The columns “Rules” shows the number of ESP rules found for each distinct rule; “Median”, “Std”, “Max”, “Min” show their median, standar d deviation, max and minimum fitness, respectively; “ η Mean”, “ η Std” show their average learning rate and standar d deviations, resp ectively . The next four columns indicate the plasticity rules {decr ease, stable, incr ease }, as {− 1 , 0 − 1 }), in the case of m = − 1. The last four columns indicate the plasticity rules in the case of m = 1. The 2-bit headers of the last 8 columns indicate the binary states of the pr e-and post-synaptic neur on activations ai and aj . m = − 1 m = 1 ID Rules Median Std Max Min η Mean η Std 00 01 10 11 00 01 10 11 1 164 48.63 0.83 49.96 42.95 0.0375 0.008 0 0 1 -1 0 0 0 0 2 23 44.23 1.05 45.88 41.55 0.0167 0.004 -1 1 1 -1 0 0 0 0 3 3 42.35 2.65 46.45 41.48 0.0192 0.003 1 0 1 -1 0 0 0 0 4 19 28.35 0.51 29.05 27.22 0.0488 0.009 -1 1 1 -1 1 0 -1 0 5 1 27.28 0 27.28 27.28 0.0182 0 -1 -1 1 0 1 0 -1 0 6 9 26.70 1.48 27.91 22.80 0.0118 0.003 -1 1 1 -1 -1 -1 1 1 7 16 26.54 0.89 28.13 24.65 0.0092 0.002 0 1 1 -1 -1 -1 1 1 8 2 25.91 0.07 25.97 25.86 0.0096 0.0008 1 1 1 -1 -1 -1 1 1 9 1 23.55 0 23.55 23.55 0.0273 0 -1 1 1 -1 0 -1 0 1 10 2 22.11 1.26 23 21.21 0.0052 0.003 1 1 1 -1 0 -1 0 1 11 10 20.96 0.33 21.59 20.45 0.0198 0.003 0 1 1 -1 1 -1 -1 -1 12 20 20.63 0.40 21.34 19.81 0.061 0.022 -1 1 1 0 1 -1 -1 -1 13 1 12.41 0 12.41 12.41 0.0799 0 0 0 0 -1 1 1 0 -1 14 19 10.30 0.55 10.87 8.36 0.0662 0.018 0 0 0 -1 0 1 1 -1 15 10 8.82 0.54 9.59 7.97 0.0301 0.009 1 1 -1 -1 1 0 -1 1 T able 3: Foraging task: results (fitness values, standar d deviations of the fitness values, number of corr ectly and incorr ectly collected items and their standar d devi a tions, and number of wall hits and its standar d deviation) of thr ee rules defined by hand, over 100 trials. Columns encoded as 2-bits repr esent the activation states of pr e-and post-synaptic neur ons. m = − 1 m = 1 ID Fitness Std Fit. Correct Std Cor . Incor . Std Inc. W all Std W all η 00 01 10 11 00 01 10 11 16 41.68 18.26 47.63 14.36 5.95 7.17 14.36 54.59 0.04 0 0 0 -1 0 0 0 0 17 5.6 9.28 25.2 6.88 19.7 5.28 120 140.2 0.01 0 0 1 -1 0 -1 0 1 18 0.2 6.13 18.3 4.48 18.1 4.27 602.6 117.5 0.01 0 0 0 -1 0 0 0 1

(11)

0 5000 10000 15000 20000 Action Steps 0 20 40 60 80 100 120

The Number of Collected Items

Green Item Blue Item (a) 0 5000 10000 15000 20000 Action Steps 0 50 100 150

The Number of Collected Items

Green Item Blue Item

(b)

Figure 3: Foraging task: number of green and blue items collected with the two ESP rules ID:1 (a) and ID:6 (b), see Table 2 for details.

Finally, in Figure 3 we report the number of collected items during a lifetime learn-ing process of agents with the ESP rules ID:1 and ID:6. Every 5000 steps, the season is changed. We observe that the agent with rule ID:6 is not as efficient as the agent with rule ID:1 at adapting to the seasonal changes. In fact, the agent with rule ID:1 learns to avoid the incorrect item faster than the agent with rule ID:6. We present additional results of the agents during the foraging task in Appendix D.25_.

4.2 Prey-predator Task

The average results of the best agents controlled/trained using HC and best ESP rule on the prey-predator task are shown in Table 4. In this case we collected 70 best perform-ing ESP rules, runnperform-ing 7 independent runs of the GA. We show these rules in Table 5 (summarized based on their discrete parts, as we did for the foraging task in Table 2).

Table 4: Prey-predator task: average fitness results of the agents controlled/trained by Hill Climbing and the best ESP rule (ID:1 in Table 5).

Algorithm Fitness Std Learning Type

Hill Climbing (HC) 41.9 16.43 Offline optimization

Evolved Synaptic Plasticity (rule ID:1) 31.88 22.88 Lifetime learning

Table 5: Prey-predator task: results of the distinct ESP rules found by the GA, ranked by their median fitness. Columns are labelled as in Table 2.

m = −1 m = 1

ID Rules Median Std Max Min ηMean ηStd 00 01 10 11 00 01 10 11

1 10 24.20 4.39 29.46 16.12 0.42 0.16 0 -1 1 0 1 -1 0 0 2 35 23.84 3.06 31.88 19.21 0.55 0.06 0 -1 1 0 -1 0 -1 -1 3 1 21.57 0 21.57 21.57 0.98 0 1 -1 1 -1 0 -1 -1 -1 4 14 16.54 2.28 20.42 13.42 0.69 0.16 1 0 1 -1 -1 0 -1 -1 5 7 10.65 2.98 15.26 6.56 0.56 0.08 1 0 1 -1 1 -1 -1 0 6 3 10.61 1.93 11.93 8.39 0.65 0.02 1 0 1 -1 0 -1 -1 0

5_{To gain further insight into these results, we have visually inspected the behavior of the agent with ESP}

rule ID:1 and η = 0.0375 during a foraging task with four seasons, each consisting of 3000 action steps (we reduced the season duration for visualization purposes). The video (available at https://youtu.

be/9jy6yTFKgT4) shows that the agent is capable of efficiently adapting to the environmental conditions

imposed by each season. Even though the desired behavior with respect to the wall should be constant across different seasons, the agent makes a few mistakes at the beginning of each season by hitting the wall. This may be due to the change of the synaptic weights that affect the behavior of the agent with respect to the wall. We provide the weights and their change after each seasonal change in Appendix D.4.

(12)

Similarly to the foraging task, the results of the ESP rule are worse than those of HC, due to lifetime learning. On the other hand, we note that in this case the ESP rules converge to larger learning rates (one order of magnitude higher compared to the foraging task). This is probably due to the increased stochasticity of this task: since preys and predators move randomly, it is difficult for the agent to learn their behavior patterns. Therefore, the agent needs to adjust its behavior very quickly.

Also in this case, we have visually inspected the behavior of the agent. In par-ticular, we focused on the two best ESP rules (ID:1 and ID:2 in Table 5, respectively with η = 0.3550 and η = 0.4377)6_{. Quite interestingly, we observed that the behaviors}

obtained by the two best ESP rules are quite different, even though they have similar results in terms of median fitness. The behavior obtained with rule ID:1 is in fact similar to that obtained with HC: the agent moves straight until it encounters an object. On the other hand, the plasticity rule ID:2 appears to have evolved in a way to perform synap-tic changes that modifies the behavior of the agent to move diagonally. This behavior appears to be advantageous against the movements of the preys and predators.

5 Conclusions

The plasticity property of biological and artificial neural networks enables learning by modifying the networks’ configurations. These modifications take place at individual synapse/neuron level, based on local interactions between neurons.

In this work, we proposed an evolutionary approach to optimize/discover synap-tic plassynap-ticity rules to produce autonomous learning in changing environmental condi-tions. We represented the plasticity rules in a binary form, to perform changes based on pairwise binary activations of neurons. Most of the works in literature consider evolving complex functions to perform synaptic changes. However, the binary repre-sentation used in this work can reduce the search space of possible rules and allow for interpretation. This may provide insights into the learning behavior of the networks for certain learning scenarios. Accordingly, we presented the ESP rules discovered in this work and discussed their behaviors.

We evaluated the proposed algorithm on agent-based foraging and prey-predator tasks. We focused specifically on the adaptation capabilities of the ANNs in the cases where the environmental conditions change. To demonstrate this, we defined two sea-sons that are associated with different reinforcement signals, and measured the lifetime adaptation capability of the networks during these seasonal changes.

We collected the best performing ESP rules after running the GAs multiple times. These rules converged into several types and differed between foraging and prey-predator tasks. For instance, in the case of the foraging task, the best ESP rule per-formed synaptic changes only when the network produced undesired output (negative reinforcement signal). This is likely due to the reward functions we used in the exper-imentation, which were designed to provide constant reward/punishment while the networks produced desired/undesired outcomes. Intuitively, after the networks learn to perform the task successfully, continuing to perform synaptic changes may cause degradation in the synaptic weights and result in forgetting.

In the case of prey-predator task, the best ESP rules were more complex than they were in the foraging task. In contrast, the best ESP rules tend to perform frequent synaptic changes to adapt to the stochasticity of the prey-predator task.

6_{A video recording of the behaviors of the agents controlled/trained by HC and the two ESP rules is}

(13)

To set an upper bound for the performance of the ANNs with ESP rules, we per-formed a set of separate experiments using hand-coded rule-based agents and ANN controllers optimized using the HC algorithm. Comparison with these algorithms in foraging and prey-predator tasks showed that the agents trained with ESP rules could perform the task very well (as good as about 74% of the performance of the HC), con-sidering the continuous learning versus offline optimization.

In future work, we aim to investigate the scalability of the ESP rules to larger net-works and various learning tasks. Furthermore, as encountered in this work, we are also particularly interested in investigating the approaches to avoid catastrophic for-getting in continuous learning scenarios.

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No: 665347.

References

Brown, T. H., Kairiss, E. W., and Keenan, C. L. (1990). Hebbian synapses: biophysical mechanisms and algorithms. Annual review of neuroscience, 13(1):475–511.

Coleman, O. J. and Blair, A. D. (2012). Evolving plastic neural networks for online learning: review and future directions. In Australasian Joint Conference on Artificial Intelligence, pages 326–337. Springer.

De Castro, L. N. (2006). Fundamentals of natural computing: basic concepts, algorithms, and applications. CRC Press.

El-Boustani, S., Ip, J. P., Breton-Provencher, V., Knott, G. W., Okuno, H., Bito, H., and Sur, M. (2018). Locally coordinated synaptic plasticity of visual cortex neurons in vivo. Science, 360(6395):1349–1354.

Floreano, D., D ¨urr, P., and Mattiussi, C. (2008). Neuroevolution: from architectures to learning. Evolutionary Intelligence, 1(1):47–62.

Floreano, D. and Urzelai, J. (2000). Evolutionary robots with on-line self-organization and behavioral fitness. Neural Networks, 13(4-5):431–443.

Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. Kowaliw, T., Bredeche, N., Chevallier, S., and Doursat, R. (2014). Artificial

neurogene-sis: An introduction and selective review. In Growing Adaptive Machines, pages 1–60. Springer.

Kuriscak, E., Marsalek, P., Stroffek, J., and Toth, P. G. (2015). Biological context of Hebb learning in artificial neural networks, a review. Neurocomputing, 152:27–35.

Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity in-spired by network science. Nature Communications, 9(1):2383.

(14)

Mouret, J.-B. and Tonelli, P. (2014). Artificial evolution of plastic neural networks: a few key concepts. In Growing Adaptive Machines, pages 251–261. Springer.

Niv, Y., Joel, D., Meilijson, I., and Ruppin, E. (2002). Evolution of Reinforcement Learn-ing in Uncertain Environments: A Simple Explanation for Complex ForagLearn-ing Behav-iors. Adaptive Behavior, 10(1):5–24.

Nolfi, S., Miglino, O., and Parisi, D. (1994). Phenotypic plasticity in evolving neural networks. In From Perception to Action Conference, 1994., Proceedings, pages 146–157. IEEE.

Orchard, J. and Wang, L. (2016). The evolution of a generalized neural learning rule. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 4688–4694. IEEE.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks.

Risi, S. and Stanley, K. O. (2010). Indirectly encoding neural plasticity as a pattern of local rules. In International Conference on Simulation of Adaptive Behavior, pages 533– 543. Springer.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088):533.

Runarsson, T. P. and Jonsson, M. T. (2000). Evolution and design of distributed learning rules. In Combinations of Evolutionary Computation and Neural Networks, 2000 IEEE Symposium on, pages 59–63. IEEE.

Sejnowski, T. J. and Tesauro, G. (1989). The Hebb rule for synaptic plasticity: algorithms and implementations. In Neural models of plasticity, pages 94–103. Elsevier.

Sipper, M., Sanchez, E., Mange, D., Tomassini, M., P´erez-Uribe, A., and Stauffer, A. (1997). A phylogenetic, ontogenetic, and epigenetic view of bio-inspired hardware systems. IEEE Transactions on Evolutionary Computation, 1(1):83–97.

Soltoggio, A., Bullinaria, J. A., Mattiussi, C., D ¨urr, P., and Floreano, D. (2008). Evolu-tionary advantages of neuromodulated plasticity in dynamic, reward-based scenar-ios. In International conference on Artificial Life (Alife XI), pages 569–576. MIT Press. Soltoggio, A. and Stanley, K. O. (2012). From modulated Hebbian plasticity to simple

behavior learning through noise and weight saturation. Neural Networks, 34:28–41. Soltoggio, A., Stanley, K. O., and Risi, S. (2018). Born to learn: The inspiration, progress,

and future of evolved plastic artificial neural networks. Neural Networks.

Vasilkoski, Z., Ames, H., Chandler, B., Gorchetchnikov, A., L´eveill´e, J., Livitz, G., Min-golla, E., and Versace, M. (2011). Review of stability properties of neural plasticity rules for implementation on memristive neuromorphic hardware. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 2563–2569. IEEE.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83.

(15)

Yaman, A., Iacca, G., Mocanu, D. C., Fletcher, G., and Pechenizkiy, M. (2019). Learning with delayed synaptic plasticity. In Genetic and Evolutionary Computation Conference, 13-17 July 2019, Prague, Czech Republic.

Yaman, A., Mocanu, D. C., Iacca, G., Fletcher, G., and Pechenizkiy, M. (2018). Limited evaluation cooperative co-evolutionary differential evolution for large-scale neu-roevolution. In Genetic and Evolutionary Computation Conference, 15-19 July 2018, Kyoto, Japan.

Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423– 1447.

(16)

A

Artificial Neural Network Model

ANNs consist of a number of artificial neurons arranged within a certain connectivity pattern, where a directed connection between two neurons is referred to as a synapse, a neuron with an outgoing synapse is referred to as a pre-synaptic neuron, and a neuron with an incoming synapse is referred to as a post-synaptic neuron.

In our experiments, we use fully connected feed-forward ANN (see Figure 2b), in which the activation of each post-synaptic neuron, aj, is calculated as:

aj= ψ   X j=0 wi,j· ai   (7)

where aiis the activation of the i-th pre-synaptic neuron, wi,jis the synaptic efficiency

(weight) from from pre-synaptic neuron i to post-synaptic neuron j, a0is a bias, which

usually takes a constant value of 1, and ψ() is the activation function, that in our case is set to a step activation function given by:

ψ(x) =

1, if x > 0;

0, otherwise. (8)

The step function reduces the possible activation states of each neuron into two possi-bilities: active (1) and passive (0). As discussed below, this binary neuron model facili-tates the interpretability of the results.

B

Hill Climbing Algorithm

In the case of the HC algorithm, we employ an ANN with the same architecture used in the ESP experiments, whose weights are randomly initialized in [−1, 1]. We eval-uate the ANN on the first season of the task, to find the initial best network and its corresponding fitness value. We then iteratively generate a candidate network by ran-domly perturbing all the weights of the best network using a Gaussian mutation with 0 mean 0.1 standard deviation. The newly created network is evaluated on the next season of the task, and replaces the best network if its fitness value is better. We repeat this process iteratively for all the seasons of the task (in our experiments, 4 alternat-ing summer/winter seasons for the foragalternat-ing task, and 2 summer/winter seasons for the prey-predator task). In both tasks, we consider in this case 1000 action steps per season. When a seasonal change happens, we keep the best network and continue the optimization procedure as specified. The final fitness result is the average of 100 inde-pendent HC optimization process.

C

Behavior-Reinforcement Signal Associations

C.1 Foraging Task

The complete list of sensory states, behaviors and reinforcement signal associations that we used in the foraging task is provided in Table 6. In the table, the columns la-belled as “Sensor” and “Behavior” show the sensor states and actions of the network, respectively. The reward functions corresponding to the two seasons, “Summer” and “Winter”, are also shown. In the following, green items, blue items and walls are re-ferred to as “G”, “B” and “W” respectively.

It should be noted that for some sensory states, multiple reward function asso-ciations may be triggered. In these cases, the assoasso-ciations that concern the behaviors

(17)

to collect/avoid items are given priority. For instance, when the sensory input of the agent indicates that there are “W straight” and “G on left”, and if the agent goes right, we activate the reward association ID:12, rather than ID:3. We should also note that the reward functions described here do not specify the reinforcement signal outcomes for all possible sensor-behavior combinations. Indeed, in total there are 192 possible sensor-behavior associations (resulting from 3 possible behavior outcomes for each of 26_{possible sensor states). We assume that all the sensor-behavior associations that are}

not shown in the table do no provide any reinforcement signal.

The first two reward associations are defined to encourage the agent to explore the environment. Otherwise, the agent may get stuck in a small area, for example by per-forming only actions such as going left or right when there is nothing present. The sen-sor state labelled as “nothing” refers to an input to the network equal to [0, 0, 0, 0, 0, 0].

The reward associations from ID:3 to ID:8 specify the reinforcement signals for the behaviors with respect to the wall, and are the same in both summer and winter seasons. It is expected that the agent learns to avoid the wall. Therefore, we define positive reward signals for the states where there is a wall, and the agent picks a be-havior that avoids a collision (IDs: 3, 5, 7). Conversely, we define punishment signals for the sensor-behavior associations where the agent collides to the wall (IDs: 4, 6, 8). For instance, the sensor state of the association given in ID:3 refers to the input to the network as [0, 0, 1, 1, 0, 0], and provides reward if the agent decides to go left or right (and, this behavior is desired in both summer and winter seasons).

The reward associations from ID:9 to ID:14, and from ID:15 to ID:20, define the reinforcement signals associated to the “G” and “B” items respectively. The reinforce-ment signals are reversed between the two seasons. In summer, the agent is expected to collect “G” items and avoid “B” items, whereas in winter the agent is expected to collect “B” items and avoid “G” items.

Table 6: Foraging task: associations of the sensor and behavior states to the reinforce-ment signals in the two seasons.

ID Sensor Behavior Summer Winter

1 nothing Straight 1 1

2 nothing Left or Right -1 -1

3 W straight Left or Right 1 1

4 W straight Straight -1 -1

5 W on left Right 1 1

6 W on left Left or Straight -1 -1

7 W on right Left 1 1

8 W on right Right or Straight -1 -1

9 G straight Straight 1 -1

10 G straight Left or Right -1 0

11 G on left Left 1 -1

12 G on left Straight or Right -1 0

13 G on right Right 1 -1

14 G on right Straight or Left -1 0

15 B straight Straight -1 1

16 B straight Left or Right 0 -1

17 B on left Left -1 1

18 B on left Straight or Right 0 -1

19 B on right Right -1 1

(18)

C.2 Prey-predator Task

The complete list of sensory states, behaviors and reinforcement signal associations that we used in the prey-predator task is provided in Table 7. The header labelled as “Closest” refers to the object that is the closest to the agent (in its visual range), measured using Euclidean distance. The behavior labelled as “Avoid” refers to the situation where the agent selects an action that maximizes its distance to the closest object. On the contrary, the behavior labelled as “Move towards” refers to the situation where the agent selects an action that minimizes its distance to the closest object.

The first four reward associations are similar to those used in the foraging task. In particular, the first two associations (ID:1 and ID:2) encourage exploration, while the re-ward associations related to the wall (ID:3 and ID:4) also encourage (punish) behaviors that avoid (move towards) the wall.

As for the behavior with respect to the preys and predators, reflected in the reward associations from ID:5 to ID:8, during the summer season the agent is required to avoid blue agents (predators) and collect green agents (preys). During the winter season, the prey-predator roles are switched. Therefore, the agent is required to avoid green agents (predators), and collect blue agents (predators).

Table 7: Prey-predator task: associations of the sensor and behavior states to the rein-forcement signals in the two seasons.

ID Closest Behavior Summer Winter

1 nothing Straight 1 1

2 nothing Right or Left -1 -1

3 W Avoid 1 1 4 W Move towards -1 -1 5 G Avoid -1 1 6 G Move towards 1 -1 7 B Avoid 1 -1 8 B Move towards -1 1

D

Additional Results on the Foraging Task

D.1 Detailed Results of the Hill Climbing Algorithm

Figure 4 shows the average fitness value of 100 runs of the optimization process with the HC algorithm. Every 1000 iterations, the season is switched. In the first 1000 it-erations, HC is able to find an agent that performs the task efficiently, with a fitness value of around 55. When the season is switched to winter (1001st iteration), the agent still performs according to the summer season, thus achieving a fitness score of −55 (since the expected behavior is reversed). After about 1000 iterations, HC is able to find an agent that performs the task in the new season efficiently. We observe a similar behavior in the following seasonal changes.

In Table 8, we report the mean and standard deviations of the fitness values, the collected number of green and blue items, and the wall hits at the 10th, 500th and 1000th iteration for each season. The values for Summer1, Winter1, Summer2 and Winter2 presented in the table correspond, respectively, to the iteration ranges 0-1000, 1001-2000, 2001-3000 and 3001-4000 on the x-axis of Figure 4.

We observe that it takes about 1000 iterations for the networks to reach on average a fitness value above 55 at the end of each season. Since the networks are well

(19)

opti-0 500 1000 1500 2000 2500 3000 3500 4000 Iteration -50 0 50 Fitness

Figure 4: Foraging task: average fitness value of 100 runs of the offline optimization process of the agents using the HC algorithm. The process starts with a randomly initialized ANN, starting from the summer season. Every 1000 iterations, the season is changed, for a total of 4 seasons (see Table 8).

Table 8: Foraging task: average and standard deviations of the results of 100 runs of the offline optimization procedure of the agents using the Hill Climbing algorithm.

Result Iteration Summer1 Winter1 Summer2 Winter2

Fitness 10th 9.20 ± 10.12 9.79 ± 11.83 8.74 ± 10.24 10.41 ± 12.04 500th 48.34 ± 24.82 47.42 ± 19.15 49.59 ± 18.02 49.94 ± 17.04 1000th 55.81 ± 23.04 55.08 ± 20.00 58.21 ± 17.08 56.95 ± 18.35 G 10th 11.61 ± 12.37 5.02 ± 7.63 13.32 ± 12.22 4.48 ± 6.46 500th 49.36 ± 24.48 1.58 ± 4.87 51.71 ± 17.43 1.59 ± 4.62 1000th 56.56 ± 22.88 0.85 ± 3.29 59.65 ± 16.89 1.38 ± 4.09 B 10th 2.41 ± 5.59 14.81 ± 14.40 4.58 ± 6.66 14.89 ± 14.05 500th 1.02 ± 3.23 49.00 ± 19.06 2.12 ± 4.73 51.53 ± 16.33 1000th 0.75 ± 1.43 55.93 ± 19.88 1.44 ± 3.47 58.33 ± 17.48 W 10th 614.09 ± 919.95 693.78 ± 854.64 770.47 ± 1101.77 702.46 ± 862.67 500th 170.26 ± 463.02 157.15 ± 446.59 146.47 ± 417.37 183.05 ± 485.90 1000th 114.92 ± 375.96 88.30 ± 351.09 115.72 ± 398.26 140.97 ± 439.38

mized for the task, a sudden decrease in their performance at the beginning of each seasonal change is observed. Still, after 10 iterations in all seasons the agents achieve a fitness value of around 9. We see clear increasing and decreasing trends in the num-ber of collected items depending on the season, as the numnum-ber of iterations increases. Moreover, the number of wall hits decreases although at the end of each season it is still relatively high.

D.2 Performance of the Agents During the Task

Figure 5 shows the results of some selected ESP rules during a single run of a foraging task over 4 seasons. The overall process lasts 20000 action steps in total, consisting of 4seasons of 5000 action steps each. The foraging task starts with the summer season and switches to the other season every 5000 action steps. Measurements were sub-sampled every 20 action steps to allow a better visualization. The figures given in the first column show the cumulative number of items of both types collected throughout the process. After each synaptic change, we separately test the network on the same task without continuous learning (i.e., fixing the weights) to show how each synaptic change affects the performance of the network. These results are shown in the figures given in the second column.

(20)

0 5000 10000 15000 20000 Action Steps 0 20 40 60 80 100

120 The Number of Collected Items

(a) 0 5000 10000 15000 20000 Action Steps 0 20 40 60 80

(b) 0 5000 10000 15000 20000 Action Steps 0 20 40 60 80 100

(c) 0 5000 10000 15000 20000 Action Steps 0 20 40 60 80

(d) 0 5000 10000 15000 20000 Action Steps 0 50 100

(e) 0 5000 10000 15000 20000 Action Steps 0 10 20 30 40 50

(f) Figur e 5: Foraging task: results of thr ee selected ESP rules dur -ing a singl e run of a foraging task over 4 seasons. Each row of the figur es pr ovides the results of the best ESP rules ID:1, ID:2, a n d ID:6 respectively . Figur es in the first column show cumulative re-sults on the number of items col-lected, wher eas figur es in the sec-ond column show the number of items collected when the agent is tested independently for 5000 ac-tion steps using the configurations of its ANN at the time of the mea-sur ement. T able 9: Foraging task: results of the best ESP rule (ID:1) for each season. Summer1 W inter1 Summer2 W inter2 Fitness G B W Fitness G B W Fitness G B W Fitness G B W Mean 50.81 53.37 2.56 5.53 48 4.13 52.1 3.6 50.37 53.71 3.3 3.18 50.61 3.7 54.33 4.7 Std 10 9.5 1.8 3.5 9.2 2.2 8.4 3.6 10.2 9.64 1.7 3.4 10.2 1.9 9.5 6.3

(21)

20 40 60 80 Fitness Value 0 2 4 6 8

The Number of Evaluations

(a) Summer1 20 30 40 50 60 70 80 Fitness Value 0 2 4 6 8

(b) Winter1 20 30 40 50 60 70 80 Fitness Value 0 2 4 6 8

(c) Summer2 20 30 40 50 60 70 80 Fitness Value 0 2 4 6 8

(d) Winter2

Figure 6: Foraging task: distribution of fitness values of the best ESP rule ID:1 over 4seasons. The x- and y-axes of each subfigure show the fitness value and number of evaluations, respectively.

More specifically, Figures 5a and 5b show the results of the best ESP rule ID:1. Figures 5c and 5d show the results of the best ESP rule ID:2. Figures 5e and 5f show the results of the best ESP rule ID:6.

In the case of the ESP rules ID:1 and ID:2, the agent learns quickly to collect the correct type of items in each season. The number of correctly collected items from the previous season stabilizes, and the number of incorrectly collected items from the previous season increases in the next season after a seasonal change occurs. We observe in Figures 5b and 5d that there are distinct and stable seasonal trends for each season. The noise in the measurements is due to the stochasticity of the process.

In the case of the ESP rule ID:6, the agent keeps collecting both kinds of items; how-ever, the number of correctly collected items is larger than the number of incorrectly col-lected items in each season. We can observe in Figure 5f that the testing performance is not stable throughout the process. The fluctuations show that the configuration of the ANN seems to be frequently shifting between learning and forgetting.

Table 9 and Figure 6 show, respectively, the average results and the distribution of the fitness of the best performing ESP rule (ID:1) in each season, over 100 trials. The agent achieves an average fitness value of about 50 in all the seasons, except for the first winter season where it achieves a fitness value of 48.

(22)

Table 10: Foraging task: results of the best performing ESP rule with validation.

Summer Winter

Fitness G B W Fitness G B W

Mean 61.0 61.0 0 0 63.0 0 63.0 0

Std 8.62 8.62 0 0 8.19 0 8.19 0

validation. In validation, the agent is tested independently every 20 action steps (the configuration of its ANN is fixed and tested separately on the same task without using continuous learning), and its fitness value at the end of the testing is stored.

Finally, we use the Wilcoxon rank-sum test (Wilcoxon, 1945) to assess the statisti-cal significance of the difference of the results produced by different algorithms. The null-hypothesis, i.e. that the means of the results produced by two algorithm (thus, their agent’s behavior) are the same, is rejected if the p-value is smaller than α = 0.05. We perform pairwise comparisons of three sets of results of fitness values for summer and winter seasons obtained from three agents evaluated over 100 evaluations. More specifically, we compare the results of the hand-coded rule-based agent (see the “Per-fect Agent” described in Section 4.1) with those of the agents that use the best perform-ing ESP rule (ID:1) in continuous learnperform-ing settperform-ings, with and without validation. Based on the pairwise comparison results, the hand-coded rule-based agent is significantly better than the other two agents with and without validation, with significance levels of 1.9 · 10−28and 8.2 · 10−05respectively. Furthermore, the results of the agent with val-idation are significantly better than those without valval-idation, with a significance level of 9.9 · 10−14.

D.3 Sensitivity Analysis to the Number of Hidden Neurons

We performed a sensitivity analysis of our results with respect to the number of hidden neurons. In Figure 7, we show the average results of 100 trials, each performed using the best performing ESP rule (ID:1), with various numbers of hidden neurons. We test networks with 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 hidden neurons. The results show that there is a 13 point increase on the average fitness value when the number of hid-den neurons is increased from 5 to 20 (the latter is the valued used in all the experiments reported in the paper). There is also a slight upward trend when more than 20 hidden neurons are used, which results in a further 3 point average fitness gain when the num-ber of hidden neurons is increased from 20 to 50. Moreover, the standard deviations are also slightly reduced while the number of hidden neurons increases, showing that the use of more hidden neurons tends to make the agent’s behavior more consistent across different trials.

D.4 Change of the Synaptic Weights After Seasonal Changes

Figures 8 and 9 show the actual values of the connection weights between input and hidden, and between hidden and output layers, in a matrix form. Each column and row in the figures corresponds to a neuron in the input, hidden and output layers. In particular, Figures 8a and 9a show the initial connection weights between input and hidden, and between hidden and output layers, sampled randomly. We performed three seasonal changes in the following order: summer, winter, summer. We used the best performing ESP rule (ID:1) to perform synaptic changes during the seasons. We provide the rest of the figures to show the connection weights after each consecutive

(23)

season.

The connection weights between input and hidden layers appear to be distributed in the range [−1, 1]; on the other hand, the connection weights between hidden and out-put layers tend to be distributed within the range [0, 1] at the end of each season. This may be due to the activation function of the output neurons, where only the neuron with the maximum activation value is allowed to fire.

5 10 15 20 25 30 35 40 40 50

The Number of Hidden Neurons

0 10 20 30 40 50 60 70

Average Fitness

Average Fitness Standard Deviation

Number of Hidden Neurons

Figure 7: Foraging task: average fitness values of the ANNs with various numbers of hidden neurons.

(24)

1 2 3 4 5 6 7 Input 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden -0.4334 -0.2504 -0.1757 -0.3762 -0.5483 -0.06545 -0.4033 -0.5199 -0.2686 -0.2107 -0.504 -0.1573 -0.1536 -0.5328 -0.5808 -0.2041 -0.2399 -0.146 -0.2102 -0.1185 -0.07148 -0.0507 -0.384 -0.4876 -0.5085 -0.2972 -0.3182 -0.8372 -0.4888 -0.3851 -0.5087 -0.09533 -0.7216 -0.5265 -0.5697 -0.341 -0.1022 -0.4204 -0.07083 -0.6119 -0.1563 -0.3778 -0.5639 -0.4637 -0.4519 -0.3832 -0.5791 -0.6931 -0.3184 -0.5524 -0.3679 -0.105 -0.6028 -0.2967 -0.2543 -0.2171 -0.2905 -0.143 -0.6191 -0.4061 -0.2775 -0.4208 -0.2087 -0.5059 -0.631 -0.1512 -0.5198 -0.2245 -0.5716 -0.3524 -0.4197 -0.5482 -0.1279 -0.1832 -0.322 -0.46 0.2237 0.3703 0.4845 0.4445 0.2527 0.7394 0.5773 0.4956 0.4496 0.3214 0.2098 0.1905 0.1664 0.106 0.0547 0.4642 0.2689 0.2485 0.5311 0.5304 0.5086 0.1502 0.0511 0.3257 0.5567 0.02194 0.01964 0.4746 0.1077 0.03536 0.2968 0.5341 0.1915 0.1325 0.3692 0.4145 0.2736 0.1396 0.2308 0.2714 0.1186 0.2288 0.08774 0.2669 0.2868 0.3263 0.3605 0.3093 0.2694 3.925e-05 0.2498 0.5741 0.5807 0.3733 0.3286 0.1536 0.2463 0.3856 0.4738 0.1293 0.2433 0.4473 0.04936 0.06234 _-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

(a) Randomly sampled initial connection weights.

1 2 3 4 5 6 7 Input 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden -0.375 -0.1822 -0.2537 -0.4247 -0.5049 -0.3738 -0.1168 -0.1645 -0.4869 -0.1399 -0.3171 -0.4925 -0.4358 -0.2428 -0.3431 -0.3221 -0.1465 -0.3656 -0.6145 -0.5423 -0.4023 -0.3607 -0.9295 -0.4768 -0.4255 -0.592 -0.7137 -0.6761 -0.6258 -0.4507 -0.1163 -0.4409 -0.1247 -0.7644 -0.1689 -0.4257 -0.583 -0.5419 -0.5569 -0.5233 -0.8807 -0.7925 -0.4326 -0.6327 -0.5376 -0.1382 -0.7431 -0.3277 -0.1401 -0.5735 -0.3403 -0.3586 -0.2534 -0.1725 -0.3965 -0.5684 -0.1407 -0.1366 -0.2357 -0.1995 0.1736 0.4982 0.5587 -0.09857 0.3793 0.3684 0.7153 -0.03503 0.4442 0.4031 0.2718 0.2064 0.1078 0.214 0.1589 0.1728 0.04722 0.08336 0.3223 0.03862 0.1196 0.3534 -0.09759 0.637 0.567 0.6104 0.1461 0.05998 0.3174 0.6553 0.02795 0.111 -0.03415 0.545 0.09324 0.08734 0.3631 0.622 0.2291 0.1126 0.4505 0.5114 0.3602 0.1899 0.3317 0.3335 0.1471 0.3209 0.106 0.3547 0.3814 0.4339 -0.07324 -0.0822 0.1827 -0.08226 0.1019 0.1016 0.01522 0.06155 0.5586 0.4904 0.2951 0.001788 0.09197 0.1197 -0.04806 -0.05142 -0.07197 0.04845 -0.00572 0.3312 0.2194 -0.05058 0.1387 -0.02478 0.1341 -0.04666 0.2135 0.1492 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

(b) Connection weights after Summer1.

1 2 3 4 5 6 7 Input 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden -0.3252 -0.2639 -0.2179 -0.4275 -0.1719 -0.4353 -0.2258 -0.1883 -0.1544 -0.1613 -0.5231 -0.1264 -0.3509 -0.5363 -0.5149 -0.2397 -0.371 -0.3557 -0.206 -0.1311 -0.3027 -0.5795 -0.5534 -0.4813 -0.1663 -0.9158 -0.3892 -0.4271 -0.5217 -0.6208 -0.6904 -0.6604 -0.58 -0.1616 -0.5795 -0.8039 -0.149 -0.4806 -0.6754 -0.4631 -0.3495 -0.2849 -0.8032 -0.7262 -0.3148 -0.5355 -0.3684 -0.548 -0.1559 -0.2605 -0.7776 -0.4435 -0.3557 -0.2747 -0.1835 -0.4795 -0.7677 -0.2436 -0.1161 -0.4202 -0.1887 -0.1713 -0.1891 -0.315 -0.1688 0.2357 0.433 0.4034 0.004549 0.2539 0.1715 0.6704 0.2853 0.2382 0.1117 0.06751 0.1645 0.1087 0.1368 0.06475 0.1159 0.3302 0.06086 0.05914 0.4679 0.6015 0.401 0.5499 -0.01476 0.04498 0.1707 0.6082 -0.003037 0.1738 -0.08395 0.6029 0.02969 0.0321 0.3195 0.6938 0.4108 -0.07593 0.1485 0.5071 0.6414 0.3997 0.1334 0.1815 0.4711 0.2479 0.2262 -0.05592 0.2752 0.2511 0.01373 0.252 -0.04189 -0.05632 0.2744 -0.05899 0.21 0.0921 0.06238 0.041 0.7197 0.528 0.4406 0.04689 0.3702 0.2108 0.08143 0.2665 0.1007 0.3671 0.1782 0.4872 -0.05643 -0.07557 0.1145 0.3014 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

(c) Connection weights after Winter1.

1 2 3 4 5 6 7 Input 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden -0.4379 -0.1027 -0.4569 -0.3232 -0.07567 -0.566 -0.1656 -0.2215 -0.08805 -0.2026 -0.1547 -0.265 -0.15 -0.3113 -0.2141 -0.4025 -0.2562 -0.3865 -0.239 -0.1499 -0.2873 -0.1467 -0.2926 -0.6494 -0.703 -0.07398 -0.6623 -0.3761 -0.9467 -0.5602 -0.6046 -0.1381 -0.5477 -0.6807 -0.6555 -0.7162 -0.5597 -0.2133 -0.3148 -0.551 -0.8535 -0.4809 -0.6622 -0.3757 -0.3271 -0.6987 -0.6943 -0.2215 -0.06624 -0.5418 -0.6774 -0.2408 -0.6125 -0.2073 -0.06902 -0.1743 -0.1008 -0.4601 -0.1221 -0.3433 -0.2459 -0.1333 -0.3726 -0.5666 -0.1063 -0.09316 -0.06168 -0.1423 -0.1162 0.1405 0.3704 0.6903 -0.06019 0.2031 0.3214 0.7407 0.2389 0.2768 0.1188 0.05419 0.06592 0.02237 0.1298 0.2224 0.03203 0.1243 0.1447 0.7226 0.4901 0.7922 0.1987 0.1193 0.6706 0.3726 0.01434 0.5935 -0.003047 0.2747 0.8253 0.3717 -0.04372 0.1941 0.5369 0.625 0.3331 -0.05231 0.1501 0.236 0.4497 0.2178 0.2687 0.1331 0.3553 0.2475 0.4994 0.03199 0.1637 0.2898 0.03545 0.1114 0.1331 0.5596 0.178 0.5668 -0.02214 0.1506 0.1336 0.0271 0.2042 0.2945 0.06237 0.2292 0.05528 -0.05849 0.1298 0.07345 0.03964 0.2033 -0.0463 0.1209 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

(d) Connection weights after Summer2.

Figure 8: Foraging task: heat map of the intensities of the connection weights between the input and hidden neurons during a single run using the best ESP rule (ID:1). The xand y-axes show the input and hidden neuron indices respectively (the 7th column shows the biases). Each connection on the heat map is color coded based on its intensity in [−1, 1].

(25)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Hidden 1 2 _Output 3 -0.2161 -0.02401 -0.1066 -0.2517 -0.3071 -0.1166 -0.1914 -0.1756 -0.3329 -0.1641 -0.3439 -0.3271 -0.2769 -0.2576 -0.223 -0.03662 -0.1336 -0.3532 -0.1806 -0.266 -0.04972 -0.1741 -0.2064 -0.2427 -0.0469 -0.2655 -0.2985 -0.1569 -0.2732 -0.3159 -0.2634 -0.1456 0.08943 0.09629 0.1199 0.2385 0.1801 0.3166 0.1848 0.2226 _-0.007075 -0.005845 0.2456 0.07805 0.2116 0.1009 0.2013 0.2606 0.2486 0.1914 0.08437 0.01832 0.1274 0.1505 0.1189 0.3043 0.06796 0.311 0.2847 0.3024 0.3117 0.3026 0.2548 -0.2 0 0.2 (a) Randomly sampled initial connection weights. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Hidden 1 2 _Output 3 0.06971 0.01325 0.04319 -0.2641 0.04069 -0.1945 0.05887 0.08586 -0.07166 -0.1935 -0.103 0.103 -0.04984 -0.06672 -0.02815 0.0385 0.1166 0.1312 -0.0211 0.02625 0.1145 _-0.05087 -0.08982 0.07407 -0.04439 -0.08451 -0.09358 -0.003877 -0.1665 0.01684 0.09206 0.13 0.1442 0.2424 0.329 0.512 0.4213 0.2464 0.2572 0.1657 0.2493 0.2436 0.1431 0.2434 0.1657 0.3469 0.2011 0.2162 0.4293 0.1778 0.2383 0.2396 0.3324 0.3496 0.1422 0.1645 0.1914 0.2999 0.2677 0.3865 0.2875 0.3457 0.5383 -0.2 0 0.2 0.4 (b) Connection weights after Summer1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Hidden 1 2 _Output 3 0.2107 0.1034 0.1776 0.02718 0.05376 0.1144 0.1365 0.2098 0.1871 0.1221 0.1439 0.1946 0.09995 0.03372 0.06837 0.2256 0.1632 0.06175 0.07812 0.2261 0.1265 0.1822 0.1841 0.1867 0.1307 0.09792 0.02898 0.1029 0.1953 0.2173 0.08291 0.1461 0.1296 _0.07444 0.1857 0.1402 0.1923 _0.02687 0.2134 0.0764 0.2521 0.3033 0.3579 0.3714 0.2336 0.2463 0.3105 0.2453 0.3169 0.2372 0.2556 0.2923 0.3239 0.2636 0.3747 0.3045 0.3349 0.2446 0.315 0.2619 0.3058 0.4289 0.3319 0.2 0.4 (c) Connection weights after W inter1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Hidden 1 2 _Output 3 0.07451 0.1844 0.2279 0.2166 0.2163 0.2136 0.2015 0.1939 0.1697 0.2631 0.2097 0.2599 0.1505 0.2022 0.1275 0.1883 0.12 0.226 0.1432 0.121 0.2268 0.1435 0.2578 0.2005 0.1532 0.2466 0.2234 0.2217 0.1753 0.2241 0.1753 0.1709 0.263 0.2631 0.2159 0.2395 0.1561 0.2249 0.2444 0.237 0.1747 0.1282 0.2135 0.2287 0.1459 0.2126 0.102 0.2123 0.1919 0.2377 0.1846 0.1886 0.1493 0.2336 0.2336 0.2588 0.1908 0.199 0.2845 0.2984 0.3559 0.4621 0.3392 0.2 0.4 (d) Connection weights after Summer2. Figur e 9: Foraging task: heat map of the intensities of the connection weights between the hidden and output neur ons during a single run using the best ESP rule (ID:1). The x and y-axes show the hidden and output neur on indices respectively (the 21st column shows the biases). Each connection on the heat map is color coded based on its intensity in [− 1 , 1] .