Using indirect encoding in neural networks to solve supervised machine learning tasks

(1)

MSc Artificial Intelligence

Track: Intelligent Systems

Master Thesis

Using indirect encoding in neural networks to

solve supervised machine learning tasks

by

Timothy F. Doolan

Student ID: 5651735

April 20, 2015

Supervisor:

Assessors:

Dr. Shimon Whiteson

Dr. Leo Dorst

Dr. Maarten van Someren

(2)

Abstract

HyperNEAT has been one of the most successful indirect encoding methods in the past few years, mainly due to its ability to directly exploit geometric regularity in the input substrate, resulting in a good performance on a wide variety of problems. In this thesis we explore the possibility of transferring HyperNEATs abilities to exploit geometric regularity to the supervised domain. To this end we develop a novel algorithm, HyperBackprop, and show that it works well on a supervised version of the boxes problem. We also create a more challenging version of the boxes problem and compare the performance of HyperBackprop with backpropagation and roving eye.

(3)

Introduction

Artificial Neural Networks (ANNs) [11] are computational models inspired by the structure of animal brains and are used to approximate unknown functions that can depend on a large number of inputs. The model is defined as a set of nodes (representing the neurons) connected to each other by some weight value (the strength of the connection). The structure and strength of these connec-tions defines the relaconnec-tionship between the input and output nodes. Modifying these attributes in a systematic way can allow the network to ‘learn’ to produce desired outputs. ANNs have been successfully applied to a wide variety of fields in AI, such as speech recognition [10] and computer vision (especially character recognition) [15]. There are two common ways to modify ANNs to a target problem; supervised methods and neuroevolutionary methods.

The most common supervised method for ANNs is backpropagation [17]. As with all supervised methods it is trained using input vectors and their corre-sponding output vectors for a given task, which are assumed to be related by some underlying function. It uses gradient descent [17] to incrementally modify the connection weights to make the computed outputs more closely resemble the provided outputs for each corresponding input and thereby hopefully capturing the intended function.

Besides supervised methods, neuroevolutionary methods are also frequently used to train ANNs. Neuroevolutionary methods use a genetic algorithm [9] to find suitable weights and in some cases also node topologies to solve a task. In genetic algorithms the phenotype, in this case the ANN, is encoded in a representation called the genotype. Populations of different random genotypes can be generated and the resulting ANNs can be evaluated on some task. Note that this does not require an exact solution to be known for each input as with supervised methods, but only that produced outputs can be ordered by some measure, called the fitness function. Parts of better performing genotypes can be randomly mutated or crossed over, producing new genotype offspring, which can then again be selected, until a satisfactory solution to the task in question is found.

(6)

high dimensional problems can easily become intractable [1, 3]. Indirect encod-ings [2, 12] are a representational approach commonly used to limit the search space for these computationally intensive problems. They contrast with direct encodings, where there is a direct mapping from the genotype to phenotype, i.e. evolution is in principle applied to the problem solution directly. An indirect encoding is generative in nature, meaning a set of rules or a grammar is applied to generate the phenotype from the usually much smaller genotype, as is also the case in biological evolution. This generative approach encourages both the use of regularity and the reuse of local solutions in solving the problem [12], and also allows for highly complex solutions to be expressed and computed in a tractable way [22].

Indirect encodings have been particularly successful in simulating biological processes [18, 2, 12], but applying them to practical problems can be tough. The main challenge is the choice of encoding representation, as that is what determines the dimensionality of the search space. One of the most successful indirect encoding methods in the past few years has been HyperNEAT [20], which uses a Compositional Pattern Producing Network (CPPN) [19] as its en-coding scheme. The CPPN is used to compute the weights for a much larger phenotype ANN, called the substrate. The topology and weights of the CPPN are evolved using NeuroEvolution of Augmenting Topologies (NEAT) [21], with the fitness of the resulting substrate defining the fitness of the CPPN.

One of the main benefits of HyperNEAT is that the substrate weights are computed directly from the Cartesian coordinates of the input and output nodes using the CPPN, meaning geometric regularities in the problem, such as symme-try, repetition and repetition with variation, can be exploited naturally [20]. Hy-perNEAT has been successfully applied to wide variety of problems; translation-invariant object detection [20], a food gathering controller [20], a quadruped robot movement controller [4], a multi-agent controller with a single encoding [5], a octopus arm controller [27] and a go action-selector [8]. Note that [24] has shown some of these problems require relatively simple forms of regularity, with the solution stemming more from the way the substrate is generated than the complexity of the CPPN itself.

The focus of this research is to explore methods to train Artificial Neural Networks on supervised data with some form of geometric regularity. In partic-ular we try see if the way the geometric regpartic-ularity of a problem is exploited by HyperNEAT can carried over to the supervised domain in any way. To this end we attempt to create an algorithm that can use backpropagation to train a large ANN and then compress the resulting set of weights using a much smaller ANN while preserving any geometric regularity in the solution. This thesis develops a specific instance of such a supervised indirect-encoding algorithm, compares it against several benchmarks and analyses the results.

Comparisons are done on a supervised version of the visual discrimination task from [20], where the substrate should indicate the center of a 3x3 square in a large 2-dimensional input field. The main challenge of this supervised indirect-encoding algorithm is finding a means of constraining the substrate weights to the space that can be expressed by the CPPN. This is because HyperNEAT and

(7)

other evolutionary methods with an indirect-encoding are inherently limited to the weight-space that can be represented by the encoding, while when training the substrate directly there is no such restriction.

Our approach alternates backpropagation on the substrate and on the CPPN, which leads us to develop a new algorithm, called HyperBackprop (HB). It com-presses intermediate substrate states during backpropagation using the CPPN and then continues training from the best approximation produced by the CPPN. Backpropagation on the substrate weights requires a fixed topology to be set, so we start using a hand-coded topology that is known to be capable of representing the solution, but later on we discuss how to include a more flexible topology. We show that HB outperforms regular backpropation when using a small subset of the training data, which demonstrates that the method is able to generalize over input geometry. However, initially those results were only produced in a small percentage of the experiments run. We tested a number of different modifications to the algorithm to try and improve this behavior, most importantly using a much larger CPPN topology, which resulted a big boost in performance.

Next we consider a more suitable benchmark to this problem, the Roving Eye (RE), which moves a smaller ANN across the input substrate, making it ideal to deal with a repeated problem. In order to try and differentiate between HB and RE, we develop a more complex version of the boxes problem, where the task changes based on the location of the larger box. We then analyse the performance of backpropagation, RE and HB on this location dependent boxes problem and show that HB does outperform both benchmarks. This demonstrates that HB is capable of successfully exploiting more complex forms of regularity.

In summary we hypothesize that geometric regularity inherent in some tasks can be exploited using supervised learning. To test this we develop a supervised algorithm that enforces geometric regularity on ANNs and analyze its results on a supervised version of the boxes problem, including a more complex variation. We confirm our hypothesis by showing that HB is capable of capturing the underlying structure in both these problems.

(8)

Chapter 2

Background

In this chapter we give some background on Backpropagation, NEAT and Hy-perNEAT. In particular we will show HyperNEAT’s ability to exploit geometric regularity.

2.1 Backpropagation

Backpropagation is the most common supervised learning method for ANNs. It is a generalization of the delta rule for single layer networks, which extends it to multi layer feedfoward networks. In single layer networks the output for some node j is computed by multiplying the inputs (xi, the ithinput) with the

corresponding weights (wji, the weight from input i to node j). This produces

the node input (netj =P_ixiwji), which is then fed into the activation function

(f (x), usually the hyperbolic tangent1−e−2x

1+e−2x) to compute the node output (yj =

f (netj)). The delta rule is a gradient descent method which minimizes the error

function (1₂P

j(tj− yj)2, where tj is the target value for node j) for single layer

networks, computing the updates for each weight (∆wji) based on the difference

between the produced output (yj) and the intended target (tj). The computed

difference for each weight is multiplied by a learning rate (η, usually 0.9), to allow the gradient descent to make small steps in the direction that will minimize the error, eventually leading to convergence.

∆wji= η(tj− yj)f0(netj)xi

In the case of backpropagation, the update is computed by ∆wji = ηδjxi,

where δj is the error term for node j. In the case of an output node, δj =

(tj − yj)f0(netj), resulting in the original delta rule. For hidden nodes the

error can be iteratively computed for each layer, starting at the output layer and working back using each next layer’s δj values, propagating the error signal

(9)

δj = f0(netj)

X

k

δkwkj

Backpropagation uses the chain rule to decompose the gradient of gradient descent per network layer, ensuring all the weight updates will decrease the overall error. Note that due to the non-linear nature of the multi-layer net-work, this is not guaranteed to be global minimum, meaning it is possible to get stuck in a local minimum (there is no weight update that will decrease the error without first increasing it more). However, it turns out that for many practical problems this is not an issue [17]. A common improvement is to compute the update per training example (online update), instead of the average over all the training samples. While this may decrease the theoretical chance of conver-gence (the algorithm follows the gradient of individual examples instead of the average gradient over all examples), it does significantly speed up convergence in most cases, as many updates are performed for one loop through the training examples, and again in most practical cases this does not hinder convergence.

2.2 NEAT

NeuroEvolution of Augmenting Topologies (NEAT) is a neuroevolutionary method that attempts to both optimize and complexify solutions by evolving both the weights and topology of an ANN simultaneously [21]. It does so by starting with a uniform population of a fully connected ANN’s without hidden nodes and then performs either simple weight mutations or structural mutations. Structural mu-tations can either be made by adding a connection between unconnected nodes or by adding a node in between a connection. In the latter case the original connection is removed, the new node is added, the connection leading into this new node is set to 1 and the connection leading out is set to the same weight as the old connection. Examples of both can be seen in Figure 2.1. NEAT also uses historical markings, which are created by assigning a global innovation number to each new mutation to improve the overall fitness of the population in two ways. Firstly, they are used to align the genotypes of any two ANN’s for cross-over mutation by combining the new structural elements based on the placement of any shared ancestral mutations. Secondly, they can be used to compute the topological distance between two ANN’s, which means the population can be grouped according to some distance threshold, allowing rarer individuals to be assigned a higher fitness.

2.3 HyperNEAT

HyperNEAT uses NEAT to evolve a Compositional Pattern Producing Network [19] that defines a function f : R4_{→ R, which serves as an indirect encoding for}

the weights of a larger substrate network. The substrate has a 2-dimensional array of input nodes and a fully connected output layer of equal size, the weights

(10)

Figure 2.1: This figure shows the types of structural muta-tions NEAT can make. In the case of adding a connection, a connection is added between two previously unconnected nodes. In the case of adding a node, an existing connection is removed and replaced by two connections with an ad-ditional node in between.

of which are set by the CPPN. The CPPN computes the weight of a connec-tion based on the Cartesian coordinates of the input and output node of the connection f (xin, yin, xout, yout) → connection weight, as seen in Figure 2.2.

In this figure we see a connection in the substrate being defined by the input layer coordinates XI, YI and the output layer coordinates XO, YO serving as the

input for the CPPN. The CPPN’s resulting output is set as the weight of that connection in the substrate. This is repeated for each possible connection in the substrate, meaning all substrate connections are defined by the CPPN, a much smaller ANN. The fitness of a CPPN is computed by applying the resulting substrate to the task.

This method of encoding has two main benefits. The first is that the CPPN defines the relationship between the substrate weights and thus is not limited to a particular substrate size. This means that the substrate can be rescaled without requiring further evolution. The second is that it generates the sub-strate from a much lower complexity CPPN, thereby enforcing some measure of regularity in the resulting substrate weights. This enforced regularity can be used to naturally exploit the inherent geometric regularities in certain problems [20]. These regularities include symmetry, repetition and repetition with varia-tions, for example allowing smaller substrate patterns to be repeated across the substrate.

(11)

Figure 2.2: This figure shows the relationship between the substrate and the CPPN in HyperNEAT. The carthesian coordinates of the input and output nodes serve as the input for the CPPN, which computes the weight value for the corresponding connection.

(12)

Chapter 3

Related Work

Although currently no supervised method for a large geometrically regular ANN exists to the author’s knowledge, some existing methods do attempt to address related problems.

Soft weight sharing [16] uses a mixture model of Gaussians to cluster the weights according to value. It adds an extra error term to the backpropagation for the error of the mixture model and the number of clusters required, which coerces the weights to adopt similar values. However, it does not consider any spatial relationship between the weights, only the similarity of their values and thus does not make use of geometric regularity as HyperNEAT does.

Auto-encoders have a hidden layer with less nodes than the input layer, which are then used to reconstruct the input as best as possible in the output layer (the hidden layer serves as a kind of PCA). These auto-encoders can be stacked, with each hidden layer being the input to the next layer to form a deep architecture. Each layer can be trained separately and unsupervised, because the input is equal to desired output [6]. This unsupervised step introduces a regularity in the deep network (especially important in the lower layers) that helps the supervised training converge more reliably and to better results. While this method does exploit the regularity in a problem, it does so by repeatedly reducing the complexity of the input, making it more suitable for classification than regression problems.

CoSyNe [13] is an evolutionary method that uses Discrete Cosine Transform (DCT) coefficients as an indirect encoding for the weights of a neural network. It applies Inverse DCT to generate the weight matrix for the (recurrent) network and evaluates the results it produces. The less coefficients used, the smaller the search space and the more spatially correlated the weights become, until with one coefficient all weights become equal. While modeling the weights as a sum of cosines might not produce exactly the same notion of regularity as HyperNEAT’s CPPN does, it is an interesting alternative indirect-encoding to consider.

Convolution networks [14] perform convolutions of multiple small features across the input space and subsample the results of each feature. Then the

(13)

process is repeated on each of the subsampled feature results and the final set of outputs is used as an input for an ANN. The produced features are the result of a regular pattern applied across the entire 2d input space. However, the final output is that of a small ANN, instead of an output layer equal to the input layer, making it more suitable as a classifier.

NEATQ [25] is reinforcement learning algorithm that combines both evolu-tionary and supervised learning. It uses NEAT to evolve a function approxi-mator for a value function that Q-learning uses to select the optimal action at each state. While evaluating the policy produced by the network, it also com-putes the value function update for each state it visits and uses the update as a target for backpropagation on the function approximation network. Thus the method refines the evolved solution by using the Q-learning update rule to pro-vide additional supervised information for the network. In contrast, we aim to take properties of an evolutionary method and translate them to the supervised domain.

HybrID [4] aims to combine the advantages of direct and indirect encodings. It first evolves a solution from an indirect encoding using HyperNEAT and then at a certain point switches to a direct encoding of the substrate using fixed topology NEAT, which only modifies the existing weights without modifying the topology. This allows the method to start off by evolving some regular solution and then refining the solution using specific weight mutations. This addresses an inherent issue in indirect encoding, namely the inability to make specific localised adjustments to the solution.

(14)

Chapter 4

Problem Description

4.1 The boxes problem

The question immediately arises what, if any, problems might be suited to ben-efit from supervised training of a large, semi-regular neural network. One such problem might be pattern recognition in images, where the relative relationship between pixels is more important than the location in the image. We will look at a simplified version of this problem that has a straightforward geometric pat-tern. It is a modified version of the visual discrimination task from the original HyperNEAT paper, which is described below.

4.1.1 Description

As described by [20], the boxes problem is an 11x11 input layer, with a fully connected output layer of the same dimension. On the input layer a 3x3 and a 1x1 square are presented at distinct locations, where presence of a square is indicated by a 1.0 input and absence by 0.0. The goal is to locate the center of the larger square in the visual field, represented by the highest activation in the output layer; the smaller square serves as a distractor. HyperNEAT uses a fitness function which is inversely proportional to the distance between the highest activation and the actual center of the larger square.

HyperNEAT also added two additional inputs for the CPPN on this prob-lem, a dx and dy, indicating the difference between the x-coordinates, and y-coordinates respectively. We will also include these additions in our CPPN inputs.

4.1.2 Producing supervised targets for the boxes problem

The boxes problem as described by Stanley in [20] is only evaluated using a fitness function which measures the distance between the highest activation and the actual center, meaning all other activation values are ignored. In contrast, a supervised method requires a complete target output, i.e. an intended value

(15)

Figure 4.1: The hand coded CPPN topology and weights used to compute the substrate with which to generate the training targets. This is to ensure that the substrate solution can expressed by the CPPN.

The nodes are B=Bias; L=Linear; G=Gaussian and T=Tanh.

The inputs are (ix, iy) the coordinates of the substrate input node; (ox, oy) the coordinates of the substrate output node and (dx, dy) the distances between them.

for every output node for each set of inputs. The simple solution would be for all outputs to be 0, except for the center, with a value of 1. However, the substrate-CPPN design does not allow for hidden layers in the substrate, meaning the outputs can only be a linear combination of the inputs, which such outputs clearly would not be. Thus in order to create a target output that can actually be learned, we must ensure that it is a linear combination of inputs. The easiest way to do this is to use an existing substrate solution to compute them.

Additionally, the CPPN topology should be capable of representing the sub-strate weights needed to produce these targets, meaning the subsub-strate weight pattern must be compressible in a much lower complexity ANN. In order to address both these problems, we will hand-code both the topology and weights for the CPPN and use this to generate a solution substrate. This solution sub-strate can then be used to compute the corresponding output for each input. For an 11x11 grid, where small box is 1x1 and a large box is 3x3 (presence of box is indicated by 1.0, absence by 0.0) this results in 9072 input combinations for which the output can be computed and stored. A subset of these will be used for training, while the rest is used for testing. This dataset should ensure that the CPPN topology is capable of representing the weights needed for the substrate to produce the required output. The CPPN used to compute the substrate is shown in Figure 4.1 and some examples of substrate input-output pairs are shown in Figure 4.2.

4.2 Location dependent boxes problem

Given our objective is to transfer HyperNEAT’s ability to exploit geometric regularity to the supervised domain, our benchmark problems should cover as wide a range of the associated attributes as possible. To this end we will also consider a more complicated version of the boxes problem, testing the repetition with variation property of HyperNEAT. This property allows HyperNEAT to

(16)

Figure 4.2: Two substrate inputs and their corresponding generated outputs. The output values are scaled in grey from 0.0 to 1.0.

not only deal with simple repetition in the substrate, but also to vary this repetition as a function of the substrate location. It will be interesting to see if our new supervised method will also inherit this property.

In order to test for this property, we will adapt the boxes problem by chang-ing the intended output based on the location of the larger box, e.g. if the box is in the top left of the substrate, then the top left of the box should produce the highest output (instead of the centre). This applies for all 9 possible sub-strate quadrants, as shown in Figure 4.3. The outputs for this task are created by taking the outputs produced for the original boxes problem (as described in 4.1.2, where the highest output is always in the centre) and shifting it in the appropriate direction based on the input box location. Figure 4.4 shows a sample input-output pair to which the shift has been applied. We call this variation the location dependent boxes problem.

(17)

Figure 4.3: This image shows the division of the 9 quadrants in the 11x11 substrate. The overlay in each quadrant marks the position with the highest output for the 3x3 target box using a black square.

Figure 4.4: This image shows a data sample for the location dependent boxes problem, where the original output has been shifted along both the x and y axes, based on the location of the large box.

(18)

Chapter 5

HyperBackprop

We will develop a supervised algorithm that attempts to replicate HyperNEAT’s property of exploiting geometric regularity. I.e., given a supervised task repeated across a large 2d input array, the algorithm should be able to generalize the solution for the entire input area using as few examples as possible. In this chapter we will focus on describing the algorithm while highlighting the design choices made during the development. In chapter 6 we will show the actual experimental results validating these choices.

The algorithm we are proposing is based on the substrate-CPPN model, as this is what gives HyperNEAT its ability to exploit geometric regularity. This means the algorithm uses two ANN’s, a substrate and a CPPN, with the CPPN regulating the weights of the substrate. Since both the substrate and the CPPN are ANN’s, supervised learning will be done using backpropagation.

HyperNEAT evolves both the topology and the weights of the CPPN. Our supervised algorithm will start with just learning the weights of a fixed topology CPPN in order to avoid the added complexity of adapting the CPPN topology while learning the weights. In section 5.3 we will discus an extension of the algorithm to include finding a CPPN topology suitable for the problem.

5.1 Interleaving substrate and CPPN training

The open question in our suggested approach is how to combine the learning of the substrate and the CPPN in a meaningful way, i.e. how to alternate or combine updates to the substrate and the CPPN. Herein lie two competing issues:

The first is a direct result of the type of problem we are considering; not all substrate weights will receive meaningful updates during training. As the aim is to be able to generalize the solution across the input area, only a subset of all possible task locations will be presented with relevant training data, meaning not all weights will receive updates related to solving the repeated task. The role of the CPPN is to model the regular solution to the task and to repeat

(19)

that model across the entire substrate. The assumption is that the weights receiving relevant updates should contain a pattern, while the other weights remain random. The more similar each of the substrate sub-solutions is, the easier it becomes for the CPPN to extract that pattern. This means that with more training iterations on the substrate, the local pattern will become more regular and thus improve the quality of the overall solution.

The second is a result of the fact that the CPPN has a much lower complex-ity than the substrate, severely limiting the types of patterns the CPPN can represent. The substrate has 14641 (= 112_{∗ 11}2_{) adjustable weights, while the}

CPPN has 23 (= 7∗3+2∗1, see Figure 4.1) weights when fully connected, mean-ing that the combinations of substrate weights that can be expressed/learned by the CPPN is very restricted. This restriction requires some regularity or consistent coherence between the values of the weights in order to be expressed by the much simpler CPPN. While the substrate starts at a point that can be represented by the CPPN, each substrate training iteration will move the weights towards a less regular configuration due to the distribution of the train-ing examples. This less regular configuration will be harder for the CPPN to properly model, implying that too many substrate training iterations can be detrimental to the quality of the solution.

Both of these aspects require that the CPPN regulates the solution to which the substrate converges, but a balance must be struck between training the substrate for the local solutions and using the CPPN to model the solution for the entire substrate. To achieve this balance, the substrate backpropagation will be interleaved with computing a best CPPN approximation of that intermediate state and continue the substrate training from the resultant approximation. This should ensure that the substrate weights remain close to a combination of weights that can be represented by the CPPN, while overall slowly converging to a regular solution to the problem.

The outline for the abstract algorithm so far can be seen in Algorithm 1. Note that it is still an abstract algorithm, because none of the loop termination conditions (outer condition, substrate condition and cppn condition) have been defined. First the training data is created by randomly selecting a percentage of samples from the dataset. Then the CPPN is instantiated randomly using the same topology as the one used to create the training data. This ensures that the CPPN is capable of representing a regular solution to the problem. Next the substrate is generated using the new CPPN, i.e. all substrate weights are computed using the CPPN.

The CPPN and substrate are alternatingly trained; first the substrate re-cieves multiple backpropagation updates using the training data until the sub-strate condition has been met and then the CPPN is trained using those newly computed substrate weights until the cppn condition has been met. The CPPN state after training, i.e. its best approximation of the new substrate weights, is then used to generate the new corresponding substrate. These interleaved train-ing cycles for substrate approximation are repeated until the outer condition has been met, presumably meaning the CPPN generated substrate produces a low enough error on the provided training data.

(20)

Algorithm 1 HyperBackprop (abstract version)

1: training ← selectRandomSamples()

2: cppn ← generateCppn()

3: substrate ← generateSubstrate(cppn)

4: while outer condition do

5: while substrate condition do

6: substrate ← backpropagation(substrate, training)

7: end while 8: while cppn condition do 9: cppn ← backpropagation(cppn, substrate) 10: end while 11: substrate ← generateSubstrate(cppn) 12: end while

5.2 Practical implementation

This section documents the implementation of the abstract version of Hyper-Backprop as described in the previous section and discusses refinements made while observing its behavior when running. The resulting version be seen in Algorithm 2.

5.2.1 Substrate training

As stated in the previous section, Substrate training should not be continued to convergence. Thus the problem then becomes defining an alternate stopping criterion while training the substrate, as the resulting set of weights should be regular enough so that they can be approximated by the CPPN.

A solution to this that might seem logical is to use a validation set to de-termine when the substrate has been overfitted to the training data. However, constructing a validation scheme for the proposed interleaved approach results in a more complex and far more computationally intensive algorithm. This is because it would require computing a new CPPN approximation for each indi-vidual backpropagation step in order to determine at which point the validation score starts to deteriorate. This requires substantially more computation than computing this approximation just once for a complete backpropagtion cycle.

The prohibitive complexity of this validation scheme prompts us to limit our scope to showing that the interleaving approach works as a concept in its own right. To do this we will simply introduce a new parameter to set the number of repetitions of the backpropagation steps in each substrate cycle. Experiments have shown that a value of 250 for this parameter works well, as confirmed by the results in chapter 6.

(21)

5.2.2 CPPN training

When the substrate iterations are completed, the CPPN is then trained on the newly computed substrate weights. The fact that the substrate is fully connected (i.e. that each output node is connected to all input nodes), together with the local nature of the problem (e.g. that only very few of those input connections are required to solve the boxes problem for an individual output), means that many of the weights in the substrate should not contribute in any significant way to the output.

This results in many weights converging to a value close to zero, which are easy for the CPPN to model. The consequence of this is that the real error during training must then be measured on those few relevant weights that do contribute. However, since we have no proper definition for relevance, our only measure can be the MSE over all the weights of the substrate. This means that the actual MSE during training is inherently very small, leaving no proper measure for convergence (it only works as a measure of relative improvement). As a result, the only way to terminate the CPPN training in any consistent manner is to again introduce a new parameter defining the number of training iterations. Experiments have shown that a value of 5000 for this parameter works well, as confirmed by the results in chapter 6.

5.2.3 Initialisation and termination conditions

Initialisation

The initial randomly generated CPPN can produce a substrate with such large training error that, in combination with the computationally intensive nature of the algorithm, the problem effectively becomes intractable. In order to remedy this, CPPN’s that produce such a substrate are simply regenerated, until a reasonable substrate error is found. Experiments have shown that a threshold of 2.0 for this error works well.

Termination

Testing has also shown that the error for the outer loop (the error of the sub-strate produced by the best CPPN approximation), may at times, decrease very little between successive iterations, giving the impression of convergence. However, in many of those cases, further iterations will produce significant im-provement in the error (see figure 6.2 in the next section for some example results). This means that traditional measures of convergence are not adequate in this case and therefore we introduce a new parameter; the number of outer loop iterations. Experiments have shown that a value of 250 for this parameter works well.

(22)

Algorithm 2 HyperBackprop

1: training ← selectRandomSamples()

2: repeat

3: cppn ← generateCppn()

4: substrate ← generateSubstrate(cppn)

5: until evaluate(substrate, training) < 2

6: for 1 . . . 250 do

7: for 1 . . . 250 do

8: substrate ← backpropagation(substrate, training)

9: end for 10: for 1 . . . 5000 do 11: cppn ← backpropagation(cppn, substrate) 12: end for 13: substrate ← generateSubstrate(cppn) 14: end for

5.3 CPPN Topology

A critical part of the algorithm design is the choice of CPPN topology. The described algorithm is limited to a fixed CPPN topology that has to be designed by hand for each problem. This means a lot of expert knowledge of the problem might be required to make a good estimate of the required CPPN complexity. The solution to this would be to automate finding the topology for the CPPN, as that would eliminate the need for hand designing the CPPN. Increasing the complexity of the CPPN should allow for more complex forms of regularity, while also remaining small enough to enforce that regularity in the substrate.

5.3.1 Cascade Correlation

One of the most common methods for automated topology finding in supervised ANNs is cascade correlation [7]. Cascade correlation starts with a network without any hidden nodes and incrementally adds them until a suitable solution is found. It adds candidate nodes to the network by connecting the node to each input node and existing hidden nodes and then training only those input connections to maximize the correlation between the node’s output and error on the network output nodes. It uses a gradient ascent rule to maximize the correlation, very similar to backprop’s weight updates to minimize the error. Those new input weights are then frozen and only the weights connecting to the outputs, including the new node, are retrained using the backpropagtion rule. If the network error is still unsatisfactory, a new candidate node is added and the process is repeated.

Combining cascade correlation with the proposed HB algorithm would re-quire approximating the intermediate substrate solutions for each outer loop step. Cascade correlation freezes the added input connections, but there is no

(23)

guarantee the resulting weights will be useful in modeling any of the subse-quent substrate states. Moreover, the intermediate solutions are likely to be less regular (i.e. requiring more hidden nodes to model) than the final/intended solution. Because of these limitations, it is not useful to reuse the weights and topology created in a previous step, meaning that for each step the entire cas-cade correlation process needs to be repeated. Consequently the substrate is approximated by a new random minimal network at each outer step instead of continuing from the previous approximation. This results in a large part of the advantage of the interleaving approach being lost, as the CPPN does not start learning from a state close to the intended new state, but instead starts with a new network at every step.

The conclusion of all this is that cascade correlation as a method of creating the CPPN topology is not a viable option for the HB algorithm.

5.3.2 Varying the starting CPPN topology

The complications that arise from combining the HB algorithm with cascade correlation show the inherent issues in combining the interleaved approach with any method that incrementally complexifies the topology. Therefore, the only alternative is to start with a CPPN of sufficient complexity to represent the solu-tion. Consequently, the HyperBackprop algorithm will remain a fixed topology method, as described in Algorithm 2.

In section 6.4 we show the results of running the HB algorithm with several different fixed CPPN topologies while using the same data. We conclude there that CPPN topologies resulting in a correct solution are flexible enough to allow for any reasonable approximation of the complexity to work.

(24)

Chapter 6

Validating the Algorithm

The goal of this chapter is to validate the HyperBackprop (HB) algorithm, which should enforce geometric regularity while approximating supervised data across a large input space. The aim is to extract and repeat the regular pattern in the data while not all input locations receive relevant data. In order to evaluate the algorithm, we must use a random subset of data that is small enough to ensure that not all input locations have a useful error signal, i.e. that backpropagation on the training subset would result in a much higher error on the test set. Therefore we will first analyse the performance of backprop on different subset sizes, to establish when the sample size is small enough to observe a potential gain from HyperBackprop.

6.1 Backpropagation results

The centred boxes problem described in section 4.1.2 contains a large amount of redundant examples (the same target boxes with many different distractors), meaning only a small percentage of the training examples should be needed to result in decent performance on all remaining test examples. There are 112 distractor locations for each possible large box position, meaning in an ideal distribution, less than 1% of the training data would be needed to include all of them.

We ran backpropagation until convergence with 100 random selections of training data. Convergence is defined as the MSE below 0.05, computed between the target and the resulting output for each output node, for each training example. Table 6.1 shows the results when 5% of the data is used for training, where performance on the test set only differs a couple of percent from the training result. Table 6.2 shows that when 1% of the data is used, there is a substantial difference between the training and test results. This means that with 5% of the data used, (nearly) all output nodes receive enough relevant updates to converge to a solution, implying that all of the substrate locations are covered by the training data. In contrast, the 1% random training samples

(25)

result in less than perfect cover, meaning backpropagation can never result in a correct solution for all test cases.

average std dev minimum Iterations 2892.4 23.14 2833 Training error 0.0424 0.00066 0.4108 Test error 0.0752 0.00168 0.7155

Table 6.1: Backpropgation on the boxes problem, 5% random sample of data

average std dev minimum Iterations 2477.6 38.21 2377 Training error 0.0215 0.00023 0.0210 Test error 0.2155 0.03501 0.1579

Table 6.2: Backpropgation on the boxes problem, 1% random sample of data

Our aim for HyperBackprop thus becomes to converge to a set of weights that solve the problem at every location using only 1% random training data, in which case not all locations will have a relevant backprop error signal.

6.2 HyperBackprop results

In this section we show the results of running the HB algorithm on the centred boxes problem using different sizes of random samples of training data. An experiment consists of the HB algorithm being run on a random sampling from the data. The experiments will be repeated 1000 times with different samples for the chosen sample size.

The benchmark we are comparing against is the backpropagation algorithm, of which the results are shown in section 6.1. Given these results, very little further improvement can be expected using HB on a 5% sample size, as the backprop test error is already very close to the training error. On the other hand the 1% sample size result can clearly be improved upon, which is is the objective of our HB algorithm.

6.2.1 Parameters and improvements

Parameters and initialisation

The substrate has 2d input array of 11 by 11 nodes and is fully connected to an output layer of the same size. The weights of the substrate are set by querying the generated CPPN for each connection using the Cartesian coordinates of the input and the output nodes.

(26)

In section 5.3.2 we concluded that a generalised approach to generating a CPPN topology is not feasible. Consequently, we will start by validating the algorithm with the same topology as the one used to generate the training data. This will ensure that CPPN is capable of representing the substrate solution. The CPPN weights are set randomly in the range [−0.3, 0.3]. In section 6.4 we will explore the effects of different CPPN topologies on the results.

The learning rate for substrate backpropation is 0.9 and for the CPPN back-propagtion 0.3.

Update type

While running a number of experiments with the proposed algorithm, we ob-served the CPPN consistently diverging. This appears to be the result of trying to compress the large number of substrate weights into the low complexity CPPN using individual updates. This results in oscillating between models of subsets of weights, many of which have a value close to zero, instead of cap-turing the overarching structure. Our solution to this is to replace the online update with the average batch update over all the training examples, resulting in a more reliable update [26]. While this severely increases the computational load for a single update, experiments show it does allow the MSE of the CPPN to improve with each update.

6.2.2 General distribution of outcomes

Figure 6.1 shows the distribution of the test error for different sample sizes. Note that this is clearly distinct from the backpropation results, which were all tightly grouped around the same value, e.g. a test error of 0.2155 for 1% training data (see Table 6.2). In contrast, most of the HB experiments failed to produce error values close to the results backprop produced and many even fell outside the histogram range.

Nevertheless, for the 1% case there are 8 experiments that ended up with test error below 0.1, compared to backprop’s best result of 0.1579, showing that the HB can provide a real improvement in some cases. This difference is even more pronounced for the 0.5% and 0.25% sample sizes, where there are far fewer training examples than possible task locations. While these better results are only seen in a small percentage of experiments, it does show the potential improvement the HB algorithm can make. Our next task then becomes to increase this percentage.

Although the final training errors vary a lot, it is important to note that the corresponding test errors are always very close in value. This is due to the fact that the training errors are produced by a substrate generated by the CPPN, which has a very low complexity. The consequence of this is that if a substrate solves the task at a number of different locations, then most likely the CPPN has found the general substrate weight pattern for the task and repeated it across the substrate.

(27)

Figure 6.1: These figures show the distribution of the test errors for each of the different percentages used to randomly sample the training data. Each of these are taken over 1000 runs, of which only about 15% ended up in the [0.0, 0.9] range.

This means that given a task with geometric regularity and a CPPN with low enough complexity, the training error actually becomes a good predictor of the test error, irrespective of the training sample size. Table 6.3 shows this relationship quite clearly, with a small difference between the two for all train-ing percentages, although a higher percentage does obviously result in a lower discrepancy.

training data % average std dev 5% 0.0113 0.00009 1% 0.0274 0.00048 0.5% 0.0453 0.00186 0.25% 0.0856 0.00690

Table 6.3: This table shows how small the difference is between the final training error and the corresponding test error for different percentages of randomly sampled training data. The differences are shown as the average and standard deviation, which are calculated over the same 1000 runs used for the rest of the results.

(28)

In the next section we will analyse some individual results, to see why so few experiments produce results better than backpropagation.

6.2.3 Analysis of individual results

In the following figures the MSE on the training data is plotted against the outer iteration count for an individual experiment. Figure 6.2 shows some typi-cal successful experiments and figure 6.3 shows the typitypi-cal failures. These graphs show the MSE on two different substrates for each outer step. The green line represents the MSE of the substrate generated by CPPN’s current best approx-imation (see line 13 in Algorithm 2) and thus how close we are to solving the underlying regular problem, as discussed above. The blue line is the MSE after the backpropagation on the substrate, which is then used as the target for the CPPN approximation.

Figure 6.2: These graphs show typical cases that resulted in a low MSE for the substrate produced by the CPPN, i.e. the more successful results.

Some interesting observations can be made based on these graphs. In many of these cases we can see the backpropagtion resulting in a large improvement in the MSE, very little of which is then be carried over to the CPPN representation. This means the resulting solution could not be compressed efficiently in the much lower complexity CPPN for these particular configurations. This situation can persist for many iterations, leading to the plateau-like features with little to no

(29)

Figure 6.3: These graphs show typical cases that resulted in very little im-provement in the MSE for the substrate produced by the CPPN, i.e. the less successful results.

improvement, as seen in the experiments in figure 6.3. However, these plateaus do not always necessarily denote a local minimum, as significant improvement can be made at later iterations, see figure 6.2 bottom.

6.3 Adding momentum

In order to deal with the need for such long run times and in particular the plateau issue, where there is little to no improvement over a large number of iterations, we add a momentum factor to the weight updates. We apply momen-tum to both the CPPN and the substrate. This momenmomen-tum factor is multiplied by the weight update of the previous iteration and added to the current itera-tion weight update. This factor allows successive updates in the same direcitera-tion to accumulate, significantly speeding up convergence and filtering out small oscillations [17].

Figure 6.4 shows the results for different momentum factors. It clearly shows an improvement for a larger momentum factor, with an increased number of the experiments ending up in the low test error range. However, backprop performance on a 1% training sample results in an average test error of 0.2155, see Table 6.2, which means that only the experiments with a test error below that can be considered an improvement. For a momentum factor of 0.9, out of a 1000 experiments only 50 experiments result a lower test error, i.e. 5% of the cases. This means there is still plenty of room for improvement.

(30)

Figure 6.4: These figures show the distribution of test errors for different mo-mentum factors. Each of them was computed with a 1% training sample, re-peated for 1000 runs; note that the momentum factor 0.0 graph (top left) is identical to the 1% graph from Figure 6.1 (top right).

6.4 Topology

While the current method does show improvement over backpropagation in a small number of cases, in the majority of the experiments the CPPN fails to capture the improvements made on the substrate. Our hypothesis is that is due to the fact that CPPN topology has the absolute minimal complexity needed to store the compressed substrate solution, which requires a single specific con-figuration of weights to be found. The rigidity imposed by this CPPN topology severely limits the ability of the CPPN to deal with variations in the substrate weights, caused by locally varying proximity to the regular solution.

Consequently, in this section we will experiment with effect of several dif-ferent fixed topologies for the CPPN. Increasing the complexity of the CPPN should allow for multiple possible solutions, while remaining small enough to enforce regularity in the substrate. To test this we will vary the number and type of nodes used in the hidden layer of the CPPN. The types will either be all sigmoid or all Gaussian. The number of nodes will vary between 1 and 4, arranged in a single layer, fully connected. See Figure 4.1 for the original 2 Gaussian node topology. The experiments are done with a 1% training sample

(31)

and using a momentum factor of 0.9.

Figure 6.5 shows the results with different numbers of Gaussian nodes and Figure 6.6 shows the the results for different numbers of sigmoid nodes. Note that these figures are taken on a different scale to the previous histograms, as they include all of the experiments instead of just the distribution over the [0.0, 0.9] range.

Figure 6.5: These figures show the distribution of the test errors for each of the number of hidden Gaussian nodes. Each of these are taken over 1000 runs.

Overall we can clearly see that the experiments with sigmoid nodes tend to perform better than their Gaussian counterparts. This is particularly interesting since the data is generated with Gaussian nodes, but approximating that data is apparently still easier with sigmoid nodes. The other clear trend is the fact that more hidden nodes improve the performance tremendously. One hidden node is fails to produce even a single experiment with an MSE below 1.2, indicating that the CPPN complexity is too low to capture the intended solution, although this is not surprising given that the data is generated using 2 hidden nodes. With each subsequent hidden node increase, an increase in performance can be seen as well. The best result is produced with 4 sigmoids resulting in 96.9% of the experiments outperforming the average backpropagation result.

These results beg the question what further additional hidden nodes would achieve. To this end we also ran experiments using 10 and 20 hidden sigmoid nodes, which can been seen in figure 6.7. The 10 hidden nodes seems to still produce a small improvement over the 4 hidden nodes, with slightly more runs

(32)

Figure 6.6: These figures show the distribution of the test errors for each of the number of hidden sigmoid nodes. Each of these are taken over 1000 runs.

ending up in both the [0.0, 0.05] and the [0.05, 0.1] bins. The 20 hidden nodes seems to perform a little worse, with almost no runs ending up in the [0.0, 0.05] bin. A possible explanation for this might be that the additional complexity limits the amount of regularization that is applied by CPPN, meaning the locally found solutions are not generalized as effectively.

All this indicates that it is definitely important to utilize enough complexity to represent the solution, while the overestimation of the complexity must be quite large before it has an impact. The fact that the method seems to benefit from the overestimation confirms our hypothesis that a minimally adequate complex CPPN is limiting to the performance. Even with a large overestimation, as in the case of 20 hidden nodes, little detrimental effect can be observed. This is most likely due to the fact that even with increases in CPPN complexity, the substrate complexity is still that much larger, thus still constraining the CPPN to a regular solution.

(33)

Figure 6.7: These 2 figures show the distribution of the test errors for 10 and 20 hidden sigmoid nodes. Each of these are taken over 1000 runs.

(34)

Chapter 7

Benchmarking

Given the positive outcome from the previous chapter, a logical next step would be to benchmark the algorithm. Backpropagation computes the results sepa-rately for each output, without considering the regularity of the problem, making it a good baseline comparison.

As concluded in the related work, we are not aware of any other algorithms that are capable of exploiting geometric regularity in supervised data. In order to allow a comparison of HyperBackprop’s ability to exploit regularity, we will develop a simple benchmark specifically for the boxes problem. This method will be based on a roving eye and is described in the next section.

Our benchmark will then consist of comparing the performance of Hyper-Backprop against backpropagation and the roving eye variation on both the centred and location dependent versions of the boxes problem.

7.1 Roving Eye

Because the centred boxes problem is an almost perfectly regular task repeated across the substrate, a simplified roving eye [23] seems like an obvious solution. The described roving eye uses a small ANN to learn to play Go. It has a 3x3 input and its outputs dictate whether the eye moves in one of the four directions or play a stone at its current position. The size of the input means it can generalize positions on the board, even across different board sizes. We can use this same method of generalization on the boxes problem, by simply having the eye compute a single output for each possible input position, in the same way a convolution network moves a single feature across the input space. The downside of such a method is that the size of the network input, and thus the regularity, must be known in advance. However, because the same network moves across the entire input space, regularity in the problem can easily be exploited.

For this roving eye version we will use a 3x3 input layer, 2 hidden layers of 9 nodes each and a single output node to move across the substrate. The

(35)

roving eye is trained until convergence using backpropagation, when the training MSE drops below 0.005. Because of the level of regularity in the centred boxes problem, we expect this benchmark to outperform HB. Firstly, HB must make the additional step of encoding the solution in the CPPN, instead of directly utilizing the regularity like the roving eye’s movements do. Secondly, HB makes no assumptions about the size of the regularity, as each output node in the substrate is connected to all input nodes, which makes the pattern to be encoded that much more complex. When testing the method originally the roving eye just produced all 0.0 outputs, as the vast majority of outputs of each example are indeed 0.0 outputs. The training data was modified by removing all the 0.0 outputs in order to focus on the actual pattern, which produced the results shown in the next section.

7.2 Centred boxes problem

In the previous chapter we tested performance of HB on the centred boxes problem in order to validate the algorithm and compared it to the backpropa-gation baseline. Backpropabackpropa-gation produced an average test error of 0.2155 using 1% training data, see Table 6.2. The final version of HyperBackprop, using a momentum factor of 0.9 and CPPN with 20 sigmoid nodes as a hidden layer, resulted in most experiments having a test error between 0.05 and 0.1, see Fig-ure 6.7. We will compare this to the results produced by our roving eye version. Given the completely regular nature of the problem, we expect the roving eye to perform even better than HB.

Figure 7.1: This figure shows the distribution of test errors using the roving eye on 1% training data taken over 1000 runs.

The distribution of results for the roving eye with 1% training data can be seen in Figure 7.1. It is clear that the roving eye easily solves this repeated task, to which the small network moving across the substrate is ideally suited. This results in it outperforming the more complex HB method as hypothesized.

(36)

Figure 7.2 compares the intended targets with the produced outputs of the roving eye and HyperBackprop for an individual boxes configuration. We can clearly see that both HB and RE outputs match the intended targets; however the roving eye output seems to be a little more regular, while the HB output seems a little closer to the intended values.

Figure 7.2: These images show produced outputs for HB and roving eye. The top left shows the input and top right shows the intended target. The bottom left shows the HB output after convergence and the bottom right shows the roving eye output after convergence

7.3 Location dependent boxes problem

7.3.1 Backpropagation

First off we will analyse the performance of backpropagation on the location dependent boxes problem in order to provide a baseline. Figure 7.3 shows the distribution of test errors for 5% and 1% random training samples. It clearly

(37)

shows that with a 5% sample the test error is small, indicating that (almost) all substrate locations had relevant training data, i.e. all possible large boxes locations were present in the training data. With a 1% sample, the distribution becomes much more varied, centering around a test error of 0.3, indicating significantly worse performance.

Figure 7.3: Distribution of the results of backpropagation on the location de-pendent boxes problem using a 5% and 1% random training data sample, taken over 1000 runs.

If we look at Figure 7.4 we can see that in the case of a 5% random sample there is little variance between the two produced outputs (top). In contrast, with the 1% sample there is a large discrepancy between the substrate locations that did receive relevant training data (bottom-left) and those that did not (bottom-right).

These results confirm that for the location dependent boxes problem with a 1% random sample as training data, some form of regularization is required in order to solve the problem at all substrate locations, just as for regular boxes problem.

(38)

Figure 7.4: The top of this image shows 2 outputs produced on the location dependent boxes problem by backpropagation using a 5% random sample as training data, where the accuracy varies very little between substrate location. The bottom of this image shows 2 outputs produced by using a 1% random sample as training data, where the accuracy does vary a great deal between substrate locations. This clearly shows that with a 1% sample, not all substrate locations receive relevant training data.

7.3.2 Roving Eye

Figure 7.5 shows the distribution of test errors produced by RE on the location dependent boxes problem using a 5x5 input. Note that the simple 5x5 input produces test errors so high they fall outside the default histogram range, thus requiring the histogram to be significantly rescaled. This is not surprising how-ever, because this input does not provide enough information to actually solve the modified problem, as the intended target changes with the current location of the eye. Examples where the box is located in the bottom left quadrant re-quire a different output from examples where box is in the top right quadrant,

(39)

because each has a different ‘centre’ location, as seen in Figure 4.3. However, the input will be identical as the eye does not consider its current location, thus creating examples with identical input which should produce different output.

Figure 7.5: This image shows the distribution of results for Roving Eye on the location dependent boxes problem, taken over 1000 runs. The left image uses a normal 5x5 input and the right uses the 5x5 input and 2 inputs indicating the current eye location.

In order to enable the roving eye to actually solve the problem, we also cre-ated a version of the roving eye that takes 2 additional inputs, indicating the current x and y location of the eye on the input substrate. The results of this new version can be seen on the right side in Figure 7.5. This version clearly performs much better, with all the test errors ending up inside the default distri-bution. These results are also slightly better than the backpropagation results using 1% training data, showing that the roving eye is in fact capable of some generalisation towards unseen examples, making it an interesting benchmark to compare against on this modified task.

7.3.3 HyperBackprop

Figure 7.6: This image shows the distribution of results for HyperBackprop on the location dependent boxes problem, taken over 1000 runs. The left image uses a 5% random sample as training data and the right image uses a 1% sample.

(40)

Figure 7.6 shows the distribution of test errors produced by HB on the location dependent boxes problem. While with a 5% sample the performance is slightly worse than backpropagation, the majority of the test errors is still below 0.15. With a 1% sample the HB test error only increases a little with respect to the 5% sample, resulting in much better performance than backpropagation or even roving eye with this sample size. This clearly demonstrates HB’s ability to generalize over the input substrate, even for problems more complex than simple repetition, as in the case of this location dependent boxes problem. Figure 7.7 shows the outputs produced by both roving eye and HB on the location dependent boxes problem, again showing that the outputs produced by HB match the target more closely than those produced by roving eye.

Figure 7.7: These images show produced outputs for HB and roving eye. The top left shows the input and top right shows the intended target. The bottom left shows the HB output after convergence and the bottom right shows the roving eye output after convergence.

(41)

Chapter 8

Conclusion and Discussion

In this thesis we examined the possibility of taking HyperNEATs property of exploiting the geometric regularity in a problem (namely symmetry, repetition and repetition with variation across the input space) and translating it to the su-pervised domain. To this end we developed a novel algorithm, HyperBackprop, based on HyperNEATs Substrate-CPPN model and analysed its performance on a supervised task. We also created a supervised version of the boxes problem (originally used to test HyperNEAT) and modified it to a more complex task, where the desired pattern becomes a function of the target location.

Applying HyperNEATs model for exploiting geometric regularity to a su-pervised setting does indeed prove to be advantageous in certain conditions. The model is able to generalize the task across the input substrate, as shown by its improvement on backpropagation’s results when using a small sample of data to train. The model is also able to exploit more complex patterns than simple repetition, as shown by it performance on the location dependent boxes problem. Its ability to infer a geometric relationship between the tasks and extrapolate them results in it even outperforming our modified roving eye ap-proach, clearly indicating the method is well suited to extract regularity from modifying patterns in a small amount of training data.

It would also be worthwhile to actually benchmark the HB method against HyperNEAT, to see how this supervised version compares with the original evolutionary approach. Given a supervised task like the location dependent boxes problem, HyperNEAT would most likely perform poorly, because the fitness function would have to be the MSE on the training data, requiring a very exact match instead of a more general fitness function. However, it would still be interesting to actually run this comparison and verify this hypothesis.

A potential improvement in the HB method would be to create better stop-ping criteria for the CPPN and substrate loop. Currently they are just repeated for a variable number of iterations, which create extra parameters to tweak when running the algorithm. Classical stopping criteria are problematic for this method, as outlined in section 5.2, but reducing the number of parameters of the method would improve the usability of the algorithm.

(42)

Bibliography

[1] Angeline, P. J. Morphogenic evolutionary computations: Introduction, issues and examples. In Evolutionary Programming IV: The Fourth Annual Conference on Evolutionary Programming (1995), J. R. McDonnell, R. G. Reynolds, and D. B. Fogel, Eds., MIT Press, pp. 387–401.

[2] Astor, J. C., and Adami, C. A developmental model for the evolution of artificial neural networks. Artif. Life 6, 3 (June 2000), 189–218.

[3] Bentley, P. Three ways to grow designs: A comparison of embryogenies for an evolutionary design problem. In In Proceedings of the Genetic and Evolutionary Computation Conference (1999), Morgan Kaufmann, pp. 35– 43.

[4] Clune, J., Beckmann, B. E., Ofria, C., and Pennock, R. T. Evolv-ing coordinated quadruped gaits with the hyperneat generative encodEvolv-ing. In Proceedings of the Eleventh Conference on Congress on Evolutionary Computation (Piscataway, NJ, USA, 2009), CEC’09, IEEE Press, pp. 2764– 2771.

[5] D’Ambrosio, D. B., Lehman, J., Risi, S., and Stanley, K. O. Evolv-ing policy geometry for scalable multiagent learnEvolv-ing. In ProceedEvolv-ings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1 - Volume 1 (Richland, SC, 2010), AAMAS ’10, In-ternational Foundation for Autonomous Agents and Multiagent Systems, pp. 731–738.

[6] Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vin-cent, P. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Twelfth International Conference on Artifi-cial Intelligence and Statistics (AISTATS) (2009), pp. 153–160.

[7] Fahlman, S. E., and Lebiere, C. The cascade-correlation learning archi-tecture. In Advances in Neural Information Processing Systems 2 (1990), Morgan Kaufmann, pp. 524–532.

[8] Gauci, J., and Stanley, K. O. Indirect encoding of neural networks for scalable go. In Proceedings of the 11th International Conference on Parallel

(43)

Problem Solving from Nature: Part I (Berlin, Heidelberg, 2010), PPSN’10, Springer-Verlag, pp. 354–363.

[9] Goldberg, D. E. Genetic Algorithms in Search, Optimization and Ma-chine Learning, 1st ed. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

[10] Graves, A., Mohamed, A., and Hinton, G. E. Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778 (2013). [11] Hopfield, J. J. Neural networks and physical systems with emergent

collective computational abilities. Proceedings of the National Academy of Sciences 79 (1982), 2554–2558.

[12] Hornby, G. S., and Pollack, J. B. Creating high-level components with a generative representation for body-brain evolution. Artif. Life 8, 3 (Aug. 2002), 223–246.

[13] Koutnik, J., Gomez, F., and Schmidhuber, J. Evolving neural net-works in compressed weight space. In Proceedings of the 12th Annual Con-ference on Genetic and Evolutionary Computation (New York, NY, USA, 2010), GECCO ’10, ACM, pp. 619–626.

[14] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE (1998), pp. 2278–2324.

[15] Matan, O., Kiang, R. K., Stenard, C. E., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D., and LeCun, Y. Handwritten character recognition using neural network architectures. In Proc. of the 4th US Postal Service Advanced Technology Conference (Washington D.C., November 1990).

[16] Nowlan, S. J., and Hinton, G. E. Simplifying neural networks by soft weight-sharing. Neural Comput. 4, 4 (July 1992), 473–493.

[17] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge, MA, USA, 1986, ch. Learning Internal Representations by Error Propagation, pp. 318–362.

[18] Sims, K. Evolving virtual creatures. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques (New York, NY, USA, 1994), SIGGRAPH ’94, ACM, pp. 15–22.

[19] Stanley, K. O. Compositional pattern producing networks: A novel abstraction of development. Genetic Programming and Evolvable Machines 8, 2 (June 2007), 131–162.

Using indirect encoding in neural networks to solve supervised machine learning tasks

MSc Artificial Intelligence

Track: Intelligent Systems

Master Thesis

Using indirect encoding in neural networks to

solve supervised machine learning tasks

by

Timothy F. Doolan

Student ID: 5651735

April 20, 2015

Supervisor:

Assessors:

Dr. Shimon Whiteson

Dr. Leo Dorst

Dr. Maarten van Someren

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Backpropagation

2.2

NEAT

2.3

HyperNEAT

Chapter 3

Related Work

Chapter 4

Problem Description

4.1

The boxes problem

4.1.1

Description

4.1.2

Producing supervised targets for the boxes problem

4.2

Location dependent boxes problem

Chapter 5

HyperBackprop

5.1

Interleaving substrate and CPPN training

5.2

Practical implementation

5.2.1

Substrate training

5.2.2

CPPN training

5.2.3

Initialisation and termination conditions

5.3

CPPN Topology

5.3.1

Cascade Correlation

5.3.2

Varying the starting CPPN topology

Chapter 6

Validating the Algorithm

6.1

Backpropagation results

6.2

HyperBackprop results

6.2.1

Parameters and improvements

6.2.2

General distribution of outcomes

6.2.3

Analysis of individual results

6.3

Adding momentum

6.4

Topology

Chapter 7

Benchmarking

7.1

Roving Eye

7.2

Centred boxes problem

7.3

Location dependent boxes problem