Neural networks evolved : the use of differential evolution for optimizing artificial neural networks

(1)

Faculty Economics and Business, Amsterdam School of Economics

December 21st 2017 Group 1

Dhr. N. van Giersbergen 2017-2018

Bachelor Thesis Econometrics Periods 1 and 2

NEURAL NETWORKS EVOLVED

The use of differential evolution for optimizing artificial neural networks

Mark den Hollander

10979603

Abstract

In this thesis several techniques for optimizing neural networks are discussed. The aim is to evaluate the performances of differential evolution compared to

traditional gradient based methods, by analyzing an empirical data set. An elaborate analysis concerning the efficiency, robustness and effectiveness is performed in order to compare the methods. It turns out that, despite the large explorative capabilities of differential evolution, the evolutionary algorithms are still outperformed by the traditional gradient based methods.

(2)

Declaration of authenticity

I, Mark Vincent den Hollander, hereby declare that I have written this thesis myself and I take complete responsibility for all content. I declare that this thesis only contains original content and that I have not used any sources but those stated in the bibliography. The Faculty Economics and Business is only responsible for the guidance of this thesis and not for the content.

(3)

1 Introduction

Throughout the entire history of the human race, 90 percent of all existing data has been cre-ated in the last two years (Dragland, 2013). In order to evaluate and understand these data, sophisticated models should be implemented. The use of artificial neural networks (ANN) has recently become a popular method for evaluating such large amounts of data. Neural network models are mathematical simplifications of the functioning of the human brain (Basheer & Hajmeer, 2000). They mimic the functioning of neurons by assigning weights to input values, which produce output values after implementing a certain activation function. Basheer and Hajmeer (2000) prefer the use of ANN models, due to their robustness and processing capaci-ties of non-linear relations.

Estimating a neural network model can be seen as an optimization problem regarding the values of the weights given to the inputs. Because of the non-linear nature of these mod-els numerical optimization methods should be used. The backpropagation algorithm is the most frequently used method for optimizing neural networks, due to its adaptability and rel-ative simple application (Basheer & Hajmeer, 2000). Other popular methods are based on the Levenberg-Marquardt and gradient descent algorithm (Ilonen, Kamarainen, & Lampinen, 2003).

However, all the aforementioned algorithms are based on the gradient of the objective function. This requires the neural network model to be differentiable, which imposes a rather large restriction on the model. To solve this problem the use of evolutionary algorithms is proposed by Ilonen et al. (2003). They suggest that differential evolution (DE) can be used for optimization, which is based on choosing a best candidate solution and updating the elements until a better solution appears. The nature of gradient based methods and differential evolution is therefore inherently different.

The performances of gradient based algorithms and differential evolution are investigated in several studies. One of those studies is performed by Piotrowski (2014), who compares gradient based methods with the differential evolution algorithm. He concludes that the DE al-gorithm is outperformed by almost all gradient based alal-gorithms and that differential evolution should therefore not be preferred. Cantú Paz and Kamath (2005), however, conclude that the

(5)

al. (2003) also claim that differential evolution provides comparable results as gradient based methods. Still, they mention several advantages of differential evolution, like the absence of major restrictions. Yet, Manniezzo (1994) suggests that using evolutionary algorithms might even be better than classical algorithms, due to the fact that evolutionary algorithms can find the global optimum more often.

Existing literature provides little consensus regarding the performance of differential evo-lution. This raises the question whether this algorithm can be a viable option for optimizing neural networks. The aim of this thesis is therefore to find out to what extent evolutionary algorithms can be used for optimizing neural network models when compared to gradient based methods.

The comparison between evolutionary and gradient based algorithms is performed by an-alyzing empirical data from a car insurance telemarketing campaign. A neural network model is presented and optimized by applying several gradient based methods alongside some evo-lutionary algorithms. By comparing the results a conclusion about the performance of both methods is provided.

The rest of this thesis is organized as follows. The second section shows the recent ad-vancements in the field of neural networks and differential evolution, along with the main problems encountered for those models. Thereafter, the data are described and the research structure for comparing the different algorithms is discussed. In the fourth section the results are shown and a detailed analysis is performed. Finally, the conclusions of this research are provided in the last section of this thesis.

2 Theoretical Background

To better understand the implication of neural network models, a brief overview of their func-tioning is provided in this section. Additionally, the main problems encountered when working with neural networks are discussed along with its advantages and disadvantages. Furthermore, the most frequently used gradient based methods and evolutionary algorithms are explained. Lastly a study concerning a telemarketing campaign is discussed and an overview of possible variables is given.

(6)

Figure 1: Graphical representation of a neural network with one hidden layer

Note. Reprinted from Neural networks and deep learning, by Michael Nielsen, retrieved from http://neuralnetworksanddeeplearning.com/chap1.html

2.1 Neural network models

A neural network consists of several layers of neurons which are connected by synapses (Günther & Fritsch, 2010). The layers of neurons between the input and output layer are called hidden layers, because their values are not directly observed. To measure the effect of each synapse, a weight is assigned to each connection. Usually a constant term is added to each layer, to control for bias effects (Hastie, Tibshirani, & Friedman, 2002). All signals are then bundled and processed by an activation function. The functioning of a neural network with one hidden layer and one output is graphically shown in Figure 1. Günther and Fritsch (2010) compute the output of a neural network with one hidden layer as follows:

ˆ y = σ v0+ X j vj · σ w0+ X i wi· xi !! , (1)

where widenotes the weights assigned to the synapses between the input and hidden layer and vj

the weights assigned to the connection between the hidden and the output layer. The activation function σ(v) is a transformation that models the firing intensity of the neuron (Basheer and Hajmeer, 2000). Often the set of all parameters is denoted as θ, which convention is also used in this thesis.

The parameters being optimized are the weights connected to the synapses. In order to optimize the parameters, an error function R(θ) is used and minimized. Usually the mean squared error (equation 2) or the cross-entropy (equation 3) is used to measure the difference

(7)

in predicted and expected value of the output. R(θ) = 1 n n X i=1 (yi− ˆyi)2 (2) R(θ) = n X i=1 yi· log(ˆyi) + (1 − yi) · log(1 − ˆyi), (3)

where ˆyi denotes the predicted value for yi. The optimization stops when the absolute value of

the change in the error function ∆|R(θ)| passes a certain threshold or a maximum number of iterations has been reached (Günther & Fritsch, 2010).

2.2 ANN common issues

For developing a neural network, different factors should be taken into account. Therefore the most common issues concerning the modelling of neural networks are discussed in this section.

Partitioning of the data

The first thing to consider is the partitioning of the data. An ANN requires a partitioning in a training, validation and test set (Basheer & Hajmeer, 2000). The training set is a subset of the data used to update the values of the weights. The validation set is used for calibration of the model and to determine the hyperparameters. The test set contains the data that are used to analyze and evaluate the performance of the neural network. Although there are no mathematical models for determining the division, several rules of thumb are proposed. Basheer and Hajmeer (2000) suggest using a partitioning of 65 percent for the training set, 25 percent for the test set and 10 percent for the validation set. Piotrowski (2014), on the other hand, uses an equal partitioning of one third for the training, validation and test set.

Overfitting

A common problem for training is the occurrence of overfitting. This means that the training data are not only fitted to the signal but to the noise as well, which can lead to rather inefficient outcomes on the test data (Piotrowski, 2014). According to Hastie et al. (2002), overfitting occurs when the network contains too many weights. They suggest using a regularization method called weight decay. This means adding an extra term to the error function, i.e. R(θ) +

(8)

λJ (θ), which imposes a penalty on bigger values of the weights. A frequently used weight decay term called Ridge regularization is

J (θ) =X

i

θ_i2, (4)

with θi the weights in the model, excluding the bias terms. The value of λ is usually chosen by

validation of the data and affects the rate of the weight decay. Other regularization methods consist of the application of an early stopping criterion as proposed by Piotrowski (2014).

Normalization and initialization

Another characteristic that affects the performance of neural networks is the value of the input variables. Hastie et al. (2002) suggest that normalizing the input values ensures that all input values are treated equally. Typically the input variables zi are normalized by subtracting the

mean of the input variable and dividing by the sample standard deviation:

xi =

zi− ¯z

sz

(5)

The initialization of the weights should also be taken into account. Basheer and Hajmeer (2000) address that these values can have an effect on the convergence of the optimization algo-rithm. They suggest that the weights are typically initialized in a small range with mean zero. Piotrowski (2014) considers several initialization ranges and concludes that starting weights in the interval [−1, 1], with search bounds between [−1000, 1000] lead to the best performance.

Choice of the transfer function

Every neural network model requires a transfer function σ(v) to obtain usable output values. Leshno, Lin, Pinkus and Schocken (1993) provide a mathematical proof about the effect of the activation function. They conclude that, as long as the activation function is not a polynomial, every bounded continuous activation function can explain the non-linear relations of the neural network. Hastie et al. (2002), however, prefer the use of the sigmoid function σ(x) = _1+e1−x,

(9)

Hidden layers and nodes

The amount of hidden layers and hidden nodes should also be taking into account. Basheer and Hajmeer (2000) assume that the determination of these values might be the most important aspect for modelling the neural network. They argue that one hidden layer is sufficient for most non-linear functions. When optimizing functions with discontinuities, on the other hand, they suggest that two hidden layers might be needed to obtain accurate results. The amount of nodes in each hidden layer affects the flexibility of the model and the capability to capture non-linear relations. This choice is often difficult, according to Basheer and Hajmeer (2000), because a model with too few nodes is incapable of estimating complex data and too many nodes can lead to overfitting the data. Hastie et al. (2002) suggest, however, that one can better overestimate the amount of nodes needed, because the problem of overfitting can be solved by applying appropriate regularization methods. Therefore, they propose using 5 to 100 hidden nodes, increasing with the amount of input and output variables available.

Multiple minima

Hastie et al. (2002) mention the effect of an error function with many local minima. This might lead to the stagnation of the optimization algorithm, which means that the global minimum is not found. One solution to this problem is the usage of a variety of different starting values or averaging over the different approximations. However, the solution found is largely dependent on the optimization method used. Therefore, several different optimization algorithms are discussed in the next section.

2.3 Gradient based optimization

In order to find the optimal values of the weights, the error function R(θ) is minimized. Due to the non-linear nature of this function, numerical techniques are required for optimization. The most frequently used gradient based optimization techniques are discussed in this section.

Backpropagation

The backpropagation (BP) algorithm is the most frequently used method, due to its flexibility and adaptability (Basheer & Hajmeer, 2000). It is based on the gradient of the error function

(10)

∂R(θ)

∂θ which can be computed rather efficiently by using the chain rule, due to the fact that

each hidden unit only receives and sends information to connected units (Hastie et al., 2002). Günther and Fritsch (2010) propose using a learning rate η that affects the convergence of the parameters. The weights are updated according to the following rule:

θ_k(h+1) = θ(h)_k − η · ∂R(θ

(h) k )

∂θ_k(h) (6)

They also propose an adaptation to this method called resilient backpropagation (RBP). For this algorithm the value of the learning rate is dependent on the sign of the partial derivative. Here ηk is increased if the partial derivative keeps its sign, and decreased if it changes sign.

This results in the following adjustment rule:

θ(h+1)_k = θ_k(h)− η_k(h)· sgn ∂R(θ (h) k ) ∂θ_k(h) ! (7) Hastie et al. (2002) mention however, that backpropagation can have a rather slow conver-gence rate and is therefore not always the method of choice for large neural networks. They also discourage the use of second order techniques like Newton-Raphson, because the computation of the second order derivative of the error function is often rather complex.

2.4 Evolutionary optimization

When the amount of parameters increases, gradient based algorithms tend to have excessively large computation times, according to Ilonen et al. (2003). In order to solve this problem, they propose using differential evolution (DE). Differential evolution minimizes the error function R(W ) by optimizing the set of parameters W . These parameters are optimized by taking a population of candidate solutions

PG= (W1,G, ..., WN P,G), G = 0, ..., Gmax, (8)

where G denotes the generation of the population and N P the amount of candidate solutions. Each vector containing D weights is therefore defined as follows:

(11)

Wi,G = (w1,i,G, ..., wD,i,G),

i = 1, ..., N P, G = 0, ..., Gmax

(9)

In order to obtain better parameters, the weights are adapted by a certain mutation scheme. Ilonen et al (2003) apprehend a mutation strategy which is denoted by DE/rand/1. The mutation vj,i,G+1 is computed by

vj,i,G+1= wj,r3,G+ F · (wj,r1,G− wj,r2,G), (10)

and the next generation is then computed by:

uj,i,G+1 =      vj,i,G+1 if randj[0, 1] ≤ CR wj,i,G, otherwise, (11)

where i = 1, ..., N P , j = 1, ..., D and r1, r2, r3 ∈ {1, ..., N P } are randomly selected and non-equal integers. Ilonen et al. (2003) mention that CR and F are control parameters that affect the convergence speed and the robustness of the model. Whenever one of the values of uj,i,G+1is located outside the range of the objective function, a rebounding technique is applied

(Piotrowski, 2014). Lastly the next generation is updated if the value of the error function of that generation is lower than that of the previous one.

Wi,G+1=     

Ui,G+1 if R(Ui,G+1) ≤ R(Wi,G)

Wi,G, otherwise,

(12)

where Ui,G+1 = (u1,i,G+1, ..., uD,i,G+1). Thus, all the parameters in the next generation are at

least as good as or better than the previous one. This process is repeated until the absolute value of the change in error function |∆R(W )| passes a certain threshold or the maximum number of generations is reached.

Several other methods for computing the mutation value vj,i,G+1 are proposed by

Pi-otrowski (2014). He suggests using the DE/best/1 or DE/current-to-best/1 method respec-tively:

(12)

vj,i,G+1= wi,j,G+ F (wj,best,G− wi,j,G) + F (wj,r1,G− wj,r2,G) (14)

Here wj,best,G denotes the best individual in the population so far.

The choice of the crossover (CR) and scale parameter (F ) is often regarded as a diffi-cult task (Neri & Tirronen, 2010). Piotrowski (2014) mentions taking values of F in [0, 1] or [0.4, 0.9] might be a good choice. Neri and Tirronen (2010) argue that taking a randomized scale factor F ∼ U N IF (0.5, 1) might be a better solution because this compensates for the large deterministic nature of the DE algorithm. For choosing the value of CR they argue that setting CR ∈ [0.8, 1] is a reasonable option. Another option might be to implement a self-adapting algorithm on the parameters CR and F (Neri & Tirronen, 2010). These self-adaptive algorithms require a relatively large computational time, however (Piotrowski, 2014). Apart from the cross-over and scale factor, the population size N P can also affect the performance of the model. Piotrowski (2014) argues that a reasonable choice for the population size is N P = 5D. Neri and Tirronen (2010) suggest, however, that a population size of N P = 10D or even N P ≤ D can also appear to be optimal. According to Ilonen et al. (2003), the choice of the control parameters is mainly a problem-dependent task and should therefore be chosen according to the model at hand.

Several studies compare the performances of gradient based methods and differential evo-lution for optimizing neural networks. Most of them conclude that the performances are roughly comparable, like Ilonen et al. (2003) and Cantú Paz and Kamath (2005). Piotrowski (2014) even concludes that differential evolution results in worse performances than gradient based methods, and he therefore prefers gradient based methods over differential evolution.

Nevertheless, there are several advantages of using differential evolution instead of gradient based methods. Cantú Paz and Kamath (2005) address the fact that the differential evolution algorithm does not require the gradient of the error function. Therefore, non-differentiable activation functions can be used. Cantú-Paz and Kamath (2005) also mention the better ca-pabilities of the differential evolution algorithm for finding a global maximum. They explain that gradient based methods can get stuck in a local optimum, while evolutionary algorithms can almost always find the global optimum. This advantage is also addressed by Piotrowski (2014), who speaks of great explorative capabilities. The downside of this characteristic,

(13)

how-suggests that the optimization algorithms should contain a proper balance between exploration and exploitation capabilities. Cantú-Paz and Kamath (2005) give a solution for finding this balance. They suggest applying an evolutionary algorithm for a quick convergence to the global optimum and then to refine the parameters using a gradient based method.

2.5 Prediction of telemarketing sales

The performance comparison of gradient based methods and differential evolution is performed using empirical data concerning cold calls of an insurance company. In order to better under-stand these data and to provide a proper analysis, the most important aspects of this problem are discussed in this section.

An analysis about telemarketing sales is conducted by Moro, Cortez and Rita (2015). They focus on historical data of a life insurance marketing campaign and use a neural network to predict the results for future campaigns. Their suggestion is to use customer aspects that are connected to recency, frequency and monetary (RFM) characteristics. Recency character-istics are variables that describe the amount of time that has passed since a client has been contacted, frequency characteristics describe the number of times a client has been contacted and monetary characteristics describe the financial state of the client. Moro et al. (2015) also propose some client related variables like age, gender and educational level. Their predictions turn out to be rather precise and therefore they conclude that a neural network incorporating RFM characteristics can be of great value for telemarketing operators.

Considering the information Moro et al. (2015) provide, a neural network might be a reasonable choice for predicting car insurance sales. Therefore, an analysis concerning the per-formance of gradient based methods and differential evolution can be performed using this data. The data set used for the analysis is discussed in the next section along with an explanation of the research.

(14)

3 Data and research structure

An analysis of the different optimization techniques is performed by using an empirical data set. For a correct analysis of this data set the details are discussed in this section. Additionally, the research structure for comparing the optimization methods is provided.

The data set analyzed in this thesis originates from a telemarketing campaign by a car insurance company. The outcome and characteristics of this campaign are obtained and result in a total of 5000 observations. The dependent variable of interest in this data set is the fact whether the client bought a car insurance or not and is displayed by the variable CarInsurance. However, only 4000 out of 5000 observations display a value for CarInsurance, so only those 4000 observations are used in the model. Information concerning the consumer characteristics (e.g. age, education) and previous telemarketing campaigns (e.g. amount of time passed and number of times a client has previously been contacted) are also available. No other variables in the data set show any missing values, so no further adaptation of the data set is needed.

As Moro et al. (2015) mention, RFM characteristics can be of great importance to predict the probability of telemarketing sales. Therefore, several variables depicting these characteristics are used for modelling the inputs of the neural network. Additionally, variables describing the individual characteristics of the consumers are included in the model. For a complete list of variables and descriptions see Table 1. Furthermore, the descriptive statistics of all variables are presented in Table 2.

Table 1: Telemarketing features

Variable class Variable name Description

Dependent variable CarInsurance Binary: 1 if client subscribed, 0 otherwise Consumer related Age Age of the client

Single Binary: 1 if client is single, 0 otherwise Married Binary: 1 if client is married, 0 otherwise Divorced Binary: 1 if client is divorced, 0 otherwise

EducLow Binary: 1 if highest eductation is primary school , 0 otherwise EducMid Binary: 1 if highest education is high school, 0 otherwise EducHigh Binary: 1 if highest education is college, 0 otherwise

EducOther Binary: 1 if education falls in none of those categories, 0 otherwise Recency CallDuration Duration of the marketing call in minutes

DaysPassed Number of days since client was last contacted for previous campaign Frequency NoOfContacts Number of contacts with client during this campaign

PrevAttempts Number of contacts with client before this campaign PrevSucces Binary: 1 if previous attempt was successful, 0 otherwise PrevFail Binary: 1 if previous attempt was unsuccessful, 0 otherwise PrevNone Binary: 1 if no previous outcome, 0 otherwise

(15)

Table 2: Descriptive statistics of all variables

all obs. training validation test

mean s.d. mean s.d. mean s.d. mean s.d.

CarInsurance 0.401 0.490 0.403 0.491 0.410 0.492 0.390 0.488 Age 41.215 11.550 41.243 11.635 41.484 11.394 40.937 11.493 Single 0.303 0.460 0.296 0.457 0.310 0.463 0.313 0.464 Married 0.576 0.494 0.580 0.494 0.579 0.494 0.565 0.496 Divorced 0.121 0.326 0.124 0.329 0.111 0.315 0.122 0.327 Educlow 0.140 0.347 0.143 0.350 0.148 0.355 0.129 0.335 EducMid 0.497 0.500 0.499 0.500 0.496 0.500 0.494 0.500 Educhigh 0.321 0.467 0.321 0.467 0.305 0.461 0.331 0.471 EducOther 0.042 0.201 0.037 0.189 0.051 0.221 0.046 0.210 CallDurations 5.847 5.704 5.922 5.928 5.821 5.216 5.705 5.576 Dayspassed 49.467 106.331 48.878 105.744 50.256 103.918 50.131 109.577 NoOfContacts 2.607 3.064 2.617 3.009 2.425 2.571 2.732 3.510 PrevAttempts 0.718 2.079 0.699 1.858 0.716 2.591 0.759 2.080 PrevSuccess 0.082 0.274 0.084 0.278 0.076 0.266 0.080 0.271 PrevFail 0.109 0.312 0.105 0.307 0.126 0.332 0.105 0.307 PrevNone 0.809 0.393 0.811 0.392 0.798 0.402 0.815 0.388 Default 0.015 0.120 0.014 0.118 0.018 0.131 0.013 0.113 Balance 1532.937 3511.452 1538.753 3773.515 1514.728 2901.997 1534.710 3358.344 HHinsurance 0.493 0.500 0.492 0.500 0.505 0.500 0.485 0.500 CarLoan 0.133 0.340 0.132 0.338 0.140 0.347 0.130 0.336 Amount of obs. 4000 2200 800 1000

To model the neural network the data set should first be partitioned. For this study a partitioning of 2200 observations in the training set, 800 in the validation set and 1000 ob-servations in the test set is chosen. The error function is computed by means of equation (3) and a regularization factor of λJ (θ) = λP θ2

i is added to the error function in order to control

for overfitting. Due to the fact that the dependent variable can only take the values of 0 and 1, the model can be seen as a classification problem. This means that the value of the output is interpreted as the probability that a client subscribes to the car insurance. Therefore, a sigmoid is used as transfer function. Additionally, all continuous input values are normalized using equation (5) and the weights are initialized in the interval [−1, 1]. For this problem one hidden layer is used, based on the assumption that the error function contains no discontinuities (Basheer & Hajmeer, 2000). This means that equation (1) is used to compute the output value. The amount of hidden nodes is dependent on the data and is chosen by comparing the value of the error function for several amounts of nodes.

The weights of the neural network are obtained by means of several optimization tech-niques. Firstly the backpropagation and resilient backpropagation algorithms are used to obtain values of the weights. The weights are also obtained by using differential evolution. The mu-tation schemes used for differential evolution are the DE/rand/1, DE/best/1 and

(16)

DE/current-to-best/1 algorithm respectively. Additionally, a logit regression model is used as a benchmark model. The efficiency, robustness and effectiveness of these algorithms are then compared to evaluate each algorithm. These

To summarize, the results of a telemarketing campaign are predicted by using a neural network with one hidden layer (equation 1). The weights of this network are obtained by opti-mizing the error function (equation 3). This is done by using a logit model, two gradient based methods (equation 6, 7) and a differential evolution algorithm using three different mutation schemes (equation 10, 13 and 14).

4 Research and results

This section reports the most important results of the various optimization methods. The research is split into two consecutive parts. The first part consists of an analysis of the structure of the neural network, i.e. the amount of hidden nodes and weight decay. Thereafter, optimal values for the important parameters of the DE algorithms are chosen, based on optimization for 10 different starting weights. In the second part the weights are optimized by the different techniques and the outcome of the telemarketing campaign is predicted. These predictions are used to compare the optimization techniques. The results for the gradient based methods are obtained by using the "h2o" package and the DE results are obtained by using the "DEoptim" package in R.

4.1 Network structure and parameters

In order to perform an equivalent analysis of all optimization techniques the optimal structure of the neural network must be obtained. This is done by performing an analysis of the test error of the backpropagation method with learning rate equal to 0.05. Furthermore, the opti-mal parameters of the evolutionary algorithms are obtained by analyzing the test errors of the DE/rand/1 algorithm. The test error used is the log loss, which is equal to the mean value of the cross entropy (equation 3). For each neural network the weights are optimized based on the observations in the training set, and the structure of the model is optimized based on the observations in the validation set. For estimating the model with DE the weights are initialized within an interval of [−1, 1] and relatively wide bounds of [−100, 100] are used, due to the fact

(17)

of the log loss for different amounts of hidden nodes in the hidden layer. The mean value of the test error drops somewhat when using three or more nodes so a structure with multiple nodes might be beneficial. However, for more than ten hidden nodes the variance of the log loss becomes rather large which indicates the appearance of overfitting. A correction for this effect is implemented by adding a weight decay term to the error function (equation 4). To investigate the effect of the weight decay parameter λ the log loss for various values of this parameter is shown in Figure 2b, using a model with 15 hidden nodes. For λ = 0.002 the mean of the log loss value is lowest, so this value is used in the model. Figure 2c shows values for the log loss for different hidden nodes when applying weight decay with λ = 0.002. The value of the log loss reaches a minimum for 6 hidden nodes, so this amount is chosen as optimal. This means that a neural network using 6 hidden nodes and a regularization parameter of λ = 0.002 might be an optimal model to use for comparing the optimization techniques. This structure is therefore used for further research.

Moreover, the effect of the the scale factor (F), crossover parameter (CR) and population size (NP) are examined. This is performed using differential evolution with a DE/rand/1 muta-tion scheme. Due to the rather large computamuta-tion time of this algorithm the maximum amount of iterations is set to one hundred. In order to optimize CR the scale factor is set to F = 0.8 and N P = 10D, which are assumed to be reasonable choices (Neri & Tirronen, 2010). Figure 2d shows the log loss for several values of CR. The mean log loss is minimal for CR = 0.6, so this value is taken as optimal. The effect of the scale factor can be seen in Figure 2e. This figure shows the log loss for varying values of the scale factor, using the optimal crossover value and once again setting N P = 10D. It can be seen that the test error is minimized when setting the scale factor equal to F = 0.6. This value is therefore taken as optimal in the model. Lastly, the population size is taken into account. Once again the log loss is calculated for values of the population size varying from 5D to 20D, using the optimal values of F and CR. The results are shown in Figure 2f, which indicate that taking N P = 15D might be optimal.

To summarize, the optimization techniques are compared by using a neural network con-sisting of one hidden layer and 6 hidden nodes. This neural network is optimized by using Ridge regularization (equation 4) with tuning parameter λ = 0.002. For the optimization by differential evolution a population size of N P = 15D is used and the scale factor and crossover parameters are both set to F = CR = 0.6.

(18)

(a) BP without weight decay (b) BP with 15 nodes and weight decay

(c) BP with weight decay (d) DE for varying cross over values

(e) DE for varying scale factor values (f) DE for varying population size

(19)

4.2 Performance of the models

With the optimal parameter values chosen in the previous section, the performance of the various optimization techniques can be compared. This is achieved by computing the log loss of the model for ten different starting weights, using observations from the validation set. Using the results with the smallest test error, the model is predicted based on the test set. In order to provide a measure of the predictive quality of the models the receiver operating characteristic (ROC) curve is used. This curve is obtained by plotting the true positive rate (TPR), also known as sensitivity, against the false positive rate (FPR), which is equal to 1 − specif ity. The predictive quality is then determined by computing the area under the curve (AUC) of the ROC. The values of the AUC are determined using the "pROC" and "SDMTools" packages in R. Additionally, an analysis of the convergence characteristics of the different mutation schemes for differential evolution is provided.

Table 3: Performance results

Logit BP RBP DE/rand/1 DE/best/1 DE/ctb/1 Log Loss 0.4629 0.4258 0.4212 0.4276 0.4285 0.4265

AUC 0.8737 0.87755 0.8823 0.8769 0.8764 0.87751 Mean Prob. 0.3821 0.3927 0.3998 0.3975 0.3893 0.3976

(a) ROC for gradient optimization (b) ROC for evolutionary optimization

(20)

Table 3 displays the performance values of the various optimization techniques. The values in the table are computed by using the weights with the smallest test error in the val-idation set and then predicting the model using the test set. For these results ten different starting weights are used for the gradient based methods and three for the differential evolu-tion algorithms. For the latter, less starting weights are used due to the large computaevolu-tion times of differential evolution. Although this might cause the results of differential evolution to be less effective, it is assumed that the effects are relatively small due to the global optimizing properties of differential evolution. Additionally, a logit regression with weight decay parameter λ = 0.002 is used as a benchmark model. Table 3 shows a test error for the logit model of 0.4629, which is larger than all other models. This confirms the believe that the use of a neural network provides better results than a standard non-linear regression model. For the neural network models all test error values lie within a range of 0.01 from each other, with a smallest value of 0.4212 for the RBP model. Interestingly, all DE methods give a larger test error than the gradient based methods. Additionally, the mean predictions are shown in Table 3. The test set displays a mean value of ¯y = 0.39, which is rather close to the predictions of all models.

When taking the AUC values into account, some interesting remarks can be made. Con-sidering the neural network models, the RBP has the largest AUC (0.8823) and DE/best/1 the smallest (0.8764). Furthermore, it can be noted that the AUC of BP and RBP is larger than all AUC values for differential evolution. Nevertheless, each model results in an AUC value larger than 0.875 which indicates that all models contain rather large predictive capabilities. This property can also be observed in Figure 3, which shows the ROC curve for gradient based methods and differential evolution separately. Figure 3a shows that the RBP method has in-deed larger predictive properties than the logit and BP model. The ROC curves in Figure 3b, however, lie rather close to each other, which indicates that the differential evolution algorithms provide comparable prediction performances.

Due to the fact that the predictive capabilities lie rather close to each other, a Delong test is performed to test whether the AUC values differ significantly. The Delong test comparing the BP and RBP model results in a p-value of 0.244, which indicates that the null hypothe-sis of difference in AUC value is not rejected at the 5 percent significance level. Testing the

(21)

DE/current-to-best/1 respectively. This means that none of the differential evolution methods perform significantly different when testing at the 5 percent significance level. Because the RBP and DE/current-to-best/1 model display the highest AUC value for each category, the difference of these values is also tested. The Delong test results in a p-value of 0.07786, which indicates that these values are not significantly different at the 5 percent significance level. Therefore, it might be concluded that the optimization methods contain comparable predictive capabilities.

Figure 4: Box plots of the Log Loss value for the optimal solution based on the validation set

In Figure 4 the variation of the test error is shown for all optimization methods, by means of several box plots. The logit model displays the largest test error and the RBP model the smallest, which corresponds to Table 3. However, the variation of the gradient based methods is much larger than that of the differential evolution algorithms. This indicates that differential evolution can find the global optimum more often than the gradient based algorithms, which regularly end in a local optimum. This also explains the smaller test errors of gradient based methods in Figure 3, because this table only displays the model with the smallest log loss value. Interestingly, the logit model also shows a rather large variation in test error, despite the fact that the logit model should always converge to the same optimum. This can be explained by the fact that regularization is used as in equation (4), which is usually not included in a standard logit regression.

(22)

Figure 5: Convergence of the test error for several differential evolution algorithms

Although the plot in Figure 4 indicates that the differential evolution methods find the global optimum more often, the time it takes to obtain this result is much larger. 2000 it-erations have been performed, with each iteration taking several seconds to update all weights. This results in a computation time that is much larger than that of the gradient based methods. Figure 5 shows the value of the test error for each iteration. All three algorithms converge to the same global optimum, but the convergence rate is rather different. For the first 500 itera-tions the changes in the test error are largest. After 1000 iteraitera-tions the log loss only changes marginally and the algorithms converge to the global optimum around 2000 iterations. Remark-able, however, is that the test error of the model using the DE/rand/1 algorithm constantly lies above the other two DE algorithms after 300 iterations. This is possibly a consequence of the larger stochastic properties of the DE/rand/1 mutation scheme. Furthermore, it can be stated that the DE/best/1 algorithm has the highest convergence rate for the first 300 it-erations. Thereafter, the DE/best/1 and DE/current-to-best/1 algorithm reach an equivalent convergence rate.

In short, five different optimization techniques are used to optimize the neural network for predicting telemarketing sales. The performances are displayed based on the log loss and the AUC values, along with a description of the differential evolution convergence. A detailed analysis and comparison of the various optimization techniques is provided in the next section.

(23)

4.3 Comparison and discussion

The performance of the optimization techniques is evaluated based on three main characteris-tics. Firstly, the efficiency is discussed by looking at the computation times of each optimization method. Secondly, the robustness of the algorithms is considered, by evaluating how often the global optimum is found. Lastly, an analysis of the effectiveness is provided.

(a) Computation time for gradient optimization in seconds

(b) Computation time for differential optimization in hours

Figure 6: Computation times for the various optimization techniques

Figure 6 shows several box plots of the duration of the optimization process. The computation times of the backpropagation model and resilient backpropagation model in Fig-ure 6a display a rather large difference. The mean time for the RBP model is approximately three times larger than that of the backpropagation model. This is mainly due to the larger complexity of resilient backpropagation. This difference, however, is negligible when compared to the computation times of the differential evolution algorithms. These algorithms display a computation time of approximately 8 hours, which is a multiple of several thousands com-pared to the gradient based methods. This illustrates one of the largest disadvantages of using differential evolution for networks with many parameters: the computation times become un-reasonably large. Thus, in terms of efficiency the gradient based methods should unequivocally be preferred to the differential evolution algorithms.

For evaluating the robustness of the models, the variation of the test error as displayed in Figure 4 is taken into account. The gradient based methods show a much larger variation in log loss than the evolutionary algorithms. This implies that differential evolution is able to find

(24)

the global optimum more often than the gradient based methods. This should be no surprise, however, since gradient based methods are local optimizers (Piotrowski, 2014). Therefore, one might want to prefer to use differential evolution over gradient based methods, due to their larger robustness. Nevertheless, the robustness of backpropagation and resilient backpropaga-tion can often be greatly increased by optimizing the model several times and using the results with minimal test error. As can be seen in Table 3, this method results in even smaller test errors than the differential evolution algorithms. Besides, the computation time for optimizing the differential evolution models is unreasonably large compared to the gradient based methods. So despite the fact that the gradient based methods contain a smaller robustness than the DE methods, this should not impose a problem because of the substantial difference in computation times.

The effectiveness also plays an important role in the performance evaluation of the opti-mization methods. Effectiveness of the algorithms is measured by computing the log loss and the AUC value. Table 3 shows these values for the various gradient and differential evolution algorithms, based on the model with the smallest test error after several optimizations. It can be seen that the backpropagation and resilient backpropagation methods perform better than the evolutionary algorithms. The same observation holds for the AUC values, which are all higher for the gradient based methods. Still, the relative difference in log loss and AUC between the gradient based and differential methods is rather small, with the gradient based methods only performing slightly better than differential evolution. Several Delong tests also conclude that the AUC values of the models are not significantly different. Therefore, it can be stated that the gradient based methods and DE algorithms perform comparably in terms of effectiveness.

The gradient based methods are evaluated together with the differential evolution algo-rithms and compared in terms of efficiency, robustness and effectiveness. In terms of efficiency, the gradient based methods perform surprisingly better than evolutionary methods, due to the large computation times of the differential evolution algorithm. On the other hand, the differ-ential evolution algorithms do perform better when considering the robustness. Yet, the unrea-sonable computation times of differential evolution render this advantage negligible. Lastly, the

(25)

gives reason to believe that the gradient based and evolutionary methods are comparable in effectiveness. All things considered, it can be stated that the gradient based methods often still perform better than differential evolution. The larger robustness of differential evolution does not result in an advantage that weighs up to the efficiency of backpropagation and resilient backpropagation. Yet, differential evolution might appear beneficial for certain problems. A possible solution might be to apply differential evolution to find the global optimum and then use gradient based methods to reach a precise solution. Therefore, further research into this direction is suggested.

5 Conclusion

In this thesis the performance of various techniques for optimizing a neural network were an-alyzed and discussed. The aim was to find out to what extent differential evolution could be compared to the traditional gradient based optimization methods. Using data from a tele-marketing campaign by a car insurance company, a model was obtained for predicting the outcome of this campaign. In order to provide a correct performance comparison, five different optimization algorithms were used. Firstly, the backpropagation and resilient backpropagation models are applied as gradient based optimizers. Next, the differential evolution algorithm was utilized, using three different mutation schemes. The results of these methods were used to provide an elaborate comparison of the algorithms.

It should be noted, that the structure of the network played and important role in the optimization. The gradient based methods were rather prone to overfitting, which implied the requirement for regularization of the parameters. The results also indicated that the the per-formance of differential evolution was highly dependent on values of the hyperparameters. It turned out that the values of the crossover and scale factor parameters increased the test error significantly when taking too large values. Furthermore, the choice of population size played an important role for differential evolution. Several population sizes were analyzed and the best results were obtained when taking the population size equal to fifteen times the amount of parameters.

Moreover, the predictive performances of the methods were analyzed. The results indi-cated that there were no distinct differences in predictive capabilities between gradient based

(26)

methods and differential evolution. The gradient based methods displayed only marginally bet-ter predictive characbet-teristics. Therefore, it can be concluded that both methods can correctly provide solutions for neural network models.

However, the optimization results indicated several remarkable characteristics. The hy-pothesis about the exploitative capabilities of gradient based methods and the explorative capabilities turned out to be correct. Differential evolution algorithms displayed a rather small variation in test error, which indicated that the global optimum was often found by these algo-rithms. The gradient based methods showed a much larger variation in test errors, due to their local optimization characteristics. Therefore, one might prefer the use of differential evolution when searching for a global optimum.

Although the differential evolution algorithms contain large explorative capabilities, this characteristic also had a major downside. It turned out that the optimization time to find this global optimum was remarkably larger than the optimization time of the gradient based methods. The evolutionary algorithms displayed a computation time equal to approximately 3000 times that of the gradient based methods. The main consequence of this problem is that it renders the benefits of the explorative capabilities almost negligible. After all, the global optimum can also be found by taking the best solution among several gradient optimizations. Thus, the largest drawback of differential evolution is its unreasonable optimization speed. This should be one of the main topics of improvement for further research.

Conclusively, it turns out that differential evolution is often still outperformed by tradi-tional gradient based methods. The small advantages of differential evolution do not outweigh the simplicity and efficiency of the gradient methods. Nevertheless, the idea of using differ-ential evolution for optimizing neural networks should not be discarded altogether. The large explorative capabilities of differential evolution offer many interesting options for optimization purposes. A viable alternative could be using differential evolution to find the global optimum and refining the parameters using gradient based methods. Therefore, it is suggested that fur-ther research into this subject is needed in order to improve the use of the differential evolution algorithm.

(27)

References

Basheer, I., & Hajmeer, M. (2000). Artificial neural networks: fundamentals, com-puting, design, and application. Journal of microbiological methods, 43 (1), 3–31.

Cantú-Paz, E., & Kamath, C. (2005). An empirical comparison of combinations of evolutionary algorithms and neural networks for classification problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35 (5), 915–927.

Dragland, A. (2013, May 22). Big data, for better or worse: 90% of world’s data generated over last two years. Retrieved October 16 2017, from https://www.sciencedaily.com/releases/2013/05/130522085217.htm Günther, F., & Fritsch, S. (2010). neuralnet: Training of neural networks. The R

journal , 2 (1), 30–38.

Hastie, T., Tibshirani, R., & Friedman, J. (2002). The elements of statistical learning: Data mining, inference, and prediction. Biometrics.

Ilonen, J., Kamarainen, J.-K., & Lampinen, J. (2003). Differential evolution training algorithm for feed-forward neural networks. Neural Processing Letters, 17 (1), 93–105.

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedfor-ward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6 (6), 861–867.

Maniezzo, V. (1994). Genetic evolution of the topology and weight distribution of neural networks. IEEE Transactions on neural networks, 5 (1), 39–53.

Moro, S., Cortez, P., & Rita, P. (2015). Using customer lifetime value and neural networks to improve the prediction of bank deposit subscription in telemar-keting campaigns. Neural Computing and Applications, 26 (1), 131–139. Neri, F., & Tirronen, V. (2010). Recent advances in differential evolution: a survey

and experimental analysis. Artificial Intelligence Review , 33 , 61-106.

Piotrowski, A. P. (2014). Differential evolution algorithms applied to neural network training suffer from stagnation. Applied Soft Computing, 21 , 382–406.

Neural networks evolved : the use of differential evolution for optimizing artificial neural networks

NEURAL NETWORKS EVOLVED

Mark den Hollander

10979603

Declaration of authenticity

Contents

1

Introduction

2

Theoretical Background

2.1

Neural network models

2.2

ANN common issues

2.3

Gradient based optimization

2.4

Evolutionary optimization

2.5

Prediction of telemarketing sales

3

Data and research structure

4

Research and results

4.1

Network structure and parameters

4.2

Performance of the models

4.3

Comparison and discussion

5

Conclusion

References