Retrospective forecasts of the 2016 U.S. primary elections : an empirical comparison of evolutionary and gradient-based neural network training with applications in political forecasting

(1)

RETROSPECTIVE FORECASTS OF THE 2016 U.S.

PRIMARY ELECTIONS

An empirical comparison of evolutionary

and gradient-based neural network training

with applications in political forecasting

Lennert Jansen

10488952

Abstract

This thesis concerns an empirical comparison between differential evolution and gradient-based optimisation methods, applied to artificial neural network training. The gradient-based methods outperform differential evolution. A logistic regression model is considered. The results, however, suggest that a more elaborate network architecture is required to grasp the non-linearities in the data to the fullest extent.

Under the supervision of Dr. N.P.A. van Giersbergen Faculty of Economics and Business

University of Amsterdam December 2017

(2)

Declaration of authenticity

I, Lennert Jan Jansen, hereby declare that this thesis has been

written by me, and only me. I am fully accountable for its contents.

The contents of this thesis are authentic, and based on resources, all

of which are documented in the bibliography. The Faculty of

Economics and Business of the University of Amsterdam can only be

held responsible for provided the supervision, not the contents of this

thesis.

(3)

List of Figures

1 Diagram of a single-layered perceptron . . . 4 2 Box plots of test errors for varying numbers of hidden units 21 3 Box plot of test errors evaluated for varying value of weight

decay parameter. . . 22 4 Box plots of test errors for varying values of CR (a) and F (b) 23

5 Log loss training performance of the three DE mutation

schemes (maximum iterations = 3000) . . . 25 6 ROC curves of DE mutation schemes . . . 26 7 ROC curve for backpropagation models . . . 27

8 ROC curves of best performing DE, gradient and logit model 30

9 Box plots of observed run time per class of algorithm . . . 31 10 Box plot of test errors evaluated for varying value of weight

(6)

1 Introduction

In 2012 Geoffrey E. Hinton and two of his graduate students at the University of Toronto stunned the world of artificial intelligence (AI) by creating a large (deep) neural network that could recognise and classify images with performances that over-shadowed any previous or contemporary work on the subject (Krizhevsky, Sutskever, & Hinton, 2012). Their research has been cited over 6000 times, and solidified the Artificial Neural Network’s relevance as a fundamental and unassailable modelling tool in AI. This influential paper reassured many people of the power of applying ANNs to large, complex problems. Image recognition, however, is but one of the countless applications of ANNs that has made this computing system the focus of a multitude of studies.

ANNs are a class of computing systems developed in the fields of statistics and artificial intelligence, inspired by the large networks of neurons and axons that con-stitute animal nervous systems. They are essentially functions which receive input variables, and then model the output variable as a nonlinear function of these in-puts (Hastie, Tibshirani, & Friedman, 2009). The input and output variables are connected by one or more layers of nodes or neurons which are all interconnected by weight-parameters. These parameters ultimately determine a ANN’s performance. The optimisation, or so-called training, of ANN-weights is therefore an important task.

The number of network parameters increases dramatically as the number of layers and neurons grows. Due to its large-scale complexity and real-world appli-cability, optimising ANN-parameters can be a difficult, yet rewarding task. This serves as a motive for the current research. Furthermore, the non-linearity of these models renders analytical optimisation methods useless. Researchers turn to numer-ical methods for training ANNs. Gradient-based optimisation algorithms such as backpropagation, resilient backpropagation, and the Levenberg-Marquadt method have been heavily researched. They are known to converge and find optima rather

(7)

rapidly. However, gradient-based methods require differentiability of the objective function. Thus, they have limited applicability.

In recent years, genetic algorithms and evolutionary strategies have gained more attention in researches regarding ANN-training, and have found to be promising stochastic global optimisation methods (Ilonen & Kamarainen, 2003). A subcate-gory of genetic algorithms, Differential Evolution (DE) also draws inspiration from a field of biology, namely natural selection. The method involves a randomly ini-tialised population of starting solutions, which are then mutated to form a new gen-eration of offspring solutions. Only the superior parent-offspring solutions survive, and the process is repeated until a globally superior solution is obtained. DE’s sim-ple imsim-plementation and lack of differentiability restriction are its main advantages. Whereas typically slow and sometimes premature convergence are its most common drawbacks. This suggests that DE could be a worthy adversary for gradient-based algorithms.

The aim of this paper is to effectively compare ANN-training performances when different classes of algorithms are applied to certain empirical data. Or more specif-ically: to what extent does the class of training algorithm affect the optimisation of Artificial Neural Network parameters applied to cross-sectional data? This compar-ison shall be made by applying both gradient- and DE-based training algorithms to ANNs implemented for predicting election results based on demographic data.

In Section 2 a theoretical background is sketched by discussing relevant recent developments, as well as a more detailed explanation on ANN, important issues to take into account, DE, and gradient-methods. In Section 3, the data set considered for this thesis is described, along with the applied research model and methods. The fourth section contains an overview, interpretation and discussion of the most important results. Finally, in Section 5 a conclusion and discussion topics for further research are summarised.

(8)

2 Theoretical study on neural network training

2.1 Artificial Neural Networks, their architecture and

im-plementation

2.1.1 Underlying structure of multi-layer perceptron

As mentioned in Section 1, Neural Network models are useful for classification of non-linearly separable patterns and approximation of arbitrary continuous functions. This thesis concerns the training of multi-layer perceptron (MLP) neural networks, arguably the most popular type of ANN (Piotrowski, 2014). An MLP’s underlying structure can be represented as a directed graph in which vertices and edges repre-sent the model’s nodes (or neurons or units) and connections, respectively (G¨unther & Fritsch, 2010). The neurons are organised in an input layer, one or more hidden layers and one output layer. Units between the input and output layers are referred to as hidden because they are not directly observed. The arrows of the edges de-termine the direction in which data is fed and passed through the model. MLP neural networks wherein the edges do not form a cycle are known as feedforward neural networks (FNN) (Hastie et al., 2009). This thesis solely concerns feedforward neural networks. Figure 1 depicts a simple single layered, feedforward MLP (more specifically a single-layered perceptron), consisting of four input nodes, five hidden nodes and a single output node.

(9)

Input #1

Input #2

Input #3

Input #4

Output

Hidden

layer

Input

layer

Output

layer

Figure 1: Diagram of a single-layered perceptron

2.1.2 Mathematical implementation of an MLP model

A feedforward MLP neural network consisting of a single hidden layer, composed of m hidden units calculates the following function:

ˆ y = σ(v0+ X (vi + σ(w0+ X wixi))), (1)

where ˆy denotes the predicted output value of the dependent variable, σ() rep-resents the sigmoid activation function, wi represents the weight-parameters

associ-ated with the connections between the input and hidden layer, vi represents those

(10)

associated with the aforementioned segments of the network, and xi represents a

vector of input variables.

2.1.3 Activation functions

The appropriate activation function is research-dependent. An activation function determines the firing rate of an MLP-neuron by performing some fixed mathematical operation on the given input. Several activation functions may be encountered in the literature. The current paper concerns the commonly used sigmoid function:

σ(v) = _1+e1−v . (2)

The sigmoid function translates its inputs to the unit interval [0, 1]. Other commonly used activation functions include the hyperbolic tangent and the Rectifier Linear Unit (ReLU) (Krizhevsky et al., 2012). Maas, Hannun, and Ng (2013) even suggest that the Rectifier Linear Unit activation function significantly improves convergence speed and model performance. This is, however, not considered in this thesis due to the differentiability restriction imposed upon the model by gradient-based methods.

2.2 Performance measures

2.2.1 Loss functions

An error (or objective) function E(w) is used to determine the performance of an MLP neural network model. This paper applies the logarithmic loss function (log loss) as objective function.

(11)

LL(w) = −1 N N X i=1 [yilog(ˆyi) + (1 − yi)log(1 − ˆyi)]. (3)

In (3), ˆyi denotes the predicted output value of the i-th observation, and y the

observed output value. The goal is to minimise the loss as a function of the parameter weights w, and thus optimise parameters weights to maximise performance. The logarithmic loss function is commonly used for classification problems. The Mean Squared Error (or Residual Sum of Squares), a commonly used objective function in econometric analysis, is often used for ANNs. However, given the nature of the problem considered in this study, the logarithmic loss function is more suited.

2.2.2 The receiving operating characteristic curve and the area under the curve

In statistical learning, the receiver operating characteristic (ROC) curve is a com-monly used technique of assessing a model’s classifying abilities (Hastie et al., 2009). It plots a trade-off between the True Positive Rate (sensitivity) and the False Posi-tive Rate (1 - specificity) for varying thresholds of the classifier. The corresponding Area Under the Curve (AUC) is a unique value between zero and one associated with an ROC curve. The AUC denotes a model’s quality in terms of its ability to correctly classify an observation of input features. An AUC of 1 implies perfect classification, whereas a value of 0.5 suggests the model is no more effective than randomly guessing an outcome per observation. The desired shape of an ROC curve is a smoothed upper-left-sided triangle, preferably encapsulating the majority of the northwestern half of the square-shaped graph. In summary, ROC curves and their respective AUC’s are useful and visually intuitive performance measures for model

(12)

comparison.

2.3 Important issues in neural network training

Due to the large-scale complexity of most neural network models, training such models can be a challenging task. Despite the seemingly endless amount of problems one can encounter in doing so, some common issues are known and described in relevant literature (Hastie et al., 2009). This subsection discusses some important issues and proposed methods of avoiding these problems.

2.3.1 Dimensionality reduction: principal component analysis

Principal component analysis (PCA) allows us to summarise a given data-set into a smaller set of sufficiently representative variables. This is especially useful when con-fronted with computational limitations as a consequence of a large body of highly correlated input variables James, Witten, Hastie, and Tibshirani (2006). PCA is the process of subsetting highly correlated variables of a data-set into a collec-tion of pairwise uncorrelated linear combinacollec-tions, known as principle components. Many heuristic as well as statistically supported methods of determining the suf-ficient minimum amount of meaningful principle components for effective analysis. Cangelosi and Goriely (2007) discuss and investigate a number of methods, includ-ing: the (modified) broken stick model, the Kaiser-Guttman test, cross-validation, bootstrapping, Bartlett’s test for equality of eigenvalues and cumulative percentage of total variance. Aside from its obvious computation-speed related benefits, PCA also offers an opportunity to decrease multicollinearity in a data-set. Heij, de Boer, Franses, Kloek, and van Dijk (2004) define multicollinearity as the occurrence of high levels correlation between input variables, inhibiting the accurate estimation of isolated effects in a model. Multicollinearity is a notorious problem is econometric analysis and carries many undesired side-effects with it, including: variance inflation and loss of interpretability of diagnostics.

(13)

2.3.2 Training, validation and test samples

Another problem that arises when training ANNs is the partitioning of the data-set. In AI, it is common practice to divide a data-set into a training, validation and test set (Looney, 1996). Training data is used to optimally fit the network parameters to the data-set. Whereas the validation data-set is used to tune the hyper-parameters associated with the model. And finally, the model’s performance is measured on the test data-set. Although no one superior ratio of training, validation and test samples exists, certain rules of thumb apply. To name one, Looney (1996) suggests a 60 percent training sample, 15 percent validation sample and a 25 percent test sample.

2.3.3 Starting values of model parameters and scaling of the inputs Generally, starting values of the weight parameters are chosen to be random values near zero (Hastie et al., 2009). Note that the first-order derivative of the sigmoid function is roughly constant near zero, meaning that it is approximately linear in that region. Thus, when training begins, the model is nearly linear. Non-linearity increases as the iterative process of modifying weights carries on. In doing so, individual units localise to directions to introduce non-linearity where it is needed. Hastie et al. (2009) state that a proper initialisation of weights is zero-centred within a narrow interval. Piotrowski (2014), for instance, finds superior performance of training algorithms for ANNs when weights are initialised randomly within [-1,1].

Prior to neural network training, the data-set must be transformed such that the independent and dependent variables exhibit particular distributional properties (Olden & Jackson, 2002). The dependent variable is often translated in accordance to the activation function. For the sigmoid and ReLU transfer functions this implies the interval [0,1]. This interval also makes it possible to interpret the predicted output as a probability. Finally, Hastie et al. (2009) as well as Olden and Jackson (2002) suggest that the input variables need be normalised with mean equal to zero

(14)

and standard deviation equal to one.

2.3.4 Overfitting

Large neural network models are at risk of being over-parameterised. Over-parameterised models with too many weights will overfit to the training data. Overfitting is essen-tially the occurrence of a statistical model that contains more parameters than is justified by the observed data. As a result, the training results will fit too closely to the training data, whilst losing predictive power for future (test) data. This is an im-portant issue in ANN training and machine learning in general. Luckily, methods of avoiding this problem and maintaining robustness exist. Piotrowski (2014) suggests a variant of early stopping, where training is prematurely ended before the model has a chance to overfit. Hastie et al. (2009) suggest a more explicit regularisation method: weight decay. Here, a penalty is added to the error function R(w) + λJ (w), where J(w) = P w2 i or J (w) = P w2i 1+w2 i , (4)

and λ > 0 is a tuning parameter which is generally estimated by means of cross-validation.

2.3.5 Number of hidden layers and units

An obvious challenge that one might run into when working with ANNs is the architecture of the model itself. By architecture, the number of hidden layers and units per layer is meant. In general, it is better to have too many hidden units than too few (Hastie et al., 2009). If appropriate regularisation is applied, the

(15)

danger of overfitting can be avoided. A number of hidden units between [5,100] is typically used. Finding the optimal amount is a matter of starting at the lower bound of the interval, and measuring the performance as the number of hidden units increases. Choosing the optimal number of hidden layers requires sufficient background information on the data and experimentation. A single hidden layer, however, is postulated to be sufficient for most arbitrary continuous optimisation problems (Basheer, 2000).

2.3.6 State space

The objective function is nonconvex, and possesses many local optima. The dan-ger of falling into stagnation or repeatedly finding local optima arises, resulting in final solutions that are heavily dependent on the starting positions of optimisation algorithms, according to Hastie et al. (2009). The authors state that training algo-rithms with explorative capabilities typically have less trouble exploring the different regions of the state space than their exploitative contemporaries.

A solution to the problem of multiple local optima is applying many randomised starting configurations when training a model. The solution with the lowest error should be chosen. They argue that a better approach would be to use the average predictions over the collection of networks as the final prediction. Finally, an ap-proach called ‘bagging’ could be used, where randomly perturbed versions of the training data is used to uncover different optima.

2.4 Gradient-based methods

2.4.1 Backpropagation and resilient backpropagation

The training of an MLP network model is essentially an optimisation problem with E(w) as objective function. Given differentiability of the objective function, opti-misation methods using the gradient may be applied. Gradient-based methods are a generally accepted and heavily researched class of training algorithms. These

(16)

nu-merical methods are used due to the nonlinear nature of ANNs. They are known to converge rapidly for small MLP models (Ilonen & Kamarainen, 2003). This subsec-tion serves as an overview of the gradient-based optimisasubsec-tion methods considered in this paper.

Backpropagation and a variant thereof, resilient backpropagation, modify the weights parameters of an ANN to find a local minimum of the error function. The root of the error function is found by calculating the gradient with respect to the weights. The weights are modified in the opposite direction of the partial derivatives until a root is found (G¨unther & Fritsch, 2010). Traditional backpropagation adjusts the weights by the following rule:

w(t+1)_k = w(t)_k − η · ∂E(t)

∂w_k(t), (5)

where t indexes the iteration step and k the weights. Here, in traditional backprop-agation, the magnitude of the partial derivative is multiplied by a global learning rate η, that is assumed to be appropriate for the entire model.

On the other hand, resilient backpropagation implements a separate learning rate which can be modified during training. Resilient backpropagation iterates in the following manner

w(t+1)_k = w_k(t)− η(t)_k · sign(∂E(t)

∂w_k(t)), (6)

(17)

in order to guarantee an equal influence of the learning rate over the entire net-work. The advantages of backpropagation and resilient backpropagation are their simple and local nature. Both methods apply only first-order partial derivatives. Backpropagation-methods tend to slow down as the complexity of gradients increase. So, methods implementing the Hessian of the error function are not recommended, due to unreasonably large computation time (Hastie et al., 2009).

2.5 Differential Evolution and several mutation schemes

Gradient-based methods are known to perform poorly when the model size increases. Due to this inconvenience, researchers search for more robust global optimisation methods. Ilonen and Kamarainen (2003), Piotrowski (2014) and Wang, Zeng, and Chen (2015) suggest DE-based methods as a solution. The basic principle of DE is analogous to that of natural selection. DE operates on a population of candidate solutions, PG. The population consists of N P , real-value vectors of model

param-eters, Wi,G, where i indexes the population and G is the generation to which the

population belongs (Ilonen & Kamarainen, 2003).

PG = (W1,G, ..., WN P,G), G = 0, ..., Gmax (7)

(18)

Wi,G= (w1,i,G, ..., wD,i,G) i = 1, ..., N P, G = 0, ..., Gmax (8)

Following the initialisation of the first population of starting solutions, PG, vectors

in the current population, PG+1, are randomly sampled and combined to create

candidate vectors for the subsequent generation. These candidates are so-called trial vector solutions: P0G+1 = Ui,G+1 = uj,i,G+1. The trial vectors are generated by

the following mutation scheme, deemed DE/rand/1 (Piotrowski, 2014):          vj,i,G+1= wj,r3,G+ F · (wj,r1,G− wj,r2,G)

uj,i,G+1= vj,i,G+1, if randj[0, 1) ≤ CR (9)

wj,i,G, otherwise

where i = 1, ..., N P ; j = 1, ..., D; r1, r2, r3 ∈ {1, ..., N P }, randomly selected, except

r1 6= r2 6= r3 6= i and CR ∈ [0, 1]; F ∈ (0, 1+]. In this mutation scheme, CR is a

real-valued crossover factor that controls the probability of a trial vector’s element will come from the mutated vector in lieu of the current vector, and F is a scaling parameter. Ilonen and Kamarainen (2003) point out that the control parameters, F and CR, affect the convergence speed and robustness of the search process. Ad-ditionally, they conclude that their optimal values depend on the error function features and population size. Thus, the optimality of control parameters depends on the specific task.

The population for the next generation PG+1 is either selected from the current

population PG or the offspring population. Selection from the offspring population

follows the rule.

Wi,G+1 =

 



Ui,G+1 if E(Wi,G+1) ≤ E(Wi,G)

(19)

Aside from DE/rand/1, two other mutation strategies are considered for this study. Namely, DE/local-to-best/1 (equation (10)) and DE/current-to-best/1 (equation (11)) (Piotrowski, 2014) (Mullen, Ardia, Gil, Windover, & Cline, 2011).

vj,i,G+1= wj,old,G+ F · (wj,best,G+1− wj,old,G) + F · (wj,r1,G− wj,r2,G) (10)

and

vj,i,G+1= wj,i,G+1+ F · (wj,best,G− wj,i,G) + F · (wj,r1,G− wj,r2,G), (11)

where wj,best,G+1 and wj,old,G represent the ith member and best member,

respec-tively, of the previous population.

The DE methods described above have three user-dependent control parame-ters: the scale factor (F ), the crossover factor (CR), and population size (N P ). The proper choice of these parameters, as well as choice of mutation strategy are important tasks of implementing DE (Piotrowski, 2014).

In summary, DE is a simple and efficient global optimiser, suitable for a wide range of nonlinear continuous optimisation problems, regardless of differentiability (Neri & Tirronen, 2010). Due to DE’s stochastic nature, it tends to explore the state space in a more global sense than its gradient-based competition. However, despite these advantages, DE methods seem to rarely outperform gradient-based significantly (Ilonen & Kamarainen, 2003). Some studies accuse DE methods of falling into stagnation too easily (Piotrowski, 2014).

(20)

2.6 The logistic regression model

To investigate whether or not a non-linear regression model is sufficient to capture the necessary patterns in the data and achieve desired predictive capabilities, a qualitative response model is considered. The logistic regression model, or logit model, is one of the more popular binary response models in econometric analysis. Hayashi (2000) defines this qualitative response model as:







f (yi = 1|xi; θ) = Λ(x0iθ)

f (yi = 0|xi; θ) = 1 − Λ(x0iθ)

where Λ denotes the cumulative density function of the logistic distribution:

Λ(v) ≡ 1

1 + e−v. (12)

Note that the sigmoid activation function is indeed identical to the cumulative den-sity of the logistic distribution. The logit model is therefore mathematically equiva-lent to a singly layered feedforward ANN consisting of a single neuron. In practice, binary response models are often based on either the standard normal density (pro-bit model) or logistic density (Heij et al., 2004). This study solely considers the logit model in order to make a fair assessment of the effectiveness of ANN structure, holding all population assumptions constant.

2.7 No consensus and political forecasting using demographic

data

It must be noted from the previous subsections that no consensus has been reached by studying the mentioned literature regarding the superiority of neither gradient-based nor DE methods. Ilonen and Kamarainen (2003), for instance, argue that DE presents no distinct advantages in terms of solution quality when compared to

(21)

gradient-based methods. To recall a few comparisons, DE methods are efficient global optimisers that place no restrictions on the error function and possess explo-rative capabilities. Gradient-based methods perform extremely well for small neural networks and find local optima more precisely due to their exploitative nature. Nev-ertheless, the literature studied in preparation for this thesis has not led to believe one of these algorithms is significantly superior to the other. And it is this lack of agreement that served as the motive for this study.

A data set containing demographic features of U.S. counties, as well as election results per county is considered. The demographic data is split up, and the separate parts are used for training, validation and testing. Section 3 describes the data set thoroughly, along with the empirical methodology applied in this thesis.

3 Data and Research Methods

3.1 The US primary elections

The current subsection describes the data and research methods used for this study. The data-set considered for this thesis contains data relevant to the 2016 U.S. pres-idential election, including up-to-date primary results. Primary elections, or pri-maries are elections that narrow down the field of candidates of major political parties or alliances in preparation for general elections. Primaries are common prac-tice in the United States of America, and determine which candidates will represent the Democratic and Republican parties. Primaries are held per county. The primary election process can be summarised as follows. Ballots are counted, and county win-ners are determined. Whomever won the most counties in a state, wins that state. The party candidate with the most states won is elected as party representative for the general election. In 2016, Hillary Clinton was the Democratic party representa-tive, and Donald Trump that of the Republican party. The aforementioned process of electing party representatives is an oversimplification of the actual process,

(22)

ne-glecting steps irrelevant to this study. Furthermore, it must be noted that some states hold caucuses in lieu of primaries to elect party candidates.

3.2 Description of the data set

The data-set in its final form was retrieved from Kaggle, an open source platform for predictive modelling and data analysis. The main original data source is Cable News Network (CNN). Two main bodies of information constitute the data-set: a file containing demographic characteristics of 3143 counties, and the primary election results per county. The county features file contains numeric demographic data con-cerning population estimates and growth, population density, age and gender distri-butions, percentages of ethnic minorities, rate of home-ownership, living standards, financial status, marital status, completed levels of education, average distance to work, employment, etc. The original matrix of input variables consisted of a total of 51 different numerical variables in the aforementioned subjects. Before any form of data analysis or preparation, two population-based variables, ”PST040210” and ”PST120214”, are dropped due to the presence of highly similar variables, leaving a total of 49 input variables to start with. See the appendix for an exhaustive list of the input variables’ names and respective descriptions.

The file for the primary elections results contains a matrix with 3143 rows, de-noting every county in the data-set. For every county available, this file describes the fraction of votes won per candidate, total number of votes per candidate, the winning candidate and his or her political party (i.e., Democratic or Republican). The research described in this thesis, however, solely requires a binary variable (de-noting whether or not Republican candidate, Donald Trump, wins) as the primary election result, as mentioned in the following subsection.

(23)

3.3 Description of the proposed model and research

meth-ods

Before any modifying operations are performed on the data, the main body of data is partitioned into a 60% training sample, 15% validation sample, and a 25% test set, as proposed by Looney (1996). All numeric input variables are then normalised. Furthermore, for the sake of simplicity and current relevance, this study focuses solely on predicting whether Donald Trump wins a county based on demographic features. That is, the observed output value is translated to the value 1, denoting Trump’s triumph, and 0 otherwise. This binary translation of the observed output defines the current study as a classification problem. Thus, the sigmoid activation function, equation (2) is applied. Performance is measured by means of the logarith-mic loss function, equation (3). This research considers a single-layer perceptron as is proposed to be sufficient for mapping an arbitrary continuous function by Basheer (2000). The choice of number of hidden units in is guided by experiment. A number of hidden units roughly between 5 and 100 is chosen, starting at 5 and evaluating the CE whilst increasing the number of units (Hastie et al., 2009).

In accordance with the research question proposed in Section 1 and the subject matter in Section 2, the model is trained using gradient-based and DE training algorithms on the training data-set. The validation-set is used to tune the model parameters before training commences. Finally, the test sample results are com-pared to distinguish the effects that the class of training algorithm may or may not have on the fitting of the model. Two gradient-based training methods are considered. Namely, backpropagation and resilient backpropagation. Three differ-ent DE mutation schemes are investigated: DE/rand/1, DE/local-to-best/1 and DE/current-to-best/1. A logistic regression model is used as a benchmark (because it is the most basic and general relevant model formulation) in order to evaluate the necessity of a network structure of greater complexity than that of a single hidden node. It must be noted that all gradient-based methods are researched using the

(24)

R-package, ”h2o”. As for the evolutionary algorithms, the packages ”DEoptim” and ”parallel” are used. See the appendix for a list of all R-packages used for this thesis.

4 Results

4.1 Pre-processing data with principal component analysis

As mentioned in section 2, to avoid intolerable computation time, the principal com-ponent analysis tool found in the ”stats” R-package is used to summarise the original data set into a smaller set of linear combinations of the input features. Additionally, a model’s predictive capabilities can be improved by filtering for multicollinearity in the input variables. A simple analysis of the observed correlations in the original data-set reveals 44 out of 49 instances where the (absolute value of the) correlation between variables exceeds 0.90. This suggests that some variables can be linearly combined or even omitted, whilst preserving necessary information contained in the input data.

A trade-off is made between computational speed and loss of information. In order to empirically assess this trade-off, the cumulative percentage of total variance method is used, as suggested by Cangelosi and Goriely (2007). For minimal loss of information, a threshold of 99% is selected to be the cumulative percentage of total variance explained by the components. Thirty-three principal components are found to be sufficient to reach this threshold. See Table 1.

Principal components cumulative percentage of variance

... ...

30 98.44

31 98.68

32 98.90

33 99.10

Table 1: Segment of table displaying the amount of principal components that exceeds 99% of total variance explained

(25)

It might sound counter-intuitive that 33 principal components explain 99% of the total variance in a data-set where 44 out of 49 variables are highly correlated. One must keep in mind that 99% is a relatively high threshold for PCA. Many studies suggest 95% to be sufficient to avoid loss of crucial information (which is achieved by 22 components in this data-set) (James et al., 2006).

The computational benefits of PCA are best described by comparing the pre-PCA average iteration time of roughly 35 seconds to the post-pre-PCA average iteration time of approximately 20 seconds for the three considered DE algorithms (assuming the network structure and level of regularisation presented in the following subsec-tion).

4.2 Determining the model architecture

4.2.1 Finding the number of hidden units

A backpropagation model with a learning rate of η = 0.05 is used throughout the current study as a general reference point. This model is used to determine the optimal number of hidden units, m, in the single hidden layer. During the process of determining the optimal number of hidden neurons, the log loss performance is evaluated for an increasing m and different values of the weight decay parameter λ. More specifically, model performance is assessed for λ = 0, 0.0001, 0.001, 0.01 and 0.02. Due to hardware limitations, a maximum of 20 hidden units is evaluated during this process, starting at m = 5 (Hastie et al., 2009). See Figure 2 for box plots for some of the estimation outputs.

(26)

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.45

0.50

0.55

Performance evaluation based on Hidden Neurons

l2 = 0.00, rate = 0.05 Number of Hidden Units

T

est Error (logloss)

(a) No weight decay

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.425 0.430 0.435 0.440 0.445 0.450 0.455

Performance evaluation based on Hidden Neurons

l2 = 0.01, rate = 0.05 Number of Hidden Units

T

est Error (logloss)

(b) Weight decay parameter = 0.01

Figure 2: Box plots of test errors for varying numbers of hidden units

Figure 2(b) shows no clear pattern in the observed log loss performances. One could argue that the evaluated sizes of the hidden layer do not suffice in terms of flexibility or capturing the non-linear relationships in the data set. Hastie et al. (2009) mention that, generally speaking, a model with too many hidden neurons is preferred over one with too little. For lack of a better model, an estimated number of 14 hidden units is obtained by finding the mean minimal value of the logarithmic loss errors for the models evaluated at varying values of the weight decay parameter.

4.2.2 Determining the optimal weight decay parameter

Analogous to the process of determining the optimal number of hidden units in the network, the optimal value of the l2 regularisation (weight decay) parameter is obtained by evaluating the log loss performance at varying levels of weight decay, for a fixed number of m = 20 hidden units (the maximum number of hidden units evaluated in this study). See Figure 3.

(27)

0 0.004 0.008 0.012 0.016 0.02 0.024 0.028 0.032 0.036 0.04 0.044 0.048

0.45

0.50

0.55

Performance evaluation based on weight decay, m = 20

Weight Decay Parameter

Log Loss

Figure 3: Box plot of test errors evaluated for varying value of weight decay parameter.

Figure 3 displays a pattern where a clear log loss-minimising value of λ can be observed. Figure 3 also clearly depicts how for increasing values of λ the observed variance of the performance function decreases as the network weights are regularised towards zero with increasing force. The optimal value is obtained by averaging between the values of λ belonging to the minimal mean and median test errors. This calculation results in an estimated value of λ = 0.006.

4.3 Differential Evolution

4.3.1 Tuning: Finding optimal values of CR and F

As mentioned in Section 2.4, the parameters CR and F heavily influence the ro-bustness and convergence speed of a DE algorithm (Ardia, Boudt, Carl, Mullen, & Peterson, 2011). Higher values of CR tend to speed up convergence (if convergence occurs) and lower values create more robust models (Price, Storn, & Lampinen, 2005). The tuning of F has similar effects in terms of a trade-off between conver-gence speed and robustness, although Ardia et al. (2011) mention that DE is much

(28)

more sensitive to the choice of F than the choice of CR.

Using the validation set, the optimal values of the DE-parameters, CR and F , are determined by running the most basic of the three mutation schemes, DE/rand/1, ten times for subsequently varying values of CR and F (holding the other DE-parameter constant). Put differently, every box plots in Figure 4 represents the spread of observed log loss performance per 10 optimisation runs. In order to main-tain reasonable computation time and avoid over-fitting, each optimisation was im-plemented with a relatively small maximum of 100 iterations. It must be kept in mind that the validation set is used for this process. Being four times smaller than the training set, over-fitting occurs more rapidly when optimising over the validation set.

The optimal values of CR and F are determined by finding the validation error minimising parameter values. These optimal DE-parameters are then chosen for the optimal model. The validation set is used for this process in the hopes of obtaining a more generalised model. Price et al. (2005) suggest the values of the crossover constant, CR = 0.9, and differential weighting factor, F = 0.9, for generally desired performance. Therefore, CR is held fixed at CR = 0.9 as F varies, and vice versa.

0.5 0.6 0.7 0.8 0.9 1 0.65 0.70 0.75 0.80 0.85 0.90 0.95 F = 0.8 CR Log Loss (v alidation)

(a) Varying values of CR and fixed F = 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 CR = 0.9 F Log Loss (v alidation)

(b) Varying values of F and fixed CR = 0.9

Figure 4: Box plots of test errors for varying values of CR (a) and F (b)

(29)

This holds for Figure 4(a) more so than for Figure 4(b). The validation error min-imising values of CR and F are estimated at CR = 0.8 and F = 0.5.

4.3.2 Training the weight parameters using three mutation schemes Following the fine tuning of the model structure and regularisation, the actual neural network training begins. The network weights, θ, are randomly initialised within the interval [−1, 1], following a uniform distribution. The upper and lower bounds are set relatively wide (compared to the scaled input variables), at [−10, 10] (Piotrowski, 2014). The size of the starting population, N P , is set equal to ten times the dimen-sionality, D. N P = 10D is suggested as a reasonable choice of population size by Storn and Price (1997) and Ardia et al. (2011). A process of determining the optimal population size, however, is not considered in this thesis due to hardware limitations. Finally, training performance is measured by the log loss function implemented with appropriate weight decay, as described in Section 2.

The three mutation schemes, DE/rand/1, DE/local-to-best/1 and DE/current-to-best/1 are applied to the specified network using the training data set. The algorithms run until either convergence or the predetermined maximum number of iterations has been reached. See Figure 5 for the results of the training process.

(30)

0 500 1000 1500 2000 2500 3000 0.4 0.6 0.8 1.0 1.2 1.4 1.6 DE Training Comparison m = 14, p = 33, lambda = 0.006, NP = 10D Iterations Log Loss DE/rand/1 DE/local−to−best/1 DE/current−to−p−best/1

Figure 5: Log loss training performance of the three DE mutation schemes (maximum iterations = 3000)

The training process depicted above requires roughly 27 hours of run-time. Fig-ure 5 reveals that none of the mutation schemes converge before reaching the maxi-mum of 3000 iterations, as all three lines continue towards descent. As seen later on in this section, the gradient-based algorithms require significantly less computation time, and reach convergence, whilst applied to the exact same network. This is in ac-cordance with the findings of Piotrowski (2014) and Ilonen and Kamarainen (2003), who argue that gradient-based networks are more suited for smaller networks.

Judging by the line graphs above, all three of the algorithms seem to be matched quite equally as they remain within each other’s vicinity for the entirety of the process. Despite very similar performance, DE/current-to-best/1 achieves superior training performance with a log loss value of 0.415849 (L2 regularisation included).

4.3.3 Test performance of Differential Evolution

When the optimal DE models are implemented and applied to the test set, the three DE mutation schemes, DE/rand/1, DE/local-to-best/1, and DE/current-to-best/1 reach log loss values of approximately 0.645, 0.782, and 0.725 respectively. Aside

(31)

from the performance in terms of the loss function, the predictive and generalising capabilities of the obtained optimal models are tested by means of receiver oper-ating characteristic curves and their respective AUC values. As for the predictive performances, the observed areas under the curve are 0.736, 0.683 and 0.687, respec-tively. The somewhat classical DE implementation, DE/rand/1 appears victorious when tested on an unseen body of data, being able to accurately predict the pri-mary election outcome of a US county at a little under 74% of the time (based on the available input data). Table 2 summarises these results, along with those of the gradient-based methods, whose results are discussed in the following subsec-tion. Figure 6 contains the relevant ROC curves belonging to the three considered mutation schemes.

Figure 6: ROC curves of DE mutation schemes

4.4 Gradient-based methods

As mentioned in Section 2, the gradient-based algorithms considered for this thesis are backpropagation, and its adaptive modification, resilient backpropagation. This subsection also discusses the implementation and results of the logistic regression model applied to the same classification problem as the gradient-based and DE

(32)

methods.

4.4.1 Backpropagation and resilient backpropagation

The considered backpropagation model is implemented with a learning rate of η = 0.05 as suggested by Cook (2016). All other model specifications are identical to those implemented for the evolutionary algorithms in the preceding subsection. Both gradient-based models are run 10 times and the best performing optimal implemen-tation of these models is selected. Figure 10 depicts the box plots associated with all ten of these optimisations per considered algorithm. For the best observed back-propagation model, the Area Under the Curve (AUC) is estimated at 0.806. That of the best performing resilient backpropagation model is calculated to be 0.787. The log loss performances for the backpropagation and resilient backpropagation models equal 0.437 and 0.443, respectively. A summary of these performance mea-sures can be seen in Table 2. Figure 7 depicts the ROC curves for both the resilient backpropagation and backpropagation models.

Figure 7: ROC curve for backpropagation models

The non-adaptive backpropagation model achieves superior loss function values, as well as predictive capabilities. An unexpected result, as the studied literature

(33)

pos-tulates that resilient backpropagation tends to achieve better testing results than its non-adaptive counterpart, due to its greater convergence speed (G¨unther & Fritsch, 2010). However, it must be repeated that only the results associated with the best performing models are presented above. When viewing Figure 10, it becomes clear that resilient backpropagation has a lower variance in terms of the observed test errors. Whereas backpropagation’s worst performing run is the worst among all the gradient-based models. This suggests that resilient backpropagation, on average, finds better solutions more often than regular backpropagation.

4.4.2 Logistic regression model

The logit model is trained, validated and tested using the same general backprop-agation model described earlier in this section. l2 regularisation is applied during the training process, with the same value for the weight decay parameter λ. Analo-gous to the process of determining the best performing model for the gradient-based models above, the logit model is optimised ten times and the best test results are considered. Figure 10 displays box plots of the results belonging to these ten runs. Table 2 contains the best log loss and AUC values achieved by the logit model.

In terms of test performance, the logit model lies between the best DE and the best backpropagation model. With a log loss test error of approximately 0.473 and an AUC value of roughly 0.763, the logit model performs slightly worse than the best backpropagation algorithm and slightly better than the best DE algorithm. This result is aptly displayed in Figure 8, where it can be seen that the receiver operating characteristic curve of the logit model lies within the confines of those of the backpropagation and DE/rand/1 algorithm, at almost every threshold. A more thorough discussion of the comparisons between these methods is presented in the following subsection.

(34)

4.5 Empirical comparison of gradient-based and DE

train-ing algorithms

This subsection serves as a final assessment of the obtained results. To effectively do so, aspects such as, robustness, effectiveness and efficiency of algorithms are dis-cussed (Ilonen & Kamarainen, 2003). The known local and global characteristics of the considered training algorithms are also evaluated based on the obtained results. Finally, computation time is also taken into account.

4.5.1 Logistic regression benchmark versus an ANN

Based on the ROC curves displayed in Figure 8, one could argue that a network structure of greater complexity than a logit model is indeed suitable for the appli-cation considered in this thesis. Despite the slightly poorer performance of the best DE model, this specific classification problem seems to achieve better solutions when a larger neural network is implemented (the backpropagation trained ANN models achieve superior test performance).

As mentioned at the beginning of this section, due to the restrictiveness of the available hardware, a larger model was not considered. Thus, the full range of non-linearities is not explored to the extent that it could possibly require. However, given the circumstances and the obtained ANN structure, the logit model deserves preference over the ANN models trained by DE. Especially considering the enormous difference in computation time needed by the DE algorithms to still produce poorer test results than that of the logit model.

(35)

Figure 8: ROC curves of best performing DE, gradient and logit model

4.5.2 Efficiency in terms of required computation time

As a measure of an algorithm’s efficiency, this study takes into account the duration of an optimisation run. See Figure 9. It goes without saying that the considered gradient-based methods are much more favourable based on this measure. Whereas the training process alone of each DE mutation scheme requires roughly 9 hours to run without convergence even occurring, the gradient-based methods require ap-proximately 1.3 seconds on average to find a fully optimised solution. Based on Figure 9 (a), RBP seems to have a significant advantage over BP in terms of ef-ficiency. This is probably caused by RBP’s adaptive learning rate, which ensures more accurate determination of an optima’s location than backpropagation’s fixed learning rate.

On another note, some fundamental benefits of DE, such as lack of differentiability-restraint, are not considered in this research. For instance, Maas et al. (2013) sug-gest that implementation of the rectifier linear unit activation function speeds up convergence and could even lead to better solutions.

(36)

BP RBP 1.280 1.285 1.290 1.295 1.300 1.305 Runtime comparison Algorithm Time in seconds

(a) Run time required for entire optimi-sation process by gradient-based meth-ods (seconds)

DE/r/1 DE/lb/1 DE/cb/1

9.8 10.0 10.2 10.4 10.6 10.8 Runtime comparison Algorithm Time in hours

(b) Run time required for training alone by DE algorithms (hours)

Figure 9: Box plots of observed run time per class of algorithm

4.5.3 Robustness

The robustness of a training algorithm is defined by Ilonen and Kamarainen (2003) and James et al. (2006) as a method’s tendency to find optimal solutions for most cases and not just for specific ones. In other words, a very robust algorithm is less prone to small alterations in the input data, and has better generalisation properties. Gradient-based optimisation methods are well known for their local and ex-ploitative nature (Piotrowski, 2014). They are known to perform very well for small networks, but easily trap themselves in local optima (Ilonen & Kamarainen, 2003). The findings presented in Figure 10, as well as the presented convergence speeds seem to be in accordance with the preceding statements. Both the backpropagation and the resilient backpropagation display relatively large variation in terms of the obtained optima per run. The spread of the solutions found by the DE algorithms is not inspected, due to the intolerable amount of computation time it would consume. Thus, no statements can be made based on the results regarding the postulated ex-plorative and global characteristics associated with DE (Piotrowski, 2014).

(37)

argue that allowing the local optimisers inspected in this study to fully explore the problem space by starting at a large number of random points would still be preferred over a slow converging global optimisation method, such as Differential Evolution.

Hybridised adaptive methods combining DE and gradient-based methods present an interesting solution to the local versus global trade-off that seems to be a recurring issue in this study. (Wang et al., 2015) and (Zhang, Zhang, Lok, & Lyu, 2007) propose effective hybridised methods. This topic, however, is left as a point of interest for further research.

LOGIT BP RBP

0.44

0.46

0.48

0.50

Spread of obtained optima per algorithm

Algorithm

T

est Error (logloss)

Figure 10: Box plot of test errors evaluated for varying value of weight decay parameter.

4.5.4 Effectiveness

Although training errors are generally optimistic indications of an algorithm’s per-formance, they can still be viewed as relevant measure of a method’s capability to find optima. The considered differential evolution algorithms’ training performances measured by the log loss function are very similar to those of the gradient-based methods, as can be seen in Table 3. DE’s training performance even goes as far as beating the logit model. Figure 4, however, depicts the slow convergence speed

(38)

during training associated with the DE methods. As mentioned in Paragraph 4.5.2, the extremely long computation time causes the benefits of DE to be lost in favour of a faster method such as the gradient-based methods or the logit model.

It must also be noted that there seems to be a rather large discrepancy between the training and test performances associated with the DE algorithms. This suggests that overfitting occurred with much more severity during DE’s training process than during that of the gradient-based methods. And thus, diminishing the generalisation properties of the models trained by the DE methods.

4.5.5 Statistical comparison of AUC

The R-package, ”pROC”, provides a non-parametric approach to comparing the areas under two ROC curves. Namely, DeLong’s test for two correlated ROC curves (Delong & Carolina, 1988). The test’s null-hypothesis assumes the true difference in AUC of two ROC curves is equal to zero. This null-hypothesis is tested against a two-sided alternative, at a 5% level. All models (including the logit benchmark) are compared using this method. The observed p-values are displayed in the appendix in A3. The most important result of DeLong’s method is the lack of significant difference within the classes of algorithms. That is, (according to this test) the gradient-based methods do not differ significantly from each other. The same holds for the evolutionary algorithms. The evolutionary algorithms, however, differ from the gradient-based methods with very high statistical significance. And finally, the logit model differs significantly from all considered models.

(39)

BP RBP DE/r/1 DE/ltb/1 DE/ctb/1 Logit LL 0.4367584 0.4438994 0.6454445 0.7824770 0.7247856 0.4726035 AUC 0.8064238 0.7869822 0.7363636 0.6826240 0.6874929 0.7627625 Table 2: Test performance: Log loss and Area Under the Curve values of the best performing models

BP RBP DE/r/1 DE/ltb/1 DE/ctb/1 Logit

LL 0.4019847 0.4076516 0.4202274 0.4210533 0.415849 0.4502343

(40)

5 Conclusions and topics of discussion for further

research

The aim of this study was to effectively compare the ANN training performances of Differential Evolution and gradient-based methods, applied to forecasting U.S. primary election results based on demographic data. Based on the experiment re-sults, a comparison was made and assessed in terms of an algorithm’s efficiency (required computation time), effectiveness (ability to find optima) and robustness (generalisation properties).

Principle component analysis proved to be an effective method of dimensional-ity reduction by substantially shrinking required computation time with minimal loss of information. The optimal ANN structure was determined as a singly layer perceptron consisting of 14 hidden nodes, regularised using weight decay (L2 regu-larisation). However, as a consequence of computational limitations a larger network was not considered, though a greater number of hidden neurons was suspected to be necessary for a better grasp of the non-linearities in the data.

In the training phase, the three DE algorithms performed at par with the two gradient-based methods. DE/current-to-best/1 and backpropagation achieved su-perior performance in their respective classes. DE, however, displayed intolerably long computation time and still failed to converge given that large amount of time. To assess the necessity of a neural network in favour of a simple non-linear regression model, a logit model was researched. The test errors of the logit model were between those of the slightly superior backpropagation and slightly inferior DE/rand/1 model. The aforementioned suspected requirement of a larger model combined with backpropagation’s better performance lead to the conclusion that a neural network was indeed suited for the considered classification problem. However, given the circumstances, a logit model was sufficient to achieve desired performance for the current study.

(41)

A large discrepancy between DE’s training performance and ability to gener-alise for unseen data was believed to be caused by significant overfitting and lack of convergence. The gradient-based methods as well as the logit model heavily outperformed the considered DE methods, when evaluating predictive capabilities. With the exception of DE/rand/1, which managed to achieve slightly poorer test performance than the gradient-based algorithms and logit model.

Gradient-based method’s reputation as a local optimiser was confirmed by the relatively large observed variance when evaluating the found optima. No statement could be made regarding DE’s postulated global characteristics, based on the em-pirical results.

In conclusion, gradient-based methods for ANN training, as well as a simple logit model have proved themselves sufficient of achieving desired forecasting performance in terms of efficiency and effectiveness. Differential evolution’s main advantages were not explored to their full extent, as its main drawback, slow convergence, inhibited many experiments. A more complex neural network architecture and the implemen-tation of hybridised methods combining explorative and exploitative characteristics of DE and gradient-based methods, respectively, are left as a topic for further re-search.

(42)

References

Ardia, D., Boudt, K., Carl, P., Mullen, K. M., & Peterson, B. G. (2011). Differential Evolution with DEoptim. The R Journal , 3 (1), 27–34.

Basheer, I. a. (2000). Selection of Methodology for Neural Network Modeling of Constitutive Hystereses Behavior of Soils. Computer-Aided Civil and Infras-tructure Engineering, 15 (6), 445–463. doi: 10.1111/0885-9507.00206

Cangelosi, R., & Goriely, A. (2007). Component retention in principal component analysis with application to cDNA microarray data. Biology direct , 2 , 2. doi: 10.1186/1745-6150-2-2

Cook, D. (2016). Practical Machine Learning with H2O (1st ed.; N. Tache & D. Futato, Eds.). O’Reilly.

Delong, E. R., & Carolina, N. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves : A Nonparametric Ap-proach Author ( s ): Elizabeth R . DeLong , David M . DeLong and Daniel L . Clarke-Pearson Published by : International Biometric Society Stable. Biometrics, 44 (3), 837–845.

G¨unther, F., & Fritsch, S. (2010). neuralnet: Training of Neural Networks. The R Journal , 2 (1), 30–38. doi: 10.1109/SP.2010.25

Hastie, T., Tibshirani, R., & Friedman, J. (2009). Springer Series in Statistics The Elements of (Vol. 27) (No. 2). Springer. doi: 10.1007/b94608

Hayashi, F. (2000). Econometrics (1st ed.). Princeton University Press.

Heij, C., de Boer, P., Franses, P. H., Kloek, T., & van Dijk, H. K. (2004). Economet-ric Methods with Applications in Business and Economics. Oxford University Press.

Ilonen, J., & Kamarainen, J.-k. (2003). Differential Evolution Training Algorithm for Feed-Forward Neural Networks. Neural Processing Letters, 17 , 93–105.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2006).

An Introduction to Statistical Learning (Vol. 102). Retrieved

from http://books.google.com/books?id=9tv0taI8l6YC doi:

10.1016/j.peva.2007.06.006

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. doi: http://dx.doi.org/10.1016/j.protcy.2014.09.007 Looney, C. G. (1996). Advances in feedforward neural networks: demystifying

knowledgeacquiring black boxes. IEEE Transactions on Knowledge and Data Engineering, 8 (2), 211–226.

Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Im-prove Neural Network Acoustic Models (Vol. 28; Tech. Rep.). Palo Alto, CA: Stanford University.

Marquardt, D. W. (1963). An Algorithm for Least-Squares Estimation of Nonlinear Parameters Author ( s ): Donald W . Marquardt Source : Journal of the

(43)

Society for Industrial and Applied Mathematics , Vol . 11 , No . 2 Published by : Society for Industrial and Applied Mathematics S. , 11 (2), 431–441. Mullen, K., Ardia, D., Gil, D., Windover, D., & Cline, J. (2011).

DE-optim : An R Package for Global Optimization by Differential Evo-lution. Journal of Statistical Software, 40 (6), 1–26. Retrieved from http://www.jstatsoft.org/v40/i06/ doi: 10.18637/jss.v040.i06

Neri, F., & Tirronen, V. (2010). Recent advances in differential evolution : a survey and experimental analysis. Artificial Intelligence Review , 33 , 61–106. doi: 10.1007/s10462-009-9137-2

Olden, J. D., & Jackson, D. A. (2002). Illuminating the ”black box”: A random-ization approach for understanding variable contributions in artificial neu-ral networks. Ecological Modelling, 154 (1-2), 135–150. doi: 10.1016/S0304-3800(02)00064-9

Piotrowski, A. P. (2014). Differential Evolution algorithms applied to Neural Net-work training suffer from stagnation. Applied Soft Computing Journal , 21 , 382–406. Retrieved from http://dx.doi.org/10.1016/j.asoc.2014.03.039 doi: 10.1016/j.asoc.2014.03.039

Price, K. V., Storn, R. M., & Lampinen, J. A. (2005). Differential Evolution - A Practical Approach to Global Optimization. doi: 10.1007/3-540-31306-0 Storn, R., & Price, K. (1997). Differential Evolution – A Simple and Efficient

Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization, 11 (4), 341–359.

Wang, L., Zeng, Y., & Chen, T. (2015). Back propagation neural net-work with adaptive differential evolution algorithm for time series fore-casting. Expert Systems with Applications, 42 (2), 855–863.

Re-trieved from http://dx.doi.org/10.1016/j.eswa.2014.08.018 doi:

10.1016/j.eswa.2014.08.018

Weisstein, E. W. (2017). Levenberg-Marquardt Method. Retrieved 2017-11-01, from http://mathworld.wolfram.com/Levenberg-MarquardtMethod.html

Zhang, J.-R., Zhang, J., Lok, T.-M., & Lyu, M. R. (2007). A

hybrid particle swarm optimization–back-propagation algorithm

for feedforward neural network training. Applied

Mathemat-ics and Computation, 185 (2), 1026–1037. Retrieved from

http://linkinghub.elsevier.com/retrieve/pii/S0096300306008277 doi: 10.1016/j.amc.2006.07.025

(44)

A

Appendix

variable name description

PST045214 ”Population 2014 estimate”

PST040210 ”Population 2010 (April 1) estimates base”

PST120214 ”Population percent change - April 1 2010 to July 1 2014”

POP010210 ”Population 2010”

AGE135214 ”Persons under 5 years percent 2014” AGE295214 ”Persons under 18 years percent 2014” AGE775214 ”Persons 65 years and over percent 2014” SEX255214 ”Female persons percent 2014”

RHI125214 ”White alone percent 2014”

RHI225214 ”Black or African American alone percent 2014”

RHI325214 ”American Indian and Alaska Native alone percent 2014” RHI425214 ”Asian alone percent 2014”

RHI525214 ”Native Hawaiian and Other Pacific Islander alone percent 2014” RHI625214 ”Two or More Races percent 2014”

RHI725214 ”Hispanic or Latino percent 2014”

RHI825214 ”White alone not Hispanic or Latino percent 2014” POP715213 ”Living in same house 1 year & over percent 2009-2013” POP645213 ”Foreign born persons percent 2009-2013”

POP815213 ”Language other than English spoken at home pct age 5+ 2009-2013” EDU635213 ”High school graduate or higher percent of persons age 25+ 2009-2013” EDU685213 ”Bachelor’s degree or higher percent of persons age 25+ 2009-2013”

VET605213 ”Veterans 2009-2013”

LFE305213 ”Mean travel time to work (minutes) workers age 16+ 2009-2013” HSG010214 ”Housing units 2014”

(45)

HSG096213 ”Housing units in multi-unit structures percent 2009-2013” HSG495213 ”Median value of owner-occupied housing units 2009-2013”

HSD410213 ”Households 2009-2013”

HSD310213 ”Persons per household 2009-2013”

INC910213 ”Per capita money income in past 12 months (2013 dollars) 2009-2013” INC110213 ”Median household income 2009-2013”

PVY020213 ”Persons below poverty level percent 2009-2013” BZA010213 ”Private nonfarm establishments 2013”

BZA110213 ”Private nonfarm employment 2013”

BZA115213 ”Private nonfarm employment percent change 2012-2013” NES010213 ”Nonemployer establishments 2013”

SBO001207 ”Total number of firms 2007” SBO315207 ”Black-owned firms percent 2007”

SBO115207 ”American Indian- and Alaska Native-owned firms percent 2007” SBO215207 ”Asian-owned firms percent 2007”

SBO515207 ”Native Hawaiian- and Other Pacific Islander-owned firms percent 2007” SBO415207 ”Hispanic-owned firms percent 2007”

SBO015207 ”Women-owned firms percent 2007”

MAN450207 ”Manufacturers shipments 2007 ($1000)” WTN220207 ”Merchant wholesaler sales 2007 ($1000)” RTN130207 ”Retail sales 2007 ($1000)”

RTN131207 ”Retail sales per capita 2007”

AFN120207 ”Accommodation and food services sales 2007 ($1000)” BPS030214 ”Building permits 2014”

LND110210 ”Land area in square miles 2010” POP060210 ”Population per square mile 2010”

Table 4: A1: Exhaustive list of original input variable names and descrip-tions.

(46)

R-packages Application Application mentioned in dplyr Matrix operations during data preparation Section 3

DEoptim Differential Evolution Section 4.3

parallel Differential Evolution Section 4.3

h2o Gradient-based methods, logistic regression Section 4.2; 4.4

pROC ROC curve, AUC value, DeLong test Section 4.4; 4.5

stats Principal Component Analysis Section 4.1

Table 5: A2: List of R-packages used in this thesis.

BP RBP DE/rand/1 DE/ltb/1 DE/ctb/1 Logit

BP 1 0.101 2.81e-07 3.35e-08 3.77e-07 3.82e-04

RBP 0.101 1 2.49e-05 7.17e-06 1.20e-05 0.045

DE/rand/1 2.81e-07 2.49e-05 1 0.96 0.99 1.45e-03

DE/ltb/1 3.35e-08 7.17e-06 0.96 1 0.97 1.23e-03

DE/ctb/1 3.77e-07 1.20e-05 0.99 0.97 1 1.36e-03

Logit 3.82e-04 0.045 1.45e-03 1.23e-03 1.36e-03 1

Table 6: A3: P-values associated with DeLong’s non-parametric test for comparing AUC values of all ROC curves obtained in this thesis.

Retrospective forecasts of the 2016 U.S. primary elections : an empirical comparison of evolutionary and gradient-based neural network training with applications in political forecasting