Studies on applications of neural networks in modeling sparse datasets and in the analysis of dynamics of CA3 in hippocampus

(1)

STUDIES ON APPLICATIONS OF NEURAL

NETWORKS IN MODELING SPARSE DATASETS

AND IN THE ANALYSIS OF DYNAMICS OF CA3

IN HIPPOCAMPUS

by

Babak Keshavarz Hedayati

Bachelor of Science, Khaje Nasir Toosi University,

2007

Master of Science, Iran University of Science and

Technology, 2010

A Dissertation Submitted in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

in the Department of Electrical and Computer

Engineering

c

Babak Keshavarz Hedayati, 2019 University of Victoria

(2)

STUDIES ON APPLICATIONS OF NEURAL

NETWORKS IN MODELING SPARSE DATASETS

AND IN THE ANALYSIS OF DYNAMICS OF CA3

IN HIPPOCAMPUS

by

Babak Keshavarz Hedayati

Bachelor of Science, Khaje Nasir Toosi University,

2007

Master of Science, Iran University of Science and

Technology, 2010

Supervisory Committee

Prof. Nikitas Dimopoulos, Supervisor

Department of Electrical and Computer Engineering

Dr. Kin Li, Departmental Member

Department of Electrical and Computer Engineering

Prof. Arif Babul, Outside Member

(3)

Abstract

Neural networks are an important tool in the field of data science as well as in the study of the very structures they were inspired from i.e. the human nervous system. In this dissertation, we studied the application of neural networks in data modeling as well as their role in studying the properties of various structures in the nervous system. This dissertation has two foci: one relates to developing methods that help improve generalization in data models and the other is to study the possible effects of the structure on the function.

As the first focus of this dissertation, we proposed a set of heuristics that improve the generalization capability of the neural network models in regression and classifi-cation problems. To do so, we explored applying apprioi information in the form of regularization of the behavior of the models. We used smoothness and self-consistency as the two regularized attributes that were enforced on the behavior of the neural net-works in our model. We used our proposed heuristics to improve the performance neural network ensembles in regression problems (more specifically in quantitative structure–activity relationship (QSAR) modeling problems). We demonstrated that these heuristics result in significant improvements in the performance of the models we used. In addition, we developed an anomaly detection method to identify and exclude the outliers among unknown cases presented to the model. This was to ensure that the data model only made a prediction about the outcome of the unknown cases that were within its domain of applicability. This filtering resulted in further improvement of the performance of the model in our experiments.

Furthermore, and through some modifications, we extended the application of our proposed heuristics to classification problems. We evaluated the performance of the resulting classification models over several datasets and demonstrated that the regular-izations we employed in our heuristics, had a positive effect on the performance of the data model across various classification problems as well.

In the second part of this dissertation, we focused on studying the relationship between the structure and the functionality in the nervous system. More specifically, whether or not the structure implies functionality. In studying these possible effects, we elected to study CA3b in Hippocampus. For this reason, we used current related literature to derive a physiologically plausible model of CA3b. To make our proposed model as close as possible to its counterpart in the nervous system, we used large scale neural simulations, in excess of 45,000 neurons, in our experiments. We used the collective firings of all the neurons in our proposed structure to produce a time series signal. We considered this time-series signal which is a way to demonstrate

(4)

the overall output of the structure should it be monitored by an EEG probe as the output of the structure. In our simulations, the structure produced and maintained a low frequency rhythm. We believe that this rhythm is similar to the Theta rhythm which occurs naturally in CA3b.

We used the fundamental frequency of this rhythm in our experiments to quantify the effects of modifications in the structure. That is, we modified various properties of our CA3b and measured the changes in the fundamental frequency of the signal.

We conducted various experiments on the structural properties (the length of axons of the neurons, the density of connections around the neurons, etc.) of the simulated CA3b structure. Our results show that the structure was very resilient to such modifi-cations.

Finally, we studied the effects of lesions in such a resilient structure. For these ex-periments, we introduced two types of lesions: many lesions of small radius and a few lesions with large radii. We then increased the severity of these lesions by increasing the number of lesions in the case of former and increasing the radius of lesions in the case of the latter.

Our results showed that many small lesions in the structure have a more pronounced effect on the fundamental frequency compared to the few lesions with large radii.

(5)

List of Figures

Figure 1.1 A Hopfield Network . . . 12

Figure 1.2 An MLP Network . . . 13

Figure 1.3 Gaussian radial basis function with c=0 and r=1 . . . 15

Figure 1.4 Multiquadratic radial basis function with c=0 and r=1 . . . 16

Figure 1.5 A Deep Neural Network with 3 Hidden Layers . . . 17

Figure 1.6 UnPCAed Data . . . 19

Figure 1.7 PCAed Data . . . 19

Figure 2.8 The spectrum of the responses of the LMA model (The vertical red line is the compound activity target): Basic model(a) after Sensi-tivity heuristic (b) and after Dynamic Selection Heuristic II (c) . . . . 38

Figure 2.9 The spectrum of the responses of the Bayesian Regularization model(The vertical red line is the compound activity target): Basic model(a) after Sensitivity heuristic (b) and after Dynamic Selection Heuristic II (c) . . . 39

Figure 2.10 Removing the high frequency noise and the bias from the EEG signal using a band-pass filter . . . 44

Figure 2.11 Extracting the statistical characteristics of a preprocessed EEG channel : the signal is divided into 4-second segments. For each of these segments, the variance of the signal is calculated . . . 45

Figure 4.12 Anatomy of a Biological Neuron . . . 60

Figure 4.13 Different behaviours of a biological Neuron . . . 61

Figure 4.14 The rat Hippocampus structure. The top of the figure shows the outline of rat’s brain in a rostal (i.e. front) to caudal (i.e. back) direction. The two halves of the Hippocampus are easily discerned while the various regions are marked: these are DG: Dentate Gyrus, CA1-3: Cornu Ammonis regions 1 to 3. Obtained using Brain Explorer 2 (http:mouse.brain-map.orgstatic/brainexplorer) . . . 68

Figure 4.15 The Hippocampus modified from original by Santiago Ramon y Cajal (1852−1934) via Wikimedia Commons1. . . 69

Figure 4.16 The standard view of the entorhinal hippocampal network. MEC: Medial Entorhinal Cortex, LEC: Lateral Entorhinal Cortex, DG: Den-tate Gyrus, CA1-3: Cornu Ammonis regions 1 to 3. . . 70

Figure 4.17 The connectivity of the various classes of neurons in CA3b. Lines marked with ”+” (”-”) represent excitation (inhibition). . . 73

(9)

Figure 5.18 The firing patterns of neurons in time for the first 250ms. Each point on the y axis corresponds to one neuron and the x axis is time in ms. The pattern is stable throughout the total simulation (10s) . . . . 80 Figure 5.19 Fourier transform of the firings patterns shown in Figure 5.18 . 81 Figure 5.20 Changes in the fundamental frequency component as an effect

of change of the Pyramidal cell connection distribution parameters. . 82 Figure 5.21 Changes in the fundamental frequency component as an effect

of change of the interneuron connection distribution parameters when the parameters of the connection distribution of the Pyramidals are at maximum (τ = σx = 1.4mm) and at minimum τ = 0.18mm, σx=

0.6mm . . . 83 Figure 5.22 Changes in the fundamental frequency component as an effect

of change in the weights of Pyramidal cell connections to interneurons 84 Figure 5.23 Changes in the fundamental frequency as an effect of change in

the length of the pyramidal layer . . . 85 Figure 5.24 Changes in the fundamental frequency as an effect of change in

the variance of weights of Pyramidal cells. The distribution is uniform with a mean of 30. . . 86 Figure 5.25 Changes in the fundamental frequency component as an effect

of DC current injection to the system . . . 87 Figure 5.26 Firings of the first 250 ms when the Pyramidal connection

dis-tribution is exponential and τ is at its minimum i.e. τ = 0.18mm . . . 88 Figure 5.27 Firings of the first 250 ms when the Pyramidal connection

dis-tribution is Gaussian and σxis at its minimum i.e. σx= 0.6mm . . . 88

Figure 6.28 Firing patterns of pyramidal cells (a), a close up of progression of the firings in the structure (b) and their frequency spectrum (c) . . . 93 Figure 6.29 Sensitivity of the fundamental frequency to the maximum

pos-sible weights of the connections . . . 94 Figure 6.30 The fundamental frequency versus the ratio of the dead neurons

in the population: increase in the number of lesions with a constant radius of 5 units (left) and increase in the radius of 10 lesions are shown (right) . . . 95

(10)

List of Tables

Table 2.1 Comparison between the models for α dataset . . . 40

Table 2.2 Comparison between the models for γ dataset . . . 41

Table 2.3 Filter Performance . . . 42

Table 2.4 The Average Norm Values for each Method . . . 43

Table 2.5 Summary of the Results . . . 46

Table 3.6 Data Sets used . . . 56

Table 3.7 Accuracy and (standard deviations) of our proposed methods compared to [113] . . . 57

Table 4.8 Morphology parameters of Neurons in CA3b . . . 73

Table 4.9 The number of axonic targets of a Neuron . . . 73

Table 4.10 The number of Source Cells a Neuron receives input from . . . 73

Table 4.11 The sigmas and masks used in structuring the network- baseline model . . . 75

Table 4.12 The masks used in structuring the network to simulate axonal connectivity . . . 75

Table 4.13 The uniform random distribution parameters used in structuring the network . . . 76

Table 4.14 The parameters of the CA3b model . . . 76

Table 6.15 The parameters of the CA3b model modified for Hodgkin Hux-ley’s neuron model . . . 92

(11)

Acronyms

ANN Artificial Neural Network

BRA Bayesian Regularization Algorithm CA1 Cornu Ammonis Region 1

CA3 Cornu Ammonis Region 3 DG Dentate Gyrus

EEG Electroencephalogram FFT Fast Fourier Transform HH Hodgkin Huxley IE Initial Estimate

LMA Levenberg-Marquardt Algorithm MCS Multiple Classifier Systems MSE Mean Squared Error NN Neural Network

PCA Principal component analysis

(12)

Glossary

Bayesian Regularization In Bayesian Regularization algorithm, large weights of the connections are penalized during the training process.. vi, viii, 26, 27, 28, 29, 30, 36, 37, 39

blind set The blind set evaluates the generalization ability of the trained model. 28, 33, 35, 36, 37, 40, 41, 42, 43, 45, 50, 53, 54, 55, 56

generalization Generalization is the ability of a model to make reliable predictions on the data that was not used to build the model in the first place [5]. iii, 2, 3, 4, 7, 15, 20, 23, 26, 28, 30, 31, 36, 39, 40, 48, 51

regularization Regularization is the act of using apriori information to restrict the domain of possible solutions available. iii, v, vi, 2, 4, 23, 24, 25, 26, 27, 28, 30, 31, 34, 35, 36, 37, 38, 39, 40, 50

test set The test set is used to apply post-training regularization to the model. 7, 18, 21, 28, 31, 42, 45, 49, 50, 52, 53, 54

training set The training set consists of a group of observations that are used in train-ing the model. xii, 7, 11, 14, 15, 16, 20, 21, 22, 23, 26, 27, 30, 31, 32, 36, 37, 40, 42, 45, 46, 49, 50, 51, 52, 53, 56, 57, 111, 112, 113

validation set In the process of training, the observations in the validation set are used as a control group so the model is not over-trained. 5, 7, 20, 21, 26, 27, 28, 30, 37, 39, 40, 41, 42, 43, 49, 51, 111, 113

(13)

Acknowledgements

I would like to express my sincere gratitude to my supervisor Prof. Nikitas Dimopou-los whose vast knowledge and immense patience were instrumental throughout my research.

(14)

Dedication

To my parents who are the light of my life. And to Mina.

(15)

Introduction

In this dissertation, we divide the focus on two distinct questions. The first question is that of data analytics: how do we develop models that generalize well in the absence of dense data? The second question is that of computational neuroscience: whether structure implies function or not?

To answer these questions, we use abstractions of the apparatus found in the central nervous system called neuron. To answer the former, we focus mostly on determining the weights of the connections between neurons in a simple structure while the neuron models used in this structure are very simple abstractions (only an integrator followed by a nonlinearity) of the physiological neurons .

To answer the latter, first we approximate the behavior of the individual neuron models more accurately (compared to those in the nervous system). Then we mostly focus on the structure. Particularly on how to make the structure resemble those found in the living species as closely as possible. Here, we are not interested in the weights of the connections between the neurons, the argument being that the interconnection strengths vary between individuals and they are also adjusted as time passes yet the gross properties of the structures persist across individuals and often across species.

The purpose of this introduction is to provide definitions and a terminology frame-work that will be used in the rest of the dissertation to discuss both of the questions presented. In this chapter, first we start with a brief introduction to data analytics and the significance of neural networks as a powerful tool in this field. Then, we proceed with introducing computational neuroscience and how neural networks are employed as structural models in the nervous system.

In the age of information, data is a highly sought-after commodity. By defini-tion, “data is a set of values of qualitative or quantitative variables” [1]. Data has two forms: structured and unstructured. Unstructured data can be viewed as sets of values with unknown or ambiguous relationships to other sets. Unstructured data should be transformed into structured data to be analyzed. Structured data is highly organized in-formation that can be easily imported in a data base thus can be used in data processing algorithms.

Data may include either implicit or explicit relations. It is desirable to discover, define and study these relations and extract useful information from the findings. This leads to data science as an interdisciplinary field that is concerned with extracting infor-mation from data. The tools used in data science may be statistical methods, learning systems such as neural networks, etc. Using each of these tools, a model of the data is created which represents an “understanding” of the implicit/explicit relations contained

(16)

within the data.

A (parametrized) model is a map from a domain to a range described through parametrized analytical functions. Training is the process of selecting the parameters (or even the functions themselves) such that certain criteria are met (e.g. minimizing the difference between the expected responses and the model responses).

Robust modeling, constructing models that generalize well, has been the subject of much research. Generalization is the ability of a model to make reliable predictions on the data that was not used to build the model in the first place [5]. In generalization, by stripping the concept of its unnecessary conditions and criteria, we extend its modeling capabilities to cover a broader spectrum of events of particular nature [5]. generaliza-tion can be most useful in applicageneraliza-tions where exploring all the possible circumstances is difficult or even impossible.

Many methods have been developed to create models with high generalization ca-pabilities. Two methods to improve the generalization are selection of appropriate model types and regularization of these models.

Regularization techniques use apriori information to restrict the domain of possible solutions available in developing data models (more on this in section 2.1). Early stop-ping [119], weight decay [129] and Bayesian regularization [110] are prime examples of regularization.

Over the years, to achieve more accurate representation of the data, many classes of models have been developed. Linear data modeling [143] is the simplest form of modeling in which, a linear relationship between the independent variable(s) and the dependent variable(s) is established. Although simple and fast to deduce, linear regres-sion falls short in demonstrating more complex relationships in the data. Non-linear methods [10], [22] are effective alternatives to linear regression methods in modeling more complex and non-linear relationships in data.

Non-linear modeling is a term applied to a broad spectrum of data models. Al-though different in the way they are represented, all of these models share the ability, to various extents, of modeling non-linear relationships in the data. Multivariate adap-tive regression splines [50], Support Vector Machines [158] and feedforward neural networks, are prime examples of this class of models.

Multivariate adaptive regression splines (MARS) [50] is a modeling technique that represents the non-linear relationship between the inputs and the outputs in a dataset by breaking the inputs into intervals and mapping the input/output relationship in each interval by polynomial equations. It does not require a priori information and can be applied to many problems [89].

Support Vector Machines (SVMs) [158] are another type of nonlinear models that are mainly used in classification problems. An SVM creates a(several) hyperplane(s) in the data space that effectively separates two (several) categories of exemplars in the data space. In theory, for two different categories in a dataset, many hyperplanes exist that can divide these two categories. An SVM determines the dimension(s) in which the two categories have the highest distance between them and places the separating hyperplane across those dimensions [158]. As popular as SVMs are, they have sev-eral disadvantages. One of their major disadvantages is that their size is not fixed. As mentioned before, SVMs classify the data using hyperplanes. However, the number of these hyperplanes for a specific dataset is not known before the process of learning for

(17)

an SVM is complete. Therefore, the structure of an SVM can not be predetermined. The other main disadvantage is that the process of training them is not as straightfor-ward as some other similarly capable models such as neural networks.

Nowadays, neural networks are an essential part in modeling complex datasets. Neural networks estimate the outcome of an event or observation based on the values of descriptors of that event or observation. An artificial neuron is the building block of every neural network. Its behavior was inspired by the physiological neurons that are the essence of the nervous system in the animals. Essentially, a neuron receives data from one/several inputs, processes the input data (using weighted summation and a transfer function) and outputs one or several values.

An Artificial Neural Network is a group of neurons in which, several basic non-linear models (artificial neurons) are connected to each other. The way neurons are connected to each other determines the structure and functionality of the neural net-work. The information a neural network contains is encoded in the weights of its interconnections. It should be noted that, throughout this dissertation, we use the terms neural networks and artificial neural networks interchangeably.

Basically, a neural network model assigns a value or a category to an event or object based on a mathematical description of the said event or object. This mathematical de-scription is in the form of a relation, function or mapping. Each of these properties that are part of the mathematical description of the object or event are called descriptors.

In data analytics, assigning a value or a category to an input (description of object or event) are called regression and classification respectively. Among regression problems that can benefit from neural network modeling are estimating the quality of wine [26], daily parcel delivery demand forecast [44] and air quality [160] to name a few. Some of the widely used classification problems for which, a neural network model can be useful are breast cancer diagnosis [3], Liver Disorders [29], etc.

Deep learning [51] is a field in data science that uses neural networks with several hidden layers (compared to only one hidden layer in standard neural networks) [12]. In deep neural networks, layer by layer, the input data is transformed and abstracted so the relationships that govern it can be learned. Although deep learning is a powerful method, it relies on large and dense datasets to succeed. In the datasets we focus on, the luxury of abundance is non-existent.

In understanding the relationships contained within a dataset, the more data avail-able, the better these relationships can be understood and modeled. However, extract-ing useful information from the available data is a difficult task. In many cases, the rules that govern the data are very complex. In addition to that, the datasets that rep-resent these problems are relatively small. Typically, in these problems, the number of observations is far lower than what is required by conventional machine learning meth-ods (that use simple structures or do not benefit from additional a priori information in their training processes). This in return leads to decreased generalization capability of these techniques.

To mitigate these problems, developing heuristics that create models with excep-tional generalization capabilities for sparse datasets are of high importance. In chapters 1, 2 and 3, we present our set of heuristics for both regression and classification prob-lems that demonstrate high generalization capabilities over sparse datasets.

(18)

neuroscience to answer how structure dictates function. Computational neuroscience is a field in neuroscience that utilizes mathematical neuron models and abstractions of structures of the brain in analyzing and understanding the inner workings, dynamics and principles that exist within the nervous system [34]. Through the modeling and simulation, invaluable insights into the inner workings of these structures have been achieved. To study the collective behavior and dynamics of the physiological neuronal structures, spiking neurons replace the traditional simplified neuron models that are used in data analytics.

In neural networks with spiking neurons (spiking neural networks), spikes are the entities that drive neurons as well as entities that are produced as outputs of neurons. The frequency of these spike productions determines the computational properties of the network and is the subject of many studies. Replacing standard neurons with spik-ing neurons in these neural networks creates a much closer resemblance between the spiking neural networks and their physiological nervous structure counterparts. The spiking neural networks that are used to study the behavior of neuronal structures are called neuronal network models.

In this dissertation, one of our main goals is to understand the influence of the struc-ture in the functioning of these systems/networks and studying the effects of altering the structure (e.g. destroying parts of the structure) on the observed function/behavior. For this purpose, using computational neuroscience, we have developed a detailed model of CA3b in Hippocampus which enabled us to study the structural properties of this part of human brain. The model we derived closely adheres to the known quantitative properties of the CA3b.

In the following chapters, we investigate the applications of neural networks and computational neuroscience models. In the first chapter, we discuss neural networks, their current trends and their shortcomings in detail. Then, we will present various techniques to overcome some of these shortcomings in modeling of data.

In the second chapter, we present a technique we used to develop regression neu-ral network models for sparse datasets. Our technique is a multi-stage method, in which, using the collective behavior of neural networks, satisfactory results are ob-tained. Then, we move on to present and discuss our contributions to these set of heuristics. Finally, we will present the results of using such heuristics on various datasets.

In the third chapter, we discuss a set of heuristics we have developed to create and train a neural network based model for classification problem. These approach are similar to those of the previous chapter in that the regularization introduced, improves the generalization capabilities of the classifier. At the end of this chapter, we will present, compare and discuss the performance of this classification technique on several datasets.

In chapters 4,5 and 6, we move on to the field of computational neuroscience and address the usage of neural networks in the simulation of structures in the mammalian nervous system. In these chapters, we will present an analysis of the behavior of such networks and how various changes in their structure affect them. Unlike modeling datasets, the sole purpose of simulating these networks, is to analyze their structural properties and dynamic behavior. In chapter 4, first physiological neurons and their basic behavior are introduced. Then, some of the most common physiological neuron

(19)

models, their underlying maths and subsequently their differences are explained. In the fifth chapter, we present a physiologically plausible model of CA3b in Hip-pocampus that we have derived from the current related literature. Later in this chapter, we will demonstrate and discuss its structural properties.

Finally, in chapter 6, we use this model of CA3b to analyze the collective behavior of neurons and how lesions of various types affect them.

In chapters 1,2 and 3 all mentions of neural networks refer to the artificial neural networks whereas in chapters 4,5 and 6, the term neural network refers to computa-tional neuroscience or spiking neural networks.

To summarize, our main contributions are:

• Developing a dynamic selection algorithm that we use alongside the already available heuristics to improve the overall performance of ensembles of neural networks in regression problems (chapter 2).

• Developing a full set of classification heuristics (chapter 3) that

– Divides the available dataset into training, testing and validation sets. – Analyzes the overall sensitivity of populations of neural networks and only

keeps the neural networks that are deemed “more knowlodegable”. – Dynamically selects which groups of neural networks should participate in

the classification of the unknown elements.

• Deriving a physiologically plausible model of CA3b based on the biomedical literature (chapter 5).

• Studying the effects of lesions on the properties and the dynamics of CA3b struc-ture (chapter 6).

(20)

Chapter One: Sparse Data

modeling using Neural

Networks

1.1 Introduction

Artificial neural networks (ANNs) are a very popular non-linear solution in establish-ing complex relationships within datasets. In an artificial neural network, simplified models of the behavior of physiological neurons are used . These simplified models are connected in specific ways to form the artificial neural network. Artificial neural networks are capable of collecting information from input data and use this information to solve highly complex problems [65]. One interesting behavior of neural networks is the ability to map the behavior of independent inputs to dependent outputs. This behav-ior is a property of a class of neural networks called feed forward neural networks. In recurrent networks (another class of neural networks), one models dynamical systems as well where the state and the input determine the next state and the output. Of interest is the ability of these networks to change their behavior by adjusting their parameters (to learn). In this dissertation, we will focus on neural networks that have no state i.e. non-recurrent networks.

Based on the type of the problem and the application of the neural network, the output is determined. If the outputs in the data are real numbers that can assume any value in a predetermined interval, the problem is of regression type. On the other hand, if the outputs determine the similarity between different inputs and assign them to the same category, a classification problem is at hand. Therefore, the type of the dataset and whether the outputs of observations are categorized or assigned real values, determines the nature of the problem and as a direct consequence the nature of the model: datasets with continuous dependent variables of regression type and the datasets with discontinuous dependent variables are of classification nature.

The connectivity and the number of layers in a neural network determines its com-plexity and application. If there is no feedback connection between any of the layers, the neural network is called a multi-layer perceptron. If there are connection loops in the structure, the network is a recurrent neural network [65]. A deep neural net-work [52] is a multi-layer perceptron that has many layers in its hidden layer compared

(21)

to one or two hidden layers in a shallow multi-layer perceptron.

A single layer neural network has a very limited modeling capability. However, by adding more layers to its structure and creating a multi-layer neural network, it trans-forms into a universal approximator [31] [72]. A multi-layer neural network consists of at least three layers. The first layer that is the first point of contact between the neu-ral network and the input data is the input layer. The hidden layer processes the data received from the input layer and transfers the result to the output layer. The output layer scales the data it receives from the hidden layer. The output of a neural network is retrieved from its output layer.

Multi-layer perceptron neural networks (MLPs) are the most commonly used type of artificial neural networks in solving regression and classification problems. Apart from being a universal approximator, an MLP that does not contain any loops (hence in training such networks, a convergence on a solution is easily achieved [138]). This property alongside its widely studied learning algorithms have given rise to its immense popularity as the tool of choice in classification problems.

The start of Neural Networks research dates back to 1943 when Warren McCul-loch and Walter Pitts wrote a paper on how the network of neurons of the brain can create and analyze highly complex patterns. The model they proposed in “A logical calculus of the ideas immanent in nervous activity” made a huge impact on the science of artificial neural networks [114]. In 1949 in his famous book “The organization of Behavior”, Donald Hebb supported the work of McCulloch and Pitts and added some attributes to it [67]. The idea of adjusting the influence of the connections between the neurons based on the level of activity the connections sees, was first described in this book.

Neural networks are considered as a class of complex function approximators. One of the drawbacks of artificial neural networks is the fact that ANNs do not lead to unique solutions and different ANNs can offer the same degree of accuracy for a prob-lem. ie. sometimes significantly different models result in the same response. This on one hand makes the choice of structure easier and on the other hand leads to non-specific rules for constructing the network. Existence of multiple solutions, in return, makes their analysis difficult [149].

Training a neural network is the process of fitting an artificial neural network to a set of observations from the dataset so it can model the behavior of that dataset. An observation in this context is a set of qualitative and/or quantitative values that describe one sample from the data space.

In the process of training a model, the data is divided into three different groups: training, testing and validation set. The training set consists of a group of observations that are used in training the neural networks. In other words using the chosen training method, the weights of the connections in the network are adjusted in a way that the output of the network for training set is as close as possible to the output as specified by the training set. In the process of training, the observations in the validation set are used as a control group so the network is not over-trained (more on this in chapter 2). The test set evaluates the accuracy of training, the generalization ability of the model and ultimately, the goodness of the model. The responses of the network to the observations in the test set are a measure of the generalization capability of the model i.e. how well the general characteristics of the test set have been learned by

(22)

the network even though the network has not encountered this set during the training phase. Although artificial neural networks can produce highly capable models, they still have their own limitations. Because of their complexity, they can only be viewed as black boxes that given the input, an output is produced. That is, unlike models based on understanding the nature of the relationship described by the data, a trained artificial neural network can not reveal the nature of the relationship except for an input/output correspondence.

A great deal of research has been dedicated to the use of neural networks in model-ing datasets with continuous or discontinuous types (regression or classification mod-els). In the following, we present examples of both regression and classification prob-lems that can be modeled using proper neutral network models:

In drug discovery, based on the structure of each compound, some of its properties can be predicted (a compound is the result of a chemical reaction and consists of two or more chemical elements [19]). Establishing the biological efficacy of drugs is a costly and time consuming process because of the biological experimentation that needs to happen to establish whether a particular chemical compound (molecule) is biologically active or not. This hardship emphasizes the need for modeling the biological activity of these chemical compounds and predict their behavior (at least in the early research stages). Although chemical synthesis is expensive, discovering the biological,activity is even more expensive. QSAR models the biological activity given the structure of a molecule. These models can be in the form of neural networks or any other models such as linear regression. In pharmacology, the modeling of the activity of the molecules is based in quantitative structure–activity relationship (QSAR) modeling. In QSAR modeling, the basic rationale is that for every compound there is a set of descriptors that define the molecular properties of that compound. The QSAR associates molecular properties as expressed by the chosen descriptors to biological activity [122].

The biological activities are measured in vitro and under carefully controlled lab conditions [6] [7]. These activities are usually expressed in pIC50 which is the negative of logarithm of IC50. IC50 is a measure that indicates how much of an inhibitor is needed to inhibit a biochemical or biological process by half.

Image classification is one of the most popular applications of neural networks. In image classification problems, the objects in the images are distinguished by the model and are assigned to their appropriate classes. In [92], using a deep neural network consisting of 650,000 neurons in five layers, demonstrated the classification capabil-ity of the deep neural network modeling by classifying 1.2 million images into 1,000 different categories with exceptional results.

In medical sciences, neural network modeling has helped the researchers to clas-sify several hard to distinguish types of cancer to specific categories using the gene signatures of such cells [88].

The problem of “Glass Identification” [42], identifying the type of glass shards in a crime scene, is a classification problem. In this problem, the goal is to determine the origin of the glass pieces found in a crime scene based on their oxide content (whether these pieces were part of a bottle, car window, etc).

Another classification problem is the task of distinguishing between three types of iris plants based on the lengths and widths of the sepals and petals. This problem is used as a prominent benchmark to evaluate the performance of neural network models.The

(23)

problem was first discussed in [46] and since then, has appeared in many other research literature as well.

The examples presented above are some of many applications of the neural net-works models. Neural netnet-works, as a set of nonlinear equations that map the inputs to the outputs gives them the ability to model highly complex data behavior. The main advantages of neural networks is that they are universal approximators. In many cases, neural network models outperform the standard linear and non-linear models [151]. In the following sections, a brief introduction to neural networks and how they are used is presented.

1.2 Artificial Neural Network Architectures and their

Training

One of the main issues in using artificial neural networks is how to train them. Through-out the years based on the application, data limitations and other restrictions several methods to train mathematical models have been developed. It should be noted that the application of these learning approaches are not restricted to neural network models. The three most frequently used learning paradigms are supervised, unsupervised and reinforcement learning [136]. In supervised learning [64], using datasets that consist of input vectors and their corresponding output vectors, the network adjusts itself in such a way that the response of the system to the inputs is as close as possible to the output vectors.

In unsupervised learning [64], as the name implies, the network does not adjust itself to a desired output. In fact no outputs are provided (no dependent variable exists). Here, the goal is to infer a function that describes the structure of the data.

In reinforcement learning [148], instead of presenting the network with the desired output, the network is allowed to interact with its surrounding environment based on the input it receives. These interactions can lead to rewards or punishments based on which, the network adjusts itself. Reinforcement learning can be seen as a variation of supervised training in which instead of exact values, a less accurate quality goal is presented. This goal maybe defined indirectly.

It should be noted that the flow of information in the neural networks is only in one direction i.e. given the output vector, its corresponding input can not be identified by the network. Therefore, the function learned developed may not be reversible.

In all of the datasets we use in our experiments the dependent variable exists. Therefore, the supervised learning is our paradigm of choice. Some of the mathe-matical models mentioned earlier are:.

1.2.1 Perceptron

Preceptron is the simplest form of a neural network’s structure. It is a single layer network that when trained can classify its inputs to True(1) and False(0). The output of a perceptron is calculated as [40] :

(24)

f (x) =1 W X

0_{+ b > θ}

0 W X0+ b < θ (1.1) where f is the output that determines the class of the input, W is the weight vector, X is the input vector, b is the bias (a constant independent of input) and θ is the threshold. θ can be assumed zero without loss of generality (the bias or shift of the origin can absorb the threshold).

Although initially the preceptron was believed to be a capable classifier, it was proven that a perceptron can only classify linearly separable datasets [118]. That is in the n dimensional space of dataset, there should be a hyperplane of (n − 1) dimensions that can separate the true and false categories.

In order to train a perceptron (determining the values of weight vector W), we should rewrite equation 1.1.

Now assume our dataset is composed of two categories. The outputs of the vectors in the first category are assigned T rue and the output of the vectors in the second category are assigned F alse. Denote the dataset as F which is F = F+_{S F}− _(F+

consisting of all the vectors in the true category and F−consisting of all the vectors in the false category) and P1, P2, ... are the vectors chosen from F space to be used in

the training of the perceptron. Assume that W0is the weight vector at iteration 0 (it is

initialized to small random numbers) and Wkis the weight vector at iteration k. In each

iteration, choose a pattern from the dataset such as Piand apply it to the perceptron as

input. If it is classified correctly (if the predicted output matches the expected one), the weight from the kthiteration to the (k + 1)thdoes not change and the next iteration

starts. However, in the case of misclassification if PiF+, Wk+1 = Wk + Piand if

PiF−, Wk+1= Wk− Pi. Hence, if there is a solution (the classification problem is

linearly separable), the training method converges [14].

1.2.2 Hopfield Network

A Hopfield Network is a recurrent neural network introduced in 1982 by John Hopfield [71] in which, each neuron assumes a state of the binary space (it can be either in state 0 or state 1). In this network all neurons are connected to all neurons and the weights are symmetrical. That is, every pair of neurons are connected with identical weights. Typically, the following two rules govern the connections in a Hopfield network: wii =

0 and wij = wjiin which wij is the weight of the connection from jthneuron to the

ithneuron in the network.

Denote the two states of each neuron in this network as V_i1 and V_i0 which cor-respond to state 1 (firing state of the neuron) and state 0 (the non-firing state of the neuron) respectively. Given a network consisting of N neurons that are mutually con-nected and the weights of these connections can be any value, the state of each neuron in this network changes through the following equations [40] :

Vi→ Vi0 if

X

j

(25)

and

Vi→ Vi1 if

X

j

wijVj+ Ii> θi (1.3)

where wij is the weight of the connection from jthneuron to ithneuron, I is the

input vector and θi is the threshold. It should be noted that Hopfield Networks are

asynchronous. That is, each neuron runs on its own independent clock and therefore, state changes are not synchronized. Similar to perceptron, θ can be zero. Denote the state of a Hopfield network as an ordered array of the states of its neurons V1, V2, V3, ...

and E = −1 2 X i6=j X wijViVj− X i IiVi+ X i θiVi (1.4)

as the energy for each state of the network. Energy (Lyapunov) function allows us to quantify the dynamics of neural networks. The energy function in equation (1.4) is finite since Viand Vjare either 0 or 1 and the weights are finite themselves.

Thus, the change in the energy of the network as the result of a change in the state of a neuron, can be calculated as

∆E = ∂E ∂Vi ∆Vi= (− 1 2 X i6=j wijVj− 1 2 X i6=j wjiVj− Ii+ θi)∆Vi) (1.5) and because of wij = wji, ∆E = ∂E ∂Vi ∆Vi= (− X i6=j wijVj− Ii+ θi)∆Vi) (1.6)

through equations ((1.2) - (1.3)) it can be seen that in (1.6), the first product factor can be positive or negative but the second product factor (∆V ) is always of the opposite sign. Therefore ∆E is always a negative value. In other words, the transition rules in a Hopfield network always lead to a decrease in the energy of the network. This concludes that the energy in the network decreases at every change and because all the components in (1.4) are bounded (total energy is finite), transitions would guide the system into a local energy minimum in which case the network “settles” and there would be no more transitions. When the system settles, it is in a steady state. In this case, vector V in which vs

i is the state of neuron i, is considered the output of the

system. Fig 1.1 shows a Hopfield network made out of 3 neurons.

Before using a Hopfield network, it should be trained. Because in artificial neural networks, data is retained in the weights of the connections, training means designating weights of the connections for specific patterns. Since a Hopfield network acts as an associative memory, these patterns correspond to the state of the neurons of the network. Training ensures that these stable states of the network correspond to the set of patterns in the training set. In a Hopfield neural network, assigning the weight of connections is done using

wij =P p s=1(2v s i − 1)(2vjs− 1) if i 6= j wij= 0 if i = j (1.7)

(26)

Figure 1.1: A Hopfield Network where vs

i refers to the state of neuron i in the sth pattern and p is the total number

of patterns to be stored in the network . Thus determined weights ensure that the patterns are stable states of the network (as per the previous discussion). Of course these patterns need to satisfy the sparse coding and orthogonality conditions [40].

In a trained Hopfield network, when an input is applied to the network, the states of the neurons start to change. Applying an input to the network is done by assigning the initial state of each neuron in the network. When the input is applied, the energy function of the network begins to move and eventually settles in a local minima closest to its state when the input was applied. Different inputs result in different minima and the network simply settles into one of these local minima or close to one when a partial input is received (each stored pattern is seen as a local energy minimum by the network) the network will settle into that pattern). When there are no more transitions in the network, states of all the neurons are considered the output of the system.

1.2.3 Multilayer Perceptron networks (MLPs)

Perhaps the most commonly used architecture in Neural Networks is the feed forward multilayer perceptron network (MLP). In a feed forward MLPs, neurons are organized into different layers and each neuron is connected to the neurons in the next layer (to-wards the output) only. This means there are no loops (neurons are not connected to themselves or to the neurons in previous layers) in this network. A multilayer percep-tron consists of three or more layers. The input layer which receives the input from the outside and injects it into the system. Output layer which transfers the data from the network to the outside and the hidden layer in which most of the computation takes place. The hidden layer itself, may consist of several layers of neurons but usually there is just one layer of neurons in it. It should be noted that there is still no general method to calculate the number of neurons in the hidden layer and for each problem this number should be determined through trial and error. In figure 1.2, a typical three layer MLP is shown.

(27)

Figure 1.2: An MLP Network

approximate any given function. In the next section, this theorem is briefly introduced. To train MLP networks, the most popular method is the back propagation algorithm (Generalized Delta Rule). In backpropagation, the error backpropagates from the out-put layer to the inout-put layer and the weights of connections are adjusted to minimize the error. Of course this adjustment of the weights is done only in the training phase of the network and after that the weights remain constant.

In the following paragraphs, a brief description of how the back propagation al-gorithm (BPA) trains an arbitrary MLP network such as figure 1.2 is explained. For the sake of simplicity, the network will be trained only on one pattern. Denote Oo_i the output of ithneuron in the output layer, Ohi the output of ithneuron in the hidden layer

and O_iithe output of ithneuron in the input layer. The activation function for all the

neurons in the hidden layer is sigmoidal whereas the activation function for the neurons in the input and output layers is linear. To adapt the weights between the hidden h and the output o layers, we have

∆wkj(n + 1) = ηδkhoO h

j + αwkj(n) (1.8)

where n is the iteration number, wkjis the weight of the jthinput of the kthneuron

in the output layer, η is the learning rate which is a number between 0 and 1, α is a factor that determines the amount of the effect the previous adjustment has and δho

k is

the error signal between the hidden and output layers which can be calculated by δho_k = O_ko(1 − O_ko)(tk− Ook) (1.9)

where tkis the desired output of kthneuron in the output layer. Using (1.8), the weights

are adjusted by

wkj(n + 1) = wkj(n) + ∆wkj(n + 1) (1.10)

As for adaptation of the weights between the input and hidden layer, δ is calculated differently. Thus equation 1.11 replaces equation 1.9 in equation 1.8 [43]:

(28)

δih_j = Oh_j(1 − Oh_j)X

k

δho_k wkj (1.11)

where Ojh = σ(sj), sj = PmwmiOi, m = 1, ..., M is the index of neurons in the

input layer and k = 1, ..., K is the index of neurons in the hidden layer.

1.2.4 Universal Approximation Theorem

The universal approximation theorem states that a multilayer feedforward neural net-work with one hidden layer and a finite number of neurons with an activation function that conforms to the following rules [31] [72].

- nonconstant - bounded

- monotonically increasing

can approximate any continuous function using a dense subset of the said function domain Rn. An activation function is basically a function that maps the input values in a neuron to a specific range. Sigmoid and Hyperbolic functions are two of the most commonly used activation functions. Activation functions determine the level of non-linearity in neurons. Therefore if σ is a function with the mentioned attributes (nonconstant, bounded and monotonically increasing), Im[0, 1]m and C(Im) is the

space of the continuous functions on Im then for ∀f ∈ C(Im), ∀x ∈ C(Im) and

∀ε > 0, there exist real constants ai, bi ∈ R, real vectors wi∈ Rmand a finite number

N : F (x) = N X i=1 aiσ(wTix + bi) (1.12) and |F (X) − f (x)| < ε (1.13) where f (x) is independent of σ and F (X) is output of the neural network. Equation (1.13) denotes that F (x) approximates f (x) with the desired accuracy for ∀xImor

functions of the form F (x) are dense in C(Im). It should be noted that many properties

of the interpolating function F are independent of the form of the activation function σ [132]).

Although it states that given sufficient amount of neurons in the hidden layer (N in 1.12), any function can be approximated, the theorem fails to determine the number of neurons in the hidden layer. Therefore, the number of neurons in the hidden layer is determined through trial and error. The number of neurons in the input layer is the number of descriptors that are available for each test. In other words for every descriptor in the multidimensional feature space, there exists an input neuron uniquely affected by the values of its corresponding descriptor. If f is scalar, then a single neuron is at the output. If f is a vector function, then many neurons are at the output.

For all the problems that we have been working on, we use the following method to determine the number of neurons to be used in the hidden layer. First several neural networks with different hidden layer sizes are trained. Then the training set is applied

(29)

Figure 1.3: Gaussian radial basis function with c=0 and r=1

to these networks. After that, the responses of these neural networks to the training set is computed. The mean of errors for each number of neurons in the hidden layer is plotted. The knee point of the diagram determines the number of neurons in the hidden layer. For smaller sizes, the network would not be able to learn the training set well. For a hidden layer size bigger than the knee point, the network would lose its ability to generalize.

1.2.5 Radial Basis Functions

Radial basis functions (RBFs) are a class of functions in which the value of the function monotonically increases or decreases as the input gets more ”distant” from a point called the center [132]. A common form of an RBF function is the Gaussian that for a scalar input is in the form of f (x) = exp(−(x−c)_r2 2) where c is the center and r is the

radius. For c = 0 and r = 1, this function is shown in figure 1.3. As it can be seen this Gaussian function decreases monotonically as the distance from the center increases. Another form of RBFs called multi quadratic RBF increases as the distance from the center increases. It is defined as h(x) =

√

r2_+(x−c)2

r ( figure 1.4).

Similar to MLP networks, radial basis networks are a group of artificial neural net-works that are commonly used for function approximation and system control. Some of their characteristics are good generalization ability, high tolerance to input noise [169] and quick training of the network. They use the same feedforward architecture: they are multi-layered, the connections are non-recurrent (the direction is always from in-put to the outin-put) and under “mild” conditions of the universal approximation theorem (continuous almost everywhere, locally essentially bounded, and not a polynomial and monotonicity is not required), it can be proven that three layer neural networks can still

(30)

Figure 1.4: Multiquadratic radial basis function with c=0 and r=1 approximate any continuous function [108].

Each radial basis (in the hidden layer) network essentially compares the input against the prototypes its neurons have learned and assigns a similarity value between the input and each of these prototypes. If the input and a prototype are the same, their similarity value is one. As the distance between the input and the prototype increases, the similarity value decreases exponentially. The lower bound for the similarity value is 0.

The nature of radial basis neurons demands that in the training process, the centers and radii of RBFs should be determined. k-mean clustering can be used to assign estimate initial values for of the centers and radii of the network: To calculate the c and r values, all of the observations in the training set are clustered. The center of each cluster is assigned to the center of an RBF and the average distance between all of the points in that cluster and its center is used as the radius value for the RBF that represents that cluster. The number of clusters k (or the number of neurons in the hidden layer) determines the complexity of the decision boundary (higher complexity with higher values of k).

In order to train the weights, first for every data point in the training set, the outputs of the RBFs are calculated. Then for each output node, using a gradient descent based training algorithm, the weights of the connections from the RBFs to that node are calculated [144].

The activation function of RBFs leads to a network that performs well in detecting new patterns (novelty detection) however it results in poor extrapolation capabilities [45].

(31)

1.2.6 Deep Neural Networks

Deep neural networks are MLPs that have two or more hidden layers. Compared to these networks, perceptrons and MLPs with no and one hidden layers respectively are considered shallow. The deeper the neural network, the more complex decisions it can make [92]. A deep neural network with three hidden layers is shown in figure 1.5. In the deep neural networks, each layer is trained based on the outputs of the previous layer ie. it is trained on a distinct feature compared to the previous layer.

Deep neural networks can be used on as classifiers that are able to infer complex relationships in the datasets. This ability has led to the populairty of deep neural net-works in a wide array of classification problems such as speech recognition [32] and image classification [83].

The complexity in the decision making of deep neural networks comes at a price though. In order to use these networks effectively, large datasets are required. In addition to that, training these networks requires higher computational power [92] [57].

Figure 1.5: A Deep Neural Network with 3 Hidden Layers

1.3 Preparing Data to Train Neural Networks

Before we start the training process, we perform the following two steps : data normal-ization and data dimension reduction.

Data normalization or feature scaling is the act of scaling the range of all of the independent variables (descriptors) in the dataset to the same scale. Without normal-ization, in the process of training, the objective function will be sensitive unequally to various descriptors in the dataset ie. the training process may learn the descriptors with the larger range better over the other descriptors. To eliminate this risk, before any analysis of the data space, data should be normalized. Two main types of data normalization are:

Rescaling: In rescaling, the range of values of each independent variable in the dataset is mapped to [-1 , 1] interval (or [0 , 1]). The general equation for this mapping is:

x0 = x − min(X)

(32)

where x0 is the rescaled value of x, min(X) and max(X) are the lowest and highest available values in descriptor X respectively.

Standardization (z-score): Here, instead of a set interval, the mean and variance of each descriptor are set to zero and 1 respectively. The equation for this transformation is:

x0 = x − ¯X

σ (1.15)

where x0 is the rescaled value of x, ¯X and σ are the mean and variance of the values in descriptor X respectively.

After data is normalized, we reduce the dimensionality of the data. In the majority of datasets, each exemplar (observation) in the dataset is described by many indepen-dent variables (descriptors). Since each descriptor represents a dimension in the data space, reducing this dimensionality (the number of descriptors) may reduce the explo-ration space drastically. This dimension reduction is especially important for datasets where the number of descriptors is disproportionally larger than the number of exem-plars (for instance a dataset with 80 exemexem-plars and 140 descriptors).

In dimension reduction, the loss of information is almost always inevitable. How-ever, if all of the descriptors with high information content are retained through the process, the loss of information can be minimized. There are several methods to iden-tify and exclude the non relevant descriptors from the data space.

The most popular method in dimension reduction is the Principal Component Anal-ysis [58] which reshapes the space in such a way that the descriptors (dimensions) that matter the most (contain more information) have more effect on the shaping of the new space. This reshaping is an orthogonal transformation that converts a set of correlated variables to a linearly uncorrelated set. The first transformed variable has the highest variance and for each consecutive variable, the variance decreases. Therefore, the first variable has the highest amount of information in the transformed dataset. By dividing the variance values of each variable to the sum of all of the variables’ variances, we can determine the percentage of information from the total information of the dataset each variable contains.

The biggest disadvantage of the Principal Component Analysis is that in the pro-cess of transforming the space, the information is placed into a new set of dimensions (descriptors). Therefore after the reduction of dimensions, there is no way to determine which one of the original descriptors contibuted more information to the problem, nor which one of the original descriptors was made redundant as it was dependent. For in-stance in figure 1.6, none of the x or y (horizontal and vertical) axes can be used alone to represent the distinction between the data points. But through the PCAed space (figure 1.7), x0can be chosen to represent the data points with enough distinction (the image of all the data points on x0is still distinguishable). This example shows a simple use of PCA. Obviously the new dimension x0 is a combination of x and y. It can be seen that the new dimensions x0 and y0 are the result of spatial rotation of the former axes. However determining which dimension x or y contributes more to the formation of x0is not possible specially when the dataset has many dimensions.

It should be noted that in reducing the dimensionality, the choice of number of de-scriptors can affect how well the model predicts the test set. If reducing the number of

(33)

Figure 1.6: UnPCAed Data

(34)

descriptors results in significant loss of information, this can result in poor performance of the model [112], therefore special care must be taken in compressing the data space for each case.

1.4 Training of Neural Networks

Although briefly explained in 1.2 , the process of training neural networks requires some considerations. If only the input that was used to train the neural networks, is applied to them (the network is only used to recognize its training set), the ability of neural networks to predict turns into recognition. Neural networks are used to classify or predict the outcome of new events based on the training they receive using the avail-able dataset. This ability is called generalization [5]. The goal of training here is to improve this generalization capability of ANNs.

As discussed in section 1.2, training methods are divided into three major groups: supervised, unsupervised and reinforced. Since in event prediction or classification problems, the goal is to match the output of the ANN to the desirable target, the focus of this section is on the characteristics of the supervised training method in neural networks.

As mentioned before, the most commonly used class of neural networks for data prediction are MLPs for training of which backpropagation technique is the most popu-lar. In the training process, the difference between the dependent variable produced by the network and its actual value is the error. Thus the training process attempts to de-crease this error. However as the generalization ability of the network is the goal here, special care must be taken not to overtrain the network, which would lead to overfitting of the data. Overfitting means that model’s ability to generalize (and predict the newly introduced data points) has decreased. This decrease in generalization ability is a result of the model specializing on the data points used in its training only. In most problems, this decrease in generalization is not desirable.

To solve the overtraining problem, the performance of the neural networks toward the validation set is used to estimate the generalization ability of the neural network. During training, selecting neural networks that have low error for the training and val-idation sets alike is an indication that a network will be able to correctly predict un-known exemplars, at least those in the validation set.

How validation sets are employed in the training process depends on the method of training. For example, we may train many neural networks and then use the validation set to select the most desirable ones or we may use the validation set after each training iteration (epoch) to decide whether we should continue the training or not.

In the case of the latter, each step of the training process decreases the training error. If this decrease in training error results in an increase in the validation error then this is an indication of an increase of the networks specialization and decrease in its generalization ability. In this case, the training process must be stopped immediately.

As another case of overtraining, sometimes while the error of the training set de-creases, the error for the validation set remains the same. In this case, the process must be stopped because the training process is not improving the generalization ability of the network (lower error for validation set) [149].

(35)

If the difference between the error of the training set and the error of validation set is very large, the network is most likely overtrained and is just recognizing the points in its training set. The test set is the ultimate measure of how well the network can generalize because the cases in the test set are not used to the network during training.

1.5 Selection of Training and Validation Sets

As mentioned before, each training process requires a training and a validation set. However, in order to divide the dataset into these two, some considerations should be taken into account. If the validation set only includes the easy cases (the cases which are very similar to the ones in the training set), the error of the network for the validation set may be misleading and may result in over-training of the network. On the other hand, if the cases are too difficult (the cases whose descriptors are very different from the ones in the training set or their outcome is different than their spatially close counterparts in training set), the the error of the network to the validation set will not decrease and this would lead to an early termination of the training process. Another drawback of putting the difficult cases in the validation set is that essential data to the training is lost and the network will not have the chance to learn these cases.

Random selection is the most commonly used method in dividing the data into training, validation and test sets. However, in order to make an effective division (training and validation set have equal numbers of easy and difficult cases), a large enough dataset is required. Although the 2:1:1 ratio is commonly used, depending on the dataset alteration of these ratios can lead to better results.

Another technique is to categorize data into groups (using techniques such as K-means clustering [146]) and then to randomly pick representatives for test and valida-tion sets from each group. In K-means clustering [146], the goal is to categorize n observations into k categories (clusters) in a way that each observation is in the cluster with the closest center to the observation. Using this technique, it can be ensured that representatives from all over the data space are present in both training and validation sets [149]. If N is the number of clusters generated, one can construct a validation set of cardinality N by including one exemplar from each of the N clusters [149] [112]. The same process can be carried out to construct the test set.

In addition to that, by using the random forest technique [99], the validation sets would only consist of examples without which, the network maintains its regression ability within reasonable limits. These methods are specially effective when dataset is very sparse and the luxury of random selection is just not available.

The other effective method is called guided selection [100]. In this method, the val-idation set is created using the exemplars that can be predicted using the information from the rest of the dataset. The goal of the guided selection is to effectively divide the data to be used in the training of the neural networks to training and validation sets. In this method, first the individual exemplars that when used in the training set, do not in-crease the accuracy of the model significantly are identified. Then various combination of these exemplars using an informed heuristic are made into groups of two. The sizes of each group is increased until the desired criteria is reached (appendix 6.3). Guided selection ensures that in the process of dividing the exemplars into training and

Studies on applications of neural networks in modeling sparse datasets and in the analysis of dynamics of CA3 in hippocampus

STUDIES ON APPLICATIONS OF NEURAL

NETWORKS IN MODELING SPARSE DATASETS

AND IN THE ANALYSIS OF DYNAMICS OF CA3

IN HIPPOCAMPUS

Babak Keshavarz Hedayati

Bachelor of Science, Khaje Nasir Toosi University,

2007

Master of Science, Iran University of Science and

Technology, 2010

A Dissertation Submitted in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

in the Department of Electrical and Computer

Engineering

STUDIES ON APPLICATIONS OF NEURAL

NETWORKS IN MODELING SPARSE DATASETS

AND IN THE ANALYSIS OF DYNAMICS OF CA3

IN HIPPOCAMPUS

Babak Keshavarz Hedayati

Bachelor of Science, Khaje Nasir Toosi University,

2007

Master of Science, Iran University of Science and

Technology, 2010

Supervisory Committee

Prof. Nikitas Dimopoulos, Supervisor

Department of Electrical and Computer Engineering

Dr. Kin Li, Departmental Member

Department of Electrical and Computer Engineering

Prof. Arif Babul, Outside Member

Abstract

Table of Contents

List of Figures

List of Tables

Acronyms

Glossary

Acknowledgements

Dedication

Introduction

Chapter One: Sparse Data

modeling using Neural

Networks

1.1

Introduction

1.2

Artificial Neural Network Architectures and their

Training

1.2.1

Perceptron

1.2.2

Hopfield Network

1.2.3

Multilayer Perceptron networks (MLPs)

1.2.4

Universal Approximation Theorem

1.2.5

Radial Basis Functions

1.2.6

Deep Neural Networks

1.3

Preparing Data to Train Neural Networks

1.4

Training of Neural Networks

1.5

Selection of Training and Validation Sets