Feature synthesis with deep learning for data science challenges

(1)

MSc Artificial Intelligence

Track: Machine Learning

Master Thesis

Feature synthesis with deep learning

for data science challenges

by

Cong-Nguyen Tran

10867481

July 4, 2016

42 ECTS January 2016 - July 2016

Supervisor:

Dr Evangelos Kanoulas

Assessor:

Dr Efstratios Gavves

(2)

Abstract

In order to extract meaningful insights for many real-world problems, data scientists often have to manually construct features from data. However, most of the time machine learning experts have more ideas to try than time and computational power permit. The relational nature of most datasets makes applying traditional machine learning techniques highly difficult. The goal of this study is to employ deep learning to automatically derive useful high-level features from raw data, with limited preprocessing and human intervention in the process. Experiments with deep neural networks are carried out on a number of large datasets coming from real-world applications. Comparisons with state-of-the-art results are also performed to evaluate the robustness of the system.

(3)

List of Figures

1-1 A typical data science work-flow. . . 9

1-2 A small subset of the KDD Cup 2014 dataset. Some column names are modified for readability. Shortened identities such as projectid or item_number are followed by an ellipsis. . . 10

2-1 An example of the logistic sigmoid activation function. . . 17

2-2 A neural network with one hidden layer. Each variable (input, hidden and out-put) is represented by a node, and network parameters are illustrated as links connecting the nodes in two successive layers. Bias factors are represented by an extra input variable 𝑥0 and hidden variable 𝑧0. The arrow ends of the edges show the flow of information throughout the network. . . 19

2-3 A fully-connected neural network with two hidden layers. The variables 𝑥’s represent the input, where 𝑥0 is set to 1 for the bias term, and 𝑦’s are the network outputs. There are two hidden layers, where the last nodes represent the respective layer’s bias factors. . . 21

2-4 An example of the ReLU activation function. . . 23

2-5 Derivative of the logistic sigmoid function. . . 24

2-6 Derivative of the ReLUs activation function. . . 25

4-1 Table structure of the IJCAI 2015 Repeat Buyers Prediction dataset. . . 31

4-2 The distribution of customer-merchant pairs with recorded outcome label across all six months in the IJCAI 2015 dataset. . . 32

4-3 The number of samples in the Pegasystems’s dataset, distributed across all recorded months when the offers took place. . . 34

5-1 An example of a one-to-many relation that can be found in many real-world datasets. Here, each customer recorded in the Customers table can have multiple transactions, each saved in the Transactions table. . . 38

(6)

5-2 Example of the proposed data joining scheme. Red bars represent samples in the primary table 𝑝, gray bars for dependent entries concatenated to the primaries in 𝑝 and white bars for zeros. . . 39 5-3 An example of the proposed data join approach on three sample projects from

Figure 1-2 with their respective resources from the KDD Cup 2014 dataset. Some columns in the resources table are omitted for readability. Resource attributes are filled with zero when their corresponding project no longer has any additional resource. The projects_att column stands for all the columns available in the projects table. . . 39 5-4 The process of determining the optimal settings for the neural networks: (a)

one-layer neural network is trained multiple times using 5-fold cross validation. The blue nodes with question marks are the ones at which we want to determine the optimal number; (b) The previously determined optimal number of hidden units for the one-layer network, together with its trained weights, is used for the first layer of a two-layer network. Optimal settings for the second layer are now identified in a similar manner with 5-fold cross validation. . . 43 6-1 The effect of 𝑘 on the prediction performance of a number of classifiers on the

KDD Cup 2014 dataset. The dashed line represents an AUC of 0.5, which is the score for random guessing. . . 46 6-2 The effect of the number of hidden nodes on classification performance of a

network with only one hidden layer. Scores are obtained by performing stratified 5-fold cross validation and compared with various baselines available. The solid black lines represent the score for random guessing. . . 49

(7)

List of Tables

5.1 Two example categorical attributes taken from the projects table of the KDD Cup 2014 dataset. . . 41 5.2 Two categorical attributes School Metro and Poverty Level after OHC. . . 41 5.3 An example of transforming a categorical attribute using RMO. The first column

Original represents the original raw data. The third column contains the label for each sample and the Transformed column is the Location column after transformation. . . 42 6.1 Performance comparison of multiple classifiers with different input data handling

schemes, on KDD Cup 2014 dataset. Results are obtained using feature synthesis from a 3-layer neural network. Here 𝑝-baseline means only the primary projects table is used, Database joins stands for the databases’ conventional left and right joins, and Proposed join shows results obtained by using our approach. Classification metric is AUC. Bold scores are the best each classifier could achieve. 47 6.2 AUC score of Logistic Regression using four different treatments of missing

val-ues, on both the KDD Cup 2014 and AJCAI 2015 datasets. . . 48 6.3 AUC scores for a two-layer neural network classifier on the three datasets, using

different representation strategies for categorical attributes. . . 50 6.4 A comparison of multiple classifiers against the baselines on the KDD Cup 2014

dataset. Both raw data input and synthesized features with neural networks are considered. Classification metric is AUC. Bold scores are the best each classifier could achieve. . . 51 6.5 A comparison of the effect of features extracted by neural networks on three

classifiers when evaluated using the IJCAI 2015 dataset. Both raw data input and synthesized features with neural networks are considered. Classification metric is AUC. Bold scores represent the best each classifier achieved in this experiment. . . 53

(8)

6.6 A performance comparison of multiple classifiers against the baseline used by Pega on the Pegasystems dataset. Both raw data input and synthesized features with neural networks are considered. Classification metric is AUC. Bold AUC scores stand for the best result each classifier achieved. . . 54

(9)

Chapter 1 Introduction

1.1 General data science process

As the amount of data available in industry and academia is increasing at a rapid rate [33], it is fruitful to develop tools and techniques to aid scientists and practitioners in employing ideas to identify valuable insights from data [26]. Machine learning algorithms form an indispensable part of data analytics processes, and thus it is desirable to have methods that are robust, generalizable and scalable.

Raw data with complex relational structure Preprocessing Feature engineering Model selection Parameter tuning Model evaluation

(10)

Figure 1-1 illustrates the various phases of an analytics process. Working through these stages often require domain knowledge of the problem at hand. For example, when trying to predict the number of potential bicycle rents, it is generally beneficial to include information about weather and season for each particular timespan. An overview of the typical steps involved in each phase is as follows:

1. Raw data with complex relational structure: For most data science problems, the input data has a highly complex structure with multiple tables or data frames connected with each other through relations. Figure 1-2 shows a small number of rows on a few columns of two tables of the KDD Cup 2014 dataset [1]. The identity (ID) field projectid acts as the connecting point for most of the tables in this case.

projectid school_city resource students 316e... Selma Rd Books 32 90de... Dalas St Books 22 3294... Colton Technology 17 bb18... Brooklyn Books 12 2476... Los Angeles Other 24 ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Relation

id type vendor vendor_name price item_number amount projectid 71d5... Books 7 AKJ Books 5.1 97805... 33 316e... 6e48... Books 27 Amazon 4.2 00644... 33 316e... 0d8b... Books 7 AKJ Books 5.83 97803... 1 90de... d2d3... Technology 82 Best Buy 379.99 BB124... 1 3294... 6377... Other 178 Quill.com 62.99 901-34... 1 2476... 9a36... Other 178 Quill.com 278.99 901-23... 1 2476... ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Relation Attributes Projects Resources

Primary key Foreign key

Degree

Cardinalit

y

Figure 1-2: A small subset of the KDD Cup 2014 dataset. Some column names are modified for readability. Shortened identities such as projectid or item_number are followed by an ellipsis.

2. Preprocessing: Most real-world datasets are presented with noise such as missing attributes or incorrectly labeled samples. Initial analysis is therefore necessary to correct the errors existing in the data.

3. Feature engineering: This is one of the most important steps for a robust predic-tive pipeline. Domain knowledge is primarily employed in this phase to construct new features from available attributes, mainly through aggregations. Attributes that

(11)

are not helpful for the current task are also removed during this stage.

4. Model selection: Each machine learning algorithm excels at a certain kind of problem and data. Based on the dataset and the current task, decision is made on which types of learning technique and approach are to be used.

5. Parameter tuning: After the machine learning model has been formed from the previous step, its parameters are optimized to give the best possible prediction per-formance. A number of techniques, such as cross validation, are available to assist in the parameter selection process [27].

6. Model evaluation: A hold-out dataset is typically used to, once again, assess the performance of the selected model and its corresponding parameter set in terms of generalizability.

Depending on the task, the different phases outlined in Figure 1-1 can be combined and are repeated until a model with reasonable performance is found.

1.2 Problem and motivation

Artificial intelligence (AI) is surpassing humans in an increasing number of problems such as visual recognition [20], identification and recovery of programming errors in softwares [16], or the game of Go against human champions [44]. However, the amount of time devoted to answering questions posed during each phase of the data science process in Figure 1-1 can be significant. Both intuition and knowledge are required in order to come up with a robust machine learning model. Therefore, it is desirable to have techniques that help expedite data science processes for people without deep understanding of the domain posed by the data.

Feature engineering plays a crucial role in determining the outcome of machine learning models. This is the step that transforms the complex relational structure of raw data often found in practice, such as the one in Figure 1-2, to a format readable by machine learning algorithms. In most cases, having well-formed inputs can lead to a drastic improvement on prediction accuracy. However, it is one of the most difficult phases in the data science process because, while there is an extensive literature with in-depth analyses on different machine learning algorithms ([28], [8], [6], [42], [10]), feature engineering is heavily domain-dependent. Recommended practices (such as the choice of aggregate, and how to represent some of the attributes) for one task might, in fact, have detrimental effects on another seemingly similar problem.

(12)

way to monitor the effectiveness of our choices until we reach the last step depicted in Figure 1-1, namely model evaluation. However, the feedback we get from evaluating the resulting model (typically in the form of prediction accuracy) represents the effectiveness of the whole pipeline. If the outcome does not reach an acceptable level, scientists typically have to look into almost all the prior phases to identify which one is deteriorating performance. Consequently, a few mistakes in feature engineering can lead to fatal yet latent problems.

The goal of this thesis to tackle this challenge by proposing an approach to systematically extract high-level features from raw data with the use of deep learning methods. These tech-niques can derive quality features directly from their inputs without much human intervention [30], something conventional machine learning algorithms cannot do. This striking characteris-tic of deep learning serves as the underlying motivation for this thesis to apply it to the feature engineering endeavor in data science processes.

1.3 Related work

In domains with natural data, such as computer vision, the idea of using deep learning to synthesize features has been applied. For example, deep convolutional networks learn features from images that, when visualized, show a strong resemblance to the original objects of interest [49]. Overfeat [43], a robust feature extractor and image recognizer, revolves around the ability of deep learning techniques to generate high-quality features from natural input data. In the field of Natural Language Processing, deep learning also proves to be powerful in feature synthesis. Word2vec [36] tries to learn the linguistic context of words in a large text corpus. It produces a high-dimensional vector for each word in the corpus, and words with similar meaning have closer distances to each other. In these particular fields, deep learning has seen many successes.

However, to the best of our knowledge, prior to our study, there has been only one single work that tries to automate the feature engineering phase of general data science challenges in the literature. The resulting system from this work is called the Data Science Machine (DSM) [26]. DSM constructs features by traversing the relational tables in a dataset. Through a set of base functions such as sum, count and mean, attributes in one table are aggregated and propagated to another table in a recursive and exhaustive manner. The resulting system demonstrates its robustness by surpassing a majority of competition contestants at prestigious data science challenges. However, our work explores a number of aspects not considered by the authors of DSM:

(13)

1. While the features are automatically generated, the set of base functions is still manually chosen. It means that highly non-linear correlations can be missed. The approach proposed in this thesis avoids this problem by using deep learning tech-niques to automatically learn complex non-linear functions.

2. Even though the DSM acquires high competition scores, there is no in-depth analysis on the contribution of each individual component of the pipeline.

3. It is not mentioned if the proposed feature construction process can improve pre-diction results of classifiers other than random forest, which is by itself a powerful algorithm for this kind of tasks [8]. In this study, we evaluate our feature synthe-sis approach by assessing its quantitative contribution to the performance of many different classifiers.

1.4 Contributions

The main contributions of this work are:

1. Propose an automatic feature synthesis scheme by the use of deep learning techniques with minimal preprocessing.

2. Analyze the impact of the synthesized features on the prediction accuracy of the data science pipeline.

The thesis is organized as follows: Chapter 1 provides a brief introduction to the problem. Chapter 2 goes through artificial neural networks and related techniques on deep learning. Chapter 3 discusses the problem and its challenge in greater depth, as well as explicitly states the research questions to be answered. Chapter 4 explains the datasets and metric used for experiments. Chapter 5 describes the proposed pipeline and how it is used to solve the stated problem. Chapter 6 reports experimental results and the corresponding analysis. Finally, a conclusion is given in Chapter 7.

(14)

(15)

Chapter 2 Artificial neural networks

2.1 History

Artificial neural networks (or neural networks for short) were originally developed with the goal of replicating how information is handled in biological systems [41], [40], [47], [34]. The Percep-tron algorithm introduced by Rosenblatt in 1957 [39] marks one of the earliest known instance of neural networks. While the original Perceptron algorithm sets the stone for subsequent work on the field, it is unable to estimate non-linear functions. This major drawback became the focus of many researchers during the late 1970’s and early 1980’s. Scientists later discovered that by stacking many Perceptrons together into multiple layers, non-linear functions could be approximated.

The lack of computational power and sufficient data were the main reason that shifted away the research focus from neural networks before the late 2000’s. These two problems were later tackled thanks to the introduction of affordable parallel computing hardware such as graphical processors and the increasing amount of classified data. Most notable achievements have been obtained with deeper networks in various domains where natural data is present such as speech recognition [21], language processing [35] and computer vision [11], [30]. The corresponding field of study is often referred to by Deep Learning.

One key characteristic of Deep neural networks is that they have many parameters, and as a result can approximate highly complex, non-linear functions. Furthermore, neural networks are flexible and can be adapted quickly for specific domains. These features make neural networks ideal for problems with abundance of data and complex relationships. The word “Deep” in Deep Learning is most commonly understood as referring to neural network models having more than two layers. Most problems with natural data such as image recognition and speech

(16)

processing are tackled effectively with neural networks that are deep in terms of the number of layers. There is also another way of seeing Deep Learning as a collection neural networks-based algorithms that are able to learn features automatically from raw data, with limited human intervention.

Throughout the evolution of Deep Learning, primarily from 2006 onward, many clever techniques have been developed to drastically improve networks’ performance. Some of these include the use of dropout layers and rectified linear units (ReLUs). They are introduced to address difficulties faced when training Deep neural networks such as training speed and generalization.

2.2 Multilayer Perceptron

This section explores the Multilayer Perceptron, the type of neural networks having the highest applicability to a number of data science problems. The mathematical form of the network is considered in this section. The calculation of its parameters (commonly known as weights) using maximum likelihood can be carried out with the error backpropagation technique, allowing for the efficient optimization of the network parameters [41].

The building blocks of the neural networks are based on linear combinations of the input signal x = [𝑥1. . . 𝑥𝐷]: ˆ 𝑦(x, w) = 𝑓 (︃ _𝐷 ∑︁ 𝑖=1 𝑤𝑖𝑥𝑖 )︃ (2.1) where 𝑓 (·) is typically a nonlinear activation function in the case of classification problems, 𝐷 is the input dimension and w is the coefficient vector. This model can be extended by applying a parametric basis functions on the input signal x that has the same form as in (2.1). This results in a neural network model where the output of non-linear transformations of the linear combination of the inputs serves as the inputs to the subsequent layer.

More precisely, let us consider 𝑀 linear combinations of x:

𝑐𝑗 = 𝐷 ∑︁ 𝑖=1 𝑤(1)_𝑗𝑖 𝑥𝑖+ 𝑤 (1) 𝑗0 (2.2)

where 𝑗 = 1, . . . , 𝑀 is the index of the 𝑗-th linear combination and the superscript represents the network layer number. Special parameters, 𝑤(1)_𝑗0 in (2.2), are the network biases. Each of the resulting activations 𝑐𝑗 then goes through a non-linear, differentiable activation function

(17)

𝑓 (·):

𝑧𝑗 = 𝑓 (𝑐𝑗) (2.3)

The 𝑧𝑗’s are generally called hidden units of the network. Early development in ANN typically

used logistic sigmoid and tanh as the choices for the non-linear activation function 𝑓 (·). The sigmoid function takes the following form:

𝜎(𝑥) ≡ 1

1 + exp(−𝑥) (2.4)

Figure 2-1 illustrates the logistic sigmoid function. Another popular activation function, the tanh function, is defined as follows:

𝑡𝑎𝑛ℎ(𝑥) ≡ exp(𝑥) − exp(−𝑥)

exp(𝑥) + exp(−𝑥) (2.5) The resulting quantities in equation (2.3) now serve as inputs to the next network layer:

𝑐𝑘 = 𝑀 ∑︁ 𝑗=1 𝑤(2)_𝑘𝑗𝑧𝑗 + 𝑤 (2) 𝑘0 (2.6)

where 𝑘 = 1, . . . , 𝐾 represents the index of the 𝑘-th linear combination similarly to the previous case, and 𝐾 is the number of nodes in the third layer. The 𝑤_𝑘0(2) serve as biases for this second layer, and the linear combinations, calculated as 𝑐𝑘’s, constitute the second layer of the network.

The process above is repeated a number of times corresponding to the desired network depth.

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 𝑥 𝜎(𝑥)

(18)

The final output layer has its inputs transformed by a suitable non-linear activation function to yield outputs ˆ𝑦. The network depth, number of hidden units for each layer and which activation functions to use are all decided based on the type of problem and data available. For instance, for two-class classifications, a sigmoid function can be used for output activation:

ˆ

𝑦𝑘 = 𝜎(𝑐𝑘) (2.7)

In case of multi-class classifications, a softmax function, which is a generalized version of sig-moid, can be used:

ˆ 𝑦𝑘= exp(𝑐𝑘) ∑︀ 𝑗exp(𝑐𝑗) (2.8) in which case, the sum over 𝑗 represents summing over the exponentials of all output activations.

For a neural networks with one hidden layer and sigmoid activations, the aforementioned phases can be extended to give the following form:

ˆ 𝑦𝑘(x, w) = 𝜎 (︃ _𝑀 ∑︁ 𝑗=1 𝑤_𝑘𝑗(2)𝜎 (︃ _𝐷 ∑︁ 𝑖=1 𝑤(1)_𝑗𝑖 𝑥𝑖+ 𝑤 (1) 𝑗0 )︃ + 𝑤_𝑘0(2) )︃ (2.9)

where vector w stores all the network parameters, and x represents the input vector. The neural network in this case models a non-linear mapping from a set of input signal {𝑥𝑖} to a

set of desired output values {𝑦𝑘}. Because multiple linear combinations can be combined into

another longer linear combination, having non-linear activation functions enable the network to capture highly complex features from the data.

(19)

..

.

..

.

..

.

𝑥1 𝑥2 𝑥3 𝑥𝐷 𝑥0 𝑧1 𝑧𝑀 𝑧0 ˆ 𝑦1 ˆ 𝑦𝐾 𝑤₁₁(1) 𝑤_{𝑀 𝐷}(1) 𝑤₁₀(1) 𝑤(2)₁₁ 𝑤(2)_𝐾𝑀 𝑤(2)₁₀

Input layer Hidden layer Ouput layer

Figure 2-2: A neural network with one hidden layer. Each variable (input, hidden and output) is represented by a node, and network parameters are illustrated as links connecting the nodes in two successive layers. Bias factors are represented by an extra input variable 𝑥0 and hidden

variable 𝑧0. The arrow ends of the edges show the flow of information throughout the network.

Figure 2-2 shows the fully-connected network with one hidden layer as described throughout this section. The process of evaluating expression (2.9) is known as forward evaluation or forward propagation of information through the network.

The bias terms 𝑤_𝑗0(1) and 𝑤(2)_𝑘0 in (2.9) can be simplified out by introducing an extra input variable 𝑥0 = 1 for the input layer and hidden variables, such as 𝑧0 = 1 in our example.

Calculation of linear combinations in (2.2) therefore becomes:

𝑐𝑗 = 𝐷

∑︁

𝑖=0

𝑤(1)_𝑗𝑖 𝑥𝑖 (2.10)

and similarly for the bias in the hidden layer in equation (2.6):

𝑐𝑘= 𝑀

∑︁

𝑗=0

𝑤(2)_𝑘𝑗𝑧𝑗 (2.11)

The network formula in (2.9) is then simplified:

ˆ 𝑦𝑘(x, w) = 𝜎 (︃ _𝑀 ∑︁ 𝑗=0 𝑤(2)_𝑘𝑗𝜎 (︃ _𝐷 ∑︁ 𝑖=0 𝑤_𝑗𝑖(1)𝑥𝑖 )︃)︃ (2.12)

(20)

The network equation (2.12) composes of two successive similar steps ended with a non-linear activation function, which is the sigmoid function 𝜎(·) in our example. Each of these steps functions similarly to the Perceptron model [39], and it is the reason this type of neural network is called Multilayer Perceptron. The Perceptron algorithm uses a step function for its non-linear activation whereas the neural network model explained in this section makes use of differentiable non-linear activation functions for the hidden layer. This choice allows for efficient network training, and lies at the heart of neural network learning framework.

The network structure with one hidden layer depicted in Figure 2-2 has seen most uses in practice for a long time. The architecture can be adapted quickly to a certain kind of problems by adjusting the number of hidden layers, as well as the number of hidden units in each layer. Different activation functions can be combined as well to produce the best possible predictive power. Regarding the number of network layers, there is no consensus among scientists on how to count this value. The network in our example can be treated as a three-layer network taking into account the input, hidden and output layers; or as a one-layer network because there is only one hidden layer available; or it can be seen as having two layers of adjustable parameters. The Multilayer Perceptron network architecture has a strong correlation with its mathemat-ical operation, and therefore it can be generalized to a more sophisticated structure. However, the flow of information must be maintained. For instance, there must not be any directed cycle within the network, so that outputs can be computed directly from inputs.

Feed-forward neural networks have very high approximation power and can approximate highly complex functions, assuming the network parameters are chosen reasonably. This char-acteristic has been central to a number of studies on neural networks [12], [45], [23], [29], [38]. A network with one hidden layer, given a sufficient number of hidden units, can approximate any continuous function over a fixed input domain.

2.3 Deep learning with neural networks

2.3.1 Deep neural networks

In terms of computation, deep networks are favored over shallow ones when the problem is complex and has non-linear correlations. The reason is with the same number of connections, deep neural networks possess much better feature extraction power. In certain domains, the number of hidden units and connections in a shallow network could go up to ten times more than those in a deep neural network only to achieve similar prediction capabilities [3], [25]. A floating

(21)

point operation needs to be carried out for each network connection involved in computation, and it is therefore much more demanding to optimize the huge number of parameters in shallow networks for desirable results.

In Section 2.2, we have seen an example of shallow network via Figure 2-2. A deep neural network having a similar structure but with two hidden layers instead of one is illustrated in Fiture 2-3.

..

.

..

.

..

.

..

.

𝑥1 𝑥2 𝑥3 𝑥𝐷 𝑥0 ˆ 𝑦1 ˆ 𝑦𝐾

Input layer Hidden layers Output layer

Figure 2-3: A fully-connected neural network with two hidden layers. The variables 𝑥’s repre-sent the input, where 𝑥0 is set to 1 for the bias term, and 𝑦’s are the network outputs. There

are two hidden layers, where the last nodes represent the respective layer’s bias factors.

As we can see from the example network in Figure 2-3, to calculate the values for the output layer the optimization process has to make use of the values of the first hidden layer’s neurons multiple times. The reason is that, in order to reach the output, we have to go through the second hidden layer, and each neuron in this layer is computed based on the neurons of the previous layer. Therefore, we can see the values of the nodes in the first layer are effectively re-used throughout the training procedure.

On the other hand, the flatter structure of a shallow network such as the one in Figure 2-2 does not have reusable neurons. For every new non-linearity we have to use another hidden unit in the network, and it therefore makes computation extremely expensive. Consequently, despite the name, deep neural networks are in fact much more compact and efficient than their shallow counterparts [4], making them ideal for high-level feature extraction.

(22)

2.3.2 Dropout layers

Neural networks, especially deep networks, have many parameters to optimize. One major dif-ficulty arising from this high complexity is that the amount of training data is often insufficient to produce a robust model. As a result, we typically either overfit a complex network or use a shallow architecture, which loses a significant amount of predicting power. Overfitting happens when the trained model fits itself to the noise in the training data, rather than captures the underlying representation. One indicator of this phenomenon is the difference between the per-formance on the training set and that on a separate validation set. When we overfit a model, its training accuracy is almost perfect whereas results on the validation set are substantially worse.

A number of techniques can be employed to combat overfitting:

1. Add a regularization term to the loss function to control the magnitude of the network parameters.

2. Pre-train the weights in an unsupervised manner before using them to initialize the network.

3. Include drop out layers [22] in the network architecture.

A drop out layer works by selecting a number of nodes from the previous layer and remove their values by overwriting them with zeros. This layer effectively keeps these nodes out of the current round of optimization. A predefined propability parameter, 𝑝𝑑𝑟𝑜𝑝, is used to control the

proportion of nodes to be dropped out at each iteration.

For prediction, the incoming connections to the layer whose weights have been dropped out by 𝑝𝑑𝑟𝑜𝑝 proportion will have their weights re-scaled using 𝑝𝑑𝑟𝑜𝑝:

w(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛)_𝑗 = 𝑝𝑑𝑟𝑜𝑝w𝑗 (2.13)

where 𝑗 indicates the layer having some of the weights dropped during training.

By using dropout layers, the network will need to more efficiently extra insights from the data, and will not easily fit to the noise from the training samples. The resulting model is therefore more robust and has better prediction power.

2.3.3 Rectified linear units

As briefly mentioned in Section 2.1, ReLUs is one of the techniques recently developed to make networks training more efficient than the logistic sigmoid described in Section 2.2 [37]. The

(23)

function has the following simple form:

𝑟(𝑥) = 𝑚𝑎𝑥(0, 𝑥) (2.14) One major difference of ReLUs compared to logistic sigmoid is that when used as the activation function for the hidden layers, the nodes in these layers can produce values that are arbitrarily large.

Figure 2-4 demonstrates this property, where the output of the ReLUs function is the same as the input 𝑥 when 𝑥 > 0. In the case of sigmoid, as we can see from Figure 2-1, all outputs coming out of this function are scaled down to have their values be less than 1.

−10 −8 −6 −4 −2 0 2 4 6 8 10 2 4 6 8 10 𝑥 𝑟(𝑥)

Figure 2-4: An example of the ReLU activation function.

Neural networks, especially deep networks, using ReLUs as the activation function are reported to have much more effective approximation capability as the loss function can generally reach a much lower value during training than with logistic sigmoid.

The reason why ReLUs performs better than traditional activation functions comes from the error backpropagation process. In order to compute the error terms for one layer, we require the derivative of the activation function 𝑓′(·), and that the errors for the layer above it have already been computed. In our current example, the activation function is the sigmoid as in (2.4), whose derivative has the following form:

𝜎′(𝑥) = 𝜎(𝑥)[1 − 𝜎(𝑥)] (2.15) which is similar to the derivative of the softmax function in (??).

(24)

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 𝑥 𝜎′(𝑥)

Figure 2-5: Derivative of the logistic sigmoid function.

Figure 2-5 plots an example of the derivative of the sigmoid function. As we can see, the problem with sigmoid is that its derivative evaluates to zero or small values for almost all cases except when 𝑥 is very close or equal to zero. It means for deep networks with a large number of layers, the propagated errors have significantly decreasing values as we go down the network layers. This degenerating issue s training with stochastic gradient descent highly difficult.

Derivative of ReLUs, on the other hand, has non-zero values for when 𝑥 is positive, as illustrated in Figure 2-6. An extension of ReLUs, called parametric rectified linear unit (PReLU) [20], has been proposed based on this observation. It has the following functional form:

𝑝𝑟(𝑥) = 𝑚𝑎𝑥(𝑥, 𝑎𝑥) (2.16) where 𝑎 is a control coefficient.

The derivative of PReLU has non-zero values for the whole input domain except at one point 𝑥 = 0. This property of ReLUs and PReLU leads to a more balanced error propagation at lower network layers.

2.3.4 Learning rate decay

When training deep neural networks, it is desirable to gradually decrease the learning rate 𝛼 during training. The underlying motivation for this technique comes from the observation that with high learning rates, stochastic gradient descent optimizes the network parameters aggressively, making them change values too fast for the loss function to reach a good local

(25)

−10 −8 −6 −4 −2 0 2 4 6 8 10 0.2 0.4 0.6 0.8 1.0 𝑥 𝑟′(𝑥)

Figure 2-6: Derivative of the ReLUs activation function. minimum.

It is difficult to come up with an optimal strategy for learning rate decay, because decreasing it too quickly causes the optimization process to deteriorate equally fast, and the parameters will be unable to reach a desirable position. In contrast, reducing the learning rate too conservatively wastes computational resources in the early stages of training because, with large jumps in the weights, there will not be much improvement.

There are a number of popular strategies for effectively decaying the learning rate:

1. Based on inverse of time: the learning rate adjusted after each iteration as follows: 𝛼 = 𝛼0

1 + 𝜃𝑡 (2.17)

where 𝛼0 is the initial learning rate, 𝜃 is a control hyperparameter and 𝑡 is the

iteration number.

2. Based on exponential of time: similar to the previous approach, but this time the decay formula has a different form:

𝛼 = 𝛼0exp(−𝜃𝑡) (2.18)

where 𝛼0, 𝜃 and 𝑡 are the same as in (2.17).

3. Based on the number of training rounds: Decrease the learning rate once after a few training rounds over the whole dataset (also known as epoch) have finished. Typically, how to determine a suitable value 0 < 𝜂 < 1 to multiply with the learning

(26)

rate is highly dependent on the problem. In practice, it is advisable to monitor the error on a validation set during training, and adjust the learning rate accordingly. Basing on the number of epochs to adjust the learning rate is generally the preferred strategy because scaling in this case is much more intuitive than trying to determine a suitable 𝜃 for the other two cases.

(27)

Chapter 3 Problem definition

3.1 Overview

By employing an adequate number of hidden layers and units, a neural network can learn the underlying representation of the data effectively. The last layer of the network, the output layer, is typically a non-linear differentiable function such as the traditional softmax function [9] [30]. The choice of output connection is dependent on the task, as well as the overall network architecture. For instance, LeNet-5 [31] uses a Gaussian connection for the output layer to solve the document recognition problem. A linear support vector machine has also been used as the output classifier to give a small but consistent improvement over the conventional fully-connected layer [46].

If we treat the output layer as a simple classifier, all the previous hidden layers can be viewed as a feature extraction process. It means that we can use the learned network parameters, except those belonging to the output layer, to automatically pick up high-level representations of the input signals. For the vision domain, this idea has been successfully adopted to carry out transfer learning [14], where a learned network architecture is incrementally trained for another related task. The advantage of this approach is that the network will be exposed to much more relevant data than training purely on a single, possibly small dataset. The transferability of features extracted from a deep neural network has also been previously explored [48]. This striking characteristic forms the foundation for Overfeat [43], a flexible image feature extractor for visual recognition based on a convolutional neural network trained on the ImageNet 1K dataset [13].

This thesis aims to explore the effectiveness of trained neural networks when used as high-level feature extractors for relational datasets. While this idea has been successfully applied

(28)

to solve domain-specific problems, such as object recognition in computer vision, its general applicability has not been extensively studied. As Section 1.2 points out, feature engineering consumes a large amount of time and often demands domain-specific knowledge. By exploiting the ability of neural networks to capture complex non-linearities within the data without much intervention from humans, feature engineering effort can be significantly reduced.

3.2 Research questions

Basing on the shortcomings of current literature mentioned in Section 3.1, this thesis addresses the following research questions (RQs):

RQ1: How can the raw data be processed into a format readable by neural net-works?

Given an initial dataset with a relational structure such as the one illustrated by Fig-ure 1-2, the way to aggregate on these tables is a major part of featFig-ure engineering. Therefore, a generalizable method to handle this complex structure is desired. RQ2: What is the effect of preprocessing on the feature synthesis process?

While deep neural networks are able to act as feature extractors, the overall per-formance is still dependent on the representation of the input data. Consequently, we need a effective method that is not computationally expensive to preprocess the data.

RQ3: How effective is it to train a neural network as a feature extractor on raw data with minimal engineering effort?

Given a preprocessed frame of data, the quality of the features extracted by the deep neural network architecture needs to be evaluated. Furthermore, the classification improvement contributed by the synthesized features is also explored.

(29)

Chapter 4 Experimental setup

4.1 Datasets

Three challenging datasets are used to evaluate the effectiveness of the proposed pipeline, as well as to answer the research questions stated in Section 3.2. These datasets are: (1) the KDD Cup 2014 [1], (2) IJCAI 2015 Repeat Buyers Prediction competition [2] and (3) Customer interaction history from Pegasystems1_.

These datasets are selected because of the challenges they pose, including:

– Challenge 1 - Skewness: The dataset is highly skewed, as the number of samples belonging to one class is significantly higher that of the other class.

– Challenge 2 - High-dimension: Even though each single instance in one of the main tables has a reasonably low number of attributes, they can potentially make the feature space explode depending on how they are encoded.

– Challenge 3 - Large number of samples: Each of these datasets contains a large number of data points. It means efficient processing is required to make the system perform in a reasonable amount of time.

– Challenge 4 - Relational structure: All three datasets contain a number of tables. These tables are connected with each other through a relational structure. Traditional machine learning methods accept fixed length vectors as inputs, and as a result, carefully handling this complex structure is required.

1

(30)

4.1.1 KDD Cup 2014 - Predicting Excitement at DonorsChoose.org

The primary data set selected for experiments is the one used for KDD Cup 2014 [1]. This dataset includes various projects posted by primary and high school teachers with the goal of raising funds for materials needed for education of their students. These projects are available on the website DonorsChoose.org, an online charity platform that aims to facilitate school donations to aid students in need.

The training set data composes of projects posted prior to 01/01/2014, and the test set consists of projects posted from this date onwards.

The dataset consists of the following tables:

∙ projects: This table contains information about all projects and the teachers who posted them. Each project in this table is differentiated by a projectid field. Mul-tiple projects can be grouped according to their teacher_acctid, the ID of the teacher, and schoolid, the ID of the school that the teacher belongs to. Further information is also available, such as school location (state, zip code, city, country, etc.), teacher’s prefix and focus subjects. Finally, the number of students affected by the project and the amount of money needed are also present in this table. In total, there are 664 098 projects in the whole dataset, each with 35 attributes. ∙ resources: Information about each type of resource requested by the projects are in

this table. Available fields include the name of the requested material, its unit price, the vendor name and the number of units needed. Each resource has its resourceid and projectid so that different resources, or the same resource requested by two different projects, can be distinguished. There are 3 667 217 records in this table and each of them has 9 fields.

∙ donations: Donations for projects in the training set are saved in this table. It records the amount of money for each donation, who the donor is, the corresponding payment method and a message from the donor. The ID fields for each donation are donationid which is unique for each donation, projectid, the ID of the donated project, and donor_acctid which is a unique value representing the donor. There are 3 097 989 donations recorded for the training set, with 21 attributes each. Donations for the test set are not available.

∙ essays: Each project is posted along with a descriptive essay from its teacher. The title and a short description of the project as well as the full essay are stored in this table. Furthermore, a need statement directly specifying the needed materials are

(31)

also available. Each essay is differentiated through projectid and teacher_acctid. There are 664 098 stored for the whole dataset. Each entry has 6 attributes (two of these contain long texts describing the project).

∙ outcomes: This table has the prediction outcomes for projects in the training sets. Out of 619 326 entries available, only 36 710 are exciting projects which accounts for 5.93% of the training projects. Besides projectid and the classification outcome column, 10 more fields specifying the results of the excitement criteria are also avail-able. Examples of these fields are the proportion of original donation messages, the number of donors acquired and not acquired by the posting teacher.

A classification challenge is posed by the competition organizer, which is to predict whether a particular project is exciting or not from a business perspective. Not having an “exciting” label does not mean a project is not meaningful for students and teachers.

This dataset is also used as a means to have results directly comparable with DSM’s [26] as the training and test sets are pre-determined by the competition organizer and are still kept available at all time.

4.1.2 IJCAI 2015 - Repeat Buyers Prediction after Sales Promotion

The second dataset comes from the International Joint Conference on Artificial Intelligence 2015 (IJCAI 2015). The data consists of customers’ shopping and behaviour logs six months before and on November 11th. The goal is to predict the probability a customer will come back to buy something else from the same merchant within six months after the sale date. This dataset is sampled from real transactions on Tmall.com, the biggest Chinese business-to-consumer online platform.

Outcome

User-Merchant action log

Basic user profile

Figure 4-1: Table structure of the IJCAI 2015 Repeat Buyers Prediction dataset. Figure 4-1 represents the structure of the IJCAI dataset. An overview of each table is as follows:

(32)

con-sists of transaction history between consumers and sellers. There are approximately 55 million recorded samples in this table. Each row has a defining ID pair, user_id and seller_id, as well as the ID of the purchased item, the time when the trans-action takes place and the trans-action invoked by the consumer (such as click or add to favorites). There are 7 attribute columns in this table.

∙ Basic user profile: This table holds the IDs of 424 170 customers, along with their age ranges and genders. These are the only 3 columns recorded.

∙ Outcome: The outcome table for training. It holds a pair of consumer-merchant IDs and the binary outcome, which is 1 if the customer becomes a repeated buyer and 0 otherwise. Therefore, there are 3 fields in this table.

Compared to the KDD Cup 2014 dataset in Section 4.1.1, the IJCAI 2015 data has a much simpler structure. However, the huge number of recorded transactions makes for its own chal-lenges. May June July August September October November 10,580 29,231 22,068 27,868 44,503 1.45 · 105 2.27 · 106 2,565 6,850 4,458 6,531 9,095 24,069 2.2 · 105 Negative Positive

Figure 4-2: The distribution of customer-merchant pairs with recorded outcome label across all six months in the IJCAI 2015 dataset.

Figure 4-2 shows the number of labelled consumer-merchant pairs. As we can see, the data is highly skewed towards negative samples, meaning most customers are not repeated buyers. One additional challenge is that because the interactions of customers and sellers are primarily based on the Double 11 sale date, a huge amount of samples are recorded in November. Particularly, on November 11th there are 139 173 positive and 1 513 857 negative samples, collectively making up 58.53% of the dataset. The whole November accounts for 88.23% of the data.

The data described above is based on the training set. As the competition has ended at the time of writing this thesis, there is no way to directly assess our pipeline against the work

(33)

of other groups on this dataset. Coming from these challenges, we split the provided training data into two parts:

1. Training set: consists of all samples from May to September, plus half the samples of November 11th and the rest of November. The data from November 11th is sampled with the same positive-negative ratio. The training set accounts for 64.76% of the original dataset.

2. Test set: all October samples, together with half the data of November 11th are in the test set. The size of the test set is equal to 35.24% of that of the original data. There are a number of reasons why we opted for this choice of splitting:

1. The whole of October, which makes for a sizeable portion of the dataset excluding November, is in the test set meaning any temporal effect of the data should be captured by the learning algorithm to perform effectively. There is no particular big sale day recorded in October. As a result, any trend occurring outside Double 11 should also be present in October.

2. Reserving half of the November 11 data point for testing means an effective 50% training — 50% test split for a large percentage of the data is provided to the learning algorithm. To perform well in this scenario the prediction pipeline needs to be extremely generalizable.

The splitting choice described above leads to a large test set. Only learning algorithms with strong generalization will perform well in this case.

4.1.3 Interaction history from Pegasystems

This dataset composes of interaction history of Pegasystems’s clients and their customers. The data comes in a single table, where all features are anonymized for confidentiality. There are 43 055 samples in this dataset, each with 21 attribute fields after cleaning out zero-variance and alias columns.

The dataset represents product offers from Pega’s client to their customers, and the customer responses are recorded as either positive or negative. These responses are the target we want to predict. Compared to the KDD Cup 2014 and IJCAI 2015 datasets, the one from Pegasystems offers different challenges:

∙ Depending on how we encode the categorical attributes of the data, we can end up with a high-dimensional dataset with a relatively small number of data points. ∙ All fields are anonymized, meaning it is highly difficult to manually employ domain

(34)

knowledge.

∙ Some fields have a large number of distinct values. Without a sound choice of representation, these fields can lead to overfitting with certain models.

There is no separate test set provided with this dataset. Therefore, we use a splitting scheme similarly to how we handle the IJCAI 2015 dataset in Section 4.1.2.

8/2012 9/2012 10/2012 11/2012 12/2012 1/2013 2/2013 10,703 42 20 10 34 10 8 31,325 186 116 27 161 334 79 Negative Positive

Figure 4-3: The number of samples in the Pegasystems’s dataset, distributed across all recorded months when the offers took place.

Figure 4-3 explores how samples are distributed across all recorded months. We can see that similarly to Figure 4-2 of IJCAI 2015, the Pegasystems’s dataset is also highly concentrated, with most of the data coming from August, 2012.

Based on this distribution of the data, we split 70% of the data into the training set and the other 30% comprises the test set. The splitting strategy is similar to that of the IJCAI 2015 dataset in Section 4.1.2: the whole 2013 and approximately the first one-third of August 2012 data makes up the test set, and the rest of the data in 2012 forms the training set. This way of splitting allows us to learn both the monthly trends as well as the August burst available in the data.

4.2 Evaluation metric

Prediction performance of classifiers are evaluated using the “area under the receiver operating characteristic (ROC) curve” (AUC). This is the proposed evaluation metric for the KDD Cup 2014 and IJCAI 2015 competitions, and is a suitable one for cases where data labels are skewed

(35)

[7]. By using AUC to assess performance, we are able to overcome Challenge 1 mentioned in Section 4.1.

A number of preliminary processing steps are also performed and assessed to give an under-standing of their impacts on the performance of deep networks. Furthermore, improvements on classifiers contributed by synthesized features are recorded and in the case of KDD Cup 2014 dataset, compared against DSM [26].

(36)

(37)

Chapter 5 Methodology

This section discusses the phases of data science work-flow (Figure 1-1) in which the proposed method plays a role. As we go through each step of the pipeline, all challenges mentioned in Section 4.1 are addressed. The goal of this section is to provide an abstract view of the method, and what kind of problems and data it is applicable to.

5.1 RQ1: How can the raw data be processed into a format

readable by neural networks?

This section aims to describe our method for handling the relational structure of most datasets, and consequently answer RQ1. It also corresponds with the first phase of the data science process in Section 1.1.

For any prediction problem, there is a central table of interest. This table typically represents the entity of which we would be interested in predicting the future status. For instance, the projects table of the KDD Cup 2014 dataset fits this role, because the goal is to estimate the probability of a project that is attractive to businesses. Let us call this table a primary table and represented by 𝑝.

Having just 𝑝 is sufficient to train a baseline machine learning model for prediction, assuming the target labels for the data are available. We call this the 𝑝-baseline from here onwards. Other tables having relations (i.e. one-to-many, one-to-one) with 𝑝 typically hold additional useful information. Let us refer to these tables as 𝑝’s direct neighbors. There are two cases:

1. 𝑝 can be joined with its neighboring tables without information loss. Generally, the one-to-one and many-to-one, where 𝑝 takes the “many” end of the connection, relations fall into this category.

(38)

2. Joining 𝑝 with its neighbors leads to either information loss or the samples of interest in 𝑝 getting repeated a number of times. One-to-many, where 𝑝 takes the “one” end of the link, and many-to-many relations between two tables belong to this case. For the first case, we can make the joins between 𝑝 and its neighbors to have all relevant data available for training. In contrast, with a one-to-many or many-to-many relation, directly joining 𝑝 and one of its neighbors causes the resulting table to have more than one entry per object of interest, and each of them has only one dependent sample from the neighbor attached to it.

Figure 5-1 shows an example of an one-to-many relation between two tables, Customers and Transactions. If the primary table of interest 𝑝 is Transactions, by joining it with the Customers table, we will have all the available information we need for classification. This is an example of the first cases described above. On the other hand, if 𝑝 is Customers which controls the “one” end of the connection, directly joining the two tables is not beneficial as the resulting table will contain many duplicated customer information.

Customers customerID age nationality gender Transactions customerID productID timestamp action 1 𝑛

Figure 5-1: An example of a one-to-many relation that can be found in many real-world datasets. Here, each customer recorded in the Customers table can have multiple transactions, each saved in the Transactions table.

The authors of DSM [26] propose to calculate only aggregation statistics of the neighboring tables through the use of appropriate functions such as SUM, MEAN and COUNT and propagate the results back to the primary table. However, there are a number of disadvantages to this approach:

1. The aggregate functions are manually defined and kept fixed. It means if the data reaches a certain level of complexity, this approach will deteriorate.

(39)

within the data.

We propose a simple approach similar to left and right outer joins, but instead of exploding the sample space by duplicating the samples in the primary table, these entries each appear only once in the resulting table with a number of their neighboring entries concatenated. In general, more recent samples from the neighbors are preferred because they have more impact on future events.

0 1 2 3

Figure 5-2: Example of the proposed data joining scheme. Red bars represent samples in the primary table 𝑝, gray bars for dependent entries concatenated to the primaries in 𝑝 and white bars for zeros.

Figure 5-2 illustrates the proposed idea, by showing an example of 4 data points in the table resulted from joining 𝑝 and one of its neighbor. Information from the neighboring table, represented as gray bars, is concatenated to the red entries in 𝑝. However, since not all entries in 𝑝 have the same number of dependences in the neighbor, zeros are padded at the end where needed. The relational structure of the data mentioned in Section 4.1 as Challenge 4 is handled effectively through the proposed approach.

projectid projects_att item_number vendor price item_number vendor price 316e... 97805... 7 5.1 00644... 27 4.2

90de... ... 97803... 7 5.83 0 0 0

bb18... 0 0 0 0 0 0

Figure 5-3: An example of the proposed data join approach on three sample projects from Figure 1-2 with their respective resources from the KDD Cup 2014 dataset. Some columns in the resources table are omitted for readability. Resource attributes are filled with zero when their corresponding project no longer has any additional resource. The projects_att column stands for all the columns available in the projects table.

Figure 5-3 shows an example result of the proposed join approach on the shortened projects and resources tables, represented in Figure 1-2, of the KDD Cup 2014 dataset. Our hypothesis is that by recursively adding more information to 𝑝, prediction performance can be significantly improved.

(40)

5.2 RQ2: What is the effect of preprocessing on the feature

synthesis process?

Before any further computation is performed, the data goes through a number of preprocessing steps to produce high quality input vectors for the neural networks. This section discusses a number of preprocessing techniques for further evaluation. This part of the pipeline explores the impact of preprocessing, the second stage of the data science work-flow depicted in Figure 1-1 of Section 1.1, and also answers the second research question.

5.2.1 Missing values

In the real-world cases, datasets often contain missing values in one or more of their tables. It is one of the first steps to clean the data as empty values are potentially harmful for classification performance if not properly treated. Four approaches are evaluated for this phase:

– Entries removal: If an entry in any of the tables contains missing fields, it is re-moved from the dataset before any further computation is done. This is the simplest approach and it will lead to information loss. For setting up the initial baseline, this method would serve well due to its ease of implementation.

– Mean imputation: Each missing value is estimated by taking the average of the attribute column that the value belongs to. An example of this type of imputation is when the number of affected students for a project is missing, it is filled in by taking the average number of students affected across all projects. This approach does not work well with categorical variables.

– Grouped mean imputation: Similar to mean imputation, but estimation is based on the attribute average grouped by ID fields, rather than on the whole column. In the previous example, the missing number of affected students is equal to the average number of affected children that the posting teacher had in the past.

– Last value carried forward: The last value available in the same column is copied over to the missing field. Grouping is also performed so as to only produce values similar to what the same teacher already had.

5.2.2 Data representation

Next in the preprocessing step is to decide on how to represent some special types of attributes. This section primarily focuses on the handling of categorical attributes since they comprise a

(41)

large portion of all three datasets. Another notable mention is free text data, which is only available in the KDD Cup 2014 dataset. Its features are represented as term frequency — inverse document frequency (tf-idf) vectors [32].

One Hot Encoding

One Hot Encoding (OHC) [19] is used to transform categorical variables. OHC works by creating a new attribute column for each distinct value that the original categorical variables have.

School Metro Focus Subject Urban Mathematics Rural Music

Urban Nutrition Rural Mathematics

Table 5.1: Two example categorical attributes taken from the projects table of the KDD Cup 2014 dataset.

Table 5.1 shows an example of two categorical attributes before any form of encoding. In the raw data, each of these categories can be represented as a string as shown in this example, or as a single integer.

School Metro (Urban) School Metro (Rural) Focus Subject (Maths.) Focus Subject (Music) Focus Subject (Nutrition) Focus Subject (Sports)

1 0 1 0 0 0

0 1 0 1 0 0

1 0 0 0 1 0

0 1 1 0 0 0

Table 5.2: Two categorical attributes School Metro and Poverty Level after OHC. After performing OHC, the resulting attributes of the example in table 5.1 are shown in table 5.2. For each record row, only the column corresponding to the original value of its categorical attribute is turned on (as 1), and all other fields are zeros.

OHC has the advantage that the encoded features have the same importance from the perspective of a learning algorithm. For example, if we encode Mathematics, Music, Nutrition and Sports of Focus Subject of the previous example as 0, 1, 2 and 3 respectively; then a classifier might perceive Mathematics and Music to be closer than Music and Sports. This effect is undesirable in general and OHC can help learning algorithms avoid this problem.

On the other hand, OHC also introduces a significant number of features, making the dimension of the data explode. For instance, the field school_zip of the projects table in the

(42)

KDD Cup 2014 dataset has 16 619 distinct values. Using OHC in this case means there will be 16 618 additional columns introduced.

Replacement with mean outcome

OHC has a major disadvantage is that the resulting feature space is dependent on the num-ber of distinct values a categorical attribute can take. Another approach to transform these attributes is Replacement with Mean Outcome (RMO). It works by producing a mapping from each categorical attribute to a float number - the expected ratio of positive samples given the attribute.

If we consider a training sample x𝑛, it has its categorical attribute 𝑥𝑛𝑘 transformed as

follows:

𝑥(𝑛𝑒𝑤)_𝑛𝑘 = 𝑝(𝑦𝑛= 1|𝑥𝑛𝑘 = 𝑐𝑘) (5.1)

where 𝑛 = 1, . . . , 𝑁 represents the training sample (table row), 𝑘 = 1, . . . , 𝐾 stands for each individual attribute (table column), 𝑦𝑛 is the label for sample x𝑛 and 𝑐𝑘 is the value of 𝑥𝑛𝑘

before the transformation.

Original Transformed Label

Amsterdam 1 1 Rotterdam 0.667 0 Utrecht 0 0 Rotterdam 0.667 1 Amsterdam 1 1 Rotterdam 0.667 1

Table 5.3: An example of transforming a categorical attribute using RMO. The first column Original represents the original raw data. The third column contains the label for each sample and the Transformed column is the Location column after transformation.

Table 5.3 shows an example RMO. Each value of the categorical attribute is replaced with the expected outcome of the labels. For instance, since the only sample having Utrecht as its location has 0 as its label, its new value for Location is 0. For Rotterdam, two-thirds of its sample have a positive label, so the value for the mapping is 0.667 or 2/3. All Amsterdam samples have a positive label, and consequently receive 1 as the target value for their transformation.

(43)

5.3 RQ3: How effective is it to train a neural network as a

feature extractor on raw data with minimal

engineer-ing effort?

We propose to use neural networks as a means of synthesizing high-level features directly from raw data, effectively removing the manual effort needed for feature engineering. As mentioned in Section 1.2, it is one of the most time-consuming and important parts of the data science process. This part of the proposed feature synthesis method looks to answer RQ3 and tackle the involved challenges.

..

.

?

..

.

? ?

..

.

Input layer Hidden layer Ouput layer

(a)

..

.

..

.

?

..

.

? ?

..

.

Input layer Hidden layers Output layer

(b)

Figure 5-4: The process of determining the optimal settings for the neural networks: (a) one-layer neural network is trained multiple times using 5-fold cross validation. The blue nodes with question marks are the ones at which we want to determine the optimal number; (b) The previously determined optimal number of hidden units for the one-layer network, together with its trained weights, is used for the first layer of a two-layer network. Optimal settings for the second layer are now identified in a similar manner with 5-fold cross validation.

Neural networks in general can have many parameters to optimize, and can scale very well with big datasets [30]. As a result, using a neural network with a carefully chosen depth and hyperparameters means Challenge 2 and Challenge 3 mentioned in Section 4.1 can be overcome. In this section, we describe the process in which our neural networks are trained and have their hyperparameters optimized. A layer-wise optimization scheme is used to select the optimal number of hidden units for each layer. The idea of training deep networks have shown to be very successful as a means to more efficiently initialize the weights [5]. In our experiments, other hyperparameters of interest such as the learning rate 𝛼, the decay parameter 𝜂 as described

(44)

in Section 2.3.4 and regularization terms are slightly tuned by performing grid searches on a few potentially suitable values.

Neural networks with up to three hidden layers are trained on each of the three datasets. The number of hidden nodes in each layers are chosen using the following scheme:

1. We use 5-fold cross validation to select the best number of hidden units for a one-layer network. This process is illustrated in Figure 5-4a.

2. Once the best number of hidden units and their corresponding weights have been determined from the previous step, a two-layer neural network is initialized, where the first layer uses these optimal settings. Figure 5-4b shows the second phase of parameter search.

3. The two steps above are repeated once more for the third neural network layer. Figure 5-4 demonstrates the process of determining the network settings most suitable for each of the classification tasks. By tuning the number of hidden units one layer at a time, computational cost is greatly reduced while a model with an acceptable level of performance can still be achieved.

(45)

Chapter 6 Results and Analysis

6.1 Overview

In this chapter, we go through and evaluate the proposed pipeline in more depth. The general work-flow is still the same as that of conventional data science, depicted in Figure 1-1. However, feature engineering is automated through the use of deep neural networks. A number of aspects of other stages are also discussed.

6.2 Relational data handling

RQ1: How can the raw data be processed into a format readable by neural net-works?

Two out of three datasets, KDD cup 2014 and IJCAI 2015, used for evaluation have re-lational structures, and effectively handling the relations within the original raw data is the first step towards a robust classification pipeline. It should be noted that even though neural network is mentioned in this question, the problem is applicable for almost all machine learning methods, where a fixed length input vector is required.

In the case of the IJCAI 2015’s data, the primary table User-Merchant action log holds the “many” end of the one-to-many relation. Therefore, by joining it with the Basic user profile table, no information loss is incurred and we do not need further processing for this step. The Pegasystems’s dataset is available in a single table. Hence, it can be used directly for the next step without performing any join or concatenation.

The KDD Cup 2014 dataset, on the other hand, requires special treatment. The reason is that the primary table projects has a one-to-many relationship with the resources table.

Feature synthesis with deep learning for data science challenges

MSc Artificial Intelligence

Master Thesis

Feature synthesis with deep learning

for data science challenges

Cong-Nguyen Tran

July 4, 2016

Supervisor:

Dr Evangelos Kanoulas

Assessor:

Dr Efstratios Gavves

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

General data science process

1.2

Problem and motivation

1.3

Related work

1.4

Contributions

Chapter 2

Artificial neural networks

2.1

History

2.2

Multilayer Perceptron

..

.

..

.

..

.

2.3

Deep learning with neural networks

2.3.1

Deep neural networks

..

.

..

.

..

.

..

.

2.3.2

Dropout layers

2.3.3

Rectified linear units

2.3.4

Learning rate decay

Chapter 3

Problem definition

3.1

Overview

3.2

Research questions

Chapter 4

Experimental setup

4.1

Datasets

4.1.1

KDD Cup 2014 - Predicting Excitement at DonorsChoose.org

4.1.2

IJCAI 2015 - Repeat Buyers Prediction after Sales Promotion

4.1.3

Interaction history from Pegasystems

4.2

Evaluation metric

Chapter 5

Methodology

5.1

RQ1: How can the raw data be processed into a format

readable by neural networks?

5.2

RQ2: What is the effect of preprocessing on the feature