Machine Learning Techniques to Predict Demand in Retail - ANN and MARS

(1)

Machine Learning Techniques to Predict Demand in

Retail – ANN and MARS

KATYA ALA-VEDRA STEPAN

11362901

MASTER INFORMATION STUDIES

BUSINESS INFORMATION SYSTEMS

FACULTY OF SCIENCE

UNIVERSITY OF AMSTERDAM

June 16, 2017

1

st

_Supervisor

₂

nd

_Supervisor

(2)

Abstract. Online markets have made it difficult for retailers to deal efficiently with

the management of inventory as demand is uncertain, especially in industries were weather conditions need to be taken into consideration. Therefore, a comparison between two widely used machine learning techniques, Artificial Neural Networks and Multivariate Adaptive Regression Splines, has been made in this study to seek for the optimal method to make demand predictions. The paper shows how little training time and information is needed to make an accurate forecast in the case study presented.

Keywords. Predictive Models, Machine Learning, Neural Networks, Multivariate

Adaptive Regression Splines

Introduction

In the eyes of the consumer, online markets can be thought of as having a big supply of products, which in reality are bound to physical stores and production lines. Though, this perception of big supply has given the advantage to those consumers to easily look around within the web, to find exactly what they are looking for, in the right color, at an adequate shipping rate and above all, at the right price. This liberty that the customer has at its fingertips makes it somewhat impossible for certain retailers to keep up due to rapid changes in trends within the market. Therefore, organizations are forced to constantly look for best practices in order to optimize their operations and remain competitive (Šustrová, 2016).

To alleviate this issue, companies must deal with the challenging task of managing inventory in an efficient matter (Sachs, 2015). To a large extent, inventory can represent one of the most crucial functions within manufacturers and retailers since it tends to have great impact on their performance as a whole (Šustrová, 2016) Hence, the optimal level of stock needs to be determined. Though, one of the main dilemmas concerning inventory management relates to stochastic demand (Raa & Aghezzaf, 2005). Without any knowledge or indication of what is going to be bought, it is difficult for merchants to plan ahead and replenish stock in an efficient, accurate and timely matter, which understandably calls for a need of higher inventory levels and therefore generates more costs. Thus, it is essential to find the right predictive method to achieve efficiency (Dhini, Surjandari, Riefqi, & Puspasari, 2015).

A popular method such as Linear Regression has been recommended and used for decades in the industry to make numeric forecasts (Witten, Frank, & Hall, 2011). Unfortunately, one of the core assumptions of this method requires linearity between dependent and independent variables (Francis, 2003), which is often not the case in reality. On the other hand, machine learning techniques have allowed organizations to solve some of the demand uncertainty issues (Bertsimas, Kallus, & Hussain, 2016). Specifically, Artificial Neural Networks (ANNs) or simply Neural Networks (NNs), have gained popularity throughout the years due to their ability to solve complex problems where there is a high degree of uncertainty (Šustrová, 2016). Although providing fast and accurate results in some cases (Radzi, Haron, & Jahori, 2006), they tend to be difficult to interpret (Tso & Yau, 2007) and hard to setup (Forte, 2015) Most importantly, they can be very computational and time expensive compared to more traditional models (Han, Pool, Tran, & Dally, 2015).

(3)

It has been argued that the method of choice to make predictions depends entirely on the data at hand. A comparable method to ANNs is Multivariate Adaptive Regression Splines (MARS). In Abraham & Steinberg (2001) these two methods were compared and results showed that MARS outperformed the NN model, whereas in Francis (2003), though while using a small dataset, the opposite was the case. Hence, as much information as possible should be gathered in order to find an appropriate method as based on this, the method might differ. Bertsimas et al. (2016) highlights the use of information generated through a company´s systems, as it can be remarkably useful for retail planning and thus useful for predictions. Furthermore, external factors which are not always considered such as weather conditions, could also be quite beneficial to decide on order quantities (Sachs, 2015).

1. Problem Statement

As mentioned earlier, in order for a company to reduce inventory costs and achieve efficient planning of their operations, it is essential to find an accurate predictive model to forecast demand. Many traditional methods make a linearity assumption between dependent and independent variables, which is usually not present in real-life data. Artificial Neural Networks on the other hand have proven to be an excellent alternative when nonlinear data is present (Forte, 2015). A similar method which is able to deal with nonlinearities is Multiple Adaptive Regression Splines (Francis, 2003). Since the accuracy of methods depend on the underlying somewhat complex relationships found in the data and specially since ANNs tend to be difficult to setup, a comparison has to be made to find the most appropriate alternative. Additionally, certain industries should consider external factors such as weather conditions since this could help explain some of the variation in demand (Sachs, 2015).

1.1. Research Question

The problem described above lead to the following research question and sub-questions: 1) Is an Artificial Neural Network performing significantly better compared to

Multivariate Adaptive Regression Splines when predicting seasonal demand in retail?

a) How is the accuracy of an Artificial Neural Network variating when changing

configurations of hidden nodes?

b) Which variables are performing best for each respective method? c) Is the weather a good predictor of future sales given the industry?

2. Literature Review

2.1. Predictive Modeling

Researchers and practitioners of predictive modeling can agree that the boundaries between the fields of statistics, machine learning, business analytics and other related field are to some extent unclear. Though, there is a clear overlap between concepts, techniques, algorithms used and ultimately they share the same goal, which is to build

(4)

models in order to predict a desired outcome by using data (Forte, 2015). These fields also make a clear distinction between two approaches known as supervised and

unsupervised learning (Cios, Pedrycz, Swiniarski, & Kurgan, 2010). Supervised learning

refers to the learning of outputs based on corresponding inputs, also known as responses or dependent variables and predictors or independent variables respectively. After the learning process is completed, unseen data is then used to predict an outcome which is expected to be nearly the same as the actual outputs (Hastie, Tibshirani, & Friedman, 2009). Unsupervised learning on the other hand, does not have the corresponding outputs at hand. Depending on the industry, data collection of the latter approach is considered to be easier and less expensive (Cios et al., 2010). Common methods in an unsupervised environment involve principal component analysis (James, Witten, Hastie, & Tibshirani, 2000), association rules analysis and cluster analysis (Hastie et al., 2009), while in supervised learning, there is a wide range of known methods such as Bayesian methods, Regression (Cios et al., 2010), Support Vector Machines, Artificial Neural Networks, among many others (Hastie et al., 2009).

Data outputs in this study are known, making the issue at hand a supervised one. Furthermore, as there is no previous knowledge on the actual relationships of variables and also as the data presents nonlinearities, two supervised methods have been narrowed down as they are able to handle these characteristics (Francis, 2003). First, Artificial Neural Networks will be the main focus of this study as they have been known for their predictive power (Francis, 2003; Radzi et al., 2006), Secondly, Multivariate Adaptive Regression Splines as it can handle some of the limitations presented by ANNs and thus considered to be a fair competitor (Francis, 2003).

2.2. Artificial Neural Networks

Although Artificial Neural Networks have been around since the 1940s, though they were not as popular at first (Forte, 2015). An illustration of this was the development of the first practical network, called the Perceptron Algorithm, which showed its potential to recognize patterns but unfortunately, it was only able to solve a minimum group of problems (Esmaeili, Osanloo, Rashidinejad, Aghajani Bazzazi, & Taji, 2012). Later in time nevertheless, they have been found extremely useful to do forecasts, especially when there was no knowledge on the relationship between variables (Tso & Yau, 2007). A variety of successful cases can be found in different domains such as in the financial industry (Kaastra & Boyd, 1996), retail industry (Šustrová, 2016), healthcare industry (Mazurowski et al., 2008), among others.

Compared to traditional linear and logistic regression, ANNs are a nonlinear approach, that can solve classification and regression problems, regardless of their complexity (Forte, 2015). Moreover, in order to accomplish a task, ANNs can mimic the structure of the nervous system by passing on signals through their nodes (“the neurons”), which remarkably simulate the learning process in the human brain (Esmaeili et al., 2012). A generic example of an ANN with a simple architecture is represented by Figure 1 (A), which would be known as the perceptron as only one input and output layer are involved (Almeida et al., 2010). As described by (Forte, 2015), a NN takes simultaneous inputs (Xn) and assigns them weights (Wn). The result is later represented as a linear weighted sum (Σ), which is then checked by the activation function (Act. f(x)) to verify if the output reached the threshold or bias (b1). If the minimum required was reached, an output is generated (Y). In a bigger NN as shown in Figure 2 (B), the output

(5)

of the activation function is then considered input by the next neuron in line, represented as a hidden neuron (Hn), and the process repeats until the last layer is reached.

Figure 1. Generic Example ANN.

Other popular configurations to increase the degree of complexity where hidden layers are included, are known as multilayer networks. Depending on the flow of information, these networks can be further classified as feed-forward or recurrent networks, where in the first case, as the name describes, the information passes from the input, through the hidden layer until it reaches the output in a forward fashion, while in the latter, information from the output layer can be used as input in the previous layer via feedback connections (Almeida et al., 2010).

Depending on the data problem at hand to be solved, certain parameters can be taken into consideration to make the ANN more suitable. For example, the activation function to check for the threshold as mentioned earlier, can have a hyperbolic tangent or logistic function (Forte, 2015). The logistic, also known as sigmoid, is usually the recommended activation function among studies (Hastie et al., 2009; Tobergte & Curtis, 2013). Additionally, this activation function can be either applied or not to the output layer. If applied, the actual outcome from the network will not be linear. Thus, as this study deals with a regression problem, it is recommended to have a linear output (Forte, 2015).

Furthermore, another important parameter to consider is the training algorithm that is used to update the weights of a NN. Typically, an ANN takes an untrained set to then compute an outcome based on the corresponding independent variables; afterwards, the error between the actual and predicted output is determined which will help adjust the initial weights to minimize the errors (Almeida et al., 2010). One traditional approach to do this is based on gradient descent (Hastie et al., 2009). To determine this error, the sum-of squared errors or cross-entropy can be used, being the first recommended for regression type problems and the second for classification problems (Hastie et al., 2009). Though, until the first efficient training algorithm was developed in the late 1980s, known as the backpropagation (BP), ANNs were rarely used in real-life problems (Tobergte & Curtis, 2013). Even though it is considered to be very popular and has been extensively used (Almeida et al., 2010), it can be very slow and thus not considered to be the first choice (Hastie et al., 2009). Further disadvantages besides speed, include the tendency to overfit the training dataset (Tobergte & Curtis, 2013). An improvement of this method known as the resilient backpropagation, solves some of the drawbacks of BP. It uses only the sign of the partial derivatives to update the weights instead of the full weights generated by the gradient descent (Günther & Fritsch, 2010). As this is considered to be one of the fastest learning algorithms (Almeida et al., 2010), it will be used for the training of the ANNs in this study.

(6)

A clear shortcoming of NNs compared to a more traditional method such as regression is that the individual weights do not provide any useful information on their own, as opposed to the coefficients found in regression (Tso & Yau, 2007) , which are easy to interpret. By being black boxes, it tends to be difficult for users to understand its results fully (Francis, 2003). Another disadvantage involves the effort it takes to train an ANN in relation to its size; which means that the bigger the NN, the more computer power is needed (Forte, 2015). Though there are no clear guidelines regarding what is the appropriate number of hidden layers and neurons to choose. Based on experimentation and previous knowledge, the range of hidden layers could be set between 5 to 100, increasing as more input variables are considered (Hastie et al., 2009). Higher complexity, meaning more hidden layers, could help the NN to better learn the training data but unfortunately this could come at the later risk of poor generalization. This is an issue known as overfitting, which means that when unseen data is introduced to the network, results might not be too accurate as the model has learned the training set too well (Tobergte & Curtis, 2013).

2.3. Multivariate Adaptive Regression Splines

An approach comparable to polynomial regression is known as Multivariate Adaptive Regression Splines (Francis, 2003), which is designed to deal with nonlinear regression problems and was first proposed by Friedman (1991) in the search of making regression modeling more flexible, especially when high dimensional data was involved. MARS in its most basic form, splits the data into ranges, where in each range it is possible to fit a linear regression with different slopes (Francis, 2003). The idea is to break the data into as many simple ranges as needed in order to recreate any type of shape, which is usually found in complex problems (Abraham & Steinberg, 2001). An example of this from Lu, Lee, & Lian (2012, p. 585) can be seen in Figure 2 (A) where k1 and k2 represent the splits. More specifically, a MARS model first creates a pool of candidate functions based on the independent values, known as basis functions, that can be either reflected pairs as can be seen in Figure 2 (B) taken from an example in Hastie et. al (2009, p. 322), or as the product of those. The data points where a range ends and another one starts is known as a knot (MARS Chronicles).

As cited by Lu et al. (2012), the process starts with the constant function and then basis functions that minimize the sum of squares residual error will be added until either the error is too small to carry on with the process or until reaching a basis function threshold (Friedman, 1991; Lu et al., 2012). At first, this process will tend to overfit the data, which is why the next phase in MARS is to delete functions that contribute the least to the model. This is usually done by using generalized cross-validation (Hastie et. al, 2009).

(7)

Figure 2. Generic Example MARS and Basis Functions.

As MARS is based on regression, the method does not have one of the main limitations of Neural Networks, which is its black box nature; in general, making it more appealing for management to understand its results (Francis, 2003). Moreover, through the basis functions, MARS has the ability to classify and select input variables that have the highest impact on the output variable (Lu et. al, 2012). Unfortunately, this is not a characteristic of ANNs as variables need to be selected beforehand based on domain knowledge, or by implementing an additional selection method specially when high dimensional data is involved (Fernando, Maier, Dandy, & May, 2005). Selecting too many variables or too few can jeopardize the predictive accuracy and generalization of a NN and thus making it a crucial task when building the model (Lu et. al, 2012).

3. Case Study

To answer the research question and sub-questions, a case study in the Netherlands has been selected. The company is an online retailer based in the Amsterdam, which is dealing with rapid expansion and limited stock capacity and therefore is in need of a predictive model to deal efficiently with the quantities ordered. Based on the requirements of this organization, online and offline sales data from their top selling category have been collected to make a weekly demand forecast. Additionally, since in their industry it is common to have fluctuating demand due to weather and seasonality dependencies, this has been also considered.

3.1. Data description

Variables for this study were mainly chosen based on company knowledge of the industry. Past weekly sales in terms of quantities were gathered from the company´s system starting week 1 until week 29 in 2016, the starting point and the highest peak of the year respectively, and week 1 until week 20 in 2017. This involved the selection of online and offline sales in the Netherlands for products belonging to the most relevant category as mentioned earlier. This variable can be now referred to as Quantity. As the company is opening stores relatively quickly throughout the country, a store count per week was used to create three new variables. Besides, information about the total number of weekly online visits was also included in the dataset as this was found to have a high correlation with the quantities ordered. As weather conditions are sought to be important for the industry, a number of variables were also collected from the Koninklijk

(8)

Nederlands Meteorologisch Instituut (KNMI, n.d.), specifically from De Bilt weather station as this is a relatively centered location in the country. These variables included maximum temperature, minimum temperature, hours of rain and hours of sun, which were averaged on a weekly basis to match the sales records. Lastly, interaction terms where created using all weather related variables with the store count information, as some of those variables together, also had a high correlation with the weekly sales data. A more detailed description of the variables can be found in Appendix A.

4. Methodology

Once the information was gathered, the data was prepared and tested using different ANNs configurations and later compared to the results from MARS. To develop the predictive models, prepare and analyze the data; Microsoft Excel 2016 and RStudio version 1.0.136 were used on a personal computer with Intel® Core™ i5-6200U CPU 2.30GHz 2.40 GHz, 64 bits Windows 10 and 8.0 GB installed memory (RAM). In order to select the best models and to evaluate their performance, several standard accuracy measurements were computed. First, two absolute measures, the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) (Zhang & Qi, 2005) for model selection and evaluation. Finally, the Mean Squared Error (MSE), the coefficient of determination R2 (Šustrová, 2016), and Pearson Correlation Coefficient have also been added accordingly as extra evaluation criterion.

For model selection more specifically, a typical approach is to split a portion of the data for testing. However, when the dataset is too small, this might come at the risk of losing valuable information for training (Francis, 2003). An alternative to this is to use cross-validation instead (e.g. using k-fold), which is one of the most utilized methods to evaluate models (Hastie et al., 2009). Though since the data at hand involved time series, this would mean that future data would eventually be used to predict the past (Bergmeir, Hyndman, & Koo, 2015). Also, when using k-folds, random splits are suggested (Witten et al., 2011). Therefore, as recommended by (Bergmeir et al., 2015) another method called time series cross-validation was used instead only for the training of the Neural Network models, as the training was programmed manually and thus was possible to control. To start, the data from 2016 was previously sorted from oldest to newest. The method involved the creation of several sets, in the first one, a split of training and testing was done using the 70/30 rule. In the second set, one more observation was added to the training set which was taken from the testing set. This process was repeated until the testing set only contained the last observation available. On the other hand, the package used to train the MARS model, which will be elaborated in the corresponding section, already included a cross-validation option, which will also be tested.

4.1.1. Parameters and Variable Selection

There are several packages designed to train ANNs in RStudio such as neuralnet (Fritsch & Guenther, 2016), nnet (Venables & Ripley, 2002), and RSNNS (Bergmeir & Benitez, 2012). Though, based on its straight-forward implementation (Tobergte & Curtis, 2013), the neuralnet package, created by Stefan Fritsch and Frauke Guenther has been chosen to create all ANNs models. In this package, it is necessary to further specify certain

(9)

parameters. As argued in the literature review, the resilient backpropagation with weight backtracking (rprop+) has been selected to train the NNsand the activation function has been set to sigmoid. Also, as this is a regression type problem, the linear output has been set to true and the error function has been chosen to be the sum-of-squared errors. Furthermore, the number of repetitions to train each dataset, has been set to 10, as more did not improve the networks´ performance. Finally, as there is no recommended architecture for the NNs, based on trial and error, 3 hidden layers have been selected to create different models. For each hidden layer, a loop was developed to create models with nodes ranging from 1 to 10 and thus, creating 1000 models per run.

In order to categorize the variables based on their importance, a similar approach as done by Esmaeili et. al (2012) was conducted. Simple regression was performed where each input variable was tested against Quantity independently. The scatterplots for each regression can be found in Appendix B, including its respective R2 and trend line. As opposed to Esmaeli et. al (2012), variables were not discarded but rather sorted from the highest R2 to the lowest. A loop was programmed so that each time a new dataset was created while deleting the next least important variable available, based on its R2. This approach is comparable to a sensitivity test as mentioned by Francis (2003), where less important variables are dropped one at a time to see if and how each variable degrades the results. Every new dataset was used to create a training and validation set considering the different architectures as mentioned above. The deletion process to select the number of variables was conducted until only variables with an R2 of at least 0.40 were left. These variables were: WebTraffic, IntStoreTemp0, StoreCount0, IntStoreSun0, IntStoreTemp1, AvgNumStores, IntStoreSun1 and IntStoreMin0 as can be seen in Appendix B.

4.1.2. Data Transformation

As the ranges from each variable in the dataset were relatively different and to achieve better results within a Neural Network, all variables had to be normalized prior training to have the same scale (Tobergte & Curtis, 2013). To some extent, the technique guarantees that all independent variables are treated with the same importance (Hastie et. al, 2009). This was done using the minimum – maximum normalization (0 – 1), as recommended by Šustrová (2016). After the training of each model was conducted, results were scaled back so they can be compared to the actual output.

4.2.1. Parameters and Variable Selection

As with ANNs, different packages have been developed for RStudio to build MARS models such as mda (Hastie, 2016) and earth (Milborrow, 2017a). Due to its more recent launching date and since it’s based on the mda package, earth has been selected to create the models. For this package, it is also required to specify certain parameters. For the pruning pass, to delete variables that are not relevant to the model based on the lowest residual sum-of-squares (Milborrow, 2017a), the backward method is tested as it is the standard developed for MARS and degree has been set to 2, to allow interaction terms to be created.Furthermore, a comparison is made by using cross-validation as a selection method instead with the number of folds set to 5 as it is one of the recommendations in Milborrow (2017b).

(10)

5. Results and Discussion

5.1.1. Model Selection

Due to the extensive results, 1000 models per run, only the top 30 best results per dataset based on their RMSE have been included in Appendix C. Each individual table includes the variables that were involved in the training of each model and also, all computed accuracy measurements such as the MAE, RMSE, MSE, R2 and Pearson Correlation. The tables also give an indication of how the measurements change given different configuration of nodes, though there is no clear trend with regards to the higher number of nodes compared to a decrease in the errors, as it was expected. This could likely have to do with the inner-workings of the ANN models and therefore it is out of scope for this study. The importance of variable selection in the creation of an ANN, can be seen in Figure 3 (A), where by deleting less important variables, a smaller RMSE can be achieved. The number of variables is represented by the x-axis, while the RMSE is represented by the y-axis. Similarly, the Box-and-Whisker plot in Figure 3(B) shows how the maximum, minimum and median RMSE gradually decrease as variables are removed. A summary of the best results per dataset, can be found in Table 1, in which the optimal architecture was found to be 3 hidden layers containing 8-7-1 nodes respectively. With this model structure, it was possible to achieve a RMSE of 40.21 and an R2 of 88.63%. Further accuracy measurements can also be found in the table, including the variables that accomplished those results.

(A) (B)

Figure 3. RMSE – Best Results.

Variables with best results: IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0,

StoreCount0, IntStoreTemp0, WebTraffic

Architecture MAE RMSE MSE RSquared Pearson

7-5-3 47.7328 54.9066 3566.7954 0.7362 0.8486 6-9-6 37.9908 44.1932 2326.9081 0.8030 0.8908 5-7-7 44.5371 51.9428 3299.7424 0.6902 0.8081 10-3-8 42.5358 50.2075 2985.2530 0.8180 0.9017 30 35 40 45 50 55 60 1 2 3 4 5 6 7 8 0 100 200 300 400 1 2 3 4 5 6 7 8

(11)

6-5-4 36.2470 42.8492 2088.3134 0.8292 0.9087 4-6-2 33.7994 43.2119 2156.8137 0.8887 0.9415 9-8-2 34.2193 42.0242 2356.7606 0.8013 0.8883

8-7-1 34.3809 40.2125 1842.2496 0.8863 0.9406 Table 1. Summary of ANNs Results.

After identifying the best performing variables, it can be said that the approach used by Esmaeli et. al (2012), would have provided the best results for this dataset from the start. Though as one of the aims of this study was to analyze and show the variable selection significance, it was important to include it. Lastly, it can be concluded that including weather related variables had a positive impact in the prediction of sales given this industry as some of these variables were still included in the model that performed best.

5.1.2. Results

Based on the best training performance from Table 1, the trained network with a structure of 8-7-1 as can be seen in Figure 4, was used to make predictions with the test set. Figure 5 shows the predictions from the first 20 weeks of the year, represented by the x-axis, together with the respective actual quantities. This shows to some extent, how the ANN has captured the behavior from last year, to predict this year´s sales based on IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0 and WebTraffic. Furthermore, the predictions reached a MAE of 137.67, a RMSE of 192.49, MSE of 37048.36, a R2 of 75.18% and a Pearson Correlation of 86.71%.

(12)

Figure 5. ANN - Prediction vs Actuals

5.2.1. Model Selection

When running the MARS model using the backward method, from 22 predictors, 12 terms were created in total from which 4 of those were selected for the final model. Figure 6 shows how the Generalized RSquared (GRSq) reaches a peak and then decreases as more terms are included than the optimal. Similarly, it can be seen the number of predictors used to create those terms. In this model, the predictors used to create the terms were IntStoreSun0 and WebTraffic. The terms, respective cuts and coefficients can be found in Table 2. This model generated a GRsq of 85.92% and a R2 of 91.31%, which represents the overall performance of the training set.

Figure 6. RStudio Output for Model Selection – Backward Method

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(13)

Selected terms Coefficients

(Intercept) 236.1901 h(8.55714-IntStoreSun0) -17.4371 h(WebTraffic-2926.6) 0.3787 h(WebTraffic-3252.25) -0.2997

Table 2. MARS Selected Terms.

To contrast, the MARS model was created again but this time with cross-validation as a selection method. Out of 22 predictors, this model created 11 terms from which 3 were selected for the final model. This produced an average GRSq of 83.72%, a R2 of 89.02% and a Mean-Out-of-Fold of 67.28%, which is used to select the number of terms. An overview of the results from each fold can be found in Table 3, with the respective splits per fold. The terms, cuts and coefficients can be found in Table 4. Furthermore, similarly to Figure 6, Figure 7 displays the model selection using cross-validation. The right horizontal black dotted line shows the number of terms selected by GCV as seen earlier, while the horizontal left lighter dotted line shows the terms selected by cross-validation, representing the mean out-of-fold Rsq. This figure also includes light grey thinner lines, which represent the results per individual fold.

Fold CVRsq In-Fold % Out-of-Fold %

1 0.233 77% 21% 2 0.626 77% 21% 3 - 0. 074 77% 21% 4 0.673 77% 21% 5 0.829 83% 17%

Table 3. MARS Cross-Validation Results

Selected terms Coefficients

(Intercept) 301.9167 h(8.55714-IntStoreSun0) -24.4431 h(WebTraffic-2926.6) 0.1078

(14)

Figure 7. RStudio Output for Model Selection – Cross-Validation

5.2.2. Results

Based on the best model, where no cross-validation was involved, predictions have been made using the test set, which can be seen in Figure 8. Once again, having the x-axis represent the weeks of the year. The model at hand was able to achieve a MAE of 55.82, a RMSE of 71.07, a MSE of 5050.97, a R2 of 94.22% and a Pearson Correlation of 97.07% using IntStoreSun0 and WebTraffic as predictors to create the terms as described earlier.

Figure 8. MARS – Prediction vs Actuals

5.3. Comparison

As can be seen from the results and to answer the research question, the Artificial Neural Network with the best performance did not perform significantly better than the Multivariate Adaptive Regression Spline model as the latter method achieved better results in all accuracy measurements used. Figure 9 illustrates the best results from both methods in comparison with the actuals. A summary of the results from both methods, can be found in Table 5.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Actuals Predictions

(15)

Figure 9. ANN vs MARS

Accuracy Measurements ANN MARS

MAE 137.67 55.82

RMSE 192.49 71.07

MSE 37048.36 5050.97

R2 75.18% 94.22%

Pearson Correlation 86.71% 97.07%

Table 5. Results ANN vs MARS.

6. Conclusion

As it has been discussed throughout this research, finding the right method to predict sales in an industry where demand is unknown and seasonality should be considered, can be a challenge. Many methods have been proposed in literature, but ultimately this still depends on the industry, and more importantly on the data at hand. This study shows that the Multivariate Adaptive Regression Splines model, with almost no training time and little information such as weekly WebTraffic and an interaction variable created with the average number of stores and sun hours (IntStoreSun0), it was possible to create a highly accurate predictive model. Furthermore, it can also be noted that in this specific industry where weather conditions are extremely important, it is key to take them into consideration when creating a predictive model as this gives an indication of the seasonality.

Artificial Neural Networks have indeed proven in certain industries their ability to learn from data to predict and model the most complex relationships. Though, this might come at the cost of extensive knowledge of the underlying functionalities of the hidden nodes and layers of this method or simply by having extremely good data at hand, which

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(16)

is not the case in real life. On the other hand, with knowledge of simple methods such as regression, it was possible to understand fully the aim of MARS and how it achieved those results. Ultimately, making it easier for management to understand the results as mentioned earlier.

As a final note, it can be said that when nonlinearities are present and also there is no knowledge on the existing relationships of the data, no single method might exist that fits all datasets in the best way. Ideally, it can be recommended based on time and knowledge, to try out several methods that use different approaches to see which one is providing the most optimal results for that specific dataset.

6.1. Limitations and Further Research

Given the computer power used to create, train and test the models, it is clear that the way the Neural Networks were trained, with such a complexity, would take up a large amount of memory and time, especially given the number of models with different configurations as done in this study or even worse when high dimensionality would have been included as mentioned earlier in the literature review. Such a complexity was necessary in order to test the different variations of hidden nodes as there are no clear rules with regards to these. Due to time limitations, it was possible to test the node configurations with three hidden layers only. Therefore, it is recommended for future research when dealing with a similar case study, to see how the architecture can be further optimized to improve overall training time and more importantly, accuracy. Thus, the question regarding the appropriate architecture of a Neural Network given a specific dataset remains open.

Moreover, another limitation when building the Neural Network models was the creation of variables as in-depth variable engineering was out of the scope of this study. This also leads to the selection of variables, in which if more variables would have been created, high dimensionality could have been an issue as well. Therefore, as not all datasets might be suitable for Artificial Neural Networks as mentioned by Francis (2003), it can be recommended that depending on the dataset available, it might be useful to research thoroughly on techniques that focus more on variable engineering and thus the use of selection techniques such as Random Forest, Genetic Algorithm and many others might be needed in parallel.

Lastly, as this study started with the discussion of inventory management, it can be recommended for future research to include an analysis of the actual inventory cost reductions when implementing the selected MARS model. As mentioned earlier, an accurate predictive model as the one presented in this study could lead to a significant reduction of overall costs in a company and thus it could be used to implement an efficient strategy to replenish stock in a timely matter.

References

Abraham, A., & Steinberg, D. (2001). Is neural network a reliable forecaster on earth? A MARS query! Lecture

Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2085 LNCS(PART 2), 679–686. https://doi.org/10.1007/3-540-45723-2_82

Almeida, C., Baugh, C. M., Lacey, C. G., Frenk, C. S., Granato, G. L., Silva, L., & Bressan, A. (2010). Modelling the dusty universe - I. Introducing the artificial neural network and first applications to luminosity and colour distributions. Monthly Notices of the Royal Astronomical Society, 402(1), 544– 564. https://doi.org/10.1111/j.1365-2966.2009.15920.x

(17)

Bergmeir, C., & Benitez, J. M. (2012). Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNSS. Journal of Statistical Software, (46(7)), 1–26. Retrieved from http://www.jstatsoft.org/v46/i07/

Bergmeir, C., Hyndman, R. J., & Koo, B. (2015). A Note on the Validity of Cross-Validation for Evaluating Time Series Prediction. Monash University, Working Paper 10/15, (April).

Bertsimas, D., Kallus, N., & Hussain, A. (2016). Inventory Management in the Era of Big Data. Production &

Operations Management, 25(12), 2006–2009.

Cios, K. J., Pedrycz, W., Swiniarski, R. W., & Kurgan, L. A. (2010). Data Mining: A Knowledge Discover

Approach. https://doi.org/http://dx.doi.org/10.1016/B978-0-08-044894-7.01318-X

Dhini, A., Surjandari, I., Riefqi, M., & Puspasari, M. A. (2015). Forecasting analysis of consumer goods demand using neural networks and ARIMA. International Journal of Technology, 6(5), 872–880. https://doi.org/10.14716/ijtech.v6i5.1882

Esmaeili, M., Osanloo, M., Rashidinejad, F., Aghajani Bazzazi, A., & Taji, M. (2012). Multiple regression, ANN and ANFIS models for prediction of backbreak in the open pit blasting. Engineering with

Computers. https://doi.org/10.1007/s00366-012-0298-2

Fernando, T. M. K. G., Maier, H. R., Dandy, G. C., & May, R. (2005). Efficient Selection of Inputs for Artificial Neural Network Models. Modsim 2005: International Congress on Modelling and Simulation:

Advances and Applications for Management and Decision Making: Advances and Applications for Management and Decision Making, 1806–1812.

Forte, R. M. (2015). Mastering predictive analytics with R. Packt Publishing Ltd.

Francis, L. (2003). Martian Chronicles: Is MARS Better than Neural Networks? Casualty Actuarial Society

Forum, 269–306. Retrieved from http://francisanalytics.com/Resources/CAS_Papers/Martian

Chronicals.pdf

Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67. https://doi.org/10.1214/09-AOAS284

Fritsch, S., & Guenther, F. (2016). neuralnet: Training of Neural Networks. R package version 1.33, https://CRAN.R-project.org/package=neuralnet.

Günther, F., & Fritsch, S. (2010). neuralnet: Training of Neural Networks. The R Journal, 2(1), 30–38. https://doi.org/10.1109/SP.2010.25

Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. Advances in Neural Information Processing Systems, 1135–1143.

Hastie, T. (2016). mda S original by Trevor Hastie & Robert Tibshirani. Original R port by Friedrich Leisch, Kurt Hornik and Brian D. Ripley. mda: Mixture and Flexible Discriminant Analysis. R package version 0.4.9. Retrieved from https://cran.r-project.org/package=mda

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Elements, 1, 337–387. https://doi.org/10.1007/b94608

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2000). An introduction to Statistical Learning. Current

medicinal chemistry (Vol. 7). https://doi.org/10.1007/978-1-4614-7138-7

Kaastra, I., & Boyd, M. (1996). Designing a neural network for forecasting financial and economic time series.

Neurocomputing, 10(3), 215–236. https://doi.org/10.1016/0925-2312(95)00039-9

KNMI. (n.d.). Daggegevens van het weer in Nederland. Retrieved May 30, 2017, from https://www.knmi.nl/nederland-nu/klimatologie/daggegevens

Lu, C. J., Lee, T. S., & Lian, C. M. (2012). Sales forecasting for computer wholesalers: A comparison of multivariate adaptive regression splines and artificial neural networks. Decision Support Systems,

54(1), 584–596. https://doi.org/10.1016/j.dss.2012.08.006

Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21(2–3), 427–436. https://doi.org/10.1016/j.neunet.2007.12.031

Milborrow, S. (2017a). Derived from mda:mars by Trevor Hastie and Rob Tibshirani. Uses Alan Miller´s Fortran utilities with Thomas Lumley´s leaps wrapper. earth: Multivariate Adaptive Regression Splines. R package version 4.5.0. Retrieved from https://cran.r-project.org/package=earth

Milborrow, S. (2017b). Notes on the earth package.

Raa, B., & Aghezzaf, E. H. (2005). A robust dynamic planning strategy for lot-sizing problems with stochastic demands. In Journal of Intelligent Manufacturing. https://doi.org/10.1007/s10845-004-5889-3 Radzi, N., Haron, H., & Jahori, T. (2006). Lot sizing using neural network approach. Second IMT-GT Regional,

(April), 1–8. Retrieved from http://math.usm.my/research/onlineproc/CS11.pdf

Sachs, A. L. (2015). Retail analytics: Integrated forecasting and inventory management for perishable products in retailing. Lecture Notes in Economics and Mathematical Systems. https://doi.org/10.1007/978-3-319-13305-8

(18)

55.

Tobergte, D. R., & Curtis, S. (2013). Machine learning with R. Journal of Chemical Information and Modeling (Vol. 53). https://doi.org/10.1017/CBO9781107415324.004

Tso, G. K. F., & Yau, K. K. W. (2007). Predicting electricity energy consumption: A comparison of regression analysis, decision tree and neural networks. Energy. https://doi.org/10.1016/j.energy.2006.11.010 Venables, W. N., & Ripley, B. D. (2002). nnet. In Modern Applied Statistics with S. (Fourth). New York:

Springer.

Witten, I. H., Frank, E., & Hall, M. a. (2011). Data Mining: Practical Machine Learning Tools and Techniques

(Google eBook). Complementary literature None. https://doi.org/0120884070, 9780120884070

Zhang, G. P., & Qi, M. (2005). Neural network forecasting for seasonal and trend time series. European

Journal of Operational Research, 160(2), 501–514. https://doi.org/10.1016/j.ejor.2003.08.037

Appendix

A – Variable Descriptions

Variables Description

Quantity Number of products sold online and offline in the Netherlands WebTraffic Number of website visits per week

AvgNumStores Number of open stores until that week StoreCount0 Number of open stores that are between 0 and 0.5 year old StoreCount05 Number of open stores that are between > 0.5 and 1 year old

StoreCount1 Number of open stores that are older than 1 year MaxTemp Average maximum temperature of that week AvgTmin Average minimum temperature of that week AvgHRain Average rain of that week

AvgSun Average sun of that week

IntStoreTemp0 Interaction variable between StoreCount0 and MaxTemp IntStoreTemp05 Interaction variable between StoreCount05 and MaxTemp

IntStoreTemp1 Interaction variable between StoreCount1 and MaxTemp IntStoreMin0 Interaction variable between StoreCount0 and AvgTmin IntStoreMin05 Interaction variable between StoreCount05 and AvgTmin

IntStoreMin1 Interaction variable between StoreCount1 and AvgTmin IntStoreRain0 Interaction variable between StoreCount0 and AvgHRain IntStoreRain05 Interaction variable between StoreCount05 and AvgHRain

IntStoreRain1 Interaction variable between StoreCount1 and AvgHRain IntStoreSun0 Interaction variable between StoreCount0 and AvgSun IntStoreSun05 Interaction variable between StoreCount05 and AvgSun

IntStoreSun1 Interaction variable between StoreCount1 and AvgSun

(19)

R² = 0.8757 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 WebTraffic R² = 0.6304 -100 0 100 200 300 400 500 600 700 IntStoreTemp0 R² = 0.6252 -5 0 5 10 15 20 25 30 35 StoreCount0

(20)

R² = 0.6241 -50.00 0.00 50.00 100.00 150.00 200.00 IntStoreSun0 R² = 0.5249 0 10 20 30 40 50 60 70 80 90 100 IntStoreTemp1 R² = 0.4981 0 5 10 15 20 25 30 35 40 AvgNumStores

(21)

R² = 0.4272 0.00 5.00 10.00 15.00 20.00 25.00 30.00 IntStoreSun1 R² = 0.415 -50.00 0.00 50.00 100.00 150.00 200.00 IntStoreMin0 R² = 0.2398 0 5 10 15 20 25 30 MaxTemp

(22)

R² = 0.2366 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 AvgSun R² = 0.1737 -15.00 -10.00 -5.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 IntStoreMin1 R² = 0.162 0.00 20.00 40.00 60.00 80.00 100.00 120.00 IntStoreRain0

(23)

R² = 0.1204 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 AvgHRain R² = 0.091 -10.00 -5.00 0.00 5.00 10.00 15.00 20.00 AvgTmin R² = 0.0216 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 IntStoreRain1

(24)

C – Results ANNs

IntStoreRain1, AvgTmin, AvgHRain, IntStoreRain0, IntStoreMin1, AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

Architecture MAE RMSE MSE RSquared Pearson

7-5-3 47.7328 54.9066 3566.7954 0.7362 0.8486 3-4-5 47.9219 57.1702 3505.9653 0.7245 0.8382 6-7-7 47.1411 58.0718 3563.7471 0.7066 0.8286 5-8-5 49.7966 59.8462 4778.5995 0.5966 0.6326 9-6-7 47.6577 59.8781 4031.2206 0.6401 0.5297 6-5-6 53.2902 60.0889 3839.1476 0.7000 0.7832 3-9-5 52.0491 61.2222 4457.1087 0.5963 0.6978 5-5-4 53.1823 61.5231 4713.9176 0.7092 0.6112 5-6-3 52.8170 61.7975 4492.4222 0.8594 0.9242 9-7-3 55.1016 62.2478 4516.5600 0.7024 0.6961 5-10-1 51.6412 62.2514 4441.6455 0.7026 0.8262 9-7-2 54.0380 62.2741 4920.8763 0.7188 0.6014 7-7-7 55.4395 63.2137 4471.5690 0.7996 0.8820 3-2-9 52.7785 63.3887 4898.9820 0.7050 0.8334 2-4-6 56.8964 63.5025 4710.0793 0.6663 0.8004 5-9-9 57.7169 63.6064 4664.2850 0.7325 0.7473 7-10-4 53.6220 63.8308 4370.4650 0.7303 0.8457 10-9-4 52.5911 64.0427 5053.2473 0.6427 0.7813 4-4-4 51.0995 64.1126 5098.8288 0.6002 0.7519 9-10-9 52.6064 64.3117 4830.2088 0.6321 0.6618 6-8-6 55.6761 64.4916 5082.9450 0.7408 0.8482 10-2-8 52.8752 65.0716 5000.0532 0.7991 0.6686 3-6-4 56.7322 65.2166 4491.6640 0.6354 0.7504 7-7-4 54.6741 65.2526 4538.3073 0.6908 0.8063 9-4-10 53.0895 65.4749 5764.4612 0.6259 0.6597 4-9-3 52.3148 65.6222 4877.7630 0.6462 0.7370 8-4-2 56.3965 65.6699 5322.8917 0.6596 0.6763 6-4-9 51.1592 65.8550 5586.6278 0.5489 0.6028 9-4-9 53.7337 65.9819 4681.7742 0.6010 0.7227 7-5-6 57.2915 66.0021 4559.3356 0.6442 0.7889

(25)

AvgTmin, AvgHRain, IntStoreRain0, IntStoreMin1, AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic Architecture MAE RMSE MSE RSquared Pearson

6-9-6 37.9908 44.1932 2326.9081 0.8030 0.8908 6-5-3 44.1427 52.5129 3034.0466 0.7329 0.8310 8-9-2 49.3124 57.8591 3852.6182 0.7505 0.8607 5-6-8 49.6659 58.5845 3960.5239 0.7492 0.8579 4-8-4 50.3153 59.6711 4376.5168 0.7000 0.8187 6-3-9 48.0173 59.7203 4042.4597 0.7245 0.7630 6-10-9 51.8899 60.0190 4157.2074 0.8477 0.9186 9-5-6 48.1977 60.0516 4467.5959 0.5558 0.7099 5-9-4 52.0629 61.2464 4295.9445 0.7634 0.8663 4-9-10 50.5271 61.4501 5415.4350 0.6756 0.6964 6-8-1 55.1778 61.5365 4427.1561 0.6594 0.7816 10-4-4 53.0186 62.0236 4423.6980 0.6902 0.8217 10-5-7 55.0143 62.1855 5494.8651 0.7088 0.6841 7-3-8 53.6552 62.4301 5025.0241 0.6463 0.7668 4-5-9 52.7937 62.6559 5272.5285 0.6916 0.7321 10-2-7 54.3109 63.1347 4619.7491 0.7371 0.8501 8-7-6 53.8878 63.3706 5204.4453 0.7073 0.8303 6-10-3 53.5690 63.8643 5156.5882 0.6383 0.6195 9-6-3 54.8703 64.0025 4351.0174 0.5407 0.6933 8-4-2 55.1810 64.2128 5039.5302 0.6656 0.7887 3-8-6 55.5628 64.4745 5417.9146 0.7018 0.6544 9-6-7 57.4176 64.5127 4676.6316 0.7860 0.8819 10-4-8 59.4272 64.7283 6024.2895 0.7570 0.7976 8-9-5 57.1777 64.7712 4932.5012 0.7517 0.8581 9-7-3 55.8182 65.0960 4608.4428 0.6628 0.8007 9-2-9 55.0923 65.2405 4690.0686 0.5569 0.5496 7-4-9 58.7884 65.2747 4918.5888 0.5854 0.6519 10-7-4 50.5775 65.3986 5166.4527 0.5854 0.6542 8-5-9 57.1400 65.9199 4735.7943 0.6287 0.5321 2-9-2 57.8037 66.0255 5418.4911 0.6963 0.8232

(26)

AvgHRain, IntStoreRain0, IntStoreMin1, AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

Architecture MAE RMSE MSE RSquared Pearson 5-7-7 44.5371 51.9428 3299.7424 0.6902 0.8081 9-3-8 47.5176 54.2760 3990.8700 0.8001 0.8459 7-9-8 48.2846 57.3690 3782.1718 0.6522 0.7727 6-2-5 49.4164 57.4591 4237.1041 0.6922 0.8216 6-3-3 50.0213 57.4752 3533.0088 0.6665 0.8044 5-2-7 46.9096 57.5523 4291.0126 0.7150 0.8330 6-5-2 49.5145 57.5525 3875.9941 0.5394 0.6851 7-10-4 47.7149 58.5154 3911.1993 0.7102 0.8311 8-7-5 49.3948 58.5264 3984.6283 0.6869 0.8065 4-7-5 50.9513 59.8799 4990.9113 0.7161 0.5884 10-3-3 50.3268 60.6289 4347.2006 0.6852 0.8188 6-4-6 52.8206 60.7199 4217.1875 0.7727 0.8699 5-6-8 51.6430 60.8218 4499.7453 0.7432 0.7544 9-7-9 52.2293 61.1995 4226.8649 0.6823 0.8096 10-7-3 53.4128 61.3462 4990.9393 0.6643 0.7222 8-6-5 52.9696 61.4275 4282.7685 0.7084 0.8292 5-4-5 53.9101 61.5461 4774.8059 0.7130 0.8213 6-7-5 53.0274 61.7724 4238.2612 0.6099 0.7654 10-7-4 52.0298 63.0150 4504.7292 0.6639 0.7687 3-7-3 54.8940 63.2439 5451.0234 0.7177 0.8362 10-9-7 51.0490 63.4606 4603.7436 0.6817 0.6624 4-1-3 50.0351 63.5552 4362.2933 0.7358 0.8528 3-6-6 55.6378 63.7131 5286.3284 0.6629 0.7177 3-9-2 52.8711 63.7429 4811.6817 0.6273 0.7241 9-7-6 54.7982 64.0777 4566.0665 0.8036 0.8923 3-6-5 52.5662 64.2929 4581.1003 0.6555 0.7929 3-9-1 50.1823 64.3306 4461.7114 0.6005 0.7185 8-6-9 53.9896 64.7619 5295.2853 0.6078 0.7288 5-2-1 53.6944 64.9374 4652.5500 0.6796 0.8069 10-3-4 55.7968 65.0031 5045.3986 0.7368 0.8417

(27)

IntStoreRain0, IntStoreMin1, AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

(28)

IntStoreMin1, AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

(29)

AvgSun, MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

(30)

MaxTemp, IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic

(31)

IntStoreMin0, IntStoreSun1, AvgNumStores, IntStoreTemp1, IntStoreSun0, StoreCount0, IntStoreTemp0, WebTraffic