Investigating how an Artificial Neural Network can help to forecast the demand for sales

(1)

Investigating how an Artificial Neural Network

can help to forecast the demand for sales

June 18, 2018

A.J. Hoogenboom 10754342

Abstract

The aim of this thesis is to show the impact of new data sources and big data analytics for the economic practitioner’s toolkit. Many businesses that maintain physical stores must predict their sales figures in advance, to strategically plan and respond to the market. The sales figure for a given store is a combination of many underlying factors such as time, pro-motions, concurrency, competition and weather. Combined with the data surplus, we see that such complex setting provides a good opportunity to apply machine learning algorithm, to create models for predicting the sales. We predict the sales volume of a product by unit size and region in a small city in the Netherlands, using an Artificial Neural Network and a Multiple Linear Regression. The conclusion is that the Artificial Neural Network outperformed the Multiple Linear Regression. The results could be improved by introducing more features to the model and introducing more complex models to the problem.

(2)

Statement of Originality

This document is written by Arjan Hoogenboom who declares to take full re-sponsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Acknowledgement

I would like to thank Samir Selimi and Miriam Maan in offering me this chance to work on a real dataset and for having the confidence in me. I would in particular thank professor Dirk Damsma, for supporting me in this internship and helping me to find an interesting thesis topic. Finally, I would like to thank my roommates, friends and family for their support. Their confidence and support in me have been a constant in my life, in the meantime, their love and patience have been the foundations of this work.

(4)

1 Introduction

This era of big data gives us many new applications and techniques to get in-sights. Combined with the upcoming rise of the Internet of Things technology (Storey & Song, 2017), a network of connected devices, systems and services within the existing internet infrastructure, the data available is becoming so huge, that we would say there is a big data surplus (Thompson & Rogers, 2017). The Internet of Things represents one of the main markets of big data applications. Because of the high variety of objects, the applications of IoT are continuously evolving. Nowadays, there are various Big Data applications supporting logistic enterprises. In fact, it is possible to track the positions of various sensors, wireless adapters and GPS. Thus, such data-driven applications enable companies not only to supervise and manage employees but also to op-timize the delivery and the complete logistics process. This by exploiting and combining various information including past logistic experiences. Businesses are looking to learn more about the available data and ways of gaining profits with this new information (Smedlund et al., 2018).

Another effect that comes together with big datasets is that someone can end up with more potential predictors than used with the classical regression methods. Unlike traditional data, the term Big Data refers to large growing data sets that include heterogeneous formats: structured, unstructured and semi-structured data. Big Data has a complex nature that requires powerful technology and advanced algorithms. When humans make decisions, the pro-cess is often biased or limited by our inability to propro-cess information overload. Data and analytics can change all that by bringing in more data points from new sources, breaking down information asymmetries, and adding automated algorithms to make the process instantaneous. As the sources of data grow richer and more diverse, there are many ways to use the resulting insights to make decisions faster, more accurate, more consistent, and more transparent. Another feature of having large datasets is that we can now allow more flexible relationship than a regular linear model. Companies become more and more interested in the world of machine learning (Carrasquilla & Melko, 2017) to get these essential business insights and, in particular, the forecasting possibilities (Mullainathan & Spiess, 2017).

Of getting the best insights from the available data, a new type of industry emerged in the world. These are companies define themselves as a Machine Learning as a Service (MLaaS) (Li et al.,2017). Examples of MLaaS compa-nies in the world of retailers are Blue Yonder, SAS, Skytree and Prime. All these companies use high-performance algorithms and automated models to deliver more accurate predictive models than the regular methods of predict-ing. Firms can hire MLaaS companies to gain more insights into the data they gather. Blue Yonder even states that retailers can cut their out-of-stock rate by as much as 80% and see their revenue and profits improve by more than 5% by just implementing an MLaaS company strategy into their business plan (https://www.blue-yonder.com/en).

(6)

of machine learning and Artificial Neural Networks. We formulated a research in order to question ourselves if we could also achieve a high accuracy on the forecast of a particular product, just like other MLaaS companies. To answer this question, we set up a machine learning model to investigate the possibility to predict the sales of one product at one store of a Dutch supermarket chain. We want to use a machine learning technique called Artificial Neural Network combined with internal and external data. To get the insight if the Artificial Neural Network is performing well enough, there is a number of measures of accuracy in the forecasting literature and each has advantages and limitations (Makridakis et al., 2008).

In order to answer the question if machine learning would help to gain bet-ter insights than the more standard approach of a non-linear regression, some related work in the field of machine learning sales forecasting will be mentioned in the next section. After that section follows a look at the framework used in this thesis. The next section is about the dataset, the origin and the features of it. The fifth section will contain a general view of the models used and the results. The sixth section will be a discussion about what we can learn by look-ing at the results and the aim of the Project. In the last two sections, you find the conclusion and way forward, where you find the results presented and the possibilities for the way forward. In the appendix, the mathematical formulas of Artificial Neural Network are described.

2 Literature review

Big data brings big opportunities and transformative potential for various sec-tors. However, it also presents unprecedented challenges to harnessing such large increasing volumes of data. Advanced data analysis is required to under-stand the relationships among features. In order to look at the possibilities of machine learning, it is important to look at the existing literature about sales forecasting with the use of machine learning. This section will be divided into two parts, starting with a look into the world of forecasting with machine learn-ing in the retail business. Followlearn-ing this subsection, a few successful examples of sales prediction with the use of machine learning will be shown.

2.1 Sales forecasting with the use of machine learning

The most commonly used machine learning-based algorithms for predicting sales include ordinary least squares, logistic regression, discriminant analysis, asso-ciation rules, decision trees and some basic query tools (Magnini et al., 2003). But still, there is more to get in the world of retailers. Fisher and Raman (2017) are stating that innovation would be needed if a chain wants to survive. The Internet of Things era will need retailers to think creatively about offering new services and developing new business models. Insights and Big data will help retailers execute their current business model even better. Fisher and Raman

(7)

mention that with the use of big data retailers could optimize store assortments, offer dynamic pricing and optimize order fulfilment. They also state that with the use of these insights, retailers can find stores that do not do well and close them. The main message they want to tell us is that the potential is huge, the pilots are very promising, but the scale-up is very challenging.

A trend that is shaping the world of data, is the online reviewing systems. Online reviews will not only give consumers with rich information, also the retail businesses gain with this information (Ramanathan et al., 2017). User-generated reviews cut their uncertainty about purchases and make it easier to generate a model which can be used to forecast sales based on consumer behaviour. Machine learning and Artificial Intelligence will be used to efficiently use and scrape the data from the internet. After being processed from the world-wide-web, a machine learning algorithm is used to generate a variable, which could be included in a forecasting model (Fan et al., 2017).

2.2 Successful examples

There have not been many machine learning papers studying similar problems but predicting sales and recommendations of e-commerce has been a hot topic. Research on the internet shows us many examples where a company gains by using machine learning algorithms. The MLaaS companies are booming, noted by recent research of Li and others (2017). Companies like Blue Yonder, Prime and a lot of start-ups are looking to enter in the demand of retail chains for developing and researching the possibilities of machine learning. Freek Aertsen (2017), academic director Tias Business School, states that machine learning algorithms should take over the task of manual forecasting. He mentions that good programmed systems outperform human decision-making. But to do this, a manager should start having faith in the systems. He explains that the role of demand planners will be not eliminated, he simply mentions that instead of 100 demand planners, you could cut to 30 (Stad, 2018).

A successful example of using machine learning for sales prediction is the case of Bakkersland (Rensen, 2018). That article mentions the main goal of big data: "As you have more information from the past, you can predict better". That is what GoDataDriven, an MLaaS company, has done. In order to predict their consumer demand per-day per-retailer, Bakkersland asked GoDataDriven to develop a predictive sales planning model. This machine learning model opti-mizes the availability of fresh bread products while at the same time miniopti-mizes left-over based on historical data. The sales forecast enables Bakkersland to produce more efficiently and plan further ahead (Go Data Driven, n.d.).

The site of GoDataDriven has enough successful Dutch examples on fore-casting done for Marktplaats, NPO and a big e-retailer. The site of Blue Yonder also mentions examples of using machine learning to forecast the demand. At DM, a German drugstore, Blue Yonder was asked to forecast the demand for their product and solve their staffing problems. With the use of machine learn-ing algorithms, they were able to more accurately forecast the sales and optimize the workflow as well as satisfied employees and customers (Blue Yonder, n.d.).

(8)

There are more examples of machine learning to forecast the sales, but a lot of the research is not public (Shibata et al., 2016). The reason is, that companies don’t want to let the world know what kind of algorithms they are using (Ebay, n.d.). Sometimes an algorithm becomes public after first having been imple-mented in their own company, such as with Spotify (https://GitHub.com/Spoti-fy/Luigi), Facebook (https://GitHub.com/facebook/prophet) and Airbnb (http-://airbnb.io/). More successful examples will already be in the world, but un-known to the public and unun-known to the scientific world, due to Intellectual Property laws.

2.3 Conclusion

This chapter has shown us that adoption of advanced demand forecasting and supply planning gives great new possibilities. We are living in an age where data comes in abundance, by using the self-learning algorithms from the field of machine learning, you can turn this data into knowledge. Instead of requiring humans to manually derive rules and build models from analyzing large amounts of data, machine learning offers a more efficient alternative for capturing the knowledge in facts to gradually improve the performance of predictive models and make data-driven decisions. Machine learning is playing a greater role in our everyday life. Thanks to machine learning, you enjoy more robust demand forecasting. Thanking this into account, the next section will look into the world of machine learning and how an Artificial Neural Network could be used to forecast the sales.

3 Framework

In this research, we will be using an Artificial Neural Network to forecast sales. In order to understand how a machine learning model is created, a general overview of machine learning is provided, followed by an introduction into the topic of Artificial Neural Network.

3.1 Machine learning

Machine learning is an application of artificial intelligence (Alpaydin, 2014), that will generate a system with the possibility to automatically learn and adapt from experiences without being explicitly programmed. Most researchers in the field of machine learning focus on the development of computer programs that can access big data pools and learn from. The process of learning is best described in the way that a computer learns by looking for patterns in the data and make better decisions in the future based on the examples that the user provides. The primary goal of a machine learning algorithm is to automatically adjust without human intervention or assistance and adjust actions accordingly.

(9)

Machine learning can make use of three distinct types of algorithms (Alpay-din, 2014), namely supervised learning, unsupervised learning and reinforcement learning. Supervised learning is a data mining task of inferring a function from labelled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and the desired output value. A supervised learning algorithm analyses the training data and produces an inferred function, which is used for mapping new exam-ples. An ideal scenario for the algorithm is to correctly determine the labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

In data mining or even in data science world, the problem of an unsupervised learning task is trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to test a potential solution. These are called unsupervised learning because unlike supervised learning there are no correct answers and there is no teacher. Algorithms are left to their own devices to discover and present the interesting structure in the data.

Reinforcement learning is a process in which a machine can be seen as an agent, who learns by taken actions and his goal is to maximize his cumulative reward. Reinforcement learning is used to find the best actions to take now to reach some future goal. This type of problem is common in games but can be useful for solving dynamic optimization and control theory problems exactly the type of issues that come up in modelling complex systems in fields such as engineering and economics.

Every class in the world of machine learning has his own primary task, as seen in figure 1. For example, if you have a market forecasting problem and labelled data, it is recommended to use a supervised learning regression method.

Until this moment there are three main categories of machine learning in which you could find the right machine learning algorithm. It is still unclear how a general machine learning model operates. The general assumption of a machine learning process is that it will have to exam the original dataset and works with the assumption that you do not need all the data (Alpaydin, 2014). The first step is dividing the original dataset into an analytical dataset and an unused dataset the data cleaning step.

(10)

Figure 1. Machine Learning Algorithm Classification (Geekstyle, 2017).

After the data is properly pre-processed, you split the data randomly into a training and a test set. The choices in dividing the dataset depend on the problem you want to use a machine learning model on. Typically, you find an 80% 20% dividing of the dataset (Gutierrez-Osuna, 2005). The training set would be again divided by 80% and 20% to generate a training and validation set. The training set would be used to train the machine learning algorithm and the validation would be used to test the algorithm model. If an algorithm does not give the insights, you could carry out another algorithm or tweak the model further. When you have achieved the level of accuracy, you start the algorithm on the test set. At this stage, you have a predictive model that can be used. The total process is shown in figure 2.

(11)

Figure 2. Machine Learning Model.

3.2 Artificial Neural Network

The first step in the search for the right machine learning algorithm is to look at the task of the project and the available data. In this case, there is labelled data and a market forecasting problem. As mentioned the use of an Artifi-cial Neural Network would be the best option. This will give good results on accuracy but will induce a decrease in speed (Brown, 2004). Artificial Neu-ral Network is less sensitive to error term assumptions and can tolerate noise, chaotic components and heavy tails better than most other methods (Master, 1993). Other advantages include robustness and adaptability due to the large number of interconnected processing elements that can be planned to learn new patterns (Kaastra & Boyd, 1996).

Mostly due to the big increase in computer power, Artificial Neural Networks have gained a lot of popularity in the machine learning community (Lui et al., 2017). This way of deep learning has major advantages due to flexible nonlinear modelling capability (Zhang, 2003). The model does not need to be in a particular model form. Rather, the model is adaptive to the features presented by the data. These abilities and the fact that the Artificial Neural Network is a forecasting method based on simple mathematical nodes of the brain makes it the ideal candidate for the forecast of the demand for a product. A neural network is very similar to an Ordinary Least Squares (OLS) re-gression because they both attempt to minimize the sum of squared errors. The number of input nodes is equal to the number of independent variables in a regression, while the output nodes represent the dependent variable(s). Linear regression models may be viewed as a feedforward neural network with no hid-den layers and one output neuron with a linear transfer function. The weights

(12)

connecting the input neurons to the single output neuron are analogous to the coefficients in a linear least squares regression. Networks with one hidden layer resemble nonlinear regression models. The weights represent regression curve parameters.

In this thesis, a back-propagation neural network is used to decide the sales forecast. Back-propagation is a technique most commonly used in training deep neural networks (Nielsen 2015) and has proven to be one of the best and simplest techniques (Rojas, 1996). It has the advantages of accuracy and versatility, despite its disadvantages of being time-consuming and complex. Our goal with back-propagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, minimizing the error for each output neuron and the network as a whole.

A back-propagation neural network is a collection of input and processing units known as nodes or in biological terms neurons. Every node is fully inter-connected by connection strengths called weights, which stores the knowledge of a network. All of these nodes are stored in layers; an input layer, hidden layers and an output layer. These layers are stacked on top of each other, building a pipeline from the input nodes to the output nodes. The number of nodes in the input layer is equal to the data items used from the dataset. The output layer in this project is going to be a single node, which will forecast the sales of the product given the input features. An example of a neural network can be seen in figure 3.

Figure 3. Deep Neural Network.

A back-propagation neural network uses a cost function to define the models’ output correctness. The cost function is using the output of the model and assigns as its input for the back-propagation and calculate the difference with the expected output (Nielsen, 2015). The difference between the expected and giving output is the cost of the model. By using back-propagation, the cost function calculates the difference between the network neural output and its

(13)

Step 1: Neural Network Paradigm

Number of hidden layers Number of hidden nodes Number of input nodes Number of output nodes Transfer function Step 2: Evaluation criteria

Step 3: Neural Network Training

Number of training epochs Size of the batch

Step 4: Implementation

Table 1. Four steps in designing a neural network model.

expected output, after one complete iteration of the model. Then the method of back-propagation is used to calculate the gradient needed in the calculation of the weights of the network. The term back-propagation comes from the fact that the calculation of the gradient proceeds backwards through the network, with the gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last. Partial computations of the gradient from one layer are reused in the computation of the gradient for the earlier layer. This backwards flow of the error information allows for efficient computation of the gradient at each layer versus an approach of calculating the gradient of each layer separately.

In table 1 the four steps of designing a neural network are described. In designing the back-propagation neural network, the number of input nodes and output nodes is straightforward, meaning that the input nodes are equal to the variables you put in the network and the output is equal to what you want to predict, in this project one. The hardest step in designing the network is the number of hidden layers and nodes. In theory, a neural network with just one hidden layer and a sufficient number of hidden nodes is able to approximate any continuous function. The literature shows that a lot of neural networks with just one or two hidden layers are widely used and do very well (Nielsen, 2015). Adding more hidden layers will not only increase computation time but also the chance of overfitting the model.

In machine learning the term overfitting is usually replaced by overtraining the model, stating that an overtrained model is a statistical model that has more parameters that can be justified by the data (Alpaydin, 2014). An overtrained model will not lead to identifying the general patterns, but to just memorize the individual points. In figure 4 the problem of overfitting and under-fitting is being shown. Therefore, to successfully build a Neural Network, you first start with two hidden layers and then decide if it’s better to continue with an extra

(14)

layer or be satisfied with the current architecture.

Figure 4. Problem Of Overfitting.

The hidden layers will contain a number of hidden nodes. There is no basic recipe nor rule of thumb in deciding how many hidden nodes you will need in designing an Artificial Neural Network. Selecting the ’best’ number of hidden neurons involves experimentation and is mostly based on selecting the best re-sults and computation time. From my own experiences, done on datasets found on https://www.kaggle.com, the best is to choose an approach of a number of hidden nodes that is equal to the number of input variables. This was my baseline in the experiment to achieve the pre-defined goal of accuracy. This approach is time-consuming, but based on my own experiences, in the end the best way.

After designing the neural layers of the network, you have to decide how you want the layers to connect with each other, in so-called transfer or activation functions (Rojas, 1996). The transfer functions are mathematical formulas and determine the output of a processing neuron. Most commonly used transfer functions are sigmoid (S-shaped), arc-tangent and linear functions, but oth-ers such as the hyperbolic tangent, unit-step, Gaussian and ramping have also been proposed (Kriesel, 2007). The transfer function is implemented to prevent output from reaching very large values which can delay a training process and thereby increase the training length. Just like in configuring the right number of hidden nodes, the process of finding the right transfer function is just kept on applying one of them and looking for the pre-defined level of accuracy. Between the hidden layers, a nonlinear transfer function will be used. If you use a linear transfer function between the hidden nodes, you will just end up with only lin-ear separable solutions. By using a back-propagation type of Artificial Neural Network, you look for a transfer function that is differentiable, smooth, mono-tonic and bounded. I decided to start by using a hyperbolic tangent sigmoid transfer function and if necessary, try a logistic transfer function (Bagnasco et al., 2015).

(15)

3.3 Conclusion

The aim of machine learning is to discover insights and make intelligent deci-sions. Machine learning is an application of artificial intelligence, which can be split into three main categories, supervised, unsupervised and reinforcement learning. Every category has his own primary task, as seen in figure 1. For our project, we used a supervised learning regression method because we have a market forecasting problem and are provided with labelled data. After the data is properly pre-processed, we would have a training and a test set. On the training set we will be training our models and on the test set, we will be testing them. As mentioned, the use of an Artificial Neural Network for our supervised learning regression method would be the option. In this thesis, a back-propagation neural network is used to determine the sales forecast. By using back-propagation, you adjust each weight in the network in proportion to how much it contributes to overall error. If you iteratively reduce each weight error, eventually you will have a series of weights that produce predictions. The main problem you are facing while designing an Artificial Neural Network is the problem of overfitting. Therefore, to successfully build a Neural Network, you first start small and then decide if it’s better to continue with an extra hidden layer and add more nodes or be satisfied with the current architecture. After designing the neural layers of the network, you have to decide how you want the layers to connect with each other, in so-called transfer or activation func-tions. The transfer function is implemented to prevent output from reaching very large values which can delay a training process and thereby increase the training length. Just like in configuring the right number of hidden nodes, the process of finding the right transfer function is just kept applying one of them and looking for the pre-defined level of accuracy.

4 Dataset

Data preprocessing is key step in making a model. In order to get the desired accuracy level, some of the data have to be converted to a workable variable. In this section, we will look at the input variables, a way to measure the weather and how we could distinguish the right product to forecast.

4.1 Input Variables

One of the most important steps in the Artificial Neural Network building pro-cess is the determination of significant input variables. Usually, not all the potential input variables are equally informative since some may be correlated, noisy or have no significant relationship with the output variable. In the latter, the model’s structure is determined from causes to effects. The conclusions are drawn from assumptions, from what came before, deductive reasoning by using empirical or analytical approaches. Before estimating the unknown model lim-its since data-driven approaches are usually assumed to be able to find which

(16)

model inputs are critical. This means that Artificial Neural Network practition-ers often present many of inputs to the Artificial Neural Network and decrease the number during the training of the Artificial Neural Network (Bowden et al., 2005). Looking at the articles mentioned in the sections above, you can see examples in which they looked at the available data and divided it into internal and external data. This delimiting is also used in this project and is shown in table 2. This separation will be the starting point for selecting the variables used in this model.

Internal Data Feature Point-Of-Sale (POS) data

Target-product Sales transaction records

Promotions

Target-product Campaign trade calendar Cannibalization-product(s) Campaign trade calendar

Location Population, number of competitor’s

External Data Competitor’s price

Target-product Retailer market price Promotions Retailer market price

Day of the week Calendar

Holiday Holiday calendar

Events Public events calendar

Weather Temperatures, precipitation, humidity

Table 2. Data dividing into internal and external data.

Most of the internal and external data is given by the retail chain. After fetching the data, we need to do some feature engineering on the main features to use them in the model. Looking at the internal data, we convert the vari-ables that will be needed and will be used in this model. The Point-Of-Sale data and promotions data are quite straightforward because it is going to be a nice conversion towards an article identification (label variable) and the frequency (ordinal categorical variable) of the article has been brought on a particular date identification (label variable). Combining the date id (label variable) with the article identification shows if the article was on promotion (dichotomous cate-gorical variable). Location data, also provided by the supermarket chain, will

(17)

indicate the location and number of competitors (discrete numerical variable). It is obvious, however, that in order for a marketing unit to decide the specific region that it finds economical for distribution purposes, it first has to assess the demand in the entire potential trading area (Merino & Ramirez-Nafarrate, 2016). In addition, no matter what kind of cost variable is considered, such as, delivery or promotion, it is very likely that the cost of the service under consid-eration will not turn out to be a very satisfactory determinant of any precisely bounded trading area. By using this intelligence, a radius of two kilometers will be used as a starting point to included competitors and could be adjusted during the testing.

Looking at the external data needed for this model, you can again notice some straightforward variables. After deciding which article, I wanted to predict the demand for, I looked up the competitor’s prices (continuous numerical vari-able) of the product presenting on the same date identification. Day of the week is just an ordinal categorical variable. The same count for the holiday calendar, which is also an ordinal categorical variable. The retail chain is also providing me with an enormous public event calendar, in which a lot of public events are mentioned. In order to decrease the training time, a correlation between all the available events and the frequency of the output will be generated, to decrease the number of events to a set of events how will be included in the model and which events will be excluded.

4.2 Weather-Grade

The hardest part of this process of converting the data to a variable is the weather variable. The weather is an important variable because it has a direct influence on the number of customers who are willing to go shopping (Granger, 1979). The study also shows that weather conditions can have a direct influ-ence on consumers with respect to what product they buy, when, where and in what quantity (Parnaudeau et al., 2018). Weather conditions such as icy roads, excessively hot or cold temperatures, or heavy rain can discourage customers to go out or go shopping. Other weather conditions can simply delay or totally remove demand or create demand that would not have existed otherwise.

So, to predict the demand for a product, a conversion towards a grade for the weather is needed. The Belgian Hugo Poppe was the first one to introduce the idea of grading the weather (Koninklijk Nederlands Meteorologisch Instituut, n.d.). In this model, we applied our own algorithm in which we tried to find a normal distribution with a Dutch grading system. Figure 5 is showing the plot of the weather-grade.

(18)

Figure 5. Distribution of the weather-grade.

The plot shows a nice distribution in which the weather-grade (M = 6.00, SD = 1.50). The weather-grade is normally distributed, with Skewness of -0.09 and Kurtosis of -0.55. A few outliers have been produced due to the fact that the algorithm could not correctly handle the extreme weather caused by high rainfall, strong wind and high temperatures.

4.3 Product Distinguish

In order to find the right product for a demand forecast, we have to ask ourselves six main business questions to name the right product. We looked for a product which is high volatility (one), high margin (two) and high frequency (three). Without these three arguments, the need for forecasting this particular product is low. With these arguments, we can find a business opportunity, which would be (in the long run) beneficial for the company. The other three arguments are high association (four), high price elasticity (five) and if we could show a promo uplift (six). These last three would show as-well that there will be a business improvement, but that is less significant than the first three arguments.

4.4 Conclusion

In order to build a model, we need to use the variables provided. The data provided by the retail chain is quite extensive, so we need to identify the data necessarily needed for this algorithm. We saw that we can define the data into internal and external data. After the conversion to usable variables, we can

(19)

include them in the model. Most of the variables are quite straightforward, except the weather variable. The weather is an important variable because it has a direct influence on the number of customers who are willing to go shopping (Granger, 1979). We had to convert it to some kind of grading system. The product to forecast can be found by using six main business question, which will indicate a business opportunity.

5 Results

This section will first contain an overview of the data and his features, followed by the results of the Multiple Regression and Artificial Neural Network. Before we delve into the results, we have to understand the metric that will be used to optimize and evaluate the models with. Therefore, we trained a Multiple Linear Regression as our baseline model. The Multiple Linear Regression was performed as the first attempt due to its simplicity. The model has the same input as the Artificial Neural Network. We compared the results gained from the Artificial Neural Network against the baseline model. Since Artificial Neural Network includes non-linearity and can have a complex structure, we hope that neural network can uncover more information compared to our baseline model. We also report other types of metrics which are more generally used. However, it is important to reinforce that from a profit-oriented perspective Root Mean Squared Error is the most important evaluation criterion in a regression model to look at.

5.1 Data

The final dataset consists of 820 days with 62 features and 1 numerical output. Table 3 is showing us the names of the variables used in this model. In our final dataset, I included the price of three competitors, who are in a range of 10 km of the target supermarket. The prices of six other products have been included in the model to indicate any cannibalization and substitution effects. In this context, a cannibalization product refers to a reduction in sales volume of one product as a result of another product by the same producer.

(20)

Variables Name Point-Of-Sale (POS) data

Target-product Frequency, Price Cannibalization-product(s) Price

Promotions

Target-product No_sales Cannibalization-product(s) No_sales

Competitors

Prices of the product Competitor_1, Competitor_2, Competitor_3

Date elements DAGNR_WEEK,

AANTAL_DAGEN_TOT_HALLOWEEN, BEVRIJDINGSDAG_VOORUIT, HALLOWEEN_VOORUIT, KONINGSDAG_VOORUIT, VAKANTIE_MEI_MIDDEN_VOORUIT DODENHERDENKING_VOORUIT, HALLOWEEN_ACHTERUIT, VAKANTIE_HERFST_MIDDEN_VOORUIT, VAKANTIE_HERFST_NOORD_VOORUIT, VAKANTIE_HERFST_ZUID_VOORUIT, AANTAL_DAGEN_TOT_VOETBAL_EVENEMENT, AANTAL_DAGEN_TOT_KONINGSDAG, AANTAL_DAGEN_TOT_VAK_MEI_MIDDEN, BETAALDAG_SALARIS_VOORUIT, BEVRIJDINGSDAG_ACHTERUIT, DODENHERDENKING_ACHTERUIT, KONINGSDAG_ACHTERUIT Weather Weather-grade

Table 3. Name of the variables used in the final dataset.

One general insight into the dataset is that we can see a spike in the sales of the product when it has some discount. This effect is visible in figure 6. The two graphs are both plotted against the date, one contains the number of sales and the other one contains if the product was on sale. The figure is also showing us that the sales of the product are stable during the whole dataset. By seeing this, we could indicate that there is no seasonal trend in the target product. To know this for sure, we have to look if there is an autocorrelation between the lagged frequency of the product sold. The autocorrelation coefficients are plotted to show the autocorrelation function. The plot is shown in figure 7 and has 150 lags. When the data has a trend, the autocorrelations for the lags tend to be large and positive because observations nearby in time are also nearby in

(21)

size. So the autocorrelation function tends to have positive values that slowly decrease as the lags increase. The partial autocorrelation is also shown and they measure the relationship between each lag after removing the lags. The first partial autocorrelation is identical to the first autocorrelation because there is nothing between them to remove.

Figure 6. Plot of the number the target product has been sold combined with the indication of sale against the date.

Figure 7. Autocorrelation and Partial Autocorrelation of the error terms. One of the classical assumptions of the Ordinary Least Squares (OLS) pro-cedure is that the observations of the error term are independent. Each error term observation must not be correlated with the error term observation that is next to it. If this assumption is violated and the error term observations are correlated, autocorrelation is present. Autocorrelation is a common problem in time-series regressions and therefore in demand forecasting because it is based on historical data. When autocorrelation is present, the error term observations follow a pattern. Such patterns tell us that something is wrong. When looking at figure 7, we can see a dark area which indicates whether the correlations are significantly different from zero. When a value is outside of the dark area, it indicates that there is a form of autocorrelation. Because we notice that it is inside the dark area, we can state that the data is not trended seasonal. If the

(22)

data was autocorrelated, the forecast will be inefficient, because there would be more information about the data that could be included in the equation.

5.2 Results Multiple Linear Regression

Multiple regression analysis was used as a baseline model against the Artificial Neural Network, to test if we can predict the sales volume of a product by unit size. I used the package statsmodels.regression.linear_model.OLS in Python 3.6 to do the regressions. Because of the sales spike in the dataset due to dis-count, a logarithmic transformation of the dependent variable was needed. I did one regression on all the variables and one regression with only the independent variables, who had a high significance (p<.05). The results of the regression indi-cated that the independent variables explained 88.8% of the variance (r2_=.888,

F(60,760)=100.8, p<.000). The results of the reduced regression also explained a significant proportion of variance (r2 _{=.875, F(10,810)=568.3, p<.000). The}

output of the two regression models has been included in the chapter of Output Models at the end of this thesis.

By using Occam’s Razor, we should continue with the model who has fewer assumptions when the two models perform quite similar. By this logical razor, we will take the reduced model as our baseline model. We saw in figure 7 already that there is no autocorrelation of the error terms in the regression. Figure 8 is showing the distribution of the error terms. The error terms ranged from -1.91 to 1.57 years (M = .00, SD = .69). The error terms are normally distributed, with skewness of -.44 and kurtosis of .08.

(23)

Figure 8. Distribution of the error term of the reduced regression.

5.3 Results Artificial Neural Network

The underlying premise of Artificial Neural Network is that multiple layers of generalized linear models can combine to produce non-linear outputs. In order to select the optimal form of an Artificial Neural Network, we have to look at a lot of different model types of an Artificial Neural Network. I used the package Keras in Python 3.6 to model the Artificial Neural Network. Table 4 shows the different number of layers and hidden nodes. Each model has run for 1000 epochs and had a batch size of 4. The model got an input dimension of 60 and an output of 1 feature. As mentioned in the section about the framework, I choose to do the splitting 80 / 20, 80% of the dataset will be used to train the Artificial Neural Network and 20 % will be used to test and evaluate the network. Every type of model has run for five times and the mean of the results has been put in table 4. The table shows the results of the models on the training-set in percentage and shows the Root Mean Square Error of the model on the test set. The 30/30 dividing means, that in the first and second layers contained both 30 hidden nodes.

(24)

Layers Nodes Accuracy Training (%) Test set (RMSE) Layers Nodes Accuracy Training (%) Test set (RMSE) 0 0 15.40 13.46 3 30/30/30 47.26 8.10 30/60/30 58.38 7.30 1 30 36.13 13.46 30/30/60 58.08 11.52 60 38.41 16.51 60/30/30 60.67 7.38 120 39.02 13.57 60/30/60 62.35 10.12 240 43.29 14.85 60/60/30 65.40 7.51 480 39.94 15.36 60/60/60 68.45 8.05 60/60/120 66.16 16.05 2 30/30 57.77 13.19 60/120/60 77.59 9.25 30/60 52.44 7.80 60/120/120 58.23 8.03 60/30 54.27 9.38 120/120/120 82.93 6.43 60/60 67.53 10.30 120/120/60 79.42 7.04 60/120 67.99 12.16 120/60/120 79.88 9.58 120/60 58.08 9.44 120/60/60 66.77 17.60 120/120 59.60 10.27 120/240/240 85.98 7.39 120/240 71.80 10.27 120/120/240 83.08 9.07 240/120 71.80 11.25 120/240/120 80.95 13.36 240/240 78.66 7.58 240/240/240 73.78 9.37 240/480 75.76 10.76 240/120/120 72.26 8.03 480/240 76.52 9.70 240/120/240 72.41 14.99 480/480 75.30 10.53 240/240/120 84.15 13.95 240/240/480 86.43 9.70 240/480/240 74.09 9.15 240/480/480 86.43 9.15 480/480/480 84.60 9.50 480/480/240 85.21 11.40 480/240/240 84.76 8.28 480/240/480 68.29 8.41

Table 4. Results Neural Network on the training set and test set.

5.4 Analysis of Results

In order to compare the models, we have to calculate the Mean Absolute Error, the Median Absolute Error and the Root Mean Squared Error for all the models. We trained two Artificial Neural Network for twenty times and took the average of the model evaluation functions. I choose a network with 3 layers of 120 nodes as one of the final models, because of his low Root Mean Squared Error. The second Artificial Neural Network I had chosen was a network with one layer of 240 nodes and two of 480, because of his high accuracy of the training set. Table 5 is showing the results of the models.

(25)

Test Accuracy

Model MAE MDAE RMSE

Multiple Linear Regression

Reduced Regression model 5.86 2.56 15.97 Artificial Neural Network

120/120/120 4.62 2.56 10.83

240/480/480 4.85 2.47 11.98

Table 5. Results of the Models.

We notice that the Artificial Neural Network outperformed the Multiple Linear Regression. A lower Mean Absolute Error, Median Absolute Error and Root Mean Squared Error are favorable above higher, which indicates that the outperform the regression. The network with three layers of 120 nodes is slightly better than the network with a construction of 240/480/480. If looked at the processing time, we can also indicate that the network with fewer nodes is far quicker than the network with more hidden nodes.

5.5 Conclusion

This part has shown us the results of the different models. When we use a Multiple Linear Regression model, we are making some assumptions about the variables. We saw that the data showed a nice correlation between the sales and discount, which is a good indication that the data preprocessing went fine. Figure 7 showed that data is not autocorrelated and other assumptions like the assumption of linearity, the assumption of random sampling of observations and assumption of no multicollinearity are not violated as well. The regression with all the variables included could explain 88.8% of the variance. The regression with only the significant variables was able to explain 87.5% of the variance. Looking at the overall score of all the presented model, based on Root Mean Squared Error the three-layer networks outperformed the other types, by having the lowest Root Mean Squared Error on the test-set and the highest accuracy on the training set. In the end, an Artificial Neural Network with three layers of 120 nodes outperformed the other models based on the evaluation of the test set.

(26)

6 Discussion and Conclusion

The aim of this thesis was to investigate the impact of new data sources and big data analytics for the economic practitioner’s toolkit. In this age of data surplus, we have to look for more efficient ways of using this available data. The standard methods of an economist can’t handle the amount of data available easy. Just as in evolution, the economist has to evolve and expand his toolkit by using different methods. The literature review has shown us that adoption of advanced demand forecasting and supply planning gives great new possibilities. We are living in an age where data comes in abundance, by using the self-learning algorithms from the field of machine self-learning, you can turn this data into knowledge. This section summaries the results and shows the challenges ahead for improving the accuracy of forecasting.

This research demonstrates the effectiveness of an Artificial Neural Network in predicting the sales of a product. Specifically, we investigated l the perform-ing of this model comparperform-ing a standard model as Multiple Linear Regression. Given the results of the experiments, an Artificial Neural Network would per-form better in this experiment. It is clear from Table 5 that the Artificial Neural Network had a lower Root Mean Square Error. But, we are not near maximum potential. Although the results are acceptable for this case, there is room for improvement. This experiment only considered a single Artificial Neural Network. It is clear that a deeper Artificial Neural Network could in-crease the accuracy but would be more time-consuming in training the model. It could also decrease in accuracy with the increase in the number of hidden units, showing that overfitting can occur quickly. More work needs to be done to determine whether deeper networks can improve model performance. Since we have a relatively small dataset to train a neural network, we can include a higher percentage of our examples as training examples instead of using an 80-20 train-test split. Thus, if we had more time and resources, we would have adjusted our algorithms to more heavily penalize false negatives in any future investigations.

An issue of big data is that the data can be overwhelming. Selecting the right features for a machine learning model is hard, especially with an Artificial Neural Network. In a simple regression method, you would easily identify the variable suitable for the model by looking at the significance. But with an Artificial Neural Network, you sometimes have to encounter something like “A Black Box” (Olden & Jackson, 2002), which means in this context that the models are not transparent. We do not know how a machine looks at the features. It will look for correlations in the given dataset. But with this approach, a correlation would not necessarily be a causal relationship between the features. Look at the classic example of ice cream sales and death by drowning in the sea in the summer. A machine learning model would look at those features and could indicate there is a correlation between the two features. We as the practitioners of economic theory would have to be careful with including features and most of the time it will trial and error path, which the researcher has to follow his own methods in order to optimize his model.

(27)

Concerning further research, the following suggestions can be made. Fu-ture work on this topic would include more machine learning algorithms to the economic toolkit. In this thesis, we especially introduced an Artificial Neural Network, but there are more models who would be perfectly suitable for this type of forecasting. Introducing Decision Tree Learning and Support Vector Machine for the forecasting could have the same or even better results than an Artificial Neural Network. Introducing also Stacking or Stacked Generalization (Wolpert, 1992) could increase the accuracy of the model. Stacking is a method of using a high-level model to combine lower-level models to achieve a higher level of accuracy.

(28)

7 References

Alpaydin, E. (2014). Introduction to machine learning. MIT press.

Aertsen, F. (2017, 11 october). Weg met planners; machines nemen het over!. Retrieved March 1, 2018, from http://www.logistiek.nl/supply-chain/nie-uws/2017/10/weg-met-planner-de-machine-neemt-het-101159045

Bagnasco, A., Fresi, F., Saviozzi, M., Silvestro, F., & Vinci, A. (2015). Elec-trical consumption forecasting in hospital facilities: An application case. Energy and Buildings, 103, 261-270.

Blue Yonder. (n.d.). Successful Demand Forecasting at dm. Retrieved March 15, 2018, from https://www.blue-yonder.com/sites/default/files/by-en-case-study-dm.pdf

Bowden, G. J., Dandy, G. C., & Maier, H. R. (2005). Input determination for neural network models in water resources applications. Part 1—background and methodology. Journal of Hydrology, 301(1-4), 75-92.

Brown, A. (2004). Machine learning Algorithms Cheat Sheet. Retrieved March 1, 2018, https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/ Machine-learning-algorithms-cheat-sheet-blog-post/td-p/363568?lightbox-message-images-363568=9185iADBF934CB709C687

Carrasquilla, J., & Melko, R. G. (2017). Machine learning phases of matter. Nature Physics, 13(5), 431.

Ebay. (n.d.). Machine Learning and Data Science. Retrieved March 15, 2018, from (http://labs.ebay.com/research-areas/research-machine-learning-and-data-science

Fan, Z. P., Che, Y. J., & Chen, Z. Y. (2017). Product sales forecasting using online reviews and historical sales data: A method combining the Bass model and sentiment analysis. Journal of Business Research, 74, 90-100.

Fisher, M., & Raman, A. (2017). Using data and big data in retailing. Pro-duction and Operations Management.

Geekstyle, A. A. (2017, February 13). Business Intelligence and its relation-ship with big data, data analytics and data science. Retrieved March 1, 2018, from https://www.linkedin.com/pulse/business-intelligence-its-relationship-big-data-geekstyle/

Go Data Driven. (n.d.). Predicting demand for fresh bread in supermar-kets. Retrieved March 15, 2018, from

(29)

https://go-datadriven.com/casestudy-bakkersland

Granger, C. W. (1979). Seasonality: causation, interpretation, and implica-tions. In Seasonal analysis of economic time series (pp. 33-56). NBER.

Gutierrez-Osuna, R. (2005). Introduction to pattern analysis. Texas: Texas A&M University.

Huang, H., & Liu, Q. (2017). Intelligent Retail Forecasting System for New Clothing Products Considering Stock-out. Fibres & Textiles in Eastern Europe. Kaastra, I., & Boyd, M. (1996). Designing a neural network for forecasting financial and economic time series. Neurocomputing, 10(3), 215-236.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.

Kitapci, O., Tosun, Ö., Tuna, M. F., & Turk, T. (2017). The Use of Artifi-cial Neural Networks (ANN) in Forecasting Housing Prices in Ankara, Turkey. Journal of Marketing and Consumer Behaviour in Emerging Markets, (1 (5)), 4-14.

Koninklijk Nederlands Meteorologisch Instituut. (n.d.). Uitleg over weer-cijfers. Retrieved March 15, 2018, from https://www.knmi.nl/kennis-en-data-centrum/uitleg/weercijfers

Kriesel, D. (2007). A brief introduction on neural networks.

Li, L. E., Chen, E., Hermann, J., Zhang, P., & Wang, L. (2017, July). Scaling machine learning as a service. International Conference on Predictive Applications and APIs(pp. 14-29).

Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural network architectures and their applications. Neurocom-puting, 234,11-26.

Magnini, V. P., Honeycutt Jr, E. D., & Hodge, S. K. (2003). Data mining for hotel firms: Use and limitations. Cornell Hotel and Restaurant Administration Quarterly, 44(2),94-105.

Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (2008). Forecasting methods and applications. John wiley & sons.

Masters, T. (1993). Practical neural network recipes in C++. Morgan Kauf-mann

(30)

Merino, M., & Ramirez-Nafarrate, A. (2016). Estimation of retail sales under competitive location in Mexico. Journal of Business Research, 69 (2), 445-451. Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econo-metric approach. Journal of Economic Perspectives, 31(2), 87-106.

Nielsen, M. A. (2015). Neural networks and deep learning. Determination Press.

Olden, J. D., & Jackson, D. A. (2002). Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecological modelling, 154 (1-2), 135-150.

Parnaudeau, M., & Bertrand, J. L. (2018). The contribution of weather variability to economic sectors. Applied Economics, 1-18.

Ramanathan, U., Subramanian, N., & Parrott, G. (2017). Role of social me-dia in retail network operations and marketing to enhance customer satisfaction. International Journal of Operations & Production Management, 37(1), 105-123. Rensen, E. (2018, February 6). Data voorspellen vraag naar brood. Retrieved march 1, 2018, from http://www.foodmagazine.nl/achtergrond/artikel/2018/2/-data-voorspellen-vraag-naar-brood-1013912

Rojas, R. (1996). The backpropagation algorithm. Neural networks (pp. 149-182). Springer, Berlin, Heidelberg.

Shibata, M., Inoue, K., Ohtsuka, Y., Fukuyo, K., & Takahashi, M. (2016, September). Towards an efficient R&D theme prediction with machine learn-ing. Management of Engineering and Technology (PICMET), 2016 Portland International Conference, pp. 1935-1941.

Smedlund, A., Ikävalko, H., & Turkama, P. (2018, January). Firm Strategies in Open Internet of Things Business Ecosystems: Framework and Case Study. Proceedings of the 51st Hawaii International Conference on System Sciences.

Stad, H. (2018, 2 March). Systemen zijn veel beter in planning dan mensen. Retrieved march 4, 2018, from http://www.logistiek.nl/warehousing/nieuws/-2018/3/162520-101162520

Storey, V. C., & Song, I. Y. (2017). Big data technologies and Management: What conceptual modeling can do. Data & Knowledge Engineering, 108, 50-67. Thompson, J., & Rogers, S. (2017). Analytics: How to Win with Intelli-gence. Technics Publications.

(31)

Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5 (2), 241-259.

Yucesan, M., Gul, M., & Celik, E. (2017). Application of Artificial Neural Networks Using Bayesian Training Rule in Sales Forecasting for Furniture In-dustry. Wood Industry/Drvna Industrija, 68 (3).

Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175.

(32)

8 Appendix

The appendix has been included to show the more mathematical steps of an Artificial Neural Network and is taken from Introduction to machine learning written by Alpaydin (2014) and CS229: Machine Learning Lecture Notes by NG, a course by Stanford University. It is important to include some general explanation of how an Artificial Neural Network is mathematically constructed. There is an answer given by Matthew Lai, Research Engineer at Google Deep-Mind, on the question if we really need to understand the math:

You can use them as black boxes to a certain extent, but if your application is pushing the limitations of neural networks, you’ll need the maths to be able to customize and optimize them to better fit your needs.

For this thesis, the math behind an Artificial Neural Network is therefore out of scope and included in this Appendix. Starting this section with the maths behind Artificial Neural Network, followed by a textual explanation of back-propagation and ending with the mathematical formulas on how to test the models.

8.1 Artificial Neural Network

Artificial Neural Networks are models inspired by biological neural networks and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. They consist of an input layer of nodes, one or more hidden layers and an output layer of nodes. There are different types of neural networks, but the general function of a neural network is described as follows:

yj= f (

X

i

Wijxij)

Where yi is the output of node j, f (.) is the transfer function, Wij is the

connection weight between the node j and node i in the lower layer and xij is

the input signal from node i in the lower layer to node j. Our goal is to input some input x into a function f(x) that outputs y. Formally, f : x → y. One of the simplest possible neural networks is to define f(x) as a single “neuron” in the network where f(x) = max(ax + b, 0), for some coefficients a, b. What f(x) does is return a single value: x or zero, whichever is greater. A more complex neural network may take the single neuron described above and “stack” them together such that one neuron passes its output as an input into the next neuron, resulting in a more complex function.

Building neural networks are analogous to Lego bricks: you take each brick and stack them together to build complex structures. The same applies to neural networks: you take each neuron and stack them together to create complex neural networks. Part of the magic of a neural network is that all you need are

(33)

the input features x and the output y while the neural network will figure out everything in the middle by itself. The process of a neural network learning the intermediate features is called end-to-end learning.

Formally, the input to a neural network is a set of input features x1, x2, x3.

We connect these three features to three neurons. These three ”internal” neu-rons are called hidden units. The goal of the neural network is to automatically find relevant features such that the features predict the output. The only thing we must offer to the neural network is a sufficient number of training examples (x(i)_{, y}(i)_{). Often times, the neural network will discover complex features which}

are very useful for predicting the output but may be difficult for a human to un-derstand since it does not have a “common” meaning. This is why some people refer to neural networks as a black box, as it can be difficult to understand the features it has invented. In Artificial Neural Network modelling, the historical data from the given series would serve as the input data and the output would be the forecasted data.

The first hidden unit requires the input x1, x2, x3 and outputs a value

de-noted by a1. We use the letter a since it refers to the neuron’s “activation”

value. Let a[1]

1 denote the output value of the first hidden unit in the first

hid-den layer. We use zero-indexing to refer to the layer numbers. In a three-layer Artificial Neural Network, the input layer is layer 0, the first hidden layer is layer 1 and the output layer is layer 2. Again, more complex neural networks may have more hidden layers. Given this mathematical notation, the output of layer 2 is a[2]

1 . More generally, a = g(z) where g(z) is some activation function.

Example activation functions include: g(z) = 1 1 + e−z, (sigmoid) g(z) = maz(z, 0), (ReLU ) g(z) =e z_{− e}−z ez_{+ e}−z, (tanh)

In general, g(z) is a non-linear function. Returning to our neural network from before, the first hidden unit in the first hidden layer will do the following computation:

z₁[1]= W₁[1]Tx + b[1]₁ and a[1]₁ = g(z₁[1])

where W is a matrix of parameters and W1 refers to the first row of this

matrix and b is the bias who works as a replacement of the threshold value. Moving on, the output layer performs the computation:

z[2]₁ = W₁[2]Ta[1]+ b[2]₁ and a[2]₁ = g(z₁[2])

(34)

a[1]=    a[1]₃ a[1]₃ a[1]₃   

With the use of matrix algebra, we can stack al the output computations and calculate the weights of the neural network.

   z[1]₁ z[1]₂ z[1]₃   =    W[1] T 1 W₂[1]T W₃[1]T      x1 x2 x3  +    b[1]₁ b[1]₂ b[1]₃   

Expressing this in a matrix function:

z[1]= W[1]x + b[1]

To compute the output layer’s activations (i.e., neural network output): z[2]= W[2]a[1]+ b[2] and a[2]= g(z[2]).

Finally, given the output of the network a[N ]_{, which we will more simply}

de-note as ˆy, we measure the loss L(a[N ]_{, y) = L(ˆ}_{y, y)}_{, where L is the loss function.}

For example, for a real-valued regression, you can use the function: L(ˆy, y) = 1 2(ˆy − y) 2 _{(squared error)} L(ˆy, y) = 1 n n X i=1

(ˆy − y)2 (mean squared error)

L(ˆy, y) = logcosh(ˆy − y) (logcosh error)

In order to minimize a cost function, you can use gradient descent as an op-timization algorithm to find the values of parameters (coefficients) of a function (f). As the model passes an epoch, it gradually converges towards a minimum where further tweaks to the parameters produce little or zero changes in the loss—also called convergence. Gradient descent is best used when the param-eters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

(35)

Figure A. Gradient descent.

8.2 Back-Propagation Neural Network Model

A back-propagation neural network is one of the most powerful neural network types. It has the same structure as the multilayer neural network and uses the back-propagation learning algorithm. They are common and popular methods of training neural networks. In the learning process, back-propagation algorithms use gradient descent method to optimize the learning process. This technique is also sometimes called backward propagation of errors because the error is calculated at the output and distributed back through the network layers. The structure of a back-propagation neural network is shown in the figure below.

(36)

Figure B. Back propagation Network.

The back-propagation neural network is a feedforward, supervised neural network that uses back-propagation algorithm for learning. back-propagation is a supervised learning algorithm and is mainly used by a multi-layer neu-ral network to change the weights connected to the networks hidden neuron layer(s). The back-propagation algorithm uses a computed output error to change the weight values in a backward direction. To get his networks error, a forward-propagation phase must have been done before. While propagating in forwarding direction, the neurons are being activated using the activation function. Then it is calculated how much the weights must be adjusted to make this error as small as possible and this process goes on until some level of error has achieved.

8.3 Model Evaluation Functions

The purpose of error measures is to get a clear and robust summary of the error distribution. It is common practice to calculate error measures by first calculating a loss function (usually eliminating the sign of the single errors) and then computing an average. Let yi denote the its observation and ˆyi denote a

forecast of yi. The forecast error is simply ei = yi− ˆyi,which is on the same

scale as the data. Accuracy measures that are based on ei are therefore

scale-dependent and cannot be used to make comparisons between series that are on different scales.

(37)

Name Function Scale-Dependent Measures

Mean Absolute Error MAE 1

n

Pn

i=1(|ei|)

Median Absolute Error MDAE median(|ei|)

Mean Squared Error MSE 1

n

Pn

i=1(e 2 i)

Root Mean Squared Error RMSE q1

n

Pn

i=1(e 2 i)

(38)

9 Output Models

9.1 Regression Models

9.1.1 Regression Model

Dep. Variable: FREQUENCY R-squared: 0.888 Model: OLS Adj. R-squared: 0.880 Method: Least Squares F-statistic: 100.8 Date: Wed, 09 May 2018 Prob (F-statistic): 1.01e-319 Time: 13:48:31 Log-Likelihood: -833.24 No. Observations: 820 AIC: 1786.

Df Residuals: 760 BIC: 2069. Df Model: 60 coef std err t P>|t| [0.025 0.975] No_Sales -1.3222 0.155 -8.504 0.000 -1.627 -1.017 Prijs -4.5497 2.203 -2.066 0.039 -8.874 -0.226 Competitor_1 0.0330 0.264 0.125 0.900 -0.484 0.550 Competitor_3 0.2695 0.353 0.763 0.445 -0.423 0.962 Competitor_2 -0.0317 0.380 -0.083 0.934 -0.778 0.715 DAGNR_WEEK 0.0489 0.013 3.820 0.000 0.024 0.074 AANTAL_DAGEN_TOT_HALLOWEEN 0.0006 0.001 0.459 0.646 -0.002 0.003 BEVRIJDINGSDAG_VOORUIT -0.0016 0.001 -1.121 0.263 -0.004 0.001 DODENHERDENKING_VOORUIT 0.0208 0.017 1.227 0.220 -0.012 0.054 HALLOWEEN_ACHTERUIT 0.0007 0.002 0.276 0.782 -0.004 0.005 KONINGSDAG_VOORUIT 0.0043 0.002 2.771 0.006 0.001 0.007 VAKANTIE_MEI_MIDDEN_VOORUIT -0.0005 0.001 -0.589 0.556 -0.002 0.001 AANTAL_DAGEN_TOT_KONINGSDAG 0.0031 0.001 2.772 0.006 0.001 0.005 AANTAL_DAGEN_TOT_VAK_MEI_MIDDEN -0.0030 0.002 -1.385 0.167 -0.007 0.001 AANTAL_DAGEN_TOT_VOETBAL_EVENEMENT -3.96e-05 9.04e-05 -0.438 0.661 -0.000 0.000 BETAALDAG_SALARIS_VOORUIT 0.0004 0.003 0.134 0.893 -0.006 0.006 BEVRIJDINGSDAG_ACHTERUIT 0.0199 0.017 1.172 0.241 -0.013 0.053 DODENHERDENKING_ACHTERUIT -0.0003 0.002 -0.202 0.840 -0.004 0.003 HALLOWEEN_VOORUIT 0.0010 0.001 0.696 0.486 -0.002 0.004 KONINGSDAG_ACHTERUIT 0.0043 0.001 3.121 0.002 0.002 0.007 VAKANTIE_HERFST_MIDDEN_VOORUIT -0.0039 0.003 -1.268 0.205 -0.010 0.002 VAKANTIE_HERFST_NOORD_VOORUIT 0.0007 0.003 0.220 0.826 -0.005 0.007 VAKANTIE_HERFST_ZUID_VOORUIT 0.0018 0.003 0.610 0.542 -0.004 0.008 WEATHER-GRADE 0.0100 0.018 0.553 0.580 -0.025 0.045 No_Sales_Artikel_27283 -0.0142 0.270 -0.053 0.958 -0.544 0.515 Prijs_Artikel_27283 3.9547 6.384 0.619 0.536 -8.578 16.487 Competitor_1_Artikel_27283 0.6709 0.599 1.119 0.263 -0.506 1.848 Competitor_3_Artikel_27283 4.6027 5.528 0.833 0.405 -6.250 15.455 Competitor_2_Artikel_27283 -0.7724 0.697 -1.108 0.268 -2.141 0.596 No_Sales_Artikel_28993 0.0307 0.107 0.285 0.776 -0.180 0.242 Prijs_Artikel_28993 -2.6071 3.834 -0.680 0.497 -10.134 4.920

(39)

coef std err t P>|t| [0.025 0.975] Competitor_1_Artikel_28993 0.1545 0.105 1.478 0.140 -0.051 0.360 Competitor_3_Artikel_28993 -2.3084 3.142 -0.735 0.463 -8.477 3.860 Competitor_2_Artikel_28993 0.0618 0.130 0.476 0.634 -0.193 0.317 No_Sales_Artikel_41131 0.0690 0.124 0.557 0.578 -0.174 0.312 Prijs_Artikel_41131 -2.2670 1.842 -1.230 0.219 -5.884 1.350 Competitor_1_Artikel_41131 -0.4548 0.184 -2.476 0.014 -0.815 -0.094 Competitor_3_Artikel_41131 -0.8054 0.291 -2.765 0.006 -1.377 -0.234 Competitor_2_Artikel_41131 1.4051 0.436 3.226 0.001 0.550 2.260 No_Sales_Artikel_49028 -0.3789 0.141 -2.693 0.007 -0.655 -0.103 Prijs_Artikel_49028 2.4357 2.558 0.952 0.341 -2.585 7.457 Competitor_1_Artikel_49028 0.0267 0.176 0.151 0.880 -0.320 0.373 Competitor_3_Artikel_49028 -0.2208 0.279 -0.792 0.428 -0.768 0.326 Competitor_2_Artikel_49028 0.3301 0.275 1.201 0.230 -0.210 0.870 No_Sales_Artikel_57296 -0.0882 0.159 -0.555 0.579 -0.401 0.224 Prijs_Artikel_57296 0.2179 0.968 0.225 0.822 -1.683 2.118 Competitor_1_Artikel_57296 -0.0691 0.241 -0.287 0.774 -0.542 0.403 Competitor_3_Artikel_57296 -0.1391 0.317 -0.439 0.661 -0.761 0.483 Competitor_2_Artikel_57296 -0.0154 0.247 -0.062 0.950 -0.500 0.469 No_Sales_Artikel_73813 0.5290 0.519 1.019 0.309 -0.490 1.548 Prijs_Artikel_73813 -2.4844 2.621 -0.948 0.344 -7.630 2.661 Competitor_1_Artikel_73813 -0.2190 0.280 -0.783 0.434 -0.768 0.330 Competitor_3_Artikel_73813 -0.6066 0.530 -1.145 0.253 -1.647 0.433 Competitor_2_Artikel_73813 0.2199 0.417 0.527 0.598 -0.599 1.039 Competitor_1_Huismerk_350gr -0.4032 0.287 -1.403 0.161 -0.967 0.161 Competitor_3_Huismerk_350gr -0.2576 1.923 -0.134 0.893 -4.033 3.518 Competitor_2_Huismerk_350gr 0.8128 1.016 0.800 0.424 -1.182 2.808 Competitor_1_Huismerk_600gr 0.1524 0.258 0.591 0.555 -0.354 0.659 Competitor_3_Huismerk_600gr 1.4354 2.019 0.711 0.477 -2.528 5.399 Competitor_2_Huismerk_600gr -0.1918 0.832 -0.231 0.818 -1.825 1.441 Omnibus: 56.307 Durbin-Watson: 1.689 Prob(Omnibus): 0.000 Jarque-Bera (JB): 66.296

Skew: -0.668 Prob(JB): 4.02e-15 Kurtosis: 3.397 Cond. No. 3.79e+05

Investigating how an Artificial Neural Network can help to forecast the demand for sales