Integrated demand forecasting and optimisation using recurrent neural networks: An exploratory study

(1)

Integrated demand forecasting and optimisation

using recurrent neural networks: An exploratory

study

T. L. Breeuwsma

∗

University of Groningen

Supervisors: Prof. dr. R.H. Teunter & A.H. Schrotenboer

April 3, 2019

Abstract

In this paper, I explore the possibilities of using a recurrent neural network (RNN) as an integrated demand forecasting and optimisation model. Specifically, this model is trained using a custom loss function which combines the overage- and underage costs to forecast the order-up-to level directly. These order-up-to levels are henceforth evaluated using a newsvendor setting. To assess the performance of the RNNs, three different baseline models are introduced: (1) The empirical quantile model, (2) exponential smoothing with MSE and (3) a feed-forward neural network. All models are evaluated using a real-life dataset containing sales data from a grocery store. The results of the RNNs are promising. For the considered products the RNN models show the lowest total costs across different cost ratios. All considered, this paper presents sufficient evidence to further explore the possibilities of RNNs for integrating demand and optimisation for inventory control theory.

JEL classification: C01, C02, C14, C22, C44, C45

Keywords: newsvendor problem, recurrent neural network, demand forecasting, integrated estimation and optimisation, order-up-to levels

∗_{Corresponding author: Thomas Lieuwe Breeuwsma (s2197960), Master Student}

(2)

List of Figures

1 Recurrent & feed-forward neural networks with three nodes . . . 12

2 Fitted polynomial loss function . . . 14

3 Historical sales of the set of 10 products from the Ta-Feng dataset 16 4 Additive model decomposition plot of Product 9 . . . 17

5 Forecasting capabilities of RNNs with- and without the feature ’weekday’ . . . 20

6 Historical sales Product 9 . . . 21

7 Total costs of predictions for Product 9 . . . 23

8 Order-up-to level S predictions for Product 9 . . . 24

9 Historical sales Product 6 . . . 25

10 Order-up-to level S predictions for Product 6 . . . 26

11 Order-up-to level S predictions using generalised model . . . 28

12 Historical sales of Products 1 and 9 . . . 29

13 Order-up-to level S predictions for a product with limited histor-ical data . . . 30

14 Empirical distribution function for Product 9 . . . 34

15 Forecasting capabilities of RNNs with- and without the feature ’weekday’ . . . 35

16 Seasonal decomposition Product 1 . . . 36

17 Seasonal decomposition Product 6 . . . 37

List of Tables

1 Total costs for the RNN with- and without the additional feature ’weekday’ . . . 20

2 Total costs on Product 9 . . . 22

3 Total costs for Product 6 . . . 26

4 Total cost of the generalised models . . . 27

5 Total costs for the RNN model trained on a product from the same product category . . . 29

6 Total number of items purchased . . . 35

7 Service levels of Product 9 . . . 37

8 Fill rates of Product 9 . . . 38

9 Service levels of Product 6 . . . 38

10 Fill rates of Product 6 . . . 38

11 RMSE Product 9 . . . 39

(4)

1 Introduction

When determining inventory levels, inventory control literature presumes that demand processes follow theoretical distributions. Moreover, in the literature, it is generally assumed that these distribution functions with corresponding parameters are known. Nevertheless, in practice, these demand distributions are unknown and have to be fitted using a limited amount of historical data. Recent developments, however, show that the possibilities of using machine learning techniques, such as artificial neural networks, do not rely on these as-sumptions. They instead assume that the true demand process is embedded in the historical data. Due to these developments, it is now possible to determine inventory levels without determining demand distributions first. This research paper, therefore, argues that better inventory levels can be determined by ex-ploring the possibilities of using machine learning techniques. More specifically, I explore the possibility of using a recurrent neural network (RNN) to directly determine inventory levels without any assumptions about the distribution of demand.

A natural starting point for this exploration is the classical newsvendor prob-lem. In the newsvendor problem, a newsboy has to determine how many copies of a newspaper he wants to buy. While doing so, he has to account for the different costs associated with a stock-out and overstock. First, the newsvendor problem is a rather simple one-period model that assumes uncertainty in de-mand. And second, it is known to be solved optimally if the demand process is known. In practice, the probability distribution is approximated using histori-cal sales data. Once an acceptable probability distribution has been found, the order-up-to level is determined using the optimality equations. Because prob-ability distributions have to be fitted before order-up-to levels are estimated in this two-step approach, the analytical solution may not be optimal. By integrating the approximation of the demand process and the inventory level simultaneously, a better inventory level may be predicted.

In this paper, this integration is performed using different RNNs with custom specified loss functions which combine the underage- and overage costs. This way, the RNN forecasts the order-up-to level directly without relying on theoret-ical distribution functions. Subsequently, these order-up-to levels are evaluated using the newsvendor costs and compared with more traditional methods like simple exponential smoothing, empirical quantile estimation and the current state-of-the-art feed-forward neural networks (FFNN).

Historical sales data from a selection of products from the Ta-Feng1_dataset will be used to evaluate the models presented. The Ta-Feng dataset is an on-line available dataset which contains transaction data from a Chinese grocery store. The data ranges from November 2000 to February 2001. Three months

1_{The Ta-Feng dataset can be found on:}

(5)

of transaction data is used to train the models and one month is used to test the predictive power of these models.

The findings in this paper are promising. The proposed RNN shows the low-est total cost for all products considered, implying that integrated low-estimation and optimisation using a RNN may lead to better inventory levels. Moreover, plots reveal that the RNN can reproduce the seasonal and trend pattern of the demand reasonably well. I show that RNNs can be used for products when limited amounts of data are available, by training on historical data from prod-ucts in the same product category. Last, the custom polynomial loss function introduced in this paper perform relatively well, while eliminating previous the-oretical barriers.

The main contribution of this paper is to integrate the forecasting and safety stock optimisation using a RNN, taking into account the uncertainty in demand. This RNN is trained using sales data with the feature weekday. A custom loss function is defined based on the underage and overage costs incurred when run-ning out-of-stock and throwing away stock respectively. These models are then compared with more traditional forecasting and optimisation practices such as exponential smoothing and empirical quantile distributions. The theoretical value of the RNN lies in the fact that it could replicate the true demand process when sufficient historical data is available. Furthermore, the model eliminates assumptions about theoretical distributions and the specific need for features to determine inventory levels.

The remainder of this paper is structured as follows. First, Section 2 pro-vides an overview of existing literature on forecasting and optimising inventory policies for the newsvendor problem. Second, Section 3 presents the problem formulation and methodology used in this study together with a brief introduc-tion to RNNs. The real world dataset is introduced in Secintroduc-tion 4 after which the forecasting abilities and policies across different approaches are evaluated in Section 5. As a conclusion, I present the main findings of this paper together with the limitations and ideas for further research in Section 6.

2 Literature review

(6)

possibilities of this model by considering the newsvendor problem.

2.1 Classical newsvendor problem

During this research, I consider the classical newsvendor problem. The newsven-dor problem is a simple problem, with two cost parameters, which lends itself perfectly for exploratory research. Also, the model can easily be extended to incorporate a more realistic setting. The classical single period newsvendor problem assumes a setting in which a newsboy has to determine how many copies of a newspaper he wants to order before the beginning of the period. In this setting, demand is stochastic and there are penalty costs associated by ordering too much (overage costs) and ordering too few (underage costs) in re-lation to the actual demand (Axs¨ater, 2015). The problem is to minimise costs by ordering such an amount that strikes a balance between the underage and overage costs while having stochastic demand. In this model, the considered product is often perishable, e.g. outdated copies of a newspaper cannot be sold anymore. The newsvendor problem can be solved analytically when certain as-sumptions are made about the distribution of the demand. In these conditions, the optimal inventory level is essentially a quantile of the theoretical demand distribution. In real life situations, however, the demand distribution and the corresponding parameters are unknown and have to be estimated based on his-torical data and possible future expectations.

In the literature, there are several approaches to solve the newsvendor prob-lem when demand is unknown. In the following paragraphs, I will briefly intro-duce the most used approaches, which can roughly be divided into the following categories: empirical quantile estimation, separated estimation & optimisation and integrated estimation & optimisation. Furthermore, I will introduce rele-vant additions to the newsvendor problem. Note that all of these approaches acknowledge the fact (implicitly or explicitly), that an optimal inventory level consists of two parts. A forecasted demand and additional units to incorporate the uncertainty in demand, called safety stock. The estimate-as-is approach, however, does not include this safety stock and is known to be outperformed by the separated estimation and optimisation approach. Therefore, this estimate-as-is approach is not considered in this research.

2.2 Empirical quantile

(7)

The EQ method avoids having to fit a distribution to the data, and therefore it is relatively easy to implement. However, it values observations from the beginning of the period as much as recent observations and therefore neglects trends and historical patterns. Taylor (2000) proposed a solution to this by using a neural network to find the quantile regression value directly without having to fit a distribution first. This approach is subsequently used by multiple authors, for example, Cannon (2011) for daily precipitation forecasting, El-Telbany (2014) for drug activity prediction and Xu, Liu, Jiang, and Yu (2016) for VaR predictions.

2.3 Separated estimation & optimisation

An approach that incorporates the importance of safety stock is the separated estimation and optimisation (SEO) method. First, a forecast is predicted and used to fit a distribution. Then the optimisation step is performed by taking a quantile of the distribution function. The forecasting step is essentially the same as the forecast-as-is solution procedure, where the inventory level is set equal to the expected demand while estimating a dispersion parameter as well. This dispersion parameter, for example, the standard deviation for the normal dis-tribution, can be estimated from the historical data. Sometimes, however, the error term from a regression is assumed to be the dispersion parameter, neglect-ing the compoundneglect-ing of the data-estimation error with the model-optimality error (Oroojlooyjadid, Snyder, and Tak´aˇc, 2016).

The main issue with this approach is that a theoretical distribution has to be fitted to the forecast whereas the forecasts usually do not exhibit a theo-retical form. A simplification can be made by using a point estimate together with a dispersion parameter after which the newsvendor problem can be solved analytically using a theoretical distribution. However, when this distribution is fitted, as in realistic cases, the SEO approach is suboptimal (Rudin and Vahn, 2014).

2.4 Integrated estimation & optimisation

(8)

Oroojlooyjadid et al. (2016) consider a neural network with a custom loss function. This neural network is a feed-forward network (FFNN) which differs from RNNs in the fact that there is no feedback cycle between nodes. A FFNN is the simplest form of neural network and training the network can be thought of as building a function which maps certain inputs (i.e. features) to an out-put. A loss function is used by the neural network to train the network and measures the inconsistency of the prediction and the real value. A custom loss function, as Oroojlooyjadid et al. (2016) consider, incorporates the underage-and overage costs, to make better predictions. Zhang underage-and Gao (2017) extend the work of Oroojlooyjadid et al. (2016) by building the custom loss function as a combination of two rectified linear units (ReLus). Although an improvement to the loss function used by Oroojlooyjadid et al. (2016) these ReLUs are still non-differentiable at 0 and therefore unfeasible.

2.5 Multi-feature newsvendor

An increasing part of recent literature includes multiple features to solve the newsvendor problem. The use of features, when available, have shown to in-crease the accuracy of the forecasts. Observations can be clustered based on features (e.g. holidays, sale, weekday, weather forecast). Traditional methods like SEO have been used to forecast based on features (Turken, Tan, Vakharia, Wang, Wang, and Yenipazarli, 2012). However, due to the vast amount of data, deep neural networks and other machine learning techniques have been used to include the use of features. Yu, Qi, and Zhao (2013) propose a support vector machine (SVM) to forecast newspaper and magazine sales using 32 features. Ali and Yaman (2013) use SVMs to model a sizeable retail dataset to predict sales with over a thousand variables.

The use of vast amounts of features in neural networks has shown excellent performance. Nevertheless, these features are not always available when fore-casting. This research, for example, uses a dataset wherein next to historical sales, only the feature weekday is available.

3 Problem Formulation

(9)

3.1 Classical newsvendor problem

The newsvendor model assumes that a company is selling a perishable good. The company needs to place an order before demand is observed. Let cu and co be the underage and overage costs respectively, y is the order quantity and d is the random demand. The objective is to minimise costs given by the following function: min y≥0 C(y) = Ed[cu(d − y) +_{+ c} o(y − d)+], (1) where (x)+_{= max(x, 0).}

In case of a known cumulative distribution function F , the newsvendor prob-lem can be solved analytically. The optimal order quantity q∗ is then given by:

q∗= inf y : F (y) ≥ a , (2) where a = cu cu+ co , (3)

is the fraction of the underage costs over the sum of the underage- and overage costs (Axs¨ater, 2015).

3.2 Research design

In this research paper, a grocery store with perishable items is mimicked. Since I consider the classical newsvendor problem, orders are placed for a single time period, which is set to a single day.

(10)

to be thrown away.

Different underage- and overage costs should lead to different order-up-to levels. For example, when uncertain about future demand, one would want to order at last enough products to meet demand when the underage costs are a multiple of the overage costs. I, therefore, consider different costs ratios based on the underage- cu and overage co costs. It is assumed that cu ≥ co, i.e. a stock-out is more expensive than an unsold item.

3.3 Methodology

I will elaborate on the methods used in this paper. A selection of methods will be used to compare the best performing policies. These methods will be trained and validated on training data, and their performance will be compared using test data.

3.3.1 Empirical quantile

The first approach is the empirical quantile (EQ). For the application of this method, all observations from the training set will be sorted based on ascending order. The distribution function Fn(y) is then obtained by:

Fn(y) = 1 n n X i=1 1xi≤y, (4)

where xi denotes the ith observed demand and n the number of observations. An example of an empirical distribution function can be found in Appendix A. The ath quantile (Equation 2 and 3) is selected as the base stock level. This approach recognises that the optimal inventory level is a combination of the expected demand plus safety stock and uses the optimal solution in case of a known distribution.

Drawbacks of this approach are that all observations are valued equally, even in cases with, e.g. trend or seasonality, and that it handles the empirical distribution as if it is the exact distribution.

3.3.2 Exponential smoothing

(11)

There are two variants of the HW exponential smoothing model. The additive- and the multiplicative HW model. The additive model is used when the seasonal component is independent of the average of the time series. The multiplicative model is used when the amplitude of the seasonal component is dependent on the average of the time series (Kalekar, 2004). Considering our products, presented in Section 4, the additive model is most suitable. Also be-cause the multiplicative model is not able to model days with zero demand. The additive HW model assumes time series are represented by:

dt= a + bt + St+ t, (5)

where dt is the demand in period t, a is the average demand, b the linear trend component, Stthe seasonal factor and t is the random error component (Kalekar, 2004). Let the length of the season be L periods. The seasonal factors are then defined such that the sum over the length of the season equals 0, i.e. P

1≤t≤LSt= 0. The value of the forecast for period t denoted as ytis given by: yt= ¯Rt−1+ ¯Gt−1+ ¯St−L. (6) Herein, the parameters ¯Rt, ¯Gt are updated every period, and ¯St is updated L periods ago2.

To obtain a base stock policy, an order-up-to level S has to be derived from the sales forecast. A theoretical distribution function and a dispersion parameter are needed. I choose to rely upon the normal distribution which is often used in theory and practice (Prak, Teunter, and Syntetos, 2017). Mean-squared error (MSE) is used as a dispersion parameter.

3.3.3 Feed-forward neural network

This paper builds upon Oroojlooyjadid et al. (2016) in using an integrated estimation and optimisation approach that uses a feed-forward neural network (FFNN) with a custom loss function. In this approach, the base stock level is forecasted using a loss function which integrates the underage and overage costs of the newsvendor problem. Through this integration, there is no need to estimate a sales forecast and a theoretical distribution. The neural network learns to predict the base stock level using the training set. Essentially, the FFNN constructs a function using a training set of data. Next to possible demand data, the FFNN also recognises additional features.

3.3.4 Recurrent neural network

In this paper, I propose a recurrent neural network3_{. RNNs benefit from the} fact that they can identify patterns through sequences. These temporal patterns

2_{For procedures regarding the updating of parameters I refer to Kalekar (2004)}

3_{This recurrent neural network is implemented in Python using the Keras and TensorFlow}

(12)

can be identified by making use of a internal memory, a feature that simple neural networks like the ones considered by Oroojlooyjadid et al. (2016) do not possess. This internal memory is the reason that RNNs are widely used for text classification, financial time series and weather forecasts (Xingjian, Chen, Wang, Yeung, Wong, and Woo, 2015; Bao, Yue, and Rao, 2017). For the RNN model considered in this paper I use the Long Short-Term Memory (LSTM) layers. This type of layer architecture is known to possess the ability to remember longer temporal patterns (Gers, Schmidhuber, and Cummins, 1999).

Figure 1: Recurrent & feed-forward neural networks with three nodes: The RNN (LHS) shows a loop to the node itself. It, therefore, produces an output and feeds that output back as an input for the next optimi-sation iteration. The feed-forward neural network (RHS) produces an output and immediately forgets this output on the next optimisation iteration.

As can be seen in Figure 14 _{the RNN, shown on the left-hand side, nodes} have two inputs: The past and the present. Whereas for the FFNN nodes, shown on the right-hand side, only the present is given as an input. This is an important distinction because this means that RNN models can replicate temporal patterns. This means that, for example, trends can be identified in temporal inputs which subsequently can be continued in the output. For ex-ample, in the sequence {1, 2, 3, 4} the RNN can identify that the next obvious integer in the sequence is {5}.

To make this distinction even more apparent. The feature inputs for a fore-cast for a random Tuesday for a FFNN network, as considered by Oroojlooyjadid

interface written in Python to interact with the TensorFlow library. See: www.tensorflow.org and www.keras.io.

4_Original _figure _posted _on _Towards _Data _Science _by _Niklas _Donges:

(13)

et al. (2016), can, for example, be: {Day of the week to predict: Tuesday, Ex-pected participation: 50 mm, Sale: No, Holiday: No}. Essentially some of the features are already predicted or forecasted. An input sequence for the RNN model, considered in this paper, can, for example, be: {Historical sales t − 3: 8, Historical sales t − 2: 5, Historical sales t − 1: 10}. I.e. the RNN processes the sequence of input data to predict an output, whereas the FFNN model performs like a function, mapping input to output.

3.3.5 Custom loss functions

Two custom loss functions will be used, both of which will be used for the FFNN and the NN. The first custom loss function that will be used is equal to the cost function of the newsvendor problem. This cost function was first used as a loss function for neural networks by Oroojlooyjadid et al. (2016). This loss function is basically the cost function of the newsvendor problem:

Et= min yt cu(dt− yt)++ co(yt− dt)+ , (7)

where ytis the decision variable at time t, dtare the historical sales at time t, t = 1, . . . , n, with n the historical demand observations. Et is the loss of period t and E = 1

n Pn

t=1Et is the total loss. Note that a historical dataset is used to train the RNN using an optimiser and the above given loss function. This training set consists of the number of sales for multiple days (periods). A main issue, however, is the fact that the cost function is not differentiable at 0, whereas back-propagation requires a differentiable loss function.

The second custom loss function is a fitted polynomial to the cost function of the newsvendor problem. Keeping in mind Runge’s phenomenon (Runge, 1901), that a higher degree polynomial does not always improve accuracy because oscillation can happen at the edges of the interval. Therefore, a polynomial with a degree high enough to produce a smooth function for the range of considered demand values is considered. This loss function overcomes the fact that the cost function of the newsvendor is not differentiable at (yt− dt)+= co(dt− yt)+= 0. In Figure 2 an example of a fitted polynomial can be found with underage costs cu= 2, overage costs co= 1 and a degree of 10.

3.3.6 Hyperparameter optimisation for FFNN and RNN

(14)

Figure 2: Fitted polynomial loss function: The fitted polynomial loss function has a degree of 10 with underage costs cu = 2, overage costs co= 1.

number of experiments grows exponentially with the number of hyperparame-ters to optimise.

The optimisation is done using the Hyperopt5 package which uses among other random hyperparameter search and a Tree-structured Parzen Estima-tor (Bergstra, Bardenet, Bengio, and K´egl, 2011). Random hyperparameter search is used since an extensive grid search is too computational costly. Every recurrent- and feed-forward neural network is optimised using 15 different eval-uations for every cost ratio. For each evaluation, the Hyperopt algorithm selects the most promising set of hyperparameters based on previous evaluations. In the end, the best combination of hyperparameters, based on the training set, is used to estimate the test set.

3.4 Evaluation

The evaluation of the different methods, given above, will be performed using the best performing set of hyperparameters on the training set. Subsequently, these optimised models will be evaluated on the test data. The corresponding total

5_{The Hyperopt package is a Python library for optimising over awkward search spaces with}

(15)

cost over the test period will be given by the cost function of the newsvendor problem. This cost function is given by:

C(y) = cu(d − y)++ co(y − d)+, (8) where cu and codenote the underage- and overage costs respectively, d denotes the actual demand and y denotes the predicted demand. Different ratios of underage- and overage costs will be evaluated.

Moreover, I consider service levels, fill rates and root-mean-square error (RMSEs) corresponding to the order-up-to levels predicted by the different mod-els. I denote the total number of periods by N , the inventory on-hand by y and the observed demand by d. The service level denotes the percentage of periods that all demand is satisfied in that given period, defined by:

Service level := PN

t=11yt≥dt

N ∗ 100. (9)

The fill rate is the percentage of demand that can be satisfied given the inventory on-hand (Axs¨ater, 2015), given by:

Fill rate := min(y, d)

d ∗ 100. (10)

The RMSE is a metric used to define the difference between the observed de-mand and predicted values. It is a single metric corresponding to the average magnitude of the error in the predictions. The RMSE is given by:

v u u t 1 N N X t=1 (dt− yt)2. (11)

4 Data

(16)

An important observation is that the dataset contains sales data (or censored demand). Sales data is only equal to demand when there is sufficient inventory to accommodate all demand. In order to obtain demand data for a product, one could choose to order more than enough inventory in an exploration phase. Although this translates to more initial costs, a business could benefit long-term by the improvement in forecasts and the corresponding cost benefits. I opted not to alter the data, but to use the original sales data because this type of data is most likely to be available in practice. For the remainder of this paper I will refer to the sales data as demand.

The 23812 products from the Ta-Feng dataset are filtered using a number of constraints. Note that the the four months correspond to a total of 120 days. First, a total demand of 600 items have to be sold, which corresponds to an average of 5 products a day. Second, the products that are considered are at least purchased in 115 of the 120 days. This ensures enough sales data is present for the models to train upon. The remaining 10 products can be found in Figure 3. Note that only the product IDs are known.

Figure 3: Historical sales of the set of 10 products from the Ta-Feng dataset. The 120 days depicted above correspond to November 2000 to February 2001.

(17)

demand, trend, seasonal component and residual information (Product 9) are depicted. We can see that there is a clear seasonal component ranging over a period of seven days. Likewise, the residual component contains quite some in-formation meaning the product sales are not entirely trend and seasonal based. Although this is common, some of the information in the residual component can be related to the fact that the demand is censored.

Figure 4: Additive model decomposition plot of Product 9. In this plot the observed demand is decomposed in trend, seasonality and residual information. We can see a clear weekly seasonal pattern.

After the possibilities of the models considered are shown using one singe product I try to generalise the model. The model is subsequently trained and tested on the ten products given in Figure 3.

(18)

4.1 Preparing the dataset

Before the data can be used for the (recurrent) neural networks a few prepara-tions have to be made which are discussed in this subsection. First, the dataset is adapted in such a way that it is a supervised learning problem. Second, the data set is split into a training set and a test set. Finally, both sets are scaled based on the training set.

Supervised learning is a machine learning branch in which the problem is constructed with known outputs. Or as Riedmiller (1994) puts it: “A teacher provides training examples of an arbitrary mapping which the network is to learn”. The counterpart, unsupervised learning, generally performs clustering or feature extraction methods based on the input data (Jordan and Rumel-hart, 1992). The dataset is transformed into a supervised learning problem by separating the dataset into an input sequence and an intended output. This operation is best demonstrated by an example. Suppose we have a dataset con-taining five values: (1, 2, 3, 4, 5). A possible supervised learning problem would consist of an input sequence (1, 2, 3, 4) with an output of (5). An additional parameter, the number of lag values τ to use, can be introduced. With, e.g., lag value τ = 3 this dataset would transform to (1, 2, 3) with corresponding output (4) and (2, 3, 4) with corresponding output (5). This lag value is used to generate more training data for the neural network to train to. Moreover, this lag value can be determined in such a way that it reflects the seasonal period, in this case a week τ = 7. This could help the network achieving better perfor-mance with less training epochs and counters overfitting.

The dataset is split into a training set and a test set. The training set is used for fitting the model to the data, whereas the test set is used for the evaluation of performance. A common ratio for the training and test set is 80%/20%. I choose to train the model on the first three months (92 days) and test on the last month (28 days). Next, the training set is split into a training set and a validation set. For each training run, using a different set of hyperparameters, this validation set is randomly chosen using 10% of the data.

The validation set is used, like the name suggest, as a validation for the model. After each training run, using a different set of hyperparameters, the performance is evaluated using the validation set. There is however an impor-tant distinction between the validation set and the test set. When performance would be evaluated after each training run, using a different set of hyperparam-eters, on the test set, the model would indirectly be fitted to the test set. This will result in a better performance on this test set. However, the test set is thought of as unknown (future) data. Validating on the test set would therefore be ”unfair” to the other models considered and skew the results.

(19)

con-tribute proportionally to the prediction (e.g. when one feature ranges from 20 to 100 and another feature ranges from 1 to 2, this would skew the proportional impact). Another reason, and one more interesting for this paper, is that back propagation through time6 (BPTT) converges faster when scaling is involved (Ioffe and Szegedy, 2015). The scaler is fitted to the training set and later ap-plied to both the training- and test set. This is, again, in order to make sure no information from the test set is leaked into the training set and the model. Note that the polynomial loss function has to be fitted to scaled values as well, for else (local) minima created through imperfections can be exploited.

4.2 Preprocessing features

The inputs for FFNN models consist of the historical sales data and the feature ’weekday’. Categorical values are not supported by neural networks when con-taining words. Therefore categorical features are preprocessed using a method called one hot encoding. It transforms a column of categorical values to multiple columns, where every column corresponds to a categorical value. Every column is binary valued and an observation corresponds to a certain categorical value when it equals 1. Likewise, when a categorical value does not correspond to a certain observation it equals 0. For example, an observation corresponding to a Monday has a 1 in the ’Monday-column’ and 0 in the other columns.

5 Results

This section will report and elaborate on the results generated by the four mod-els considered: the empirical quantile (EQ) model, the exponential smoothing (ES) model, the feed-forward neural network (FFNN) and the recurrent neural network (RNN). The costs reported in this section are calculated using Formula 8.

5.0.1 Intermediate results

During testing and optimisation, I found that the RNN did not identify the sea-sonality pattern in the historical dataset. We can find the intermediate results in Figure 5, where both the RNN with- and without the additional feature are depicted. For completeness, a cost cost structure of (cu/co) = (1/1) is used on the data of Product 9. Since we are able to observe the seasonal pattern without a seasonal decomposition (see observed sales in Figure 4), there was room for improvement. Hence, I tried to give the RNN model additional information by providing the day of the week as an additional feature. This feature was not given beforehand because I expected that the RNN would recognise this be-haviour throughout the data. This feature is preprocessed in the same manner as the ’weekday’ feature for the FFNN7.

6_{Back propagation through time is the learning algorithm that is used to compute the}

weights for the RNN (Werbos et al., 1990).

(20)

As we can see in Table 1, the performance of the RNN model increased substantially by providing the additional feature. The RNN is able to identify the seasonality in the data. Henceforth, the RNN models considered in this paper will use the feature ’weekday’ as an additional input. Note that a time indicating feature like ’weekday’ can always be reproduced from the variable ’dates’ in a historical dataset, and therefore places no additional constraints on the availability of the data.

Figure 5: Forecasting capabilities of RNNs with- and without the feature ’weekday’. Note that the historical sales are given as ’Actual’ and a cost structure of (cu/co) = (1/1) is used.

Table 1: Total costs for the RNN with- and without the additional fea-ture ’weekday’, given different cost strucfea-tures. The product considered is Product 9.

Cost structure (cu/co)

Model 1/1 2/1 3/1 5/1 10/1

RNNwithout 218.30 274.84 320.03 386.35 429.13

(21)

5.1 Single product performance

In this section, I consider the performance of all considered models, on two separate products. Both products possess substantially different characteristics. The first product that is identified showed the highest seasonal component of the products. Whereas, the second product showed less of a seasonal component while having a high residual component. Also, the second product showed a clear downward trend.

5.1.1 Product with a high seasonal component

Product 9 showed the highest seasonal component8 _{of the ten products} con-sidered. It might be interesting to see how well the models can reproduce the seasonality in their forecasts. The historical sales, for Product 9, during the period of 120 days, can be found in Figure 6. We can see that the amplitude of the demand peaks are approximately the same over the length of the time series.

Figure 6: Historical sales of Product 9 given a period of four months (November 2000 to February 2001).

In Table 2 the total costs for each method and cost structure (cu/co) are given. The RNN models clearly show the best performance. This superiority in performance is even more clear in Figure 7. Both RNN models show an almost

(22)

identical output for all cost ratios. However, the RNN model trained using the original cost function of the newsvendor problem, shows a more stable perfor-mance in comparison with the RNN trained using a polynomial loss function. This unstable performance of the RNNpoly might be a result of: (1) the fact that the minimum of the polynomial loss function is not at zero, but rather a little shifted to the left, as we can see in Figure 2. Or (2) the cost function is not fitted accurate enough which may result in an unsmooth loss function. In other words, the loss function might therefore contain several local minima.

Table 2: Total costs for all the considered methods tested on Product 9, given different cost structures.

Cost structure (cu/co) Model (1/1) (2/1) (3/1) (5/1) (10/1) EQ 210.00 290.00 364.00 446.00 553.00 ES 179.32 271.48 310.74 361.07 450.30 FFNN 160.01 231.92 709.17 767.92 947.72 FFNNpoly 179.98 683.86 754.03 812.90 896.43 RNN 150.96 222.52 266.90 324.59 388.87 RNNpoly 164.01 286.22 267.76 310.33 408.14

Moreover, we see that the FFNN models produce relative inaccurate results when the cost ratio increases. This might relate to their inability to develop pat-terns over time using the information that is fed while training these models. Through this inability, they can hardly develop new patterns when the rela-tive underage costs increase, in comparison with the RNN models. In Figure 8 the forecasted order-up-to levels are given for the FFNN and the RNN model. Herein, we can see that the FFNN produces four times the same pattern. This relates to the use of a single feature, which limits the model to predict any-thing more than a distinction between weekdays. The models considered by Oroojlooyjadid et al. (2016) do not show this behaviour which is a result of the increase in features provided.

The ES model shows a very stable and robust performance over all cost ra-tios, as can be seen in Table 2. This is related to the fact that a static inventory policy is considered, i.e. an order-up-to level is determined based on the train-ing set, which is static durtrain-ing the entire test set. We see, as expected, that the performance of the ES model is consistently better than that of the EQ model. Also the ES model is able to replicate the seasonality of Product 9 quite well, as can be seen in Figure 15 found in Appendix B.

(23)

Figure 7: Total costs of predictions for Product 9 given different cost structures

(24)

Figure 8: Order-up-to level S predictions for Product 9 by the RNN and FFNN using a cost structure of (cu/co) = (3/1).

5.1.2 Product with a high residual component

In this section, we will briefly examine the performance of the EQ, ES, FFNN and RNN models on another product from the set. As we can see in Figure 9, the amplitude of the shocks of the historical sales of Product 6 differ over time. Moreover, the seasonal component is not present as clearly as in the previous section, and the residual information is relative high9_{. Therefore, I expect the} performance of the models to be worse on this time series.

(25)

Figure 9: Historical sales of Product 6 given a period of four months (November 2000 to February 2001).

(26)

Table 3: Total costs for all the considered methods tested on Product 6, given different cost structures.

Cost structure (cu/co) Model (1/1) (2/1) (3/1) (5/1) (10/1) EQ 161.00 275.00 465.00 745.00 1221.00 ES 162.33 309.67 450.20 645.32 890.28 FFNN 337.12 407.09 715.34 770.56 767.12 RNN 140.04 227.26 224.47 397.55 715.43

Figure 10: Order-up-to level S predictions using the EQ, ES, FFNN and RNN models given a cost structure of (cu/co) = (1/1) for Product 6.

5.2 Generalisability

(27)

In Table 4, we can see the results of the RNN which is trained on the ten products, noted as RNNgen, and the other models which were trained and eval-uated on Product 9. As is expected, the RNN model outperforms the RNNgen. Moreover, all models dominate the generalised model on the lower cost ratios. On the higher cost ratios only the FFNN performs worse.

Table 4: Total costs for all the considered methods tested on Product 9. The RNNgen model is a generalised model, trained on ten products, whereas the other models are solely trained on Product 9. Different cost structures are consdered.

Cost structure (cu/co) Model (1/1) (2/1) (3/1) (5/1) (10/1) EQ 210.00 290.00 364.00 446.00 553.00 ES 179.32 271.48 310.74 361.07 450.30 FFNN 160.01 231.92 709.17 767.92 947.72 RNN 150.96 222.52 266.90 324.59 388.87 RNNgen 248.33 341.86 473.51 812.94 811.42

A more in-depth view of the predictions of the generalised model can be found in Figure 11. Based on a cost structure of (cu/co) = (2/1), we see that predictions of the generalised model show somewhat similar behaviour to the actual sales over time. However, in general, all predictions are below the actual demand. This could be a result of the fact that the average product in the set has fewer sales than Product 9.

The ten products that were considered and used to trained upon were not necessarily from the same product category, nor did all the time series behave in a similar matter10_{. To account for this the model was optimised using a wider} range of nodes and a possible second LSTM layer. The results indicate that: (1) a deeper model with more layers has to be used, (2) more information in the form of features have to be added, or (3) products have to be grouped based on similar demand behaviour. The latter option might lead to grouping product based on product categories.

(28)

Figure 11: Order-up-to level S predictions using the generalised RNN model and RNN model. The generalised RNN model is trained using ten products, whereas the RNN model is trained on Product 9. Both models are used to predict the demand of Product 9. A cost structure of (cu/co) = (2/1) is used.

5.3 Product with limited data

In this section, I mimic a newly launched product, called Product 1, for which very limited data is available. This newly launched product is in the same product category as Product 9, which is extensively covered in Section 5.1.1. The historical sales of both products can be found in Figure 12. As we can see, Product 111 contains some of the seasonality that Product 9 includes as well. This might be because both products are from the same product category. However, there is also a clear difference between the two products in terms of sales amount per day.

(29)

Figure 12: Historical sales of Products 1 and 9 given a period of four months (November 2000 to February 2001).

I use the historic data from Product 9 to train a RNN model. Note that Product 1 was the only product in the same product category as Product 9, which is the reason it was picked. Subsequently, this RNN model is used to predict the order up-to-level of this new product. The limited available historic data of Product 1 consist of one week of data. A benchmark that will be used in this section is the empirical quantile model. The empirical quantile is based on this one week of available data of Product 1. Furthermore, for completeness I will also include the results from the RNN model when historical data is avail-able, which correspond to the results from Section 5.1.1.

(30)

In Table 5 we can see the total costs for the tree models. The RNN model which uses training data from Product 9 is given by RNN1 whereas, the RNN model with limited historical data is given by RNN2. As expected, the RNN with full historical data (RNN1) is outperforming the other two models. How-ever, we can see that using historical data from another product, as is done by RNN2, significantly improves performance in comparison with the EQ model. Moreover, in Figure 13 we can see that RNN2predicts somewhat the same peaks in demand as RNN1.

Figure 13: Order-up-to level S predictions for Product 9 with lim-ited historical data using the EQ and RNN models. The RNN1 model represents the RNN model trained on the training set of Product 9. The RNN2 model represents the predictions made using a RNN model trained on the data of Product 1, a product from the same product cat-egory as Product 9. Furthermore only a week of historical data is avail-able for the RNN2 and EQ model. A cost structure of (cu/co) = (1/1) is used.

6 Conclusion

(31)

period, single product setting with underage- and overage costs. To assess the performance of the RNN models, three different baseline models are introduced: (1) the empirical quantile model, (2) a separated estimation and optimisation model using exponential smoothing and (3) an integrated estimation & opti-misation model using feed-forward neural networks. All models are evaluated using products from the Ta-Feng data set, which contains four months of data for products from a grocery store. The models are trained on the first three months and tested using the last month.

The results of the recurrent neural networks are promising. For both sepa-rately considered products, we saw that the RNN models produced the lowest total cost ratios. I found that the RNN model was able to predict the demand for Product 9, which shows a clear seasonal pattern, reasonably well. However, the demand forecasts for Product 1, with a less predictable pattern and a high residual component in the seasonal decomposition plot, showed still some room for improvement. One single RNN to predict multiple (unrelated) products show weak performance in comparison with all models trained on that single prod-uct. The performance of the RNN model trained on a product and tested on a different product from the same category showed promising results. Further-more, I confirmed that the polynomial loss function is a promising alternative to the cost function. I conclude by stating that the recurrent neural network is a promising tool for integrated forecasting and estimation of order-up-to levels. However, a RNN is not a panacea for demand forecasting.

6.1 Limitations and further research

The first limitation of this research is the fact data historical sales data are used as if it were actual demand data. To infer the real forecasting abilities of the models, (uncensored) demand data should be used. During this research, it was unclear if sudden drops in sales were because demand was low, or that the company ran out of stock. Second, four months of historical data may be too less to assess the predictive capabilities of recurrent neural networks. Not only because recurrent networks are known to be able to handle large amounts of data, but also because, seasonal patterns can be trained throughout a year of data. Moreover, in this research, only one feature was available since the data was initially on a transaction level. In some research papers, thousands of fea-tures are considered, which could increase the performance of the RNN models significantly and would, therefore, be interesting to study. Besides, because of the exploratory character of this research, only a few time series and products are considered. Further research is therefore needed to assess the ability of the RNN model on inventory control in general.

(32)

products from the same product category or with similar characteristics may lead to an increase in performance. Also, the influence of lead times may increase the uncertainty of the demand predictions and would be an interesting subject to assess the performance of RNNs on.

References

Ali, Özden Gür and Kübra Yaman (2013). Selecting rows and columns for training support vector regression models with large retail datasets. European Journal of Operational Research 226 (3), 471–480.

Axs¨ater, Sven (2015). Inventory control, Volume 225. Springer.

Bao, Wei, Jun Yue, and Yulei Rao (2017). A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PloS one 12 (7), e0180944.

Bergstra, James S, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl (2011). Algorithms for hyper-parameter optimization. In Advances in neural infor-mation processing systems, pp. 2546–2554.

Bertsimas, Dimitris and Aur´elie Thiele (2005). A data-driven approach to newsvendor problems. Working Paper, Massachusetts Institute of Technology. Cannon, Alex J (2011). Quantile regression neural networks: Implementa-tion in r and applicaImplementa-tion to precipitaImplementa-tion downscaling. Computers & geo-sciences 37 (9), 1277–1284.

El-Telbany, Mohammed E (2014). What quantile regression neural networks tell us about prediction of drug activities. In Computer Engineering Conference (ICENCO), 2014 10th International, pp. 76–80. IEEE.

Gers, Felix A, J¨urgen Schmidhuber, and Fred Cummins (1999). Learning to forget: Continual prediction with lstm.

Ioffe, Sergey and Christian Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .

Jordan, Michael I and David E Rumelhart (1992). Forward models: Supervised learning with a distal teacher. Cognitive science 16 (3), 307–354.

(33)

Prak, Dennis, Ruud Teunter, and Aris Syntetos (2017). On the calculation of safety stocks when demand is forecasted. European Journal of Operational Research 256 (2), 454–461.

Riedmiller, Martin (1994). Advanced supervised learning in multi-layer perceptrons-from backpropagation to adaptive learning algorithms. Computer standards and interfaces 16 (3), 265–278.

Rudin, Cynthia and Gah-Yi Vahn (2014). The big data newsvendor: Practical insights from machine learning.

Runge, Carl (1901). ¨Uber empirische funktionen und die interpolation zwischen ¨

aquidistanten ordinaten. Zeitschrift f¨ur Mathematik und Physik 46 (224-243), 20.

Taylor, James W (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Fore-casting 19 (4), 299–311.

Turken, Nazli, Yinliang Tan, Asoo J Vakharia, Lan Wang, Ruoxuan Wang, and Arda Yenipazarli (2012). The multi-product newsvendor problem: Review, extensions, and directions for future research. In Handbook of newsvendor problems, pp. 3–39. Springer.

Wang, Pengfei, Jiafeng Guo, Yanyan Lan, Jun Xu, Shengxian Wan, and Xueqi Cheng (2015). Learning hierarchical representation model for nextbasket rec-ommendation. In Proceedings of the 38th International ACM SIGIR confer-ence on Research and Development in Information Retrieval, pp. 403–412. ACM.

Werbos, Paul J et al. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), 1550–1560.

Xingjian, SHI, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo (2015). Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810.

Xu, Qifa, Xi Liu, Cuixia Jiang, and Keming Yu (2016). Quantile autoregression neural network model with applications to evaluating value at risk. Applied Soft Computing 49, 1–12.

Yu, Feng, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan (2016). A dy-namic recurrent model for next basket recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 729–732. ACM.

(34)

Zhang, Yanfei and Junbin Gao (2017). Assessing the performance of deep learn-ing algorithms for newsvendor problem. In International Conference on Neu-ral Information Processing, pp. 912–921. Springer.

Appendices

A

Empirical distribution function

(35)

B

Additional results Product 9

Table 6: Total number of items purchased using each method and different cost structures. The actual total demand over the period of four weeks was 604.

Cost structure (cu/co) Model (1/1) (2/1) (3/1) (5/1) (10/1) EQ 560.00 672.00 784.00 924.00 1036.00 ES 574.41 721.92 805.41 905.73 1031.68 FFNN 603.82 634.65 717.64 776.39 956.19 FFNNpoly 635.35 692.33 762.50 821.36 904.90 RNN 546.00 630.62 679.70 773.84 858.45 RNNpoly 620.41 617.79 715.14 782.43 867.61

(36)

C

Seasonality Product 1

(37)

D

Seasonality Product 6

Figure 17: Seasonal decomposition Product 6

E

Service levels and fill rates Product 9

(38)

Table 8: Fill rates of Product 9 Cost structure (cu/co) Model 1/1 2/1 3/1 5/1 10/1 EQ 85.99 92.77 95.73 98.22 99.10 ES 86.96 93.80 96.64 98.86 99.79 FFNN 90.70 91.98 95.54 97.39 99.70 FFNNpoly 92.09 94.92 97.30 98.33 99.06 RNN 86.98 92.52 94.34 97.38 98.79 RNNpoly 91.79 92.17 95.65 97.66 98.60

F

Service levels and fill rates Product 6

Table 9: Service levels of Product 6

Cost structure (cu/co) Model (1/1) (2/1) (3/1) (5/1) (10/1) EQ 78.57 96.43 100.00 100.00 100.00 ES 46.43 85.71 100.00 100.00 100.00 FFNN 78.57 100.00 46.43 100.00 96.43 RNN 50.00 71.43 71.43 100.00 92.86

(39)

G

RMSE Product 9

Table 11: RMSE Product 9 Cost structure (cu/co) Model 1/1 2/1 3/1 5/1 10/1 EQ 87.64 91.07 126.5 215.79 323.21 ES 66.45 83.08 117.08 181.46 298.65 FFNN 50.03 47.31 73.2 92.24 206.9 FFNNpoly 61.06 73.95 77.53 109.96 163.3 RNN 49.52 49.82 56.24 91.97 127.83 RNNpoly 54.08 46.83 60.63 85.51 136.68

Integrated demand forecasting and optimisation using recurrent neural networks: An exploratory study