Sales forecasting of luxury products using LSTM neural networks and ARIMA models

(1)

Sales forecasting of luxury products using LSTM neural

networks and ARIMA models

(2)

University of Groningen

Master’s Thesis Econometrics, Operational Research and Actuarial Studies Supervisor: Prof. Dr. R.J.M. Alessie

(3)

Sales forecasting of luxury products using LSTM neural

networks and ARIMA models

Arend Jansen

6 January 2021

Abstract

Neural networks are becoming an increasingly popular tool for time series forecasting. This thesis, written in co-operation with HTG, compares the forecasting accuracy of ARIMA models with long short-term memory (LSTM) neural networks in combination with Seasonal and Trend decomposition using Loess (STL) for 10 luxury perfume and cosmetic products. Four models are compared: (S)ARIMA, ARIMA with STL, LSTM networks and LSTM net-works with STL. Six years of data (2012 - 2018) are used to fit the models and 18 months of data (January 2019 - June 2020) are used to compare forecasts. The root mean-squared error (RMSE) and symmetric mean absolute percent error (SMAPE) are used as forecasting accu-racy metrics. The (S)ARIMA models generally provide the most accurate forecasts, followed by ARIMA with STL. While the LSTM networks in some cases provide accurate forecasts, their forecasts are often poor and inconsistent.

(4)

1 Introduction

Predicting sales for products has always been a topic of interest for firms. Sales forecasting is rapidly becoming one of the most crucial aspects of planning for companies (Mentzer and Cox JR., 1984). On the one hand having insufficient inventory means forgoing potential profits, while on the other hand having excess inventory will result in unnecessary storage costs. Therefore it is of critical importance for firms to be able to adequately forecast the demand for their products. Moreover forecasting becomes exceedingly difficult when demand is highly volatile and seasonal. This thesis is written in co-operation with HTG. HTG is a Dutch wholesale company predomi-nantly trading in luxury goods such as perfume, cosmetics and liquor. For these products (mainly perfume) forecasts are made in order to predict the number of sales in the coming months. Of particular importance for HTG are forecasts four to six months ahead, mostly because of the lead time required to restock their inventory.

As of today more than 70 different time series techniques for forecasting sales and demand ex-ist (Mentzer and Moon, 2005). Traditionally, ARIMA and Holt-Winters models have been widely used in forecasting of time series data (Rahman and Ahmar, 2017). These models are often chosen because of their ability to model trend and seasonal fluctuations in time series data (Alon et al., 2001). The accuracy between these models has often been compared and debated in scientific literature, such as in Veiga et al. (2014) and Makatjane and Moroke (2016).

A relatively recent development is the use of artificial neural networks (ANNs) in time series fore-casting. Artificial neural networks have been used for forecasting demand in a number of fields (Mahbub et al., 2013). A major advantage of neural networks is their flexible nonlinear modeling capability (Zhang, 2003). Moreover, because ANNs are non-parametric, they make few a priori assumptions about the models which means they are less susceptible to miss-specification (Khashei and Bijari, 2011).

A recurrent neural network (RNN) is a special class of neural networks that can be applied to time series problems and other sequential data. The long short-term memory (LSTM) neural network is a special case of a RNN first developed by Hochreiter and Schmidhuber (1997). Unlike regu-lar RNNs the LSTM networks suffer from less computing problems, which allow them to better capture long term dependencies (Qin et al., 2017). In various scientific research LSTM networks have been shown to produce more accurate forecasts than more conventional networks such as in Sagheer and Kotb (2019) and Abbasimehr et al. (2020).

(6)

compared to ARIMA and seasonal ARIMA models which have already been used to forecast sales in the past. The monthly number of sales of the top 10 most sold products will be forecast and their forecast accuracy will be evaluated.

This thesis will attempt to contribute to the existing literature which already compares ARIMA models to LSTM networks such as Yu et al. (2018), Palkar et al. (2020) and Elmasdotter and Nystr¨omer (2018). Contrary to the aforementioned literature, which focuses on forecasting retail sales with short forecasting windows in either days or weeks, this thesis focuses on larger sales and on a much longer forecasting window of six months. Hence this thesis will contribute to the literature by applying these models on more sizable time series data and a longer forecasting window. Moreover, there exists little literature as far as we know on making forecasts for large sales of perfume or cosmetic products.

One of the main goals of this thesis to investigate whether a complex nonlinear LSTM network is more accurate in its ability to make forecasts than more traditional time series methods. The intuition behind this hypothesis is that a LSTM network might be capable of learning complex relationships between the features of the data that regular statistical models fail to capture. This thesis will compare the accuracy of one-step ahead forecasts as well as multi-step ahead forecasts. The main metric to compare the accuracy of the forecasts that will be used is the root mean-squared error (RMSE). The reasons for using this metric will be discussed in section 2.

The sales data used in this thesis will be temporally divided into a train and test data set. The train data set consists of the first part of the time series (2012 - 2018) and will be used to fit the employed models, while the test data set consists of the most current part of the time series (January 2019 - June 2020) and will be used to evaluate the forecasting accuracy of said models. A total of 10 products will be used as a benchmark to compare the forecast accuracy of the models in this thesis. The products selected for review are the 10 products with the highest total number of units sold in our data.

In addition to the new models, decomposition through STL will also be applied to the data. STL is a method to decompose time series into seasonal and nonseasonal components. Follow-ing this decomposition, the seasonal and nonseasonal components can be forecast usFollow-ing separate models.

In total, four different kinds of forecasting methods will be compared in this thesis. These include: (S)ARIMA, ARIMA with STL, LSTM neural networks and LSTM neural networks with STL. The remainder of this thesis is structured as follows: first a review of the forecast evaluation metrics will be provided (section 2), followed by a discussion of the data employed in this the-sis (section 3). Next, ARIMA models and time series stationarity are discussed (section 4), in addition to neural networks and LSTMs (section 5). Following, the forecasting procedures will be thoroughly explained (section 6). Finally, results are provided and discussed (section 7) and conclusions are delivered (section 8). Additionally extra tables will be provided (section 9) as well as an Appendix (section 10).

2 Forecast accuracy metrics

In order to compare the accuracy of forecasts made by various models the forecast error is consid-ered. The forecast error etat time t is the difference between the predicted value ˆytand the actual

observed value yt, so et= yt− ˆyt. There exist a number of different metrics to compare forecast

accuracy. In the remainder of this section the most commonly used metrics will be reviewed. Consider we have a sample of forecasts ˆytand a sample of observed values ytfor (t = 1, .., T ).

(7)

RMSE = v u u t 1 T T X t=1 (et)2. (1)

The mean absolute error (MAE) is given by

MAE = 1 T T X t=1 |et|. (2)

The mean absolute percent error (MAPE) is given by

MAPE = 1 T T X t=1 |et yt |. (3)

The symmetric mean absolute percent error (SMAPE) is given by

SMAPE = 1 T T X t=1 |et| (|yt| + |ˆyt|)/2 . (4)

And finally, the mean absolute scaled error (MASE) is given by

MASE = T X t=1 et 1 T −1 PT t=2|yt− yt−1| . (5)

The RMSE and MAE are scale-dependent errors meaning that they cannot be used to compare time series on a different scale or magnitude, which means that these errors cannot be used to compare the forecast accuracy of different products. The MAPE, SMAPE and MASE are scale-free which means they can be used to compare forecast accuracy between different products. The MAPE has a number of disadvantages. First off all, it cannot be used when any yt = 0

in the data as this will lead to a division by zero. Moreover, it has an extremely skewed distribu-tion when the observed value yt is close to zero (Hyndman and Koehler, 2006). Furthermore, the

MAPE penalizes negative errors more than positive ones, resulting in favouring of forecasts that are too low over forecasts that are too high compared to the observed value. For these reasons it was decided not to include the MAPE in our forecast evaluation.

The SMAPE is a percentage based scale-free error like the MAPE, although it should not be interpreted as an exact percentage. It can take on a value between 0 and 2, with a lower value indicating a better forecast. Similarly to the MAPE, the SMAPE also punishes negative errors more than positive ones. The SMAPE however does not suffer from the division by zero problem as the MAPE does and therefore is much easier to utilize. The SMAPE was therefore used as a secondary accuracy metric to evaluate our forecasts.

The MASE gives an indication of the performance of our forecast relative to a ’naive’ forecast. ’Naive’ in this context means that the forecast is equal to the last observed value. If the value of the MASE is greater than 1 the forecast was worse than a ’naive’ forecast and if the MASE is smaller than 1 the forecast was better than a ’naive’ forecast. Because The MASE cannot be selected as loss metric for training a neural network, it was not utilized.

(8)

the MAE does. Because the nature of this research is about forecasting sales containing large outliers, we believe that the RMSE will be a more suitable forecast accuracy metric over the MAE. Thus it was decided to use the RMSE as the primary forecasting accuracy metric, with the SMAPE being used as a secondary metric.

3 Data

The data that will be used in this thesis is private data collected by HTG. The data of interest is contained in two tables. The first table contains information on the sales made by the company and the second table contains information on the stock level of their products.

The first table contains all the completed order lines (sales) for 30 distinct products between January 2012 and June 2020. While HTG trades in significantly more products, these were the products that were most traded and thus of most importance to accurately forecast. Each order line has an anonymous product id and debtor id to illustrate which product was sold and who was the purchaser. Furthermore, for each order line the price at which the product was sold is also available, as well as to what address the order was sent. Table 1 provides a short description of the variables in this table.

Table 1: Data on the completed sales between January 2012 and June 2020.

Variable Description

product id id number of the product (anonymous)

debtor id id number of the debtor of the order (anonymous) quantity number of units that were sold

price base price at which the product was sold in euros date invoice date at which the order was received

afleveradresnr type of address that the order was sent to

The second table of data contains the daily inventory level for every product between January 2012 and June 2020. This table contains the stock level combined with a start and end date of that stock level. All start and end dates in this table are consecutive, meaning that the stock level of a product is available for every date between April 2016 and June 2020. In Table 2 a short description of the variables in this table are given.

Table 2: Data on the inventory level of each product between January 2012 and June 2020. Variable Description

product id id number of the product (anonymous)

stock stock level

start start date of the given stock level end end date of the given stock level

3.1 Aggregating the data

(9)

a product at HTG whenever they themselves make a sale. These orders were left out because they are small (often consisting of a single unit), happen on a regular basis and hence are quite predictable. Because this thesis focuses on predicting large and extreme sales it was decided to leave out these dropshipping order lines. Exclusion of these order lines was achieved by leaving out orders with a certain afleveradresnr value in Table 1.

After leaving out the order lines related to dropshipping we decided on temporally aggregat-ing the data. The data in Table 1 was collected on a daily level with often multiple observed order lines for a given day. In order to obtain a valid time series it was decided to aggregate the data over time. For this temporal aggregation we considered daily, weekly and monthly epochs. Aggregating the data by month is considered to be the best option, since it is most important for HTG to make accurate forecasts roughly four to six months ahead. Although aggregating by week results in more available observations in the in the time series, it is much more difficult to make accurate forecasts six months ahead in terms of weeks as opposed to months. Therefore it was decided to aggregate the data by month.

After aggregation and some editing/grouping of the data through SQL the final data set used further on in this thesis is described in Table 3.

Table 3: Variables in the final aggregated data set. Variable description

product id id number of the product (anonymous) quantity number of units sold in a month avg stock average stock level during the month avg price average price per unit sold in the month year current year in the data

month current month in the data

Finally, after aggregating the data we decided which products would be used as benchmarks for forecasting accuracy. It was decided to use the 10 products with the highest total quantity of units sold out of the 30 total products.

3.2 Data visualization

(10)

Figure 1: Plots of the monthly number of sales between January 2012 and June 2020 for the 10 products that will be evaluated.

(11)

Table 4: Summary statistics of the top 10 most sold products.

Id Total quantity sold Number of order lines Average monthly sales Maximum monthly sales Monthly sales std. dev 12915 143239 5422 1404 12150 2481 12917 78648 5139 771 5723 1102 12922 86095 9890 844 5879 1164 12925 108211 9323 1060 7933 1475 12927 108311 5801 1061 11856 1733 12929 72003 2159 706 6725 1330 12934 127314 8697 1248 13095 1768 13330 102616 9580 1006 6797 1090 14257 91811 5613 900 7464 1137 15241 77173 3632 757 7330 1209

This table contains the total number of units sold, the number of sales that were made, the average monthly number of units sold, the largest number of units sold in a month and the standard deviation of the monthly number of units sold for the 10 products between January 2012 and June 2020.

4 ARIMA modeling

In this section the mathematical framework of ARIMA modeling will be discussed. ARIMA mod-els are statistical modmod-els for time series data that use past information to predict future values. We begin by analyzing the simple autoregressive model and then move on to more complicated ARIMA structures.

First, we introduce the lag operator L. For a given process Xtthe lag operator Lp for p = (1, ..., t)

defines the p-lagged value of the process. Hence Lp_X

t= Xt−p.

One of the most basic variants of a time series model is the autoregressive (AR) model. In the autoregressive model the current value of a time series Xtis explained by its previous values

(Xt−1, ...., Xt−p). The autoregressive model of order p can be written as

Xt= µ + p

X

i=1

φiXt−i+ t, (6)

where t ∼ W N (0, σ2) (W N stands for white noise). In terms of the lag operator the AR(p)

model can be written as

Xt(1 − p

X

i=1

φiLi) = µ + t. (7)

The moving average (MA) model is a model where the current value of a time series Xtis modeled

as a weighted sum of its past error terms (t−1, ..., t−q). The moving average model of order q

can be written as Xt= µ + t+ q X j=1 θjt−1, (8)

where t∼ W N (0, σ2). In terms of the lag operator the MA(q) model can be written as

Xt= µ + (1 + q

X

j=1

θjLj)t. (9)

(12)

Xt= µ + p X i=1 φiXt−i+ t+ q X j=1 θjt−j, (10)

where t∼ W N (0, σ2). In terms of the lag operator the ARMA(p,q) model can be written as

Φ(L)Xt= µ + Θ(L)t, (11) where Φ(L) = 1 − p X i=1 φiLi, Θ(L) = 1 + q X j=1 θjLj. (12)

Once an ARMA(p,q) model has been fitted it can be forecast recursively. Forecasting is done by using the models fitted coefficients and past observation of the time series and errors. A one-step ahead forecast of an ARMA(p,q) model is given by

ˆ Xt+1= µ + p−1 X i=0 ˆ φi+1Xt−i+ q−1 X j=0 ˆ θj+1t−j. (13)

From the one-step ahead forecast described in equation (13) a two-steps ahead recursive forecast can be calculated using the one-step forecast as

ˆ Xt+2= µ + ˆφ1Xˆt+1+ p−2 X i=0 ˆ φi+2Xt−i+ q−2 X j=0 ˆ θj+2t−j. (14)

Equations for higher order recursive forecasts can be calculated similarly to equations (13) and (14).

The ARIMA model is a popular time series model that was first presented by Box and Jenk-ins (1970). The ARIMA model extends the ARMA model by adapting an integration order in the model. The ARIMA model is defined by three parameters: p, d and q. The p parameter defines the order of the autoregressive part of the model, the q parameter defines the order of the moving average part of the model and the d parameter defines the integration order. Given a time series Xt, where t is an integer index, the formula for the ARIMA(p,d,q) model is given by

Φ(L)(1 − L)dXt= µ + Θ(L)t. (15) where Φ(L) = 1 − p X i=1 φiLi, Θ(L) = 1 + q X j=1 θjLj. (16)

In order to forecast an ARIMA model, using equation (15), we can write: Yt= (1 − L)dXt, then

we obtain

Φ(L)Yt= µ + Θ(L)t, (17)

(13)

The seasonal ARIMA (SARIMA) model is an extension of the regular ARIMA model. It is a two-part model with a seasonal part and a nonseasonal part, and the seasonal and nonsea-sonal parts of the SARIMA model are multiplicative. The order of SARIMA model is defined as (p, d, q), (P, D, Q)m, where (p, d, q) are orders of the nonseasonal part of the SARIMA model,

(P, D, Q) are the orders of the seasonal part of the SARIMA model and m is the seasonal frequency of the time series (m = 12 in the case of months).

In terms of the lag operator the SARIMA (p, d, q) × (P, D, Q)mmodel can be expressed as

Φ(L)Ψ(Lm)(1 − L)d(1 − Lm)DYt= µ + Θ(L)Ω(Lm)t, (18) where Φ(L) = 1 − p X i=1 φiLi, Ψ(Lm) = 1 − P X i=1 ψiLm·i, Θ(L) = 1 + q X j=1 θjLj, Ω(Lm) = 1 + Q X j=1 ωjLm·j. (19)

The SARIMA model of equation (18) looks quite complicated but in really its structure does not deviate much from a regular ARIMA model. For example a SARIMA (1, 0, 1) × (1, 0, 0)12 model

would have the following equation:

Yt= µ + φ1Yt−1+ ψ1Yt−12− φ1ψ1Yt−13+ t+ θ1t−1. (20)

Forecasting of a SARIMA model is similar to forecasting a regular ARIMA model using the models fitted coefficients and past observations and residuals of the time series.

4.1 Stationarity

Classical time series models such as ARIMA rely on the assumption of stationarity, meaning that the statistical properties of a time series such as the mean, variance and autocovariance are con-stant over time. The assumption of stationarity in a time series can be violated in the presence of a trend or seasonality in the data. Nonstationary time series data can often be made stationary by applying transformations to the time series.

Two common variants of nonstationary processes which can easily be converted into station-ary ones are trend-stationstation-ary and difference-stationstation-ary processes. A trend-stationstation-ary process is a process which becomes stationary after removing the deterministic time trend, while a difference-stationary time series is a time series where its first differences constitute a difference-stationary process. Furthermore, a level-stationary time series is a time series which is stationary around a constant but non-zero mean.

(14)

into a deterministic trend, a random walk and a stationary error component (Kwiatkowski et al., 1992). The KPSS test for trend-stationarity is based on equation (21):

yt= ζt + rt+ t, (21)

where rt= rt−1+ ut, utis i.i.d (0, σ2u) and tis is a stationary error.

Alternatively, ζ = 0 in equation (21) can be specified to test for level-stationarity instead of trend-stationarity. In the KPSS test the null hypothesis H0 : σu2 = 0 is tested against the

alter-native hypothesis of unit root Ha : σu2> 0. Under the null hypothesis the time series is trend or

level-stationary while under the alternative hypothesis it is not. Equation (21) is then estimated for the time series by OLS, its residuals et are calculated and ˆσ2 is calculated. The test uses a

one-sided Lagrange multiplier (LM) test statistic defined in equation (22). At α = 0.05 the critical value for the level-stationary variant is 0.463 and the critical value for the trend-stationary variant is 0.153. If the KPSS test statistic of equation (22) is greater than these critical values the null hypothesis is rejected at α = 0.05 significance.

KPSSLM = PT t=1St2 ˆ σ2 , (22) where St=P t i=1ei for t = (1, ..., T ).

The Augmented Dickey-Fuller (ADF) test is used to test for the presence of a unit root in time series data. There are three versions of the ADF test, defined in equation (23):

∆yt= α + βt + γyt−1+ p X j=1 δyt−j+ t ∆yt= α + γyt−1+ p X j=1 δyt−j+ t ∆yt= γyt−1+ p X j=1 δyt−j+ t. (23)

The main differences between the versions of the ADF tests defined in equation (23) are the presence of a constant α and time trend βt. Furthermore, p is the number of lags of the dependent variable we want to include in our ADF test. In the ADF test the null hypothesis H0: γ = 0 of

a unit root is tested against the alternative hypothesis Ha : γ < 0 of level or trend-stationarity.

The test statistic that is computed for the ADF test is defined in equation (24): ADFτ =

ˆ γ

S.E(ˆγ). (24)

Next, some transformation methods for nonstationary time series will be discussed.

First, differencing of the data (∆yt= yt− yt−1) is a common technique used to stabilize the mean

of the time series or to make a difference-stationary time series stationary.

Box-Cox transformations are frequently used to stabilize the variance of a time series. A Box-Cox transformation of variable y with transformation parameter λ is defined in equation (25):

y(λ) = (_yλ₋₁

y if λ 6= 0

log(y) if λ = 0. (25)

(15)

in (Theodosiou, 2011). The idea behind this method is that a time series can be decomposed into three components: a trend-cycle component, a seasonal component and a noise component. Additive decomposition of a time series yt is described in equation (26):

yt= Tt+ St+ Et, (26)

where Ttis the trend-cycle component, Stis the seasonal component and Etis the noise component.

The idea behind decomposition is to forecast the seasonal component St and the nonseasonal

component At= Tt+ Etseparately and then to add the forecasts of the seasonal and nonseasonal

components together to obtain the actual time series forecast. The nonseasonal component can be forecast by a regular time series model such as ARIMA. The seasonal component is usually forecast through the seasonal naive method, which is quite simple. In the seasonal naive method a h-periods ahead forecast ˆYt+hof time series Ytis given in equation (27):

ˆ

Yt+h= Yt+h−m, (27)

were m is the number of seasonal periods in the data (12 in the case of months).

Common methods of decomposition include classical time series decomposition and ’Seasonal and Trend decomposition using Loess’ (STL) developed by Cleveland et al. (1990). STL is a more robust form of decomposition and is able to handle outliers well. Although a full description of the STL algorithm is beyond the scope of this thesis, a general and simplified description of the STL algorithm will be provided. The STL algorithm is a convergence algorithm that repeats four steps until convergence is achieved. These four steps are explained below:

1. The extraction of noise from the time series by applying bilinear filtering.

2. Seasonally differencing of the data and removal of the trend component by solving a least absolute deviations regression of the deseasonalized series.

3. Calculating the seasonal component by means of non-local seasonal filtering. 4. Final adjusting to make the decomposition unique.

4.2 Fitting (S)ARIMA models

The fitting and estimation of (S)ARIMA models in this thesis was done through the auto.arima() function available in forecast package in R developed by Hyndman and Khandakar (2008). This function aims to find the best possible ARIMA or SARIMA model for the data it receives based on the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). The ARIMA and SARIMA models are fitted by the auto.arima() function through maximum likeli-hood estimation. The auto.arima() function roughly applies the following steps in order to find a suitable (S)ARIMA model:

1. First, a Box-Cox transformation is applied to the data if this has been specified in the auto.arima() function. The λ parameter of the Box-Cox transformation can be specified manually or it can be generated automatically through a BoxCox.lambda() function which automatically finds the λ parameter that minimizes the variation between subsets of the data. Also it is possible to skip this step and not apply any Box-Cox transformation at all. Furthermore, a bias adjust-ment to the fitted values and forecasts can be made when a Box-Cox transformation is applied. This bias adjustment can be necessary due to Jensen’s inequality. For λ = 0 (log) we have that log(E(x)) ≥ E(log(x)). Hence our forecast values will be smaller than they should be, because now median forecasts are being made instead of mean forecasts. To circumvent this problem we can apply a bias adjustment. This bias adjustment multiplies log-transformed forecasts by

1 n

Pn

(16)

2. Next, the auto.arima() function will determine whether the data needs to be differenced (and seasonally differenced in the case of SARIMA models) in order to make it stationary. This is determined by applying KPSS and ADF tests repeatedly. The integration orders (d, (D)) of the (S)ARIMA models that will be considered further on is determined by the number of times the data was differenced to make it stationary.

3. Finally, the best order ARIMA or SARIMA model is obtained. To find the best model the algo-rithm estimates many different (p, q) orders of ARIMA models and (p, q, P, Q) orders of SARIMA models and chooses the model with the lowest BIC or AIC values. This can be done by a step-wise algorithm to limit the search space over potential models, or it can be done without any algorithm by searching over all possible models up to a certain predefined maximum order. The step-wise algorithm is considerably faster but less thorough in its search. It is still possible for the auto.arima() function to choose a regular ARIMA model over the SARIMA model if that results in lower AIC/BIC values. Alternatively, it is also possible to disable seasonal models and only search for regular ARIMA models.

A notable advantage of the auto.arima() function is that even though the function transforms the data through differencing and Box-Cox transformations before fitting a model, it automati-cally reverts forecasts back to the original scale, hence there is no need to rescale the data after making a forecast.

5 LSTM neural networks

As mentioned, a long short-term memory (LSTM) network is a special form of a Recurrent Neural Network (RNN) first developed by Hochreiter and Schmidhuber (1997). In order to explain a LSTM network we will first consider a basic neural network.

5.1 Neural networks

Neural networks are mathematical models inspired by the functioning of biological neurons (Hill et al., 1996). At the most basic level a neural network takes an input value which goes through various connected neurons and produces an output value.

A neural network is a machine learning algorithm which can apply either supervised or unsuper-vised learning. Superunsuper-vised learning means that the desired (observed) output value is known a priori and can be compared to the predicted output value of a neural network. The difference between the observed and predicted output value can then be used to update the neural network and find the optimal network parameters. In unsupervised learning the desired output value is not observed and thus cannot be compared to the predicted output value of a neural network. This thesis focuses on supervised learning, where the desired output value (observed future number of sales) is compared with the predicted future sales of a neural network.

In order to understand a neural network first a few concepts about its structure need to be specified. Neurons: neurons are at the foundation of a neural network. Each layer in the network con-sists of a number of neurons (also sometimes called units or nodes). Figure 2 shows a visual example of how a neuron operates.

A neuron receives a vector of inputs x = (x1, ..., xn) from the previous layer, it multiplies these

in-puts element-wise with their respective weights w = (w1, ..., wn). Next, the neuron sums all these

values and and adds a bias value b. Finally, the neuron applies a linear or nonlinear activation function f (). The final output of the neuron in this example then becomes: y = f (Pn

i=1wixi+ b).

(17)

Figure 2: A visualisation of a neuron. Adopted from Cingillioglu (2019).

estimated by the neural network. Exceptions to this are the neurons from the input and output layers. Each neuron from the input layer is only connected to all the neurons from the next layer. Similarly, each neuron from the output layer is only connected to all the neurons from the previous layer.

Layers: a neural network consists of a number of sequential layers. A neural network always consists of at least two layers: an input and an output layer. The input layer is the first layer of the neural network that receives the input data, and has one neuron per input variable. The input layer usually does not do anything with the input other than pass it onto the next layer, and usually contains no bias value. The output layer is the last layer of a neural network that produces its output. In between the input and output layers any number of hidden layers can be specified.

In Figure 3 a visualization of the layers in a simple neural network can be seen. This neural net-work contains three layers: an input layer with three neurons, one hidden layer with four neurons and an output layer with a single neuron. The layers in this neural networks are connected through their weights (shown as the black arrows). This neural network takes in three input values and produces one output value.

(18)

Activation functions: activation functions in neurons are applied at the end in order to scale the output of a neuron to a desired value, usually in the range of (0,1) or (-1,1), before it is passed on to a subsequent layer. Commonly chosen activation functions are: sigmoid, tanh, rectified linear unit (relu) and linear, defined in equations (28), (29), (30) and (31):

σ(x) = 1 1 + e−x. (28) tanh(x) = e x_{− e}−x ex_{+ e}−x. (29) relu(x) = max(0,x). (30) f (x) = x. (31)

Weights: a neuron is connected to the neurons in the previous and subsequent layers through weights. A weight with a large (positive) value indicates a strong influence of information between two neurons, while a negative weight means a negative influence of information between two neu-rons.

Bias: bias is an additional parameter that each neuron can have in addition to its weights. A bias value shifts the output of the activation function of a neuron by adding an extra constant. This allows for more accurate fitting of a neuron to the input data. Even though it has the name ’bias’ it should be interpreted as a constant value added to a neuron before the input is passed to its activation function.

Loss functions: the loss function of a neural network is the function that is minimized in or-der to find the optimal parameters of a neural network (the weights and biases). It compares the output of a neural network with the desired output value. The loss function is minimized by adapting the values of the weights and biases in a neural network. Since we are dealing with a regression type problem, commonly chosen loss functions include: mean squared error, mean absolute error and mean absolute percent error.

Training: training of a neural network refers to adapting the weights and biases of a neural network so that the output of a neural network more closely resembles the desired output. The most well-known classic algorithm used for training neural networks is the gradient descent al-gorithm. The gradient descent algorithm takes the gradient (vector of partial derivatives) of the loss function with respect to the parameters present in the neural network (the weights and bi-ases). The gradient is equal to the change in loss function as a response to a change in the neural network parameters. The algorithm that is most commonly used for calculating this gradient is called backpropagation. Backpropagation is used to compute the gradients of the loss function analytically by using the chain rule sequentially from the output layer towards the input layer of a neural network (Hughes et al., 2018). To visualize gradient descent, a mathematical example is provided:

Let w be a vector of neural network weights. Define L(w) as the loss function, let n be the iteration integer we are currently at, let ∇L(w) be the gradient of L(w) and finally let η be the learning rate of the neural network. Then the weights at the next iteration are calculated through gradient descent as follows:

wn+1= wn− η∇L(wn). (32)

(19)

Learning rate: the learning rate parameter η in equation (32) specifies how much the parame-ters (weights and biases) of a neural network are changed each time the parameparame-ters of a neural network are updated using an optimization algorithm like gradient descent (or other training al-gorithms). The learning rate should be carefully considered. A learning rate that is too low will result in very slow optimization, while using a learning rate that is too high might in result the loss function not converging to a global minimum. The default learning rate is usually 0.01. Ad-ditionally, adaptive learning rates can also be utilized which change the learning rate over time. Samples: samples refer to the number of distinct input-output pairs we have available in our data set to feed to to the neural network. To obtain samples for a neural network we usually have to transform the time series data to a supervised learning problem.

Epochs: the number of epochs refers to how many times the entire training data set (all training samples) is fed to a neural network in order to train it. One epoch means all the samples in our training data set are fed once to the neural network.

Batch-size: batch-size refers to the number of samples from the training data set that are fed to the network before the loss functions is calculated for those samples and the neural network weights are updated through the gradient descent algorithm. There are three common batch-size methods. Stochastic gradient descent (SGD) involves calculating the loss function and updating the neural network weights after each individual sample from the training data set is fed to the neural network. Batch gradient descent (BGD) is a method that calculates the loss function of each individual training sample but only updates the network after the entire training data set has been fed to the neural network. Finally, mini-batch gradient descent (or mini-batch) is a method where the neural network weights are updated after a predefined number of samples have been fed to train the neural network. The size of a mini-batch lies somewhere in between one (SGD) and the total number of samples in the training data set (BGD).

Overfitting: overfitting of a neural network refers to the phenomenon that a neural network starts mimicking the training data set too well and as a result performs poorly when applied to the test data set or other out of sample data. This is a common problem in neural networks that can be addressed in a number of ways. Common solutions to overfitting include: adding a dropout layer, using validation data and keeping the learning rate low.

Dropout: dropout is a technique that can be applied to neural networks to prevent overfitting. Dropout randomly removes neurons and their weight connections from the hidden layers of a neu-ral network during training (Srivastava et al., 2014). The neurons in the hidden layer are dropped stochastically with a certain specified probability.

5.2 Recurrent neural networks

A recurrent neural network is a special type of neural network made for handling sequential data, such as text documents or time series. In a recurrent neural network the neuron has an internal hidden state that remembers information about previous entries. In order to explain recurrent neural networks more accurately a visual example is provided.

Consider a recurrent neuron. In Figure 4 a visual representation of this neuron can be observed. Next, consider a time series x = (x0, .., xt). Initially, x0 is passed to the recurrent neuron. The

recurrent neuron then outputs a value y0 but it also outputs a hidden state h0 to itself that

con-tains information about what inputs it has previously received. Next x1 is given as input and the

recurrent neuron calculates h1given h0and x1. It then calculates output y1and passes the hidden

state h1 to itself again. This process continues for the entire sequence x. Through this process a

(20)

Figure 4: A visualisation of a recurrent neuron.

In this figure xtis the input, ytis the output, htis the hidden state and RNN is the recurrent neuron. On the left the standard representation can be seen and on the right the unfolded representation is observed.

Backpropagation of recurrent neural networks is done through a special form of backpropaga-tion called backpropagabackpropaga-tion through time (BPTT). BPTT is used on the unfolded version of the recurrent neural network to find the gradients. Each time step in the RNN is then treated as an additional sequential layer in the unfolded RNN.

Vanishing and exploding gradients are problems frequently encountered when using recurrent neural networks. The vanishing gradient problem refers to the fact that the gradient of the loss function with respect to the recurrent weights is multiplied with the same value over and over again for each step back in time, resulting in a gradient that quickly converges to zero after a couple of time steps. The exploding gradient problem refers to the opposite effect, where the gradient becomes increasingly large. Because of the vanishing gradient problem a recurrent neural network often cannot accurately model long term dependencies in time series data. A long-short term memory neural network is a type of recurrent neural network that provides a solution to the vanishing gradient problem.

5.3 LSTM

A long short-term memory network is a special type of recurrent neural network first developed by Hochreiter and Schmidhuber (1997). A LSTM network is a neural network containing at least one hidden LSTM layer. LSTM networks do not suffer from the vanishing gradient problem like the standard recurrent neural network and are therefore better able to model long term dependencies. LSTM layers use a more complex form of neurons often also named ’cells’ or ’units’. A more thorough description of a LSTM cell is given in the remainder of this section.

In Figure 5 a visual representation of a single LSTM unit/cell can be observed. The most impor-tant part of a LSTM cell is its recurrent cell state Ct. In a LSTM unit the cell state Ct−1 from

the previous time step is updated and passed on to the new cell state Ct. The output htfrom the

LSTM cell is also partially based on this cell state Ct. Other inputs that are given to the LSTM

cell are the input data xtand the output of the cell at the previous time step ht−1. A LSTM cell

consists of four gates which handle the cell state Ctof the cell. These gates will be discussed below.

First, let n be equal to the number of cells in the LSTM layer and let m be equal to the number of features (variables) in the input data. Moreover, let t = (1, ..., T ) be the number of time steps in the input data. Define xt as an (m × 1) vector of inputs and htas a (n × 1) vector of outputs.

(21)

Figure 5: A visualisation of a LSTM cell. Adopted from Soni (2018).

of the forget, input, cell and output gates and let (Uf, Ui, Ug, Uo) ∈ IRn×n be matrices containing

the weights of the recurrent connections of these four gates. Finally let (bf, bi, bg, bo) ∈ IRn×1 be

the vectors containing the bias values of the four gates. These weights and biases are initially randomly set to a value between (-1,1) when the LSTM network is created. In total one LSTM layer has 4(nm + n2_{+ n) parameters (weights and biases). Now, using these matrices and vectors}

the four gates will be discussed:

Forget gate: the forget gate decides what information will be kept or deleted from the previ-ous cell state Ct−1. It takes xtand ht−1as inputs and uses a sigmoid function defined in equation

(28) to output a number between (0,1) for every value in the cell state vector Ct−1. An output of

0 in this case means to completely dismiss a value while an output of 1 represents not dismissing a value from the previous cell state at all. The formula for of the forget gate is given in equation (33):

ft= σ(Wfxt+ Ufht−1+ bf). (33)

Input gate: the input gate decides which values in the cell state Ct will be updated with new

information. It takes xtand ht−1 as inputs and uses a sigmoid function defined in equation (28)

to output a number between (0,1) for every value in the cell state vector Ct. The formula for the

input gate is described in equation (34):

it= σ(Wixt+ Uiht−1+ bi). (34)

Cell gate: the cell gate develops potential new values gt to update the cell state Ct. It takes xt

and ht−1as inputs and uses a tanh function defined in equation (29) to output numbers between

(-1,1) for every value in the cell state vector Ct. The cell state function is described in equation

(35):

gt= tanh(Wgxt+ Ught−1+ bg). (35)

After it and gt are calculated they are multiplied using the Hadamard product () and ft and

Ct−1 are multiplied using the Hadamard product. Following this, the the new cell state Ct is

updated. Initially at t = 0 the cell state is set to 0. The formula for Ct is described in equation

(36):

(22)

Output gate: this gate decides what becomes the actual output of the LSTM cell. The output gate consists of two functions. First, the output gate decides which part of the cell state Ctwill

be output through a function otgiven in equation (37):

ot= σ(Woxt+ Uoht−1+ bo). (37)

Next, the output gate will determine the actual output of the LSTM cell by multiplying otand

a tanh function of the cell state Ctusing the Hadamard product. The value of the output htis

given in equation (38):

ht= tanh(Ct) ot. (38)

In our network, only the output of the LSTM layer at the final time step (hT) is passed on to the

next layer of a neural network.

After the LSTM layer, an output layer of p neurons can be specified. Define y as a (p × 1) vector of outputs. Furthermore, define Wy ∈ IRp×n as a matrix containing the weights connecting the

hidden LSTM layer and the output layer and let by be a (p × 1) vector of bias values. Then the

output vector y returned by the output layer is given by

y = g(WyhT + by), (39)

where g(·) is the activation function of the output layer (for commonly chosen activation functions see equations 28 - 31).

Similar to a recurrent neural network, a LSTM is trained by using backpropagation through time on the unfolded version of a LSTM. A more detailed description of the backpropagation equations of a LSTM network is given in Appendix A.

5.4 LSTM hyperparameter selection

Now that the basics of neural networks and LSTM layers have been discussed, it is now time to construct our LSTM network. The LSTM networks in this thesis were fit in Keras, a machine-learning based library in python devloped by Chollet et al. (2015). In Keras LSTM networks can be constructed, trained and predictions can be made using built-in customizable functions. First, we consider the way time series data is fed to a LSTM. Initially, all time series data is scaled between the (0,1) range before it can be fed to a LSTM. This is done because neural net-works in general work better with scaled input than with raw input. Feeding raw non-scaled data to a neural network will often lead to explosive behaviour and convergence problems. After predic-tions are made the output will then be scaled back from the (0,1) range to true scale once again. For this scaling a min-max scaler is fitted. For a vector x = (x1, .., xn) the min-max scalar formula

is provided in equation (40). After fitting this scaler and scaling the data, a reverse transformation can also be applied to go from scaled values back to the original values.

xi,scaled=

x − min(x)

max(x) − min(x). (40)

(23)

the lookback defines how many time steps in the time series are used as input. The output sample consist of a single variable (the feature we want to predict) and the number of steps we want to predict ahead. Samples are created from a time series using the window method. To illustrate the window method a short example will be provided.

Consider we have a time series xt for t = (1, ..., T ). From this time series we want to create

samples to train our LSTM with 1 feature, a lookback of 3 and 2-steps ahead predictions. Using the window method we can transform the time series xt to LSTM input-output samples. The

samples in this example will become:

(x1, x2, x3) → (x4, x5) (x2, x3, x4) → (x5, x6) (x3, x4, x5) → (x6, x7) .. . (xT −4, xT −3, xT −2) → (xT −1, xT)

The LSTM is then fed an input sample, for example (x1, x2, x3) and using this input it will output

a prediction (ˆx4, ˆx5). The output of the LSTM (ˆx4, ˆx5) can then be compared to the corresponding

output sample (x4, x5) (and the error can then be calculated), after which the neural network can

be trained. After training the LSTM can be provided any 3-length sequence of inputs for making predictions. For instance, (xT −2, xT −1, xT) could be given to the LSTM and it will try to predict

(ˆxT +1, ˆxT +2).

Hence, each sample will be a small ’window’ of the full time series. From a time series of T observations, a total of (T − Nlookback− Noutput) samples can be constructed, where Nlookbackis

the lookback length and Noutput is the number of predictions we want to make.

A sample can also contain data from multiple variables (features). In that case the input sample is to be given as a matrix with the number of rows equal to the lookback and the number of columns equal to the number of features. An example of such a sample is also provided below.

Consider a time series consisting of 5 different variables (features). Let xt,i the value of

vari-able i = (1, ..., 5) and time step t = (1, ..., T ). Suppose we want to use a lookback of 4, and want to predict variable (i = 1) 3-steps ahead, then a visual representation of an input-output sample pair generated by the window method is given below:

    x1,1 x1,2 x1,3 x1,4 x1,5 x2,1 x2,2 x2,3 x2,4 x2,5 x3,1 x3,2 x3,3 x3,4 x3,5 x4,1 x4,2 x4,3 x4,4 x4,5     → x5,1, x6,1, x7,1

Next, the hyperparameters of the LSTM network are considered. A neural network is largely defined by its own hyperparameters. These hyperparameters form the shape of a neural network, and need to be specified by the user himself. The right choice for some of these hyperparameters is quite obvious, while for others there is no clear answer or rule to apply at all. For hyperparame-ters where the choice is less obvious, either experiments or a grid search over all possible values is required, as there often does not exist an efficient algorithm for determining their optimal values. The following hyperparameters needed to be decided on in our LSTM network:

(24)

2. The Number of cells per hidden LSTM layer: in general it is recommended to keep the num-ber of cells in an LSTM layer below a certain numnum-ber, so that the numnum-ber of parameters in the neural network does not greatly surpass the number of available training samples. Eckhardt (2018) provides a general rule of thumb formula for the maximum number of cells in equation (41):

NLST M =

Number of training samples α(Ninput+ Noutput)

, (41)

where NLST M is the number of cells in the hidden LSTM layer, Ninputand Noutputare the number

of neurons in the input and output layer and α is a constant between 2 and 10. Based on this formula it was decided to use 20 cells in the hidden LSTM layer. Additionally, a dropout with intensity 0.1 was added to this hidden LSTM layer to prevent overfitting. This dropout randomly set 10% the outgoing weight connections of the hidden LSTM layer to zero during training. 3. Number of data features and lookback: the number of features indicate the number of variables we feed to the LSTM, and the lookback indicates how many months of previous data is contained in each input sample. A lookback of 12 was chosen so each sample contains an entire year (12 months) to account for possible seasonality. A total of 6 features were used as input, an overview of the chosen features can be found in Table 5.

Table 5: Features selected for LSTM input.

Variable description

quantity monthly quantity sold

year year

comb quantity combined number of sales of all thirty products month sin sine coordinate of the month

month cos cosine coordinate of the month

avg price average selling price per unit per month

The reason month is expressed through sine and cosine values in Table 5 rather than the numbers (1-12) is because of the scaling. For a LSTM the input data has to be scaled in the (0,1) range. However scaling the month number in such a way is inaccurate. The neural network will think that the month December is ’larger’ than January. December would be given a value of 1 and January a value of 1/12. However, December and January are only one month apart. For this reason it was decided to convert the month number into two values, a sine and a cosine coordinate described in equation (42): Y = Sin(2π ∗month 12 ), X = Cos(2π ∗month 12 ). (42)

In the end, the input part of each sample is a (12 × 6) matrix containing 12 months of sequential data for 6 different variables. The corresponding output of the sample will be a (6 × 1) vector with the quantity sold in the next 6 months.

(25)

6. The loss function used: this one is obvious. Since the root mean-squared error is used as the main forecasting accuracy metric, a mean-squared error loss function is used.

7. The optimization algorithm used: there are a number of optimization algorithms available to train our LSTM. The most well known algorithm is the stochastic gradient descent earlier dis-cussed in this thesis. Adaptive moment estimation (ADAM) is another popular gradient based stochastic updating algorithm developed by Kingma and Ba (2015). The advantages of ADAM over basic stochastic gradient descent is that ADAM is robust and requires little memory usage (Kingma and Ba, 2015). Hence it was decided to use the Adam optimizer to train the LSTM network.

8. The activation function used in each layer: once again we decided to keep it simple, in the hid-den LSTM layer we used the tanh activation function while in the output layer a linear activation function was used.

9. The recurrent function used in each layer: we opted to use the sigmoid function as the re-current activation function in the hidden LSTM layer.

10. Number of epochs used in training: the number of epochs to use for training is a tricky parameter to adjust. While more epochs generally means the LSTM is trained for longer, re-sulting in better predictions, a higher number of epochs also results in more computing time. Generally, it is best to keep increasing the number of epochs until the fit on our train and test data starts to diminish. In this thesis 100 epochs for each product are used to train the LSTM in order to limit excessive computing times.

11. Batch-size used in training: a batch size of 8 was used to speed up computations.

6 Forecasting procedures

As mentioned, to measure the forecast accuracy of each product the total data set was split into a training part and a testing part for each product. In this section we will define X = (X1, ..., X84)

as the training data set and Y = (Y1, .., Y18) as the test data set for a given product. The training

data contains the quantities (monthly number of units sold) from the year 2012 up to and includ-ing 2018 and our test data set contains the monthly quantities from 2019 and 2020. Y is a direct continuation of X, hence the first observation Y1 of Y follows directly from the last observation

X84 in X. The forecast accuracy will be evaluated on the test data Y.

Because it is computationally very expensive to update the LSTM model when new data be-comes available, it was decided to only fit our models on the training data. We do not refit our model for every new forecast we make once more data becomes usable. In order to keep the com-parison fair it was thus decided likewise to fit the ARIMA models on the training data only. For each observation Yi in Y a six-step ahead forecast is made. We denote ˆYi,j as the

fore-cast for Yi that was made j-months prior, with j = (1, .., 6).

First, we will go over the estimation and forecasting procedure for the (S)ARIMA model without STL.

Step 1. Take the training data set X. Use the KPSS and ADF test to determine if the time series is already stationary. If the time series is not stationary, a Box-Cox transformation is applied with λ = 0 to X.

(26)

auto.arima() function in R. Here, with the stepwise algorithm set to false to search over all pos-sible models. The auto.arima() function automatically finds to optimal order of the ARIMA or SARIMA model that minimizes the AIC and BIC criteria. Moreover, if we use a Box-Cox trans-formation we specify that a bias adjustment to the forecasts needs to be made in the auto.arima() function.

Step 3. Now we start making forecasts. First, we take X and recursively forecast 6 months ahead to obtain forecasts ˆYj,j for i = 1 and j = (1, .., 6) using the forecast.arima() function in R. This

function applies a six-step ahead recursive forecast to the data with the (S)ARIMA model that was estimated over the training data in step 2.

For i = 2 and onward we apply a recursive strategy. First, Yi−1is appended to X. Then ˆYi+j−1,j

for j = (1, ..., 6) is forecast using X. Thus each time we append the latest observation of our test data to X and make another set of forecasts using our (S)ARIMA model fitted in step 2.

step 4. Finally, we compare all our 1-6 step forecasts ˆYi,j to the actual observations Yi and

evaluate the accuracy of each j-step ahead forecast.

The procedure for making forecasts for the ARIMA model with STL is similar, and performed through the following steps:

Step 1. Take the training data set X. Apply STL to decompose the training data observations Xi into a seasonal component XiS and a nonseasonal component X

A

i , so Xi = XiS + X A i and

X = XA+ XS. This will result in a month specific seasonal value Smbased on the training data.

Next, we turn to the test data set Y. For each observation Yi in Y we detract the corresponding

month specific season value Smto obtain a seasonal component YiS and a nonseasonal component

YA

i . So Yi= YiS+ YiAand Y = Y

A_{+ Y}S _{with Y}S

i = Smfor the corresponding month.

Step 2. For the nonseasonal component XA, we fit the best ARIMA model (using the auto.arima() function) and we do not apply a Box-Cox transformation to the data this time. The reason no Box-Cox transformation can be applied is because the nonseasonal component might contain neg-ative values.

Step 3. Now it is time to make forecasts. First, we take the nonseasonal training data XA and recursively forecast 6 months ahead to obtain forecasts ˆYA

j,j for i = 1 and j = (1, .., 6) using the

forecast.arima() function in R. This function applies a six-step ahead recursive forecast to the data with the ARIMA model that was estimated in step 2. Then we take XS and forecast 6 months ahead to obtain forecasts ˆY_j,jS for i = 1 and j = (1, .., 6) using the seasonal naive method.

For i = 2 and onward we apply a recursive strategy. First, YA

i−1 is appended to X A

and YS i−1is

appended to XS. Then we forecast ˆY_i+j−1,jA for j = (1, ..., 6) using XA and our fitted ARIMA model. Finally, we forecast ˆYS

i+j−1,j for j = (1, .., 6) using the seasonal naive method and X S_.

Thus each time we append the latest nonseasonal and seasonal observation of our test data to XA and XS and make another set of forecasts.

Step 4. For each (i, j), we combine ˆYS

i,j and ˆYi,jA to obtain our actual forecast ˆYi,j.

Step 5. Finally, we compare all our 1-6 month(s) ahead forecasts ˆYi,j to the actual observations

Yi and evaluate the accuracy of each j-step ahead forecast.

The estimation and forecast procedure for the LSTM model is significantly more complicated. For the LSTM network, the training and test data sets need to contain additional features, such as average price and year as extra columns.

Step 1. Add additional features of the data as extra columns to X and Y. Then fit a (0,1)-min-max scaler on all the columns of the training data X. Then scale all the columns of training data set X and the test data set Y, using the scaler that was fitted previously.

Step 2. Append Y to X by row in a new matrix Z. Then transform Z to a supervised learning problem, so we obtain a total of 87 training samples. The first 67 samples (those based on data from X) will be our training samples, while the last 13 samples (those based on data from Y) will be our test samples.

Step 3. Define the LSTM network and train it using the training samples from step 2.

(27)

j = (1, ..., 6) and i = (1, ..., 14).

Step 5. Rescale the predicted output ¯Yi,j from the test samples back to the original scale, as to

obtain ˆYi,j. Finally, compare the 1-6 month(s) ahead forecasts ˆYi,j to the actual observations Yi

and evaluate the accuracy of each j-step ahead forecast.

Finally, forecasting using LSTM with STL is performed through the following steps:

Step 1. Take the training data set X. Apply STL to decompose the training data observations Xi into a seasonal component XiS and a nonseasonal component XiA, so Xi = XiS + XiA and

X = XS+ XA. This will result in a month specific seasonal value Smbased on the training data.

Next, we turn to the test data set Y. For each observation Yi in Y we detract the corresponding

month specific season value Smto obtain a seasonal component YiS and a nonseasonal component

YA

i . So Yi= YiS+ YiAand Y = Y

S_{+ Y}A _{with Y}S

i = Smfor the corresponding month.

Step 2. Add additional features of the data as extra columns to XA and YA. Then fit a (0,1)-min-max scaler on all the columns of the training data XA. Then scale all the columns of training data set XA and the test data set YA, using the scaler that was fitted previously.

Step 3. Append YAto XAby row in a new matrix Z. Then transform Z to a supervised learning problem, so we obtain a total of 87 training samples. The first 67 samples (those based on data from XA) will be our training samples, while the last 13 samples (those based on data from YA) will be our test samples.

Step 4. Define the LSTM network and train it using training samples from step 3.

Step 5. Feed the test samples to the neural network and save its predicted scaled output ¯YA i.j for

j = (1, ..., 6) and i = (1, ..., 14).

Step 6. Rescale the predicted output ¯YA

i.j back to its original scale, to obtain ˆYi.jA.

Step 7. Estimate ˆYS

i,j for i = (1, .., 14) and (j = 1, ..., 6) using the seasonal naive method.

Step 8. For each (i, j) we combine ˆYS

i,j and ˆYi,jA to obtain our actual forecast ˆYi,j.

Step 9. Finally we compare all our 1-6 month forecasts ˆYi,j to the actual observations Yi and

evaluate the accuracy of each j-step ahead forecast.

The hyperparameter specifications of the two LSTMs were the same for both the plain LSTM and the LSTM with STL. The only difference was the fact that the LSTM with STL was trained on the seasonally adjusted part of the sales time series while the plain LSTM was trained on the full sales time series.

Moreover, notable is that only two LSTM networks (plain LSTM and LSTM with STL) were trained using sales data from all thirty products available rather than a separate LSTM network for each product. This was done because the number of observations in the training data (67 sam-ples per product) was too little to accurately train a LSTM for each product individually. While it would be better to train a separate LSTM for each product, there would simply be too little data available for training our LSTM network and the results would become very inconsistent. Hence it was decided to train a single LSTM on thirty products so that there were 30 × 67 = 2010 total samples available to train the LSTM networks. This was achieved by feeding the training samples of each product sequentially to the LSTM network and once all training samples were fed it was counted as a single epoch. In total 100 of these epochs were used to train the LSTM networks.

7 Results

(28)

the 1-6 months forecasts for each product are found in Table 17.

As an alternative metric, the symmetric mean absolute percent errors (SMAPE) were calculated for all 1-6 month(s) ahead forecasts. These SMAPE values can be found in Tables 18 - 23 in section 9 at the end of this thesis.

For reference, the RMSE values of naive forecast are also displayed in Table 16. The naive forecast simply adopts the number of sales in the current month as all 1-6 month(s) ahead forecast values. Hence this forecast is called ’naive’ because the assumption is made that all future sales numbers should be roughly equal to the number of sales in the current month. The formula for the naive forecast of ytis specified as

ˆ

yt+i= yt for i = (1, ..., 6). (43)

Comparing Tables 6 - 9, Table 17 and Tables 18 - 23 it can be observed that the (S)ARIMA models generally performed best, having the lowest RMSE values in 24 out 60 cases, and often their RMSE values are significantly lower than those of the other models. Moreover, for two of the considered products (12915,15241) the SARIMA model provides the lowest RMSE across all forecast periods. In terms of average RMSE across the 1-6 month(s) forecasts the (S)ARIMA models also performed best for 5 out of 10 products (12915,12925,12934,14257,15241). In terms of the SMAPE the (S)ARIMA models also were the clear winner, possessing the lowest SMAPE values in 34 out of 60 cases. Also interesting is that in some cases a regular ARIMA model was chosen over a SARIMA model, meaning that for some products there might be small or negligible seasonal influence compared to others.

The second best performing model was ARIMA with STL, providing the lowest RMSE forecasts in 16 out of 60 cases, and the lowest average RMSE across the 1-6 month(s) forecasts for three products (12917,12922,13330). In terms of SMAPE the ARIMA with STL was also the second best performing model, possessing the lowest SMAPE in 14 out of 60 cases.

The third best performing model was the LSTM combined with STL. This model provided the lowest RMSE value in 13 out 60 cases, significantly outperforming the plain LSTM model. In terms of SMAPE the LSTM with STL was the worst performing model, possessing the lowest SMAPE in only 6 out of 60 cases.

The plain LSTM model was the worst performing model in terms of RMSE, providing the lowest RMSE forecasts in merely 7 out of 60 cases. In terms of SMAPE the plain LSTM was the third best performing model, possessing the lowest SMAPE in 9 out of 60 cases.

What is interesting to note is that the LSTM forecasts of Tables 8 and 9 have either the lowest RMSE or have considerably larger RMSE values than the ARIMA models. For some products the LSTM networks perform similar to the naive forecasts in Table 16. The ARIMA based models of Tables 6 and 7 seem to be more consistent: even when their forecasts do not have the lowest RMSE, they still are relatively within range.

(29)

Table 6: RMSE values of 1-6 months ahead forecasts using (S)ARIMA.

Product id (S)ARIMA order 1 month RMSE 2 month RMSE 3 month RMSE 4 month RMSE 5 month RMSE 6 month RMSE 12915 (0, 1, 1) × (1, 0, 0)12 3359 3718 3994 3861 3710 3561 12917 (0, 1, 2) 1846 1547 1967 1679 1906 1571 12922 (1, 0, 0) × (1, 0, 1)12 1503 1762 1954 2025 2095 2120 12925 (1, 1, 1) × (2, 0, 0)12 1458 1540 1507 1411 1363 1343 12927 (1, 1, 1) × (1, 0, 0)12 3126 3364 3321 3115 3203 3154 12929 (0, 1, 3) 2500 2890 3229 3779 4047 4312 12934 (0, 1, 3) × (1, 0, 0)12 2230 2891 3051 3437 3237 3560 13330 (0, 1, 1) 1777 1878 2011 2053 2056 2034 14257 (0, 1, 1) 1911 2187 2392 2695 2887 2967 15241 (0, 1, 1) × (1, 0, 0)12 2021 2041 2048 2195 2089 2374 The first columns specifies the product id, while the second column specifies what order (S)ARIMA model was estimated. A yellow coloured square indicates that the RMSE was the lowest among all four models considered. A green coloured square indicates that the RMSE was the lowest among the models considered and that the RMSE was at least 5% lower than the second lowest RMSE.

Table 7: RMSE values of 1-6 months ahead forecasts using ARIMA with STL.

Product id ARIMA order 1 month RMSE 2 month RMSE 3 month RMSE 4 month RMSE 5 month RMSE 6 month RMSE

12915 (3, 1, 0) 3893 4303 4536 4482 3940 4089 12917 (5, 1, 0) 1578 1462 1791 1664 1718 1601 12922 (0, 1, 1) 1219 1406 1363 1499 1498 1598 12925 (0, 1, 3) 1393 1485 1492 1446 1411 1402 12927 (5, 1, 0) 3442 3697 3270 3123 3065 3077 12929 (0, 1, 1) 1389 1388 1304 1321 1286 1221 12934 (0, 1, 3) 3408 3380 3340 3553 3460 3633 13330 (0, 1, 1) 1751 1849 1955 1989 1963 1918 14257 (4, 1, 0) 2063 2316 2633 2835 3001 2944 15241 (0, 1, 5) 2103 2448 2305 2528 2382 2470

The first columns specifies the product id, while the second column specifies what order ARIMA model was specified. A yellow coloured square indicates that the RMSE was the lowest among all four models considered. A green coloured square indicates that the RMSE was the lowest among the models considered and that the RMSE was at least 5% lower than the second lowest RMSE.

Table 8: RMSE values of 1-6 months ahead forecasts using the LSTM network.

Product id 1 month RMSE 2 month RMSE 3 month RMSE 4 month RMSE 5 month RMSE 6 month RMSE

12915 4812 4972 4978 4923 4931 5388 12917 1659 2073 2222 2470 2174 2146 12922 1705 1703 1686 1396 1631 1539 12925 1664 1651 1719 2038 1657 1503 12927 3196 3287 3074 3177 3093 3111 12929 1646 1838 1719 1581 1405 1467 12934 4089 4273 4003 4072 3747 3566 13330 1767 1859 1925 2199 2120 1723 14257 2326 2607 2677 2617 2671 2673 15241 2613 2959 2933 2721 2772 2742

(30)

Table 9: RMSE values of 1-6 months ahead forecasts using the LSTM network with STL.

Product id 1 month RMSE 2 month RMSE 3 month RMSE 4 month RMSE 5 month RMSE 6 month RMSE

12915 4219 4655 4426 4374 4581 5212 12917 1677 2199 2359 2535 2359 2111 12922 1446 1542 1476 1609 1708 1692 12925 1183 1332 1272 1458 1617 1632 12927 3110 3220 3040 3218 3229 3278 12929 1278 1253 1264 1257 1211 1223 12934 3632 3611 3478 3364 3351 3502 13330 2027 2353 2532 2527 2457 2350 14257 2309 2614 2736 2903 2963 2982 15241 2461 2684 2699 2891 2824 2847

The first column specifies the product id. A yellow coloured square indicates that the RMSE was the lowest among all four models considered. A green coloured square indicates that the RMSE was the lowest among the models considered and that the RMSE was at least 5% lower than the second lowest RMSE.

8 Conclusions

Four types models were utilized in this thesis: regular (S)ARIMA models, ARIMA with STL, LSTM networks and LSTM networks with STL. When making monthly forecasts for the years 2019 and 2020, the (S)ARIMA models without decomposition generally produce the most accurate forecasts based on the RMSE. The plain LSTM network generally performs poorly, and while the LSTM network with STL is able to provide the lowest RMSE forecasts for some products, it is highly inconsistent and needs to be refined further before it can be applied to actual forecasting. The reason our LSTM performs poorly compared to previous research such as Elmasdotter and Nystr¨omer (2018) and Siami-Namini and Namin (2018) could be due to the low number of ob-servations available in our monthly data. Previous research in which LSTM networks had more success in forecasting often used high frequency data in the form of weeks or days and shorrter forecasting windows, which allowed their LSTM networks to train on more observations than the LSTM networks in this thesis. Thus when dealing with low frequency data such as months it might be better to stick with parametric methods such as ARIMA based models as opposed to non-parametric methods such as neural networks.

In our case, it is important to consider that the models in this thesis are only fit on the training data (2012-2018) and are not refit every time another forecast is made when new data becomes available. In a real world scenario however it would certainly be more logical to update the mod-els after every month when new data becomes available and new forecasts need to be made. The main reason for not refitting the models was the lack of required computational resources: while refitting ARIMA and SARIMA models every time new data becomes available can be done quite easily, having to retrain the LSTM networks for every new forecast is enormously computationally cumbersome. In order to keep the comparison between forecasts of these models fair it was thus decided to only fit the models on the training data. No refitting was done after the models were initially fit. Further research could potentially compare the models with additional refitting. Another relevant consideration is the design and specifications of the LSTM networks themselves. The LSTM networks that were used for forecasting were relatively simplistic versions, only consist-ing of one hidden layer in addition to an input and output layer. We believe many improvements could be made to the LSTM networks. Additional improvements could possibly include: changing the hyperparameters of the LSTM networks, choosing a longer lookback window, using different data features as inputs, adding more hidden layers to the LSTM and combining LSTM hidden layers with other kinds of hidden layers (which are beyond the scope of this thesis).

(31)

networks could be created, all dedicated to a specific forecasting window. One LSTM network would be trained to only produce one month ahead forecasts, then another separate LSTM net-work could be trained to produce two months ahead forecasts etc.

Moreover, more time series transformations could be applied before the data is fed to LSTM net-works. In this thesis it was found that using time series decomposition in combination with a LSTM network generally produced superior results compared to a plain LSTM network. Apart from decomposition and scaling the data to the (0,1) range no further transformations were ap-plied. Further transforming the data by differencing and/or Box-Cox transformations (to make it stationary before feeding it to the LSTM) could possibly produce superior forecasts.

Sales forecasting of luxury products using LSTM neural networks and ARIMA models

Sales forecasting of luxury products using LSTM neural

networks and ARIMA models

Sales forecasting of luxury products using LSTM neural

networks and ARIMA models

Arend Jansen

6 January 2021

Contents

1

Introduction

2

Forecast accuracy metrics

3

Data

3.1

Aggregating the data

3.2

Data visualization

4

ARIMA modeling

4.1

Stationarity

4.2

Fitting (S)ARIMA models

5

LSTM neural networks

5.1

Neural networks

5.2

Recurrent neural networks

5.3

LSTM

5.4

LSTM hyperparameter selection

6

Forecasting procedures

7

Results

8

Conclusions