Forecasting shipped orders with long short-term memory recurrent neural networks

(1)

Forecasting shipped orders with long

short-term memory recurrent neural

networks

Roel van der Burght 10998187 Bachelor thesis

Credits: 18 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor Tim Bakker MSc.

Amsterdam Machine Learning Lab Faculty of Science

University of Amsterdam Science Park 904

(2)

1 Introduction 4 2 Problem statement 5 2.1 Introduction . . . 5 2.2 Data set . . . 5 2.2.1 Target variables . . . 6 2.2.2 Features . . . 7 2.2.3 Seasonality . . . 8 2.2.4 Data preparation . . . 10 3 Method 11 3.1 LSTM . . . 11 3.2 Lookback . . . 13 3.3 Evaluation . . . 13 3.4 Research question . . . 14 3.5 Implementation . . . 15

4 Results and Evaluation 16 4.1 Results . . . 16

4.1.1 Basic model . . . 16

4.1.2 Extension basic model . . . 18

4.1.3 Encoder-Decoder . . . 19

4.1.4 Multiple outputs encoder-decoder . . . 21

4.2 Evaluation . . . 23

5 Conclusion 26

(3)

Abstract

Any company involved in the shipping of goods to its customers benefits from an accurate forecast of the amount of shipments expected in the future. This research is a case study looking to improve upon the forecasting technique in place at Greetz, the company commissioning this project. Long short-term memory recurrent neural networks have been used to this end. Different model architectures have been explored. The conclusion drawn from the results is that, using this method, no improvements over the techniques in place at Greetz where made.

(4)

Chapter 1

Introduction

Since the birth of machine learning in the 1970s there have been ideas of widespread usage of the developed techniques. With the coming of the information age, bringing cheap computing power and readily available data storage options, the traditionally limiting factors for the application of machine learning outside of the academic environment belong to the past. This can be seen in the corporate environment, where a shift towards data driven business practice is taking place [7]. This shift is not limited to the big tech giants like Google and Facebook; in today’s world smaller companies have most of the necessary means to collect, store and analyse their data to produce data driven decisions. However, due to limiting factors such as personnel management and costs, it is not to be expected that smaller companies will be able to collect and store all data which can be used for decision-making purposes. This raises the question of how best to approach data-driven decision making, taking into account the comparatively small amount of available data a typical small company can deliver.

While the range of possible problems where data-driven decision making can aid is endless, the most obvious starting point for any company involved in the selling of products would be an estimation of sales. Having a good estimate of future sales has many use cases, for example in the planning of personnel and the projection of profits. The question of how well this can be forecast with the available data remains. Many factors are of potential influence on the sales and most smaller companies will not have the means to monitor most of them.

This research is a case study into the above stated problem. Data from a company was used and an effort to build a model which forecasts the number of products shipped per day. The precise use case and requirements of the model will be introduced in chapter 2. Here will also be given a description of the used data set. The proposed model will be introduced in the chapter 3. The results are presented in chapter 4, and the conclusion drawn from them is given in chapter 5. In chapter 6 the shortcomings of this research are discussed.

(5)

Chapter 2

Problem statement

2.1 Introduction

This project was commissioned by Greetz, a company specialized in the online selling of gifts and gift cards. A substantial part of the daily operations at Greetz consists of packaging and sorting orders, as to prepare them for shipment by postal services. Because this is such a core part of the business procedure and for the most part a manual process, there is a demand for an accurate forecasting of the amount of items that need to be shipped on a given day. Several business practices such as personnel planning and revenue estimation would benefit from an accurate forecasting mechanism.

Greetz sells a diverse assortment of products, such as postcards, flower bouquets, books and chocolate. Because of factors such as price, size and shelf life that differ between these product categories, the most optimal forecasting mechanism would differentiate between these categories.

The preferred length of the forecast, i.e. the number of days into the future for which the forecast is done, is set at 45. This length is the most convenient for the company. The problem statement as formulated by Greetz is thus as follows: Is it possible to predict the amount of items shipped per day for the coming 45 days and, if so, is it possible to predict items shipped per category? In the following section the data used for the forecasts is explored. The approach to solving this problem will be explained in further detail in chapter 3. A small word on naming conventions: in this research sent items and shipped items are used interchangeably, both referring to the variable tried to predict.

2.2 Data set

The data provided is a selection of variables thought to be of influence on the number of shipped items and the records for the sent items of the past years. The data is regarded as a collection of multiple time series, where all the series have daily entries. The first date in the data set is January first 2010, the last April first 2019. The data set can be split into two groups: target variables and features. Both categories will be explained in more detail in the coming subsections. It is important to note that not all the different features and target variables have entries as far back as 2010. This is because the company simply did not keep track of them at that time. This is one of the reasons as to why some part of the given data set is omitted from the data set used, which will be elaborated upon in section 2.2.4.

(6)

2.2.1 Target variables

The target variables consisted of daily entries for the amount of items shipped per category and the total items shipped, which is the sum of the individual categories. Table 2.1 shows the first ten entries for 2018. These are the values that are to be forecast in this research. The eventual goal was to forecast all individual product categories, but to start the total items shipped where forecasted.

Total Balloons Beverages Cards type1 Cards type2 Chocolate Flowers Other gifts Cards total Gifts total

2537 4 1 2525 0 1 2 0 2525 8 41853 442 225 34476 4569 544 940 46 39045 2197 29753 649 166 21590 5983 378 578 25 27573 1796 23271 676 159 17746 3171 347 691 22 20917 1895 24013 633 173 18611 2984 402 738 37 21595 1983 3066 327 93 1906 0 174 373 10 1906 977 4096 193 78 3215 0 145 289 23 3215 728 42022 657 198 32436 7014 456 733 37 39450 2081 23656 656 161 17293 3933 442 691 38 21226 1988 25492 661 137 19683 3482 397 643 27 23165 1865

Table 2.1: Target variables

Features Descriptions

Weekday Day of the week

Temp Average temperature

Sunshine Minutes of sunshine

Rain Minutes of rain

Voucher orders Number of items sold using a discount code

Traffic Number of visits to the website

SEA cost Money spend on search engine advertisement

SEA clicks Number of clicks on advertisements

Emails accepted Number of emails sent to mailing list addresses Emails open Number of sent emails that are opened

Spc day positive run up Days before a positive special day on which the positive effect is noticed

Spc day positive Positive special days

Spc day negative Negative special days

Target variables Number of items sent

(7)

2.2.2 Features

The features consisted of several variables thought to be of influence on the target variables. Table 2.2 gives a short description of the features used. The reasoning behind the inclusion of these features will be given below.

• Weekday

The day of the week is a good indicator for the amount of shipped items. The target variables show high weekly seasonality, as will be explained in further detail in section 2.2.3.

• Temp/Sunshine/Rain

The inclusion of these variables is based upon the experience of Greetz employees. Within the company it is believed that there is a correlation between weather and sales. On days with particularly bad weather there seem to be more sales than on days where the weather is good. With the inclusion of these variables it is projected that the model used can capture this trend.

• Voucher orders

Greetz uses sales and discounts as a way to increase the number of sold products. Including the number of items that are checked out using a discount code could help the model explain a certain spike in shipped items on days that such a discount is widely used. One caveat is that there is no distinction made between the category on which the discount applies. This could be an issue when the model forecasts multiple categories.

• Traffic

The number of website visits is likely to be of substantial influence on the number of sold items, and therefore in some way correlated to the shipping numbers.

• SEA cost/SEA clicks/Emails accepted/Emails opened

The amount of advertisement done and the number of website visits this generates is likely to be correlated to the number of shipped items. These variables are also likely to be correlated with the traffic variable. The choice of model must be robust to this kind of correlation between features.

• Special day features

Special days are yearly reoccurring days on which a significant increase or decrease in sent and sold items is seen. The special days that are of positive influence on the target variables are represented in the column “Spc day positive” and days of negative influence are found in the column “Spc days negative”. Examples of positive days are Christmas and Valentines day, an example of a negative day is the Remembrance of the Dead (Dodenherdenking). For some positive special days an increase in sent items is seen not only on the day itself, but in the days leading up to it as well. This information is encoded in the column “Spc days posistive run up”. The length of the run up is based on the expert knowledge of Greetz employees and not learned or extracted from the data. The way the information about the special days is encoded is by simply adding a string containing the name of the relevant day to the data entry.

• Target variables

The values for the previous target variables are also added to the features. Many time series problems are approached using only the previous target variables as input [2], therefore it is possible that adding it here improves performance.

(8)

Many of the mentioned features are intuitively expected not to correlate directly to the number of sent items, but more so to the number of sold items. The difference is subtle but worth expanding upon. Greetz gives customers the opportunity to buy items and set the date of delivery on a day of their choosing. This means that the number of shipped items on a given day is the sum of the number of sold items for that day with the delivery date set on that day, and the number of sold items for past days with the delivery date set on the given day.

For all features regarding advertisement and marketing a case can be made that they are expected to correlate to the number of sold items. Greetz chooses to invest time and money into marketing and advertisement for exactly this reason. However, because there exists the possibility to set a custom delivery date an increase or decrease in the number of sold items does not necessarily result in a corresponding shift in shipped items. This is a shortcoming of the available data, but as this research is focused on making the best use of the data at hand the inclusion is justified. There is also a strong correlation between the two variables. Spearman correlation test reports a 0.78 score, meaning the two variables are highly positively correlated.

It is also expected that there is some correlation among the features, specifically between the different forms of advertisement and web traffic. The chosen modelling technique must therefore be robust to this kind of correlation in the input features.

2.2.3 Seasonality

Seasonality is the periodic fluctuation of a time series. Many time series of economic nature display some form of seasonality. In the case of retail data seasonality of some form is expected on a daily, weekly and yearly basis. One can imagine a breakfast restaurant experiencing recurring peak hours in the morning, furniture stores doing especially well on the weekends, and most retailers in the western world seeing an increase in sales during the Christmas period.

Figure 2.1 shows the total daily sent items starting from 2013. The most distinct yearly reoc-curring peaks can be seen during the Christmas and Valentines day periods. Other annual busy periods are seen during Mothers and Fathers day. These holidays however are not on a fixed day each year and thus do not fall into a well defined seasonal period. Seasonality is also apparent on a weekly basis. To illustrate this the daily sent items in the year 2018 are shown in figure 2.2. This weekly seasonality is caused by differing delivery prices during the weekends. Packages sent during the weekend cost more, therefore many customers do not choose a delivery date in the weekend, resulting in lower sendvolumes on Saturday and Sunday. These customers are also responsible for the reoccurring peaks on Mondays, as this is the first day after the weekend where the prices for delivery are normal.

(9)

Figure 2.1: Total items shipped from 2013 through April first 2019

(10)

2.2.4 Data preparation

To prepare the data for processing several steps have been taken. First the data was rearranged in a csv file in similar fashion to the tables shown in the last section. The data was delivered in two xlsx files, both containing sheets for all the individual features and one for the target variables, with one file containing the entries from 2010 until the end of 2016 and the other containing the remaining years. These files where first combined by appending the last years worth of data to the first, using LibreOffice Calc. The next step was to combine all the sheets containing feature columns into one csv file.

Some difficulty arose in the fact that there was no common format used for the dates. Each feature was saved in a separate sheet, marked with its own date column. Some dates where of the form dd/mm/yyyy while other where of the form mm/dd/yyyy, some spelled out the names of the month in full while others used the first three letters, some used backslashes as denominators while others used dashes. A custom python function was written that took all cases into account and changed all dates to a dd-mm-yy format. After all the dates where changed, the sheets in the xlsx files where loaded into individual pandas DataFrames. The dates where set as the indexes for these DataFrames which opens up the use of the pandas merge function to merge the DataFrames together. The merge function joins columns on indexes, meaning that rows are joined on matching indexes. The use of this function with the set indexes is preferred over simply joining the columns or adding them by hand in a program like LibreOffice Calc or Microsoft Excel, because of the way the data was stored by Greetz. When a value in the data entry should be zero, say when no voucher orders where processed on a given day, there was made no entry for that day, instead of making the entry with the value zero. This meant that entry i of feature column x does not necessarily correspond tot the same date as entry i from feature column y. Simply appending the columns would result in mismatched dates and is thus not desired. The only features that had consistent entries made for each date since the start of the data set where the features regarding the weather. Using the pandas merge to add the feature columns to the weather column made sure the ordering of the features was correct. After this the DataFrame was saved as a csv file, and the special days columns where added by hand using LibreOffice Calc. This data was not stored by Greetz and had to be added manually. Doing this programmatic is not feasible because of the irregular intervals for special days like mothers and fathers day, and the effort it would take to take leap years into account. For the final preprocessing steps the csv file was again loaded into a pandas DataFrame. The missing entries where replaced with zeros. The reason for doing this instead of another method to deal with this, mean filling for instance, is because the missing values actually meant that the value for that variable should be zero. The categorical features where one-hot encoded (the “Date” column being omitted). The final step undertaken before the data was ready to be used by the model would be normalization. For this the MinMaxScaler module from the sklearn library was used. A small but important footnote regarding normalization is that is should be done after the data is split, for otherwise information about the test set might be present in the train set. Practically this means that the scaler is first fitted to the training set and then used to transform the values in both train and test set. The “Date” column is removed from the data, as it bears no use in forecasting. The target variables where not scaled.

The data from before 2013 was not used in this research and thus dropped. There are two reasons for this, the first being that many features where not being stored during that period. The second reason is based on the expert knowledge of Greetz employees. They have seen the company change over the years, and stated that the way the company operates as well as the amount of orders they process is not comparable to how it was before 2013. Because of this change the model might learn wrong relations from that part of the data set.

(11)

Chapter 3

Method

3.1 LSTM

Long short-term memory recurrent neural networks (LSTM RNNs) have been used to great effect to learn sequence based problems like translation [10], traffic forecasting [20], image captioning [13] and predicting student dropout [11]. The common denominator among these examples is that they all incorporate some form of sequential input and/or output data. The fact that LSTM RNNs can handle this kind of problem makes them well suited for the time series forecasting problem at hand, as has been shown in similar research [15] [14] [18]. Furthermore the inherent flexibility in dealing with multivariate input data in neural networks in general, and LSTM RNNs in this particular case, makes them well suited with regards to the data set available.

The use of the LSTM variant of recurrent neural networks is preferred over the vanilla version because it proves to be a solution to the vanishing and exploding gradient problems [1]. These problems arise when, during backpropagation, the gradient gets too small in the case of a van-ishing gradient, or too large in case of an exploding gradient. This can happen because of the multiplication of the gradient with the activation functions in each layer. A gradient that is too small can not update the weights enough, whereas a too large gradient overshoots the updating, both limiting the learning capabilities of the network. Due to these problems vanilla RNNs have trouble learning long term dependencies, becoming apparent when the input sequence is long. To overcome this problem LSTM RNNs incorporate memory cells. Memory cells are a combination of the following four units: an input gate, a forget gate, an output gate and a self-recurrent neuron. The input gate determines to what extend an incoming signal can change the state of the memory cell. The output gate controls to what extend the outgoing signal can change the next memory cell. The forget gate controls the amount of information retained from the previous state of the memory cell. These gates are matrix operation functions, all with their own set of weights and biases. They take two input signals, the input features and the past cells hidden state, multiply each by a weight matrix, add the result plus a bias vector, and pass the remainder through an activation function.

(12)

][H]

Figure 3.1: Memory cell of an LSTM

Figure 3.1 shows a memory cell to illustrate the working of these gates in tandem. The meaning of the mathematical symbols is as follows:

1. xtis the input vector to the memory cell at time t.

2. Wi, Wf, Wc, Wo, Ui, Uf, Uc, Uoand Vo are weight matrices.

3. bi, bf, bc and bo are bias vectors.

4. htis the hidden state of the memory cell at time t.

5. it and eCtare values of the input gate and the candidate state of the memory cell at time

t, respectively, which can be formulated as:

it= σ(Wixt+ Uiht−1+ bi) (3.1)

e

Ct= tanh(Wcxt+ Ucht−1+ bc) (3.2)

6. ftand Ctare values of the forget gate and the state of the memory cell at time t,

respec-tively, which can be calculated by:

ft= σ(Wfxt+ Ufht−1+ bf) (3.3)

Ct= it∗ eCt+ ft∗ Ct−1 (3.4)

7. otand ht are values of the output gate and the hidden state value of the memory cell at

time t, respectively, which can be formulated as:

ot= σ(Woxt+ Uoht−1+ VoCt+ bo) (3.5)

ht= ot∗ tanh(Ct) (3.6)

Figure 3.1 and description are presented as found in [18].

The reason why the LSTM architecture is a good solution to the vanishing gradient problem is found in the way the cell state signal C is propagated through the network. Because the update term for this value, Ct= it∗ eCt+ ft∗ Ct−1, is an addition operation rather than a multiplication,

(13)

will not vanish easily. The addition operator results in a derivative used during backpropagation that consists of a summation, rather than a multiplication. Because the terms are now summed, instead of multiplied, multiple near-zero terms will not result in a vanishing gradient. Another important part of the solution is the fact that the gates can learn to what effect the signal needs to be modified. The weights and biases of the gate units are updated in similar fashion to regular weights in neural networks. This means that the gates can learn the best configuration by backpropagation. If a long term sequence needs to be learned, the forget and input gates of the cells can be adjusted so as to let the signal pass. The combination of the gates and added cell state update terms combat the vanishing gradient and gives the LSTM architecture its long-term memory.

In this research several network architectures have been tried and tested, which will be expanded upon in section 4.1. In this section values for the amount of memory cells and hidden units used in LSTM layers will be reported. In this research memory cells refer to the “block” containing the gates. Hidden units refers to the dimensionality of the hidden state of a memory cell.

3.2 Lookback

In this research lookback is defined as the number of days used as model input. For example: a lookback of 7 means the features for the past 7 days will be used to train the model. As is discussed in section 2.2.3 the target variables show high seasonality on a weekly and yearly basis. Different configurations for the lookback might have an effect on the way the model captures the seasonality, and thus on the overall performance. When training the models lookback values of 7, 14, 45 and 370 have been tried. The values 7 and 14 where used in an effort to capture weekly trends. The value 370 was used to try and capture yearly trends. 370 has been used as opposed to 365 to account for leap-years. With this relatively high lookback, and corresponding long input sequence, the bias vectors of the forget gates in the memory cells bf where initialized

with the value 1. This is done so that the forget gates are “open” to allow for better gradient flow. This enables the LSTM to learn long range dependencies better[12].

Something to take note of when increasing the lookback is that doing so will decrease the available training data. With a lookback of size N , the first N available data entries will be used as input for the prediction of days N+1to N+45. These N days can not be used as targets during training

because the input length for these days would range further back than the start date the data set.

3.3 Evaluation

The evaluation of the models in this research is two-fold. First the models will be evaluated using the commonly used metrics of mean squared error and mean absolute error. While the mean squared error is sufficient and related to the mean absolute error, the scores on the latter will also be reported as they are better interpretable with regards to the actual data. Equations 3.7 and 3.8 show the formulas for the calculation of the mean squared error and the mean absolute error respectively, where n is the total number of data points in the set, y are the target values and ˆy are the values as predicted by the model.

(14)

1 n

n

P

i

(y

_i

− ˆ

y

_i

)

2

(3.7)

_n

1 n

P

i

|y

_i

− ˆ

y

_i

|

(3.8) Aside from these metrics the scores will also be compared to the forecasting done by Greetz. This will be treated as a baseline for the model. The companies employees make a forecast for each of the individual product categories (e.i. cards, flowers, chocolate etc.) each month. The forecasting process is not automated and done without the use of advanced statistical analysis or machine learning methods. To calculate the forecast for a period, the send-volumes of the same period for the previous year are multiplied with the expected growth for this year. Afterwards an experienced employee evaluates the prediction and adjusts where necessary. It is thus mostly based on the expert knowledge of the employee. Neither the send volumes of years further back, nor the features from the data set used in this research are taken into account in the process. The results of this forecasting mechanism, when evaluated using the metrics discussed, are pre-sented in table 3.1. There where no separate forecasts available for the different types of cards, so only the evaluation for the total sent cards is given.

The results of the forecast made by Greetz where only available for the year 2018. Therefore only the results of the models for the same period will be used for comparing. However the model is trained to forecast only 45 days ahead. To get a forecast for a year, the output of 2018 was taken in intervals of 45 days. These where combined to get a forecast for the entire year. This has some problems, namely because the yearly forecast starting one day before or after might look completely different. It is thus likely that the models forecast for 2018 is not the best possible. This would be problematic if the model where to be used on a daily basis, and previous forecasts would be updated in accordance to the current to find the most likely. The use case of this model however is to be used once every 45 days, thus the fact that the used forecast is a somewhat randomly selected sequence is in line with how the model would be used.

Total Cards total Gifts Total Balloons Beverage Cakes Chocolate Flowers Other gifts

MSE 64577098 55090685 172220 147431 18599 5450 109017 64575 467742

MAE 4584 4176 353 191 72 46 187 169 566

Table 3.1: Evaluation of the forecasting method used by Greetz

3.4 Research question

As stated before this research is a case study looking to find the best way to forecast the number of items shipped per day given the available data. The length of the forecast should be 45, and if possible it should differentiate between the product categories. The forecasts will be compared to the forecasts done by Greetz. The objective of this research is to improve forecasting accuracy over the forecasting technique utilized by Greetz.

The question this research will try to answer is thus as follows: Can the forecasting accuracy of the method utilized by Greetz be improved upon by a model using the available data. This was tested for the total sent items product category and the individual product categories. Long short-term memory recurrent neural networks where used as the modelling method.

(15)

3.5 Implementation

In this research python3 (version 3.6.8) was used as the main tool for modelling and preprocess-ing. For preprocessing the pandas library (version 0.24.2) was used, as well as some “manual” operations on csv files using LibreOffice Calc. The keras library (version 2.2.4) was chosen to implement the neural networks, making use of the tensorflow backend. Keras was chosen over other libraries because of the relatively gradual learning curve compared to similar libraries. An added perk of using keras is that it has the possibility of using a GPU to train the models, instead of the computers CPU. This turned out to speed up computations significantly, with the learning process being completed up to 20 times faster. Specifically this was achieved by using the CuDNNLSTM class instead of the regular LSTM class present in the keras library. Training was done on a Dell XPS-15 laptop using a Nvidia GTX 1050 GPU.

(16)

Chapter 4

Results and Evaluation

In this research several model architectures have been tested. In section 4.1 the used models will be described and their results will be shown, together with the used hyper-parameters. While mean absolute error is reported the mean squared error is used as the loss function when training. The evaluation of the forecast over 2018 is also reported, denoted as MSE 2018 and MAE 2018, for mean squared error and mean absolute error respectively. In section 4.2 the models will be evaluated and compared to the results of the forecasting mechanism utilized by Greetz.

The reported models where trained using a 70-30 train-test split. Of the training data 20% was used for validation. All LSTM layers of the reported models used a sigmoid as the activation function for the gates and a hyperbolic tangent as the activation function for the outputs of the memory cells. All fully connected layers used a ReLu activation function.

4.1 Results

4.1.1 Basic model

The first model tried used a very basic architecture. This basic model consisted of a single LSTM layer connected to a fully connected hidden layer, which in turn was connected to the output layer. The LSTM layer output came only from the last memory cell, meaning that the LSTM encoded the sequence into a vector of length n, where n is the number of units in the hidden layer connected to the LSTM. The number of hidden units used in the LSTM and fully connected layers are 64 and 32 respectively. The output layer consist of a single neuron which gives a vector of length 45, corresponding to the 45 days of output.

This model was not expected to achieve a high level of performance, but it is useful as a baseline metric. The results achieved with this model and the used hyper-parameters are shown in table 4.1. The forecast for 2018 is given in figure 4.1.

(17)

Parameter Value Evaluation metric Score

Lookback 370 MSE train 163780000

Optimizer Adam MAE train 6443

Learning rate 0.008 MSE test 800028177 Learning rate decay 0.00005 MAE test 20551

MSE 2018 695034929

MAE 2018 18721

Table 4.1: Hyper-parameter configuration and results for the basic model

Figure 4.1: Prediction of the basic model of the total items shipped in 2018

While experimenting with the different hyper-parameter configurations it was noticed that the models with the best validation loss all reached a plateau around the same value. This could mean that adding more complexity in the form of extra layers can result in better scores. When comparing the train and test loss reported in table 4.1 it is clear that the model is overfitted to the training data. Because of this, and the fact that the addition of extra complexity is likely to increase the overfit, regularization measures where added in the next model.

(18)

4.1.2 Extension basic model

The extended version of the basic model consisted of two stacked LSTM layers followed by 4 fully connected hidden layers. The stacking of LSTM layers has shown to result in better performance when compared to a single layered architecture [5] [4]. As with deep feed forward neural networks the extra layers offer the possibility to represent the input in higher levels of abstraction. This is hypothesised to make the learning of the temporal structure between successive time steps easier [6].

Both LSTM layers used 128 hidden units. The first fully connected layer following the LSTM used 64 hidden units. Each following layer used half the number of hidden units as its predecessor. To prevent overfitting dropout was added as a regularization method. Applying dropout to LSTM layers [9] and fully connected feed forward layers [3] has shown to reduce overfitting and improve model performance. Dropout was added to all layers of the network. The amount of dropout used was experimentally found to give the best results, just as with the other hyper parameters. The used hyper parameters and results are given in table 4.2, the forecast for 2018 is presented in figure 4.3.

Lookback 45 MSE train 289910000

Optimizer Adam MAE train 10689

Learning rate 0.008 MSE test 373011364 Learning rate decay 0.00005 MAE test 9615

Dropout 10% MSE 2018 264957700

MAE 2018 9366

Table 4.2: Hyper-parameter configuration and results for the extended basic model

The results are noticeably better than the results obtained with the basic model. The results on the test set have improved significantly. The training loss has increased and the test loss has decreased compared to the basic model. This is a result of the addition of dropout regularization. By adding dropout the model is able to generalize better to unseen data. Looking at figure 4.2 it can be noted that the model somewhat captures the weekly seasonality. It is also clear that this model does not predict the outliers around the special days particularly well.

(19)

Figure 4.2: Forecast of the extended model of the total items shipped in 2018

4.1.3 Encoder-Decoder

The next model tried was an encoder-decoder using LSTM recurrent layers. An encoder-decoder recurrent neural network is a specific type of network architecture. These type of models consist of a recurrent layer which transforms the input to a vector of fixed length (the encoder) and another recurrent layer which transforms this vector into a sequence of outputs (the decoder). Encoder-decoders have proven to achieve good results in several types of sequence modelling problems, such as machine translation [8] and time series prediction [16].

A key difference in the way an encoder-decoder outputs, as opposed to the previous models, is that the output is a computed sequentially. Where the previous models output layer was a single neuron outputting a vector of length 45, the output of an encoder-decoder is given by the decoding LSTM layer. The decoder has the same number of memory cells as the length of the output sequence where each cell gives the output for one day of the forecast.

In the used implementation of the model three fully connected layers where stacked on top of the outputs for each memory cell. Each of these layers used half the amount of hidden units as its predecessor, starting at 64. Both encoder and decoder made use of a single LSTM layer, as tests with multiple layers resulted in worse performance. Each LSTM layer used 128 hidden units. Dropout was added to the input of the decoder, the output of the encoder and the hidden layers following the encoder. Additional regularization was added in the form of L1L2 recurrent

(20)

regularization. The values for the used hyper-parameters and results are presented in table 4.3, the forecast for 2018 is shown in figure 4.3

Lookback 45 MSE test 261616786

Optimizer 45 MAE test 7620

Adam Adam MSE train 157190000

Learning rate 0.001 MAE train 6588

Learning rate decay 0 MSE 2018 266678033

Dropout 40% MAE 2018 7586

L1L2 regularization 0.2

Table 4.3: Hyper-parameter configuration and results for the encoder-decoder model

Figure 4.3: Forecast of the Encoder-Decoder model of the total items shipped in 2018

When comparing the results with the previous model it is clear that the encoder-decoder performs better, however not by much. From the forecast if figure 4.3 it can be noted that, like the previous model, the encoder decoder somewhat captures the weekly cycle of shipped items. It also seems

(21)

to perform marginally better in the forecast of the outliers, especially around Christmas. The addition of the recurrent L1L2 regularizer and the relatively high dropout percentage seem to have effectively reduced overfitting, bringing the train and test loss closer together than the previous models.

4.1.4 Multiple outputs encoder-decoder

Using the encoder-decoder architecture, an attempt was made at the forecasting of the individual product categories. This was done by connecting multiple encoder LSTM layers to the decoder, each responsible for the output of a product category. This was first tried using 8 decoding layers, one for every product category. The results where lacking, with training loss not decreasing below a threshold comparable to mean estimation. The forecasting of all individual product categories was not possible using this method. To at least partially differentiate between the product categories an attempt was made into differentiating between the cards and gifts product categories. Using two LSTM layers for decoding corresponding to these output variables produced better results, as the reduced model complexity allowed for better learning.

As with the previous model, each memory cell of the decoding layers had been stacked with a number of fully connected layers. The best results where achieved with 4 layers, with the first containing 128 hidden nodes and the successive layers having halve the number as its predecessor. Dropout was applied to each of these layers, as to the input layer the decoder and the output layers for the encoders. Again L1L2 regularization was added as a recurrent regularizer to all LSTM layers. The used hyper parameters and results are presented in table 4.4. The loss for the individual product categories is also show there. The forecast for 2018 is given in figures 4.5 and 4.4, for cards and gifts respectively.

Parameter Value Evaluation metric Value Evaluation metric Value Evaluation metric Value

Lookback 45 MSE train 41165032 MSE train cards 38952930 MSE train gifts 2212102

Optimizer 45 MAE train 3357 MAE train cards 2842 MAE train gifts 515

Adam Adam MSE test 304225241 MSE test cards 294701999 MSE test gifts 9523242

Learning rate 0.001 MAE test 9946 MAE test cards 8544 MAE test gifts 1402

Learning rate decay 0 MSE 2018 196315808 MSE 2018 cards 188074118 MSE 2018 gifts 8241690

Dropout 20% MAE 2018 8869 MAE 2018 cards 7521 MAE 2018 gifts 1348

Table 4.4: Hyper-parameter configuration and results for the endocer-decoder with multiple outputs

(22)

Figure 4.4: Forecast of the Encoder-Decoder model of the cards shipped in 2018

(23)

From table 4.4 it becomes clear that the model is overfitted to the training data, with the training loss being significantly higher than the test loss. This is also true for the individual product categories. Adding more dropout or increasing the value for L1L2 regularization has been tried to combat this, it did however not decrease test error. Regarding the forecasts in figures 4.5 and 4.4 it can be concluded that the forecast for the sent cards performs reasonably well, capturing some of the weekly seasonality and the reoccurring yearly peaks around Christmas. The forecast for the gifts category does not seems to approximate the real values very well. The forecast is consequently too low and does not seem to give a good approximation of the weekly seasonality.

4.2 Evaluation

When regarding the results and forecast plots of the best models it can be noted that the forecasts are a rough approximation of the true values. The weekly seasonality is taken into account, however there is still a lot of room for improvement. The models fail to give accurate predictions of the yearly peaks during Christmas and Valentines day. The hypothesis that a lookback of 370 days could help during these periods does not hold, as the best models all use a lookback of 45 days, save the basic model. The train and test loss of the model forecasting the total number of items shipped are within reasonable range of each other, meaning that this model generalizes well to unseen data. This is not so much the case for the model with multiple outputs, even though regularization measures have been applied.

To put the results in perspective they are compared against the forecasting mechanism utilized by Greetz. Table 4.5 show the evaluation of the forecast by Greetz over 2018, the evaluation of the models forecast over 2018 and the rounded percentage difference between the evaluation of the two forecasts. Plots for the forecast of the total shipped items, cards and other gifts categories are given in figures 4.6, 4.7 and4.8.

(24)

Figure 4.6: Forecast by Greetz of the total items shipped 2018

(25)

Figure 4.8: Forecast by Greetz of the gifts shipped in 2018

Evaluation metric Greetz forecasting value Model forecasting value Percentage difference

MSE total send items 64577098 266678033 413 %

MAE total send items 4584 7586 165 %

MSE cards 55090684 188074118 341 %

MAE cards 4175 7521 180 %

MSE gifts 1872933 8241690 440 %

MAE gifts 914 1348 147 %

Table 4.5: Evaluation of the forecasting by Greetz over 2018

From table 4.5 it becomes evident that the models perform worse than the forecasts done by Greetz. From the plots of the forecast it becomes clear that the Greetz forecast performs signif-icantly better during the peak seasons. The results for “regular” periods is also notably better, but not by the same margins. This is confirmed by the higher percentage differences for the mean squared error compared to the mean absolute error.

(26)

Chapter 5

Conclusion

The objective of this study was to increase forecasting accuracy over the forecasting method utilized by Greetz, making the best use of the data available to the company. To this end long short-term memory recurrent neural networks have been used as a forecasting technique. The results of different model architectures have been reported upon and the best where compared to the results of the model used by Greetz, to evaluate whether the accuracy was an improvement. This was done for the product categories cards and gifts, and the total products sent.

The results show that with the method used it is possible to approximate the total items and cards categories reasonably well, and the gifts category less so. It must however be concluded that the models developed in this research do not perform better than the baseline forecasting method from Greetz, on any of the product categories. Furthermore it was found impossible to produce any reasonable forecasts for all product categories individually with the method used in this research.

(27)

Chapter 6

Discussion

In this section further improvements to the used method and possible different approaches to the problem will be discussed.

The optimization process of the neural network was done in a trail and error fashion. The out-comes of different architectures or hyper-parameter configurations where analyzed after training, and based on the results new parameters and architectures where tried. Testing this way does not systematically test all possible configurations of parameters meaning there are options that have not been checked. A grid search approach would not have this problem, and therefore would be a better approach to optimization. Aside from optimizing the used models further, different types of architectures could also have led to better performance. RNNs with attention mechanisms have proved to be produce good results in time series prediction [19]. Furthermore convolutional networks with residual causal skip connections have reached state of the art performance in time series problems [17]. The inclusion of these techniques could not be realized in this research, but is worth exploring in further works.

It is also possible that the usage of neural networks altogether is not the best by default for this particular problem. Neural networks are powerful, however they need many training instances in order to get the best results. The number of training instances (with the lookback parameter set to 45) was 1253, which is not particularly much. It might well be the case that more analytical approaches, like ARIMA models or exponential smoothing, preform better on this particular problem.

To conclude this discussion a shortcoming in the way the data was preprocessed will be discussed. The way the special days are represented in the input data is sub optimal. Each positive special day is marked (by the name of the special day in the relevant feature columns), as well as the days leading up to the special day in which an increase in shipped orders is seen. When a forecast is made over a period with a special day in it, the input features do not resemble any information about the fact that a special day is in that period, unless the timeframe of the input features falls within the days leading up to the special day. To clarify: Greetz sees an increase in shipments a week before valentines day, meaning that the 7 days leading up to Valentinesday are marked with the value “Valentinesday” in the feature column “spc day positive run up”. When the model forecast starts on a day before these 7 days, the input holds no information about the fact that valentines day will occur during the period to be forecast. This is an oversight of this research, which was noticed too late to adjust. This is a possible explanation for the poor model performance on the special day periods. One possible solution to this would be to add a feature with information about what day of the year it is, in similar fashion to how the days of the week

(28)

are represented. This way there is information about the distance to the special days encoded in the model input.

(29)

Bibliography

[1] Sepp Hochreiter and J¨urgen Schmidhuber. “Long short-term memory”. In: Neural compu-tation 9.8 (1997), pp. 1735–1780.

[2] Johann Du Preez and Stephen F Witt. “Univariate versus multivariate time series fore-casting: an application to international tourism demand”. In: International Journal of Forecasting 19.3 (2003), pp. 435–451.

[3] Geoffrey E Hinton et al. “Improving neural networks by preventing co-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580 (2012).

[4] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. “Speech Recognition with Deep Recurrent Neural Networks”. In: CoRR abs/1303.5778 (2013). arXiv: 1303 . 5778. url: http://arxiv.org/abs/1303.5778.

[5] Michiel Hermans and Benjamin Schrauwen. “Training and Analysing Deep Recurrent Neu-ral Networks”. In: Advances in NeuNeu-ral Information Processing Systems 26. Ed. by C. J. C. Burges et al. Curran Associates, Inc., 2013, pp. 190–198. url: http://papers.nips.cc/ paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf. [6] Razvan Pascanu et al. “How to construct deep recurrent neural networks”. In: arXiv

preprint arXiv:1312.6026 (2013).

[7] Foster Provost and Tom Fawcett. “Data science and its relationship to big data and data-driven decision making”. In: Big data 1.1 (2013), pp. 51–59.

[8] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”. In: CoRR abs/1406.1078 (2014). arXiv: 1406.1078. url: http://arxiv.org/abs/1406.1078.

[9] Vu Pham et al. “Dropout improves recurrent neural networks for handwriting recognition”. In: 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE. 2014, pp. 285–290.

[10] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Advances in neural information processing systems. 2014, pp. 3104–3112. [11] Mi Fei and Dit-Yan Yeung. “Temporal models for predicting student dropout in massive

open online courses”. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE. 2015, pp. 256–263.

[12] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. “An empirical exploration of recurrent network architectures”. In: International Conference on Machine Learning. 2015, pp. 2342–2350.

[13] Andrej Karpathy and Li Fei-Fei. “Deep visual-semantic alignments for generating image descriptions”. In: Proceedings of the IEEE conference on computer vision and pattern recog-nition. 2015, pp. 3128–3137.

(30)

[14] Zachary C Lipton et al. “Learning to diagnose with LSTM recurrent neural networks”. In: arXiv preprint arXiv:1511.03677 (2015).

[15] Ryo Akita et al. “Deep learning for stock prediction using numerical and textual informa-tion”. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). IEEE. 2016, pp. 1–6.

[16] Pankaj Malhotra et al. “LSTM-based encoder-decoder for multi-sensor anomaly detection”. In: arXiv preprint arXiv:1607.00148 (2016).

[17] Aaron van den Oord et al. “Wavenet: A generative model for raw audio”. In: arXiv preprint arXiv:1609.03499 (2016).

[18] Wei Bao, Jun Yue, and Yulei Rao. “A deep learning framework for financial time series using stacked autoencoders and long-short term memory”. In: PloS one 12.7 (2017), e0180944. [19] Yao Qin et al. “A dual-stage attention-based recurrent neural network for time series

prediction”. In: arXiv preprint arXiv:1704.02971 (2017).

[20] Zheng Zhao et al. “LSTM network: a deep learning approach for short-term traffic forecast”. In: IET Intelligent Transport Systems 11.2 (2017), pp. 68–75.

Forecasting shipped orders with long short-term memory recurrent neural networks